Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark

Zhuolin Qiu, Bin Wu, Bai Wang, Le Yu
Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, PMLR 36:17-28, 2014.

Abstract

In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v36-qiu14, title = {Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark}, author = {Qiu, Zhuolin and Wu, Bin and Wang, Bai and Yu, Le}, booktitle = {Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications}, pages = {17--28}, year = {2014}, editor = {Fan, Wei and Bifet, Albert and Yang, Qiang and Yu, Philip S.}, volume = {36}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {24 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v36/qiu14.pdf}, url = {https://proceedings.mlr.press/v36/qiu14.html}, abstract = {In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.} }
Endnote
%0 Conference Paper %T Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark %A Zhuolin Qiu %A Bin Wu %A Bai Wang %A Le Yu %B Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications %C Proceedings of Machine Learning Research %D 2014 %E Wei Fan %E Albert Bifet %E Qiang Yang %E Philip S. Yu %F pmlr-v36-qiu14 %I PMLR %P 17--28 %U https://proceedings.mlr.press/v36/qiu14.html %V 36 %X In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments.
RIS
TY - CPAPER TI - Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark AU - Zhuolin Qiu AU - Bin Wu AU - Bai Wang AU - Le Yu BT - Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications DA - 2014/08/13 ED - Wei Fan ED - Albert Bifet ED - Qiang Yang ED - Philip S. Yu ID - pmlr-v36-qiu14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 36 SP - 17 EP - 28 L1 - http://proceedings.mlr.press/v36/qiu14.pdf UR - https://proceedings.mlr.press/v36/qiu14.html AB - In this paper we implement a collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model on Spark. Spark is a fast in-memory cluster computing framework for large-scale data processing, which has been the talk of the Big Data town for a while. It is suitable for iterative and interactive algorithm. Our approach splits the dataset into P*P partitions, shuffles and recombines these partitions into P sub-datasets using rules to avoid conflicts of sampling, where each of P sub-datasets only contains P partitions, and then parallel processes each sub-dataset one by one. Despite increasing the number of iterations, this method reduces data communication overhead, makes good use of Spark’s efficient iterative execution and results in significant speedup on large-scale datasets in our experiments. ER -
APA
Qiu, Z., Wu, B., Wang, B. & Yu, L.. (2014). Gibbs Collapsed Sampling for Latent Dirichlet Allocation on Spark. Proceedings of the 3rd International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, in Proceedings of Machine Learning Research 36:17-28 Available from https://proceedings.mlr.press/v36/qiu14.html.

Related Material