Online Latent Dirichlet Allocation with Infinite Vocabulary

Ke Zhai; Jordan Boyd-Graber

Online Latent Dirichlet Allocation with Infinite Vocabulary

Ke Zhai, Jordan Boyd-Graber

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(1):561-569, 2013.

Abstract

Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-zhai13,
  title = 	 {Online Latent {D}irichlet Allocation with Infinite Vocabulary},
  author = 	 {Zhai, Ke and Boyd-Graber, Jordan},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {561--569},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/zhai13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/zhai13.html},
  abstract = 	 {Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.}
}

Endnote

%0 Conference Paper
%T Online Latent Dirichlet Allocation with Infinite Vocabulary
%A Ke Zhai
%A Jordan Boyd-Graber
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-zhai13
%I PMLR
%P 561--569
%U https://proceedings.mlr.press/v28/zhai13.html
%V 28
%N 1
%X Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.

RIS


TY  - CPAPER
TI  - Online Latent Dirichlet Allocation with Infinite Vocabulary
AU  - Ke Zhai
AU  - Jordan Boyd-Graber
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/02/13
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-zhai13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 1
SP  - 561
EP  - 569
L1  - http://proceedings.mlr.press/v28/zhai13.pdf
UR  - https://proceedings.mlr.press/v28/zhai13.html
AB  - Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.
ER  -

APA


Zhai, K. & Boyd-Graber, J.. (2013). Online Latent Dirichlet Allocation with Infinite Vocabulary. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(1):561-569 Available from https://proceedings.mlr.press/v28/zhai13.html.

Related Material

Download PDF