Online Latent Dirichlet Allocation with Infinite Vocabulary

Ke Zhai, Jordan Boyd-Graber
Proceedings of the 30th International Conference on Machine Learning, PMLR 28(1):561-569, 2013.

Abstract

Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v28-zhai13, title = {Online Latent {D}irichlet Allocation with Infinite Vocabulary}, author = {Zhai, Ke and Boyd-Graber, Jordan}, booktitle = {Proceedings of the 30th International Conference on Machine Learning}, pages = {561--569}, year = {2013}, editor = {Dasgupta, Sanjoy and McAllester, David}, volume = {28}, number = {1}, series = {Proceedings of Machine Learning Research}, address = {Atlanta, Georgia, USA}, month = {17--19 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v28/zhai13.pdf}, url = {https://proceedings.mlr.press/v28/zhai13.html}, abstract = {Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.} }
Endnote
%0 Conference Paper %T Online Latent Dirichlet Allocation with Infinite Vocabulary %A Ke Zhai %A Jordan Boyd-Graber %B Proceedings of the 30th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2013 %E Sanjoy Dasgupta %E David McAllester %F pmlr-v28-zhai13 %I PMLR %P 561--569 %U https://proceedings.mlr.press/v28/zhai13.html %V 28 %N 1 %X Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance.
RIS
TY - CPAPER TI - Online Latent Dirichlet Allocation with Infinite Vocabulary AU - Ke Zhai AU - Jordan Boyd-Graber BT - Proceedings of the 30th International Conference on Machine Learning DA - 2013/02/13 ED - Sanjoy Dasgupta ED - David McAllester ID - pmlr-v28-zhai13 PB - PMLR DP - Proceedings of Machine Learning Research VL - 28 IS - 1 SP - 561 EP - 569 L1 - http://proceedings.mlr.press/v28/zhai13.pdf UR - https://proceedings.mlr.press/v28/zhai13.html AB - Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary a priori. This is reasonable in batch settings, but it is not reasonable when data are revealed over time, as is the case with streaming / online algorithms. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variational inference and because we only can consider a finite number of words for each truncated topic propose heuristics to dynamically organize, expand, and contract the set of words we consider in our vocabulary truncation. We show our model can successfully incorporate new words as it encounters new terms and that it performs better than online LDA in evaluations of topic quality and classification performance. ER -
APA
Zhai, K. & Boyd-Graber, J.. (2013). Online Latent Dirichlet Allocation with Infinite Vocabulary. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(1):561-569 Available from https://proceedings.mlr.press/v28/zhai13.html.

Related Material