Distributed Representations of Sentences and Documents

Quoc Le, Tomas Mikolov
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1188-1196, 2014.

Abstract

Many machine learning algorithms require the input to be represented as a fixed length feature vector. When it comes to texts, one of the most common representations is bag-of-words. Despite their popularity, bag-of-words models have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose an unsupervised algorithm that learns vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-le14, title = {Distributed Representations of Sentences and Documents}, author = {Le, Quoc and Mikolov, Tomas}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {1188--1196}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/le14.pdf}, url = {https://proceedings.mlr.press/v32/le14.html}, abstract = {Many machine learning algorithms require the input to be represented as a fixed length feature vector. When it comes to texts, one of the most common representations is bag-of-words. Despite their popularity, bag-of-words models have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose an unsupervised algorithm that learns vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.} }
Endnote
%0 Conference Paper %T Distributed Representations of Sentences and Documents %A Quoc Le %A Tomas Mikolov %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-le14 %I PMLR %P 1188--1196 %U https://proceedings.mlr.press/v32/le14.html %V 32 %N 2 %X Many machine learning algorithms require the input to be represented as a fixed length feature vector. When it comes to texts, one of the most common representations is bag-of-words. Despite their popularity, bag-of-words models have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose an unsupervised algorithm that learns vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.
RIS
TY - CPAPER TI - Distributed Representations of Sentences and Documents AU - Quoc Le AU - Tomas Mikolov BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-le14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 1188 EP - 1196 L1 - http://proceedings.mlr.press/v32/le14.pdf UR - https://proceedings.mlr.press/v32/le14.html AB - Many machine learning algorithms require the input to be represented as a fixed length feature vector. When it comes to texts, one of the most common representations is bag-of-words. Despite their popularity, bag-of-words models have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose an unsupervised algorithm that learns vector representations of sentences and text documents. This algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that our technique outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. ER -
APA
Le, Q. & Mikolov, T.. (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1188-1196 Available from https://proceedings.mlr.press/v32/le14.html.

Related Material