Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Sujith Ravi, Qiming Diao
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, PMLR 51:519-528, 2016.

Abstract

Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that effectively captures the sparsity of the label distribution and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. Finally, we propose a robust graph augmentation strategy using unsupervised deep learning architectures that yields further significant quality gains for SSL in natural language applications.

Cite this Paper


BibTeX
@InProceedings{pmlr-v51-ravi16, title = {Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation}, author = {Ravi, Sujith and Diao, Qiming}, booktitle = {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics}, pages = {519--528}, year = {2016}, editor = {Gretton, Arthur and Robert, Christian C.}, volume = {51}, series = {Proceedings of Machine Learning Research}, address = {Cadiz, Spain}, month = {09--11 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v51/ravi16.pdf}, url = {https://proceedings.mlr.press/v51/ravi16.html}, abstract = {Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that effectively captures the sparsity of the label distribution and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. Finally, we propose a robust graph augmentation strategy using unsupervised deep learning architectures that yields further significant quality gains for SSL in natural language applications.} }
Endnote
%0 Conference Paper %T Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation %A Sujith Ravi %A Qiming Diao %B Proceedings of the 19th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2016 %E Arthur Gretton %E Christian C. Robert %F pmlr-v51-ravi16 %I PMLR %P 519--528 %U https://proceedings.mlr.press/v51/ravi16.html %V 51 %X Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that effectively captures the sparsity of the label distribution and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. Finally, we propose a robust graph augmentation strategy using unsupervised deep learning architectures that yields further significant quality gains for SSL in natural language applications.
RIS
TY - CPAPER TI - Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation AU - Sujith Ravi AU - Qiming Diao BT - Proceedings of the 19th International Conference on Artificial Intelligence and Statistics DA - 2016/05/02 ED - Arthur Gretton ED - Christian C. Robert ID - pmlr-v51-ravi16 PB - PMLR DP - Proceedings of Machine Learning Research VL - 51 SP - 519 EP - 528 L1 - http://proceedings.mlr.press/v51/ravi16.pdf UR - https://proceedings.mlr.press/v51/ravi16.html AB - Traditional graph-based semi-supervised learning (SSL) approaches are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the label distribution per node thereby achieving a space reduction from O(m) to O(\log m), under certain conditions. In this paper, we present a novel streaming graph-based SSL approximation that effectively captures the sparsity of the label distribution and further reduces the space complexity per node to O(1). We also provide a distributed version of the algorithm that scales well to large data sizes. Experiments on real-world datasets demonstrate that the new method achieves better performance than existing state-of-the-art algorithms with significant reduction in memory footprint. Finally, we propose a robust graph augmentation strategy using unsupervised deep learning architectures that yields further significant quality gains for SSL in natural language applications. ER -
APA
Ravi, S. & Diao, Q.. (2016). Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 51:519-528 Available from https://proceedings.mlr.press/v51/ravi16.html.

Related Material