Stochastic Neighbor Compression

Matt Kusner, Stephen Tyree, Kilian Weinberger, Kunal Agrawal
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):622-630, 2014.

Abstract

We present Stochastic Neighborhood Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction, demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-kusner14, title = {Stochastic Neighbor Compression}, author = {Kusner, Matt and Tyree, Stephen and Weinberger, Kilian and Agrawal, Kunal}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {622--630}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/kusner14.pdf}, url = {https://proceedings.mlr.press/v32/kusner14.html}, abstract = {We present Stochastic Neighborhood Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction, demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.} }
Endnote
%0 Conference Paper %T Stochastic Neighbor Compression %A Matt Kusner %A Stephen Tyree %A Kilian Weinberger %A Kunal Agrawal %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-kusner14 %I PMLR %P 622--630 %U https://proceedings.mlr.press/v32/kusner14.html %V 32 %N 2 %X We present Stochastic Neighborhood Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction, demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements.
RIS
TY - CPAPER TI - Stochastic Neighbor Compression AU - Matt Kusner AU - Stephen Tyree AU - Kilian Weinberger AU - Kunal Agrawal BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-kusner14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 622 EP - 630 L1 - http://proceedings.mlr.press/v32/kusner14.pdf UR - https://proceedings.mlr.press/v32/kusner14.html AB - We present Stochastic Neighborhood Compression (SNC), an algorithm to compress a dataset for the purpose of k-nearest neighbor (kNN) classification. Given training data, SNC learns a much smaller synthetic data set, that minimizes the stochastic 1-nearest neighbor classification error on the training data. This approach has several appealing properties: due to its small size, the compressed set speeds up kNN testing drastically (up to several orders of magnitude, in our experiments); it makes the kNN classifier substantially more robust to label noise; on 4 of 7 data sets it yields lower test error than kNN on the entire training set, even at compression ratios as low as 2%; finally, the SNC compression leads to impressive speed ups over kNN even when kNN and SNC are both used with ball-tree data structures, hashing, and LMNN dimensionality reduction, demonstrating that it is complementary to existing state-of-the-art algorithms to speed up kNN classification and leads to substantial further improvements. ER -
APA
Kusner, M., Tyree, S., Weinberger, K. & Agrawal, K.. (2014). Stochastic Neighbor Compression. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):622-630 Available from https://proceedings.mlr.press/v32/kusner14.html.

Related Material