In Defense of Minhash over Simhash

Anshumali Shrivastava; Ping Li

In Defense of Minhash over Simhash

Anshumali Shrivastava, Ping Li

Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, PMLR 33:886-894, 2014.

Abstract

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of \em resemblance similarity (\mathcalR), while the collision probability of SimHash is a function of \em cosine similarity (\mathcalS). To provide a common basis for comparison, we evaluate retrieval results in terms of \mathcalS for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to \mathcalS, by using a general inequality \mathcalS^2≤\mathcalR≤\frac\mathcalS2-\mathcalS. Our \textbfworst case analysis can show that MinHash significantly outperforms SimHash in \textbfhigh similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often \mathcalR≥\frac\mathcalSz-\mathcalS holds where z is only slightly larger than 2 (e.g., z≤2.1). Our \textbfrestricted worst case analysis by assuming \frac\mathcalSz-\mathcalS≤\mathcalR≤\frac\mathcalS2-\mathcalS shows that MinHash indeed significantly outperforms SimHash even in \textbflow similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

Cite this Paper

BibTeX


@InProceedings{pmlr-v33-shrivastava14,
  title = 	 {{In Defense of Minhash over Simhash}},
  author = 	 {Shrivastava, Anshumali and Li, Ping},
  booktitle = 	 {Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {886--894},
  year = 	 {2014},
  editor = 	 {Kaski, Samuel and Corander, Jukka},
  volume = 	 {33},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Reykjavik, Iceland},
  month = 	 {22--25 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v33/shrivastava14.pdf},
  url = 	 {https://proceedings.mlr.press/v33/shrivastava14.html},
  abstract = 	 {MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing  literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms  SimHash when the data are binary, as common in practice such as search.      The collision probability of MinHash is a function of  \em resemblance similarity (\mathcalR), while the collision probability of SimHash is a function of  \em cosine  similarity (\mathcalS). To provide a common basis for  comparison,  we evaluate  retrieval results in terms of \mathcalS for both MinHash and SimHash. This evaluation is valid  as we can prove that MinHash is  a valid LSH with respect to \mathcalS, by using a  general inequality  \mathcalS^2≤\mathcalR≤\frac\mathcalS2-\mathcalS. Our \textbfworst case  analysis  can show that MinHash significantly outperforms SimHash in  \textbfhigh similarity region.        Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often \mathcalR≥\frac\mathcalSz-\mathcalS holds where z is only slightly larger than 2 (e.g., z≤2.1). Our \textbfrestricted worst case analysis by assuming \frac\mathcalSz-\mathcalS≤\mathcalR≤\frac\mathcalS2-\mathcalS shows that MinHash indeed significantly outperforms SimHash even in \textbflow similarity region.        We believe the results in this paper  will provide valuable guidelines for search in practice,  especially when the data are sparse.}
}

Endnote

%0 Conference Paper
%T In Defense of Minhash over Simhash
%A Anshumali Shrivastava
%A Ping Li
%B Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2014
%E Samuel Kaski
%E Jukka Corander	
%F pmlr-v33-shrivastava14
%I PMLR
%P 886--894
%U https://proceedings.mlr.press/v33/shrivastava14.html
%V 33
%X MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing  literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms  SimHash when the data are binary, as common in practice such as search.      The collision probability of MinHash is a function of  \em resemblance similarity (\mathcalR), while the collision probability of SimHash is a function of  \em cosine  similarity (\mathcalS). To provide a common basis for  comparison,  we evaluate  retrieval results in terms of \mathcalS for both MinHash and SimHash. This evaluation is valid  as we can prove that MinHash is  a valid LSH with respect to \mathcalS, by using a  general inequality  \mathcalS^2≤\mathcalR≤\frac\mathcalS2-\mathcalS. Our \textbfworst case  analysis  can show that MinHash significantly outperforms SimHash in  \textbfhigh similarity region.        Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often \mathcalR≥\frac\mathcalSz-\mathcalS holds where z is only slightly larger than 2 (e.g., z≤2.1). Our \textbfrestricted worst case analysis by assuming \frac\mathcalSz-\mathcalS≤\mathcalR≤\frac\mathcalS2-\mathcalS shows that MinHash indeed significantly outperforms SimHash even in \textbflow similarity region.        We believe the results in this paper  will provide valuable guidelines for search in practice,  especially when the data are sparse.

RIS


TY  - CPAPER
TI  - In Defense of Minhash over Simhash
AU  - Anshumali Shrivastava
AU  - Ping Li
BT  - Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
DA  - 2014/04/02
ED  - Samuel Kaski
ED  - Jukka Corander	
ID  - pmlr-v33-shrivastava14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 33
SP  - 886
EP  - 894
L1  - http://proceedings.mlr.press/v33/shrivastava14.pdf
UR  - https://proceedings.mlr.press/v33/shrivastava14.html
AB  - MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing  literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms  SimHash when the data are binary, as common in practice such as search.      The collision probability of MinHash is a function of  \em resemblance similarity (\mathcalR), while the collision probability of SimHash is a function of  \em cosine  similarity (\mathcalS). To provide a common basis for  comparison,  we evaluate  retrieval results in terms of \mathcalS for both MinHash and SimHash. This evaluation is valid  as we can prove that MinHash is  a valid LSH with respect to \mathcalS, by using a  general inequality  \mathcalS^2≤\mathcalR≤\frac\mathcalS2-\mathcalS. Our \textbfworst case  analysis  can show that MinHash significantly outperforms SimHash in  \textbfhigh similarity region.        Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often \mathcalR≥\frac\mathcalSz-\mathcalS holds where z is only slightly larger than 2 (e.g., z≤2.1). Our \textbfrestricted worst case analysis by assuming \frac\mathcalSz-\mathcalS≤\mathcalR≤\frac\mathcalS2-\mathcalS shows that MinHash indeed significantly outperforms SimHash even in \textbflow similarity region.        We believe the results in this paper  will provide valuable guidelines for search in practice,  especially when the data are sparse.
ER  -

APA


Shrivastava, A. & Li, P.. (2014). In Defense of Minhash over Simhash. Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 33:886-894 Available from https://proceedings.mlr.press/v33/shrivastava14.html.

Related Material

Download PDF