MinHash-rs

A Rust implementation of MinHash trying to be parsimonious with memory.

What is MinHash?

MinHash is a probabilistic data structure used to estimate the similarity between two sets. It is based on the observation that if we hash two sets of objects, the probability that the hashes agree is equal to the Jaccard similarity between the two sets.

How does it work?

MinHash works by hashing the elements of a set and keeping track of the minimum hash value for each hash function. The probability that the minimum hash value of two sets is the same is equal to the Jaccard similarity between the two sets. By using multiple hash functions, we can estimate the Jaccard similarity between two sets by averaging the probability that the minimum hash value of the two sets is the same.

Using this crate

As usual, just add the following to your Cargo.toml file, altough remember to check out the benchmark results below before going for MinHash over HyperLogLog.

[dependencies]
minhash-rs = "0.1.0"

Reason for this implementation

I wanted to benchmark how well does MinHash estimates the Jaccard similarity between two sets and how well does it compare with other methods such as HyperLogLog. The implementations I have found used more memory than it was necessary by the data structure, and I wanted to compare the performance of MinHash with other methods using the same amount of memory. Additionally, oftencase the methods were not optimized in any way shape or form, and I wanted to compare as fairly as possible MinHash with my rather well optimized implementation of HyperLogLog. I have benchmarked MinHash on many different universe sizes, you can find the Jupyter Notebook here.

You can find the raw benchmark results in the tests folder, the compressed CSVs. I will keep adding datapoints for a while to be extra sure, but the results seem to be already quite clear.

After days of benchmarking, it seems the results are in and indeed MinHash does not seem to be ever preferable to HyperLogLog at a comparable memory requirement.

Here is the most comprehensive visualization (find more in the Jupyter Notebook), where the x axis is the memory in bits (log scale) and the y axis is the product of the MSE of estimating the Jaccard similarity multiplied by the time required to estimate the Jaccard similarity, also in log scale. The lower the better.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
benches		benches
figures		figures
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MinHash Jaccard benchmarks.ipynb		MinHash Jaccard benchmarks.ipynb
README.md		README.md
minhash_diagram.drawio		minhash_diagram.drawio
minhash_diagram.jpg		minhash_diagram.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MinHash-rs

What is MinHash?

How does it work?

Using this crate

Reason for this implementation

About

Releases

Packages

Contributors 2

Languages

License

LucaCappelletti94/minhash-rs

Folders and files

Latest commit

History

Repository files navigation

MinHash-rs

What is MinHash?

How does it work?

Using this crate

Reason for this implementation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages