UNIT 2 Bigdata Mining and Analytics

UNIT 2

Nearest Neighbor Search (NNS) in Big Data Mining and Analytics

Nearest Neighbor Search (NNS) is a fundamental problem in various fields like machine
learning, pattern recognition, computer vision, and data mining. The objective is to find the
closest point in a given dataset to a query point based on a defined distance metric.
Key Concepts

1. Distance Metrics:
○ Euclidean Distance: Commonly used in continuous spaces, defined as
∑i=1n(xi−yi)2\sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}∑i=1n(xi−yi)2.
○ Manhattan Distance: Sum of absolute differences,
∑i=1n∣xi−yi∣\sum_{i=1}^{n}|x_i - y_i|∑i=1n∣xi−yi∣.
○ Cosine Similarity: Measures the cosine of the angle between two vectors,
x⃗⋅y⃗∥x⃗∥∥y⃗∥\frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\| \|\vec{y}\|}∥x∥∥y∥x⋅y.
○ Hamming Distance: Used for binary data, counting differing positions between
two strings of equal length.
2. Types of Nearest Neighbor Search:
○ Exact Nearest Neighbor: Guarantees finding the closest point.
○ Approximate Nearest Neighbor (ANN): Provides a solution that is close to the
exact nearest neighbor but with improved performance.
Algorithms for Nearest Neighbor Search
1. Linear Search:
○ The simplest form where the distance between the query and all points in the
dataset is calculated.
○ Time Complexity: O(n)O(n)O(n), where nnn is the number of points.
2. Data Structures for Efficient NNS:
○ KD-Tree (k-Dimensional Tree):
■ Space-partitioning data structure for organizing points in a k-dimensional
space.
■ Suitable for low-dimensional spaces.
■ Construction: O(nlog⁡n)O(n \log n)O(nlogn)
■ Search: O(log⁡n)O(\log n)O(logn) in the best case.
○ Ball Tree:
■ Uses hyperspheres to partition the space.
■ Suitable for high-dimensional spaces.
■ Construction: O(nlog⁡n)O(n \log n)O(nlogn)
■ Search: Efficient in high dimensions.
○ VP-Tree (Vantage-Point Tree):
■ Organizes data points based on their distance from a selected vantage
point.
■ Suitable for metric spaces.
○ LSH (Locality-Sensitive Hashing):
■ Projects data into lower-dimensional spaces using hash functions.
■ Suitable for high-dimensional and large-scale datasets.
■ Search: Sub-linear time complexity.
3. Approximate Nearest Neighbor Algorithms:
○ ANN-BallTree:
■ A variant of Ball Trees optimized for approximate search.
○ Product Quantization:
■ Compresses the data points into a product space, enabling fast distance
computation.
○ Hierarchical Navigable Small World (HNSW):
■ Graph-based approach for ANN.
■ Constructs a navigable small-world graph where search efficiency is high.
Practical Considerations
1. Curse of Dimensionality:
○ As dimensionality increases, the distance between points becomes less
meaningful.
○ High-dimensional spaces often require specialized algorithms like LSH or HNSW.
2. Choice of Distance Metric:
○ The choice of distance metric can significantly impact the performance and
accuracy of the search.
○ Must align with the nature of the data and the problem domain.
3. Dimensionality Reduction:
○ Techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed
Stochastic Neighbor Embedding) can reduce dimensionality while preserving
important relationships.
4. Scalability:
○ For very large datasets, distributed computing frameworks like Apache Spark or
libraries like Faiss (Facebook AI Similarity Search) can be used to handle
scalability issues.
Applications of Nearest Neighbor Search
● Recommendation Systems: Suggest items similar to the ones a user has shown
interest in.
● Image Retrieval: Find images similar to a query image.
● Anomaly Detection: Identify unusual patterns by finding points that do not have close
neighbors.
● Genomics: Find genetic sequences similar to a given sequence.
Shingling of Documents
Shingling is a technique used in text processing to convert documents into sets of contiguous
subsequences, which are called shingles or k-grams. This method is crucial in various
applications like document similarity detection, plagiarism detection, and near-duplicate web
page identification.
Key Concepts
1. Shingle Definition:
○ A shingle is a substring of length k extracted from a document. It is often a
sequence of characters or words.
○ The choice of k depends on the granularity of similarity detection required.
Common values are 5 to 10 for characters, or 2 to 5 for words.
2. Character-based Shingling:
○ This method treats documents as a sequence of characters.
○ Example: For the string "abcde" and k=2, the shingles are {"ab", "bc", "cd", "de"}.
3. Word-based Shingling:
○ This method treats documents as a sequence of words.
○ Example: For the sentence "the quick brown fox" and k=2, the shingles are {"the
quick", "quick brown", "brown fox"}.
4. Hashing Shingles:
○ Shingles can be hashed into numerical values to reduce the dimensionality and
facilitate efficient storage and comparison.
○ Common hashing functions include MD5, SHA-1, or any other suitable hash
function.
5. Use of Shingling in Similarity Detection:
○ By converting documents into sets of shingles, we can compare the sets to
measure document similarity.
○ Jaccard similarity coefficient is often used to compare the similarity between two
sets of shingles.
Steps in Shingling a Document
1. Preprocessing:
○ Remove punctuation, convert to lowercase, and handle special characters to
ensure uniformity.
○ Tokenize the document into characters or words based on the chosen shingling
method.
2. Generating Shingles:
○Extract contiguous substrings (shingles) of length k from the preprocessed
document.
○ Store these shingles as a set or list for further processing.
3. Hashing and Representation:
○ Optionally, hash the shingles to represent them as numerical values.
○ This helps in efficient storage and quick computation.
Example: Shingling a Document
Given Document:
"The quick brown fox jumps over the lazy dog."
Preprocessing:
● Convert to lowercase: "the quick brown fox jumps over the lazy dog"
● Remove punctuation: "the quick brown fox jumps over the lazy dog"
Shingles (Character-based, k=5):
● Shingles: {"the q", "he qu", "e qui", " quic", "quick", "uick ", "ick b", "ck br", "k bro", "
brow", "brown", "rown ", "own f", "wn fo", "n fox", " fox ", "fox j", "ox ju", "x jum", " jump",
"jumps", "umps ", "mps o", "ps ov", "s ove", " over", "over ", "ver t", "er th", "r the", " the ",
"the l", "he la", "e laz", " lazy", "lazy ", "azy d", "zy do", "y dog", " dog."}
Shingles (Word-based, k=2):
● Shingles: {"the quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the",
"the lazy", "lazy dog"}
Applications of Shingling
1. Near-Duplicate Detection:
○ Identifying near-duplicate web pages to reduce redundancy in search engine
indices.
○ Plagiarism detection in academic papers.
2. Clustering Documents:
○ Grouping similar documents together based on shingle similarity.
○ Useful in organizing large document repositories.
3. Data Deduplication:
○ Reducing storage requirements by identifying and removing duplicate or
near-duplicate data entries.
Challenges and Considerations

● Choice of k: The value of k significantly affects the sensitivity of similarity detection.
Smaller k values may lead to high sensitivity, while larger values may miss subtle
similarities.
● Scalability: Efficiently handling and storing large sets of shingles, especially for large
documents, requires careful design of data structures and algorithms.
● Preprocessing Steps: Uniform preprocessing is crucial to ensure that similar
documents are represented similarly.
Similarity Preserving Summaries

Introduction
In big data mining and analytics, efficiently comparing large datasets is crucial. Traditional
methods can be computationally expensive and impractical for large volumes of data. Similarity
preserving summaries, also known as locality-sensitive hashing (LSH) and sketching
techniques, provide efficient ways to approximate similarity measures and make the comparison
of large datasets feasible.
Key Concepts
1. Similarity Measures:
○ Jaccard Similarity: Measures the similarity between two sets as the size of the
intersection divided by the size of the union.
○ Cosine Similarity: Measures the cosine of the angle between two non-zero
vectors in an inner product space.
○ Euclidean Distance: Measures the straight-line distance between two points in
Euclidean space.
○ Hamming Distance: Measures the number of positions at which the
corresponding symbols are different.
2. MinHash:
○ Objective: Approximate Jaccard similarity between sets.
○ How it Works: Uses random permutations of the universe of elements and
creates signatures for sets. The probability that the min-hash values of two sets
are equal is equal to their Jaccard similarity.
3. Locality-Sensitive Hashing (LSH):
○ Objective: To hash input items so that similar items map to the same "buckets"
with high probability.
○ Types:
■ MinHash LSH: For Jaccard similarity.
■ Cosine LSH: For cosine similarity.
■ E2LSH: For Euclidean distance.
○ Procedure:
■ Generate multiple hash functions.
■ Items are hashed into buckets using these functions.
■ Similar items end up in the same or nearby buckets.
4. Sketching Techniques:
○ Objective: To create compact representations (sketches) of data that preserve
similarity.
○ Count-Min Sketch: Estimates the frequency of elements in a data stream.
○ HyperLogLog: Estimates the cardinality (number of distinct elements) of a set.
○ Bloom Filter: Probabilistic data structure to test membership of an element in a
set, with a certain false positive rate.
Applications
1. Near-Duplicate Detection:
○ Identifying duplicate or near-duplicate documents, images, or records.
○ Applications in web crawling, plagiarism detection, and data deduplication.
2. Approximate Nearest Neighbors (ANN):
○ Finding data points that are approximately closest to a given point in
high-dimensional spaces.
○ Applications in recommendation systems, image retrieval, and clustering.
3. Data Compression and Storage:
○ Reducing the storage requirements by summarizing the data while preserving
essential information.
○ Efficient storage of large datasets and streaming data.
4. Data Stream Analysis:
○ Real-time analysis of data streams using sketches to keep track of important
statistics.
○ Applications in network monitoring, financial data analysis, and online analytics.
Implementation Details
1. MinHash Implementation:
○ Define multiple random permutations of the element space.
○ For each set, apply each permutation and record the minimum value (min-hash
signature).
○ Compare signatures of two sets to estimate Jaccard similarity.
2. LSH Implementation:
○ Choose appropriate hash functions for the similarity measure.
○ Hash each item into buckets.
○ Use multiple hash tables to improve the accuracy.
3. Sketching Implementation:
○ Count-Min Sketch:
■ Initialize a matrix with zero values.
■ Use multiple hash functions to map elements to columns.
■ Update counts for incoming elements.
■ Estimate frequency by taking the minimum value across hash functions.
○ Bloom Filter:
■ Initialize a bit array with zero values.
■ Use multiple hash functions to map elements to positions in the bit array.
■ Set bits to 1 for each element.
■ Check membership by verifying all relevant bits are set.
Challenges and Considerations
1. Accuracy vs. Efficiency:

○ Trade-off between the accuracy of the similarity measure and the efficiency of the
summarization technique.
2. Parameter Tuning:
○ Selecting the number of hash functions, permutations, and sketch dimensions.
○ Balancing between false positives/negatives and computational overhead.
3. Scalability:
○ Ensuring the methods scale well with increasing data size and dimensionality.
○ Parallel and distributed implementations can help manage large-scale data.
Locality Sensitive Hashing (LSH)

Overview
Locality Sensitive Hashing (LSH) is a method for dimension reduction in high-dimensional data.
It is particularly useful for approximate nearest neighbor search in large datasets. The goal of
LSH is to hash input items so that similar items map to the same buckets with high probability,
making it easier to find similar pairs.
Key Concepts
1. Hash Functions: Functions that map input data to fixed-size values (hash codes).
2. Buckets: Groups or bins where hashed items are placed.
3. Similarity: A measure of how alike two items are, often determined by distance metrics
like Euclidean distance, Cosine similarity, or Jaccard index.
LSH Families
LSH families are collections of hash functions with the property that similar items are more likely
to be hashed to the same bucket.
● (r1,r2,p1,p2)(r_1, r_2, p_1, p_2)(r1,r2,p1,p2)-sensitive hash family: A family of hash

functions is (r1,r2,p1,p2)(r_1, r_2, p_1, p_2)(r1,r2,p1,p2)-sensitive if for two items xxx
and yyy:
○ If distance(x,y)≤r1\text{distance}(x, y) \leq r_1distance(x,y)≤r1, then the
probability that xxx and yyy hash to the same bucket is at least p1p_1p1.
○ If distance(x,y)≥r2\text{distance}(x, y) \geq r_2distance(x,y)≥r2, then the
probability that xxx and yyy hash to the same bucket is at most p2p_2p2.
Common LSH Algorithms
1. MinHash for Jaccard Similarity:

○ Used for sets and is designed to approximate the Jaccard similarity.
○ Procedure: Compute multiple hash functions and use the minimum hash value
for each function to form a signature for each set. Compare signatures for
similarity.
2. Random Projection for Cosine Similarity:
○ Projects high-dimensional data onto a lower-dimensional space using random
vectors.
○ Procedure: For each vector, the dot product with a random vector determines the
hash value (positive or negative). Multiple random projections form the hash
signature.
3. p-Stable Distributions for Euclidean Distance:
○ Uses stable distributions (e.g., Gaussian) to hash data points in a way that
preserves Euclidean distance.
○ Procedure: For each vector, the dot product with a random vector and adding a
random shift, followed by scaling, determines the hash value.
Steps in LSH
1. Hashing: Use multiple hash functions to hash each data point.

2. Bucketing: Place each hash code into a corresponding bucket.
3. Querying: For a given query point, hash it using the same functions and retrieve
candidate points from the same buckets.
4. Filtering: Apply a final distance check to find the nearest neighbors from the candidates.
Applications
● Near-duplicate Detection: Identifying duplicate or near-duplicate documents, images,

or other data.
● Recommendation Systems: Finding items similar to a user's preferences.
● Clustering: Grouping similar items together.
● Data Mining: Frequent itemset mining and association rule mining.
Advantages
● Efficiency: Reduces the need to compare all pairs of items, making the search process
faster.
● Scalability: Suitable for very large datasets due to its probabilistic nature and efficient
querying.
Challenges
● Parameter Tuning: Selecting appropriate parameters for hash functions and number of
hash tables.
● Approximation: LSH provides approximate results, which may not always be sufficient
depending on the application.
Example
Consider a dataset of documents. To find similar documents using LSH with MinHash:
1. Convert each document into a set of shingles (e.g., sequences of words).

2. Apply multiple MinHash functions to each set to obtain signatures.
3. Use LSH to hash these signatures into buckets.
4. For a new document, compute its signature, hash it, and look up similar documents in
the corresponding buckets.
Hashing for Documents

1. Introduction to Hashing
● Hashing: A technique to convert data of arbitrary size to fixed-size values using hash
functions.
● Hash Function: A function that takes input data (documents) and returns a fixed-size
string or number, typically called a hash code or hash value.
2. Hash Functions
● Properties:
○ Deterministic: The same input always produces the same output.
○ Uniformity: Evenly distributes hash values.
○ Efficiency: Computes hash value quickly.
○ Pre-image Resistance: Difficult to reverse-engineer the input from the hash
value.
● Common Hash Functions:
○ MD5: Produces a 128-bit hash value. Widely used but vulnerable to collisions.
○ SHA-1: Produces a 160-bit hash value. More secure than MD5 but now
considered weak against collision attacks.
○ SHA-256: Produces a 256-bit hash value. Part of the SHA-2 family, providing
stronger security.
3. Applications of Hashing in Document Analysis

● Document Deduplication: Identifying duplicate documents by comparing their hash
values.
● Similarity Detection:
○ MinHash: A technique to estimate the similarity between documents efficiently
using a small number of hash functions.
○ SimHash: Converts a document into a smaller set of hash values while
preserving similarity, useful for near-duplicate detection.
4. MinHash for Document Similarity
● MinHash Algorithm:
○ Convert documents into sets (e.g., sets of words, shingles).
○ Apply multiple hash functions to each element in the set.
○ Record the minimum hash value for each hash function.
○ Compare the minhash signatures of two documents to estimate their similarity.
● Jaccard Similarity:
○ Measures similarity between finite sets.
○ Jaccard Similarity(A,B)=∣A∩B∣∣A∪B∣\text{Jaccard Similarity} (A, B) = \frac{|A
\cap B|}{|A \cup B|}Jaccard Similarity(A,B)=∣A∪B∣∣A∩B∣
○ MinHash approximates the Jaccard similarity efficiently.
5. Locality Sensitive Hashing (LSH)
● Purpose: Group similar items into the same bucket with high probability.
● Procedure:
○ Choose a hash function that maps similar documents to the same buckets.
○ Use multiple hash functions to increase accuracy.
○ Compare only documents within the same bucket to find similar pairs.
● Applications:
○ Plagiarism detection.
○ Duplicate content detection in web crawlers.
○ Nearest neighbor search in high-dimensional spaces.
6. Implementation Considerations
● Hash Table Design:

○ Use a table to store document hash values.
○ Handle collisions using methods like chaining or open addressing.
● Scalability:
○ Use distributed hash tables for large-scale document collections.
○ Implement parallel processing for hash computations.
7. Case Studies and Real-world Applications
● Web Search Engines: Use hashing to index and retrieve similar web pages.
● Data Cleaning: Remove duplicates in large datasets.
● Spam Detection: Identify spam emails by comparing hash values of content.
Distance Measures
In the context of Big Data Mining and Analytics, distance measures are crucial for various tasks
such as clustering, classification, and anomaly detection. Here's an overview of different
distance measures commonly used in this field:
1. Euclidean Distance
● Definition: The straight-line distance between two points in Euclidean space.

● Formula: For two points p=(p1,p2,...,pn)p = (p_1, p_2, ..., p_n)p=(p1,p2,...,pn) and
q=(q1,q2,...,qn)q = (q_1, q_2, ..., q_n)q=(q1,q2,...,qn),
d(p,q)=(p1−q1)2+(p2−q2)2+...+(pn−qn)2d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + ...
+ (p_n - q_n)^2}d(p,q)=(p1−q1)2+(p2−q2)2+...+(pn−qn)2
● Usage: Suitable for continuous variables and low-dimensional data.
2. Manhattan Distance (L1 Norm)
● Definition: The sum of the absolute differences of their coordinates.

● Formula: d(p,q)=∣p1−q1∣+∣p2−q2∣+...+∣pn−qn∣d(p, q) = |p_1 - q_1| + |p_2 - q_2| + ... +
|p_n - q_n|d(p,q)=∣p1−q1∣+∣p2−q2∣+...+∣pn−qn∣
● Usage: Works well in high-dimensional spaces and for data with integer values.
3. Minkowski Distance
● Definition: A generalized distance measure that includes both Euclidean and Manhattan
distances as special cases.
● Formula: d(p,q)=(∑i=1n∣pi−qi∣r)1/rd(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^r
\right)^{1/r}d(p,q)=(i=1∑n∣pi−qi∣r)1/r
○ For r=1r = 1r=1, it becomes the Manhattan distance.
○ For r=2r = 2r=2, it becomes the Euclidean distance.
● Usage: Flexible measure that can be tuned based on the value of rrr.
4. Cosine Similarity
● Definition: Measures the cosine of the angle between two non-zero vectors in an inner
product space.
● Formula: cosine_similarity(p,q)=p⋅q∥p∥∥q∥\text{cosine\_similarity}(p, q) = \frac{p
\cdot q}{\|p\| \|q\|}cosine_similarity(p,q)=∥p∥∥q∥p⋅qwhere p⋅qp \cdot qp⋅q is the dot
product and ∥p∥∥q∥\|p\| \|q\|∥p∥∥q∥ are the magnitudes of the vectors.
● Usage: Commonly used in text mining and information retrieval, particularly for
measuring document similarity.
5. Jaccard Index
● Definition: Measures the similarity between finite sample sets.

● Formula: J(A,B)=∣A∩B∣∣A∪B∣J(A, B) = \frac{|A \cap B|}{|A \cup
B|}J(A,B)=∣A∪B∣∣A∩B∣
● Usage: Effective for binary attributes and set-based data.
6. Hamming Distance
● Definition: Measures the number of positions at which the corresponding symbols are
different.
● Formula: For two strings of equal length, d(p,q)=∑i=1nδ(pi,qi)d(p, q) = \sum_{i=1}^{n}
\delta(p_i, q_i)d(p,q)=i=1∑nδ(pi,qi) where δ(pi,qi)\delta(p_i, q_i)δ(pi,qi) is 1 if pi≠qip_i
\neq q_ipi=qiand 0 otherwise.
● Usage: Useful for categorical data and in error detection and correction in coding theory.
7. Mahalanobis Distance
● Definition: A distance measure that accounts for the correlations of the data set and is
scale-invariant.
● Formula: d(p,q)=(p−q)TS−1(p−q)d(p, q) = \sqrt{(p - q)^T S^{-1} (p -
q)}d(p,q)=(p−q)TS−1(p−q)where SSS is the covariance matrix of the data.
● Usage: Suitable for multivariate data and when the data distribution is not spherical.
8. Chebyshev Distance
● Definition: The maximum distance along any coordinate dimension.

● Formula: d(p,q)=max⁡i∣pi−qi∣d(p, q) = \max_i |p_i - q_i|d(p,q)=imax∣pi−qi∣
● Usage: Effective for discrete vector spaces and certain optimization problems.
Applications of Distance Measures
● Clustering Algorithms: Algorithms like K-means, DBSCAN, and hierarchical clustering

rely on distance measures to group similar data points.
● Classification: Distance measures are used in K-Nearest Neighbors (KNN) to classify a
data point based on its neighbors.
● Anomaly Detection: Identifying outliers by measuring the distance from a data point to
the center of a distribution.
● Recommendation Systems: Computing similarity between users or items to generate
recommendations.
Theory of Locality Sensitive Functions

Locality Sensitive Functions (LSF) are a critical concept in the realm of big data mining and
analytics, especially when dealing with high-dimensional data. They form the backbone of
various techniques used for finding similar items efficiently. This is pivotal in applications like
recommendation systems, plagiarism detection, and image recognition.
Here’s a comprehensive breakdown of the key concepts, theories, and applications related to
Locality Sensitive Functions:
1. Locality Sensitive Hashing (LSH)
Definition: Locality Sensitive Hashing is a method for performing probabilistic dimension

reduction of high-dimensional data. The main idea is to hash the input items such that similar
items are more likely to collide in the same bucket than dissimilar ones.
Components:
● Hash Functions: These are designed to ensure that similar items have a higher
probability of being mapped to the same hash value.
● Buckets: The hashed values are stored in buckets, facilitating quick retrieval of similar
items.
Applications:
● Nearest Neighbor Search: Quickly finding the closest points in high-dimensional

spaces.
● Document Similarity: Detecting near-duplicate documents or plagiarism.
● Image and Video Retrieval: Efficiently finding similar images or video frames.
2. Theory and Mathematical Foundations
Similarity Measures: Different similarity measures are used depending on the type of data and
the application:
● Jaccard Similarity: Used for sets and measures the size of the intersection divided by
the size of the union of the sets.
● Cosine Similarity: Commonly used in text analysis, measuring the cosine of the angle
between two vectors.
● Hamming Distance: Used for binary vectors, counting the number of positions at which
the corresponding bits are different.
● Euclidean Distance: Measures the "straight line" distance between two points in
Euclidean space.
Family of Hash Functions: For a hash function hhh to be locality sensitive with respect to a
similarity measure SSS, it must satisfy the following conditions for any two points ppp and qqq:
● If S(p,q)S(p, q)S(p,q) is high, then the probability that h(p)=h(q)h(p) = h(q)h(p)=h(q) is
high.
● If S(p,q)S(p, q)S(p,q) is low, then the probability that h(p)=h(q)h(p) = h(q)h(p)=h(q) is low.
3. Construction of Locality Sensitive Functions
LSH for Jaccard Similarity:
● MinHashing: A technique where multiple random permutations of the input sets are
used, and the minimum value in each permutation is taken to form the hash value.
LSH for Cosine Similarity:
● Random Hyperplanes: Randomly generate hyperplanes and use them to split the data.
The sign of the dot product of the vector with the hyperplane normal determines the
hash.
LSH for Euclidean Distance:
● P-Stable Distributions: Using projections of data points onto random vectors drawn
from a p-stable distribution (like Gaussian distribution for L2 norm).
4. Algorithm and Implementation
LSH Algorithm:
1. Choose Parameters: Select the number of hash functions kkk and the number of hash
tables LLL.
2. Generate Hash Functions: Create LLL groups of kkk hash functions each.
3. Hash Data: For each data point, compute the hash values using the hash functions and
insert the data point into the corresponding buckets in all LLL hash tables.
4. Querying: To find similar items for a query, hash the query point and retrieve candidate
points from the corresponding buckets in the hash tables. These candidates are then
checked for actual similarity.
Efficiency:
● Time Complexity: LSH can reduce the query time to sublinear with respect to the
number of points in the dataset, making it scalable for large datasets.
● Space Complexity: It requires additional space for storing the hash tables, but this
trade-off is often acceptable given the efficiency gains.
5. Advanced Topics and Variations
Multi-Probe LSH: Instead of probing a single bucket in each hash table, this method probes
multiple buckets, increasing the recall without significantly increasing the number of hash tables.
Cross Polytope LSH: A more recent development that uses the geometry of cross polytopes
for high-dimensional nearest neighbor search, offering improved performance over traditional
LSH methods.
Scalable and Parallel LSH: Implementations that leverage parallel computing and distributed
systems to handle extremely large datasets efficiently.
Locality-Sensitive Hashing (LSH) Families - Notes

Introduction to LSH
Locality-Sensitive Hashing (LSH) is a technique used for approximate nearest neighbor search
in high-dimensional spaces. It aims to hash input items in such a way that similar items are
mapped to the same buckets with high probability, while dissimilar items are mapped to different
buckets.
LSH Families
LSH families are collections of hash functions that ensure the property of locality sensitivity.
There are various types of LSH families, each suitable for different types of data and distance
metrics. Below are some common LSH families:
1. LSH for Euclidean Distance
● Hash Functions: ha,b(v)=⌊a⋅v+br⌋h_{a,b}(v) = \left\lfloor \frac{a \cdot v + b}{r}

\right\rfloorha,b(v)=⌊ra⋅v+b⌋ where aaa is a random vector with each component drawn
from a Gaussian distribution, bbb is a random real number drawn uniformly from the
range [0,r][0, r][0,r], and rrr is a fixed width parameter.
● Purpose: This LSH family is used for datasets where the Euclidean distance is the
distance measure of interest.
● Properties: Two points that are close in Euclidean space will have a higher probability of
colliding under this hash family.
2. LSH for Cosine Similarity
● Hash Functions: h(v)=sign(a⋅v)h(v) = \text{sign}(a \cdot v)h(v)=sign(a⋅v) where aaa is

a random vector from a multivariate normal distribution.
● Purpose: This LSH family is used for datasets where cosine similarity (or angular
distance) is the distance measure of interest.
● Properties: The hash function splits the space with hyperplanes passing through the
origin. Similar vectors are more likely to fall on the same side of a random hyperplane.
3. LSH for Jaccard Similarity

● MinHash: h(S)=min⁡({h(s)∣s∈S})h(S) = \min(\{h(s) \mid s \in S\})h(S)=min({h(s)∣s∈S})
where hhh is a random permutation of the input set elements SSS.
● Purpose: This LSH family is used for sets or binary vectors where the Jaccard similarity
is the distance measure of interest.
● Properties: The MinHash function ensures that the probability of two sets having the
same hash value is equal to their Jaccard similarity.
Construction and Usage of LSH Families
1. Hash Tables: Multiple hash functions from the chosen LSH family are combined to
create a hash table. Each input vector is hashed using these functions, and the result is
concatenated to form a composite hash value which is then used to index the hash
table.
2. Amplification: To reduce the probability of false positives, LSH uses an amplification
technique where multiple hash tables are created, and an item is considered similar if it
collides in at least one table.
3. Query Process:
○ Given a query point, compute its hash value using the same hash functions.
○ Retrieve all points in the buckets corresponding to the query hash value.
○ Perform a final distance computation to find the actual nearest neighbors from
the retrieved points.
Advantages and Challenges
● Advantages:
○ Efficient for high-dimensional data.
○ Sub-linear query time complexity in terms of the number of data points.
○ Can handle different distance metrics by choosing appropriate LSH families.
● Challenges:
○ Choosing the right parameters (e.g., number of hash functions, width parameter)
is non-trivial.
○ Performance can degrade with extremely high-dimensional data due to the curse
of dimensionality.
○ May require a large number of hash tables for high accuracy, leading to
increased memory usage.
Applications of LSH
● Near-duplicate detection: Finding similar documents or images in large datasets.

● Recommendation systems: Identifying similar items or users.
● Clustering: Grouping similar data points in high-dimensional space.
● Genomics: Finding similar genetic sequences.
Methods for High Degree of Similarities
Similarity Measures:
● Cosine Similarity: Measures the cosine of the angle between two vectors. It's widely
used in text mining and recommendation systems.
● Euclidean Distance: Measures the straight-line distance between two points in
Euclidean space. It's used when data points are represented as vectors.
● Jaccard Similarity: Measures similarity between finite sample sets, often used in
document similarity and set matching.
Clustering Algorithms:
● K-means: Groups data into K clusters based on similarity.

● Hierarchical Clustering: Builds a hierarchy of clusters based on similarity.
● DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters
data based on density and proximity, useful for identifying outliers as well.
Dimensionality Reduction:
● Principal Component Analysis (PCA): Reduces the dimensionality of data while

preserving its variance, often used to find similarities in high-dimensional datasets.
● t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces the dimensionality of
data for visualization while preserving local similarities.
Graph-Based Methods:
● Graph Similarity: Measures how similar two graphs are based on structural or
attribute-based features.
● Graph Clustering Algorithms: Identifies communities in graphs based on similarity
measures among nodes.
Nearest Neighbor Methods:
● k-Nearest Neighbors (k-NN): Finds k data points that are closest to a query point based
on a similarity measure.
● Locality-Sensitive Hashing (LSH): Efficiently finds approximate nearest neighbors by
hashing similar items into the same buckets.
Deep Learning Techniques:
● Siamese Networks: Learns embeddings for pairs of data points where similarity is
measured in the embedding space.
● Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs):
Used in tasks such as image and sequence similarity, respectively.
Anomaly Detection:
● Distance-Based Methods: Detect anomalies based on the distance of data points from
their neighbors.
● Density-Based Methods: Identify anomalies as data points in sparse regions of the
dataset.
Text and Natural Language Processing (NLP):
● Word Embeddings: Represent words as dense vectors in a continuous space where

semantic similarity corresponds to vector similarity.
● Document Similarity: Measure the similarity between documents based on the
similarity of their constituent words or phrases.
Collaborative Filtering:
● User-based and Item-based Similarity: Measure similarity between users or items

based on their interactions or attributes, used in recommendation systems.

UNIT 2 Bigdata Mining and Analytics

Uploaded by

Copyright:

Available Formats

UNIT 2 Bigdata Mining and Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 2 Bigdata Mining and Analytics

Uploaded by

Copyright:

Available Formats

UNIT 2

Nearest Neighbor Search (NNS) in Big Data Mining and Analytics

Algorithms for Nearest Neighbor Search

Applications of Nearest Neighbor Search

Steps in Shingling a Document

Example: Shingling a Document

"The quick brown fox jumps over the lazy dog."

Shingles (Character-based, k=5):

Shingles (Word-based, k=2):

Challenges and Considerations

Similarity Preserving Summaries

Challenges and Considerations

1. Accuracy vs. Efficiency:

Locality Sensitive Hashing (LSH)

● (r1,r2,p1,p2)(r_1, r_2, p_1, p_2)(r1​,r2​,p1​,p2​)-sensitive hash family: A family of hash

Common LSH Algorithms

1. MinHash for Jaccard Similarity:

1. Hashing: Use multiple hash functions to hash each data point.

● Near-duplicate Detection: Identifying duplicate or near-duplicate documents, images,

1. Convert each document into a set of shingles (e.g., sequences of words).

Hashing for Documents

3. Applications of Hashing in Document Analysis

4. MinHash for Document Similarity

5. Locality Sensitive Hashing (LSH)

● Hash Table Design:

7. Case Studies and Real-world Applications

● Definition: The straight-line distance between two points in Euclidean space.

2. Manhattan Distance (L1 Norm)

● Definition: The sum of the absolute differences of their coordinates.

● Definition: Measures the similarity between finite sample sets.

● Definition: The maximum distance along any coordinate dimension.

Applications of Distance Measures

● Clustering Algorithms: Algorithms like K-means, DBSCAN, and hierarchical clustering

Theory of Locality Sensitive Functions

1. Locality Sensitive Hashing (LSH)

Definition: Locality Sensitive Hashing is a method for performing probabilistic dimension

● Nearest Neighbor Search: Quickly finding the closest points in high-dimensional

2. Theory and Mathematical Foundations

3. Construction of Locality Sensitive Functions

LSH for Jaccard Similarity:

LSH for Cosine Similarity:

LSH for Euclidean Distance:

4. Algorithm and Implementation

5. Advanced Topics and Variations

Locality-Sensitive Hashing (LSH) Families - Notes

1. LSH for Euclidean Distance

● Hash Functions: ha,b(v)=⌊a⋅v+br⌋h_{a,b}(v) = \left\lfloor \frac{a \cdot v + b}{r}

2. LSH for Cosine Similarity

● Hash Functions: h(v)=sign(a⋅v)h(v) = \text{sign}(a \cdot v)h(v)=sign(a⋅v) where aaa is

3. LSH for Jaccard Similarity

Construction and Usage of LSH Families

Advantages and Challenges

● Near-duplicate detection: Finding similar documents or images in large datasets.

● K-means: Groups data into K clusters based on similarity.

● Principal Component Analysis (PCA): Reduces the dimensionality of data while

Nearest Neighbor Methods:

Deep Learning Techniques:

Text and Natural Language Processing (NLP):

● Word Embeddings: Represent words as dense vectors in a continuous space where

● (r1,r2,p1,p2)(r_1, r_2, p_1, p_2)(r1,r2,p1,p2)-sensitive hash family: A family of hash