UNIT 2 Bigdata Mining and Analytics
UNIT 2 Bigdata Mining and Analytics
UNIT 2 Bigdata Mining and Analytics
Key Concepts
1. Distance Metrics:
○ Euclidean Distance: Commonly used in continuous spaces, defined as
∑i=1n(xi−yi)2\sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}∑i=1n(xi−yi)2.
○ Manhattan Distance: Sum of absolute differences,
∑i=1n∣xi−yi∣\sum_{i=1}^{n}|x_i - y_i|∑i=1n∣xi−yi∣.
○ Cosine Similarity: Measures the cosine of the angle between two vectors,
x⃗⋅y⃗∥x⃗∥∥y⃗∥\frac{\vec{x} \cdot \vec{y}}{\|\vec{x}\| \|\vec{y}\|}∥x∥∥y∥x⋅y.
○ Hamming Distance: Used for binary data, counting differing positions between
two strings of equal length.
2. Types of Nearest Neighbor Search:
○ Exact Nearest Neighbor: Guarantees finding the closest point.
○ Approximate Nearest Neighbor (ANN): Provides a solution that is close to the
exact nearest neighbor but with improved performance.
1. Linear Search:
○ The simplest form where the distance between the query and all points in the
dataset is calculated.
○ Time Complexity: O(n)O(n)O(n), where nnn is the number of points.
2. Data Structures for Efficient NNS:
○ KD-Tree (k-Dimensional Tree):
■ Space-partitioning data structure for organizing points in a k-dimensional
space.
■ Suitable for low-dimensional spaces.
■ Construction: O(nlogn)O(n \log n)O(nlogn)
■ Search: O(logn)O(\log n)O(logn) in the best case.
○ Ball Tree:
■ Uses hyperspheres to partition the space.
■ Suitable for high-dimensional spaces.
■ Construction: O(nlogn)O(n \log n)O(nlogn)
■ Search: Efficient in high dimensions.
○ VP-Tree (Vantage-Point Tree):
■ Organizes data points based on their distance from a selected vantage
point.
■ Suitable for metric spaces.
○ LSH (Locality-Sensitive Hashing):
■ Projects data into lower-dimensional spaces using hash functions.
■ Suitable for high-dimensional and large-scale datasets.
■ Search: Sub-linear time complexity.
3. Approximate Nearest Neighbor Algorithms:
○ ANN-BallTree:
■ A variant of Ball Trees optimized for approximate search.
○ Product Quantization:
■ Compresses the data points into a product space, enabling fast distance
computation.
○ Hierarchical Navigable Small World (HNSW):
■ Graph-based approach for ANN.
■ Constructs a navigable small-world graph where search efficiency is high.
Practical Considerations
1. Curse of Dimensionality:
○ As dimensionality increases, the distance between points becomes less
meaningful.
○ High-dimensional spaces often require specialized algorithms like LSH or HNSW.
2. Choice of Distance Metric:
○ The choice of distance metric can significantly impact the performance and
accuracy of the search.
○ Must align with the nature of the data and the problem domain.
3. Dimensionality Reduction:
○ Techniques like PCA (Principal Component Analysis) or t-SNE (t-Distributed
Stochastic Neighbor Embedding) can reduce dimensionality while preserving
important relationships.
4. Scalability:
○ For very large datasets, distributed computing frameworks like Apache Spark or
libraries like Faiss (Facebook AI Similarity Search) can be used to handle
scalability issues.
● Recommendation Systems: Suggest items similar to the ones a user has shown
interest in.
● Image Retrieval: Find images similar to a query image.
● Anomaly Detection: Identify unusual patterns by finding points that do not have close
neighbors.
● Genomics: Find genetic sequences similar to a given sequence.
Shingling of Documents
Shingling is a technique used in text processing to convert documents into sets of contiguous
subsequences, which are called shingles or k-grams. This method is crucial in various
applications like document similarity detection, plagiarism detection, and near-duplicate web
page identification.
Key Concepts
1. Shingle Definition:
○ A shingle is a substring of length k extracted from a document. It is often a
sequence of characters or words.
○ The choice of k depends on the granularity of similarity detection required.
Common values are 5 to 10 for characters, or 2 to 5 for words.
2. Character-based Shingling:
○ This method treats documents as a sequence of characters.
○ Example: For the string "abcde" and k=2, the shingles are {"ab", "bc", "cd", "de"}.
3. Word-based Shingling:
○ This method treats documents as a sequence of words.
○ Example: For the sentence "the quick brown fox" and k=2, the shingles are {"the
quick", "quick brown", "brown fox"}.
4. Hashing Shingles:
○ Shingles can be hashed into numerical values to reduce the dimensionality and
facilitate efficient storage and comparison.
○ Common hashing functions include MD5, SHA-1, or any other suitable hash
function.
5. Use of Shingling in Similarity Detection:
○ By converting documents into sets of shingles, we can compare the sets to
measure document similarity.
○ Jaccard similarity coefficient is often used to compare the similarity between two
sets of shingles.
1. Preprocessing:
○ Remove punctuation, convert to lowercase, and handle special characters to
ensure uniformity.
○ Tokenize the document into characters or words based on the chosen shingling
method.
2. Generating Shingles:
○Extract contiguous substrings (shingles) of length k from the preprocessed
document.
○ Store these shingles as a set or list for further processing.
3. Hashing and Representation:
○ Optionally, hash the shingles to represent them as numerical values.
○ This helps in efficient storage and quick computation.
Given Document:
Preprocessing:
● Convert to lowercase: "the quick brown fox jumps over the lazy dog"
● Remove punctuation: "the quick brown fox jumps over the lazy dog"
● Shingles: {"the q", "he qu", "e qui", " quic", "quick", "uick ", "ick b", "ck br", "k bro", "
brow", "brown", "rown ", "own f", "wn fo", "n fox", " fox ", "fox j", "ox ju", "x jum", " jump",
"jumps", "umps ", "mps o", "ps ov", "s ove", " over", "over ", "ver t", "er th", "r the", " the ",
"the l", "he la", "e laz", " lazy", "lazy ", "azy d", "zy do", "y dog", " dog."}
● Shingles: {"the quick", "quick brown", "brown fox", "fox jumps", "jumps over", "over the",
"the lazy", "lazy dog"}
Applications of Shingling
1. Near-Duplicate Detection:
○ Identifying near-duplicate web pages to reduce redundancy in search engine
indices.
○ Plagiarism detection in academic papers.
2. Clustering Documents:
○ Grouping similar documents together based on shingle similarity.
○ Useful in organizing large document repositories.
3. Data Deduplication:
○ Reducing storage requirements by identifying and removing duplicate or
near-duplicate data entries.
In big data mining and analytics, efficiently comparing large datasets is crucial. Traditional
methods can be computationally expensive and impractical for large volumes of data. Similarity
preserving summaries, also known as locality-sensitive hashing (LSH) and sketching
techniques, provide efficient ways to approximate similarity measures and make the comparison
of large datasets feasible.
Key Concepts
1. Similarity Measures:
○ Jaccard Similarity: Measures the similarity between two sets as the size of the
intersection divided by the size of the union.
○ Cosine Similarity: Measures the cosine of the angle between two non-zero
vectors in an inner product space.
○ Euclidean Distance: Measures the straight-line distance between two points in
Euclidean space.
○ Hamming Distance: Measures the number of positions at which the
corresponding symbols are different.
2. MinHash:
○ Objective: Approximate Jaccard similarity between sets.
○ How it Works: Uses random permutations of the universe of elements and
creates signatures for sets. The probability that the min-hash values of two sets
are equal is equal to their Jaccard similarity.
3. Locality-Sensitive Hashing (LSH):
○ Objective: To hash input items so that similar items map to the same "buckets"
with high probability.
○ Types:
■ MinHash LSH: For Jaccard similarity.
■ Cosine LSH: For cosine similarity.
■ E2LSH: For Euclidean distance.
○ Procedure:
■ Generate multiple hash functions.
■ Items are hashed into buckets using these functions.
■ Similar items end up in the same or nearby buckets.
4. Sketching Techniques:
○ Objective: To create compact representations (sketches) of data that preserve
similarity.
○ Count-Min Sketch: Estimates the frequency of elements in a data stream.
○ HyperLogLog: Estimates the cardinality (number of distinct elements) of a set.
○ Bloom Filter: Probabilistic data structure to test membership of an element in a
set, with a certain false positive rate.
Applications
1. Near-Duplicate Detection:
○ Identifying duplicate or near-duplicate documents, images, or records.
○ Applications in web crawling, plagiarism detection, and data deduplication.
2. Approximate Nearest Neighbors (ANN):
○ Finding data points that are approximately closest to a given point in
high-dimensional spaces.
○ Applications in recommendation systems, image retrieval, and clustering.
3. Data Compression and Storage:
○ Reducing the storage requirements by summarizing the data while preserving
essential information.
○ Efficient storage of large datasets and streaming data.
4. Data Stream Analysis:
○ Real-time analysis of data streams using sketches to keep track of important
statistics.
○ Applications in network monitoring, financial data analysis, and online analytics.
Implementation Details
1. MinHash Implementation:
○ Define multiple random permutations of the element space.
○ For each set, apply each permutation and record the minimum value (min-hash
signature).
○ Compare signatures of two sets to estimate Jaccard similarity.
2. LSH Implementation:
○ Choose appropriate hash functions for the similarity measure.
○ Hash each item into buckets.
○ Use multiple hash tables to improve the accuracy.
3. Sketching Implementation:
○ Count-Min Sketch:
■ Initialize a matrix with zero values.
■ Use multiple hash functions to map elements to columns.
■ Update counts for incoming elements.
■ Estimate frequency by taking the minimum value across hash functions.
○ Bloom Filter:
■ Initialize a bit array with zero values.
■ Use multiple hash functions to map elements to positions in the bit array.
■ Set bits to 1 for each element.
■ Check membership by verifying all relevant bits are set.
Locality Sensitive Hashing (LSH) is a method for dimension reduction in high-dimensional data.
It is particularly useful for approximate nearest neighbor search in large datasets. The goal of
LSH is to hash input items so that similar items map to the same buckets with high probability,
making it easier to find similar pairs.
Key Concepts
1. Hash Functions: Functions that map input data to fixed-size values (hash codes).
2. Buckets: Groups or bins where hashed items are placed.
3. Similarity: A measure of how alike two items are, often determined by distance metrics
like Euclidean distance, Cosine similarity, or Jaccard index.
LSH Families
LSH families are collections of hash functions with the property that similar items are more likely
to be hashed to the same bucket.
Steps in LSH
Applications
Advantages
● Efficiency: Reduces the need to compare all pairs of items, making the search process
faster.
● Scalability: Suitable for very large datasets due to its probabilistic nature and efficient
querying.
Challenges
● Parameter Tuning: Selecting appropriate parameters for hash functions and number of
hash tables.
● Approximation: LSH provides approximate results, which may not always be sufficient
depending on the application.
Example
Consider a dataset of documents. To find similar documents using LSH with MinHash:
● Hashing: A technique to convert data of arbitrary size to fixed-size values using hash
functions.
● Hash Function: A function that takes input data (documents) and returns a fixed-size
string or number, typically called a hash code or hash value.
2. Hash Functions
● Properties:
○ Deterministic: The same input always produces the same output.
○ Uniformity: Evenly distributes hash values.
○ Efficiency: Computes hash value quickly.
○ Pre-image Resistance: Difficult to reverse-engineer the input from the hash
value.
● Common Hash Functions:
○ MD5: Produces a 128-bit hash value. Widely used but vulnerable to collisions.
○ SHA-1: Produces a 160-bit hash value. More secure than MD5 but now
considered weak against collision attacks.
○ SHA-256: Produces a 256-bit hash value. Part of the SHA-2 family, providing
stronger security.
● MinHash Algorithm:
○ Convert documents into sets (e.g., sets of words, shingles).
○ Apply multiple hash functions to each element in the set.
○ Record the minimum hash value for each hash function.
○ Compare the minhash signatures of two documents to estimate their similarity.
● Jaccard Similarity:
○ Measures similarity between finite sets.
○ Jaccard Similarity(A,B)=∣A∩B∣∣A∪B∣\text{Jaccard Similarity} (A, B) = \frac{|A
\cap B|}{|A \cup B|}Jaccard Similarity(A,B)=∣A∪B∣∣A∩B∣
○ MinHash approximates the Jaccard similarity efficiently.
● Purpose: Group similar items into the same bucket with high probability.
● Procedure:
○ Choose a hash function that maps similar documents to the same buckets.
○ Use multiple hash functions to increase accuracy.
○ Compare only documents within the same bucket to find similar pairs.
● Applications:
○ Plagiarism detection.
○ Duplicate content detection in web crawlers.
○ Nearest neighbor search in high-dimensional spaces.
6. Implementation Considerations
● Web Search Engines: Use hashing to index and retrieve similar web pages.
● Data Cleaning: Remove duplicates in large datasets.
● Spam Detection: Identify spam emails by comparing hash values of content.
Distance Measures
In the context of Big Data Mining and Analytics, distance measures are crucial for various tasks
such as clustering, classification, and anomaly detection. Here's an overview of different
distance measures commonly used in this field:
1. Euclidean Distance
3. Minkowski Distance
● Definition: A generalized distance measure that includes both Euclidean and Manhattan
distances as special cases.
● Formula: d(p,q)=(∑i=1n∣pi−qi∣r)1/rd(p, q) = \left( \sum_{i=1}^{n} |p_i - q_i|^r
\right)^{1/r}d(p,q)=(i=1∑n∣pi−qi∣r)1/r
○ For r=1r = 1r=1, it becomes the Manhattan distance.
○ For r=2r = 2r=2, it becomes the Euclidean distance.
● Usage: Flexible measure that can be tuned based on the value of rrr.
4. Cosine Similarity
● Definition: Measures the cosine of the angle between two non-zero vectors in an inner
product space.
● Formula: cosine_similarity(p,q)=p⋅q∥p∥∥q∥\text{cosine\_similarity}(p, q) = \frac{p
\cdot q}{\|p\| \|q\|}cosine_similarity(p,q)=∥p∥∥q∥p⋅qwhere p⋅qp \cdot qp⋅q is the dot
product and ∥p∥∥q∥\|p\| \|q\|∥p∥∥q∥ are the magnitudes of the vectors.
● Usage: Commonly used in text mining and information retrieval, particularly for
measuring document similarity.
5. Jaccard Index
6. Hamming Distance
● Definition: Measures the number of positions at which the corresponding symbols are
different.
● Formula: For two strings of equal length, d(p,q)=∑i=1nδ(pi,qi)d(p, q) = \sum_{i=1}^{n}
\delta(p_i, q_i)d(p,q)=i=1∑nδ(pi,qi) where δ(pi,qi)\delta(p_i, q_i)δ(pi,qi) is 1 if pi≠qip_i
\neq q_ipi=qiand 0 otherwise.
● Usage: Useful for categorical data and in error detection and correction in coding theory.
7. Mahalanobis Distance
● Definition: A distance measure that accounts for the correlations of the data set and is
scale-invariant.
● Formula: d(p,q)=(p−q)TS−1(p−q)d(p, q) = \sqrt{(p - q)^T S^{-1} (p -
q)}d(p,q)=(p−q)TS−1(p−q)where SSS is the covariance matrix of the data.
● Usage: Suitable for multivariate data and when the data distribution is not spherical.
8. Chebyshev Distance
Here’s a comprehensive breakdown of the key concepts, theories, and applications related to
Locality Sensitive Functions:
Components:
● Hash Functions: These are designed to ensure that similar items have a higher
probability of being mapped to the same hash value.
● Buckets: The hashed values are stored in buckets, facilitating quick retrieval of similar
items.
Applications:
Similarity Measures: Different similarity measures are used depending on the type of data and
the application:
● Jaccard Similarity: Used for sets and measures the size of the intersection divided by
the size of the union of the sets.
● Cosine Similarity: Commonly used in text analysis, measuring the cosine of the angle
between two vectors.
● Hamming Distance: Used for binary vectors, counting the number of positions at which
the corresponding bits are different.
● Euclidean Distance: Measures the "straight line" distance between two points in
Euclidean space.
Family of Hash Functions: For a hash function hhh to be locality sensitive with respect to a
similarity measure SSS, it must satisfy the following conditions for any two points ppp and qqq:
● If S(p,q)S(p, q)S(p,q) is high, then the probability that h(p)=h(q)h(p) = h(q)h(p)=h(q) is
high.
● If S(p,q)S(p, q)S(p,q) is low, then the probability that h(p)=h(q)h(p) = h(q)h(p)=h(q) is low.
● MinHashing: A technique where multiple random permutations of the input sets are
used, and the minimum value in each permutation is taken to form the hash value.
● Random Hyperplanes: Randomly generate hyperplanes and use them to split the data.
The sign of the dot product of the vector with the hyperplane normal determines the
hash.
● P-Stable Distributions: Using projections of data points onto random vectors drawn
from a p-stable distribution (like Gaussian distribution for L2 norm).
LSH Algorithm:
1. Choose Parameters: Select the number of hash functions kkk and the number of hash
tables LLL.
2. Generate Hash Functions: Create LLL groups of kkk hash functions each.
3. Hash Data: For each data point, compute the hash values using the hash functions and
insert the data point into the corresponding buckets in all LLL hash tables.
4. Querying: To find similar items for a query, hash the query point and retrieve candidate
points from the corresponding buckets in the hash tables. These candidates are then
checked for actual similarity.
Efficiency:
● Time Complexity: LSH can reduce the query time to sublinear with respect to the
number of points in the dataset, making it scalable for large datasets.
● Space Complexity: It requires additional space for storing the hash tables, but this
trade-off is often acceptable given the efficiency gains.
Multi-Probe LSH: Instead of probing a single bucket in each hash table, this method probes
multiple buckets, increasing the recall without significantly increasing the number of hash tables.
Cross Polytope LSH: A more recent development that uses the geometry of cross polytopes
for high-dimensional nearest neighbor search, offering improved performance over traditional
LSH methods.
Scalable and Parallel LSH: Implementations that leverage parallel computing and distributed
systems to handle extremely large datasets efficiently.
Locality-Sensitive Hashing (LSH) is a technique used for approximate nearest neighbor search
in high-dimensional spaces. It aims to hash input items in such a way that similar items are
mapped to the same buckets with high probability, while dissimilar items are mapped to different
buckets.
LSH Families
LSH families are collections of hash functions that ensure the property of locality sensitivity.
There are various types of LSH families, each suitable for different types of data and distance
metrics. Below are some common LSH families:
1. Hash Tables: Multiple hash functions from the chosen LSH family are combined to
create a hash table. Each input vector is hashed using these functions, and the result is
concatenated to form a composite hash value which is then used to index the hash
table.
2. Amplification: To reduce the probability of false positives, LSH uses an amplification
technique where multiple hash tables are created, and an item is considered similar if it
collides in at least one table.
3. Query Process:
○ Given a query point, compute its hash value using the same hash functions.
○ Retrieve all points in the buckets corresponding to the query hash value.
○ Perform a final distance computation to find the actual nearest neighbors from
the retrieved points.
● Advantages:
○ Efficient for high-dimensional data.
○ Sub-linear query time complexity in terms of the number of data points.
○ Can handle different distance metrics by choosing appropriate LSH families.
● Challenges:
○ Choosing the right parameters (e.g., number of hash functions, width parameter)
is non-trivial.
○ Performance can degrade with extremely high-dimensional data due to the curse
of dimensionality.
○ May require a large number of hash tables for high accuracy, leading to
increased memory usage.
Applications of LSH
● Cosine Similarity: Measures the cosine of the angle between two vectors. It's widely
used in text mining and recommendation systems.
● Euclidean Distance: Measures the straight-line distance between two points in
Euclidean space. It's used when data points are represented as vectors.
● Jaccard Similarity: Measures similarity between finite sample sets, often used in
document similarity and set matching.
Clustering Algorithms:
Dimensionality Reduction:
Graph-Based Methods:
● Graph Similarity: Measures how similar two graphs are based on structural or
attribute-based features.
● Graph Clustering Algorithms: Identifies communities in graphs based on similarity
measures among nodes.
● k-Nearest Neighbors (k-NN): Finds k data points that are closest to a query point based
on a similarity measure.
● Locality-Sensitive Hashing (LSH): Efficiently finds approximate nearest neighbors by
hashing similar items into the same buckets.
● Siamese Networks: Learns embeddings for pairs of data points where similarity is
measured in the embedding space.
● Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs):
Used in tasks such as image and sequence similarity, respectively.
Anomaly Detection:
● Distance-Based Methods: Detect anomalies based on the distance of data points from
their neighbors.
● Density-Based Methods: Identify anomalies as data points in sparse regions of the
dataset.
Collaborative Filtering: