Bloom Filters in MapReduce

Project for Cloud Computing course at University of Pisa (MSc in Computer Engineering).

The aim of the project was to build a Bloom filter over the ratings of movies listed in the IMDb datasets. In paricular, we have to build one Bloom filter per average rating where the latter are rounded to the closest integer value.

The project consists in :

an implementation of the MapReduce Bloom filter construction algorithm using the Hadoop framework
an implementaton the MapReduce Bloom filter construction algorithm using the Spark framework

Some requirements

In the Hadoop implementation, we had to use the following classes:

org.apache.hadoop.mapreduce.lib.input.NLineInputFormat: splits N lines of input as one split
org.apache.hadoop.util.hash.Hash.MURMUR_HASH: the hash function family to use

In the Spark implementation, we had to use/implement analogous classes.

Hadoop Execution

To start the execution on the namenode:

hadoop jar hadoop-bloom-filters-1.0-SNAPSHOT.jar it.unipi.hadoop.Main data.tsv output_dir 100000 0.0001 1

where:

data.tsv is the path of the input file on HDFS
output_dir is the name of the output directory
100000 is the number of lines of each split
0.0001 is the p value chosen
1 is the version to put into execution for the job2. In the version 1 the Mapper emits an array of bloom filters exploiting the in-mapper combiner pattern. In the version 2 the Mapper emits the bit positions (set to one) of the bloom filter.

Spark Execution

To start the execution of Spark:

spark-submit --master yarn main.py data.tsv aggregate_by_key false 0

where:

aggregate_by_key is the type of job2 to put into execution

Authors

Biagio Cornacchia, [email protected]
Giacomo Pacini, [email protected]
Elisa De Filomeno, [email protected]
Matteo Pierucci, [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
hadoop		hadoop
spark		spark
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bloom Filters in MapReduce

Some requirements

Hadoop Execution

Spark Execution

Authors

About

Languages

License

biagiocornacchia/bloom-filters-in-mapreduce

Folders and files

Latest commit

History

Repository files navigation

Bloom Filters in MapReduce

Some requirements

Hadoop Execution

Spark Execution

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages