Project for Cloud Computing course at University of Pisa (MSc in Computer Engineering).
The aim of the project was to build a Bloom filter over the ratings of movies listed in the IMDb datasets. In paricular, we have to build one Bloom filter per average rating where the latter are rounded to the closest integer value.
The project consists in :
- an implementation of the MapReduce Bloom filter construction algorithm using the
Hadoop framework
- an implementaton the MapReduce Bloom filter construction algorithm using the
Spark framework
In the Hadoop implementation, we had to use the following classes:
org.apache.hadoop.mapreduce.lib.input.NLineInputFormat
: splits N lines of input as one splitorg.apache.hadoop.util.hash.Hash.MURMUR_HASH
: the hash function family to use
In the Spark implementation, we had to use/implement analogous classes.
To start the execution on the namenode:
hadoop jar hadoop-bloom-filters-1.0-SNAPSHOT.jar it.unipi.hadoop.Main data.tsv output_dir 100000 0.0001 1
where:
data.tsv
is the path of the input file on HDFSoutput_dir
is the name of the output directory100000
is the number of lines of each split0.0001
is the p value chosen1
is the version to put into execution for the job2. In the version1
the Mapper emits an array of bloom filters exploiting the in-mapper combiner pattern. In the version2
the Mapper emits the bit positions (set to one) of the bloom filter.
To start the execution of Spark:
spark-submit --master yarn main.py data.tsv aggregate_by_key false 0
where:
aggregate_by_key
is the type of job2 to put into execution
- Biagio Cornacchia, [email protected]
- Giacomo Pacini, [email protected]
- Elisa De Filomeno, [email protected]
- Matteo Pierucci, [email protected]