This is the final project for the 'Management and Analysis of Physics Dataset (MOD. B)' course in the 'Physics of Data' master program, University of Padua.
Group 1: Paolo Lapo Cerni , Emanuele Quaglio , Lorenzo Vigorelli ,
Arman Singh Bains
This project aims to adapt the well-known K-Means clustering algorithm to MapReduce-like architectures, exploiting the parallelization capabilities offered by distributed systems. K-Means consists of two stages: the initialization and the Llyod iterations. A proper initialization is crucial to obtain good results. At the state of the art, the K-Means++ can obtain a set of initial centroids close to the optimal one but it's not easily parallelizable. Recently, K-Means// has been proposed to overcome this issue.
Main reference: Bahmani, Bahman, et al. "Scalable k-means++." arXiv preprint arXiv:1203.6402 (2012).
The main results are organized in the ParallelInitializationRdd
notebook, including a brief exploration of the dataset and the time efficiency analysis.
The BenchmarkComputation
folder contains the code to run the analysis. The code is divided into three files, depending on whether you want to persist, persist and unpersist, or not persist at all the RDDs during the computation.
The functions are divided into three files in the internalLibrary
. To access them directly, you can also use the instaFunctions.py
file.
Data
contains the logs of each run, defined as a nested structure of dictionaries.
We use PySpark as the engine to distribute the analysis, exploiting CloudVeneto computational resources.
Our cluster has a master node and two workers. Each machine has
Especially if you have little storage available, you can run into unexpected errors. It is crucial to monitor the running processes with
ps aux | grep spark
and eventually kill them via their pid
. For instance:
sudo kill pid1
To test the code locally, you can use the docker-compose.yml
file to simulate a cluster. The docker image used in this setup will be pulled from the remote Docker registry.
By default, only one worker is created, with
docker compose up --scale spark-worker=N
This docker file will expose the Jupyter-notebook service on the port 8889
while the Spark cluster dashboard will be available on localhost:8080
.
To shut down the cluster, type:
docker compose down