Apache Spark Data Structure Performance Evaluator

This educational project, carried out for the Μ111 - Big Data Management course at NKUA during Spring 2023, focuses on comparing the execution times of RDD (Resilient Distributed Datasets) and DataFrame data structures in Apache Spark. By executing a designated query based on user input, either RDD or DataFrame is utilized, along with either CSV or Parquet data formats.

Requirements

OpenJDK 8
Apache Hadoop 2.7.7
Apache Spark 2.4.4
Python 3.5.2

Workflow

When you try to execute a query, the system typically works like this:

Data is fetched from the Hadoop Distributed File System (HDFS) using the data_loader.py class, considering the specified file format (CSV or Parquet).
The loaded data is passed to the query_executor.py class, which executes the designated query based on the user's choice of data structure (RDD or DataFrame).
The system measures and records the execution time for each query, providing insights into the performance differences between RDD and DataFrame data structures.

Dataset

Find the dataset here.

The dataset contains the following CSVs:

movies.csv (id, name, description, release_year, duration, cost, revenue, popularity)
ratings.csv (id, movie_id, rating, timestamp)
movie_genres.csv (movie_id, genre)
employeesR.csv (employee_id, name, department_id)
departmentsR.csv (department_id, name)

The current implementation contains 5 pre-defined queries.

Usage

Assuming Apache Hadoop and Spark are running properly on the target system, follow these steps:

Open a terminal window and navigate to './src'.
Download the dataset:

wget -O ../datasets/project2023.tar https://www.dropbox.com/s/c10t67glk60wpha/project2023.tar.gz?dl=0

Prepare HDFS:

spark-submit benchmark.py -f hdfs_setup -data project2023

Extract dataset and store CSVs in HDFS:

spark-submit benchmark.py -f save_csv -data project2023

Convert CSVs to Parquet and store in HDFS:

spark-submit benchmark.py -f save_parquet -data project2023

Run a query. The following command saves the result in '../output' dir:

spark-submit benchmark.py -f query -file csv -struct rdd -idx_q 1 -data project2023 -v 1 > ../output/result.txt

To print the results in the terminal instead:

spark-submit benchmark.py -f query -file csv -struct rdd -idx_q 1 -data project2023 -v 1

How to

How to define my own queries (transformations) for a new dataset?

Compress your CSVs in a .tar file and store it in '/datasets' dir.
Define a schema for each CSV in schemas.py. Update schema_map dictionary.
Define your transformations methods (e.g. query_6, query_7, etc) in query_executor.py within the parent class and its subclasses. Update transform_map dictionary.
Update query_data_map.json. The key is your query index (e.g. 6, 7..) and the value is a list of CSVs that the query requires.
Define your custom printing function in printer.py within the parent class and its subclasses. Update printer_map dictionary.
You're all set! Follow steps 3 to 6 in "Usage", but this time, instead of "project2023" use the name of your dataset (.tar) file.

Insights

Experimentation with Dataframes, RDDs, and two file types—CSV and Parquet—reveals insights into their operational efficacies. For smaller datasets, we observed negligible performance differences between Dataframes and RDDs, and likewise between CSV and Parquet file formats. Their unique attributes and efficiencies didn't manifest in a significant way with lower volumes of data.

However, when we scaled up to larger datasets, as evidenced in the second query, the story dramatically changed. Both Dataframes and, to a greater extent, the Parquet file type showcased remarkable efficiency enhancements. This stark contrast underscores the true value of Dataframes and Parquet files in the realm of Big Data. Their strength lies not in handling day-to-day, smaller data tasks, but in managing and processing vast datasets where their optimized structures come to the fore. This conclusion underscores the importance of adopting Dataframes and Parquet for data-intensive tasks. Their significant contribution to execution efficiency when handling larger datasets validates their utility in Big Data contexts.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
datasets		datasets
output		output
src		src
LICENSE		LICENSE
README.md		README.md
figure.png		figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark Data Structure Performance Evaluator

Requirements

Workflow

Dataset

Usage

How to

How to define my own queries (transformations) for a new dataset?

Insights

License

About

Releases

Packages

Languages

License

nekcht/apache-spark-evaluation

Folders and files

Latest commit

History

Repository files navigation

Apache Spark Data Structure Performance Evaluator

Requirements

Workflow

Dataset

Usage

How to

How to define my own queries (transformations) for a new dataset?

Insights

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages