This educational project, carried out for the Μ111 - Big Data Management course at NKUA during Spring 2023, focuses on comparing the execution times of RDD (Resilient Distributed Datasets) and DataFrame data structures in Apache Spark. By executing a designated query based on user input, either RDD or DataFrame is utilized, along with either CSV or Parquet data formats.
- OpenJDK 8
- Apache Hadoop 2.7.7
- Apache Spark 2.4.4
- Python 3.5.2
When you try to execute a query, the system typically works like this:
- Data is fetched from the Hadoop Distributed File System (HDFS) using the data_loader.py class, considering the specified file format (CSV or Parquet).
- The loaded data is passed to the query_executor.py class, which executes the designated query based on the user's choice of data structure (RDD or DataFrame).
- The system measures and records the execution time for each query, providing insights into the performance differences between RDD and DataFrame data structures.
Find the dataset here.
The dataset contains the following CSVs:
- movies.csv (id, name, description, release_year, duration, cost, revenue, popularity)
- ratings.csv (id, movie_id, rating, timestamp)
- movie_genres.csv (movie_id, genre)
- employeesR.csv (employee_id, name, department_id)
- departmentsR.csv (department_id, name)
The current implementation contains 5 pre-defined queries.
Assuming Apache Hadoop and Spark are running properly on the target system, follow these steps:
- Open a terminal window and navigate to './src'.
- Download the dataset:
wget -O ../datasets/project2023.tar https://www.dropbox.com/s/c10t67glk60wpha/project2023.tar.gz?dl=0
- Prepare HDFS:
spark-submit benchmark.py -f hdfs_setup -data project2023
- Extract dataset and store CSVs in HDFS:
spark-submit benchmark.py -f save_csv -data project2023
- Convert CSVs to Parquet and store in HDFS:
spark-submit benchmark.py -f save_parquet -data project2023
- Run a query. The following command saves the result in '../output' dir:
spark-submit benchmark.py -f query -file csv -struct rdd -idx_q 1 -data project2023 -v 1 > ../output/result.txt
To print the results in the terminal instead:
spark-submit benchmark.py -f query -file csv -struct rdd -idx_q 1 -data project2023 -v 1
- Compress your CSVs in a .tar file and store it in
'/datasets'
dir. - Define a schema for each CSV in
schemas.py
. Updateschema_map
dictionary. - Define your transformations methods (e.g.
query_6
,query_7
, etc) inquery_executor.py
within the parent class and its subclasses. Updatetransform_map
dictionary. - Update
query_data_map.json
. The key is your query index (e.g. 6, 7..) and the value is a list of CSVs that the query requires. - Define your custom printing function in
printer.py
within the parent class and its subclasses. Updateprinter_map
dictionary. - You're all set! Follow steps 3 to 6 in "Usage", but this time, instead of "project2023" use the name of your dataset (.tar) file.
Experimentation with Dataframes, RDDs, and two file types—CSV and Parquet—reveals insights into their operational efficacies. For smaller datasets, we observed negligible performance differences between Dataframes and RDDs, and likewise between CSV and Parquet file formats. Their unique attributes and efficiencies didn't manifest in a significant way with lower volumes of data.
However, when we scaled up to larger datasets, as evidenced in the second query, the story dramatically changed. Both Dataframes and, to a greater extent, the Parquet file type showcased remarkable efficiency enhancements. This stark contrast underscores the true value of Dataframes and Parquet files in the realm of Big Data. Their strength lies not in handling day-to-day, smaller data tasks, but in managing and processing vast datasets where their optimized structures come to the fore. This conclusion underscores the importance of adopting Dataframes and Parquet for data-intensive tasks. Their significant contribution to execution efficiency when handling larger datasets validates their utility in Big Data contexts.
This project is licensed under the MIT License - see the LICENSE file for details.