- Dealing with large datasets and diverse data sources can be challenging when applying traditional machine learning techniques.
- Spark, a distributed processing engine utilizing the MapReduce framework, addresses these challenges in big data processing.
This project focuses on Classification and Clustering in Spark MLlib using Airlines Data.
- Implementation includes Decision tree classifier, Random forest classifier, and K-Means clustering algorithms.
s3://airlines123/airline/data.zip
- Language:
Python - Package:
Pyspark - Services:
Spark
-
File Names:
DecisionTree.ipynbRandomForest.ipynbK_means.ipynb
-
Datasets:
data.zipSocial_Network_Ads.csv
-
Execute using Python script:
<spark_path> spark-submit <file_path><spark_path>: Path to Spark installation<file_path>: Path to the script file
Example:
<C:\Users\admin\Desktop\spark\bin>spark-submit C:\Users\admin\Desktop\sparkml\DecisionTree.py>
- Modular Code
- Create a virtual environment
- Install requirements:
pip install -r requirements.txt - Run code:
python DecisionTree.py - Check output for all visualizations