Advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets, and get ready for high-performance machine learning. From cleaning data to creating features and implementing machine learning models, you'll execute end-to-end workflows with Spark. The track ends with building a recommendation engine using the popular MovieLens dataset and the Million Songs dataset.
1. Introduction to Big Data analysis with Spark Notebook
Fundamentals of BigData and introduction to Spark as distributed computing framework
- Main components: Spark Core and Spark built-in libraries - Spark SQL, Spark MLlib, Graphx, and Spark Streaming
- PySpark: Apache Spark's Python API to execute Spark jobs
- PySpark shell: For developing the interactive applications in python
- Spark modes: Local and clister mode
2. Programming in PySpark RDD’s Notebook
Introduction to RDDs, different features of RDDs, methods of creating RDDs and RDD operations (Transformation and Actions)
- Transsformations:
map()
,flatMap()
,filter()
,union()
- Actions:
collect()
,take()
,first()
,count()
- Paired RDD Transformations:
reduceByKey()
,groupByKey()
,sortByKey()
,join()
,countByKey()
,collectAsMap()
- Advanced RDD Actions:
reduce()
,saveAsTextFile()
- Project: Write code that calculates the most common words
3. PySpark SQL & DataFrames Notebook
Introduction to Spark SQL, DataFrame abstraction, creating DataFrames, DataFrame operations and visualizing Big Data through DataFrames
- DataFrame Transformations:
select()
,filter()
,groupby()
,orderby()
,dropDuplicates()
andwithColumnRenamed()
- DataFrame Actions:
head()
,show()
,count()
,clomuns
anddescribe()
- Data Visualization:
hist()
,displot()
,pandas_histogram()
,toPandas()
,HandySpark
- Project: Exploratory data analysis (EDA) on the "FIFA 2018 World Cup Player"
4. Machine Learning with PySpark MLlib Notebook
Introduction to Spark MLlib, the three C's of Machine Learning (Collaborative filtering, Classification and Clustering)