Python and Spark for Data Analysis

These are the IPython notebooks I used for a 4-day training course on Python and Spark for data science, given in December 2015 to a Data Minded client. The audience consisted of experienced data analysts, familiar with technologies like R and SPSS, but who had never used Python and had never worked on a Hadoop cluster.

The content is mildly redacted to remove all references to the actual client, but are otherwise unchanged.

Each day consisted of working through a series of IPython notebooks. Exercises are interspersed throughout. The last notebook of each day contains solutions to that day's exercises.

Objectives

The objectives of the training were to:

Learn the fundamentals of Python
Learn the fundamentals of its statistical and machine learning packages
Learn Apache Spark using Python
Learn how to apply these technologies in a live Hadoop cluster

Pre-requisites

Before the start of the course, we required the following software to be installed on students' laptops:

Anaconda 2.4.1 64-bit for Windows. The packages in this version of Anaconda included:
- Python 2.7.11
- IPython 4.0.1
- NumPy 1.9.3
- SciPy 0.16.0
- Matplotlib 1.5.0
- Pandas 0.17.1
- Seaborn 0.6.0
- Scikit-learn 0.17
Apache Spark 1.2.0. The version was chosen to match that in the client's production cluster, even though the latest version at the time of the course was 1.5.2
JDK 7u79.

Syllabus

The four days covered the following content.

Day 0: Fundamentals of Python

This day was intended for people with very limited programming experience and/or no Python experience. Day 0 was optional.

At the end of this day, the students were able to:

Start and run python programs interactively with python CLI
Use an IDE to write programs and execute them, including command line arguments
Create notebooks locally and on a server
Import libraries
Store data in variables and understand their reach
Know the standard operators
Control the flow of a program
Perform common string operations such as concatenation, substring, replace
Use the correct data structures
Use functions to structure your program

Day 1: Statistical and Machine Learning Packages

On Day 1, we discussed several of the powerful statistical and machine learning libraries in Python. It was purposely a very hands on introduction and we did not dive into the mathematics behind any of the algorithms.

At the end of this day, the students were able to:

Import and export data in csv
Use numpy/scipy to perform mathematical computations
Slice and dice data
Use pandas to wrangle data
Plot data and perform exploratory analysis
Use scikit-learn
Perform regression analysis in Python
Perform classification analysis in Python

Day 2: Apache Spark and Python

On the second day, we dove into Spark. We focused on the essential parts. After a brief introduction into Spark Core, we explored Spark SQL and Spark MLlib.

At the end of this day, the students were able to:

Understand the role of Spark and pyspark in the eco-system
Run spark locally from a shell
Run spark locally in IPython Notebooks
Do a word count on an input file
Load data in SparkSQL
Query data in SparkSQL
Use Spark MLlib to perform regression and classification analyses at scale

Day 3: Python and Apache Spark on a Cluster

In this last day, we set up a small Cloudera Hadoop cluster on AWS and explored how everything we had learned could be run in a cluster environment. The second half of the day was set aside for an open-ended project. Possible projects included:

setting up a machine learning pipeline on data from the UCI Machine Learning Repository;
implementing a machine learning algorithm using Spark Core;
testing to what extent Spark running times scales linearly with data size.

At the end of this day, the students were able to

Run python scripts on the cluster from a shell and from ipython notebooks
Use Spark to read from and write to HDFS
Use SparkSQL to read data from and write data to Hive
Understand how YARN works
Submit spark jobs on the cluster
Use Spark, SparkSQL and Spark MLlib to run algorithms on large-scale data.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
names		names
.gitignore		.gitignore
Day0_FundamentalsOfPython_1_Intro.ipynb		Day0_FundamentalsOfPython_1_Intro.ipynb
Day0_FundamentalsOfPython_2_VariablesOperationsAndFlowControl.ipynb		Day0_FundamentalsOfPython_2_VariablesOperationsAndFlowControl.ipynb
Day0_FundamentalsOfPython_3_StringsTuplesAndDictionaries.ipynb		Day0_FundamentalsOfPython_3_StringsTuplesAndDictionaries.ipynb
Day0_FundamentalsOfPython_4_ModulesAndFunctions.ipynb		Day0_FundamentalsOfPython_4_ModulesAndFunctions.ipynb
Day0_FundamentalsOfPython_5_ClosingWords.ipynb		Day0_FundamentalsOfPython_5_ClosingWords.ipynb
Day0_FundamentalsOfPython_Solutions.ipynb		Day0_FundamentalsOfPython_Solutions.ipynb
Day0_MyFirstNotebook.ipynb		Day0_MyFirstNotebook.ipynb
Day1_StatAndML_1_RecapNumPySciPy.ipynb		Day1_StatAndML_1_RecapNumPySciPy.ipynb
Day1_StatAndML_2_Matplotlib.ipynb		Day1_StatAndML_2_Matplotlib.ipynb
Day1_StatAndML_3_Pandas.ipynb		Day1_StatAndML_3_Pandas.ipynb
Day1_StatAndML_4_SciKitLearn.ipynb		Day1_StatAndML_4_SciKitLearn.ipynb
Day1_StatAndML_Solutions.ipynb		Day1_StatAndML_Solutions.ipynb
Day2_SparkAndPython_1_Recap.ipynb		Day2_SparkAndPython_1_Recap.ipynb
Day2_SparkAndPython_2_SparkConcepts.ipynb		Day2_SparkAndPython_2_SparkConcepts.ipynb
Day2_SparkAndPython_3_SparkCore.ipynb		Day2_SparkAndPython_3_SparkCore.ipynb
Day2_SparkAndPython_4_SparkSQL.ipynb		Day2_SparkAndPython_4_SparkSQL.ipynb
Day2_SparkAndPython_5_MLLib.ipynb		Day2_SparkAndPython_5_MLLib.ipynb
Day2_SparkAndPython_6_Solutions.ipynb		Day2_SparkAndPython_6_Solutions.ipynb
Day3_SparkOnACluster_1_Recap.ipynb		Day3_SparkOnACluster_1_Recap.ipynb
Day3_SparkOnACluster_2_PythonIPythonAndPySparkOnCluster.ipynb		Day3_SparkOnACluster_2_PythonIPythonAndPySparkOnCluster.ipynb
Day3_SparkOnACluster_3_Solutions.ipynb		Day3_SparkOnACluster_3_Solutions.ipynb
README.md		README.md
allnames.txt		allnames.txt
myfigure.pdf		myfigure.pdf
mysample.csv		mysample.csv
mysample2.csv		mysample2.csv
names.zip		names.zip
normal.txt		normal.txt
passwords.py		passwords.py
pyscript.py		pyscript.py
pysparkscript.py		pysparkscript.py
spam.txt		spam.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python and Spark for Data Analysis

Objectives

Pre-requisites

Syllabus

Day 0: Fundamentals of Python

Day 1: Statistical and Machine Learning Packages

Day 2: Apache Spark and Python

Day 3: Python and Apache Spark on a Cluster

About

Releases

Packages

Languages

patvarilly/python-and-spark-for-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Python and Spark for Data Analysis

Objectives

Pre-requisites

Syllabus

Day 0: Fundamentals of Python

Day 1: Statistical and Machine Learning Packages

Day 2: Apache Spark and Python

Day 3: Python and Apache Spark on a Cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages