Python Data Science

Description

A collection of data science scripts for data analysis in Python.

Python libraries used:

Numpy
Scikit Learn
Pandas
Seaborn
Matplotlib

Installation

To install all of the libraries, run the commands in the "install.txt" file. These are:

sudo apt-get install python-pip
sudo pip install numpy scipy
sudo pip install pandas
sudo apt-get install python-matplotlib
sudo pip install -U scikit-learn
sudo pip install tabulate

Files

helpers.py: Helper functions. adapted from the Python Machine Learning repository
explore_wine_data.py: Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
statistics_iris.py: Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.

Information

Exploratory Data Analysis

Histogram: A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis.
Scatterplot: A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values.
Beeswarm Plot: A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values.
Cumulative Distribution Function: The cumulative distribution function (cdf) is the probability that a variable takes a value less than or equal to x. For example, we may wish to see what percentage of the data has a certain feature variable that is less than or equal to x.

Statistics

Mean and Median: Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values.
Variance and Standard Deviation: Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Images		Images
README.md		README.md
explore_wine_data.py		explore_wine_data.py
helpers.py		helpers.py
install.txt		install.txt
statistics_iris.py		statistics_iris.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Data Science

Description

Installation

Files

Information

Exploratory Data Analysis

Statistics

Examples

About

Uh oh!

Releases

Packages

Languages

GeorgeSeif/Data-Science-Python

Folders and files

Latest commit

History

Repository files navigation

Python Data Science

Description

Installation

Files

Information

Exploratory Data Analysis

Statistics

Examples

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages