Skip to content

Commit 9fd5e60

Browse files
committed
Added Pandas DS info
1 parent 6add207 commit 9fd5e60

File tree

1 file changed

+35
-3
lines changed

1 file changed

+35
-3
lines changed

README.md

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,20 +29,52 @@ To install all of the libraries, run the commands in the "install.txt" file. The
2929

3030
## Information
3131

32-
#### Visualisations
32+
### Visualisations
3333
- **Histogram:** A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis.
3434
- **Scatterplot:** A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values.
3535
- **Beeswarm Plot:** A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values.
3636
- **Cumulative Distribution Function:** The cumulative distribution function (cdf) is the probability that a variable takes a value less than or equal to x. For example, we may wish to see what percentage of the data has a certain feature variable that is less than or equal to x.
3737
- **Bar Plots:** Classical bar plots that are good for visualisation and comparison of different data statistics, especially comparing statistics of feature variables.
3838

39-
#### Statistics
39+
### Statistics
4040
- **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values.
4141
- **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.
42+
- **Kurtosis:** Measures the "sharpness" of a distribution. If a distribution has a high kurtosis value (>3) then it's data will be highly concentrated around the same value. If K=3 then the distribution is normal (zero-mean, unit-variance). If K < 3 then the values of the distribution will be spread out.
43+
- **Skewness:** Measures the asymmetry of a distribution. Positive skewness means values are concentrated on the left (lower); negative skewness means values are concentrated on the right (higher).
4244
- **Covariance Matrix:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The magnitude of the covariance determines how strongly the features are correlated. A high covariance value means that when one of the feature variables changes by an amount x, the other will change by an amount very close to x; vice versa for low covariance values.
4345
- **PCA Dimensionality Reduction:** Principal Component Analysis (PCA) is a technique commonly used for dimensionality reduction. PCA computes the feature vectors along which the data has the highest variance. Since these feature vectors have the highest variance they also hold most of the information that the data represents. Therefore we can project the data on to these feature vectors, reducing the dimensionality of the data which makes analysis easier and more clear.
4446
- **Data Shuffling:** Shuffling the data prior to applying a machine learning algorithm has been proven to improve the performance.
4547

4648

47-
#### Examples
49+
### Pandas Data Science
50+
51+
#### Basic Dataset Information
52+
**Read in a CSV dataset:** `pd.DataFrame.from_csv("csv_file")` OR `pd.read_csv("csv_file")`
53+
**Read in an Excel dataset:** `pd.read_excel("excel_file")`
54+
**Basic dataset feature info:** `df.info()`
55+
**Basic dataset statistics:** `print(df.describe())`
56+
**Print dataframe in a table:** `print(tabulate(print_table, headers=headers))` where "print_table" is a list of lists and "headers" is a list of the string headers
57+
58+
#### Basic Data Handling
59+
**Drop missing data:** `df.dropna(axis=0, how='any')` Return object with labels on given axis omitted where alternately any or all of the data are missing
60+
**Replace missing data:** `df.replace(to_replace=None, value=None)` Replace values given in "to_replace" with "value".
61+
**Check for NANs:** `pd.isnull(object)` Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
62+
**Drop a feature:** `df.drop('feature_variable_name', axis=1)` axis is either 0 for rows, 1 for columns
63+
**Convert object type to float:** `pd.to_numeric(df["feature_name"], errors='coerce')` Convert object types to numeric to be able to perform compuations
64+
**Convert DF to numpy array:** `df.as_matrix()`
65+
**Get first "n" rows:** `df.head([n])`
66+
**Get data by feature name:** `df.loc[feature_name]`
67+
68+
#### Basic Plotting
69+
**Area plot:** `df.plot.area([x, y])`
70+
**Vertical bar plot:** `df.plot.bar([x, y])`
71+
**Horizontal bar plot:** `df.plot.barh([x, y])`
72+
**Boxplot:** `df.plot.box([by])`
73+
**Histogram:** `df.plot.hist([by, bins])`
74+
**Line plot:** `df.plot.line([x, y])`
75+
**Pie chart:** `df.plot.pie([y])`
76+
77+
78+
79+
### Examples
4880
![alt text](https://github.com/GeorgeSeif/Data-Science-Python/blob/master/Images/explore_wine_scattermatrix.png)

0 commit comments

Comments
 (0)