Skip to content

Commit 311799f

Browse files
committed
Covariance boston
1 parent 75f6a28 commit 311799f

File tree

3 files changed

+53
-0
lines changed

3 files changed

+53
-0
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ To install all of the libraries, run the commands in the "install.txt" file. The
2424
- **helpers.py:** Helper functions. adapted from the [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) repository
2525
- **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
2626
- **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
27+
- **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.
2728

2829
## Information
2930

@@ -36,6 +37,7 @@ To install all of the libraries, run the commands in the "install.txt" file. The
3637
#### Statistics
3738
- **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values.
3839
- **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.
40+
- **Covariance Matrix:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The magnitude of the covariance determines how strongly the features are correlated. A high covariance value means that when one of the feature variables changes by an amount x, the other will change by an amount very close to x; vice versa for low covariance values.
3941

4042

4143
#### Examples

covariance_boston.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import pandas as pd
2+
import numpy as np
3+
from scipy import stats
4+
import seaborn.apionly as sns
5+
from tabulate import tabulate
6+
import matplotlib.pyplot as plt
7+
from sklearn.datasets import load_boston
8+
9+
import helpers as helpers
10+
11+
# NOTE that this loads as a dictionairy
12+
boston_data = load_boston()
13+
14+
train_data = np.array(boston_data.data)
15+
train_labels = np.array(boston_data.target)
16+
17+
num_features = boston_data.data.shape[1]
18+
unique_labels = np.unique(train_labels)
19+
num_classes = len(unique_labels)
20+
21+
22+
print("The boston dataset has " + str(num_features) + " features")
23+
print(boston_data.feature_names)
24+
25+
26+
27+
# Put everything into a Pandas DataFrame
28+
data = pd.DataFrame(data=np.c_[train_data], columns=boston_data.feature_names)
29+
# print(tabulate(data, headers='keys', tablefmt='psql'))
30+
31+
32+
33+
# Compute the covariance matrix
34+
cov_mat_boston = np.cov(train_data.T)
35+
print("Covariance matrix")
36+
print(cov_mat_boston)
37+
38+
39+
40+
# Normalize the data and then recompute the covariance matrix
41+
normalized_train_data = helpers.normalize_data(train_data)
42+
normalized_cov_mat_boston = np.cov(normalized_train_data.T)
43+
print("Normalized data covariance matrix")
44+
print(normalized_cov_mat_boston)
45+
46+
47+
48+
# create scatterplot matrix
49+
fig = sns.pairplot(data=data, hue='CRIM')
50+
51+
plt.show()

helpers.pyc

3.1 KB
Binary file not shown.

0 commit comments

Comments
 (0)