Covariance boston

GeorgeSeif · GeorgeSeif · commit 311799ffc429 · 2017-08-20T21:54:59.000-04:00
diff --git a/README.md b/README.md
@@ -24,6 +24,7 @@ To install all of the libraries, run the commands in the "install.txt" file. The
 - **helpers.py:** Helper functions. adapted from the [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) repository
 - **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
 - **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
+- **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.
 
 ## Information
 
@@ -36,6 +37,7 @@ To install all of the libraries, run the commands in the "install.txt" file. The
 #### Statistics
 - **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values. 
 - **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.  
+- **Covariance Matrix:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The magnitude of the covariance determines how strongly the features are correlated. A high covariance value means that when one of the feature variables changes by an amount x, the other will change by an amount very close to x; vice versa for low covariance values. 
 
 
 #### Examples
diff --git a/covariance_boston.py b/covariance_boston.py
@@ -0,0 +1,51 @@
+import pandas as pd
+import numpy as np
+from scipy import stats
+import seaborn.apionly as sns
+from tabulate import tabulate
+import matplotlib.pyplot as plt
+from sklearn.datasets import load_boston
+
+import helpers as helpers
+
+# NOTE that this loads as a dictionairy
+boston_data = load_boston()
+
+train_data = np.array(boston_data.data)
+train_labels = np.array(boston_data.target)
+
+num_features = boston_data.data.shape[1]
+unique_labels = np.unique(train_labels)
+num_classes = len(unique_labels)
+
+
+print("The boston dataset has " + str(num_features) + " features")
+print(boston_data.feature_names)
+
+
+
+# Put everything into a Pandas DataFrame
+data = pd.DataFrame(data=np.c_[train_data], columns=boston_data.feature_names)
+# print(tabulate(data, headers='keys', tablefmt='psql'))
+
+
+
+# Compute the covariance matrix
+cov_mat_boston = np.cov(train_data.T)
+print("Covariance matrix")
+print(cov_mat_boston)
+
+
+
+# Normalize the data and then recompute the covariance matrix
+normalized_train_data = helpers.normalize_data(train_data)
+normalized_cov_mat_boston = np.cov(normalized_train_data.T)
+print("Normalized data covariance matrix")
+print(normalized_cov_mat_boston)
+
+
+
+# create scatterplot matrix
+fig = sns.pairplot(data=data, hue='CRIM')
+
+plt.show()
diff --git a/helpers.pyc b/helpers.pyc