Skip to content

Commit efd7efd

Browse files
committed
Added ML algorithms
1 parent 311799f commit efd7efd

12 files changed

+1241
-103
lines changed

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ A collection of data science scripts for data analysis in Python.
55

66
Python libraries used:
77
- Numpy
8+
- Scipy
89
- Scikit Learn
910
- Pandas
1011
- Seaborn
@@ -25,10 +26,14 @@ To install all of the libraries, run the commands in the "install.txt" file. The
2526
- **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
2627
- **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
2728
- **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.
29+
- **linear_regression.py:** Linear regression on the Boston Housing dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
30+
- **logistic_regression.py:** Logistic regression on the wine dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
31+
- **pca_logistic_regression.py:** Logistic regression with Principal Component Analysis (PCA) for dimensionality reduction on the wine dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
32+
- **kmeans.py, kmediods.py, k_nearnest_neighbor.py, mean_shift.py, dbscan.py:** Different clustering methods applied to the iris dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
2833

2934
## Information
3035

31-
#### Exploratory Data Analysis
36+
#### Visualisations
3237
- **Histogram:** A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis.
3338
- **Scatterplot:** A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values.
3439
- **Beeswarm Plot:** A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values.
@@ -38,6 +43,8 @@ To install all of the libraries, run the commands in the "install.txt" file. The
3843
- **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values.
3944
- **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.
4045
- **Covariance Matrix:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The magnitude of the covariance determines how strongly the features are correlated. A high covariance value means that when one of the feature variables changes by an amount x, the other will change by an amount very close to x; vice versa for low covariance values.
46+
- **PCA Dimensionality Reduction:** Principal Component Analysis (PCA) is a technique commonly used for dimensionality reduction. PCA computes the feature vectors along which the data has the highest variance. Since these feature vectors have the highest variance they also hold most of the information that the data represents. Therefore we can project the data on to these feature vectors, reducing the dimensionality of the data which makes analysis easier and more clear.
47+
- **Data Shuffling:** Shuffling the data prior to applying a machine learning algorithm has been proven to improve the performance.
4148

4249

4350
#### Examples

covariance_boston.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import matplotlib.pyplot as plt
77
from sklearn.datasets import load_boston
88

9-
import helpers as helpers
9+
import ml_helpers as helpers
1010

1111
# NOTE that this loads as a dictionairy
1212
boston_data = load_boston()

dbscan.py

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
import numpy as np
2+
from sklearn import datasets, decomposition, cluster
3+
import ml_helpers
4+
5+
class DBSCAN():
6+
def __init__(self, radius=1, min_samples=5):
7+
self.radius = radius
8+
self.min_samples = min_samples
9+
# List of arrays (clusters) containing sample indices
10+
self.clusters = []
11+
self.visited_samples = []
12+
# Hashmap {"sample_index": [neighbor1, neighbor2, ...]}
13+
self.neighbors = {}
14+
self.data = None # Dataset
15+
16+
# Return a list of neighboring samples
17+
# A sample_2 is considered a neighbor of sample_1 if the distance between
18+
# them is smaller than radiusi
19+
def _get_neighbors(self, sample_i):
20+
neighbors = []
21+
for _sample_i, _sample in enumerate(self.data):
22+
if _sample_i != sample_i and ml_helpers.euclidean_distance(self.data[sample_i], _sample) < self.radius:
23+
neighbors.append(_sample_i)
24+
return np.array(neighbors)
25+
26+
# Recursive method which expands the cluster until we have reached the border
27+
# of the dense area (density determined by radius and min_samples)
28+
def _expand_cluster(self, sample_i, neighbors):
29+
cluster = [sample_i]
30+
# Iterate through neighbors
31+
for neighbor_i in neighbors:
32+
if not neighbor_i in self.visited_samples:
33+
self.visited_samples.append(neighbor_i)
34+
# Fetch the samples distant neighbors
35+
self.neighbors[neighbor_i] = self._get_neighbors(neighbor_i)
36+
# Make sure the neighbors neighbors are more than min_samples
37+
if len(self.neighbors[neighbor_i]) >= self.min_samples:
38+
# Choose neighbors of neighbor except for sample
39+
distant_neighbors = self.neighbors[neighbor_i][
40+
np.where(self.neighbors[neighbor_i] != sample_i)]
41+
# Add the neighbors neighbors as neighbors of sample
42+
self.neighbors[sample_i] = np.concatenate(
43+
(self.neighbors[sample_i], distant_neighbors))
44+
# Expand the cluster from the neighbor
45+
expanded_cluster = self._expand_cluster(
46+
neighbor_i, self.neighbors[neighbor_i])
47+
# Add expanded cluster to this cluster
48+
cluster = cluster + expanded_cluster
49+
if not neighbor_i in np.array(self.clusters):
50+
cluster.append(neighbor_i)
51+
return cluster
52+
53+
# Return the samples labels as the index of the cluster in which they are
54+
# contained
55+
def _get_cluster_labels(self):
56+
# Set default value to number of clusters
57+
# Will make sure all outliers have same cluster label
58+
labels = len(self.clusters) * np.ones(np.shape(self.data)[0])
59+
for cluster_i, cluster in enumerate(self.clusters):
60+
for sample_i in cluster:
61+
labels[sample_i] = cluster_i
62+
return labels
63+
64+
# DBSCAN
65+
def predict(self, data):
66+
self.data = data
67+
n_samples = np.shape(self.data)[0]
68+
# Iterate through samples and expand clusters from them
69+
# if they have more neighbors than self.min_samples
70+
for sample_i in range(n_samples):
71+
if sample_i in self.visited_samples:
72+
continue
73+
self.visited_samples.append(sample_i)
74+
self.neighbors[sample_i] = self._get_neighbors(sample_i)
75+
if len(self.neighbors[sample_i]) >= self.min_samples:
76+
# Sample has more neighbors than self.min_samples => expand
77+
# cluster from sample
78+
new_cluster = self._expand_cluster(
79+
sample_i, self.neighbors[sample_i])
80+
# Add cluster to list of clusters
81+
self.clusters.append(new_cluster)
82+
83+
# Get the resulting cluster labels
84+
cluster_labels = self._get_cluster_labels()
85+
return cluster_labels
86+
87+
# Get the training data
88+
# Import the Iris flower dataset
89+
iris = datasets.load_iris()
90+
train_data = np.array(iris.data)
91+
train_labels = np.array(iris.target)
92+
num_features = train_data.data.shape[1]
93+
94+
# Apply PCA to the data to reduce its dimensionality
95+
pca = decomposition.PCA(n_components=3)
96+
pca.fit(train_data)
97+
train_data = pca.transform(train_data)
98+
99+
100+
# *********************************************
101+
# Apply DBSCAN Clustering MANUALLY
102+
# *********************************************
103+
# Create the DBSCAN Clustering Object
104+
unique_labels = np.unique(train_labels)
105+
num_classes = len(unique_labels)
106+
clf = DBSCAN(radius=1, min_samples=5)
107+
108+
predicted_labels = clf.predict(train_data)
109+
110+
111+
# Compute the training accuracy
112+
Accuracy = 0
113+
for index in range(len(train_labels)):
114+
# Cluster the data using DBSCAN
115+
current_label = train_labels[index]
116+
predicted_label = predicted_labels[index]
117+
118+
if current_label == predicted_label:
119+
Accuracy += 1
120+
121+
Accuracy /= len(train_labels)
122+
123+
# Print stuff
124+
print("Manual DBSCAN Classification Accuracy = ", Accuracy)
125+
126+
# *********************************************
127+
# Apply DBSCAN Clustering using Sklearn
128+
# *********************************************
129+
# Create the DBSCAN Clustering Object
130+
unique_labels = np.unique(train_labels)
131+
num_classes = len(unique_labels)
132+
clf = cluster.DBSCAN(eps=1, min_samples=5)
133+
134+
predicted_labels = clf.fit_predict(train_data)
135+
136+
137+
# Compute the training accuracy
138+
Accuracy = 0
139+
for index in range(len(train_labels)):
140+
# Cluster the data using DBSCAN
141+
current_label = train_labels[index]
142+
predicted_label = predicted_labels[index]
143+
144+
if current_label == predicted_label:
145+
Accuracy += 1
146+
147+
Accuracy /= len(train_labels)
148+
149+
# Print stuff
150+
print("Sklearn DBSCAN Classification Accuracy = ", Accuracy)

helpers.py

Lines changed: 0 additions & 101 deletions
This file was deleted.

0 commit comments

Comments
 (0)