GeorgeSeif
diff --git a/‎README.md‎
Lines changed: 8 additions & 1 deletion b/‎README.md‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎covariance_boston.py‎
Lines changed: 1 addition & 1 deletion b/‎covariance_boston.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎dbscan.py‎
Lines changed: 150 additions & 0 deletions b/‎dbscan.py‎
Lines changed: 150 additions & 0 deletions
diff --git a/‎helpers.py‎
Lines changed: 0 additions & 101 deletions b/‎helpers.py‎
Lines changed: 0 additions & 101 deletions
@@ -5,6 +5,7 @@ A collection of data science scripts for data analysis in Python.
 
 Python libraries used:
 - Numpy
+- Scipy
 - Scikit Learn
 - Pandas
 - Seaborn
@@ -25,10 +26,14 @@ To install all of the libraries, run the commands in the "install.txt" file. The
 - **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
 - **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
 - **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.
+- **linear_regression.py:** Linear regression on the Boston Housing dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
+- **logistic_regression.py:** Logistic regression on the wine dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
+- **pca_logistic_regression.py:** Logistic regression with Principal Component Analysis (PCA) for dimensionality reduction on the wine dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
+- **kmeans.py, kmediods.py, k_nearnest_neighbor.py, mean_shift.py, dbscan.py:** Different clustering methods applied to the iris dataset. Includes data shuffling and normalization. Includes an implementation from scratch and Sklearn.
 
 ## Information
 
-#### Exploratory Data Analysis
+#### Visualisations
 - **Histogram:** A histogram is a graphical method of displaying quantitative data. A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis. The distinguishing feature of a histogram is that data is grouped into "bins", which are intervals on the x axis.
 - **Scatterplot:** A scatter plot is a graphical method of displaying the relationship between data points. Each feature variable is assigned an axis. Each data point in the dataset is then plotted based on its feature values.
 - **Beeswarm Plot:** A Beeswarm plot is a two-dimensional visualisation technique where data points are plotted relative to a fixed reference axis so that no two datapoints overlap. The beeswarm plot is a useful technique when we wish to see not only the measured values of interest for each data point, but also the distribution of these values. 
@@ -38,6 +43,8 @@ To install all of the libraries, run the commands in the "install.txt" file. The
 - **Mean and Median:** Both of these show a type of "average" or "center" value for a particular feature variable. The mean is the more literal and precise center; however median is much more robust to outliers which may pull the mean value calculation far away from the majority of the values. 
 - **Variance and Standard Deviation:** Useful for seeing to what degree the feature variable of a dataset varies across all example i.e are most of the values for this particular feature variable similar across the dataset or are they all very different.  
 - **Covariance Matrix:** The covariance of two variables measures how "correlated" they are. If the two variables have a positive covariance, then one when variable increases so does the other; with a negative covariance the values of the feature variables will change in opposite directions. The magnitude of the covariance determines how strongly the features are correlated. A high covariance value means that when one of the feature variables changes by an amount x, the other will change by an amount very close to x; vice versa for low covariance values. 
+- **PCA Dimensionality Reduction:** Principal Component Analysis (PCA) is a technique commonly used for dimensionality reduction. PCA computes the feature vectors along which the data has the highest variance. Since these feature vectors have the highest variance they also hold most of the information that the data represents. Therefore we can project the data on to these feature vectors, reducing the dimensionality of the data which makes analysis easier and more clear. 
+- **Data Shuffling:** Shuffling the data prior to applying a machine learning algorithm has been proven to improve the performance. 
 
 
 #### Examples
 
@@ -6,7 +6,7 @@
 import matplotlib.pyplot as plt
 from sklearn.datasets import load_boston
 
-import helpers as helpers
+import ml_helpers as helpers
 
 # NOTE that this loads as a dictionairy
 boston_data = load_boston()
 
@@ -0,0 +1,150 @@
+import numpy as np
+from sklearn import datasets, decomposition, cluster
+import ml_helpers
+
+class DBSCAN():
+    def __init__(self, radius=1, min_samples=5):
+        self.radius = radius
+        self.min_samples = min_samples
+        # List of arrays (clusters) containing sample indices
+        self.clusters = []
+        self.visited_samples = []
+        # Hashmap {"sample_index": [neighbor1, neighbor2, ...]}
+        self.neighbors = {}
+        self.data = None   # Dataset
+
+    # Return a list of neighboring samples
+    # A sample_2 is considered a neighbor of sample_1 if the distance between
+    # them is smaller than radiusi
+    def _get_neighbors(self, sample_i):
+        neighbors = []
+        for _sample_i, _sample in enumerate(self.data):
+            if _sample_i != sample_i and ml_helpers.euclidean_distance(self.data[sample_i], _sample) < self.radius:
+                neighbors.append(_sample_i)
+        return np.array(neighbors)
+
+    # Recursive method which expands the cluster until we have reached the border
+    # of the dense area (density determined by radius and min_samples)
+    def _expand_cluster(self, sample_i, neighbors):
+        cluster = [sample_i]
+        # Iterate through neighbors
+        for neighbor_i in neighbors:
+            if not neighbor_i in self.visited_samples:
+                self.visited_samples.append(neighbor_i)
+                # Fetch the samples distant neighbors
+                self.neighbors[neighbor_i] = self._get_neighbors(neighbor_i)
+                # Make sure the neighbors neighbors are more than min_samples
+                if len(self.neighbors[neighbor_i]) >= self.min_samples:
+                    # Choose neighbors of neighbor except for sample
+                    distant_neighbors = self.neighbors[neighbor_i][
+                        np.where(self.neighbors[neighbor_i] != sample_i)]
+                    # Add the neighbors neighbors as neighbors of sample
+                    self.neighbors[sample_i] = np.concatenate(
+                        (self.neighbors[sample_i], distant_neighbors))
+                    # Expand the cluster from the neighbor
+                    expanded_cluster = self._expand_cluster(
+                        neighbor_i, self.neighbors[neighbor_i])
+                    # Add expanded cluster to this cluster
+                    cluster = cluster + expanded_cluster
+            if not neighbor_i in np.array(self.clusters):
+                cluster.append(neighbor_i)
+        return cluster
+
+    # Return the samples labels as the index of the cluster in which they are
+    # contained
+    def _get_cluster_labels(self):
+        # Set default value to number of clusters
+        # Will make sure all outliers have same cluster label
+        labels = len(self.clusters) * np.ones(np.shape(self.data)[0])
+        for cluster_i, cluster in enumerate(self.clusters):
+            for sample_i in cluster:
+                labels[sample_i] = cluster_i
+        return labels
+
+    # DBSCAN
+    def predict(self, data):
+        self.data = data
+        n_samples = np.shape(self.data)[0]
+        # Iterate through samples and expand clusters from them
+        # if they have more neighbors than self.min_samples
+        for sample_i in range(n_samples):
+            if sample_i in self.visited_samples:
+                continue
+            self.visited_samples.append(sample_i)
+            self.neighbors[sample_i] = self._get_neighbors(sample_i)
+            if len(self.neighbors[sample_i]) >= self.min_samples:
+                # Sample has more neighbors than self.min_samples => expand
+                # cluster from sample
+                new_cluster = self._expand_cluster(
+                    sample_i, self.neighbors[sample_i])
+                # Add cluster to list of clusters
+                self.clusters.append(new_cluster)
+
+        # Get the resulting cluster labels
+        cluster_labels = self._get_cluster_labels()
+        return cluster_labels
+
+# Get the training data
+# Import the Iris flower dataset
+iris = datasets.load_iris()
+train_data = np.array(iris.data)
+train_labels = np.array(iris.target)
+num_features = train_data.data.shape[1]
+
+# Apply PCA to the data to reduce its dimensionality
+pca = decomposition.PCA(n_components=3)
+pca.fit(train_data)
+train_data = pca.transform(train_data)
+
+
+# *********************************************
+# Apply DBSCAN Clustering MANUALLY
+# *********************************************
+# Create the DBSCAN Clustering Object 
+unique_labels = np.unique(train_labels)
+num_classes = len(unique_labels)
+clf = DBSCAN(radius=1, min_samples=5)
+
+predicted_labels = clf.predict(train_data)
+
+
+# Compute the training accuracy
+Accuracy = 0
+for index in range(len(train_labels)):
+	# Cluster the data using DBSCAN
+	current_label = train_labels[index]
+	predicted_label = predicted_labels[index]
+
+	if current_label == predicted_label:
+		Accuracy += 1
+
+Accuracy /= len(train_labels)
+
+# Print stuff
+print("Manual DBSCAN Classification Accuracy = ", Accuracy)
+
+# *********************************************
+# Apply DBSCAN Clustering using Sklearn
+# *********************************************
+# Create the DBSCAN Clustering Object 
+unique_labels = np.unique(train_labels)
+num_classes = len(unique_labels)
+clf = cluster.DBSCAN(eps=1, min_samples=5)
+
+predicted_labels = clf.fit_predict(train_data)
+
+
+# Compute the training accuracy
+Accuracy = 0
+for index in range(len(train_labels)):
+	# Cluster the data using DBSCAN
+	current_label = train_labels[index]
+	predicted_label = predicted_labels[index]
+
+	if current_label == predicted_label:
+		Accuracy += 1
+
+Accuracy /= len(train_labels)
+
+# Print stuff
+print("Sklearn DBSCAN Classification Accuracy = ", Accuracy)