Skip to content

Commit a6dc955

Browse files
committed
Added statistics helpers
1 parent 03975cc commit a6dc955

File tree

2 files changed

+62
-1
lines changed

2 files changed

+62
-1
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@ To install all of the libraries, run the commands in the "install.txt" file. The
2323

2424
## Files
2525
- **ml_helpers.py:** Machine Learning helper functions. Adapted from my [Python Machine Learning](https://github.com/GeorgeSeif/Python-Machine-Learning) repository
26-
- **plt_helpers:** Helper functions to make plotting easy in Matplotlib.
26+
- **plt_helpers.py:** Helper functions to make plotting easy in Matplotlib.
27+
- **statistics_helpers.py:** Helper functions for computing dataset statistics
2728
- **explore_wine_data.py:** Exploratory data analysis of the wine dataset from sklearn using visualisations. Includes data analysis using histogram, scatterplot, bee swarm plot, and cumulative distribution function.
2829
- **statistics_iris.py:** Compute various statistics of the iris dataset features such as histogram, min, max, median, mean, and variance.
2930
- **covariance_boston.py:** Compute the covariance matrix of the Boston Housing dataset. These matrices can sometimes give faster insight into which variables are related rather than creating scatter plots.

statistics_helpers.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import math
2+
3+
def mean(x):
4+
return sum(x) / len(x)
5+
6+
def median(v):
7+
"""finds the 'middle-most' value of v"""
8+
n = len(v)
9+
sorted_v = sorted(v)
10+
midpoint = n // 2
11+
12+
if n % 2 == 1:
13+
# if odd, return the middle value
14+
return sorted_v[midpoint]
15+
else:
16+
# if even, return the average of the middle values
17+
lo = midpoint - 1
18+
hi = midpoint
19+
return (sorted_v[lo] + sorted_v[hi]) / 2
20+
21+
def quantile(x, p):
22+
"""returns the pth-percentile value in x"""
23+
p_index = int(p * len(x))
24+
return sorted(x)[p_index]
25+
26+
def mode(x):
27+
"""returns a list, might be more than one mode"""
28+
counts = Counter(x)
29+
max_count = max(counts.values())
30+
return [x_i for x_i, count in counts.iteritems()
31+
if count == max_count]
32+
33+
34+
def data_range(x):
35+
return max(x) - min(x)
36+
37+
def variance(x):
38+
"""assumes x has at least two elements"""
39+
n = len(x)
40+
deviations = de_mean(x)
41+
return sum_of_squares(deviations) / (n - 1)
42+
43+
def standard_deviation(x):
44+
return math.sqrt(variance(x))
45+
46+
def interquartile_range(x):
47+
return quantile(x, 0.75) - quantile(x, 0.25)
48+
49+
50+
def covariance(x, y):
51+
n = len(x)
52+
return dot(de_mean(x), de_mean(y)) / (n - 1)
53+
54+
def correlation(x, y):
55+
stdev_x = standard_deviation(x)
56+
stdev_y = standard_deviation(y)
57+
if stdev_x > 0 and stdev_y > 0:
58+
return covariance(x, y) / stdev_x / stdev_y
59+
else:
60+
return 0 # if no variation, correlation is zero

0 commit comments

Comments
 (0)