Pandas is a very popular tool for data analysis. It comes built-in with many useful features, it's battle tested and widely accepted. However, pandas is not always the best tool for the job.
SQL databases have been around since the 1970s. Some of the smartest people in the world worked on making it easy to slice, dice, fetch and manipulate data quickly and efficiently. SQL databases have come such a long way, that many developers and data scientists lost track of what they can do with the database they already have!
In this article I demonstrate how to use SQL to perform fast and efficient data analysis.
Table of Contents
Interactive Course
SQL vs Pandas Performance
Imagine a simple table with 1M users, each with a username and an indication if the user was activated or not. A simple data analysis task would be to answer how many activated and inactivated users are there?
Create the table and populate with random data:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username TEXT NOT NULL,
activated BOOLEAN NOT NULL
);
INSERT INTO users (
username,
activated
)
SELECT
md5(random()::text)::text AS username,
random() > 0.9 AS activated
FROM
generate_series(1, 1000000);
Setup a python virtual environment and install dependencies:
$ python -m venv venv
$ source venv/bin/activate
$ pip install psycopg2 pandas memory-profiler
To produce benchmark results, create a script with the following pattern:
# bench.py
from memory_profiler import profile
@profile
def run():
# TODO: Replace with code to benchmark
print('do work!')
if __name__ == "__main__":
run()
Execute your script from the command like and view the row level results:
$ python bench.py
do work!
Filename: bench.py
Line # Mem usage Increment Occurences Line Contents
============================================================
3 38.9 MiB 38.9 MiB 1 @profile
4 def run():
5 # TODO: Replace with code to benchmark
6 38.9 MiB 0.0 MiB 1 print('do work!')
You can find more details about this method in this article or in the memory-profiler
documentation.
Let's start with a naive approach using pandas:
import psycopg2
import pandas as pd
connection = psycopg2.connect(dbname='db')
with connection.cursor() as cursor:
cursor.execute('SELECT * FROM users')
df = pd.DataFrame(
cursor.fetchall(),
columns=['id', 'username', 'activated'],
)
result = df.groupby(by='activated').count()
print(result)
The script uses psycopg2 to create a connection to the database. It then fetches data from the users table into a pandas dataframe, and calls groupby
to get the counts for activated and inactivated users.
This script produces the following output:
id username
activated
False 900029 900029
True 99971 99971
We got an answer to our question, but at what cost?
Pandas True Cost
Let's execute this function again, but this time look at the memory usage:
(venv) $ python test_pandas_naive.py
id username
activated
False 900029 900029
True 99971 99971
Filename: test_pandas_naive.py
Line # Mem usage Increment Occurences Line Contents
============================================================
3 38.8 MiB 38.8 MiB 1 @profile
4 def run():
5 41.2 MiB 2.3 MiB 1 import psycopg2
6 78.2 MiB 37.1 MiB 1 import pandas as pd
7
8 78.6 MiB 0.4 MiB 1 connection = psycopg2.connect(dbname='db')
9 78.6 MiB 0.0 MiB 1 with connection.cursor() as cursor:
10 179.9 MiB 101.3 MiB 1 cursor.execute('SELECT * FROM users')
11 386.0 MiB 12.7 MiB 2 df = pd.DataFrame(
12 373.4 MiB 193.5 MiB 1 cursor.fetchall(),
13 373.4 MiB 0.0 MiB 1 columns=['id', 'username', 'activated'],
14 )
15
16 386.0 MiB 0.0 MiB 1 result = df.groupby(by='activated').count()
17 386.0 MiB 0.0 MiB 1 print(result)
To view the memory usage of the program we use the package memory-profiler. We used this technique in the past to find the fastest way to load data into PostgreSQL using Python.
The output shows the overall memory usage for each row in the program, as well as the additional memory added by each row in the column "Increment". The output for this program reveals some interesting finds:
-
Pandas alone consumes ~37M of memory: Just importing pandas, before even doing anything with it, consumes a significant amount of memory. For comparison, importing psycopg2 only adds 2.3MB of memory to the program.
-
Fetching the data into memory consumed an additional ~300MB: When we fetched the data into memory, and then into a pandas dataframe, the program occupied an additional 300MB. For reference, the size of the table in the database is only 65MB.
If we ignore the 38MB consumed by the profiler itself, the program consumed 347MB of memory, and executing this script without the profiler took 1.101s to complete.
Removing Unnecessary Data
Our quick analysis showed that fetching the data consumed the most memory. To optimize that, we can try to fetch less data. For example, we don't really use the username column, so maybe we can not fetch it from the database:
(venv) $ python test_pandas.py
id
activated
False 900029
True 99971
Filename: test_pandas.py
# Mem usage Increment Occurences Line Contents
========================================================
3 38.7 MiB 38.7 MiB 1 @profile
4 def run():
5 41.0 MiB 2.3 MiB 1 import psycopg2
6 78.3 MiB 37.2 MiB 1 import pandas as pd
7
8 78.6 MiB 0.3 MiB 1 connection = psycopg2.connect(dbname='db')
9 78.6 MiB 0.0 MiB 1 with connection.cursor() as cursor:
10 132.1 MiB 53.6 MiB 1 cursor.execute('SELECT id, activated FROM users')
11 142.1 MiB -90.7 MiB 2 df = pd.DataFrame(
12 232.8 MiB 100.7 MiB 1 cursor.fetchall(),
13 232.8 MiB 0.0 MiB 1 columns=['id', 'activated'],
14 )
15
16 142.1 MiB 0.0 MiB 1 result = df.groupby(by='activated').count()
17 142.1 MiB 0.0 MiB 1 print(result)
By explicitly providing a list of columns to the query and fetching only what we need, the program now consumes only 232MB, or 193MB without the overhead of the profiler. This is an improvement from the previous attempt which consumed 347MB of memory.
Executing the script without the profiler took 0.839s compared to the previous program which took 1.1s.
Aggregating in the Database
The most memory in the program is still the data being fetched into memory. What if instead of first fetching the data and aggregating using pandas, we would aggregate the data in the database, and create a pandas dataframe from the results:
$ python test_db.py
activated cnt
0 False 900029
1 True 99971
Filename: test_db.py
# Mem usage Increment Occurences Line Contents
========================================================
3 38.6 MiB 38.6 MiB 1 @profile
4 def run():
5 41.0 MiB 2.3 MiB 1 import psycopg2
6 78.0 MiB 37.0 MiB 1 import pandas as pd
7
8 78.3 MiB 0.4 MiB 1 connection = psycopg2.connect(dbname='db')
9 78.3 MiB 0.0 MiB 1 with connection.cursor() as cursor:
10 78.3 MiB 0.0 MiB 1 cursor.execute('''
11 SELECT activated, count(*) AS cnt
12 FROM users
13 GROUP BY activated
14 ''')
15 78.3 MiB 0.0 MiB 2 result = pd.DataFrame(
16 78.3 MiB 0.0 MiB 1 cursor.fetchall(),
17 78.3 MiB 0.0 MiB 1 columns=['activated', 'cnt'],
18 )
19 79.3 MiB 1.0 MiB 1 print(result)
This is a big leap compared to the previous attempt. Doing the processing in the database and fetching aggregated results consumed only 79MB of memory, or 40MB if we remove the overhead of the profiler. This is a big improvement!
Executing the script without the profiler took 0.380s, which is twice as fast as the previous program which took 0.839s.
Removing Pandas
At this point the only significant memory hog is pandas itself. Just for fun and reference, let's see what the program consumes without pandas:
$ python test_db_plain.py
([(False, 900029), (True, 99971)],)
Filename: test_db_plain.py
# Mem usage Increment Occurences Line Contents
========================================================
3 38.9 MiB 38.9 MiB 1 @profile
4 def run():
5 41.2 MiB 2.4 MiB 1 import psycopg2
6
7 41.7 MiB 0.5 MiB 1 connection = psycopg2.connect(dbname='db')
8 41.7 MiB 0.0 MiB 1 with connection.cursor() as cursor:
9 41.7 MiB 0.0 MiB 1 cursor.execute('''
10 SELECT activated, count(*) AS cnt
11 FROM users
12 GROUP BY activated
13 ''')
14 41.7 MiB 0.0 MiB 1 result = cursor.fetchall(),
15
16 41.7 MiB 0.0 MiB 1 print(result)
After removing pandas and keeping the results as a python list of tuples, the program consumes 41MB of memory, or just 2.8MB if we ignore the profiler overhead. This is a huge difference from where we started!
The timing is also much lower. Without the profile this program completes in just 0.114s. That's 70% less than the previous attempt using pandas, and overall 90% faster than the first program.
Results Summary
This is the a summary of the results:
Program | Peak Memory | % Memory Diff | Runtime | % Runtime Diff |
---|---|---|---|---|
Pandas with entire table | 347 MB | 1.101s | ||
Pandas with only necessary data | 193 MB | -44% | 0.839s | -23% |
Pandas with aggregation in database | 40 MB | -80% | 0.380s | -54% |
No Pandas, aggregation in database | 2.3 MB | -94% | 0.114s | -70% |
This benchmark does not mention the memory consumed by the database itself - this is intentional. Databases usually consume a configurable amount of memory, and than manage allocations between different buffers and system components internally. Over the years, databases have gotten pretty good at managing their memory so you won't have to. Whether you decide to use the database or not, the memory is already paid for, so you might as well use it!
Pandas and SQL: Better Together!
Programs that consume a lot of memory are a huge pain. Developers need powerful development environments, iterations are slower and the entire process takes more time. From an infrastructure perspective, resources cost money, and the more you scale the more you have to pay. The costs pile up pretty quickly.
All of this is not to say that Pandas is unnecessary, or that it can be replaced. Pandas provide great benefits and it has proven itself as being incredibly valuable. The same thing can be said for databases.
To take advantage of both worlds and create lightweight programs that are also fast, use SQL and Pandas together!
I'm focusing on Pandas and Numpy because they are the most popular, but the concepts described in the article apply to other tools and languages such as R, Julia, Matlab, SAS and so on. To make the argument even more compelling, I include interactive Hex Notebooks you can experiment with on your own.
Basics
The SQL query language was invented more than 40 years ago, and it is the most popular language for querying relational data. SQL is defined in an ANSI standard but there are still subtle differences between popular database engines such as PostgreSQL, MySQL, Oracle, SQL Server and others.
These are the common clauses of an SQL query:
SELECT <expressions>
FROM <tables>
JOIN <to other table> ON <join condition>
WHERE <predicates>
GROUP BY <expressions>
HAVING <predicate>
ORDER BY <expressions>
LIMIT <number of rows>
In PostgreSQL only the SELECT
clause is really mandatory, so you can mix and match to do what you want.
Common Table Expressions
It's sometimes useful to split a large query into smaller steps. Using SQL, you can define a common table expression or "CTE" in short, with the WITH
clause:
WITH emails AS (
SELECT '[email protected]' AS email
)
SELECT * FROM emails;
email
───────────────────
[email protected]
You can have multiple CTE's in a single query, and they can even depend on each other:
WITH emails AS (
SELECT '[email protected]' AS email
),
normalized_emails AS (
SELECT lower(email) AS email FROM emails
)
SELECT * FROM normalized_emails;
email
───────────────────
[email protected]
Common table expressions are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results!
Generating Data
Generating data is very handy. Sometimes you need to generate data for practice, sometime you need to generate a time series or a small table to join to. There are several ways to generate data in SQL:
UNION ALL
WITH dt AS (
SELECT 1 AS id, 'haki' AS name
UNION ALL
SELECT 2, 'benita'
)
SELECT * FROM dt;
id │ name
────┼────────
1 │ haki
2 │ benita
Using UNION ALL
you can combine, or concatenate, the results of multiple queries.
Concatenating query results is very common, but it can be a bit tedious for generating data.
VALUES LIST
WITH dt AS (
SELECT * FROM (
VALUES
(1, 'haki'),
(2, 'benita')
) AS t(id, name)
)
SELECT * from dt;
Using the VALUES
keyword you can provide a list of rows, and then define names and types using a "table alias list" t(..)
. The t
can be any name. Using a VALUES
list is very useful when you need to generate small sets of data, or as the documentation calls it, "constants table".
UNNEST
To generate small sets of one dimensional data, you can unnest
a PostgreSQL array:
WITH dt AS (
SELECT unnest(array[1, 2]) AS n
)
SELECT * FROM dt;
n
───
1
2
This is more restricting than VALUES
as it can only produce a one dimensional table of the same datatype, but we are going to use it later.
GENERATE_SERIES
To generate large amounts of data, PostgreSQL provides a table function called generate_series
:
WITH dt AS (
SELECT * FROM generate_series(0, 5) AS t(n)
)
SELECT * FROM dt;
n
──
0
1
2
3
4
5
The function generate_series
accepts three arguments: start, stop and step. In the example above we did not specify a step, so the default 1
was used. We can provide a different step to generate a different series:
WITH dt AS (
SELECT * FROM generate_series(
0, -- start
10, -- stop
2 -- step
) AS t(n)
)
SELECT * FROM dt;
n
────
0
2
4
6
8
10
To generate a list of even numbers, we set the step to 2.
The function generate_series
is not restricted just to integers, it can be used for other types as well. One common examples is generating date ranges:
WITH daterange AS (
SELECT *
FROM generate_series(
'2021-01-01 UTC'::timestamptz, -- start
'2021-01-02 UTC'::timestamptz, -- stop
interval '1 hour' -- step
) AS t(hh)
) SELECT * FROM daterange;
hh
────────────────────────
2021-01-01 00:00:00+00
2021-01-01 01:00:00+00
2021-01-01 02:00:00+00
...
2021-01-01 22:00:00+00
2021-01-01 23:00:00+00
2021-01-02 00:00:00+00
To generate a 24 hour range we provided generate_series
with a start and end data, and set the step to a 1 hour interval.
GENERATE_SERIES with row numbers
As mentioned above, generate_series
is a "table function". There is a little trick with table functions to include row numbers in the result:
WITH daterange AS (
SELECT *
FROM generate_series(
'2021-01-01'::timestamptz, -- start
'2021-01-02'::timestamptz, -- stop
interval '1 hour' -- step
) WITH ORDINALITY AS t(hh, n)
) SELECT * FROM daterange;
hh │ n
────────────────────────┼────
2021-01-01 00:00:00+00 │ 1
2021-01-01 01:00:00+00 │ 2
2021-01-01 02:00:00+00 │ 3
...
2021-01-01 22:00:00+00 │ 23
2021-01-01 23:00:00+00 │ 24
2021-01-02 00:00:00+00 │ 25
Using WITH ORDINALITY
, the results now include another column with the row number.
Random
To generate random numbers PostgreSQL provides a random
function that returns a value between 0 and 1:
SELECT random();
0.5917508391168769
To generate values at different ranges you can random
in an expression:
-- Random float between 0 and 100
SELECT random() * 100;
59.17508391168769
-- Random integer between 1 and 100
SELECT ceil(random() * 100);
59
-- Random integer between 11 and 100
SELECT 10 + ceil(random() * 90);
59
It's a common mistake to use round
instead of ceil
or floor
to generate a range of integers. Using round
may produce inconsistent distribution. Consider the following query to generate random integers in the range 0 - 4 using round
instead ceil
:
SELECT
round(random() * 3) AS n,
count(*)
FROM
generate_series(0, 1000)
GROUP BY
1;
n │ count
──┼───────
0 │ 150
1 │ 328
2 │ 341
3 │ 182
Notice how the values 0 and 3 are coming up less than 1 and 2. Using round, random values less than 0.5 will be rounded down to 0, and random numbers greater than 2.5 will be rounded up to 3, while for example, random values between 0.5 and 1.5 will be rounded to 1. This makes the edges less likely to come up.
This problem can be solved by either rounding up or down. Consider the same query using ceil
:
SELECT
ceil(random() * 3) AS n,
count(*)
FROM
generate_series(0, 1000)
GROUP BY
1;
n │ count
──┼───────
1 │ 328
2 │ 339
3 │ 334
Using ceil
produces more evenly distributed random numbers.
Random Choice
You can use the random
function to pick a random value from a list of values:
SELECT
(array['red', 'green', 'blue'])[ceil(random() * 3)] AS color
FROM
generate_series(1, 5);
color
───────
green
green
blue
green
blue
The expression defines an array of colors, and then uses random
to get a random element from the array. Notice that in PostgreSQL, arrays start at 1:
-- In PostgreSQL arrays start at 1
SELECT (array['red', 'green', 'blue'])[1];
array
───────
red
Sampling
Sampling a random portion of a table is a very common when training a model. A simple way to fetch a random portion of a table is combining random
with LIMIT
:
db=# WITH sample AS (
SELECT *
FROM users
ORDER BY random() LIMIT 10000
)
SELECT count(*) FROM sample;
count
───────
10000
(1 row)
Time: 205.643 ms
To sample 10K random rows from the table you first sort in a random order, and then take the first 10K rows.
Using random
to sample data is great, but for very large datasets it can be inefficient. PostgreSQL provides other methods of sampling a proportion of a table, which are more suited for large tables.
PostgreSQL provides two sampling methods, SYSTEM
and BERNOULLI
. To sample a table, use the TABLESAMPLE
keyword in the FROM
clause, and provide the sampling method along with it's arguments. For example, sampling 10% of the table using the SYSTEM
sampling method:
db=# WITH sample AS (
SELECT *
FROM users TABLESAMPLE SYSTEM(10)
)
SELECT count(*) FROM sample;
count
───────
95400
(1 row)
Time: 13.690 ms
The SYSTEM
sampling method works by sampling blocks rather than rows, which makes it very fast. The table we sampled contains 1M rows, and the sample returned slightly less than 100K rows. For large datasets it's not uncommon to compromise accuracy for performance.
Another sampling method provided by PostgreSQL is BERNOULLI
:
db=# WITH sample AS (
SELECT *
FROM users TABLESAMPLE BERNOULLI(10)
)
SELECT count(*) FROM sample;
count
────────
100364
(1 row)
Time: 54.593 ms
Unlike the SYSTEM
sampling method, BERNOULLI
works at the row level which makes it a bit slower, but the results are better distributed.
These are the timings for sampling 10% of table with 1M rows using different sampling methods:
Sampling Method | Timing |
---|---|
random() |
205ms |
BERNOULLI |
54ms |
SYSTEM |
13ms |
If you need to sample from a large table consider using TABLESAMPLE
.
Example: Train / Test Split with SQL
A common task when analyzing data is to split a dataset for training and testing. The training dataset is used to train the model, and the test dataset is used to evaluate the model.
To put what you've seen so far to practice, generate a transactions table with some random data:
CREATE TABLE transaction AS
SELECT
id,
'2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at,
round(10 + 90 * random()) as charged_amount,
random() > 0.6 as reported_as_fraud
FROM
generate_series(1, 10) AS id
ORDER BY
1;
The transaction table include the date and amount of the transaction, and an indication whether the transaction was reported as fraudulent.
Before we move on, let's break it down:
SELECT '2021-01-01'::date + interval '1 day' * ceil(random() * 365) as billed_at;
Add a random number of days between 0 and 365 to January 1st, 2021 to produce a random date in that year.
SELECT round(10 + 90 * random()) as charged_amount;
Produce a random round charged amount between 10 and 100.
SELECT random() > 0.6 as reported_as_fraud;
Produce the parameter we want to estimate. In our fake data, we want to have 40% fraudulent transactions. Using an expression we produce a boolean value which will evaluate to true ~40% of the times.
This is what the data looks like:
db=# SELECT * FROM transaction;
id │ billed_at │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
1 │ 2021-05-22 00:00:00 │ 54 │ t
2 │ 2021-05-31 00:00:00 │ 63 │ f
3 │ 2021-11-11 00:00:00 │ 26 │ t
4 │ 2021-07-04 00:00:00 │ 64 │ t
5 │ 2021-02-27 00:00:00 │ 90 │ t
6 │ 2021-05-21 00:00:00 │ 20 │ t
7 │ 2021-07-29 00:00:00 │ 69 │ t
8 │ 2021-02-24 00:00:00 │ 20 │ f
9 │ 2021-05-07 00:00:00 │ 36 │ f
10 │ 2021-05-05 00:00:00 │ 38 │ f
To test a model which classifies transactions as fraudulent, we want to split the table into a training and test datasets. One way to do that is adding a column, but we are going to create two separate tables instead.
To create a table similar to an existing table in PostgreSQL, you can use the following commands:
CREATE TABLE transaction_training AS TABLE transaction WITH NO DATA;
CREATE TABLE transaction_test AS TABLE transaction WITH NO DATA;
This is a really handy syntax! We simply tell PostgreSQL to create a table similar to another table, but with no data.
Next, we want to split the data in the transaction
table between transaction_training
and transaction_test
. We want our training set to include 80% of the rows, in this case 8 rows:
WITH
training_transaction_ids AS (
INSERT INTO transaction_training
SELECT * FROM transaction
ORDER BY random() LIMIT 8
RETURNING id
)
INSERT INTO transaction_test
SELECT * FROM transaction
WHERE id NOT IN (SELECT id FROM training_transaction_ids);
To populate data for training we select from the transaction
table, shuffle the rows using ORDER BY random()
and then insert into transaction_training
just the first 8 rows.
To insert only the remaining rows into the test table, we keep the ids of the training rows by specifying RETURNING id
in a common table expression (the WITH
clause). We then insert rows into transaction_test
and exclude rows in training_transaction_ids
. For more on this technique check out how to implement complete processes using WITH
and RETURNING
.
This is the result:
db=# SELECT * FROM transaction_training;
id │ billed_at │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
6 │ 2021-05-21 00:00:00 │ 20 │ t
4 │ 2021-07-04 00:00:00 │ 64 │ t
5 │ 2021-02-27 00:00:00 │ 90 │ t
2 │ 2021-05-31 00:00:00 │ 63 │ f
10 │ 2021-05-05 00:00:00 │ 38 │ f
3 │ 2021-11-11 00:00:00 │ 26 │ t
9 │ 2021-05-07 00:00:00 │ 36 │ f
7 │ 2021-07-29 00:00:00 │ 69 │ t
(8 rows)
db=# SELECT * FROM transaction_test;
id │ billed_at │ charged_amount │ reported_as_fraud
────┼─────────────────────┼────────────────┼───────────────────
1 │ 2021-05-22 00:00:00 │ 54 │ t
8 │ 2021-02-24 00:00:00 │ 20 │ f
(2 rows)
And there you have it, a training dataset and a test dataset with SQL, directly in the database!
Descriptive Statistics
When you get a fresh data set, the first thing you usually want to do is get familiar with it. Some people call this "Exploratory data analysis", or EDA for short. Pandas, as well as other languages and tools, provide some utility functions to produce descriptive statistics.
Describing a Series
Describing a numeric series using pandas:
>>> import pandas as pd
>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
To generate descriptive statistics in SQL, you can use the following query:
WITH s AS (
SELECT * FROM (VALUES (1), (2), (3)) AS t(n)
)
SELECT
count(*),
avg(n),
stddev(n),
min(n),
percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY n),
max(n)
FROM
s;
count │ avg │ stddev │ min │ percentile_cont │ max
──────┼────────┼───────────┼─────┼─────────────────┼─────
3 │ 2.0000 │ 1.0000000 │ 1 │ {1.5,2,2.5} │ 3
Basic aggregate functions in SQL produced a similar output to that of Pandas. The interesting part here is function percentile_cont
.
The function percentile_cont
is an Ordered-Set Aggregate Function, meaning, it operates with respect to some order. To illustrate, in the query above you can replace both min
and max
with percentile_cont
:
WITH s AS (
SELECT * FROM (VALUES (1), (2), (3)) AS t(n)
)
SELECT
percentile_cont(array[
0, -- <--- min
0.25, 0.5, 0.75,
1 -- <--- max
]) WITHIN GROUP (ORDER BY n),
FROM
s;
percentile_cont
─────────────────
{1,1.5,2,2.5,3}
Another common use for percentile_cont
is to find the median of a sequence of numbers:
WITH s AS (SELECT * FROM generate_series(1, 10) AS t(n))
SELECT
percentile_disc(0.5) WITHIN GROUP (ORDER BY n),
percentile_cont(0.5) WITHIN GROUP (ORDER BY n)
FROM
s;
percentile_disc │ percentile_cont
─────────────────┼─────────────────
5 │ 5.5
The query demonstrates two types of medians:
percentile_disc
returns the value that 50% of the table is less than. Notice that5
is present in the table.percentile_cont
returns a value that 50% of the values are less than. The value 5.5 is not present in the table, it's the value between 5 and 6 which divides the values in the table in the middle (0.5).
Both functions can accept an array of values, in which case they will return a corresponding list of results.
Describing a Categorical Series
Previously we described a list of numbers. This time we want to describe a list of categorical values. For example, pandas will produce the following output:
>>> s = pd.Series(['a', 'a', 'b', 'c'])
>> s.describe()
count 4
unique 3
top a
freq 2
dtype: object
Using SQL we can produce similar output:
WITH s AS (SELECT unnest(array['a', 'a', 'b', 'c']) AS v)
SELECT
count(*),
count(DISTINCT V) AS unique,
mode() WITHIN GROUP (ORDER BY V) AS top
FROM
s;
count │ unique │ top
───────┼────────┼─────
4 │ 3 │ a
To calculate the number of unique values you used COUNT(DISTINCT ...)
. To get the value that appears most often in the series, i.e. the one with the highest frequency, you used another ordered set function called mode
.
Subtotals
Another useful technique to analyze data is producing sub totals. We already saw how to apply aggregate functions on the table, but how about multiple aggregation levels in the same query?
Let's imagine a table of employees. For each employee we keep the name, the role and the department they work at:
WITH emp AS (
SELECT * FROM (VALUES
('Haki', 'R&D', 'Manager'),
('Dan', 'R&D', 'Developer'),
('Jax', 'R&D', 'Developer'),
('George', 'Sales', 'Manager'),
('Bill', 'Sales', 'Developer'),
('David', 'Sales', 'Developer')
) AS t(
name, department, role
)
)
SELECT * FROM emp;
name │ department │ role
────────┼────────────┼───────────
Haki │ R&D │ Manager
Dan │ R&D │ Developer
Jax │ R&D │ Developer
George │ Sales │ Manager
Bill │ Sales │ Developer
David │ Sales │ Developer
To find the number of employees with each role in each departments, you can use GROUP BY
:
WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY department, role;
department │ role │ count
────────────┼───────────┼───────
R&D │ Developer │ 2
R&D │ Manager │ 1
Sales │ Manager │ 1
Sales │ Developer │ 2
Rollup
What if we want to also get the number of employees in each department, in all roles:
WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY ROLLUP(department), role;
department │ role │ count
────────────┼───────────┼───────
R&D │ Developer │ 2
R&D │ Manager │ 1
Sales │ Manager │ 1
Sales │ Developer │ 2
R&D │ ¤ │ 3 -- <-- Total for R&D
Sales │ ¤ │ 3 -- <-- Total for Sales
Notice the use of the subclause ROLLUP
in the GROUP BY
clause.
To add a subtotal for each department, we tell the database to "rollup" by the department field. The database then added two additional aggregate results, one for each department.
The database can actually produce sub totals in several levels. For example, to add the grand total of the number of employees in all departments, we can tell the database to "rollup" the role field as well:
WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY ROLLUP(department, role);
department │ role │ count
────────────┼───────────┼───────
¤ │ ¤ │ 6 -- <-- Grand total
R&D │ Developer │ 2
R&D │ Manager │ 1
Sales │ Manager │ 1
Sales │ Developer │ 2
R&D │ ¤ │ 3 -- <-- Total for R&D
Sales │ ¤ │ 3 -- <-- Total for Sales
The query now includes several subtotals. To identify the aggregate level for each row, use the function GROUPING
:
WITH emp AS ( /* ... */ )
SELECT
department, role, COUNT(*),
GROUPING(department) AS department_subtotal,
GROUPING(department, role) AS grand_total
FROM emp
GROUP BY ROLLUP(department), role;
department │ role │ count │ department_subtotal │ grand_total
────────────┼───────────┼───────┼─────────────────────┼─────────────
Sales │ Developer │ 2 │ 0 │ 0
Sales │ Manager │ 1 │ 0 │ 0
R&D │ Developer │ 2 │ 0 │ 0
R&D │ Manager │ 1 │ 0 │ 0
¤ │ Manager │ 2 │ 1 │ 2
¤ │ Developer │ 4 │ 1 │ 2
Cube
When talking about subtotals, or aggregates at multiple levels, OLAP usually comes to mind. OLAP cube is a technique where all the subtotals are pre-calculated to make retrieval faster. Using ROLLUP
we can achieve this by providing all possible combinations, but there is an easier way to do that:
WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY CUBE(department, role);
department │ role │ count
────────────┼───────────┼───────
¤ │ ¤ │ 6 -- <-- Grand Total
R&D │ Developer │ 2
R&D │ Manager │ 1
Sales │ Manager │ 1
Sales │ Developer │ 2
R&D │ ¤ │ 3 -- <-- Subtotal for R&D department
Sales │ ¤ │ 3 -- <-- Subtotal for Sales department
¤ │ Manager │ 2 -- <-- Subtotal for Manager role
¤ │ Developer │ 4 -- <-- Subtotal for Developer role
CUBE
generates subtotals for all possible combinations. In the examples above, using CUBE
added an additional subtotal for each department.
Grouping Sets
Both CUBE
and ROLLUP
are syntactic sugar of GROUPING SETS
:
WITH emp AS ( /* ... */ )
SELECT department, role, COUNT(*)
FROM emp
GROUP BY GROUPING SETS (
(), -- <-- Grand total
(role), -- <-- Subtotal by role
(department), -- <-- Subtotal by department
(role, department) -- <-- No aggregation, the row itself
);
department │ role │ count
────────────┼───────────┼───────
¤ │ ¤ │ 6
Sales │ Developer │ 2
Sales │ Manager │ 1
R&D │ Developer │ 2
R&D │ Manager │ 1
¤ │ Manager │ 2
¤ │ Developer │ 4
R&D │ ¤ │ 3
Sales │ ¤ │ 3
Using GROUPING SETS
you can tell the database exactly which subtotals to generate. The query above is generating all possible combinations, so it's equivalent to CUBE
.
Pivot Tables
Pivot tables are a technique to reshape data, and pandas includes a very powerful pivot_table
function:
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'name': ['Haki', 'Dan', 'Jax', 'George', 'Bill', 'David'],
... 'department': ['R&D', 'R&D', 'R&D', 'Sales', 'Sales', 'Sales',],
... 'role': ['Manager', 'Developer', 'Developer', 'Manager', 'Developer', 'Developer'],
... })
>>> pd.pivot_table(df, values='name', index='role', columns='department', aggfunc='count')
department R&D Sales
role
Developer 2 2
Manager 1 1
Conditional Expressions
To recreate the "pivot" above in SQL, do the following:
WITH emp AS (
SELECT * FROM (VALUES
('Haki', 'R&D', 'Manager'),
('Dan', 'R&D', 'Developer'),
('Jax', 'R&D', 'Developer'),
('George', 'Sales', 'Manager'),
('Bill', 'Sales', 'Developer'),
('David', 'Sales', 'Developer')
) AS t(
name, department, role
)
)
SELECT
role,
SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END) as "R&D",
SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as "Sales"
FROM
emp
GROUP BY
role;
role │ R&D │ Sales
───────────┼─────┼───────
Manager │ 1 │ 1
Developer │ 2 │ 2
Using CASE
, you constructed a "Conditional Expression" that returns the value 1 to sum
only for a specific department. By adding a conditional expression for every department, you "reshaped" the data to a pivot table.
Aggregate Expressions
Using CASE
is flexible but it's a bit tedious. Applying conditions on aggregates is so useful that SQL added special syntax for it:
WITH emp AS ( /* ... */ )
SELECT
role,
COUNT(*) FILTER (WHERE department = 'R&D') as "R&D",
COUNT(*) FILTER (WHERE department = 'Sales') as "Sales"
FROM
emp
GROUP BY
role;
This way of reshaping data into "pivot tables" is very common for visualizing and analyzing data. There is no doubt Pandas is much more flexible and comfortable when it comes to pivot tables. However, the process of turning rows into columns is also very common in ETL processes, where data is denormalized.
The bottom line is that Pandas may be better to quickly analyze and visualize small sets of data, but ETL processes may benefit from doing this process in the database using conditional or aggregate expressions.
Running and Cumulative Aggregation
Aggregations over a sliding window are very common, usually on a time series. For example, traders use moving averages as an indication of a stock's trend, running sums can be used to backtest an anomaly detection strategy and so on.
To illustrate, take this table listing daily temperatures:
WITH temperatures AS (
SELECT * FROM (VALUES
('2021-01-01'::date, 10),
('2021-01-02'::date, 12),
('2021-01-03'::date, 13),
('2021-01-04'::date, 14),
('2021-01-05'::date, 18),
('2021-01-06'::date, 15),
('2021-01-07'::date, 16),
('2021-01-08'::date, 17)
) as t(t, c)
)
SELECT * FROM temperatures;
t │ c
────────────┼────
2021-01-01 │ 10
2021-01-02 │ 12
2021-01-03 │ 13
2021-01-04 │ 14
2021-01-05 │ 18
2021-01-06 │ 15
2021-01-07 │ 16
2021-01-08 │ 17
Window Functions
Say you want to compare each day to the hottest day ever:
WITH temperatures AS ( /* ... */ )
SELECT
*,
MAX(c) OVER (PARTITION BY 1) AS hottest_temperature
FROM
temperatures;
t │ c │ hottest_temperature
────────────┼────┼─────────────────────
2021-01-01 │ 10 │ 18
2021-01-02 │ 12 │ 18
2021-01-03 │ 13 │ 18
2021-01-04 │ 14 │ 18
2021-01-05 │ 18 │ 18
2021-01-06 │ 15 │ 18
2021-01-07 │ 16 │ 18
2021-01-08 │ 17 │ 18
By adding the OVER (PARTITION ...)
clause to the aggregate function MAX
, you turned it into a window function. Window functions operate on a set of rows determined by the PARTITION
clause. Since you used a constant value PARTITION BY 1
, the function operates on all the rows.
To complete the query, use the result of the window function in an expression:
WITH temperatures AS ( /* ... */ )
SELECT
*,
(c::float / MAX(c) OVER (PARTITION BY 1) - 1) * 100 AS compared_to_hottest_day
FROM
temperatures;
t │ c │ compared_to_hottest_day
────────────┼────┼────────────────────────
2021-01-01 │ 10 │ -44.44444444444444
2021-01-02 │ 12 │ -33.333333333333336
2021-01-03 │ 13 │ -27.77777777777778
2021-01-04 │ 14 │ -22.22222222222222
2021-01-05 │ 18 │ 0
2021-01-06 │ 15 │ -16.666666666666664
2021-01-07 │ 16 │ -11.111111111111116
2021-01-08 │ 17 │ -5.555555555555558
If you're are not sure what is the purpose of casting the temperature to float, make sure to read my tip about dividing integers in SQL.
Sliding Window
Comparing each day's temperature against the hottest temperature ever can be useful, but more often than not, you want to compare a value to a limited period, or in other words, a sliding window.
To find the highest temperature in the last three days for example, you can add a frame clause:
WITH temperatures AS ( /* ... */ )
SELECT
*,
MAX(c) OVER (
ORDER BY t
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS hottest_temperature_last_three_days
FROM
temperatures;
t │ c │ hottest_temperature_last_three_days
────────────┼────┼─────────────────────────────────────
2021-01-01 │ 10 │ 10
2021-01-02 │ 12 │ 12
2021-01-03 │ 13 │ 13
2021-01-04 │ 14 │ 14
2021-01-05 │ 18 │ 18
2021-01-06 │ 15 │ 18
2021-01-07 │ 16 │ 18
2021-01-08 │ 17 │ 17
Once again you used a window function, but this time you added a frame clause to it, stating the window should include 2 previous rows and the current one.
Try it!
The frame syntax is very flexible, and it is not restricted to ROWS
. The query above can be expressed using a range frame as well:
WITH temperatures AS ( /* ... */ )
SELECT
*,
MAX(c) OVER (
ORDER BY t
RANGE BETWEEN '2 days' PRECEDING AND '0 days' FOLLOWING
) AS hottest_temperature_last_three_days
FROM
temperatures;
t │ c │ hottest_temperature_last_three_days
────────────┼────┼─────────────────────────────────────
2021-01-01 │ 10 │ 10
2021-01-02 │ 12 │ 12
2021-01-03 │ 13 │ 13
2021-01-04 │ 14 │ 14
2021-01-05 │ 18 │ 18
2021-01-06 │ 15 │ 18
2021-01-07 │ 16 │ 18
2021-01-08 │ 17 │ 17
Notice how nice the RANGE
syntax is... it reads like an actual sentence!
Linear Regression
Another common tool for analyzing data is linear regression.
For example, performing linear regression using Pandas and Scipy:
>>> import pandas as pd
>>> import scipy.stats
>>> df = pd.DataFrame([[1.2, 1], [2, 1.8], [3.1, 2.9]])
>>> slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(df[0], df[1])
(1.0 -0.2000000000000004 1.0 9.003163161571059e-11 0.0)
Most developers probably don't expect the database to have statistical functions, but PostgreSQL does:
WITH t AS (SELECT * FROM (VALUES
(1.2, 1.0),
(2.0, 1.8),
(3.1, 2.9)
) AS t(x, y))
SELECT
regr_slope(y, x) AS slope,
regr_intercept(y, x) AS intercept,
sqrt(regr_r2(y, x)) AS r
FROM
t;
slope │ intercept │ r
────────────────────┼──────────────────────┼───
1.0000000000000002 │ -0.20000000000000048 │ 1
Using statistical aggregate functions in PostgreSQL we got results similar to scipy.
Interpolation
Data cleaning is an important part of any data job, and handling missing values is a big part of that.
Fill with Constant
The simplest way to fill in missing data is with some constant value. Using Pandas for example, this is done using the fillna
function:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G'])
>>> df.fillna('X')
0
0 A
1 B
2 X
3 D
4 X
5 X
6 G
In SQL, if your missing values are NULL, you can use a condition expression CASE
, or use the shorter COALESCE
function:
WITH tb AS (
SELECT * FROM (VALUES
(1, 'A' ),
(2, 'B' ),
(3, null),
(4, 'D' ),
(5, null),
(6, null),
(7, 'G' )
) AS t(n, v)
)
SELECT
n,
coalesce(v, 'X') AS v
FROM
tb;
n │ v
───┼───
1 │ A
2 │ B
3 │ X
4 │ D
5 │ X
6 │ X
7 │ G
The function COALESCE
accepts any number of arguments and return the first one that is not NULL.
Back and Forward Fill
Filling values with constants is easy, but not always possible. Another common interpolation technique is filling empty values with previous or following non-missing values.
Pandas offers several variations on back and forward filling, for example:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(['A', 'B', np.NaN, 'D', np.NaN, np.NaN, 'G'])
>>> df.fillna(method='ffill') # or df.ffill()
>>> 0
0 A
1 B
2 B
3 D
4 D
5 D
6 G
>>> df.fillna(method='bfill') # or df.bfill() or df.backfill()
0
0 A
1 B
2 D
3 D
4 G
5 G
6 G
To achieve the same using SQL, you can use a subquery:
WITH tb AS (
SELECT * FROM (VALUES
(1, 'A' ),
(2, 'B' ),
(3, null),
(4, 'D' ),
(5, null),
(6, null),
(7, 'G' )
) AS t(n, v)
)
SELECT
*,
-- Find the next not null value
coalesce(v, (
SELECT v
FROM tb AS tb_
WHERE tb_.n < tb.n AND v IS NOT NULL
ORDER BY n DESC
LIMIT 1
)) AS ffill_v,
-- Find the previous not null value
coalesce(v, (
SELECT v
FROM tb as tb_
WHERE tb_.n > tb.n AND v IS NOT NULL
ORDER BY n ASC
LIMIT 1
)) as bfill_v
FROM
tb;
n │ v │ ffill_v │ bfill_v
───┼───┼─────────┼─────────
1 │ A │ A │ A
2 │ B │ B │ B
3 │ ¤ │ B │ D
4 │ D │ D │ D
5 │ ¤ │ D │ G
6 │ ¤ │ D │ G
7 │ G │ G │ G
The SQL version is a bit longer, but it is fairly expressive, and it gives great flexibility.
NOTE: It's tempting to use the window function LEAD
and LAG
here, but these function can only be used when filling single row gaps. Once you have more than one consecutive missing row, LEAD
and LAG
may leave you with missing values.
Linear Interpolation
Another common interpolation technique for discrete data is linear interpolation.
>>> import pandas as pd
... import numpy as np
...
... df = pd.DataFrame([
... (np.datetime64('2021-01-01'), 10),
... (np.datetime64('2021-01-02'), 12),
... (np.datetime64('2021-01-03'), np.NaN),
... (np.datetime64('2021-01-04'), 14),
... (np.datetime64('2021-01-05'), np.NaN),
... (np.datetime64('2021-01-06'), np.NaN),
... (np.datetime64('2021-01-07'), 18),
... (np.datetime64('2021-01-08'), 15)
... ], columns=('t', 'c'))
>>> # Assume data is evenly distributed
>>> df['c'].interpolate('linear')
0 10.000000
1 12.000000
2 13.000000
3 14.000000
4 15.333333
5 16.666667
6 18.000000
7 15.000000
Linear interpolation works by filling missing values along a linear line between two known coordinates (x1, y1) and (x2, y2). In this case, The two known coordinates are the last and the next known date and temperatures.
You already found the next and previous known value for each row when you implemented back and forward fill:
WITH
temperatures AS (
SELECT * FROM (VALUES
('2021-01-01'::date, 10),
('2021-01-02'::date, 12),
('2021-01-03'::date, null),
('2021-01-04'::date, 14),
('2021-01-05'::date, null),
('2021-01-06'::date, null),
('2021-01-07'::date, 18),
('2021-01-08'::date, 15)
) as t(t, c)
),
temperatures_with_previous_values AS (
SELECT
*,
-- Last known temperature
(
SELECT
array[extract('epoch' FROM t), c]
FROM
temperatures as temperatures_
WHERE
temperatures_.t < temperatures.t
AND c IS NOT NULL
ORDER BY
temperatures_.t DESC
LIMIT 1
) AS last_known_temperature,
-- Next known temperature
(
SELECT
array[extract('epoch' FROM t), c]
FROM
temperatures as temperatures_
WHERE
temperatures_.t > temperatures.t
AND c IS NOT NULL
ORDER BY
temperatures_.t ASC
LIMIT 1
) AS next_known_temperature
FROM
temperatures
)
SELECT * FROM temperatures_with_previous_values;
t │ c │ last_known_temperature │ next_known_temperature
────────────┼────┼────────────────────────┼────────────────────────
2021-01-01 │ 10 │ ¤ │ {1609545600,12}
2021-01-02 │ 12 │ {1609459200,10} │ {1609718400,14}
2021-01-03 │ ¤ │ {1609545600,12} │ {1609718400,14}
2021-01-04 │ 14 │ {1609545600,12} │ {1609977600,18}
2021-01-05 │ ¤ │ {1609718400,14} │ {1609977600,18}
2021-01-06 │ ¤ │ {1609718400,14} │ {1609977600,18}
2021-01-07 │ 18 │ {1609718400,14} │ {1610064000,15}
2021-01-08 │ 15 │ {1609977600,18} │ ¤
There are two main differences here from what you've done before:
-
You converted the date to a number: This is called "epoch", the number of seconds since 1970. To convert the date you used the function
extract('epoch' FROM t)
. -
You keep two values from the previous and next row: To implement linear interpolation we need coordinates, which are both the date and the temperature. To return multiple values from the previous row, you constructed an array
array[extract('epoch' FROM t), c]
.
To calculate a missing value with two known coordinates using linear interpolation, use the following formula:
y = y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
You already have all the data available, so just organize it a bit:
WITH
temperatures AS ( /* ... */ ),
temperatures_with_previous_values AS ( /* ... */ ),
-- This step is just for convenience
temperatures_prep AS (
SELECT
t,
c,
extract('epoch' from t) as x,
last_known_temperature[1] as x0,
last_known_temperature[2] as y0,
next_known_temperature[1] as x1,
next_known_temperature[2] as y1
FROM
temperatures_with_previous_values
)
SELECT t, c, x, x0, y0, x1, y1 FROM temperatures_prep;
t │ c │ x │ x0 │ y0 │ x1 │ y1
────────────┼────┼────────────┼────────────┼────┼────────────┼────
2021-01-01 │ 10 │ 1609459200 │ ¤ │ ¤ │ 1609545600 │ 12
2021-01-02 │ 12 │ 1609545600 │ 1609459200 │ 10 │ 1609718400 │ 14
2021-01-03 │ ¤ │ 1609632000 │ 1609545600 │ 12 │ 1609718400 │ 14
2021-01-04 │ 14 │ 1609718400 │ 1609545600 │ 12 │ 1609977600 │ 18
2021-01-05 │ ¤ │ 1609804800 │ 1609718400 │ 14 │ 1609977600 │ 18
2021-01-06 │ ¤ │ 1609891200 │ 1609718400 │ 14 │ 1609977600 │ 18
2021-01-07 │ 18 │ 1609977600 │ 1609718400 │ 14 │ 1610064000 │ 15
2021-01-08 │ 15 │ 1610064000 │ 1609977600 │ 18 │ ¤ │ ¤
Now that you have all the data neatly organized, you can use the formula to calculate missing values:
WITH
temperatures AS ( /* ... */ ),
temperatures_with_previous_values AS ( /* ... */ ),
temperatures_prep AS ( /* ... */ )
SELECT
t,
c,
CASE
WHEN c IS NOT NULL THEN c
ELSE y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
END AS interpolated_c
FROM
temperatures_prep
;
t │ c │ interpolated_c
────────────┼────┼────────────────────
2021-01-01 │ 10 │ 10
2021-01-02 │ 12 │ 12
2021-01-03 │ ¤ │ 13
2021-01-04 │ 14 │ 14
2021-01-05 │ ¤ │ 15.333333333333334
2021-01-06 │ ¤ │ 16.666666666666668
2021-01-07 │ 18 │ 18
2021-01-08 │ 15 │ 15
And there it is, the missing temperatures were filled using linear interpolation.
WITH
temperatures AS (
SELECT * FROM (VALUES
('2021-01-01'::date, 10),
('2021-01-02'::date, 12),
('2021-01-03'::date, null),
('2021-01-04'::date, 14),
('2021-01-05'::date, null),
('2021-01-06'::date, null),
('2021-01-07'::date, 18),
('2021-01-08'::date, 15)
) as t(t, c)
),
temperatures_with_previous_values AS (
SELECT
*,
-- Last known temperature
(
SELECT
ARRAY[EXTRACT('epoch' FROM t), c]
FROM
temperatures as temperatures_
WHERE
temperatures_.t < temperatures.t
AND c IS NOT NULL
ORDER BY
temperatures_.t DESC
LIMIT 1
) AS last_known_temperature,
-- Next known temperature
(
SELECT
ARRAY[EXTRACT('epoch' FROM t), c]
FROM
temperatures as temperatures_
WHERE
temperatures_.t > temperatures.t
AND c IS NOT NULL
ORDER BY
temperatures_.t ASC
LIMIT 1
) AS next_known_temperature
FROM
temperatures
),
-- This step is just for convenience
temperatures_prep AS (
SELECT
t,
c,
extract('epoch' from t) as x,
last_known_temperature[1] as x0,
last_known_temperature[2] as y0,
next_known_temperature[1] as x1,
next_known_temperature[2] as y1
FROM
temperatures_with_previous_values
)
SELECT
t,
c,
CASE
WHEN c IS NOT NULL THEN c
ELSE y0 + (x - x0) * ((y1 - y0) / (x1 - x0))
END AS interpolated_c
FROM
temperatures_prep;
Binning
Binning, or "bucketing", is a technique to group values together.
Custom Binning
Custom binning is most common for categorical data or for discrete data when ranges are pre-determined.
Image you have a table with student grades, and you want to classify them to letter grades A-F:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
)
SELECT
CASE
WHEN grade < 60 THEN 'F'
WHEN grade < 70 THEN 'D'
WHEN grade < 80 THEN 'C'
WHEN grade < 90 THEN 'B'
ELSE 'A'
END AS letter_grade,
COUNT(*)
FROM
grades
GROUP BY
letter_grade
ORDER BY
letter_grade;
letter_grade │ count
──────────────┼───────
A │ 29
B │ 10
C │ 12
D │ 10
F │ 39
Custom binning can also use expressions to categorize data into custom groups. In the american grade system for example, the letter grade can also be calculated based on the percentile, and not the absolute grade:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
),
percent_grades AS (
SELECT percent_rank() OVER (ORDER BY grade) as percent_grade
FROM grades
)
SELECT
CASE
WHEN percent_grade < 0.6 THEN 'F'
WHEN percent_grade < 0.7 THEN 'D'
WHEN percent_grade < 0.8 THEN 'C'
WHEN percent_grade < 0.9 THEN 'B'
ELSE 'A'
END AS letter_grade,
COUNT(*)
FROM
percent_grades
GROUP BY
letter_grade
ORDER BY
letter_grade;
To find the relative grade of every student based on the grades of all other students, you used the window function percent_rank
. The function returns a value between 0 and 1 that represents the rank of the current row relative to all other rows.
Custom binning is mostly useful for when the data is familiar, or within a known set of values. When exploring unknown or unbound sets of data there are other binning techniques you can use.
Equal Height Binning
Say you need to divide students to groups based on their grade, and you want every group to have roughly the same number of students. To achieve this, PostgreSQL provides a function called NTILE
:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
),
grades_with_tiles AS (
SELECT
*,
ntile(10) OVER (ORDER BY grade) AS bucket
FROM
grades
)
SELECT
min(grade) AS from_grade,
max(grade) AS to_grade,
count(*) AS cnt,
bucket
FROM
grades_with_tiles
GROUP BY
bucket
ORDER BY
from_grade;
from_grade │ to_grade │ cnt │ bucket
────────────┼──────────┼─────┼────────
40 │ 41 │ 10 │ 1
41 │ 45 │ 10 │ 2
47 │ 53 │ 10 │ 3
53 │ 61 │ 10 │ 4
61 │ 70 │ 10 │ 5
71 │ 79 │ 10 │ 6
79 │ 87 │ 10 │ 7
89 │ 95 │ 10 │ 8
95 │ 99 │ 10 │ 9
99 │ 100 │ 10 │ 10
Divding values into bins or buckets with roughly the same frequency is called "Equal Height Binning". Notice how each group holds exactly 10 rows.
Try it!
The function NTILE
is a window function. It accepts the number of buckets, in this case 10, and an order by clause in which to divide the range by. Window functions can't be used as a group by key, so you need to use either a subquery or a CTE to add the "bucket" field.
Equal Width Binning
So far you divided students to groups based on arbitrary letter grades (custom binning) and to equally sized groups based on their grades (equal height binning). None of these grouping techniques gives you a good sense of the data distribution. One way to visualize the data and get a sense of how grades are distributed is using a histogram.
To draw a histogram you need to divide grades into equal width ranges. Grades range from 0 to 100, so you can split the range to 10 bars of 10 each:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
)
SELECT
floor((grade - 1) / 10) as bucket,
min(grade) from_grade,
max(grade) to_grade,
count(*)
FROM
grades
GROUP BY
bucket
ORDER BY
bucket;
bucket │ from_grade │ to_grade │ count
────────┼────────────┼──────────┼───────
3 │ 40 │ 40 │ 7
4 │ 41 │ 50 │ 20
5 │ 51 │ 58 │ 12
6 │ 61 │ 70 │ 11
7 │ 71 │ 79 │ 11
8 │ 81 │ 90 │ 14
9 │ 92 │ 100 │ 25
To assign each grade to the right bucket we used a little arithmetics. This worked out nicely because the arithmetics here are fairly simple, but what if you wanted smaller buckets? Say 20 buckets of width 5? or 25 buckets of width 4? That would have made the calculation more complicated.
To simplify the task of assigning values into equal width buckets within a predefined range, PostgreSQL provides the function width_bucket
:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
)
SELECT
width_bucket(grade, 0 ,101, 20) as bucket,
(width_bucket(grade, 0 ,101, 20) - 1) * 5 as low_bound,
width_bucket(grade, 0 ,101, 20) * 5 as high_bound,
count(*)
FROM
grades
GROUP BY
bucket
ORDER BY
bucket;
bucket │ low_bound │ high_bound │ count
───────┼───────────┼────────────┼───────
8 │ 35 │ 40 │ 7
9 │ 40 │ 45 │ 13
10 │ 45 │ 50 │ 7
11 │ 50 │ 55 │ 8
12 │ 55 │ 60 │ 4
13 │ 60 │ 65 │ 7
14 │ 65 │ 70 │ 4
15 │ 70 │ 75 │ 7
16 │ 75 │ 80 │ 4
17 │ 80 │ 85 │ 6
18 │ 85 │ 90 │ 8
19 │ 90 │ 95 │ 7
20 │ 95 │ 100 │ 18
Ths function width_bucket
accepts the value to assign, two arguments for the lower and higher bounds of the range, and the number of buckets to divide the range into.
You should be careful with the values you set for the higher and lower bounds. The higher bound of the range is exclusive, meaning, if you set the higher bound to 100, grades that equal 100 will be considered out of range, and will result in an additional bucket. This is why the query above uses 101 as the higher bound.
To calculate the higher bound of each bucket, we multiply the index of the bucket by the width. To get the lower bound, we multiple the width by the index minus one, or in other words, the higher bound of the previous bucket.
Histograms are great for visualization, but if you try to draw a histogram from the result above you won't be able to get a real sense of the distribution because you might have gaps. Notice for example, how the range above starts with bucket 9, which is not the first bucket. This is because in our grades table, no one got a grade which is less than 40.
The function width_bucket
is useful, but we already do most of the hard work, so might as well generate the buckets on our own using generate_series
:
SELECT
bucket,
(bucket - 1) * 5 + 1 AS low_bound,
bucket * 5 AS high_bound
FROM
generate_series(1, 20) AS bucket;
bucket │ low_bound │ high_bound
───────┼───────────┼────────────
1 │ 1 │ 5
2 │ 6 │ 10
3 │ 11 │ 15
4 │ 16 │ 20
5 │ 21 │ 25
6 │ 26 │ 30
7 │ 31 │ 35
8 │ 36 │ 40
9 │ 41 │ 45
10 │ 46 │ 50
11 │ 51 │ 55
12 │ 56 │ 60
13 │ 61 │ 65
14 │ 66 │ 70
15 │ 71 │ 75
16 │ 76 │ 80
17 │ 81 │ 85
18 │ 86 │ 90
19 │ 91 │ 95
20 │ 96 │ 100
The query generates 20 ranges of 5 within a range 1 to 100. To create the histogram, use this table as an axis and join it to the grades:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
),
buckets AS (
SELECT
bucket,
(bucket - 1) * 5 + 1 AS low_bound,
bucket * 5 AS high_bound
FROM
generate_series(1, 20) AS bucket
)
SELECT
bucket,
low_bound,
high_bound,
COUNT(grade) AS cnt
FROM
buckets
LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound
GROUP BY
bucket, low_bound, high_bound
ORDER BY
bucket;
bucket │ low_bound │ high_bound │ cnt
───────┼───────────┼────────────┼─────
1 │ 1 │ 5 │ 0
2 │ 6 │ 10 │ 0
3 │ 11 │ 15 │ 0
4 │ 16 │ 20 │ 0
5 │ 21 │ 25 │ 0
6 │ 26 │ 30 │ 0
7 │ 31 │ 35 │ 0
8 │ 36 │ 40 │ 7
9 │ 41 │ 45 │ 13
10 │ 46 │ 50 │ 7
11 │ 51 │ 55 │ 8
12 │ 56 │ 60 │ 4
13 │ 61 │ 65 │ 7
14 │ 66 │ 70 │ 4
15 │ 71 │ 75 │ 7
16 │ 76 │ 80 │ 4
17 │ 81 │ 85 │ 6
18 │ 86 │ 90 │ 8
19 │ 91 │ 95 │ 7
20 │ 96 │ 100 │ 18
To make sure you don't have any gaps in the data, you LEFT JOIN
ed the grades to the generated axis table buckets
. As a result, some rows do not have a value. Using COUNT(*)
count rows, so buckets with no grades return 1. To overcome that, count only rows with grades using COUNT(grade)
.
Try it!
To finish off with a bang, you can enhance your query with a little ascii chart to display the histogram straight in the terminal:
WITH grades AS (
SELECT round(70 + sin(n) * 30)::int AS grade
FROM generate_series(1, 100) AS n
),
buckets AS (
SELECT
bucket,
(bucket - 1) * 5 + 1 AS low_bound,
bucket * 5 AS high_bound
FROM
generate_series(1, 20) AS bucket
)
SELECT
low_bound || ' - ' || high_bound as bounds,
COUNT(grade) AS cnt,
repeat('■', COUNT(grade)::int) as chart
FROM
buckets
LEFT JOIN grades ON grade BETWEEN low_bound AND high_bound
GROUP BY
bucket, low_bound, high_bound
ORDER BY
bucket;
bounds │ cnt │ chart
─────────┼─────┼────────────────────
1 - 5 │ 0 │
6 - 10 │ 0 │
11 - 15 │ 0 │
16 - 20 │ 0 │
21 - 25 │ 0 │
26 - 30 │ 0 │
31 - 35 │ 0 │
36 - 40 │ 7 │ ■■■■■■■
41 - 45 │ 13 │ ■■■■■■■■■■■■■
46 - 50 │ 7 │ ■■■■■■■
51 - 55 │ 8 │ ■■■■■■■■
56 - 60 │ 4 │ ■■■■
61 - 65 │ 7 │ ■■■■■■■
66 - 70 │ 4 │ ■■■■
71 - 75 │ 7 │ ■■■■■■■
76 - 80 │ 4 │ ■■■■
81 - 85 │ 6 │ ■■■■■■
86 - 90 │ 8 │ ■■■■■■■■
91 - 95 │ 7 │ ■■■■■■■
96 - 100 │ 18 │ ■■■■■■■■■■■■■■■■■■
And there you have it! A histogram with SQL, right there in your terminal...
Take Away
If there is one thing you take away from this article, it should be this - use the best tool for the job!. Pandas is great, and SQL is also great. Each has its strengths and weaknesses, and you have a better chance of creating an optimal data pipeline if know both!