Adapt Postgres queries¶
In this guide, you'll learn how to adapt Postgres queries to run on ClickHouse®, looking at a specific example.
Postgres is great as an OLTP database, but if you need real-time queries on hundreds of millions of rows, it's not going to be fast enough. ClickHouse is. Most queries from Postgres will look very similar on ClickHouse.
Haki Benita has a great blog post on how to do lots of operations like pivot tables, subtotals, linear regression, binning, or interpolation on Postgres, coming from a Pandas background. In this guide, you'll see how to adapt most of the Postgres queries from Haki's post to run on ClickHouse.
Prerequisites¶
You don't need an active Tinybird Workspace to read through this guide, but it's a good idea to read Haki's post first so you're familiar with the examples. In addition, a working knowledge of ClickHouse and SQL is required.
Common table expressions¶
ClickHouse supports CTEs. Both the WITH <expression> AS <identifier>
as well as the WITH <identifier> AS <subquery expression>
syntaxes are supported.
WITH emails AS ( SELECT '[email protected]' AS email ) SELECT * FROM emails ; WITH emails AS ( SELECT '[email protected]' AS email ) SELECT * FROM emails Query id: 6b234b03-6dc4-4ddf-8454-b03d34b75b60 ┌─email─────────────┐ │ [email protected] │ └───────────────────┘ 1 rows in set. Elapsed: 0.014 sec.
A CHAINED CTE ON POSTGRES
WITH emails AS ( SELECT '[email protected]' AS email ), normalized_emails AS ( SELECT lower(email) AS email FROM emails ) SELECT * FROM normalized_emails; WITH emails AS ( SELECT '[email protected]' AS email ), normalized_emails AS ( SELECT lower(email) AS email FROM emails ) SELECT * FROM normalized_emails Query id: c511a113-1852-4a9f-90bf-99d33eba8254 ┌─email─────────────┐ │ [email protected] │ └───────────────────┘ 1 rows in set. Elapsed: 0.127 sec.
On Tinybird¶
For now, we only support the WITH <expression> AS <identifier> syntax. So the previous queries would have to be rewritten like this:
There's a difference with CTEs on Postgres VS ClickHouse. In Postgres, as the original post says, "CTEs are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results". On ClickHouse, CTEs can only return one row, so those intermediate results can't have multiple rows. For a similar result, on ClickHouse you have to use subqueries.
A common pattern is returning a tuple of groupArrays in the CTE, so you can return more than one row in the form of arrays. Then consume the results in the main query with transform for instance or arrayJoin.
In Tinybird, Pipes act like notebooks where each Node is a subquery and you can refer to the results of one Node in another Node. It's great to see intermediate results and reduce the complexity of your queries. If you'd like to try it out, sign up here.
Generating data¶
As you see in the original article, in Postgres there are several ways to do it:
Union all¶
This works the same in ClickHouse as in Postgres:
UNION ALL ALSO WORKS
WITH dt AS ( SELECT 1 AS id, 'haki' AS name UNION ALL SELECT 2, 'benita' ) SELECT * FROM dt; WITH dt AS ( SELECT 1 AS id, 'haki' AS name UNION ALL SELECT 2, 'benita' ) SELECT * FROM dt Query id: e755e5a5-5e5b-4e8a-a262-935f9946d45d ┌─id─┬─name───┐ │ 2 │ benita │ └────┴────────┘ ┌─id─┬─name─┐ │ 1 │ haki │ └────┴──────┘ 2 rows in set. Elapsed: 0.051 sec.
The VALUES keyword won't work on ClickHouse to select data, only to insert it.
Joining data¶
The join syntax from Postgres will work on ClickHouse, but typically the kinds of analytical data that you'll store on ClickHouse will be orders of magnitude bigger than what you'd store on Postgres, and this would make your joins slow. There are ways to make JOINs faster, check the best practices for writing faster SQL queries or contact us for guidance.
Unnest - arrayJoin¶
arrayJoin is the ClickHouse equivalent of unnest on Postgres. So this Postgres query:
ON CLICKHOUSE, UNNEST EXPANDS AN ARRAY INTO ROWS
WITH dt AS ( SELECT unnest(array[1, 2]) AS n ) SELECT * FROM dt;
Would be rewritten on ClickHouse like this:
ON CLICKHOUSE, ARRAYJOIN EXPANDS AN ARRAY INTO ROWS
SELECT arrayJoin([1, 2]) AS dt
Generating series of data¶
Generate_series¶
The generate_series doesn't exist on ClickHouse, but with the numbers function we can accomplish a lot as well. This is its basic usage:
NUMBERS PRODUCES ROWS
SELECT * FROM numbers(10)
A similar result can be obtained with the range function, that returns arrays. If we only provide an argument, it behaves like numbers. And with range we can also specify a start, end and step:
RANGE PRODUCES ARRAYS
SELECT range(10), range(0, 10, 2)
This, combined with arrayJoin lets us do the same as generate_series:
RANGE OF INTEGERS USING START, END AND STEP
SELECT arrayJoin(range(0, 10, 2)) AS number
Generating time series¶
generate_series can produce results with other types different than integers, while range only outputs integers. But with some smart logic we can achieve the same results. For example, on Postgres you'd generate a with a datetime for each hour in a day this way, as in the original post:
THE GENERATE_SERIES FUNCTION OF POSTGRES
WITH daterange AS ( SELECT * FROM generate_series( '2021-01-01 UTC'::timestamptz, -- start '2021-01-02 UTC'::timestamptz, -- stop interval '1 hour' -- step ) AS t(hh) ) SELECT * FROM daterange; hh ──────────────────────── 2021-01-01 00:00:00+00 2021-01-01 01:00:00+00 2021-01-01 02:00:00+00 ... 2021-01-01 22:00:00+00 2021-01-01 23:00:00+00 2021-01-02 00:00:00+00
Generate a time series specifying the start date and the number of intervals¶
On ClickHouse, you can achieve the same this way:
TIME SERIES ON CLICKHOUSE GIVEN START DATE, NUMBER OF INTERVALS AND INTERVAL SIZE
WITH toDate('2021-01-01') as start SELECT addHours(toDate(start), number) AS hh FROM ( SELECT arrayJoin(range(0, 24)) AS number )
Generate a time series specifying the start and end date and the step¶
Another way of doing the same thing:
TIME SERIES ON CLICKHOUSE GIVEN THE START AND END DATE AND THE STEP SIZE
WITH toStartOfDay(toDate('2021-01-01')) AS start, toStartOfDay(toDate('2021-01-02')) AS end SELECT arrayJoin(arrayMap(x -> toDateTime(x), range(toUInt32(start), toUInt32(end), 3600))) as hh
Generate a time series using timeSlots¶
Using the timeSlots function, we can specify the start (DateTime), duration (seconds) and step (seconds) and it generates an array of DateTime values.
TIME SERIES ON CLICKHOUSE USING THE TIMESLOTS FUNCTION
WITH toDateTime('2021-01-01 00:00:00') AS start SELECT arrayJoin(timeSlots(start, toUInt32(24 * 3600), 3600)) AS hh
Generating a random value¶
The rand function in ClickHouse is akin to random in Postgres, with the difference that rand returns a random UInt32 number between 0 and 4294967295 (). So to get random floats between 0 and 1 like random, you have to divide the result by 4294967295.
GENERATING A RANDOM VALUE ON CLICKHOUSE WITH THE RAND FUNCTION
SELECT rand() random_int, random_int / 4294967295 random_float
To get more than one row, you'd simply do
GENERATING SEVERAL RANDOM VALUES ON CLICKHOUSE
SELECT rand() random_int, random_int / 4294967295 random_float FROM numbers(100)
Generating random integers within a range¶
You would use the floor or ceil function (not round, for the reasons explained here) in addition to the result of rand multiplied by the max of the range of integers you want to generate, like this:
GENERATING SEVERAL RANDOM INTEGERS IN A GIVEN RANGE ON CLICKHOUSE
SELECT ceil(rand() / 4294967295 * 3) AS n FROM numbers(10)
And here you can see that the distribution is uniform (this wouldn't happen if you had use round):
THE DISTRIBUTION IS UNIFORM
SELECT ceil(rand() / 4294967295 * 3) AS n, count(*) FROM numbers(10000) GROUP BY n ORDER BY n
Sampling data from a list¶
This is how you'd take samples with replacement from a list in Postgres:
RANDOM CHOICE IN POSTGRES
SELECT (array['red', 'green', 'blue'])[ceil(random() * 3)] AS color FROM generate_series(1, 5);
In ClickHouse, this is how you'd do it:
RANDOM CHOICE IN CLICKHOUSE
SELECT ['red', 'green', 'blue'][toInt32(ceil(rand() / 4294967295 * 3))] AS color FROM numbers(5)
To get only one value, you'd remove the FROM numbers(5) part. Note that to define an array on ClickHouse you can do it either calling array('red', 'green', 'blue') or with ['red', 'green', 'blue'] like in the code snippet.
Sampling data from a table¶
Sorting data by rand() can be used to get a random sample, like here:
SAMPLE DATA USING RAND
SELECT * FROM events_mat ORDER BY rand() LIMIT 100
But this is slow, as a full scan of the table has to be run here.
A more efficient way to do it is using the SAMPLE clause. You can pass an integer to it (should be large enough, typically above 1000000)
THE SAMPLE CLAUSE, PASSING AN INTEGER
SELECT * FROM events_mat SAMPLE 1000000
And you can also pass a float between 0 and 1, to indicate the fraction of the data that will be sampled.
THE SAMPLE CLAUSE, PASSING A FLOAT
SELECT * FROM events_mat SAMPLE 0.01
Descriptive statistics on a numeric series¶
ClickHouse also comes with lots of statistical function, like Postgres does (see this section of the original post). The first query, written on Postgres this way:
DESCRIPTIVE STATISTICS ON POSTGRES
WITH s AS ( SELECT * FROM (VALUES (1), (2), (3)) AS t(n) ) SELECT count(*), avg(n), stddev(n), min(n), percentile_cont(array[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY n), max(n) FROM s; count │ avg │ stddev │ min │ percentile_cont │ max ──────┼────────┼───────────┼─────┼─────────────────┼───── 3 │ 2.0000 │ 1.0000000 │ 1 │ {1.5,2,2.5} │ 3
Would be done on ClickHouse like this:
DESCRIPTIVE STATISTICS ON CLICKHOUSE
SELECT count(*), avg(n), stddevSamp(n), min(n), quantiles(0.25, 0.5, 0.75)(n), max(n) FROM (SELECT arrayJoin([1,2,3]) AS n)
Descriptive statistics on categorical series¶
ClickHouse can also be used to get some statistics from discrete values. While on Postgres you'd do this:
DESCRIPTIVE STATISTICS OF CATEGORICAL VALUES ON POSTGRES
WITH s AS (SELECT unnest(array['a', 'a', 'b', 'c']) AS v) SELECT count(*), count(DISTINCT V) AS unique, mode() WITHIN GROUP (ORDER BY V) AS top FROM s; count │ unique │ top ───────┼────────┼───── 4 │ 3 │ a
On ClickHouse you'd do:
DESCRIPTIVE STATISTICS OF CATEGORICAL VALUES ON CLICKHOUSE
SELECT count(*) AS count, uniq(v) AS unique, topK(1)(v) top FROM (SELECT arrayJoin(['a', 'b', 'c', 'd']) AS v)
uniq will provide approximate results when your data is very big. If you need exact results you can use uniqExact, but be aware that uniq will generally be faster than uniqExact. Check out the topK docs as well if you're interested.
As a side note, if you have categorical columns, most likely you'll have better performance and lower storage cost data types. The performance of using LowCardinality will be better than using the base data types even on columns with more than a few millions of different values. This is what Instana found out, as well - read their full post here:
"When we came across the LowCardinality data type the first time, it seemed like nothing we could use. We assumed that our data is just not homogeneous enough to be able to use it. But when looking at it recently again, it turns out we were very wrong. The name LowCardinality is slightly misleading. It actually can be understood as a dictionary. And according to our tests, it still performs better and is faster even when the column contains millions of different values"
Subtotals and aggregations¶
The same operations done in this section of Haki's post can be done with ClickHouse.
Given a table that contains this data:
EMPLOYEES
SELECT * FROM employees
Finding the number of employees with each role is straightforward, same syntax as on Postgres:
EMPLOYEES PER ROLE
SELECT department, role, count(*) count FROM employees GROUP BY department, role
Using rollup and cube¶
The ROLLUP modifier is also available on ClickHouse, although the syntax is slightly different than on Postgres. This query on Postgres:
GROUP BY WITH ROLLUP ON POSTGRES
SELECT department, role, COUNT(*) FROM employees GROUP BY ROLLUP(department, role);
would be written on ClickHouse like this:
GROUP BY WITH ROLLUP ON CLICKHOUSE
SELECT department, role, COUNT(*) FROM employees GROUP BY department, role WITH ROLLUP
It allows you to have more subtotals (but not all). To have all the subtotals for all the possible combinations of grouping keys, you need to use the CUBE modifier:
GROUP BY WITH CUBE ON CLICKHOUSE
SELECT department, role, COUNT(*) FROM employees GROUP BY department, role WITH CUBE
Pivot tables and conditional expressions¶
Pivot tables let you reshape data when you want typically a column with keys, a column with categories and a column with values, and you want to aggregate those values and use the categories column as columns of a new table.
On Postgres you could do it this way:
PIVOT TABLE CREATED MANUALLY ON POSTGRES
SELECT role, COUNT(*) FILTER (WHERE department = 'R&D') as "R&D", COUNT(*) FILTER (WHERE department = 'Sales') as "Sales" FROM employees GROUP BY role; role │ R&D │ Sales ───────────┼─────┼─────── Manager │ 1 │ 1 Developer │ 2 │ 2
On ClickHouse, you could do the same this way:
PIVOT TABLE CREATED MANUALLY ON POSTGRES
SELECT role, countIf(department = 'R&D') as "R&D", countIf(department = 'R&D') as "Sales" FROM employees GROUP BY role
The original post doesn't mention this, but Postgres has a very convenient crosstab function that lets you what we've done in one line. If the number of the categories to pivot is too large, you can imagine how long this query could be become if done manually and how handy the crosstab function could get. Something like this is not available yet on ClickHouse, unfortunately.
Running and Cumulative Aggregations¶
Aggregations over sliding windows are a common solution. This can be done with Window functions.
This also can be done with the groupArrayMovingSum and groupArrayMovingAvg functions, available in stable releases since a long time ago already. This is an example of its usage:
Given this dataset:
GOOGLE TRENDS DATA FOR THE TERM 'AMAZON'
SELECT date, amazon as value FROM amazon_trends
We could compute a 7-day moving average of value like this:
7-DAY MOVING AVERAGE
SELECT * FROM (SELECT groupArray(date) as date_arr, groupArray(value) as value_arr, groupArrayMovingAvg(7)(value) mov_avg FROM (SELECT date, amazon as value FROM amazon_trends)) ARRAY JOIN *
The periods parameter is optional. If you omit it, all the previous rows are used for the aggregation.
Linear regression¶
Given this data
DATA TO PERFORM LINEAR REGRESSION
SELECT arrayJoin([[1.2, 1], [2, 1.8], [3.1, 2.9]])[1] x, arrayJoin([[1.2, 1], [2, 1.8], [3.1, 2.9]])[2] y
On Postgres we can see in the the original post that you'd do linear regression like this:
LINEAR REGRESSION ON POSTGRES
WITH t AS (SELECT * FROM (VALUES (1.2, 1.0), (2.0, 1.8), (3.1, 2.9) ) AS t(x, y)) SELECT regr_slope(y, x) AS slope, regr_intercept(y, x) AS intercept, sqrt(regr_r2(y, x)) AS r FROM t; slope │ intercept │ r ────────────────────┼──────────────────────┼─── 1.0000000000000002 │ -0.20000000000000048 │ 1
There's not a function on ClickHouse like regr_r2 that gives you the R2 coefficient, but it wouldn't be hard to calculate it yourself as the formula is simple.
Filling null values.¶
This part is called "interpolation" in Haki's post. Filling null values with Pandas is a one-liner. Imagine we have this table
A TABLE WITH A INT AND A STRING COLUMN, WITH SOME NULL VALUES
SELECT * FROM num_str
Fill null values with a constant value¶
The way to replace all the null values by a constant value is using the coalesce function, that works in ClickHouse the same way it does in Postgres, using coalesce:
FILLING NULL VALUES WITH A CONSTANT
SELECT n, coalesce(v, 'X') AS v FROM num_str
Back and forward filling data¶
This is also a one-liner in Pandas. On the section of the original post, the author does this using correlated subqueries, but those aren't supported yet on ClickHouse. Fortunately, ClickHouse comes with a lot of powerful array functions like groupArray, arrayFill and arrayReverseFill.
groupArray, as the other aggregate functions on ClickHouse, skips null values. So the solution involves replacing them by another value (make sure that the new value doesn't appear in the column before). This is done with the ifNull function. Add some array magic in, and this is how you'd do it:
BACK AND FORWARD FILLING VALUES
SELECT values.1 n, values.2 v, values.3 v_ffill, values.4 v_bfill FROM (SELECT arrayJoin( arrayZip( groupArray(n) AS n, arrayMap(x -> x != 'wadus' ? x : null, groupArray(v_nulls_replaced)) AS v, arrayFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_ffill, arrayReverseFill(x -> x != 'wadus', groupArray(v_nulls_replaced)) AS v_bfill ) ) values FROM (SELECT *, ifNull(v, 'wadus') v_nulls_replaced FROM num_str ORDER BY n ASC) )
To understand what's going on here, import this Pipe with a step-by-step explanation and the results of the transformations that are taking place. Tinybird lets you run each subquery in a Node of a notebook-like UI (we call them Pipes). This lets you build and debug complex queries in a cleaner way. If you'd like to use it, sign up here.
Filling gaps in time-series, reshaping indexes.¶
Sometimes, you'll group a time-series by a Date or DateTime column, and it can happen that the intervals between rows are not always the same because there were no values found for some dates or datetimes. In Pandas, the solution would be creating a new date_range index and then re-indexing the original Series/DataFrame with that index, as explained here.
On ClickHouse, the same can be accomplished with the WITH FILL modifier. Here's a simple example of it:
FILLING GAPS IN TIME ON CLICKHOUSE
SELECT toDate((number * 2) * 86400) AS d1, 'string_value' as string_col, toInt32(rand() / exp2(32) * 100) as n FROM numbers(10) ORDER BY d1 WITH FILL STEP 1
The STEP 1
part is not necessary here as it's the default, but know that you can set a different value than 1.
Linear interpolation¶
Imagine you have a table like this, containing a time-series with some rows missing. We could fill those missing gaps with the WITH FILL expression previously shown, but that way we'd just get zeroes when there's a missing value, while the actual missing value is probably closer to the previous and the next values than to zero.
TIME-SERIES WITH MISSING ROWS
SELECT *, bar(value, 0, 100, 20) FROM trends_with_gaps ORDER BY date WITH FILL STEP 1
Linear interpolation is the simplest way to fill those missing values. In it, missing values are replaced by the average of the previous and the next known values. On Postgres, Haki's post explains how to do it here.
On ClickHouse, this can be done with arrays. Check out this great post by Altinity for an in-depth explanation. This is how it'd be done with this dataset:
INTERPOLATING VALUES
SELECT date, value, value_interpolated, bar(value_interpolated, 0, 100, 20) AS value_interpolated_bar FROM ( SELECT groupArray(date) AS dt_arr, groupArray(value) AS value_arr, arrayFill(x -> ((x.1) > 0), arrayZip(value_arr, dt_arr)) AS value_lower, arrayReverseFill(x -> ((x.1) > 0), arrayZip(value_arr, dt_arr)) AS value_upper, arrayMap((l, u, v, dt) -> if(v > 0, v, (l.1) + ((((u.1) - (l.1)) / ((u.2) - (l.2))) * (dt - (l.2)))), value_lower, value_upper, value_arr, dt_arr) AS value_interpolated FROM ( SELECT * FROM trends_with_gaps ORDER BY date WITH FILL STEP 1 ) ) ARRAY JOIN dt_arr AS date, value_interpolated, value_arr AS value
For a step-by-step explanation of how this works, and to see how you could construct this query iteratively with a notebook-like interface on Tinybird, import this Pipe.
Binning and histograms¶
The original post talks about custom binning, equal-width and equal-height binning. The way to do custom binning is very similar on ClickHouse, which also supports CASE statements. Equal height binning could be achieved with the quantiles function, already described before, in the descriptive statistics section. The most interesting use-case of equal-width binning is creating histograms, which is very easy on ClickHouse. It even comes with a histogram function, which receives a number of bins and the data, and returns a list of tuples containing the lower and upper bounds of each bucket, as well as its height:
THE HISTOGRAM FUNCTION
SELECT histogram(10)(value) AS values FROM trends_with_gaps
Then, extracting each values from the tuples and even having a visual representation of the data can be done like this:
Running ClickHouse without worrying about it¶
Tinybird lets you do real-time analytics on huge amounts of data, powered by ClickHouse, without having to worry about scalability, hosting or maintaining any ClickHouse clusters. With it, you can ingest huge datasets and streaming data, analyze it with SQL and publish dynamic API Endpoints on those queries a couple of clicks.
Our product is already being used by some big companies and we've recently been featured on Techcrunch. To use Tinybird, sign up here