Long story short, since PostgreSQL 14 to_tsquery('pg_class')
becomes
'pg' <-> 'class'
instead of 'pg' & 'class'
(commit 0c4f355c6a). That is for instance,
in PostgreSQL 13 and earlier to_tsquery('pg_class')
matches
to_tsvector('a class of pg')
. But since PostgreSQL 14 it doesn’t match, but
still matches to_tsvector('pg_class')
and to_tsvector('pg*class')
.
This is incompatible change, which affects fts users, but we have to do this
in order to fix phrase search design problems.
The story started with a
bug
when to_tsvector('pg_class pg')
didn’t match to
websearch_to_tsquery('"pg_class pg"')
.
1 2 3 4 5 |
|
Looks strange! Naturally, when you search for some
text in quotes, you expect it to match at least the exact same text in the document.
But it doesn’t. My first idea was that it’s just bug of websearch_to_tsquery()
function, but to_tsquery()
appears to have the same problem:
to_tsquery('pg_class <-> pg')
doesn’t match to to_tsvector('pg_class pg')
as well.
1 2 3 4 5 |
|
I was surprised that although phrase search arrived many years ago, basic things don’t work.
Looking under the hood, both websearch_to_tsquery('"pg_class pg"')
and
to_tsquery('pg_class <-> pg')
compiles into ( 'pg' & 'class' ) <-> 'pg'
.
1 2 3 4 5 |
|
This tsquery expects both pg
and class
to be one position left from another
pg
. That means both pg
and class
need to reside in the same position.
In principle, that’s possible, for instance, when a single word is split into two
synonyms by fulltext dictionary. But that’s not our case. When we parse
pg_class pg
text, each word gets position sequentially. No two of them
reside in the same position.
1 2 3 4 5 |
|
Why does tsquery parsing work this way? Historically, in PostgreSQL fulltext search
to_tsquery('pg_class')
compiles into 'pg' & 'class'
. Therefore, pg
and
class
don’t have to appear together. Before phrase search, that was the
only way to process this query as soon as we split pg_class
into pg
and
class
. Thus, querying compound words was a bit relaxed. But now, when
combined with phrase search, it becomes unreasonably strict.
My original intention was to choose the way to compile pg_class
depending
on the context. With phrase search operator nearby pg_class
should become
'pg' <-> 'class'
, but be 'pg' & 'class'
in the rest of cases. But this
way required invasive refactoring of tsquery processing, taking more time than
I could to spend on this bug.
Fortunately, Tom Lane came with a proposal
to always compile pg_class
into 'pg' <-> 'class'
. Thus, now both
websearch_to_tsquery('"pg_class pg"')
and to_tsquery('pg_class <-> pg')
compile into 'pg' <-> 'class' <-> 'pg'
. And both of them match to
to_tsvector('pg_class pg')
. That is a win!
1 2 3 4 5 6 7 8 9 10 11 |
|
This approach would make all queries involving compound words more strict. But at first, this appears the only easy way to fix this design bug. Secondly, this is probably a better way to handle compound words themselves.
And AFAICS, this approach seems to be the right way. Thanks to it, yet another phrase search bug appears to be quite easy to fix.
Happy phrase searching in PostgreSQL 14! Hopefully, we would further manage without incompatible changes :)
]]>It seems a good idea to change grey psql output to a lovely rainbow in honor of IDAHOT day. Thankfully there is lolcat utility, which is very easy to install on Linux and Mac OS.
Linux
1
|
|
Mac OS
1
|
|
Having lolcat installed, you can set it up as a psql pager and get lovely rainbow psql output!
1 2 |
|
PostgreSQL has an extension to jsonpath: **
operator, which explores
arbitrary depth finding your values everywhere. At the same time, there is
a lax
mode, defined by the standard, providing a “relaxed” way for working
with json. In the lax
mode, accessors automatically unwrap arrays; missing
keys don’t trigger errors; etc. In short, it appears that the **
operator
and lax
mode aren’t designed to be together :)
The story started with the bug report.
The simplified version is below. Jsonpath query is intended to select the
value of key "y"
everywhere. But it appears to select these values twice.
1 2 3 4 5 6 7 8 9 |
|
This case looks like a bug. But is it? Let’s dig into details. Let’s split
the jsonpath query into two parts: one containing the **
operator and another
having the key accessor.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
As you can see, the **
operator selects every child in the json document as
expected. The key accessor extracts corresponding values from both objects
themselves and their wrapping arrays. And that’s also expected in the lax
mode. So, it appears there is no bug; everything works as designed, although
it’s surprising for users.
Finally, I’ve committed a paragraph to the
docs,
which explicitly clarifies this issue.
It seems that lax
mode and **
operator just aren’t designed to be
used together. If you need **
operator, you can use strict
mode. and
everything is intuitively correct.
1 2 3 4 5 6 7 |
|
The world changes. ARM architecture breaks into new areas of computing. An only decade ago, only your mobile, router, or another specialized device could be ARM-based, while your desktop and server were typically x86-based. Nowadays, your new MacBook is ARM-based, and your EC2 instance could be ARM as well.
In the mid-2020, Amazon made graviton2 instances publically available. The maximum number of CPU core there is 64. This number is where it becomes interesting to check PostgreSQL scalability. It’s exciting to check because ARM implements atomic operations using pair of load/store. So, in a sense, ARM is just like Power, where I’ve previously seen a significant effect of platform-specific atomics optimizations.
But on the other hand, ARM 8.1 defines a set of LSE instructions, which, in particular, provide the way to implement atomic operation in a single instruction (just like x86). What would be better: special optimization, which puts custom logic between load and store instructions, or just a simple loop of LSE CAS instructions? I’ve tried them both.
You can see the results of read-only and read-write pgbench on the graphs
below (details on experiments are here).
pg14-devel-lwlock-ldrex-strex
is the patched PostgreSQL with special
load/store optimization for lwlock, pg14-devel-lse
is PostgreSQL compiled
with LSE support enabled.
You can see that load/store optimization gives substantial positive effect, but LSE rocks here!
So, if you’re running PostgreSQL on graviton2 instance, make sure you’ve binaries compiled with LSE support (see the instruction) because the effect is dramatic.
BTW, it appears that none of these optimizations have a noticeable effect on the performance of Apple M1. Probably, M1 has a smart enough inner optimizer to recognize these different implementations to be equivalent. And it was surprising that LSE usage might give a small negative effect on Kunpeng 920. It was discouraging for me to know an ARM processor, where single instruction operation is slower than multiple instruction equivalent. Hopefully, processor architects would fix this in new Kunpeng processors.
In general, we see that now different ARM embodiments have different performance characteristics and different effects of optimizations. Hopefully, this is a problem of growth, and it will be overcome soon.
Update: As Krunal Bauskar pointer in the comments, LSE instructions are still faster than the load/store option on Kunpeng 920. Different timings might cause the regression. For instance, with LSE instructions, we could just faster reach the regression caused by another bottleneck.
]]>Long story short, using PostgreSQL 11 with RUM index you can do both TOP-N query and COUNT(*) for non-selective FTS queries without fetching all the results from heap (that means much faster). Are you bored yet? If not, please read the detailed description below.
At November 1st 2017, Tome Lane committed a patch enabling bitmap scans to behave like index-only scan when possible. In particular, since PostgreSQL 11 COUNT(*) queries can be evaluated using bitmap scans without accessing heap when corresponding bit in visibility map is set. This patch was written by Alexander Kuzmenkov and reviewed by Alexey Chernyshov (sboth are my Postgres Pro colleagues), and it was heavily revised by Tom Lane.
This commit might seem to be just one of planner and executor optimizations, nice but doesn’t deserve much attention. However, under detailed consideration this patch appears to be significant improvement on the way of making full text search in PostgreSQL to be done the right way.
I’ve started working on FTS improvements in 2012. That time I realized that GIN index is good for selective FTS queries, when number of matching results is low. See the example below: GIN did great work for us by returning just few dozens of matching rows very fast. The rest operations including relevance calculation and sorting are also fast, because they are performed over very small row set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
But situation is different if FTS query is not selective and number of matching rows is high. Then we have fetch all those rows from heap, calculate relevance for each of them and sort them. And despite we only need TOP-10 rows, this query takes a lot of time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
How can we improve this situation? If we would get results from index pre-ordered by relevance, then we would be able to evaluate TOP-N query without fetching the whole set of matching rows from heap. Unfortunately, that appears to be impossible for GIN index which stores only facts of occurence of specifix terms in document. But if we have additional infromation about terms positions in the index, then it might work. That information would be enough to calculate relevance only basing on index information.
Thus, I’ve proposed proposed a set of patches to GIN index. Some improvements were committed including index compression and index search optimization. However, additional information storage for GIN index wasn’t committed, because it alters GIN index structure too much.
Fortunately, we have extensible index access methods in PostgreSQL 9.6. And that enables us to implement things, which wasn’t committed to GIN and more, as a separate index access method RUM. Using RUM, one can execute TOP-N FTS query much faster without fetching all the matching rows from heap.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
However, the problem persisted if you need to get total count of matching rows. Then PostgreSQL executor still have to fetch all the matching rows from the heap in order to check their visibility. So, if you need total number of resulting rows for pagination, then it’s still might be very slow.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
For sure, some modern UIs use techniques like continuous scrolling which doesn’t require to show full number of results to user. Also, one can use planner estimation for number of resulting rows which is typically matching the order of magnitude to actual number of resulting rows. But nevertheless, slow counting of total results number was a problem for many of RUM users.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
It’s not very widely known, but PostgreSQL is gathering statistics for indexed expressions. See following example.
1 2 3 4 5 6 7 8 9 |
|
We created table with two columns x
and y
whose values are independently and uniformly distributed from 0 to 1. Despite we analyze that table, PostgreSQL optimizer estimates selectivity of x + y < 0.01
qual as 1/3. You can see that this estimation is not even close to reality: we actually selected 56 rows instead of 333333 rows estimated. This estimation comes from a rough assumption that <
operator selects 1/3 of rows unless something more precise is known. Of course, it could be possible for planner to do something better in this case. For instance, it could try to calculate histogram for x + y
from the separate histograms for x
and y
. However, PostgreSQL optimizer doesn’t perform such costly and complex computations for now.
Situation changes once we define an index on x + y
.
1 2 3 4 5 6 7 8 9 10 11 |
|
Besides index get used for this query, there is way more accurate estimate for the number of rows selected by x + y < 0.01
. Estimation is improved because PostgreSQL is now gathering separate statistics for x + y
expression. You can check that by querying a system catalog.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
So, there are histogram, most common values and etc for x + y
expression, and that leads to more accurate selectivity estimation for x + y < 0.01
. However, there is still and 1 order of degree error (641 rows estimated instead of 56). Could we improve that? Yes, PostgreSQL have statistics-gathering target parameter which is tunable per column using ALTER TABLE … SET STATISTICS … command. Using this command, you may tune size of statistics arrays.
But, uhhhh, in our case we have no column, we have an indexed expression. That appears to be a problem since there is no documented way to tune statistic target for that…
Nevertheless, it appears to be possible. There is a gotcha which allows advanced DBAs to do that.
1 2 3 4 5 6 7 8 9 10 11 |
|
That works. When we collect statistic arrays of 10000 size, estimate becomes 69 rows. It’s only 23% estimation error which is more than good enough for query planning.
But… What the hell is ALTER INDEX ... SET STATISTICS ...
?! There is nothing like this in PostgreSQL documentation!
Let’s understand this situation step by step.
ALTER INDEX
and ALTER TABLE
share the same bison rule.ALTER INDEX
is not applicable are filtered runtime.ALTER INDEX ... SET STATISTICS ...
is not forbidden and works the same way as ALTER TABLE ... SET STATISTICS ...
does.expr
, expr1
, expr2
…There was some short discussion about that in pgsql-hackers mailing lists. The conclusion was that this should be documented, but it’s not yet done. I also think that we should invent some better syntax for that instead of usage of internal column names.
]]>Today I gave a talk “Our answer to Uber” at United Dev Conf, Minsk. Slides could be found at slideshare. In my talk I attempted to make a review of Uber’s notes and summarize community efforts to overcome highlighted shortcomings.
United Dev Conf is quite big IT conference with more than 700 attendees. I’d like to notice that interest in PostgreSQL is quire high. The room was almost full during my talk. Also, after the talk I was continuously giving answers to surroundings in about 1 hour.
I think that Minsk is very attractive place for IT events. There are everything required for it: lovely places for events, good and not expensive hotels, developed infrastructure. Additionally Belarus introduces 5 days visa-free travel for 80 countries, and that made conference attendance much easier for many people. It would be nice to have PGDay.Minsk one day.
]]>Faceted search is very popular buzzword nowadays. In short, faceted search specialty is that its results are organized per category. Popular search engines are receiving special support of faceted search.
Let’s see what PostgreSQL can do in this field. At first, let’s formalize our task. For each category which have matching documents we want to obtain:
For sure, it’s possible to query such data using multiple per category SQL queries. But we’ll make it in a single SQL query. That also would be faster in majority of cases. The query below implements faceted search over PostgreSQL mailing lists archives using window functions and CTE. Usage of window function is essential while CTE was used for better query readability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
|
The resulting JSON document contains total count of matching mailing list messages and TOP 5 relevant messages for each list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
In the plan of this query we can see that message_body_idx
GIN index is
scanned only once, and this is great.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Thus, it appears that nothing prevents you from implementing trendy kinds of searches using old good SQL and powerful features of PostgreSQL including: fulltext search, JSON support, window functions etc.
]]>Dealing with partitioned tables we can’t always select relevant partitions during query planning. Naturally, during query planning you can’t know values which come from subquery or outer part of nested loop join. Nevertheless, it would be ridiculous to scan all the partitions in such cases.
This is why my Postgres Professional colleague Dmitry Ivanov developed a new custom executor node for pg_pathman: RuntimeAppend. This node behaves like regular Append node: it contains set of children Nodes which should be appended. However, RuntimeAppend have one distinction: each run it selects only relevant children to append basing on parameter values.
Let’s consider example: join of journal
table which contains row per each
30 seconds of year partitioned by day, and q
table which refers 1000 random
rows of journal
table. Without RuntimeAppend optimizer selects Hash Join
plan.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The Hash Join execution takes 256 milliseconds for execution and 29 milliseconds
for planning. Relatively high planning time is expected because all the
partitions are present in plan. It’s surprising that optimizer didn’t select
Nested Loop join. Let’s force it to do so by enable_hashjoin = off
and
enable_mergejoin = off
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
The Nested Loop join takes 456 milliseconds to execute. This is even worse.
But this is understandable because we have to scan each partition of journal
for each row of q
.
Finally, let’s enable RuntimeAppend.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
The Nested Loop join with RuntimeAppend takes only about 9 milliseconds
to execute! Such fast execution is possible thanks to RuntimeAppend scans only
one relevant partition of journal
for each row of q
.
Nevertheless, all the partitions are present in plan and planning time is still quite high. This relatively high planning time could be not so significant for prepared statements or long OLAP queries.
However, long planning time appears to be not the only problem. We run a benchmark when RuntimeAppend node returns just a few rows in prepared statement. Despite high planning time doesn’t affect prepared statements, TPS was few time slower than it was without partitioning. After running perf, we got this flamegraph. This flamegraph shows that we spend very significant time for locking and unlocking every partition. Naturally, locking 365 partitions isn’t using fast-path locking and appears to be significant overhead.
Thus, we see how huge benefit could runtime partition selection have. However, in current design having all the partitions in plan cause high overhead. Solution could be found in redesigning partition locking. We are researching this problem now. It’s likely this problem can’t be solved in the boundaries of extension and proper solution requires hacking of PostgreSQL core.
]]>For people who are actively working with psql, it frequently happens that you want to draw graph for the table you’re currently seeing. Typically, it means a cycle of actions including: exporting data, importing it into graph drawing tool and drawing graph itself. It appears that this process could be automated: graph could be drawn by typing a single command directly in psql. See an example on the screenshot below.
It might seem like a magic, but actually there is absolutely no magic. iTerm2 supports image inlining since version 3 which is currently beta. Thus, if we put image surrounded with corresponding escape sequences it will appear in the terminal. From psql side we need to redirect output to the script which would do it. We can define a macro for simplifying this like in one of my previous posts.
1
|
|
And finally we need a pg_graph script which parses psql output, draws graph and puts it into stdout. I wrote one using Python and matplotlib. It recognizes first column as series of X-values and rest of columns as series of Y-values. If first column contains only decimal values it draws a plot chart, otherwise it draws a bar chart.
Thereby, it’s not hard to teach psql to do more things. Also, we can consider some improvements to psql including:
\g
which would make it easier to parse psql
output from scripts;PostgreSQL scalability on multicore and multisocket machines became a subject of optimization long time ago once such machines became widely used. This blog post shows brief history of vertical scalability improvements between versions 8.0 and 8.4. PostgreSQL 9.2 had very noticeable scalability improvement. Thanks to fast path locking and other optimizations it becomes possible to achieve more than 350 000 TPS in select-only pgbench test. The latest stable release PostgreSQL 9.5 also contain significant scalability advancements including LWLock improvement which allows achieving about 400 000 TPS in select-only pgbench test.
Postgres Professional company also became involved into scalability optimization. In partnership with IBM we researched PostgreSQL scalability on modern Power8 servers. The results of this research was published in popular Russian blog habrahabr (Google translated version). As brief result of this research we identify two ways to improve PostgreSQL scalability on Power8:
The optimization #1 appears to give huge benefit on big Intel servers as well, while optimization #2 is Power-specific. After long rounds of optimization, cleaning and testing #1 was finally committed by Andres Freund.
On the graph above, following PostgreSQL versions were compared:
Alignment issues worth some explanation. Initially, I complained performance regression introduced by commit 5364b357 which increases number of clog buffers. That was strange by itself, because read-only benchmark shouldn’t lookup to clog thanks to hint bits. As expected it appears that clog buffers don’t really affect read-only performance directly, 5364b357 just changed layout of shared memory structures.
It appears that read-only benchmark became very sensitive to layout of shared memory structures. As result performance has significant variety depending on shared_buffers, max_connections and other options which influence shared memory distribution. When I gave Andres access to that big machine, he very quickly find a way to take care about performance irregularity: make all PGXACTs full cacheline aligned. Without this patch SnapshotResetXmin() dirties processor cache containing multiple PGXACTs. With this patch SnapshotResetXmin() dirties cacheline with only single PGXACT. Thus, GetSnapshotData() have much less cache misses. That was surprising and good lesson for me. I knew that alignment influence performance, but I didn’t expect this influence to be so huge. PGXACT cacheline alignment issue was discovered after feature freeze for 9.6. That means it would be subject for 9.7 development. Nevertheless, 9.6 have very noticeable scalability improvement.
Therefore, PostgreSQL single instance delivers more than 1 million TPS and one could say, that PostgreSQL opens a new era of millions TPS.
P.S. I’d like to thank:
PostgreSQL 9.6 receives suitable support of extensible index access methods. And that’s good news because Postgres was initially designed to support it.
“It is imperative that a user be able to construct new access methods to provide efficient access to instances of nontraditional base types”
Michael Stonebraker, Jeff Anton, Michael Hirohama. Extendability in POSTGRES , IEEE Data Eng. Bull. 10 (2) pp.16-23, 1987
That was a huge work which consists of multiple steps.
CREATE ACCESS METHOD
command which provides legal way for insertion into
pg_am with support of dependencies and pg_dump/pg_restore.
Committed by Alvaro Herrera.I am very thankful for the efforts of committers and reviewers who make it possible to include these features into PostgreSQL.
However, end users don’t really care about this infrastructure. They do care about features we can provide on the base of this infrastructure. Actually, we would be able to have index access methods which are:
Also, I consider this work as an approach (together with FDW) to pluggable storage engines. I will speak about this during my talk at PGCon 2016.
]]>Recently Robert Haas has committed a patch which allows seeing some more detailed information about current wait event of the process. In particular, user will be able to see if process is waiting for heavyweight lock, lightweight lock (either individual or tranche) or buffer pin. The full list of wait events is available in the documentation. Hopefully, it will be more wait events in further releases.
It’s nice to see current wait event of the process, but just one snapshot is not very descriptive and definitely not enough to do any conclusion. But we can use sampling for collecting suitable statistics. This is why I’d like to present pg_wait_sampling which automates gathering sampling statistics of wait events. pg_wait_sampling enables you to gather statistics for graphs like the one below.
Let me explain you how did I draw this graph. pg_wait_sampling samples wait events into two destinations: history and profile. History is an in-memory ring buffer and profile is an in-memory hash table with accumulated statistics. We’re going to use the second one to see insensitivity of wait events over time periods.
At first, let’s create table for accumulated statistics. I’m doing these experiments on my laptop, and for the simplicity this table will live in the instance under monitoring. But note, that such table could live on the another server. I’d even say it’s preferable to place such data to another server.
1 2 3 4 5 |
|
Secondly, I wrote a function to copy data from pg_wait_sampling_profile view to profile_log table and clean profile data. This function returns number of rows inserted into profile_log table. Also, this function discards pid number and groups data by wait event. And this is not necessary needed to be so.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
And then I run psql session where setup watch of this function. Monitoring of our system is started. For real usage it’s better to schedule this command using cron or something.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
We can see that write_profile_log returns 0. That means we didn’t insert anything to profile_log. And this is right because system is not under load now. Let us create some load using pgbench.
1 2 |
|
In the parallel session we can see that write_profile_log starts to insert some data to profile_log table.
1 2 3 4 5 |
|
Finally, let’s examine the profile_log table.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
How to interpret these data? In the first row we can see that count for tuple lock for 14:03:19 is 41. The pg_wait_sampling collector samples wait event every 10 ms while write_profile_log function writes snapshot of profile every 10 s. Thus, it was 1000 samples during this period. Taking into account that it was 10 backends serving pgbench, we can read the first row as “from 14:03:09 to 14:03:19 backends spend about 0.41% of time in waiting for tuple lock”.
That’s it. This blog post shows how you can setup a wait event monitoring of your database using pg_wait_sampling extension with PostgreSQL 9.6. This example was given just for introduction and it is simplified in many ways. But experienced DBAs would easily adopt it for their setups.
P.S. Every monitoring has some overhead. Overhead of wait monitoring was subject of hot debates in mailing lists. This is why features like exposing wait events parameters and measuring each wait event individually are not yet in 9.6. But sampling also has overhead. I hope pg_wait_sampling would be a start point to show on comparison that other approaches are not that bad, and finally we would have something way more advanced for 9.7.
]]>Recently pg_pathman receives support of UPDATE and DELETE queries. Because of some specialties of PostgreSQL query planner hooks, UPDATE and DELETE planning is accelerated only when only one partition is touched by query. Other way, regular slow inheritance query planning is used. However, case when UPDATE and DELETE touches only one partition seems to be most common and most needing optimization.
Also, I’d like to share some benchmark. This benchmark consists of operations on journal table with about 1 M records for year partitioned by day. For sure, this is kind of toy example, because nobody will split so small amount of data into so many partitions. But it is still good to see partitioning overhead. Performance of following operations was compared:
The following partitioning methods were compared:
Benchmarks were done on 2 x Intel Xeon CPU X5675 @ 3.07GHz, 24 GB of memory server with fsync = off in 10 threads. See the results below.
Test name | single table, TPS | pg_partman, TPS | pg_pathman, TPS |
---|---|---|---|
Select one row | 47973 | 1084 | 41775 |
Select whole one partition | 2302 | 704 | 2556 |
Insert one row | 34401 | 7969 | 25859 |
Update one row | 32769 | 202 | 29289 |
I can make following highlights for these results.
See this gist for SQL-scripts used for benchmarking.
P.S. This post is not a criticism of pg_partman. It was developed long time before extendability mechanisms which pg_pathman use were created. And it is a great extension which has served many years.
]]>In my previous post I’ve introduced pg_pathman as an extension which accelerate query planning over partitioned tables. In this post I would like to covert another aspect of pg_pathman: it not only produce plans faster, but also produce better plans. Thanks to it query execution with pg_pathman becomes much faster in some cases.
When you search partitioned table with some filter conditions, pg_pathman adopts this filter to each individual partition. Therefore, each partition receives the only filter conditions which are useful to check against it.
Let me illustrate this on the example. At first, let’s see what’s happening with filter conditions while dealing with PostgreSQL inheritance mechanism.
Let us make some partitioned table using inheritance.
1 2 3 4 5 6 7 8 |
|
And them fill it with test data.
1 2 3 4 5 6 |
|
Then let’s try to select rows from two time intervals.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
We can see that filter condition was passed to each partition as is. But
actually it could be simplified a lot. For instance, table test_2 could be scan
without filter condition at all because all its rows are matching. Filter
condition to test_3 could be simplified to ts < '2015-03-15'
, therefore
BitmapOr is not necessary.
Let’s try the same example with pg_pathman. Firstly create test table and its partitions.
1 2 3 |
|
Then insert test data into table. pg_pathman automatically creates trigger which distribute data between partitions. Just like pg_partman does.
1
|
|
And finally try the same query with pg_pathman.
1 2 3 4 5 6 7 8 9 10 11 |
|
We can see that pg_pathman selects the same partitions, but query plan becomes
way simpler. Now, test_2 is scanned without useless filter condition. test_3
is scanned using just ts < '2015-03-15'
filter condition. Thanks to it, plain
Index Scan is used instead of BitmapOr. And similar advances was applied to
rest of partitions.
How was this simplification possible? The common fear here is that such simplification could be computational expensive in general case. But since pg_pathman is intended to decrease query planning time, it’s very important to keep all transformations cheap and simple. And this cheap and simple algorithm of transformation really exists.
Let’s see how it works on simple example. The filter condition (ts >=
'2015-02-01' AND ts < '2015-03-15') OR (ts >= '2015-05-15' AND ts <
'2015-07-01')
have following tree representation.
Leaf nodes of tree are simple conditions. Non-leaf nodes are logical operators which forms complex conditions. For particular partition each filter condition (either simple or complex) could be treated into one of three classes.
Filter condition is always true for rows of this partition (t). For
instance, condition ts >= '2015-04-15'
is always true for partition ts >=
2015-05-01 AND ts < 2015-06-01
.
Filter condition could be either true or false for rows of this partition
(m). For instance, condition ts >= '2015-03-15'
could be either true or
false for partition ts >= 2015-03-01 AND ts < 2015-03-01
.
Filter condition is always false for rows of this partition (f). For
instance, condition ts <= '2015-02-01'
is always false for partition ts >=
2015-04-01 AND ts < 2015-04-01
.
We can mark each tree node with vector of classes which corresponding condition is treated against each partition. These vectors could be filled upwards: for leaf nodes first, and then for non-leaf nodes using tri-state logic.
It’s evident that only conditions which could be either true or false (m) are useful for filtering. Conditions which are always true or always false shouldn’t be presented in the partitions filter. Using produced three we can now produce filter conditions for each partition.
For ts >= 2015-01-01 AND ts < 2015-02-01
partition, whole filter condition
is false. So, skip it.
For ts >= 2015-02-01 AND ts < 2015-03-01
partition, whole filter condition
is true. So, scan it without filter.
For ts >= 2015-03-01 AND ts < 2015-04-01
partition, filter condition tree
would be reduced into following tree.
Therefore, this partition will be scan with ts < '2015-03-15'
filter.
For ts >= 2015-04-01 AND ts < 2015-05-01
partition, whole filter condition
is false. So, skip it.
For ts >= 2015-05-01 AND ts < 2015-06-01
partition, filter condition tree
would be reduced into following tree.
Therefore, this partition will be scan with ts >= '2015-05-15'
filter.
For ts >= 2015-06-01 AND ts < 2015-07-01
partition, whole filter condition
is true. So, scan it without filter.
This is how filter conditions processing works in pg_pathman. The explanation could be a bit exhausting for reading, but I hope you feel enlighten by getting how it works. I remember that pg_pathman is open source extension for PostgreSQL 9.5 in beta-release stage. I appeal to everyone interested for trying it and sharing a feedback.
]]>Partitioning in PostgreSQL is traditionally implemented using table inheritance. Table inheritance allow planner to include into plan only those child tables (partitions) which are compatible with query. Simultaneously a lot of work on partitions management remains on users: create inherited tables, writing trigger which selects appropriate partition for row inserting etc. In order to automate this work pg_partman extension was written. Also, there is upcoming work on declarative partitioning by Amit Langote for PostgreSQL core.
In Postgres Professional we notice performance problem of inheritance based partitioning. The problem is that planner selects children tables compatible with query by linear scan. Thus, for query which selects just one row from one partition it would be much slower to plan than to execute. This fact discourages many users and this is why we’re working on new PostgreSQL extension: pg_pathman.
pg_pathman caches partitions meta-information and uses set_rel_pathlist hook in order to replace mechanism of child tables selection by its own mechanism. Thanks to this binary search algorithm over sorted array is used for range partitioning and hash table lookup for hash partitioning. Therefore, time spent to partition selection appears to be negligible in comparison with forming of result plan nodes. See postgrespro blog post for performance benchmarks.
pg_pathman now in beta-release status and we encourage all interested users to try it and give us a feedback. pg_pathman is compatible with PostgreSQL 9.5 and distributed under PostgreSQL license. In the future we’re planning to enhance functionality of pg_pathman by following features.
Despite we have pg_pathman useful here and now, we want this functionality to eventually become part of PostgreSQL core. This is why we are going to join work on declarative partitioning by Amit Langote which have excellent DDL infrastructure and fulfill it with effective internal algorithms.
]]>Users of jsonb datatype frequently complaint that it lucks of statistics.
Naturally, today jsonb statistics is just default scalar statistics, which is
suitable for <
, <=
, =
, >=
, >
operators selectivity estimation. But
people search jsonb documents using @>
operator, expressions with ->
operator, jsquery etc. This is why selectivity estimation, which people
typically get in their queries, is just a stub. This could lead wrong query plans
and bad performance. And it made us introduce hints in jsquery extension.
Thus, problem is clear. But the right solution is still unclear, at least for me. Let me discuss evident approaches to jsonb statistics and their limitations.
First candidate for good selectivity estimation is @>
operator. Really,
@>
is builtin operator with GIN index support. First idea that comes into
mind is to collect most frequent paths and their frequencies as jsonb
statistics. In order to understand idea of paths better let’s consider how GIN
jsonb_path_ops works. jsonb_path_ops is builtin GIN operator class, which is
most suitable for jsonb @>
operator.
Path is a sequence of key names, array indexes and referenced value.
For instance, the document
{"a": [{"b": "xyz", "c": true}, 10], "d": {"e": [7, false]}}
would be decomposed into following set of paths.
1 2 3 4 5 |
|
In this representation of paths array indexes are replaced with #
. That
allows our search to be agnostic to them like @>
operator does. Thus, when we
have such decomposition we can say that if a @> b
then a
paths are superset
of b
paths. If we intersect posting list of search argument paths then we can
get list of candidates for search result. This is how jsonb_path_ops works.
The same idea could be applied to jsonb statistics. We could decompose each jsonb document into set of paths and then collect frequencies of most common individual paths. Such statistics perfectly fits current PostgreSQL system catalog and looks very similar to statistics of tsvectors and arrays, which are decomposed into lexemes and elements correspondingly. Such statistics of most common paths could look like following table.
Path | Frequency |
---|---|
“a”.#.”b”.”xyz” | 0.55 |
“d”.”e”.#.77 | 0.43 |
“a”.#.”b”.”def” | 0.35 |
“d”.”e”.#.100 | 0.22 |
“d”.”f”.true | 0.1 |
Having such statistics we can estimate selectivity of @>
operator as product
of frequencies of search argument paths. For paths, which are not in most
common list, we can use some default “rare frequency”. Also, we use quite rough
assumption that paths appearance is independent. Let’s be honest: this
assumption is just wrong. However, this is typical assumption we have to use
during query planning. Finally, we don’t need absolutely accurate cost. Match of
magnitude order can be considered as a quite good result.
There is also another source or inaccuracy I’d like to mention. Let’s consider some example.
1 2 |
|
Both a
and b
are decomposed into the same set of paths.
1 2 |
|
However, neither a @> b
neither ‘b @> a’. Since we ignored array indexes in
paths we also ignore whether values beholds to same array element or not. This
leads also to false positives in GIN and overestimations by statistics.
This approach is not only limited by @>
operator. We can produce estimation
for queries with complex logic. Example in jsquery could be "(abc" = 1 OR
"xyz".# = "hij") AND NOT "def" = false
.
However, such statistics hardly can estimate selectivity of <
, <=
,
>=
, >
operators over jsonb values. For instance, in order to estimate
jsquery "x" > 1
we can only count most common paths, which match this
condition. But we’re lacking of histograms. It is a serious obstacle in getting
accurate estimates and it lets us search for better solution.
Another idea of jsonb statistics we can get from assumption that almost every “schemaless” dataset can be easily represented in the schema of tables. Assuming this we would like our selectivity estimates for search in jsonb documents to be as good as those for search in plain tables.
Let’s consider this on the example. The following json document could represent the information about order in e-commerce.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
The same information could be represented in the following couple of tables.
id | contact | phone | address |
---|---|---|---|
1 | John Smith | 212 555-1234 | 10021-3100, 21 2nd Street, New York |
order_id | article | name | price | quantity |
---|---|---|---|---|
1 | XF56120 | Sunglasses | 500 | 1 |
1 | AT10789 | T-Shirt | 100 | 2 |
What kind of statictis would be collected by PostgreSQL in the second case? It would be most common values and histogram for each attribute. Most common values (MCVs) are values, which occur in the column most frequently. Frequencies of those values are collected and stored as well. Histogram is described by array of bounds. Each bound is assumed to contain equal number of column values excluding MCVs (so called equi-depth histogram).
With some simplification such statistics could be represented in the following table.
Table | Attribute | Most common values | Histogram |
---|---|---|---|
order | contact | {“John Smith”: 0.05, “James Johnson”: 0.01} | [“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”] |
product | price | {“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03} | [0, 12.5, 45.5, 250, 1000] |
product | quantity | {“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01} | [0, 4, 7, 9, 10] |
……. | ……… | …………………………………………. | …………………………………………….. |
What if we replace table and attribute with path of keys where corresponding value could be found in json document?
Key path | Most common values | Histogram |
---|---|---|
contact | {“John Smith”: 0.05, “James Johnson”: 0.01} | [“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”] |
products.#.price | {“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03} | [0, 12.5, 45.5, 250, 1000] |
products.#.quantity | {“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01} | [0, 4, 7, 9, 10] |
………………. | …………………………………………. | …………………………………………….. |
This kind of statistics seems to be comprehensive enough. It could produce fine
estimations for queries like products.#.price > 100
.
However, there are still bunch of open problems here.
Typical json documents we can meet in applications are really well structured
as an example above. However, there are some cases when they are not. At
first, someone could easily put values into keys. Let me illustrate this on
the following example: products
becomes an object where article is used as
a key.
In this case we can find that cardinality of key paths are very high. Thus, we would be unable to collect suitable statistics for each key path. However, we could consider such situation as user mistake. Then we should advise users to restructure their documents.
There are still kind of documents, which don’t fit this model not because of user mistake but because of their nature. Imagine json formatted query plans stored in the table. Plans could have unlimited levels of nesting and correspondingly cardinality of key paths could be very high.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Some objects stored inside jsonb documents could require special statistics.
For instance, point coordinates could be represented in json as
{"x": 11.3, "y": 27.0}
. But statistics we will need in this case is not
separate statistics for x
and y
. We would need something special for
geometrical objects like 2D-histograms.
Another problem is fitting this model into PostgreSQL system catalog.
pg_statistic
assumes that statistics of attribute is represented by few
arrays. However, in this model we have to store few arrays per each key
path. For sure, we do a trick by storing array of jsonb or something like
this, but that would be a kluge. It would be nice to store each key path in
the separate row of pg_statistic
. This would require significant changes
in statistics handling though.
This was just my current thoughts about jsonb statistics. Probably, someone come with much better ideas. But I’m not sure we can find ideal solution, which would fit everyone needs. We can see that current developments in multivariate statistics use pluggable approach: user can turn on specific method on specific set of column. We could end up with something similar for jsonb: simple basic statistics + various kinds of pluggable statistics for specific needs.
]]>While hacking PostgreSQL it’s very useful to know pid of the backend you are
working with. You need to know pid of the process to attach debugger, profiler
etc. Luckily, .psqlrc provides us an elegant way to define the shortcuts for
psql. Using config line below one can find out backend pid just by typing :pid
.
1
|
|
1 2 3 4 5 |
|
In 9.6 it becomes possible to even include backend pid into psql prompt.
However, it’s possible to automate more complex actions in psql. I’ve configured
my psql to run gdb attached to current backend in new tab of iTerm2 just by
typing :gdb
.
The :gdb
command selects pid of current backend and puts it to the input of
pg_debug script.
1
|
|
pg_debug extracts pid from its input and then runs OSA script which runs gdb in the new tab of iTerm2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
This script works for Mac OS X and iTerm2, but the same approach should work for other platforms and terminal emulators.
]]>