2021-05-22T14:12:00+03:00

Long story short, since PostgreSQL 14 to_tsquery('pg_class') becomes 'pg' <-> 'class' instead of 'pg' & 'class' (commit 0c4f355c6a). That is for instance, in PostgreSQL 13 and earlier to_tsquery('pg_class') matches to_tsvector('a class of pg'). But since PostgreSQL 14 it doesn’t match, but still matches to_tsvector('pg_class') and to_tsvector('pg*class'). This is incompatible change, which affects fts users, but we have to do this in order to fix phrase search design problems.

The story started with a bug when to_tsvector('pg_class pg') didn’t match to websearch_to_tsquery('"pg_class pg"').

# select to_tsvector('pg_class pg') @@
         websearch_to_tsquery('"pg_class pg"');
 ?column?
----------
 f

Looks strange! Naturally, when you search for some text in quotes, you expect it to match at least the exact same text in the document. But it doesn’t. My first idea was that it’s just bug of websearch_to_tsquery() function, but to_tsquery() appears to have the same problem: to_tsquery('pg_class <-> pg') doesn’t match to to_tsvector('pg_class pg') as well.

# select to_tsvector('pg_class pg') @@
         to_tsquery('pg_class <-> pg');
 ?column?
----------
 f

I was surprised that although phrase search arrived many years ago, basic things don’t work.

Looking under the hood, both websearch_to_tsquery('"pg_class pg"') and to_tsquery('pg_class <-> pg') compiles into ( 'pg' & 'class' ) <-> 'pg'.

# select websearch_to_tsquery('"pg_class pg"'),
         to_tsquery('pg_class <-> pg');
    websearch_to_tsquery     |         to_tsquery
-----------------------------+-----------------------------
 ( 'pg' & 'class' ) <-> 'pg' | ( 'pg' & 'class' ) <-> 'pg'

This tsquery expects both pg and class to be one position left from another pg. That means both pg and class need to reside in the same position. In principle, that’s possible, for instance, when a single word is split into two synonyms by fulltext dictionary. But that’s not our case. When we parse pg_class pg text, each word gets position sequentially. No two of them reside in the same position.

# select to_tsvector('pg_class pg');
    to_tsvector
--------------------
 'class':2 'pg':1,3
(1 row)

Why does tsquery parsing work this way? Historically, in PostgreSQL fulltext search to_tsquery('pg_class') compiles into 'pg' & 'class'. Therefore, pg and class don’t have to appear together. Before phrase search, that was the only way to process this query as soon as we split pg_class into pg and class. Thus, querying compound words was a bit relaxed. But now, when combined with phrase search, it becomes unreasonably strict.

My original intention was to choose the way to compile pg_class depending on the context. With phrase search operator nearby pg_class should become 'pg' <-> 'class', but be 'pg' & 'class' in the rest of cases. But this way required invasive refactoring of tsquery processing, taking more time than I could to spend on this bug.

Fortunately, Tom Lane came with a proposal to always compile pg_class into 'pg' <-> 'class'. Thus, now both websearch_to_tsquery('"pg_class pg"') and to_tsquery('pg_class <-> pg') compile into 'pg' <-> 'class' <-> 'pg'. And both of them match to to_tsvector('pg_class pg'). That is a win!

# select websearch_to_tsquery('"pg_class pg"'),
         to_tsquery('pg_class <-> pg');
   websearch_to_tsquery    │        to_tsquery
───────────────────────────┼───────────────────────────
 'pg' <-> 'class' <-> 'pg' │ 'pg' <-> 'class' <-> 'pg'

# select to_tsvector('pg_class pg') @@ websearch_to_tsquery('"pg_class pg"'),
         to_tsvector('pg_class pg') @@ to_tsquery('pg_class <-> pg');
 ?column? │ ?column?
──────────┼──────────
 t        │ t

This approach would make all queries involving compound words more strict. But at first, this appears the only easy way to fix this design bug. Secondly, this is probably a better way to handle compound words themselves.

And AFAICS, this approach seems to be the right way. Thanks to it, yet another phrase search bug appears to be quite easy to fix.

Happy phrase searching in PostgreSQL 14! Hopefully, we would further manage without incompatible changes :)

]]>

2021-05-17T23:30:00+03:00

It seems a good idea to change grey psql output to a lovely rainbow in honor of IDAHOT day. Thankfully there is lolcat utility, which is very easy to install on Linux and Mac OS.

Linux

$ sudo snap install lolcat

Mac OS

$ brew install lolcat

Having lolcat installed, you can set it up as a psql pager and get lovely rainbow psql output!

\pset pager always
\setenv PAGER 'lolcat -f | less -iMSx4R -FX'

]]>

2021-05-06T18:10:00+03:00

PostgreSQL has an extension to jsonpath: ** operator, which explores arbitrary depth finding your values everywhere. At the same time, there is a lax mode, defined by the standard, providing a “relaxed” way for working with json. In the lax mode, accessors automatically unwrap arrays; missing keys don’t trigger errors; etc. In short, it appears that the ** operator and lax mode aren’t designed to be together :)

The story started with the bug report. The simplified version is below. Jsonpath query is intended to select the value of key "y" everywhere. But it appears to select these values twice.

# SELECT * FROM jsonb_path_query('[{"x": "a", "y": [{"x":"b"}]}]'::jsonb,
                                 '$.**.x');
 jsonb_path_query
------------------
 "a"
 "a"
 "b"
 "b"
(4 rows)

This case looks like a bug. But is it? Let’s dig into details. Let’s split the jsonpath query into two parts: one containing the ** operator and another having the key accessor.

# SELECT var,
         jsonb_path_query_array(var, '$.x') key_x
  FROM jsonb_path_query('[{"x": "a", "y": [{"x":"b"}]}]'::jsonb,
                        '$.**') var;
               var               | key_x
---------------------------------+-------
 [{"x": "a", "y": [{"x": "b"}]}] | ["a"]
 {"x": "a", "y": [{"x": "b"}]}   | ["a"]
 "a"                             | []
 [{"x": "b"}]                    | ["b"]
 {"x": "b"}                      | ["b"]
 "b"                             | []
(6 rows)

As you can see, the ** operator selects every child in the json document as expected. The key accessor extracts corresponding values from both objects themselves and their wrapping arrays. And that’s also expected in the lax mode. So, it appears there is no bug; everything works as designed, although it’s surprising for users.

Finally, I’ve committed a paragraph to the docs, which explicitly clarifies this issue. It seems that lax mode and ** operator just aren’t designed to be used together. If you need ** operator, you can use strict mode. and everything is intuitively correct.

# SELECT * FROM jsonb_path_query('[{"x": "a", "y": [{"x":"b"}]}]'::jsonb,
                                 'strict $.**.x');
 jsonb_path_query
------------------
 "a"
 "b"
(2 rows)

]]>

2021-04-30T03:10:00+03:00

The world changes. ARM architecture breaks into new areas of computing. An only decade ago, only your mobile, router, or another specialized device could be ARM-based, while your desktop and server were typically x86-based. Nowadays, your new MacBook is ARM-based, and your EC2 instance could be ARM as well.

In the mid-2020, Amazon made graviton2 instances publically available. The maximum number of CPU core there is 64. This number is where it becomes interesting to check PostgreSQL scalability. It’s exciting to check because ARM implements atomic operations using pair of load/store. So, in a sense, ARM is just like Power, where I’ve previously seen a significant effect of platform-specific atomics optimizations.

But on the other hand, ARM 8.1 defines a set of LSE instructions, which, in particular, provide the way to implement atomic operation in a single instruction (just like x86). What would be better: special optimization, which puts custom logic between load and store instructions, or just a simple loop of LSE CAS instructions? I’ve tried them both.

You can see the results of read-only and read-write pgbench on the graphs below (details on experiments are here). pg14-devel-lwlock-ldrex-strex is the patched PostgreSQL with special load/store optimization for lwlock, pg14-devel-lse is PostgreSQL compiled with LSE support enabled.

You can see that load/store optimization gives substantial positive effect, but LSE rocks here!

So, if you’re running PostgreSQL on graviton2 instance, make sure you’ve binaries compiled with LSE support (see the instruction) because the effect is dramatic.

BTW, it appears that none of these optimizations have a noticeable effect on the performance of Apple M1. Probably, M1 has a smart enough inner optimizer to recognize these different implementations to be equivalent. And it was surprising that LSE usage might give a small negative effect on Kunpeng 920. It was discouraging for me to know an ARM processor, where single instruction operation is slower than multiple instruction equivalent. Hopefully, processor architects would fix this in new Kunpeng processors.

In general, we see that now different ARM embodiments have different performance characteristics and different effects of optimizations. Hopefully, this is a problem of growth, and it will be overcome soon.

Update: As Krunal Bauskar pointer in the comments, LSE instructions are still faster than the load/store option on Kunpeng 920. Different timings might cause the regression. For instance, with LSE instructions, we could just faster reach the regression caused by another bottleneck.

]]>

2018-02-17T18:20:00+03:00

Long story short, using PostgreSQL 11 with RUM index you can do both TOP-N query and COUNT(*) for non-selective FTS queries without fetching all the results from heap (that means much faster). Are you bored yet? If not, please read the detailed description below.

At November 1st 2017, Tome Lane committed a patch enabling bitmap scans to behave like index-only scan when possible. In particular, since PostgreSQL 11 COUNT(*) queries can be evaluated using bitmap scans without accessing heap when corresponding bit in visibility map is set. This patch was written by Alexander Kuzmenkov and reviewed by Alexey Chernyshov (sboth are my Postgres Pro colleagues), and it was heavily revised by Tom Lane.

This commit might seem to be just one of planner and executor optimizations, nice but doesn’t deserve much attention. However, under detailed consideration this patch appears to be significant improvement on the way of making full text search in PostgreSQL to be done the right way.

I’ve started working on FTS improvements in 2012. That time I realized that GIN index is good for selective FTS queries, when number of matching results is low. See the example below: GIN did great work for us by returning just few dozens of matching rows very fast. The rest operations including relevance calculation and sorting are also fast, because they are performed over very small row set.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'exclusion constraint')
ORDER BY ts_rank_cd(fts, plainto_tsquery('english', 'exclusion constraint')) DESC
LIMIT 10;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=144.26..144.28 rows=10 width=784) (actual time=320.142..320.149 rows=10 loops=1)
   Buffers: shared hit=7138 read=7794
   ->  Sort  (cost=144.26..144.32 rows=25 width=784) (actual time=320.141..320.147 rows=10 loops=1)
         Sort Key: (ts_rank_cd(fts, '''exclus'' & ''constraint'''::tsquery)) DESC
         Sort Method: top-N heapsort  Memory: 38kB
         Buffers: shared hit=7138 read=7794
         ->  Bitmap Heap Scan on pgmail  (cost=44.20..143.72 rows=25 width=784) (actual time=5.232..315.302 rows=3357 loops=1)
               Recheck Cond: (fts @@ '''exclus'' & ''constraint'''::tsquery)
               Heap Blocks: exact=2903
               Buffers: shared hit=7138 read=7794
               ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..44.19 rows=25 width=0) (actual time=3.689..3.689 rows=3357 loops=1)
                     Index Cond: (fts @@ '''exclus'' & ''constraint'''::tsquery)
                     Buffers: shared hit=11 read=23
 Planning time: 0.176 ms
 Execution time: 320.213 ms
(15 rows)

But situation is different if FTS query is not selective and number of matching rows is high. Then we have fetch all those rows from heap, calculate relevance for each of them and sort them. And despite we only need TOP-10 rows, this query takes a lot of time.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane')
ORDER BY ts_rank_cd(fts, plainto_tsquery('english', 'Tom Lane')) DESC
LIMIT 10;
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=144.26..144.28 rows=10 width=784) (actual time=18110.231..18110.236 rows=10 loops=1)
   Buffers: shared hit=1358323 read=399077
   ->  Sort  (cost=144.26..144.32 rows=25 width=784) (actual time=18110.229..18110.231 rows=10 loops=1)
         Sort Key: (ts_rank_cd(fts, '''tom'' & ''lane'''::tsquery)) DESC
         Sort Method: top-N heapsort  Memory: 44kB
         Buffers: shared hit=1358323 read=399077
         ->  Bitmap Heap Scan on pgmail  (cost=44.20..143.72 rows=25 width=784) (actual time=70.267..17895.628 rows=224568 loops=1)
               Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Rows Removed by Index Recheck: 266782
               Heap Blocks: exact=39841 lossy=79307
               Buffers: shared hit=1358323 read=399077
               ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..44.19 rows=25 width=0) (actual time=63.914..63.914 rows=224568 loops=1)
                     Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
                     Buffers: shared hit=41 read=102
 Planning time: 0.131 ms
 Execution time: 18110.376 ms
(16 rows)(15 rows)

How can we improve this situation? If we would get results from index pre-ordered by relevance, then we would be able to evaluate TOP-N query without fetching the whole set of matching rows from heap. Unfortunately, that appears to be impossible for GIN index which stores only facts of occurence of specifix terms in document. But if we have additional infromation about terms positions in the index, then it might work. That information would be enough to calculate relevance only basing on index information.

Thus, I’ve proposed proposed a set of patches to GIN index. Some improvements were committed including index compression and index search optimization. However, additional information storage for GIN index wasn’t committed, because it alters GIN index structure too much.

Fortunately, we have extensible index access methods in PostgreSQL 9.6. And that enables us to implement things, which wasn’t committed to GIN and more, as a separate index access method RUM. Using RUM, one can execute TOP-N FTS query much faster without fetching all the matching rows from heap.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane')
ORDER BY fts <=> plainto_tsquery('english', 'Tom Lane')
LIMIT 10;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=48.00..83.25 rows=10 width=1523) (actual time=242.974..248.366 rows=10 loops=1)
   Buffers: shared hit=809 read=25, temp read=187 written=552
   ->  Index Scan using pgmail_idx on pgmail  (cost=48.00..193885.14 rows=54984 width=1523) (actual time=242.972..248.358 rows=10 loops=1)
         Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Order By: (fts <=> '''tom'' & ''lane'''::tsquery)
         Buffers: shared hit=809 read=25, temp read=187 written=552
 Planning time: 14.709 ms
 Execution time: 312.794 ms
(8 rows)

However, the problem persisted if you need to get total count of matching rows. Then PostgreSQL executor still have to fetch all the matching rows from the heap in order to check their visibility. So, if you need total number of resulting rows for pagination, then it’s still might be very slow.

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane');
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118931.46..118931.47 rows=1 width=8) (actual time=36263.708..36263.709 rows=1 loops=1)
   Buffers: shared hit=800692 read=348338
   ->  Bitmap Heap Scan on pgmail  (cost=530.19..118799.14 rows=52928 width=0) (actual time=74.724..36195.946 rows=224568 loops=1)
         Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Rows Removed by Index Recheck: 266782
         Heap Blocks: exact=39841 lossy=79307
         Buffers: shared hit=800692 read=348338
         ->  Bitmap Index Scan on pgmail_fts_idx  (cost=0.00..516.96 rows=52928 width=0) (actual time=67.467..67.467 rows=224568 loops=1)
               Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Buffers: shared hit=41 read=102
 Planning time: 0.210 ms
 Execution time: 36263.790 ms
(12 rows)

For sure, some modern UIs use techniques like continuous scrolling which doesn’t require to show full number of results to user. Also, one can use planner estimation for number of resulting rows which is typically matching the order of magnitude to actual number of resulting rows. But nevertheless, slow counting of total results number was a problem for many of RUM users.

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM pgmail
WHERE fts @@ plainto_tsquery('english', 'Tom Lane');
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=121794.28..121794.29 rows=1 width=8) (actual time=132.336..132.336 rows=1 loops=1)
   Buffers: shared hit=404
   ->  Bitmap Heap Scan on pgmail  (cost=558.13..121656.82 rows=54984 width=0) (actual time=83.676..116.889 rows=224568 loops=1)
         Recheck Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
         Heap Blocks: exact=119148
         Buffers: shared hit=404
         ->  Bitmap Index Scan on pgmail_idx  (cost=0.00..544.38 rows=54984 width=0) (actual time=61.459..61.459 rows=224568 loops=1)
               Index Cond: (fts @@ '''tom'' & ''lane'''::tsquery)
               Buffers: shared hit=398
 Planning time: 0.183 ms
 Execution time: 133.885 ms
(11 rows)

]]>

2017-05-31T18:20:00+03:00

It’s not very widely known, but PostgreSQL is gathering statistics for indexed expressions. See following example.

CREATE TABLE test AS (SELECT random() x, random() y FROM generate_series(1,1000000));
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..20406.00 rows=333333 width=16) (actual time=1.671..113.693 rows=56 loops=1)
   Filter: ((x + y) < '0.01'::double precision)
   Rows Removed by Filter: 999944

We created table with two columns x and y whose values are independently and uniformly distributed from 0 to 1. Despite we analyze that table, PostgreSQL optimizer estimates selectivity of x + y < 0.01 qual as 1/3. You can see that this estimation is not even close to reality: we actually selected 56 rows instead of 333333 rows estimated. This estimation comes from a rough assumption that < operator selects 1/3 of rows unless something more precise is known. Of course, it could be possible for planner to do something better in this case. For instance, it could try to calculate histogram for x + y from the separate histograms for x and y. However, PostgreSQL optimizer doesn’t perform such costly and complex computations for now.

Situation changes once we define an index on x + y.

CREATE INDEX test_idx ON test ((x + y));
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=13.39..1838.32 rows=641 width=16) (actual time=0.040..0.107 rows=56 loops=1)
   Recheck Cond: ((x + y) < '0.01'::double precision)
   Heap Blocks: exact=56
   ->  Bitmap Index Scan on test_idx  (cost=0.00..13.23 rows=641 width=0) (actual time=0.028..0.028 rows=56 loops=1)
         Index Cond: ((x + y) < '0.01'::double precision)

Besides index get used for this query, there is way more accurate estimate for the number of rows selected by x + y < 0.01. Estimation is improved because PostgreSQL is now gathering separate statistics for x + y expression. You can check that by querying a system catalog.

SELECT * FROM pg_stats WHERE tablename = 'test_idx';
-[ RECORD 1 ]----------+--------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | test_idx
attname                | expr
inherited              | f
null_frac              | 0
avg_width              | 8
n_distinct             | -0.999863
most_common_vals       | {0.262215601745993,0.319712610449642,0.3959802063182,0.404356196057051,0.40578526025638,0.437070866115391,0.462984828744084,0.4651908758096
most_common_freqs      | {2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,
histogram_bounds       | {0.00104234321042895,0.0141074191778898,0.0200657406821847,0.0247588600032032,0.0284962640143931,0.0315022920258343,0.0346860070712864,0.03
correlation            | -0.00176553
most_common_elems      | NULL
most_common_elem_freqs | NULL
elem_count_histogram   | NULL

So, there are histogram, most common values and etc for x + y expression, and that leads to more accurate selectivity estimation for x + y < 0.01. However, there is still and 1 order of degree error (641 rows estimated instead of 56). Could we improve that? Yes, PostgreSQL have statistics-gathering target parameter which is tunable per column using ALTER TABLE … SET STATISTICS … command. Using this command, you may tune size of statistics arrays.

But, uhhhh, in our case we have no column, we have an indexed expression. That appears to be a problem since there is no documented way to tune statistic target for that…

Nevertheless, it appears to be possible. There is a gotcha which allows advanced DBAs to do that.

ALTER INDEX test_idx ALTER COLUMN expr SET STATISTICS 10000;
ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test WHERE x + y < 0.01;
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=4.96..258.61 rows=69 width=16) (actual time=0.022..0.074 rows=56 loops=1)
   Recheck Cond: ((x + y) < '0.01'::double precision)
   Heap Blocks: exact=56
   ->  Bitmap Index Scan on test_idx  (cost=0.00..4.94 rows=69 width=0) (actual time=0.014..0.014 rows=56 loops=1)
         Index Cond: ((x + y) < '0.01'::double precision)

That works. When we collect statistic arrays of 10000 size, estimate becomes 69 rows. It’s only 23% estimation error which is more than good enough for query planning.

But… What the hell is ALTER INDEX ... SET STATISTICS ...?! There is nothing like this in PostgreSQL documentation!

Let’s understand this situation step by step.

ALTER INDEX and ALTER TABLE share the same bison rule.
Cases when ALTER INDEX is not applicable are filtered runtime.
ALTER INDEX ... SET STATISTICS ... is not forbidden and works the same way as ALTER TABLE ... SET STATISTICS ... does.
Indexed expressions are internally named as attributes: expr, expr1, expr2 …

There was some short discussion about that in pgsql-hackers mailing lists. The conclusion was that this should be documented, but it’s not yet done. I also think that we should invent some better syntax for that instead of usage of internal column names.

]]>

2017-04-08T00:20:00+03:00

Today I gave a talk “Our answer to Uber” at United Dev Conf, Minsk. Slides could be found at slideshare. In my talk I attempted to make a review of Uber’s notes and summarize community efforts to overcome highlighted shortcomings.

United Dev Conf is quite big IT conference with more than 700 attendees. I’d like to notice that interest in PostgreSQL is quire high. The room was almost full during my talk. Also, after the talk I was continuously giving answers to surroundings in about 1 hour.

I think that Minsk is very attractive place for IT events. There are everything required for it: lovely places for events, good and not expensive hotels, developed infrastructure. Additionally Belarus introduces 5 days visa-free travel for 80 countries, and that made conference attendance much easier for many people. It would be nice to have PGDay.Minsk one day.

]]>

2016-06-17T14:20:00+03:00

Faceted search is very popular buzzword nowadays. In short, faceted search specialty is that its results are organized per category. Popular search engines are receiving special support of faceted search.

Let’s see what PostgreSQL can do in this field. At first, let’s formalize our task. For each category which have matching documents we want to obtain:

Total number of matching documents;
TOP N matching documents.

For sure, it’s possible to query such data using multiple per category SQL queries. But we’ll make it in a single SQL query. That also would be faster in majority of cases. The query below implements faceted search over PostgreSQL mailing lists archives using window functions and CTE. Usage of window function is essential while CTE was used for better query readability.

Faceted search SQL query

/*
 * Select all matching messages, calculate rank within list and total count
 * within list using window functions.
 */
WITH msg AS (
    SELECT
        message_id,
        subject,
        list,
        RANK() OVER (
            PARTITION BY list
            ORDER BY ts_rank_cd(body_tsvector,  plainto_tsquery('index bloat')), id
        ) rank,
        COUNT(*) OVER (PARTITION BY list) cnt
    FROM messages
    WHERE body_tsvector @@ plainto_tsquery('index bloat')
),
/* Aggregate messages and count per list into json. */
lst AS (
    SELECT
        list,
        jsonb_build_object(
            'count', cnt,
            'results', jsonb_agg(
                jsonb_build_object(
                    'message_id', message_id,
                    'subject', subject
        ))) AS data
    FROM msg
    WHERE rank <= 5
    GROUP by list, cnt
)
/* Aggregate per list data into single json */
SELECT  jsonb_object_agg(list, data)
FROM    lst;

The resulting JSON document contains total count of matching mailing list messages and TOP 5 relevant messages for each list.

Faceted search JSON result

{
  "pgsql-admin": {
    "count": 263,
    "results": [
      {"message_id": "CACjxUsMUWkY1Z2K2A6yVdF88GT3xcFw5ofWTR6r1zqLUYu0WzA@mail.gmail.com", "subject": "Re: Slow planning time"},
      {"message_id": "[email protected]", "subject": "Re: Finetuning Autovacuum"},
      {"message_id": "[email protected]", "subject": "Re: blocking automatic vacuum"},
      {"message_id": "[email protected]", "subject": "Re: Vacuum Full"},
      {"message_id": "[email protected]", "subject": "Re: postgres bogged down beyond tolerance"
      }
    ]
  },
/*................................................................................*/
  "pgsql-advocacy": {
    "count": 8,
    "results": [
      {"message_id": "[email protected]", "subject": "Re: Press Release"},
      {"message_id": "[email protected]", "subject": "Re: [HACKERS] Increased company involvement"},
      {"message_id": "[email protected]", "subject": "Search and archives still out of sync"},
      {"message_id": "[email protected]", "subject": "Re: postgresql publication"},
      {"message_id": "[email protected]", "subject": "Re: postgresql publication"
      }
    ]
  }
}

In the plan of this query we can see that message_body_idx GIN index is scanned only once, and this is great.

Plan of faceted search SQL query

                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2369.50..2369.51 rows=1 width=114) (actual time=34.232..34.232 rows=1 loops=1)
   CTE msg
 ->  WindowAgg  (cost=2087.93..2354.30 rows=491 width=336) (actual time=30.925..33.087 rows=2486 loops=1)
       ->  WindowAgg  (cost=2087.93..2222.96 rows=491 width=336) (actual time=30.716..32.020 rows=2486 loops=1)
             ->  Sort  (cost=2087.93..2089.16 rows=491 width=336) (actual time=30.711..30.838 rows=2486 loops=1)
                   Sort Key: messages.list, (ts_rank_cd(messages.body_tsvector, plainto_tsquery('index bloat'::text))), messages.id
                   Sort Method: quicksort  Memory: 582kB
                   ->  Bitmap Heap Scan on messages  (cost=48.05..2065.98 rows=491 width=336) (actual time=3.037..24.345 rows=2486 loops=1)
                         Recheck Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text))
                         Heap Blocks: exact=2044
                         ->  Bitmap Index Scan on message_body_idx  (cost=0.00..47.93 rows=491 width=0) (actual time=2.723..2.723 rows=2486 loo
                               Index Cond: (body_tsvector @@ plainto_tsquery('index bloat'::text))
   CTE lst
 ->  HashAggregate  (cost=12.69..13.69 rows=67 width=540) (actual time=34.090..34.133 rows=14 loops=1)
       Group Key: msg.list, msg.cnt
       ->  CTE Scan on msg  (cost=0.00..11.05 rows=164 width=540) (actual time=30.928..33.879 rows=68 loops=1)
             Filter: (rank <= 5)
             Rows Removed by Filter: 2418
   ->  CTE Scan on lst  (cost=0.00..1.34 rows=67 width=114) (actual time=34.092..34.140 rows=14 loops=1)
 Planning time: 0.380 ms
 Execution time: 34.357 ms

Thus, it appears that nothing prevents you from implementing trendy kinds of searches using old good SQL and powerful features of PostgreSQL including: fulltext search, JSON support, window functions etc.

]]>

2016-06-15T15:00:00+03:00

Dealing with partitioned tables we can’t always select relevant partitions during query planning. Naturally, during query planning you can’t know values which come from subquery or outer part of nested loop join. Nevertheless, it would be ridiculous to scan all the partitions in such cases.

This is why my Postgres Professional colleague Dmitry Ivanov developed a new custom executor node for pg_pathman: RuntimeAppend. This node behaves like regular Append node: it contains set of children Nodes which should be appended. However, RuntimeAppend have one distinction: each run it selects only relevant children to append basing on parameter values.

Let’s consider example: join of journal table which contains row per each 30 seconds of year partitioned by day, and q table which refers 1000 random rows of journal table. Without RuntimeAppend optimizer selects Hash Join plan.

Regular Append: Hash Join

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=27.50..25442.51 rows=1000 width=56) (actual time=0.479..252.506 rows=1000 loops=1)
   Hash Cond: (j.dt = q.dt)
   ->  Append  (cost=0.00..21463.01 rows=1051201 width=49) (actual time=0.005..152.258 rows=1051201 loops=1)
         ->  Seq Scan on journal_1 j  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.004..0.247 rows=2880 loops=1)
         ->  Seq Scan on journal_2 j_1  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.208 rows=2880 loops=1)
         ->  Seq Scan on journal_3 j_2  (cost=0.00..58.80 rows=2880 width=49) (actual time=0.001..0.197 rows=2880 loops=1)
...............................................................................................................................
         ->  Seq Scan on journal_366 j_365  (cost=0.00..1.01 rows=1 width=49) (actual time=0.001..0.001 rows=1 loops=1)
   ->  Hash  (cost=15.00..15.00 rows=1000 width=8) (actual time=0.185..0.185 rows=1000 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 48kB
         ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.003..0.074 rows=1000 loops=1)
 Planning time: 29.262 ms
 Execution time: 256.337 ms
(374 rows)

The Hash Join execution takes 256 milliseconds for execution and 29 milliseconds for planning. Relatively high planning time is expected because all the partitions are present in plan. It’s surprising that optimizer didn’t select Nested Loop join. Let’s force it to do so by enable_hashjoin = off and enable_mergejoin = off.

Regular Append: Nested Loop

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.28..170817.00 rows=1000 width=56) (actual time=1.091..452.658 rows=1000 loops=1)
   ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.006..0.158 rows=1000 loops=1)
   ->  Append  (cost=0.28..167.14 rows=366 width=49) (actual time=0.218..0.438 rows=1 loops=1000)
         ->  Index Scan using journal_1_dt_idx on journal_1 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_2_dt_idx on journal_2 j_1  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_3_dt_idx on journal_3 j_2  (cost=0.28..0.46 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
......................................................................................................................................................
         ->  Index Scan using journal_366_dt_idx on journal_366 j_365  (cost=0.12..0.15 rows=1 width=49) (actual time=0.001..0.001 rows=0 loops=1000)
               Index Cond: (dt = q.dt)
 Planning time: 29.922 ms
 Execution time: 456.140 ms
(737 rows)

The Nested Loop join takes 456 milliseconds to execute. This is even worse. But this is understandable because we have to scan each partition of journal for each row of q.

Finally, let’s enable RuntimeAppend.

RuntimeAppend

# EXPLAIN ANALYZE SELECT * FROM q JOIN journal j ON q.dt = j.dt;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.28..481.67 rows=1000 width=56) (actual time=0.041..9.911 rows=1000 loops=1)
   ->  Seq Scan on q  (cost=0.00..15.00 rows=1000 width=8) (actual time=0.005..0.079 rows=1000 loops=1)
   ->  Custom Scan (RuntimeAppend)  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1000)
         ->  Index Scan using journal_330_dt_idx on journal_330 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=5)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_121_dt_idx on journal_121 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.004..0.004 rows=1 loops=1)
               Index Cond: (dt = q.dt)
         ->  Index Scan using journal_37_dt_idx on journal_37 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=4)
               Index Cond: (dt = q.dt)
................................................................................................................................................
         ->  Index Scan using journal_355_dt_idx on journal_355 j  (cost=0.28..0.46 rows=1 width=49) (actual time=0.003..0.003 rows=1 loops=1)
               Index Cond: (dt = q.dt)
 Planning time: 30.775 ms
 Execution time: 8.615 ms
(687 rows)

The Nested Loop join with RuntimeAppend takes only about 9 milliseconds to execute! Such fast execution is possible thanks to RuntimeAppend scans only one relevant partition of journal for each row of q.

Nevertheless, all the partitions are present in plan and planning time is still quite high. This relatively high planning time could be not so significant for prepared statements or long OLAP queries.

However, long planning time appears to be not the only problem. We run a benchmark when RuntimeAppend node returns just a few rows in prepared statement. Despite high planning time doesn’t affect prepared statements, TPS was few time slower than it was without partitioning. After running perf, we got this flamegraph. This flamegraph shows that we spend very significant time for locking and unlocking every partition. Naturally, locking 365 partitions isn’t using fast-path locking and appears to be significant overhead.

Thus, we see how huge benefit could runtime partition selection have. However, in current design having all the partitions in plan cause high overhead. Solution could be found in redesigning partition locking. We are researching this problem now. It’s likely this problem can’t be solved in the boundaries of extension and proper solution requires hacking of PostgreSQL core.

]]>

2016-06-09T16:45:00+03:00

For people who are actively working with psql, it frequently happens that you want to draw graph for the table you’re currently seeing. Typically, it means a cycle of actions including: exporting data, importing it into graph drawing tool and drawing graph itself. It appears that this process could be automated: graph could be drawn by typing a single command directly in psql. See an example on the screenshot below.

It might seem like a magic, but actually there is absolutely no magic. iTerm2 supports image inlining since version 3 which is currently beta. Thus, if we put image surrounded with corresponding escape sequences it will appear in the terminal. From psql side we need to redirect output to the script which would do it. We can define a macro for simplifying this like in one of my previous posts.

\set graph '\\g |pg_graph'

And finally we need a pg_graph script which parses psql output, draws graph and puts it into stdout. I wrote one using Python and matplotlib. It recognizes first column as series of X-values and rest of columns as series of Y-values. If first column contains only decimal values it draws a plot chart, otherwise it draws a bar chart.

Thereby, it’s not hard to teach psql to do more things. Also, we can consider some improvements to psql including:

Add output format option for \g which would make it easier to parse psql output from scripts;
Provide elegant way to pass parameters into psql macro.

]]>

2016-05-09T12:50:00+03:00

PostgreSQL scalability on multicore and multisocket machines became a subject of optimization long time ago once such machines became widely used. This blog post shows brief history of vertical scalability improvements between versions 8.0 and 8.4. PostgreSQL 9.2 had very noticeable scalability improvement. Thanks to fast path locking and other optimizations it becomes possible to achieve more than 350 000 TPS in select-only pgbench test. The latest stable release PostgreSQL 9.5 also contain significant scalability advancements including LWLock improvement which allows achieving about 400 000 TPS in select-only pgbench test.

Postgres Professional company also became involved into scalability optimization. In partnership with IBM we researched PostgreSQL scalability on modern Power8 servers. The results of this research was published in popular Russian blog habrahabr (Google translated version). As brief result of this research we identify two ways to improve PostgreSQL scalability on Power8:

Implement Pin/UnpinBuffer() using CAS operations instead of buffer header spinlock;
Optimize LWLockAttemptLock() in assembly to make fewer loops for changing lwlock state.

The optimization #1 appears to give huge benefit on big Intel servers as well, while optimization #2 is Power-specific. After long rounds of optimization, cleaning and testing #1 was finally committed by Andres Freund.

On the graph above, following PostgreSQL versions were compared:

9.5.2 release – peak is 540 000 TPS with 60 clients,
9.6 master (more precisely 59455018) – peak is 1 064 000 TPS with 110 clients,
9.6 master where all PGXACTs were full cacheline aligned – peak is 1 722 000 TPS with 200 clients.

Alignment issues worth some explanation. Initially, I complained performance regression introduced by commit 5364b357 which increases number of clog buffers. That was strange by itself, because read-only benchmark shouldn’t lookup to clog thanks to hint bits. As expected it appears that clog buffers don’t really affect read-only performance directly, 5364b357 just changed layout of shared memory structures.

It appears that read-only benchmark became very sensitive to layout of shared memory structures. As result performance has significant variety depending on shared_buffers, max_connections and other options which influence shared memory distribution. When I gave Andres access to that big machine, he very quickly find a way to take care about performance irregularity: make all PGXACTs full cacheline aligned. Without this patch SnapshotResetXmin() dirties processor cache containing multiple PGXACTs. With this patch SnapshotResetXmin() dirties cacheline with only single PGXACT. Thus, GetSnapshotData() have much less cache misses. That was surprising and good lesson for me. I knew that alignment influence performance, but I didn’t expect this influence to be so huge. PGXACT cacheline alignment issue was discovered after feature freeze for 9.6. That means it would be subject for 9.7 development. Nevertheless, 9.6 have very noticeable scalability improvement.

Therefore, PostgreSQL single instance delivers more than 1 million TPS and one could say, that PostgreSQL opens a new era of millions TPS.

P.S. I’d like to thank:

Andres Freund, so-author and committer of patch;
My PostgresPro colleagues: Dmitry Vasilyev who run a lot of benchmarks, YUriy Zhuravlev who wrote original proof of concept of this patch;
Dilip Kumar and Robert Haas who helped with testing.

]]>

2016-04-06T15:26:00+03:00

PostgreSQL 9.6 receives suitable support of extensible index access methods. And that’s good news because Postgres was initially designed to support it.

“It is imperative that a user be able to construct new access methods to provide efficient access to instances of nontraditional base types”

Michael Stonebraker, Jeff Anton, Michael Hirohama. Extendability in POSTGRES , IEEE Data Eng. Bull. 10 (2) pp.16-23, 1987

That was a huge work which consists of multiple steps.

Rework access method interface so that access method internals are hidden from SQL level to C level. Besides help for custom access methods support, this refactoring is good by itself.
Committed by Tom Lane.
CREATE ACCESS METHOD command which provides legal way for insertion into pg_am with support of dependencies and pg_dump/pg_restore. Committed by Alvaro Herrera.
Generic WAL interface which provides custom access methods the way to be WAL-logged. Each built-in access method has its own type of WAL records. But custom access method shouldn’t because it could affect reliability. Generic WAL records represent difference between pages in general way as result of per-byte comparison of original and modified images of the page. For sure, it is not as efficient as own type of WAL records, but there is no choice under restrictions we have. Committed by Teodor Sigaev.
Bloom contrib module which is example of custom index access method which uses generic WAL interface. This contrib is essential for testing infrastructure described above. Also, this access method could be useful by itself. Committed by Teodor Sigaev.

I am very thankful for the efforts of committers and reviewers who make it possible to include these features into PostgreSQL.

However, end users don’t really care about this infrastructure. They do care about features we can provide on the base of this infrastructure. Actually, we would be able to have index access methods which are:

Too hard to add to PostgreSQL core. For instance, we presented fast FTS in 2012. We have 2 of 4 GIN features committed to core. And it seems to be very long way to have rest of features in core. But since 9.6 we would provide it as an extension.
Not patent free. There are some interesting data structures which are covered by patents (Fractal Tree index, for example). This is why they couldn’t be added to PostgreSQL core. Since 9.6, they could be provided without fork.

Also, I consider this work as an approach (together with FDW) to pluggable storage engines. I will speak about this during my talk at PGCon 2016.

]]>

2016-03-25T18:00:00+03:00

Recently Robert Haas has committed a patch which allows seeing some more detailed information about current wait event of the process. In particular, user will be able to see if process is waiting for heavyweight lock, lightweight lock (either individual or tranche) or buffer pin. The full list of wait events is available in the documentation. Hopefully, it will be more wait events in further releases.

It’s nice to see current wait event of the process, but just one snapshot is not very descriptive and definitely not enough to do any conclusion. But we can use sampling for collecting suitable statistics. This is why I’d like to present pg_wait_sampling which automates gathering sampling statistics of wait events. pg_wait_sampling enables you to gather statistics for graphs like the one below.

Let me explain you how did I draw this graph. pg_wait_sampling samples wait events into two destinations: history and profile. History is an in-memory ring buffer and profile is an in-memory hash table with accumulated statistics. We’re going to use the second one to see insensitivity of wait events over time periods.

At first, let’s create table for accumulated statistics. I’m doing these experiments on my laptop, and for the simplicity this table will live in the instance under monitoring. But note, that such table could live on the another server. I’d even say it’s preferable to place such data to another server.

CREATE TABLE profile_log (
    ts         timestamp,
    event_type text,
    event      text,
    count      int8);

Secondly, I wrote a function to copy data from pg_wait_sampling_profile view to profile_log table and clean profile data. This function returns number of rows inserted into profile_log table. Also, this function discards pid number and groups data by wait event. And this is not necessary needed to be so.

CREATE OR REPLACE FUNCTION write_profile_log() RETURNS integer AS $$
DECLARE
    result integer;
BEGIN
    INSERT INTO profile_log
        SELECT current_timestamp, event_type, event, SUM(count)
        FROM pg_wait_sampling_profile
        WHERE event IS NOT NULL
        GROUP BY event_type, event;
    GET DIAGNOSTICS result = ROW_COUNT;
    PERFORM pg_wait_sampling_reset_profile();
    RETURN result;
END
$$
LANGUAGE 'plpgsql';

And then I run psql session where setup watch of this function. Monitoring of our system is started. For real usage it’s better to schedule this command using cron or something.

smagen@postgres=# SELECT write_profile_log();
 write_profile_log
-------------------
                 0
(1 row)

smagen@postgres=# \watch 10
Fri Mar 25 14:03:09 2016 (every 10s)

 write_profile_log
-------------------
                 0
(1 row)

We can see that write_profile_log returns 0. That means we didn’t insert anything to profile_log. And this is right because system is not under load now. Let us create some load using pgbench.

$ pgbench -i -s 10 postgres
$ pgbench -j 10 -c 10 -M prepared -T 60 postgres

In the parallel session we can see that write_profile_log starts to insert some data to profile_log table.

Fri Mar 25 14:04:19 2016 (every 10s)
 write_profile_log
-------------------
                 9
(1 row)

Finally, let’s examine the profile_log table.

 SELECT * FROM profile_log;
             ts             |  event_type   |       event       | count
----------------------------+---------------+-------------------+-------
 2016-03-25 14:03:19.286394 | Lock          | tuple             |    41
 2016-03-25 14:03:19.286394 | LWLockTranche | lock_manager      |     1
 2016-03-25 14:03:19.286394 | LWLockTranche | buffer_content    |    68
 2016-03-25 14:03:19.286394 | LWLockTranche | wal_insert        |     3
 2016-03-25 14:03:19.286394 | LWLockNamed   | WALWriteLock      |    68
 2016-03-25 14:03:19.286394 | Lock          | transactionid     |   331
 2016-03-25 14:03:19.286394 | LWLockNamed   | ProcArrayLock     |     8
 2016-03-25 14:03:19.286394 | LWLockNamed   | WALBufMappingLock |     5
 2016-03-25 14:03:19.286394 | LWLockNamed   | CLogControlLock   |     1
........................................................................

How to interpret these data? In the first row we can see that count for tuple lock for 14:03:19 is 41. The pg_wait_sampling collector samples wait event every 10 ms while write_profile_log function writes snapshot of profile every 10 s. Thus, it was 1000 samples during this period. Taking into account that it was 10 backends serving pgbench, we can read the first row as “from 14:03:09 to 14:03:19 backends spend about 0.41% of time in waiting for tuple lock”.

That’s it. This blog post shows how you can setup a wait event monitoring of your database using pg_wait_sampling extension with PostgreSQL 9.6. This example was given just for introduction and it is simplified in many ways. But experienced DBAs would easily adopt it for their setups.

P.S. Every monitoring has some overhead. Overhead of wait monitoring was subject of hot debates in mailing lists. This is why features like exposing wait events parameters and measuring each wait event individually are not yet in 9.6. But sampling also has overhead. I hope pg_wait_sampling would be a start point to show on comparison that other approaches are not that bad, and finally we would have something way more advanced for 9.7.

]]>

2016-03-18T12:20:00+03:00

Recently pg_pathman receives support of UPDATE and DELETE queries. Because of some specialties of PostgreSQL query planner hooks, UPDATE and DELETE planning is accelerated only when only one partition is touched by query. Other way, regular slow inheritance query planning is used. However, case when UPDATE and DELETE touches only one partition seems to be most common and most needing optimization.

Also, I’d like to share some benchmark. This benchmark consists of operations on journal table with about 1 M records for year partitioned by day. For sure, this is kind of toy example, because nobody will split so small amount of data into so many partitions. But it is still good to see partitioning overhead. Performance of following operations was compared:

Select single row by its timestamp,
Select data for whole day (whole one partition),
Insert one row with random timestamp,
Update one row with random timestamp.

The following partitioning methods were compared:

Single table, no partitioning,
pg_partman extension,
pg_pathman extension.

Benchmarks were done on 2 x Intel Xeon CPU X5675 @ 3.07GHz, 24 GB of memory server with fsync = off in 10 threads. See the results below.

Test name	single table, TPS	pg_partman, TPS	pg_pathman, TPS
Select one row	47973	1084	41775
Select whole one partition	2302	704	2556
Insert one row	34401	7969	25859
Update one row	32769	202	29289

I can make following highlights for these results.

pg_pathman is dramatically faster than pg_partman, because pg_pathman uses planner hooks for faster query planning while pg_partman uses built-in inheritance mechanism.
When selecting or updating a single row, pg_pathman is almost as fast as plain table. The difference for insertion of single row is slightly bigger because trigger is used for that.
The difference between pg_partman and pg_pathman when selecting the whole partition is not as dramatic as when selecting the one row. This is why planning time becomes less substantial in comparison with execution time.
Inserting random rows with pg_pathman is still much faster than with pg_partman while both of them use trigger on parent relation. However, pg_pathman uses fast C-function for partition selection.
Selecting the whole partition when table is partitioned by pg_pathman is slightly faster than selecting same rows from plain table. This is because sequential scan was used for selecting whole partition while index scan was used for selecting part of plain table. When among of data is big and doesn’t fit cache this difference is expected to be much more.

See this gist for SQL-scripts used for benchmarking.

create_*.sql creates journal table using various partitioning methods.
select_one.sql, select_day.sql, insert.sql and update.sql are pg_bench scripts.

P.S. This post is not a criticism of pg_partman. It was developed long time before extendability mechanisms which pg_pathman use were created. And it is a great extension which has served many years.

]]>

2016-03-14T11:10:00+03:00

In my previous post I’ve introduced pg_pathman as an extension which accelerate query planning over partitioned tables. In this post I would like to covert another aspect of pg_pathman: it not only produce plans faster, but also produce better plans. Thanks to it query execution with pg_pathman becomes much faster in some cases.

When you search partitioned table with some filter conditions, pg_pathman adopts this filter to each individual partition. Therefore, each partition receives the only filter conditions which are useful to check against it.

Let me illustrate this on the example. At first, let’s see what’s happening with filter conditions while dealing with PostgreSQL inheritance mechanism.

Let us make some partitioned table using inheritance.

CREATE TABLE test (ts timestamp NOT NULL, title text);
CREATE INDEX test_ts_idx ON test (ts);
CREATE TABLE test_1 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-01-01' AND ts < '2015-02-01' )) INHERITS (test);
CREATE TABLE test_2 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-02-01' AND ts < '2015-03-01' )) INHERITS (test);
CREATE TABLE test_3 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-03-01' AND ts < '2015-04-01' )) INHERITS (test);
CREATE TABLE test_4 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-04-01' AND ts < '2015-05-01' )) INHERITS (test);
CREATE TABLE test_5 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-05-01' AND ts < '2015-06-01' )) INHERITS (test);
CREATE TABLE test_6 (LIKE test INCLUDING INDEXES, CHECK ( ts >= '2015-06-01' AND ts < '2015-07-01' )) INHERITS (test);

And them fill it with test data.

INSERT INTO test_1 (SELECT '2015-01-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 31 - 1) i);
INSERT INTO test_2 (SELECT '2015-02-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 28 - 1) i);
INSERT INTO test_3 (SELECT '2015-03-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 31 - 1) i);
INSERT INTO test_4 (SELECT '2015-04-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 30 - 1) i);
INSERT INTO test_5 (SELECT '2015-05-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 31 - 1) i);
INSERT INTO test_6 (SELECT '2015-06-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 30 - 1) i);

Then let’s try to select rows from two time intervals.

# EXPLAIN SELECT * FROM test WHERE (ts >= '2015-02-01' AND ts < '2015-03-15') OR (ts >= '2015-05-15' AND ts < '2015-07-01');
                                                                                                                                    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..5028.22 rows=128059 width=41)
   ->  Seq Scan on test  (cost=0.00..0.00 rows=1 width=40)
         Filter: (((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone)) OR ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone)))
   ->  Seq Scan on test_2  (cost=0.00..1183.40 rows=40320 width=41)
         Filter: (((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone)) OR ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone)))
   ->  Bitmap Heap Scan on test_3  (cost=444.46..1266.02 rows=20178 width=41)
         Recheck Cond: (((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone)) OR ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone)))
         ->  BitmapOr  (cost=444.46..444.46 rows=20178 width=0)
               ->  Bitmap Index Scan on test_3_ts_idx  (cost=0.00..430.07 rows=20178 width=0)
                     Index Cond: ((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone))
               ->  Bitmap Index Scan on test_3_ts_idx  (cost=0.00..4.30 rows=1 width=0)
                     Index Cond: ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone))
   ->  Seq Scan on test_5  (cost=0.00..1310.80 rows=24360 width=41)
         Filter: (((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone)) OR ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone)))
   ->  Seq Scan on test_6  (cost=0.00..1268.00 rows=43200 width=41)
         Filter: (((ts >= '2015-02-01 00:00:00'::timestamp without time zone) AND (ts < '2015-03-15 00:00:00'::timestamp without time zone)) OR ((ts >= '2015-05-15 00:00:00'::timestamp without time zone) AND (ts < '2015-07-01 00:00:00'::timestamp without time zone)))
(16 rows)

We can see that filter condition was passed to each partition as is. But actually it could be simplified a lot. For instance, table test_2 could be scan without filter condition at all because all its rows are matching. Filter condition to test_3 could be simplified to ts < '2015-03-15', therefore BitmapOr is not necessary.

Let’s try the same example with pg_pathman. Firstly create test table and its partitions.

CREATE TABLE test (ts timestamp NOT NULL, title text);
CREATE INDEX test_ts_idx ON test (ts);
SELECT create_range_partitions('test', 'ts', '2015-01-01'::timestamp, '1 month'::interval, 6);

Then insert test data into table. pg_pathman automatically creates trigger which distribute data between partitions. Just like pg_partman does.

INSERT INTO test (SELECT '2015-01-01'::timestamp + i * interval '1 minute', md5(i::text) FROM generate_series(0, 1440 * 181 - 1) i);

And finally try the same query with pg_pathman.

# EXPLAIN SELECT * FROM test WHERE (ts >= '2015-02-01' AND ts < '2015-03-15') OR (ts >= '2015-05-15' AND ts < '2015-07-01');
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Append  (cost=0.00..3248.59 rows=0 width=0)
   ->  Seq Scan on test_2  (cost=0.00..780.20 rows=0 width=0)
   ->  Index Scan using test_3_ts_idx on test_3  (cost=0.29..767.99 rows=0 width=0)
         Index Cond: (ts < '2015-03-15 00:00:00'::timestamp without time zone)
   ->  Seq Scan on test_5  (cost=0.00..864.40 rows=0 width=0)
         Filter: (ts >= '2015-05-15 00:00:00'::timestamp without time zone)
   ->  Seq Scan on test_6  (cost=0.00..836.00 rows=0 width=0)
(7 rows)

We can see that pg_pathman selects the same partitions, but query plan becomes way simpler. Now, test_2 is scanned without useless filter condition. test_3 is scanned using just ts < '2015-03-15' filter condition. Thanks to it, plain Index Scan is used instead of BitmapOr. And similar advances was applied to rest of partitions.

How was this simplification possible? The common fear here is that such simplification could be computational expensive in general case. But since pg_pathman is intended to decrease query planning time, it’s very important to keep all transformations cheap and simple. And this cheap and simple algorithm of transformation really exists.

Let’s see how it works on simple example. The filter condition (ts >= '2015-02-01' AND ts < '2015-03-15') OR (ts >= '2015-05-15' AND ts < '2015-07-01') have following tree representation.

Leaf nodes of tree are simple conditions. Non-leaf nodes are logical operators which forms complex conditions. For particular partition each filter condition (either simple or complex) could be treated into one of three classes.

Filter condition is always true for rows of this partition (t). For instance, condition ts >= '2015-04-15' is always true for partition ts >= 2015-05-01 AND ts < 2015-06-01.
Filter condition could be either true or false for rows of this partition (m). For instance, condition ts >= '2015-03-15' could be either true or false for partition ts >= 2015-03-01 AND ts < 2015-03-01.
Filter condition is always false for rows of this partition (f). For instance, condition ts <= '2015-02-01' is always false for partition ts >= 2015-04-01 AND ts < 2015-04-01.

We can mark each tree node with vector of classes which corresponding condition is treated against each partition. These vectors could be filled upwards: for leaf nodes first, and then for non-leaf nodes using tri-state logic.

It’s evident that only conditions which could be either true or false (m) are useful for filtering. Conditions which are always true or always false shouldn’t be presented in the partitions filter. Using produced three we can now produce filter conditions for each partition.

For ts >= 2015-01-01 AND ts < 2015-02-01 partition, whole filter condition is false. So, skip it.
For ts >= 2015-02-01 AND ts < 2015-03-01 partition, whole filter condition is true. So, scan it without filter.
For ts >= 2015-03-01 AND ts < 2015-04-01 partition, filter condition tree would be reduced into following tree.

Therefore, this partition will be scan with ts < '2015-03-15' filter.
For ts >= 2015-04-01 AND ts < 2015-05-01 partition, whole filter condition is false. So, skip it.
For ts >= 2015-05-01 AND ts < 2015-06-01 partition, filter condition tree would be reduced into following tree.

Therefore, this partition will be scan with ts >= '2015-05-15' filter.
For ts >= 2015-06-01 AND ts < 2015-07-01 partition, whole filter condition is true. So, scan it without filter.

This is how filter conditions processing works in pg_pathman. The explanation could be a bit exhausting for reading, but I hope you feel enlighten by getting how it works. I remember that pg_pathman is open source extension for PostgreSQL 9.5 in beta-release stage. I appeal to everyone interested for trying it and sharing a feedback.

]]>

2016-03-04T17:10:00+03:00

Partitioning in PostgreSQL is traditionally implemented using table inheritance. Table inheritance allow planner to include into plan only those child tables (partitions) which are compatible with query. Simultaneously a lot of work on partitions management remains on users: create inherited tables, writing trigger which selects appropriate partition for row inserting etc. In order to automate this work pg_partman extension was written. Also, there is upcoming work on declarative partitioning by Amit Langote for PostgreSQL core.

In Postgres Professional we notice performance problem of inheritance based partitioning. The problem is that planner selects children tables compatible with query by linear scan. Thus, for query which selects just one row from one partition it would be much slower to plan than to execute. This fact discourages many users and this is why we’re working on new PostgreSQL extension: pg_pathman.

pg_pathman caches partitions meta-information and uses set_rel_pathlist hook in order to replace mechanism of child tables selection by its own mechanism. Thanks to this binary search algorithm over sorted array is used for range partitioning and hash table lookup for hash partitioning. Therefore, time spent to partition selection appears to be negligible in comparison with forming of result plan nodes. See postgrespro blog post for performance benchmarks.

pg_pathman now in beta-release status and we encourage all interested users to try it and give us a feedback. pg_pathman is compatible with PostgreSQL 9.5 and distributed under PostgreSQL license. In the future we’re planning to enhance functionality of pg_pathman by following features.

Execute time selection of partitions using custom nodes (useful for nested loops and prepared statements);
Optimization of ordering output from partitioned tables (useful for merge join and order by);
Optimization of hash join when both tables are partitioned by join key;
LIST-partitioning;
HASH-partitioning by attributes of any hashable type.

Despite we have pg_pathman useful here and now, we want this functionality to eventually become part of PostgreSQL core. This is why we are going to join work on declarative partitioning by Amit Langote which have excellent DDL infrastructure and fulfill it with effective internal algorithms.

]]>

2015-09-07T11:30:00+03:00

Introduction

Users of jsonb datatype frequently complaint that it lucks of statistics. Naturally, today jsonb statistics is just default scalar statistics, which is suitable for <, <=, =, >=, > operators selectivity estimation. But people search jsonb documents using @> operator, expressions with -> operator, jsquery etc. This is why selectivity estimation, which people typically get in their queries, is just a stub. This could lead wrong query plans and bad performance. And it made us introduce hints in jsquery extension.

Thus, problem is clear. But the right solution is still unclear, at least for me. Let me discuss evident approaches to jsonb statistics and their limitations.

Collect just frequent paths

First candidate for good selectivity estimation is @> operator. Really, @> is builtin operator with GIN index support. First idea that comes into mind is to collect most frequent paths and their frequencies as jsonb statistics. In order to understand idea of paths better let’s consider how GIN jsonb_path_ops works. jsonb_path_ops is builtin GIN operator class, which is most suitable for jsonb @> operator.

Path is a sequence of key names, array indexes and referenced value. For instance, the document {"a": [{"b": "xyz", "c": true}, 10], "d": {"e": [7, false]}} would be decomposed into following set of paths.

"a".#."b"."xyz"
"a".#."c".true
"a".#.10
"d"."e".#.7
"d"."e".#.false

In this representation of paths array indexes are replaced with #. That allows our search to be agnostic to them like @> operator does. Thus, when we have such decomposition we can say that if a @> b then a paths are superset of b paths. If we intersect posting list of search argument paths then we can get list of candidates for search result. This is how jsonb_path_ops works.

The same idea could be applied to jsonb statistics. We could decompose each jsonb document into set of paths and then collect frequencies of most common individual paths. Such statistics perfectly fits current PostgreSQL system catalog and looks very similar to statistics of tsvectors and arrays, which are decomposed into lexemes and elements correspondingly. Such statistics of most common paths could look like following table.

Path	Frequency
“a”.#.”b”.”xyz”	0.55
“d”.”e”.#.77	0.43
“a”.#.”b”.”def”	0.35
“d”.”e”.#.100	0.22
“d”.”f”.true	0.1

Having such statistics we can estimate selectivity of @> operator as product of frequencies of search argument paths. For paths, which are not in most common list, we can use some default “rare frequency”. Also, we use quite rough assumption that paths appearance is independent. Let’s be honest: this assumption is just wrong. However, this is typical assumption we have to use during query planning. Finally, we don’t need absolutely accurate cost. Match of magnitude order can be considered as a quite good result.

There is also another source or inaccuracy I’d like to mention. Let’s consider some example.

a = [{"x": [1]}, {"x": [2]}]
b = [{"x": [1,2]}]

Both a and b are decomposed into the same set of paths.

#."x".1
#."x".2

However, neither a @> b neither ‘b @> a’. Since we ignored array indexes in paths we also ignore whether values beholds to same array element or not. This leads also to false positives in GIN and overestimations by statistics.

This approach is not only limited by @> operator. We can produce estimation for queries with complex logic. Example in jsquery could be "(abc" = 1 OR "xyz".# = "hij") AND NOT "def" = false.

However, such statistics hardly can estimate selectivity of <, <=, >=, > operators over jsonb values. For instance, in order to estimate jsquery "x" > 1 we can only count most common paths, which match this condition. But we’re lacking of histograms. It is a serious obstacle in getting accurate estimates and it lets us search for better solution.

Collect scalar statistics for each key path

Another idea of jsonb statistics we can get from assumption that almost every “schemaless” dataset can be easily represented in the schema of tables. Assuming this we would like our selectivity estimates for search in jsonb documents to be as good as those for search in plain tables.

Let’s consider this on the example. The following json document could represent the information about order in e-commerce.

{
  "id": 1,
  "contact": "John Smith",
  "phone": "212 555-1234",
  "address": "10021-3100, 21 2nd Street, New York",
  "products":
  [
    {
      "article": "XF56120",
      "name": "Sunglasses",
      "price": 500,
      "quantity": 1
    },
    {
      "article": "AT10789",
      "name": "T-Shirt",
      "price": 100,
      "quantity": 2
    }
  ]
}

The same information could be represented in the following couple of tables.

id	contact	phone	address
1	John Smith	212 555-1234	10021-3100, 21 2nd Street, New York

order_id	article	name	price	quantity
1	XF56120	Sunglasses	500	1
1	AT10789	T-Shirt	100	2

What kind of statictis would be collected by PostgreSQL in the second case? It would be most common values and histogram for each attribute. Most common values (MCVs) are values, which occur in the column most frequently. Frequencies of those values are collected and stored as well. Histogram is described by array of bounds. Each bound is assumed to contain equal number of column values excluding MCVs (so called equi-depth histogram).

With some simplification such statistics could be represented in the following table.

Table	Attribute	Most common values	Histogram
order	contact	{“John Smith”: 0.05, “James Johnson”: 0.01}	[“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”]
product	price	{“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}	[0, 12.5, 45.5, 250, 1000]
product	quantity	{“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}	[0, 4, 7, 9, 10]
…….	………	………………………………………….	……………………………………………..

What if we replace table and attribute with path of keys where corresponding value could be found in json document?

Key path	Most common values	Histogram
contact	{“John Smith”: 0.05, “James Johnson”: 0.01}	[“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”]
products.#.price	{“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}	[0, 12.5, 45.5, 250, 1000]
products.#.quantity	{“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}	[0, 4, 7, 9, 10]
……………….	………………………………………….	……………………………………………..

This kind of statistics seems to be comprehensive enough. It could produce fine estimations for queries like products.#.price > 100.

However, there are still bunch of open problems here.

Typical json documents we can meet in applications are really well structured as an example above. However, there are some cases when they are not. At first, someone could easily put values into keys. Let me illustrate this on the following example: products becomes an object where article is used as a key.

In this case we can find that cardinality of key paths are very high. Thus, we would be unable to collect suitable statistics for each key path. However, we could consider such situation as user mistake. Then we should advise users to restructure their documents.

There are still kind of documents, which don’t fit this model not because of user mistake but because of their nature. Imagine json formatted query plans stored in the table. Plans could have unlimited levels of nesting and correspondingly cardinality of key paths could be very high.

{
  "id": 1,
  "contact": "John Smith",
  "phone": "212 555-1234",
  "address": "10021-3100, 21 2nd Street, New York",
  "products":
  {
    "XF56120":
    {
      "name": "Sunglasses",
      "price": 500,
      "quantity": 1
    },
    "AT10789":
    {
      "name": "T-Shirt",
      "price": 100,
      "quantity": 2
    }
  }
}

Some objects stored inside jsonb documents could require special statistics. For instance, point coordinates could be represented in json as {"x": 11.3, "y": 27.0}. But statistics we will need in this case is not separate statistics for x and y. We would need something special for geometrical objects like 2D-histograms.
Another problem is fitting this model into PostgreSQL system catalog. pg_statistic assumes that statistics of attribute is represented by few arrays. However, in this model we have to store few arrays per each key path. For sure, we do a trick by storing array of jsonb or something like this, but that would be a kluge. It would be nice to store each key path in the separate row of pg_statistic. This would require significant changes in statistics handling though.

Conclusion

This was just my current thoughts about jsonb statistics. Probably, someone come with much better ideas. But I’m not sure we can find ideal solution, which would fit everyone needs. We can see that current developments in multivariate statistics use pluggable approach: user can turn on specific method on specific set of column. We could end up with something similar for jsonb: simple basic statistics + various kinds of pluggable statistics for specific needs.

]]>

2015-08-26T18:00:00+03:00

While hacking PostgreSQL it’s very useful to know pid of the backend you are working with. You need to know pid of the process to attach debugger, profiler etc. Luckily, .psqlrc provides us an elegant way to define the shortcuts for psql. Using config line below one can find out backend pid just by typing :pid.

.psqlrc

\set pid 'SELECT pg_backend_pid();'

=# :pid
 pg_backend_pid
----------------
          99038
(1 row)

In 9.6 it becomes possible to even include backend pid into psql prompt.

However, it’s possible to automate more complex actions in psql. I’ve configured my psql to run gdb attached to current backend in new tab of iTerm2 just by typing :gdb.

The :gdb command selects pid of current backend and puts it to the input of pg_debug script.

.psqlrc

\set gdb 'SELECT pg_backend_pid() \\g |pg_debug'

pg_debug extracts pid from its input and then runs OSA script which runs gdb in the new tab of iTerm2.

pg_debug

#!/bin/bash

IFS=''

while read line
do
	# Extended display off
	if [[ $line =~ ^\ +([0-9]+) ]]; then
		PID=${BASH_REMATCH[1]}
		break
	fi
	# Extended display on
	if [[ $line =~ ^pg_backend_pid.*\ ([0-9]+) ]]; then
		PID=${BASH_REMATCH[1]}
		break
	fi
done

# Open gdb session
osascript -e "
tell application \"iTerm\"
	activate
	tell the current terminal
		set mysession to (the current session)
		launch session \"Default Session\"
		tell the last session
			write text \"gdb --pid=$PID -x <(echo continue)\"
		end tell
		select mysession
	end tell
end tell"

This script works for Mac OS X and iTerm2, but the same approach should work for other platforms and terminal emulators.

]]>