SlideShare a Scribd company logo
Indexing	
  Complex	
  PostgreSQL	
  
Data	
  Types
Jonathan	
  S.	
  Katz	
  -­‐	
  PGDay	
  UK	
  2014	
  -­‐	
  July	
  9,	
  2014
About
• Jonathan	
  S.	
  Katz	
  
– CTO,	
  VenueBook	
  
– Co-­‐Organizer,	
  NYC	
  PostgreSQL	
  User	
  Group	
  
– PGConf	
  NYC	
  2015	
  
• Mar	
  25	
  -­‐	
  27,	
  2015	
  
• New	
  York	
  Marriott	
  Downtown	
  
• http://nyc.pgconf.us	
  
– @jkatz05
2
Quick	
  Overview
• Introductory	
  Talk	
  with	
  demos	
  and	
  fun	
  
• B-­‐trees	
  
• GiST:	
  Generalized	
  Search	
  Trees	
  
• GIN:	
  Generalized	
  Inverted	
  Index	
  
• SP-­‐GiST:	
  Space	
  Partitioned	
  Generalized	
  Search	
  
Trees
3
Assumptions
• PostgreSQL	
  9.3+	
  
• most	
  will	
  be	
  9.0+	
  
• PostGIS	
  2.0+	
  
• Believe	
  it	
  will	
  work	
  for	
  most	
  available	
  
versions	
  
• PostgreSQL	
  9.4	
  beta?
4
The	
  Beginning
5
The	
  Beginning
6
4
3
8
7
2
9
1
6
10
6
Find	
  3
7
4
3
8
7
2
9
1
6
10
5
Find	
  3
8
4
3
8
7
2
9
1
6
10
5
Find	
  3
9
4
3
8
7
2
9
1
6
10
5
Find	
  10
10
4
3
8
7
2
9
1
6
10
5
Find	
  10
11
4
3
8
7
2
9
1
6
10
5
Find	
  10
12
4
3
8
7
2
9
1
6
10
5
Find	
  10
13
4
3
8
7
2
9
1
6
10
5
Find	
  10
14
4
3
8
7
2
9
1
6
10
5
Find	
  10
15
4
3
8
7
2
9
1
6
10
5
Find	
  10
16
4
3
8
7
2
9
1
6
10
5
Find	
  10
17
4
3
8
7
2
9
1
6
10
5
Find	
  10
18
4
3
8
7
2
9
1
6
10
5
Find	
  10
19
4
3
8
7
2
9
1
6
10
5
What	
  We	
  Learned
• Without	
  any	
  data	
  structure	
  around	
  search,	
  we	
  
rely	
  on	
  "hope"	
  
• Assumed	
  "unique	
  values"	
  and	
  "equality"	
  
– would	
  have	
  to	
  scan	
  all	
  rows	
  otherwise	
  
• …and	
  what	
  about:	
  
– INSERT	
  
– UPDATE	
  
– DELETE
20
What	
  We	
  Need
• Need	
  a	
  data	
  structure	
  for	
  search	
  that:	
  
– allows	
  efficient	
  lookups	
  
– plays	
  nicely	
  with	
  disk	
  I/O	
  
– does	
  not	
  take	
  too	
  long	
  for	
  updates	
  
21
B-­‐Trees
• "default"	
  index	
  
• quick	
  traversal	
  to	
  leaf	
  nodes	
  
• leaf	
  nodes	
  pre-­‐sorted	
  
• node	
  size	
  designed	
  to	
  fit	
  in	
  disk	
  block	
  size	
  
– "degree"	
  of	
  nodes	
  has	
  max-­‐size	
  
• theoretical	
  performance	
  
– reads:	
  O(log	
  n)	
  
– writes:	
  O(log	
  n)	
  
– space:	
  O(n)
22
B-­‐Trees	
  and	
  PostgreSQL
• supports	
  
– <=,	
  <,	
  =,	
  >,	
  >=	
  
– BETWEEN,	
  IN	
  
– IS	
  NOT	
  NULL,	
  IS	
  NULL	
  
– LIKE	
  in	
  specific	
  case	
  of	
  ‘plaintext%’	
  
– ~	
  in	
  specific	
  case	
  of	
  ‘^plaintext’	
  
– ILIKE	
  and	
  ~*	
  if	
  pattern	
  starts	
  with	
  nonalpha	
  characters	
  
• does	
  not	
  support	
  
• IS	
  NOT	
  DISTINCT	
  FROM
23
B-­‐Trees	
  and	
  PostgreSQL
• data	
  types	
  supported	
  
– any	
  data	
  type	
  with	
  all	
  the	
  equality	
  operators	
  defined	
  
– number	
  types	
  
• integer,	
  numeric,	
  decimal	
  
– text	
  
• char,	
  varchar,	
  text	
  
– date	
  /	
  times	
  
• timestamptz,	
  timestamp,	
  date,	
  time,	
  timetz,	
  interval	
  
- arrays,	
  ranges
24
Demo	
  Specs	
  +	
  Configuration
• Hardware	
  specs	
  
– 2.3GHz	
  Intel	
  i7	
  2x4	
  core	
  
– 16GB	
  RAM	
  DDR3	
  1600MHz	
  
– Apple	
  SSD	
  6Gbps	
  
• postgresql.conf	
  (9.3,	
  9.4	
  beta	
  1)	
  
– shared_buffers	
  =	
  1GB	
  
– work_mem	
  =	
  64MB	
  
– maintenance_work_mem	
  =	
  1024MB	
  
– effective_cache_size	
  =	
  8GB
25
Demo	
  #1:	
  Basic	
  Indexing	
  Plans
26
Demo	
  #1	
  Notes
• Index	
  maintenance	
  
• VACUUM	
  –	
  "cleans	
  up"	
  after	
  writes	
  on	
  table	
  /
indexes	
  
– ANALYZE	
  –	
  keeps	
  statistics	
  up-­‐to-­‐date	
  for	
  planner	
  
!
VACUUM ANALYZE tablename;	
  
!
• Good	
  idea	
  to	
  leave	
  autovacuum	
  on
27
Indexing	
  in	
  Production
• CREATE	
  INDEX	
  CONCURRENTLY	
  
• REINDEX	
  
– corruption,	
  bloat,	
  invalid	
  
• FILLFACTOR	
  
– 10	
  –	
  100	
  
– default:	
  90	
  
– strategy:	
  lower	
  %	
  ::	
  write-­‐activity	
  
• TABLESPACE	
  
• NULLS	
  LAST,	
  NULLS	
  FIRST
28
Demo	
  #2:	
  Partial	
  Indexes
29
CREATE INDEX indexname ON tablename (columnname)
WHERE somecondition;
Demo	
  #2	
  Notes
• Partial	
  Indexes	
  are	
  
– good	
  if	
  known	
  to	
  query	
  limited	
  subset	
  of	
  table	
  
– take	
  up	
  less	
  space	
  
– allow	
  for	
  much	
  quicker	
  writes	
  
• Like	
  all	
  good	
  things,	
  do	
  not	
  overuse	
  and	
  
saturate	
  your	
  I/O
30
Unique	
  Indexes
• only	
  for	
  B-­‐trees	
  
• NULL	
  not	
  unique	
  
• use	
  UNIQUE	
  constraints	
  –	
  automatically	
  create	
  
indexes	
  
!
CREATE TABLE foo (bar int UNIQUE);	
  
-- or	
  
CREATE UNIQUE INDEX foo_bar_idx ON foo (bar);	
  
ALTER TABLE foo ADD CONSTRAINT a_unique
USING INDEX a_idx;
31
Multi-­‐Column	
  Indexes
• Useful	
  for	
  
– querying	
  two	
  columns	
  that	
  are	
  frequently	
  queried	
  
together	
  
– enforcing	
  UNIQUEness	
  across	
  columns	
  
• n.b.	
  creating	
  UNIQUE	
  constraint	
  on	
  table	
  creates	
  UNIQUE	
  
INDEX	
  
• PostgreSQL	
  supports	
  
– up	
  to	
  32	
  columns	
  
– B-­‐tree,	
  GiST,	
  GIN	
  
• Be	
  careful	
  of	
  how	
  you	
  choose	
  initial	
  column	
  order!
32
Multi-­‐Column	
  Indexes
33
CREATE INDEX multicolumn_idx ON
tablename (col1, col2);
!
!
!
CREATE UNIQUE INDEX pn_idx ON
phone_numbers (country_code, national_number)
WHERE extension IS NULL
Demo	
  #3	
  Notes
• Multi-­‐column	
  indexes	
  can	
  be	
  
– efficient	
  for	
  speed	
  +	
  space	
  
– inefficient	
  with	
  performance	
  
• Usage	
  depends	
  on	
  your	
  application	
  needs
34
Expression	
  Indexes
• can	
  index	
  on	
  expressions	
  to	
  speed	
  up	
  lookups	
  
– e.g.	
  case	
  insensitive	
  email	
  addresses	
  
– can	
  use	
  functions	
  or	
  scalars	
  
• (x * y) / 100	
  
• COALESCE(first_name, '') || ' ' ||
COALESCE(last_name, '')	
  
• LOWER(email_address)	
  
• tradeoff:	
  slower	
  writes
35
Demo	
  #4:	
  Expression	
  Indexes
36
Demo	
  #4	
  Notes
• fast	
  lookups,	
  slow	
  updates	
  
• size	
  on	
  disk	
  
• cool	
  data	
  types	
  
– geometric	
  
– JSON
37
Geometric	
  Data	
  Types
38
Geometric	
  Data	
  Types
CREATE TABLE points (coord point);
!
CREATE INDEX points_idx ON points (coord);
ERROR: data type point has no default
operator class for access method "btree"
HINT: You must specify an operator class
for the index or define a default operator
class for the data type.
39
GiST
• "generalized	
  search	
  tree"	
  
• infrastructure	
  that	
  provides	
  template	
  to	
  create	
  arbitrary	
  
indexing	
  schemes	
  
– supports	
  concurrency,	
  logging,	
  searching	
  –	
  only	
  have	
  to	
  define	
  
behavior	
  
– user-­‐defined	
  operator	
  class	
  
• <<,	
  &<,	
  &>,	
  >>,	
  <<|,	
  &<|,	
  |&>,	
  |>>,	
  @>,	
  <@,	
  ~=,	
  &&	
  
– have	
  to	
  implement	
  functions	
  in	
  interface	
  
• supports	
  lossless	
  +	
  lossy	
  indexes	
  
• provides	
  support	
  for	
  "nearest-­‐neighbor"	
  queries	
  –	
  
"KNN-­‐Gist"
40
CREATE INDEX points_coord_gist_idx ON points
USING gist(coord)
Demo	
  #5:	
  Geometric	
  GiST	
  +	
  KNN-­‐Gist
41
Demo	
  #5	
  Notes
• GiST	
  indexes	
  on	
  geometric	
  types	
  radically	
  
speedup	
  reads	
  
• Writes	
  are	
  slower	
  due	
  to	
  distance	
  calculation	
  
• Index	
  size	
  can	
  be	
  very	
  big
42
PostGIS
• For	
  when	
  you	
  are	
  doing	
  real	
  things	
  with	
  shapes
43• (and	
  geographic	
  information	
  systems)
PostGIS	
  +	
  Indexes
• B-­‐Tree?	
  
• R-­‐Tree?	
  
• PostGIS	
  docs	
  do	
  not	
  recommend	
  using	
  just	
  
an	
  R-­‐Tree	
  index	
  
• GiST	
  
• overlaps!	
  	
  containment!	
  
• uses	
  a	
  combination	
  of	
  GiST	
  +	
  R-­‐Tree
44
PostGIS	
  +	
  GiST
45
2-­‐D
CREATE INDEX zipcodes_geom_gist_idx ON zipcodes
USING gist(geom);
N-­‐D	
  (PostGIS	
  2.0+
CREATE INDEX zipcodes_geom_gist_idx ON zipcodes
USING gist(geom gist_geometry_ops_nd);
Example	
  -­‐	
  USA	
  Zipcode	
  Boundaries
• 33,120	
  rows	
  
• geom:	
  MultiPolygon	
  
• 52MB	
  without	
  indexes	
  
• With	
  geometry	
  GiST	
  +	
  integer	
  B-­‐Tree:	
  869MB
46
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• Geocoded	
  Address	
  
• Lat,Long	
  =	
  40.7356197,-­‐73.9891102	
  
• PostGIS:	
  POINT(-­‐73.9891102	
  40.7356197)	
  
• 4269	
  -­‐	
  “SRID”	
  -­‐	
  unique	
  ID	
  for	
  coordinate	
  system	
  
definitions
47
SELECT zcta5ce10 AS zipcode
FROM zipcodes
WHERE ST_Contains(
geom, --MultiPolygon
ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269)
);
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• No	
  Index
48
Seq Scan on zipcodes (cost=0.00..15382.00 rows=1 width=6) (actual
time=64.780..5153.485 rows=1 loops=1)
Filter: ((geom &&
'0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) AND
_st_contains(geom,
'0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry))
Rows Removed by Filter: 33119
Total runtime: 5153.505 ms
What	
  Zipcode	
  Is	
  My	
  Office	
  In?
• Here’s	
  the	
  GiST:
49
Index Scan using zipcodes_geom_gist on zipcodes (cost=0.28..8.54
rows=1 width=6) (actual time=0.120..0.207 rows=1 loops=1)
Index Cond: (geom &&
'0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry)
Filter: _st_contains(geom,
'0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry)
Rows Removed by Filter: 1
!
Total runtime: 0.235 ms
PostGIS	
  +	
  GiST	
  Conclusion
• 5153ms	
  =>	
  0.235ms	
  =	
  21,900x	
  speedup	
  :-­‐)
50
SELECT zcta5ce10 AS zipcode
FROM zipcodes
WHERE ST_Contains(
geom,
ST_GeomFromText('POINT(-73.9891102 40.7356197)',
4269)
);
!
zipcode
---------
10003
Full	
  Text	
  Search
• PostgreSQL	
  offers	
  full	
  text	
  search	
  with	
  the	
  
tsearch2	
  engine	
  
– algorithms	
  for	
  performing	
  FTS	
  
– to_tsvector('english',	
  content)	
  @@	
  
to_tsquery('irish	
  &	
  conference	
  |	
  meeting')	
  
– provides	
  indexing	
  capabilities	
  for	
  efficient	
  search
51
Test	
  Data	
  Set
• Wikipedia	
  English	
  category	
  titles	
  –	
  all	
  
1,823,644	
  that	
  I	
  downloaded
52
Full-­‐Text	
  Search:	
  Basics
53
SELECT title
FROM category
WHERE
to_tsvector('english', title) @@ to_tsquery('united & kingdom’);
!
title
-----
Lists of railway stations in the United Kingdom
Political history of the United Kingdom
Military of the United Kingdom
United Kingdom constitution
Television channels in the United Kingdom
United Kingdom
Roman Catholic secondary schools in the United Kingdom
[results truncated]
!
!
QUERY PLAN
------------
Seq Scan on category (cost=0.00..49262.77 rows=46 width=29) (actual time=21.900..16809.890
rows=8810 loops=1)
Filter: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text))
Rows Removed by Filter: 1814834
!
Total runtime: 16811.108 ms
Full-­‐Text	
  Search	
  +	
  GiST
54
CREATE INDEX category_title_gist_idx ON category
USING gist(to_tsvector('english', title));
!
SELECT title
FROM category
WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom');
QUERY PLAN
-------------
Bitmap Heap Scan on category (cost=4.77..182.47 rows=46 width=29) (actual time=75.517..180.650
rows=8810 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united &
kingdom'::text))
-> Bitmap Index Scan on category_title_gist_idx (cost=0.00..4.76 rows=46 width=0) (actual
time=74.687..74.687 rows=8810 loops=1)
Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united &
kingdom’::text))
!
Total runtime: 181.354 ms
Full	
  Text	
  Search	
  +	
  GiST
• GiST	
  indexes	
  can	
  produce	
  false	
  positives	
  
– "documents"	
  represented	
  by	
  fixed	
  length	
  signature	
  
• words	
  are	
  hashed	
  into	
  single	
  bits	
  and	
  concatenated	
  
– when	
  false	
  positive	
  occurs,	
  row	
  is	
  returned	
  and	
  
checked	
  to	
  see	
  if	
  false	
  match	
  
• Extra	
  validations	
  =	
  performance	
  degradation
55
Performance	
  Summary	
  with	
  GiST
• initial	
  index	
  build	
  takes	
  awhile	
  =>	
  slow	
  writes	
  
• reads	
  are	
  quick	
  
• Table	
  size:	
  271MB	
  
• Index	
  size:	
  83MB
56
GIN	
  Index
• "generalized	
  inverted	
  index"	
  
• supports	
  searching	
  within	
  composite	
  data	
  
– arrays,	
  full-­‐text	
  documents,	
  hstore	
  
• key	
  is	
  stored	
  once	
  and	
  points	
  to	
  composites	
  it	
  is	
  
contained	
  in	
  
• like	
  GiST,	
  provides	
  index	
  infrastructure	
  to	
  extend	
  GIN	
  
based	
  on	
  behavior	
  
– supports	
  operators	
  <@,	
  @>,	
  =,	
  &&	
  
• GIN	
  performance	
  ⬄	
  log(#	
  unique	
  things)
57
Full	
  Text	
  Search	
  +	
  GIN
58
CREATE INDEX category_title_gin_idx ON category
USING gin(to_tsvector('english', title));
!
EXPLAIN ANALYZE SELECT title FROM category WHERE to_tsvector('english',
title) @@ to_tsquery('united & kingdom');
!
QUERY PLAN
-------
Bitmap Heap Scan on category (cost=28.36..206.06 rows=46 width=29) (actual time=8.864..14.674
rows=8810 loops=1)
Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united &
kingdom'::text))
-> Bitmap Index Scan on category_title_gin_idx (cost=0.00..28.35 rows=46 width=0) (actual
time=7.905..7.905 rows=8810 loops=1)
Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united &
kingdom'::text))
!
!
Total runtime: 15.157 ms
Performance	
  Summary	
  with	
  GIN
• index	
  build	
  was	
  much	
  quicker	
  
• significant	
  speedup	
  from	
  no	
  index	
  
– (12,000ms	
  =>	
  15ms)	
  
• significant	
  speedup	
  from	
  GiST	
  
– (181ms	
  =>	
  15ms)	
  
• Table	
  size:	
  271MB	
  
• Index	
  size:	
  
• 9.3:	
  71MB	
  
• 9.4	
  beta	
  1:	
  	
  40MB
59
What	
  Was	
  Not	
  Discussed
• Word	
  density	
  
– prior	
  to	
  9.3,	
  performance	
  issues	
  with	
  greater	
  word	
  
density	
  
• Type	
  of	
  text	
  data	
  –	
  phrases	
  vs	
  paragraphs
60
Full	
  Text	
  Search	
  –	
  GiST	
  vs	
  GIN
• Reads	
  
– overall,	
  GIN	
  should	
  win	
  
• Writes	
  
– traditionally,	
  GiST	
  has	
  better	
  performance	
  for	
  
writes	
  
– GIN	
  
• FASTUPDATE	
  
• 9.4:	
  compression
61
Regular	
  Expression	
  Indexes
• Added	
  in	
  9.3	
  
• Support	
  for	
  LIKE/ILIKE	
  wildcard	
  indexes	
  in	
  9.1	
  
– title	
  LIKE	
  '%ab%e'	
  
• Uses	
  pg_trgm	
  extension	
  +	
  GIN	
  
!
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX category_title_regex_idx ON
category USING GIN(title gin_trgm_ops);
62
Regular	
  Expressions	
  -­‐	
  No	
  Index
63
EXPLAIN ANALYZE SELECT title FROM category
WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$';
!
QUERY PLAN
----------
Seq Scan on category (cost=0.00..40144.55 rows=182 width=29) (actual
time=2.509..4260.792 rows=5878 loops=1)
Filter: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text)
Rows Removed by Filter: 1817766
!
Total runtime: 4261.204 ms
Regular	
  Expressions	
  -­‐	
  Indexed
64
CREATE INDEX category_title_regex_idx ON category
USING gin(title gin_trgm_ops);
!
EXPLAIN ANALYZE SELECT title FROM category
WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$';
QUERY PLAN
-----------
Bitmap Heap Scan on category (cost=197.41..871.77 rows=182 width=29) (actual
time=107.445..146.713 rows=5878 loops=1)
Recheck Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text)
Rows Removed by Index Recheck: 4712
-> Bitmap Index Scan on category_title_regex_idx (cost=0.00..197.37 rows=182 width=0) (actual
time=106.645..106.645 rows=10590 loops=1)
Index Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text)
!
!
Total runtime: 147.026 ms
Range	
  Types
• stores	
  range	
  data	
  
– 	
  1	
  to	
  6	
  
– 2013-­‐10-­‐29	
  –	
  2013-­‐11-­‐2	
  
• easy-­‐to-­‐use	
  operators	
  to	
  check	
  inclusion,	
  
overlaps	
  
• built-­‐in	
  types:	
  integers,	
  numerics,	
  dates,	
  
timestamps	
  
• extensible
65
Range	
  Type	
  Examples
66
--find all ranges that overlap with [100, 200)
!
SELECT * FROM ranges WHERE int4range(100, 200) && range;
!
range
-----------
[10,102)
[13,102)
[18,108)
[32,101)
[34,134)
[37,123)
[43,111)
[46,132)
[48,107)
[results trunctated]
!QUERY PLAN
-----------
Seq Scan on ranges (cost=0.00..14239.86 rows=7073 width=32) (actual time=0.018..185.411 rows=143 loops
Filter: ('[100,200)'::int4range && range)
Rows Removed by Filter: 999857
!
Total runtime: 185.439 ms
Range	
  Types	
  +	
  GiST
67
CREATE INDEX ranges_range_gist_idx ON ranges USING gist(range);
!
EXPLAIN ANALYZE SELECT * FROM ranges WHERE
int4range(100, 200) && range;
!
QUERY PLAN
------------
Bitmap Heap Scan on ranges (cost=5.29..463.10 rows=130 width=13)
(actual time=0.120..0.135 rows=144 loops=1)
Recheck Cond: ('[100,200)'::int4range && range)
-> Bitmap Index Scan on ranges_range_gist_idx (cost=0.00..5.26
rows=130 width=0) (actual time=0.109..0.109 rows=144 loops=1)
Index Cond: ('[100,200)'::int4range && range)
!
!
Total runtime: 0.168 ms
SP-­‐GiST
• space-­‐partitioned	
  generalized	
  search	
  tree	
  
• ideal	
  for	
  non-­‐balanced	
  data	
  structures	
  
– k-­‐d	
  trees,	
  quad-­‐trees,	
  suffix	
  trees	
  
– divides	
  search	
  space	
  into	
  partitions	
  of	
  unequal	
  size	
  
• matching	
  partitioning	
  rule	
  =	
  fast	
  search	
  
• traditionally	
  for	
  "in-­‐memory"	
  transactions,	
  
converted	
  to	
  play	
  nicely	
  with	
  I/O
68
Range	
  Types:	
  GiST	
  vs	
  SP-­‐Gist
CREATE TABLE ranges AS
SELECT
int4range(
(random()*5)::int,
(random()*5)::int + 5
) AS range
FROM generate_series(1,<N>) x;
!
SELECT * FROM ranges WHERE range <operator>
int4range(3,6);
69
N	
  =	
  1,000,000
70
CREATE INDEX ranges_range_spgist_idx ON ranges USING spgist(range);
ERROR: unexpected spgdoinsert() failure
Fixed in 9.3.2
GiST	
  Used GiST	
  Time SP-Gist Used SP-GiST
Time= Yes 121 Yes 37
&& No 257	
  No 260
@> No 223 No 223
<@ Yes 163	
  Yes 111
<< Yes 95 Yes 5
>> Yes 95 Yes 25
&< No 184 No 185
&> No 203 No 203
Range	
  Types:	
  GiST	
  vs	
  SP-­‐GiST
CREATE TABLE ranges AS
SELECT
int4range(x, x + (random()*5)::int + 5)
AS range
FROM generate_series(1,<N>) x;
71
N	
  =	
  250,000
72
GiST	
  Used GiST	
  Time SP-­‐GiST	
  Used SP-­‐GiST	
  Time
= Yes 0.5 Yes 0.7
&& Yes 0.3 Yes 0.3
@> Yes 0.3 Yes 0.3
<@ Yes 0.06 Yes 0.25
<< No 40 Yes 0.2
>> No 60 No 60
&< Yes 0.3 Yes 0.2
&> No 74 No 61
GiST	
  vs	
  SP-­‐GiST:	
  Space
73
GiST	
  Clustered SP-­‐GiST	
  Clustered GiST	
  Sparse SP-­‐GiST	
  Sparse
100K	
  Size 6MB 5MB 6MB 11MB
100K	
  Time 0.5s .4s 2.5s 7.8s
250K	
  Size 15MB 12MB 15MB 28MB
250K	
  Time 1.5s 1.1s 6.3s 47.2s
500K	
  Size 30MB 25MB 30MB 55MB
500K	
  Time 3.1s 3.0s 13.9s 192s
1MM	
  Size 59MB 52MB! 60MB 110MB
1MM	
  Time 5.1s 5.7s 29.2 777s
Integer	
  Arrays
74
CREATE UNLOGGED TABLE int_arrays AS
SELECT ARRAY[x, x + 1, x + 2] AS data
FROM generate_series(1,1000000) x;
!
CREATE INDEX int_arrays_data_idx ON
int_arrays (data);
!
CREATE INDEX int_arrays_data_gin_idx ON
int_arrays USING GIN(data);
B-­‐Tree(?)	
  +	
  Integer	
  Arrays
75
EXPLAIN ANALYZE
SELECT *
FROM int_arrays
WHERE 5432 = ANY (data);
QUERY PLAN
-----------
Seq Scan on int_arrays (cost=0.00..30834.00 rows=5000 width=33) (actual
time=1.260..159.197 rows=3 loops=1)
Filter: (5432 = ANY (data))
Rows Removed by Filter: 999997
!
Total runtime: 159.222 ms
GIN	
  +	
  Integer	
  Arrays
76
EXPLAIN ANALYZE
SELECT *
FROM int_arrays
WHERE ARRAY[5432] <@ data;
QUERY PLAN
-----------
Bitmap Heap Scan on int_arrays (cost=70.75..7680.14 rows=5000 width=33)
(actual time=0.020..0.021 rows=3 loops=1)
Recheck Cond: ('{5432}'::integer[] <@ data)
-> Bitmap Index Scan on int_arrays_data_gin_idx (cost=0.00..69.50
rows=5000 width=0) (actual time=0.014..0.014 rows=3 loops=1)
Index Cond: ('{5432}'::integer[] <@ data)
!
Total runtime: 0.045 ms
Hash	
  Indexes
• only	
  work	
  with	
  "="	
  operator	
  
• are	
  still	
  not	
  WAL	
  logged	
  as	
  of	
  9.4	
  beta	
  1	
  
– not	
  crash	
  safe	
  
– not	
  replicated
77
btree_gin
78
CREATE EXTENSION IF NOT EXISTS btree_gin;
!
CREATE UNLOGGED TABLE numbers AS
SELECT (random() * 2000)::int AS a FROM generate_series(1, 2000000) x;
!
CREATE INDEX numbers_gin_idx ON numbers USING gin(a);
!
EXPLAIN ANALYZE SELECT * FROM numbers WHERE a = 1000;
!
QUERY PLAN
------------
Bitmap Heap Scan on numbers (cost=113.50..9509.26 rows=10000 width=4) (actual
time=0.388..1.459 rows=991 loops=1)
Recheck Cond: (a = 1000)
-> Bitmap Index Scan on numbers_gin_idx (cost=0.00..111.00 rows=10000 width=0)
(actual time=0.232..0.232 rows=991 loops=1)
Index Cond: (a = 1000)
!
Total runtime: 1.563 ms
btree_gin	
  vs	
  btree
79
-- btree
SELECT pg_size_pretty(pg_total_relation_size('numbers_idx'));
pg_size_pretty
----------------
43 MB
!
!
!
-- GIN
SELECT pg_size_pretty(pg_total_relation_size('numbers_gin_idx'));
pg_size_pretty
----------------
16 MB
• Only	
  use	
  GIN	
  over	
  btree	
  if	
  you	
  have	
  a	
  lot	
  of	
  duplicate	
  entries
hstore	
  -­‐	
  the	
  PostgreSQL	
  Key-­‐Value	
  
Store
80
CREATE EXTENSION IF NOT EXISTS hstore;
!
CREATE UNLOGGED TABLE keypairs AS
SELECT
(x || ' => ' || (x + (random() * 5)::int))::hstore AS data
FROM generate_series(1,1000000) x;
SELECT pg_size_pretty(pg_relation_size('keypairs'));
!
!
SELECT * FROM keypairs WHERE data ? ‘3';
data
----------
"3"=>"4"
!
EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3';
QUERY PLAN
-----------
Seq Scan on keypairs (cost=0.00..19135.06 rows=950 width=32) (actual time=0.065..208.808
rows=1 loops=1)
Filter: (data ? '3'::text)
Rows Removed by Filter: 999999
!
Total runtime: 208.825 ms
hstore	
  -­‐	
  the	
  PostgreSQL	
  Key-­‐Value	
  
Store
81
CREATE INDEX keypairs_data_gin_idx ON keypairs
USING gin(data);
!
EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3';
!
QUERY PLAN
-----------
Bitmap Heap Scan on keypairs (cost=27.75..2775.66 rows=1000 width=24) (actual
time=0.044..0.045 rows=1 loops=1)
Recheck Cond: (data ? '3'::text)
-> Bitmap Index Scan on keypairs_data_gin_idx (cost=0.00..27.50 rows=1000 width=0)
(actual time=0.039..0.039 rows=1 loops=1)
Index Cond: (data ? '3'::text)
!
Total runtime: 0.071 ms
JSONB:	
  Coming	
  in	
  9.4
82
INSERT INTO documents
SELECT row_to_json(ROW(x, x + 2, x + 3))::jsonb
FROM generate_series(1,1000000) x;
!
!
CREATE INDEX documents_data_gin_idx ON documents
USING gin(data jsonb_path_ops);
!
!
!
SELECT * FROM documents WHERE data @> '{ "f1": 10 }';
data
--------------------------------
{"f1": 10, "f2": 12, "f3": 13}
!
!
Execution time: 0.084 ms
Awesome	
  vs	
  WTF:	
  
A	
  Note	
  On	
  Operator	
  Indexability
83
EXPLAIN ANALYZE SELECT * FROM documents WHERE data @> '{ "f1": 10 }';
!
QUERY PLAN
-----------
Bitmap Heap Scan on documents (cost=27.75..3082.65 rows=1000 width=66) (actual time=0.029..0.030
rows=1 loops=1)
Recheck Cond: (data @> '{"f1": 10}'::jsonb)
Heap Blocks: exact=1
-> Bitmap Index Scan on documents_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual
time=0.014..0.014 rows=1 loops=1)
Index Cond: (data @> '{"f1": 10}'::jsonb)
!
Execution time: 0.084 ms
!
EXPLAIN ANALYZE SELECT * FROM documents WHERE '{ "f1": 10 }' <@ data;
!
QUERY PLAN
-----------
Seq Scan on documents (cost=0.00..24846.00 rows=1000 width=66) (actual time=0.015..245.924
rows=1 loops=1)
Filter: ('{"f1": 10}'::jsonb <@ data)
Rows Removed by Filter: 999999
!
Execution time: 245.947 ms
For	
  More	
  Information…
• http://www.postgresql.org/docs/current/static/
indexes.html	
  
• http://www.postgresql.org/docs/current/static/
gist.html	
  
• http://www.postgresql.org/docs/current/static/
gin.html	
  
• http://www.postgresql.org/docs/current/static/
spgist.html	
  
• GiST	
  +	
  GIN	
  +	
  Full	
  Text	
  Search:	
  
– http://www.postgresql.org/docs/current/static/textsearch-­‐
indexes.html
84
Conclusion
• Postgres	
  has	
  *a	
  lot*	
  of	
  different	
  types	
  of	
  
indexes,	
  and	
  variations	
  on	
  each	
  of	
  its	
  engines	
  
• Extensions	
  make	
  use	
  of	
  PostgreSQL	
  indexes	
  
– PostGIS	
  
• Need	
  to	
  understand	
  where	
  index	
  usage	
  is	
  
appropriate	
  in	
  your	
  application
85
Thanks!
• http://nyc.pgconf.us	
  
• Twitter:	
  @jkatz05
86

More Related Content

Indexing Complex PostgreSQL Data Types

  • 1. Indexing  Complex  PostgreSQL   Data  Types Jonathan  S.  Katz  -­‐  PGDay  UK  2014  -­‐  July  9,  2014
  • 2. About • Jonathan  S.  Katz   – CTO,  VenueBook   – Co-­‐Organizer,  NYC  PostgreSQL  User  Group   – PGConf  NYC  2015   • Mar  25  -­‐  27,  2015   • New  York  Marriott  Downtown   • http://nyc.pgconf.us   – @jkatz05 2
  • 3. Quick  Overview • Introductory  Talk  with  demos  and  fun   • B-­‐trees   • GiST:  Generalized  Search  Trees   • GIN:  Generalized  Inverted  Index   • SP-­‐GiST:  Space  Partitioned  Generalized  Search   Trees 3
  • 4. Assumptions • PostgreSQL  9.3+   • most  will  be  9.0+   • PostGIS  2.0+   • Believe  it  will  work  for  most  available   versions   • PostgreSQL  9.4  beta? 4
  • 20. What  We  Learned • Without  any  data  structure  around  search,  we   rely  on  "hope"   • Assumed  "unique  values"  and  "equality"   – would  have  to  scan  all  rows  otherwise   • …and  what  about:   – INSERT   – UPDATE   – DELETE 20
  • 21. What  We  Need • Need  a  data  structure  for  search  that:   – allows  efficient  lookups   – plays  nicely  with  disk  I/O   – does  not  take  too  long  for  updates   21
  • 22. B-­‐Trees • "default"  index   • quick  traversal  to  leaf  nodes   • leaf  nodes  pre-­‐sorted   • node  size  designed  to  fit  in  disk  block  size   – "degree"  of  nodes  has  max-­‐size   • theoretical  performance   – reads:  O(log  n)   – writes:  O(log  n)   – space:  O(n) 22
  • 23. B-­‐Trees  and  PostgreSQL • supports   – <=,  <,  =,  >,  >=   – BETWEEN,  IN   – IS  NOT  NULL,  IS  NULL   – LIKE  in  specific  case  of  ‘plaintext%’   – ~  in  specific  case  of  ‘^plaintext’   – ILIKE  and  ~*  if  pattern  starts  with  nonalpha  characters   • does  not  support   • IS  NOT  DISTINCT  FROM 23
  • 24. B-­‐Trees  and  PostgreSQL • data  types  supported   – any  data  type  with  all  the  equality  operators  defined   – number  types   • integer,  numeric,  decimal   – text   • char,  varchar,  text   – date  /  times   • timestamptz,  timestamp,  date,  time,  timetz,  interval   - arrays,  ranges 24
  • 25. Demo  Specs  +  Configuration • Hardware  specs   – 2.3GHz  Intel  i7  2x4  core   – 16GB  RAM  DDR3  1600MHz   – Apple  SSD  6Gbps   • postgresql.conf  (9.3,  9.4  beta  1)   – shared_buffers  =  1GB   – work_mem  =  64MB   – maintenance_work_mem  =  1024MB   – effective_cache_size  =  8GB 25
  • 26. Demo  #1:  Basic  Indexing  Plans 26
  • 27. Demo  #1  Notes • Index  maintenance   • VACUUM  –  "cleans  up"  after  writes  on  table  / indexes   – ANALYZE  –  keeps  statistics  up-­‐to-­‐date  for  planner   ! VACUUM ANALYZE tablename;   ! • Good  idea  to  leave  autovacuum  on 27
  • 28. Indexing  in  Production • CREATE  INDEX  CONCURRENTLY   • REINDEX   – corruption,  bloat,  invalid   • FILLFACTOR   – 10  –  100   – default:  90   – strategy:  lower  %  ::  write-­‐activity   • TABLESPACE   • NULLS  LAST,  NULLS  FIRST 28
  • 29. Demo  #2:  Partial  Indexes 29 CREATE INDEX indexname ON tablename (columnname) WHERE somecondition;
  • 30. Demo  #2  Notes • Partial  Indexes  are   – good  if  known  to  query  limited  subset  of  table   – take  up  less  space   – allow  for  much  quicker  writes   • Like  all  good  things,  do  not  overuse  and   saturate  your  I/O 30
  • 31. Unique  Indexes • only  for  B-­‐trees   • NULL  not  unique   • use  UNIQUE  constraints  –  automatically  create   indexes   ! CREATE TABLE foo (bar int UNIQUE);   -- or   CREATE UNIQUE INDEX foo_bar_idx ON foo (bar);   ALTER TABLE foo ADD CONSTRAINT a_unique USING INDEX a_idx; 31
  • 32. Multi-­‐Column  Indexes • Useful  for   – querying  two  columns  that  are  frequently  queried   together   – enforcing  UNIQUEness  across  columns   • n.b.  creating  UNIQUE  constraint  on  table  creates  UNIQUE   INDEX   • PostgreSQL  supports   – up  to  32  columns   – B-­‐tree,  GiST,  GIN   • Be  careful  of  how  you  choose  initial  column  order! 32
  • 33. Multi-­‐Column  Indexes 33 CREATE INDEX multicolumn_idx ON tablename (col1, col2); ! ! ! CREATE UNIQUE INDEX pn_idx ON phone_numbers (country_code, national_number) WHERE extension IS NULL
  • 34. Demo  #3  Notes • Multi-­‐column  indexes  can  be   – efficient  for  speed  +  space   – inefficient  with  performance   • Usage  depends  on  your  application  needs 34
  • 35. Expression  Indexes • can  index  on  expressions  to  speed  up  lookups   – e.g.  case  insensitive  email  addresses   – can  use  functions  or  scalars   • (x * y) / 100   • COALESCE(first_name, '') || ' ' || COALESCE(last_name, '')   • LOWER(email_address)   • tradeoff:  slower  writes 35
  • 36. Demo  #4:  Expression  Indexes 36
  • 37. Demo  #4  Notes • fast  lookups,  slow  updates   • size  on  disk   • cool  data  types   – geometric   – JSON 37
  • 39. Geometric  Data  Types CREATE TABLE points (coord point); ! CREATE INDEX points_idx ON points (coord); ERROR: data type point has no default operator class for access method "btree" HINT: You must specify an operator class for the index or define a default operator class for the data type. 39
  • 40. GiST • "generalized  search  tree"   • infrastructure  that  provides  template  to  create  arbitrary   indexing  schemes   – supports  concurrency,  logging,  searching  –  only  have  to  define   behavior   – user-­‐defined  operator  class   • <<,  &<,  &>,  >>,  <<|,  &<|,  |&>,  |>>,  @>,  <@,  ~=,  &&   – have  to  implement  functions  in  interface   • supports  lossless  +  lossy  indexes   • provides  support  for  "nearest-­‐neighbor"  queries  –   "KNN-­‐Gist" 40 CREATE INDEX points_coord_gist_idx ON points USING gist(coord)
  • 41. Demo  #5:  Geometric  GiST  +  KNN-­‐Gist 41
  • 42. Demo  #5  Notes • GiST  indexes  on  geometric  types  radically   speedup  reads   • Writes  are  slower  due  to  distance  calculation   • Index  size  can  be  very  big 42
  • 43. PostGIS • For  when  you  are  doing  real  things  with  shapes 43• (and  geographic  information  systems)
  • 44. PostGIS  +  Indexes • B-­‐Tree?   • R-­‐Tree?   • PostGIS  docs  do  not  recommend  using  just   an  R-­‐Tree  index   • GiST   • overlaps!    containment!   • uses  a  combination  of  GiST  +  R-­‐Tree 44
  • 45. PostGIS  +  GiST 45 2-­‐D CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom); N-­‐D  (PostGIS  2.0+ CREATE INDEX zipcodes_geom_gist_idx ON zipcodes USING gist(geom gist_geometry_ops_nd);
  • 46. Example  -­‐  USA  Zipcode  Boundaries • 33,120  rows   • geom:  MultiPolygon   • 52MB  without  indexes   • With  geometry  GiST  +  integer  B-­‐Tree:  869MB 46
  • 47. What  Zipcode  Is  My  Office  In? • Geocoded  Address   • Lat,Long  =  40.7356197,-­‐73.9891102   • PostGIS:  POINT(-­‐73.9891102  40.7356197)   • 4269  -­‐  “SRID”  -­‐  unique  ID  for  coordinate  system   definitions 47 SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, --MultiPolygon ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) );
  • 48. What  Zipcode  Is  My  Office  In? • No  Index 48 Seq Scan on zipcodes (cost=0.00..15382.00 rows=1 width=6) (actual time=64.780..5153.485 rows=1 loops=1) Filter: ((geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) AND _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry)) Rows Removed by Filter: 33119 Total runtime: 5153.505 ms
  • 49. What  Zipcode  Is  My  Office  In? • Here’s  the  GiST: 49 Index Scan using zipcodes_geom_gist on zipcodes (cost=0.28..8.54 rows=1 width=6) (actual time=0.120..0.207 rows=1 loops=1) Index Cond: (geom && '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Filter: _st_contains(geom, '0101000020AD100000F648DE944D7F52C08CE54CC9285E4440'::geometry) Rows Removed by Filter: 1 ! Total runtime: 0.235 ms
  • 50. PostGIS  +  GiST  Conclusion • 5153ms  =>  0.235ms  =  21,900x  speedup  :-­‐) 50 SELECT zcta5ce10 AS zipcode FROM zipcodes WHERE ST_Contains( geom, ST_GeomFromText('POINT(-73.9891102 40.7356197)', 4269) ); ! zipcode --------- 10003
  • 51. Full  Text  Search • PostgreSQL  offers  full  text  search  with  the   tsearch2  engine   – algorithms  for  performing  FTS   – to_tsvector('english',  content)  @@   to_tsquery('irish  &  conference  |  meeting')   – provides  indexing  capabilities  for  efficient  search 51
  • 52. Test  Data  Set • Wikipedia  English  category  titles  –  all   1,823,644  that  I  downloaded 52
  • 53. Full-­‐Text  Search:  Basics 53 SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom’); ! title ----- Lists of railway stations in the United Kingdom Political history of the United Kingdom Military of the United Kingdom United Kingdom constitution Television channels in the United Kingdom United Kingdom Roman Catholic secondary schools in the United Kingdom [results truncated] ! ! QUERY PLAN ------------ Seq Scan on category (cost=0.00..49262.77 rows=46 width=29) (actual time=21.900..16809.890 rows=8810 loops=1) Filter: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) Rows Removed by Filter: 1814834 ! Total runtime: 16811.108 ms
  • 54. Full-­‐Text  Search  +  GiST 54 CREATE INDEX category_title_gist_idx ON category USING gist(to_tsvector('english', title)); ! SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); QUERY PLAN ------------- Bitmap Heap Scan on category (cost=4.77..182.47 rows=46 width=29) (actual time=75.517..180.650 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gist_idx (cost=0.00..4.76 rows=46 width=0) (actual time=74.687..74.687 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom’::text)) ! Total runtime: 181.354 ms
  • 55. Full  Text  Search  +  GiST • GiST  indexes  can  produce  false  positives   – "documents"  represented  by  fixed  length  signature   • words  are  hashed  into  single  bits  and  concatenated   – when  false  positive  occurs,  row  is  returned  and   checked  to  see  if  false  match   • Extra  validations  =  performance  degradation 55
  • 56. Performance  Summary  with  GiST • initial  index  build  takes  awhile  =>  slow  writes   • reads  are  quick   • Table  size:  271MB   • Index  size:  83MB 56
  • 57. GIN  Index • "generalized  inverted  index"   • supports  searching  within  composite  data   – arrays,  full-­‐text  documents,  hstore   • key  is  stored  once  and  points  to  composites  it  is   contained  in   • like  GiST,  provides  index  infrastructure  to  extend  GIN   based  on  behavior   – supports  operators  <@,  @>,  =,  &&   • GIN  performance  ⬄  log(#  unique  things) 57
  • 58. Full  Text  Search  +  GIN 58 CREATE INDEX category_title_gin_idx ON category USING gin(to_tsvector('english', title)); ! EXPLAIN ANALYZE SELECT title FROM category WHERE to_tsvector('english', title) @@ to_tsquery('united & kingdom'); ! QUERY PLAN ------- Bitmap Heap Scan on category (cost=28.36..206.06 rows=46 width=29) (actual time=8.864..14.674 rows=8810 loops=1) Recheck Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) -> Bitmap Index Scan on category_title_gin_idx (cost=0.00..28.35 rows=46 width=0) (actual time=7.905..7.905 rows=8810 loops=1) Index Cond: (to_tsvector('english'::regconfig, title) @@ to_tsquery('united & kingdom'::text)) ! ! Total runtime: 15.157 ms
  • 59. Performance  Summary  with  GIN • index  build  was  much  quicker   • significant  speedup  from  no  index   – (12,000ms  =>  15ms)   • significant  speedup  from  GiST   – (181ms  =>  15ms)   • Table  size:  271MB   • Index  size:   • 9.3:  71MB   • 9.4  beta  1:    40MB 59
  • 60. What  Was  Not  Discussed • Word  density   – prior  to  9.3,  performance  issues  with  greater  word   density   • Type  of  text  data  –  phrases  vs  paragraphs 60
  • 61. Full  Text  Search  –  GiST  vs  GIN • Reads   – overall,  GIN  should  win   • Writes   – traditionally,  GiST  has  better  performance  for   writes   – GIN   • FASTUPDATE   • 9.4:  compression 61
  • 62. Regular  Expression  Indexes • Added  in  9.3   • Support  for  LIKE/ILIKE  wildcard  indexes  in  9.1   – title  LIKE  '%ab%e'   • Uses  pg_trgm  extension  +  GIN   ! CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX category_title_regex_idx ON category USING GIN(title gin_trgm_ops); 62
  • 63. Regular  Expressions  -­‐  No  Index 63 EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; ! QUERY PLAN ---------- Seq Scan on category (cost=0.00..40144.55 rows=182 width=29) (actual time=2.509..4260.792 rows=5878 loops=1) Filter: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Filter: 1817766 ! Total runtime: 4261.204 ms
  • 64. Regular  Expressions  -­‐  Indexed 64 CREATE INDEX category_title_regex_idx ON category USING gin(title gin_trgm_ops); ! EXPLAIN ANALYZE SELECT title FROM category WHERE title ~ '(([iI]sland(s)?)|([pP]eninsula))$'; QUERY PLAN ----------- Bitmap Heap Scan on category (cost=197.41..871.77 rows=182 width=29) (actual time=107.445..146.713 rows=5878 loops=1) Recheck Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) Rows Removed by Index Recheck: 4712 -> Bitmap Index Scan on category_title_regex_idx (cost=0.00..197.37 rows=182 width=0) (actual time=106.645..106.645 rows=10590 loops=1) Index Cond: (title ~ '(([iI]sland(s)?)|([pP]eninsula))$'::text) ! ! Total runtime: 147.026 ms
  • 65. Range  Types • stores  range  data   –  1  to  6   – 2013-­‐10-­‐29  –  2013-­‐11-­‐2   • easy-­‐to-­‐use  operators  to  check  inclusion,   overlaps   • built-­‐in  types:  integers,  numerics,  dates,   timestamps   • extensible 65
  • 66. Range  Type  Examples 66 --find all ranges that overlap with [100, 200) ! SELECT * FROM ranges WHERE int4range(100, 200) && range; ! range ----------- [10,102) [13,102) [18,108) [32,101) [34,134) [37,123) [43,111) [46,132) [48,107) [results trunctated] !QUERY PLAN ----------- Seq Scan on ranges (cost=0.00..14239.86 rows=7073 width=32) (actual time=0.018..185.411 rows=143 loops Filter: ('[100,200)'::int4range && range) Rows Removed by Filter: 999857 ! Total runtime: 185.439 ms
  • 67. Range  Types  +  GiST 67 CREATE INDEX ranges_range_gist_idx ON ranges USING gist(range); ! EXPLAIN ANALYZE SELECT * FROM ranges WHERE int4range(100, 200) && range; ! QUERY PLAN ------------ Bitmap Heap Scan on ranges (cost=5.29..463.10 rows=130 width=13) (actual time=0.120..0.135 rows=144 loops=1) Recheck Cond: ('[100,200)'::int4range && range) -> Bitmap Index Scan on ranges_range_gist_idx (cost=0.00..5.26 rows=130 width=0) (actual time=0.109..0.109 rows=144 loops=1) Index Cond: ('[100,200)'::int4range && range) ! ! Total runtime: 0.168 ms
  • 68. SP-­‐GiST • space-­‐partitioned  generalized  search  tree   • ideal  for  non-­‐balanced  data  structures   – k-­‐d  trees,  quad-­‐trees,  suffix  trees   – divides  search  space  into  partitions  of  unequal  size   • matching  partitioning  rule  =  fast  search   • traditionally  for  "in-­‐memory"  transactions,   converted  to  play  nicely  with  I/O 68
  • 69. Range  Types:  GiST  vs  SP-­‐Gist CREATE TABLE ranges AS SELECT int4range( (random()*5)::int, (random()*5)::int + 5 ) AS range FROM generate_series(1,<N>) x; ! SELECT * FROM ranges WHERE range <operator> int4range(3,6); 69
  • 70. N  =  1,000,000 70 CREATE INDEX ranges_range_spgist_idx ON ranges USING spgist(range); ERROR: unexpected spgdoinsert() failure Fixed in 9.3.2 GiST  Used GiST  Time SP-Gist Used SP-GiST Time= Yes 121 Yes 37 && No 257  No 260 @> No 223 No 223 <@ Yes 163  Yes 111 << Yes 95 Yes 5 >> Yes 95 Yes 25 &< No 184 No 185 &> No 203 No 203
  • 71. Range  Types:  GiST  vs  SP-­‐GiST CREATE TABLE ranges AS SELECT int4range(x, x + (random()*5)::int + 5) AS range FROM generate_series(1,<N>) x; 71
  • 72. N  =  250,000 72 GiST  Used GiST  Time SP-­‐GiST  Used SP-­‐GiST  Time = Yes 0.5 Yes 0.7 && Yes 0.3 Yes 0.3 @> Yes 0.3 Yes 0.3 <@ Yes 0.06 Yes 0.25 << No 40 Yes 0.2 >> No 60 No 60 &< Yes 0.3 Yes 0.2 &> No 74 No 61
  • 73. GiST  vs  SP-­‐GiST:  Space 73 GiST  Clustered SP-­‐GiST  Clustered GiST  Sparse SP-­‐GiST  Sparse 100K  Size 6MB 5MB 6MB 11MB 100K  Time 0.5s .4s 2.5s 7.8s 250K  Size 15MB 12MB 15MB 28MB 250K  Time 1.5s 1.1s 6.3s 47.2s 500K  Size 30MB 25MB 30MB 55MB 500K  Time 3.1s 3.0s 13.9s 192s 1MM  Size 59MB 52MB! 60MB 110MB 1MM  Time 5.1s 5.7s 29.2 777s
  • 74. Integer  Arrays 74 CREATE UNLOGGED TABLE int_arrays AS SELECT ARRAY[x, x + 1, x + 2] AS data FROM generate_series(1,1000000) x; ! CREATE INDEX int_arrays_data_idx ON int_arrays (data); ! CREATE INDEX int_arrays_data_gin_idx ON int_arrays USING GIN(data);
  • 75. B-­‐Tree(?)  +  Integer  Arrays 75 EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE 5432 = ANY (data); QUERY PLAN ----------- Seq Scan on int_arrays (cost=0.00..30834.00 rows=5000 width=33) (actual time=1.260..159.197 rows=3 loops=1) Filter: (5432 = ANY (data)) Rows Removed by Filter: 999997 ! Total runtime: 159.222 ms
  • 76. GIN  +  Integer  Arrays 76 EXPLAIN ANALYZE SELECT * FROM int_arrays WHERE ARRAY[5432] <@ data; QUERY PLAN ----------- Bitmap Heap Scan on int_arrays (cost=70.75..7680.14 rows=5000 width=33) (actual time=0.020..0.021 rows=3 loops=1) Recheck Cond: ('{5432}'::integer[] <@ data) -> Bitmap Index Scan on int_arrays_data_gin_idx (cost=0.00..69.50 rows=5000 width=0) (actual time=0.014..0.014 rows=3 loops=1) Index Cond: ('{5432}'::integer[] <@ data) ! Total runtime: 0.045 ms
  • 77. Hash  Indexes • only  work  with  "="  operator   • are  still  not  WAL  logged  as  of  9.4  beta  1   – not  crash  safe   – not  replicated 77
  • 78. btree_gin 78 CREATE EXTENSION IF NOT EXISTS btree_gin; ! CREATE UNLOGGED TABLE numbers AS SELECT (random() * 2000)::int AS a FROM generate_series(1, 2000000) x; ! CREATE INDEX numbers_gin_idx ON numbers USING gin(a); ! EXPLAIN ANALYZE SELECT * FROM numbers WHERE a = 1000; ! QUERY PLAN ------------ Bitmap Heap Scan on numbers (cost=113.50..9509.26 rows=10000 width=4) (actual time=0.388..1.459 rows=991 loops=1) Recheck Cond: (a = 1000) -> Bitmap Index Scan on numbers_gin_idx (cost=0.00..111.00 rows=10000 width=0) (actual time=0.232..0.232 rows=991 loops=1) Index Cond: (a = 1000) ! Total runtime: 1.563 ms
  • 79. btree_gin  vs  btree 79 -- btree SELECT pg_size_pretty(pg_total_relation_size('numbers_idx')); pg_size_pretty ---------------- 43 MB ! ! ! -- GIN SELECT pg_size_pretty(pg_total_relation_size('numbers_gin_idx')); pg_size_pretty ---------------- 16 MB • Only  use  GIN  over  btree  if  you  have  a  lot  of  duplicate  entries
  • 80. hstore  -­‐  the  PostgreSQL  Key-­‐Value   Store 80 CREATE EXTENSION IF NOT EXISTS hstore; ! CREATE UNLOGGED TABLE keypairs AS SELECT (x || ' => ' || (x + (random() * 5)::int))::hstore AS data FROM generate_series(1,1000000) x; SELECT pg_size_pretty(pg_relation_size('keypairs')); ! ! SELECT * FROM keypairs WHERE data ? ‘3'; data ---------- "3"=>"4" ! EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; QUERY PLAN ----------- Seq Scan on keypairs (cost=0.00..19135.06 rows=950 width=32) (actual time=0.065..208.808 rows=1 loops=1) Filter: (data ? '3'::text) Rows Removed by Filter: 999999 ! Total runtime: 208.825 ms
  • 81. hstore  -­‐  the  PostgreSQL  Key-­‐Value   Store 81 CREATE INDEX keypairs_data_gin_idx ON keypairs USING gin(data); ! EXPLAIN ANALYZE SELECT * FROM keypairs WHERE data ? ‘3'; ! QUERY PLAN ----------- Bitmap Heap Scan on keypairs (cost=27.75..2775.66 rows=1000 width=24) (actual time=0.044..0.045 rows=1 loops=1) Recheck Cond: (data ? '3'::text) -> Bitmap Index Scan on keypairs_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.039..0.039 rows=1 loops=1) Index Cond: (data ? '3'::text) ! Total runtime: 0.071 ms
  • 82. JSONB:  Coming  in  9.4 82 INSERT INTO documents SELECT row_to_json(ROW(x, x + 2, x + 3))::jsonb FROM generate_series(1,1000000) x; ! ! CREATE INDEX documents_data_gin_idx ON documents USING gin(data jsonb_path_ops); ! ! ! SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; data -------------------------------- {"f1": 10, "f2": 12, "f3": 13} ! ! Execution time: 0.084 ms
  • 83. Awesome  vs  WTF:   A  Note  On  Operator  Indexability 83 EXPLAIN ANALYZE SELECT * FROM documents WHERE data @> '{ "f1": 10 }'; ! QUERY PLAN ----------- Bitmap Heap Scan on documents (cost=27.75..3082.65 rows=1000 width=66) (actual time=0.029..0.030 rows=1 loops=1) Recheck Cond: (data @> '{"f1": 10}'::jsonb) Heap Blocks: exact=1 -> Bitmap Index Scan on documents_data_gin_idx (cost=0.00..27.50 rows=1000 width=0) (actual time=0.014..0.014 rows=1 loops=1) Index Cond: (data @> '{"f1": 10}'::jsonb) ! Execution time: 0.084 ms ! EXPLAIN ANALYZE SELECT * FROM documents WHERE '{ "f1": 10 }' <@ data; ! QUERY PLAN ----------- Seq Scan on documents (cost=0.00..24846.00 rows=1000 width=66) (actual time=0.015..245.924 rows=1 loops=1) Filter: ('{"f1": 10}'::jsonb <@ data) Rows Removed by Filter: 999999 ! Execution time: 245.947 ms
  • 84. For  More  Information… • http://www.postgresql.org/docs/current/static/ indexes.html   • http://www.postgresql.org/docs/current/static/ gist.html   • http://www.postgresql.org/docs/current/static/ gin.html   • http://www.postgresql.org/docs/current/static/ spgist.html   • GiST  +  GIN  +  Full  Text  Search:   – http://www.postgresql.org/docs/current/static/textsearch-­‐ indexes.html 84
  • 85. Conclusion • Postgres  has  *a  lot*  of  different  types  of   indexes,  and  variations  on  each  of  its  engines   • Extensions  make  use  of  PostgreSQL  indexes   – PostGIS   • Need  to  understand  where  index  usage  is   appropriate  in  your  application 85
  • 86. Thanks! • http://nyc.pgconf.us   • Twitter:  @jkatz05 86

Editor's Notes

  1. 001.sql