How to Read OSM Data with DuckDB

A deep dive into OpenStreetMap data structure and how to utilize it in a scalable way

Kamil Raczycki
Towards Data Science
29 min readMar 2, 2024

--

Dall-E 3 image: Adorable and cute 3D render duck studying a paper map, bright sky, with blur background, high quality, 8k

This article will provide an in-depth look at how to read OpenStreetMap data using the DuckDB database.

The steps described in this guide will allow the reader to load the OSM data using the Monaco example divided into nodes, ways, and relations.

The final result of OSM elements read using the DuckDB engine. From the left: nodes, ways and relations. Generated by the author using GeoPandas library.

A basic knowledge of the SQL language is expected to fully understand the steps described in this tutorial. Most of the GIS-related operations and special joins are described further in the article.

Outline of the article

  • What is OSM? — introduction to the OSM service.
  • OpenStreetMap data model — definition of objects used in the OSM.
  • Reading OSM data — basic operations on the data using DuckDB.
  • Constructing point geometries from the nodes
  • Constructing linestring and polygon geometries from the ways
  • Constructing polygon and multi-polygon geometries from the relations
  • Examples of badly defined relation objects
  • QuackOSM — a hassle-free tool for reading OSM data

What is OSM?

OpenStreetMap (OSM) is the most popular free map of the world and it's kept alive by a growing base of volunteers and contributors.

The data collected and built by the community is available publicly for free and commercial purposes, so many companies, academic researchers and individual developers use this resource in their projects. All data is provided under the Open Data Commons Open Database License (ODbL).

The data can be accessed in multiple ways:

The most space-efficient file type in which the data is stored is Protocolbuffer Binary Format with an extension *.osm.pbf . You can read more about it here.

You can also read this short article about OpenStreetMap from Eugenia Anello:

OpenStreetMap data model

This section is based on OSM Wiki page about Elements

Conceptually the data in OpenStreetMap is split into 3 components:

Nodes represent points in space. They are represented by a pair of coordinates in a WGS84 Coordinate Reference System — longitude and latitude. Nodes can be used to define a single feature on a map (eg. bench, lamp post, tree) or be used with other nodes to represent a shape of a way.

Example of a node — a tree in the park. Screenshot from the OpenStreetMap by the author.

Ways are shapes representing a polyline by using an ordered list of nodes. Those polylines can be open and represent roads, canals, and walls, or they can be closed to form a polygon and represent buildings, forests, lakes or other simple shapes.

Example of a way — a part of a highway road. Screenshot from the OpenStreetMap by the author.

Relations represent relationships between multiple objects and data elements defined in the OSM. This could be for example a bus route with ways showing roads on which the bus travels and nodes showing stops of a route, or a multi polygon with holes represented by at least 2 ways: these can be an outer polygon and an inner polygon.

An example of a relation— a hotel building outline with holes. Screenshot from the OpenStreetMap by the author.

Each element can, but doesn’t have to, have tags attached. Tags describe the meaning of the element. They are composed of a key and a value. There is no fixed dictionary of those values, but users should stay within conventions documented in the OSM Wiki.

Additionally, each element has an ID that is unique in a given element type space (so there can be a node with ID = 100, a way with ID = 100 and a relation with ID = 100).

Reading OSM data

Many tools allow users to transform the OSM data model to file formats commonly used in the GIS domain, such as GDAL. These tools are automatically reconstructing geometries from the raw data. We will try to read it and reconstruct geometries manually.

Examples below show how to access raw data and are written in SQL using DuckDB engine with Spatial extension. All queries with a full Jupyer notebook can be accessed in the GitHub repository.
You can run the notebook in the parallel or you can install the DuckDB engine and open the CLI to execute the queries there.

Getting the data

For simplicity and easy access, examples are focused entirely on the Monaco region. You can download the current extract from the Geofabrik download server: https://download.geofabrik.de/europe/monaco.html (and click monaco-latest.osm.pbf download link)

Familiarisation with the data structure

To start, we will use the DESCRIBEfunction to get information about the columns:

DESCRIBE SELECT * FROM ST_READOSM('monaco-latest.osm.pbf');

┌─────────────┬──────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼──────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ kind │ ENUM('node', 'way', 'relation', 'changeset') │ YES │ NULL │ NULL │ NULL │
│ id │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ tags │ MAP(VARCHAR, VARCHAR) │ YES │ NULL │ NULL │ NULL │
│ refs │ BIGINT[] │ YES │ NULL │ NULL │ NULL │
│ lat │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ lon │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ ref_roles │ VARCHAR[] │ YES │ NULL │ NULL │ NULL │
│ ref_types │ ENUM('node', 'way', 'relation')[] │ YES │ NULL │ NULL │ NULL │
└─────────────┴──────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

There are 8 columns returned by the ST_READOSMfunction:

  • kind — this is the type of an element. It can also have a value of changesetrepresenting the changes after editing the existing element in the OSM. Extracts from the Geofabrik download server don’t contain changesets, so we don’t have to think about them.
  • id — the element identifier.
  • tags — map (or a dictionary) of two strings: a tag key and a tag value.
  • refs — list of member IDs related to the element. Nodes should have this list empty and ways and relations can’t have it empty.
  • lat and lon — latitude and longitude of a node. Ways and relations should have these fields empty since only nodes can have coordinates in the OSM.
  • ref_roles and ref_types — a list of additional information about members: what role is assigned to the member and what type is it (node, way or relation).

Counting the elements

Let’s see how many elements there are in total and per element type.

SELECT COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf');

┌────────────────┐
│ total_elements │
│ int64 │
├────────────────┤
│ 35782 │
└────────────────┘

SELECT kind, COUNT(*) as total_elements
FROM ST_READOSM('monaco-latest.osm.pbf')
GROUP BY 1;

┌──────────────────────────────────────────────┬────────────────┐
│ kind │ total_features │
│ enum('node', 'way', 'relation', 'changeset') │ int64 │
├──────────────────────────────────────────────┼────────────────┤
│ node │ 30643 │
│ way │ 4849 │
│ relation │ 290 │
└──────────────────────────────────────────────┴────────────────┘

Looking at the elements

Let’s check the examples of the data for each element type.

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
LIMIT 5;

┌──────────────────────┬──────────┬─────────────────────────────┬─────────┬────────────────────┬────────────────────┬───────────┬───────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'… │ int64 │ map(varchar, varchar) │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼──────────┼─────────────────────────────┼─────────┼────────────────────┼────────────────────┼───────────┼───────────────────────────────────┤
│ node │ 21911883 │ │ │ 43.737117500000004 │ 7.422909300000001 │ │ │
│ node │ 21911886 │ {crossing=zebra, crossing… │ │ 43.737239900000006 │ 7.423498500000001 │ │ │
│ node │ 21911894 │ │ │ 43.737773100000005 │ 7.4259193 │ │ │
│ node │ 21911901 │ │ │ 43.737762100000005 │ 7.4267099000000005 │ │ │
│ node │ 21911908 │ │ │ 43.7381906 │ 7.426961 │ │ │
└──────────────────────┴──────────┴─────────────────────────────┴─────────┴────────────────────┴────────────────────┴───────────┴───────────────────────────────────┘

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'way'
LIMIT 5;

┌──────────────────────┬─────────┬──────────────────────┬─────────────────────────────────────────┬────────┬────────┬───────────┬───────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'… │ int64 │ map(varchar, varch… │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼─────────┼──────────────────────┼─────────────────────────────────────────┼────────┼────────┼───────────┼───────────────────────────────────┤
│ way │ 4097656 │ {highway=secondary… │ [21912089, 7265761724, 1079750744, 21… │ │ │ │ │
│ way │ 4098197 │ {highway=tertiary,… │ [21918500, 10723812919, 1204288480, 2… │ │ │ │ │
│ way │ 4224972 │ {highway=residenti… │ [25177418, 4939779378, 4939779381, 49… │ │ │ │ │
│ way │ 4224978 │ {access=no, addr:c… │ [25178088, 25178091, 25178045, 251780… │ │ │ │ │
│ way │ 4226740 │ {highway=secondary… │ [25192130, 6444966483, 1737389127, 64… │ │ │ │ │
└──────────────────────┴─────────┴──────────────────────┴─────────────────────────────────────────┴────────┴────────┴───────────┴───────────────────────────────────┘

SELECT *
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'relation'
LIMIT 5;

┌──────────────────────┬───────┬──────────────────────┬──────────────────────┬────────┬────────┬──────────────────────┬─────────────────────────────────────────────┐
│ kind │ id │ tags │ refs │ lat │ lon │ ref_roles │ ref_types │
│ enum('node', 'way'… │ int64 │ map(varchar, varch… │ int64[] │ double │ double │ varchar[] │ enum('node', 'way', 'relation')[] │
├──────────────────────┼───────┼──────────────────────┼──────────────────────┼────────┼────────┼──────────────────────┼─────────────────────────────────────────────┤
│ relation │ 7385 │ {ISO3166-2=FR-06, … │ [1701090139, 32665… │ │ │ [admin_centre, lab… │ [node, node, way, way, way, way, way, way… │
│ relation │ 8654 │ {ISO3166-2=FR-PAC,… │ [26761400, 1251610… │ │ │ [admin_centre, lab… │ [node, node, way, way, way, way, way, way… │
│ relation │ 11980 │ {ISO3166-1=FR, ISO… │ [17807753, 1363947… │ │ │ [admin_centre, lab… │ [node, node, relation, relation, relation… │
│ relation │ 36990 │ {ISO3166-1=MC, ISO… │ [1790048269, 77077… │ │ │ [admin_centre, out… │ [node, way, way, way, way, way, way, way,… │
│ relation │ 90352 │ {admin_level=2, bo… │ [770774507, 377944… │ │ │ [outer, outer, out… │ [way, way, way, way, way, way, way, way, … │
└──────────────────────┴───────┴──────────────────────┴──────────────────────┴────────┴────────┴──────────────────────┴─────────────────────────────────────────────┘

Now we can see how the elements are defined: nodes have coordinates, ways have refs lists filled with node IDs and relations have the most complicated structure with refs lists filled with IDs and ref_types lists showing which ID correspond to which element type. Additionally, ref_roles have information about the role of the member (admin_centre, label, inner, outer).

Constructing point geometries from the nodes

Now that we know what the structure looks like, we can start building some geometries. Starting with nodes should be the easiest since it’s just a pair of latitudes and longitudes in the WGS84 Coordinate Reference System.

We should only extract nodes with at least one tag attached since those have any semantical meaning for analytical purposes. Nodes without any tags could be used to construct ways in the later stages.

SELECT
id,
tags,
ST_POINT(lon, lat) geometry
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'node'
AND tags IS NOT NULL
AND cardinality(tags) > 0;

┌─────────────┬─────────────────────────────────────────────────────────┬──────────────────────────────────────────────┐
│ id │ tags │ geometry │
│ int64 │ map(varchar, varchar) │ geometry │
├─────────────┼─────────────────────────────────────────────────────────┼──────────────────────────────────────────────┤
│ 21911886 │ {crossing=zebra, crossing:island=no, crossing_ref=zeb… │ POINT (7.423498500000001 43.737239900000006) │
│ 21912962 │ {crossing=zebra, crossing_ref=zebra, highway=crossing} │ POINT (7.426912100000001 43.737912800000004) │
│ 21914341 │ {crossing=uncontrolled, crossing_ref=zebra, highway=c… │ POINT (7.4233732 43.737010000000005) │
│ 21915639 │ {highway=traffic_signals} │ POINT (7.4256003 43.7404449) │
│ 21917308 │ {bus=yes, name=Monte-Carlo (Casino), public_transport… │ POINT (7.4259854 43.740984700000006) │
│ 21918329 │ {crossing=marked, crossing:markings=yes, highway=cros… │ POINT (7.427889100000001 43.7423616) │
│ 21918589 │ {crossing=marked, crossing:island=yes, crossing:marki… │ POINT (7.4317478 43.7472774) │
│ 21918697 │ {crossing=marked, crossing:island=no, crossing:markin… │ POINT (7.432645000000001 43.747892900000004) │
│ 21918939 │ {bus=yes, name=Portier, public_transport=stop_position} │ POINT (7.430429800000001 43.741472800000004) │
│ 21919093 │ {crossing=marked, crossing:markings=yes, crossing_ref… │ POINT (7.4352171 43.748160000000006) │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 11450012980 │ {direction=forward, highway=give_way} │ POINT (7.416853100000001 43.735809700000004) │
│ 11450012981 │ {direction=forward, highway=give_way} │ POINT (7.416783000000001 43.735872900000004) │
│ 11450012982 │ {direction=forward, highway=give_way} │ POINT (7.416664900000001 43.735887000000005) │
│ 11450012983 │ {direction=forward, highway=give_way} │ POINT (7.4166968 43.7356945) │
│ 11451922579 │ {bus=yes, highway=bus_stop, name=Larvotto, public_tra… │ POINT (7.435032400000001 43.7481639) │
│ 11451922580 │ {bus=yes, highway=bus_stop, name=Grimaldi Forum, publ… │ POINT (7.4311343 43.7442067) │
│ 11451922581 │ {bench=yes, bus=yes, highway=bus_stop, name=Portier, … │ POINT (7.430357300000001 43.742209100000004) │
│ 11451922582 │ {bench=no, bin=no, bus=yes, highway=bus_stop, lit=yes… │ POINT (7.4107674 43.7296193) │
│ 11451922600 │ {direction=backward, highway=give_way} │ POINT (7.4105622 43.7291648) │
│ 11452057060 │ {direction=backward, highway=give_way} │ POINT (7.419164 43.737116) │
├─────────────┴─────────────────────────────────────────────────────────┴──────────────────────────────────────────────┤
│ 3167 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Nodes plotted on a map. Generated by the author using GeoPandas library.

After filtering the nodes based on tags, we are left with 3167 points.
It’s around 10% of the total number of nodes in the source file.

Constructing linestring and polygon geometries from the ways

With nodes out of the way 😉, let’s focus now on ways. Ways can take the form of linestrings or polygons. Let’s focus on linestrings first and then we will assign proper geometry types.

To construct the ways we have to do multiple operations:

  • Select matching ways with tags.
  • Unnest all nodes refs for each way element and keep them in the proper order (remember - nodes refs order matter!).
  • Select required nodes with geometries.
  • Group nodes per way ID and construct a linestring geometry using ST_MakeLinefunction.
  • Join constructed geometries with tags.

Let’s start with selecting ways with tags.

CREATE TEMP TABLE matching_ways AS 
SELECT id, tags
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'way' AND tags IS NOT NULL AND cardinality(tags) > 0;

SELECT * FROM matching_ways;

┌────────────┬─────────────────────────────────────────────────────────────────┐
│ id │ tags │
│ int64 │ map(varchar, varchar) │
├────────────┼─────────────────────────────────────────────────────────────────┤
│ 4097656 │ {highway=secondary, lanes=2, lit=yes, maxspeed=30, name=Avenu… │
│ 4098197 │ {highway=tertiary, lanes=2, lit=yes, name=Boulevard d`Italie,… │
│ 4224972 │ {highway=residential, lit=yes, name=Avenue des Papalins, onew… │
│ 4224978 │ {access=no, addr:country=MC, leisure=pitch, lit=yes, sport=so… │
│ 4226740 │ {highway=secondary, lanes=2, lit=yes, maxspeed=50, name=Boule… │
│ 4227155 │ {area=yes, highway=pedestrian, lit=yes, name=Place du Palais,… │
│ 4227156 │ {access=no, bus=yes, demolished:highway=service, oneway=yes, … │
│ 4227157 │ {highway=residential, name=Rue des Remparts, oneway=yes, surf… │
│ 4227158 │ {highway=residential, name=Rue Philibert Florence, oneway=yes… │
│ 4227164 │ {highway=secondary, lanes=2, lit=yes, name=Virage Antony Nogu… │
│ · │ · │
│ · │ · │
│ · │ · │
│ 1230247124 │ {bench=no, bin=yes, bus=yes, check_date=2023-12-10, covered=y… │
│ 1230273018 │ {access=yes, leisure=playground, wheelchair=yes} │
│ 1230273019 │ {highway=footway, surface=paving_stones} │
│ 1230398878 │ {abandoned:railway=rail, highway=primary, lanes=2, lit=yes, m… │
│ 1233828725 │ {highway=service, lit=yes, name=Place Moulins, oneway=yes} │
│ 1233828726 │ {highway=tertiary, junction=roundabout, lanes=2, lit=yes, non… │
│ 1233853668 │ {highway=secondary, lanes=2, lit=yes, maxspeed=50, name=Boule… │
│ 1238339144 │ {cycleway:left=lane, highway=tertiary, lanes=2, lit=yes, maxs… │
│ 1238339145 │ {cycleway:left=lane, highway=tertiary, lanes=2, lit=yes, maxs… │
│ 1239415633 │ {footway=sidewalk, highway=footway} │
├────────────┴─────────────────────────────────────────────────────────────────┤
│ 4786 rows (20 shown) 2 columns │
└──────────────────────────────────────────────────────────────────────────────┘

Now we will unnest refs lists and split them into individual rows. We will also utilize DuckDB’s indexing functions to remember the order of elements in the original list.

CREATE TEMP TABLE matching_ways_with_nodes_refs AS
SELECT id, UNNEST(refs) as ref, UNNEST(range(length(refs))) as ref_idx
FROM ST_READOSM('monaco-latest.osm.pbf')
SEMI JOIN matching_ways USING (id)
WHERE kind = 'way';

SELECT * FROM matching_ways_with_nodes_refs;

┌───────────┬────────────┬─────────┐
│ id │ ref │ ref_idx │
│ int64 │ int64 │ int64 │
├───────────┼────────────┼─────────┤
│ 4097656 │ 21912089 │ 0 │
│ 4097656 │ 7265761724 │ 1 │
│ 4097656 │ 1079750744 │ 2 │
│ 4097656 │ 2104793864 │ 3 │
│ 4097656 │ 6340961560 │ 4 │
│ 4097656 │ 1110560507 │ 5 │
│ 4097656 │ 21912093 │ 6 │
│ 4097656 │ 6340961559 │ 7 │
│ 4097656 │ 21912095 │ 8 │
│ 4097656 │ 7265762803 │ 9 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 151154357 │ 7927855820 │ 2 │
│ 151154358 │ 1868015912 │ 0 │
│ 151154358 │ 1868015910 │ 1 │
│ 151154358 │ 1868015900 │ 2 │
│ 151154358 │ 4437858221 │ 3 │
│ 151154358 │ 1868015889 │ 4 │
│ 151154358 │ 1790048390 │ 5 │
│ 151154358 │ 1868015881 │ 6 │
│ 151154358 │ 1868015894 │ 7 │
│ 151154358 │ 1868015896 │ 8 │
├───────────┴────────────┴─────────┤
│ ? rows (>9999 rows, 20 shown) │
└──────────────────────────────────┘

As you can see, there are now multiple rows per way element and multiple ref values. Ref_idx represents the original order in the refs list.

You can also see the SEMI JOINclause in the query. This is a special join available in the DuckDB, that just filters rows without actually joining the second table. You can read more about it in the official documentation:

Now we can select the required nodes based on the refs from the previous step:

CREATE TEMP TABLE required_nodes_with_geometries AS
SELECT id, ST_POINT(lon, lat) geometry
FROM ST_READOSM('monaco-latest.osm.pbf') nodes
SEMI JOIN matching_ways_with_nodes_refs
ON nodes.id = matching_ways_with_nodes_refs.ref
WHERE kind = 'node';

SELECT * FROM required_nodes_with_geometries;

┌────────────┬──────────────────────────────────────────────┐
│ id │ geometry │
│ int64 │ geometry │
├────────────┼──────────────────────────────────────────────┤
│ 4547239708 │ POINT (7.4219302 43.7307441) │
│ 4547239709 │ POINT (7.421943700000001 43.730763) │
│ 4547239710 │ POINT (7.421885100000001 43.7308797) │
│ 4547239711 │ POINT (7.421869 43.730828200000005) │
│ 4547239712 │ POINT (7.4218596 43.7307865) │
│ 4547239713 │ POINT (7.4219166 43.7307701) │
│ 4547239714 │ POINT (7.4220242 43.730742) │
│ 4547239715 │ POINT (7.4220765 43.730725500000005) │
│ 4547239716 │ POINT (7.4220677 43.730680400000004) │
│ 4547239717 │ POINT (7.422105200000001 43.730662) │
│ · │ · │
│ · │ · │
│ · │ · │
│ 9983243582 │ POINT (7.427689600000001 43.732439400000004) │
│ 9983243583 │ POINT (7.4238021 43.7298837) │
│ 9983243584 │ POINT (7.4272774 43.7321503) │
│ 9983243585 │ POINT (7.4287133 43.7403807) │
│ 9983243586 │ POINT (7.429299500000001 43.740351100000005) │
│ 9983243587 │ POINT (7.4265819 43.7399094) │
│ 9983243588 │ POINT (7.4275342 43.7323072) │
│ 9983243589 │ POINT (7.427783700000001 43.732368) │
│ 9983243590 │ POINT (7.428839600000001 43.7402718) │
│ 9983243591 │ POINT (7.4269439 43.739707700000004) │
├────────────┴──────────────────────────────────────────────┤
│ ? rows (>9999 rows, 20 shown) 2 columns │
└───────────────────────────────────────────────────────────┘

Now we can construct the full linestrings:

CREATE TEMP TABLE matching_ways_linestrings AS
SELECT
matching_ways.id,
matching_ways.tags,
ST_MakeLine(list(nodes.geometry ORDER BY ref_idx ASC)) linestring
FROM matching_ways
JOIN matching_ways_with_nodes_refs
ON matching_ways.id = matching_ways_with_nodes_refs.id
JOIN required_nodes_with_geometries nodes
ON matching_ways_with_nodes_refs.ref = nodes.id
GROUP BY 1, 2;

SELECT * FROM matching_ways_linestrings;

┌────────────┬──────────────────────┬──────────────────────────────────────────┐
│ id │ tags │ linestring │
│ int64 │ map(varchar, varch… │ geometry │
├────────────┼──────────────────────┼──────────────────────────────────────────┤
│ 4227212 │ {highway=residenti… │ LINESTRING (7.424979 43.73150620000000… │
│ 4227230 │ {access=yes, admin… │ LINESTRING (7.4220576 43.7321018, 7.42… │
│ 4227273 │ {highway=secondary… │ LINESTRING (7.4214168 43.7365583000000… │
│ 4229325 │ {highway=secondary… │ LINESTRING (7.4201113 43.7373194000000… │
│ 4229900 │ {highway=secondary… │ LINESTRING (7.4167819 43.7299324, 7.41… │
│ 4230116 │ {highway=footway, … │ LINESTRING (7.4302666 43.7467818, 7.43… │
│ 24672725 │ {highway=residenti… │ LINESTRING (7.4356746 43.7516752, 7.43… │
│ 25082701 │ {area=yes, highway… │ LINESTRING (7.421602 43.7370201, 7.421… │
│ 31900562 │ {area=yes, floatin… │ LINESTRING (7.422541900000001 43.73529… │
│ 37794470 │ {admin_level=2, bo… │ LINESTRING (7.438411400000001 43.74942… │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 1203859766 │ {landuse=grass} │ LINESTRING (7.4172849 43.7324023000000… │
│ 1203859775 │ {landuse=grass} │ LINESTRING (7.417039300000001 43.73209… │
│ 1218054879 │ {highway=footway, … │ LINESTRING (7.4257143 43.7307631000000… │
│ 1229465608 │ {amenity=shelter, … │ LINESTRING (7.417716100000001 43.73075… │
│ 1230102086 │ {footway=crossing,… │ LINESTRING (7.414346 43.72708160000000… │
│ 1230123519 │ {footway=link, hig… │ LINESTRING (7.4238967 43.7370229, 7.42… │
│ 1202747620 │ {crossing=unmarked… │ LINESTRING (7.420526100000001 43.72671… │
│ 1203858289 │ {crossing=uncontro… │ LINESTRING (7.4179021 43.7344211000000… │
│ 1203861575 │ {amenity=parking} │ LINESTRING (7.430036100000001 43.74015… │
│ 1203861579 │ {landuse=construct… │ LINESTRING (7.4323636 43.7423150000000… │
├────────────┴──────────────────────┴──────────────────────────────────────────┤
│ 4786 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘
Ways linestrings plotted on a map. Generated by the author using GeoPandas library.

After doing all of those operations, we have ways in the linestring form. Now we have to select the ways that are supposed to be polygons. We can do this based on tag values. Unfortunately, there is no single source of truth for this operation. We can look at the page from the OSM wiki to see one of the definitions created by the community.

A screenshot from the Overpass turbo/Polygon Features wiki page. Taken on 2024-01-17.

As you can see, this list is quite long, so for brevity we will only check if the linestring forms a closed loop and if the area tag value is not ‘no’.

WITH way_polygon_feature AS (
SELECT
id,
(
-- if first and last nodes are the same
ST_Equals(ST_StartPoint(linestring), ST_EndPoint(linestring))
-- if the element doesn't have any tags leave it as a Linestring
AND tags IS NOT NULL
-- if the element is specifically tagged 'area':'no' -> LineString
AND NOT (
list_contains(map_keys(tags), 'area')
AND list_extract(map_extract(tags, 'area'), 1) = 'no'
)
) AS is_polygon
FROM matching_ways_linestrings
)
SELECT
matching_ways_linestrings.id,
matching_ways_linestrings.tags,
(CASE
WHEN is_polygon
THEN ST_MakePolygon(linestring)
ELSE linestring
END)::GEOMETRY AS geometry
FROM matching_ways_linestrings
JOIN way_polygon_feature
ON matching_ways_linestrings.id = way_polygon_feature.id;

┌────────────┬──────────────────────┬──────────────────────────────────────────┐
│ id │ tags │ geometry │
│ int64 │ map(varchar, varch… │ geometry │
├────────────┼──────────────────────┼──────────────────────────────────────────┤
│ 4226740 │ {highway=secondary… │ LINESTRING (7.422170500000001 43.73286… │
│ 4227198 │ {highway=residenti… │ LINESTRING (7.420338900000001 43.73223… │
│ 4227216 │ {highway=service, … │ LINESTRING (7.423448400000001 43.73268… │
│ 4227242 │ {highway=secondary… │ LINESTRING (7.4221786 43.7322418000000… │
│ 4227272 │ {highway=secondary… │ LINESTRING (7.422117500000001 43.73702… │
│ 4229292 │ {foot=no, highway=… │ LINESTRING (7.41643 43.7311373, 7.4168… │
│ 4229577 │ {highway=residenti… │ LINESTRING (7.4251499 43.7397216, 7.42… │
│ 4230122 │ {alt_name=rue R. P… │ LINESTRING (7.428823400000001 43.74602… │
│ 25082741 │ {highway=secondary… │ LINESTRING (7.4231414 43.7369886000000… │
│ 25722909 │ {highway=footway, … │ LINESTRING (7.423659300000001 43.73134… │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 1089844256 │ {handrail=no, high… │ LINESTRING (7.4229453 43.7326080000000… │
│ 1089844296 │ {highway=footway} │ LINESTRING (7.427017200000001 43.74044… │
│ 1202570722 │ {access=private, l… │ POLYGON ((7.4263241 43.7390763, 7.4263… │
│ 946167565 │ {disused:leisure=p… │ POLYGON ((7.4276993 43.739335100000005… │
│ 1089844271 │ {highway=footway} │ LINESTRING (7.4277667 43.732816, 7.427… │
│ 335731716 │ {highway=footway} │ LINESTRING (7.417764200000001 43.73439… │
│ 779454018 │ {building=resident… │ POLYGON ((7.4217522 43.7277202, 7.4218… │
│ 943330855 │ {landuse=grass} │ POLYGON ((7.4151844 43.7332252, 7.4152… │
│ 1089844229 │ {highway=steps, in… │ LINESTRING (7.4274525 43.7328052, 7.42… │
│ 49209608 │ {addr:city=Monaco,… │ POLYGON ((7.4223448 43.7304076, 7.4221… │
├────────────┴──────────────────────┴──────────────────────────────────────────┤
│ 4786 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘

Now we can see the mix of linestring and polygon geometries in the final result. Of course, the predicate for the is_polygon column could (or even should) be extended by the logic mentioned above regarding the values of the tags.

Compared to the image above, many more filled polygons are now visible on the map.

Ways geometries plotted on a map. Generated by the author using GeoPandas library.

Constructing polygon and multi-polygon geometries from the relations

Relations are utilized in OSM to group multiple other elements into a single object. Here we will focus solely on the (multi-) polygons. These specific elements have a type tag with one of two values: boundary, and multipolygon.

This kind of object is the most complex to reconstruct the geometry for and here is the list of steps we have to do:

  • Select relations with proper type value.
  • Unnest all refs related to the relation and keep only way refs — we only need related way refs to reconstruct a polygon.
  • Select required ways with linestring geometries — here we can utilize steps from constructing ways.
  • Assign an ‘outer’ role to the way ref if it’s null and check if any ref from the relation has the role ‘outer’ — if a relation has no ‘outer’ refs then treat all of them as ‘outer’.
  • Group all linestrings per ‘outer’ and ‘inner’ role and merge them into a single multilinestring — many relations are defined with multiple single linestrings that only together create a closed polygon.
  • Split multilinestrings into single closed-loop linestrings and save them as polygons.
  • Split geometries into ‘outer’ and ‘inner’ polygons. These can be extracted from the ref_role column of the relation object.
  • For each ‘outer’ polygon, select all ‘inner’ polygons that are fully within it and make holes in this polygon.
  • Make a union of all ‘outer’ polygons with holes.

Let’s start with selecting relations with matching tag values.

CREATE TEMP TABLE matching_relations AS 
SELECT id, tags
FROM ST_READOSM('monaco-latest.osm.pbf')
WHERE kind = 'relation' AND len(refs) > 0
AND tags IS NOT NULL AND cardinality(tags) > 0
AND list_contains(map_keys(tags), 'type')
AND list_has_any(map_extract(tags, 'type'), ['boundary', 'multipolygon']);

SELECT * FROM matching_relations;

┌──────────┬───────────────────────────────────────────────────────────────────┐
│ id │ tags │
│ int64 │ map(varchar, varchar) │
├──────────┼───────────────────────────────────────────────────────────────────┤
│ 7385 │ {ISO3166-2=FR-06, admin_level=6, alt_name:de=Meeralpen, border_… │
│ 8654 │ {ISO3166-2=FR-PAC, admin_level=4, border_type=region, boundary=… │
│ 36990 │ {ISO3166-1=MC, ISO3166-1:alpha2=MC, ISO3166-1:alpha3=MCO, admin… │
│ 174558 │ {admin_level=8, boundary=administrative, name=Roquebrune-Cap-Ma… │
│ 174562 │ {admin_level=8, boundary=administrative, name=Beausoleil, name:… │
│ 174956 │ {admin_level=8, alt_name:it=Capo d`Aglio, boundary=administrati… │
│ 174958 │ {admin_level=8, boundary=administrative, name=La Turbie, name:i… │
│ 393226 │ {building=castle, castle_type=palace, charge=10€, email=visites… │
│ 393481 │ {building=school, name=Pensionnat des Dames de Saint-Maur, type… │
│ 1124039 │ {ISO3166-1=MC, ISO3166-1:alpha2=MC, ISO3166-1:alpha3=MCO, ISO31… │
│ · │ · │
│ · │ · │
│ · │ · │
│ 14399505 │ {area=yes, floating=yes, man_made=pier, type=multipolygon} │
│ 16248281 │ {building=apartments, name=F, type=multipolygon} │
│ 16248282 │ {building=apartments, name=E, type=multipolygon} │
│ 16248283 │ {building=apartments, name=D, type=multipolygon} │
│ 16248284 │ {building=apartments, name=C, type=multipolygon} │
│ 16248285 │ {building=apartments, name=B, type=multipolygon} │
│ 16248286 │ {building=apartments, name=A, type=multipolygon} │
│ 16250182 │ {addr:country=MC, alt_name=Parc de la Roseraie, leisure=park, n… │
│ 16261416 │ {natural=water, type=multipolygon, water=pond} │
│ 16467322 │ {admin_level=2, boundary=land_area, land_area=administrative, n… │
├──────────┴───────────────────────────────────────────────────────────────────┤
│ 78 rows (20 shown) 2 columns │
└──────────────────────────────────────────────────────────────────────────────┘

Now we will unnest refs lists and split them into individual rows. Like with ways, here we will also utilize DuckDB’s indexing functions to remember the order of elements in the original list. Additionally, we will only keep the way refs.

CREATE TEMP TABLE matching_relations_with_ways_refs AS
WITH unnested_relation_refs AS (
SELECT
r.id,
UNNEST(refs) as ref,
UNNEST(ref_types) as ref_type,
UNNEST(ref_roles) as ref_role,
UNNEST(range(length(refs))) as ref_idx
FROM ST_READOSM('monaco-latest.osm.pbf') r
SEMI JOIN matching_relations USING (id)
WHERE kind = 'relation'
)
SELECT id, ref, ref_role, ref_idx
FROM unnested_relation_refs
WHERE ref_type = 'way';

SELECT * FROM matching_relations_with_ways_refs;

┌──────────┬───────────┬──────────┬─────────┐
│ id │ ref │ ref_role │ ref_idx │
│ int64 │ int64 │ varchar │ int64 │
├──────────┼───────────┼──────────┼─────────┤
│ 7385 │ 30836152 │ outer │ 2 │
│ 7385 │ 889278953 │ outer │ 3 │
│ 7385 │ 889278956 │ outer │ 4 │
│ 7385 │ 889278957 │ outer │ 5 │
│ 7385 │ 889278958 │ outer │ 6 │
│ 7385 │ 889278962 │ outer │ 7 │
│ 7385 │ 960087656 │ outer │ 8 │
│ 7385 │ 30836155 │ outer │ 9 │
│ 7385 │ 889278965 │ outer │ 10 │
│ 7385 │ 889278963 │ outer │ 11 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 11278320 │ 35150936 │ outer │ 197 │
│ 11278320 │ 466647567 │ outer │ 198 │
│ 11278320 │ 214008684 │ outer │ 199 │
│ 11278320 │ 466647569 │ outer │ 200 │
│ 11278320 │ 466647568 │ outer │ 201 │
│ 11278320 │ 214008689 │ outer │ 202 │
│ 11278320 │ 263776194 │ outer │ 203 │
│ 11278320 │ 31124893 │ outer │ 204 │
│ 11278320 │ 31124895 │ outer │ 205 │
│ 11278320 │ 31124899 │ outer │ 206 │
├──────────┴───────────┴──────────┴─────────┤
│ ? rows (>9999 rows, 20 shown) 4 columns │
└───────────────────────────────────────────┘

In the next step, we will construct linestrings for the ways required by the relations. The query below compresses almost full logic of reading ways in one go (getting required nodes, constructing points and grouping them into linestrings):

CREATE TEMP TABLE required_ways_linestrings AS
WITH ways_required_by_relations_with_nodes_refs AS (
SELECT id, UNNEST(refs) as ref, UNNEST(range(length(refs))) as ref_idx
FROM ST_READOSM('monaco-latest.osm.pbf') ways
SEMI JOIN matching_relations_with_ways_refs
ON ways.id = matching_relations_with_ways_refs.ref
WHERE kind = 'way'
),
nodes_required_by_relations_with_geometries AS (
SELECT id, ST_POINT(lon, lat) geometry
FROM ST_READOSM('monaco-latest.osm.pbf') nodes
SEMI JOIN ways_required_by_relations_with_nodes_refs
ON nodes.id = ways_required_by_relations_with_nodes_refs.ref
WHERE kind = 'node'
)
SELECT
ways.id,
ST_MakeLine(list(nodes.geometry ORDER BY ref_idx ASC)) linestring
FROM ways_required_by_relations_with_nodes_refs ways
JOIN nodes_required_by_relations_with_geometries nodes
ON ways.ref = nodes.id
GROUP BY 1;

SELECT * FROM required_ways_linestrings;

┌────────────┬─────────────────────────────────────────────────────────────────┐
│ id │ linestring │
│ int64 │ geometry │
├────────────┼─────────────────────────────────────────────────────────────────┤
│ 37794470 │ LINESTRING (7.438411400000001 43.749422300000006, 7.4384056 4… │
│ 87878917 │ LINESTRING (7.418041100000001 43.7256933, 7.4181547 43.725613… │
│ 94252430 │ LINESTRING (7.4141873 43.729494100000004, 7.414203400000001 4… │
│ 94399618 │ LINESTRING (7.4239717 43.740543100000004, 7.4238252 43.740372… │
│ 94399736 │ LINESTRING (7.423881400000001 43.7402971, 7.4239411 43.740265… │
│ 104863462 │ LINESTRING (7.424840000000001 43.731281800000005, 7.424917000… │
│ 120114110 │ LINESTRING (7.420872 43.7370979, 7.4207787 43.7370534, 7.4206… │
│ 156242249 │ LINESTRING (7.4294728 43.739895600000004, 7.429579500000001 4… │
│ 165636038 │ LINESTRING (7.419717100000001 43.732398700000005, 7.419772900… │
│ 169297863 │ LINESTRING (7.422815300000001 43.7321967, 7.4234109 43.732263… │
│ · │ · │
│ · │ · │
│ · │ · │
│ 24874398 │ LINESTRING (7.418523400000001 43.7247599, 7.419012800000001 4… │
│ 92627424 │ LINESTRING (7.4166521 43.7322122, 7.4162303 43.73173910000000… │
│ 94452829 │ LINESTRING (7.4317066 43.7469647, 7.4317998 43.7470371, 7.431… │
│ 335740502 │ LINESTRING (7.4185151 43.7353633, 7.418442000000001 43.735298… │
│ 398377193 │ LINESTRING (7.420872 43.7370979, 7.420939000000001 43.7371298… │
│ 405529436 │ LINESTRING (7.417534900000001 43.7258914, 7.4175347 43.725833… │
│ 423632259 │ LINESTRING (7.4212208 43.7320944, 7.421461300000001 43.732057… │
│ 572935479 │ LINESTRING (7.427508100000001 43.7394168, 7.427496700000001 4… │
│ 586494707 │ LINESTRING (7.4270357 43.739109000000006, 7.4269512 43.739155… │
│ 1202570715 │ LINESTRING (7.426448400000001 43.739491400000006, 7.4264682 4… │
├────────────┴─────────────────────────────────────────────────────────────────┤
│ 141 rows (20 shown) 2 columns │
└──────────────────────────────────────────────────────────────────────────────┘

After creating the required linestrings, now we can join them with relations data. We will also make sure that the required ref_role is properly parsed — fill in the empty values or replace them if the relation has incorrectly defined ref_roles in the OSM database.

CREATE TEMP TABLE matching_relations_with_ways_linestrings AS
WITH unnested_relations_with_way_linestrings AS (
SELECT
r.id,
COALESCE(r.ref_role, 'outer') as ref_role,
r.ref,
w.linestring::GEOMETRY as geometry
FROM matching_relations_with_ways_refs r
JOIN required_ways_linestrings w
ON w.id = r.ref
ORDER BY r.id, r.ref_idx
),
any_outer_refs AS (
-- check if any way attached to the relation has the `outer` role
SELECT id, bool_or(ref_role == 'outer') has_any_outer_refs
FROM unnested_relations_with_way_linestrings
GROUP BY id
)
SELECT
unnested_relations_with_way_linestrings.* EXCLUDE (ref_role),
-- if none of the way refs has `outer` role - treat each ref as `outer`
CASE WHEN any_outer_refs.has_any_outer_refs
THEN unnested_relations_with_way_linestrings.ref_role
ELSE 'outer'
END as ref_role
FROM unnested_relations_with_way_linestrings
JOIN any_outer_refs
ON any_outer_refs.id = unnested_relations_with_way_linestrings.id;

SELECT * FROM matching_relations_with_ways_linestrings;

┌──────────┬───────────┬────────────────────────────────────────────┬──────────┐
│ id │ ref │ geometry │ ref_role │
│ int64 │ int64 │ geometry │ varchar │
├──────────┼───────────┼────────────────────────────────────────────┼──────────┤
│ 7385 │ 772081595 │ LINESTRING (7.415291600000001 43.7234393… │ outer │
│ 7385 │ 212810311 │ LINESTRING (7.4120416 43.7280955, 7.4123… │ outer │
│ 7385 │ 176533407 │ LINESTRING (7.412670800000001 43.7316348… │ outer │
│ 7385 │ 37811853 │ LINESTRING (7.4127007 43.7346954, 7.4126… │ outer │
│ 7385 │ 37794471 │ LINESTRING (7.428754700000001 43.7460174… │ outer │
│ 7385 │ 398372186 │ LINESTRING (7.436885 43.7519173, 7.43677… │ outer │
│ 7385 │ 37794470 │ LINESTRING (7.438411400000001 43.7494223… │ outer │
│ 7385 │ 770774507 │ LINESTRING (7.439171000000001 43.7490109… │ outer │
│ 8654 │ 772081595 │ LINESTRING (7.415291600000001 43.7234393… │ outer │
│ 8654 │ 212810311 │ LINESTRING (7.4120416 43.7280955, 7.4123… │ outer │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 11546878 │ 840890844 │ LINESTRING (7.420754 43.735872900000004,… │ outer │
│ 11546878 │ 840890848 │ LINESTRING (7.4207413 43.7356673, 7.4207… │ outer │
│ 16467322 │ 772081595 │ LINESTRING (7.415291600000001 43.7234393… │ outer │
│ 16467322 │ 212810311 │ LINESTRING (7.4120416 43.7280955, 7.4123… │ outer │
│ 16467322 │ 176533407 │ LINESTRING (7.412670800000001 43.7316348… │ outer │
│ 16467322 │ 37811853 │ LINESTRING (7.4127007 43.7346954, 7.4126… │ outer │
│ 16467322 │ 37794471 │ LINESTRING (7.428754700000001 43.7460174… │ outer │
│ 16467322 │ 398372186 │ LINESTRING (7.436885 43.7519173, 7.43677… │ outer │
│ 16467322 │ 37794470 │ LINESTRING (7.438411400000001 43.7494223… │ outer │
│ 16467322 │ 770774507 │ LINESTRING (7.439171000000001 43.7490109… │ outer │
├──────────┴───────────┴────────────────────────────────────────────┴──────────┤
│ 410 rows (20 shown) 4 columns │
└──────────────────────────────────────────────────────────────────────────────┘

As you can see, there are multiple ways with linestrings assigned to each relation. Let’s look at an example to see how they look on the map:

A single relation (5986437) with colour-coded ways that are part of it. Generated by the author using GeoPandas library.

To create full polygons, we have to utilize a ST_LineMerge function, that will combine a list of linestrings (you can compare it to tying pieces of string together). You can read more about this operation here:

As an additional validation step, we will check if the produced linestrings have at least 4 points and if the first point equals the last point, before converting them into polygons:

CREATE TEMP TABLE matching_relations_with_merged_polygons AS
WITH merged_linestrings AS (
SELECT
id,
ref_role,
UNNEST(
ST_Dump(ST_LineMerge(ST_Collect(list(geometry)))),
recursive := true
),
FROM matching_relations_with_ways_linestrings
GROUP BY id, ref_role
),
relations_with_linestrings AS (
SELECT
id,
ref_role,
-- ST_Dump returns column named `geom`
geom AS geometry,
row_number() OVER (PARTITION BY id) as geometry_id
FROM
merged_linestrings
-- discard linestrings with less than 4 points
WHERE ST_NPoints(geom) >= 4
),
valid_relations AS (
SELECT id, is_valid
FROM (
SELECT
id,
bool_and(
-- Check if start point equals the end point
ST_Equals(ST_StartPoint(geometry), ST_EndPoint(geometry))
) is_valid
FROM relations_with_linestrings
GROUP BY id
)
WHERE is_valid = true
)
SELECT
id,
ref_role,
ST_MakePolygon(geometry) geometry,
geometry_id
FROM relations_with_linestrings
SEMI JOIN valid_relations
ON relations_with_linestrings.id = valid_relations.id;

SELECT * FROM matching_relations_with_merged_polygons;

┌──────────┬──────────┬──────────────────────────────────────────┬─────────────┐
│ id │ ref_role │ geometry │ geometry_id │
│ int64 │ varchar │ geometry │ int64 │
├──────────┼──────────┼──────────────────────────────────────────┼─────────────┤
│ 1369631 │ inner │ POLYGON ((7.438232200000001 43.7511057… │ 1 │
│ 1369631 │ outer │ POLYGON ((7.438008600000001 43.7502850… │ 2 │
│ 1484217 │ outer │ POLYGON ((7.423010700000001 43.7307028… │ 1 │
│ 1484217 │ inner │ POLYGON ((7.423114600000001 43.7305994… │ 2 │
│ 5986436 │ outer │ POLYGON ((7.428754700000001 43.7460174… │ 1 │
│ 8269572 │ outer │ POLYGON ((7.4287048 43.737701400000006… │ 1 │
│ 8269572 │ inner │ POLYGON ((7.4284492 43.7375604, 7.4286… │ 2 │
│ 16248286 │ outer │ POLYGON ((7.426306200000001 43.7391565… │ 1 │
│ 16248286 │ inner │ POLYGON ((7.4263241 43.7390763, 7.4263… │ 2 │
│ 16261416 │ inner │ POLYGON ((7.417829200000001 43.731382,… │ 1 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 8280869 │ inner │ POLYGON ((7.426753300000001 43.7388868… │ 2 │
│ 8280869 │ inner │ POLYGON ((7.4270357 43.739109000000006… │ 3 │
│ 8280869 │ inner │ POLYGON ((7.4271374 43.739011500000004… │ 4 │
│ 8280869 │ inner │ POLYGON ((7.4272666 43.7389165, 7.4273… │ 5 │
│ 11384697 │ outer │ POLYGON ((7.427075 43.7315167, 7.42749… │ 1 │
│ 11384697 │ inner │ POLYGON ((7.426917700000001 43.7313266… │ 2 │
│ 11384697 │ inner │ POLYGON ((7.427001400000001 43.7313937… │ 3 │
│ 11546878 │ outer │ POLYGON ((7.4207413 43.7356673, 7.4207… │ 1 │
│ 11538023 │ inner │ POLYGON ((7.426528200000001 43.7329359… │ 1 │
│ 11538023 │ outer │ POLYGON ((7.426554 43.7328276, 7.42654… │ 2 │
├──────────┴──────────┴──────────────────────────────────────────┴─────────────┤
│ 88 rows (20 shown) 4 columns │
└──────────────────────────────────────────────────────────────────────────────┘

Let’s see the previous example after this operation. We should expect two separate polygons to be present:

A single relation (5986437) with merged ways as two separate polygons. Generated by the author using GeoPandas library.

I’ve mentioned previously the outer and inner ‘ref_types’ of ways that create a relation. Here you can see what it looks like:

A single relation (8280869) with merged ways grouped into outer and inner polygons. Generated by the author using GeoPandas library.

The roles mean that the inner ways are the ‘holes’ within outer polygons and we have to reproduce this step to make proper geometries.

Let’s focus now on splitting the polygons into groups with and without holes, using the ST_Within predicate, which checks if a geometry is fully within another.

CREATE TEMP TABLE matching_relations_with_outer_polygons_with_holes AS
WITH outer_polygons AS (
SELECT id, geometry_id, geometry
FROM matching_relations_with_merged_polygons
WHERE ref_role = 'outer'
), inner_polygons AS (
SELECT id, geometry_id, geometry
FROM matching_relations_with_merged_polygons
WHERE ref_role = 'inner'
)
SELECT
op.id,
op.geometry_id,
ST_Difference(any_value(op.geometry), ST_Union_Agg(ip.geometry)) geometry
FROM outer_polygons op
JOIN inner_polygons ip
ON op.id = ip.id AND ST_WITHIN(ip.geometry, op.geometry)
GROUP BY op.id, op.geometry_id;

CREATE TEMP TABLE matching_relations_with_outer_polygons_without_holes AS
WITH outer_polygons AS (
SELECT id, geometry_id, geometry
FROM matching_relations_with_merged_polygons
WHERE ref_role = 'outer'
)
SELECT
op.id,
op.geometry_id,
op.geometry
FROM outer_polygons op
ANTI JOIN matching_relations_with_outer_polygons_with_holes opwh
ON op.id = opwh.id AND op.geometry_id = opwh.geometry_id;

SELECT * FROM matching_relations_with_outer_polygons_with_holes
UNION ALL
SELECT * FROM matching_relations_with_outer_polygons_without_holes;

┌──────────┬─────────────┬─────────────────────────────────────────────────────┐
│ id │ geometry_id │ geometry │
│ int64 │ int64 │ geometry │
├──────────┼─────────────┼─────────────────────────────────────────────────────┤
│ 16248286 │ 1 │ POLYGON ((7.4263139 43.7391602, 7.426324900000001… │
│ 1369192 │ 1 │ POLYGON ((7.4226247 43.7402251, 7.4227848 43.7396… │
│ 8147763 │ 1 │ POLYGON ((7.438223000000001 43.7477296, 7.4382403… │
│ 393226 │ 1 │ POLYGON ((7.4204802 43.731016700000005, 7.4203979… │
│ 11484092 │ 4 │ POLYGON ((7.417896900000001 43.724827000000005, 7… │
│ 11384697 │ 1 │ POLYGON ((7.4274934 43.731250700000004, 7.4267196… │
│ 8269572 │ 1 │ POLYGON ((7.4287166 43.7376724, 7.4287948 43.7376… │
│ 16261416 │ 3 │ POLYGON ((7.4178309 43.731391, 7.417877000000001 … │
│ 16248284 │ 1 │ POLYGON ((7.426637500000001 43.7394678, 7.4266485… │
│ 16248285 │ 1 │ POLYGON ((7.426114600000001 43.739267600000005, 7… │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 11546879 │ 1 │ POLYGON ((7.4209316 43.7356234, 7.421037 43.73562… │
│ 2220206 │ 1 │ POLYGON ((7.4120416 43.7280955, 7.412103900000001… │
│ 5986438 │ 1 │ POLYGON ((7.4191482 43.738887500000004, 7.4199833… │
│ 2221178 │ 1 │ POLYGON ((7.415860400000001 43.7313885, 7.416195 … │
│ 2221179 │ 1 │ POLYGON ((7.422280000000001 43.7367186, 7.4227584… │
│ 2220208 │ 1 │ POLYGON ((7.41849 43.730922400000004, 7.4185156 4… │
│ 2254506 │ 1 │ POLYGON ((7.432995900000001 43.7447102, 7.4329125… │
│ 5986437 │ 1 │ POLYGON ((7.4307593 43.745595, 7.4306621 43.74541… │
│ 5986437 │ 2 │ POLYGON ((7.4343358 43.7457085, 7.4341337 43.7455… │
│ 11546878 │ 1 │ POLYGON ((7.4207413 43.7356673, 7.4207413 43.7357… │
├──────────┴─────────────┴─────────────────────────────────────────────────────┤
│ 47 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘

In this query, there is utilized another special join — ANTI JOIN. This one filters out all the rows on the left-side table that are joined by the right-side table.

The last step that is needed is to merge all the polygons for a single relation using the ST_Union_Agg operation. It will combine all polygons into multipolygons (if there is more than one) and produce a single geometry.

WITH unioned_outer_geometries AS (
SELECT id, geometry
FROM matching_relations_with_outer_polygons_with_holes
UNION ALL
SELECT id, geometry
FROM matching_relations_with_outer_polygons_without_holes
),
final_geometries AS (
SELECT id, ST_Union_Agg(geometry) geometry
FROM unioned_outer_geometries
GROUP BY id
)
SELECT r_g.id, r.tags, r_g.geometry
FROM final_geometries r_g
JOIN matching_relations r
ON r.id = r_g.id;

┌──────────┬──────────────────────┬────────────────────────────────────────────┐
│ id │ tags │ geometry │
│ int64 │ map(varchar, varch… │ geometry │
├──────────┼──────────────────────┼────────────────────────────────────────────┤
│ 393226 │ {building=castle, … │ POLYGON ((7.4204802 43.731016700000005, … │
│ 393481 │ {building=school, … │ POLYGON ((7.423992800000001 43.731841800… │
│ 1369191 │ {building=apartmen… │ POLYGON ((7.4231004 43.7412894, 7.423314… │
│ 1369192 │ {building=apartmen… │ POLYGON ((7.4226247 43.7402251, 7.422784… │
│ 1369193 │ {building=apartmen… │ POLYGON ((7.424 43.7405491, 7.4240401 43… │
│ 1369195 │ {building=apartmen… │ POLYGON ((7.418679900000001 43.738187100… │
│ 1369631 │ {addr:housenumber=… │ POLYGON ((7.4379729 43.7502505, 7.437937… │
│ 1369632 │ {addr:city=Monaco,… │ POLYGON ((7.4317121 43.74709, 7.4318277 … │
│ 1484190 │ {addr:city=Monaco,… │ POLYGON ((7.425469 43.731494000000005, 7… │
│ 1484217 │ {building=retail, … │ POLYGON ((7.423202300000001 43.7307117, … │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 14399505 │ {area=yes, floatin… │ POLYGON ((7.4195986 43.7265293, 7.419595… │
│ 16248281 │ {building=apartmen… │ POLYGON ((7.4260438 43.739789800000004, … │
│ 16248282 │ {building=apartmen… │ POLYGON ((7.426243100000001 43.7396824, … │
│ 16248283 │ {building=apartmen… │ POLYGON ((7.426438200000001 43.739575200… │
│ 16248284 │ {building=apartmen… │ POLYGON ((7.426637500000001 43.7394678, … │
│ 16248285 │ {building=apartmen… │ POLYGON ((7.426114600000001 43.739267600… │
│ 16248286 │ {building=apartmen… │ POLYGON ((7.4263139 43.7391602, 7.426324… │
│ 16250182 │ {addr:country=MC, … │ POLYGON ((7.418348300000001 43.7256979, … │
│ 16261416 │ {natural=water, ty… │ POLYGON ((7.4178309 43.731391, 7.4178770… │
│ 11546878 │ {addr:city=Monaco,… │ POLYGON ((7.4207413 43.7356673, 7.420741… │
├──────────┴──────────────────────┴────────────────────────────────────────────┤
│ 46 rows (20 shown) 3 columns │
└──────────────────────────────────────────────────────────────────────────────┘

Here is the previous relation example, now with holes:

A single relation (8280869) — a polygon with holes. Generated by the author using GeoPandas library.
Relations geometries plotted on a map. Generated by the author using GeoPandas library.

Examples of badly defined relation objects

As OpenStreetMap data is mainly added by the community, there are examples where geometries are not properly defined. The OSM wiki describes rules for map makers on how to add geometries to a map.

This section will outline some common mistakes with examples.

Two overlapping ‘outer’ ways in the relation

This building is defined by two shapes: a rectangle and almost a circle. Since these two overlap, you can see how the rendering engine created a gap where there probably shouldn’t be any.

Screenshot from the OpenStreetMap by the author.

No ‘outer’ way in the relation

Here you can see a paintball field with way members defined as ‘Main Building’ and 4 ‘Arenas’. These should be all defined as outer ways.

Screenshot from the OpenStreetMap by the author.

Two overlapping or touching ‘inner’ ways

Screenshot from the OpenStreetMap by the author.

If you are interested in reading more about how these can be fixed, look into this repository:

QuackOSM — a hassle-free tool for reading OSM data

To end this article I want to highlight a library that can automatically download the OSM data, filter it by the geometry or using OSM tags and save it as a GeoParquet file that can be easily integrated into more scalable solutions. The library is written in Python and is open-source.

You can install it with a single command: pip install quackosm[cli].

This tutorial contains a simplified version of the queries used in the QuackOSM 🦆, but these won’t scale very well for the bigger regions. The library can easily parse whole countries such as France on a consumer-grade PC if you need to. You can of course utilize the DuckDB engine later on the prepared GeoParquet file using SPATIAL extension.

All of the steps defined here can be replaced with a single line of code:

>>> import quackosm as qosm
>>> qosm.get_features_gdf('monaco-latest.osm.pbf')
tags geometry
feature_id
way/834616137 {'highway': 'footway'} LINESTRING (7.42133 43.72711, 7.42134 43.72710...
way/408817108 {'addr:country': 'MC', 'building': 'yes', 'nam... POLYGON ((7.43533 43.74965, 7.43534 43.74955, ...
way/686435440 {'highway': 'footway'} LINESTRING (7.41381 43.73258, 7.41388 43.73266...
node/7799514898 {'name': 'Cino Bar', 'shop': 'kiosk'} POINT (7.42702 43.73118)
way/143214794 {'building': 'yes'} POLYGON ((7.42788 43.74437, 7.42786 43.74437, ...
... ... ...
way/161882794 {'highway': 'secondary', 'lanes': '2', 'lit': ... LINESTRING (7.42142 43.73656, 7.42147 43.73662...
way/1082330089 {'highway': 'steps', 'incline': 'down'} LINESTRING (7.42438 43.72990, 7.42440 43.72988...
way/834313093 {'highway': 'service', 'smoothness': 'intermed... LINESTRING (7.42709 43.73256, 7.42708 43.73262...
way/94399451 {'addr:country': 'MC', 'building': 'yes'} POLYGON ((7.41678 43.73699, 7.41661 43.73689, ...
node/2462515787 {'crossing': 'uncontrolled', 'highway': 'cross... POINT (7.41656 43.73208)

[7940 rows x 2 columns]
>>> qosm.convert_pbf_to_gpq('monaco-latest.osm.pbf')
PosixPath('files/monaco-latest_nofilter_noclip_compact.geoparquet')

Disclaimer

I’m the author of the QuackOSM library.

--

--

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Kamil Raczycki

Spatial Data Scientist engaged in the open source community. Joining the machine learning domain with geospatial data. https://www.linkedin.com/in/raczyckikamil

Responses (1)