You have a file, possibly a huge CSV, and you want to import its content into your database. There are lots of options to do this but how would you decide which one to use. More often than not the question is how much time would the bulk load would take. I found my self doing the same few days back when I wanted to design a data ingestion process for PostgreSQL where we needed to bulk load around 250GB of data from CSV files every 24 hours.
Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server. So I decided to do a simple comparison of bulk loading options and techniques.
In short I wanted to see the performance difference of loading the data into standard vs unlogged tables and want to compare the loading time difference between loading into table that has an index vs drop-index->load->recreate-index option.
Moreover, I wanted to see the performance difference of COPY command, client-side copy command, loading through file_fdw, and pg_bulkload for each of the above options.
Database and system settings
Since the intention was to do a relative performance comparison among different data loading techniques and options, so using the personal MacBook Pro running macOS Catalena with 16GB of RAM, 2.7 GHz Quad-Core Intel Core i7 processor, and 500 GB SSD disk was good enough to serve the purpose.
For database I compiled PostgreSQL v12 from source code with default configure options. I left most of the configuration parameter to their default values and only changed the below mentioned settings.
shared_buffers = 2GB
work_mem = 400MB
maintenance_work_mem = 640MB
Sample data and schema
For the purpose of this exercise, I downloaded a sample CSV file from http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ with 5million rows.
The sample CSV file contains 5 million rows, 14 columns and 624MB in size.
To keep things simple I created a sales_record table in PostgreSQL with one to one mapping with the CSV file
CREATE TABLE sales_record ( region VARCHAR, country VARCHAR, item_type VARCHAR, sales_channel VARCHAR, order_priority CHAR, order_date DATE, order_id INT, ship_date DATE, unit_sold INT, unit_price FLOAT, unit_cost FLOAT, total_revenue FLOAT, total_cost FLOAT, total_profit FLOAT );
Along with that I also wanted to see the impact of having an index on the bulk load performance, So for tests that require an INDEX, I created a btree
index on the country
column.
CREATE INDEX country_idx ON sales_record USING btree (country);
Load using the COPY command
COPY
moves data between PostgreSQL tables and standard file-system files. The copy command comes in two variants, COPY TO and COPY FROM. The former copies the table content to the file, while we will use the latter to load data into the table from the file.
COPY sales_record FROM '/Users/muhammadusama/work/data/5m_Sales_Records.csv' CSV HEADER;
Load using psql ‘\copy’
‘\copy
‘ is a psql
operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql
(client) reads or writes the file and routes the data between the server and the local file system. This means that file accessibility and privileges are those of the local user, not the server, and no SQL superuser privileges are required.
\copy sales_record FROM '/Users/muhammadusama/work/data/5m_Sales_Records.csv' csv header;
Through file_fdw
The foreign-data wrapper file_fdw
, can be used to access data files in the server’s file system, or to execute programs on the server and read their output. We can also use the file_fdw to load data from CSV to PostgreSQL tables.
-- Create file_fdw extension and foreign server
CREATE EXTENSION file_fdw ;
CREATE SERVER file_fdw_server FOREIGN DATA WRAPPER file_fdw;
-- Define the foreign table that points to our CSV file
CREATE FOREIGN TABLE foreign_sales_record (
region VARCHAR,
country VARCHAR,
item_type VARCHAR,
sales_channel VARCHAR,
order_priority CHAR,
order_date DATE,
order_id INT,
ship_date DATE,
unit_sold INT,
unit_price FLOAT,
unit_cost FLOAT,
total_revenue FLOAT,
total_cost FLOAT,
total_profit FLOAT) SERVER file_fdw_server
OPTIONS (
format 'csv',
header 'false' ,
filename '/Users/muhammadusama/work/data/5m_Sales_Records.csv',
delimiter ',',
null '');
-- Copy the data from foreign table to local table
INSERT INTO sales_record SELECT * from foreign_sales_record;
Although file_fdw
is not expected to be as fast as COPY command when it comes to loading the data but it provides a lot of flexibility and options when it comes to pre-processing the data before loading.
pg_bulkload tool
pg_bulkload is also a very interesting option when it comes to high speed data loading. Its an open-source tool that achieves its performance by skipping the shared buffers and WAL logging.
-- CREATE pg_bulkload extension
$ bin/psql -c "CREATE EXTENSION pg_bulkload" postgres
-- Create control file with appropriate contents
$ more sample_csv.ctl
WRITER = PARALLEL
OUTPUT = public.sales_record # [<schema_name>.]table_name
INPUT = /Users/muhammadusama/work/data/5m_Sales_Records.csv # Input data location (absolute path)
TYPE = CSV # Input file type
QUOTE = "\"" # Quoting character
ESCAPE = \ # Escape character for Quoting
DELIMITER = "," # Delimiter
-- Execute pg_bulkload utility
$ bin/pg_bulkload -d postgres -h localhost sample_csv.ctl
Results
Below chart shows the time taken by each tool/command to load 5 million rows from CSV file
Conclusion
Each method for data loading has its own pros and cons which may make one preferred choice over others for a particular use case. But when it comes to raw performance pg_bulkload is a clear winner with COPY and /copy line up behind while file_fdw stands at the last place.
While no matter which data loading method we use, loading into an indexed table is always slow, So do consider drop-index->load->create-index
when you have a huge data to be loaded.
Comparison of all the tools was an apple to apple comparison with both client and server were running on the same machine. So, /copy had no network overhead. In the case of PostgreSQL server and client are on different machines the /copy command may not perform as well as these above results.
Muhammad Usama is a database architect / PostgreSQL consultant at HighGo Software and also Pgpool-II core committer. Usama has been involved with database development (PostgreSQL) since 2006, he is the core committer for open source middleware project Pgpool-II and has played a pivotal role in driving and enhancing the product. Prior to coming to open source development, Usama was doing software design and development with the main focus on system-level embedded development. After joining the EnterpriseDB, an Enterprise PostgreSQL’s company in 2006 he started his career in open source development specifically in PostgreSQL and Pgpool-II. He is a major contributor to the Pgpool-II project and has contributed to many performance and high availability related features.
Recent Comments