We’re extremely pleased to announce the release of nanoparquet 0.3.0. nanoparquet is a new R package that reads Parquet files into data frames, and writes data frames to Parquet files.
You can install it from CRAN with:
install.packages("nanoparquet")
This blog post will cover the features and limitations of nanoparquet, and also our future plans.
What is Parquet?
Parquet is a file format for storing data on disk. It is specifically designed for large data sets, read-heavy workloads and data analysis. The most important features of Parquet are:
-
Columnar. Data is stored column-wise, so whole columns (or large chunks of columns) are easy to read quickly. Columnar storage allows better compression, fast operations on a subset of columns, and easy ways of removing columns or adding new columns to a data file.
-
Binary. A Parquet file is not a text file. Each Parquet data type is stored in a well-defined binary storage format, leaving no ambiguity about how fields are persed.
-
Rich types. Parquet supports a small set of low level data types with well specified storage formats and encodings. On top of the low level types, it implemented several higher level logical types, like UTF-8 strings, time stamps, JSON strings, ENUM types (factors), etc.
-
Well supported. At this point Parquet is well supported across modern languages like R, Python, Rust, Java, Go, etc. In particular, Apache Arrow handles Parquet files very well, and has bindings to many languages. DuckDB is a very portable tool that reads and writes Parquet files, or even opens a set of Parquet files as a database.
-
Performant. Parquet columns may use various encodings and compression to ensure that the data files are kept as small as possible. When running an analytical query on the subset of the data, the Parquet format makes it easy to skip the columns and/or rows that are irrelevant.
-
Concurrency built in. A Parquet file can be divided into row groups. Parquet readers can read, uncompress and decode row groups in parallel. Parquet writes can encode and compress row groups in parallel. Even a single column may be divided into multiple pages, that can be (un)compressed, encode and decode in parallel.
-
Missing values. Support for missing values is built into the Parquet format.
Why we created nanoparquet?
Although Parquet is well supported by modern languages, today the complexity of the Parquet format often outweighs its benefits for smaller data sets. Many tools that support Parquet are typically used for larger, out of memory data sets, so there is a perception that Parquet is only for big data. These tools typically take longer to compile or install, and often seem too heavy for in-memory data analysis.
With nanoparquet, we wanted to have a smaller tool that has no dependencies and is easy to install. Our goal is to facilitate the adoption of Parquet for smaller data sets, especially for teams that share data between multiple environments, e.g. R, Python, Java, etc.
nanoparquet Features
These are some of the nanoparquet features that we are most excited about.
-
Lightweight. nanoparquet has no package or system dependencies other than a C++-11 compiler. It compiles in about 30 seconds into an R package that is less than a megabyte in size.
-
Reads many Parquet files.
nanoparquet::read_parquet()
supports reading most Parquet files. In particular, in supports all Parquet encodings and at the time of writing it supports three compression codecs: Snappy, Gzip and Zstd. Make sure you read “Limitations” below. -
Writes many R data types.
nanoparquet::write_parquet()
supports writing most R data frames. In particular, missing values are handled properly, factor columns are kept as factors, and temporal types are encoded correctly. Make sure you read “Limitations” below. -
Type mappings. nanoparquet has a well defined set of type mapping rules. Use the
parquet_column_types()
function to see howread_parquet()
andwrite_parquet()
maps Parquet and R types for a file or a data frame. -
Metadata queries. nanoparquet has a number of functions that allow you to query the metadata and schema without reading in the full dataset.
Examples
Reading a Parquet file
The nanoparquet R package contains an example Parquet file. We are going to use it to demonstrate how the package works.
If the pillar package is loaded, then nanoparquet data frames are pretty-printed.
library(nanoparquet)
library(pillar)
udf <- system.file("extdata/userdata1.parquet", package = "nanoparquet")
Before actually reading the file, let’s look up some metadata about it, and also how its columns will be mapped to R types:
parquet_info(udf)
#> # A data frame: 1 × 7
#> file_name num_cols num_rows num_row_groups file_size parquet_version
#> <chr> <int> <dbl> <int> <dbl> <int>
#> 1 /Users/gaborcsardi… 13 1000 1 73217 1
#> # ℹ 1 more variable: created_by <chr>
parquet_column_types(udf)
#> # A data frame: 13 × 6
#> file_name name type r_type repetition_type logical_type
#> * <chr> <chr> <chr> <chr> <chr> <I<list>>
#> 1 /Users/gaborcsa… regi… INT64 POSIX… REQUIRED <TIMESTAMP(TRUE, micros)>
#> 2 /Users/gaborcsa… id INT32 integ… REQUIRED <INT(32, TRUE)>
#> 3 /Users/gaborcsa… firs… BYTE… chara… OPTIONAL <STRING>
#> 4 /Users/gaborcsa… last… BYTE… chara… REQUIRED <STRING>
#> 5 /Users/gaborcsa… email BYTE… chara… OPTIONAL <STRING>
#> 6 /Users/gaborcsa… gend… BYTE… factor OPTIONAL <STRING>
#> 7 /Users/gaborcsa… ip_a… BYTE… chara… REQUIRED <STRING>
#> 8 /Users/gaborcsa… cc BYTE… chara… OPTIONAL <STRING>
#> 9 /Users/gaborcsa… coun… BYTE… chara… REQUIRED <STRING>
#> 10 /Users/gaborcsa… birt… INT32 Date OPTIONAL <DATE>
#> 11 /Users/gaborcsa… sala… DOUB… double OPTIONAL <NULL>
#> 12 /Users/gaborcsa… title BYTE… chara… OPTIONAL <STRING>
#> 13 /Users/gaborcsa… comm… BYTE… chara… OPTIONAL <STRING>
For every Parquet column we see its low level Parquet data type in type
, e.g. INT64
or BYTE_ARRAY
. r_type
the R type that
read_parquet()
will create for that column. If repetition_type
is REQUIRED
, then that column cannot contain missing values. OPTIONAL
columns may have missing values. logical_type
is the higher level Parquet data type.
E.g. the first column is an UTC (because of the TRUE
) timestamp, in microseconds. It is stored as a 64 bit integer in the file, and it will be converted to a POSIXct
object by
read_parquet()
.
To actually read the file into a data frame, call
read_parquet()
:
ud1 <- read_parquet(udf)
ud1
#> # A data frame: 1,000 × 13
#> registration id first_name last_name email gender ip_address cc
#> <dttm> <int> <chr> <chr> <chr> <fct> <chr> <chr>
#> 1 2016-02-03 07:55:29 1 Amanda Jordan ajord… Female 1.197.201… 6759…
#> 2 2016-02-03 17:04:03 2 Albert Freeman afree… Male 218.111.1… NA
#> 3 2016-02-03 01:09:31 3 Evelyn Morgan emorg… Female 7.161.136… 6767…
#> 4 2016-02-03 00:36:21 4 Denise Riley drile… Female 140.35.10… 3576…
#> 5 2016-02-03 05:05:31 5 Carlos Burns cburn… NA 169.113.2… 5602…
#> 6 2016-02-03 07:22:34 6 Kathryn White kwhit… Female 195.131.8… 3583…
#> 7 2016-02-03 08:33:08 7 Samuel Holmes sholm… Male 232.234.8… 3582…
#> 8 2016-02-03 06:47:06 8 Harry Howell hhowe… Male 91.235.51… NA
#> 9 2016-02-03 03:52:53 9 Jose Foster jfost… Male 132.31.53… NA
#> 10 2016-02-03 18:29:47 10 Emily Stewart estew… Female 143.28.25… 3574…
#> # ℹ 990 more rows
#> # ℹ 5 more variables: country <chr>, birthdate <date>, salary <dbl>,
#> # title <chr>, comments <chr>
Writing a Parquet file
To show
write_parquet()
, we’ll use the flights
data in the nycflights13 package:
library(nycflights13)
flights
#> # A tibble: 336,776 × 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> 7 2013 1 1 555 600 -5 913 854
#> 8 2013 1 1 557 600 -3 709 723
#> 9 2013 1 1 557 600 -3 838 846
#> 10 2013 1 1 558 600 -2 753 745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
First we check how columns of flights
will be mapped to Parquet types:
parquet_column_types(flights)
#> # A data frame: 19 × 6
#> file_name name type r_type repetition_type logical_type
#> <chr> <chr> <chr> <chr> <chr> <I<list>>
#> 1 NA year INT32 integ… REQUIRED <INT(32, TRUE)>
#> 2 NA month INT32 integ… REQUIRED <INT(32, TRUE)>
#> 3 NA day INT32 integ… REQUIRED <INT(32, TRUE)>
#> 4 NA dep_time INT32 integ… OPTIONAL <INT(32, TRUE)>
#> 5 NA sched_dep_t… INT32 integ… REQUIRED <INT(32, TRUE)>
#> 6 NA dep_delay DOUB… double OPTIONAL <NULL>
#> 7 NA arr_time INT32 integ… OPTIONAL <INT(32, TRUE)>
#> 8 NA sched_arr_t… INT32 integ… REQUIRED <INT(32, TRUE)>
#> 9 NA arr_delay DOUB… double OPTIONAL <NULL>
#> 10 NA carrier BYTE… chara… REQUIRED <STRING>
#> 11 NA flight INT32 integ… REQUIRED <INT(32, TRUE)>
#> 12 NA tailnum BYTE… chara… OPTIONAL <STRING>
#> 13 NA origin BYTE… chara… REQUIRED <STRING>
#> 14 NA dest BYTE… chara… REQUIRED <STRING>
#> 15 NA air_time DOUB… double OPTIONAL <NULL>
#> 16 NA distance DOUB… double REQUIRED <NULL>
#> 17 NA hour DOUB… double REQUIRED <NULL>
#> 18 NA minute DOUB… double REQUIRED <NULL>
#> 19 NA time_hour INT64 POSIX… REQUIRED <TIMESTAMP(TRUE, micros)>
This looks fine, so we go ahead and write out the file. By default it will be Snappy-compressed, and many columns will be dictionary encoded.
write_parquet(flights, "flights.parquet")
Parquet metadata
Use
parquet_schema()
to see the schema of a Parquet file. The schema also includes “internal” parquet columns. Every Parquet file is a tree where columns may be part of an “internal” column. nanoparquet currently only supports flat files, that consist of a single internal root column and all other columns are leaf columns and are children of the root:
parquet_schema("flights.parquet")
#> # A data frame: 20 × 11
#> file_name name type type_length repetition_type converted_type
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 flights.parquet schema NA NA NA NA
#> 2 flights.parquet year INT32 NA REQUIRED INT_32
#> 3 flights.parquet month INT32 NA REQUIRED INT_32
#> 4 flights.parquet day INT32 NA REQUIRED INT_32
#> 5 flights.parquet dep_time INT32 NA OPTIONAL INT_32
#> 6 flights.parquet sched_dep_t… INT32 NA REQUIRED INT_32
#> 7 flights.parquet dep_delay DOUB… NA OPTIONAL NA
#> 8 flights.parquet arr_time INT32 NA OPTIONAL INT_32
#> 9 flights.parquet sched_arr_t… INT32 NA REQUIRED INT_32
#> 10 flights.parquet arr_delay DOUB… NA OPTIONAL NA
#> 11 flights.parquet carrier BYTE… NA REQUIRED UTF8
#> 12 flights.parquet flight INT32 NA REQUIRED INT_32
#> 13 flights.parquet tailnum BYTE… NA OPTIONAL UTF8
#> 14 flights.parquet origin BYTE… NA REQUIRED UTF8
#> 15 flights.parquet dest BYTE… NA REQUIRED UTF8
#> 16 flights.parquet air_time DOUB… NA OPTIONAL NA
#> 17 flights.parquet distance DOUB… NA REQUIRED NA
#> 18 flights.parquet hour DOUB… NA REQUIRED NA
#> 19 flights.parquet minute DOUB… NA REQUIRED NA
#> 20 flights.parquet time_hour INT64 NA REQUIRED TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> # precision <int>, field_id <int>
To see more information about a Parquet file, use
parquet_metadata()
:
parquet_metadata("flights.parquet")
#> $file_meta_data
#> # A data frame: 1 × 5
#> file_name version num_rows key_value_metadata created_by
#> <chr> <int> <dbl> <I<list>> <chr>
#> 1 flights.parquet 1 336776 <tbl [1 × 2]> https://github.com/gaborc…
#>
#> $schema
#> # A data frame: 20 × 11
#> file_name name type type_length repetition_type converted_type
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 flights.parquet schema NA NA NA NA
#> 2 flights.parquet year INT32 NA REQUIRED INT_32
#> 3 flights.parquet month INT32 NA REQUIRED INT_32
#> 4 flights.parquet day INT32 NA REQUIRED INT_32
#> 5 flights.parquet dep_time INT32 NA OPTIONAL INT_32
#> 6 flights.parquet sched_dep_t… INT32 NA REQUIRED INT_32
#> 7 flights.parquet dep_delay DOUB… NA OPTIONAL NA
#> 8 flights.parquet arr_time INT32 NA OPTIONAL INT_32
#> 9 flights.parquet sched_arr_t… INT32 NA REQUIRED INT_32
#> 10 flights.parquet arr_delay DOUB… NA OPTIONAL NA
#> 11 flights.parquet carrier BYTE… NA REQUIRED UTF8
#> 12 flights.parquet flight INT32 NA REQUIRED INT_32
#> 13 flights.parquet tailnum BYTE… NA OPTIONAL UTF8
#> 14 flights.parquet origin BYTE… NA REQUIRED UTF8
#> 15 flights.parquet dest BYTE… NA REQUIRED UTF8
#> 16 flights.parquet air_time DOUB… NA OPTIONAL NA
#> 17 flights.parquet distance DOUB… NA REQUIRED NA
#> 18 flights.parquet hour DOUB… NA REQUIRED NA
#> 19 flights.parquet minute DOUB… NA REQUIRED NA
#> 20 flights.parquet time_hour INT64 NA REQUIRED TIMESTAMP_MIC…
#> # ℹ 5 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#> # precision <int>, field_id <int>
#>
#> $row_groups
#> # A data frame: 1 × 7
#> file_name id total_byte_size num_rows file_offset total_compressed_size
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 flights.parq… 0 5732430 336776 NA NA
#> # ℹ 1 more variable: ordinal <int>
#>
#> $column_chunks
#> # A data frame: 19 × 19
#> file_name row_group column file_path file_offset offset_index_offset
#> <chr> <int> <int> <chr> <dbl> <dbl>
#> 1 flights.parquet 0 0 NA 23 NA
#> 2 flights.parquet 0 1 NA 111 NA
#> 3 flights.parquet 0 2 NA 323 NA
#> 4 flights.parquet 0 3 NA 6738 NA
#> 5 flights.parquet 0 4 NA 468008 NA
#> 6 flights.parquet 0 5 NA 893557 NA
#> 7 flights.parquet 0 6 NA 1312660 NA
#> 8 flights.parquet 0 7 NA 1771896 NA
#> 9 flights.parquet 0 8 NA 2237931 NA
#> 10 flights.parquet 0 9 NA 2653250 NA
#> 11 flights.parquet 0 10 NA 2847249 NA
#> 12 flights.parquet 0 11 NA 3374563 NA
#> 13 flights.parquet 0 12 NA 3877832 NA
#> 14 flights.parquet 0 13 NA 3966418 NA
#> 15 flights.parquet 0 14 NA 4264662 NA
#> 16 flights.parquet 0 15 NA 4639410 NA
#> 17 flights.parquet 0 16 NA 4976781 NA
#> 18 flights.parquet 0 17 NA 5120936 NA
#> 19 flights.parquet 0 18 NA 5427022 NA
#> # ℹ 13 more variables: offset_index_length <int>, column_index_offset <dbl>,
#> # column_index_length <int>, type <chr>, encodings <I<list>>,
#> # path_in_schema <I<list>>, codec <chr>, num_values <dbl>,
#> # total_uncompressed_size <dbl>, total_compressed_size <dbl>,
#> # data_page_offset <dbl>, index_page_offset <dbl>,
#> # dictionary_page_offset <dbl>
The output will include the schema, as above, but also data about the row groups (
write_parquet()
always writes a single row group currently), and column chunks. There is one column chunk per column in each row group.
The columns chunk information also tells you whether a column chunk is dictionary encoded, its encoding, its size, etc.
cc <- parquet_metadata("flights.parquet")$column_chunks
cc[, c("column", "encodings", "dictionary_page_offset")]
#> # A data frame: 19 × 3
#> column encodings dictionary_page_offset
#> <int> <I<list>> <dbl>
#> 1 0 <chr [3]> 4
#> 2 1 <chr [3]> 48
#> 3 2 <chr [3]> 181
#> 4 3 <chr [3]> 1445
#> 5 4 <chr [3]> 463903
#> 6 5 <chr [3]> 891412
#> 7 6 <chr [3]> 1306995
#> 8 7 <chr [3]> 1767223
#> 9 8 <chr [3]> 2235594
#> 10 9 <chr [3]> 2653154
#> 11 10 <chr [3]> 2831850
#> 12 11 <chr [3]> 3352496
#> 13 12 <chr [3]> 3877796
#> 14 13 <chr [3]> 3965856
#> 15 14 <chr [3]> 4262597
#> 16 15 <chr [3]> 4638461
#> 17 16 <chr [3]> 4976675
#> 18 17 <chr [3]> 5120660
#> 19 18 <chr [3]> 5379476
cc[["encodings"]][1:3]
#> [[1]]
#> [1] "PLAIN" "RLE" "RLE_DICTIONARY"
#>
#> [[2]]
#> [1] "PLAIN" "RLE" "RLE_DICTIONARY"
#>
#> [[3]]
#> [1] "PLAIN" "RLE" "RLE_DICTIONARY"
Limitations
nanoparquet 0.3.0 has a number of limitations.
-
Only flat tables.
read_parquet()
can only read flat tables, i.e. Parquet files without nested columns. (Technically all Parquet files are nested, and nanoparquet supports exactly one level of nesting: a single meta column that contains all other columns.) Similarly,write_parquet()
will not write list columns. -
Unsupported Parquet types.
read_parquet()
reads some Parquet types as raw vectors of a list column currently, e.g.FLOAT16
,INTERVAL
. See the manual for details. -
No encryption. Encrypted Parquet files are not supported.
-
Missing compression codecs.
LZO
,BROTLI
andLZ4
compression is not yet supported. -
No statistics. nanoparquet does not read or write statistics, e.g. minimum and maximum values from and to Parquet files.
-
No checksums. nanoparquet does not check or write checksums currently.
-
No Bloom filters. nanoparquet does not currently support reading or writing Bloom filters from or to Parquet files.
-
May be slow for large files. Being single-threaded and not fully optimized, nanoparquet is probably not suited well for large data sets. It should be fine for a couple of gigabytes. It may be fine if all the data fits into memory comfortably.
-
Single row group.
write_parquet()
always creates a single row group, which is not optimal for large files. -
Automatic encoding. It is currently not possible to choose encodings in
write_parquet()
manually.
We are planning on solving these limitations, while keeping nanoparquet as lean as possible. In particular, if you find a Parquet file that nanoparquet cannot read, please report an issue in our issue tracker!
Other tools for Parquet files
If you run into some of these limitations, chances are you are dealing with a larget data set, and you will probably benefit from using tools geared towards larger Parquet files. Luckily you have several options.
In R
Apache Arrow
You can usually install the arrow
package from CRAN. Note, however, that some CRAN builds are suboptimal at the time of writing, e.g. the macOS builds lack Parquet support and it is best to install arrow from
R-universe on these platforms.
Call
arrow::read_parquet()
to read Parquet files, and
arrow::write_parquet()
to write them. You can also use
arrow::open_dataset()
to open (one or more) Parquet files and perform queries on them without loading all data into memory.
DuckDB
DuckDB is an excellent tool that handles Parquet files seemlessly. You can install the duckdb R package from CRAN.
To read a Parquet file into an R data frame with DuckDB, call
df <- duckdb:::sql("FROM 'file.parquet'")
Alternatively, you can open (one or more) Parquet files and query them as a DuckDB database, potentially without reading all data into memory at once.
Here is an example that shows how to put an R data frame into a (temporary) DuckDB database, and how to export it to Parquet:
drv <- duckdb::duckdb()
con <- DBI::dbConnect(drv)
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbWriteTable(con, "mtcars", mtcars)
DBI::dbExecute(con, DBI::sqlInterpolate(con,
"COPY mtcars TO ?filename (FORMAT 'parquet', COMPRESSION 'snappy')",
filename = 'mtcars.parquet'
))
In Python
There are at least three good options to handle Parquet files in Python. Just like for R, the first two are Apache Arrow and DuckDB. You can also try the fastparquet Python package for a potentially lighter solution.
Acknowledgements
nanoparquet would not exist without the work of Hannes Mühleisen on miniparquet, which had similar goals, but it is discontinued now. nanoparquet is a fork of miniparquet.
nanoparquet also contains code and test Parquet files from DuckDB, Apache Parquet, Apache Arrow, Apache Thrift, it contains libraries from Google, Facebook, etc. see the COPYRIGHTS file in the repository for the full details.