Apache Arrow nanoarrow 0.5.0 Release


Published 27 May 2024
By The Apache Arrow PMC (pmc)

The Apache Arrow team is pleased to announce the 0.5.0 release of Apache Arrow nanoarrow. This release covers 79 resolved issues from 9 contributors.

Release Highlights

The primary focus of the nanoarrow 0.5.0 release was expanding the initial Python bindings that were released in 0.4.0. The nanoarrow Python package can now create and consume most Arrow data types, arrays, and array streams, including conversion to/from objects compatible with the Python buffer protocol and conversion to/from lists of Python objects.

The nanoarrow 0.5.0 release also includes updates to its build configuration to make it possible to use nanoarrow with FetchContent in projects with a wider variety of CMake usage. In addition to CMake, nanoarrow now supports the Meson build system. Thanks to @vyasr and @WillAyd for contributing these changes!

In the R bindings, support for reading IPC streams is now accessible with read_nanoarrow()!

Finally, build system helpers and helpers to reconcile modern C++ usage with nanorrow C structures (e.g., iterating over an ArrowArrayStream or ArrowArray using a range-for loop) were added to nanoarrow.hpp. Thanks to @bkeitz for contributing these changes!

See the Changelog for a detailed list of contributions to this release.

Breaking Changes

Most changes included in the nanoarrow 0.5.0 release will not break downstream code; however, several changes in the C library are breaking changes to previous behaviour.

  • ArrowBufferResize() and ArrowBitmapResize() now adjust size_bytes/ size_bits in addition to capacity_bytes/buffer.capacity_bytes. Preivously these functions only adjusted the capacity of the underlying buffer which caused some understandable confusion even though this behaviour was documented. This change affects all usage of ArrowBufferReisze() and ArrowBitmapResize() that increased the size of the underlying buffer (i.e., usage where shrink_to_fit was non zero should be unaffected).
  • ArrowBufferReset() now always calls the allocator’s free() callback. Previously, a call to the free() callback was skipped if the pointer was NULL; however, this led to some confusion and made it easy to accidentally leak a custom deallocator whose pointer happened to be NULL.
  • As a consequence of the above, it is now mandatory to call ArrowBufferInit() before calling ArrowBufferReset(). There was some existing usage of nanoarrow that zero-ed the memory for an ArrowBuffer and then (sometimes) called ArrowBufferReset(). Preivously this was a no-op; however, after 0.5.0 this will crash. This is consistent with other structures in the nanoarrow C library (which require an initialization before it is safe to reset/release them).

Python bindings

The nanoarrow Python bindings are distributed as the nanoarrow package on PyPI and conda-forge:

pip install nanoarrow
conda install nanoarrow -c conda-forge

High level users can use the Schema, Array, and ArrayStream classes to interact with data types, arrays, and array streams:

import nanoarrow as na

na.int32()
#> <Schema> int32

na.Array([1, 2, 3], na.int32())
#> nanoarrow.Array<int32>[3]
#> 1
#> 2
#> 3

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
#> nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>

Low-level users can use c_schema(), c_array(), and c_array_stream() to interact with thin wrappers around the Arrow C Data interface structures:

na.c_schema(pa.decimal128(10, 3))
#> <nanoarrow.c_schema.CSchema decimal128(10, 3)>
#> - format: 'd:10,3'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.c_array(["one", "two", "three", None], na.string())
#> <nanoarrow.c_array.CArray string>
#> - length: 4
#> - offset: 0
#> - null_count: 1
#> - buffers: (4754305168, 4754307808, 4754310464)
#> - dictionary: NULL
#> - children[0]:

All nanoarrow type/array-like objects implement the Arrow PyCapsule interface for both producing and consuming and are zero-copy interchangeable with pyarrow objects in many cases:

import pyarrow as pa

pa.field(na.int32())
#> pyarrow.Field<: int32>

na.Schema(pa.string())
#> <Schema> string

pa.array(na.Array([4, 5, 6], na.int32()))
#> <pyarrow.lib.Int32Array object at 0x11b552500>
#> [
#>   4,
#>   5,
#>   6
#> ]

na.Array(pa.array([10, 11, 12]))
#> nanoarrow.Array<int64>[3]
#> 10
#> 11
#> 12

For a more detailed tour of the nanoarrow Python bindings, see the Getting started in Python guide and the Python API reference.

C/C++

The nanoarrow 0.5.0 release includes a number of bugfixes and improvements to the core C library and C++ helpers.

First, the CMake build system was refactored to enable FetchContent to work in a wider variety of develop/build/install scenarios. In most cases, CMake-based projects should be able to add the nanoarrow C library as a dependency with:

include(FetchContent)
fetchcontent_declare(nanoarrow
                     GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git
                     GIT_TAG  apache-arrow-nanoarrow-0.5.0
                     GIT_SHALLOW TRUE)
fetchcontent_makeavailable(nanoarrow)

add_executable(some_target ...)
target_link_libraries(some_target nanoarrow::nanoarrow)

Projects using the Meson build system can install nanoarrow from WrapDB using:

mkdir -p subprojects
meson wrap install nanoarrow

…and use dependency('nanoarrow') to add the dependency:

nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('some_target',
                          ...,
                          dependencies: [nanoarrow_dep])

Finally, a set of C++ range/view helpers were added to smooth out some of more verbose aspects of working with nanoarrow in C++. While the new helpers are targeted at more than just nanoarrow’s tests, they have been particularly helpful in allowing nanoarrow’s tests to be more less repetitive and more effective. For example, one particularly verbose test was collapsed to:

#include <gtest/gtest.h>
#include <gmock/gmock-matchers.h>
#include <nanoarrow/nanoarrow_gtest_util.hpp>
#include <nanoarrow/nanoarrow.hpp>

nanoarrow::UniqueArrayStream array_stream;
// ... populate array_stream
nanoarrow::ViewArrayStream array_stream_view(array_stream.get());

for (ArrowArray& array : array_stream_view) {
  EXPECT_THAT(nanoarrow::ViewArrayAs<int32_t>(&array), ElementsAre(1234));
}

EXPECT_EQ(array_stream_view.count(), 1);
EXPECT_EQ(array_stream_view.code(), NANOARROW_OK);
EXPECT_STREQ(array_stream_view.error()->message, "");

See the new section in the C++ API reference for details.

R bindings

The nanoarrow R bindings are distributed as the nanoarrow package on CRAN.

Whereas nanoarrow has had an IPC reader supporting most features of the IPC streaming format since 0.3.0, the R bindings did not implement bindings until this release. The 0.5.0 release of the R package includes read_nanoarrow() as an entrypoint to reading streams from various sources including URLs, filenames, and R connections:

library(nanoarrow)

url <- "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"

read_nanoarrow(url) |>
  tibble::as_tibble()
#> # A tibble: 15,487 × 5
#>    commit                                time                files merge message
#>    <chr>                                 <dttm>              <int> <lgl> <chr>
#>  1 49cdb0fe4e98fda19031c864a18e6156c6ed… 2024-03-07 02:00:52     2 FALSE GH-403…
#>  2 1d966e98e41ce817d1f8c5159c0b9caa4de7… 2024-03-06 21:51:34     1 FALSE GH-403…
#>  3 96f26a89bd73997f7532643cdb27d04b7097… 2024-03-06 20:29:15     1 FALSE GH-402…
#>  4 ee1a8c39a55f3543a82fed900dadca791f6e… 2024-03-06 07:46:45     1 FALSE GH-403…
#>  5 3d467ac7bfae03cf2db09807054c5672e195… 2024-03-05 16:13:32     1 FALSE GH-201…
#>  6 ef6ea6beed071ed070daf03508f4c14b4072… 2024-03-05 14:53:13    20 FALSE GH-403…
#>  7 53e0c745ad491af98a5bf18b67541b12d779… 2024-03-05 12:31:38     2 FALSE GH-401…
#>  8 3ba6d286caad328b8572a3b9228045da8c8d… 2024-03-05 08:15:42     6 FALSE GH-400…
#>  9 4ce9a5edd2710fb8bf0c642fd0e3863b01c2… 2024-03-05 07:56:25     2 FALSE GH-401…
#> 10 2445975162905bd8d9a42ffc9cd0daa0e19d… 2024-03-05 01:04:20     1 FALSE GH-403…
#> # ℹ 15,477 more rows

In developing the Python bindings, it became clear that a representation of a Arrow C++’s ChunkedArray was an important concept to represent. Whereas the Python bindings have the Array class to provide this structure, the R bindings had only the nanoarrow_array as a thin wrapper around the Arrow C Data interface. When developing the geospatial extension GeoArrow for R, a data structure that maintained chunked Arrow memory as an R vector was needed as an intermediary between an Arrow-native source and an R-native destination. This experimental structure can be created with as_nanoarrow_vctr():

library(nanoarrow)

array <- as_nanoarrow_array(c("one", "two", "three"))
convert_array(array, nanoarrow_vctr())
#> <nanoarrow_vctr string[3]>
#> [1] "one"   "two"   "three"

Contributors

This release consists of contributions from 9 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list.

$ git shortlog -sn apache-arrow-nanoarrow-0.5.0.dev..apache-arrow-nanoarrow-0.5.0 | grep -v "GitHub Actions"
  67  Dewey Dunnington
  3  Dirk Eddelbuettel
  3  Joris Van den Bossche
  2  William Ayd
  1  Alenka Frim
  1  Benjamin Kietzman
  1  Max Conradt
  1  Vyas Ramasubramani
  1  eitsupi