Apache Arrow nanoarrow 0.5.0 Release
Published
27 May 2024
By
The Apache Arrow PMC (pmc)
The Apache Arrow team is pleased to announce the 0.5.0 release of Apache Arrow nanoarrow. This release covers 79 resolved issues from 9 contributors.
Release Highlights
The primary focus of the nanoarrow 0.5.0 release was expanding the initial Python bindings that were released in 0.4.0. The nanoarrow Python package can now create and consume most Arrow data types, arrays, and array streams, including conversion to/from objects compatible with the Python buffer protocol and conversion to/from lists of Python objects.
The nanoarrow 0.5.0 release also includes updates to its build
configuration to make it possible to use nanoarrow with FetchContent
in projects with a wider variety of CMake usage. In addition to CMake,
nanoarrow now supports the Meson build system. Thanks to
@vyasr and @WillAyd
for contributing these changes!
In the R bindings, support for reading IPC streams
is now accessible with read_nanoarrow()
!
Finally, build system helpers and helpers to reconcile modern C++ usage
with nanorrow C structures (e.g., iterating over an ArrowArrayStream
or
ArrowArray
using a range-for loop) were added to nanoarrow.hpp
.
Thanks to @bkeitz for contributing these
changes!
See the Changelog for a detailed list of contributions to this release.
Breaking Changes
Most changes included in the nanoarrow 0.5.0 release will not break downstream code; however, several changes in the C library are breaking changes to previous behaviour.
ArrowBufferResize()
andArrowBitmapResize()
now adjustsize_bytes
/size_bits
in addition tocapacity_bytes
/buffer.capacity_bytes
. Preivously these functions only adjusted the capacity of the underlying buffer which caused some understandable confusion even though this behaviour was documented. This change affects all usage ofArrowBufferReisze()
andArrowBitmapResize()
that increased the size of the underlying buffer (i.e., usage whereshrink_to_fit
was non zero should be unaffected).ArrowBufferReset()
now always calls the allocator’sfree()
callback. Previously, a call to thefree()
callback was skipped if the pointer wasNULL
; however, this led to some confusion and made it easy to accidentally leak a custom deallocator whose pointer happened to beNULL
.- As a consequence of the above, it is now mandatory to call
ArrowBufferInit()
before callingArrowBufferReset()
. There was some existing usage of nanoarrow that zero-ed the memory for anArrowBuffer
and then (sometimes) calledArrowBufferReset()
. Preivously this was a no-op; however, after 0.5.0 this will crash. This is consistent with other structures in the nanoarrow C library (which require an initialization before it is safe to reset/release them).
Python bindings
The nanoarrow Python bindings are distributed as the nanoarrow
package on
PyPI and conda-forge:
pip install nanoarrow
conda install nanoarrow -c conda-forge
High level users can use the Schema
, Array
, and ArrayStream
classes
to interact with data types, arrays, and array streams:
import nanoarrow as na
na.int32()
#> <Schema> int32
na.Array([1, 2, 3], na.int32())
#> nanoarrow.Array<int32>[3]
#> 1
#> 2
#> 3
url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
#> nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>
Low-level users can use c_schema()
, c_array()
, and c_array_stream()
to interact
with thin wrappers around the Arrow C Data interface structures:
na.c_schema(pa.decimal128(10, 3))
#> <nanoarrow.c_schema.CSchema decimal128(10, 3)>
#> - format: 'd:10,3'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:
na.c_array(["one", "two", "three", None], na.string())
#> <nanoarrow.c_array.CArray string>
#> - length: 4
#> - offset: 0
#> - null_count: 1
#> - buffers: (4754305168, 4754307808, 4754310464)
#> - dictionary: NULL
#> - children[0]:
All nanoarrow type/array-like objects implement the
Arrow PyCapsule interface
for both producing and consuming and are zero-copy interchangeable with pyarrow
objects in many cases:
import pyarrow as pa
pa.field(na.int32())
#> pyarrow.Field<: int32>
na.Schema(pa.string())
#> <Schema> string
pa.array(na.Array([4, 5, 6], na.int32()))
#> <pyarrow.lib.Int32Array object at 0x11b552500>
#> [
#> 4,
#> 5,
#> 6
#> ]
na.Array(pa.array([10, 11, 12]))
#> nanoarrow.Array<int64>[3]
#> 10
#> 11
#> 12
For a more detailed tour of the nanoarrow Python bindings, see the Getting started in Python guide and the Python API reference.
C/C++
The nanoarrow 0.5.0 release includes a number of bugfixes and improvements to the core C library and C++ helpers.
First, the CMake build system was refactored to enable FetchContent
to
work in a wider variety of
develop/build/install scenarios. In most cases, CMake-based projects should be able
to add the nanoarrow C library as a dependency with:
include(FetchContent)
fetchcontent_declare(nanoarrow
GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git
GIT_TAG apache-arrow-nanoarrow-0.5.0
GIT_SHALLOW TRUE)
fetchcontent_makeavailable(nanoarrow)
add_executable(some_target ...)
target_link_libraries(some_target nanoarrow::nanoarrow)
Projects using the Meson build system can install nanoarrow from WrapDB using:
mkdir -p subprojects
meson wrap install nanoarrow
…and use dependency('nanoarrow')
to add the dependency:
nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('some_target',
...,
dependencies: [nanoarrow_dep])
Finally, a set of C++ range/view helpers were added to smooth out some of more verbose aspects of working with nanoarrow in C++. While the new helpers are targeted at more than just nanoarrow’s tests, they have been particularly helpful in allowing nanoarrow’s tests to be more less repetitive and more effective. For example, one particularly verbose test was collapsed to:
#include <gtest/gtest.h>
#include <gmock/gmock-matchers.h>
#include <nanoarrow/nanoarrow_gtest_util.hpp>
#include <nanoarrow/nanoarrow.hpp>
nanoarrow::UniqueArrayStream array_stream;
// ... populate array_stream
nanoarrow::ViewArrayStream array_stream_view(array_stream.get());
for (ArrowArray& array : array_stream_view) {
EXPECT_THAT(nanoarrow::ViewArrayAs<int32_t>(&array), ElementsAre(1234));
}
EXPECT_EQ(array_stream_view.count(), 1);
EXPECT_EQ(array_stream_view.code(), NANOARROW_OK);
EXPECT_STREQ(array_stream_view.error()->message, "");
See the new section in the C++ API reference for details.
R bindings
The nanoarrow R bindings are distributed as the nanoarrow
package on
CRAN.
Whereas nanoarrow has had an IPC reader supporting most features of the
IPC streaming format since 0.3.0, the R bindings did not implement bindings
until this release. The 0.5.0 release of the R package includes read_nanoarrow()
as an entrypoint to reading streams from various sources including URLs,
filenames, and R connections:
library(nanoarrow)
url <- "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
read_nanoarrow(url) |>
tibble::as_tibble()
#> # A tibble: 15,487 × 5
#> commit time files merge message
#> <chr> <dttm> <int> <lgl> <chr>
#> 1 49cdb0fe4e98fda19031c864a18e6156c6ed… 2024-03-07 02:00:52 2 FALSE GH-403…
#> 2 1d966e98e41ce817d1f8c5159c0b9caa4de7… 2024-03-06 21:51:34 1 FALSE GH-403…
#> 3 96f26a89bd73997f7532643cdb27d04b7097… 2024-03-06 20:29:15 1 FALSE GH-402…
#> 4 ee1a8c39a55f3543a82fed900dadca791f6e… 2024-03-06 07:46:45 1 FALSE GH-403…
#> 5 3d467ac7bfae03cf2db09807054c5672e195… 2024-03-05 16:13:32 1 FALSE GH-201…
#> 6 ef6ea6beed071ed070daf03508f4c14b4072… 2024-03-05 14:53:13 20 FALSE GH-403…
#> 7 53e0c745ad491af98a5bf18b67541b12d779… 2024-03-05 12:31:38 2 FALSE GH-401…
#> 8 3ba6d286caad328b8572a3b9228045da8c8d… 2024-03-05 08:15:42 6 FALSE GH-400…
#> 9 4ce9a5edd2710fb8bf0c642fd0e3863b01c2… 2024-03-05 07:56:25 2 FALSE GH-401…
#> 10 2445975162905bd8d9a42ffc9cd0daa0e19d… 2024-03-05 01:04:20 1 FALSE GH-403…
#> # ℹ 15,477 more rows
In developing the Python bindings, it became clear that a representation of
a Arrow C++’s ChunkedArray
was an important concept to represent. Whereas
the Python bindings have the Array
class to provide this structure, the
R bindings had only the nanoarrow_array
as a thin wrapper around the
Arrow C Data interface. When developing the geospatial extension
GeoArrow for R, a data structure that
maintained chunked Arrow memory as an R vector was needed as an intermediary
between an Arrow-native source and an R-native destination. This experimental
structure can be created with as_nanoarrow_vctr()
:
library(nanoarrow)
array <- as_nanoarrow_array(c("one", "two", "three"))
convert_array(array, nanoarrow_vctr())
#> <nanoarrow_vctr string[3]>
#> [1] "one" "two" "three"
Contributors
This release consists of contributions from 9 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list.
$ git shortlog -sn apache-arrow-nanoarrow-0.5.0.dev..apache-arrow-nanoarrow-0.5.0 | grep -v "GitHub Actions"
67 Dewey Dunnington
3 Dirk Eddelbuettel
3 Joris Van den Bossche
2 William Ayd
1 Alenka Frim
1 Benjamin Kietzman
1 Max Conradt
1 Vyas Ramasubramani
1 eitsupi