Skip to content

Commit

Permalink
Zarr notebook (rapidsai#261)
Browse files Browse the repository at this point in the history
Fixes rapidsai#143

cc. @ivirshup @jakirkham

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Alexey Kamenev (https://github.com/Alexey-Kamenev)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: rapidsai#261
  • Loading branch information
madsbk authored Aug 14, 2023
1 parent 6915392 commit 622a525
Show file tree
Hide file tree
Showing 3 changed files with 372 additions and 2 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ docs/build/
cpp/doxygen/html/
.mypy_cache
.hypothesis
.ipynb_checkpoints
9 changes: 7 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ conda env create --name kvikio-dev --file conda/environments/all_cuda-118_arch-x

### C++ (build from source)

To build the C++ example, go to the `cpp` subdiretory and run:
To build the C++ example run:

```
./build.sh libkvikio
Expand All @@ -81,7 +81,7 @@ Then run the example:

### Python (build from source)

To build and install the extension, go to the `python` subdirectory and run:
To build and install the extension run:

```
./build.sh kvikio
Expand All @@ -103,6 +103,11 @@ python benchmarks/single-node-io.py

## Examples


### Notebooks
- [How to read and write GPU memory directly to/from Zarr files](notebooks/zarr.ipynb)


### C++

```c++
Expand Down
364 changes: 364 additions & 0 deletions notebooks/zarr.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 23,
"id": "7a060f7d-9a0c-4763-98df-7dc82409c6ba",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"In this tutorial, we will show how to use KvikIO to read and write GPU memory directly to/from Zarr files.\n",
"\"\"\"\n",
"import json\n",
"import shutil\n",
"import numpy\n",
"import cupy\n",
"import zarr\n",
"import kvikio\n",
"import kvikio.zarr\n",
"from kvikio.nvcomp_codec import NvCompBatchCodec\n",
"from numcodecs import LZ4"
]
},
{
"cell_type": "markdown",
"id": "99f4d25b-2006-4026-8629-1accafb338ef",
"metadata": {},
"source": [
"We need to set three Zarr arguments: \n",
" - `meta_array`: in order to make Zarr read into GPU memory (instead of CPU memory), we set the `meta_array` argument to an empty CuPy array. \n",
" - `store`: we need to use a GPU compatible Zarr Store, which will be KvikIO’s GDS store in our case. \n",
" - `compressor`: finally, we need to use a GPU compatible compressor (or `None`). KvikIO provides a nvCOMP compressor `kvikio.nvcomp_codec.NvCompBatchCodec` that we will use."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "c179c24a-766e-4e09-83c5-349868042576",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(<zarr.core.Array (10,) int64>,\n",
" NvCompBatchCodec(algorithm='lz4', options={}),\n",
" <kvikio.zarr.GDSStore at 0x7fd42021ac20>)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's create a new Zarr array using KvikIO's GDS store and LZ4 compression\n",
"z = zarr.array(\n",
" cupy.arange(10), \n",
" chunks=2, \n",
" store=kvikio.zarr.GDSStore(\"my-zarr-file.zarr\"), \n",
" meta_array=cupy.empty(()),\n",
" compressor=NvCompBatchCodec(\"lz4\"),\n",
" overwrite=True,\n",
")\n",
"z, z.compressor, z.store"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"cupy.ndarray"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# And because we set the `meta_array` argument, reading the Zarr array returns a CuPy array\n",
"type(z[:])"
]
},
{
"cell_type": "markdown",
"id": "549ded39-1053-4f82-a8a7-5a2ee999a4a1",
"metadata": {},
"source": [
"From this point onwards, `z` can be used just like any other Zarr array."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "8221742d-f15c-450a-9701-dc8c05326126",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3, 4, 5, 6, 7, 8])"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z[1:9]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "f0c451c1-a240-4b26-a5ef-6e70a5bbeb55",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([42, 43, 44, 45, 46, 47, 48, 49, 50, 51])"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z[:] + 42"
]
},
{
"cell_type": "markdown",
"id": "7797155f-40f4-4c50-b704-2356ca64cba3",
"metadata": {},
"source": [
"### GPU compression / CPU decompression"
]
},
{
"cell_type": "markdown",
"id": "a0029deb-19b9-4dbb-baf0-ce4b199605a5",
"metadata": {},
"source": [
"In order to read GPU-written Zarr file into a NumPy array, we simply open that file **without** setting the `meta_array` argument:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "399f23f7-4475-496a-a537-a7163a35c888",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(numpy.ndarray,\n",
" kvikio.nvcomp_codec.NvCompBatchCodec,\n",
" array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z = zarr.open_array(kvikio.zarr.GDSStore(\"my-zarr-file.zarr\"))\n",
"type(z[:]), type(z.compressor), z[:]"
]
},
{
"cell_type": "markdown",
"id": "8e9f31d5",
"metadata": {},
"source": [
"And we don't need to use `kvikio.zarr.GDSStore` either:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "4b1f46b2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(numpy.ndarray,\n",
" kvikio.nvcomp_codec.NvCompBatchCodec,\n",
" array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"z = zarr.open_array(\"my-zarr-file.zarr\")\n",
"type(z[:]), type(z.compressor), z[:]"
]
},
{
"cell_type": "markdown",
"id": "f10fd704-35f7-46b7-aabe-ea68fb2bf88d",
"metadata": {},
"source": [
"However, the above use `NvCompBatchCodec(\"lz4\")` for decompression. In the following, we will show how to read Zarr file written and compressed using a GPU on the CPU.\n",
"\n",
"Some algorithms, such as LZ4, can be used interchangeably on CPU and GPU but Zarr will always use the compressor used to write the Zarr file. We are working with the Zarr team to fix this shortcoming but for now, we will use a workaround where we _patch_ the metadata manually."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "d980361a-e132-4f29-ab13-cbceec5bbbb5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(numpy.ndarray, numcodecs.lz4.LZ4, array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read the Zarr metadata and replace the compressor with a CPU implementation of LZ4\n",
"store = zarr.DirectoryStore(\"my-zarr-file.zarr\") # We could also have used kvikio.zarr.GDSStore\n",
"meta = json.loads(store[\".zarray\"])\n",
"meta[\"compressor\"] = LZ4().get_config()\n",
"store[\".zarray\"] = json.dumps(meta).encode() # NB: this changes the Zarr metadata on disk\n",
"\n",
"# And then open the file as usually\n",
"z = zarr.open_array(store)\n",
"type(z[:]), type(z.compressor), z[:]"
]
},
{
"cell_type": "markdown",
"id": "8ea73705",
"metadata": {},
"source": [
"### CPU compression / GPU decompression\n",
"\n",
"Now, let's try the otherway around."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "c9b2d56a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(<zarr.core.Array (10,) int64>,\n",
" LZ4(acceleration=1),\n",
" <zarr.storage.DirectoryStore at 0x7fd351e7a9b0>)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numcodecs\n",
"# Let's create a new Zarr array using the default compression.\n",
"z = zarr.array(\n",
" numpy.arange(10), \n",
" chunks=2, \n",
" store=\"my-zarr-file.zarr\", \n",
" overwrite=True,\n",
" # The default (CPU) implementation of LZ4 codec.\n",
" compressor=numcodecs.registry.get_codec({\"id\": \"lz4\"})\n",
")\n",
"z, z.compressor, z.store"
]
},
{
"cell_type": "markdown",
"id": "dedd4623",
"metadata": {},
"source": [
"Again, we will use a workaround where we _patch_ the metadata manually."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "ac3f30b1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(cupy.ndarray,\n",
" kvikio.nvcomp_codec.NvCompBatchCodec,\n",
" array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read the Zarr metadata and replace the compressor with a GPU implementation of LZ4\n",
"store = kvikio.zarr.GDSStore(\"my-zarr-file.zarr\") # We could also have used zarr.DirectoryStore\n",
"meta = json.loads(store[\".zarray\"])\n",
"meta[\"compressor\"] = NvCompBatchCodec(\"lz4\").get_config()\n",
"store[\".zarray\"] = json.dumps(meta).encode() # NB: this changes the Zarr metadata on disk\n",
"\n",
"# And then open the file as usually\n",
"z = zarr.open_array(store, meta_array=cupy.empty(()))\n",
"type(z[:]), type(z.compressor), z[:]"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "80682922-b7b0-4b08-b595-228c2b446a78",
"metadata": {},
"outputs": [],
"source": [
"# Clean up\n",
"shutil.rmtree(\"my-zarr-file.zarr\", ignore_errors=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 622a525

Please sign in to comment.