Skip to content

Commit beeb322

Browse files
committed
chore(docs): add comparision to awkarray and zarr
1 parent 79ad236 commit beeb322

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed

docs/get-started/what-is.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,59 @@ DocArray is designed to maximize the local experience, with the requirement of c
4040
| Rich functions for data types ||||||
4141

4242

43+
There are two other packages that people often compare DocArray to, yet I haven't use them extensively. It would be unfair to put them in the above list, so here is a dedicated section for them.
44+
45+
## To AwkwardArray
46+
47+
[AwkwardArray](https://awkward-array.org/quickstart.html) is a library for manipulating JSON/dict data via Numpy idioms. Instead of working with Python dynamically typed object, AwkwardArray converts the data into precompiled routines on contiguous data. Hence, it is highly efficient.
48+
49+
DocArray and AwkwardArray are designed with different purposes. DocArray comes from the context of deep learning engineering that works on a stream of multi/cross-modal Documents. AwkwardArray comes from particle physics where with high-performance number-crunching is the priority. Both shares the idea of having generic data structure, but are designed differently to maximize the productivity of their own domains. This results in different sets of feature functions.
50+
51+
When it comes to the speed, AwkwardArray is fast at column access whereas DocArray is fast at row access (streaming):
52+
53+
```python
54+
import awkward as ak
55+
import numpy as np
56+
from docarray import DocumentArray
57+
from toytime import TimeContext
58+
59+
da = DocumentArray.empty(100_000)
60+
da.embeddings = np.random.random([len(da), 64])
61+
62+
da.texts = [f'hello {j}' for j in range(len(da))]
63+
64+
ak_array = ak.from_iter(da.to_list())
65+
66+
with TimeContext('iter via DocArray'):
67+
for d in da:
68+
pass
69+
70+
with TimeContext('iter via awkward'):
71+
for r in ak_array:
72+
pass
73+
74+
with TimeContext('access text via DocArray'):
75+
da.texts
76+
77+
with TimeContext('access text via awkward'):
78+
ak_array['text']
79+
```
80+
81+
```text
82+
iter via DocArray ... 0.004s
83+
iter via awkward ... 1.664s
84+
access text via DocArray ... 0.031s
85+
access text via awkward ... 0.000s
86+
```
87+
88+
As one can see, you can convert a DocumentArray into AwkwardArray via `.to_list()`.
89+
90+
## To Zarr
91+
92+
[Zarr](https://zarr.readthedocs.io/en/stable/) is a format for the storage of chunked, compressed, N-dimensional arrays. I know Zarr quite long time ago, to me it is the package when a `numpy.ndarray` is so big to fit into memory. Zarr provides a comprehensive set of functions that allows one to chunk, compress, stream large NdArray. Hence, from that perspective, Zarr like `numpy.ndarray` focuses on numerical representation and computation.
93+
94+
In DocArray, the basic element one would work with is a Document, not `ndarray`. The support of `ndarray` is important, but not the full story: in the context of deep learning engineering, it is often an intermediate representation of Document for computing, then being thrown away. Therefore, having a consistent data structure that can live *long enough* to cover creating, storing, computing, transferring, returning and rendering is a motivation behind DocArray.
95+
4396
## To Jina Users
4497

4598
Jina 2.0-2.6 *kind of* have their own "DocArray", with very similar `Document` and `DocumentArray` interface. However, it is an old design and codebase. Since then, many redesigns and improvements have been made to boost its efficiency, usability and portability. DocArray is now an independent package that other frameworks such as future Jina 3.x and Finetuner will rely on.

0 commit comments

Comments
 (0)