You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/get-started/what-is.md
+53Lines changed: 53 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,6 +40,59 @@ DocArray is designed to maximize the local experience, with the requirement of c
40
40
| Rich functions for data types |✅|❌| ❌ |✅|❌|
41
41
42
42
43
+
There are two other packages that people often compare DocArray to, yet I haven't use them extensively. It would be unfair to put them in the above list, so here is a dedicated section for them.
44
+
45
+
## To AwkwardArray
46
+
47
+
[AwkwardArray](https://awkward-array.org/quickstart.html) is a library for manipulating JSON/dict data via Numpy idioms. Instead of working with Python dynamically typed object, AwkwardArray converts the data into precompiled routines on contiguous data. Hence, it is highly efficient.
48
+
49
+
DocArray and AwkwardArray are designed with different purposes. DocArray comes from the context of deep learning engineering that works on a stream of multi/cross-modal Documents. AwkwardArray comes from particle physics where with high-performance number-crunching is the priority. Both shares the idea of having generic data structure, but are designed differently to maximize the productivity of their own domains. This results in different sets of feature functions.
50
+
51
+
When it comes to the speed, AwkwardArray is fast at column access whereas DocArray is fast at row access (streaming):
52
+
53
+
```python
54
+
import awkward as ak
55
+
import numpy as np
56
+
from docarray import DocumentArray
57
+
from toytime import TimeContext
58
+
59
+
da = DocumentArray.empty(100_000)
60
+
da.embeddings = np.random.random([len(da), 64])
61
+
62
+
da.texts = [f'hello {j}'for j inrange(len(da))]
63
+
64
+
ak_array = ak.from_iter(da.to_list())
65
+
66
+
with TimeContext('iter via DocArray'):
67
+
for d in da:
68
+
pass
69
+
70
+
with TimeContext('iter via awkward'):
71
+
for r in ak_array:
72
+
pass
73
+
74
+
with TimeContext('access text via DocArray'):
75
+
da.texts
76
+
77
+
with TimeContext('access text via awkward'):
78
+
ak_array['text']
79
+
```
80
+
81
+
```text
82
+
iter via DocArray ... 0.004s
83
+
iter via awkward ... 1.664s
84
+
access text via DocArray ... 0.031s
85
+
access text via awkward ... 0.000s
86
+
```
87
+
88
+
As one can see, you can convert a DocumentArray into AwkwardArray via `.to_list()`.
89
+
90
+
## To Zarr
91
+
92
+
[Zarr](https://zarr.readthedocs.io/en/stable/) is a format for the storage of chunked, compressed, N-dimensional arrays. I know Zarr quite long time ago, to me it is the package when a `numpy.ndarray` is so big to fit into memory. Zarr provides a comprehensive set of functions that allows one to chunk, compress, stream large NdArray. Hence, from that perspective, Zarr like `numpy.ndarray` focuses on numerical representation and computation.
93
+
94
+
In DocArray, the basic element one would work with is a Document, not `ndarray`. The support of `ndarray` is important, but not the full story: in the context of deep learning engineering, it is often an intermediate representation of Document for computing, then being thrown away. Therefore, having a consistent data structure that can live *long enough* to cover creating, storing, computing, transferring, returning and rendering is a motivation behind DocArray.
95
+
43
96
## To Jina Users
44
97
45
98
Jina 2.0-2.6 *kind of* have their own "DocArray", with very similar `Document` and `DocumentArray` interface. However, it is an old design and codebase. Since then, many redesigns and improvements have been made to boost its efficiency, usability and portability. DocArray is now an independent package that other frameworks such as future Jina 3.x and Finetuner will rely on.
0 commit comments