You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-18Lines changed: 20 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,15 +12,15 @@
12
12
13
13
<!-- start elevator-pitch -->
14
14
15
-
DocArray is a library for nested, unstructured data such as text, image, audio, video, 3D mesh. It allows deeplearning engineers to efficiently process, embed, search, recommend, store, transfer the data with Pythonic API.
15
+
DocArray is a library for nested, unstructured data such as text, image, audio, video, or 3D mesh. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the data with a Pythonic API.
16
16
17
17
🌌 **All data types**: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.
18
18
19
-
🐍 **Pythonic experience**: designed to be as easy as Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
19
+
🐍 **Pythonic experience**: designed to be as easy as a Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
20
20
21
-
🧑🔬 **Data science powerhouse**: greatly accelerate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle on CPU/GPU.
21
+
🧑🔬 **Data science powerhouse**: greatly accelerate data scientists' work on embedding, matching, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.
22
22
23
-
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, dataframe.
23
+
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, DataFrame.
24
24
25
25
<!-- end elevator-pitch -->
26
26
@@ -50,7 +50,7 @@ DocArray consists of two simple concepts:
50
50
51
51
### A 10-liners text matching
52
52
53
-
We search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
53
+
Let's search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You got better embedding? Of course you do! Looking forward to seeing your results.
78
+
Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You have better embeddings? Of course you do! We look forward to seeing your results!
79
79
80
80
### A complete workflow of visual search
81
81
82
-
Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build simple meme image search. The dataset contains 6016 image-pairs stored in `/left` and `/right`. Images that shares the same filename are perceptually similar. For example,
82
+
Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in `/left` and `/right`. Images that share the same filename are perceptually similar. For example:
83
83
84
84
<table>
85
85
<thead>
@@ -100,11 +100,11 @@ Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totall
100
100
</tbody>
101
101
</table>
102
102
103
-
Our problem is given an image from `/left` and find its most-similar image in `/right` (without looking at the filename of course).
103
+
Our problem is given an image from `/left`, can we find its most-similar image in `/right`? (without looking at the filename of course).
104
104
105
105
### Load images
106
106
107
-
First load images and preprocess them with standard computer vision techniques:
107
+
First load images and pre-process them with standard computer vision techniques:
108
108
109
109
```python
110
110
from docarray import DocumentArray
@@ -124,7 +124,7 @@ left_da.plot_image_sprites()
124
124
125
125
### Apply preprocessing
126
126
127
-
Let's do some standard computer vision preprocessing:
127
+
Let's do some standard computer vision pre-processing:
128
128
129
129
```python
130
130
from docarray import Document
@@ -133,12 +133,12 @@ def preproc(d: Document):
133
133
return (d.load_uri_to_image_blob() # load
134
134
.set_image_blob_shape((200, 200)) # resize all to 200x200
135
135
.set_image_blob_normalization() # normalize color
136
-
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the pytorch model later
136
+
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the PyTorch model later
137
137
138
138
left_da.apply(preproc)
139
139
```
140
140
141
-
Did I mention `apply`work in parallel?
141
+
Did I mention `apply`works in parallel?
142
142
143
143
### Embed images
144
144
@@ -147,10 +147,10 @@ Now convert images into embeddings using a pretrained ResNet50:
147
147
```python
148
148
import torchvision
149
149
model = torchvision.models.resnet50(pretrained=True) # load ResNet50
150
-
left_da.embed(model, device='cuda') # embed via GPU to speedup
150
+
left_da.embed(model, device='cuda') # embed via GPU to speed up
151
151
```
152
152
153
-
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use Tensorflow, PaddlePaddle, ONNX models in `.embed(...)`.
153
+
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in `.embed(...)`.
154
154
155
155
### Visualize embeddings
156
156
@@ -216,7 +216,7 @@ Better see it.
216
216
<ahref="https://docarray.jina.ai"><imgsrc="https://github.com/jina-ai/docarray/blob/main/.github/README-img/9nn.png?raw=true"alt="Visualizing top-9 matches using DocArray API"height="250px"></a>
217
217
</p>
218
218
219
-
What we did here is reverting the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that one can visualize them using image sprites.
219
+
What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.
220
220
221
221
### Quantitative evaluation
222
222
@@ -252,21 +252,23 @@ recall@5 0.0573470744680851
252
252
253
253
More metrics can be used such as `precision_at_k`, `ndcg_at_k`, `hit_at_k`.
254
254
255
-
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner)one could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).
255
+
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner)you could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).
256
256
257
257
258
258
### Save results
259
259
260
-
You can save a DocumentArray to binary, JSON, dict, dataframe, CSV or Protobuf message with/without compression. In its simplest form,
260
+
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,
261
261
262
262
```python
263
263
left_da.save('left_da.bin')
264
264
```
265
265
266
266
To reuse it, do `left_da = DocumentArray.load('left_da.bin')`.
267
267
268
+
268
269
If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:
269
270
271
+
270
272
```python
271
273
left_da.push(token='my_shared_da')
272
274
```
@@ -295,6 +297,6 @@ Intrigued? That's only scratching the surface of what DocArray is capable of. [R
295
297
296
298
## Join Us
297
299
298
-
DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in opensource.
300
+
DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in open-source.
0 commit comments