Skip to content

Commit 995dbd3

Browse files
authored
docs(readme): polish (#16)
1 parent 0a88beb commit 995dbd3

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

README.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@
1212

1313
<!-- start elevator-pitch -->
1414

15-
DocArray is a library for nested, unstructured data such as text, image, audio, video, 3D mesh. It allows deep learning engineers to efficiently process, embed, search, recommend, store, transfer the data with Pythonic API.
15+
DocArray is a library for nested, unstructured data such as text, image, audio, video, or 3D mesh. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the data with a Pythonic API.
1616

1717
🌌 **All data types**: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data.
1818

19-
🐍 **Pythonic experience**: designed to be as easy as Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
19+
🐍 **Pythonic experience**: designed to be as easy as a Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
2020

21-
🧑‍🔬 **Data science powerhouse**: greatly accelerate data scientists work on embedding, matching, visualizing, evaluating via Torch/Tensorflow/ONNX/PaddlePaddle on CPU/GPU.
21+
🧑‍🔬 **Data science powerhouse**: greatly accelerate data scientists' work on embedding, matching, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.
2222

23-
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, dataframe.
23+
🚡 **Portable**: ready-to-wire at anytime with efficient and compact serialization from/to Protobuf, bytes, base64, JSON, CSV, DataFrame.
2424

2525
<!-- end elevator-pitch -->
2626

@@ -50,7 +50,7 @@ DocArray consists of two simple concepts:
5050

5151
### A 10-liners text matching
5252

53-
We search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
53+
Let's search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
5454

5555
```python
5656
from docarray import Document, DocumentArray
@@ -75,11 +75,11 @@ print(q.matches[:5, ('text', 'scores__jaccard__value')])
7575
[0.3333333333333333, 0.6666666666666666, 0.7, 0.7272727272727273, 0.75]]
7676
```
7777

78-
Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You got better embedding? Of course you do! Looking forward to seeing your results.
78+
Here the feature embedding is done by simple [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) and distance metric is [Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index). You have better embeddings? Of course you do! We look forward to seeing your results!
7979

8080
### A complete workflow of visual search
8181

82-
Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build simple meme image search. The dataset contains 6016 image-pairs stored in `/left` and `/right`. Images that shares the same filename are perceptually similar. For example,
82+
Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in `/left` and `/right`. Images that share the same filename are perceptually similar. For example:
8383

8484
<table>
8585
<thead>
@@ -100,11 +100,11 @@ Let's use DocArray and [Totally Looks Like](https://sites.google.com/view/totall
100100
</tbody>
101101
</table>
102102

103-
Our problem is given an image from `/left` and find its most-similar image in `/right` (without looking at the filename of course).
103+
Our problem is given an image from `/left`, can we find its most-similar image in `/right`? (without looking at the filename of course).
104104

105105
### Load images
106106

107-
First load images and preprocess them with standard computer vision techniques:
107+
First load images and pre-process them with standard computer vision techniques:
108108

109109
```python
110110
from docarray import DocumentArray
@@ -124,7 +124,7 @@ left_da.plot_image_sprites()
124124

125125
### Apply preprocessing
126126

127-
Let's do some standard computer vision preprocessing:
127+
Let's do some standard computer vision pre-processing:
128128

129129
```python
130130
from docarray import Document
@@ -133,12 +133,12 @@ def preproc(d: Document):
133133
return (d.load_uri_to_image_blob() # load
134134
.set_image_blob_shape((200, 200)) # resize all to 200x200
135135
.set_image_blob_normalization() # normalize color
136-
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the pytorch model later
136+
.set_image_blob_channel_axis(-1, 0)) # switch color axis for the PyTorch model later
137137

138138
left_da.apply(preproc)
139139
```
140140

141-
Did I mention `apply` work in parallel?
141+
Did I mention `apply` works in parallel?
142142

143143
### Embed images
144144

@@ -147,10 +147,10 @@ Now convert images into embeddings using a pretrained ResNet50:
147147
```python
148148
import torchvision
149149
model = torchvision.models.resnet50(pretrained=True) # load ResNet50
150-
left_da.embed(model, device='cuda') # embed via GPU to speedup
150+
left_da.embed(model, device='cuda') # embed via GPU to speed up
151151
```
152152

153-
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use Tensorflow, PaddlePaddle, ONNX models in `.embed(...)`.
153+
This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow, PaddlePaddle, or ONNX models in `.embed(...)`.
154154

155155
### Visualize embeddings
156156

@@ -216,7 +216,7 @@ Better see it.
216216
<a href="https://docarray.jina.ai"><img src="https://github.com/jina-ai/docarray/blob/main/.github/README-img/9nn.png?raw=true" alt="Visualizing top-9 matches using DocArray API" height="250px"></a>
217217
</p>
218218

219-
What we did here is reverting the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that one can visualize them using image sprites.
219+
What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.
220220

221221
### Quantitative evaluation
222222

@@ -252,21 +252,23 @@ recall@5 0.0573470744680851
252252

253253
More metrics can be used such as `precision_at_k`, `ndcg_at_k`, `hit_at_k`.
254254

255-
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) one could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).
255+
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) you could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/get-started/totally-looks-like/).
256256

257257

258258
### Save results
259259

260-
You can save a DocumentArray to binary, JSON, dict, dataframe, CSV or Protobuf message with/without compression. In its simplest form,
260+
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,
261261

262262
```python
263263
left_da.save('left_da.bin')
264264
```
265265

266266
To reuse it, do `left_da = DocumentArray.load('left_da.bin')`.
267267

268+
268269
If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:
269270

271+
270272
```python
271273
left_da.push(token='my_shared_da')
272274
```
@@ -295,6 +297,6 @@ Intrigued? That's only scratching the surface of what DocArray is capable of. [R
295297

296298
## Join Us
297299

298-
DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in opensource.
300+
DocArray is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE). [We are actively hiring](https://jobs.jina.ai) AI engineers, solution engineers to build the next neural search ecosystem in open-source.
299301

300302
<!-- end support-pitch -->

0 commit comments

Comments
 (0)