Skip to content

Commit 81c436a

Browse files
numb3r3winstonwwhanxiao
authored
docs(array): add docs for find function (#197)
Co-authored-by: winstonww <[email protected]> Co-authored-by: Han Xiao <[email protected]>
1 parent 6427f42 commit 81c436a

File tree

8 files changed

+226
-15
lines changed

8 files changed

+226
-15
lines changed

docarray/array/queryset/lookup.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -107,13 +107,22 @@ def lookup(key, val, doc: 'Document') -> bool:
107107
elif last == 'size':
108108
return iff_not_none(value, lambda y: len(y) == val)
109109
elif last == 'exists':
110-
if value is None:
111-
return True != val
112-
elif isinstance(value, (str, bytes)):
113-
return (value == '' or value == b'') != val
110+
if not isinstance(val, bool):
111+
raise ValueError(
112+
'$exists operator can only accept True/False as value for comparison'
113+
)
114+
115+
if '__' in get_key:
116+
is_empty = False
117+
try:
118+
is_empty = not value
119+
except:
120+
# ndarray-like will end up here
121+
pass
122+
123+
return is_empty != val
114124
else:
115-
return True == val
116-
# return (value is None or value == '' or value == b'') != val
125+
return (get_key in doc.non_empty_fields) == val
117126
else:
118127
# return value == val
119128
raise ValueError(

docarray/array/queryset/parser.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,11 +51,15 @@ def _parse_lookups(data: Dict = {}, root_node: Optional[LookupNode] = None):
5151
f'The operator {key} is not supported yet, please double check the given filters!'
5252
)
5353
else:
54-
items = list(value.items())
55-
if len(items) == 0:
56-
raise ValueError(f'The query is illegal: {data}')
54+
if not value or not isinstance(value, dict):
55+
raise ValueError(
56+
'''Not a valid query. It should follow the format:
57+
{ <field1>: { <operator1>: <value1> }, ... }
58+
'''
59+
)
5760

58-
elif len(items) == 1:
61+
items = list(value.items())
62+
if len(items) == 1:
5963
op, val = items[0]
6064
if op in LOGICAL_OPERATORS:
6165
if op == '$not':

docs/fundamentals/document/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
A Document object has a predefined data schema as below, each of the attributes can be set/get with the dot expression as you would do with any Python object.
66

7+
(doc-fields)=
78
| Attribute | Type | Description |
89
|-------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
910
| id | string | A hexdigest that represents a unique document ID |

docs/fundamentals/documentarray/access-elements.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
(access-elements)=
2-
# Access Elements
2+
# Access Documents
33

44
This is probably my favorite chapter so far. Readers come to this far may ask: okay you re-implement Python List coin it as DocumentArray, what's the big deal?
55

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
(find-documentarray)=
2+
# Query by Conditions
3+
4+
We can use {meth}`~docarray.array.mixins.find.FindMixin.find` to select Documents from a DocumentArray based the conditions specified in a `query` object. One can use `da.find(query)` to filter Documents and get nearest neighbours from `da`:
5+
6+
- To filter Documents, the `query` object is a Python dictionary object that defines the filtering conditions using a [MongoDB](https://docs.mongodb.com/manual/reference/operator/query/)-like query language.
7+
- To find nearest neighbours, the `query` object needs to be a NdArray-like, a Document, or a DocumentArray object that defines embedding. One can also use `.match()` function for this purpose, and there is a minor interface difference between these two functions, which will be described {ref}`in the next chapter<match-documentarray>`.
8+
9+
Let's see some examples in action. First, let's prepare a DocumentArray we will use.
10+
11+
```python
12+
from jina import Document, DocumentArray
13+
14+
da = DocumentArray([Document(text='journal', weight=25, tags={'h': 14, 'w': 21, 'uom': 'cm'}, modality='A'),
15+
Document(text='notebook', weight=50, tags={'h': 8.5, 'w': 11, 'uom': 'in'}, modality='A'),
16+
Document(text='paper', weight=100, tags={'h': 8.5, 'w': 11, 'uom': 'in'}, modality='D'),
17+
Document(text='planner', weight=75, tags={'h': 22.85, 'w': 30, 'uom': 'cm'}, modality='D'),
18+
Document(text='postcard', weight=45, tags={'h': 10, 'w': 15.25, 'uom': 'cm'}, modality='A')])
19+
20+
da.summary()
21+
```
22+
23+
```text
24+
Documents Summary
25+
26+
Length 5
27+
Homogenous Documents True
28+
Common Attributes ('id', 'text', 'tags', 'weight', 'modality')
29+
30+
Attributes Summary
31+
32+
Attribute Data type #Unique values Has empty value
33+
──────────────────────────────────────────────────────────
34+
id ('str',) 5 False
35+
weight ('int',) 5 False
36+
modality ('str',) 2 False
37+
tags ('dict',) 5 False
38+
text ('str',) 5 False
39+
```
40+
41+
## Filter with query operators
42+
43+
A query filter document can use the query operators to specify conditions in the following form:
44+
45+
```text
46+
{ <field1>: { <operator1>: <value1> }, ... }
47+
```
48+
49+
Here `field1` is {ref}`any field name<doc-fields>` of a Document object. To access nested fields, one can use the dunder expression. For example, `tags__timestamp` is to access `doc.tags['timestamp']` field.
50+
51+
`value1` can be either a user given Python object, or a substitution field with curly bracket `{field}`
52+
53+
Finally, `operator1` can be one of the following:
54+
55+
| Query Operator | Description |
56+
|----------------|------------------------------------------------------------------------------------------------------------|
57+
| `$eq` | Equal to (number, string) |
58+
| `$ne` | Not equal to (number, string) |
59+
| `$gt` | Greater than (number) |
60+
| `$gte` | Greater than or equal to (number) |
61+
| `$lt` | Less than (number) |
62+
| `$lte` | Less than or equal to (number) |
63+
| `$in` | Is in an array |
64+
| `$nin` | Not in an array |
65+
| `$regex` | Match the specified regular expression |
66+
| `$size` | Match array/dict field that have the specified size. `$size` does not accept ranges of values. |
67+
| `$exists` | Matches documents that have the specified field. And empty string content is also considered as not exists. |
68+
69+
70+
For example, to select all `modality='D'` Documents,
71+
72+
```python
73+
r = da.find({'modality': {'$eq': 'D'}})
74+
75+
pprint(r.to_dict(exclude_none=True)) # just for pretty print
76+
```
77+
78+
```text
79+
[{'id': '92aee5d665d0c4dd34db10d83642aded',
80+
'modality': 'D',
81+
'tags': {'h': 8.5, 'uom': 'in', 'w': 11.0},
82+
'text': 'paper',
83+
'weight': 100.0},
84+
{'id': '1a9d2139b02bc1c7842ecda94b347889',
85+
'modality': 'D',
86+
'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0},
87+
'text': 'planner',
88+
'weight': 75.0}]
89+
```
90+
91+
To select all Documents whose `.tags['h']>10`,
92+
93+
```python
94+
r = da.find({'tags__h': {'$gt': 10}})
95+
```
96+
97+
```text
98+
[{'id': '4045a9659875fd1299e482d710753de3',
99+
'modality': 'A',
100+
'tags': {'h': 14.0, 'uom': 'cm', 'w': 21.0},
101+
'text': 'journal',
102+
'weight': 25.0},
103+
{'id': 'cf7691c445220b94b88ff116911bad24',
104+
'modality': 'D',
105+
'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0},
106+
'text': 'planner',
107+
'weight': 75.0}]
108+
```
109+
110+
Beside using a predefined value, one can also use a substitution with `{field}`, notice the curly brackets there. For example,
111+
112+
```python
113+
r = da.find({'tags__h': {'$gt': '{tags__w}'}})
114+
```
115+
116+
```text
117+
[{'id': '44c6a4b18eaa005c6dbe15a28a32ebce',
118+
'modality': 'A',
119+
'tags': {'h': 14.0, 'uom': 'cm', 'w': 10.0},
120+
'text': 'journal',
121+
'weight': 25.0}]
122+
```
123+
124+
125+
126+
## Combine multiple conditions
127+
128+
129+
You can combine multiple conditions using the following operators
130+
131+
| Boolean Operator | Description |
132+
|------------------|----------------------------------------------------|
133+
| `$and` | Join query clauses with a logical AND |
134+
| `$or` | Join query clauses with a logical OR |
135+
| `$not` | Inverts the effect of a query expression |
136+
137+
138+
139+
```python
140+
r = da.find({'$or': [{'weight': {'$eq': 45}}, {'modality': {'$eq': 'D'}}]})
141+
```
142+
143+
```text
144+
[{'id': '22985b71b6d483c31cbe507ed4d02bd1',
145+
'modality': 'D',
146+
'tags': {'h': 8.5, 'uom': 'in', 'w': 11.0},
147+
'text': 'paper',
148+
'weight': 100.0},
149+
{'id': 'a071faf19feac5809642e3afcd3a5878',
150+
'modality': 'D',
151+
'tags': {'h': 22.85, 'uom': 'cm', 'w': 30.0},
152+
'text': 'planner',
153+
'weight': 75.0},
154+
{'id': '411ecc70a71a3f00fc3259bf08c239d1',
155+
'modality': 'A',
156+
'tags': {'h': 10.0, 'uom': 'cm', 'w': 15.25},
157+
'text': 'postcard',
158+
'weight': 45.0}]
159+
```

docs/fundamentals/documentarray/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,10 @@ serialization
3636
access-elements
3737
access-attributes
3838
embedding
39+
find
3940
matching
4041
evaluation
4142
parallelization
4243
visualization
4344
post-external
44-
```
45+
```

docs/fundamentals/documentarray/matching.md

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,31 @@
33

44
```{important}
55
6-
{meth}`~docarray.array.mixins.match.MatchMixin.match` supports both CPU & GPU.
6+
{meth}`~docarray.array.mixins.match.MatchMixin.match` and {meth}`~docarray.array.mixins.find.FindMixin.find` support both CPU & GPU.
77
```
88

9-
Once `.embeddings` is set, one can use {func}`~docarray.array.mixins.match.MatchMixin.match` function to find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings`.
9+
Once `.embeddings` is set, one can use {meth}`~docarray.array.mixins.find.FindMixin.find` or {func}`~docarray.array.mixins.match.MatchMixin.match` function to find the nearest-neighbour Documents from another DocumentArray (or itself) based on their `.embeddings` and distance metrics.
10+
11+
12+
## Difference between find and match
13+
14+
Though both `.find()` and `.match()` is about finding nearest neighbours of a given "query" and both accpet similar arguments, there are some differences between them:
15+
16+
##### Which side is the query at?
17+
- `.find()` always requires the query on the right-hand side. Say you have a DocumentArray with one million Documents, to find one query's nearest neightbours you should write `one_million_docs.find(query)`;
18+
- `.match()` assumes the query is on left-hand side. `A.match(B)` semantically means "A matches against B and save the results to A". So with `.match()` you should write `query.match(one_million_docs)`.
19+
20+
##### What is type of the query?
21+
- query (RHS) in `.find()` can be plain NdArray-like object or a single Document or a DocumentArray.
22+
- query (lHS) in `.match()` can be either a Document or a DocumentArray.
23+
24+
##### What is the return?
25+
- `.find()` returns a List of DocumentArray, each of which corresponds to one element/row in the query.
26+
- `.match()` do not return anything. Match results are stored inside right-hand side's `.matches`.
27+
28+
In the example below, we will use `.match()` to describe the feature. But keep in mind, `.find()` should always work by simply switching the right and left-hand sides.
29+
30+
## Example
1031

1132
The following example finds for each element in `da1` the three closest Documents from the elements in `da2` according to Euclidean distance.
1233

@@ -104,6 +125,22 @@ match emb = (0, 0) 1.0
104125
105126
````
106127

128+
The above example when writing with `.find()`:
129+
130+
```python
131+
da2.find(da1, metric='euclidean', limit=3)
132+
```
133+
134+
or simply:
135+
136+
```python
137+
da2.find(np.array(
138+
[[0, 0, 0, 0, 1],
139+
[1, 0, 0, 0, 0],
140+
[1, 1, 1, 1, 0],
141+
[1, 2, 2, 1, 0]]), metric='euclidean', limit=3)
142+
```
143+
107144
The following metrics are supported:
108145

109146
| Metric | Frameworks |

tests/unit/array/test_lookup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ def doc():
1111
tags={
1212
'x': 0.1,
1313
'y': 1.5,
14-
'z': 0,
14+
'z': 1,
1515
'name': 'test',
1616
'bar': '',
1717
'labels': ['a', 'b', 'test'],

0 commit comments

Comments
 (0)