Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
caa00fd
chore: first pr
jupyterjazz Jun 28, 2023
b45e3a6
docs: modify hnsw
jupyterjazz Jul 6, 2023
cad4e60
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 6, 2023
11bda62
docs: rough versions of inmemory and hnsw
jupyterjazz Jul 6, 2023
96319ca
chore: update branch
jupyterjazz Jul 6, 2023
f5825f8
docs: weaviate v1
jupyterjazz Jul 6, 2023
8aaedbe
docs: elastic v1
jupyterjazz Jul 17, 2023
4a3e25c
docs: introduction page
jupyterjazz Jul 17, 2023
db77beb
docs: redis v1
jupyterjazz Jul 17, 2023
82afb99
docs: qdrant v1
jupyterjazz Jul 17, 2023
befc786
docs: validate intro inmemory and hnsw examples
jupyterjazz Jul 17, 2023
9bdb0dc
docs: validate elastic and qdrant examples
jupyterjazz Jul 17, 2023
64f83bf
docs: validate code examples for redis and weaviate
jupyterjazz Jul 18, 2023
759900c
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 19, 2023
60cd4d4
chore: merge recent updates
jupyterjazz Jul 19, 2023
ca25feb
docs: milvus v1
jupyterjazz Jul 19, 2023
7fef5d8
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 24, 2023
fe572da
docs: validate milvus code
jupyterjazz Jul 24, 2023
10bc14b
docs: make redis and milvus visible
jupyterjazz Jul 24, 2023
6199a2a
docs: refine vol1
jupyterjazz Jul 26, 2023
fa8f919
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 26, 2023
c257a4e
docs: refine vol2
jupyterjazz Jul 26, 2023
ccf17e1
chore: pull recent updates
jupyterjazz Jul 26, 2023
f3ca77c
docs: update api reference
jupyterjazz Jul 27, 2023
21e3ad2
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 27, 2023
e6ef9c4
docs: apply suggestions
jupyterjazz Jul 31, 2023
19045ec
docs: separate nested data section
jupyterjazz Jul 31, 2023
5736334
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 31, 2023
41c7307
docs: apply suggestions vol2
jupyterjazz Jul 31, 2023
a32a1e5
fix: nested data imports
jupyterjazz Jul 31, 2023
8a8aa33
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Aug 1, 2023
ef0b7ef
docs: apply johannes suggestions
jupyterjazz Aug 1, 2023
6818688
chore: merge conflicts
jupyterjazz Aug 1, 2023
9268161
docs: apply suggestions
jupyterjazz Aug 1, 2023
b402802
docs: app sgg
jupyterjazz Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: validate intro inmemory and hnsw examples
Signed-off-by: jupyterjazz <[email protected]>
  • Loading branch information
jupyterjazz committed Jul 17, 2023
commit befc786c43f5b61a9cb3edf835b5b7c27eb59306
21 changes: 10 additions & 11 deletions docs/user_guide/storing/docindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ Currently, DocArray supports the following vector databases:
- [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md)
- [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md)
- [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md)
- InMemoryExactNNSearch | [Docs](index_in_memory.md)
- InMemoryExactNNIndex | [Docs](index_in_memory.md)


## Basic Usage

For this user guide you will use the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch]
For this user guide you will use the [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]
because it doesn't require you to launch a database server. Instead, it will store your data locally.

!!! note "Using a different vector database"
Expand All @@ -52,14 +52,13 @@ because it doesn't require you to launch a database server. Instead, it will sto

!!! note "InMemory-specific settings"
The following sections explain the general concept of Document Index by using
[InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] as an example.
For InMemory-specific settings, check out the [InMemoryExactNNSearch][docarray.index.backends.in_memory.InMemoryExactNNSearch] documentation
`InMemoryExactNNIndex` as an example.
For InMemory-specific settings, check out the `InMemoryExactNNIndex` documentation
[here](index_in_memory.md).


```python
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.index import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np

Expand All @@ -72,13 +71,13 @@ class MyDoc(BaseDoc):
# Create documents (using dummy/random vectors)
docs = DocList[MyDoc](MyDoc(title=f'title #{i}', price=i, embedding=np.random.rand(128)) for i in range(10))

# Initialize a new HnswDocumentIndex instance and add the documents to the index.
doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index')
# Initialize a new InMemoryExactNNIndex instance and add the documents to the index.
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

# Perform a vector search.
query = np.ones(128)
retrieved_docs = doc_index.find(query, search_field='embedding', limit=10)
retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10)

# Perform filtering (price < 5)
query = {'price': {'$lt': 5}}
Expand All @@ -87,9 +86,9 @@ filtered_docs = doc_index.filter(query, limit=10)
# Perform a hybrid search - combining vector search with filtering
query = (
doc_index.build_query() # get empty query object
.find(np.ones(128), search_field='embedding') # add vector similarity search
.find(query=np.ones(128), search_field='embedding') # add vector similarity search
.filter(filter_query={'price': {'$gte': 2}}) # add filter search
.build() # build the query
)
results = doc_index.execute_query(query)
retrieved_docs, scores = doc_index.execute_query(query)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should here again add a big fat link to all the backend documentation pages and tell people that they can get more detailed information there

16 changes: 8 additions & 8 deletions docs/user_guide/storing/index_hnswlib.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ class MyDoc(BaseDoc):
docs = DocList[MyDoc](MyDoc(title=f'title #{i}', embedding=np.random.rand(128)) for i in range(10))

# Initialize a new HnswDocumentIndex instance and add the documents to the index.
doc_index = HnswDocumentIndex[MyDoc](workdir='./my_index')
doc_index = HnswDocumentIndex[MyDoc](work_dir='./my_index')
doc_index.index(docs)

# Perform a vector search.
Expand Down Expand Up @@ -326,8 +326,8 @@ query = (
)

# execute the combined query and return the results
results = db.execute_query(query)
print(f'{results=}')
retrieved_docs, scores = db.execute_query(query)
print(f'{retrieved_docs=}')
```

In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search
Expand Down Expand Up @@ -534,7 +534,7 @@ class YouTubeVideoDoc(BaseDoc):


# create a Document Index
doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='/tmp2')
doc_index = HnswDocumentIndex[YouTubeVideoDoc](work_dir='./tmp2')

# create some data
index_docs = [
Expand Down Expand Up @@ -611,7 +611,7 @@ class MyDoc(BaseDoc):


# create a Document Index
doc_index = HnswDocumentIndex[MyDoc](work_dir='/tmp3')
doc_index = HnswDocumentIndex[MyDoc](work_dir='./tmp3')

# create some data
index_docs = [
Expand Down Expand Up @@ -676,7 +676,7 @@ Now we can instantiate our Index and index some data.

```python
docs = DocList[MyDoc](
[MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)]
[MyDoc(embedding=np.random.rand(128), text=f'I am the first version of Document {i}') for i in range(100)]
)
index = HnswDocumentIndex[MyDoc]()
index.index(docs)
Expand All @@ -686,7 +686,7 @@ assert index.num_docs() == 100
Now we can find relevant documents

```python
res = index.find(query=docs[0], search_field='tens', limit=100)
res = index.find(query=docs[0], search_field='embedding', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
assert 'I am the first version' in doc.text
Expand All @@ -705,7 +705,7 @@ assert index.num_docs() == 100
When we retrieve them again we can see that their text attribute has been updated accordingly

```python
res = index.find(query=docs[0], search_field='tens', limit=100)
res = index.find(query=docs[0], search_field='embedding', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
assert 'I am the second version' in doc.text
Expand Down
88 changes: 18 additions & 70 deletions docs/user_guide/storing/index_in_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ doc_index.index(docs)

# Perform a vector search.
query = np.ones(128)
retrieved_docs = doc_index.find(query, search_field='embedding', limit=10)
retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10)
```

## Initialize
Expand Down Expand Up @@ -99,7 +99,7 @@ You can work around this problem by subclassing the predefined Document and addi
embedding: NdArray[128]


db = InMemoryExactNNIndex[MyDoc](work_dir='test_db')
db = InMemoryExactNNIndex[MyDoc]()
```

=== "Using Field()"
Expand All @@ -114,7 +114,7 @@ You can work around this problem by subclassing the predefined Document and addi
embedding: AnyTensor = Field(dim=128)


db = InMemoryExactNNIndex[MyDoc](work_dir='test_db3')
db = InMemoryExactNNIndex[MyDoc]()
```

Once the schema of your Document Index is defined in this way, the data that you are indexing can be either of the
Expand All @@ -126,11 +126,11 @@ The [next section](#index) goes into more detail about data indexing, but note t
from docarray import DocList

# data of type TextDoc
data = DocList[TextDoc](
data = DocList[MyDoc](
[
TextDoc(text='hello world', embedding=np.random.rand(128)),
TextDoc(text='hello world', embedding=np.random.rand(128)),
TextDoc(text='hello world', embedding=np.random.rand(128)),
MyDoc(text='hello world', embedding=np.random.rand(128)),
MyDoc(text='hello world', embedding=np.random.rand(128)),
MyDoc(text='hello world', embedding=np.random.rand(128)),
]
)

Expand Down Expand Up @@ -338,8 +338,8 @@ query = (
)

# execute the combined query and return the results
results = db.execute_query(query)
print(f'{results=}')
retrieved_docs, scores = db.execute_query(query)
print(f'{retrieved_docs=}')
```

In the example above you can see how to form a hybrid query that combines vector similarity search and filtered search
Expand Down Expand Up @@ -403,7 +403,7 @@ If you want to set configurations globally, i.e. for all vector fields in your D

```python
from collections import defaultdict
from docarray.typing import AbstractTensor
from docarray.typing.tensor.abstract_tensor import AbstractTensor
new_doc_index = InMemoryExactNNIndex[MyDoc](
default_column_config=defaultdict(
dict,
Expand Down Expand Up @@ -461,24 +461,24 @@ from docarray.typing import ImageUrl, VideoUrl, AnyTensor
# define a nested schema
class ImageDoc(BaseDoc):
url: ImageUrl
tensor: AnyTensor = Field(space='cosine', dim=64)
tensor: AnyTensor = Field(space='cosine_sim', dim=64)


class VideoDoc(BaseDoc):
url: VideoUrl
tensor: AnyTensor = Field(space='cosine', dim=128)
tensor: AnyTensor = Field(space='cosine_sim', dim=128)


class YouTubeVideoDoc(BaseDoc):
title: str
description: str
thumbnail: ImageDoc
video: VideoDoc
tensor: AnyTensor = Field(space='cosine', dim=256)
tensor: AnyTensor = Field(space='cosine_sim', dim=256)


# create a Document Index
doc_index = InMemoryExactNNIndex[YouTubeVideoDoc](work_dir='/tmp2')
doc_index = InMemoryExactNNIndex[YouTubeVideoDoc]()

# create some data
index_docs = [
Expand Down Expand Up @@ -540,18 +540,18 @@ The `MyDoc` contains a `DocList` of `VideoDoc`, which contains a `DocList` of `I
```python
class ImageDoc(BaseDoc):
url: ImageUrl
tensor_image: AnyTensor = Field(space='cosine', dim=64)
tensor_image: AnyTensor = Field(space='cosine_sim', dim=64)


class VideoDoc(BaseDoc):
url: VideoUrl
images: DocList[ImageDoc]
tensor_video: AnyTensor = Field(space='cosine', dim=128)
tensor_video: AnyTensor = Field(space='cosine_sim', dim=128)


class MyDoc(BaseDoc):
docs: DocList[VideoDoc]
tensor: AnyTensor = Field(space='cosine', dim=256)
tensor: AnyTensor = Field(space='cosine_sim', dim=256)


# create a Document Index
Expand Down Expand Up @@ -601,56 +601,4 @@ root_docs, sub_docs, scores = doc_index.find_subindex(
root_docs, sub_docs, scores = doc_index.find_subindex(
np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)
```

### Update elements
In order to update a Document inside the index, you only need to reindex it with the updated attributes.

First lets create a schema for our Index
```python
import numpy as np
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from docarray.index import InMemoryExactNNIndex
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
```
Now we can instantiate our Index and index some data.

```python
docs = DocList[MyDoc](
[MyDoc(embedding=np.random.rand(10), text=f'I am the first version of Document {i}') for i in range(100)]
)
index = InMemoryExactNNIndex[MyDoc]()
index.index(docs)
assert index.num_docs() == 100
```

Now we can find relevant documents

```python
res = index.find(query=docs[0], search_field='tens', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
assert 'I am the first version' in doc.text
```

and update all of the text of this documents and reindex them

```python
for i, doc in enumerate(docs):
doc.text = f'I am the second version of Document {i}'

index.index(docs)
assert index.num_docs() == 100
```

When we retrieve them again we can see that their text attribute has been updated accordingly

```python
res = index.find(query=docs[0], search_field='tens', limit=100)
assert len(res.documents) == 100
for doc in res.documents:
assert 'I am the second version' in doc.text
```
```