Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
caa00fd
chore: first pr
jupyterjazz Jun 28, 2023
b45e3a6
docs: modify hnsw
jupyterjazz Jul 6, 2023
cad4e60
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 6, 2023
11bda62
docs: rough versions of inmemory and hnsw
jupyterjazz Jul 6, 2023
96319ca
chore: update branch
jupyterjazz Jul 6, 2023
f5825f8
docs: weaviate v1
jupyterjazz Jul 6, 2023
8aaedbe
docs: elastic v1
jupyterjazz Jul 17, 2023
4a3e25c
docs: introduction page
jupyterjazz Jul 17, 2023
db77beb
docs: redis v1
jupyterjazz Jul 17, 2023
82afb99
docs: qdrant v1
jupyterjazz Jul 17, 2023
befc786
docs: validate intro inmemory and hnsw examples
jupyterjazz Jul 17, 2023
9bdb0dc
docs: validate elastic and qdrant examples
jupyterjazz Jul 17, 2023
64f83bf
docs: validate code examples for redis and weaviate
jupyterjazz Jul 18, 2023
759900c
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 19, 2023
60cd4d4
chore: merge recent updates
jupyterjazz Jul 19, 2023
ca25feb
docs: milvus v1
jupyterjazz Jul 19, 2023
7fef5d8
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 24, 2023
fe572da
docs: validate milvus code
jupyterjazz Jul 24, 2023
10bc14b
docs: make redis and milvus visible
jupyterjazz Jul 24, 2023
6199a2a
docs: refine vol1
jupyterjazz Jul 26, 2023
fa8f919
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 26, 2023
c257a4e
docs: refine vol2
jupyterjazz Jul 26, 2023
ccf17e1
chore: pull recent updates
jupyterjazz Jul 26, 2023
f3ca77c
docs: update api reference
jupyterjazz Jul 27, 2023
21e3ad2
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 27, 2023
e6ef9c4
docs: apply suggestions
jupyterjazz Jul 31, 2023
19045ec
docs: separate nested data section
jupyterjazz Jul 31, 2023
5736334
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 31, 2023
41c7307
docs: apply suggestions vol2
jupyterjazz Jul 31, 2023
a32a1e5
fix: nested data imports
jupyterjazz Jul 31, 2023
8a8aa33
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Aug 1, 2023
ef0b7ef
docs: apply johannes suggestions
jupyterjazz Aug 1, 2023
6818688
chore: merge conflicts
jupyterjazz Aug 1, 2023
9268161
docs: apply suggestions
jupyterjazz Aug 1, 2023
b402802
docs: app sgg
jupyterjazz Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: apply johannes suggestions
Signed-off-by: jupyterjazz <[email protected]>
  • Loading branch information
jupyterjazz committed Aug 1, 2023
commit ef0b7ef869cf332b22f27f18f5da51e837ebedfb
27 changes: 21 additions & 6 deletions docs/user_guide/storing/docindex.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,49 +58,52 @@ This doesn't require a database server - rather, it saves your data locally.
For a deeper understanding, please look into its [documentation](index_in_memory.md).

### Define document schema and create data
The following code snippet defines a document schema using the `BaseDoc` class. Each document consists of a title (a string),
a price (an integer), and an embedding (a 128-dimensional array). It also creates a list of ten documents with dummy titles,
prices ranging from 0 to 9, and randomly generated embeddings.
```python
from docarray import BaseDoc, DocList
from docarray.index import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np

# Define the document schema.
class MyDoc(BaseDoc):
title: str
price: int
embedding: NdArray[128]

# Create documents (using dummy/random vectors)
docs = DocList[MyDoc](
MyDoc(title=f"title #{i}", price=i, embedding=np.random.rand(128))
for i in range(10)
)
```

### Initialize the Document Index and add data
Here we initialize an `InMemoryExactNNIndex` instance with the document schema defined previously, and add the created documents to this index.
```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add 1-2 sentences of explanation to these code snippets? i think they are very self-explanatory, but personally as a user i don't like code snippets without words around them :)

# Initialize a new InMemoryExactNNIndex instance and add the documents to the index.
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)
```

### Perform a vector similarity search
Now, let's perform a similarity search on the document embeddings using a query vector of ones.
As a result, we'll retrieve the top 10 most similar documents and their corresponding similarity scores.
```python
# Perform a vector search.
query = np.ones(128)
retrieved_docs, scores = doc_index.find(query, search_field='embedding', limit=10)
```

### Filter documents
In this segment, we filter the indexed documents based on their price field, specifically retrieving documents with a price less than 5.
```python
# Perform filtering (price < 5)
query = {'price': {'$lt': 5}}
filtered_docs = doc_index.filter(query, limit=10)
```

### Combine different search methods
The final snippet combines the vector similarity search and filtering operations into a single query.
We first perform a similarity search on the document embeddings and then apply a filter to return only those documents with a price greater than or equal to 2.
```python
# Perform a hybrid search - combining vector search with filtering
query = (
doc_index.build_query() # get empty query object
.find(query=np.ones(128), search_field='embedding') # add vector similarity search
Expand All @@ -109,3 +112,15 @@ query = (
)
retrieved_docs, scores = doc_index.execute_query(query)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should here again add a big fat link to all the backend documentation pages and tell people that they can get more detailed information there


## Learn more
The code snippets presented above just scratch the surface of what a Document Index can do.
To learn more and get the most out of `DocArray`, take a look at the detailed guides for the vector database backends you're interested in:

- [Weaviate](https://weaviate.io/) | [Docs](index_weaviate.md)
- [Qdrant](https://qdrant.tech/) | [Docs](index_qdrant.md)
- [Elasticsearch](https://www.elastic.co/elasticsearch/) v7 and v8 | [Docs](index_elastic.md)
- [Redis](https://redis.com/) | [Docs](index_redis.md)
- [Milvus](https://milvus.io/) | [Docs](index_milvus.md)
- [HNSWlib](https://github.com/nmslib/hnswlib) | [Docs](index_hnswlib.md)
- InMemoryExactNNIndex | [Docs](index_in_memory.md)
117 changes: 107 additions & 10 deletions docs/user_guide/storing/index_elastic.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.Ela


## Basic usage
This snippet demonstrates the basic usage of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex]. It defines a document schema with a title and an embedding,
creates ten dummy documents with random embeddings, initializes an instance of [ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] to index these documents,
and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector.

```python
from docarray import BaseDoc, DocList
Expand Down Expand Up @@ -186,23 +189,44 @@ db.index(data)

## Index

Use `.index()` to add documents into the index.
Now that you have a Document Index, you can add data to it, using the [`index()`][docarray.index.abstract.BaseDocIndex.index] method.
The `.num_docs()` method returns the total number of documents in the index.

```python
index_docs = [SimpleDoc(tensor=np.ones(128)) for _ in range(64)]
from docarray import DocList

doc_index.index(index_docs)
# create some random data
docs = DocList[SimpleDoc]([SimpleDoc(tensor=np.ones(128)) for _ in range(64)])

doc_index.index(docs)

print(f'number of docs in the index: {doc_index.num_docs()}')
```

As you can see, `DocList[SimpleDoc]` and `ElasticDocIndex[SimpleDoc]` both have `SimpleDoc` as a parameter.
This means that they share the same schema, and in general, both the Document Index and the data that you want to store need to have compatible schemas.

!!! question "When are two schemas compatible?"
The schemas of your Document Index and data need to be compatible with each other.

Let's say A is the schema of your Document Index and B is the schema of your data.
There are a few rules that determine if schema A is compatible with schema B.
If _any_ of the following are true, then A and B are compatible:

- A and B are the same class
- A and B have the same field names and field types
- A and B have the same field names, and, for every field, the type of B is a subclass of the type of A

In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index.



## Vector search

The `.find()` method is used to find the nearest neighbors of a vector.
Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method.

You need to specify the `search_field` that is used when performing the vector search.
This is the field that serves as the basis of comparison between your query and indexed documents.
You can use the [`find()`][docarray.index.abstract.BaseDocIndex.find] function with a document of the type `MyDoc`
to find similar documents within the Document Index:

You can use the `limit` argument to configure how many documents to return.

Expand All @@ -211,14 +235,87 @@ You can use the `limit` argument to configure how many documents to return.
This can lead to poor performance when the search involves many vectors.
[ElasticDocIndex][docarray.index.backends.elastic.ElasticDocIndex] does not have this limitation.

```python
query = SimpleDoc(tensor=np.ones(128))
=== "Search by Document"

docs, scores = doc_index.find(query, limit=5, search_field='tensor')
```
```python
# create a query Document
query = SimpleDoc(tensor=np.ones(128))

# find similar documents
matches, scores = doc_index.find(query, search_field='tensor', limit=5)

print(f'{matches=}')
print(f'{matches.text=}')
print(f'{scores=}')
```

=== "Search by raw vector"

```python
# create a query vector
query = np.random.rand(128)

# find similar documents
matches, scores = doc_index.find(query, search_field='tensor', limit=5)

print(f'{matches=}')
print(f'{matches.text=}')
print(f'{scores=}')
```

To peform a vector search, you need to specify a `search_field`. This is the field that serves as the
basis of comparison between your query and the documents in the Document Index.

In this particular example you only have one field (`tensor`) that is a vector, so you can trivially choose that one.
In general, you could have multiple fields of type `NdArray` or `TorchTensor` or `TensorFlowTensor`, and you can choose
which one to use for the search.

The [`find()`][docarray.index.abstract.BaseDocIndex.find] method returns a named tuple containing the closest
matching documents and their associated similarity scores.

When searching on the subindex level, you can use the [`find_subindex()`][docarray.index.abstract.BaseDocIndex.find_subindex] method, which returns a named tuple containing the subindex documents, similarity scores and their associated root documents.

How these scores are calculated depends on the backend, and can usually be [configured](#configuration).


### Batched search

You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method.

=== "Search by Documents"

```python
# create some query Documents
queries = DocList[SimpleDoc](
SimpleDoc(tensor=np.random.rand(128)) for i in range(3)
)

# find similar Documents
matches, scores = doc_index.find_batched(queries, search_field='tensor', limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

=== "Search by raw vectors"

```python
# create some query vectors
query = np.random.rand(3, 128)

# find similar Documents
matches, scores = doc_index.find_batched(query, search_field='tensor', limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing
a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores.



## Filter

Expand Down
3 changes: 3 additions & 0 deletions docs/user_guide/storing/index_hnswlib.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s
- [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]

## Basic usage
This snippet demonstrates the basic usage of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex]. It defines a document schema with a title and an embedding,
creates ten dummy documents with random embeddings, initializes an instance of [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] to index these documents,
and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector.

```python
from docarray import BaseDoc, DocList
Expand Down
3 changes: 3 additions & 0 deletions docs/user_guide/storing/index_in_memory.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ utilizes DocArray's [`find()`][docarray.utils.find.find] and [`filter_docs()`][d


## Basic usage
This snippet demonstrates the basic usage of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex]. It defines a document schema with a title and an embedding,
creates ten dummy documents with random embeddings, initializes an instance of [InMemoryExactNNIndex][docarray.index.backends.in_memory.InMemoryExactNNIndex] to index these documents,
and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector.

```python
from docarray import BaseDoc, DocList
Expand Down
39 changes: 39 additions & 0 deletions docs/user_guide/storing/index_milvus.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ focusing on special features and configurations of Milvus.


## Basic usage
This snippet demonstrates the basic usage of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]. It defines a document schema with a title and an embedding,
creates ten dummy documents with random embeddings, initializes an instance of [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex] to index these documents,
and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector.

!!! note "Single Search Field Requirement"
In order to utilize vector search, it's necessary to define 'is_embedding' for one field only.
This is due to Milvus' configuration, which permits a single vector for each data object.
Expand Down Expand Up @@ -215,8 +219,43 @@ When searching on the subindex level, you can use the [`find_subindex()]`[docarr

How these scores are calculated depends on the backend, and can usually be [configured](#configuration).

### Batched Search

You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method.

=== "Search by documents"

```python
# create some query documents
queries = DocList[MyDoc](
MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3)
)

# find similar documents
matches, scores = doc_index.find_batched(queries, limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

=== "Search by raw vectors"

```python
# create some query vectors
query = np.random.rand(3, 128)

# find similar documents
matches, scores = doc_index.find_batched(query, limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing
a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores.


## Filter

Expand Down
40 changes: 40 additions & 0 deletions docs/user_guide/storing/index_qdrant.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ based on the [Qdrant](https://qdrant.tech/) vector search engine.


## Basic usage
This snippet demonstrates the basic usage of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex]. It defines a document schema with a title and an embedding,
creates ten dummy documents with random embeddings, initializes an instance of [QdrantDocumentIndex][docarray.index.backends.qdrant.QdrantDocumentIndex] to index these documents,
and performs a vector similarity search to retrieve the top 10 most similar documents to a given query vector.

```python
from docarray import BaseDoc, DocList
from docarray.index import QdrantDocumentIndex
Expand Down Expand Up @@ -253,8 +257,44 @@ When searching on the subindex level, you can use the [`find_subindex()`][docarr

How these scores are calculated depends on the backend, and can usually be [configured](#configuration).

### Batched Search

You can also search for multiple documents at once, in a batch, using the [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method.

=== "Search by documents"

```python
# create some query documents
queries = DocList[MyDoc](
MyDoc(embedding=np.random.rand(128), text=f'query {i}') for i in range(3)
)

# find similar documents
matches, scores = doc_index.find_batched(queries, search_field='embedding', limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

=== "Search by raw vectors"

```python
# create some query vectors
query = np.random.rand(3, 128)

# find similar documents
matches, scores = doc_index.find_batched(query, search_field='embedding', limit=5)

print(f'{matches=}')
print(f'{matches[0].text=}')
print(f'{scores=}')
```

The [find_batched()][docarray.index.abstract.BaseDocIndex.find_batched] method returns a named tuple containing
a list of `DocList`s, one for each query, containing the closest matching documents and their similarity scores.


## Filter

You can filter your documents by using the `filter()` or `filter_batched()` method with a corresponding filter query.
Expand Down
Loading