Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
caa00fd
chore: first pr
jupyterjazz Jun 28, 2023
b45e3a6
docs: modify hnsw
jupyterjazz Jul 6, 2023
cad4e60
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 6, 2023
11bda62
docs: rough versions of inmemory and hnsw
jupyterjazz Jul 6, 2023
96319ca
chore: update branch
jupyterjazz Jul 6, 2023
f5825f8
docs: weaviate v1
jupyterjazz Jul 6, 2023
8aaedbe
docs: elastic v1
jupyterjazz Jul 17, 2023
4a3e25c
docs: introduction page
jupyterjazz Jul 17, 2023
db77beb
docs: redis v1
jupyterjazz Jul 17, 2023
82afb99
docs: qdrant v1
jupyterjazz Jul 17, 2023
befc786
docs: validate intro inmemory and hnsw examples
jupyterjazz Jul 17, 2023
9bdb0dc
docs: validate elastic and qdrant examples
jupyterjazz Jul 17, 2023
64f83bf
docs: validate code examples for redis and weaviate
jupyterjazz Jul 18, 2023
759900c
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 19, 2023
60cd4d4
chore: merge recent updates
jupyterjazz Jul 19, 2023
ca25feb
docs: milvus v1
jupyterjazz Jul 19, 2023
7fef5d8
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 24, 2023
fe572da
docs: validate milvus code
jupyterjazz Jul 24, 2023
10bc14b
docs: make redis and milvus visible
jupyterjazz Jul 24, 2023
6199a2a
docs: refine vol1
jupyterjazz Jul 26, 2023
fa8f919
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 26, 2023
c257a4e
docs: refine vol2
jupyterjazz Jul 26, 2023
ccf17e1
chore: pull recent updates
jupyterjazz Jul 26, 2023
f3ca77c
docs: update api reference
jupyterjazz Jul 27, 2023
21e3ad2
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 27, 2023
e6ef9c4
docs: apply suggestions
jupyterjazz Jul 31, 2023
19045ec
docs: separate nested data section
jupyterjazz Jul 31, 2023
5736334
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Jul 31, 2023
41c7307
docs: apply suggestions vol2
jupyterjazz Jul 31, 2023
a32a1e5
fix: nested data imports
jupyterjazz Jul 31, 2023
8a8aa33
Merge branch 'main' into docs-self-contained-indices
jupyterjazz Aug 1, 2023
ef0b7ef
docs: apply johannes suggestions
jupyterjazz Aug 1, 2023
6818688
chore: merge conflicts
jupyterjazz Aug 1, 2023
9268161
docs: apply suggestions
jupyterjazz Aug 1, 2023
b402802
docs: app sgg
jupyterjazz Aug 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: apply suggestions vol2
Signed-off-by: jupyterjazz <[email protected]>
  • Loading branch information
jupyterjazz committed Jul 31, 2023
commit 41c73079154aaadaec193ffed8c28976b6ced622
2 changes: 1 addition & 1 deletion docs/user_guide/storing/docindex.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Introduction

A Document Index lets you store your Documents and search through them using vector similarity.
A Document Index lets you store your documents and search through them using vector similarity.

This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to
some query that you provide.
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/storing/first_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ This section covers the following three topics:

## Document Index

A Document Index lets you store your Documents and search through them using vector similarity.
A Document Index lets you store your documents and search through them using vector similarity.

This is useful if you want to store a bunch of data, and at a later point retrieve documents that are similar to
a query that you provide.
Expand Down
28 changes: 14 additions & 14 deletions docs/user_guide/storing/index_elastic.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The following example is based on [ElasticDocIndex][docarray.index.backends.elas
but will also work for [ElasticV7DocIndex][docarray.index.backends.elasticv7.ElasticV7DocIndex].


## Basic Usage
## Basic usage

```python
from docarray import BaseDoc, DocList
Expand Down Expand Up @@ -123,16 +123,16 @@ class SimpleDoc(BaseDoc):
doc_index = ElasticDocIndex[SimpleDoc](hosts='http://localhost:9200')
```

### Using a predefined Document as schema
### Using a predefined document as schema

DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.

The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding`
The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding`
field. But this is crucial information for any vector database to work properly!

You can work around this problem by subclassing the predefined Document and adding the dimensionality information:
You can work around this problem by subclassing the predefined document and adding the dimensionality information:

=== "Using type hint"
```python
Expand Down Expand Up @@ -197,12 +197,12 @@ doc_index.index(index_docs)
print(f'number of docs in the index: {doc_index.num_docs()}')
```

## Vector Search
## Vector search

The `.find()` method is used to find the nearest neighbors of a vector.

You need to specify the `search_field` that is used when performing the vector search.
This is the field that serves as the basis of comparison between your query and indexed Documents.
This is the field that serves as the basis of comparison between your query and indexed documents.

You can use the `limit` argument to configure how many documents to return.

Expand Down Expand Up @@ -324,7 +324,7 @@ docs = doc_index.filter(query)
```


## Text Search
## Text search

In addition to vector similarity search, the Document Index interface offers methods for text search:
[text_search()][docarray.index.abstract.BaseDocIndex.text_search],
Expand All @@ -351,7 +351,7 @@ docs, scores = doc_index.text_search(query, search_field='text')
```


## Hybrid Search
## Hybrid search

Document Index supports atomic operations for vector similarity search, text search and filter search.

Expand Down Expand Up @@ -389,7 +389,7 @@ You can also manually build a valid ES query and directly pass it to the `execut

## Access documents

To access a document, you need to specify its `id`. You can also pass a list of `id` to access multiple documents.
To access a document, you need to specify its `id`. You can also pass a list of `id`s to access multiple documents.

```python
# access a single Doc
Expand Down Expand Up @@ -422,8 +422,8 @@ The following configs can be set in `DBConfig`:
| Name | Description | Default |
|-------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| `hosts` | Hostname of the Elasticsearch server | `http://localhost:9200` |
| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | None |
| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | None. Data will be stored in an index named after the Document type used as schema. |
| `es_config` | Other ES [configuration options](https://www.elastic.co/guide/en/elasticsearch/client/python-api/8.6/config.html) in a Dict and pass to `Elasticsearch` client constructor, e.g. `cloud_id`, `api_key` | `None` |
| `index_name` | Elasticsearch index name, the name of Elasticsearch index object | `None`. Data will be stored in an index named after the Document type used as schema. |
| `index_settings` | Other [index settings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/index-modules.html#index-modules-settings) in a Dict for creating the index | dict |
| `index_mappings` | Other [index mappings](https://www.elastic.co/guide/en/elasticsearch/reference/8.6/mapping.html) in a Dict for creating the index | dict |
| `default_column_config` | The default configurations for every column type. | dict |
Expand Down Expand Up @@ -473,10 +473,10 @@ print(f'number of docs in the persisted index: {doc_index2.num_docs()}')
```


## Nested Data and Subindex Search
## Nested data and subindex search

The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`.
However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document
contains a `DocList` of other documents.

Go to [Nested Data](nested_data.md) section to learn more.
Go to the [Nested Data](nested_data.md) section to learn more.
42 changes: 21 additions & 21 deletions docs/user_guide/storing/index_hnswlib.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ It stores vectors on disk in [hnswlib](https://github.com/nmslib/hnswlib), and s
- [RedisDocumentIndex][docarray.index.backends.redis.RedisDocumentIndex]
- [MilvusDocumentIndex][docarray.index.backends.milvus.MilvusDocumentIndex]

## Basic Usage
## Basic usage

```python
from docarray import BaseDoc, DocList
Expand Down Expand Up @@ -83,16 +83,16 @@ the database will store vectors with 128 dimensions.
for you. This is supported for all Document Index backends. No need to convert your tensors to NumPy arrays manually!


### Using a predefined Document as schema
### Using a predefined document as schema

DocArray offers a number of predefined Documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
DocArray offers a number of predefined documents, like [ImageDoc][docarray.documents.ImageDoc] and [TextDoc][docarray.documents.TextDoc].
If you try to use these directly as a schema for a Document Index, you will get unexpected behavior:
Depending on the backend, an exception will be raised, or no vector index for ANN lookup will be built.

The reason for this is that predefined Documents don't hold information about the dimensionality of their `.embedding`
The reason for this is that predefined documents don't hold information about the dimensionality of their `.embedding`
field. But this is crucial information for any vector database to work properly!

You can work around this problem by subclassing the predefined Document and adding the dimensionality information:
You can work around this problem by subclassing the predefined document and adding the dimensionality information:

=== "Using type hint"
```python
Expand Down Expand Up @@ -178,10 +178,10 @@ This means that they share the same schema, and in general, both the Document In
- A and B have the same field names and field types
- A and B have the same field names, and, for every field, the type of B is a subclass of the type of A

In particular, this means that you can easily [index predefined Documents](#using-a-predefined-document-as-schema) into a Document Index.
In particular, this means that you can easily [index predefined documents](#using-a-predefined-document-as-schema) into a Document Index.


## Vector Search
## Vector search

Now that you have indexed your data, you can perform vector similarity search using the [`find()`][docarray.index.abstract.BaseDocIndex.find] method.

Expand All @@ -194,7 +194,7 @@ to find similar documents within the Document Index:
# create a query Document
query = MyDoc(embedding=np.random.rand(128), text='query')

# find similar Documents
# find similar documents
matches, scores = db.find(query, search_field='embedding', limit=5)

print(f'{matches=}')
Expand All @@ -208,15 +208,15 @@ to find similar documents within the Document Index:
# create a query vector
query = np.random.rand(128)

# find similar Documents
# find similar documents
matches, scores = db.find(query, search_field='embedding', limit=5)

print(f'{matches=}')
print(f'{matches.text=}')
print(f'{scores=}')
```

To succesfully peform a vector search, you need to specify a `search_field`. This is the field that serves as the
To peform a vector search, you need to specify a `search_field`. This is the field that serves as the
basis of comparison between your query and the documents in the Document Index.

In this particular example you only have one field (`embedding`) that is a vector, so you can trivially choose that one.
Expand Down Expand Up @@ -298,19 +298,19 @@ for doc in cheap_books:



## Text Search

In addition to vector similarity search, the Document Index interface offers methods for text search:
[text_search()][docarray.index.abstract.BaseDocIndex.text_search],
as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched].
## Text search

!!! note
The [HnswDocumentIndex][docarray.index.backends.hnswlib.HnswDocumentIndex] implementation does not support text search.

To see how to perform text search, you can check out other backends that offer support.

In addition to vector similarity search, the Document Index interface offers methods for text search:
[text_search()][docarray.index.abstract.BaseDocIndex.text_search],
as well as the batched version [text_search_batched()][docarray.index.abstract.BaseDocIndex.text_search_batched].


## Hybrid Search
## Hybrid search

Document Index supports atomic operations for vector similarity search, text search and filter search.

Expand Down Expand Up @@ -349,7 +349,7 @@ The kinds of atomic queries that can be combined in this way depends on the back
Some backends can combine text search and vector search, while others can perform filters and vectors search, etc.


## Access Documents
## Access documents

To retrieve a document from a Document Index you don't necessarily need to perform a fancy search.

Expand All @@ -371,7 +371,7 @@ docs = db[ids] # get by list of ids
```


## Delete Documents
## Delete documents

In the same way you can access Documents by `id`, you can also delete them:

Expand All @@ -390,7 +390,7 @@ del db[ids[0]] # del by single id
del db[ids[1:]] # del by list of ids
```

## Update Documents
## Update documents
In order to update a Document inside the index, you only need to re-index it with the updated attributes.

First, let's create a schema for our Document Index:
Expand Down Expand Up @@ -572,10 +572,10 @@ If the location already contains data from a previous session, it will be access



## Nested Data and Subindex Search
## Nested data and subindex search

The examples provided primarily operate on a basic schema where each field corresponds to a straightforward type such as `str` or `NdArray`.
However, it is also feasible to represent and store nested documents in a Document Index, including scenarios where a document
contains a `DocList` of other documents.

Go to [Nested Data](nested_data.md) section to learn more.
Go to the [Nested Data](nested_data.md) section to learn more.
Loading