Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d43057b
feat: json and dict for docvec
JohannesMessner May 22, 2023
c45bfca
test: add tests
JohannesMessner May 22, 2023
564d144
test: add docvec to dict test
JohannesMessner May 22, 2023
76f9c8e
feat: to from dataframe for docvec
JohannesMessner May 22, 2023
73a1ac7
test: dataframe docvec tests
JohannesMessner May 22, 2023
f83fb4f
feat: to from csv for docvec
JohannesMessner May 22, 2023
ca8dc12
test: test csv with docvec
JohannesMessner May 22, 2023
2b52b1e
Merge branch 'main' into feat-docvec-io
JohannesMessner Jun 14, 2023
b115637
feat: pickle serialization for docvec
JohannesMessner Jun 14, 2023
bd86985
feat: protbuf array serialization for docvec
JohannesMessner Jun 14, 2023
c280ff2
test: test base64 deser for docvec
JohannesMessner Jun 14, 2023
ad881cf
test: test save and load for docvec
JohannesMessner Jun 14, 2023
4b1b533
feat: docvec json column wise
JohannesMessner Jun 19, 2023
60e651e
Merge branch 'main' into feat-docvec-io
JohannesMessner Jun 19, 2023
f9c97ec
Merge branch 'main' into feat-docvec-io
JohannesMessner Jun 20, 2023
0603fc5
test: add test for docvec json
JohannesMessner Jun 20, 2023
c6ace8e
test: add tensor type arg
JohannesMessner Jun 20, 2023
51719b2
fix: mypy stuff
JohannesMessner Jun 26, 2023
ad5f5bd
fix: raising of error when needed
JohannesMessner Jun 26, 2023
200dbac
fix: more exception raising
JohannesMessner Jun 26, 2023
8d1f446
fix: mypy
JohannesMessner Jun 26, 2023
6815720
refactor: don't expose to/from csv for docvec
JohannesMessner Jun 26, 2023
6b5ddc7
test: adjust tests
JohannesMessner Jun 26, 2023
587c20a
docs: add documentation for docvec io
JohannesMessner Jun 27, 2023
663f17d
Merge branch 'main' into feat-docvec-io
JohannesMessner Jun 27, 2023
7d035fb
Merge branch 'main' into feat-docvec-io
JohannesMessner Jun 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: add documentation for docvec io
Signed-off-by: Johannes Messner <[email protected]>
  • Loading branch information
JohannesMessner committed Jun 27, 2023
commit 587c20a7d916283c0c08062b33dc01fa7a5309e4
173 changes: 167 additions & 6 deletions docs/user_guide/sending/serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,8 @@ assert doc == new_doc # True

## DocList

When sending or storing [`DocList`][docarray.array.doc_list.doc_list.DocList], you need to use serialization. [`DocList`][docarray.array.doc_list.doc_list.DocList] supports multiple ways to serialize the data.
When sending or storing [`DocList`][docarray.array.doc_list.doc_list.DocList], you need to use serialization.
[`DocList`][docarray.array.doc_list.doc_list.DocList] supports multiple ways to serialize the data.

### JSON

Expand Down Expand Up @@ -193,7 +194,7 @@ dl_from_bytes = DocList[SimpleDoc].from_bytes(
)
```

## CSV
### CSV

- [`to_csv()`][docarray.array.doc_list.io.IOMixinArray.to_csv] serializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a CSV file.
- [`from_csv()`][docarray.array.doc_list.io.IOMixinArray.from_csv] deserializes a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a CSV file.
Expand All @@ -215,7 +216,7 @@ dl_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv')
print(dl_from_csv)
```

## Pandas.Dataframe
### Pandas.Dataframe

- [`from_dataframe()`][docarray.array.doc_list.io.IOMixinArray.from_dataframe] loads a [`DocList`][docarray.array.doc_list.doc_list.DocList] from a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
- [`to_dataframe()`][docarray.array.doc_list.io.IOMixinArray.to_dataframe] saves a [`DocList`][docarray.array.doc_list.doc_list.DocList] to a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
Expand All @@ -237,10 +238,40 @@ print(dl_from_dataframe)

## DocVec

When sending or storing [`DocVec`][docarray.array.doc_list.doc_list.DocVec], you need to use protobuf serialization.
For sending or storing [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] it offers a very similar interface to that of
[`DocList`][docarray.array.doc_list.doc_list.DocList].

!!! note
We plan to add more serialization formats in the future, notably JSON.
### JSON

- [`to_json()`][docarray.array.doc_list.io.IOMixinArray.to_json] serializes a [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] to JSON. It returns the binary representation of the JSON object.
- [`from_json()`][docarray.array.doc_list.io.IOMixinArray.from_json] deserializes a [`DocList`][docarray.array.doc_vec.doc_vec.DocVec] from JSON. It can load from either a `str` or `binary` representation of the JSON object.

In contrast to [DocList's JSON format](#json-1), `DocVec.to_json()` outputs a column oriented JSON file:

```python
from docarray import BaseDoc, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

with open('simple-dv.json', 'wb') as f:
json_dv = dv.to_json()
print(json_dv)
f.write(json_dv)

with open('simple-dv.json', 'r') as f:
dv_load_from_json = DocVec[SimpleDoc].from_json(f.read())
print(dv_load_from_json)
```

```output
b'{"tensor_columns":{},"doc_columns":{},"docs_vec_columns":{},"any_columns":{"id":["005a208a0a9a368c16bf77913b710433","31d65f02cb94fc9756c57b0dbaac3a2c"],"text":["doc 0","doc 1"]}}'
<DocVec[SimpleDoc] (length=2)>
```

### Protobuf

Expand Down Expand Up @@ -306,5 +337,135 @@ assert isinstance(dv_from_proto_numpy.tensor, NdArray)
!!! note
Serialization to protobuf is not supported for union types involving `BaseDoc` types.

### Base64

When transferring data over the network, use `Base64` format to serialize the [DocVec][docarray.array.doc_list.doc_list.DocVec].
Serializing a [DocVec][docarray.array.doc_list.doc_list.DocVec] in Base64 supports both the `pickle` and `protobuf` protocols.
You can also choose different compression methods.

- [`to_base64()`][docarray.array.doc_list.io.IOMixinArray.to_base64] serializes a [DocVec][docarray.array.doc_list.doc_list.DocVec] to Base64
- [`from_base64()`][docarray.array.doc_list.io.IOMixinArray.from_base64] deserializes a [DocVec][docarray.array.doc_list.doc_list.DocVec] from Base64:

You can multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`.

```python
from docarray import BaseDoc, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocVec[SimpleDoc].from_base64(
base64_repr_dv, compress=None, protocol='pickle'
)
```

### Save binary

These methods **serialize and save** your data:

- [`save_binary()`][docarray.array.doc_list.io.IOMixinArray.save_binary] saves a [DocVec][docarray.array.doc_list.doc_list.DocVec] to a binary file.
- [`load_binary()`][docarray.array.doc_list.io.IOMixinArray.load_binary] loads a [DocVec][docarray.array.doc_list.doc_list.DocVec] from a binary file.

You can choose between multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`.

```python
from docarray import BaseDoc, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dv.save_binary('simple-dl.pickle', compress=None, protocol='pickle')

dv_from_binary = DocVec[SimpleDoc].load_binary(
'simple-dv.pickle', compress=None, protocol='pickle'
)
```

In the above snippet, the [DocVec][docarray.array.doc_list.doc_list.DocVec] is stored as the file `simple-dv.pickle`.

### Bytes

These methods just serialize your data, without saving it to a file:

- [to_bytes()][docarray.array.doc_list.io.IOMixinArray.to_bytes] saves a [DocVec][docarray.array.doc_list.doc_list.DocVec] to a byte object.
- [from_bytes()][docarray.array.doc_list.io.IOMixinArray.from_bytes] loads a [DocVec][docarray.array.doc_list.doc_list.DocVec] from a byte object.

!!! note
These methods are used under the hood by [save_binary()][docarray.array.doc_list.io.IOMixinArray.to_base64] and [`load_binary()`][docarray.array.doc_list.io.IOMixinArray.load_binary] to prepare/load/save to a binary file. You can also use them directly to work with byte files.

Like working with binary files:

- You can use `protocol` to choose between `pickle` and `protobuf`.
- You can use multiple compression methods: `lz4`, `bz2`, `lzma`, `zlib`, and `gzip`.

```python
from docarray import BaseDoc, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

bytes_dv = dv.to_bytes(protocol='pickle', compress=None)

dv_from_bytes = DocVec[SimpleDoc].from_bytes(bytes_dv, compress=None, protocol='pickle')
```

### CSV

!!! warning
[`DocVec`][docarray.array.doc_vec.doc_vec.DocVec] does not support `.to_csv()` or `from_csv()`.
This is because CSV is a row-based format while DocVec has a column-based data layout.
To overcome this, you can convert your [`DocVec`][docarray.array.doc_vec.doc_vec.DocVec]
to a [`DocList`][docarray.array.doc_list.doc_list.DocList].

```python
from docarray import BaseDoc, DocList, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

dv.to_doc_list().to_csv('simple-dl.csv')
dv_from_csv = DocList[SimpleDoc].from_csv('simple-dl.csv').to_doc_vec()
```

For more details you can check the [DocList section on CSV serialization](#csv)

### Pandas.Dataframe

- [`from_dataframe()`][docarray.array.doc_list.io.IOMixinArray.from_dataframe] loads a [DocVec][docarray.array.doc_list.doc_list.DocVec] from a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
- [`to_dataframe()`][docarray.array.doc_list.io.IOMixinArray.to_dataframe] saves a [DocVec][docarray.array.doc_list.doc_list.DocVec] to a [Pandas Dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

```python
from docarray import BaseDoc, DocVec


class SimpleDoc(BaseDoc):
text: str


dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])

df = dv.to_dataframe()
dv_from_dataframe = DocVec[SimpleDoc].from_dataframe(df)
print(dv_from_dataframe)
```