Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CAGRA serialization #1755

Merged
merged 4 commits into from
Aug 21, 2023
Merged

Conversation

benfred
Copy link
Member

@benfred benfred commented Aug 18, 2023

This changes the serialization format of saved CAGRA indices by:

  • The dtype will now be written in the first 4 bytes of the serialized file, to match the IVF methods and to make it easier to deduce the dtype from python ([FEA] Improve CAGRA serialization #1729)
  • Writing out the dataset with the index is now optional. Since many use cases will already have the dataset written out separately, this gives us the option to save disk space by not writing out an extra copy of the input dataset. If the include_dataset=false option is given, you will have to call index.update_dataset to set the dataset yourself after loading

This changes the serialization format of saved CAGRA instances by:

* The dtype will now be written in the first 4 bytes of the index, to match
the IVF methods and to make it easier to deduce the dtype from python (rapidsai#1729)
* Writing out the dataset with the index is now optional. Since many use cases
will already have the dataset written out separately, this gives us the
option to save disk space by not writing out an extra copy of the input dataset.
If the include_dataset=false option is given, you will have to call `index.update_dataset`
to set the dataset yourself after loading
@benfred benfred requested review from a team as code owners August 18, 2023 23:02
@benfred benfred added improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 18, 2023
@github-actions github-actions bot added cpp python and removed improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 18, 2023
@benfred benfred added improvement Improvement / enhancement to an existing function breaking Breaking change labels Aug 18, 2023
@benfred benfred linked an issue Aug 18, 2023 that may be closed by this pull request
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great overall. A few minor things.

@@ -706,6 +781,8 @@ def save(filename, Index index, handle=None):
Name of the file.
index : Index
Trained CAGRA index.
include_dataset : bool
Whether or not to write out the dataset along with the index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to mention the implication here just to make it more obvious for the uninformed- like a warning that a dataset can get quite large so it's advisable to set this to false to shrink the size of the serialized index.

@@ -258,6 +258,13 @@ struct index : ann::index {
dataset.data_handle(), dataset.extent(0), dataset.extent(1), dataset.extent(1));
}
}
void update_dataset(raft::resources const& res,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to keep these const mdspans. If this is because of python, can we use make_const_mdspan() in that layer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that automatically discarding const would be bad - but this is doing the opposite and is automatically adding it (like this is converting a non-const mdspan to a const msdpan), which I feel like is something that should be allowed with our API's.

The issue I have is that Cython kinda sucks with respecting const identifiers, which is why all our Cython api's use non-const mdspans right now. Like if I try to add a get_const_hmv_float (to parallel the non-const get_hmv_float we have now) - I get an error message from Cython, where it doesn't recognize const float as a type inside template parameters:

      Error compiling Cython file:
      ------------------------------------------------------------
      ...
          if cai.dtype != np.float32:
              raise TypeError("dtype %s not supported" % cai.dtype)
          if check_shape and len(cai.shape) != 2:
              raise ValueError("Expected a 2D array, got %d D" % len(cai.shape))
          shape = (cai.shape[0], cai.shape[1] if len(cai.shape) == 2 else 1)
          return make_host_matrix_view[const float, int64_t, row_major](
                                             ^
      ------------------------------------------------------------
      
      /home/ben/code/raft/python/pylibraft/pylibraft/common/mdspan.pyx:232:39: Expected ']', found 'float'

I can get around this by adding a Cython typedef (like ctypedef const float const_float) - but that introduces the need for other hacks later on (like cython will treat const_float and const float as separate types - meaning that when we define the update_dataset for Cython in c_cagra.pxd I can't just go const T as the type, and have to introduce a new template param =(. I've done this in the last commit - let me know what you think


if dataset_ai.from_cai:
self.index[0].update_dataset(deref(handle_),
get_dmv_float(dataset_ai,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we could use make_const_mdspan. It would simplify things so that we don't need to make non-const functions everywhere (which kind of circumvents the const functions).


if dataset_ai.from_cai:
self.index[0].update_dataset(deref(handle_),
get_dmv_int8(dataset_ai,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_const_mdspan here too.

_check_input_array(dataset_ai, [np.dtype("ubyte")])

if dataset_ai.from_cai:
self.index[0].update_dataset(deref(handle_),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_const_mdspan

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Aug 21, 2023

/merge

@rapids-bot rapids-bot bot merged commit ea9d395 into rapidsai:branch-23.10 Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cpp improvement Improvement / enhancement to an existing function python
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] Improve CAGRA serialization
2 participants