Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ANN bench scripts to generate ground truth #1967

Merged
merged 9 commits into from
Nov 7, 2023

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Nov 7, 2023

Add a command line based tool that can read a dataset and generate ground truth files with exact neighbors using raft brute force knn.

@tfeher tfeher requested a review from a team as a code owner November 7, 2023 12:44
@github-actions github-actions bot added the python label Nov 7, 2023
@tfeher
Copy link
Contributor Author

tfeher commented Nov 7, 2023

usage: generate_groundtruth [-h] [--queries QUERIES] [--output OUTPUT] [--n_queries N_QUERIES] [-N ROWS] [-D COLS] [--dtype DTYPE] [-k K] [--metric METRIC] dataset

Generate true neighbors using exact NN search. The input and output files are in big-ann-benchmark's binary format.

positional arguments:
  dataset               input dataset file name

options:
  -h, --help            show this help message and exit
  --queries QUERIES     Queries file name, or one of 'random-choice' or 'random' (default). 'random-choice': select n_queries vectors from the input dataset. 'random': generate n_queries as uniform random numbers.
  --output OUTPUT       output directory name (default current dir)
  --n_queries N_QUERIES
                        Number of quries to generate (if no query file is given). Default: 10000.
  -N ROWS, --rows ROWS  use only first N rows from dataset, by default the whole dataset is used
  -D COLS, --cols COLS  number of features (dataset columns). Default: read from dataset file.
  --dtype DTYPE         Dataset dtype. When not specified, then derived from extension. Supported types: 'float32', 'float16', 'uint8', 'int8'
  -k K                  Number of neighbors (per query) to calculate
  --metric METRIC       Metric to use while calculating distances. Valid metrics are those that are accepted by pylibraft.neighbors.brute_force.knn. Most commonly used with RAFT ANN are 'sqeuclidean' and 'inner_product'

Example usage
    # With existing query file
    python -m raft-ann-bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=/dataset/query.public.10K.fbin

    # With randomly generated queries
    python -m raft-ann-bench.generate_groundtruth --dataset /dataset/base.fbin --output=groundtruth_dir --queries=random --n_queries=10000

    # Using only a subset of the dataset. Define queries by randomly
    # selecting vectors from the (subset of the) dataset.
    python -m raft-ann-bench.generate_groundtruth --dataset /dataset/base.fbin --nrows=2000000 --cols=128 --output=groundtruth_dir --queries=random-choice --n_queries=10000
 

@tfeher tfeher added feature request New feature or request non-breaking Non-breaking change Vector Search labels Nov 7, 2023
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be very useful to have in the raft-ann-bench code, thanks for adding! Hopefully soon we will also be able to provide a means for users to convert their own datasets into the proper binary format to be loaded by raft-Ann-bench also.

Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Corey for the review, I have updated the PR accordingly.

Copy link
Contributor Author

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cjnolet for the second review, fixed the issues.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Nov 7, 2023

/merge

@rapids-bot rapids-bot bot merged commit dbfa2ea into rapidsai:branch-23.12 Nov 7, 2023
60 checks passed
benfred pushed a commit to benfred/raft that referenced this pull request Nov 8, 2023
Add a command line based tool that can read a dataset and generate ground truth files with exact neighbors using raft brute force knn.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1967
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change python Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

2 participants