Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ANN bench latency #1940

Merged
merged 23 commits into from
Nov 3, 2023
Merged

Conversation

tfeher
Copy link
Contributor

@tfeher tfeher commented Oct 31, 2023

This adds explicit latency column to the benchmark.

@tfeher tfeher requested a review from a team as a code owner October 31, 2023 01:09
@github-actions github-actions bot added the cpp label Oct 31, 2023
@tfeher tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Vector Search and removed cpp labels Oct 31, 2023
@tfeher tfeher requested a review from a team as a code owner October 31, 2023 01:11
@tfeher
Copy link
Contributor Author

tfeher commented Oct 31, 2023

The initial version of this PR adds a helper ANN benchmark class: FixLatencyWorkload. This can be configured to spend
10ms on CPU or GPU time sleeping, and it is useful to illustrate the shortcommings of the current benchmark counters.

./RAFT_IVF_PQ_ANN_BENCH --search --data_prefix=/data --benchmark_filter=fix_latency --benchmark_min_time=100x --benchmark_counters_tabular=true --mode=latency fix_latency.json
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries total_queries    use_gpu
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/0/process_time/real_time/threads:1       10.2 ms        0.032 ms          100  0.0101664  0.0101838          0    1.01838        982.414/s         10         10          1000          0
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries sync_stream total_queries    use_gpu
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/1/process_time/real_time/threads:1      0.024 ms        0.024 ms          100   13.5014u  0.0104866          0    1.04866       408.261k/s         10         10           0          1000          1
fix_latency/2/process_time/real_time/threads:1       10.5 ms         10.5 ms          100  0.0104682  0.0104769          0    1.04769        954.496/s         10         10           1          1000          1

We can see the following issues:

  • CPU: measures user time and system time. This does not include idle time. Idle time can also occur when we wait for GPU sync, therefore this is not a correct measure for latency.

  • GPU: If the GPU lib does not sync stream, then GPU time will not work.

@tfeher
Copy link
Contributor Author

tfeher commented Oct 31, 2023

In throughput mode Time is not a correct measure of latency. Time is the wall clock time diff divided by the number of iterations. Since we run iterations in parallel, time becomes effectively latency of single iteration divided by number of threads.

I believe we want to define vector search Latency as wall-clock time for finishing a batch. This PR adds such a column to the benchmark output. Time gives us almost what we need, but it is divided by the number of threads. Unfortunately the timers are private, so we cannot access them from the benchmark state. We can use the end_to_end duration variable, and divide by the iteration count to get the average latency (notice the kAvgThreads).

/RAFT_IVF_PQ_ANN_BENCH --search --data_prefix=/data --benchmark_filter=fix_latency --benchmark_min_time=100x  --benchmark_counters_tabular=true --mode=throughput /workspace1/config/fix_latency.json

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries total_queries    use_gpu
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/0/process_time/real_time/threads:1        10.2 ms        0.032 ms          100  0.0101596  0.0101772          0    1.01772        983.058/s         10         10          1000          0
fix_latency/0/process_time/real_time/threads:2        5.10 ms        0.046 ms          200  0.0101801  0.0102127          0    1.02172       1.95936k/s         10         10            2k          0
fix_latency/0/process_time/real_time/threads:4        2.55 ms        0.057 ms          400  0.0101708   0.010227          0    1.02396       3.91614k/s         10         10            4k          0
fix_latency/0/process_time/real_time/threads:8        1.28 ms        0.064 ms          800  0.0101425  0.0102338          0    1.02196       7.84236k/s         10         10            8k          0
fix_latency/0/process_time/real_time/threads:16      0.633 ms        0.039 ms         1600  0.0101036  0.0101894          0    1.01989       15.8025k/s         10         10           16k          0
fix_latency/0/process_time/real_time/threads:32      0.316 ms        0.033 ms         3200  0.0100886  0.0102426          0    1.01987       31.6601k/s         10         10           32k          0
fix_latency/0/process_time/real_time/threads:48      0.211 ms        0.033 ms         4800   0.010088  0.0102979          0    1.02507       47.4775k/s         10         10           48k          0
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries sync_stream total_queries    use_gpu
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fix_latency/1/process_time/real_time/threads:1       0.025 ms        0.025 ms          100   13.8858u  0.0104901          0    1.04901       404.242k/s         10         10           0          1000          1
fix_latency/1/process_time/real_time/threads:2       0.013 ms        0.023 ms          200   8.14784u  0.0104827          0    1.04866       797.795k/s         10         10           0            2k          1
fix_latency/1/process_time/real_time/threads:4       0.014 ms        0.041 ms          400   10.9731u  0.0104859          0    1.04857       709.049k/s         10         10           0            4k          1
fix_latency/1/process_time/real_time/threads:8       0.368 ms        0.657 ms          800   105.973u  0.0296742          0    2.96675       27.1756k/s         10         10           0            8k          1
fix_latency/1/process_time/real_time/threads:16      0.945 ms         12.0 ms         1600   1.43132m  0.0289549          0    2.89458       10.5867k/s         10         10           0           16k          1
fix_latency/1/process_time/real_time/threads:32      0.516 ms         16.1 ms         3200   1023.67u  0.0180644          0    1.80312       19.3707k/s         10         10           0           32k          1
fix_latency/1/process_time/real_time/threads:48      0.354 ms         16.4 ms         4800   790.397u   0.017988          0    1.79329        28.237k/s         10         10           0           48k          1
fix_latency/2/process_time/real_time/threads:1        10.5 ms         10.5 ms          100  0.0104754  0.0104842          0    1.04842        953.833/s         10         10           1          1000          1
fix_latency/2/process_time/real_time/threads:2        5.24 ms         10.5 ms          200  0.0104645  0.0104835          0    1.04876       1.90857k/s         10         10           1            2k          1
fix_latency/2/process_time/real_time/threads:4        2.62 ms         10.5 ms          400  0.0104443  0.0104819          0    1.04864       3.81809k/s         10         10           1            4k          1
fix_latency/2/process_time/real_time/threads:8        1.31 ms         10.4 ms          800   0.010427  0.0104937          0    1.04861       7.63035k/s         10         10           1            8k          1
fix_latency/2/process_time/real_time/threads:16      0.655 ms         10.4 ms         1600  0.0103763  0.0105217          0    1.05113       15.2569k/s         10         10           1           16k          1
fix_latency/2/process_time/real_time/threads:32      0.328 ms         10.3 ms         3200  0.0103018  0.0105683          0     1.0542       30.4933k/s         10         10           1           32k          1
fix_latency/2/process_time/real_time/threads:48      0.219 ms         10.3 ms         4800  0.0103413   0.010683          0    1.06139        45.568k/s         10         10           1           48k          1

@tfeher
Copy link
Contributor Author

tfeher commented Oct 31, 2023

  • TODO: add command line arg to control the number of threads. This would be useful when running larger set of benchmarks / profiling.

@cjnolet
Copy link
Member

cjnolet commented Nov 1, 2023

TODO: add command line arg to control the number of threads. This would be useful when running larger set of benchmarks / profiling.

It would be nice to even be able to set the min/max threads (and set them equal if one desires to run only a single threaded experiment)

@github-actions github-actions bot removed the python label Nov 2, 2023
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tfeher tfeher added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 2, 2023
@tfeher
Copy link
Contributor Author

tfeher commented Nov 2, 2023

I am still investigating on bug. Please do not merge until its fixed. Fixed.

@tfeher tfeher removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 2, 2023
@tfeher
Copy link
Contributor Author

tfeher commented Nov 2, 2023

Fixed the problem. Pending CI it is ready to merge. Example output

./RAFT_IVF_FLAT_ANN_BENCH --search --data_prefix=/data --benchmark_filter=raft_ivf_flat --benchmark_out_format=csv --benchmark_out=res.csv  --override_kv=n_queries:10 --benchmark_counters_tabular=true --mode=throughput --threads=1:24 --benchmark_min_warmup_time=1 /workspace1/config/deep-10M.json
...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations        GPU    Latency     Recall end_to_end items_per_second          k  n_queries     nprobe total_queries
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:1       0.111 ms        0.111 ms         6335   101.801u   110.565u    0.90378   0.700427       90.4461k/s         10         10         50        63.35k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:2       0.075 ms        0.146 ms         9250   133.635u   151.153u    0.90378   0.699437       132.519k/s         10         10         50         92.5k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:4       0.052 ms        0.187 ms        13560   177.321u   206.465u    0.90378   0.699672       194.025k/s         10         10         50        135.6k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:8       0.040 ms        0.250 ms        17696   276.013u   324.323u    0.90378   0.717858       249.676k/s         10         10         50       176.96k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:16      0.037 ms        0.331 ms        18528    508.35u   596.448u    0.90378   0.689932       271.407k/s         10         10         50       185.28k
raft_ivf_flat.nlist5K/0/process_time/real_time/threads:24      0.037 ms        0.400 ms        18144   757.152u   892.142u    0.90377   0.671828       272.899k/s         10         10         50       181.44k

It is recommended to use --benchmark_min_warmup_time=1, otherwise the first benchmark is not warmed up and the realtive perf with different number of threads is misleading.

@github-actions github-actions bot added the python label Nov 2, 2023
@cjnolet
Copy link
Member

cjnolet commented Nov 3, 2023

/merge

auto duration = std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count();
state.counters.insert({{"end_to_end", duration}});
}
cudaDeviceSynchronize();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU only builds fail at this point

benchmark.hpp:299:3: error: there are no arguments to 'cudaDeviceSynchronize' that depend on a template parameter, so a declaration of 'cudaDeviceSynchronize' must be available [-fpermissive]
  299 |   cudaDeviceSynchronize()

I thought we cuda_stub.hpp to help with this, but it does not work. If we are in a hurry we could disable the sync here, but it would be better to fix the cuda_stubs.

@rapids-bot rapids-bot bot merged commit b21cad3 into rapidsai:branch-23.12 Nov 3, 2023
60 checks passed
benfred pushed a commit to benfred/raft that referenced this pull request Nov 8, 2023
This adds explicit latency column to the benchmark.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#1940
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python Vector Search
Projects
Development

Successfully merging this pull request may close these issues.

3 participants