Scaling workspace resources #2322

achirkin · 2024-05-16T12:40:48Z

Brief

Add another workspace memory resource that does not have the explicit memory limit. That is, after the change we have the following:

rmm::mr::get_current_device_resource() is default for all allocations, as before. It is used for the allocations with unlimited lifetime, e.g. returned to the user.
raft::get_workspace_resource() is for temporary allocations and forced to have fixed size, as before. However, it becomes smaller and should be used only for allocations, which do not scale with problem size. It defaults to a thin layer on top of the current_device_resource.
raft::get_large_workspace_resource() (new) is for temporary allocations, which can scale with the problem size. Unlike workspace_resource, its size is not fixed. By default, it points to the current_device_resource, but the user can set it to something backed by the host memory (e.g. managed memory) to avoid OOM exceptions when there's not enough device memory left.

Problem

We have a list of issues/preference/requirements, some of which contradict others

We rely on RMM to handle all allocations and we often use rmm::mr::pool_memory_resource for performance reasons (to avoid lots of cudaMalloc calls in the loops)
Historically, we've used managed memory allocators as a workaround to avoid OOM errors or improve speed (by increasing batch sizes).
However, the design goal is to avoid setting allocators on our own and to give the full control to the user (hence the workaround in 2 was removed).
We introduced the workspace resource earlier to allow querying the available memory reliably and maximize the batch sizes accordingly (see also issue #1310). Without this, some of our batched algorithms either fail with OOM or severely underperform due to small batch sizes.
However, we cannot just put all of RAFT temporary allocations into the limited workspace_resource, because some of them scale with the problem size and would inevitably fail with OOM at some point.
Setting the workspace resource to the managed memory is not advisable as well for performance reasons: we have lots of small allocations in performance critical sections, so we need a pool, but a pool in the managed memory inevitably outgrows the device memory and makes the whole program slow.

Solution

I propose to split the workspace memory into two:

small, fixed-size workspace for small, frequent allocations
large workspace for the allocations that scale with the problem size

Notes:

We still leave the full control over the allocator types to the user.
Neither of the workspace resource should have unlimited lifetime / returned to the user. As a result, if the user sets the managed memory as the large workspace resource, the memory is guaranteed to be released after the function call.
We have the option to use the slow managed memory without a pool for large allocations, while still using a fast pool for small allocations.
We have more flexible control over which allocations are "large" and which are "small", so hopefully using the managed memory is not so bad for performance.

cjnolet · 2024-05-21T15:05:28Z

/merge

Setup the `raft::resource::large_workspace_resource` to an `rmm::mr::managed_memory_resource` for the benchmarks. This change was first introduced in rapidsai/raft#2322 but somehow went missing in cuVS. It allows to run large (not fitting in GPU memory) ANN problemsto completion at the cost to speed. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #372

Add large workspace resource

204ecd7

achirkin added 3 - Ready for Review feature request New feature or request breaking Breaking change labels May 16, 2024

achirkin self-assigned this May 16, 2024

achirkin requested a review from a team as a code owner May 16, 2024 12:40

github-actions bot added the cpp label May 16, 2024

achirkin mentioned this pull request May 16, 2024

[Outdated] Scaling workspace resources #2194

Closed

cjnolet approved these changes May 21, 2024

View reviewed changes

rapids-bot bot merged commit efcd11f into rapidsai:branch-24.06 May 21, 2024
69 checks passed

achirkin mentioned this pull request Oct 2, 2024

Put the ann-bench large_workspace_resource in managed memory rapidsai/cuvs#372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling workspace resources #2322

Scaling workspace resources #2322

achirkin commented May 16, 2024

cjnolet commented May 21, 2024

Scaling workspace resources #2322

Scaling workspace resources #2322

Conversation

achirkin commented May 16, 2024

Brief

Problem

Solution

cjnolet commented May 21, 2024