Fast/recommended path for per-frame buffer uploads? #1428

mrshannon · 2021-02-10T18:01:31Z

mrshannon
Feb 10, 2021

So I have been digging though the old issues trying to determine what the optimal/recommended path is for per frame buffer uploads (i.e. for MVP/normal matrices and dynamic vertex arrays). I can identify three paths to take. NOTE: I am currently assuming non-unified memory as there does not seem to be a method of determining when staging buffers are not necessary, therefore I am assuming MAP_WRITE and VERTEX/UNIFORM etc are mutually exclusive.

Create the staging buffers every frame with createBuffers and use mappedAtCreation, destroy the buffer after the copy to the GPU private buffer is complete. Unless there is some management of allocation behind the scenes (which is not mentioned in the spec) this seems like a bad idea because of the constant allocation/deallocation. As far as I can tell it's not like OpenGL where buffer orphaning is a fast path.

Issue a direct write into the GPU private buffer with writeBuffer. This seems very similar to glBufferSubData which is controversial as it is a fast path on NVIDIA but is a slow path on Intel. Not sure if that is the case with WebGPU or not as it's actual behavior is not in the spec.

Map (using mapAsync) and unmap round robin staging buffers every frame. The difficulty of this is guessing which ones are required for the frame as it's an async operation and thus cannot be performed inside the main render function. It's not impossible, just difficult in some circumstances, though the ability to map a sub range synchronously once the Promise is resolved partially negates this. The bigger issue is that mapping and unmapping is traditionally slow. Going back to OpenGL, NVIDIA does not recommend this approach, though Intel seems to prefer it over direct writes like glBufferSubData.

In other API's the recommendation from all vendors is to use persistent buffer mapping as it is the fast path for all IHVs. This is neither currently supported nor does there seem to be any plans of supporting this in WebGPU (am I wrong?). Therefore, is there any recommendations on what should be the recommended path on WebGPU, especially since there is no way to determine which IHV it's running on and adjust accordingly?

kvark · 2021-02-10T18:18:28Z

kvark
Feb 10, 2021
Maintainer

Generally I would recommend write_buffer. The performance woes of glBufferSubData do not apply to it: when designing WebGPU we are taking extra consideration for performance characteristics to be portable, not just the function. On a high level, what write_buffer does is finding staging space internally, filling it, and issuing a copy_buffer_to_buffer on the queue. So it's not blocking, it's queued like everything else on the queue, and it's fairly efficient, since the WebGPU implementation has more tools to manage that staging space nicely than the user.

However, under the following conditions you may get better performance by doing a "staging belt" yourself:

if your data is not already in an array buffer. With mapping, you can serialize it right into the mapped range. With write_buffer, you'd need an extra copy on your side, since write_buffer expects an array buffer.
you are not accessing WebGPU from WASM. Since WASM today can't share memory, it has to do a copy for buffer mapping, but this isn't a problem for write_buffer.
you are not afraid of additional complexity in the logic. Building a staging belt is not rocket science, but there are certainly more places to screw up than with write_buffer.

Edit: if you go this path, there is also an extra benefit that you can overwrite sections of your target buffers multiple times in the same submission. With write_buffer you can't update data in the middle of a submission (unless doing an extra copy from some place).

The bigger issue is that mapping and unmapping is traditionally slow

The actual native buffer can be persistently mapped for all you know, so don't worry about this.

The difficulty of this is guessing which ones are required for the frame as it's an async operation and thus cannot be performed inside the main render function

Not sure I understand that difficulty/guessing. You maintain a pool of staging buffers that are mapped, and once you use any of them, you request it for mapping again, and in the promise you bring it back into the pool. There shouldn't be any guesswork here.

Create the staging buffers every frame with createBuffers and use mappedAtCreation

This is strictly inferior to write_buffer, since the latter knows exactly the lifetime of that staging buffer, while if you are doing that, the implementation has no idea how long this buffer will live, and the allocation and memory reuse aren't going to be as good.

2 replies

Kangz Feb 10, 2021
Maintainer

Not sure I understand that difficulty/guessing. You maintain a pool of staging buffers that are mapped, and once you use any of them, you request it for mapping again, and in the promise you bring it back into the pool. There shouldn't be any guesswork here.

And if you ever run out of space you can instantly create an additional mapped chunk using mappedAtCreation. That one chunk will be potentially less efficient the first time it is used, but follow-up mapping will be as efficient as the other chunks in your staging belt.

kainino0x Feb 10, 2021
Maintainer

It's a bit old but there is a sketch of this idea here: https://github.com/gpuweb/gpuweb/blob/main/design/BufferOperations.md#updating-data-to-an-existing-buffer-like-webgls-buffersubdata

Morglod · 2024-03-30T03:52:21Z

Morglod
Mar 30, 2024

The actual native buffer can be persistently mapped for all you know, so don't worry about this.

year 2024, macos m1 wgpu-rs implementation, metal backend
mapping/unmapping buffers still slow as hell (as it always was) (16-32ms for 272 bytes of data)

8 replies

kainino0x Apr 2, 2024
Maintainer

Chrome is not using WGPUFuture yet, but also we have a latency problem in Chrome that won't be immediately fixed by them.

I am surprised by the speed of mapAsync though! I thought just a plain dumb round trip took almost 1ms in chrome - but I guess that was a GL problem.

Morglod Apr 2, 2024

It did not create a new buffer every frame

Me too for ring buffer variant.

Looks like its a problem of current wgpu-rs implementation, will try dawn, thank you for this example. My point is that its strange api decision which leads to this kind of problems. API should fit user needs, not implementations. I cant imagine that someone will say "I need to map buffer memory". All you need is: write, write chunks, read, read chunks (chunks - I mean some kind of streaming). (reads and writes may be async)

Morglod Apr 2, 2024

Yep, after 6 hours of fighting with google's implementation, dawn's mapAsync is 300x times faster. But I still dont understand why I should do mapping, polling and all other stuff instead of just read* call (or 3 calls like begin_read, read, read_end).

@greggman

greggman Apr 2, 2024
Maintainer

mapping, polling and all other stuff instead of just read* cal

requesting a mapping and then waiting for the map to happen is because the buffer might be in use and you need to wait for it to finish being used. polling should not happen and I think that's the point of the WGPUFuture thing that's being worked on: webgpu-native/webgpu-headers#199 I don't know if that will actually make your code any simpler. I think the point of map is that if you want efficient code, you should ask for the buffer to be mapped, then go do other stuff since the GPU might still be busy with the buffer.

I'm not sure I follow your suggestion of begin_read, read, read_end. I guess you want begin_read to halt the CPU and wait for the GPU (which is what your polling is currently doing).

It sounds like maybe you want mapSync? (which is being discussed here: #2217)

ferasboulala Feb 1, 2025

Year 2025, M1 Max, Metal backend, dawn native: Around 500us to map 256 bytes (not that the size of the buffer matters). Note that if I don't map/unmap and instead just ignore, the 500us disappears. This isn't a case of "the gpu is still using it", it's just plain slow.

Disregard my comments. This is just noise because I am learning. The map/unmap is definitely slow but maybe I am misusing it. It's also possible that it is due to security concerns that the API is built that way.

Fast/recommended path for per-frame buffer uploads? #1428

Uh oh!

mrshannon Feb 10, 2021

Replies: 2 comments · 10 replies

Uh oh!

Uh oh!

kvark Feb 10, 2021 Maintainer

Uh oh!

Kangz Feb 10, 2021 Maintainer

Uh oh!

kainino0x Feb 10, 2021 Maintainer

Uh oh!

Uh oh!

Morglod Mar 30, 2024

Uh oh!

kainino0x Apr 2, 2024 Maintainer

Uh oh!

Uh oh!

Morglod Apr 2, 2024

Uh oh!

Morglod Apr 2, 2024

Uh oh!

greggman Apr 2, 2024 Maintainer

Uh oh!

Uh oh!

ferasboulala Feb 1, 2025

mrshannon
Feb 10, 2021

Replies: 2 comments 10 replies

kvark
Feb 10, 2021
Maintainer

Kangz Feb 10, 2021
Maintainer

kainino0x Feb 10, 2021
Maintainer

Morglod
Mar 30, 2024

kainino0x Apr 2, 2024
Maintainer

greggman Apr 2, 2024
Maintainer