Minutes 2018 07 25

GPU Web 2018-07-25

Chair: Corentin

Scribe: Ken

Location: Google Hangout

Minutes from last meeting

TL;DR

Compute passes (issue 64)
- List of advantages of passes around synchronization, scheduling and type safety.
- Concern that compute passes aren’t necessary and more verbose.
- Need to take a look at making passes separate objects for type safety. Explore the spectrum between Vulkan-style single commandbuffer and Metal style where a sequence of passes are submitted.
- Discussion whether we want to do automatic offloading to compte/copy queues.
Render target aliasing
- Explanation of the semantics of residency in D3D12
- Using “tiled resources” would help with fragmentation but not broadly available
- Discussion about the difference between purge/unpurge and smart allocator

Tentative agenda

Compute passes
Render target aliasing
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
- Thomas Denney
Google
- Corentin Wallez
- Kai Ninomiya
- Ken Russell
Intel
- Yunchao He
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Elviss Strazdiņš
Markus Siglreithmaier

Administrative items

CW: Going to TPAC?
DJ: Some might for CSS (Myles) but WebGPU shouldn’t go
MM: Some people could meet at the HLSL BOF at SIGGRAPH
MM: Continuing work on WHLSL, one design question is how to do shader modularity. Not important to discuss it now but good for people to take a look.
https://github.com/gpuweb/WHLSL/issues/1

Compute passes

https://github.com/gpuweb/gpuweb/issues/64
DM: not specifically for compute passes but also for blit passes.
- Want to try to figure out other sides of the problem when making a decision.
- Resource binding: may or may not want to scope resources between passes
- If it does help impl on D3D12 to scope resources to passes…
MM: we discussed this a year ago. We agreed that state would not be inherited between them (passes?)
DM: that’s correct. If we do remove compute passes then there’s no option to “not inherit” the state because you don’t have a pass anymore. It is related to the issue that we had before but now we have a better understanding of what a pass is.
- We did make a decision. There’s an opinion that having passes would help the implementation. I’m not convinced of that.
CW: first aspect was resource binding inheritance.
- Metal allows to bundle descriptor allocations to the pass.
DM: that’s just because the passes are best represented by Metal.
DM: synchronization / validation.
- We ended up agreeing that usage would be constant across the pass. That’s cool. But we then decided we want UAV barriers between different dispatches in compute pass. What does const usage give us in this case?
- Since we can not tell users there’s no sync work between dispatches any more (unless you want to educate users on possible syncs between dispatches – slightly lower level than API we’re aiming for), then might as well say everything’s synchronized.
- MM: or, nothing’s synchronized.
DM: scheduling:
- Metal appears to use multiple hardware queues implicitly, based on the type of passes. Haven’t observed this though. Good opportunity for us to try to take advantage of multiple queues this way.
DM: type safety:
- Metal passes give you type safety. Also applies to BlitEncoder pass, which doesn’t gain anything from the other aspects discussed here
CW: my answer wasn’t completely unbiased. In our impl, having passes for things (descriptor upload, synchronization, interating over commands) the passes helped.
- When you have chunks of work that are very well-defined.
- Advantage on D3D backend: resources use all bindgroups in pass. Then can make sure you upload everything before beginning of pass. Similar to putting in barriers.
- CW: uploading is when in D3D you create descriptors in a CPU DescriptorHeap and then memcpy them to a GPU DescHeap. Or in Vulkan, writing DescriptorSets.
- DM: for D3D12, at the place where the user binds or uses descriptor sets, need all of them to be in the same heap. You’re saying that by extending this scope to the whole pass you get an easier implementation? Seems artificial. Might as well extend to the whole command buffer. Pass isn’t giving you anything except lifetime of bindings.
- CW: that’s a good point.
- JG: if you have the whole command buffer that you’re encoding anyway, passes do help you limit that to a smaller section. But for compute do you need that limitation?
- CW: for descriptors it’s maybe not as beneficial as for memory barriers.
CW: what are drawbacks of passes? Though DM’s review is unbiased it shows a number of things you can do with passes. Not sure we’ve talked about explicit advantages of having passes.
- Makes user experience a little more verbose/constrained, e.g. if you want to do copies in the middle of compute passes.
JG: if we don’t need it, let’s not add it.
CW: doesn’t seem that constraining. In the worst case you could end a compute pass eagerly if you don’t know what the follow up code will do.
JG: if you end them early and the impl is using them for hints, then if impl doesn’t do structuring across compute passes then they don’t help you. You have to do command buffer-level global optimizations anyway.
CW: don’t think you do. Seems heavy handed. Helps impl in the majority of cases, and adds type safety.
JG: in the present case it doesn’t add type safety. Gives runtime safety because you can try to execute dispatch outside a compute pass and it’ll give you an error. But you don’t have a separate ComputePass object which you need for type safety. ComputePasses are state when encoding CommandBuffers. Haven’t seen a proposal making ComputePasses actual objects.
CW: seems if they’re real objects then you incur a bit more garbage collection? What if you use the Compute Pass after you’ve ended it? Would need some Lifetime parameterization of types like in Rust.
JG: don’t see them as necessary for MVP.
DM: compute passes in particular?
JG: yes.
MM: the biggest use case we’ve seen is when apps want to do fairly unrelated sets of work. You setup your resources for one, relevant to first chunk of work, then set up second set and do second work. Basically, the first item in this, resource binding inheritance, is the major use case.
- The way Metal does it, you bind resources independently – collection of textures, buffers, etc. We have DescriptorSets but it’s the same idea. You can attach a collection of these sets.
KR: question about automatically inserting ComputePasses
CW: in Dawn for example we already have some smarts about grouping together adjacent DispatchCompute calls.
CW: there’s no strong argument either way. Mostly a matter of feeling.
DM: still a matter of how much overhead you get by putting different dispatches in different compute passes in Metal.
MM: I knew this a few months ago but don’t remember.
DM: I can see two different extremes for the passes. Either Metal-like passes for everything, and CommandBuffers are a collection of different passes. Narrow binding and synchronization scopes. But would really need us to have separate types. Don’t see why we would go this way and not have separate types for the passes.
- Or Vulkan-style, where your unit of resource scope is a command buffer. We can expect the users to submit more command buffers. They can do any blit/compute, etc. operations in a cmdbuf, including starting a render pass. Maybe 2 types of cmdbufs? A series of blit/compute commands, or your whole cmdbuf is a render pass.
- They don’t intersect much. You can do compute/blit inside a render pass, and (something else) without introducing synchronization.
- Also seems fairly sound.
CW: all of these approaches are sound. Hard to know.
KR: seems more symmetric to have these passes because we know we need render passes (and maybe sub-passes).
CW: the world where we know we need a copy pass is not a good thing for WebGPU. If you have this and we have multi-queue then anything put into a copy pass can be run on the copy queue. Not necessarily true. “Clear this texture” has to be in a copy pass but can’t be put on the copy queue.
DM: if we have explicit passes then we’re in a better position to not expose multiple queues.
CW: a bit scary.
DM: we know the dependencies between them.
CW: we don’t know how much work each pass is. Don’t know whether offloading to a different queue and paying the cost of synchronizing. Better if the user tells us.
KN: part of the problem is that we can’t introspect the GPU’s scheduler. Hard to imagine a scheduler that would work well.
DM: don’t think our current approach is as sound as the two approaches I suggested.
- We have passes and we do not, because the blit pass doesn’t make much sense. We have synchronization and we do not, because compute passes are not synchronized between dispatches.
MM: if we went with creating an object for each pass, would the same lack of guarantees about draw calls in a render pass apply to dispatches in a compute pass?
CW: right now we all agree it’s a good idea to not give developers footguns. That doesn’t change the sync properties of compute passes.
CW: let’s discuss this again next week and try to make progress on the Github issue.
MM: we have a little confusion about Mozilla’s idea here. Sounds like there are 3 options and Moz is arguing for 2 out of the 3?
DM: haven’t made up my mind about what we should do. Will follow up on comment on the issue suggesting a more concrete proposal.

Render-target aliasing

https://github.com/gpuweb/gpuweb/issues/63
CW: Idea of render graph. Complicated. Vulkan developers don’t use the subpass graph even if it’s not exactly the same. The graph will need to be rebuilt every frame.
- Also Purge/Unpurge idea. Related to optimized allocator.
- Discussion about residency in D3D12.
MM: some questions about how DX works.
- Confused about how eviction works. I have a draw call that might read one of a bunch of resources. If I don’t evict any of those resources but the video memory manager says it’s too big and has to kick one out, do I get device loss? Block?
- RC: you’re responsible for making resident any VRAM resources you need.
  - If memory manager has to evict stuff it will; will cause thrashing.
  - If you no longer need it, you evict it.
- MM: way to know you’re in the thrashing situation?
  - RC: no, no callback or anything.
- MM: so when you say “evict” is that a hint, or does it evict it now? Similarly, “make resident”?
  - RC: both are function calls, not flags. Up to VMM to do this lazily. Will make sure things are resident at the point of the next draw call. Imp’t in the bindless world because you don’t know what it’s going to read or write in the next draw call.
- KR: what if one draw call relies on more resources than can fit on the GPU?
  - RC: think MakeResident fails.
- MM: so you don’t get context loss?
  - RC: no.
- JG: effectively that means MakeResident sets up state you expect to be true, but things you made resident might have gotten evicted?
- RC: it’s a virtualized OS. CPU mem works the same way. It makes sure that the things you say MakeResident will be there when you draw. In the old days, the driver and runtime knew which things you’d used recently and not. Could page things out intelligently. In new world it has no idea. Might kick things out you’re about to use. Some D3D developers are surprised, they make a bunch of things resident, expect the driver to manage things like before and it behaves poorly.
- MM: Metal model almost identical. If you don’t use bindless facilities in Metal then Metal knows what you’re touching and makes them resident. If you use bindless then you have to mark all the resources you will touch by calling “UseFor”. System can evict resources any time it wants.
- KR: and there what happens if you’ve oversubscribed?
- MM: not sure.
- CW: sounds fine to lose the context.
- JG: maybe can’t create a state where that’s true? If things get evicted behind your back for a time when you’re doing something else, there’s still enough room. Residency being fallible should mean…
- KR: argument about a draw call touching tons of textures that don’t all fit.
- JG: in order to make the draw call you have to make all of them resident. If not enough room then MakeResident will fail.
- MM: is that to say that “MakeResident” means “can it fit on the GPU ever”?
- CW: no.
- JG: MakeResident is an imperative, page this in right now. If you can’t, then MakeResident will fail. If resources not in use by your program, those can be paged out while another program is running. Everyone can use all their data, but there will be the thrashing Rafael talked about.
- RC: yes. When you evict stuff you have to use fences to make sure you don’t evict something you were using. Else, undefined behavior.
- MM: so if the call fails you might want to just try again?
- JG: not if your program hasn’t done anything different. If you call MakeResident and it fails and you don’t change anything, the OS has already tried transparently evicting stuff and it didn’t work. You need to evict other stuff to make progress.
CW: think we understand residency better but not sure why we’re talking about it for render target aliasing.
- JG: it’s one way to do aliasing.
- RC: RT aliasing has to do with reusing the resources. This thing is definitely no longer in use, so reuse it behind the scenes.
CW: that’s in the case where we do things magically.
RC: V1 could be, you create a texture with a descriptor, we allocate and make resident the memory, but we don’t reuse memory between RTs.
MM: one impl strategy for Discard: make all of these resources use the virtual memory system. Discard throws away physical pages, later reclaim them. Isn’t it that there are D3D12 devices that don’t support virtual memory?
RC: are you talking about residency?
CW: using tiled resources to implement aliasing.
RC: from talking with them, called “reserved resource”, really tiled resource, you have to call UpdateTileMappings, but it’s the same thing. The API gives it to you, you reserve it, nothing is committed, you have to allocate tiles yourself.
- Can have one texture go across multiple heaps.
CW: that’s a general thing that would help with mem alloc and fragmentation. If we had discard/undiscard/purge/unpurge (really fast alloc/dealloc), don’t have to worry about fragmentation any more.
MM: do most D3D12 devices support this virtual memory subsystem feature?
RC: I think yes.
CB: was introduced in D3D11. Not sure how widespread it is.
CW: think Doom used it for megatextures. Can’t rely on it though. Have to support mobile platforms that don’t have this sort of memory control. Great optimization on desktop platforms though.
MM: there are 3 impl strategies I’ve come across (for solution “C”). First: Rafael’s. Second; literally moving the resource. Another: maybe Metal-specific: purgability bit inside Metal resource. If OS decides it’s low on memory, it’ll look for purgeable resources.
CW: my worry is that option C is like virtual memory management and the others are hard to get right. Would require invasive changes to application code.
MM: are you feeling we should drop this feature on the floor? Or option D, make malloc/free super fast and have no new API?
CW: for V1 I think we need something.
...more discussion…
MM: if we end up with no new API then if the app is written like at the top of the issue then Rafael is correct. But in solution C, purge = delete resource, then if you have super fast memory manager it’d be better. But slow memory manager = awful.
RC: so purge is the same as deleting it altogether? have to make it again?
CW: could also have a hint...not saying we don’t need an API for this eventually.
MM: like the idea of super-discard.
RC: that’s what I thought purge was.
CW: we want a discard/store operation anyway, because we can take advantage of it on mobile GPUs. Then they’re cleared the next time we use it anyway. But you can implement super-discard in the webgpu impl.
MM: you specify that on the current render pass, but the point at which you can discard is the point where the current render pass is reading from.
KR: I like the idea of texture heaps. Strong hint that you want these things to use the same storage.
CB: agree, that’s why MSFT did it. It’s like the union of resources.
CW: that’s what Frostbite did. Saw gains from it. Seems early to do it.
KR: from the Metal use cases it sounds like some apps won’t work on mobile without it.
JG: do between MVP and V1?

Agenda for next meeting

Passes
Move along anything we haven’t finished in the issues
- e.g. render target aliasing.
If newer topics, can put them at the start.
CB: there is the HLSL BOF at SIGGRAPH, right after the Khronos ones.
Fall F2F is Thursday September 27.
Please click on link about shader modularity and provide your opinions!
A priori next week the meeting next week will be one hour earlier, 1 PM Pacific.

Minutes 2018 07 25

GPU Web 2018-07-25

Minutes from last meeting

TL;DR

Tentative agenda

Attendance

Administrative items

Compute passes

Render-target aliasing

Agenda for next meeting

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!