-
Notifications
You must be signed in to change notification settings - Fork 321
Minutes 2018 08 08
Corentin Wallez edited this page Jan 4, 2022
·
1 revision
Chair: Corentin
Scribe: Ken
Location: Google Meet
- Discussion around passes issue 64
- Metal cannot submit command buffers to a different queue and can’t replay them
- At least a syscall per command buffer with a callback.
- Discussion of recording at submit time or not.
- Proposal diverges from native APIs, would have to be designed very carefully.
- Need to talk about reuse and bundles / secondary cmd buf.
- (again) discussion about separate compute pass or not.
- Command-buffers could aggregate passes and provide reusability
- Mozilla will discuss and agree internally.
- Metal cannot submit command buffers to a different queue and can’t replay them
- Uploading data to textures
- No equivalent of glTexImage2D because mapping buffers is at least as fast.
- Browser will be the arbiter of residency,
- Passes
- Agenda for next meeting
- Apple
- Myles C. Maxfield
- Thomas Denney
- Google
- Corentin Wallez
- Dan Sinclair
- Kai Ninomiya
- Ken Russell
- Intel
- Aleksandar Stojilijkovic
- Microsoft
- Rafael Cintron
- Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
- Yandex
- Kirill Dmitrenko
- Elviss Strazdiņš
- https://github.com/gpuweb/gpuweb/issues/64
- MM: answers to questions from last week:
- Once a cmdbuf is recorded, what can it be submitted to?
- A: only the queue it was created from. Not different queue on same device, not different device.
- What facilities does Metal have for replaying previously recorded commands?
- A: no facilities. Not possible with Metal.
- CW: two ways to parallelize cmdbuf recording:
-
- Create multiple cmdbufs at same time
-
- Use parallel RenderEncoder to have multiple encoders record to same render pass.
-
- MM: correct.
- DM: you have new indirect cmdbuf. Is it reusable?
- MM: nope.
- MM: was one detail: if the number of cmdbufs increases, what happens on Metal?
- Every cmdbuf causes a syscall to commit it. However some drivers try to coalesce them. Unclear what ratio is of syscall increases.
- Interesting: completion of cmdbuf also has a callback which can occur. In order to cause that callback to run, there’s another interrupt from the GPU to CPU, and that interrupt has significant cost.
- Would be good if not every cmdbuf had a completed callback.
- Either author can opt in, saying that this buffer doesn’t need a callback; or use a fence approach.
- CW: another possibility: there’s a bunch of things submitted to the queue corresponding to Metal cmdbufs. WebGPU implementation would only set a callback on the last one. App would know not to submit many times per frame.
- DM: we control the callbacks. We won’t necessarily have a callback per cmdbuf.
- MM: one implicit thing: one possible way of having multiple passes but no cmdbuf would be: instead of recording each command synchronously, remember which one was encoded. Then batch multiple encoders into one Metal cmdbuf. But not possible because we don’t want to defer – want to reduce latency. Should say this publicly.
- DM: some controversy about this because Dawn is going to record at submit time. We (Firefox) hope to record live, but not clear whether possible due to validation and threading.
- MM: is it currently impossible to record cmds synchronously?
- CW: not impossible, but recording synchronously prevents doing a lot of driver magic. Can’t do residency stuff yourself for example. Everything has to be in the same position as when command was recorded. Ex: DescriptorHeaps on D3D that you have to build just-in-time because you don’t know how many descriptors you’ll need. On D3D we may limit ourselves to having 500,000 Descriptors max. Makes everything more complicated and limited.
- JG: recording live doesn’t mean recording everything live – just most things.
- CW: it’s hard.
- JG: agree. But not impossible.
- DM: should take on early recording as a good thing to have, and leave it on the table for the design. Having passes as cmdbufs doesn’t seem like it blocks live recording.
- MM: if we want to have live recording, then every pass must be backed with one platform cmdbuf. And only way to do multiple passes -> one cmdbuf is to not do live recording.
- CW: yes. And this is an issue for Metal because there’s a high fixed cost per cmdbuf.
- MM: were discussing with Metal team and they didn’t say it was a crazy idea; willing to try it.
- DM: don’t think the cost is that high. Have been profiling on Metal. Submitting 100 cmdbufs didn’t have an impact. Don’t have the specific benchmarks to show the group, but based on personal experience don’t think that eliminating passes will be a blocker.
- CW: we don’t have a perf concern for this proposal on Metal. Some concern that it’s different from every native API.
- JG: much less safe to go our own way. Other people have already launched products and made these choices. (More people than we’ve been in contact with.) And they converged upon similar ideas in many situations. Going our own way should only be done with a lot of forethought, and I’m nervous about this. If they did it this way and we don’t see why, then we should consider doing it that way.
- CW: separate passes as objects: people designing things are closer to hardware. Cmdbuf is a hw-level concept. We’re a level higher than that. If someone can make a better abstraction than that it’s this group.
- JG: the more you diverge the more unknown divergences you have and the more you can’t establish by analogy. Right now can look at existing specs and see what they do. If we keep moving away from the native specs we’ll have to think carefully about the corner cases.
- RC: agree with Jeff. If impl on D3D is to have each pass be its own cmdbuf, each time you open a cmdbuf you have to reset a lot of state. Putting everything in a big cmdbuf lets you amortize this cost.
- JG: in particular; you can record and heuristically possibly reuse cmdbufs. The caching is a heuristic and we want to avoid heuristics if we can, and more toward the explicit APIs where everything is predictable.
- DM: your point is counter to Rafael’s. RC says we can be more implicit with the caching of root signatures and state between passes (because D3D12 doesn’t have passes). More explicit to our API would be to start passes clean, they set the states that they need. This means resetting the root signature.
- JG: if you want to take multiple passes and merge them into a cmdbuf for reuse, you have to have a heuristic about what you can reuse, how big chunks you can reuse, etc. Have to make a choice. Cmdbuf recording is more direct.
- CW: that’s assuming that you can only reuse an entire submit, not individual passes.
- JG: correct. If you submit a big chain of passes and some subsections are patterns you want to reuse, currently (pure passes proposal) there’s no way to reuse those.
- DM: that’s a good concern. Comes down to our secondary and parallel command encoder and bundle story. Opens the scope of the question quite a bit.
- JG: it has to. Not talking about how to expose passes, but how to expose commands.
- MM: if the goal is to give a hint to the runtime that these commands are related, cmdbuf is probably not the right abstraction, because it’s a hw-level feature.
- JG: it’s an api level feature. It’s a parallel to the driver’s cmdbuf but not the same thing.
- MM: maybe we need hint API calls.
- CW: probably not hints but something enforceable. If you want to reuse the pass/cmdbuf you have to tag it reusable.
- JG: the minute you tag something reusable you have cmdbufs.
- CW: no, you say each is reusable.
- MM: the minute you say it’s reusable it doesn’t work on Metal.
- CW: rather, just tagging it at the API level.
- JG: still prevents merging of passes into a single cmdbuf. Also, don’t need to reuse Metal’s encoders directly; can record live and also store off the commands we recorded on the side.
- CW: why do you want to reuse a sequence of passes?
- JG: is it not valuable to submit fewer command buffers?
- CW: not really. Theoretically fewer command buffers is less overhead, but overhead is low.
- JG: not clear.
- RC: on D3D, say you have 4 cmdbufs. Probably the same as 1 cmdbuf with everything. But 1 cmdbuf with everything, you can set the root signature once and commit a lot of stuff.
- CW: binding / setting root signature is cheap, because you can’t change parts of the root signature.
- RC: still, more API calls, “set this”, “set that”. Have not done the analysis on this though. More overhead we’ll send to the driver. If possibility for reuse, having single cmdbuf is better than multiple ones sharing the same state.
- MM: sounds like this proposal has two pieces. Other one is making a “work” pass including both blit and compute operations.
- RC: right. Should we have passes for anything except render passes in general?
- DM: CW’s extension was to not have a work pass. Have different passes for compute and transfers.
- CW: one concern with compute passes: increases committed burden on the developer. If you look at it as passes it’s simpler. One reason why we think compute passes are better than no compute passes. Other reason: bigger work granularity, better boundaries between things.
- RC: does having compute passes make it easier or harder to insert UAV barriers?
- CW: it does make UAV barriers easier b/c you only have to check things used in the same compute pass. Also makes transitions easier because you never have them inside compute passes.
- Going back to earlier discussion: do we want Begin/EndComputePass? It’s annoying b/c of more details for the app, but would make this cleaner.
- All isomorphic – it’s all the same. Just matter of how we want to express it. Not sure how to reach agreement.
- RC: what are the details of what the developer has to say in a compute pass? “I’m only going to read or write to these”? Will it insert UAV barriers just in case?
- CW: inside compute pass will look at whether UAV was touched by previous dispatch and put a UAV barrier in.
- RC: OK. So if you have to do that inside a compute pass anyway then what is it about a compute pass that makes validation easier?
- CW: the amount of work you have to look at is smaller. Say you have a UAV and you want to populate it and then do read/write. Populate = copy. Transition to Copy usage. Copy. Then transition. Do read/write. When you insert UAV barrier you know that it’s been done … before the copy? You know that you only have usage transitions outside the compute pass.
- RC: so if you have two compute passes next to each other, the API will put a UAV barrier between them no matter what?
- CW: yes it would put it at the end of one or the beginning of the other. But if you have a copy between them, it would know there’s no such thing as a UAV between compute passes.
- DM: follow-up proposal. Background on why cmdbufs in our API don’t make too much sense: if we have passes for everything then cmd submission is modeled after Metal. Cmdbuf there is sequence of passes with callbacks for them. Unit of execution is a pass, not a cmdbuf.
- CW: and cmdbufs are just reservation of order inside Metal queue.
- DM: I do buy the argument that if you want to reuse passes then you should bake in the order inside the passes. Put all the sync we want between passes. What if we use that to expose the reusability as a feature that we maybe ship after MVP? Technically you create a pass, record it, commit it. Or you can create a cmdbuf, and everything in it will be reusable. Cmdbuf will imply reusability.
- JG: then you’re specing things both with and without cmdbufs. It’s more work for everyone.
- DM: reusable vs. not is already two ways. Do you think that providing a flag to cmdbuf “reusable or not” is much simpler? Better than saying cmdbuf implies reusability?
- JG: yes. There’s only one way to do things. Don’t have different rules for submission order.
- DM: still two different code paths. Have to test both if you want to reuse both.
- JG: just a flag to test. Less spec-level work. Not two ways to submit things.
- CW: having separate passes is probably easier to test rather than having separate command buffers. Independent passes -> only one way to chain work together. If cmdbufs, can have
- KR: seems more complex to try to eliminate these concepts to me
- JG: seems like we’re making things more complex to test, etc.
- DM: we don’t have to ship it in MVP. Can do the same thing as with memory heaps. Things are allocated instantly, without heaps. Here are the passes, use them. When we figure out reusable command buffers, secondary command buffers, and bundles, we can roll them out.
- JG: it changes queue submission and cmd submission. Changes fundamental concepts. It’s a big change.
- DM: a cmdbuf is a sequence of passes. Not a fundmental change.
- JG: it is a big change if you’re live-encoding cmdbufs.
- CW: let’s cool down on this. Come up with concrete proposals and comment on the issue.
- JG: it needs to be more compelling to deviate from prior art.
- CW: ok, good to know. Can try to address this next time. And if we don’t then your concern still stands.
- MM: sounds like DM and JG need to work this out
- JG: this is the natural forum for it.
- DM: does Apple have an official position?
- MM: not really. First issue, should we have more or fewer passes. Willing to experiment with both. Q of should we have compute / work passes, no strong opinion either.
- CW: basically our opinion too. Slight bias toward compute passes. Need to experiment with either traditional cmdbufs or passes.
- RC: D3D team’s feedback: render passes yes, useful, needed for tilers. Compute: seem unnecessary. If unnecessary, don’t make developers type things they don’t need to. But they sometimes forget that WebGPU API is going to do implicit barriers. That’s why I keep asking from a validation perspective, does having passes help us? If it does let’s have them. If not and there’s no other reason except making things pretty then let’s not have them.
- DM: are there any plans for compute inside render passes?
- RC: compute ops inside rendering? Yes. There are tiers you can ask for. Tier 0: everything emulated. Tier 1: rendering ops optimized, compute not. Tier 2: both render and compute are optimized in render passes. Yes, can do both. Have details like putting all barriers at the beginning.
- CW: to clarify: is that Tier 0 = hardware doesn’t have render passes? Tier 1, you can do render passes but HW can’t do compute at the same time? So compute will be split off and done at beginning or end. Tier 2, HW supports it and can schedule it. Is this right?
- RC: (reads from D3D12 spec. Discussion about Tier 0, 1, 2.)
- MM: but it’s not true for Vulkan/Metal that you can do compute in render passes.
- DM: it’s going to be true for Vulkan too. Will our pass concept soon become obsolete if APIs start supporting compute in render passes?
- JG: thought we were targeting Android phones without updates.
- CW: we should keep this discussion for the next time. But Kirill has questions from the mailing list about how uploads work.
- CW: you upload to a buffer, ideally mapping it, and then copy to a texture
- Reasoning: there has to be some form of swizzling done by the driver, so a copy is done by the driver. Mapping the buffer, because that avoids copies always.
- Does this make sense / everyone agree?
- JG: yes.
- DM: main way of getting data to GPU is making staging and make copy?
- CW: yes, including for texture data.
- RC: are we always going to have staging concept for images, 2D canvas, etc.?
- CW: those will probably be magic.
- CW: wanted to give Kirill chance to ask questions.
- KD: there are 3 questions on the email. One about data.
- Extra stuff about mapping buffers, etc. comes from APIs exposing different types of memory and how it’s visible (CPU, GPU, slow/not slow). Since we don’t expose it maybe we don’t need the concept?
- Maybe I’m wrong, but Metal only has a method for updating texture with some bytes?
- JG: buffers in Metal are implicitly mapped
- DM: Metal has the concept but not exposing them as explicitly as other APIs. You can tell Metal to map it first and synchronize between them. Can use managed memory on Metal.
- In regard to mapping and copying: the fact that we’re running in the browser is that we’ll have slow host-visible memory that the user operates on. Whether there’s another level between that and GPU-local memory is an abstraction. Still applicable to the browser scenario.
- MM: is the concern mainly about performance?
- KD: initial concern was, does it make sense to expose this to the application, or just make a magic method like glTexImage2D which just accepts bytes?
- JG: mapping has a number of benefits. Perf-wise, it’ll be just as good if not better than a synchronous method which has to copy the bytes first. Way TexImage2D is usually implemented is by copying all bytes of source buffer out. So have to have it either fully uploaded, or copied out by the time you return. Also useful for being able to map shared memory between in processes in browsers that have remote GPU implementation.
- KD: makes sense.
- KR: There’s also that this allows you to pipeline copies to a texture that’s currently in flight. Significant driver complexity in OpenGL to deal with this.
- JG: to your question about evicting resources: we were just talking about this.
- CW: we talked about what WebGPU impls can do to be nice to the system. For streaming applications, like to keep everything in cache as much as possible. Tag parts of memory less useful than others. Good candidate for either V1 feature, or extension. Not for MVP though.
- KD: perf concern. In explicit APIs if you allocate a resource it’s resident (in VRAM, on hardware). In managed mode the impl would have control. Seems like difference between OpenGL and these other APIs.
- CW: the browser will have to be the arbiter here. Choose (for example) that foreground tab will have higher priority for residency than background tabs. Hopefully that’s good enough. Might have discardable memory eventually but that’s more complex.
- Next week’s meeting cancelled
- Week after: try to have yet another conversation about passes. Would like to make progress on other fronts as well.
- Start talking about things like parameters for copying from textures to textures?
- DM: so exciting.
- MM: best done with a proposal. CW did you just volunteer?
- CW: (grimace) yes. Will make a proposal for copies. Don’t know enough about multisampling to handle it.
- DM: do we need multisampling?
- KR: yes. It’s an area where there are big differences between tilers and desktop.
- MM: so explicit resolve call?
- CW: we need to figure this out. Explicit? Implicit at end of pass?
- JG: do we have any result from the scheduling Doodle?
- CW: I will ping people and send result to the group.