-
Notifications
You must be signed in to change notification settings - Fork 2.2k
GPU Crash Debugging and Reporting
Link to a video that contains demo for various tools - https://youtu.be/WIUhaVXFDkc
Parts of this page has moved to Atom Renderer Guide. As more changes are made here, please also plan to submit them to o3de.org repository. See Contributing to O3DE Documentation Guide for submitting changes.
NOTE: a "device removal" in the logs constitutes a crash as far as GPU crashes go
So you've encountered a crash in the GPU 🐞, what are your options?
If you want to debug the issue yourself please follow the steps specified in the section called GPU crash debugging further down. If not then the most helpful thing you can do is file an issue, and provide as much information as possible. Standard reporting practices apply (minimal reproducible steps are awesome), but its also recommended that you try an alternative RHI (instructions below) and enable validation to see if additional information is produced that you could include in the bug report. For users with multiple GPUs (e.g. multiple discrete GPUs, or an integrated GPU), instructions on switching the GPU selected are described below as well.
When reporting a crash encountered, please indicate your GPU vendor (e.g. Nvidia, AMD, Intel, Apple, etc.), device name (e.g. RTX2080), and driver version. Next, indicate what RHI you were using (e.g. Vulkan, D3D12, Metal).
For machines that have multiple GPUs, you can inform the runtime to prioritize one device over another with a CLI option. At the command line, pass -forceAdapter="your_device"
. Here, your_device
can be any case-insensitive partial match for the device you wish to select.
You can select a non-default RHI using the -rhi
command line option. For example, -rhi=vulkan
may be used to select the Vulkan RHI. Currently this only applies on Windows where D3D12
is the default selected backend. In order to use settings registry to make Vulkan your default RHI add this to make vulkan top priority.
{
"O3DE":
{
"Atom":
{
"RHI":
{
"FactoryManager":
{
"factoriesPriority": [ "vulkan" ]
}
}
}
}
}
Driver validation is enabled by default on debug builds. Pass one of the following values to the CLI option -rhi-device-validation=
to enable RHI-level validation. For Vulkan, be sure that you have the SDK installed before enabling device validation (SDK link).
-rhi-device-validation="enable"
-rhi-device-validation="verbose"
-rhi-device-validation="gpu"
The CLI option takes precedence over other behavior across all configurations.
You have a few options when trying to debug GPU TDRs or Device removed errors.
-
DX12
- DRED - Section on how to enable and use DRED logs in this page https://www.o3de.org/docs/atom-guide/dev-guide/troubleshoot/
- Aftermath
- Download Aftermath SDK
- You can either enable Aftermath code via a cmake variable LY_AFTERMATH_PATH or you can add an environment variable called ATOM_AFTERMATH_PATH and point it to the path where the SDK is downloaded and unzipped.
- Reconfigure and recompile and you should now have Aftermath support
- If a gpu crash/tdr actually happened and as long as your whole computer did not restart or blue screened the open3d app will do a aftermath dump in the same folder as the executable and this dump file can be opened via NVidia Nsight Graphics app.
- Aftermath does not work well with Pix and Renderdoc hence if you want to disable it you can do either of the two methods listed below
- Remove the environment variable ATOM_PIX_PATH or disable the cmake variable
- Rename or Delete the folder where the SDK was downloaded
-
Vulkan
- Aftermath - Same steps as DX12
-
Metal
- Debug config automatically has enhanced command buffer error enabled so it should print out error log provided by the drivers.
-
CPU GPU lockstep mode (applicable to all back ends) - Another option is running the app in a special mode where cpu runs in lockstep with gpu. This means that when gpu crashes the cpu will break right after. In this mode we force each pass into a separate command list, cache last executing pass name and throw up a dialog box with this name once a gpu crash has occurred. In order to enable this mode follow these steps.
- Uncomment AZ_FORCE_CPU_GPU_INSYNC within Gems/Atom/RHI/Code/Include/Atom/RHI.Reflect/Base.h
- Recompile and run
- When the gpu crashes the cpu will also crash/hang and allow you to inspect the main thread which should have called execute/commit on the work related to the pass that crashed the gpu.
- GPU Crash Debugging and Reporting
- CPU & GPU Debugging Tools
- CPU Profiling Tools
- GPU Profiling Tools
- GPU Memory Profiling
- Faster Shader Iteration
- Commit sign off
- PerformanceCollector API
- Allocator Tagging Guide
- What happens when entering/exiting Game mode?
- Hello World
- Using Tick Bus
- Using Transform Bus
- Reflecting Properties to the Editor
- Working With An External Lua Debugger
- Attachment Images and Buffers
- Image Builder
- Scene And Render Pipeline
- Shader Management Console (SMC)
- Work With Passes In Gems
- Developer Guide: Shader Build Arguments Customization
- Developer Guide: Customize AZSLc Executable
- Collecting Graphics Performance Metrics
- Mesh Instancing: For Content Creators
- Mesh Instancing: For Shader Authors
- Mesh Instancing: For Engine Maintainers/Contributors
- Screen Capture Image Comparison Testing