CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles

Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The original post still applies to previous GPU architectures, up to and including Volta. For Volta and newer architectures, profile your applications with NVIDIA Nsight Compute and NVIDIA Nsight Systems. For more information about how to transition to use current GPU profiling tools, see Migrating to Nsight Tools.

Some applications launch many tiny kernels, making them prone to very large (100s of megabytes or larger) nvprof timeline dumps, even for application runs of only a handful of seconds.

Such nvprof files may fail to even load when you try to import them into NVIDIA Visual Profiler (NVVP). One symptom of this problem is that when you choose Finish on the import screen, NVVP delays for a minute or so but then goes right back to the import screen asking you to choose Finish again. In other cases, attempting to load a large file can result in delays of many hours.

It turns out that this problem is because of the Java max heap size setting specified in the libnvvp/nvvp.ini file of the CUDA Toolkit installation. The profiler configures the Java VM to cap the heap size at 1 GB to work even on systems with minimal physical memory. While this 1 GB value is already an improvement over the 512 MB setting used in earlier CUDA versions, it is still not enough for some applications, considering that the memory footprint of the profiler can be at least 4-5x larger than the input file size.

Given that many modern workstations have far more than 1 GB of physical memory, you can customize this configuration setting based on your needs and based on your system’s physical memory size to improve the NVVP ability to import larger data files. The nvvp.ini configuration file looks like the following out of the box, with the relevant line highlighted:

-data
@user.home/nvvp_workspace
-vm
../jre/bin/java
-vmargs
-Xmx1024m
-Dorg.eclipse.swt.browser.DefaultType=mozilla

The primary goal, then, is to change 1024m to something bigger. The size you pick depends on your situation.

For example, my workstation has 24 GB of system memory, and I happen to know that I won’t need to run any other memory-intensive applications at the same time as NVVP, so it’s okay for NVVP to take up the vast majority of that space. So I might pick, say, 22 GB as the maximum heap size, leaving a few gigabytes for the OS, GUI, and any other programs that might be running.

While you’re at it, you can make a few other configuration tweaks as well:

Increase the default heap size (the one Java automatically starts up with) to 2 GB. (-Xms)
Tell Java to run in 64-bit mode instead of the default 32-bit mode (only works on 64-bit systems). This is required if you want heap sizes >4 GB. (-d64)
Enable Java’s parallel garbage collection system, which helps both to decrease the required memory space for a given input size as well as to catch out-of-memory errors more gracefully. (-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode)

I changed my nvvp.ini file to the following. The -Xmx setting should be tailored to the available system memory and input size as mentioned earlier, but try at least 5-6x the input file size as a minimum. Most CUDA installations require administrator/root-level access to modify this file.

-data
@user.home/nvvp_workspace
-vm
../jre/bin/java
-d64
-vmargs
-Xms2g
-Xmx22g
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-Dorg.eclipse.swt.browser.DefaultType=Mozilla

Thanks to this change, I can load profiles of hundreds of megabytes in file size in seconds instead of hours.

CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles

Related resources

Tags

About the Authors

CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles

Related resources

Tags

About the Authors

Comments

Related posts

NVIDIA Announces Nsight Systems 2021.3

Nsight Compute 2020.3 Simplifies CUDA Kernel Profiling and Optimization

CUDA Pro Tip: Profiling MPI Applications

CUDACasts Episode #9: Explore GPU device memory with Nsight Eclipse Edition

CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling

Related posts

Advanced API Performance: SetStablePowerState

Advanced Kernel Profiling with the Latest Nsight Compute

TensorFlow Performance Logging Plugin nvtx-plugins-tf Goes Public

NVIDIA Nsight Systems Adds Vulkan Support

Nsight Systems Exposes New GPU Optimization Opportunities