[RFC] Building 'libc++' for GPU targets

jhuber6 · 2024-07-18T01:33:27.474Z

TL;DR I want to support a build of the libc++ library targeting AMDGPU and NVPTX devices in the same vein as libc for GPUs — The LLVM C Library. This will make standard C++ runtime features available on the GPU.

Motivation

Currently, offloading languages like CUDA / HIP / OpenMP can only support C++ features on the GPU that are marked constexpr. Providing a functioning C++ standard library on the GPU would allow more C++ programs to be trivially run on the GPU without modification. This is also attractive for efforts like CUDA or hip stdpar that blur the line between GPU and CPU code.

Obviously, many features of the C++ standard library aren’t overly valuable when put on the GPU. But, if we can compile it we may as well provide it.

Overview

The proposed build process is to use the same support we use for the LLVM C library for GPUs. This is mostly focused around treating GPU libraries as cross-compiling targets, for example Compiler Explorer. This allows us to compile C++ for the GPU with minimal modifications and distribute it as a standard LTO enabled static library.

Enabling the build will be done using the LLVM_RUNTIME_TARGETS support for cross-compiling to multiple target architectures. This option will create separate CMake jobs for each target architecture. Individual arguments can then be passed to these builds using the RUNTIMES_<target>_<option> syntax. This allows us to stack libc++ on top of the existing libc and compiler-rt projects that can also be built for the GPU. The current invocation currently looks something like this:

$ cmake ../llvm -GNinja -DCMAKE_BUILD_TYPE=Release                          \
-DLLVM_ENABLE_PROJECTS='clang;lld' -C ../libcxx/cmake/caches/GPU.cmake      \
-DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=compiler-rt;libc;libcxx \
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc;libcxx   \
-DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda

This will use the LLVM_ENABLE_PER_RUNTIME_DIR option to install the libraries into a separate directory so they do not conflict with each other, for example,

<install>/lib/amdgcn-amd-amdhsa/libc.a
<install>/lib/amdgcn-amd-amdhsa/libc++.a
<install>/lib/clang/<major>/lib/amdgcn-amd-amdhsa/libclang_rt.builtins.a

Expected Support

The GPU build will include a config that disables C++ library features that are not easily implemented on the GPU. This currently excludes exceptions, RTTI, filesystem, threads, random device, and locales. Features like streams currently do not work due to locale.h not being implemented in the LLVM libc.

Required changes

The libc++ project uses the __config header to control which features are available depending on the target. We simply need to provide checks on the __NVPTX__ and __AMDGPU__ macros, similar to how things like BSD or musl are handled. So far, most of the complexity is hidden in the libc project and I have only needed to disable a handful of features. I have made a series of patches that allow us to build a functional libc++ on the GPU awaiting the response to this RFC, [libcxx] Add cache file for the GPU build by jhuber6 · Pull Request #99348 · llvm/llvm-project · GitHub.

We will need a new configuration to tell libc++ to build using the LLVM C library in-tree, similar to libunwind or libc++abi. This is currently being handled by @petrhosek.

Testing

The goal is to set up a pre-commit CI interface for building and testing the GPU support and add it to the existing builds of the GPU C library. The first step will be to make a CI interface that simply builds the project and then work on getting tests to run.

Testing will be done using the same interface the LLVM C library uses in Testing the GPU C library — The LLVM C Library. To summarize, we use a minimal GPU runtime and startup utilities to pretend like the GPU is a normal target and execute a main function. This will allow us to compile and run trivial unit tests.

The libc++ project has a similar interface using the -Dexecutor option that uses a separate executable to launch the tests. We can simply pass this to the amdhsa-loader after linking in the crt1.o file that contains the _start definition from the C library.

It’s likely that there will be tests that must be disabled, we could add a // UNSUPPORTED: gpu option for the lit tests as-needed. I will need to investigate the test suite thoroughly to see if there are any unit tests that depend on external libraries.

Future Work

Ideally this project will be merged into existing offloading languages. This will may require modifying headers to declare certain things as available with __device__ or similar. The locale.h interface will need to be implemented at least for the default locale in the LLVM C library. Allocation functions like aligned_alloc also need to be added to the LLVM C library’s GPU build.

@jdoerfert @Artem-B @yxsamliu @ldionne @philnik @JonChesterfield @mordante @petrhosek @michaelrj-google

jhuber6 · 2024-07-18T03:19:31.865Z

I figured I’d give a completely useless example of the type of stuff that’s working right now.

#include <algorithm>
#include <array>
#include <cstdio>
#include <random>

int main(int argc, char **argv) {
  std::mt19937 generator(__builtin_amdgcn_workitem_id_x());

  std::uniform_int_distribution<int> dist(1, 100);
  std::array<int, 8> arr;
  for (auto &x : arr)
    x = dist(generator);

  std::sort(std::begin(arr), std::end(arr));
  std::string str("thread " + std::to_string(__builtin_amdgcn_workitem_id_x()) + ":");
  for (auto &x : arr)
    str += " " + std::to_string(x);
  fprintf(stdout, "%s\n", str.c_str());
}

This can be compiled directly and then run on my gfx1030 card.

$ clang++ test.cpp --target=amdgcn-amd-amdhsa -mcpu=native -fno-exceptions -flto -lc -lc++ -lm $INSTALL/lib/amdgcn-amd-amdhsa/crt1.o -I$INSTALL/include/c++/v1 -I$INSTALL/include/amdgcn-amd-amdhsa/c++/v1/ -O2
$ amdhsa-loader --threads 5 a.out
thread 0: 10 22 45 48 65 68 68 84
thread 1: 6 10 13 38 65 73 76 80
thread 2: 8 16 23 41 44 73 76 83
thread 3: 1 4 20 22 25 57 73 75
thread 4: 2 10 47 51 56 70 73 88

JonChesterfield · 2024-07-18T09:21:04.009Z

Good stuff. Yep, keen - there’s a lot of capability available from adding relatively few ifdef to libc++. Presumably build cxxabi and not libunwind.

The lit running programs under an interpreter sounds useful independent of this. People routinely run test cases through qemu, lit knowing how to drive that seems a good thing. I’m unsure what notation would be preferred. One thing that could be useful is running every program through a loader that looks at the binary to decide what to do with it, though run-inter: or similar would be less invasive.

yxsamliu · 2024-07-18T13:43:52.717Z

+1

This can strongly enhance capability of all C+±based offloading languages.

arichardson · 2024-07-19T10:10:35.568Z

jhuber6:

However, the libc++ test suite uses llvm-lit, which will likely need to be modified to allow executing RUN: lines with an external emulator. The difficulty here is differentiating which programs are run natively (e.g. clang) and which ones to emulate. I think this can be done by checking ELF flags or similar to see if the e_machine value matches the one the emulator handles.

The libc++ test suite already has support for running programs with an external emulator. In the past I’ve used this to cross-compile for FreeBSD on Linux and then configure the libc++ test executor to use ssh.py which copies the test over to a remote system and runs it there. I would imagine it is easy to add another executor script that simply uses your emulator.
It may have changed since I last ran it but you used to be able to invoke lit with -Dexecutor=/path/to/ssh.py.

jhuber6 · 2024-07-19T12:08:26.616Z

That’s good, it should not be too difficult to use this since the LLVM libc project already uses LIBC_GPU_LOADER_EXECUTABLE to do pretty much the same thing. I should be able to then set up the compilation flags for the GPU target as-needed and import the crt1.o file that contains that _start code from the libc project. I will also need some logic to tell it which GPU architecture to use.

One thing I’m somewhat concerned about is the compilation time, since large GPU programs tend to have very long link times. There’s also the issue of the system running out of resources when running a bunch of GPU tests at once. However, we do that for the OpenMP tests (they also use lit) so it may not be a huge issue.

jhuber6 · 2024-07-29T22:25:08.526Z

Here’s some initial results running the libc++ test suite on an AMDGPU target. Getting the test suite to run required adding a new config file and setting LIBCXX_TEST_CONFIG="gpu-libc++-shared.cfg.in". Getting the tests to run required LIBCXX_TEST_PARAMS="executor=amdhsa-loader".

Testing Time: 3337.23s

Total Discovered Tests: 9729
  Unsupported      : 2738 (28.14%)
  Passed           : 6122 (62.93%)
  Expectedly Failed:   54 (0.56%)
  Failed           :  815 (8.38%)

First observation is that it takes quite awhile to both compile and run all of the tests. There were about four remaining tests I had to kill manually because they were taking too long (sort and pow tests). For various reasons, we need these builds to use monolithic LTO currently which drastically inflates the link times (over a minute is common, average is about ten seconds). This is a known issue for GPU targets.

The failing tests generally fell into a few categories, I’ll try to reduce the backend issues further.

Lack of a c++abi leaving __cxa_pure_virtual or __cxa_guard_acquire unresolved.
Backend failures on [AMDGPU] Backend code generation fails on internal function at O0 · Issue #64863 · llvm/llvm-project · GitHub .
Backend / SDAG failures on "Memory operands expect pointer values" from VisitInlineAssembly (This seems to be more unhandled asm memory constraints).
The GPU libc not providing posix_memalign yet.
Lack of facos, fasin, fatan, fcosh, fexp, flog, flog10, fsinh, ftan, ftanh libcalls. (We don’t support any math libcalls in the GPU backends).
Using “couldn’t allocate output register for constraint ‘r’”
unistd.h not being found because the GPU does not export it.
Any use of std::signal.
Reference to unresolved using declaration for atan2l and others. The GPU does not support long double at all so we do not export any of these functions.
Various runtime failures (Running tests in parallel probably exhausted resources on my machine a few times)

This isn’t a comprehensive list, but it’s a start at least.

jhuber6 · 2024-07-30T19:48:47.296Z

I got a build of libc++abi working on the GPU with the following configuration as well. This provides a lot of the missing symbols, however things like __cxa_guard_acquire are going to be broken for any non-trivial case because we can’t implement a generic mutex on the GPU.

set(LIBCXXABI_BAREMETAL ON CACHE BOOL "")
set(LIBCXXABI_ENABLE_SHARED OFF CACHE BOOL "")
set(LIBCXXABI_ENABLE_EXCEPTIONS OFF CACHE BOOL "") 
set(LIBCXXABI_ENABLE_THREADS OFF CACHE BOOL "") 
set(LIBCXXABI_ENABLE_NEW_DELETE_DEFINITIONS OFF CACHE BOOL "") 
set(LIBCXXABI_USE_LLVM_UNWINDER OFF CACHE BOOL "")

ldionne · 2024-07-31T13:46:29.227Z

Thanks for the RFC. The overall plan looks reasonable to me, however one important point I’d like to mention is that we probably don’t want to add UNSUPPORTED: gpu or anything of the like. Instead, we want to find out what fundamental capabilities are not supported on GPUs and make generic opt-outs for them. This is important for maintenance and reusability of these knobs, and I encourage you to think that way as you make progress.

Similarly, we want to keep the number of #ifdefs for GPU-specific things extremely low. Instead, we want to figure out what other general-purpose _LIBCPP_HAS_NO_<foo> might need to be introduced to support GPUs.

Apart from that, the general plan seems sensible to me. I assume the compiler used to build for these targets is a regular upstream Clang?

jhuber6 · 2024-07-31T14:01:45.678Z

ldionne:

Similarly, we want to keep the number of #ifdefs for GPU-specific things extremely low. Instead, we want to figure out what other general-purpose _LIBCPP_HAS_NO_<foo> might need to be introduced to support GPUs.

Getting this build to work only required a few really minor tweaks, I added _LIBCPP_TARGETING_GPU which is only used to force it to use clock_gettime and not try to use isatty. There’s also a single check that I needed to disable the _start_*/__stop_* symbols from being generated because these aren’t supported at all for NVPTX. These are really minor, so I don’t think they warrant a new config since we already hard-code stuff like ifdef __OpenBSD__ in the same places I added these checks, see here, here, and here.

ldionne:

Similarly, we want to keep the number of #ifdefs for GPU-specific things extremely low. Instead, we want to figure out what other general-purpose _LIBCPP_HAS_NO_<foo> might need to be introduced to support GPUs.

Getting the entire suite to run clean is going to be a lot of work, I’ll want to be able to change a few things. Right now I have issues with inline assembly using invalid constraints as well as using posix_memalign instead of aligned_alloc. For others, it might be worthwhile to put in the test config that certain directories are disabled, then the GPU can just set some list of those.

ldionne:

Apart from that, the general plan seems sensible to me. I assume the compiler used to build for these targets is a regular upstream Clang?

Yes, all of this goes through the runtime support. With the patches I linked above applied I use this cache file to be able to build it [libcxx] Add cache file for the GPU build by jhuber6 · Pull Request #99348 · llvm/llvm-project · GitHub. We will always use upstream clang for GPU builds, so it makes it easier to work with.

So, I think right now it would be best to merge all of my functional changes + cache files. That way I can get a CI bot set up to at least build the libc++ and libc++abi projects. Then I will merge the GPU test config support with the expectation that many tests will fail. I’ll then slowly fix things to hopefully get that number lower and lower. Does this sound reasonable?

jhuber6 · 2024-08-21T17:58:47.864Z

Thanks for the comments, I just landed the last patch required for basic building support. Additional support for running the test suite on the GPU will follow. I’ll also try to better document what’s actually expected to work at the moment.

For now, this can be built by adding the following CMake options to your build.

    -DRUNTIMES_nvptx64-nvidia-cuda_CACHE_FILES=${LLVM_PROJ}/libcxx/cmake/caches/NVPTX.cmake \
    -DRUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES=${LLVM_PROJ}/libcxx/cmake/caches/AMDGPU.cmake \
    -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=compiler-rt;libcxxabi;libcxx;libc \
    -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libcxxabi;libcxx;libc \
    -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda       \

Compiling for the GPU directly like how the unit tests will do it can be done as follows,

$ clang++ hello.cpp -stdlib=libc++ --target=amdgcn-amd-amdhsa -flto -mcpu=native ./lib/amdgcn-amd-amdhsa/crt1.o -lc -lc++ -lc++abi; amdhsa-loader a.out
Hello World

Compiling for offloading languages like CUDA or HIP will require more work to define these functions as __host__ __device__.