Invoke 5.x+ on AMD

Tested with: AMD Radeon PRO W7900 (Navi33 GPU / RDNA3 architecture). Other GPUs will probably work great with small tweaks, but I don't know what those tweaks might be. Please discover for yourself and report back!

TL;DR

Here's a very basic shell script that does all the things described below. Use it at your own risk. DO understand what it does. Otherwise, read the detailed instructions.

In fact, we really recommend reading the detailed instructions anyway. There's some good info in there.

#! /bin/bash

# This script hasn't been tested end-to-end. Be prepared to troubleshoot, or just run commands one-by-one.

# First, create a folder to hold your Invoke installation, like: mkdir ~/invoke; cd ~/invoke.
# you'll need at least 30GB of disk space for this to be useful - use your best judgement.

# Create and activate your venv.
python -m venv .venv
source .venv/bin/activate

# install invoke from pypi
pip install invokeai --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2

# remove cpu-only torch/torchvision and install ROCm-specific versions
pip uninstall -y torch torchvision

# YES, install THESE SPECIFIC VERSIONS until torch/rocm bugs are fixed, at least for the newer GPUs. See linked Github issues at the end of this gist.
pip install --pre torch==2.6.0.dev20240913 torchvision==0.20.0.dev20240913+rocm6.2 --index-url https://download.pytorch.org/whl/nightly/rocm6.2

# install rocm-compatible bitsandbytes. (UPDATE: it probably won't work anyway for now, missing some libs. This is fine.)
pip uninstall -y bitsandbytes
pip install bitsandbytes@git+https://github.com/ROCm/bitsandbytes.git


### We're done installing things. READ BELOW on possibly required environment varriables before running the software

Delicious Deets

Ensure that your AMD GPU drivers are correctly installed, kernel mode is loaded, both rocm-smi and amd-smi return information about the GPU, and you're overall happy and convinced that your GPU's compute capability is working correctly. Highly recommend nvtop for monitoring your GPU. Despite the name, it works with AMD just fine.
Install ROCm 6.2 or newer (I tested with 6.2.2)
⚠️ Do NOT use the official Invoke installer - it has no idea what to do about any of it. We'e doing this old-school 😎
⚠️ Do NOT use Docker - for now - we love it very much, but it's out of scope for AMD stuff at the moment. But it should be relatively easy to make it work using these steps.
Create and activate a new virtual environment using python3.11. Other versions may work, try if you wish.
```
python -m venv .venv
source .venv/bin/activate
```
- will 3.10 work? Maybe, maybe not.
- will 3.12 work? Probably will. Try it and let us know!
- can I use pyenv? Please do!! pyenv local 3.11
- can I use uv? Yes, but YMMV and you might need to use a combination of uv and pip. We don't recommend it at this time just for sanity's sake. (but we ❤️ uv!)
- You will likely need like >20GB of disk space just for the virtualenv, most of it for pytorch and rocm+nvidia libs (yes, they are still needed, this is still CUDA under the hooda).
With your virtual environment activated, install the Invoke python package:
```
pip install invokeai --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2
```
- This just installed torch with only CPU support for you. This is expected, we will fix this next.
- We're pointing at the pytorch-rocm nightly index because otherwise we'll get torch with CUDA support, which is useless to us here.
Uninstall torch and torchvision.
```
pip uninstall -y torch torchvision
```
Install torch and torchvision with ROCM support:
```
pip install --pre torch==2.6.0.dev20240913 torchvision==0.20.0.dev20240913+rocm6.2 --index-url https://download.pytorch.org/whl/nightly/rocm6.2
```
- ⚠️ Pay attention: --index-url, NOT --extra-index-url. The latter will just get you CPU torch again. Also, --pre is required because we're installing pre-release versions of torch and torchvision.
- This will install known working ROCm 6.2-specific torch v2.6.0 and torchvision v0.20.0 (as of this writing).
- You will see a red scary warning like this:
```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
invokeai 5.2.0 requires torch==2.4.1, but you have torch 2.5.0 which is incompatible.
invokeai 5.2.0 requires torchvision==0.19.1, but you have torchvision 0.20.0 which is incompatible.
```
This is fine. (but for the record, that's why we can't (yet) use uv or make this process less painful by bumping our pyproject.toml dependencies.)
Uninstall currently installed bitsandbytes and install a ROCm-compatible version from the ROCm fork.
```
pip uninstall -y bitsandbytes
pip install bitsandbytes@git+https://github.com/ROCm/bitsandbytes.git
```
Same as above, you will see a scary red warning. This is fine (i think) - please report back if it isn't.

(UPDATE: it seems that even doing it this way, there are some missing libraries needed for BNB. Stay tuned. FOR NOW, ignore the warning, this is fine.)

The following may or may not be needed for your specific GPU, or some other values may be required. This was determined experimentally for the AMD W7900 specifically. You can probably run all of the below exports commands as a script.

# Set your GFX version as appropriate for the GPU. If this is set wrong, GPU won't be recognized.
# for 7xxx cards
export HSA_OVERRIDE_GFX_VERSION=11.0.0

# for 6xxx cards
export HSA_OVERRIDE_GFX_VERSION=10.0.0

# This controls HIP memory allocation and you may need to experiment with some of these values to get the best performance (or even to generate at all).
# expandable segments help prevent memory fragmentation. They are not supported on all architectures, but it doesn't hurt to have them enabled.
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8,max_split_size_mb:512

# This avoids the `Attempting to use hipBLASLt on a unsupported architecture!` errors. Might not be needed for all architectures. This is related to the bug that has to be solved by installing the specific Torch prerelease versions (see the FAQ).
export TORCH_BLAS_PREFER_HIPBLASLT=0

# This helps with performance. In my tests, and in conjunction with other parameters above, this significantly increases generation speed (1.09it/s to 2.8it/s with SDXL on W7900)
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

After all of the above is done, we can run invoke using our usual CLI:

invokeai-web

You should see your GPU being detected, like:

[2024-10-20 01:50:53,780]::[InvokeAI]::INFO --> Using torch device: AMD Radeon PRO W7900 Dual Slot

Open the browser, go to http://localhost:9090, install some models, INVOKE!! 🎉

FAQ

Does Flux work?
- Not yet, or at least not in my testing. But that's our next frontier.
There is a very long delay between the very first progress image (step) and the subsequent generation steps. Is this normal?
- Yes, it seems to be normal and has long been the case on AMD GPUs.
- There is also a delay when the VAE decodes the final latents into the image (at the very end of generation). It can take a minute or more. BE PATIENT.
- This should only happen on first generation, and may have something to do with GPU memory management. More research is needed into this.
Can I make it go faster?
- It won't be as fast as a recent NVIDIA GPU, but please do experiment with different settings and let us know.
- I get ~2.82it/s on with SDXL on the AMD Radeon PRO W7900. Not great, not terrible.
- Try to set attention type to sliced for a minor speed-up, instead of the default torch-sdp. Experiment with the slice size, but expect incorrect values to produce garbage.
Can all of this be scripted/automated?
- Of course (most of it, anyway). See top of this post for an early attempt. But making it reliable and smoothy workable for all combinations of GPU and system configuration is non-trivial. We do appreciate pull requests!
Why do we need to install those oddly specific versions of torch and torchvision?
- because these are the versions that were proven to work at the time of this writing on the AMD W7900 GPU we have at our disposal. Otherwise you may get all sorts of unpleasant errors and no one will be able to help you. Further reading:
  - pytorch/pytorch#138067 (fixed as of 5 hours ago at the time of this writing, but not yet in the daily build)
  - ROCm/hipBLASLt#1243 which is the required upstream fix for this issue, and is still open as of this writing).
Why can't we "just" add these dependencies to the regular Invoke pyproject.toml?
- Because we don't want to "just" cause pain and suffering to everyone who has their installs stable with the currently pinned dependencies, especially on Windows where AMD doesn't work at all at the moment. Also because of non-trivial dependency relationships in general (see above point, see also the bitsandbytes notes). But we'll get there! "Walk before we run", etc.
"I definitely know for a fact that some or all of this is incorrect!!!1!"
- Great - please tell us what worked for you. The more data points the better. All of this may vary significantly based on system configuration, and what works for one user may not work for another. Newer GPUs are more likely to have "quirks"; older GPUs may be more stable. Let us know and we'll update the knowledge base.

Please report all of your experiences with this setup in the InvokeAI Discord. We love to hear from our AMD user community.

ebr/AMD.md

Invoke 5.x+ on AMD

TL;DR

Delicious Deets

FAQ