CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

am17an · 2025-12-10T10:59:21Z

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120f" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is ~~10% slower than master~~ 25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

Figure out why we don't see better results
Address NMSE error b/w q8_0 and mxfp4

Model	Test	t/s `c6f6e4f`	t/s mxfp4	Speedup
gpt-oss 20B MXFP4 MoE	pp512	10560.64	13304.80	1.26
gpt-oss 20B MXFP4 MoE	pp1024	10659.15	13515.51	1.27
gpt-oss 20B MXFP4 MoE	pp2048	10801.35	13715.10	1.27
gpt-oss 20B MXFP4 MoE	pp4096	10854.04	13806.59	1.27
gpt-oss 20B MXFP4 MoE	pp8192	10688.23	13525.14	1.27
gpt-oss 20B MXFP4 MoE	pp16384	10140.17	11587.72	1.14

on RTX Pro 6000 Blackwell

Model	Test	t/s `c6f6e4f`	t/s mxfp4	Speedup
gpt-oss 120B MXFP4 MoE	pp2048	6663.49	8747.55	1.31
gpt-oss 120B MXFP4 MoE	pp4096	6821.07	8913.47	1.31
gpt-oss 120B MXFP4 MoE	pp8192	7079.22	9162.30	1.29
gpt-oss 120B MXFP4 MoE	pp16384	6936.33	9039.58	1.30

Note: This PR was developed on @JohannesGaessler's server with a 5090 provided by NVIDIA. So thanks to them both!

ggml/src/ggml-cuda/CMakeLists.txt

easyfab · 2025-12-10T18:11:34Z

Nice speedup ,

Master:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	5614.78 ± 40.21
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	4729.89 ± 10.28
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	204.28 ± 0.53
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	6460.61 ± 65.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	6624.29 ± 24.83
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.47 ± 0.25

PR:

Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp512	6473.65 ± 37.97
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	pp2048	5346.78 ± 4.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	0	tg128	205.29 ± 0.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp512	7754.67 ± 53.15
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	pp2048	7917.86 ± 20.30
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	221.23 ± 0.21

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-12-10T18:48:53Z

ggml/src/ggml-cuda/common.cuh

+    if (sign > 0.0f) {
+        return static_cast<uint8_t>(best_i);        // 0..7
+    } else {
+        return static_cast<uint8_t>(best_i | 0x8);  // 8..15
+    }


I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.

More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.

ggml/src/ggml-cuda/mmq.cu

ggml/src/ggml-cuda/mmq.cuh

ggml/src/ggml-cuda/quantize.cu

JohannesGaessler · 2025-12-11T11:25:14Z

ggml/src/ggml-cuda/common.cuh

+        return 0;
+    }
+
+    const uint8_t sign_bit = x < 0.0f ? 0x8 : 0;


I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.

ggml/src/ggml-cuda/common.cuh

JohannesGaessler · 2025-12-11T11:30:31Z

ggml/src/ggml-cuda/mmq.cuh

 }

 #define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0                   + 4)
+#define MMQ_MMA_TILE_X_K_FP4  (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0)


The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.

ggml/src/ggml-cuda/mmq.cuh

ggml/src/ggml-cuda/quantize.cu

am17an · 2025-12-12T09:32:32Z

I used 512 as MMQ_ITER_K so that all tile sizes remain the same, and it seems to be faster than the previous version.

ggml/src/ggml-cuda/CMakeLists.txt

ggml/src/ggml-cuda/common.cuh

mediouni-m · 2025-12-14T07:59:35Z

To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

120f is most likely what you want - to also cover the DGX Spark (which is sm_121)

michaelw9999 · 2025-12-17T06:38:04Z

Hey everyone, hope we can get working on this together. I have been working on an NVFP4 implementation for llama for Blackwell for months and months now (just for fun in my spare time). I never was sure I would be able to actually do it (first time working on kernels) so was staying quiet, but I kept at it just to see if I could do it. I just fixed the last major bug tonight! I've been fighting weeks and hair pulling of junk tokens and mystery bugs, it's been rough. This was truly far more complicated than it seemed at first comparing to the MXFP4 I was looking at. But, omg, finally, it's all actually working, and itt's flying! I've made a mess so have a lot of cleanup to do before I want to post up anything. I'm using the PTX call with mxf4nvf4.block_scale.scale_vec::4X (m16n8k64) with e2m1/e2m1/UE4M3 scales per 16w and a global FP32 scale, and put in a bunch of other things. So I've got now working: cpu fallback, mmq (native mma), on the fly activations quant'd to nvfp4 (prefill), mmvq is still nvfp4 x q8_1 but I might try it with nvfp4. I have more work to do but I'll share a lot more soon, I just wanted to share the good news that I got it making output for the first time!! :)) :)) I never saw this PR until just now, would have totally got on board sooner.

am17an · 2025-12-17T06:56:41Z

@michaelw9999 for adding nvfp4 support I would first suggest to create a PR which adds the nvfp4 quantization, i.e how it exists for mxfp4. After this PR (and a PR which generally introduces nvfp4 into ggml) I think it should be easy to add support as only the on-the-fly activations should change, everything else should be relatively the same.

I would not suggest to use the x4 instruction for now as it will lead to a lot of changes and be harder to review

michaelw9999 · 2025-12-17T07:02:04Z

@michaelw9999 for adding nvfp4 support I would first suggest to create a PR which adds the nvfp4 quantization, i.e how it exists for mxfp4. After this PR (and a PR which generally introduces nvfp4 into ggml) I think it should be easy to add support as only the on-the-fly activations should change, everything else should be relatively the same.

I would not suggest to use the x4 instruction for now as it will lead to a lot of changes and be harder to review

I will clean it up a a lot! And try to make as clear and clean as possible. I've spent so much time focusing on getting it to 4X I feel letting that go would be sad and set me back a lot of effort. Nothing should come anywhere near messing with anything else. I could put up a PR for the NVFP4 quantization quite shortly, (I have GGML_TYPE_NVFP4 = 40). It's only 11pm here...so I'm usually at this for a few more hours into the night, and quite motivated tonight!

am17an · 2025-12-17T07:08:42Z

It is ultimately your call as to how to go about the PR, I would've added nvfp4 support after this PR but I will review whatever you post. The final authority is @JohannesGaessler in merging this part of the code

JohannesGaessler · 2025-12-17T07:57:08Z

@michaelw9999 thank you for your work and interest in contributing to llama.cpp/ggml. For context, is the scope you're targeting only models like GPT-OSS that are already released as MXFP4 or more general NVFP4 quantization for any model?

michaelw9999 · 2025-12-17T09:18:47Z

Hello JohannesGaessler! What I have been working on is "true NVFP4" for any model, not really MXFP4, so these really are not at all the same thing. But there can for sure be some overlap and commonality with the new Blackwell PTX MMA functions with block scaling built in, and we can use some common functions between the two, but really these are different quantizations. We should definitely share some of the new helper functions to reduce bloat. I've made an entire new set of conversion functions for dealing with intrinsic fp4 and fp8s (from cuda_fp4.h and cuda_fp8.h) There were many issues to work out with unsigned/signed complexities problems with scaling/clamping/NaN and the pre-existing GGML_CUDA_xx functions not working so I had to re-do a lot of things (and hopefully go back, and un-do things where not necessary anymore). My understanding is the NVFP4 vs "FP4" difference (unfortunately very confusing naming) is that it's one scale per 16 weights block for NVFP4, optionally E4M3 scales(4x) or E8M0 scales(2x), but 32/block for MXFP4/E8M0, so NVFP4 should have higher precision (and be very, very fast on Blackwell SM120+). The NVFP4 implementation I've written (hoping to at least) be true enough to be convertible from any other already trained or quantized down NVFP4 models so full compatibility. I've been working with Qwen3 only the last few months but started on Nemotron3 yesterday... (surprised they did not release an NVFP4 model...) So I've made two entirely different approaches, the first never working at all yet, the second I went starting it by mimicking Q6_K as it seemed the closest related, so I set it up like a K quant with a superblock:

#define QK_NVFP4 16
typedef struct {
uint8_t qs[QK_K/2]; // 8 bytes of quants, so total of 16...4-bit E2M1 quants
uint8_t scales[QK_NVFP4]; // 16 scales in UE4M3 format
float d; // NV spec was one FP32 scale per tensor, but for starters here I put one per 256-weight superblock
} block_nvfp4;

Perhaps we ought to move this to a different thread so this is now deviating a bit from the MXFP4 side?

JohannesGaessler · 2025-12-17T09:31:39Z

I would suggest we get on a call or similar to discuss details. To be 100% clear: even if you invest months of work that does not mean that it will make sense to merge (all of) your changes. In fact, if you invest a lot of time I think it's even more important to discuss requirements with maintainers. I am hosting a Mumble server that we can use to talk (contact me via the email on my Github page), but I am open to other means of communication as well.

JohannesGaessler

If at all possible, rewrite the code using MXFP4 constants like QI_MXFP4 rather than QI8_0.

JohannesGaessler · 2025-12-17T09:43:13Z

ggml/src/ggml-cuda/CMakeLists.txt

If we start adding Blackwell-specific features we should start adding Blackwell to the virtual architectures that are being built for a non-native build. To my knowledge the FP4 instructions are forward-compatible so we should build the lowest possible virtual architecture for it.

We may also consider building the real architectures for convenience but this is a different matter. The tradeoff is that users won't have to do JIT compilation on the first run but additional CPU time is needed for the builds, particularly in our CI. I am not managing the CI so input from @ggerganov would be appreciated on this matter.

I'll try to understand what are the implications of more native, virtual and real CUDA builds and will try to provide input. Atm, I'm not familiar with the specifics and the tradeoffs to comment on this topic. I think we can decide this later, not necessary for this PR, correct?

The FP4 matrix multiply instructions are not forward compatible and are very likely to change with future architectures. See 9.7.14.5.14. Multiply-and-Accumulate Instruction: mma of the PTX ISA manual.

Note that how things are done currently already ships with very broken performance on Hopper and datacenter Blackwell today. Maybe the path forward is to wait for cuTile C++ to ship, that should help.

TL;DR on ISA compatibility: sm_120 has the baseline forward compatible feature set, sm_120a PTX is only compatible with compute_120 and sm_120f PTX is compatible with compute_12x.

@ggerganov which real architectures to ship can be decided on independently of this PR, it's only relevant for conveience of users vs. convenience of our CI.

ggml/src/ggml-cuda/mmq.cuh

ggml/src/ggml-cuda/quantize.cu

am17an requested a review from JohannesGaessler as a code owner December 10, 2025 10:59

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 10, 2025

am17an marked this pull request as draft December 10, 2025 10:59

loci-dev mentioned this pull request Dec 10, 2025

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP] auroralabs-loci/llama.cpp#511

Open

2 tasks

CISC reviewed Dec 10, 2025

View reviewed changes

ggml/src/ggml-cuda/CMakeLists.txt Outdated Show resolved Hide resolved

am17an force-pushed the mxfp4 branch from 16e8a11 to 9dde464 Compare December 10, 2025 15:48

JohannesGaessler reviewed Dec 10, 2025

View reviewed changes

JohannesGaessler reviewed Dec 11, 2025

View reviewed changes

Aman Gupta added 8 commits December 11, 2025 13:52

CUDA: experimental native mxfp4 support for blackwell

e214110

optimize load_tiles

41e876a

optimize quantize_mxfp4

40eb6c7

cleanup

65f944b

first pass review: formatting

a6dcaa5

use interleaved layout for mma

b7deb96

mmq: add assert for size

928cc55

use __nv_fp4x4_e2m1

a1672f6

am17an force-pushed the mxfp4 branch from b978da3 to a1672f6 Compare December 11, 2025 15:41

use iter_k as 512, cleanup

61c41a0

am17an force-pushed the mxfp4 branch from 870c9d7 to 61c41a0 Compare December 12, 2025 10:02

mediouni-m reviewed Dec 14, 2025

View reviewed changes

ggml/src/ggml-cuda/CMakeLists.txt Outdated Show resolved Hide resolved

mediouni-m reviewed Dec 14, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

Use 1200 as blackwell instead of 1000

83f62fb

am17an force-pushed the mxfp4 branch from 8acc50b to 83f62fb Compare December 17, 2025 07:44

am17an requested a review from JohannesGaessler December 17, 2025 09:27

JohannesGaessler reviewed Dec 17, 2025

View reviewed changes

address review comments

40b7df8

am17an force-pushed the mxfp4 branch 3 times, most recently from 43c262b to bdde328 Compare December 18, 2025 06:03

mmq: fix stride

5dfa63d

am17an force-pushed the mxfp4 branch from bdde328 to 5dfa63d Compare December 18, 2025 09:53

am17an marked this pull request as ready for review December 18, 2025 13:00

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Are you sure you want to change the base?

CUDA: experimental native mxfp4 support for blackwell [WIP] #17906

Conversation

am17an commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

easyfab commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

mediouni-m commented Dec 14, 2025

Uh oh!

michaelw9999 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelw9999 commented Dec 17, 2025

Uh oh!

am17an commented Dec 17, 2025

Uh oh!

JohannesGaessler commented Dec 17, 2025

Uh oh!

michaelw9999 commented Dec 17, 2025

Uh oh!

JohannesGaessler commented Dec 17, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

woachk Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

woachk Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Dec 10, 2025 •

edited

Loading

michaelw9999 commented Dec 17, 2025 •

edited

Loading

am17an commented Dec 17, 2025 •

edited

Loading

JohannesGaessler Dec 17, 2025 •

edited

Loading