-
Notifications
You must be signed in to change notification settings - Fork 14.1k
CUDA: experimental native mxfp4 support for blackwell [WIP] #17906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Nice speedup , Master: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
PR: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
|
ggml/src/ggml-cuda/common.cuh
Outdated
| if (sign > 0.0f) { | ||
| return static_cast<uint8_t>(best_i); // 0..7 | ||
| } else { | ||
| return static_cast<uint8_t>(best_i | 0x8); // 8..15 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be slightly more optimal to extract the sign bit from x, do a bit shift, and a logical and.
More generally, there are FP4 conversion intrinsics in the CUDA math API but I'm not sure whether they would be of use.
| return 0; | ||
| } | ||
|
|
||
| const uint8_t sign_bit = x < 0.0f ? 0x8 : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if the compiler is smart enough to do this optimization but I meant to transplant the sign bit directly without the use of conditional statements at all. So cast the float to an unsigned integer, shift 28 bits to the right, and apply & 0x8.
ggml/src/ggml-cuda/mmq.cuh
Outdated
| } | ||
|
|
||
| #define MMQ_MMA_TILE_X_K_Q8_0 (2*MMQ_TILE_NE_K + 2*MMQ_TILE_NE_K/QI8_0 + 4) | ||
| #define MMQ_MMA_TILE_X_K_FP4 (MMQ_TILE_NE_K + MMQ_TILE_NE_K / QI8_0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting value is correct, I just don't think you should be calculating it like this since it will be confusing. It would be better to use something like MMQ_TILE_NE_K + 4 though ideally you would replace the hardcoded value with something that indicates where it comes from.
|
I used 512 as MMQ_ITER_K so that all tile sizes remain the same, and it seems to be faster than the previous version. |
120f is most likely what you want - to also cover the DGX Spark (which is sm_121) |
|
Hey everyone, hope we can get working on this together. I have been working on an NVFP4 implementation for llama for Blackwell for months and months now (just for fun in my spare time). I never was sure I would be able to actually do it (first time working on kernels) so was staying quiet, but I kept at it just to see if I could do it. I just fixed the last major bug tonight! I've been fighting weeks and hair pulling of junk tokens and mystery bugs, it's been rough. This was truly far more complicated than it seemed at first comparing to the MXFP4 I was looking at. But, omg, finally, it's all actually working, and itt's flying! I've made a mess so have a lot of cleanup to do before I want to post up anything. I'm using the PTX call with mxf4nvf4.block_scale.scale_vec::4X (m16n8k64) with e2m1/e2m1/UE4M3 scales per 16w and a global FP32 scale, and put in a bunch of other things. So I've got now working: cpu fallback, mmq (native mma), on the fly activations quant'd to nvfp4 (prefill), mmvq is still nvfp4 x q8_1 but I might try it with nvfp4. I have more work to do but I'll share a lot more soon, I just wanted to share the good news that I got it making output for the first time!! :)) :)) I never saw this PR until just now, would have totally got on board sooner. |
|
@michaelw9999 for adding nvfp4 support I would first suggest to create a PR which adds the nvfp4 quantization, i.e how it exists for mxfp4. After this PR (and a PR which generally introduces nvfp4 into ggml) I think it should be easy to add support as only the on-the-fly activations should change, everything else should be relatively the same. I would not suggest to use the x4 instruction for now as it will lead to a lot of changes and be harder to review |
I will clean it up a a lot! And try to make as clear and clean as possible. I've spent so much time focusing on getting it to 4X I feel letting that go would be sad and set me back a lot of effort. Nothing should come anywhere near messing with anything else. I could put up a PR for the NVFP4 quantization quite shortly, (I have GGML_TYPE_NVFP4 = 40). It's only 11pm here...so I'm usually at this for a few more hours into the night, and quite motivated tonight! |
|
It is ultimately your call as to how to go about the PR, I would've added nvfp4 support after this PR but I will review whatever you post. The final authority is @JohannesGaessler in merging this part of the code |
|
@michaelw9999 thank you for your work and interest in contributing to llama.cpp/ggml. For context, is the scope you're targeting only models like GPT-OSS that are already released as MXFP4 or more general NVFP4 quantization for any model? |
|
Hello JohannesGaessler! What I have been working on is "true NVFP4" for any model, not really MXFP4, so these really are not at all the same thing. But there can for sure be some overlap and commonality with the new Blackwell PTX MMA functions with block scaling built in, and we can use some common functions between the two, but really these are different quantizations. We should definitely share some of the new helper functions to reduce bloat. I've made an entire new set of conversion functions for dealing with intrinsic fp4 and fp8s (from cuda_fp4.h and cuda_fp8.h) There were many issues to work out with unsigned/signed complexities problems with scaling/clamping/NaN and the pre-existing GGML_CUDA_xx functions not working so I had to re-do a lot of things (and hopefully go back, and un-do things where not necessary anymore). My understanding is the NVFP4 vs "FP4" difference (unfortunately very confusing naming) is that it's one scale per 16 weights block for NVFP4, optionally E4M3 scales(4x) or E8M0 scales(2x), but 32/block for MXFP4/E8M0, so NVFP4 should have higher precision (and be very, very fast on Blackwell SM120+). The NVFP4 implementation I've written (hoping to at least) be true enough to be convertible from any other already trained or quantized down NVFP4 models so full compatibility. I've been working with Qwen3 only the last few months but started on Nemotron3 yesterday... (surprised they did not release an NVFP4 model...) So I've made two entirely different approaches, the first never working at all yet, the second I went starting it by mimicking Q6_K as it seemed the closest related, so I set it up like a K quant with a superblock: #define QK_NVFP4 16 Perhaps we ought to move this to a different thread so this is now deviating a bit from the MXFP4 side? |
|
I would suggest we get on a call or similar to discuss details. To be 100% clear: even if you invest months of work that does not mean that it will make sense to merge (all of) your changes. In fact, if you invest a lot of time I think it's even more important to discuss requirements with maintainers. I am hosting a Mumble server that we can use to talk (contact me via the email on my Github page), but I am open to other means of communication as well. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If at all possible, rewrite the code using MXFP4 constants like QI_MXFP4 rather than QI8_0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start adding Blackwell-specific features we should start adding Blackwell to the virtual architectures that are being built for a non-native build. To my knowledge the FP4 instructions are forward-compatible so we should build the lowest possible virtual architecture for it.
We may also consider building the real architectures for convenience but this is a different matter. The tradeoff is that users won't have to do JIT compilation on the first run but additional CPU time is needed for the builds, particularly in our CI. I am not managing the CI so input from @ggerganov would be appreciated on this matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to understand what are the implications of more native, virtual and real CUDA builds and will try to provide input. Atm, I'm not familiar with the specifics and the tradeoffs to comment on this topic. I think we can decide this later, not necessary for this PR, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FP4 matrix multiply instructions are not forward compatible and are very likely to change with future architectures. See 9.7.14.5.14. Multiply-and-Accumulate Instruction: mma of the PTX ISA manual.
Note that how things are done currently already ships with very broken performance on Hopper and datacenter Blackwell today. Maybe the path forward is to wait for cuTile C++ to ship, that should help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR on ISA compatibility: sm_120 has the baseline forward compatible feature set, sm_120a PTX is only compatible with compute_120 and sm_120f PTX is compatible with compute_12x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov which real architectures to ship can be decided on independently of this PR, it's only relevant for conveience of users vs. convenience of our CI.
43c262b to
bdde328
Compare
Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile
-DCMAKE_CUDA_ARCHITECTURES="120f"is required.Blackwell has a
m16n8k64instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is10% slower than master25% faster than master on PP. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures intest-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)TODO:
on RTX Pro 6000 Blackwell