GPT OSS MXFP4 SM120 integration by rolson24 · Pull Request #6216 · modular/modular

rolson24 · 2026-03-19T15:03:52Z

Summary

Draft stacked PR on top of sm120-attention-correctness.

This PR brings over the GPT-OSS product-facing fixes that were needed after the
SM120 attention correctness work:

GPT-OSS config/quant loading fixes so openai/gpt-oss-20b resolves and loads
correctly in MAX
Harmony reasoning/response formatting fixes so GPT-OSS responses are surfaced
cleanly through serve
GPT-OSS MXFP4 grouped matmul enablement needed for the reduced-memory MoE path

This branch is intentionally scoped to GPT-OSS integration on top of the
attention branch. I have kept the RoPE-only kernel changes out of this PR for
now.

I’m opening this early for visibility. If preferred, I can split this further
into:

serving/config changes
MXFP4 grouped matmul enablement

NOTE: This implementation is not super optimized, however the MXFP4 MoE kernel does allow GPT OSS to run on an RTX 5090 without running out of memory

Testing

Validated on top of sm120-attention-correctness with:

./bazelw --batch test //max/tests/tests/pipelines/lib:test_reasoning --test_output=errors
./bazelw --batch test //max/kernels/test/gpu/nn:test_flash_attention.mojo.test --test_output=errors

GPT-OSS MXFP4 smoke test:

launched MAX serve with:

LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v13:${LD_LIBRARY_PATH:-} \
MAX_ENABLE_EXPERIMENTAL_MXFP4_GROUPED_MATMUL=1 \
./bazelw run //max/python/max/entrypoints:pipelines -- \
  serve --model-path openai/gpt-oss-20b --port 8126 \
  --device-memory-utilization 1.0 \
  --max-batch-size 1 \
  --max-batch-input-tokens 1024 \
  --max-batch-total-tokens 1024 \
  --max-length 1024

verified that the server starts and responds coherently
with max_tokens=128, the prompt Say exactly: hello world returns:
- content = "hello world"

Checklist

PR is small and focused — consider splitting larger changes into a
sequence of smaller PRs
I ran ./bazelw run format to format my changes
I added or updated tests to cover my changes
If AI tools assisted with this contribution, I have included an
Assisted-by: trailer in my commit message or this PR description
(see ../AI_TOOL_POLICY.md)

Assisted-by: OpenAI Codex

BEGIN_PUBLIC [Kernel][GPU] Fix SM120 GPT-OSS attention correctness Fix ragged and decode attention behavior on SM120-class GPUs, including sink-aware softmax handling and fused RoPE correctness in the live attention path. This keeps the changes focused on kernel correctness and the associated GPU test gating. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

BEGIN_PUBLIC [Kernel][GPU] Fix sink attention test buffer lifetime Keep the sink-weight device buffer alive across the async flash attention launch and readback in the sink test, and allow the sink path to execute on SM120 so the test covers that configuration. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

BEGIN_PUBLIC [Kernel][GPU] Tighten SM120 sink test changes Restore the SM100-only assertion on the specialized prefill path and keep only the minimal sink-buffer lifetime fix in the sink attention test. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

BEGIN_PUBLIC [Kernel][GPU] Split RoPE changes from attention branch Restore the RoPE kernels to main on the SM120 attention branch so the flash attention correctness work can be reviewed independently. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

…ttention sink kernel test

BEGIN_PUBLIC [Kernel][GPU] Restrict attention fallback to SM120 Narrow the sink and ragged attention correctness fallbacks to SM120-class GPUs instead of applying them to all non-SM90 and non-SM100 NVIDIA devices. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

BEGIN_PUBLIC [Kernel][GPU] Improve GPT-OSS loading and Harmony serving Fix GPT-OSS config and quantization resolution, and add Harmony reasoning parsing so GPT-OSS responses are surfaced cleanly through the serving stack. This keeps the serving and model-loading improvements isolated from the lower level kernel and MXFP4 execution changes. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

BEGIN_PUBLIC [Kernel][GPU] Enable GPT-OSS MXFP4 grouped matmul path Add the grouped MXFP4 matmul plumbing needed by GPT-OSS MoE, including the TileTensor-based grouped matmul implementation, custom-op registration, and the Python integration used by the GPT-OSS model path. This keeps the GPT-OSS MXFP4 enablement isolated from unrelated serving and attention fixes. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>

rolson24 added 10 commits March 18, 2026 11:39

reformat

d8a3be4

reformat

78f953a

[Kernel][GPU] Ensure synchronization after enqueueing copy in flash a…

ef2a979

…ttention sink kernel test

github-actions bot added the waiting-on-review label Mar 19, 2026

rolson24 mentioned this pull request Mar 19, 2026

[Kernels] Sm120 attention correctness #6209

Open

4 tasks

abduld requested a review from tboerstad March 20, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT OSS MXFP4 SM120 integration#6216

GPT OSS MXFP4 SM120 integration#6216
rolson24 wants to merge 10 commits intomodular:mainfrom
rolson24:gpt-oss-sm120-integration

rolson24 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rolson24 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rolson24 commented Mar 19, 2026 •

edited

Loading