Skip to content

GPT OSS MXFP4 SM120 integration#6216

Draft
rolson24 wants to merge 10 commits intomodular:mainfrom
rolson24:gpt-oss-sm120-integration
Draft

GPT OSS MXFP4 SM120 integration#6216
rolson24 wants to merge 10 commits intomodular:mainfrom
rolson24:gpt-oss-sm120-integration

Conversation

@rolson24
Copy link

@rolson24 rolson24 commented Mar 19, 2026

Summary

Draft stacked PR on top of sm120-attention-correctness.

This PR brings over the GPT-OSS product-facing fixes that were needed after the
SM120 attention correctness work:

  • GPT-OSS config/quant loading fixes so openai/gpt-oss-20b resolves and loads
    correctly in MAX
  • Harmony reasoning/response formatting fixes so GPT-OSS responses are surfaced
    cleanly through serve
  • GPT-OSS MXFP4 grouped matmul enablement needed for the reduced-memory MoE path

This branch is intentionally scoped to GPT-OSS integration on top of the
attention branch. I have kept the RoPE-only kernel changes out of this PR for
now.

I’m opening this early for visibility. If preferred, I can split this further
into:

  • serving/config changes
  • MXFP4 grouped matmul enablement

NOTE: This implementation is not super optimized, however the MXFP4 MoE kernel does allow GPT OSS to run on an RTX 5090 without running out of memory

Testing

Validated on top of sm120-attention-correctness with:

  • ./bazelw --batch test //max/tests/tests/pipelines/lib:test_reasoning --test_output=errors
  • ./bazelw --batch test //max/kernels/test/gpu/nn:test_flash_attention.mojo.test --test_output=errors

GPT-OSS MXFP4 smoke test:

  • launched MAX serve with:
    LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v13:${LD_LIBRARY_PATH:-} \
    MAX_ENABLE_EXPERIMENTAL_MXFP4_GROUPED_MATMUL=1 \
    ./bazelw run //max/python/max/entrypoints:pipelines -- \
      serve --model-path openai/gpt-oss-20b --port 8126 \
      --device-memory-utilization 1.0 \
      --max-batch-size 1 \
      --max-batch-input-tokens 1024 \
      --max-batch-total-tokens 1024 \
      --max-length 1024
    
  • verified that the server starts and responds coherently
  • with max_tokens=128, the prompt Say exactly: hello world returns:
    • content = "hello world"

Checklist

  • PR is small and focused — consider splitting larger changes into a
    sequence of smaller PRs
  • I ran ./bazelw run format to format my changes
  • I added or updated tests to cover my changes
  • If AI tools assisted with this contribution, I have included an
    Assisted-by: trailer in my commit message or this PR description
    (see ../AI_TOOL_POLICY.md)

Assisted-by: OpenAI Codex

rolson24 added 10 commits March 18, 2026 11:39
BEGIN_PUBLIC
[Kernel][GPU] Fix SM120 GPT-OSS attention correctness

Fix ragged and decode attention behavior on SM120-class GPUs, including
sink-aware softmax handling and fused RoPE correctness in the live attention
path.

This keeps the changes focused on kernel correctness and the associated GPU
test gating.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Fix sink attention test buffer lifetime

Keep the sink-weight device buffer alive across the async flash attention
launch and readback in the sink test, and allow the sink path to execute on
SM120 so the test covers that configuration.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Tighten SM120 sink test changes

Restore the SM100-only assertion on the specialized prefill path and keep
only the minimal sink-buffer lifetime fix in the sink attention test.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Split RoPE changes from attention branch

Restore the RoPE kernels to main on the SM120 attention branch so the
flash attention correctness work can be reviewed independently.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Restrict attention fallback to SM120

Narrow the sink and ragged attention correctness fallbacks to SM120-class
GPUs instead of applying them to all non-SM90 and non-SM100 NVIDIA devices.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Improve GPT-OSS loading and Harmony serving

Fix GPT-OSS config and quantization resolution, and add Harmony reasoning
parsing so GPT-OSS responses are surfaced cleanly through the serving stack.

This keeps the serving and model-loading improvements isolated from the lower
level kernel and MXFP4 execution changes.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC
[Kernel][GPU] Enable GPT-OSS MXFP4 grouped matmul path

Add the grouped MXFP4 matmul plumbing needed by GPT-OSS MoE, including the
TileTensor-based grouped matmul implementation, custom-op registration, and the
Python integration used by the GPT-OSS model path.

This keeps the GPT-OSS MXFP4 enablement isolated from unrelated serving and
attention fixes.
END_PUBLIC

Signed-off-by: Raif Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant