Draft
Conversation
BEGIN_PUBLIC [Kernel][GPU] Fix SM120 GPT-OSS attention correctness Fix ragged and decode attention behavior on SM120-class GPUs, including sink-aware softmax handling and fused RoPE correctness in the live attention path. This keeps the changes focused on kernel correctness and the associated GPU test gating. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC [Kernel][GPU] Fix sink attention test buffer lifetime Keep the sink-weight device buffer alive across the async flash attention launch and readback in the sink test, and allow the sink path to execute on SM120 so the test covers that configuration. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC [Kernel][GPU] Tighten SM120 sink test changes Restore the SM100-only assertion on the specialized prefill path and keep only the minimal sink-buffer lifetime fix in the sink attention test. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC [Kernel][GPU] Split RoPE changes from attention branch Restore the RoPE kernels to main on the SM120 attention branch so the flash attention correctness work can be reviewed independently. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
…ttention sink kernel test
BEGIN_PUBLIC [Kernel][GPU] Restrict attention fallback to SM120 Narrow the sink and ragged attention correctness fallbacks to SM120-class GPUs instead of applying them to all non-SM90 and non-SM100 NVIDIA devices. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC [Kernel][GPU] Improve GPT-OSS loading and Harmony serving Fix GPT-OSS config and quantization resolution, and add Harmony reasoning parsing so GPT-OSS responses are surfaced cleanly through the serving stack. This keeps the serving and model-loading improvements isolated from the lower level kernel and MXFP4 execution changes. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
BEGIN_PUBLIC [Kernel][GPU] Enable GPT-OSS MXFP4 grouped matmul path Add the grouped MXFP4 matmul plumbing needed by GPT-OSS MoE, including the TileTensor-based grouped matmul implementation, custom-op registration, and the Python integration used by the GPT-OSS model path. This keeps the GPT-OSS MXFP4 enablement isolated from unrelated serving and attention fixes. END_PUBLIC Signed-off-by: Raif Olson <[email protected]>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Draft stacked PR on top of
sm120-attention-correctness.This PR brings over the GPT-OSS product-facing fixes that were needed after the
SM120 attention correctness work:
openai/gpt-oss-20bresolves and loadscorrectly in MAX
cleanly through serve
This branch is intentionally scoped to GPT-OSS integration on top of the
attention branch. I have kept the RoPE-only kernel changes out of this PR for
now.
I’m opening this early for visibility. If preferred, I can split this further
into:
NOTE: This implementation is not super optimized, however the MXFP4 MoE kernel does allow GPT OSS to run on an RTX 5090 without running out of memory
Testing
Validated on top of
sm120-attention-correctnesswith:./bazelw --batch test //max/tests/tests/pipelines/lib:test_reasoning --test_output=errors./bazelw --batch test //max/kernels/test/gpu/nn:test_flash_attention.mojo.test --test_output=errorsGPT-OSS MXFP4 smoke test:
LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v13:${LD_LIBRARY_PATH:-} \ MAX_ENABLE_EXPERIMENTAL_MXFP4_GROUPED_MATMUL=1 \ ./bazelw run //max/python/max/entrypoints:pipelines -- \ serve --model-path openai/gpt-oss-20b --port 8126 \ --device-memory-utilization 1.0 \ --max-batch-size 1 \ --max-batch-input-tokens 1024 \ --max-batch-total-tokens 1024 \ --max-length 1024Checklist
sequence of smaller PRs
Assisted-by: trailer in my commit message or this PR description
(see ../AI_TOOL_POLICY.md)
Assisted-by: OpenAI Codex