Fix matmul_kernel #5701

miscalculation53 · 2025-12-22T10:37:07Z

Change the test code of test_matmul_1_sram.mojo so that (M, N, K)-s of all relative sizes are tested.
Fix the kernel of test_matmul_1_sram.mojo.
Add a test of matmul_kernel similar to that of test_matmul_1_sram.mojo.
Fix matmul_kernel.

…ar into matmul-kernel-fix

github-actions · 2025-12-22T10:37:16Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Copilot

Pull request overview

This PR aims to fix the matmul_kernel to correctly handle matrix dimensions that are not divisible by the tile size. The changes update the tiling logic to properly handle partial tiles when K is not a multiple of tile_size.

Key changes:

Modified kernel logic to compute current_tile_size for partial tiles instead of using a pre-computed remainder
Parameterized test functions to accept (M, N, K) dimensions and test multiple size combinations
Added a new test file test_matmul_kernel.mojo to validate the kernel implementation

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
max/kernels/test/gpu/linalg/test_matmul_kernel.mojo	New test file that validates `matmul_kernel` with non-divisible matrix dimensions (513, 502, 511)
max/kernels/test/gpu/examples/test_matmul_1_sram.mojo	Updated to parameterize test with compile-time M, N, K dimensions and fixed partial tile handling
max/kernels/src/linalg/matmul/gpu/init.mojo	Fixed `matmul_kernel` to properly handle partial tiles by computing tile size dynamically

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T10:42:47Z

max/kernels/test/gpu/examples/test_matmul_1_sram.mojo

+                )
        else:
-            a_val = (
+            a_shared[localRow * UInt(current_tile_size) + localCol] = (


The shared memory indexing for matrix A is incorrect. The shared memory is allocated with fixed dimensions of tile_size * tile_size, so the row stride should always be tile_size, not current_tile_size. Using current_tile_size as the stride will cause incorrect memory layout when handling partial tiles. The indexing should be a_shared[localRow * UInt(tile_size) + localCol].

Copilot · 2025-12-22T10:42:47Z

max/kernels/test/gpu/examples/test_matmul_1_sram.mojo

+                    a[Int(row), offset + Int(localCol)] if (
+                        row < UInt(M)
+                    ) else 0.0
+                )


When handling partial tiles, if the condition offset + Int(localCol) < K is false, no value is written to the corresponding location in a_shared. However, this location may still be read during the computation loop (line 120), potentially using uninitialized or stale data. To ensure correctness, a zero value should be written when the condition is false, similar to how the full_tile case handles out-of-bounds accesses.

Suggested change

)

)

else:

# For out-of-bounds columns in a partial tile, write zero to avoid

# reading uninitialized or stale data during the compute loop.

a_shared[localRow * UInt(current_tile_size) + localCol] = 0.0

Copilot · 2025-12-22T10:42:48Z

max/kernels/src/linalg/matmul/gpu/__init__.mojo

+        for kk in range(current_tile_size):
            result += (
-                a_shared[localRow * UInt(tile_size) + UInt(kk)].cast[s_type]()
+                a_shared[localRow * UInt(current_tile_size) + UInt(kk)].cast[s_type]()


The shared memory indexing for matrix A is incorrect when reading during computation. The shared memory is allocated with fixed dimensions of tile_size * tile_size, so the row stride should always be tile_size, not current_tile_size. The indexing should be a_shared[localRow * UInt(tile_size) + UInt(kk)].

Copilot · 2025-12-22T10:42:48Z

max/kernels/test/gpu/examples/test_matmul_1_sram.mojo

+        for k in range(current_tile_size):
            result += a_shared.load(
-                localRow * UInt(tile_size) + UInt(k)
+                localRow * UInt(current_tile_size) + UInt(k)


The shared memory indexing for matrix A is incorrect when reading during computation. The shared memory is allocated with fixed dimensions of tile_size * tile_size, so the row stride should always be tile_size, not current_tile_size. The indexing should be a_shared.load(localRow * UInt(tile_size) + UInt(k)).

Copilot · 2025-12-22T10:42:48Z

max/kernels/test/gpu/linalg/test_matmul_kernel.mojo

+    for i in range(M - 10, M):
+        for j in range(N - 10, N):
+            if c_host[i, j] != Float32(K):
+                print(
+                    "Fail at index = [",
+                    i,
+                    ",",
+                    j,
+                    "] the value is",
+                    c_host[i, j],
+                    "the golden value is",
+                    K,
+                )
+                failed = True


The test only validates the last 10x10 block of the result matrix (lines 82-95), which provides inadequate test coverage. Bugs in the kernel could go undetected if they only affect elements outside this region. The test should validate all elements of the result matrix to ensure the matmul kernel works correctly across the entire output.

Copilot · 2025-12-22T10:42:49Z

max/kernels/src/linalg/matmul/gpu/__init__.mojo

+                    a[Int(row), Int(offset + Int(localCol))] if (
+                        row < UInt(m)
+                    ) else 0.0
+                )


When handling partial tiles, if the condition offset + Int(localCol) < k is false, no value is written to the corresponding location in a_shared. However, this location may still be read during the computation loop (line 166), potentially using uninitialized or stale data. To ensure correctness, a zero value should be written when the condition is false, similar to how the full_tile case handles out-of-bounds accesses.

Suggested change

)

)

else:

a_shared[localRow * UInt(current_tile_size) + localCol] = 0.0

Copilot · 2025-12-22T10:42:49Z

max/kernels/src/linalg/matmul/gpu/__init__.mojo

-                row < UInt(m) and offset + Int(localCol) < k
-            ) else 0.0
+            if offset + Int(localCol) < k:
+                a_shared[localRow * UInt(current_tile_size) + localCol] = (


The shared memory indexing for matrix A is incorrect. The shared memory is allocated with fixed dimensions of tile_size * tile_size (as seen in the allocation), so the row stride should always be tile_size, not current_tile_size. Using current_tile_size as the stride will cause incorrect memory layout when handling partial tiles. The indexing should be a_shared[localRow * UInt(tile_size) + localCol].

Copilot · 2025-12-22T10:42:49Z

max/kernels/src/linalg/matmul/gpu/__init__.mojo

-            a_val = (
-                a[Int(row), Int(offset + Int(localCol))] if row
-                < UInt(m) else 0.0
+            a_shared[localRow * UInt(current_tile_size) + localCol] = (


The shared memory indexing for matrix A is incorrect. The shared memory is allocated with fixed dimensions of tile_size * tile_size, so the row stride should always be tile_size, not current_tile_size. Using current_tile_size as the stride will cause incorrect memory layout when handling partial tiles. The indexing should be a_shared[localRow * UInt(tile_size) + localCol].

Copilot · 2025-12-22T10:42:49Z

max/kernels/test/gpu/examples/test_matmul_1_sram.mojo

-                row < UInt(M) and offset + Int(localCol) < K
-            ) else 0.0
+            if offset + Int(localCol) < K:
+                a_shared[localRow * UInt(current_tile_size) + localCol] = (


The shared memory indexing for matrix A is incorrect. The shared memory is allocated with fixed dimensions of tile_size * tile_size (lines 54-58), so the row stride should always be tile_size, not current_tile_size. Using current_tile_size as the stride will cause incorrect memory layout when handling partial tiles. The indexing should be a_shared[localRow * UInt(tile_size) + localCol].

miscalculation53 added 10 commits December 22, 2025 18:05

tests for 3**3 types of (M, N, K)

415cdb2

fix matmul_sram

e1cdf57

add a test for matmul_kernel

6aae024

fix matmul_kernel

d7a94e0

tests for 3**3 types of (M, N, K)

743d50c

fix matmul_sram

e734e8f

add a test for matmul_kernel

5d88ba6

fix matmul_kernel

0d82344

Merge branch 'matmul-kernel-fix' of github.com:miscalculation53/modul…

2f41810

…ar into matmul-kernel-fix

delete unnecessary imports of test_matmul_kernel.mojo

7b44412

Copilot AI review requested due to automatic review settings December 22, 2025 10:37

miscalculation53 requested a review from a team as a code owner December 22, 2025 10:37

Copilot started reviewing on behalf of miscalculation53 December 22, 2025 10:37 View session

Copilot AI reviewed Dec 22, 2025

View reviewed changes

JoeLoser assigned abduld Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix matmul_kernel #5701

Fix matmul_kernel #5701

miscalculation53 commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Copilot AI Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                )
+                )
+            else:
+                # For out-of-bounds columns in a partial tile, write zero to avoid
+                # reading uninitialized or stale data during the compute loop.
+                a_shared[localRow * UInt(current_tile_size) + localCol] = 0.0

Fix matmul_kernel #5701

Are you sure you want to change the base?

Fix matmul_kernel #5701

Conversation

miscalculation53 commented Dec 22, 2025

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants