Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.

This implements a variation of the perf logger where rather than timing each
operation individually with effectively a barrier in between, we put the
timing boundaries where we already synchronize and time the groups of work
that normally overlap. This can be useful to help understand whether
individual operations need to be optimized, or if the group is already running
efficiently.

GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
GGML_VK_PERF_LOGGER is also set).

GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner December 11, 2025 19:42
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Dec 16, 2025

I see surprisingly little difference when using the new concurrency mode:

Without:

Vulkan Timings:
ADD: 59 x 10.73 us
CPY: 1 x 3.24 us
FLASH_ATTN_EXT dst(64,9,512,1),  q(64,512,9,1),  k(64,512,3,1),  v(64,512,3,1),  m(512,512,1,1): 30 x 116.092 us
GET_ROWS: 2 x 7.42 us
GLU: 30 x 16.988 us
MUL_MAT q4_0 m=1536 n=512 k=576: 58 x 62.821 us (14408.8 GFLOPS/s)
MUL_MAT q4_0 m=192 n=512 k=576: 60 x 0.541 us (209017 GFLOPS/s)
MUL_MAT q4_0 m=576 n=512 k=1536: 29 x 80.838 us (11203.5 GFLOPS/s)
MUL_MAT q4_0 m=576 n=512 k=576: 60 x 42.123 us (8058.33 GFLOPS/s)
MUL_MAT_ADD MUL_MAT_VEC q4_0 m=576 n=1 k=1536: 1 x 5.32 us (332.499 GFLOPS/s)
MUL_MAT_VEC q4_0 m=1536 n=1 k=576: 2 x 7.4 us (238.91 GFLOPS/s)
MUL_MAT_VEC q8_0 m=49152 n=1 k=576: 1 x 156.56 us (361.356 GFLOPS/s)
RMS_NORM_MUL RMS_NORM(576,1,1,1): 2 x 4.2 us
RMS_NORM_MUL RMS_NORM(576,512,1,1): 59 x 13.974 us
ROPE: 30 x 16.757 us
ROPE_VIEW_SET_ROWS ROPE: 30 x 11.48 us
SET_ROWS: 30 x 7.737 us
Total time: 15280.2 us.

With:

Vulkan Timings:
ADD: 59 x 10.035 us
CPY, ROPE, ROPE_VIEW_SET_ROWS ROPE, SET_ROWS: 1 x 97.72 us
FLASH_ATTN_EXT dst(64,9,512,1),  q(64,512,9,1),  k(64,512,3,1),  v(64,512,3,1),  m(512,512,1,1): 30 x 120.124 us
GET_ROWS: 2 x 2.92 us
GLU: 30 x 18.262 us
MUL_MAT q4_0 m=1536 n=512 k=576, MUL_MAT q4_0 m=1536 n=512 k=576: 29 x 121.631 us (14884 GFLOPS/s)
MUL_MAT q4_0 m=576 n=512 k=1536: 29 x 74.968 us (12080.8 GFLOPS/s)
MUL_MAT q4_0 m=576 n=512 k=576: 30 x 39.66 us (8558.84 GFLOPS/s)
MUL_MAT q4_0 m=576 n=512 k=576, MUL_MAT q4_0 m=192 n=512 k=576, MUL_MAT q4_0 m=192 n=512 k=576: 30 x 44.03 us (12848.8 GFLOPS/s)
MUL_MAT_ADD MUL_MAT_VEC q4_0 m=576 n=1 k=1536: 1 x 5.56 us (318.147 GFLOPS/s)
MUL_MAT_VEC q4_0 m=1536 n=1 k=576, MUL_MAT_VEC q4_0 m=1536 n=1 k=576: 1 x 15 us (235.725 GFLOPS/s)
RMS_NORM_MUL RMS_NORM(576,1,1,1): 2 x 2.68 us
RMS_NORM_MUL RMS_NORM(576,512,1,1): 59 x 15.433 us
ROPE, ROPE_VIEW_SET_ROWS ROPE, SET_ROWS: 29 x 23.008 us
Total time: 14663.2 us.

Is this expected?

@jeffbolznv
Copy link
Collaborator Author

It depends on the model, some don't have much concurrency. Which model did you try?

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 16, 2025

That was just some SmolLM-135M that I had laying around. But I also tried gpt-oss 20B and didn't see that much. I think I just misunderstood what you are doing here. The operations are still timed individually, except if they run concurrently.

@jeffbolznv
Copy link
Collaborator Author

Yeah, it should match the groupings you see if you set GGML_VK_SYNC_LOGGER=1. If an op doesn't run concurrently, then it still gets reported by itself. And fusions still get reported as a single op.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants