Fix for test_ddp_apply_optim_in_backward for gloo #170239

albmalamd · 2025-12-11T21:58:20Z

Problem

AsyncAllreduceCUDAHostWork in ProcessGroupGloo was producing incorrect results when performing allreduce operations on CUDA/HIP tensors, causing test_ddp_apply_optim_in_backward to fail. The problem is common for both CUDA and HIP however it was hidden by different CI setup (no torchvision installed so the test execution was incomplete).

Root Cause

ProcessGroupGloo is a CPU-based collective communication backend. For CUDA tensors, it:

Copies tensors from GPU to pinned CPU memory (async)
Performs allreduce on CPU using Gloo
Copies results back to GPU (async)

The bug was in the synchronize() method, which used: events[i].block(c10::impl::VirtualGuardImpl(device.type()).getStream(device));

event.block(stream) calls cudaStreamWaitEvent()/hipStreamWaitEvent(), providing only GPU-GPU synchronization
it makes a CUDA stream wait for an event but doesn't synchronize with the CPU. Since Gloo needs data on the CPU side, this is insufficient.

Fix

Changed to event.synchronize():

this calls cudaEventSynchronize()/hipEventSynchronize(), which blocks the CPU thread until the event completes, providing proper CPU-GPU synchronization. This ensures GPU copy operations have actually completed before Gloo accesses the data.

pytorch-bot · 2025-12-11T21:58:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/170239

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5f54666 with merge base beb6c77 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

periodic-rocm-mi200 / linux-jammy-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed, unstable) (gh) (#166875)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh · 2025-12-12T16:26:21Z

cc @d4l3k who would be a suitable reviewer for this change? (tentatively assigned you but feel free to re-assign)

jeffdaily · 2025-12-12T21:05:38Z

@albmalamd can you please provide more information in your PR description? I know the two of us discussed this on a call but the lack of context here makes it hard for others to review. Thanks.

albmalamd · 2025-12-15T15:06:09Z

hey @jeffdaily, I've added a description. Please have a look. Thank you!

pytorch-bot · 2025-12-16T16:47:02Z

To add the ciflow label ciflow/periodic-rocm-mi200 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-12-16T16:47:02Z

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jeffdaily · 2025-12-17T00:02:03Z

@bdhirsh / @d4l3k would appreciate a review. The PR description is very helpful.

d4l3k · 2025-12-17T02:28:54Z

I can see why this would solve it but the copy is being done nonblocking so you should be able to do this without a CPU synchronization.

By the time run() completes all of the CPU operations are completed and the copy (HtoD) should follow stream semantics

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 11, 2025

albmalamd force-pushed the fix_test_ddp_apply_optim_in_backward branch from 0786b7e to b24ae97 Compare December 11, 2025 21:59

pytorchbot added the open source label Dec 11, 2025

bdhirsh requested a review from d4l3k December 12, 2025 16:25

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 12, 2025

jerrymannil added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 12, 2025

jeffdaily added the keep-going Don't stop on first failure, keep running tests until the end label Dec 12, 2025

jeffdaily approved these changes Dec 12, 2025

View reviewed changes

Fix for test_ddp_apply_optim_in_backward for gloo

5f54666

albmalamd force-pushed the fix_test_ddp_apply_optim_in_backward branch from b24ae97 to 5f54666 Compare December 16, 2025 16:38

jataylo added ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 labels Dec 16, 2025

pytorch-bot bot removed ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Dec 16, 2025

jataylo added ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 labels Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for test_ddp_apply_optim_in_backward for gloo #170239

Fix for test_ddp_apply_optim_in_backward for gloo #170239

albmalamd commented Dec 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

bdhirsh commented Dec 12, 2025

Uh oh!

jeffdaily commented Dec 12, 2025

Uh oh!

albmalamd commented Dec 15, 2025

Uh oh!

pytorch-bot bot commented Dec 16, 2025

Uh oh!

pytorch-bot bot commented Dec 16, 2025

Uh oh!

jeffdaily commented Dec 17, 2025

Uh oh!

d4l3k commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix for test_ddp_apply_optim_in_backward for gloo #170239

Are you sure you want to change the base?

Fix for test_ddp_apply_optim_in_backward for gloo #170239

Conversation

albmalamd commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Uh oh!

pytorch-bot bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/170239

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

bdhirsh commented Dec 12, 2025

Uh oh!

jeffdaily commented Dec 12, 2025

Uh oh!

albmalamd commented Dec 15, 2025

Uh oh!

pytorch-bot bot commented Dec 16, 2025

Uh oh!

pytorch-bot bot commented Dec 16, 2025

Uh oh!

jeffdaily commented Dec 17, 2025

Uh oh!

d4l3k commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

albmalamd commented Dec 11, 2025 •

edited

Loading

pytorch-bot bot commented Dec 11, 2025 •

edited

Loading