Skip to content

Conversation

@albmalamd
Copy link
Contributor

@albmalamd albmalamd commented Dec 11, 2025

Fixes #155714

Problem

AsyncAllreduceCUDAHostWork in ProcessGroupGloo was producing incorrect results when performing allreduce operations on CUDA/HIP tensors, causing test_ddp_apply_optim_in_backward to fail. The problem is common for both CUDA and HIP however it was hidden by different CI setup (no torchvision installed so the test execution was incomplete).

Root Cause

ProcessGroupGloo is a CPU-based collective communication backend. For CUDA tensors, it:

  1. Copies tensors from GPU to pinned CPU memory (async)
  2. Performs allreduce on CPU using Gloo
  3. Copies results back to GPU (async)

The bug was in the synchronize() method, which used: events[i].block(c10::impl::VirtualGuardImpl(device.type()).getStream(device));

  • event.block(stream) calls cudaStreamWaitEvent()/hipStreamWaitEvent(), providing only GPU-GPU synchronization
  • it makes a CUDA stream wait for an event but doesn't synchronize with the CPU. Since Gloo needs data on the CPU side, this is insufficient.

Fix

Changed to event.synchronize():

  • this calls cudaEventSynchronize()/hipEventSynchronize(), which blocks the CPU thread until the event completes, providing proper CPU-GPU synchronization. This ensures GPU copy operations have actually completed before Gloo accesses the data.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/170239

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5f54666 with merge base beb6c77 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 11, 2025
@albmalamd albmalamd force-pushed the fix_test_ddp_apply_optim_in_backward branch from 0786b7e to b24ae97 Compare December 11, 2025 21:59
@bdhirsh bdhirsh requested a review from d4l3k December 12, 2025 16:25
@bdhirsh
Copy link
Contributor

bdhirsh commented Dec 12, 2025

cc @d4l3k who would be a suitable reviewer for this change? (tentatively assigned you but feel free to re-assign)

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 12, 2025
@jerrymannil jerrymannil added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Dec 12, 2025
@jeffdaily jeffdaily added the keep-going Don't stop on first failure, keep running tests until the end label Dec 12, 2025
@jeffdaily
Copy link
Collaborator

@albmalamd can you please provide more information in your PR description? I know the two of us discussed this on a call but the lack of context here makes it hard for others to review. Thanks.

@albmalamd
Copy link
Contributor Author

hey @jeffdaily, I've added a description. Please have a look. Thank you!

@albmalamd albmalamd force-pushed the fix_test_ddp_apply_optim_in_backward branch from b24ae97 to 5f54666 Compare December 16, 2025 16:38
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 labels Dec 16, 2025
@jataylo jataylo added ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 labels Dec 16, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 16, 2025

To add the ciflow label ciflow/periodic-rocm-mi200 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 16, 2025

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Dec 16, 2025
@jataylo jataylo added ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 labels Dec 16, 2025
@jeffdaily
Copy link
Collaborator

@bdhirsh / @d4l3k would appreciate a review. The PR description is very helpful.

@d4l3k
Copy link
Member

d4l3k commented Dec 17, 2025

I can see why this would solve it but the copy is being done nonblocking so you should be able to do this without a CPU synchronization.

By the time run() completes all of the CPU operations are completed and the copy (HtoD) should follow stream semantics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300 ciflow/periodic-rocm-mi200 Trigger "distributed" config CI on ROCm MI200 keep-going Don't stop on first failure, keep running tests until the end open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISABLED test_ddp_apply_optim_in_backward (__main__.TestDistBackendWithSpawn)

7 participants