Skip to content

rework test_multi_gpu_data_parallel_forward #31087

@ydshieh

Description

@ydshieh

Description

Currently test_multi_gpu_data_parallel_forward is problematic and are skipped for several model testing. Many times, it looks like some cuda issue (CUDA error: misaligned address etc.). With other nvidia related stuffs (software or hardware) and/or torch versions, they might pass, fail, or pass but fail many subsequent tests.

It currently uses nn.DataParallel which is no longer recommended. In the long term, we should try DistributedDataParallel and see how this test goes.

See #31086 for example

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions