don't use no_sync when deepspeed doesn't support it for certain zero stages #35157

winglian · 2024-12-09T03:50:40Z

What does this PR do?

Deepspeed 0.16 has assertions preventing the use of no_sync with zero 2/3. see https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L1986-L2004

it seems people are reporting this here deepspeedai/DeepSpeed#6793, and I'm assuming that everyone is using accelerate/transformers as downgrading to deepspeed 0.15.4 makes it "work" for them.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…stages

winglian · 2024-12-09T15:17:44Z

might have to broaden deepspeed for all zero cases? https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2208-L2209

muellerzr

Since we should disable it for all DS (apparently), let's just go ahead and do that. I'll apply a similar fix in Accelerator.

cc @SunMarc

muellerzr · 2024-12-11T02:34:14Z

src/transformers/trainer.py

                    context = (
                        functools.partial(self.accelerator.no_sync, model=model)
-                        if i != len(batch_samples) - 1
+                        if i != len(batch_samples) - 1 and not disable_deepspeed_no_sync


Suggested change

if i != len(batch_samples) - 1 and not disable_deepspeed_no_sync

if i != len(batch_samples) - 1 and not self.accelerator.distributed_type == DistributedType.DEEPSPEED

may i ask which version of transformers support this fix-up? mine is 4.46.0. same problem with deepspeed 0.16

HuggingFaceDocBuilderDev · 2024-12-11T02:59:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

LGTM with zach suggestion !

muellerzr

Thanks! We've also confirmed this fixes up all the fails users reported wrt deepspeed. cc @ArthurZucker for final post wing doing quality ;)

ArthurZucker

sorry for breaking this ... and thanks for the fix!

ArthurZucker · 2024-12-13T16:49:32Z

can you just run make fixup

…stages (#35157) * don't use no_sync when deepspeed doesn't support it for certain zero stages * chore: lint * fix no_sync context for deepspeed across all zero types * chore: lint

…stages (huggingface#35157) * don't use no_sync when deepspeed doesn't support it for certain zero stages * chore: lint * fix no_sync context for deepspeed across all zero types * chore: lint

AetherPrior · 2025-01-20T06:27:22Z

Strangely, this issue still exists on deepspeed==0.16.2, has this fix been pushed to a stable release yet?

ArthurZucker · 2025-01-21T15:56:36Z

This is on 4.48 !

fangpings · 2025-02-19T22:48:43Z

With transformers==4.48.0, accelerate==1.2.1 and deepspeed==0.16.3, still see this issue

jianguoz · 2025-03-04T07:05:45Z

Hi @ArthurZucker @muellerzr , we still face the same issue with transformers==4.48.0/4.49.0, accelerate==1.2.1 and deepspeed==0.16.3. Could you check this?

error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

SunMarc · 2025-03-04T12:53:39Z

Can you share the full traceback @jianguoz and with a minimal reproducer @jianguoz thanks !

jianguoz · 2025-03-04T19:07:03Z

Hi @SunMarc @ArthurZucker @muellerzr , Below is the my zero 3 config and error output when fine-tuning a mistral_small_24b or llama model on 8 GPUs. It can only work with gradient_accumulation=1. I believe people will face same issue when training a model using the latest transformer>=4.48.0 and deepspeed>=0.16.0 and accelerate==1.2.1/1.4.0.

[2025-03-04 19:03:35,551] [INFO] [config.py:991:print_user_config]   json = {
    "bf16": {
        "enabled": true
    }, 
    "optimizer": {
        "type": "AdamW", 
        "params": {
            "lr": 2e-05, 
            "weight_decay": 0
        }
    }, 
    "scheduler": {
        "type": "WarmupDecayLR", 
        "params": {
            "warmup_min_lr": 1e-06, 
            "warmup_max_lr": 2e-05, 
            "warmup_num_steps": 100, 
            "total_num_steps": 1.259000e+03
        }
    }, 
    "zero_optimization": {
        "stage": 3, 
        "overlap_comm": true, 
        "contiguous_gradients": true, 
        "sub_group_size": 1.000000e+09, 
        "reduce_bucket_size": 2.621440e+07, 
        "stage3_prefetch_bucket_size": 0, 
        "stage3_param_persistence_threshold": 5.120000e+04, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_accumulation_steps": 3, 
    "gradient_clipping": 1.0, 
    "steps_per_print": inf, 
    "train_batch_size": 48, 
    "train_micro_batch_size_per_gpu": 2, 
    "wall_clock_breakdown": false, 
    "fp16": {
        "enabled": false
    }
}

output

step = 0 -- rank 0: error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
  0%|                                                                                                                                     | 1/10072 [01:15<210:10:00, 75.13s/it]step = 1 -- rank 0: error: step = 0 -- rank 1: error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
  0%|                                                                                                                                     | 1/10072 [01:15<210:30:16, 75.25s/it]step = 0 -- rank 2: error:step = 0 -- rank 5: error:  step = 0 -- rank 3: error:no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3step = 0 -- rank 6: error: 
step = 0 -- rank 7: error:
no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3  
no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

step = 0 -- rank 4: error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
step = 1 -- rank 1: error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
  0%|                                                                                                                                     | 1/10072 [01:15<210:23:32, 75.21s/it]step = 1 -- rank 2: error: step = 1 -- rank 5: error:no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 
no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
step = 1 -- rank 4: error: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
step = 1 -- rank 7: error:step = 1 -- rank 6: error:  no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

jianguoz · 2025-03-06T19:50:21Z

Hi @SunMarc , do we have updates for this issue? I saw others have same issues on #34984

SunMarc · 2025-03-13T17:01:48Z

cc @XuehaiPan

jianguoz · 2025-03-26T23:15:59Z

Hi @SunMarc , any update for this incompatible issues between Huggingface and Deepspeed?

don't use no_sync when deepspeed doesn't support it for certain zero …

63acec4

…stages

winglian force-pushed the deepspeed-no-sync-zero branch from 7ef3da1 to 63acec4 Compare December 9, 2024 03:52

winglian mentioned this pull request Dec 9, 2024

upgrade deepspeed to 0.16.1 axolotl-ai-cloud/axolotl#2157

Merged

chore: lint

37773bd

muellerzr reviewed Dec 11, 2024

View reviewed changes

muellerzr mentioned this pull request Dec 11, 2024

Discrepancy in Training Loss Behavior with Gradient Accumulation using DeepSpeed #34694

Closed

4 tasks

SunMarc approved these changes Dec 11, 2024

View reviewed changes

fix no_sync context for deepspeed across all zero types

41434e6

muellerzr approved these changes Dec 12, 2024

View reviewed changes

SunMarc requested a review from ArthurZucker December 13, 2024 13:17

ArthurZucker approved these changes Dec 13, 2024

View reviewed changes

chore: lint

647eccb

ArthurZucker merged commit add53e2 into huggingface:main Dec 13, 2024
1 of 5 checks passed

inkcherry mentioned this pull request Dec 19, 2024

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3 deepspeedai/DeepSpeed#6793

Open

randydl mentioned this pull request Dec 27, 2024

Please upgrade transformers and deepspeed version in requirements.txt hiyouga/LLaMA-Factory#6460

Closed

nzw0301 mentioned this pull request Dec 30, 2024

No use no_sync context manager when using gradient accumulation w/ deepspeed's zero stage 2 or 3 via accelerate #34984

Open

SunMarc mentioned this pull request Apr 7, 2025

DeepSpeed ZeRO Stage 2/3 incompatibility with no_sync context manager huggingface/accelerate#3481

Closed

	if i != len(batch_samples) - 1 and not disable_deepspeed_no_sync
	if i != len(batch_samples) - 1 and not self.accelerator.distributed_type == DistributedType.DEEPSPEED

don't use no_sync when deepspeed doesn't support it for certain zero stages #35157

don't use no_sync when deepspeed doesn't support it for certain zero stages #35157

Uh oh!

Conversation

winglian commented Dec 9, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

winglian commented Dec 9, 2024

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

muellerzr Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

chuangzhidan Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2024

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Dec 13, 2024

Uh oh!

Uh oh!

AetherPrior commented Jan 20, 2025

Uh oh!

ArthurZucker commented Jan 21, 2025

Uh oh!

fangpings commented Feb 19, 2025

Uh oh!

jianguoz commented Mar 4, 2025

Uh oh!

SunMarc commented Mar 4, 2025

Uh oh!

jianguoz commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianguoz commented Mar 6, 2025

Uh oh!

SunMarc commented Mar 13, 2025

Uh oh!

jianguoz commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jianguoz commented Mar 4, 2025 •

edited

Loading