Accelerate x Trainer issue tracker: #33345

ArthurZucker · 2024-09-06T10:05:26Z

A bunch of issues are a bit stale, and @SunMarc + @muellerzr are a bit short on bandwidth!
Thus we would love to have community support to solve the following:

Help needed

Feature request

Support for Multiple Datasets and Domain-Specific Loss Calculation in Trainer #30725

Replied with potential fix and following

The text was updated successfully, but these errors were encountered:

WizKnight · 2024-09-06T11:42:20Z

@ArthurZucker Hey there!👋 I'm new to this repository and excited to learn and contribute. Please let me know if there are any good starting points or tasks where I can be of assistance.

ArthurZucker · 2024-09-06T12:28:04Z

Any of these issue that have the Good First Issues should be fairly easy! 🤗

irislin1006 · 2024-09-06T15:54:46Z

Hi @ArthurZucker, I'm a first time contributor, but I would love to take issue #31734 as a start 👍

[Update on 202409/07] Handled and replied in the issue

nnilayy · 2024-09-06T22:22:38Z

Hi there👋 @ArthurZucker, Handled issue #31439, hope that helps🤗.

WizKnight · 2024-09-07T09:37:08Z

Hi there👋 @ArthurZucker, I'll handle the issue #28124

irislin1006 · 2024-09-08T02:53:44Z

Hi there👋 @ArthurZucker, I would like to take #32312 😀

[Update on 202409/09] Handled and replied in the issue

SunMarc · 2024-09-10T15:02:24Z

cc @matthewdouglas

godspeed5 · 2024-09-13T19:10:56Z

I had opened PR #31268 as a fix for issue #30819. I think some discussion is needed on there @amyeroberts

WizKnight · 2024-09-16T13:56:05Z

Hey @amyeroberts, just wanted to check in on issue #28124. It seems like @muellerzr already tackled it with his fix in #30169.
Should I still work on this further, or is it good to go as is?

Thanks!

amyeroberts · 2024-09-16T18:05:41Z

Hi @WizKnight - best to ask @muellerzr (ideally on the relevant PR / issue to avoid pinging everyone here) on the status of those. I can see in #30169 the PR wasn't merged in due to inactivity -- pending a response to these questions..

In general, if something has just been closed by the github stale bot and not because of a clear decision not to pursue the PR / a clear rejection from the review process you're free to pick up the work :)

SunMarc · 2024-09-27T16:16:38Z

cc @MekkCyber

P-Potdar · 2024-10-01T19:20:55Z

Hey @SunMarc and @muellerzr,

I'd love to contribute to this project and help resolve some of the issues mentioned here, especially the DeepSpeed Zero3-related bugs. I’ve already gone through some of the issues and identified potential starting points for solutions. I'll be focusing on these:

Training hangs at the first gradient syncing of an MoE model while using DeepSpeed (#30911)
Trainer doesn't save evaluation metrics (#33733)
CUDA RuntimeError: Unspecified Launch Failure during Training (#30913)
I'll submit PRs with proposed fixes and updates soon. Thank you for the opportunity to contribute!

Also, if there are any specific guidelines or areas where help is most needed, feel free to point me in the right direction!

Looking forward to collaborating on this during Hacktoberfest 🎉

b423016 · 2024-10-02T16:57:32Z

Hey @SunMarc and @muellerzr I would be happy to contribute the issue
Trainer doesn't save evaluation metrics (#33733 )

ArthurZucker · 2024-10-03T15:05:21Z

Awesome! Just added the tag to make sure it works for everyone! 🥳

Thejaggeddevil · 2024-10-11T08:18:33Z

i want to work on #29348 please assign this to me

eeshan15 · 2024-10-11T19:49:41Z

hey @ArthurZucker Will this be counted in hacktoberfest?

ArthurZucker · 2024-10-22T13:30:47Z

Yes, given that there is the tag!
We don't assign issue, first PR that is up will be reviewed, if stale anyone can take it, if no PR is linked, you can also create one 🤗

naba89 · 2024-11-30T11:43:35Z

Hi,
I don't think #28469 has been fixed yet. Facing this even in 4.46.3.

I have a hacky workaround here: custom_trainer
It works on the repro-script, but might not cover all cases.

SunMarc · 2024-12-02T15:58:48Z

I don't think #28469 has been fixed yet. Facing this even in 4.46.3.

Just reopened ! If you have a fix, would you like to open a PR so that we can have a look ? Thanks !

ArthurZucker added Good First Issue trainer Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! DeepSpeed Good Difficult Issue Accelerate labels Sep 6, 2024

irislin1006 mentioned this issue Sep 6, 2024

Cannot find the best model after training #31734

Closed

4 tasks

nnilayy mentioned this issue Sep 6, 2024

Memory leak when using CLIPTextModel #31439

Closed

4 tasks

amyeroberts added the PyTorch FSDP label Sep 13, 2024

ArthurZucker added the HACKTOBERFEST-ACCEPTED label Oct 3, 2024

zeus2611 mentioned this issue Oct 24, 2024

Fix batch size handling in prediction_loop for DataLoaderShard #34343

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate x Trainer issue tracker: #33345

Accelerate x Trainer issue tracker: #33345

ArthurZucker commented Sep 6, 2024 •

edited by MekkCyber

Loading

WizKnight commented Sep 6, 2024

ArthurZucker commented Sep 6, 2024

irislin1006 commented Sep 6, 2024 •

edited

Loading

nnilayy commented Sep 6, 2024 •

edited

Loading

WizKnight commented Sep 7, 2024

irislin1006 commented Sep 8, 2024 •

edited

Loading

SunMarc commented Sep 10, 2024

godspeed5 commented Sep 13, 2024

WizKnight commented Sep 16, 2024

amyeroberts commented Sep 16, 2024

SunMarc commented Sep 27, 2024

P-Potdar commented Oct 1, 2024

b423016 commented Oct 2, 2024

ArthurZucker commented Oct 3, 2024

Thejaggeddevil commented Oct 11, 2024

eeshan15 commented Oct 11, 2024

ArthurZucker commented Oct 22, 2024

naba89 commented Nov 30, 2024 •

edited

Loading

SunMarc commented Dec 2, 2024

Accelerate x Trainer issue tracker: #33345

Accelerate x Trainer issue tracker: #33345

Comments

ArthurZucker commented Sep 6, 2024 • edited by MekkCyber Loading

Help needed

Feature request

Replied with potential fix and following

WizKnight commented Sep 6, 2024

ArthurZucker commented Sep 6, 2024

irislin1006 commented Sep 6, 2024 • edited Loading

nnilayy commented Sep 6, 2024 • edited Loading

WizKnight commented Sep 7, 2024

irislin1006 commented Sep 8, 2024 • edited Loading

SunMarc commented Sep 10, 2024

godspeed5 commented Sep 13, 2024

WizKnight commented Sep 16, 2024

amyeroberts commented Sep 16, 2024

SunMarc commented Sep 27, 2024

P-Potdar commented Oct 1, 2024

b423016 commented Oct 2, 2024

ArthurZucker commented Oct 3, 2024

Thejaggeddevil commented Oct 11, 2024

eeshan15 commented Oct 11, 2024

ArthurZucker commented Oct 22, 2024

naba89 commented Nov 30, 2024 • edited Loading

SunMarc commented Dec 2, 2024

ArthurZucker commented Sep 6, 2024 •

edited by MekkCyber

Loading

irislin1006 commented Sep 6, 2024 •

edited

Loading

nnilayy commented Sep 6, 2024 •

edited

Loading

irislin1006 commented Sep 8, 2024 •

edited

Loading

naba89 commented Nov 30, 2024 •

edited

Loading