Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate x Trainer issue tracker: #33345

Open
22 of 43 tasks
ArthurZucker opened this issue Sep 6, 2024 · 19 comments
Open
22 of 43 tasks

Accelerate x Trainer issue tracker: #33345

ArthurZucker opened this issue Sep 6, 2024 · 19 comments
Labels

Comments

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Sep 6, 2024

A bunch of issues are a bit stale, and @SunMarc + @muellerzr are a bit short on bandwidth!
Thus we would love to have community support to solve the following:

Help needed

Feature request

Replied with potential fix and following

@ArthurZucker ArthurZucker added Good First Issue trainer Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! DeepSpeed Good Difficult Issue Accelerate labels Sep 6, 2024
@WizKnight
Copy link

@ArthurZucker Hey there!👋 I'm new to this repository and excited to learn and contribute. Please let me know if there are any good starting points or tasks where I can be of assistance.

@ArthurZucker
Copy link
Collaborator Author

Any of these issue that have the Good First Issues should be fairly easy! 🤗

@irislin1006
Copy link

irislin1006 commented Sep 6, 2024

Hi @ArthurZucker, I'm a first time contributor, but I would love to take issue #31734 as a start 👍

[Update on 202409/07] Handled and replied in the issue

@nnilayy
Copy link
Contributor

nnilayy commented Sep 6, 2024

Hi there👋 @ArthurZucker, Handled issue #31439, hope that helps🤗.

@WizKnight
Copy link

Hi there👋 @ArthurZucker, I'll handle the issue #28124

@irislin1006
Copy link

irislin1006 commented Sep 8, 2024

Hi there👋 @ArthurZucker, I would like to take #32312 😀

[Update on 202409/09] Handled and replied in the issue

@SunMarc
Copy link
Member

SunMarc commented Sep 10, 2024

cc @matthewdouglas

@godspeed5
Copy link

I had opened PR #31268 as a fix for issue #30819. I think some discussion is needed on there @amyeroberts

@WizKnight
Copy link

Hey @amyeroberts, just wanted to check in on issue #28124. It seems like @muellerzr already tackled it with his fix in #30169.
Should I still work on this further, or is it good to go as is?

Thanks!

@amyeroberts
Copy link
Collaborator

Hi @WizKnight - best to ask @muellerzr (ideally on the relevant PR / issue to avoid pinging everyone here) on the status of those. I can see in #30169 the PR wasn't merged in due to inactivity -- pending a response to these questions..

In general, if something has just been closed by the github stale bot and not because of a clear decision not to pursue the PR / a clear rejection from the review process you're free to pick up the work :)

@SunMarc
Copy link
Member

SunMarc commented Sep 27, 2024

cc @MekkCyber

@P-Potdar
Copy link

P-Potdar commented Oct 1, 2024

Hey @SunMarc and @muellerzr,

I'd love to contribute to this project and help resolve some of the issues mentioned here, especially the DeepSpeed Zero3-related bugs. I’ve already gone through some of the issues and identified potential starting points for solutions. I'll be focusing on these:

Training hangs at the first gradient syncing of an MoE model while using DeepSpeed (#30911)
Trainer doesn't save evaluation metrics (#33733)
CUDA RuntimeError: Unspecified Launch Failure during Training (#30913)
I'll submit PRs with proposed fixes and updates soon. Thank you for the opportunity to contribute!

Also, if there are any specific guidelines or areas where help is most needed, feel free to point me in the right direction!

Looking forward to collaborating on this during Hacktoberfest 🎉

@b423016
Copy link

b423016 commented Oct 2, 2024

Hey @SunMarc and @muellerzr I would be happy to contribute the issue
Trainer doesn't save evaluation metrics (#33733 )

@ArthurZucker
Copy link
Collaborator Author

Awesome! Just added the tag to make sure it works for everyone! 🥳

@Thejaggeddevil
Copy link

i want to work on #29348 please assign this to me

@eeshan15
Copy link

hey @ArthurZucker Will this be counted in hacktoberfest?

@ArthurZucker
Copy link
Collaborator Author

Yes, given that there is the tag!
We don't assign issue, first PR that is up will be reviewed, if stale anyone can take it, if no PR is linked, you can also create one 🤗

@naba89
Copy link

naba89 commented Nov 30, 2024

Hi,
I don't think #28469 has been fixed yet. Facing this even in 4.46.3.

I have a hacky workaround here: custom_trainer
It works on the repro-script, but might not cover all cases.

@SunMarc
Copy link
Member

SunMarc commented Dec 2, 2024

I don't think #28469 has been fixed yet. Facing this even in 4.46.3.

Just reopened ! If you have a fix, would you like to open a PR so that we can have a look ? Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

13 participants