-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate x Trainer issue tracker: #33345
Comments
@ArthurZucker Hey there!👋 I'm new to this repository and excited to learn and contribute. Please let me know if there are any good starting points or tasks where I can be of assistance. |
Any of these issue that have the |
Hi @ArthurZucker, I'm a first time contributor, but I would love to take issue #31734 as a start 👍 [Update on 202409/07] Handled and replied in the issue |
Hi there👋 @ArthurZucker, Handled issue #31439, hope that helps🤗. |
Hi there👋 @ArthurZucker, I'll handle the issue #28124 |
Hi there👋 @ArthurZucker, I would like to take #32312 😀 [Update on 202409/09] Handled and replied in the issue |
I had opened PR #31268 as a fix for issue #30819. I think some discussion is needed on there @amyeroberts |
Hey @amyeroberts, just wanted to check in on issue #28124. It seems like @muellerzr already tackled it with his fix in #30169. Thanks! |
Hi @WizKnight - best to ask @muellerzr (ideally on the relevant PR / issue to avoid pinging everyone here) on the status of those. I can see in #30169 the PR wasn't merged in due to inactivity -- pending a response to these questions.. In general, if something has just been closed by the github stale bot and not because of a clear decision not to pursue the PR / a clear rejection from the review process you're free to pick up the work :) |
cc @MekkCyber |
Hey @SunMarc and @muellerzr, I'd love to contribute to this project and help resolve some of the issues mentioned here, especially the DeepSpeed Zero3-related bugs. I’ve already gone through some of the issues and identified potential starting points for solutions. I'll be focusing on these: Training hangs at the first gradient syncing of an MoE model while using DeepSpeed (#30911) Also, if there are any specific guidelines or areas where help is most needed, feel free to point me in the right direction! Looking forward to collaborating on this during Hacktoberfest 🎉 |
Hey @SunMarc and @muellerzr I would be happy to contribute the issue |
Awesome! Just added the tag to make sure it works for everyone! 🥳 |
i want to work on #29348 please assign this to me |
hey @ArthurZucker Will this be counted in hacktoberfest? |
Yes, given that there is the tag! |
Hi, I have a hacky workaround here: custom_trainer |
Just reopened ! If you have a fix, would you like to open a PR so that we can have a look ? Thanks ! |
A bunch of issues are a bit stale, and @SunMarc + @muellerzr are a bit short on bandwidth!
Thus we would love to have community support to solve the following:
Help needed
dataloader_persistent_workers=True
causes fork-bomb due to repeated creation ofeval_dataloader
#28469Feature request
Replied with potential fix and following
data_seed
inTrainingArguments
is unused #31818 followed by @MekkCyberresume_from_checkpoint
function fails because "There seems to be not a single sample in your epoch_iterator" #26413 followed by @muupan, @muellerzr and @SunMarcThe text was updated successfully, but these errors were encountered: