-
-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streaming multipack for pretraining dataset #959
streaming multipack for pretraining dataset #959
Conversation
Here's a patch file I used to test a c4 pretraiing dataset with tinnyllama. multigpu doesn't work currently w this since I think it needs a proper data collator to pad the samples to the same sequence length |
Would this streaming feature work with S3, GCS, Azure Blob Storage? |
02dc87f
to
a5eb52e
Compare
This PR is ready for review and should resolve #1026. @mhenrichsen |
eec349a
to
2a49248
Compare
Confirmed working on single gpu. Currently fails on multi gpu. |
* [Feat] streaming multipack * WIP make continued pretraining work w multipack * fix up hadrcoding, lint * fix dict check * update test for updated pretraining multipack code * fix hardcoded data collator fix for multipack pretraining * fix the collator to be the max length for multipack pretraining * don't bother with latest tag for test * cleanup docker build/test --------- Co-authored-by: [email protected] <jinwonkim> Co-authored-by: Wing Lian <[email protected]>
This pull request introduces support for multipacking in streaming pretraining datasets.
Due to the immense size of these datasets, traditional methods of loading them entirely into memory are not feasible.
The proposed solution aims to enhance efficiency and scalability.
need guide for making config.
this multipack does not use BatchSamplerDataCollatorForSeq2Seq, only use DataCollatorForSeq2Seq due to huggingface dataset map function.