Skip to content

Conversation

@jinwonkim93
Copy link
Contributor

This pull request introduces support for multipacking in streaming pretraining datasets.
Due to the immense size of these datasets, traditional methods of loading them entirely into memory are not feasible.
The proposed solution aims to enhance efficiency and scalability.

need guide for making config.
this multipack does not use BatchSamplerDataCollatorForSeq2Seq, only use DataCollatorForSeq2Seq due to huggingface dataset map function.

@winglian
Copy link
Collaborator

winglian commented Jan 5, 2024

Here's a patch file I used to test a c4 pretraiing dataset with tinnyllama. multigpu doesn't work currently w this since I think it needs a proper data collator to pad the samples to the same sequence length
patch0.patch

@casper-hansen
Copy link
Contributor

Would this streaming feature work with S3, GCS, Azure Blob Storage?

@winglian
Copy link
Collaborator

winglian commented Jan 5, 2024

This PR is ready for review and should resolve #1026. @mhenrichsen

@mhenrichsen
Copy link
Collaborator

Confirmed working on single gpu. Currently fails on multi gpu.

@winglian winglian changed the title [WIP] streaming multipack for pretraining dataset streaming multipack for pretraining dataset Jan 6, 2024
@winglian winglian merged commit 553c80f into axolotl-ai-cloud:main Jan 6, 2024
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* [Feat] streaming multipack

* WIP make continued pretraining work w multipack

* fix up hadrcoding, lint

* fix dict check

* update test for updated pretraining multipack code

* fix hardcoded data collator fix for multipack pretraining

* fix the collator to be the max length for multipack pretraining

* don't bother with latest tag for test

* cleanup docker build/test

---------

Co-authored-by: [email protected] <jinwonkim>
Co-authored-by: Wing Lian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants