streaming multipack for pretraining dataset #959

jinwonkim93 · 2023-12-15T03:22:46Z

This pull request introduces support for multipacking in streaming pretraining datasets.
Due to the immense size of these datasets, traditional methods of loading them entirely into memory are not feasible.
The proposed solution aims to enhance efficiency and scalability.

need guide for making config.
this multipack does not use BatchSamplerDataCollatorForSeq2Seq, only use DataCollatorForSeq2Seq due to huggingface dataset map function.

src/axolotl/utils/data.py

winglian · 2024-01-05T03:17:41Z

Here's a patch file I used to test a c4 pretraiing dataset with tinnyllama. multigpu doesn't work currently w this since I think it needs a proper data collator to pad the samples to the same sequence length
patch0.patch

casper-hansen · 2024-01-05T13:47:46Z

Would this streaming feature work with S3, GCS, Azure Blob Storage?

winglian · 2024-01-05T18:00:32Z

This PR is ready for review and should resolve #1026. @mhenrichsen

mhenrichsen · 2024-01-05T20:55:20Z

Confirmed working on single gpu. Currently fails on multi gpu.

* [Feat] streaming multipack * WIP make continued pretraining work w multipack * fix up hadrcoding, lint * fix dict check * update test for updated pretraining multipack code * fix hardcoded data collator fix for multipack pretraining * fix the collator to be the max length for multipack pretraining * don't bother with latest tag for test * cleanup docker build/test --------- Co-authored-by: [email protected] <jinwonkim> Co-authored-by: Wing Lian <[email protected]>

winglian reviewed Jan 4, 2024

View reviewed changes

src/axolotl/utils/data.py Outdated Show resolved Hide resolved

winglian force-pushed the pretraining_multipack branch from 02dc87f to a5eb52e Compare January 5, 2024 16:23

winglian requested review from NanoCode012, casper-hansen, mhenrichsen and hamelsmu January 5, 2024 18:00

[email protected] and others added 6 commits January 5, 2024 14:27

[Feat] streaming multipack

8ed5bcb

WIP make continued pretraining work w multipack

da9aee1

fix up hadrcoding, lint

36b244d

fix dict check

680cbe2

update test for updated pretraining multipack code

789c972

fix hardcoded data collator fix for multipack pretraining

2a49248

winglian force-pushed the pretraining_multipack branch from eec349a to 2a49248 Compare January 5, 2024 19:27

winglian added 3 commits January 5, 2024 15:58

fix the collator to be the max length for multipack pretraining

7c3be2e

don't bother with latest tag for test

bea8bee

cleanup docker build/test

5a321c3

winglian changed the title ~~[WIP] streaming multipack for pretraining dataset~~ streaming multipack for pretraining dataset Jan 6, 2024

winglian merged commit 553c80f into axolotl-ai-cloud:main Jan 6, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming multipack for pretraining dataset #959

streaming multipack for pretraining dataset #959

jinwonkim93 commented Dec 15, 2023

winglian commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

winglian commented Jan 5, 2024

mhenrichsen commented Jan 5, 2024

streaming multipack for pretraining dataset #959

streaming multipack for pretraining dataset #959

Conversation

jinwonkim93 commented Dec 15, 2023

winglian commented Jan 5, 2024

casper-hansen commented Jan 5, 2024

winglian commented Jan 5, 2024

mhenrichsen commented Jan 5, 2024