Add Llama2 7B FSDP demo #165

abhibyreddi · 2024-10-24T21:46:21Z

Demo code performs data loading, listing, and checkpoint saving with Dataflux. Follow up PR will add support for loading the saved checkpoints with Dataflux.

Tests pass - tested manually on a GCE VM with GPUs
Appropriate changes to documentation are included in the PR

abhibyreddi · 2024-11-02T01:09:00Z

Matt & JD, all the files introduced in this PR are borrowed from here. I've called out the changes I made in the README file.

I tested this by training the 7B model on the sample dataset (~24GB) for 500 iterations. I also added the output I got when I prompted the generated model (it's very bad but it works!)

demo/llama/dataset.py

MattIrv · 2024-11-04T15:19:17Z

demo/llama/README.md

+The code in this directory trains the Llama 7B model on [Huggingface's RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample/tree/main). All the code in this directory is borrowed from [Lightning AI's lit-llama repo](https://github.com/Lightning-AI/lit-llama). Changes have been made in appropriate places to fetch the data from GCS instead of reading from disk.
+
+This demo has been tested on a GCE instance with `2` Nvidia `H100` GPUs.
+


Let's scope the demo to just fsdp checkpointing, if possible.

Made changes to support saving checkpoints with Dataflux. Updated README with details. Could you take another look?

The readme doesn't appear to cover checkpointing at all. Can you update it?

I think the training portion is not that interesting from readme purposes, since it doesn't involve Dataflux at all (does it?)

I added a note in the "Run the pre-training script" section about implementing a custom strategy that makes it possible to save checkpoints using Dataflux. What other details would be good to mention here?

* add simple llama load benchmark and results * address comments * comments

This reverts commit 7f12381.

* Update save/load print statements * Account for save/load only

* Remove model and path arguments from FSDP strategy constructors * Fix incorrect argument order in gcs writer/reader

* Added initial naive Async FSDP strategy * refactor script to support multiple trainer.fit executions for average execution time. * Add logging statement to async_save. Add return docstring to updated init_process() method. * Added a section to the README with details on how to run the async demo. * - Refactor DatafluxFSDPStrategy to use arg for async behavior instead of creating a child class. - Refactor the save_checkpoint_helper to just modify the checkpoint dict. This allows easier control over save/async_save in DatafluxFSDPStrategy. - Add more benchmark logging. - Use a custom model class for adding simulated blocking workflow. * fix typo * get rank from env var after it's set instead of returning. Fix var naming. * remove return type hint. * Updated README to reflect review feedback * Improve readme docs. * more doc cleanup and logging improvements. * Use env var for accessing rank. Update logging strings. * Fix docstring accuracy. * further docstring fix * remove irrelevant optimizer choice

* updated readme to include async and multinode features * Address comments

* Update multi-node benchmarking readme * Add commands for all the strategies currently supported * Update numbering * Add more info * Fix typo * Reword * Fix typos * Address comments * Resolve merge conflicts * Fix note * Fix note * Fix note * Fix note * Fix note * Remove note * Fix type * Add a note about gcsfuse delployment * Fix headings * Final commit

abhibyreddi · 2024-11-13T18:58:58Z

@MattIrv, @jdnurme , @Yash9060 want to call your attention to demo/llama/strategies.py. The DatafluxFSDPStrategy class there is similar to the class with the same name in demo/lightning/checkpoint/multinode/strategies.py.

The one introduced in this PR inherits lightning.fabric.strategies.FSDPStrategy. The existing one inherits lightning.pytorch.strategies.FSDPStrategy. They are similar but not the same.

demo/lightning/checkpoint/multinode/strategies.py

demo/llama/README.md

demo/llama/dataset.py

Yash9060 · 2024-11-13T19:25:51Z

demo/llama/strategies.py

+            self._async_save(converted_state, path, writer)
+        else:
+            self._save(converted_state, path, writer)
+        duration_ms = (time.time() - start_time) / 1000


ANy specific reason why we need to have time in millisecond ? I think everywhere else in the codebase we use seconds ?

This has been copied over from demo/lightning/checkpoint/multinode/strategies.py.

Yash9060 · 2024-11-13T19:27:44Z

demo/llama/strategies.py

+            self.checkpoint_group = dist.new_group(
+                default_ranks, backend=self.process_group_backend)
+
+    def save_checkpoint(


Add a link to save_checkpoint source code ? (Also I think this is coming from lightning fabric, any reason why we are using lightning fabric save_checkpoint instead of simple lightning checkpoint ?)

lit-llama's training code does not use lightning.Trainer and fabric takes only custom strategies that inherit one of the classes defined in lightning.fabric.strategies. Another option I had was to re-write the training code to use lightning's LightningModule. This seemed simpler to do.

demo/llama/train.py

MattIrv · 2024-11-13T19:53:14Z

demo/llama/README.md

+The code in this directory trains the Llama 7B model on [Huggingface's RedPajama dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample/tree/main). All the code in this directory is borrowed from [Lightning AI's lit-llama repo](https://github.com/Lightning-AI/lit-llama). Changes have been made in appropriate places to fetch the data from GCS instead of reading from disk.
+
+This demo has been tested on a GCE instance with `2` Nvidia `H100` GPUs.
+


The readme doesn't appear to cover checkpointing at all. Can you update it?

I think the training portion is not that interesting from readme purposes, since it doesn't involve Dataflux at all (does it?)

MattIrv · 2024-11-13T19:54:46Z

demo/llama/dataset.py

This appears to mostly be a copy of https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/packed_dataset.py. Is there some way to only override the pieces we need from that code without reimplementing everything?

This might apply for strategies.py and train.py here too

MattIrv · 2024-11-13T19:55:40Z

demo/llama/strategies.py

Is there a way to avoid adding yet another reimplementation of these strategies and reuse the existing ones?

MattIrv · 2024-11-13T19:56:21Z

demo/llama/train.py

+          gradient_accumulation_iters, devices)
+
+
+def train(


Can the underlying implementation be reused here? It looks like there might not be any changes

MattIrv · 2024-11-13T19:56:38Z

demo/llama/train.py

+
+
+@torch.no_grad()
+def validate(fabric: L.Fabric, model: torch.nn.Module,


Same comment here and for several of the functions below, does this really need to be reimplemented?

abhibyreddi added 12 commits October 24, 2024 21:43

Initial commit

e2705f1

Copy over code to prompt the pre-trained model

fcd3693

Add script to download 1T dataset

c0d3c4e

Fix imports

55386db

Add function to list with dataflux

ac62f1f

Swap disk reads with reads from GCS via Dataflux

78604d9

Append bytes to buffer, not the BytesIO object

a9f1902

Update README

869628d

Update hyperlinks

83c3519

Update instructions to download tokenizer and dataset

af359d0

Add a note to update project and bucket name vars

4fe4989

Fix typo

532da9b

abhibyreddi marked this pull request as ready for review November 2, 2024 01:05

abhibyreddi requested a review from a team as a code owner November 2, 2024 01:05

abhibyreddi requested review from Yash9060, jdnurme and MattIrv and removed request for Yash9060 November 2, 2024 01:05

Merge branch 'main' into abhibyreddi/fsdp-demo

1381e0e

MattIrv reviewed Nov 4, 2024

View reviewed changes

jdnurme and others added 9 commits November 8, 2024 19:27

add simple llama load benchmark and results (#171)

1acc3ee

* add simple llama load benchmark and results * address comments * comments

Revert "add simple llama load benchmark and results (#171)" (#177)

590f4bb

This reverts commit 7f12381.

Update save/load print statements for FSDP benchmark (#176)

cc37b49

* Update save/load print statements * Account for save/load only

Update auto wrap policy and remove duplicate load in trainer.fit (#175)

18acb93

Refactor DemoTransformer model to match Lightning demo (#174)

4bd89cc

Remove model and path arguments from FSDP strategy constructors (#173)

6ac0e88

* Remove model and path arguments from FSDP strategy constructors * Fix incorrect argument order in gcs writer/reader

updated readme to include async and multinode features (#183)

5d502b3

* updated readme to include async and multinode features * Address comments

abhibyreddi added 6 commits November 11, 2024 21:57

Undo dataflux checkpoint changes

41c8b60

Add custom FSDP Strategy that works with lightning fabric

9c3cf90

Cross ts dot is

8d9b97b

Add timing code

6054cf1

Delete model.py copy

0f9ecf9

Update README

fe4d723

abhibyreddi requested review from Yash9060 and MattIrv November 13, 2024 18:44

abhibyreddi changed the title ~~Add FSDP demo~~ Add Llama2 7B FSDP demo Nov 13, 2024