| \n | # shard_x is \"data parallel\" | \n
It's not automatic and you need to specify how you want to shard the model. This example shards 70B llama onto 6 GPU for serving, and you would need a lot more GPUs for training
\n tinygrad/examples/llama.py\n
\n\n Line 283\n in\n ce46a7e\n
\n| \n | for k,v in nn.state.get_state_dict(model).items(): | \n
-
|
Wondering, if I want to train model similar to llama-70b from scratch on 2 gpus 24gb memory each. Will tinygrad automatically split the model for training? I am talking about model parallelism not Data parallelism. |
Beta Was this translation helpful? Give feedback.
-
|
Sharding model is supported, we don't explicitly distinguish model parallelism from data parallelism tinygrad/test/test_multitensor.py Line 25 in ce46a7e It's not automatic and you need to specify how you want to shard the model. This example shards 70B llama onto 6 GPU for serving, and you would need a lot more GPUs for training Line 283 in ce46a7e |
Beta Was this translation helpful? Give feedback.
-
|
Just want to know your thoughts on tinygrad shard vs pytorch + accelerate (huggingface) |
Beta Was this translation helpful? Give feedback.
Sharding model is supported, we don't explicitly distinguish model parallelism from data parallelism
tinygrad/test/test_multitensor.py
Line 25 in ce46a7e
It's not automatic and you need to specify how you want to shard the model. This example shards 70B llama onto 6 GPU for serving, and you would need a lot more GPUs for training
tinygrad/examples/llama.py
Line 283 in ce46a7e