-
Notifications
You must be signed in to change notification settings - Fork 265
Description
I was trying to train Polycoder using the preconfigured dataset, from the checkpoint checkpoints-2-7B, I used the following command as per the instructions in the repo (only changing the configs as appropriate):
sudo python ./deepy.py train.py -d configs 2-7B.yml local_setup.yml
which gave the following error:
RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 23.70 GiB total capacity; 20.49 GiB already allocated; 1.74 GiB free; 20.50 GiB reserved in total by PyTorch)
Interestingly, the full 25 Gigs of our GPU is free, as per nvidia-smi.
I tried updating the batch size, and the the only location I found to update batch size in the config files was train_micro_batch_size_per_gpu: 8, in 2-7B.yml.
It was 8, I changed it to 4, and then also to 1, but in both cases got the same error.
I am running all this in docker, as per the containerized setup instructions.
Appreciate any help!