CUDA out of memory error on training

I was trying to train Polycoder using the [preconfigured dataset](https://github.com/frankxu2004/gpt-neox#datasets), from the checkpoint `checkpoints-2-7B`, I used the following command as per the [instructions](https://github.com/frankxu2004/gpt-neox#containerized-setup) in the repo (only changing the configs as appropriate):

`sudo python ./deepy.py train.py -d configs 2-7B.yml  local_setup.yml`

which gave the following error:

`RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 23.70 GiB total capacity; 20.49 GiB already allocated; 1.74 GiB free; 20.50 GiB reserved in total by PyTorch)`

Interestingly, the full 25 Gigs of our GPU is free, as per nvidia-smi. 

I tried updating the batch size, and the the only location I found to update batch size in the config files was `train_micro_batch_size_per_gpu: 8`, in `2-7B.yml`.

It was 8, I changed it to 4, and then also to 1, but in both cases got the same error.

I am running all this in docker, as per the [containerized setup instructions](https://github.com/frankxu2004/gpt-neox#containerized-setup).

Appreciate any help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory error on training #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA out of memory error on training #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions