!!! note We only test on Ampere GPUs (e.g., A100s or 30xx series). If it works with JAX, it should work, though. We have done limited testing on H100 GPUs, but we do not have regular access to them.
We have two installation options for Levanter:
- Using a Virtual Environment: This is the simplest way if you don't have root access to your machine (and don't have rootless docker installed).
- Using a Docker Container: This is the best way to achieve the fastest training speeds, because the Docker container has TransformerEngine, and Levanter uses TransformerEngine's FusedAttention implementation to accelerate training.
virtualenv -p python3.10 levanter
source levanter/bin/activate
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
git clone https://github.com/stanford-crfm/levanter.git
cd levanter
pip install -e .
We recommend using a virtual environment to install Levanter.
You can use either virtualenv
or conda
to create a virtual environment.
Here are the steps for creating a virtual environment with virtualenv
virtualenv -p python3.10 levanter
source levanter/bin/activate
conda create --name levanter python=3.10 pip
conda activate levanter
Please refer to the JAX Installation Guide. Below are two options that worked as of March 2024.
# CUDA 12 installation
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# CUDA 11 installation
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
You can install Levanter either from PyPI or from source. We recommend installing from source.
git clone https://github.com/stanford-crfm/levanter.git
cd levanter
pip install -e .
This package is frequently out of date, so we recommend installing from source.
pip install levanter
By default, Levanter logs training runs to Weights and Biases. You can sign up for a free WandB account at https://wandb.ai/site.
You can obtain an API token from Weights and Biases and use it to log into your WandB account on the command line as follows:
To use WandB, you can log in to your WandB account on the command line as follows:
wandb login ${YOUR TOKEN HERE}
For more information on getting set up with Weights and Biases, visit https://wandb.ai/site.
If you do not want to use WandB, you can disable it by running:
wandb offline
You can also use TensorBoard for logging. See the Tracker documentation for more information.
To take advantage of the fastest training speeds Levanter has to offer, we recommend using the official Docker container built by NVIDIA's JAX Toolbox team. The image is continuously updated with the latest versions of JAX, CUDA, TransformerEngine, and Levanter. Training speeds are accelerated by TransformerEngine's FusedAttention implementation, which requires a TransformerEngine installation in your environment. Luckily, we can use a Docker container that already has Levanter and TransformerEngine installed for us.
To check if you have Docker installed, run
sudo docker --version
If it is not installed, you can follow the installation instructions on their website.
You'll also need to have the nvidia-container-toolkit
installed. You can follow the installation instructions on their website.
Technically optional, since the first time you run the container it will be downloaded, but you can download the container ahead of time with the following command:
sudo docker pull ghcr.io/nvidia/jax:levanter
If you just want to use Leventer out of the box to train models, these are the Docker setup steps you should follow.
If you're interested in actively developing Levanter while using a Docker container, see the Developing in a GPU Docker Container guide.
To run a docker container interactively, you can use the following command:
sudo docker run -it --gpus=all --shm-size=16g ghcr.io/nvidia/jax:levanter
Then, you can run training commands from within your Docker container as follows:
python -m levanter.main.train_lm \
--config_path /opt/levanter/config/gpt2_small.yaml
You can also run a job in a Docker container with the following command:
sudo docker run \
--gpus=all \
--shm-size=16g \
-i ghcr.io/nvidia/jax:levanter \
python -m levanter.main.train_lm \
--config_path /opt/levanter/config/gpt2_small.yaml
For more information on how to train models in Levanter, see our User Guide.
If you are planning to add to or extend Levanter for your own use case, follow these Docker setup steps.
First, clone the Levanter repository:
git clone https://github.com/stanford-crfm/levanter.git
Then run an interactive Docker container with your Levanter directory mounted as a volume. For example, if your Levanter
repo is located at /nlp/src/username/levanter
, then run the command below to make that directory accessible to the Docker container.
sudo docker run -it --gpus=all -v /nlp/src/username/levanter:/levanter --shm-size=16g ghcr.io/nvidia/jax:levanter
Once your container starts, the Levanter repo you cloned will be available at /levanter
.
You should cd
into the levanter
directory and run the install command for Levanter from that directory.
cd /levanter
pip install -e .
Now, you should be able to run training jobs in this container using the version of Levanter from your mounted directory:
python src/levanter/main/train_lm.py \
--config_path config/gpt2_small.yaml
-
To use the Levanter datasets available on Google cloud within a Docker container, you need to install gcloud and login inside the docker container. See Google Cloud Setup instructions at the top of Getting Started on TPU VMs.
-
If you are using a Docker container on the Stanford NLP cluster, you need to check which GPUs have been allocated to you within your slurm job. Run
nvidia-smi
before you start your docker container and note theBus-Id
for each GPU. Then, after starting your docker container, runnvidia-smi
again to discover the indices of the GPUs you've been allocated within the full node. The GPU index is listed to the left of the GPU name in the left most column. Runexport CUDA_VISIBLE_DEVICES=[YOUR GPU INDICES]
so the container will only use your allocated GPUs and not all the GPUs on the node. For example, if you are using GPUs[2, 3, 4, 5]
you would runexport CUDA_VISIBLE_DEVICES=2,3,4,5
.
For more details on how to configure training runs, please see the Getting Started Training guide. Here are some examples of running a job.
python -m levanter.main.train_lm --config config/gpt2_small
Here's a simple example of running a job on a single node. This example assumes you have cloned the Levanter repository and are in the root directory of the repository.
srun --account=nlp --cpus-per-task=128 --gpus-per-node=8 --job-name=levanter-multi-1 --mem=1000G --open-mode=append --partition=sphinx --time=14-0 infra/run-slurm.sh python src/levanter/main/train_lm.py --config_path config/gpt2_small.yaml
This example uses sbatch
to submit a job to Slurm. This example assumes you have cloned and installed the Levanter repository.
Nvidia recommends using this method (rather than one process per node) for best performance.
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --job-name=levanter-test
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --output=levanter_%j.log
#SBATCH --mem=16G
## On the Stanford NLP cluster you might need this:
export PATH=$(echo $PATH | sed 's|:/usr/local/cuda/bin||')
## Activate your virtual environment
source levanter/bin/activate
srun python -m levanter.main.train_lm --config config/gpt2_small_fast --trainer.per_device_parallelism -1
Then, submit the job with sbatch:
sbatch my-job.sh
For multi-gpu training, you need to additionally have nvidia-fabricmanager installed on each of your nodes.
sudo apt-get install cuda-drivers-fabricmanager
sudo systemctl start nvidia-fabricmanager
If you are using a docker container to train your model, your docker run command should look similar to this
sudo docker run -it --network=host -v ~/src/levanter/cache:/cache -v /home/user/levanter:/levanter --gpus=all --shm-size=16g ghcr.io/nvidia/jax:levanter
The main difference between the command here and the one found in the GPU Docker Development Guide is the --network=host
argument. This tells the docker container to use the host machine's network instead of the default docker bridge
network. Using host
is the easiest way to do multi-node networking with docker and should be sufficient for your training purposes. Please see docker's host and bridge network documentation for more information.
We use JAX Distributed to help manage multi-node training in Levanter. On each node you can run a command like the following to kick off a training job:
NCCL_DEBUG=INFO python src/levanter/main/train_lm.py \
--config_path config/gpt2_7b.yaml \
--trainer.ray.auto_start_cluster false \
--trainer.per_device_parallelism -1 \
--trainer.distributed.num_processes 4 \
--trainer.distributed.local_device_ids "[0,1,2,3,4,5,6,7]" \
--trainer.distributed.coordinator_address 12.345.678.91:2403 \
--trainer.distributed.process_id 0
This will start a 4 node job where each node has 8 GPUs.
--trainer.distributed.num_processes
- sets the number of nodes used in this training run--trainer.distributed.local_device_ids
- sets the ids of the local GPUs to use on this specific node--trainer.distributed.coordinator_address
- is the IP address and port number of the node that will be leading the training run. All other nodes should have network access to the port and IP address set by this argument. The same IP address and port number should be used for this argument in every node's run command.--trainer.distributed.process_id
- The process ID of the current node. If the node is coordinator for the training run (its IP address was the one specified at--trainer.distributed.coordinator_address
), its process ID needs to be set to zero. All other nodes in the train run should have a unique integer ID between [1,num_processes
- 1].
When the above command is run on the coordinator node, it will block until all other processes connect to it. All the other nodes will connect to the coordinator node before they can begin training. All other training run arguments have the same meaning as with single node runs. We recommend thinking about increasing your --trainer.train_batch_size
value when you scale from single node to multi-node training, as this is the global batch size for your training job and you've now increased your compute capacity.
Here is an updated Slurm script example where we've added #SBATCH --nodes=2
.
NOTE: This script hasn't been tested yet.
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --job-name=levanter-test
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --output=levanter_%j.log
#SBATCH --mem=16G
#SBATCH --nodes=2
# On the Stanford NLP cluster, you might need this:
export PATH=$(echo $PATH | sed 's|:/usr/local/cuda/bin||')
CONTAINER_PATH="ghcr.io/nvidia/jax:levanter"
TRAINING_COMMAND="python -m levanter.main.train_lm --config_path config/gpt2_7b.yaml --trainer.ray.auto_start_cluster false --trainer.per_device_parallelism -1"
srun docker run --gpus=all --shm-size=16g --rm $CONTAINER_PATH $TRAINING_COMMAND
If you're Slurm (and using Pyxis), you won't need to do provide the distributed arguments described in the previous section. JAX/Levanter will infer them for you.
In Levanter, you can switch between using TPUs and GPUs in the middle of a training run. See our tutorial on Switching Hardware Mid-Training Run to learn more.
On H100 GPUs, you can train with FP8 precision. To do this, you just need to add the following to your config:
trainer:
# ...
fp8: true
For details on how it works, see the Haliax FP8 docs and Transformer Engine's FP8 docs.
For solutions to common problems, please see the FAQ.
See FAQ entry, but some variant of this should work:
export PATH=$(echo $PATH | sed 's|:/usr/local/cuda/bin||')
The issue is that the system-installed CUDA is being used instead of the CUDA installed by JAX.