Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tensor parallel memory usage #35202

Open
Nan2018 opened this issue Dec 11, 2024 · 0 comments
Open

Improve tensor parallel memory usage #35202

Nan2018 opened this issue Dec 11, 2024 · 0 comments
Labels
Feature request Request for a new feature

Comments

@Nan2018
Copy link

Nan2018 commented Dec 11, 2024

Feature request

Thanks to #34184 we can use TP for llama with only one line change. However the current implementation loads the whole model to each GPU in each rank before applying TP, significantly increasing the memory footprint.

Motivation

We can load the model in CPU before applying TP. I tested this with llama3.1 8B on 2 GPUs. The memory usage is reduced from 60G to less than 20G. Below is my test script

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.distributed import device_mesh
from stainedglass_core.integrations.lm_eval.models.tensor_parallel.llama import parallelize_model

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.cuda.set_device(device)
torch.distributed.init_process_group("nccl")

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='cpu',
)
num_gpus = torch.cuda.device_count()
tp_mesh = device_mesh.init_device_mesh("cuda", (num_gpus,), mesh_dim_names=("tp",))
model.tensor_parallel(tp_mesh)
model.to(device)  # needed for weights and buffers that are not included by the TP plan

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

outputs = model(inputs)
print(tokenizer.decode(outputs.logits.squeeze()[-1].argmax()))

Your contribution

We can set device_map to cpu in PreTrainedModel.from_pretrained if tp_plan is not None, and apply TP at the end.

happy to have discussions and work on a pr for this.

CC @kwen2501 @ArthurZucker

@Nan2018 Nan2018 added the Feature request Request for a new feature label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant