Skip to content

Latest commit

 

History

History
152 lines (122 loc) · 5.39 KB

simplifine-cloud-setup.mdx

File metadata and controls

152 lines (122 loc) · 5.39 KB
title description icon
Cloud Training Quickstart
Welcome to the Cloud Quickstart guide for Simplifine's Train Engine. This guide will help you get up and running quickly with training models in the cloud using Distributed Data Parallel (DDP).
cloud

{/* TODO - Update with the config schema */}

Prerequisites

Before you begin, ensure you have the following:

  • A Simplifine API key.
  • Access to a cloud GPU (L4 or A100) supported by Simplifine.

To obtain a Simplifine API key with free credit, express interest here.

Setting Up the Client

The first step is to initialize the Client class with your API key and GPU type. This client will handle communication with the Simplifine servers.

from simplifine_alpha.train_utils import Client

# Initialize the client with your API key and GPU type
api_key = ''  # Enter your Simplifine API key here
gpu_type = 'a100'  # 'l4' or 'a100'
client = Client(api_key=api_key, gpu_type=gpu_type)

Replace '' with your actual Simplifine API key. The gpu_type can be 'l4' or 'a100'.

Training a Model with Data Driven Parallelism (DDP)

The example below shows how to use DDP to distribute the training process across multiple GPUs.

Step 1: Define Your Training Job with DDP

You can train a model using the sft_train_cloud method and enable DDP by setting the use_ddp parameter to True.

client.sft_train_cloud(
    job_name='ddp_job',
    model_name='EleutherAI/gpt-neo-125M',
    dataset_name='my_dataset',
    data_from_hf=True,
    keys=['title', 'abstract', 'explanation'],
    data={'title': ['title 1', 'title 2'], 'abstract': ['abstract 1', 'abstract 2'], 'explanation': ['explanation 1', 'explanation 2']},
    template='### TITLE: {title}\n ### ABSTRACT: {abstract}\n ###EXPLANATION: {explanation}',
    response_template='###EXPLANATION:',
    use_zero=False,
    use_ddp=True
)

Step 2: Monitor Your Jobs

After sending the query, you can check the status of your jobs. The status will be one of the following: completed, in progress, or pending.

status = client.get_all_jobs()
for num, i in enumerate(status[-5:]):
    print(f'Job {num}: {i}')

<Accordion title="Output Example"

Job 0: {'job_id': '544bb4f0-206f-43b7-850e-5e1e9f7b4d23', 'job_name': 'job-4', 'status': 'completed'}
Job 1: {'job_id': 'bde91132-9776-41ae-89f9-855dfb116a91', 'job_name': 'ddp_job', 'status': 'completed'}
Job 2: {'job_id': 'a1ff54dd-5ee2-4e35-9e78-6868f63dad37', 'job_name': 'zero_example_cloud', 'status': 'completed'}
Job 3: {'job_id': '543d3bc3-3ce4-4af6-9f9a-6c0823dcc9b0', 'job_name': 'ddp_job', 'status': 'in progress'}
Job 4: {'job_id': '5d55d46a-7793-4c06-9cef-279f03a0f953', 'job_name': 'job_1', 'status': 'pending'}

Step 3: Retrieve Training Logs

You can retrieve the logs for any job to check detailed information about the training process.

job_id = status[-1]['job_id']
logs = client.get_train_logs(job_id)
print(logs['response'])

<Accordion title="Output Example"

[2024-07-28 18:13:08,712] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
...
{'train_runtime': 6.5488, 'train_samples_per_second': 73.295, 'train_steps_per_second': 9.162, 'train_loss': 0.1135852018992106, 'epoch': 1.0}

Step 4: Downloading the Trained Model

Once your model has finished training, you can download it using the download_model function.

import os

# Create a folder to store the model
os.mkdir('sf_trained_model')

# Download and save the model
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model')

<Accordion title="Output Example"

Directory downloaded successfully and saved to /content/sf_trained_model/5d55d46a-7793-4c06-9cef-279f03a0f953.zip
Model unzipped successfully to /content/sf_trained_model
Model downloaded, unzipped, and zip file deleted successfully!
You can stop an ongoing job if needed.
stop_running_job = False
if stop_running_job:
    job_id = status[-1]['job_id']
    client.stop_job(job_id)

Step 5: Loading and Using the Trained Model

Finally, you can load the trained model and tokenizer to generate text.

from transformers import AutoModelForCausalLM, AutoTokenizer

path = '/content/sf_trained_model'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''

input_example = sf_tokenizer(input_example, return_tensors='pt')

output = sf_model.generate(input_example['input_ids'],
                           attention_mask=input_example['attention_mask'],
                           max_length=30,
                           eos_token_id=sf_tokenizer.eos_token_id,
                           early_stopping=True,
                           pad_token_id=sf_tokenizer.eos_token_id
)

print(sf_tokenizer.decode(output[0]))

This quickstart guide covers setting up the client, training a model using DDP, monitoring jobs, downloading the trained model, and using it for inference. Let me know if you need any further adjustments!