Running Training Jobs with Ray Jobs

ML training workloads with Ray.

Published in

Zencore Engineering

5 min readOct 23, 2024

Part 4/5 — Running Training Jobs with Ray Jobs

Distributed computing has become a cornerstone of modern machine learning (ML), enabling the processing of large datasets and running complex models efficiently. Ray, an open-source framework, simplifies this by offering an intuitive way to distribute computations. For those in the Google Cloud ecosystem, Ray can be set up using KubeRay on Google Kubernetes Engine (GKE) or as a managed service with Ray on Vertex AI. This guide will walk you through running your first training job using Ray Jobs, focusing on defining the job, the submission process, and how Ray manages the underlying infrastructure.

Why Use Ray for ML Training?

Ray offers a streamlined way to distribute computations across clusters, allowing you to focus on your ML models instead of the complexities of managing clusters. Its Ray Jobs API is especially useful for running training jobs because it automatically handles the creation of Ray clusters, runs the computation, and tears down the resources when the job completes.

Defining Your Ray Training Job

Before submitting a job with the Ray Jobs API, you need to define the Python script that contains your training logic. Here’s an example of a simple training script:

train_model.py:

import ray
import time
import random

# Initialize Ray, this will automatically connect to the Ray cluster when run with Ray Jobs
ray.init(address='auto')
@ray.remote
def train_step(epoch):
    """Simulate a single training step."""
    time.sleep(random.uniform(0.5, 1.0))  # Simulate computation time
    return f"Completed training for epoch {epoch}"
def train_model(num_epochs=5):
    """Run the training process."""
    results = []
    for epoch in range(1, num_epochs + 1):
        result = ray.get(train_step.remote(epoch))
        results.append(result)
        print(result)
    print("Training complete.")
if __name__ == "__main__":
    train_model()

In this script:

ray.init(address='auto'): Automatically connects to the Ray cluster, which is created when you submit the job.
@ray.remote: Wraps the train_step function, allowing it to be executed as a distributed task.
ray.get(): Retrieves the results from distributed tasks, blocking until they complete.

This script simulates a simple training process, where each training step is distributed across Ray workers.

Submitting the Ray Job

Once the training script is ready, you can submit it using the Ray Jobs API. The process differs slightly depending on whether you’re using Ray on Vertex AI or KubeRay on GKE.

Submitting the Job with Ray on Vertex AI

Set up a Ray cluster by following Managing Ray Clusters on Vertex AI
Set Up the Environment: Ensure that the Google Cloud SDK and Vertex AI SDK are installed:

pip install google-cloud-aiplatform

3. Submit the Job: Assuming you have already set up a Google Cloud project and enabled the Vertex AI API, you can submit the job directly:

export PROJECT_ID='YOUR_PROJECT_ID'
export CLUSTER_NAME='ray-cluster'
export REGION='us-central1'
export RESOURCE_NAME="projects/$PROJECT_ID/locations/$REGION/persistentResources/$CLUSTER_NAME"
ray job submit --address="vertex_ray://$RESOURCE_NAME -- python train_model.py

4. Here, Ray executes the training script (train_model.py), and then terminates the cluster once the job is complete. This is ideal for workloads that need to run once and then shut down to avoid unnecessary resource consumption.

Submitting the Job with KubeRay on GKE

If you prefer the flexibility of using KubeRay on GKE, follow these steps:

Follow my KubeRay on GKE guide to set up a KubeRay cluster
Port-Forward to Ray service: This allows local communication with the Ray cluster’s job API.

kubectl port-forward svc/raycluster-kuberay-head-svc 10001:10001

3. Submit the Job: Use the JobSubmissionClient to submit the train_model.py script to the Ray cluster:

import ray
from ray.job_submission import JobSubmissionClient
from google.cloud import aiplatform

client = JobSubmissionClient(f"ray://localhost:10001")
job_id = client.submit_job(
    entrypoint="python train_model.py",
    runtime_env={
        "working_dir": "./",
        "pip": ["numpy", "setuptools<70.0.0", "xgboost", "ray==CLUSTER_RAY_VERSION"]
    }
)
print(f"Submitted job with ID: {job_id}")

runtime_env: Defines the working directory and dependencies. You can install your pip requirements here.
CLUSTER_RAY_VERSION: Ensures the Ray version matches between the client and the cluster.

4. Monitor the Job: Use the Ray dashboard URL provided in the output to monitor job progress and resource utilization.

This setup allows you to create a Ray cluster, run a job, and tear it down afterward if desired, giving flexibility in resource management while leveraging the managed environment of Vertex AI.

Monitoring and Managing the Job

While the job is running, you can monitor its progress using the Ray Dashboard.

KubeRay

You can set up port forwarding to access the dashboard:

kubectl port-forward svc/ray-cluster-head-svc 8265:8265

Access the Ray dashboard by navigating to http://localhost:8265 in your browser. The dashboard provides real-time insights into task execution, cluster utilization, and resource allocation.

Ray on Vertex AI

You can access the dashboard through the console

In the Google Cloud console, go to the Ray on Vertex AI page.
In the row for the cluster you created, select the more actions menu.
Select the Ray OSS dashboard link. The dashboard opens in another tab.

Benefits of Using Ray Jobs for ML Training

Automated Cluster Management: The Ray Jobs API handles the creation and teardown of clusters, ensuring that you only use resources when you need them.
Scalable Computation: Ray’s ability to distribute training tasks across multiple workers enables faster training times, especially for large datasets.
Cost Efficiency: By automatically terminating clusters after job completion, you can minimize cloud costs, only paying for the compute time you use.

Conclusion

Ray’s ability to manage distributed tasks through the Ray Jobs API simplifies the process of running ML training workloads. Whether you choose the managed service approach with Ray on Vertex AI or the more flexible KubeRay setup on GKE, you can efficiently scale your training processes without the overhead of managing infrastructure. With the provided script and submission process, you’re ready to run your first training job using Ray and take advantage of the power of distributed ML training.

See my other posts for more on Ray:

Part 1: Running Ray on Google Cloud
Part 2: Managing Ray Clusters on Vertex AI
Part 3: Managing Ray Clusters with KubeRay on GKE
Part 4: This article
Part 5: Serving a model with Ray Serving

Stay tuned for more advanced tutorials, where we’ll cover optimizing resource configurations and integrating with other Google Cloud services to streamline your training workflows!