Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud #1301

glvov-bdai · 2024-10-25T09:45:30Z

Description

This PR adds Ray support, which enables a lot of really cool stuff by leveraging the existing Hydra support, including but not limited to:

Several training runs at once in parallel or consecutively with minimal interaction
Using the same training setup everywhere (on cloud and local) with minimal overhead
Tuning hyperparameters
Tuning hyperparameters in parallel on multiple GPUs and/or multiple GPU Nodes
Simultaneously tuning model hyperparameters for different environments/agents
Resource Isolation

I know this PR seems huge, but most of the code diff is config files / argparser stuff / documentation / comments

My main project at BDAI is changing from RL to LfD effective November 1st, so I'm posting this PR as early as possible so I have as much time as possible to address comments.

It would be much appreciated if the NVIDIA folks can work with me to get this reviewed ASAP. I realize that this is a pretty big PR; but I also think that it adds a lot of cool functionality, and the merging process will go much smoother if I am able to devote time to this while at work. Thanks! ;)

Fixes # (issue#1190), (issue#1213)

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Screenshots

Checklist

I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

Co-authored-by: James Smith <[email protected]> Signed-off-by: garylvov <[email protected]>

Signed-off-by: glvov-bdai <[email protected]>

docs/source/features/ray.rst

jsmith-bdai

LGTM, thanks for adding!

sky-bro · 2025-03-12T07:13:21Z

Hi, have you tried using runtime environments to specify different pip packages, it always throw me an error saying failed to setup runtime environment because some packages could not be found.
https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments

but if I login to the node and executes python, then import the package, no error occurred

Does this have something to do with the PYTHONPATH variable as in /isaac-sim/setup_python_env.sh?

garylvov and others added 30 commits September 27, 2024 00:14

start

67122dc

add feature extraction

2d207b5

blank

1aa2832

further

e4c395f

add args

c53d987

Merge branch 'isaac-sim:main' into feature/hyperparam_tune

50862cc

formatting

905bed1

tweaks

6909501

fix

2577827

allow jobs to actually get scheduled

c21b2f5

add dockerfile

ceba315

formatting

7771439

tweaks

6563e1f

get gcp cluster working with ray, and isaac

b27092f

make bash command consistent

b94fe87

tweaks

9a525ec

formatting

e6e9f85

formatting

1187ca0

fix argparser

83cb89a

formatting

1885d0b

cleanup command

653b8ae

start argparser

dc9fb3f

sync

5f9f0dd

Merge branch 'isaac-sim:main' into feature/hyperparam_tune

3cde9e4

formatting

db975bc

cherrypick ResNet Cart from PR

c80d278

add extra point in readme

7fd0169

add note about saving

4dd48b1

fixes

873ea54

Merge branch 'isaac-sim:main' into feature/hyperparam_tune

93fbff3

garylvov and others added 10 commits November 4, 2024 20:46

Update source/standalone/workflows/ray/isaac_ray_tune.py

8311bc6

Co-authored-by: James Smith <[email protected]> Signed-off-by: garylvov <[email protected]>

Update source/standalone/workflows/ray/isaac_ray_tune.py

9302017

Co-authored-by: James Smith <[email protected]> Signed-off-by: garylvov <[email protected]>

address james' comments

34afd2a

delete old file and fix imports

70aa958

format

5ba620d

change top level to be caps'

80b5df5

fix docstrings and typos

5027656

Merge branch 'main' into feature/hyperparam_tune

b293b48

fix weird bolding thing

aa92a9d

fix emphasize lines and included files in rst

34f0908

glvov-bdai requested a review from jsmith-bdai November 5, 2024 04:11

glvov-bdai and others added 11 commits November 7, 2024 13:23

Merge branch 'main' into feature/hyperparam_tune

0856cd6

Merge branch 'main' into feature/hyperparam_tune

7cc587a

Merge branch 'main' into feature/hyperparam_tune

a38de8e

Merge branch 'main' into feature/hyperparam_tune

30a63ff

Merge branch 'main' into feature/hyperparam_tune

ad8161d

Signed-off-by: glvov-bdai <[email protected]>

Merge branch 'main' into feature/hyperparam_tune

1f14c86

Merge branch 'main' into feature/hyperparam_tune

027e529

Add clarification about local with and without docker

a9d690a

Signed-off-by: glvov-bdai <[email protected]>

remove trailing whitespace

86e190f

Signed-off-by: glvov-bdai <[email protected]>

remove erroneous comment

9b74068

Signed-off-by: glvov-bdai <[email protected]>

Merge branch 'main' into feature/hyperparam_tune

299f9f3

kellyguo11 approved these changes Dec 13, 2024

View reviewed changes

docs/source/features/ray.rst Outdated Show resolved Hide resolved

docs/source/features/ray.rst Outdated Show resolved Hide resolved

glvov-bdai added 3 commits December 13, 2024 13:16

Merge branch 'main' into feature/hyperparam_tune

9879b3f

add linux only and break down install

57c6e69

fix typo in dep

5be17f0

jsmith-bdai approved these changes Dec 13, 2024

View reviewed changes

jsmith-bdai merged commit 286e1ee into isaac-sim:main Dec 13, 2024
5 checks passed

giulioturrisi mentioned this pull request Jan 14, 2025

Usage IsaacLabExtension with Ray Tuner isaac-sim/IsaacLabExtensionTemplate#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud #1301

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud #1301

glvov-bdai commented Oct 25, 2024 •

edited

Loading

jsmith-bdai left a comment

sky-bro commented Mar 12, 2025

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud #1301

Adds Ray Workflow: Multiple Run Support, Distributed Hyperparameter Tuning, and Consistent Setup Across Local/Cloud #1301

Conversation

glvov-bdai commented Oct 25, 2024 • edited Loading

Description

Type of change

Screenshots

Checklist

jsmith-bdai left a comment

Choose a reason for hiding this comment

sky-bro commented Mar 12, 2025

glvov-bdai commented Oct 25, 2024 •

edited

Loading