Skip to content

[RFC] Support running in different modes. #29

Open
@jovany-wang

Description

I'm proposing that support running rayfed job in single-controller mode.

I'd like to propose 2 options on how we startup the single-controller cluster and how we connect to the cluster and run our jobs.

option 1

Add a new cli toolkit to start the cluster, it just wrapper the ray cli toolkit, for example:

A. running single-controller mode

> rayfed start --head --mode=single-controller --party=ALICE  # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE  # node2, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB  # node3, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB  # node4, connecting to the node1

And then, the job could be run in single controller mode automatically:

# main.py
fed.init(address="1.2.3.4:5555", xxx)
# Nothing need to be changed in this job script.

B. running multiple-controller mode

> rayfed start --head --mode=multiple-controller --party=ALICE  # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE  # node2, connecting to the node1
> rayfed start --head --mode=multiple-controller --party=BOB  # node3, listening on 5.6.7.8:6666
> rayfed start --address="5.6.7.8:6666" --party=BOB  # node4, connecting to the node3

And then, you run the following script in 2 clusters:

# main.py
fed.init(address="1.2.3.4:5555", xxx)
# nothing need to be changed in this job script.
# in node2
> python main.py --party=ALICE
# in node3
> python main.py --party=BOB

option 2

No need to add a new toolkit, but we should tell users that add some extra arguments when starting up the Ray cluster.
For example,

A. running single-controller mode

> ray start --head --resources={"_PARTY_ALICE", 9999}  # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_ALICE", 9999}  # node2, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999}  # node3, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999}  # node4, connecting to the node1

And then, add the extra mode info when fed.init():

# main.py
fed.init(address="1.2.3.4:5555", mode="single-controller", xxx)
# Nothing need to be changed in this job script.

A. running multiple-controller mode

> ray start --head # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" # node2, connecting to the node1
> ray start --head # node3, listening on 5.6.7.8:6666
> ray start --address="5.6.7.8:6666" # node4, connecting to the node3

And then, add the extra mode info when fed.init()(And we could ignore it if we provide a default value):

# main.py
fed.init(address="1.2.3.4:5555", mode="multiple-controller", xxx)
# Nothing need to be changed in this job script.

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions