Description
I'm proposing that support running rayfed job in single-controller mode.
I'd like to propose 2 options on how we startup the single-controller cluster and how we connect to the cluster and run our jobs.
option 1
Add a new cli toolkit to start the cluster, it just wrapper the ray cli toolkit, for example:
A. running single-controller mode
> rayfed start --head --mode=single-controller --party=ALICE # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE # node2, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB # node3, connecting to the node1
> rayfed start --address="1.2.3.4:5555" --party=BOB # node4, connecting to the node1
And then, the job could be run in single controller mode automatically:
# main.py
fed.init(address="1.2.3.4:5555", xxx)
# Nothing need to be changed in this job script.
B. running multiple-controller mode
> rayfed start --head --mode=multiple-controller --party=ALICE # node1, listening on 1.2.3.4:5555
> rayfed start --address="1.2.3.4:5555" --party=ALICE # node2, connecting to the node1
> rayfed start --head --mode=multiple-controller --party=BOB # node3, listening on 5.6.7.8:6666
> rayfed start --address="5.6.7.8:6666" --party=BOB # node4, connecting to the node3
And then, you run the following script in 2 clusters:
# main.py
fed.init(address="1.2.3.4:5555", xxx)
# nothing need to be changed in this job script.
# in node2
> python main.py --party=ALICE
# in node3
> python main.py --party=BOB
option 2
No need to add a new toolkit, but we should tell users that add some extra arguments when starting up the Ray cluster.
For example,
A. running single-controller mode
> ray start --head --resources={"_PARTY_ALICE", 9999} # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_ALICE", 9999} # node2, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999} # node3, connecting to the node1
> ray start --address="1.2.3.4:5555" --resources={"_PARTY_BOB", 9999} # node4, connecting to the node1
And then, add the extra mode info when fed.init()
:
# main.py
fed.init(address="1.2.3.4:5555", mode="single-controller", xxx)
# Nothing need to be changed in this job script.
A. running multiple-controller mode
> ray start --head # node1, listening on 1.2.3.4:5555
> ray start --address="1.2.3.4:5555" # node2, connecting to the node1
> ray start --head # node3, listening on 5.6.7.8:6666
> ray start --address="5.6.7.8:6666" # node4, connecting to the node3
And then, add the extra mode info when fed.init()
(And we could ignore it if we provide a default value):
# main.py
fed.init(address="1.2.3.4:5555", mode="multiple-controller", xxx)
# Nothing need to be changed in this job script.