Training fails with big models on some setup #1251

mvesin · 2024-11-25T15:21:55Z

On some setup, training with big models fails:

node raises Unexpected error raised by researcher gRPC server in Sender <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL and researcher indicated node disconnected
researcher freezes after completing a training round, and previously had an error message Message to send is older than 300 seconds. Discard message.

The text was updated successfully, but these errors were encountered:

mvesin · 2024-11-25T15:27:47Z

Case 1. seems to occur when the delay is over GRPC_SERVER_TASK_WAIT_TIMEOUT between (1) the server completes sending the task request to the node (2) the node completes processing the task request, send another task request, the researcher receives and processes it.

So it is a combination of model size, network delay, node host delay.

It is confirmed to be solved by increasing GRPC_SERVER_TASK_WAIT_TIMEOUT.

Case 2. may be due to the need of increasing MAX_SEND_DURATION accordingly with GRPC_SERVER_TASK_WAIT_TIMEOUT (keep it over GRPC_SERVER_TASK_WAIT_TIMEOUT to have server message send retries) and below GRPC_CLIENT_TASK_REQUEST_TIMEOUT (message send cannot be longer than connection max duration).

mvesin · 2024-11-25T15:28:32Z

To support setup with very big models and network/host latency we may target higher values for these parameters eg:

GRPC_SERVER_TASK_WAIT_TIMEOUT = 1200
MAX_SEND_DURATION = 4800
GRPC_CLIENT_TASK_REQUEST_TIMEOUT = 21600

srcansiz · 2024-11-25T16:55:53Z

To support setup with very big models and network/host latency we may target higher values for these parameters eg:
GRPC_SERVER_TASK_WAIT_TIMEOUT = 1200
MAX_SEND_DURATION = 4800
GRPC_CLIENT_TASK_REQUEST_TIMEOUT = 21600

Hi @mvesin,

Thanks for the issue, and explaining the possible causes and the solutions.

For case 1: similar design that is implemented for node to node communication (NodeToNodeRouter) can be an option. Handler submits the task to the router as soon as it is received. The processing of the task won't be blocking for the node for polling again for the new coming tasks, thus the node will stay connected. After this modification, only thing that may be problem is the message transfer that depends on transfer speed.

To solve and address such issues, and make it more explicit to the end-user, introducing max message size to transfer can be a nice option, along with the fields in configuration file that user can configure these timeouts relatively to the max message sized allowed by the user. These parameters can be set in the researcher configuration. End-users can anticipate network delays after a few test, and configure timeouts accordingly.

mvesin added bug this issue is about reporting and resolving a suspected bug candidate an individual developer submits a work request to the team (extension proposal, bug, other request) labels Nov 25, 2024

mvesin assigned mvesin and srcansiz Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails with big models on some setup #1251

Training fails with big models on some setup #1251

mvesin commented Nov 25, 2024

mvesin commented Nov 25, 2024

mvesin commented Nov 25, 2024

srcansiz commented Nov 25, 2024

Training fails with big models on some setup #1251

Training fails with big models on some setup #1251

Comments

mvesin commented Nov 25, 2024

mvesin commented Nov 25, 2024

mvesin commented Nov 25, 2024

srcansiz commented Nov 25, 2024