You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mvesin opened this issue
Nov 25, 2024
· 3 comments
Assignees
Labels
bugthis issue is about reporting and resolving a suspected bugcandidatean individual developer submits a work request to the team (extension proposal, bug, other request)
node raises Unexpected error raised by researcher gRPC server in Sender <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL and researcher indicated node disconnected
researcher freezes after completing a training round, and previously had an error message Message to send is older than 300 seconds. Discard message.
The text was updated successfully, but these errors were encountered:
mvesin
added
bug
this issue is about reporting and resolving a suspected bug
candidate
an individual developer submits a work request to the team (extension proposal, bug, other request)
labels
Nov 25, 2024
Case 1. seems to occur when the delay is over GRPC_SERVER_TASK_WAIT_TIMEOUT between (1) the server completes sending the task request to the node (2) the node completes processing the task request, send another task request, the researcher receives and processes it.
So it is a combination of model size, network delay, node host delay.
It is confirmed to be solved by increasing GRPC_SERVER_TASK_WAIT_TIMEOUT.
Case 2. may be due to the need of increasing MAX_SEND_DURATION accordingly with GRPC_SERVER_TASK_WAIT_TIMEOUT (keep it over GRPC_SERVER_TASK_WAIT_TIMEOUT to have server message send retries) and below GRPC_CLIENT_TASK_REQUEST_TIMEOUT (message send cannot be longer than connection max duration).
Thanks for the issue, and explaining the possible causes and the solutions.
For case 1: similar design that is implemented for node to node communication (NodeToNodeRouter) can be an option. Handler submits the task to the router as soon as it is received. The processing of the task won't be blocking for the node for polling again for the new coming tasks, thus the node will stay connected. After this modification, only thing that may be problem is the message transfer that depends on transfer speed.
To solve and address such issues, and make it more explicit to the end-user, introducing max message size to transfer can be a nice option, along with the fields in configuration file that user can configure these timeouts relatively to the max message sized allowed by the user. These parameters can be set in the researcher configuration. End-users can anticipate network delays after a few test, and configure timeouts accordingly.
bugthis issue is about reporting and resolving a suspected bugcandidatean individual developer submits a work request to the team (extension proposal, bug, other request)
On some setup, training with big models fails:
node raises
Unexpected error raised by researcher gRPC server in Sender <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL
and researcher indicated node disconnectedresearcher freezes after completing a training round, and previously had an error message
Message to send is older than 300 seconds. Discard message.
The text was updated successfully, but these errors were encountered: