Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training fails with big models on some setup #1251

Open
mvesin opened this issue Nov 25, 2024 · 3 comments
Open

Training fails with big models on some setup #1251

mvesin opened this issue Nov 25, 2024 · 3 comments
Assignees
Labels
bug this issue is about reporting and resolving a suspected bug candidate an individual developer submits a work request to the team (extension proposal, bug, other request)

Comments

@mvesin
Copy link
Member

mvesin commented Nov 25, 2024

On some setup, training with big models fails:

  1. node raises Unexpected error raised by researcher gRPC server in Sender <AioRpcError of RPC that terminated with: status = StatusCode.INTERNAL and researcher indicated node disconnected

  2. researcher freezes after completing a training round, and previously had an error message Message to send is older than 300 seconds. Discard message.

@mvesin mvesin added bug this issue is about reporting and resolving a suspected bug candidate an individual developer submits a work request to the team (extension proposal, bug, other request) labels Nov 25, 2024
@mvesin
Copy link
Member Author

mvesin commented Nov 25, 2024

Case 1. seems to occur when the delay is over GRPC_SERVER_TASK_WAIT_TIMEOUT between (1) the server completes sending the task request to the node (2) the node completes processing the task request, send another task request, the researcher receives and processes it.

So it is a combination of model size, network delay, node host delay.

It is confirmed to be solved by increasing GRPC_SERVER_TASK_WAIT_TIMEOUT.

Case 2. may be due to the need of increasing MAX_SEND_DURATION accordingly with GRPC_SERVER_TASK_WAIT_TIMEOUT (keep it over GRPC_SERVER_TASK_WAIT_TIMEOUT to have server message send retries) and below GRPC_CLIENT_TASK_REQUEST_TIMEOUT (message send cannot be longer than connection max duration).

@mvesin
Copy link
Member Author

mvesin commented Nov 25, 2024

To support setup with very big models and network/host latency we may target higher values for these parameters eg:

GRPC_SERVER_TASK_WAIT_TIMEOUT = 1200
MAX_SEND_DURATION = 4800
GRPC_CLIENT_TASK_REQUEST_TIMEOUT = 21600

@srcansiz
Copy link
Member

To support setup with very big models and network/host latency we may target higher values for these parameters eg:

GRPC_SERVER_TASK_WAIT_TIMEOUT = 1200
MAX_SEND_DURATION = 4800
GRPC_CLIENT_TASK_REQUEST_TIMEOUT = 21600

Hi @mvesin,

Thanks for the issue, and explaining the possible causes and the solutions.

For case 1: similar design that is implemented for node to node communication (NodeToNodeRouter) can be an option. Handler submits the task to the router as soon as it is received. The processing of the task won't be blocking for the node for polling again for the new coming tasks, thus the node will stay connected. After this modification, only thing that may be problem is the message transfer that depends on transfer speed.

To solve and address such issues, and make it more explicit to the end-user, introducing max message size to transfer can be a nice option, along with the fields in configuration file that user can configure these timeouts relatively to the max message sized allowed by the user. These parameters can be set in the researcher configuration. End-users can anticipate network delays after a few test, and configure timeouts accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug this issue is about reporting and resolving a suspected bug candidate an individual developer submits a work request to the team (extension proposal, bug, other request)
Projects
None yet
Development

No branches or pull requests

2 participants