Fix data race in ML training during host stop#21844
Merged
stelfrag merged 1 commit intonetdata:masterfrom Mar 2, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
No issues found across 2 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Host as ML Host Controller
participant State as ML Host State
participant Worker as ML Worker Thread
participant Dim as ML Dimension State
Note over Host, Dim: Host Shutdown Sequence (ml_host_stop)
Host->>State: NEW: Set ml_running = false
Note right of State: Prevents new ML tasks from starting immediately
par Concurrent Execution
Worker->>Dim: ml_dimension_update_models()
Dim->>Dim: Lock slock
Dim->>State: Check ml_running status
alt NEW: ml_running is false
Dim->>Dim: Set training_in_progress = false
Dim-->>Worker: Early Exit
else ml_running is true
Dim->>Dim: Standard model update flow
end
Dim->>Dim: Unlock slock
and State Cleanup
Host->>Dim: Lock slock
Host->>Dim: CHANGED: Clear km_contexts
Note right of Dim: Prevents use of stale models during reset
Host->>Dim: Unlock slock
end
Note over Worker, Dim: Model Inlining Flow (ml_worker_add_existing_model)
Worker->>Worker: Prepare model data
Worker->>Dim: CHANGED: Assign Dim->kmeans (Outside slock)
Note over Worker, Dim: Safe: per-host work is serialized via worker queue
Worker->>Dim: ml_dimension_update_models()
…afety in k-means dimension handling
1ada3fc to
6b9f8b4
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate a shutdown/reset-time data race in the ML training pipeline by preventing model publication/updates while a host is being stopped and by tightening the rules around when k-means model state is updated.
Changes:
- Set
ml_running = falseat the start ofml_host_stop()to block new ML activity during reset. - During host stop, clear
dim->km_contexts(rather than reinitializingdim->kmeans) to avoid cross-thread writes toDim->kmeans. - In
ml_dimension_update_models(), bail out early when the host is not running and ensuretraining_in_progressis cleared; in the worker, rely on per-host queue serialization to assignDim->kmeansoutside the dimension spinlock.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/ml/ml_public.cc |
Disables ML earlier during host stop and resets per-dimension model context state safely. |
src/ml/ml.cc |
Prevents model updates when the host is stopped and adjusts k-means assignment/update flow to avoid cross-thread writes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
thiagoftsm
approved these changes
Mar 2, 2026
Contributor
thiagoftsm
left a comment
There was a problem hiding this comment.
After few hours, PR is running as expected. LGTM!
stelfrag
added a commit
to stelfrag/netdata
that referenced
this pull request
Mar 16, 2026
Prevent concurrent ML activity during host reset and improve thread safety in k-means dimension handling (cherry picked from commit cee7787)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Summary by cubic
Block new ML work during host stop and guard k-means updates to fix a data race. Prevents crashes and inconsistent models on shutdown.
Written for commit 6b9f8b4. Summary will update on new commits.