Fix data race in ML training during host stop by stelfrag · Pull Request #21844 · netdata/netdata

stelfrag · 2026-02-28T07:52:08Z

Summary

Prevent concurrent ML activity during host reset and improve thread safety in k-means dimension handling

Summary by cubic

Block new ML work during host stop and guard k-means updates to fix a data race. Prevents crashes and inconsistent models on shutdown.

Bug Fixes
- Disable ML at the start of ml_host_stop; clear km_contexts instead of reinitializing kmeans.
- In ml_dimension_update_models, bail out when the host isn’t running and reset training_in_progress.
- Move Dim->kmeans assignment outside the lock; rely on the serialized worker queue to avoid cross-thread writes.

^{Written for commit 6b9f8b4. Summary will update on new commits.}

cubic-dev-ai

No issues found across 2 files

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

Architecture diagram

sequenceDiagram
    participant Host as ML Host Controller
    participant State as ML Host State
    participant Worker as ML Worker Thread
    participant Dim as ML Dimension State

    Note over Host, Dim: Host Shutdown Sequence (ml_host_stop)

    Host->>State: NEW: Set ml_running = false
    Note right of State: Prevents new ML tasks from starting immediately

    par Concurrent Execution
        Worker->>Dim: ml_dimension_update_models()
        Dim->>Dim: Lock slock
        Dim->>State: Check ml_running status
        alt NEW: ml_running is false
            Dim->>Dim: Set training_in_progress = false
            Dim-->>Worker: Early Exit
        else ml_running is true
            Dim->>Dim: Standard model update flow
        end
        Dim->>Dim: Unlock slock
    and State Cleanup
        Host->>Dim: Lock slock
        Host->>Dim: CHANGED: Clear km_contexts
        Note right of Dim: Prevents use of stale models during reset
        Host->>Dim: Unlock slock
    end

    Note over Worker, Dim: Model Inlining Flow (ml_worker_add_existing_model)

    Worker->>Worker: Prepare model data
    Worker->>Dim: CHANGED: Assign Dim->kmeans (Outside slock)
    Note over Worker, Dim: Safe: per-host work is serialized via worker queue
    Worker->>Dim: ml_dimension_update_models()

…afety in k-means dimension handling

Copilot

Pull request overview

This PR aims to eliminate a shutdown/reset-time data race in the ML training pipeline by preventing model publication/updates while a host is being stopped and by tightening the rules around when k-means model state is updated.

Changes:

Set ml_running = false at the start of ml_host_stop() to block new ML activity during reset.
During host stop, clear dim->km_contexts (rather than reinitializing dim->kmeans) to avoid cross-thread writes to Dim->kmeans.
In ml_dimension_update_models(), bail out early when the host is not running and ensure training_in_progress is cleared; in the worker, rely on per-host queue serialization to assign Dim->kmeans outside the dimension spinlock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/ml/ml_public.cc`	Disables ML earlier during host stop and resets per-dimension model context state safely.
`src/ml/ml.cc`	Prevents model updates when the host is stopped and adjusts k-means assignment/update flow to avoid cross-thread writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

thiagoftsm

After few hours, PR is running as expected. LGTM!

Prevent concurrent ML activity during host reset and improve thread safety in k-means dimension handling (cherry picked from commit cee7787)

github-actions bot added the area/ml Machine Learning Related Issues label Feb 28, 2026

cubic-dev-ai bot reviewed Feb 28, 2026

View reviewed changes

Prevent concurrent ML activity during host reset and improve thread s…

6b9f8b4

…afety in k-means dimension handling

stelfrag force-pushed the fix_kmeans_init branch from 1ada3fc to 6b9f8b4 Compare February 28, 2026 08:08

stelfrag requested a review from thiagoftsm February 28, 2026 08:36

stelfrag marked this pull request as ready for review February 28, 2026 08:36

stelfrag requested a review from vkalintiris as a code owner February 28, 2026 08:36

ilyam8 requested a review from Copilot February 28, 2026 09:15

Copilot started reviewing on behalf of ilyam8 February 28, 2026 09:16 View session

Copilot AI reviewed Feb 28, 2026

View reviewed changes

thiagoftsm approved these changes Mar 2, 2026

View reviewed changes

stelfrag merged commit cee7787 into netdata:master Mar 2, 2026
146 checks passed

stelfrag deleted the fix_kmeans_init branch March 2, 2026 14:34

stelfrag mentioned this pull request Mar 16, 2026

Patch release 2.9.1 #21954

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data race in ML training during host stop#21844

Fix data race in ML training during host stop#21844
stelfrag merged 1 commit intonetdata:masterfrom
stelfrag:fix_kmeans_init

stelfrag commented Feb 28, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

thiagoftsm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stelfrag commented Feb 28, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

thiagoftsm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stelfrag commented Feb 28, 2026 •

edited by cubic-dev-ai bot

Loading