Skip to content

Fix data race in ML training during host stop#21844

Merged
stelfrag merged 1 commit intonetdata:masterfrom
stelfrag:fix_kmeans_init
Mar 2, 2026
Merged

Fix data race in ML training during host stop#21844
stelfrag merged 1 commit intonetdata:masterfrom
stelfrag:fix_kmeans_init

Conversation

@stelfrag
Copy link
Copy Markdown
Collaborator

@stelfrag stelfrag commented Feb 28, 2026

Summary
  • Prevent concurrent ML activity during host reset and improve thread safety in k-means dimension handling

Summary by cubic

Block new ML work during host stop and guard k-means updates to fix a data race. Prevents crashes and inconsistent models on shutdown.

  • Bug Fixes
    • Disable ML at the start of ml_host_stop; clear km_contexts instead of reinitializing kmeans.
    • In ml_dimension_update_models, bail out when the host isn’t running and reset training_in_progress.
    • Move Dim->kmeans assignment outside the lock; rely on the serialized worker queue to avoid cross-thread writes.

Written for commit 6b9f8b4. Summary will update on new commits.

@github-actions github-actions bot added the area/ml Machine Learning Related Issues label Feb 28, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Host as ML Host Controller
    participant State as ML Host State
    participant Worker as ML Worker Thread
    participant Dim as ML Dimension State

    Note over Host, Dim: Host Shutdown Sequence (ml_host_stop)

    Host->>State: NEW: Set ml_running = false
    Note right of State: Prevents new ML tasks from starting immediately

    par Concurrent Execution
        Worker->>Dim: ml_dimension_update_models()
        Dim->>Dim: Lock slock
        Dim->>State: Check ml_running status
        alt NEW: ml_running is false
            Dim->>Dim: Set training_in_progress = false
            Dim-->>Worker: Early Exit
        else ml_running is true
            Dim->>Dim: Standard model update flow
        end
        Dim->>Dim: Unlock slock
    and State Cleanup
        Host->>Dim: Lock slock
        Host->>Dim: CHANGED: Clear km_contexts
        Note right of Dim: Prevents use of stale models during reset
        Host->>Dim: Unlock slock
    end

    Note over Worker, Dim: Model Inlining Flow (ml_worker_add_existing_model)

    Worker->>Worker: Prepare model data
    Worker->>Dim: CHANGED: Assign Dim->kmeans (Outside slock)
    Note over Worker, Dim: Safe: per-host work is serialized via worker queue
    Worker->>Dim: ml_dimension_update_models()
Loading

@stelfrag stelfrag requested a review from thiagoftsm February 28, 2026 08:36
@stelfrag stelfrag marked this pull request as ready for review February 28, 2026 08:36
@ilyam8 ilyam8 requested a review from Copilot February 28, 2026 09:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate a shutdown/reset-time data race in the ML training pipeline by preventing model publication/updates while a host is being stopped and by tightening the rules around when k-means model state is updated.

Changes:

  • Set ml_running = false at the start of ml_host_stop() to block new ML activity during reset.
  • During host stop, clear dim->km_contexts (rather than reinitializing dim->kmeans) to avoid cross-thread writes to Dim->kmeans.
  • In ml_dimension_update_models(), bail out early when the host is not running and ensure training_in_progress is cleared; in the worker, rely on per-host queue serialization to assign Dim->kmeans outside the dimension spinlock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/ml/ml_public.cc Disables ML earlier during host stop and resets per-dimension model context state safely.
src/ml/ml.cc Prevents model updates when the host is stopped and adjusts k-means assignment/update flow to avoid cross-thread writes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After few hours, PR is running as expected. LGTM!

@stelfrag stelfrag merged commit cee7787 into netdata:master Mar 2, 2026
146 checks passed
@stelfrag stelfrag deleted the fix_kmeans_init branch March 2, 2026 14:34
@stelfrag stelfrag mentioned this pull request Mar 16, 2026
stelfrag added a commit to stelfrag/netdata that referenced this pull request Mar 16, 2026
Prevent concurrent ML activity during host reset and improve thread safety in k-means dimension handling

(cherry picked from commit cee7787)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ml Machine Learning Related Issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants