umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics by afreen23 · Pull Request #69926 · ceph/ceph

afreen23 · 2026-07-02T21:20:47Z

backport tracker: https://tracker.ceph.com/issues/77917

backport of #69750 #69753 #69757
parent tracker: https://tracker.ceph.com/issues/77723

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aa8f054)

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 0514014)

The hardware metrics exporter was reading cephadm's private KV store directly via get_store_prefix('mgr/cephadm/node_proxy/data'). This had two problems: 1. get_store_prefix() is module-scoped, so from the prometheus module it searched under prometheus's own namespace instead of cephadm's, resulting in zero metrics being exported despite metric definitions appearing at /metrics. 2. The firmware key was accessed as 'firmwares' (plural) but the stored field is 'firmware' (singular), causing all firmware version metrics to be silently empty. Use node_proxy_fullreport() and node_proxy_firmware() via the OrchestratorClientMixin instead. This routes through cephadm's NodeProxyCache which handles KV access and firmware key compat correctly. Follows the same pattern as set_cephadm_daemon_status_metrics() and get_smb_metadata(). Signed-off-by: Afreen Misbah <[email protected]> Assisted-by: Claude Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 6bb07c6)

- Fix fan repeating panels: set multi=true on fan_speeds template variable so Grafana generates one panel per fan instead of one - Remove TACH-only regex filter on fan_speeds template and AVG Cooling query so all system fans are visible regardless of naming - Replace duplicate Power Control panel with Network health panel - Fix NVMe drive count to use storage_capacity_bytes{protocol=NVMe} instead of counting temperature sensors (inaccurate proxy) - Normalize all hostname filters to regex match (=~) for consistency - Register hardware.libsonnet in dashboards.libsonnet so the dashboard JSON is generated during ceph-mixin builds - Add temperatures category to prometheus health metrics loop Signed-off-by: Afreen Misbah <[email protected]> Assisted-by: Claude Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 04207ca)

…emory_capacity_bytes Prometheus naming conventions require base units (bytes not MiB) Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit ae0567e)

- CPU/NVMe temp: change from stat to gauge panels with colored arc thresholds (green/yellow/red) and proper °C units - Health panels: wrap queries in max() aggregation to show single worst-case value instead of overlapping series - Pie charts: switch to donut style with visible legends - Fan RPM: use locale unit for comma formatting instead of short which auto-scaled to "K" - Fix temperature panels units - Add Device List and Platform Firmware table panels Fixes https://tracker.ceph.com/issues/77723 Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 5dd6807)

- export `firmware_version` labels from `node_proxy_storage_capacity_bytes` metrics to be used in device firmware panel - improve iterations for performance - dropping redundant node_proxy_firmware() RPC; firmware data is already present in the fullreport response, and removing unnecessary loops. - replace health if/elif chain with HEALTH_STATUS_MAP dict lookup - rename metrics to ceph_hardware_*, fix component labels and add tests - added unit test cases - add serial name, slot info to capacity metric - fix health metrics showing ID instead of component name Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit be32430)

- Update tests to check for queries in nested row panels - fix tox tests - add named tuple Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aba8698)

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 42b3a7d)

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 4be833e)

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit b231e6f)

Add comprehensive Grafana dashboard for monitoring hardware compression metrics from FCM-enabled drives. Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit da21940)

github-actions

Automated Backport Parity Review - Multiple PRs Detected

This backport appears to pull commits from multiple main PRs including: PR #69750, PR #69753, PR #69757.

This must be made explicit in the backport PR description. Furthermore, each backport tracker ticket associated with these main PRs must be linked to this PR.

Automated Redmine Linkage Audit

The following tracking irregularities were found:

Orphaned Main PR: Could not find a Redmine tracker for main PR #69753. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.
Orphaned Main PR: Could not find a Redmine tracker for main PR #69757. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.

Commit Parity Visualizer

BACKPORT PR #69926	SOURCE PR	SOURCE STATUS
`f0efeed` mgr/prometheus: add node-proxy hardware metrics exporter	PR #69750	`aa8f054` mgr/prometheus: add node-proxy hardware metrics exporter
`116e142` monitoring: add hardware metrics Grafana dashboard		`0514014` monitoring: add hardware metrics Grafana dashboard
`65d352f` mgr/prometheus: use orchestrator API for node-proxy hardware metrics		`6bb07c6` mgr/prometheus: use orchestrator API for node-proxy hardware metrics
`e5c0768` monitoring: fix hardware Grafana dashboard and health metrics		`04207ca` monitoring: fix hardware Grafana dashboard and health metrics
`89f7cd0` mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes		`ae0567e` mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes
`989fec5` monitoring: improve hardware Grafana dashboard panels		`5dd6807` monitoring: improve hardware Grafana dashboard panels
`4892843` mgr/prometheus: refactor hardware metrics		`be32430` mgr/prometheus: refactor hardware metrics
`9bb0a74` mgr/promethues: fixed tests and refactor module.py		`aba8698` mgr/promethues: fixed tests and refactor module.py
`feebe5b` mointoring: Add hardware alerts	PR #69757	`42b3a7d` mointoring: Add hardware alerts
`3934f57` monitoring: add PSU temperature graph to hardware dashboard	PR #69753	`4be833e` monitoring: add PSU temperature graph to hardware dashboard
`e649ace` monitoring: add panel descriptions and fix tooltip labels		`b231e6f` monitoring: add panel descriptions and fix tooltip labels
`c8368fa` monitoring: add Ceph Hardware - Compression dashboard		`da21940` monitoring: add Ceph Hardware - Compression dashboard

🛟 Need Help?

If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.

📋 Component Lead / Release Manager

To override the audit failure, apply releng-audit-override label or comment /audit override.

⚠️ Note: Automated audit checks will be suspended on future pushes to prevent comment spam while you work.

When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.

CI Run Log: View Workflow Details

afreen23 added 8 commits July 3, 2026 02:50

mgr/prometheus: add node-proxy hardware metrics exporter

f0efeed

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aa8f054)

monitoring: add hardware metrics Grafana dashboard

116e142

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 0514014)

mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_m…

89f7cd0

…emory_capacity_bytes Prometheus naming conventions require base units (bytes not MiB) Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit ae0567e)

mgr/promethues: fixed tests and refactor module.py

9bb0a74

- Update tests to check for queries in nested row panels - fix tox tests - add named tuple Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aba8698)

afreen23 requested a review from a team as a code owner July 2, 2026 21:20

afreen23 requested review from Pegonzal and devikab25 July 2, 2026 21:20

afreen23 added this to the umbrella milestone Jul 2, 2026

afreen23 added the dashboard label Jul 2, 2026

github-project-automation Bot added this to Ceph-Dashboard Jul 2, 2026

github-project-automation Bot moved this to New in Ceph-Dashboard Jul 2, 2026

github-actions Bot added monitoring pybind labels Jul 2, 2026

afreen23 added needs-qa and removed pybind monitoring labels Jul 2, 2026

github-actions Bot added the releng-audit-pass Release engineering: passed backport verification audit. label Jul 2, 2026

afreen23 added 4 commits July 3, 2026 13:10

mointoring: Add hardware alerts

feebe5b

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 42b3a7d)

monitoring: add PSU temperature graph to hardware dashboard

3934f57

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 4be833e)

monitoring: add panel descriptions and fix tooltip labels

e649ace

Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit b231e6f)

monitoring: add Ceph Hardware - Compression dashboard

c8368fa

Add comprehensive Grafana dashboard for monitoring hardware compression metrics from FCM-enabled drives. Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit da21940)

github-actions Bot added monitoring pybind and removed releng-audit-pass Release engineering: passed backport verification audit. labels Jul 3, 2026

github-actions Bot reviewed Jul 3, 2026

View reviewed changes

github-actions Bot added the releng-audit-fail Release engineering: failed backport verification audit. label Jul 3, 2026

nizamial09 approved these changes Jul 3, 2026

View reviewed changes

github-project-automation Bot moved this from New to Reviewer approved in Ceph-Dashboard Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926

umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926
afreen23 wants to merge 12 commits into
ceph:umbrellafrom
afreen23:wip-77917-umbrella

afreen23 commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

afreen23 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Automated Backport Parity Review - Multiple PRs Detected

Automated Redmine Linkage Audit

Commit Parity Visualizer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

afreen23 commented Jul 2, 2026 •

edited

Loading