Skip to content

umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926

Open
afreen23 wants to merge 12 commits into
ceph:umbrellafrom
afreen23:wip-77917-umbrella
Open

umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926
afreen23 wants to merge 12 commits into
ceph:umbrellafrom
afreen23:wip-77917-umbrella

Conversation

@afreen23

@afreen23 afreen23 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

backport tracker: https://tracker.ceph.com/issues/77917


backport of #69750 #69753 #69757
parent tracker: https://tracker.ceph.com/issues/77723

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

afreen23 added 8 commits July 3, 2026 02:50
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit aa8f054)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 0514014)
The hardware metrics exporter was reading cephadm's private KV store
directly via get_store_prefix('mgr/cephadm/node_proxy/data'). This
had two problems:

1. get_store_prefix() is module-scoped, so from the prometheus module
   it searched under prometheus's own namespace instead of cephadm's,
   resulting in zero metrics being exported despite metric definitions
   appearing at /metrics.

2. The firmware key was accessed as 'firmwares' (plural) but the
   stored field is 'firmware' (singular), causing all firmware version
   metrics to be silently empty.

Use node_proxy_fullreport() and node_proxy_firmware() via the
OrchestratorClientMixin instead. This routes through cephadm's
NodeProxyCache which handles KV access and firmware key compat
correctly. Follows the same pattern as set_cephadm_daemon_status_metrics()
and get_smb_metadata().

Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 6bb07c6)
- Fix fan repeating panels: set multi=true on fan_speeds template
  variable so Grafana generates one panel per fan instead of one
- Remove TACH-only regex filter on fan_speeds template and AVG
  Cooling query so all system fans are visible regardless of naming
- Replace duplicate Power Control panel with Network health panel
- Fix NVMe drive count to use storage_capacity_bytes{protocol=NVMe}
  instead of counting temperature sensors (inaccurate proxy)
- Normalize all hostname filters to regex match (=~) for consistency
- Register hardware.libsonnet in dashboards.libsonnet so the
  dashboard JSON is generated during ceph-mixin builds
- Add temperatures category to prometheus health metrics loop

Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 04207ca)
…emory_capacity_bytes

Prometheus naming conventions require base units (bytes not MiB)

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit ae0567e)
- CPU/NVMe temp: change from stat to gauge panels with colored
  arc thresholds (green/yellow/red) and proper °C units
- Health panels: wrap queries in max() aggregation to show
  single worst-case value instead of overlapping series
- Pie charts: switch to donut style with visible legends
- Fan RPM: use locale unit for comma formatting instead of
  short which auto-scaled to "K"
- Fix temperature panels units
- Add Device List and Platform Firmware table panels

Fixes https://tracker.ceph.com/issues/77723

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 5dd6807)
- export `firmware_version` labels from `node_proxy_storage_capacity_bytes` metrics to be used in device firmware panel
- improve iterations for performance - dropping redundant node_proxy_firmware() RPC; firmware data is already
  present in the fullreport response, and removing unnecessary loops.
- replace health if/elif chain with HEALTH_STATUS_MAP dict lookup
- rename metrics to ceph_hardware_*, fix component labels and add tests
- added unit test cases
- add serial name, slot info to capacity metric
- fix health metrics showing ID instead of component name

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit be32430)
- Update tests to check for queries in nested row panels
- fix tox tests
- add named tuple

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit aba8698)
@afreen23 afreen23 requested a review from a team as a code owner July 2, 2026 21:20
@afreen23 afreen23 requested review from Pegonzal and devikab25 July 2, 2026 21:20
@afreen23 afreen23 added this to the umbrella milestone Jul 2, 2026
@github-actions github-actions Bot added the releng-audit-pass Release engineering: passed backport verification audit. label Jul 2, 2026
afreen23 added 4 commits July 3, 2026 13:10
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 42b3a7d)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 4be833e)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit b231e6f)
Add comprehensive Grafana dashboard for monitoring hardware compression
metrics from FCM-enabled drives.

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit da21940)
@github-actions github-actions Bot added monitoring pybind and removed releng-audit-pass Release engineering: passed backport verification audit. labels Jul 3, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Backport Parity Review - Multiple PRs Detected

This backport appears to pull commits from multiple main PRs including: PR #69750, PR #69753, PR #69757.

This must be made explicit in the backport PR description. Furthermore, each backport tracker ticket associated with these main PRs must be linked to this PR.


Automated Redmine Linkage Audit

The following tracking irregularities were found:

  • Orphaned Main PR: Could not find a Redmine tracker for main PR #69753. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.
  • Orphaned Main PR: Could not find a Redmine tracker for main PR #69757. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.

Commit Parity Visualizer

BACKPORT PR #69926 SOURCE PR SOURCE STATUS
f0efeed mgr/prometheus: add node-proxy hardware metrics exporter PR #69750 aa8f054 mgr/prometheus: add node-proxy hardware metrics exporter
116e142 monitoring: add hardware metrics Grafana dashboard 0514014 monitoring: add hardware metrics Grafana dashboard
65d352f mgr/prometheus: use orchestrator API for node-proxy hardware metrics 6bb07c6 mgr/prometheus: use orchestrator API for node-proxy hardware metrics
e5c0768 monitoring: fix hardware Grafana dashboard and health metrics 04207ca monitoring: fix hardware Grafana dashboard and health metrics
89f7cd0 mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes ae0567e mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes
989fec5 monitoring: improve hardware Grafana dashboard panels 5dd6807 monitoring: improve hardware Grafana dashboard panels
4892843 mgr/prometheus: refactor hardware metrics be32430 mgr/prometheus: refactor hardware metrics
9bb0a74 mgr/promethues: fixed tests and refactor module.py aba8698 mgr/promethues: fixed tests and refactor module.py
feebe5b mointoring: Add hardware alerts PR #69757 42b3a7d mointoring: Add hardware alerts
3934f57 monitoring: add PSU temperature graph to hardware dashboard PR #69753 4be833e monitoring: add PSU temperature graph to hardware dashboard
e649ace monitoring: add panel descriptions and fix tooltip labels b231e6f monitoring: add panel descriptions and fix tooltip labels
c8368fa monitoring: add Ceph Hardware - Compression dashboard da21940 monitoring: add Ceph Hardware - Compression dashboard

🛟 Need Help?

If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.

📋 Component Lead / Release Manager

To override the audit failure, apply releng-audit-override label or comment /audit override.


⚠️ Note: Automated audit checks will be suspended on future pushes to prevent comment spam while you work.

When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.

CI Run Log: View Workflow Details

@github-actions github-actions Bot added the releng-audit-fail Release engineering: failed backport verification audit. label Jul 3, 2026
@github-project-automation github-project-automation Bot moved this from New to Reviewer approved in Ceph-Dashboard Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboard monitoring needs-qa pybind releng-audit-fail Release engineering: failed backport verification audit.

Projects

Status: Reviewer approved

Development

Successfully merging this pull request may close these issues.

2 participants