Skip to content

tentacle: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69925

Open
afreen23 wants to merge 12 commits into
ceph:tentaclefrom
afreen23:wip-77916-tentacle
Open

tentacle: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69925
afreen23 wants to merge 12 commits into
ceph:tentaclefrom
afreen23:wip-77916-tentacle

Conversation

@afreen23

@afreen23 afreen23 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

backport tracker: https://tracker.ceph.com/issues/77916


backport of #69750 #69753 #69757
parent tracker: https://tracker.ceph.com/issues/77723

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

afreen23 added 8 commits July 3, 2026 02:48
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit aa8f054)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 0514014)
The hardware metrics exporter was reading cephadm's private KV store
directly via get_store_prefix('mgr/cephadm/node_proxy/data'). This
had two problems:

1. get_store_prefix() is module-scoped, so from the prometheus module
   it searched under prometheus's own namespace instead of cephadm's,
   resulting in zero metrics being exported despite metric definitions
   appearing at /metrics.

2. The firmware key was accessed as 'firmwares' (plural) but the
   stored field is 'firmware' (singular), causing all firmware version
   metrics to be silently empty.

Use node_proxy_fullreport() and node_proxy_firmware() via the
OrchestratorClientMixin instead. This routes through cephadm's
NodeProxyCache which handles KV access and firmware key compat
correctly. Follows the same pattern as set_cephadm_daemon_status_metrics()
and get_smb_metadata().

Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 6bb07c6)
- Fix fan repeating panels: set multi=true on fan_speeds template
  variable so Grafana generates one panel per fan instead of one
- Remove TACH-only regex filter on fan_speeds template and AVG
  Cooling query so all system fans are visible regardless of naming
- Replace duplicate Power Control panel with Network health panel
- Fix NVMe drive count to use storage_capacity_bytes{protocol=NVMe}
  instead of counting temperature sensors (inaccurate proxy)
- Normalize all hostname filters to regex match (=~) for consistency
- Register hardware.libsonnet in dashboards.libsonnet so the
  dashboard JSON is generated during ceph-mixin builds
- Add temperatures category to prometheus health metrics loop

Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 04207ca)
…emory_capacity_bytes

Prometheus naming conventions require base units (bytes not MiB)

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit ae0567e)
- CPU/NVMe temp: change from stat to gauge panels with colored
  arc thresholds (green/yellow/red) and proper °C units
- Health panels: wrap queries in max() aggregation to show
  single worst-case value instead of overlapping series
- Pie charts: switch to donut style with visible legends
- Fan RPM: use locale unit for comma formatting instead of
  short which auto-scaled to "K"
- Fix temperature panels units
- Add Device List and Platform Firmware table panels

Fixes https://tracker.ceph.com/issues/77723

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 5dd6807)
- export `firmware_version` labels from `node_proxy_storage_capacity_bytes` metrics to be used in device firmware panel
- improve iterations for performance - dropping redundant node_proxy_firmware() RPC; firmware data is already
  present in the fullreport response, and removing unnecessary loops.
- replace health if/elif chain with HEALTH_STATUS_MAP dict lookup
- rename metrics to ceph_hardware_*, fix component labels and add tests
- added unit test cases
- add serial name, slot info to capacity metric
- fix health metrics showing ID instead of component name

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit be32430)
- Update tests to check for queries in nested row panels
- fix tox tests
- add named tuple

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit aba8698)
@afreen23 afreen23 requested a review from a team as a code owner July 2, 2026 21:19
@afreen23 afreen23 requested review from aaSharma14 and nizamial09 July 2, 2026 21:19
@afreen23 afreen23 added this to the tentacle milestone Jul 2, 2026
@github-actions github-actions Bot added the releng-audit-pass Release engineering: passed backport verification audit. label Jul 2, 2026
afreen23 added 4 commits July 3, 2026 13:14
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 42b3a7d)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 4be833e)
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit b231e6f)
Add comprehensive Grafana dashboard for monitoring hardware compression
metrics from FCM-enabled drives.

Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit da21940)
@github-actions github-actions Bot added releng-audit-fail Release engineering: failed backport verification audit. and removed releng-audit-pass Release engineering: passed backport verification audit. labels Jul 3, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Backport Parity Review - Multiple PRs Detected

This backport appears to pull commits from multiple main PRs including: PR #69750, PR #69753, PR #69757.

This must be made explicit in the backport PR description. Furthermore, each backport tracker ticket associated with these main PRs must be linked to this PR.


Automated Redmine Linkage Audit

The following tracking irregularities were found:

  • Orphaned Main PR: Could not find a Redmine tracker for main PR #69753. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.
  • Orphaned Main PR: Could not find a Redmine tracker for main PR #69757. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.

Commit Parity Visualizer

BACKPORT PR #69925 SOURCE PR SOURCE STATUS
fa73567 mgr/prometheus: add node-proxy hardware metrics exporter PR #69750 aa8f054 mgr/prometheus: add node-proxy hardware metrics exporter
21760ca monitoring: add hardware metrics Grafana dashboard 0514014 monitoring: add hardware metrics Grafana dashboard
930d4c9 mgr/prometheus: use orchestrator API for node-proxy hardware metrics 6bb07c6 mgr/prometheus: use orchestrator API for node-proxy hardware metrics
a14d762 monitoring: fix hardware Grafana dashboard and health metrics 04207ca monitoring: fix hardware Grafana dashboard and health metrics
5fcb966 mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes ae0567e mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes
3c6698a monitoring: improve hardware Grafana dashboard panels 5dd6807 monitoring: improve hardware Grafana dashboard panels
833ddcf mgr/prometheus: refactor hardware metrics be32430 mgr/prometheus: refactor hardware metrics
a0edbc9 mgr/promethues: fixed tests and refactor module.py aba8698 mgr/promethues: fixed tests and refactor module.py
a545338 mointoring: Add hardware alerts PR #69757 42b3a7d mointoring: Add hardware alerts
db676ba monitoring: add PSU temperature graph to hardware dashboard PR #69753 4be833e monitoring: add PSU temperature graph to hardware dashboard
4cc0df6 monitoring: add panel descriptions and fix tooltip labels b231e6f monitoring: add panel descriptions and fix tooltip labels
c1067f8 monitoring: add Ceph Hardware - Compression dashboard da21940 monitoring: add Ceph Hardware - Compression dashboard

🛟 Need Help?

If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.

📋 Component Lead / Release Manager

To override the audit failure, apply releng-audit-override label or comment /audit override.


⚠️ Note: Automated audit checks will be suspended on future pushes to prevent comment spam while you work.

When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.

CI Run Log: View Workflow Details

@github-project-automation github-project-automation Bot moved this from New to Reviewer approved in Ceph-Dashboard Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboard monitoring needs-qa pybind releng-audit-fail Release engineering: failed backport verification audit.

Projects

Status: Reviewer approved

Development

Successfully merging this pull request may close these issues.

2 participants