umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926
umbrella: mgr/dashboard: Add hardware monitoring dashboard using node proxy metrics#69926afreen23 wants to merge 12 commits into
Conversation
Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aa8f054)
Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 0514014)
The hardware metrics exporter was reading cephadm's private KV store
directly via get_store_prefix('mgr/cephadm/node_proxy/data'). This
had two problems:
1. get_store_prefix() is module-scoped, so from the prometheus module
it searched under prometheus's own namespace instead of cephadm's,
resulting in zero metrics being exported despite metric definitions
appearing at /metrics.
2. The firmware key was accessed as 'firmwares' (plural) but the
stored field is 'firmware' (singular), causing all firmware version
metrics to be silently empty.
Use node_proxy_fullreport() and node_proxy_firmware() via the
OrchestratorClientMixin instead. This routes through cephadm's
NodeProxyCache which handles KV access and firmware key compat
correctly. Follows the same pattern as set_cephadm_daemon_status_metrics()
and get_smb_metadata().
Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 6bb07c6)
- Fix fan repeating panels: set multi=true on fan_speeds template
variable so Grafana generates one panel per fan instead of one
- Remove TACH-only regex filter on fan_speeds template and AVG
Cooling query so all system fans are visible regardless of naming
- Replace duplicate Power Control panel with Network health panel
- Fix NVMe drive count to use storage_capacity_bytes{protocol=NVMe}
instead of counting temperature sensors (inaccurate proxy)
- Normalize all hostname filters to regex match (=~) for consistency
- Register hardware.libsonnet in dashboards.libsonnet so the
dashboard JSON is generated during ceph-mixin builds
- Add temperatures category to prometheus health metrics loop
Signed-off-by: Afreen Misbah <[email protected]>
Assisted-by: Claude
Signed-off-by: Afreen Misbah <[email protected]>
(cherry picked from commit 04207ca)
…emory_capacity_bytes Prometheus naming conventions require base units (bytes not MiB) Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit ae0567e)
- CPU/NVMe temp: change from stat to gauge panels with colored arc thresholds (green/yellow/red) and proper °C units - Health panels: wrap queries in max() aggregation to show single worst-case value instead of overlapping series - Pie charts: switch to donut style with visible legends - Fan RPM: use locale unit for comma formatting instead of short which auto-scaled to "K" - Fix temperature panels units - Add Device List and Platform Firmware table panels Fixes https://tracker.ceph.com/issues/77723 Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 5dd6807)
- export `firmware_version` labels from `node_proxy_storage_capacity_bytes` metrics to be used in device firmware panel - improve iterations for performance - dropping redundant node_proxy_firmware() RPC; firmware data is already present in the fullreport response, and removing unnecessary loops. - replace health if/elif chain with HEALTH_STATUS_MAP dict lookup - rename metrics to ceph_hardware_*, fix component labels and add tests - added unit test cases - add serial name, slot info to capacity metric - fix health metrics showing ID instead of component name Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit be32430)
- Update tests to check for queries in nested row panels - fix tox tests - add named tuple Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit aba8698)
Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 42b3a7d)
Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit 4be833e)
Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit b231e6f)
Add comprehensive Grafana dashboard for monitoring hardware compression metrics from FCM-enabled drives. Signed-off-by: Afreen Misbah <[email protected]> (cherry picked from commit da21940)
There was a problem hiding this comment.
Automated Backport Parity Review - Multiple PRs Detected
This backport appears to pull commits from multiple main PRs including: PR #69750, PR #69753, PR #69757.
This must be made explicit in the backport PR description. Furthermore, each backport tracker ticket associated with these main PRs must be linked to this PR.
Automated Redmine Linkage Audit
The following tracking irregularities were found:
- Orphaned Main PR: Could not find a Redmine tracker for
mainPR #69753. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state. - Orphaned Main PR: Could not find a Redmine tracker for
mainPR #69757. Please create a ticket, set its 'Pull Request ID', populate the 'Backports' field, and ensure it is in the 'Pending Backport' state.
Commit Parity Visualizer
| BACKPORT PR #69926 | SOURCE PR | SOURCE STATUS |
|---|---|---|
| f0efeed mgr/prometheus: add node-proxy hardware metrics exporter | PR #69750 | aa8f054 mgr/prometheus: add node-proxy hardware metrics exporter |
| 116e142 monitoring: add hardware metrics Grafana dashboard | 0514014 monitoring: add hardware metrics Grafana dashboard | |
| 65d352f mgr/prometheus: use orchestrator API for node-proxy hardware metrics | 6bb07c6 mgr/prometheus: use orchestrator API for node-proxy hardware metrics | |
| e5c0768 monitoring: fix hardware Grafana dashboard and health metrics | 04207ca monitoring: fix hardware Grafana dashboard and health metrics | |
| 89f7cd0 mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes | ae0567e mgr/prometheus: Rename node_proxy_memory_capacity_mib to node_proxy_memory_capacity_bytes | |
| 989fec5 monitoring: improve hardware Grafana dashboard panels | 5dd6807 monitoring: improve hardware Grafana dashboard panels | |
| 4892843 mgr/prometheus: refactor hardware metrics | be32430 mgr/prometheus: refactor hardware metrics | |
| 9bb0a74 mgr/promethues: fixed tests and refactor module.py | aba8698 mgr/promethues: fixed tests and refactor module.py | |
| feebe5b mointoring: Add hardware alerts | PR #69757 | 42b3a7d mointoring: Add hardware alerts |
| 3934f57 monitoring: add PSU temperature graph to hardware dashboard | PR #69753 | 4be833e monitoring: add PSU temperature graph to hardware dashboard |
| e649ace monitoring: add panel descriptions and fix tooltip labels | b231e6f monitoring: add panel descriptions and fix tooltip labels | |
| c8368fa monitoring: add Ceph Hardware - Compression dashboard | da21940 monitoring: add Ceph Hardware - Compression dashboard |
🛟 Need Help?
If you need technical help resolving these issues, please consult with the Component Lead. If you need administrative overrides, please see the #ceph-upstream-releases channel on Slack and request a review from the @ceph/ceph-release-manager.
📋 Component Lead / Release Manager
To override the audit failure, apply releng-audit-override label or comment /audit override.
When you are ready for a new audit, please remove the releng-audit-fail label or comment /audit retest.
CI Run Log: View Workflow Details
backport tracker: https://tracker.ceph.com/issues/77917
backport of #69750 #69753 #69757
parent tracker: https://tracker.ceph.com/issues/77723
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh