(OVH Real Time Monitoring probes)
This repository contain OVH RTM probes and packaging script. It depends on the default implementation of the ovh-rtm-metrics-toolkit package and the additional tools noderig and beamium.
Real Time Monitoring is composed of 2 packages:
-
ovh-metrics-toolkit: configure beamium and noderig to push monitoring metrics and probes result's to OVH monitoring platform.
-
ovh-rtm-binaries: copies OVH monitoring probes into /usr/bin/rtm*
For details about Noderig see the main repository
For details about Beamium see the main repository
By installing ovh-rtm-metrics-toolkit package you will be able to have a metrics based monitoring solution for your server. (only working for baremetal and public cloud). (public cloud users can only use Insight to retrieve their metrics)
- Displayed on OVHCloud web manager:
- on API:
- But you can also create your own Grafana monitoring dashboard to display metrics values in real time: (recommended method)
How to proceed: RTM on Grafana
Please refer to OVH docs: https://docs.ovh.com/gb/en/dedicated/install-rtm/
http://last.public.ovh.rtm.snap.mirrors.ovh.net/
OVH datacenters are composed of many type of servers, each running differents OSes with different components. This monitoring solution try to be compatible with the most part of them and thus still currently under development.
Feel free to comment or contribute!
RTM collects real time monitoring data (based on noderig default collectors) on CPU, LOAD, RAM, DISK, NET.
RTM probes are perl scripts. Results are mainly available on ovh API. Located in /usr/bin/rtm*, they are launched at differents intervals. It depends on which noderig external collectors folders they are linked. You can see links located in noderig external collector:
/opt/noderig/
├── 3600
│ ├── rtmHourly -> /usr/bin/rtmHourly
│ └── rtmRaidCheck -> /usr/bin/rtmRaidCheck
└── 43200
└── rtmHardware -> /usr/bin/rtmHardware
Thoses scripts exposes monitoring results as prometheus format (https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md).
This probe is executed each 3600s (1hour). It collects informations like the uptime, load average, memory usage, current rtm version installed,the top processes, open ports and the number of ongoing processes.
{"metric":"rtm.info.rtm.version","timestamp":1582207693,"value":"1.0.11"}
{"metric":"rtm.info.uptime","timestamp":1582207693,"value":"11400912"}
{"metric":"rtm.hostname","timestamp":1582207693,"value":"ns10000"}
{"metric":"os.load.processesactive","timestamp":1582207693,"value":"1"}
{"metric":"os.load.processesup","timestamp":1582207693,"value":"650"}
{"metric":"rtm.info.mem.top_mem_1_name","timestamp":1582207693,"value":"/usr/bin/syncthing"}
{"metric":"rtm.info.mem.top_mem_1_size","timestamp":1582207693,"value":"4833764"}
{"metric":"rtm.info.mem.top_mem_2_name","timestamp":1582207693,"value":"/usr/bin/smbd"}
{"metric":"rtm.info.mem.top_mem_2_size","timestamp":1582207693,"value":"1871772"}
{"metric":"rtm.info.mem.top_mem_3_name","timestamp":1582207693,"value":"/usr/bin/smbd"}
{"metric":"rtm.info.mem.top_mem_3_size","timestamp":1582207693,"value":"1510164"}
{"metric":"rtm.info.mem.top_mem_4_name","timestamp":1582207693,"value":"/usr/sbin/named"}
{"metric":"rtm.info.mem.top_mem_4_size","timestamp":1582207693,"value":"1025712"}
{"metric":"rtm.info.mem.top_mem_5_name","timestamp":1582207693,"value":"/usr/sbin/rsyslogd"}
{"metric":"rtm.info.mem.top_mem_5_size","timestamp":1582207693,"value":"361880"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.uid","timestamp":1582207693,"value":"111"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.pid","timestamp":1582207693,"value":"4851"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.username","timestamp":1582207693,"value":"oco"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.exe","timestamp":1582207693,"value":"/usr/bin/perl"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.cmdline","timestamp":1582207693,"value":"perl"}
{"metric":"rtm.info.tcp.listen.ip-0-0-0-0.port-79.procname","timestamp":1582207693,"value":"perl"}
This probe is executed each 43200s (12h). It collects information on the hardware such as the motherboard, PCI devices, disk health (S.M.A.R.T data), etc. Also collects some information on the software, such as the kernel and BIOS version.
{"metric":"rtm.hw.mb.manufacture","timestamp":1582208062,"value":"Supermicro"}
{"metric":"rtm.hw.mb.name","timestamp":1582208062,"value":"X10SRi-F"}
{"metric":"rtm.hw.mb.serial","timestamp":1582208062,"value":"NM175S506822"}
{"metric":"rtm.info.bios_date","timestamp":1582208062,"value":"12/17/2015"}
{"metric":"rtm.info.bios_version","timestamp":1582208062,"value":"2.0"}
{"metric":"rtm.info.bios_vendor","timestamp":1582208062,"value":"American Megatrends Inc."}
{"metric":"rtm.info.release.os","timestamp":1582208062,"value":"Ubuntu 16.04 xenial"}
{"metric":"rtm.info.kernel.version","timestamp":1582208062,"value":"#193-Ubuntu SMP Tue Sep 17 17:42:52 UTC 2019"}
{"metric":"rtm.info.kernel.release","timestamp":1582208062,"value":"4.4.0-165-generic"}
{"metric":"rtm.hw.cpu.name","timestamp":1582208062,"value":"Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz"}
{"metric":"rtm.hw.cpu.number","timestamp":1582208062,"value":"12"}
{"metric":"rtm.hw.cpu.cache","timestamp":1582208062,"value":"15360 KB"}
{"metric":"rtm.hw.cpu.mhz","timestamp":1582208062,"value":"1212.750"}
{"metric":"rtm.info.check.vm","timestamp":1582208062,"value":"False"}
{"metric":"rtm.info.check.oops","timestamp":1582208062,"value":"False"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel0-Dimm1-DIMMA2","timestamp":1582208062,"value":"No Module Installed"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel1-Dimm1-DIMMB2","timestamp":1582208062,"value":"No Module Installed"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel2-Dimm0-DIMMC1","timestamp":1582208062,"value":"16384"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel1-Dimm0-DIMMB1","timestamp":1582208062,"value":"16384"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel2-Dimm1-DIMMC2","timestamp":1582208062,"value":"No Module Installed"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel3-Dimm1-DIMMD2","timestamp":1582208062,"value":"No Module Installed"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel3-Dimm0-DIMMD1","timestamp":1582208062,"value":"16384"}
{"metric":"rtm.hw.mem.bank-P0-Node0-Channel0-Dimm0-DIMMA1","timestamp":1582208062,"value":"16384"}
{"metric":"rtm.hw.lspci.pci.ff-15-2","timestamp":1582208062,"value":"8086:6fb6"}
....
{"metric":"rtm.info.hdd.sda.capacity","timestamp":1582208062,"value":"4.00 TB"}
{"metric":"rtm.info.hdd.sda.link_type","timestamp":1582208062,"value":"sata"}
{"metric":"rtm.info.hdd.sda.firmware","timestamp":1582208062,"value":"A5GNT920"}
{"metric":"rtm.info.hdd.sda.dmesg.io.errors","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.disk_type","timestamp":1582208062,"value":"hdd"}
{"metric":"rtm.info.hdd.sda.iostat.busy","timestamp":1582208062,"value":"3.34"}
{"metric":"rtm.info.hdd.sda.model","timestamp":1582208062,"value":"HGST HUS726040ALA610"}
{"metric":"rtm.info.hdd.sda.serial","timestamp":1582208062,"value":"K3GDA42B"}
{"metric":"rtm.info.hdd.sda.temperature","timestamp":1582208062,"value":"37"}
{"metric":"rtm.info.hdd.sda.iostat.read.per.sec","timestamp":1582208062,"value":"5.11"}
{"metric":"rtm.info.hdd.sda.iostat.writekb.per.sec","timestamp":1582208062,"value":"108.74"}
{"metric":"rtm.info.hdd.sda.iostat.write.merged.per.sec","timestamp":1582208062,"value":"1.38"}
{"metric":"rtm.info.hdd.sda.iostat.write.avg.wait","timestamp":1582208062,"value":"5.39"}
{"metric":"rtm.info.hdd.sda.iostat.read.avg.wait","timestamp":1582208062,"value":"9.25"}
{"metric":"rtm.info.hdd.sda.iostat.read.merged.per.sec","timestamp":1582208062,"value":"0.01"}
{"metric":"rtm.info.hdd.sda.iostat.readkb.per.sec","timestamp":1582208062,"value":"616.81"}
{"metric":"rtm.info.hdd.sda.iostat.write.per.sec","timestamp":1582208062,"value":"3.19"}
{"metric":"rtm.info.hdd.sda.smart.highest-temperature","timestamp":1582208062,"value":"47"}
{"metric":"rtm.info.hdd.sda.smart.bytes-read","timestamp":1582208062,"value":"21596906627584"}
{"metric":"rtm.info.hdd.sda.smart.udma-crc-error","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.link-failures","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.temperature","timestamp":1582208062,"value":"37"}
{"metric":"rtm.info.hdd.sda.smart.offline-uncorrectable","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.percentage-used","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.realocated-event-count","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.reallocated-sector-count","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.reported-corrected","timestamp":1582208062,"value":"-1"}
{"metric":"rtm.info.hdd.sda.smart.power-cycles","timestamp":1582208062,"value":"23"}
{"metric":"rtm.info.hdd.sda.smart.power-on-hours","timestamp":1582208062,"value":"10521"}
{"metric":"rtm.info.hdd.sda.smart.global-health","timestamp":1582208062,"value":"1"}
{"metric":"rtm.info.hdd.sda.smart.logged-error-count","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.current-pending-sector","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.reported-uncorrect","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.lowest-temperature","timestamp":1582208062,"value":"19"}
{"metric":"rtm.info.hdd.sda.smart.bytes-written","timestamp":1582208062,"value":"10071769343488"}
{"metric":"rtm.info.hdd.sda.smart.time","timestamp":1582208062,"value":"0"}
{"metric":"rtm.info.hdd.sda.smart.command-timeout","timestamp":1582208062,"value":"-1"}
This probe is executed each 3600s (1hour). It collects information on RAID health's (if available).
{"metric":"rtm.hw.scsiraid.unit.md3.vol0.capacity","timestamp":1582208405,"value":"24.4 GB"}
{"metric":"rtm.hw.scsiraid.unit.md3.vol0.phys","timestamp":1582208405,"value":"3"}
{"metric":"rtm.hw.scsiraid.unit.md3.vol0.type","timestamp":1582208405,"value":"raid1"}
{"metric":"rtm.hw.scsiraid.unit.md3.vol0.status","timestamp":1582208405,"value":"active"}
{"metric":"rtm.hw.scsiraid.unit.md3.vol0.flags","timestamp":1582208405,"value":"clean"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sda3.capacity","timestamp":1582208405,"value":"24.4 GB"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sda3.status","timestamp":1582208405,"value":"active"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sda3.flags","timestamp":1582208405,"value":"sync"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sdb3.capacity","timestamp":1582208405,"value":"24.4 GB"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sdb3.status","timestamp":1582208405,"value":"active"}
{"metric":"rtm.hw.scsiraid.port.md3.vol0.sdb3.flags","timestamp":1582208405,"value":"sync"}
Only available with nvidia cards. Collects information and metrics on nvidia hardware.
*need nvidia-smi driver and application installed
Instructions on how to contribute to OVH RTM are available on the Contributing page.