User Details
- User Since
- Oct 3 2014, 8:06 AM (536 w, 1 d)
- Availability
- Available
- IRC Nick
- godog
- LDAP User
- Filippo Giunchedi
- MediaWiki User
- FGiunchedi (WMF) [ Global Accounts ]
Thu, Jan 9
Wed, Jan 8
Tue, Jan 7
see also this re: statuspage and ipv6 support https://community.atlassian.com/t5/Statuspage-questions/Enabling-IPv6-IPv4-dualstack-for-a-status-page/qaq-p/2847737
Fri, Dec 20
Thu, Dec 19
No longer valid I think, also MXes now use postfix
Has been done at some point in host overview dashboard, sample query: node_cpu_frequency_hertz{instance=~"titan1001:.*"} or node_cpu_scaling_frequency_hertz{instance=~"titan1001:.*"}
I believe this is done for all intents and purposes, what do you think @Clement_Goubert ?
I'll boldly resolve this task, we haven't seen a reoccurrence
The check has moved to Prometheus in T370506: Replace check_ripe_atlas with prometheus alert
Thank you all for looking into this -- let's indeed see how 3m (or larger) goes and if that is satisfactory!
Wed, Dec 18
Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which has indeed no gaps. I noticed the interval used for (i)rate calculation is 2m, which I believe is the culprit. In the sense that we're scraping data every 1m and the raw data points (i.e. two) might fall outside the window looked by irate(...[2m]) for a given interval, therefore returning no points for that interval. Switching the rate calculation from 2m to 5m widens the window to look for data and eliminates the gaps, note the values are unchanged since rate/irate always return per-second values. Let me know what you think! cc @CDanis
We can certainly try the caching bucket in thanos store, since it is easy to do. I'd be happy to be wrong although I don't think that's going to help anything, since the problem in my mind is the quantity of data that thanos components (thanos-store, thanos-query, thanos-query-frontend) have to process.
No worries at all @cmooney, I've opened T382396: Investigate gnmic metric gaps and counters going to zero to investigate/followup on the two issues you mentioned
Tue, Dec 17
Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in the gnmi task and I'm happy to assist! Also thank you for the dashboards, I'll take a look and no cheekiness has been detected
I took a look as well at the general performance degradation when switching away from recording rules (i.e. increased CPU and network bandwidth on titan[12]001). Despite the attempts at optimizing what we have with @herron (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103365?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104690?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104678?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103352?usp=search) I'm not seeing a significant change.
Mon, Dec 16
Thank you @Scott_French! I'm happy to help with ops-maint-gcal.js changes, feel free to send reviews my way
Fri, Dec 13
I took another look at the dashboards and it looks like to me we now have all interesting switch port metrics in Prometheus via gnmi for cloudsw devices. Specifically port in/out bytes and discards seem the most/only graphed metrics. Does that track @cmooney @dcaro ? If so I'll start a test conversion of https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance to swap out graphite for prometheus for said metrics.
Dec 9 2024
Dec 6 2024
Dec 5 2024
FWIW as stated now the task is not actionable, I recommend setting alerts up as instructed by https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts
Dec 4 2024
I'm optimistically resolving this since I don't think we've seen it recently
On the thanos-compact side the signal for backlogged operations are the following metrics:
For the purposes of routing instance alerts to their owner team we have been using role_owner variable, joining said variable on instance will attach the correct team label and thus alertmanager can apply the correct routing. Taking NodeTextfileStale as inspiration, the critical expression would be: ( node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100 >= 90 ) * on (instance) group_left(team) role_owner . Since team label will be in the expression, it can be removed from the alert itself. Hope that helps!
Dec 2 2024
Indeed, grafana alerts are to be sent to alertmanager (i.e. centrally) and from there alerts can trigger email notifications. Moving to cxserver folks, I'm assuming an alert misconfiguration (?)
Sep 25 2024
Sep 24 2024
Nice! Thank you @hnowlan, resolving as we're done
Sep 20 2024
Sep 19 2024
Sep 18 2024
Sep 17 2024
Sep 16 2024
Doh! You are quite right, the relevant alert is MaxConntrack
Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)
@ssingh from T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) I couldn't find any obvious blocker to have 0.9 on bullseye. Do you remember if there were there any obvious blockers for that to happen ? Thank you
Also note that prometheus-postgres-exporter since 0.12.0 has gained support for replication monitoring. This is IMHO the proper solution, though it will require >= trixie unless we chose to backport the package instead: https://github.com/prometheus-community/postgres_exporter/blob/master/CHANGELOG.md#0120--2023-03-21
It is an option that can/should be considered I think since it would mean future-proofing the postgresql monitoring infrastructure.
Resolving since we haven't run into further problems with envoy metrics ingestion
Sep 13 2024
Sep 12 2024
From my investigation so far on IRC:
script is deployed!
Sep 11 2024
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Sep 10 2024
This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access
This is done! The only other use of graphite_threshold is 'zuul_gearman_wait_queue' which will be addressed as part of T233089: Export zuul metrics to Prometheus. There are of course graphite-itself alerts left, which will be removed together with graphite.
Sep 9 2024
I have merged the ported mw alerts (thank you @Clement_Goubert !) and changed the referenced dashboard at https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts to show the prometheus metrics.