Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (536 w, 1 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Thu, Jan 9

andrea.denisse awarded T178690: Better organization for SRE grafana dashboards a Love token.
Thu, Jan 9, 9:31 PM · Observability-Metrics, User-CDanis, SRE
fgiunchedi created T383309: rsyslog receiver on centrallog hosts misplaces some log host entries.
Thu, Jan 9, 9:53 AM · Observability-Logging

Wed, Jan 8

fgiunchedi renamed T243065: Provision plaintext syslog collectors in PoPs from Provision plaintext syslog collectors in esams/ulsfo/eqsin to Provision plaintext syslog collectors in PoPs.
Wed, Jan 8, 3:40 PM · Patch-For-Review, Observability-Logging, SRE
fgiunchedi created T383232: Move a selection of Prometheus instances to new Prometheus hw in eqiad/codfw.
Wed, Jan 8, 2:44 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
fgiunchedi created T383223: Resolve prometheus k8s-mlstaging and k8s-dse port conflict.
Wed, Jan 8, 1:44 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics

Tue, Jan 7

fgiunchedi created T383118: One day earlier date in french auto-thankyou confirmation.
Tue, Jan 7, 10:58 AM · FR-AutoTY-Email
fgiunchedi added a comment to T372457: Remove librenms -> graphite integration, replace with gnmi.

Thank you for the feedback @dcaro, appreciate it! We did resolve the gaps issue by extending the rate() window in T382396 and that indeed fixed the issue.

Tue, Jan 7, 10:43 AM · Cloud-VPS, SRE Observability (FY2024/2025-Q2), cloud-services-team
fgiunchedi added a comment to T382851: www.wikimediastatus.net does not support IPv6.

see also this re: statuspage and ipv6 support https://community.atlassian.com/t5/Statuspage-questions/Enabling-IPv6-IPv4-dualstack-for-a-status-page/qaq-p/2847737

Tue, Jan 7, 9:01 AM · IPv6, Incident Tooling

Fri, Dec 20

fgiunchedi added a comment to T368953: Thanos Cache Tuning.

Re: reverting liftwing to previous recording rules, the trouble is it'd be a revert to a broken state in Pyrra where the SLOs would be recording bad values and would activate several burn alerts. If we do need to revert the safest thing to do would be offboard of the liftwing SLOs, but it'd be nice to avoid that if we can. Would it be fair to prep an offboard patch and keep it in our back pocket in case of issue?

Fri, Dec 20, 9:57 AM · Patch-For-Review, Observability-Metrics

Thu, Dec 19

fgiunchedi closed T251155: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data, a subtask of T167689: Add RIPE atlas data to Prometheus, as Resolved.
Thu, Dec 19, 3:03 PM · observability, SRE
fgiunchedi closed T251155: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data as Resolved.

Done in T370506: Replace check_ripe_atlas with prometheus alert

Thu, Dec 19, 3:03 PM · Observability-Metrics, Infrastructure-Foundations, netops, SRE
fgiunchedi closed T275867: Add exim queue size to grafana graph, a subtask of T297144: large MX queues should page, as Invalid.
Thu, Dec 19, 3:01 PM · observability, Mail, Sustainability (Incident Followup), SRE, Infrastructure-Foundations
fgiunchedi closed T275867: Add exim queue size to grafana graph as Invalid.

No longer valid I think, also MXes now use postfix

Thu, Dec 19, 3:01 PM · SRE-Sprint-Week-Sustainability-March2023, Observability-Metrics, Infrastructure-Foundations, Sustainability (Incident Followup), Mail
fgiunchedi closed T286768: node_cpu_frequency_hertz metric no longer present in Bullseye as Resolved.

Has been done at some point in host overview dashboard, sample query: node_cpu_frequency_hertz{instance=~"titan1001:.*"} or node_cpu_scaling_frequency_hertz{instance=~"titan1001:.*"}

Thu, Dec 19, 2:59 PM · Observability-Metrics, SRE
fgiunchedi closed T286768: node_cpu_frequency_hertz metric no longer present in Bullseye, a subtask of T275873: Prepare our base system layer for Debian 11/bullseye, as Resolved.
Thu, Dec 19, 2:59 PM · SRE
fgiunchedi added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I believe this is done for all intents and purposes, what do you think @Clement_Goubert ?

Thu, Dec 19, 2:52 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
fgiunchedi closed T366492: grafana-server exploding in memory as Invalid.

I'll boldly resolve this task, we haven't seen a reoccurrence

Thu, Dec 19, 2:50 PM · Observability-Metrics
fgiunchedi closed T312108: icinga: check_ripe_atlas.py raises 502: Bad Gateway as Declined.

The check has moved to Prometheus in T370506: Replace check_ripe_atlas with prometheus alert

Thu, Dec 19, 2:41 PM · Icinga, Observability-Alerting
fgiunchedi added a comment to T368953: Thanos Cache Tuning.

Initial results with Thanos store caching bucket enabled look promising. I'm seeing reductions in duration, errors, network/socket utilization and slight decrease in cpu util

{F58028229} {F58028264}{F58028221} {F58028224}

Along with some increases in disk read/write (see above) and GC time, should be able to tune these out if they become a problem

{F58028233}

I added new panels to track Thanos store caching bucket metrics here https://grafana-rw.wikimedia.org/d/c2b5ccc9-0c16-45ae-99bc-a244e3f73808/thanos-cache-overview

Since we're staring to see some positive effects I think its worth trying to tune this further. Theres still room for improvement in iter and get hits, and in bucket cache evictions overall. Also you can see Thanos store quickly consumed all the memory available for cache

Thu, Dec 19, 10:13 AM · Patch-For-Review, Observability-Metrics
fgiunchedi added a comment to T382396: Investigate gnmic metric gaps and counters going to zero.

Thank you all for looking into this -- let's indeed see how 3m (or larger) goes and if that is satisfactory!

Thu, Dec 19, 9:44 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi awarded T233089: Export zuul metrics to Prometheus a Party Time token.
Thu, Dec 19, 8:00 AM · Release-Engineering-Team (Seen), Patch-For-Review, Continuous-Integration-Infrastructure, observability, SRE

Wed, Dec 18

fgiunchedi updated subscribers of T382396: Investigate gnmic metric gaps and counters going to zero.

Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which has indeed no gaps. I noticed the interval used for (i)rate calculation is 2m, which I believe is the culprit. In the sense that we're scraping data every 1m and the raw data points (i.e. two) might fall outside the window looked by irate(...[2m]) for a given interval, therefore returning no points for that interval. Switching the rate calculation from 2m to 5m widens the window to look for data and eliminates the gaps, note the values are unchanged since rate/irate always return per-second values. Let me know what you think! cc @CDanis

Wed, Dec 18, 1:37 PM · netops, Infrastructure-Foundations, SRE
fgiunchedi added a comment to T302995: Transition to Pyrra for SLO Visualization and Management.

We can certainly try the caching bucket in thanos store, since it is easy to do. I'd be happy to be wrong although I don't think that's going to help anything, since the problem in my mind is the quantity of data that thanos components (thanos-store, thanos-query, thanos-query-frontend) have to process.

Wed, Dec 18, 9:35 AM · Patch-For-Review, User-herron, Observability-Metrics
fgiunchedi added a comment to T369384: Productionize gnmic network telemetry pipeline.

No worries at all @cmooney, I've opened T382396: Investigate gnmic metric gaps and counters going to zero to investigate/followup on the two issues you mentioned

Wed, Dec 18, 9:13 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi created T382396: Investigate gnmic metric gaps and counters going to zero.
Wed, Dec 18, 9:12 AM · netops, Infrastructure-Foundations, SRE

Tue, Dec 17

fgiunchedi added a comment to T372457: Remove librenms -> graphite integration, replace with gnmi.

Thank you for the extensive explanation @cmooney ! Yes definitely let's go over the issues you outlined in the gnmi task and I'm happy to assist! Also thank you for the dashboards, I'll take a look and no cheekiness has been detected

Tue, Dec 17, 1:56 PM · Cloud-VPS, SRE Observability (FY2024/2025-Q2), cloud-services-team
fgiunchedi added a comment to T302995: Transition to Pyrra for SLO Visualization and Management.

I took a look as well at the general performance degradation when switching away from recording rules (i.e. increased CPU and network bandwidth on titan[12]001). Despite the attempts at optimizing what we have with @herron (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103365?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104690?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1104678?usp=search https://gerrit.wikimedia.org/r/c/operations/puppet/+/1103352?usp=search) I'm not seeing a significant change.

Tue, Dec 17, 1:42 PM · Patch-For-Review, User-herron, Observability-Metrics

Mon, Dec 16

fgiunchedi added a comment to T381680: The ops-maint-gcal.js script is missing support for some vendors.

Thank you @Scott_French! I'm happy to help with ops-maint-gcal.js changes, feel free to send reviews my way

Mon, Dec 16, 8:55 AM · SRE-Unowned

Fri, Dec 13

fgiunchedi updated subscribers of T372457: Remove librenms -> graphite integration, replace with gnmi.

I took another look at the dashboards and it looks like to me we now have all interesting switch port metrics in Prometheus via gnmi for cloudsw devices. Specifically port in/out bytes and discards seem the most/only graphed metrics. Does that track @cmooney @dcaro ? If so I'll start a test conversion of https://grafana.wikimedia.org/d/613dNf3Gz/wmcs-ceph-eqiad-performance to swap out graphite for prometheus for said metrics.

Fri, Dec 13, 11:27 AM · Cloud-VPS, SRE Observability (FY2024/2025-Q2), cloud-services-team

Dec 9 2024

fgiunchedi created T381771: Occasional saturation of asw2-b-eqiad / cr port uplink and cache upload usage.
Dec 9 2024, 1:26 PM · Patch-For-Review, Traffic

Dec 6 2024

fgiunchedi moved T370153: Move kafka-mirror Prometheus-based alerts from Icinga to alerts.git from Inbox to In Progress on the SRE Observability (FY2024/2025-Q2) board.
Dec 6 2024, 2:41 PM · SRE Observability (FY2024/2025-Q2), Patch-For-Review
fgiunchedi moved T381561: deployment site for prometheus::blackbox::check::(icmp|http|tcp) is not driven by the $site parameter from Inbox to In Progress on the SRE Observability (FY2024/2025-Q2) board.
Dec 6 2024, 2:40 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Observability-Alerting

Dec 5 2024

fgiunchedi added a comment to T376535: Setup alert for LPL team Grafana dashboards.

FWIW as stated now the task is not actionable, I recommend setting alerts up as instructed by https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts

Dec 5 2024, 9:25 AM · LPL Essential (LPL Essential 2024 Nov-Dec), CX-cxserver

Dec 4 2024

fgiunchedi closed T320594: Flapping probes for centrallog2001 as Invalid.

I'm optimistically resolving this since I don't think we've seen it recently

Dec 4 2024, 1:52 PM · SRE Observability (FY2024/2025-Q2), Observability-Logging
fgiunchedi moved T381466: storage sizing for Thanos appears to be underestimated (compactor unable to perform its operations due to low concurrency) from Inbox to In Progress on the SRE Observability (FY2024/2025-Q2) board.
Dec 4 2024, 1:48 PM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
fgiunchedi added a comment to T381466: storage sizing for Thanos appears to be underestimated (compactor unable to perform its operations due to low concurrency).

On the thanos-compact side the signal for backlogged operations are the following metrics:

Dec 4 2024, 11:27 AM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
fgiunchedi added a comment to T367065: Move profile::idp::client::httpd::site checks to Prometheus blackbox probes.

To some extend I question the value of these checks. It's more or less just an "isalive" check for Apache and does not reveal anything about the underlying service.

Dec 4 2024, 10:30 AM · Infrastructure-Foundations, Observability-Alerting
fgiunchedi added a comment to T374827: Port conntrack utilization alert to Prometheus/Alertmanager.

For the purposes of routing instance alerts to their owner team we have been using role_owner variable, joining said variable on instance will attach the correct team label and thus alertmanager can apply the correct routing. Taking NodeTextfileStale as inspiration, the critical expression would be: ( node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100 >= 90 ) * on (instance) group_left(team) role_owner . Since team label will be in the expression, it can be removed from the alert itself. Hope that helps!

Dec 4 2024, 10:27 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Infrastructure-Foundations, Observability-Alerting
fgiunchedi moved T370772: Prometheus eqiad/codfw hw expansion architecture options from Inbox to In Progress on the SRE Observability (FY2024/2025-Q2) board.
Dec 4 2024, 10:13 AM · SRE Observability (FY2024/2025-Q2), Observability-Metrics
fgiunchedi moved T371087: Configure Prometheus instance centrally from Inbox to In Progress on the SRE Observability (FY2024/2025-Q2) board.
Dec 4 2024, 10:13 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Observability-Metrics

Dec 2 2024

fgiunchedi edited projects for T376535: Setup alert for LPL team Grafana dashboards, added: CX-cxserver; removed Grafana.

Indeed, grafana alerts are to be sent to alertmanager (i.e. centrally) and from there alerts can trigger email notifications. Moving to cxserver folks, I'm assuming an alert misconfiguration (?)

Dec 2 2024, 2:14 PM · LPL Essential (LPL Essential 2024 Nov-Dec), CX-cxserver
fgiunchedi placed T316022: Clean up check_ssl checks from puppet also covered by blackbox prober up for grabs.
Dec 2 2024, 9:41 AM · Observability-Alerting, User-fgiunchedi

Sep 25 2024

fgiunchedi edited projects for T375634: Allow to route generic alerts (like ProbeDown) to team receivers, added: observability; removed Observability-Alerting.
Sep 25 2024, 2:50 PM · Abstract Wikipedia team, Observability-Alerting
fgiunchedi updated the task description for T370153: Move kafka-mirror Prometheus-based alerts from Icinga to alerts.git.
Sep 25 2024, 11:15 AM · SRE Observability (FY2024/2025-Q2), Patch-For-Review
fgiunchedi added a comment to T371426: Create log-based alerting.

Wow, looks great 🤩! Yes, that sounds like a plan, TY!
How do we route these alerts to trigger our Slack?

Sep 25 2024, 9:18 AM · Abstract Wikipedia team (25Q2 (Oct–Dec)), function-orchestrator, function-evaluator, Observability-Logging
andrea.denisse awarded T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002 a 100 token.
Sep 25 2024, 1:22 AM · SRE Observability (FY2024/2025-Q1)

Sep 24 2024

fgiunchedi created T375522: graphite hosts almost saturate their 1gbit NICs.
Sep 24 2024, 2:06 PM · SRE Observability (FY2024/2025-Q2)
fgiunchedi renamed T370526: Remove load_average check for ms-be/thanos-be from Port to Prometheus load_average check to Remove load_average check for ms-be/thanos-be.
Sep 24 2024, 1:10 PM · SRE Observability (FY2024/2025-Q2), SRE-swift-storage, Observability-Alerting
fgiunchedi created P69400 (An Untitled Masterwork).
Sep 24 2024, 12:50 PM
fgiunchedi closed T374860: Retire mw_wikiversion_difference check as Resolved.

Nice! Thank you @hnowlan, resolving as we're done

Sep 24 2024, 9:33 AM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi closed T374860: Retire mw_wikiversion_difference check, a subtask of T321808: Port all Icinga checks to Prometheus/Alertmanager, as Resolved.
Sep 24 2024, 9:33 AM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi created T375475: debmonitor could provide users with cumin and/or debdeploy pre-made config/command.
Sep 24 2024, 9:09 AM · SRE-tools, Infrastructure-Foundations
fgiunchedi removed a watcher for SRE-OnFire: fgiunchedi.
Sep 24 2024, 7:27 AM

Sep 20 2024

fgiunchedi created T375271: Remove uwsgi::app unit monitoring.
Sep 20 2024, 12:40 PM · Patch-For-Review, Observability-Alerting

Sep 19 2024

fgiunchedi removed a subtask for T349619: Migrate roles to puppet7: T351624: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate".
Sep 19 2024, 10:10 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
fgiunchedi removed a parent task for T351624: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate": T349619: Migrate roles to puppet7.
Sep 19 2024, 10:10 AM · Patch-For-Review, User-fgiunchedi, Observability-Logging, SRE
fgiunchedi added a project to T351710: ossl rsyslog errors post-migration: Observability-Logging.
Sep 19 2024, 10:06 AM · SRE Observability (FY2024/2025-Q2), Observability-Logging, User-fgiunchedi, Patch-For-Review, SRE, observability
fgiunchedi removed a subtask for T324623: Switch rsyslog from gtls to ossl: T351710: ossl rsyslog errors post-migration.
Sep 19 2024, 10:06 AM · User-MoritzMuehlenhoff, Cloud-VPS, cloud-services-team, Patch-For-Review, SRE, observability, User-dcaro
fgiunchedi removed a parent task for T351710: ossl rsyslog errors post-migration: T324623: Switch rsyslog from gtls to ossl.
Sep 19 2024, 10:06 AM · SRE Observability (FY2024/2025-Q2), Observability-Logging, User-fgiunchedi, Patch-For-Review, SRE, observability
fgiunchedi created T375166: Port PDU checks to Prometheus/Alertmanager.
Sep 19 2024, 9:32 AM · Observability-Alerting
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 19 2024, 9:16 AM · Observability-Metrics, SRE
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 19 2024, 9:11 AM · Observability-Metrics, SRE
fgiunchedi added a comment to T375066: Audit hosts in 'misc' cluster.

Would it make sense to have clusters named after a team to group all machines owned by a specific subteam?

Or would that go against the purpose of clusters and they should stricly group servers by service / puppet role?

Sep 19 2024, 8:03 AM · Observability-Metrics, SRE

Sep 18 2024

fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

Reading through the comments I found:

For centrallog we'll do the in place upgrade to Bookworm as part of T353912, I'd say sooner rather than later. Not sure about cloudlb* hosts, thus cc cloud-services-team to see if they do have indeed an ETA for cloudlb on Bookworm or not.

I've created a subtask for the cloudlb bookworm upgrade, it's not in our immediate plans but shouldn't be too hard: T375082: Upgrade cloudlb hosts to bookworm.

Sep 18 2024, 2:37 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi updated the task description for T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002.
Sep 18 2024, 2:34 PM · SRE Observability (FY2024/2025-Q1)
fgiunchedi created T375085: mtail 3.0.0~rc50-1+b6 leaks memory on centrallog2002.
Sep 18 2024, 2:33 PM · SRE Observability (FY2024/2025-Q1)
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 18 2024, 1:37 PM · Observability-Metrics, SRE
fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

On perhaps a related note, while it is true that many of the things the script is doing are taken care by the self-healing nature of anycast-hc in 0.9 and beyond, I do wonder if there is value in keeping this script around. anycast-hc is a critical part of our infrastructure especially on the DNS boxes because we announce all ns[0-2] routes from there. In a hypothetical case where it is not running, this would mean that all our nameservers are down. This leads me to wonder that in the absence of this script, what kind of an alert will be generated if anycast-hc is not running if it failed to start or is manually stopped for whatever reason?

One such alert can be the BGP one but there are many of them and the SNR is quite terrible. The other alert can be the systemd service not running but that again is shadowed by other such alerts. Does my concern make sense? In which case I am thinking perhaps we can restrict the scope of the script but still keep it around because if we get rid of this, there isn't a single dedicated script to check if anycast-hc is running or and in case it failed or didn't start, we won't probably catch it.

Sep 18 2024, 1:31 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi added a project to T375066: Audit hosts in 'misc' cluster: observability.
Sep 18 2024, 10:21 AM · Observability-Metrics, SRE
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 18 2024, 10:21 AM · Observability-Metrics, SRE
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 18 2024, 10:18 AM · Observability-Metrics, SRE
fgiunchedi updated the task description for T375066: Audit hosts in 'misc' cluster.
Sep 18 2024, 10:13 AM · Observability-Metrics, SRE
fgiunchedi created T375066: Audit hosts in 'misc' cluster.
Sep 18 2024, 9:57 AM · Observability-Metrics, SRE

Sep 17 2024

fgiunchedi added a comment to T374916: Port Categories lag / ping checks to Prometheus/Alertmanager.

We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.

Sep 17 2024, 3:40 PM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Patch-For-Review, SRE Observability (FY2024/2025-Q2), MW-1.43-notes (1.43.0-wmf.25; 2024-10-01), Wikidata, Discovery-Search, Wikidata-Query-Service, Observability-Alerting
fgiunchedi created T374916: Port Categories lag / ping checks to Prometheus/Alertmanager.
Sep 17 2024, 9:37 AM · Data-Platform-SRE (2025.01.11 - 2025.01.31), Patch-For-Review, SRE Observability (FY2024/2025-Q2), MW-1.43-notes (1.43.0-wmf.25; 2024-10-01), Wikidata, Discovery-Search, Wikidata-Query-Service, Observability-Alerting
fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)

Ah, that's interesting. I mean it's not surprising it worked given the code but I am not sure if we want to extend it like that. Or maybe, it's fine and we do this all the time! (I know of some cases in which we do it but I am not sure if it is frowned upon.)

Sep 17 2024, 9:28 AM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi added a project to T374842: Retire anycast_healthchecker Icinga check: cloud-services-team.
Sep 17 2024, 9:28 AM · SRE Observability (FY2024/2025-Q2), Observability-Alerting

Sep 16 2024

fgiunchedi added a comment to T374827: Port conntrack utilization alert to Prometheus/Alertmanager.

Doh! You are quite right, the relevant alert is MaxConntrack

Sep 16 2024, 3:24 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Infrastructure-Foundations, Observability-Alerting
fgiunchedi created T374860: Retire mw_wikiversion_difference check.
Sep 16 2024, 3:04 PM · serviceops, SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi added a comment to T374842: Retire anycast_healthchecker Icinga check.

Thank you @ssingh, there's also centrallog hosts running bullseye + 0.8 in addition to cloudlb hosts, at any rate I tried installing the bookworm anycast-healthchecker 0.9 package on bullseye on a test host and it seems to work as expected (i.e. a rebuild for bullseye doesn't seem to be needed)

Sep 16 2024, 2:50 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi updated subscribers of T374842: Retire anycast_healthchecker Icinga check.

@ssingh from T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) I couldn't find any obvious blocker to have 0.9 on bullseye. Do you remember if there were there any obvious blockers for that to happen ? Thank you

Sep 16 2024, 2:00 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi created T374842: Retire anycast_healthchecker Icinga check.
Sep 16 2024, 1:56 PM · SRE Observability (FY2024/2025-Q2), Observability-Alerting
fgiunchedi added a comment to T374839: Port postgresql replication check to Prometheus/Alertmanager.

Also note that prometheus-postgres-exporter since 0.12.0 has gained support for replication monitoring. This is IMHO the proper solution, though it will require >= trixie unless we chose to backport the package instead: https://github.com/prometheus-community/postgres_exporter/blob/master/CHANGELOG.md#0120--2023-03-21
It is an option that can/should be considered I think since it would mean future-proofing the postgresql monitoring infrastructure.

Sep 16 2024, 1:15 PM · SRE Observability (FY2024/2025-Q2), PostgreSQL, Observability-Alerting
fgiunchedi created T374839: Port postgresql replication check to Prometheus/Alertmanager.
Sep 16 2024, 1:13 PM · SRE Observability (FY2024/2025-Q2), PostgreSQL, Observability-Alerting
fgiunchedi created T374827: Port conntrack utilization alert to Prometheus/Alertmanager.
Sep 16 2024, 10:18 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Infrastructure-Foundations, Observability-Alerting
fgiunchedi created T374823: Port netbox reports checks to Prometheus/Alertmanager.
Sep 16 2024, 9:37 AM · SRE Observability (FY2024/2025-Q2), Infrastructure-Foundations, netbox, Observability-Alerting
fgiunchedi created T374821: Replace or delete dumps_store_load_average.
Sep 16 2024, 9:26 AM · Data-Platform-SRE (2024.09.28 - 2024.10.18), SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi closed T359633: Strategy for Envoy metrics and Prometheus as Resolved.

Resolving since we haven't run into further problems with envoy metrics ingestion

Sep 16 2024, 8:15 AM · User-fgiunchedi, Observability-Metrics, MW-on-K8s

Sep 13 2024

fgiunchedi created T374711: keyholder-proxy doesn't restart on config change.
Sep 13 2024, 12:33 PM · User-Elukey, Puppet, Keyholder, Infrastructure-Foundations, SRE

Sep 12 2024

fgiunchedi added a comment to T374599: cloud: prometheus: investigate weirdness with metrics and alertmanager.

From my investigation so far on IRC:

Sep 12 2024, 10:01 AM · SRE Observability (FY2024/2025-Q2), User-aborrero, cloud-services-team
fgiunchedi closed T372411: Automation to find / summarize "orphaned" traces, a subtask of T366750: Basic data quality work for traces, as Resolved.
Sep 12 2024, 9:04 AM · Patch-For-Review, Observability-Tracing
fgiunchedi closed T372411: Automation to find / summarize "orphaned" traces as Resolved.

script is deployed!

Sep 12 2024, 9:04 AM · Patch-For-Review, Observability-Tracing

Sep 11 2024

fgiunchedi created T374513: Lint problems for NeutronAgentDownForLong and NeutronAgentDown.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Sep 11 2024, 7:52 AM · cloud-services-team (FY2024/2025-Q1-Q2)

Sep 10 2024

lmata awarded T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting a Party Time token.
Sep 10 2024, 1:10 PM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics
fgiunchedi closed T326657: Add prometheus-https load balancer as Resolved.

This is done, I've set the service as non-paging since we're using it for the prometheus web interface (i.e. humans) whereas the http service is paging since that's for automated access

Sep 10 2024, 12:31 PM · Patch-For-Review, SRE Observability (FY2024/2025-Q1), Observability-Metrics
fgiunchedi closed T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting as Resolved.

This is done! The only other use of graphite_threshold is 'zuul_gearman_wait_queue' which will be addressed as part of T233089: Export zuul metrics to Prometheus. There are of course graphite-itself alerts left, which will be removed together with graphite.

Sep 10 2024, 10:53 AM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics
fgiunchedi closed T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting, a subtask of T343020: Converting MediaWiki Metrics to StatsLib, as Resolved.
Sep 10 2024, 10:51 AM · Patch-For-Review, SRE Observability (FY2024/2025-Q2), Observability-Metrics
fgiunchedi updated the task description for T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.
Sep 10 2024, 10:50 AM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics

Sep 9 2024

jcrespo awarded T356788: thanos-query probedown due to OOM of both eqiad titan frontends a Like token.
Sep 9 2024, 3:20 PM · SRE Observability (FY2024/2025-Q1), Sustainability (Incident Followup), SRE, observability
fgiunchedi updated subscribers of T350597: Audit and prioritize metrics for conversion to statslib that are used for graphite-based alerting.

I have merged the ported mw alerts (thank you @Clement_Goubert !) and changed the referenced dashboard at https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts to show the prometheus metrics.

Sep 9 2024, 2:31 PM · SRE Observability (FY2024/2025-Q1), User-fgiunchedi, Observability-Metrics