User Details
- User Since
- Aug 10 2021, 2:29 PM (170 w, 2 d)
- Availability
- Available
- IRC Nick
- arnoldokoth
- LDAP User
- AOkoth
- MediaWiki User
- AOkoth (WMF) [ Global Accounts ]
Yesterday
We settled on what a deployment for this could look like. An initContainer that shares an emptyDir volume with the serving container will fetch the data and place it in the volume then a sidecar with the rsync sharing the same volume to regularly update the contents. We also agreed that the k8s-aux cluster is a better fit for this.
Tue, Nov 5
Ack @MoritzMuehlenhoff Gotten some new ideas to test out that don't rely on a persistent volume. Will get back to you on the auto-deploy bit as it's not something that is currently happening.
Mon, Nov 4
@AntiCompositeNumber We compared the load time with a test instance we have which has a relatively small amount of data and the load time is roughly 2 seconds. So I'd be inclined to believe that it might have something to do with the number of queues one has access to as you've mentioned.
Thanks @MoritzMuehlenhoff In that case, I think that rules out the option of running the script in CI (or elsewhere really).
Tue, Oct 22
Oct 2 2024
Sep 30 2024
Sep 26 2024
Decom'd vrts1001 and vrts2001.
Sep 23 2024
Pretty much @Dzahn It works fine now.
Sep 18 2024
{P69260}
Sep 17 2024
Sep 16 2024
Sep 9 2024
Sep 3 2024
Sep 2 2024
@Dzahn I think those are stored in the database and from the vrts_aliases script are the ones queried on L64.
Aug 29 2024
Database is reachable from new hosts:
Aug 27 2024
Managed to collect a few metrics. The ticket count is for the queues specified in the report and filtered to count from the beginning of this year.
sql_vrts_queue_validity{col="count",database="otrs",driver="mysql",host="tcp()",name="invalid",sql_job="vrts_sql_metrics",user="otrs"} 201 sql_vrts_queue_validity{col="count",database="otrs",driver="mysql",host="tcp()",name="valid",sql_job="vrts_sql_metrics",user="otrs"} 200
Aug 21 2024
Works as expected now.
aokoth@vrts1001:~$ curl -s http://localhost:9237/metrics | grep -v "^#" <redacted> sql_invalid_queues{col="count",database="otrs",driver="mysql",host="tcp(<redacted>)",sql_job="vrts_sql_metrics",user="otrs"} 201 sql_valid_queues{col="count",database="otrs",driver="mysql",host="tcp(<redacted>)",sql_job="vrts_sql_metrics",user="otrs"} 200
Aug 15 2024
Created a follow up for the search issue here (T372586).
This is done for the production host. I can't say with certainty that it fixed the search issue.
Deployed this on vrts1001 using these queries as a test. Though they both set to 0 which is clearly not the case as running the query against the database returns a different result.:
Aug 12 2024
Sure. Scheduled this for Thursday this week (should be enough time to test before then).
Aug 1 2024
Jul 25 2024
@Jclark-ctr I've updated Puppet. The desired RAID level is 10.
Jul 23 2024
Upgrade complete.
aokoth@vrts1001:~$ ls -l /opt/ total 44 lrwxrwxrwx 1 root root 16 Jul 23 18:10 otrs -> /opt/znuny-6.5.9
Jul 22 2024
Tested the cookbook on the replica. It didn't work all the way through (mostly because of how the replica is setup) but the issue with the proxy is now resolved. Scheduled the update for 23rd July @ 9PM UTC.
Jul 18 2024
{P66830}
Jul 17 2024
@Jelto @fgiunchedi Yeah, the last ones I got were also addressed to our team. :)
Jul 16 2024
I think I found the setting that affects this. It's configured on the VRTS dashboard and was initially set to root@localhost. Updated it to our team's mailing list. Let's see where the next one goes.
Restart didn't work. Just got another email. 🤔
Though I noticed that none of the services on the host were restarted i.e. vrts-daemon and exim4 after the alias change. I've manually restarted them so let's see where the next one goes.
@jhathaway Any idea how we can re-route these to a different email?
@Marostegui Just seen this too. :(
Jul 15 2024
Thinking we can merge this into (T362933) and all subsequent ones at least until we get the new hardware? cc: @LSobanski
Looks good now. Emails to root on the vrts host will now be sent to sre-collab.
Oooh. Sure. So the desired outcome is complete silence? Let me look into it.
Just confirmed that this is indeed a bug but with no negative effect (other than the noise of course) and can safely be ignored. There are just two isolated processes that are not aware of what the other is up to (in a nutshell).
Reading about this behaviour and it keeps going back to https://manpages.debian.org/unstable/clamav-daemon/clamd.conf.5.en.html#ConcurrentDatabaseReload which we've already disabled.
Usual suspects. 🙃
aokoth@vrts1001:~$ less /var/log/exim4/paniclog.1 2024-07-12 00:50:46 1sS4Uo-00GGJz-Fq malware acl condition: clamd /var/run/clamav/clamd.ctl : unable to connect to UNI X socket (/var/run/clamav/clamd.ctl): Connection refused 2024-07-12 00:50:46 1sS4Uo-00GGK3-Pd malware acl condition: clamd /var/run/clamav/clamd.ctl : unable to connect to UNI X socket (/var/run/clamav/clamd.ctl): Connection refused
Jul 12 2024
Thanks @Dzahn I have filed a ticket with Znuny to get a better understanding of what might be causing this. I'll update once I get a response.
Jul 11 2024
I ran into some unexpected "bugs" in the cookbook so I'll re-schedule this.
Jul 10 2024
Upgrade scheduled for 11th July, 2024 @ 1800 UTC.
This is resolved now.
Jul 7 2024
@Marostegui Thank you for opening this. This might be a bug but will confirm with Znuny.
Jul 5 2024
@LSobanski I also just tried to reproduce the error but was unsuccessful. https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=13244279;ArticleID=16159676
Jun 27 2024
@Dzahn Yeah, I have not tracked that down yet. From the forums it looks like it appears in different places and not just when installing a package.
Jun 26 2024
Permission errors seem to be resolved by running the script with the otrs user.
aokoth@vrts1001:/tmp$ sudo -u otrs /opt/otrs/bin/otrs.Console.pl Admin::Package::Install /tmp/WikimediaTemplates-1.0.18.opm Installing package... +----------------------------------------------------------------------------+ | WikimediaTemplates-1.0.18 | Pre-Install Information
Jun 21 2024
I removed the package manually from the dashboard then followed the steps above to re-install it. Looks like it got installed but not cleanly:
aokoth@vrts1001:/tmp$ sudo -u www-data /opt/otrs/bin/otrs.Console.pl Admin::Package::Install /tmp/WikimediaTemplates-1.0.18.opm Installing package... +----------------------------------------------------------------------------+ | WikimediaTemplates-1.0.18 | Pre-Install Information
Jun 20 2024
{P65250}
Jun 6 2024
May 30 2024
@Volans Understood.
May 28 2024
Run with comma added manually. Thanks @Dzahn
May 28 19:44:49 mx1001 systemd[1]: Starting Generate VRTS aliases file for Exim... May 28 19:45:31 mx1001 systemd[1]: generate_vrts_aliases.service: Succeeded. May 28 19:45:31 mx1001 systemd[1]: Finished Generate VRTS aliases file for Exim. May 28 19:45:31 mx1001 systemd[1]: generate_vrts_aliases.service: Consumed 1.301s CPU time.
@Volans Not even a little? :)
Last run:
May 28 18:18:41 mx1001 systemd[1]: Starting Generate VRTS aliases file for Exim... May 28 18:18:43 mx1001 vrts_aliases[1605430]: ERROR:/usr/local/bin/vrts_aliases:email is handled by postfix alias: [email protected] May 28 18:18:57 mx1001 vrts_aliases[1605430]: ERROR:/usr/local/bin/vrts_aliases:email is handled by gsuite: [email protected] May 28 18:19:24 mx1001 systemd[1]: generate_vrts_aliases.service: Succeeded. May 28 18:19:24 mx1001 systemd[1]: Finished Generate VRTS aliases file for Exim. May 28 18:19:24 mx1001 systemd[1]: generate_vrts_aliases.service: Consumed 1.094s CPU time.
May 20 2024
One of the options discussed that could potentially help fix this is reducing the email attachment size which could potentially be the cause of the sudden spikes in memory use by clamav. From reading on the forums it seems like this value might be influenced by a database config (max_allowed_packet) or WebMaxFileUpload configuration parameter though yet to confirm this officially.
Apr 30 2024
Never mind. Got some insight from @eoghan
Hi, I was going through the steps listed here https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging to re-image lists1004 then ran into this issue when I got to the cookbook sre.hosts.provision --no-dhcp --no-users lists1004 step. It gets stuck on this prompt:
==> Unable to auto-detect NIC with link. Pick the one to set PXE on: ['NIC.Embedded.1-1-1', 'NIC.Embedded.2-1-1']
Not sure how to resolve. Kindly assist.