RCA: 2020-02-11 High db insert rate caused by enabling issue templates on projects

Summary

Enabling instance templates for each project creation due to a bug, caused thousands of services to be created for each project creation (doing URL validation for each service, causing delays) which was causing a high db insert rate and latencies for web.

Additionally, because of copying the same template to the projects, many slack notifications from other projects were sent to out to one single project, which is a security issue: https://gitlab.com/gitlab-com/gl-security/secops/operations/issues/650

Service(s) affected : ServiceWeb ServicePostgres
Team attribution : ~"group::ecosystem"
Minutes downtime or degradation : 153m (06:31 - 09:04)

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

Impact & Metrics

Start with the following:

What was the impact of the incident? (i.e. service outage, sub-service brown-out, exposure of sensitive data, ...)
Who was impacted by this incident? (i.e. external customers, internal customers, specific teams, ...)
How did the incident impact customers? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

DB inserts into services table: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&fullscreen&panelId=8&from=1581399974013&to=1581413232942

Web saturation: https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&fullscreen&panelId=10&from=1581400421276&to=1581413397765

Detection & Response

Start with the following:

How was the incident detected?
- EOC was paged because of db replication lag caused by high insert rate, marquee customer alerts also triggered
Did alarming work as expected?
- We did not get SLO violation alerts, maybe the impact wasn't high enough - need to investigate
How long did it take from the start of the incident to its detection?
- 45m until db replication alert fired (06:31 - 07:16)
How long did it take from detection to remediation?
- 108m (07:16 - 09:04)
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)

Root Cause Analysis

We experienced high web latencies

Why? - Because we suddenly started to create many services for each project creation, doing url lookups each service in lib/gitlab/url_blocker.rb:111, which took a lot of time (during a db transaction inserting into the services table).
Why? - Because the instance creation attribute was set to true for each new project.
Why? - Because a db migration with a column rename created a trigger to set instance = template, which enabled the previously disabled attribute
Why? - Because we had two versions of the codebase running at the same time, one would set a template to false when creating a new service, and the other would set instance to false but leave template as true (and therefore repopulate instance as true based on the trigger)
Why? - Service code is old and does some things we probably would not accept now like not having unique constraints on project_id and service type. We have validations at application level, not at the database level.

What went well

Start with the following:

Identify the things that worked well or as expected.
Any additional call-outs for what went particularly well.

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective actions

Immediate corrective actions were to revert the change.
- gitlab-org/gitlab!24857 (merged)
- gitlab-org/gitlab!24885 (merged)
We will add database constraints as suggested in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9176#note_286149490
Add suggestion to renaming columns docs about splitting migration MRs

Guidelines

Edited Feb 13, 2020 by Henri Philipps