RCA: 2020-02-11 High db insert rate caused by enabling issue templates on projects
Incident: production#1651 (closed)
Summary
Enabling instance templates for each project creation due to a bug, caused thousands of services to be created for each project creation (doing URL validation for each service, causing delays) which was causing a high db insert rate and latencies for web.
Additionally, because of copying the same template to the projects, many slack notifications from other projects were sent to out to one single project, which is a security issue: https://gitlab.com/gitlab-com/gl-security/secops/operations/issues/650
- Service(s) affected : ServiceWeb ServicePostgres
- Team attribution : ~"group::ecosystem"
- Minutes downtime or degradation : 153m (06:31 - 09:04)
For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.
Impact & Metrics
Start with the following:
- What was the impact of the incident? (i.e. service outage, sub-service brown-out, exposure of sensitive data, ...)
- Who was impacted by this incident? (i.e. external customers, internal customers, specific teams, ...)
- How did the incident impact customers? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many attempts were made to access the impacted service/feature?
- How many customers were affected?
- How many customers tried to access the impacted service/feature?
Include any additional metrics that are of relevance.
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
DB inserts into services
table:
https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&fullscreen&panelId=8&from=1581399974013&to=1581413232942
Web saturation: https://dashboards.gitlab.net/d/general-service/general-service-platform-metrics?orgId=1&fullscreen&panelId=10&from=1581400421276&to=1581413397765
Detection & Response
Start with the following:
- How was the incident detected?
- EOC was paged because of db replication lag caused by high insert rate, marquee customer alerts also triggered
- Did alarming work as expected?
- We did not get SLO violation alerts, maybe the impact wasn't high enough - need to investigate
- How long did it take from the start of the incident to its detection?
- 45m until db replication alert fired (06:31 - 07:16)
- How long did it take from detection to remediation?
- 108m (07:16 - 09:04)
- Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)
Root Cause Analysis
We experienced high web latencies
- Why? - Because we suddenly started to create many services for each project creation, doing url lookups each service in
lib/gitlab/url_blocker.rb:111
, which took a lot of time (during a db transaction inserting into theservices
table). - Why? - Because the instance creation attribute was set to
true
for each new project. - Why? - Because a db migration with a column rename created a trigger to set
instance = template
, which enabled the previously disabled attribute - Why? - Because we had two versions of the codebase running at the same time, one would set a
template
to false when creating a new service, and the other would setinstance
to false but leavetemplate
as true (and therefore repopulateinstance
as true based on the trigger) - Why? - Service code is old and does some things we probably would not accept now like not having unique constraints on project_id and service type. We have validations at application level, not at the database level.
What went well
Start with the following:
- Identify the things that worked well or as expected.
- Any additional call-outs for what went particularly well.
What can be improved
Start with the following:
- Using the root cause analysis, explain what can be improved to prevent this from happening again.
- Is there anything that could have been done to improve the detection or time to detection?
- Is there anything that could have been done to improve the response or time to response?
- Is there an existing issue that would have either prevented this incident or reduced the impact?
- Did we have any indication or beforehand knowledge that this incident might take place?
Corrective actions
- Immediate corrective actions were to revert the change.
- We will add database constraints as suggested in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9176#note_286149490
- Add suggestion to renaming columns docs about splitting migration MRs