RESOLVED: Customers are experiencing connectivity issues with multiple Google Cloud services in zone us-east5-c

2025-04-11T16:10:00+00:00

Incident began at 2025-03-29 12:53 and ended at 2025-03-29 19:15 (all times are US/Pacific).

Incident Report

Summary:

On Saturday, 29 March 2025, multiple Google Cloud Services in the us-east5-c zone experienced degraded service or unavailability for a duration of 6 hours and 10 minutes. To our Google Cloud customers whose services were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause:

The root cause of the service disruption was a loss of utility power in the affected zone. This power outage triggered a cascading failure within the uninterruptible power supply (UPS) system responsible for maintaining power to the zone during such events. The UPS system, which relies on batteries to bridge the gap between utility power loss and generator power activation, experienced a critical battery failure.

This failure rendered the UPS unable to perform its core function of ensuring continuous power to the system. As a direct consequence of the UPS failure, virtual machine instances within the affected zone lost power and went offline, resulting in service downtime for customers. The power outage and subsequent UPS failure also triggered a series of secondary issues, including packet loss within the us-east5-c zone, which impacted network communication and performance. Additionally, a limited number of storage disks within the zone became unavailable during the outage.

Remediation and Prevention:

Google engineers were alerted to the incident from our internal monitoring alerts at 12:54 US/Pacific on Saturday, 29 March and immediately started an investigation.

Google engineers diverted traffic away from the impacted location to partially mitigate impact for some services that did not have zonal resource dependencies. Engineers bypassed the failed UPS and restored power via generator by 14:49 US/Pacific on Saturday, 29 March. The majority of Google Cloud services recovered shortly thereafter. A few services experienced longer restoration times as manual actions were required in some cases to complete full recovery.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

Harden cluster power failure and recovery path to achieve a predictable and faster time-to-serving after power is restored.
Audit systems that did not automatically failover and close any gaps that prevented this function.
Work with our uninterruptible power supply (UPS) vendor to understand and remediate issues in the battery backup system.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.

Detailed Description of Impact:

Customers experienced degraded service or unavailability for multiple Google Cloud products in the us-east5-c zone of varying impact and severity as noted below:

AlloyDB for PostgreSQL: A few clusters experienced transient unavailability during the failover. Two impacted clusters did not failover automatically and required manual intervention from Google engineers to do the failover.

BigQuery: A few customers in the impacted region experienced brief unavailability of the product between 12:57 US/Pacific until 13:19 US/Pacific.

Cloud Bigtable: The outage resulted in increased errors and latency for a few customers between 12:47 US/Pacific to 19:37 US/Pacific.

Cloud Composer: External streaming jobs for a few customers experienced increased latency for a period of 16 minutes.

Cloud Dataflow: Streaming and batch jobs saw brief periods of performance degradation. 17% of streaming jobs experienced degradation from 12:52 US/Pacific to 13:08 US/Pacific, while 14% of batch jobs experienced degradation from 15:42 US/Pacific to 16:00 US/Pacific.

Cloud Filestore: All basic, high scale and zonal instances in us-east5-c were unavailable and all enterprise and regional instances in us-east5 were operating in degraded mode from 12:54 to 18:47 US/Pacific on Saturday, 29 March 2025.

Cloud Firestore: Limited impact of approximately 2 minutes where customers experienced elevated unavailability and latency, as jobs were being rerouted automatically.

Cloud Identity and Access Management: A few customers experienced slight latency or errors while retrying for a short period of time.

Cloud Interconnect: All us-east5 attachments connected to zone1 were unavailable for a duration of 2 hours, 7 minutes.

Cloud Key Management Service: Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact.

Cloud Kubernetes Engine: Customers experienced terminations of their nodes in us-east5-c. Some zonal clusters in us-east5-c experienced loss of connectivity to their control plane. No impact was observed for nodes or control planes outside of us-east5-c.

Cloud NAT: Transient control plane outage affecting new VM creation processes and/or dynamic port allocation.

Cloud Router: Cloud Router was unavailable for up to 30 seconds while leadership shifted to other clusters. This downtime was within the thresholds of most customer's graceful restart configuration (60 seconds).

Cloud SQL: Based on monitoring data, 318 zonal instances experienced 3h of downtime in the us-east5-c zone. All external high-availability instances successfully failed out of the impacted zone.

Cloud Spanner: Customers in the us-east5 region may have seen a few minutes of errors or latency increase during the few minutes after 12:52 US/Pacific when the cluster first failed.

Cloud VPN: A few legacy customers experienced loss of connectivity of their sessions up to 5 mins.

Compute Engine: Customers experienced instance unavailability and inability to manage instances in us-east5-c from 12:54 to 18:30 US/Pacific on Saturday, 29 March 2025.

Managed Service for Apache Kafka: CreateCluster and some UpdateCluster commands (those that increased capacity config) had a 100% error rate in the region, with the symptom being INTERNAL errors or timeouts. Based on our monitoring, the impact was limited to one customer who attempted to use these methods during the incident.

Memorystore for Redis: High availability instances failed over to healthy zones during the incident. 12 instances required manual intervention to bring back provisioned capacity. All instances were recovered by 19:28 US/Pacific.

Persistent Disk: Customers experienced very high I/O latency, including stalled I/O operations or errors in some disks in us-east5-c from 12:54 US/Pacific to 20:45 US/Pacific on Saturday, 29 March 2025. Other products using PD or communicating with impacted PD devices experienced service issues with varied symptoms.

Secret Manager: Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact.

Virtual Private Cloud: Virtual machine instances running in the us-east5-c zone were unable to reach the network. Services were partially unavailable from the impacted zone. Customers wherever applicable were able to fail over workloads to different Cloud zones.

Affected products: AlloyDB for PostgreSQL, Cloud Firestore, Google BigQuery, Google Cloud Bigtable, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Identity and Access Management, Persistent Disk, Virtual Private Cloud (VPC)