BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Cloudflare Experiences Major Incident in November, Resulting in Log Loss

Cloudflare Experiences Major Incident in November, Resulting in Log Loss

Cloudflare has recently confirmed that on November 14th they experienced an incident affecting Cloudflare Logs with 55% of logs during a 3.5-hour period being lost.

The incident impacted most customers using the service, with a misconfiguration triggering a cascading series of system failures and exposing weaknesses in handling unexpected spikes in demand. Jamie Herre, Tom Walwyn, Christian Endres, Gabriele Viglianisi, Mik Kocikowski, and Rian van der Merwe explain:

On a typical day, Cloudflare sends about 4.5 trillion individual event logs to customers. Although this represents less than 10% of the over 50 trillion total customer event logs processed, it presents unique challenges of scale when building a reliable and fault-tolerant system.

To deliver logs from tens of thousands of servers in over 330 cities worldwide, Cloudflare developed Logpush, a Golang service designed to collect and push logs into predictable file sizes while scaling automatically with usage. The internal Buftee service provides buffers for each Logpush job, containing 100% of the logs generated by the zone or account. Logpush reads logs from these buffers and pushes them in batches to various customer-configured destinations, with over 600 million batches processed daily.

In the article, the team highlights what went wrong on November 14th, detailing the systems involved, the failures experienced, and the actions that Cloudflare plans to take moving forward. The authors acknowledge:

We made a change to support an additional dataset for Logpush. This required adding a new configuration to be provided to Logfwdr in order for it to know which customers’ logs to forward for this new stream. (...) A bug in this system resulted in a blank configuration being provided to Logfwdr.

Although the team identified the mistake and reverted the change in under 5 minutes, this failure triggered a second, latent bug in Logfwdr, causing a massive overload that rendered Buftee unresponsive. Nermin Smajic, senior corporate cybersecurity advisor at ESET, comments:

This incident exemplifies why cybersecurity is not just about preventing external threats, but also about maintaining robust, resilient internal systems that can withstand complex technical challenges.

Recovering from the incorrect Buftee configuration took Cloudflare several hours. The authors clarify:

When Logfwdr began to send event logs for all customers, Buftee began to create buffers for each one as those logs arrived (...) This massive increase, resulting in roughly 40 times more buffers, is not something we’ve provisioned Buftee clusters to handle.

Source: Cloudflare blog

Lorin Hochstein, staff software engineer at Airbnb and author of Surfing Complexity, observes:

Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception. (...) Automated safety mechanisms themselves add complexity, and we are no better at implementing bug-free safety code than we are at implementing bug-free feature code.

While Cloudflare's operational team promises to implement more alerts to ensure these specific misconfigurations are impossible to miss, they acknowledge that mistakes and misconfigurations are inevitable. They emphasize that the goal of all Cloudflare systems should be to respond to such issues predictably and gracefully.

 

About the Author

Rate this Article

Adoption
Style

BT