top of page

‘We Let the Internet Down Today:’ Cloudflare Admits Massive Outage Was Caused by Internal Error

  • Writer: WGON
    WGON
  • 1 hour ago
  • 2 min read
ree

Cloudflare brought major portions of the internet to its knees on Tuesday morning, causing  widespread service disruptions for Elon Musk’s X and a wide range of websites, apps, and even video games. The company now admits the failure was completely its own fault due to a programming mistake.


Breitbart News reported yesterday that America woke up to a internet full of error messages and websites that didn’t work. The widespread outages were quickly traced to Cloudflare, which many companies use to provide a “shield” between their servers and the internet at large.


In a recently published blog post, Cloudflare explained that the service outages were caused by internal programming problems, stating categorically that “The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind.”


According to the company, the outage was triggered by an internal change made to database access permissions used by its Bot Management system. This change inadvertently caused the database to generate a “feature configuration” file used be its machine learning models that was double the expected size. When this oversized configuration file propagated across Cloudflare’s global network, it exceeded a hardcoded size limit in the software, causing the bot management module to fail entirely. This cascaded into widespread failures of Cloudflare’s core traffic proxy responsible for routing all customer traffic.


Compounding the problem, the failures manifested inconsistently because the database was only partially updated with the permission change. This resulted in the oversized file being generated intermittently every five minutes as database queries executed on updated and non-updated parts of the database cluster. The failures initially led Cloudflare engineers to suspect a distributed denial-of-service (DDoS) attack performed by bad actors, a theory that was completely disproven as they dug deeper.


Although the failure of Cloudflare’s servers impacted many customers, still more had problems when third-party services integrated with Cloudflare, such as customer login systems using their Turnstile CAPTCHA, also experienced failures.


Cloudflare engineers stopped the outage by about 10:00 a.m. ET by blocking the generation of the oversized file and manually deploying a known good version across their network. Although the company claimed that services were restored after that, many websites had problems for many hours past the fix. The outage is generally considered to have lasted six hours, with one industry expert estimating it cost $15 billion an hour for Cloudflare’s customers.


Following the incident, Cloudflare’s CEO Matthew Prince published an apology, calling the outage unacceptable and deeply painful to their entire team given Cloudflare’s critical role in the internet ecosystem. The company is now conducting a thorough internal review to identify process gaps, harden systems against future configuration failures, improve debugging and observability, and implement more granular feature kill switches.


Read more at the Cloudflare Blog here.

 
 
 

Comments


bottom of page