What is Wrong with Facebook tonight

What Is Wrong With Facebook Tonight - Early today Facebook was down or unreachable for many of you for about 2.5 hrs. This is the worst outage we've had in over four years, as well as we wanted to to start with apologize for it. We also wished to give far more technical information on what took place and also share one large lesson discovered.

What's Wrong With Facebook

What Is Wrong With Facebook Tonight

The key defect that triggered this outage to be so severe was an unfavorable handling of an error condition. An automated system for confirming arrangement values ended up triggering much more damages than it fixed.

The intent of the computerized system is to check for configuration worths that are invalid in the cache and also change them with updated values from the consistent shop. This functions well for a transient problem with the cache, yet it doesn't function when the persistent store is void.

Today we made a modification to the persistent copy of a configuration value that was interpreted as invalid. This indicated that every single client saw the invalid worth as well as attempted to repair it. Because the fix includes making a query to a collection of data sources, that cluster was quickly bewildered by numerous countless queries a second.

To make matters worse, every single time a client obtained a mistake attempting to inquire among the data sources it translated it as a void worth, as well as removed the matching cache key. This suggested that even after the initial trouble had actually been fixed, the stream of questions proceeded. As long as the data sources fell short to service several of the requests, they were triggering much more demands to themselves. We had actually gotten in a responses loophole that really did not permit the data sources to recover.

The method to stop the responses cycle was fairly uncomfortable - we needed to quit all traffic to this database cluster, which meant shutting off the site. When the data sources had recouped and the root cause had actually been fixed, we slowly enabled more people back onto the site.

This got the site back up and running today, and also in the meantime we've turned off the system that attempts to deal with configuration values. We're exploring new layouts for this arrangement system following layout patterns of other systems at Facebook that deal more beautifully with feedback loopholes as well as short-term spikes.

We ask forgiveness once again for the site interruption, and we desire you to understand that we take the performance and also integrity of Facebook really seriously.