Facebook You Re Doing It Wrong

Facebook You Re Doing It Wrong - Early today Facebook was down or inaccessible for much of you for about 2.5 hours. This is the most awful failure we have actually had in over 4 years, and we wanted to first of all excuse it. We additionally wanted to supply much more technological detail on what occurred as well as share one big lesson found out.

What's Wrong With Facebook

Facebook You Re Doing It Wrong


The key imperfection that triggered this failure to be so extreme was a regrettable handling of an error problem. An automatic system for verifying arrangement worths ended up triggering a lot more damages than it fixed.

The intent of the computerized system is to look for configuration worths that are void in the cache and also change them with updated worths from the persistent store. This works well for a transient problem with the cache, yet it doesn't work when the consistent store is void.

Today we made a change to the relentless copy of a setup worth that was interpreted as invalid. This indicated that every customer saw the invalid worth and attempted to repair it. Because the repair includes making a query to a cluster of data sources, that cluster was promptly overwhelmed by thousands of thousands of questions a 2nd.

To make matters worse, every time a client obtained an error trying to query one of the data sources it translated it as a void value, as well as erased the equivalent cache key. This suggested that also after the initial trouble had been dealt with, the stream of questions proceeded. As long as the data sources stopped working to service several of the requests, they were creating a lot more requests to themselves. We had actually gone into a comments loophole that didn't allow the databases to recover.

The way to stop the comments cycle was rather uncomfortable - we needed to quit all traffic to this database collection, which meant turning off the site. When the databases had recouped as well as the source had been taken care of, we gradually allowed even more people back onto the site.

This obtained the site back up and running today, as well as in the meantime we've turned off the system that attempts to remedy configuration worths. We're exploring new styles for this arrangement system following layout patterns of other systems at Facebook that deal more with dignity with responses loopholes and transient spikes.

We say sorry once more for the website interruption, and we desire you to recognize that we take the performance as well as integrity of Facebook really seriously.