On the 4th of October, Facebook’s platforms (Facebook, WhatsApp and Instagram) were all down for nearly six hours, in its worst-ever outage. But the chances are you knew this already! Whether you were refreshing your Instagram, waiting for a message on WhatsApp or you simply caught all the press coverage, it would have been hard to miss.
So, why are we writing about it too? Well, we think there’s an interesting lesson in there that can be applied to anyone who runs a business. And yes, it’s to do with partnerships! But before we get onto that…
What actually happened?
According to details released by Facebook, the outage was caused by routine maintenance.
To cut a long story short, Facebook has loads of data centres. Some are big buildings that house millions of machines, storing data and keeping Facebook’s platforms running. Others are smaller buildings that connect Facebook’s network to the wider internet (and, therefore, the people using its platforms).
All these data centres are connected to each other by a network Facebook built called the ‘backbone.’ This is made up of thousands and thousands of miles of cables that literally connect all the data centres to one another. When you use Facebook, WhatsApp or Instagram, your device sends a request to your nearest small data centre, which then uses the backbone network to communicate with a larger data centre to get your app the info it needs. Clever, right?!
Now for the problem. This huge network needs maintenance. And, to do that, Facebook’s engineers often need to take part of the backbone offline. On the 4th of October, during one of these routine maintenance jobs, a command accidentally took down all of the connections in the backbone network, disconnecting all the data centres from one other and the wider internet. Oops!
To make matters worse, this then caused another problem. One of the jobs of the smaller data centres is to respond to DNS queries. Basically, when you search for a web address, that needs to be translated into a specific server IP address. To help, the data centre advertises this to the internet using something called a border gateway protocol (BGP).
Facebook’s DNS servers (the servers that fulfil this job) were operating fine. But they’re set up so that, if the servers can’t communicate with the data centres, the servers withdraw the BGP advertisement so that they can’t be found. This is because usually, if the servers can’t communicate with the data centres, that’s the sign of an unhealthy network connection.
In this case, because the whole backbone was down, all the DNS servers declared themselves as unhealthy and withdrew their BGP advertisements. This meant that it became impossible for the rest of the internet to find them, even though they were operational. Eek!
Why did it take them so long to fix the issue?
Even though that might all sound like a massive issue, none of it should actually be that complicated to fix. So, why did it take Facebook nearly six hours to get to the root of the problem and get Facebook’s platforms back up and running?
Well, it turns out that Facebook’s services weren’t just down for users, but for Facebook’s employees too! No kidding, Facebook’s employees actually rely on a business version, known as Workplace, for the majority of their communication.
Worse still, Facebook’s engineers needed access to the network that was down in order to diagnose the problem and apply the fix. Without the network working, the engineers couldn’t access Facebook’s data centres in the way they usually would. And, the problems with DNS meant that most of the tools they’d normally use to diagnose and fix issues were broken too!
In other words, the Facebook team had inadvertently created a single point of failure – exactly what network designers do their best to avoid!
In the end, Facebook had to send engineers to its data centres in person for them to be able to diagnose the issues and restart the systems – a time-consuming workaround, as these facilities and the hardware itself are deliberately hard to access and modify for security reasons!
So, what’s the lesson?
Despite what it might sound like, we’re not telling you all this just to have a rant at Facebook or to laugh at their misfortune. Outages due to maintenance errors aren’t uncommon, and the main thing is to have systems and protocols in place that limit how often they occur and the damage they cause when they do.
However, Facebook made one key mistake that we can all learn from. The fact that it relied on its own systems to manage and protect its own systems made it extremely vulnerable and meant that a small issue became a big one.
As a business, it can be tempting to try and own all of your core technologies and processes so that you’re not relying on anyone else. Just look at Apple, which recently abandoned a long-term supply chain partnership with Intel when it decided to start manufacturing its own microprocessors for its Macs! But contrary to what you might think, relying on others isn’t necessarily a weakness.
If Facebook had partnered with technical providers rather than depending on its own technologies to manage every aspect of its business, would the error have occurred? Maybe. But would it have been easier to resolve? Definitely!
Even if it had partnered with an external company to regularly audit its network, it’s just possible that the structural issue would have been flagged up earlier, potentially preventing a major outage like this from occurring. Sometimes, if you’re too close to a problem, it’s easy to miss obvious things that an external pair of eyes might be able to spot immediately!
The moral of the story? Strategic partnerships can make you stronger. Not only can they help prevent you from being reliant on a single point of failure, but they can also give you another perspective and lead to more options when things do go wrong. The phrase ‘stronger together than we are alone’ definitely applies here!
We’re big believers in the fact that mistakes don’t define a person (or organisation). Instead, it’s how you learn from them that counts. And it looks like Facebook engineering VP Santosh Janardhan agrees. In his recent blog post, he states that ‘Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one.’
That said, we also believe that the best learning takes place when we look beyond the four walls of our own organisations and seek to collaborate with others. Forming strategic partnerships with other businesses means gaining skills, perspectives and expertise that you don’t have in-house. In the case of technology, it also means spreading your eggs across more than one basket so that you’re less dependent on a single point of failure.
Ready to find your ideal collaborators? Just sign up with Breezy to find hundreds of relevant Leads.