While the Queen once referred to an entire year as her ‘annus horribilis’, Latin for ‘horrible year’, the BBC’ technology correspondent Rory Cellan-Jones, called seven days in October as Facebook’s ‘worst week’. Not only was the tech giant dealing with the fall-out from a whistle blower, but its entire service went off-line for six hours, along with its other channels, Instagram and WhatsApp.
Many of us just assumed it was our wi-fi and did a lot of turning things off and back on again… but a quick visit to Twitter soon explained everything. There on the social media channel’s time line was a tweet which simply said: ‘Hello literally everyone’.
As Rory Cellan-Jones explained: “Go back 15 years and it was relatively common for young web platforms like Facebook to fall off the internet for a day or even longer. Nobody got that irate. It is a mark of how the world has changed that even if a platform goes offline for an hour there is an enormous hullabaloo, because billions of people now expect to be connected 24/7.”
While for many of us, it was a case of going ‘cold turkey’ or dipping into Twitter, some businesses now absolutely rely on social media for their sales. There were brands which had arranged ‘Facebook live’ events for that evening and influencers who couldn’t influence.
But, what happened? A simple explanation is that Facebook’s systems stopped talking to the wider internet. Facebook itself explained ‘configuration changes on the backbone routers that co-ordinate network traffic between our data centers caused issues that interrupted this communication’. This had a cascading effect… bringing our services to a halt’.
The internet isn’t just one thing, rather hundreds of thousands of networks and firms, such as Facebook, have their own larger networks. The back-end system that allows computers to connect with their network uses the Border Gateway Protocol (BGP). In order to direct people to the websites they want to visit, BGP looks at all of the available paths that data could travel and picks the best route.
This particular breakdown happened when Facebook suddenly stopped providing the information the system needed to function and nobody’s computers had any way of connecting to Facebook, WhatsApp or Instagram. The broadcasting of its routes just disappeared and all traffic died instantly.
The situation then became more complicated as, apparently, many of Facebook’s engineers are still working from home. When people arrived at the office in California to sort out the issue, they couldn’t access the building as the doors are connected to Facebook’s system, so they had failed as well…
This issue follows a major outage in June, which led to a number of high profile websites going down for around an hour – including the UK government website, Amazon, Twitch, Reddit, The Financial Times and the New York Times. Many people initially assumed it was a security or hacking issue. However, it transpired that the cloud computing provider, Fastly, was behind the problems.
Fastly is an American cloud computing services provider which operates servers at strategic points around the world to help customers move and store content close to their end users.
The firm said there had been issues with its global content delivery network (CDN), which it was fixing. In a statement, it said: "We identified a service configuration that triggered disruption across our POPs (points of presence) globally and have disabled that configuration."
What many people will be asking though is how can quite so many sites could be affected?
CDNs – such as those provided by Fastly - are the links which make the Internet run smoothly. Sometimes these links break but it is rare that this causes quite so much disruption. We need a CDN to help us load web pages quickly. For instance, if somebody in the US was looking to load a website based in the UK, then there will inevitably be a short delay. However, these delays are not acceptable in today’s fast-paced world, so a CDN allows somebody to access the same page more locally to them.
Fastly blamed the blackout on a software bug which, it said, was triggered when one of its customers had changed their settings. This led to 85% of its network returning errors.
While most organisations don’t receive the same levels of web traffic as Amazon or the BBC, it is worth being aware of the infrastructure sitting behind your website and, ideally, arrange an audit of it, in the hope of perhaps fixing some issues and improving resilience against an outage.
If you’d like explore this subject more, then do get in touch with our team at Lake Solutions.