What exactly is a cloud outage? In this beginner’s guide, we go through all you need to know. What actually happens, what the impact can be, and a few examples.
Despite the sheer size and scope of cloud technology, cloud outages can’t be avoided. It’s not an infallible technology, despite it being robust and reliable. But when cloud computing goes down, what is actually happening? What causes the outages to happen? And can it take us all down with it? If you’re new to cloud technology, the prospect of a cloud outage can be pretty unnerving.
In this article, we’ll go through everything you need to know about cloud outages. We’ll define what they are and why it’s important to get to grips with them. We’ll also provide a few examples of what can happen when systems go awry.
Just to recap what we’ll be discussing:
What Happens During a Cloud Outage?
What are the Common Causes of Cloud Outages?
Why is it Important to Understand What a Cloud Outage is?
What are the Knock-On Effects?
Which Cloud Provider Has the Most Outages?
Different Examples of Cloud Outages
Help is at Hand
A cloud outage is the name given to a period of time when a cloud infrastructure isn’t fully available. A cloud outage may affect part of the infrastructure or the whole thing.
The result of cloud outages is downtime or unexpected behaviour. Users might not be able to use services or resources that are based in the infrastructure. This can be a pretty big blow for organisations and services that depend on the cloud to do, well, anything.
Service Level Agreements (SLAs) from cloud providers set expectations for cloud performance. These agreements state the minimum level of service a cloud provider will offer. This is usually an availability of services 99.9% of the time and this accounts for a small chance of outages.
That said, outages have been increasing and the consequences of cloud outages are getting worse. Over half of outages in 2021 resulted in “substantial financial, operational and/or reputational damage”. Over 60% of outages resulted in over $100k of total losses last year.
So what is actually causing these outages? Let’s take a closer look.
There’s no single cause to a cloud outage. Some are more preventable than others, and some are sometimes down to really bad luck. Here are some of the main causes:
Human error is a fact of life. According to the Uptime Institute’s report, 79% of data centre outages in 2021 were due to human error. These are usually due to a staff member forgetting to do something, or doing it incorrectly. It can all come down to pressing the wrong button.
Back in 2017, an AWS outage caused Instagram, IMDB, and Quora to go offline for several hours. This was caused by an employee looking to resolve a debugging issue on the billing system. In doing so they accidentally took a bunch of servers offline - and took a bunch of colossal apps with them.
A 2016 report from the Ponemon Institute found that cybercrime is becoming a fast-growing cause of outages.
Cybersecurity is a serious concern. Cloud outages caused by cyber attacks are often distributed denial-of-service (DDoS) attacks. During a DDoS attack, a cloud’s functionality is disrupted by an overwhelming increase in traffic. This ends up overloading the cloud infrastructure, resulting in a cloud outage.
As is the case of anything digital, glitches and bugs can crop up and sometimes, these cause outages.
GCP hit the news with an outage in November 2021 for this reason. It turned out to be caused by an internal network glitch. This ended up taking down Spotify, Home Depot and Etsy.
Some cloud providers may depend on third parties for their networking functionality. If there are issues with these systems, there can be a domino effect. And this can result in a cloud outage.
For any digital system, maintenance is required to keep things ticking over. Normally users are given warnings about any maintenance that is scheduled.
Hurricanes, storms, and other large scale natural events can impact nearby data centres. Last year, an electrical storm in Northern Virginia caused an outage at AWS’s US East-1 regional zone. This ended up affecting Slack and Epic Games.
The Uptime Insitute reports that power issues are the most common cause of outages.
Data centres require a lot of power. In 2018, it was estimated that data centres used roughly 1% of the world’s energy output. The power for these is normally provided by third parties, like the National Grid. Electrical outages happen when there’s not enough power being provided for these data centres.
Cloud outages can have a ripple effect to all corners of an organisation. Understanding what they are, what causes them, and how to prevent them, can reduce a lot of headaches.
As the Uptime Institute reports, the majority of outages are probably preventable.
As mentioned earlier, it’s not just things not working for a while that is the only outcome of cloud outages. Effects can be felt across a whole company.
While not all causes can be mitigated, their impact can be minimised. Processes can be implemented to reduce the risk of human error, for example. And different cloud vendors can be used so that you don’t have all your eggs in one basket, and so on.
Taking this preventative approach to managing cloud outages can reduce the risk of:
Unexpected costs - The Uptime Insitutite reported that ⅔ of outage incidents in 2020 cost over USD $100,000, and ⅓ of these cost over USD $1 million.
Loss of income - No access to services means not being able to trade
Reputational damage and disgruntled customers - Customers can get very cross because of downtime as they expect a high level of service. This article from Forbes notes how AWS’s recent outages have resulted in a loss of customers.
We’ve discussed what cloud outages are and their impact - but which provider has the most outages?
There is currently no evidence to show which cloud provider has the most outages. It’s mostly the big ”oops” stories that make the news. However these normally come from the big three of the industry - AWS, Google Cloud, and Microsoft Azure.
You can keep an eye on any current outages here.
The effects of cloud outages can range from “oh that’s annoying” to “OH MY GOD!”. Here’s a couple of horror stories that have hit the headlines to show you what we mean.
Back in May 2020, a cloud outage took down the digital application Slack for 48 minutes because of load balancing problems. It was only 48 minutes but it’s still not ideal when you have a deadline approaching!
In another case, an AWS outage impacted service availability for customers in the US East 1. This was all due to networking issues. This Amazon data centre outage affected applications including Heroku, and even Github went down. Should we also mention that this happened on Friday the 13th?
This Google Cloud outage happened during summer 2021. It was in their newest region - australia-southeast2. This is thought to have been caused by transient voltage - basically a big, short power surge. It was all resolved within an hour but 23 Google cloud services were impacted.
Last December, there was also an Azure outage that took down the entirety of Microsoft 365. Just take a moment to consider that. Luckily, it was resolved that same evening.
There was an enormous global AWS outage last year. Back in December, the outage saw Amazon, Netflix, Tinder, McDonalds, Disney + and Roku taken out.
A very rare Apple outage happened in March. This took down Apple Music, iCloud, and the App store. It’s said to have been caused by a DNS problem.
With cloud technology, the good outweighs the bad. And the opportunity and services give organisations many options.
So when it comes to cloud outages - they’re certainly something you don’t want to happen, but they shouldn’t be a reason not to invest in the cloud or cloud migration. There’s things you can do to prepare against cloud outages. These can help reduce outage impact.
Make a plan of who to contact and what to do when something happens. Simply having a plan in place can change how quickly you can resolve any issues in the future because you have a protocol to follow and can work on autopilot.
Actually try it out. Make a regular and quick playthrough of the plan with all stakeholders. These fire drills bake in the process for the relevant people, find hiccups in your plan before they can really do damage in a live fire, and normalize the processes and protocols. You can always expand based on what you learn as well.
You can also look at added services that provide additional protection. Divio has a partnership with Cloudflare that allows us to offer the gated Enterprise-grade plan to our Gold and Platinum customers. We recently helped a partner set this up for a client who went from regular outages every few days to zero outages after we turned the service on.
In addition: look at your partners and their own plans and protocols. Some cloud compliance certifications will confirm this sort of planning is baked in. Divio, for example, has ISO 27001 certification and this level of disaster recovery is a requirement for our work. It’s something Divio did anyway before ISO certification and something we exceed today in our approach to disaster recovery but that doesn’t change the fact that two simple first steps can really help you prevent problems for your business. If those first steps are being followed by your partners too, even better. You don’t have to go to extreme lengths to see most of the benefits of prevention.
Remember: prevention is all about preparation. And this is where we come in. We can give you clear advice on your cloud setup and/or help you jump ahead on the prevention route with bundled cybersecurity services. Divio Gold and Platinum Enterprise clients save over $5,000 a month on Cloudflare and other bundled services, so don't hesitate to reach out to us to learn more or just to get some clear and actionable cloud advice.