Top causes of system downtime and how devops prepare to avert a crisis

Top causes of system downtime and how devops prepare to avert a crisis

These last few weeks have been a challenge for devops teams everywhere. For months everyone watched as the novel Covid-19 coronavirus swept across the world. Governments have called for people to stay home to help stop the spread of the disease. Nobody knew exactly how it would affect business and our everyday lives. Hardly anyone expected a novel virus that has forced us into isolation and many of our routine activities online. This sudden surge in traffic puts a lot of strain on systems and exposes major weaknesses that otherwise would go unnoticed. Ensuring that your systems are ready for unforeseeable events is an essential part of your devops strategy.

From daily meetings to your kids’ school lessons to virtual museum visits traffic is up across the board. Keeping these services up and running effectively is now more important than ever. Perhaps the biggest surge in use is in video conferencing services, remote collaboration tools, streaming services and online payments. Having a plan in place for what to do when a surge in traffic threatens to crash the site is vital to prevent downtime and service disruptions. 


1. Reasons of rapid increase in traffic

As many companies scrambled to figure out how to go fully remote in the face of orders to isolate, many daily activities moved to the network. Some examples include: 

1. Meetings — both for work and personal

Video conferencing and messenger services replace face-to-face meetings.

2. Grocery shopping and eating out

We’re ordering more groceries online and arranging food delivery through couriers.

3. Payments

In Poland, for example, payment providers have increased the no-pin transaction limit to the Polish zloty equivalent of $25.00 to gain 80% of overall payments without touching a pinpad or banknotes. This is very hygienic but threatens to overload the payment system. A lot of shopping has moved to e-commerce sites, straining capacity.

4. Daily news

Demand for the latest updates has increased readership on news sites.

5. Streaming videos

Cinema closures have driven demand for streaming services.

6. Outings

We’re visiting museums and galleries virtually.

7. School

Lessons are going ahead in many parts of the world via e-learning platforms.

Whenever you have a rush of traffic in a short amount of time, it tests whether or not the developers who designed the platforms did their jobs well. If a system crashes under the strain of increased traffic, chances are there wasn’t enough planning and foresight in development.


2. Latest examples of system downtime

How do millions of people sitting at home affect the use of network services? One local example happened to an online news portal which announced at first that all schools in Poznań will be closed for two weeks due to quarantine. As the news broke, it ended with website unavailability because resources were too low and the company was unable to react quickly. It was only handling over 13,000 users at the same time — in a city with half a million people. Should it be a barrier for your business?

Another more dramatic case happened last week as the stock trading application Robinhood failed due to a surge in traffic. This failure prevented users from accessing their accounts and selling their stocks as prices fell, leaving many with huge losses. The loss of user trust and credibility — not to mention the drastic losses for users themselves is immense. Here are just a few other scenarios that can happen. 


3. System downtime – what could possibly go wrong?

1. An increased number of visits can kill the server

Literally, when the resources of a single machine are running out you can talk about unavailable content. Shared hosting is definitely not a solution here. The best would be to use a cloud service provider such as Amazon Web Services, Microsoft Azure or Google Cloud Platform which has enough resources for a devops team to scale up if needed. Alternatively, you can use a powerful dedicated server in a well-known data center.

2. Poorly designed databases may not withstand a sudden increase in data

Let’s say that the number of orders in the store has increased and with each subsequent order the database responds more slowly.

3. Poorly written code needs a lot of computing power for simple tasks

With increased traffic, costs can rise disproportionately to profits. To avoid this, write tests during development and carry out stress tests before going to production. 

4. Self-hosting instead of using the cloud

Many companies and publishers keep resources on their own. Nowadays, the cloud offers the flexibility to respond to urgent needs. In this model you only pay for what you use, you can start the next server at any time and quench it when the traffic drops. It’s also possible to automate this process. So why not use it?

5. Saving on infrastructure can lead to system failure.

Work on small, cheap resources cannot defend themselves in such a situation. Suppose someone is hosting a website on his own and has a small reserve on bandwidth. Increasing bandwidth is not possible overnight. Instead, use a cloud provider or a data center. 


4. System downtime – all too common mistakes

1. No support when the website is on fire

It’s common that companies order a website or e-commerce shop and later just let it run without any devops support. When increased traffic occurs, no one is able to help. At Espeo we offer support for our software in the production environment to not leave you alone in such a case.

2. Old technologies make the product inaccessible

An example would be one of the Polish e-learning platforms that still uses Adobe Flash extensions. Browsers no longer officially support these and the end of life is happening later this year. Now as the schools have been closed, it turned out that using the service exceeds the skills of most young users.

3. Weak security

Today, the standard is to use the HTTPS protocol (using SSL certificates). It provides a secure connection between the user and the provider. No implementation of encrypted transmission may result in users’ rejection. Especially when we deal with payments and providing personal data needed for the order.

On the other hand, sometimes websites are vulnerable to attacks because the code is written using open source solutions that are not updated on time. At Espeo we are putting a lot of effort to keep systems updated. Our services among others consist of scanning of running resources, servers monitoring so we can prevent attacks and keep software stable.

4. “Tests are not needed” sentiment

Many software houses cave to pressure from clients eager to rush a program to production. But it’s a huge mistake to think that you don’t need to test software. Simulating an outage is far easier and cheaper to test for than a system outage. It takes some long-term planning and upfront costs, but it’s much more cost-effective to test for these crises. At Espeo our quality assurance specialists test each project. Depending on the scope our devops team can handle a lot of different tests to prevent problems in the future.

5. Bad architecture

“A Single instance will deal with everything” is a bad concept. Keeping everything in one place will fail sooner or later. The key is to multiply resources and keep the database and website apart from each other. At Espeo advise clients to set environment with load balancers, take advantage of scalability and master-slave database replication.


5. Final thoughts on how to avert a system downtime crisis

Long story short, be prepared! Assume the worst scenario and prepare a solution for it before it happens. In Espeo during the development process for our clients, we put a lot of effort to use our experience to design solutions right the first time. 

The biggest part of preparing for a crisis is to make sure you have all the necessary features in place to respond quickly. As the coronavirus has shown us, these very unexpected events can have a huge impact on business and on the software we rely on every day. Making sure it can handle a rapid increase in traffic — and then quickly go back to normal will save you time, money, and reputation.

Want to learn more about devops and testing services? Drop us a line and we’ll get back to you shortly.

See also:

Do you like this article?