The holiday shopping season is coming to a close, but the fear of outages will continue to loom over businesses and developers throughout the new year. Top companies such as Apple, Slack, Southwest Airlines, and Uber faced some of the worst outages in history this year, and they had nothing to do with the holiday season.
According to BigPanda, an incident-management process provider, Southwest Airlines faced a 12-hour outage resulting in 2,300 canceled flights, 4,500 delayed flights, and about a US$54 million lost; Apple’s trusted services such as iCloud, App Store, iTunes and Apple TV faced up to nine hours of outages; Slack’s web servers were overwhelmed, resulting in two hours of outages; and Uber faced a 20-minute global outage, causing its competitor Lyft to see an increase in interest.
(Related: Top retailers face major security problems)
While outages can’t always be avoided, there are some ways in which developers and businesses can prevent them from happening to mission-critical aspects of their solutions. Ajay Nilaver, vice president of products for BigPanda, laid out some ways businesses can avoid tech outages:
Start with the application’s architecture: You want to start with a high level of understanding the problem space you are trying to go after, the solutions you want to put together, and how you want to translate that into the application itself.
According to Nilaver, BigPanda uses a microservices architecture in order to stay ahead of the customer curve and re-architect portions of their product as needed. “Bottom line: No matter what choice you make in terms of the architecture and underlying infrastructure, you still have to use design principles and operation principles and figure out how to approach them,” he said.
In addition, you want your architecture to be resilient because you should expect things to fail. This can also help businesses prepare for things that aren’t in their control such as DDoS or other attacks from the outside, according to Nilaver.
Embrace the new notion of ChatOps: Developers live in collaboration systems today, according to Nilaver. When developers are deploying code multiple times a day into pre-production environments, testing all their code, and pushing it into production (all while living in collaboration tools), they are able to get constant feedback from monitoring systems about how well their code push went and how things are looking.
Look at your key performance indicators and metrics: KPIs coupled with rich dashboarding capabilities will allow the operations team to look at the state and health of the overall application at any point in time. According to Nilaver, you want to anticipate issues before they happen and minimize downtime as much as possible. Having fine-grain insight into your application can help you catch issues early and respond quickly enough to avoid an outage.
Start small: By rolling out to a small subset of your install base, you can wait until you know everything is going well and then expand the features to a broader set.
Conduct performance instances: A performance instance is a separate instance where regular load testing and simulated load scale can be performed. This can help a business figure out how well its infrastructure is scaling and help it in planning, according to Nilaver.
Embrace a continuous learning culture: Even if you check everything off your to-do list, there is still room for error. It is good to have a culture that allows you to learn from your mistakes. “Having a culture where you do retrospective analysis if something went wrong—whether that is a process improvement, tooling improvement or maybe a combination of both—already built into your philosophy will help everything be more controlled and planned out,” said Nilaver. Continue to learn from your mistakes, and you will continue to improve your overall process, he explained.
Have a backup plan: For example, if you are a global company that has users logging in from all over the world, and you see users are experiencing issues in one region, you want to be able to quickly detect that and move that traffic to a region that can serve them. “For a period of time they may feel slowdown or experience some minutes of downtime, but your architecture and online processes should be able to get end users back up and running. In the best-case scenario, end users will not even be able to detect the fact that there was an issue,” Nilaver said.