Thanksgiving through Black Friday is a busy season for e-retailers. A lot of revenue is generated during this period of time, and if a site goes down, it’s not just one person who is going to have to do some explaining. The IT team, project managers and the CEO will have to answer to shareholders, and discuss why so much money was missed on one of the best days of the year for sales.

Even tech-savvy companies can experience website and application crashes, and the reason is application and web delivery is complex, especially when a site experiences a traffic increase of tens of thousands of users all trying to have one great shopping experience.

(Related: Security testing should be on every DevOps team’s Black Friday checklist)

CEO of software company NGINX, Gus Robertson, recommended you ask your software team questions like how many users can be handled at once, how do you plan to scale when traffic soars, who will monitor traffic, how can you recover, and who has your back. Discussing these questions with your team could save the company money and keep customers on sites and in applications by itself.

SD Times: What are the things you can do right to make sure you are set up and prepared for the holiday?
Robertson: Things we recommend, number one, is monitoring. If you are not monitoring the application and the website, then you’re not going to see where the errors are. I think a lot of companies are monitoring the infrastructure but not necessarily the actual application itself, or vice versa. And you want to not just see the traffic flowing at the front, but you also want to track that traffic that flows through the entire application to the weakest link. The weakest link may be an individual service or a product catalog. I think too many companies only monitor the infrastructure and not the app, or they’re not monitoring the app the whole way through to various components where they can identify the weak areas.

The second thing is that you want to also track response time, because at the end of the day, what you want to do is deliver the best user experience possible. If you have low performance of the front end to the client, you’ll lose that client and they’ll go somewhere else. You’re tracking the response time to that individual; it’s critical in making sure you are delivering that experience.

How can you track response time?
You want to load test for concurrency, which is making sure you can handle the amount of users coming into the site. That’s [like] making sure that you’ve got enough staff in your store to handle the amount of customers coming to your store. Brick and mortar retailers think of that logically; okay I need extra staff to handle the cash registers or answer any queries. In a certain way, in this day you need to know that the amount of users coming to a website is going to be far greater—can I handle that many users at the front of my website?

You need to at your load testing how you can optimize and tune for performance. There are various things like caching, compression, various smart tools, and those type of tools that give way to performance at the front end of the site. The other thing is you want to look at workflows. For example, you may load test the front of the site to make sure people can get in, but if you are not load testing the very back end like the payment gateway, you may have an application performing well, but nobody can actually check out of your shopping cart.

[Make] sure you think about the workflows that people have through your site and how they get to check out, and [load test] all along the way. The point I want to make is don’t underestimate the load based on last year’s demand. The load you had last year may be very different to the load you get this year. You want to give as much headroom so you can be prepared for any volume of traffic, far greater than what you would expect from last year’s traffic.

Can companies reflect on last year’s mistakes and successes in order to improve testing this holiday season?
Every company should look at last year, look at what worked and what didn’t work, look for the vulnerabilities that were hit, look for things that didn’t break but were maybe hitting their limits and make sure this year you address those weakest points. The weakest link in the chain will break the chain. As I said before, I think people too often focus on one aspect, very few people go to the next level of depth and actually test things like the shopping cart, the checkout and so forth.

What is the developer’s role during the holiday season?
Scaling the site is one of the core areas, and there are a number of ways to do that. Number one is around the architecture itself. Many websites and applications today are still written as monolithic apps, which means in order to scale them, you basically create clones of the monoliths to scale, and that’s not really the most efficient way to scale a website today.

A more appropriate way to distribute architecture—many people use the term “microservices” to describe distributed architecture, and the benefit to this approach means you can independently scale the components of the site that are getting hit the most. So for example you can independently scale the shopping cart, or you can independently scale the product catalog or the product pages. You can independently scale the caching and all of the images that all of the people are looking [at].

This is where the developer role is so important because between now and the holidays you are not going to be able to re-architect your site. What we would suggest is not making any dramatic changes to your site, because those dramatic changes are not tested, and they’re likely going to be the things that cause problems. Once the holiday season is over, making sure you’ve looked at the architecture of your site, you’re developers have used modern techniques like microservices to design their architecture and their application so that it does scale.

What if a team has done all of the testing needed and the site still experiences a crash or another issue?
[Even with] a site’s best intentions and all its preparation in the world, sometimes things just go wrong. Have a failure plan in place, make sure you have a disaster recovery site that you also load tested and checked performance—I think a lot of times people don’t do that. There is so much Twitter noise that goes around and social media, so make sure you have a plan in place to handle how you inform customers, keep them updated, and win them back after the site’s back up and running. Using social media to your advantage is really the point there.

At the end of the day, it’s all about trying to give the user the right experience, and the user needs to be able to come to the site, have great performance, get to the set up pages, select and go to the cart, check out, and then they are happy. That’s what a consumer wants, and that’s really what you are trying to provide is that simplified user experience.