When chat apps or other websites critical to getting a day’s work done experience an outage, panic sets in. Today, some teams were feeling the stress when Slack briefly had connectivity issues, and its API requests were responding with errors, making it behave slowly.
Ironically, some users were not even able to access Slack’s status page to learn what was happening with it. But, for those who could read it, this is the message they were given:
“Slack’s web servers are being overwhelmed at the moment and we’re working to restore full capacity and get everyone back to using Slack. API requests may respond in error and chat may behave quite slowly in the meantime.”
According to Greg Brail, chief architect at Apigee, an API-management platform, if Slack were having capacity problems, that could mean multiple things. Possible issues could be a database is overloaded, or it may just mean there is a bottleneck in its technology stack.
(Related: Why you should keep your ear to ChatOps)
“The thing about the Internet is there is no in-house load testing environment that is going to simulate exactly what happens on the Internet, so all of these companies have to learn the hard why that getting something to run reliably at scale on the Internet is hard work,” said Brail.
Since Slack has an API that is used by third-party apps, that means it’s important to have visibility into what’s going on, especially when a capacity problem breaks out. Brail said that when this happens, teams need to ask themselves what actually caused this problem.
“It’s important that you have visibility into which app, which developer, which version of the app, which platform is driving the traffic,” he said. “One of the questions you can ask yourself [is] what changed? Is it just that my business has changed? Or has some app [had] a new version pushed out, and maybe that new version of the app is causing the problem?”
Even though the public doesn’t know exactly what went wrong under the hood, it shines a light on the fact that businesses rely on APIS, and when there are issues (even briefly), they could have a huge impact on users, according to Brail.
For other companies that want to make sure they can handle the API traffic, he said having an API-management platform in place allows teams to understand the traffic per application and how to control it.
“Everybody makes mistakes and has problems scaling, and anyone that thinks they are immune is not running a high-volume service on the Internet,” said Brail. “But what we learned from the world of API management is that you need to have a mechanism in place so you have visibility into which applications are using the API, and how much volume is coming in.”
And, it helps to have a status page that is not hosted on the same place as your web server.