Malicious agents can crash a website by implementing a DDoS—a Distributed Denial of Service Attack—against a server. So can sloppy programmers.

Take, for example, the National Weather Service’s website, which is operated by the United States National Oceanic and Atmospheric Administration, or NOAA. On August 29, the service went down, hard, as single rogue Android app overwhelmed the NOAA’s servers.

As far as anyone knows, there was nothing deliberately malicious about the Android app, and of course there is nothing specific to Android in this situation. However, the app in question was making service requests of the NOAA server’s public APIs every few milliseconds. With hundreds, thousands or tens of thousands of instances of that app running simultaneously, the NOAA system collapsed.

There is plenty of blame to go around. Let’s start with the app developer.

Certainly the app developer was sloppy, sloppy, sloppy. I can imagine that the app worked great in testing, when only one or two instances of the app were running at any one time on a simulator or on actual devices. Scale it up—boom! This is a case where manual code reviews may have found the problem. Maybe not.

Alternatively, the app developer could have checked to see if the public APIs it required (such as NOAA’s weather API) could handle the anticipated load. However, if the coders didn’t write the software correctly, load testing may not have sufficed. For example, say that the design of the app was to pull data every 10 seconds. If the programmers accidentally set up the data retrieval to pull the data every 10 milliseconds, the load would be 1,000x greater than anticipated. Every 10 seconds, no problem. Every 10 milliseconds, big problem. Boom!

This is a nasty bug, to be sure. Compilers, libraries, test systems, all would verify that the software ran correctly, because it did run correctly. In the scenario I’ve painted, it simply wasn’t coded to meet the design. The bug might have been spotted if someone noticed a very high number of external API calls, or again, perhaps during a manual code review. Otherwise, it’s not hard to see how it would slip through the crack.

Let’s talk about NOAA now. In 2004, the weather service beefed up its Internet loads in anticipation of Hurricane Charley, contracting with Akamai to host some of its busiest Web pages, using distributed edge caching to reduce the load. This worked well, and Akamai continued to work with NOAA. It’s unclear if Akamai also fronted public API calls; my guess is that those were passed straight through to the National Weather Service servers.

NOAA’s biggest problem is that it has little control over external applications that use its public APIs. Even so, Akamai was still in the circuit and, fortunately, was able to help with the response to the Aug. 29 accidental DDoS situation. At that time, the National Weather Service put out a bulletin on its NIDS messaging service that said:

TO – ALL CUSTOMERS SUBJECT – POINT FORECAST ISSUES. WE ARE PROVIDING NOTICE TO ALL THAT NIDS HAS IDENTIFIED AN ABUSING ANDROID APP THAT IS IMPACTING FORECAST.WEATHER.GOV. WE HAVE FORCED ALL SITES TO ZONES WHILE WE WORK WITH THE DEVELOPER. AKAMAI IS BEING ENGAGED TO BLOCK THE APPLICATION. WE CONTINUE TO WORK ON THIS ISSUE AND APPRECIATE YOUR PATIENCE AS WE WORK TO RESOLVE THIS ISSUE.

Kudos to NOAA for responding quickly and transparently to this issue. Still, this appalling situation—that a single DDoS attack could cripple such a vital service—is unacceptable. Imagine if this had been a malicious attack, rather than an accidental coding error, and if the attacker was able to modify the attack in real time to go around Akamai’s attempts to block the traffic.

What could NOAA have done differently? For best results, DDoS attacks must be blocked within the network before they reach (and overwhelm) the server. Therefore, DDoS detection and blocking systems should already have been in place.

In addition, there should also have been detection systems within the API servers themselves, with the ability to detect potential attacks due to abnormally high volumes of requests from a specific app, raise alarms, and also drop such requests (which is fast and takes few resources), instead of servicing them (which is slow and takes more resources). Perfect? No. DDoS scenarios are nasty and messy. No matter how you slice it, though, a single misbehaving app should never be able to crash your server.

Do you have code or resources in place to detect and react to a DDoS situation? Write me at alan@camdenassociates.com.