Q&A: Lessons NOT learned from CrowdStrike and other incidents

Published: July 31st, 2024

- SD Times

When an event like the CrowdStrike failure literally brings the world to its knees, there’s a lot to unpack there. Why did it happen? How did it happen? Could it have been prevented?

On the most recent episode of our weekly podcast, What the Dev?, we spoke with Arthur Hicken, chief evangelist at the testing company Parasoft, about all of that and whether we’ll learn from the incident.

Here’s an edited and abridged version of that conversation:

AH: I think that is the key topic right now: lessons not learned — not that it’s been long enough for us to prove that we haven’t learned anything. But sometimes I think, “Oh, this is going to be the one or we’re going to get better, we’re going to do things better.” And then other times, I look back at statements from Dijkstra in the 70s and go, maybe we’re not gonna learn now. My favorite Dijkstra quote is “if debugging is the act of removing bugs from software, then programming is the act of putting them in.” And it’s a good, funny statement, but I think it’s also key to one of the important things that went wrong with CrowdStrike.

We have this mentality now, and there’s a lot of different names for it — fail fast, run fast, break fast — that certainly makes sense in a prototyping era, or in a place where nothing matters when failure happens. Obviously, it matters. Even with a video game, you can lose a ton of money, right? But you generally don’t kill people when a video game is broken because it did a bad update.

David Rubinstein, editor-in-chief of SD Times: You talk about how we keep having these catastrophic failures, and we keep not learning from them. But aren’t they all a little different in certain ways, like you had Log4j that you thought would be the thing that oh, people are now definitely going to pay more attention now. And then we get CrowdStrike, but they’re not all the same type of problem?

AH: Yeah, that is true, I would say, Log4j was kind of insidious, partly because we didn’t recognize how many people use this thing. Logging is one of those less worried about topics. I think there is a similarity in Log4j and in CrowdStrike, and that is we have become complacent where software is built without an understanding of what the rigors are for quality, right? With Log4j, we didn’t know who built it, for what purpose, and what it was suitable for. And with CrowdStrike, perhaps they hadn’t really thought about what if your antivirus software makes your computer go belly up on you? And what if that computer is doing scheduling for hospitals or 911 services or things like that?

And so, what we’ve seen is that safety critical systems are being impacted by software that never thought about it. And one of the things to think about is, can we learn something from how we build safety critical software or what I like to call good software? Software meant to be reliable, robust, meant to operate under bad conditions.

I think that’s a really interesting point. Would it have hurt CrowdStrike to have built their software to better standards? And the answer is it wouldn’t. And I posit that if they were building better software, speed would not be impacted negatively and they’d spend less time testing and finding things.

DR: You’re talking about safety critical, you know, back in the day that seemed to be the purview of what they were calling embedded systems that really couldn’t fail. They were running planes and medical devices and things that really were life and death. So is it possible that maybe some of those principles could be carried over into today’s software development? Or is it that you needed to have those specific RTOSs to ensure that kind of thing?

AH: There’s certainly something to be said for a proper hardware and software stack. But even in the absence of that, you have your standard laptop with no OS of choice on it and you can still build software that is robust. I have a little slide up on my other monitor from a joint webinar with CERT a couple of years ago, and one of the studies that we used there is that 64% of vulnerabilities in NIST are programming errors. And 51% of those are what they like to call classic errors. I look at what we just saw in CrowdStrike as a classic error. A buffer overflow, reading null pointers on initialized things, integer overflows, these are what they call classic errors.

And they obviously had an effect. We don’t have full visibility into what went wrong, right? We get what they tell us. But it appears that there’s a buffer overflow that was caused by reading a config file, and one can argue about the effort and performance impact of protecting against buffer overflows, like paying attention to every piece of data. On the other hand, how long has that buffer overflow been sitting in that code? To me a piece of code that’s responding to an arbitrary configuration file is something you have to check. You just have to check this.

The question that keeps me up at night, like if I was on the team at CrowdStrike, is okay, we find it, we fix it, then it’s like, where else is this exact problem? Are we going to go and look and find six other or 60 other or 600 other potential bugs sitting in the code only exposed because of an external input?

DR: How much of this comes down to technical debt, where you have these things that linger in the code that never get cleaned up, and things are just kind of built on top of them? And now we’re in an environment where if a developer is actually looking to eliminate that and not writing new code, they’re seen as not being productive. How much of that is feeding into these problems that we’re having?

AH: That’s a problem with our current common belief about what technical debt is, right? I mean the original metaphor is solid, the idea that stupid things you’re doing or things that you failed to do now will come back to haunt you in the future. But simply running some kind of static analyzer and calling every undealt with issue technical debt is not helpful. And not every tool can find buffer overflows that don’t yet exist. There are certainly static analyzers that can look for design patterns that would allow or enforce design patterns that would disallow buffer overflow. In other words, looking for the existence of a size check. And those are the kinds of things that when people are dealing with technical debt, they tend to call false positives. Good design patterns are almost always viewed as false positives by developers.

So again, it’s that we have to change the way we think, we have to build better software. Dodge said back in, I think it was the 1920s, you can’t test quality into a product. And the mentality in the software industry is if we just test it a little more, we can somehow find the bugs. There are some things that are very difficult to protect against. Buffer overflow, integer overflow, uninitialized memory, null pointer dereferencing, these are not rocket science.

You may also like…

Lessons learned from CrowdStrike outages on releasing software updates

Software testing’s chaotic conundrum: Navigating the Three-Body Problem of speed, quality, and cost

Q&A: Solving the issue of stale feature flags

Article Tags

CrowdStrike, QA, quality, testing

About SD Times

View all posts by SD Times

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Q&A: Lessons NOT learned from CrowdStrike and other incidents

Article Tags

Subscribe to SDTimes

About SD Times

Related Articles

Parasoft brings agentic AI to service virtualization in latest release

Dynatrace Live Debugger, Mistral Agents API, and more – SD Times Daily Digest

Quality begins with planning: Building software with the right mindset

Snyk announces new DAST solution for securing APIs and web apps