Amid the cacophony of noise about generative AI and software development, we haven’t seen much thoughtful discussion about software testing specifically. We’ve been experimenting with ChatGPT’s test writing capabilities and wanted to share our findings. In short: we conclude that ChatGPT is only somewhat useful for writing tests today, but we expect that to change dramatically in the next few years and developers should be thinking now about how to future-proof their careers.
We’re the cofounders of CodeCov, a company acquired by Sentry that specializes in code coverage, so we’re no strangers to testing. For the past two months, we’ve been exploring the ability of ChatGPT and other generative AI tools to write unit tests. Our exploration primarily involved providing ChatGPT with coverage information for a particular function or class and code for that class. We then prompted ChatGPT to write unit tests for any part of the provided code that was uncovered, and determined whether or not the generated tests successfully exercised the uncovered lines of code.
We’ve found that ChatGPT can reliably handle 30-50% of test writing currently, though the tests it handles well are primarily the easier tests, or those that test trivial functions and relatively straightforward code paths. This suggests that ChatGPT is of limited use for test writing today, since organizations with any amount of testing culture will typically have written their most straightforward tests already. Where generative AI will be most helpful in future is in correctly testing more complex code paths, allowing developer time and attention to be diverted to more challenging problems.
However, we already have seen improvements in the quality of test generation, and we expect this trend to continue in the coming years. First, very large, tech-forward organizations like Netflix, Google, and Microsoft are likely to build models for internal use trained on their own systems and libraries. This should allow them to achieve substantially better results, and the economics are too compelling for them not to do so. Given the rapid rates of improvement that we’re seeing from generative AI programs, a well trained LLM could be writing a large portion of these companies’ software tests in the near future.
Further out, in the next three to five years, we anticipate that all organizations will be impacted. The companies developing generative AI tools – whether Scale AI, Google, Microsoft, or someone else – will train models to better understand code, and once AI is smart enough to understand the structure of code and how it executes, there is no reason that future-gen AI tools won’t be able to handle all unit testing. (Google had an announcement along these lines just last month). In addition, Microsoft’s ownership of GitHub gives them an enormous platform to distribute AI coding tools to millions of software developers easily, meaning large-scale adoption can happen very quickly.
Whether the world will be ready for fully automated testing is another question. Much like self-driving cars, we expect that AI will be able to write 100% of code before humans are 100% ready to trust it. In other words, even when AI can handle all unit testing, organizations will still want humans as a backstop to review any code that AI has written, and may still prefer human-authored tests for the most critical code paths. Additionally, developers will still want metrics like code coverage to demonstrate the veracity of an AI’s efforts. Trust may take a long time to build.
Looking further out, AI may redefine how we approach software testing in its entirety. Rather than generating and executing automated tests, the testing framework may be the AI itself. It’s not out of the question that a sufficiently advanced and trained AI with access to enough computing resources could simply exercise all code paths for us, return any executions that fail and recommend fixes for those failing paths, or just automatically correct them in the course of analyzing and executing the code. This could obviate the need for software testing in the traditional sense altogether.
In any event, it’s likely that in the coming years AI will be able to do much of the work that developers do today, testing included. This could be bad news for junior engineers, but it remains to be seen how this will play out. We can also imagine a scenario in which “AI + junior engineers” could do the work of a mid-level engineer at lower cost, so it’s unclear who will be most affected.
Whatever the case, it’s important to experiment with these tools now if you’re not doing so already. Ideally, your organization is already providing opportunities to test generative AI tools and determine how they can make teams productive and efficient, now or in the near future. Every company should be doing this. If that’s not the case where you work, then you should still be experimenting with your own code on your own time.
One way to think about the role AI will fill is to think of it as a junior developer. If you want to stay “above the algorithm” and have a continuing role alongside AI, pay attention to where junior developers tend to fail today, because that’s where humans will be needed.
The ability to review code will always be important. Instead of writing code, think of your role as a reviewer or mentor, the person who supervises the AI and helps it to improve. But whatever you do, don’t ignore it, because it’s clear to us that change is coming and our roles are all going to shift.