We’re on the brink of a seismic shift in software development, with AI-powered code generation and refactoring tools positioned to reshape how developers write, maintain, and optimize code. Organizations everywhere are evaluating and implementing AI tools to deliver more features faster, bridge skill gaps, improve code quality, reduce technical debt, and save costs. But is today’s AI really ready for the scale and precision demanded by enterprise-level codebases?

AI’s Role in Software Development: Promise and Pitfalls

The primary use of AI in coding right now is in code authorship—creating new code with assistants such as GitHub Copilot. These tools have proven that AI can make coding faster and improve developer productivity by providing relevant suggestions. Yet, when it comes to maintaining and refactoring complex codebases at scale, GenAI has clear limitations. Each edit it suggests requires developer oversight, which can work for generating new code in isolated tasks but becomes unwieldy across extensive, interconnected systems.

Unlike traditional programming or even code generation tasks, refactoring at scale requires transforming code in thousands of locations within a codebase, potentially across repositories with millions or billions of lines. GenAI models are not built for this level of transformation; they are designed to generate probable outcomes based on immediate context, but this is inherently limited when it comes to large-scale accuracy. Even a 0.01% error rate in handling a codebase with thousands of cases could lead to critical errors, costly debugging cycles, and rollbacks.

For example, in one instance, a senior developer using Copilot accepted a misspelled configuration property (JAVE_HOME instead of JAVA_HOME) that caused a deployment failure. AI suggestions often contain these subtle but impactful errors, highlighting how even seasoned developers can fall victim to AI inaccuracies even in authorship scenarios that are only editing a single file at a time.

Refactoring and analyzing code at scale requires more than quick suggestions. It requires precision, dependability, and broad visibility across a codebase—all areas where GenAI, which is inherently probabilistic and suggestive, falls short. For true mass-scale impact, we need a level of accuracy and consistency that today’s GenAI alone can’t yet provide.

Beyond Copilots: Mass-Scale Refactoring Needs a Different Approach

One thing we know is that large language models (LLMs) are data-hungry, yet there’s a shortage of source code data to feed them. Code-as-text and even Abstract Syntax Tree (AST) representations are insufficient for extracting data about a codebase. Code has a unique structure, strict grammar, and intricate dependencies, with type information that only a compiler can deterministically resolve. These elements contain valuable insights for AI, yet remain invisible in text and syntax representations of source code.

This means AI needs access to a better data source for code, such as the Lossless Semantic Tree (LST), which retains type attribution and dependencies from the source code. LSTs provide a machine-readable representation of code that enables precise and deterministic handling of code analysis and transformations, an essential step toward truly scalable code refactoring.

Additionally, AI models can be augmented using techniques such as Retrieval-Augmented Generation (RAG) and tool calling, which enable models to work effectively at scale across entire codebases.

The newest technique for building agentic experiences is tool calling. It allows the model to drive natural language human-computer interaction while it invokes tools such as a calculator to do math or an OpenRewrite deterministic recipe (i.e., validated code transformation and search patterns) to extract data about and take action on the code. This enables experiences such as describing dependencies in use, upgrading frameworks, fixing vulnerabilities, locating where a piece of business logic is defined (e.g., where is payment processing code?)—and do this at scale across many repositories while producing accurate results.

AI in Mass-Scale Code Changes: Trust, Security, and Cost

For any AI implementation at scale, organizations must address three key concerns: trust, security, and cost.

  1. Trust: Implementing accurate guardrails is essential to scale with confidence. Using OpenRewrite recipes and LSTs, for instance, allows AI to operate within the guardrails of tested, rules-based transformations, building a foundation of trust with developers.
  2. Security: Proprietary code is a valuable asset, and security is paramount. While third-party AI hosting can pose risks, a dedicated, self-hosted AI instance ensures that code remains secure, providing confidence for enterprise teams handling sensitive IP.
  3. Cost: Mass-scale AI is resource-intensive, with substantial computational demands. Using strategies like RAG can save significant costs and time—and improve the quality of output. Also, by selectively deploying models and techniques based on task-specific needs, you can control costs without sacrificing performance.
Leveraging AI for Code Responsibly at Scale

We will continue to see LLMs improve, but their limitation will always be the data, particularly for coding use cases. Organizations must approach mass-scale refactoring with a balanced view—leveraging AI’s strengths but anchoring it in the rigor and structure necessary for precision at scale. Only then can we move beyond the hype and truly unlock AI’s potential in the world of large-scale software engineering.

We will continue to see LLMs improve, but their limitation will always be the data, particularly for coding use cases. Organizations must approach mass-scale refactoring with a balanced view—leveraging AI’s strengths but anchoring it in the rigor and structure necessary for precision at scale. Only then can we move beyond the hype and truly unlock AI’s potential in the world of large-scale software engineering.