Join the DZone community and get the full member experience.
Join For Free
If the 1990s was the internet era and the 2010s was the smartphones era, then it's clear that the 2020s will be defined by Large Language Models (LLMs) and AI tools. In a decade where nearly every field is being defined by advancements in AI, software testing is no exception.
Software testing is a fundamental building block in the foundation upon which software quality assurance is built. The development of testing techniques has a long and storied history stretching back to the early days of software development itself.
Mike Cohn introduced the concept of the " testing pyramid" in "Succeeding with Agile" wherein he defining a three-tiered approach for bucketing test. The pyramid is a deliberately chosen shape, emphasizing the ideal proportions of each type of test in a balanced testing strategy.
* Unit tests, designed to test an individual, isolated "unit" of code, such as a function, form the base of the pyramid, usually written alongside the code itself.
* Integration tests, the middle layer, designed to test the interactions between units of code, ensuring that the individual "gears" that are units of code enmesh properly.
* End-to-End tests, designed to validate entire user journeys across the code flow ensuring a well-oiled state.
Given the nature of each test type, the quantity needed of each decreases as we move up the pyramid. Unit tests are cheap to add during development, while integration and end-to-end tests become increasingly complex to set up and slow to execute. This pyramid has been a venerated and useful framing for striking the right balance between testing setup, maintenance costs, and speed of execution since its inception.
Figure
FO
The question of automated test generation was first raised in the 1970s. Then followed a large gap spanning decades between that initial research and test generation tools actually becoming available for use in the 2010s, primarily formed from search-based meta heuristics.
The past few years have seen a radical departure from the previous approach with the shift to LLM-based tools. Ideally, each approach to automation should not only meet code coverage goals, but also integrate seamlessly with industrial-scale continuous deployment workflows as a matter of practical purposes. The latter wasn't really the case until AI came along.
This article discusses the upsides and shortcomings of both software test generation techniques, i.e., search-based and LLM-based, with a particular focus on the recent advances in the LLM-based testing area.
Then we break down approaches to combine these automation techniques with static analysis tools, and manual testing.
Finally, we'll draw some conclusions on how to build an effective testing strategy in this new era using these advancements in LLM-based predictive AI, maximizing utilization of precious manual testing resources while pointing towards a reshaping of the traditional testing pyramid.
Key Test Generation Techniques
Search-Based Software Test Generation
Search-based test generation approaches are usually based upon either genetic algorithms (which are meta heuristics inspired by natural selection processes) or gradient-based optimization approaches such as Simulated Annealing.
Despite the underlying strategy, search algorithms contain a key component - the "fitness function," i.e., the goal criteria used to guide the algorithm towards better solutions. Code coverage, though simplistic, is an often-used metric to gauge how good a software testing suite is, and is therefore a commonly used fitness function when generating tests using search algorithmic approaches.
In practical applications of this technique, several open source tools have been developed, with EvoSuite being a popular option using a genetic-algorithm approach to generate unit tests for Java code. EvoSuite lets users pick different code coverage criteria to optimize, including line or branch-based coverage.
Searching strategies lend themselves particularly well to optimizing whole test suites in pursuit of the fitness function. This comes at a computational and comprehension cost, wherein evaluating whole suites against coverage criteria at each generation step is expensive. Most such algorithms also generate tests that still need significant post-processing to achieve human readability. Despite search techniques being researched and used for decades, these costs have prevented them from becoming ubiquitous in industry applications.
LLM-Based Test Generation
In recent years, through advancements in Large Language Model-based Software Engineering (LLMSE), code and test generation tooling is gaining traction by the day. Google Research made some significant advancements in applying LLMs to software generation starting with inline code completion using predictive AI.
This is a neat application, not just from an applying LLMs standpoint but also from a user experience perspective, since it feels like a natural extension of autocompletion. Reports from adoption in enterprise contexts indicate a 37% acceptance rate of AI code completion suggestions, which also contributed to completing 50% of code characters overall, excluding copy-pasted characters. The process self-improves using a feedback loop between model improvements, context construction heuristics, and tuning models based on previous accept/reject outcomes, resulting in learning from practical user behavior.
Test generation can be considered a subfield under LLMSE, with the key components of an LLM-based test generation strategy including inputs such as the code under test, prompt generators, test validation, and prompt refiners to tune and refine the generated tests in a feedback loop. Compared to search-based strategies, this technique is still in its infancy but has gained traction since tests generated using prompt refining on predictive AI output human-readable tests requiring little post-processing.
The first industrial-scale deployment of this technique, backed by significant quality assurance improvements, came in early 2024 with Meta introducing TestGen-LLM to auto-generate unit tests for their codebase. The evaluation from their first trials on Instagram Reels and Stories concluded with overwhelmingly positive improvements: 57% of TestGen-LLMs passed reliably, and coverage improved 25%. Overall, engineers accepted 73% of TestGen-LLM's recommended tests, accelerating future explorations in this area. This tool is backed by Meta's internally developed LLMs, but the ensemble approach itself is LLM-agnostic. This level of scaling is a significant milestone for industrial deployment of test generation tools, unlocking a clear path to changing continuous developer workflows by making unit tests easy to automate at scale.
Additionally, LLM-generated tests have proven themselves to be a nifty trick to cover the "last mile" gaps when combined with a search-based algorithm.
A test suite generated via search techniques might get close to its coverage goal but get stuck in local optima. In this case, an LLM prompting and feedback loop can help it escape. TestSpark is an open-source IntelliJ plugin that attempts to do precisely that for Java and Kotlin workflows, using EvoSuite to generate a test suite and letting users modify the suite by providing prompts and selecting specific LLM models to cover gaps.
Generative and predictive AI are not without shortcomings and concerns, especially given their relative infancy compared to predecessor automation techniques. Their outputs are only as good as the quality of training data, and they do take in the inherent bias of training sets. There are also concerns that model training encourages memorisation rather than learning. Model training and refining are continuously developing areas of research, as is careful curation of diverse training data in an attempt to smooth over these limitations.
Building the Foundations of a Sustainable Testing Strategy
Automating unit test generation at scale has catapulted us into the next era of software quality assurance. Considering the full picture to build a successful testing strategy involves combining the right test automation techniques with static analysis tools and manual end-to-end testing.
Static analysis tooling provides an early signal for detecting memory leaks, security vulnerabilities, etc., without running the code. Then come tests that run the code, split up into buckets according to the original testing pyramid. Human domain expertise is still essential to this holistic strategy. What automation, particularly in the sphere of unit testing, has done is flatten the testing pyramid with the ability to expand the range of unit testing with fewer resources. This frees up human expertise to focus on the narrower but more complex tiers of the pyramid - integration and end-to-end testing. The top of the pyramid becomes the key focus of human intervention.
The field of software quality assurance is obsessed with marginal optimizations and stacking improvements on top of one another. But recent advancements in AI-based automations have given us the opportunity to make a subtle but important reframe -- taking a step back and considering a combination of tools and techniques that underpin a successful, cohesive quality assurance strategy.
Categorizing these techniques, understanding how they work, and how to effectively view them through the lens of a modified testing pyramid is one approach to doing just that.