github repo

There is a growing interest in using LLMs to generate unit tests. They have shown potential to create tests that look natural and resemble those written by developers, including sane variable names and typical assertions.

However, despite some initial success and praise from AI enthusiasts, generating tests with LLMs involves challenges that differ from general code generation. Junjie Wang et.al [1] describes some of these challenges much better than I could. Some of these challenges relate to the project’s testing style, such as the use of test libraries or mocking versus faking.

While working on a personal project, I used ChatGPT to generate tests, which proved to be quite helpful. To generate quality tests, my method involved manually gathering relevant info from the repo, feeding it to ChatGPT for test generation, and then handling post-generation tasks such as fixing imports, running the test and addressing any error. If the tests failed, I either used ChatGPT again for corrections or manually fixed them before verifying their validity.

After some time, I came to the conclusion that when provided with all the necessary information, 90% of time ChatGPT can actually generate very good tests and serve as an effective “pair programmer”. This observation motivated me to automate the entire process. My goal was to experiment with different prompts, explore various code mining strategies, and find ways to leverage LLMs for even better test generation.

Following the automation of test generation, I began developing a custom “AI coding assistant” as a VS Code extension tailored to my needs (that I have no plans to ever release). This extension essentially builds upon the strategies I had been using in the project I’m discussing.

  • Michele Tufano et. al. [2] pretrained a BART Transformer model on a corpus of test cases mapped to the corresponding focal methods, used the Focal Method’s whole class as Focal Context and finetuned the model on the unit test case generation task
  • Max Schafer et. al. [3] developed a tool called “TestPilot”, which works by including the signature of the focal method, their bodies, documentation and usage examples. Additionally, if the test run fails,, they add the error messages to the prompt to fix the test (formally called “validate-and-fix paradigm”)
  • Yinghao Chen et.al. [4] developed “ChatUniTest”, a unit test generation framework for Java that uses an off-the-shelf LLM (ChatGPT3.5) to generate tests. They construct the prompt by parsing a project and exploring the Abstract Syntax Tree (AST) to gather imports, focal method’s class, signature, body and method invocations (with their source code). They also used the LLM to repair the test if there is a compile or runtime error

My approach

Description

Overview of the test generation technique used

Parse project

To parse the project, I used Go’s x/packages lib, which did all the hard work. This package made it straightforward to load, parse, and analyze Go source code, though I still had to familiarise myself with some concepts to be able to extract info about the focal method. A valuable resource that helped me a lot was the repo https://github.com/golang/example/tree/master/gotypes, which explained some of the core concepts of the go/types package. Once I understood these concepts, it was easier to rely only on the standard library’s documentation, which is really well-written.

After I parsed the project, I identified the focal method (or function under test) and gathered its context. This included documentation, function signature, function body, types of all used identifiers, and the bodies of all invoked functions. By collecting the types of all used identifiers, I also implicitly cover the function’s parameters and results. This detailed information was then incorporated into the prompts.

The hardest part was probably identifying tests relevant to the one I wanted ChatGPT to generate. I assumed that if ChatGPT is given some tests as example that partially resembles the unit test we want, it will generate higher quality tests. For example, maybe a similar setup is required, or existing mock or fake implementations could be reused. Finding tests that should be similar with the one we’re trying to generate could guide ChatGPT to use the same standards as the existing tests.

I tried 3 strategies; the first two yielded poor results, so I kept the third:

  • Find tests based on code similarity
    • If focal method A is similar with func B and it has tests, use B’s tests as example. I defined “code similarity” as the hamming distance between two functions that had all used identifiers abstracted away.
    • This selection of tests didn’t work well, probably because of the code similarity algorithm used
  • Find tests that use interfaces similar to the ones used in the focal method
    • This one seemed promising, but on public repos it often lead to selection of tests that were poor examples
  • Random selection of test files near the focal method
    • The final approach that worked reasonably well was to choose 3-4 test files within a common package root path (usually a depth of 2-3)

To deal with interfaces and mocking, I prompted GPT to generate its own fake implementations for interfaces, with the rationale that if the LLM could create a decent test using its own fakes, manually swapping them out for existing fakes in the project would not be too difficult.

Also, to make sure the LLM takes into account the existing tests for the focal method, I included in the prompt the test file corresponding to the focal method’s file.

Generate test

Using the parsed context and examples of existing tests, I constructed a prompt and used ChatGPT to generate the test.

From my observations, ChatGPT puts more emphasis on the final part of the prompt, so I ordered the elements of the focal context as follows:

  • Example tests
  • Test file corresponding to the focal method’s file
  • Types of all used identifiers and the bodies of all invoked functions
    • I aggregated all definitions under the same package, hoping that GPT will use the package prefix correctly
  • Focal method
  • Instructions

Validate

Next, I parsed ChatGPT’s response, wrote the test to the corresponding file (in Go, the common practice is to write tests in a file with the same name but ending in _test.go), fixed any import issues using Go tools, and ran the test.

Repair

If compile or runtime errors occurred, I used an LLM to repair the generated test by providing it with the error messages.


As you can see, my approach is not unique and has been used in various forms. However, there were no available projects in Go (the language I primarily use) that I could use to experiment with different prompts or code mining strategies.

One might ask: given that today most LLMs have 128k tokens window context, why not just prompt the entire project? Well, simply because it’s expensive to do so. From my experiments, generating a test required on average about 4k tokens for the input and merely 400 tokens for the output. Using gpt-4o-mini, this translates to approximately $0.00084 per test.

Empiric evaluation

Todo

Golang for LLMs

I experimented with the prompt a lot, and came to the conclusion that Golang is particularly well-suited for LLMs:

  • Golang is a simple language
  • Standard library is widely used and feature-rich
  • Usually some few libraries are extremely used, so the training set contains many usages of the same libraries
  • From what I’ve seen, 90% of unit tests follow similar patterns and principles
  • There are no deprecated go versions. This one is pretty important IMO. There are some deprecated functions that ChatGPT favors (like those is ioutil), but so far I didn’t find out a deprecated function that was removed from the language.

Limitations and problems

  • ChatGPT is designed for chat interactions. Using an LLM to complete the beginning of a test function might yield better results. (See [3])
  • This method focuses on white-box testing.
  • Obviously, this is not designed for tests within tests. While ChatGPT can handle this, a code completion tool might be more effective. When I manually tested ChatGPT for this use case, it often attempted to rewrite the test (instead of completing it), even when instructed not to.

Future work

First of all, I believe the test selection process has the greatest potential for achieving better results, but it is a challenging problem. Too many cases need to be defined to ensure consistent outcomes, and for now, randomly choosing tests has yielded very good results with minimal effort.

A second thing I would like to experiment with is LLM based “program repair”. From my experience (tho’ I don’t have data to support this), ChatGPT has been quite effective at identifying bugs in code, so a program repair technique could potentially be leveraged to determine whether the test or the code is incorrect.

I’d also like to explore using an LLM that performs code completion, as this might produce better results.