Many were caught by surprise when OpenAI revealed the coding capabilities of their model (and ofc SWEs started to panic they will lose their jobs). However, there was a significant body of research indicating the eventual creation of an intelligent code generation tool…it’s just that most of us were not aware of it.
This is not a technical post, but rather aims to be (or at least, I hope so) an interesting exploration of studies from over a decade ago that highlighted the deterministic characteristics of software code.
This image was generated by DALL·E-3. It’s fascinating how effective style prompts are in producing high-quality images. I wonder if “styles” could also be used for code generation, perhaps to explore whether two tests, written in varying styles, converge to the same result.
I was also surprised that a deep learning model can generate high quality code, and I was even more surprised that they do so one word (token) at a time. I had the impression that software writing is a creative endeavour, and it’s enough to create reusable software libraries, components and frameworks to minimise redundant development effort. I also thought that generating code just from a description and some context was impossible, since a statistical model can’t capture the nuances of software development, with its inherent complexity and need for human intuition.
But I was wrong. LLMs can do it very well. However, the idea that an intelligent system (read statistical model) can generate code was not a novelty unveiled by OpenAI. In fact, it has been hinted many years ago that code is highly repetitive and predictable, and follows a simpler distribution than natural language.
In the following paragraphs, I’ll give an overview of some papers that I enjoyed reading on the topic. In order, these are:
- A Study of the Uniqueness of Source Code [1]
- On the Naturalness of Software [4]
- The Plastic Surgery Hypothesis [6]
Code is predictable, natural and reusable
Uniqueness of software
In a paper published in 2010 titled “A Study of the Uniqueness of Source Code” [1], the authors present the results of the first study to address the uniqueness of source code across a wide range of software projects. They define uniqueness in terms of “syntactic redundancy”, a metric calculated by determining how much of a given software project can be reconstructed using pieces of code from a corpus (i.e. a large collection of source code). This corpus was compiled from approximately 6000 open source software projects and consisted of 420 million lines of code. The study analysed software at various levels of granularity to understand at what point software becomes unique.
In this study “granularity” refers to the size of the code segments being compared, measured in tokens, where a token could be a keyword, variable name, operator, etc. The levels of granularity ranged from very fine (a few tokens, equivalent to about one line of code) to much coarser (dozens of tokens, representing larger blocks of code).
To focus on measuring “incidental similarity” in source code rather than “trivial redundancy” (i.e. exact copies of functions or files), the authors implemented some filters to refine the dataset. They deleted all import statements, removed duplicated files identified by content and omitted files with identical names (the last one aims to add some control for the cases in which copied files are very slightly adapted or are of different but similar versions).
Also, considering that two lines of code can be similar, but only differ from naming conventions or code writing style, such as the order of two parameters or an operator in an expression, the researchers abstracted away all identifiers and used hamming distance [2] (where the difference is based on the nr. of tokens) to find “approximate matches” between two sequences.
It was surprising to see the results.
Digging into Java, at a granularity of approximatively one line of code (6 tokens), the sequences are nearly wholly redundant (98.3% similarity). This suggests the possibility that individual lines of code are not unique, which is not-surprising given the simple grammar rules of programming languages.
Just three lines of java code (20 tokens), using exact matching revealed that 60% of code is redundant. When they compared lines with a little “wiggle room” [3] for differences (a hamming distance of 1), the redundancy in those lines jumped to 80%. Allow a hamming distance of 2, the experiments suggests that 90% of the code looks familiar.
Extending to six lines of Java (or 35 tokens), 20% of the code was similar. And with a hamming distance of 1 and 2, the figures rose to 30% and 40% respectively, showing a significant amount of repetition.
The research highlighted a clear pattern: redundancy is near total at the line level and remains significant through the range of approximately one to six lines. So, aiming to write something truly original? You’ll need to stretch beyond the six-line mark.
Naturalness of software
The naturalness of software (i.e. code is likely to be repetitive and predictable) was studied in a paper titled “On the Naturalness of Software” [4] in 2012.
The premise of this research is that, despite the complexity and expressive richness of natural language, what people write and speak tends to be regular and predictable. This predictability lead to rapid adoption of statistical models in NLP tasks [8] and lead to great success in text summarisation, searching and completion. Funny thing, allegedly, Fred Jelenik claimed that “every time a linguist leaves our group, our performance of our speech recognition goes up” [9].
The authors argued that coding is also human-generated, so it would make sense that the code written by programmers, which may seem complex, is also regular and predictable.
To check their hypothesis, they applied an n-gram model [5] on 90% of the dataset (i.e. corpus of code) to find the “language’s distribution”, then tested this model’s prediction on the remaining 10%, hoping for low-entropy (i.e. how “surprised” a model is when given “unseen” tokens and how well they can guess the next token).
The researchers stripped comments from code and conducted lexical analysis on the projects to create token sequences for the language model.
Their findings show that n-gram models can in fact capture patterns in software. Software’s cross-entropy decreases significantly with higher n-gram orders (i.e. higher ’n’ - increasing the number of tokens considered in sequences when building statistical models), a lot more so than natural language. This suggests a higher repetitive local context in Java programs compared to English text.
An interesting finding was that models trained on one repository showed lower cross-entropy when tested on the same repository versus different ones, suggesting that the regularity captured by the models reflects unique, project-specific patterns beyond just programming language syntax. In other words, the models capture the “naturalness” or repetitiveness specific to each project. You may find that intuitive: each project has its own vocabulary, specific local patterns of iteration, field access, method calls etc.
One last observation was that, when tested if n-gram models can capture similarities within and differences between project domains (unix packages), there appeared to be a lot of local regularity repeated within application domains, and much less to across application domains. Some domains, like the “Web”, appeared to have a very high-level of regularity (and lower self-entropy).
This paper was a very interesting read, showing that code exhibits “natural” characteristics similar to human language. This fact mainly suggests the potential of using language models for code summarisation, searching and completion tasks.
The Plastic Surgery Hypothesis
The last paper reviewed here is titled “The Plastic Surgery Hypothesis” [6], published in 2014. The researchers tried to demonstrate the hypothesis “Changes to a codebase contain snippets that already exist in the codebase at the time of the change, and these snippets can be efficiently found and exploited”.
The authors collected a dataset of 12 Java software projects. They retrieved the commit history of the projects from 2004 to 2012, including all related Jira tasks (title, description, author, state etc.). They filtered out some changes, such as non-source code file changes or comment adjustments. As a result, they analysed a total of 15723 commits.
To test the hypothesis, they divided each code change into smaller units named “snippets” or “grafts”. Each snippet corresponds to one line of code.
They then looked for exact matches of these snippets (ignoring whitespace) within three spaces:
- the parent version of the code (i.e., HEAD~1),
- non-parental ancestors of the code,
- other projects
The results were:
- 42% of commits are 50% graftable within a project, and 10% of commits can be entirely reconstructed using grafts from that project
- Only 16% of commits are completely novel
- The size and type of the commit (such as bug fixes, improvements, or new features) have little impact on graftability variance
- 30% of the components of commits are present within the same file
- Contiguous snippets, where sequences of adjacent lines are reused, frequently appear in both the source and target locations of code changes
The last result refers to the observation that copied blocks of code, consisting of sequences of lines that are next to each other, often recur within the same software project. The average size of a contiguous graft was about 2.5 lines.
“graftable” refers to the proportion or percentage of a code change (commit) that can be reconstructed using existing code snippets (grafts) found within defined search spaces.
This hypothesis was the foundation of automatic repair tools (i.e. bug-fixing) that used genetic programming. In the current landscape, it also underscores the potential for using LLMs to efficiently address issues within a codebase.
Toward automatic software synthesis
These research findings on code naturalness, reusability and predictability, make it unsurprising that LLMs have been able to produce effective recommendations for code generations.
Actually, these observations have underpinned the growth of a generate-and-test approach called “genetic improvement” [7]. I didn’t do much reading on the “genetic improvement” subject, but my understanding is that it shares the same scientific principles as LLM-based code generation (which makes sense, because we also use LLMs in a ‘generate-and-test’ approach). Maybe the progression of “genetic improvement” it’s a subject worth reading to understand how we can use LLMs better in software engineering.
“Aye, this 50 lines of code is very similar to 50 lines of code found online, if only the difference could be automated!”
This is a common critique of LLM-based autocompletion. In my opinion, if the next 50 lines closely match what we need, then autocompleting it in areas that are very similar, under the supervision of an engineer to write the differences, is actually a practical approach.
Many functions we write already exist, but they are hard to find and choose from. For example, if ‘ParseImageName(absolutePath string)’ has been implemented 1000 times, how do we pick the right one? We might review 50 of them, notice that 90% are 90% suitable, and then create our version based on them. We’re essentially looking for the implementations’ distribution and then make some adjustments. This process has been partially automated, as expected, since it follows a pattern that is clear yet too complex to define precisely.
Of course, any guessed token needs to be reviewed and adjusted by a software engineer. Studies I’ve linked show that there is a high change the next tokens will be correctly predicted, which is why there is a high interest in research to find how to predict them.