Many were caught by surprise when OpenAI revealed the coding capabilities of their model (and ofc started to panic SWEs will lose their jobs). However, there was a significant body of research indicating the eventual creation of an intelligent code generation tool…it’s just that most of us were not aware of it.

This is not a technical post, but rather aims to be (or at least, I hope) an interesting exploration of studies from over a decade ago that highlighted the deterministic characteristics of software code.

 

Generate an image about large language models that write code. The style should be “sketch”

This image was generated by DALL·E-3. It’s fascinating how effective style prompts are in producing high-quality images. I wonder if “styles” could also be used for code generation, perhaps to explore whether two tests, written in varying styles, converge to the same result.

 

I was also surprised that a deep learning model can generate high quality code, and I was even more surprised that they do so one word (token) at a time. I had the impression that software writing is a creative endeavour, and it’s enough to create reusable software libraries, components and frameworks to minimise redundant development effort. I also thought that generating code just from a description and some context was impossible, since a statistical model can’t capture the nuances of software development, with its inherent complexity and need for human intuition.

But I was wrong. LLMs can do it very well. However, the idea that an intelligent system (read statistical model) can generate code was not a novelty unveiled by OpenAI. In fact, it has been hinted many years ago that code is highly repetitive and predictable, and follows a simpler distribution than natural language.

It’s challenging to say why. My guess is that the nature of software development, being a product of human effort, is inherently limited by our cognitive constraints. On top of that we should also consider that programming languages have a simple formal grammar and software engineers impose homogeneity through adhering to style conventions. At a high-level, every software is unique, but when broken down into smaller components, there is a high likelihood that each piece has already been conceived by someone else, is available in a open-source repository and can be replicated through automatic code generation.

In the following paragraphs, I’ll give an overview of some papers that I enjoyed reading on the topic. In order, these are:

  • A Study of the Uniqueness of Source Code [1]
  • On the Naturalness of Software [4]
  • The Plastic Surgery Hypothesis [6]

 

Code is predictable, natural and reusable

Uniqueness of software

In a paper published in 2010 titled “A Study of the Uniqueness of Source Code” [1], the authors present the results of the first study to address the uniqueness of source code across a wide range software projects. They define uniqueness in terms of “syntactic redundancy,” a metric calculated by determining how much of a given software project can be reconstructed using pieces of code from a corpus (i.e. a large collection of source code). This corpus was compiled from approximately 6,000 open source software projects and consisted of 420 million lines of code. The study analysed software at various levels of granularity to understand at what point software becomes unique.

In this study “granularity” refers to the size of the code segments being compared, measured in tokens, where a token could be a keyword, variable name, operator, etc. The levels of granularity ranged from very fine (a few tokens, equivalent to about one line of code) to much coarser (dozens of tokens, representing larger blocks of code).

To focus on measuring “incidental similarity” in source code rather than “trivial redundancy” (i.e. exact copies of functions or files), the authors implemented some filters to refine the dataset. They deleted all import statements, removed duplicated files identified by content and omitted files with identical names (the last one aims to add some control for the cases in which copied files are very slightly adapted or are of different but similar versions).

Also, considering that two lines of code can be similar, but only differ from naming conventions or code writing style, such as the order of two parameters or an operator in an expression, the researchers abstracted away all identifiers and used hamming distance [2] to find “approximate matches” between two sequences.

It was surprising to see the results.

Digging into Java, at a granularity of approximatively one line of code (6 tokens), the sequences are nearly wholly redundant (98.3% similarity). This suggests the possibility that individual lines of code are not unique, which is not-surprising given the simple grammar rules of programming languages.

Just three lines of java code (20 tokens), revealed that 60% of code is redundant. When they compared lines with a little “wiggle room” [3] for differences (a hamming distance of 1), the redundancy is those lines jumped to 80%. Allow a hamming distance of 2, the experiments suggests that 90% of the code looks familiar.

Extending to six lines of Java (or 35 tokens), 20% of the code was similar. And with a hamming distance of 1 and 2, the figures rose to 30% and 40% respectively, showing a significant amount of repetition.

The research highlighted a clear pattern: redundancy is near total at the line level and remains significant through the range of approximately one to six lines. So, aiming to write something truly original? You’ll need to stretch beyond the six-line mark.

 

Naturalness of software

The naturalness of software (i.e. code, as natural language, is likely to be repetitive and predictable) was studied in a paper titled “On the Naturalness of Software” [4] in 2012. The authors argued that if human languages is known to be predictable (), coding should be too, since both are human-generated.

To check their hypothesis, they applied an n-gram model [5] with Kneser-Ney smoothing on 90% of a corpus (made of Java and Ubuntu projects) to model the language distribution, then tested this model’s prediction on the remaining 10%, hoping for low-entropy (i.e. how “surprised” a model is when given “unseen” tokens and how well they can guess the next token).

The researchers stripped comments from code and conducted lexical analysis on the projects to create token sequences for the language model.

Their findings show that n-gram models CAN capture patterns in software. Software’s cross-entropy decreases significantly with higher n-gram orders (i.e. higher ’n’ - increasing the number of tokens considered in sequences when building statistical models), a lot more so than natural language. This suggests a higher repetitive local context in Java programs compared to English text.

An interesting finding was that models trained on one repository showed lower cross-entropy when tested on the same repository versus different ones, suggesting that the regularity captured by the models reflects unique, project-specific patterns beyond just programming language syntax. In other words, the models capture the “naturalness” or repetitiveness specific to each project. You may find that intuitive: each project has its own vocabulary, specific local patterns of iteration, field access, method calls etc.

One last observation was that, when tested if n-gram models can capture similarities within and differences between project domains, there appeared to be a lot of local regularity repeated within application domains, and much less to across application domains. Some domains, like the “Web”, appeared to have a very high-level of regularity (and lower self-entropy).

This paper was a very interesting read, showing that code exhibits “natural” characteristics similar to human language. This fact mainly suggests the potential of using language models for code summarisation, searching and completion tasks.

 

The Plastic Surgery Hypothesis

The last paper reviewed here is titled “The Plastic Surgery Hypothesis” [6], published in 2014. The researchers tried to demonstrate the hypothesis “Changes to a codebase contain snippets that already exist in the codebase at the time of the change, and these snippets can be efficiently found and exploited”.

The authors collected a dataset of 12 Java software projects. They retrieved the commit history of the projects from 2004 to 2012, including all related Jira tasks (title, description, author, state etc.). They filtered out some changes, such as non-source code file changes or comment adjustments. As a result, they analysed a total of 15723 commits.

To test the hypothesis, they divided each code change into smaller units named “snippets” or “grafts.” They then looked for exact matches of these snippets (ignoring whitespace) within three spaces:

  • the parent version of the code (i.e., HEAD~1),
  • non-parental ancestors of the code,
  • other projects

The results were:

  • 42% of commits are 50% graftable, and 10% of commits can be entirely reconstructed using grafts
  • Only 16% of commits are completely novel
  • The size and type of the commit (such as bug fixes, improvements, or new features) have little impact on graftability variance
  • 30% of the components of commits are present within the same file

“graftable” refers to the proportion or percentage of a code change (commit) that can be reconstructed using existing code snippets (grafts) found within defined search spaces.

This hypothesis was the foundation of automatic repair tools that used genetic programming to search for repairs. In the current landscape, it also underscores the potential for using LLMs to efficiently address issues within a codebase

 

Toward automatic software synthesis

There research findings on code naturalness, reusability and predictability, make it unsurprising that LLMs have been able to exploit the same predictable reusability to produce effective recommendations for code generations.

Of course, these observations have underpinned the growth of generate-and-test approaches to repair and genetic improvement [7].

Maybe the genetic improvement approaches that have been studied to far can provide a scientific framework within which to understand and evaluate LLM-based code generation better. Both technologies share the “generate-and-test” approach to program transformation and code generation, potentially making much of the existing work on genetic improvement directly applicable to LLM-based code generation.