AI Can Write Code Like Humans—Bugs and All

New tools that help developers write software also generate similar mistakes.
Security icons including a bug.
Illustration: Elena Lacey

Some software developers are now letting artificial intelligence help write their code. They’re finding that AI is just as flawed as humans.

Last June, GitHub, a subsidiary of Microsoft that provides tools for hosting and collaborating on code, released a beta version of a program that uses AI to assist programmers. Start typing a command, a database query, or a request to an API, and the program, called Copilot, will guess your intent and write the rest.

Alex Naka, a data scientist at a biotech firm who signed up to test Copilot, says the program can be very helpful, and it has changed the way he works. “It lets me spend less time jumping to the browser to look up API docs or examples on Stack Overflow,” he says. “It does feel a little like my work has shifted from being a generator of code to being a discriminator of it.”

But Naka has found that errors can creep into his code in different ways. “There have been times where I've missed some kind of subtle error when I accept one of its proposals,” he says. “And it can be really hard to track this down, perhaps because it seems like it makes errors that have a different flavor than the kind I would make.”

The risks of AI generating faulty code may be surprisingly high. Researchers at NYU recently analyzed code generated by Copilot and found that, for certain tasks where security is crucial, the code contains security flaws around 40 percent of the time.

The figure “is a little bit higher than I would have expected,” says Brendan Dolan-Gavitt, a professor at NYU involved with the analysis. “But the way Copilot was trained wasn’t actually to write good code—it was just to produce the kind of text that would follow a given prompt.”

Despite such flaws, Copilot and similar AI-powered tools may herald a sea change in the way software developers write code. There’s growing interest in using AI to help automate more mundane work. But Copilot also highlights some of the pitfalls of today’s AI techniques.

While analyzing the code made available for a Copilot plugin, Dolan-Gavitt found that it included a list of restricted phrases. These were apparently introduced to prevent the system from blurting out offensive messages or copying well-known code written by someone else.

Oege de Moor, vice president of research at GitHub and one of the developers of Copilot, says security has been a concern from the start. He says the percentage of flawed code cited by the NYU researchers is only relevant for a subset of code where security flaws are more likely.

De Moor invented CodeQL, a tool used by the NYU researchers that automatically identifies bugs in code. He says GitHub recommends that developers use Copilot together with CodeQL to ensure their work is safe.

The GitHub program is built on top of an AI model developed by OpenAI, a prominent AI company doing cutting-edge work in machine learning. That model, called Codex, consists of a large artificial neural network trained to predict the next characters in both text and computer code. The algorithm ingested billions of lines of code stored on GitHub—not all of it perfect—in order to learn how to write code.

OpenAI has built its own AI coding tool on top of Codex that can perform some stunning coding tricks. It can turn a typed instruction, such as “Create an array of random variables between 1 and 100 and then return the largest of them,” into working code in several programming languages.

Another version of the same OpenAI program, called GPT-3, can generate coherent text on a given subject, but it can also regurgitate offensive or biased language learned from the darker corners of the web.

Copilot and Codex have led some developers to wonder if AI might automate them out of work. In fact, as Naka’s experience shows, developers need considerable skill to use the program, as they often must vet or tweak its suggestions.

Hammond Pearce, a postdoctoral researcher at NYU involved with the analysis of Copilot code, says the program sometimes produces problematic code because it doesn’t fully understand what a piece of code is trying to do. “Vulnerabilities are often caused by a lack of context that a developer needs to know,” he says.

Some developers worry that AI is already picking up bad habits. “We have worked hard as an industry to get away from copy-pasting solutions, and now Copilot has created a supercharged version of that,” says Maxim Khailo, a software developer who has experimented with using AI to generate code but has not tried Copilot.

Khailo says it might be possible for hackers to mess with a program like Copilot. “If I was a bad actor, what I would do would be to create vulnerable code projects on GitHub, artificially boost their popularity by buying GitHub stars on the black market, and hope that it will become part of the corpus for the next training round.”

Both GitHub and OpenAI say that, on the contrary, their AI coding tools are only likely to become less error prone. OpenAI says it vets projects and code both manually and using automated tools.

De Moor at GitHub says recent updates to Copilot should have reduced the frequency of security vulnerabilities. But he adds that his team is exploring other ways of improving the output of Copilot. One is to remove bad examples that the underlying AI model learns from. Another may be to use reinforcement learning, an AI technique that has produced some impressive results in games and other areas, to automatically spot bad output, including previously unseen examples. “Enormous improvements are happening,” he says. “It’s almost unimaginable what it will look like in a year.”


More Great WIRED Stories