The mechanics of AI-assisted development

It is extremely uncommon nowadays for a developer to write code unassisted by some form of tool. Editors support syntax highlighting, linting, IntelliSense and more. However it is still up to a developer to write the actual code and come up with algorithms to solve problems.

AI is changing that. By training a model on billions of lines of code and then providing it with the proper context, an AI assistant can return suggestions for what code it thinks you are planning to write next. In practice, this behaves a bit more like an advanced version of autocomplete than asking another developer how to finish writing a block of code, but the advantage is that you ultimately need to only tweak an accepted suggestion rather than write the whole chunk itself.

Because running the model is computationally expensive, both Amazon's (AWS) CodeWhisperer and GitHub's Copilot run in the cloud and get sent the context from your editor of choice before responding with suggestions. This does introduce latency into the suggestion process as well as creating a cause of concern that code is being sent over the network, but it also means that anyone can use the tool regardless of how powerful the hardware they are working on is.

Differences between CodeWhisperer and Copilot

While the implementation details of CodeWhisperer and Copilot are not really available, we can talk about a few of the important differences to consider if you are choosing between the two.

  • Where your code is sent: If you are working on a project that will ultimately be deployed to AWS, it may not bother you as much to use CodeWhisperer, since any context sent by the assistant is going to the "same" location it is being deployed. If your code is being hosted on GitHub, the same logic applies to Copilot.
  • What code the model was trained on: Copilot is under fire for being trained on and sometimes even suggesting code from GPL projects (a license that effectively prohibits commercial usage) as well as other potentially problematic licenses. While Amazon is not entirely clear on what code CodeWhisperer has been trained on, there is an emphasis on being more AWS focused as well as pulling from open-source (no clarification on whether this included GPL code).
  • Current cost of use: CodeWhisperer is currently in a free preview and it is unclear what it will cost to use in the future. Copilot also started out as free for the beta program, but now costs either $10/month or $100/year.

Beyond these concerns, we can look at a comparison of how they perform in the real world.

Comparison of developer experience

In order to compare CodeWhisperer and Copilot, I used both tools in Visual Studio Code (VS Code) on a series of examples I came up with last year when evaluating Copilot for the first time. The examples include:

  • Creating and searching a binary tree.
  • Testing the implementation of the binary tree and the search method.
  • Creating a boilerplate ExpressJS server.
  • Creating a series of events and expected states for a reducer function.
  • Mapping and filtering of data.
  • Tic Tac Toe in Python.
  • (New) AWS lambda implementation.

I have included some of the details of what the experience was like below, although this is entirely based on my own experience in only VS Code, so others may see different results with different use cases and development environments.

CodeWhisperer

Poor experiences

Right out of the gate CodeWhisperer was more annoying than helpful. Suggestions from it often caused syntactical problems such as repeating semi-colons or brackets or even attempting to replace code with a functional equivalent that was already there. I am unsure if this is a problem only in VS Code or if it happens in other editors as well.

Suggestion duplicates the declaration of a close quote, function signature, and opening bracket.

While this is not the worst problem in the world, it happened rather consistently and was very distracting to actually trying to write code. Beyond this, it was often too generic with its suggestions to be helpful without further prompting. The experience would be better if suggestions were either more pertinent or at least less frequent, only appearing when they could be more specific. 

Suggestion fails to pick up on 'Tree' as a prompt for writing a test, as well as when the test title is changed to 'should be searchable'.

When attempting to add an import statement for the binary tree implementation to see if that would provide enough context, CodeWhisperer actually ended up crashing.  This somehow resulted in my delete key not working which required a restart of VS Code to fix. I was unable to duplicate this issue again.

Import statement causes a failure to run command, effectively crashing CodeWhisperer and VS Code.

Once I had added the construction of the binary tree to the test, the suggestions became more relevant, such as suggesting tree.add(5) (even though the method was actually called insert). However, it was very easy to get CodeWhisperer stuck in a loop of sorts where it would continue to make suggestions that were extremely similar to the last. For example, I would have to explicitly stop accepting suggestions for what numbers to add to the tree, otherwise it would just keep suggesting lines of the form tree.add(x); instead of building a reasonably sized tree for testing. To its credit, it did avoid inserting duplicate numbers for a long time and did follow some semblance of a pattern at times, but ultimately did end up suggesting repeated lines.

The same problem occurred after switching from writing test setup to writing assertions. CodeWhisperer seemed to decide that it needed to be extremely robust when testing the search functionality by not only testing numbers not present in the tree but also every built-in JavaScript type.

Excessive looping with both setting up the tree for testing as well as when writing assertions. For the final suggestion, 25 has already been added to the tree but is being suggested. Similarly, some of the expectations are duplicated such as DataView on lines 40 and 57.

NOTE: When attempting this same approach a few days later, CodeWhisperer still had trouble with the initial prompting, but once the import statement was added, it was actually able to suggest the full test that it came up with previously including the duplicated assertions. This could indicate that because I accepted those suggestions before, the model learned to group all those suggestions into a single one, effectively fitting this particular context to exactly my use case. Moving into a new VS Code window to run this test resulted in a much less robust answer with only 2 assertions instead of more than 30. This leads me to believe that it was using the project context to see that a similar file already existed and then effectively copying the content of that file over.

While the above focuses almost purely on the binary tree test case, many of the same problems appeared in the other contexts as well, such as:

  • Creating an Express Server devolved into importing more and more dependencies and I eventually stopped it after the pattern const socketIOClientServerClient<XX> = require('socket.io-client-server-client-<XX>') emerged.
  • Attempting to create an AWS Lambda function also quickly devolved into excessive importing.
  • Creating test data devolved into setting obscure property names and eventually started repeating organizerInstagram: 'XXXXXXXXXX', after several other organizer<X> fields.
  • Overwrote expected formatting of test data despite having a clear example of the desired formatting. This was despite the suggestion being displayed in the correct formatting, violating the expectation of accepting the suggestion. This happened every time despite correcting the formatting on previously accepted suggestions.
  • Struggled to vary string test data. In particular, when prompted with the names 'John' and 'Jane' and eventually 'Jack',  it ultimately would only suggest 'John' for every following suggestion.
  • Did not pick up on a key change in test data from an event of type: 'ENTER' to type: 'LEAVE' and still suggested using type: 'ENTER' for future events.
  • Got caught in multiple comment loops when describing parts of the Tic Tac Toe game.
  • Throughout all examples, consistently suggested duplicated characters such as closing brackets (}) causing the syntax highlighter to complain which required manual fixing.

Good experiences

Despite all these negative experiences, there were some instances where CodeWhisperer did behave as expected and there is at least one feature that does not exist in Copilot (yet).

Perhaps the most impressive suggestion (although Copilot was also able to do this) was to correctly configure a reducer based off test data. It correctly suggested the action types and corresponding results for guests entering and leaving a space.

The reducer function from lines 9 to 18 was a generated suggestion by CodeWhisperer that correctly tracks when guests are present or not based off 'ENTER' and 'LEAVE' events.

It was also able to come up with several working lines for the tic tac toe game before getting stuck in a comment loop and showed off one of its unique features: responsible code referencing. When CodeWhisperer makes a suggestion it recognizes as coming directly from the training data, it will present you with what license the code was written under and the ability to go and look at the referenced code in context.

Example of accepted Tic Tac Toe code as well as CodeWhisperer's reference code call out.

Granted, the particular example here is far from unique and you could probably find multiple projects that use this exact line of code, but with more specific or longer suggestions, this feature could be helpful to both avoid referencing code that is under a protective license and check the original context of the code to make sure it is applicable to your situation.

Another feature highlighted by AWS about CodeWhisperer is a specific focus on writing code that utilizes AWS APIs. While I did not take the time to robustly test this, the suggestions I received when writing a Lambda function contained more lines on average and did correctly reference AWS APIs.

CodeWhisperer's generated Lambda code that deletes an entry from a DynamoDB table.

Copilot

Poor experiences

The experience with Copilot was generally pleasant, but there were a few frustrating points.

Perhaps the most confusing part was trying to identify when Copilot would correctly use the project context as opposed to just the current file. For example, it was clear that it remembered I was writing a binary tree when starting the test file as it immediately made relevant suggestions such as the title of the describe and first test. However when attempting to map and filter data with a test file open that had example calls and expectations, Copilot stubbornly stuck to its own understanding of the problem and required much more explicit commenting to work in the right direction.

Test description, title, and body written by copilot for testing a binary tree, no extra prompting necessary.

Another negative experience was when Copilot thought it had reached the end of a file, it seemed to forget that it was writing code and what language it was working in. When writing the ExpressJS app, it decided to continue the file as though it was now able to write directly to the terminal, suggesting node server.js as a valid line in a JavaScript file.

Copilot suggests the line 'node server.js' in a JavaScript file.

In the Tic Tac Toe file written in Python, it suggested a comment without the # to prefix it.

Python comment suggested without the '#' as a prefix.

While these suggestions are easy to ignore, it is a bit concerning that it seems to forget what is effectively the most important piece of context: that it is still writing code in a particular language.

Good experiences

I have more experience using Copilot than CodeWhisperer which is not surprising considering it has been available for a year longer. Outside of the experiments I ran to directly compare the two assistants, Copilot has felt much like an advanced version of autocomplete when writing code.

  • It is consistent in the amount of time it takes to make a suggestion.
  • It seamlessly integrates as I type, allowing me to continue typing into the suggestion without it disappearing or jumping.
  • It generally feels as though it gets out of the way when necessary, working to cut down on boilerplate code but allowing me to solve the harder problems on my own.

For example, when writing test data for a reducer that has people enter and leave a space, Copilot was able to take a single name ("John") and give me 3 other unique names, come up with the fact that "leave" was the opposite of "enter", and order the events so that someone leaves before the final person joins, ensuring the events are not grouped together by type.

Copilot generated test events for guests entering and leaving a space.

This type of mundane task would normally take under a minute for such a small dataset but when this type of setup is required for dozens of tests, the time taken adds up. Additionally, Copilot (like a developer) looks at the context of a suggestion, meaning it is at least highly likely to maintain consistent choices for values such as reusing names across multiple datasets.

In addition, Copilot has a much stronger opinion about when a file should end which helped it not generate pointless new code after completing exercises such as with Tic Tac Toe. Once the game had been fully generated, it did make a suggestion to add more comments, but no more suggestions for the code itself.

Tic Tac Toe actually saw some of the most improvement since my last round of testing in terms of suggestions. It fixed the main problem that was generated before where it used the indexes 0 through 8 for the actual board but presented the user with a choice of 1 through 9. This meant a square could not be chosen at all and the user's expectation of where their move would be placed was violated. This time, it chose to use the numbers 1 through 9 for both, explicitly stating it would ignore index 0.

Copilot consistently used 1-9 to represent the Tic Tac Toe board, even including a comment explicitly saying it is ignoring index 0.

What the future holds

As of this article, there is no information on what the plan is for AWS CodeWhisperer. At some point it will probably follow in Copilot's footsteps and become a paid product, perhaps discounted for those with AWS accounts. It was announced that CodeWhisperer is integrated into the AWS Lambda console as a preview and can be enabled now, so it may be that AWS plans on a tighter integration strategy as an incentive to use their tool over others.

Meanwhile GitHub Copilot is working on a version specifically for organizations, though there are not many details on what this actually means beyond more control from an organizational perspective and some form of bulk licensing. They did, however, release a write-up on how Copilot impacts developer productivity and happiness which claims developers using Copilot could finish a task 55% faster than a developer without it.

The technology that Copilot is based on, OpenAI's Codex, is starting to be applied in many other ways including quick access terminal command searching, code explanation, and converting Figma designs into code. Codex is in turn based on GPT-3 which is more generic and used for any form of text completion and insertion in a human-like way. While GPT-3 was released in 2020 and GPT-4 has been announced, no release date has been specified at this time.

Alternatives

In terms of competition from other organizations, Google recently released a paper about their own internal experimentation with AI-based code completion that saw a 6% reduction in coding iteration time (time between builds and tests) when trained on their own monorepo containing multiple languages. There is no indication yet of turning this model into a publicly available tool, but the results and methods are definitely being factored in to the future of other tools.

SalesForce has also released CodeGen, an open source development tool that aims to translate natural language into code as a form of conversational AI programming. One of the bigger implications of this tool is the ability for programmers to work in new programming languages without necessarily needing to learn as much upfront to be productive, as a problem can be described in natural language regardless of the desired programming language for the implementation. The model and code being publicly available also means being able to host an instance of it yourself, rather than relying on Microsoft or AWS. This has lead to the creation of FauxPilot, an alternative to Copilot that allows you to avoid sending telemetry data to Microsoft, though it still suffers from the same scrutinies around code licensing.

Tabnine is a company that has already been in the AI code assistant space for a few years, though it has used different approaches than that of Copilot, CodeWhisperer, and CodeGen. Recently, they announced new AI models that enable similar features such as whole-line, full-function, and natural language-to-code completions. Of particular interest is their policy on privacy as they never train on potentially problematically licensed code helping you avoid the currently murky legal implications. They also provide team-specific model training to avoid your own code and completions leaking to or getting polluted by a more publicly used model.

Replit, an online IDE, has also started a Beta program for its AI assistant GhostWriter. They put a particular emphasis on writing code on mobile devices which is a bit of a strange selling point, but is admittedly a neglected edge case that, with the help of AI doing most of the typing, could become more feasible. The model is based off SalesForce's CodeGen as well and they seem to have plans to make it uniquely tailored to the Replit experience.

Other applications of AI

It is clear this space is getting a lot of attention right now, though what remains to be seen is whether or not there will be any more significant jumps in performance. There is also the strong possibility of better open source alternatives becoming available like the situation around text-to-image with OpenAI's DALL-E 2 (closed source) and Stable Diffusion (open source).

Also worth noting is the level of improvement we are seeing in other spaces, such as the differences between DALL-E 1 and DALL-E 2 (released only a year apart) as well as the rapid progress being made on Stable Diffusion. Within the past month, projects have been produced that can now not just do text-to-image but text-to-video and text-to-3D. There are of course rough edges, but it was not too long ago that ideas such as these would have seemed far-fetched.

A year ago, I was confident that technology like Copilot would not become a replacement for developers. With the jumps occurring in other AI spaces in such short time windows, it may not be long before what is now AI generating at most a few lines of code becomes whole files or even whole projects. Granted, there is a long way to go in terms of guaranteeing correctness and verifying that such a project meets the desired requirements, but when AI can get 80% to 90% of the way there in a short amount of time, the job of a programmer starts to look very different than it is today.

Technologies