A few weeks ago GitHub released its Copilot solution, which uses AI to suggest code to developers. Developers can write a comment in their code and Copilot will automatically write the code it thinks is appropriate. It’s an impressive example of the power of AI, but has many developers and members of the open-source community upset and worrying over what it means for the future of open source.
One issue is that the program has had many examples of exactly copying an existing function verbatim, rather than using AI to create something new. For example, Armin Ronacher, director of engineering at Sentry and the creator of Flask, tweeted a GIF of himself using Copilot where it reproduced the famous fast inverse square root function from the video game Quake.
Leonora Tindall, a free software enthusiast and co-author of Programming Rust, reached out to GitHub asking if her GPL code was used in the training set and the company’s support team responded back saying “All public GitHub code was used in training. We don’t distinguish by license type.” When SD Times reached out to GitHub to confirm what code Copilot was trained on, they declined to comment.
“I, like many others, have shared work on GitHub under the General Public License, which as you may know is a copyright-based license that allows anyone to use the shared code for whatever they want, so long as, 1) they give credit to the original author and 2) anything based on it is also shared, publicly, under the GPL. Microsoft (through GitHub) has fulfilled neither of these requirements,” Tindall said. “Their argument is that copying things is fair use as long as the thing you’re copying it into is a machine learning dataset, and subsequently a machine learning model. It’s clear that copying is happening, since people have been able to get Copilot to emit, verbatim, very novel and unique code (for instance, the Quake fast inverse square root function).”
According to Tobie Langel, an open-source and web standards consultant, the GPL was largely created to avoid things like Copilot from happening. “I understand why people are upset that it’s legal to use — or considered legally acceptable of a risk of using — GPL content to train a model of that nature. I understand why this is upsetting. It’s upsetting because of the intent of what the GPL is about,” said Langel.
Ronacher believes that Copilot is largely in the clear based on current copyright laws, but that there’s an argument to be made that there are a lot of elements of copyright laws that ought to be revisited. Langel also feels that legally Copilot is fine based on the conversations he’s had with IP lawyers so far.
The issue of what’s copyrightable, and what isn’t, is complicated because different people have different opinions about it.
“I think a lot of programmers think there’s a difference between taking one small function and using it without attribution or taking a whole file and using it without attribution. Even from a copyright perspective, there are differences about what’s the minimum level of creation that actually falls under copyright,” said Ronacher.
For example a+b wouldn’t be copyrightable, but something more complicated and unique could be. If you were to remove all the comments and reduce it to its minimum, the fast inverse square root function in Quake is still only two lines. “But it’s such a memorable and well-known function that it’s hard to argue that this is not copyrighted because it’s also very complex and into the creation of this a lot of thought went,” said Ronacher.
There is a threshold on what is copyrightable versus what isn’t, but it’s hard for humans to determine where that line is, and even harder for a machine to do that, Ronacher said.
“I don’t think being upset about Copilot is synonymous with being for a hard-line stance on copyright. A lot of us free software types are pretty anti-copyright, but since we have to play by those rules, we think the big companies should have to obey them also,” said Tindall.
Langel believes that Copilot won’t be the final breaking point for addressing some of the issues in open source, just another drop in the bucket. “I think these issues have been increasingly brought up, whether it be ICE using open-source software, there’s lots that has been happening and that’s sort of increasingly creating awareness of these issues across the community. But I don’t think we’re at a breaking point and I’m quite convinced that this isn’t the breaking point,” said Langel.
Another issue people have with Copilot is the potential for monetization, Ronacher explained. “It obviously requires infrastructure to run … The moment it turns into someone profiting on someone else’s community contributions, then it’s going to get really tricky. The commercial aspect here is opening some wounds that the open-source community has never really solved, which is that open source and commercial interests have trouble overlapping,” Ronacher said.
Langel pointed out that large companies are already profiting from the open-source code made by developers who wrote that code for free by leveraging these open-source solutions. He also noted that these companies profit from user data that is produced on a daily basis, such as location data provided when people walk from place to place.
“The location data that you’re giving Apple or Google when you’re walking around is to some degree, just as valuable as the open-source code produced by engineers as part of their professional work or their hobby,” said Langel.
In addition to monetization, Langel believes that the large scale of Copilot is another contributing factor into why so many developers are made uneasy by the tool.
“Is this bothering us because open source moved from 40-100 engineers that really care about a piece of software working on it together and suddenly it’s like this is a new tool built on all of the software in the world to be used by all of the developers in the world and have potentially GitHub or Microsoft profit from it? And so the core question to me is one of scale and one of moving from tight-knit small communities to global and you’re essentially bumping into the same issues that are of concern elsewhere about globalization, about people having a hard time finding their place in this increasingly large space in which they’re operating,” said Langel.
The issues and controversy aside, Ronacher thinks that Copilot is largely a positive learning tool because it cuts down on the time a developer spends on problems and provides insights into what developers have done before.
“The whole point of creation is to do new things and not to reinvent what somebody else already did. This idea that the actually valuable thing is doing the new thing, not something done by someone else already before,” said Ronacher.
Still, Ronacher admits he doesn’t actually feel like in its current state Copilot is all that useful in the day-to-day for most activities a developer needs to do. It is valuable for certain use cases, like making small utilities that a developer only has to do once in a while where they might not remember how to do certain things.
“[If there’s] something that you have done in the past and you have to do it again, but you haven’t done it for like a year, you might forget some of these things,” said Ronacher. “And it’s sufficiently good at helping you with that. So for instance if you have to insert a bunch of rows into the database, you might have forgotten how the database driver’s API is. Typically you go to the documentation to do that, but GitHub Copilot for the most part is actually able to autocomplete a whole bunch of relatively simple statements so that you actually don’t have to go to the documentation. And I think for all of these purposes it’s really good, and for all of these purposes it also doesn’t generate code that’s copyrightable because it’s highly specific to the situation.”