The US Copyright Office has shared a ‘Pre-Publication Version’ of its report on whether or not training AI with existing content constitutes fair use under American copyright law. The report was made public shortly before the sacking of Copyright Office boss Shira Perlmutter by Donald Trump, which has proven to be something of a distraction. But the report itself is noteworthy.
This is the third report the Copyright Office has published following its extensive consultation on AI, the previous two dealing with digital replicas and whether AI-generated works should enjoy copyright protection. It left by far the most contentious issue to last.
The copyright industries - including the music industry - are adamant that AI companies must get permission from relevant copyright owners before using existing content to train generative AI models. However, many AI companies are equally adamant that AI training is fair use under US law, meaning no permission is required.
This dispute is central to numerous lawsuits filed by copyright owners against AI companies in the US courts, including those filed by record labels and music publishers against Suno, Udio and Anthropic.
In its report, the Copyright Office first considers if and when AI companies actually make copies of existing works. Although it is debatable if copying occurs during the actual training process or when an AI model outputs new content - and the report summarises that debate - it seems clear existing works are copied when an AI company first puts together a training dataset.
“The steps required to produce a training dataset containing copyrighted works clearly implicate the right of reproduction”, the report states, adding “developers make multiple copies of works by downloading them; transferring them across storage mediums; converting them to different formats; and creating modified versions or including them in filtered subsets”.
But is that copying fair use? The Copyright Office is keen to stress that the answer to that question is a massive “it depends”, and that the courts will need to assess various things to decide if the fair use defence stands in any one legal battle. However, it then provides some discussion and guidance that those courts could consult when doing that assessing.
It begins by reminding everyone of the four criteria courts always use to assess whether or not the use of a copyright protected work is fair use.
First, the purpose and character of the use, including whether it’s a ‘transformative’ use. Second, the nature of the copyrighted work. Third, the amount and substantiality of the portion taken. And fourth, the effect of the use upon the potential market for or value of the copyright-protected work.
The report reckons that the first and fourth factors are most important in the context of AI training. At least some uses of copyright protected works in AI training are likely to be transformative, making those uses more likely to be fair use under the first factor. However, factor four then needs to be considered, ie the impact the AI has on the value of the copyright-protected work.
This more or less aligns with opinions recently expressed by the judge who is overseeing one of the AI copyright lawsuits that is most advanced, between a group of authors and Meta in relation to its Llama AI model.
In a recent court session, he seemed to acknowledge that Meta's use of the authors’ works was transformative, but then questioned whether it could possibly be fair use given Llama is “dramatically changing - you might even say obliterating - the market for that person’s work”.
If many of these cases are going to ultimately swing on the fourth factor of fair use, that poses an important question.
Should the court only consider the direct impact of the AI on each specific work in the training dataset - so in the Meta case the question is: will the outputs of the generative AI reduce the value of the copyright in each individual book contained in the training dataset?
Or can the court consider the general impact of the AI on entire categories of works - so the question is: will the outputs of the generative AI reduce the value of the copyright in books in general, including future books?
Comedian Sarah Silverman is involved in the Meta legal battle. Will Llama generating new written works result in her books making less money? Possibly not. Although, given some AI companies are already licensing works for training data, thus creating a new licensing market, you could say Silverman is losing licensing income if some AI companies are allowed to use her works without negotiating a deal.
However, the authors’ case is stronger if the court can consider the impact of Llama on the market for books more generally, including future books that Silverman may write. This is something considered in the Copyright Office’s report, which ultimately comes down on the copyright owners’ side on this crucial question.
“A number of commenters contended that courts should consider the harms caused where a generative AI model’s outputs, even if not substantially similar to a specific copyrighted work, compete in the market for that type of work”, it states.
It then cites the submission from the Copyright Alliance, which says “with generative AI, the harm is often to a creator’s overall body of work or even the market more broadly”, adding “these harms all impact the creator’s incentives and they should be considered under a factor four analysis”.
Unsurprisingly, AI companies have argued that “fourth factor analysis” should only consider “harm to markets for the specific copyrighted work”. It then quotes AI company Scenario which argues that extending factor four analysis “to whether the AI system competes in the market for a general class of works” could have “unintended and potentially detrimental consequences”.
But the Office then writes, “while we acknowledge this is uncharted territory, in the Office’s view, the fourth factor should not be read so narrowly”.
The Copyright Act “on its face encompasses any ‘effect’ upon the potential market”, it states, adding, “the speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data”, and “that means more competition for sales of an author’s works”.
In conclusion, the Copyright Office’s report says, “various uses of copyrighted works in AI training are likely to be transformative”, but “the extent to which they are fair will depend on what works were used, from what source, for what purpose, and with what controls on the outputs”, because all of those things “can affect the market”, and therefore impact on the fourth factor for assessing fair use.
Honing in on the purpose of the output in particular, the report’s conclusion says that “when a model is deployed for purposes such as analysis or research, the outputs are unlikely to substitute for expressive works used in training”. Which makes the case for fair use much stronger.
However, it goes on, “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries”.
That point is interesting in that it probably means that some copyright owners will find it easier to defeat the fair use defence than others. So while judgements made in the first copyright AI cases to get to trial will certainly impact on other cases, there may well be subtle but crucial differences in those other cases which mean that the fair use defence is stronger or weaker.
The report ultimately concludes that AI training may or may not be fair use depending on various specific factors, which means the Copyright Office doesn’t really come down strongly on one side or the other in this debate. However, it does arguably strengthen the arguments of copyright owners more than AI companies in many of the ongoing legal disputes.
That said, as noted, officially the version of the report in circulation is a ‘Pre-Publication Version’, possibly rushed out ahead of the sacking of Perlmutter. It remains to be seen if the new leadership at the Copyright Office makes any changes before the final version is made public.