Copyright Office Report on AI Training and Fair Use

Can AI systems use copyrighted materials under the fair use doctrine?
The fair use doctrine is a provision of copyright law that allows for limited usage of copyrighted material in certain situations without needing to obtain permission from the copyright holder. There is a four-factor test used to determine fair use: the character of the use, nature of the work, amount used, and effect upon the market. These are general guidelines and use for one of these purposes is not automatically fair, while other purposes beyond these might be considered fair.

Some of the purposes where fair use is generally allowed are: criticism, comment, news reporting, teaching, scholarship, and research. So, for example, quotes from a copyrighted book can be published in a critique of the book without having to obtain permission from the copyright holder.

Of course, generative artificial intelligence (AI) systems train on looking at large amounts of data, including copyrighted materials. Can this be considered fair use and avoid infringement? This was the scope of the recent study put out by the United States Copyright Office (the USCO).

Do AI models Infringe Training Data?
The USCO report finds that generative AI systems implicate several of the exclusive rights granted to copyright owners. The report also addresses whether the model’s weights are themselves infringing copies of the underlying works. This is disputed, with developers asserting that models are merely comprised of strings of numbers, while others assert as evidence to the contrary that models sometimes generate outputs substantially similar to the training data. The report concludes that where generated outputs are substantially similar, it is a strong argument that copying the model’s weights implicates the reproduction and derivative work rights of the original works.

The Application of the Fair-Use Defense
The key issue when analyzing fair use has been whether the use is “transformative”. There is of course a spectrum, and where the output is based on a diverse dataset, it is more likely to be transformative, but where it is trained to generate outputs substantially similar to copyrighted works, then it is “at best, modestly transformative”. The USCO rejected arguments that AI training is inherently transformative, because it is analogous to human learning. The report asserts that the analogy rests on a faulty premise that fair use is a defense for all acts if those acts are used for learning. But an example was given that a student couldn’t rely on fair use to copy all of the books at a library. The analogy also fails because humans retain only imperfect impressions of the works experienced, filtered through their unique personalities and limitations. AI involves training on the creation of perfect copies. The structure of exclusive copyright rights is premised on certain human limitations.

There are more than 40 cases currently pending related to the issue of AI using copyrighted materials. There won’t be a single answer whether unauthorized use is fair use. There is an ongoing discussion of some form of licensing scheme for AI training data, but for now, the recommendation is to continue development without government intervention.