AI’s secret ingredient: Text, Data, and Fair Use

Data Mining Includes Copyrighted Materials

Text and data mining (TDM) means collecting large volumes of data from the internet. From that data, large language models identify patterns and generate new text. But accessing massive amounts of text quickly raises copyright concerns, as it inevitably steps on copyrighted toes.

Fair Use as a Justification for Using Copyrighted Materials

However, copyright law offers a possible safety valve: the Fair Use Doctrine. Under fair use, copyrighted materials do not always require explicit permission. In the US, many large technology companies rely on fair use to justify training AI systems on copyrighted works.
Those factors are:

the purpose and character of the use,
the nature of the copyrighted work,
the amount and substantiality of the portion used, and
the effect on the market for the original work.

Fair use allows limited use of copyrighted material in specific circumstances, but it is never automatic or guaranteed. Instead, courts apply a case-by-case analysis using four factors.

Class Action Responses to the Fair Use Justification

Several class actions argue that AI training is not fair use. Their claim focuses on the scale of copying involved.

So far, courts have largely disagreed, finding AI training permissible when used for pattern recognition.
However, that conclusion depends on models not reproducing expressive content.

Still, the legal landscape remains unsettled, and several cases are ongoing. Future rulings could require broader licensing frameworks for AI training.

Fair Use has been a consistent subject of attention with AI. Here are some previous articles on the developments:
Fair Use and AI Training: Buy it or Bye-Bye

AI Training, Fair Use, and Meta