AI’s secret ingredient: Text, Data, and Fair Use

Text and data mining (TDM) is basically grabbing lots of data from the internet. From that data, large language models and AI extract patterns and can generate text.

Accessing large amounts of text will, of course, very quickly start stepping on copyrighted toes. But there is a mechanism by which copyrighted materials don’t always need explicit permission for use: the Fair Use Doctrine.

In the US, many large tech companies rely on the fair use doctrine to access and justify training their AI systems.

Fair use allows the use of copyrighted materials in some instances. It’s never a blanket permission, it’s a case-by-case analysis based on four factors: 1) the purpose and character of the use, 2) the nature of the copyrighted work, 3) the amount and substantiality of the portion used, and 4) the effect on the market for the original work.

There are class action suits arguing that AI training is not fair use because it involves mass copying, but for the most part, courts have so far found the LLM’s training on copyrighted material ok, inasmuch as they are used for pattern recognition rather than reproducing the expressive content. But again, there are cases ongoing, so this may change, requiring some sort of larger scale licensing arrangements.