Meta faces copyright infringement claims over AI training dataset

Meta's data set used for artificial intelligence training is under scrutiny due to copyright infringement allegations. This raises questions about data usage standards in the AI industry.

Meta Platforms, the parent entity of Facebook and Instagram, is currently embroiled in a legal battle over copyright infringement claims. The lawsuit, spearheaded by notable authors such as Sarah Silverman and Michael Chabon, centers on the accusation that Meta used thousands of copyrighted books without proper authorization to train its artificial intelligence language model, Llama.

Despite warnings from their legal team about the potential risks of using pirated books, Meta allegedly continued with the contentious dataset. This situation was further complicated by the emergence of chat logs. These logs revealed conversations between Meta-affiliated researcher Tim Dettmers and the legal department, discussing the legality of using book files for AI training. The logs indicated that the legal team advised against the immediate use of such data, particularly those with active copyrights. There was also a debate on whether the use of this data could fall under the fair use doctrine, a legal principle in the U.S. that protects certain unlicensed uses of copyrighted works.

Meta faces copyright infringement claims over AI training dataset

The lawsuit against Meta, which began over the summer, has recently seen significant developments. It was consolidated, merging two separate legal actions. A recent twist in the case saw a California judge dismissing part of the Silverman lawsuit last month. This led to the authors seeking amendments to their claims, signaling an evolving legal scenario.

The repercussions of this lawsuit could extend far beyond Meta, potentially affecting the entire AI industry. Success in these lawsuits may lead to increased development costs for AI models, as companies might face more scrutiny and demands for compensation from content creators. Moreover, new regulations in Europe could force AI companies, including Meta, to reveal the data sources for their AI models, adding to their legal challenges.

At the heart of this controversy is Meta’s Llama models, especially the latest version, Llama 2, released in the summer. While Meta disclosed that the first version was trained using the “Books3 section of ThePile,” the training data specifics for Llama 2 remain undisclosed. This lack of transparency has fueled the ongoing dispute, which stands as a potential game-changer in the generative AI software market.