Details Emerge from Writers’ Lawsuit Against Meta
The ongoing legal battle between a group of writers and Meta has unveiled significant insights into the company’s use of pirated book copies, particularly from the Russian library LibGen, for training its artificial intelligence models. Recent reports from Ars Technica reveal that Meta acknowledged utilizing torrents to download a substantial dataset known as LibGen, which comprises tens of millions of pirated books.
For the first time, unredacted emails from Meta have surfaced, indicating that the company downloaded “at least 81.7 terabytes of data from several shadow libraries via the Anna’s Archive site,” including a minimum of 35.7 terabytes from Z-Library. Furthermore, it was disclosed that “Meta had previously downloaded 80.6 terabytes of data from LibGen.”
“The scale of Meta’s illegal torrent scheme is staggering,” the writers noted. They emphasized that “much smaller acts of data piracy—only 0.008 percent of the copyrighted works copied by Meta—led judges to refer the case to the U.S. Attorney’s Office for a criminal investigation.”
In earlier proceedings, Meta sought to prevent the disclosure of its use of pirated books for training its AI models. However, a judge rejected the company’s request, asserting that Meta’s insistence on editing materials was not aimed at protecting its business interests but rather at “avoiding negative publicity.”
Meta has previously revealed in a research paper that it trained its expansive language model, Llama, on fragments from Books3, a dataset comprising approximately 196,000 books extracted from the internet. Notably, the company had not publicly acknowledged that it sourced data directly from LibGen prior to this legal scrutiny.