Meta is in hot water, with internals emails showing the company torrented more than 80 TB of copyrighted and pirated books from questionable online databases.
Meta is one of several companies in hot water over how it trains its AI models, with a legal case accusing the company of pirating tens of millions of books from questionable sources.
The plaintiffs describe the extent of Meta’s actions in their complaint:
“However it is done, torrenting pirated works is flagrantly illegal.6 And the magnitude of Meta’s unlawful torrenting scheme is astonishing: just last spring, Meta torrented at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive, including at least 35.7 terabytes of data from Z-Library and LibGen. Pritt Decl., Ex. H.7 Meta also previously torrented 80.6 terabytes of data from LibGen (Sci-Mag).”
The plaintiffs emphasize the legal ramifications of Meta’s actions, especially compared to established legal precedent:
“Vastly smaller acts of data piracy—just .008% of the amount of copyrighted works Meta pirated—have resulted in Judges referring the conduct to the U.S. Attorneys’ office for criminal investigation.”
Meta’s Emails Undermine the Company’s Case
The extent of Meta’s activities wasn’t fully known until internal emails were made public, painting a damning picture of the company’s action.
Below are excerpts from the court filing (thanks to Ars Technica for hosting the document):
Melanie Kambadur stated on a message chain, “I don’t think we should use pirated material. I really need to draw a line there.” The four messages that follow are redacted.
Joelle Pineau responds to Eleonora Presani’s statement that “using pirated material should be beyond our ethical threshold.” Ms. Pineau then asks, “You think it’s problematic to use even for this phase?” followed by a redacted sentence. Presani then says “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”
This document appears to be notes from a January 2023 meeting that Mark Zuckerberg attended. It is heavily redacted, including a large section titled “Legal Escalations.” Immediately after that section the document states “[Zuckerberg] wants to move this stuff forward,” and “we need to find a way to unblock all this.”
Nikolay Bashlylov suggested that Meta conceal its downloading of LibGen data using a VPN (“Can we load libgen data using Meta IP ranges? Or should we use some vpn?”). All three bullet points that follow are redacted.
In an internal message, Nikolay Bashlykov expresses concern about using Meta IP addresses “to load through torrents pirate content,” and says, “torrenting from a corporate laptop doesn’t feel right :).” A response from David Esiobu is redacted.
This document contains admissions that Meta knew that LibGen was pirated (i.e., illegal) and expresses concern over what will happen if regulators learn that Meta is training Llama on pirated copyrighted data. The “Legal Risk” section is entirely redacted.
This document shows Meta employees deciding not to use “FB [Facebook] infra[structure]” for its “data downloading” from pirated databases in order to “avoid[] risk of tracing back the seeder/downloader [] from FB servers.”
The emails even implicate OpenAI for allegedly engaging in identical behavior.
On a message thread, Sy Choudhury discusses OpenAI’s use of LibGen. An update from “the partnerships side” is redacted.
On a message chain, Erin Murray explains that OpenAI’s model is likely trained on Smashwords and LibGen. The latter half of her message, addressed to Beau James, an in-house counsel, is redacted.
The above communications paint a damning picture of a company and its executives that knowingly crossed the line in an effort to train its AI models, even taking action to limit the fallout and legal repercussions should its actions ever be discovered.
from WebProNews https://ift.tt/El4ZL9n
No comments:
Post a Comment