The Atlantic's AI Watchdog Turns Its Lens on YouTube Training Data

The Atlantic's AI Watchdog Turns Its Lens on YouTube Training Data
Journalist Alex Reisner, writing for The Atlantic, has identified millions of YouTube videos used without creators' explicit consent to train generative AI video tools — the latest in a line of dataset investigations published under the outlet's AI Watchdog banner.
The September 2025 piece extends a body of work Reisner has built at The Atlantic tracking the provenance of generative AI training corpora. In 2023, he acquired a dataset of more than 191,000 books used without permission by Meta to train its generative AI systems — a disclosure that became a reference point in copyright litigation around large language models. The YouTube investigation follows the same methodological thread: obtain or reconstruct the dataset, identify the rights holders, and make the results searchable.
The searchable dimension matters. The Atlantic's AI Watchdog section gives users a direct interface to query the datasets Reisner and colleagues have surfaced, letting individual creators, authors, and publishers check whether their work appears in corpora that fed commercial models. That kind of first-person discoverability has practical weight in an environment where class-action suits against AI developers increasingly depend on named plaintiffs who can document specific infringement.
YouTube as a training source for video generation models is not a surprise to anyone who has followed the generative video space closely. Scraped video at scale is the path of least resistance for teams building diffusion-based or transformer-based video synthesis systems — the platform hosts an extraordinary density of labeled, captioned, and temporally rich content spanning virtually every visual domain. What Reisner's reporting adds is specificity: not the inference that YouTube was scraped, but documented evidence of which videos, at what scale, and in service of which products.
The practical exposure here runs along several axes. For individual creators, the question is whether their content contributed commercial value to a model from which they received nothing. For the AI developers involved, the question is whether ingesting publicly accessible video constitutes fair use under U.S. copyright law — a question courts have not resolved cleanly for text, let alone for audiovisual works with distinct performance rights, synchronization rights, and platform terms of service that explicitly prohibit scraping for machine learning. YouTube's own terms have barred this use for years; whether that prohibition has contractual or tortious teeth against downstream model developers rather than just the scraper is a live legal question.
The books precedent is instructive here, though not dispositive. The Books3 dataset that Reisner exposed in 2023 consisted of text stripped from a shadow library. Video scraped from YouTube is materially different: the rights stack is more complex, the file sizes and bandwidth demands are orders of magnitude larger, and the relationship between training data and model output is harder to characterize as simple reproduction. A diffusion model trained on video frames does not store and retrieve those frames the way a search cache does — but that technical nuance has not yet been translated into settled legal doctrine.
Worth flagging: the AI Watchdog's search tool quietly changes the power dynamics of this conversation. Until recently, most rights holders had no practical means of knowing whether their work had entered a training corpus. Discovery in litigation is expensive and slow. A public database that answers that question for free, at scale, shifts leverage toward claimants before a single lawsuit is filed — and could accelerate the volume of litigation considerably.
The broader trajectory here is one of progressive exposure. Training data opacity was, for a few years, a largely unchallenged norm in the industry. Reisner's 2023 books piece cracked that open for text. The YouTube investigation does the same for video. Each disclosure makes the next one easier to prosecute journalistically and legally, and each searchable database The Atlantic publishes adds to an accumulating public record that courts, regulators, and legislators can draw on.
Generative AI's data foundations have always been its most legally and ethically contested surface. The industry built quickly on the assumption that scale would outpace scrutiny. Scrutiny is catching up.


