How Paywalls Hand the Training Set to State Media
A study from the Foundation for Defense of Democracies reports that large language models cite state-aligned propaganda about 57% of the time when answering controversial geopolitical questions. On the topics where accuracy matters most, the average model reaches for state-aligned framing more often than not.
The easy explanation is that the models are biased by design or that their builders are careless. The study points somewhere less satisfying and more structural. State media outlets tend not to sit behind paywalls and tend to block AI crawlers less aggressively than Western outlets do. That makes their framing cheaper to read, copy, and ingest, though it does not mean models swallow it raw. Major vendors run extensive domain filtering and quality classifiers over pre-training data, so easy availability raises the odds of inclusion rather than guaranteeing it.
The inversion
The defensive moves Western publishers made to protect their work are the same moves that thinned their presence in the training data.
When a quality outlet erects a hard paywall and serves a 403 to every crawler, it is acting rationally. It is protecting subscription revenue and refusing to let a model vendor monetize its reporting for free. But the open web does not reward that posture. A crawler that hits a wall moves on to whatever is reachable. State-funded outlets, which are not trying to sell subscriptions and often want their framing distributed as widely as possible, leave the door open. The more carefully a newsroom guards its archive, the smaller its footprint in the corpus that shapes how a model talks about its beat.
That footprint is reduced, not erased. Many quality outlets syndicate through wire services or permit partial indexing, so traces of their reporting survive even behind a wall. Still, the outlets with the strongest incentive to be accurate shrink in the record, while the outlets with the strongest incentive to push a line stay fully indexed. No conspiracy is needed. Two business models collide inside a crawler, and this is the predictable output.
Why this is more concrete than it sounds
Three signals move together here.
- Access cost. A subscriber-gated investigation and a freely syndicated state wire story are not equally available to a scraper. One requires credentials the crawler does not have. The other is built for redistribution.
- Crawler policy. A robots.txt that disallows known AI agents, plus bot detection at the edge, removes a source from many pipelines even when a human can read it fine.
- Volume and repetition. State outlets often publish the same framing across many language editions and mirror sites. Repetition in a corpus tends to look like consensus to a model learning what answers are common.
Stack those together and you get a corpus where the cheapest, most repeated, most crawler-friendly version of a contested event is disproportionately the state-aligned one. A base model trained on that skewed distribution will lean toward it without any malice. The qualifier matters: instruction tuning and reinforcement learning from human feedback explicitly train models to override raw next-token tendencies on sensitive geopolitical topics, so a 57% citation rate is a signal about what survives that correction rather than only about what the base data contained.
The limits of one study
The 57% figure comes from a single study on controversial geopolitical questions. The published summary does not name the specific models tested, the exact questions asked, or the individual outlets counted as state-aligned, which makes the headline hard to audit. The number should not be read as a global statement about every model or every topic, and the paywall mechanism is the study's attribution rather than a controlled experiment isolating that one variable. Source weighting, the post-training and safety tuning each vendor applies, and how questions are phrased during evaluation all plausibly contribute. Blocking crawlers is not wrong either. Publishers have every right to decide who trains on their work. The narrower point holds: the aggregate effect of many reasonable individual choices is a training distribution tilted toward sources that wanted to be ingested.
What follows from this
If the mechanism is roughly right, the fixes that actually move the number are unglamorous and mostly about access, not about bolting on more refusals at inference time.
- Licensed access beats open scraping. Quality publishers and model vendors negotiating paid, structured access put accurate reporting back into the corpus without forcing newsrooms to give it away. The friction is commercial, not technical.
- Source weighting matters more than source presence. Even when good reporting is in the data, it can be drowned out by volume. Evaluations that measure which sources a model actually cites, the way this study did, are more useful than benchmarks that only score final answers.
- Crawler-friendliness is not a proxy for trustworthiness. Any provenance signal that treats easy availability as a positive will systematically reward the outlets with the least to lose from being copied.
You cannot win an information contest by withdrawing from the field where the models learn. A walled archive is faint in the corpus. If accurate reporting is hard to reach and propaganda is free, the propaganda is the version that gets memorized, then partly filtered, and sometimes still cited. This is partly the shape of the training set, assembled one rational 403 at a time. Patching refusals at inference time cannot reach that.