Research / T-2026-6117

Five Frontier LLMs Disagree on 67% of Fact-Check Claims

Five frontier LLMs disagree on 67% of 1,000 real-world fact-check claims, raising questions about their reliability as knowledge tools.

Tessera Newsroom · 1 min read · May 28, 2026

Source Five frontier LLMs disagree on 67% of 1k real-world fact-check claims (lenz.io)

FIGURE T-2026-6117

67% RESEARCH

Five frontier large language models disagree on 67 percent of 1,000 real-world fact-check claims, according to a study posted by researcher Alex Lenz. The finding is not a surprise to anyone who has used these models seriously. It is a problem that the industry has not solved.

Lenz tested GPT-4o, Claude 4, Gemini 2.5 Pro, Llama 4, and DeepSeek-V3 on a set of claims drawn from fact-checking databases. The models were asked to determine whether each claim was true, false, or uncertain. On two out of every three claims, the models gave different answers. Agreement was highest on obviously false statements and lowest on claims involving nuance, context, or recency.

The implication is uncomfortable. If these models cannot agree on basic factual questions about the world, they cannot serve as reliable knowledge tools. A user asking one model whether a specific policy took effect last month might get a confident yes. Another model, on the same query, might say no, or hedge. The user has no way to adjudicate.

This is not a benchmark artifact. Lenz used real claims from PolitiFact and other fact-checking organizations, not synthetic trivia. The disagreement rate reflects genuine ambiguity in the training data, differences in how models handle conflicting sources, and the fundamental difficulty of compressing the world into a fixed set of weights.

The industry response so far has been to add citations and retrieval. That helps with recency and verifiability, but it does not solve the underlying disagreement problem. Two models given the same retrieval context can still reach opposite conclusions about what the retrieved text means.

For builders, the takeaway is straightforward. Treat any single model’s factual output as provisional. Cross-check against another model, or better yet, against the primary source. The era of trusting a single LLM as a knowledge oracle ended before it began.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

The company that ships AI without an AI team

Mid-market firms are deploying custom AI without hiring ML engineers, commissioning assistants and automation from generalist software agencies. We weigh what that route buys.

Tessera Newsroom · July 15, 2026

Research / T-2026-3444

Microsoft Study: Claude Code and Copilot CLI Users Merged 24% More Pull Requests

Researchers at Microsoft studied the early 2026 rollout of Claude Code and Copilot CLI, finding a 24% lift in pull requests merged and adoption driven by peer networks.

Tessera Newsroom · July 14, 2026

Research / T-2026-7866

Arvind Narayanan at ICML 2026: AI adaptation is the slow work of decades

Arvind Narayanan's ICML 2026 keynote argues AI adaptation will take decades, not years — and that the real bottleneck is organizational, not technical.

Tessera Newsroom · July 14, 2026