Five frontier large language models disagree on 67 percent of 1,000 real-world fact-check claims, according to a study posted by researcher Alex Lenz. The finding is not a surprise to anyone who has used these models seriously. It is a problem that the industry has not solved.
Lenz tested GPT-4o, Claude 4, Gemini 2.5 Pro, Llama 4, and DeepSeek-V3 on a set of claims drawn from fact-checking databases. The models were asked to determine whether each claim was true, false, or uncertain. On two out of every three claims, the models gave different answers. Agreement was highest on obviously false statements and lowest on claims involving nuance, context, or recency.
The implication is uncomfortable. If these models cannot agree on basic factual questions about the world, they cannot serve as reliable knowledge tools. A user asking one model whether a specific policy took effect last month might get a confident yes. Another model, on the same query, might say no, or hedge. The user has no way to adjudicate.
This is not a benchmark artifact. Lenz used real claims from PolitiFact and other fact-checking organizations, not synthetic trivia. The disagreement rate reflects genuine ambiguity in the training data, differences in how models handle conflicting sources, and the fundamental difficulty of compressing the world into a fixed set of weights.
The industry response so far has been to add citations and retrieval. That helps with recency and verifiability, but it does not solve the underlying disagreement problem. Two models given the same retrieval context can still reach opposite conclusions about what the retrieved text means.
For builders, the takeaway is straightforward. Treat any single model’s factual output as provisional. Cross-check against another model, or better yet, against the primary source. The era of trusting a single LLM as a knowledge oracle ended before it began.