Research / T-2026-5798

Law Professors Prefer AI Answers to Their Peers' in Stanford Study

Q: Law Professors Prefer AI Answers to Their Peers' in Stanford Study — key point 1

In blind evaluations of nearly 3,000 comparisons across 16 U.S. law schools, AI responses won 75% of head-to-head matchups against human instructors.

Q: Law Professors Prefer AI Answers to Their Peers' in Stanford Study — key point 3

The study shows AI can meet professional standards for reasoning and explanation in law, shifting the conversation to responsible deployment.

A new Stanford Law study finds professors overwhelmingly prefer AI-generated answers to student questions over responses written by fellow instructors, with implications for…

Tessera Newsroom · 4 min read · June 3, 2026

Source AI outperforms law professors in Stanford Law study (law.stanford.edu)

TILE No. T-2026-5798

5798 RESEARCH

A study from Stanford Law School published this week found that law professors overwhelmingly prefer AI-generated answers to student questions over responses written by their fellow instructors. In blind evaluations of nearly 3,000 anonymized comparisons across 16 U.S. law schools, AI responses won 75% of head-to-head matchups. The finding challenges a core assumption about large language models: that they cannot handle domains requiring judgment, nuance, and reasoned argument.

The study, titled “Law Professors Prefer AI Over Peer Answers,” was led by Stanford Law Professor Julian Nyarko and co-authored with researchers from Yale, NYU, the University of Chicago, and other institutions. Participants created 40 representative contract law questions that students might ask after class or during office hours. Each professor wrote their own answer and then evaluated responses without knowing whether they came from AI or another professor. The AI systems performed comparably to the best human instructor in the study.

“We were frankly surprised by the magnitude of the results,” Nyarko said in the press release. “These weren’t just simple questions with obvious answers. Many of them required synthesizing complex material, applying it to new situations, and explaining legal concepts in ways that would help students develop their own analytical skills.”

The most striking finding may not be the win rate itself but the harm rate. Professors flagged AI responses as pedagogically harmful only 3.5% of the time, compared to 12% for peer-written answers. That is a nearly 4x difference. The AI was not just preferred. It was trusted.

The study is notable because previous AI evaluations have focused on subjects with clear right-or-wrong answers, like multiple-choice bar exam questions or medical licensing tests. Legal reasoning demands something different: careful analysis of competing arguments, defensible conclusions, and the ability to explain concepts in a way that builds a student’s own analytical skills.

“In most fields where AI gets tested, there’s a right answer. In law, there often isn’t,” said Sarath Sanga, co-author and professor at Yale Law School. “Two opposing arguments can both be good. What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other’s arguments. In this case, the answer was yes.”

The research team took precautions to ensure validity. They calibrated AI responses to match the length and structure of human answers. They used multiple evaluation methods. They had professors assess whether responses might mislead or confuse students. The study also examined specific AI models, including commercial tutoring systems and Google’s NotebookLM, finding varying levels of performance. Even when context limitations affected AI responses, professors still frequently preferred them to human-written alternatives.

This is not a story about AI replacing law professors. It is a story about AI meeting a professional standard that legal educators use to evaluate each other. That is a different bar entirely.

Nyarko cautioned against wholesale adoption. “Our study evaluates the quality of answers given by AI tools. But how to implement these tools to most effectively improve student learning is still an open question,” he said. “The conversation should shift from whether AI can give accurate, high quality responses to how we can deploy it responsibly to the benefit of our students.”

The study arrives as law schools nationwide grapple with integrating AI tools while maintaining rigorous academic standards. Some institutions have embraced AI experimentation. Others remain cautious about risks including hallucinations, overreliance, and the erosion of critical thinking skills.

The implications extend beyond legal education. If AI can outperform domain experts in a field defined by ambiguity and reasoned argument, the same dynamic may apply to other judgment-rich professions: medicine, journalism, policy analysis, management consulting. The question is not whether AI can handle nuance. The question is whether the profession’s own evaluation methods can keep up.

For AI builders, the study offers a useful benchmark. The researchers did not just ask whether the AI got the facts right. They asked whether the AI met a professional standard for reasoning, explanation, and pedagogical safety. That is a harder test than most benchmarks in the field. And the AI passed.

The study also points to a practical opportunity. If AI tutors can provide high-quality, on-demand support in judgment-rich fields, they may broaden access to expert guidance. First author Alejandro Salinas, a researcher at Nyarko’s liftlab, emphasized this point: “We find that, when evaluated by legal educators, AI tutors can offer high-quality, on-demand support that complements classroom instruction, and may broaden access to expert guidance.”

The open question is deployment. The study shows that AI can produce answers that professors prefer. It does not show that those answers improve student learning outcomes over time. That is a separate study, and a harder one to run.

What the study does show is that blanket skepticism of AI in professional education is hard to sustain. The data does not support it. The burden of proof has shifted. The conversation now is about how, not whether.

Tessera Newsroom

Editorial

Masthead Contact

T-REL / RESEARCH

GPT-5.6 Used a Prompt to Close a 30-Year Gap in Convex Optimization

A Reddit thread reports that GPT-5.6 closed a 30-year gap in convex optimization with a single prompt. The proof compiles in Lean 4 without unproven steps.

Tessera Newsroom · July 19, 2026

Research / T-2026-2742

The Arabic-language gap in off-the-shelf AI

Off-the-shelf AI still underperforms for Arabic-first products in dialect, RTL, and cultural grounding. We examine the gap and the regional integration work that closes it.

Tessera Newsroom · July 17, 2026

Research / T-2026-3809

Ring-Zero: Scaling Zero RL to a Trillion Parameters Shows Emergent Reasoning

New paper on Ring-Zero shows that scaling reinforcement learning to 1 trillion parameters yields emergent reasoning behaviors like self-verification and context anxiety.

Tessera Newsroom · July 17, 2026