A study from Stanford Law School published this week found that law professors overwhelmingly prefer AI-generated answers to student questions over responses written by their fellow instructors. In blind evaluations of nearly 3,000 anonymized comparisons across 16 U.S. law schools, AI responses won 75% of head-to-head matchups. The finding challenges a core assumption about large language models: that they cannot handle domains requiring judgment, nuance, and reasoned argument.
The study, titled “Law Professors Prefer AI Over Peer Answers,” was led by Stanford Law Professor Julian Nyarko and co-authored with researchers from Yale, NYU, the University of Chicago, and other institutions. Participants created 40 representative contract law questions that students might ask after class or during office hours. Each professor wrote their own answer and then evaluated responses without knowing whether they came from AI or another professor. The AI systems performed comparably to the best human instructor in the study.
“We were frankly surprised by the magnitude of the results,” Nyarko said in the press release. “These weren’t just simple questions with obvious answers. Many of them required synthesizing complex material, applying it to new situations, and explaining legal concepts in ways that would help students develop their own analytical skills.”
The most striking finding may not be the win rate itself but the harm rate. Professors flagged AI responses as pedagogically harmful only 3.5% of the time, compared to 12% for peer-written answers. That is a nearly 4x difference. The AI was not just preferred. It was trusted.
The study is notable because previous AI evaluations have focused on subjects with clear right-or-wrong answers, like multiple-choice bar exam questions or medical licensing tests. Legal reasoning demands something different: careful analysis of competing arguments, defensible conclusions, and the ability to explain concepts in a way that builds a student’s own analytical skills.
“In most fields where AI gets tested, there’s a right answer. In law, there often isn’t,” said Sarath Sanga, co-author and professor at Yale Law School. “Two opposing arguments can both be good. What we wanted to know is whether AI can meet the latent professional standard that lawyers use to evaluate each other’s arguments. In this case, the answer was yes.”
The research team took precautions to ensure validity. They calibrated AI responses to match the length and structure of human answers. They used multiple evaluation methods. They had professors assess whether responses might mislead or confuse students. The study also examined specific AI models, including commercial tutoring systems and Google’s NotebookLM, finding varying levels of performance. Even when context limitations affected AI responses, professors still frequently preferred them to human-written alternatives.
This is not a story about AI replacing law professors. It is a story about AI meeting a professional standard that legal educators use to evaluate each other. That is a different bar entirely.
Nyarko cautioned against wholesale adoption. “Our study evaluates the quality of answers given by AI tools. But how to implement these tools to most effectively improve student learning is still an open question,” he said. “The conversation should shift from whether AI can give accurate, high quality responses to how we can deploy it responsibly to the benefit of our students.”
The study arrives as law schools nationwide grapple with integrating AI tools while maintaining rigorous academic standards. Some institutions have embraced AI experimentation. Others remain cautious about risks including hallucinations, overreliance, and the erosion of critical thinking skills.
The implications extend beyond legal education. If AI can outperform domain experts in a field defined by ambiguity and reasoned argument, the same dynamic may apply to other judgment-rich professions: medicine, journalism, policy analysis, management consulting. The question is not whether AI can handle nuance. The question is whether the profession’s own evaluation methods can keep up.
For AI builders, the study offers a useful benchmark. The researchers did not just ask whether the AI got the facts right. They asked whether the AI met a professional standard for reasoning, explanation, and pedagogical safety. That is a harder test than most benchmarks in the field. And the AI passed.
The study also points to a practical opportunity. If AI tutors can provide high-quality, on-demand support in judgment-rich fields, they may broaden access to expert guidance. First author Alejandro Salinas, a researcher at Nyarko’s liftlab, emphasized this point: “We find that, when evaluated by legal educators, AI tutors can offer high-quality, on-demand support that complements classroom instruction, and may broaden access to expert guidance.”
The open question is deployment. The study shows that AI can produce answers that professors prefer. It does not show that those answers improve student learning outcomes over time. That is a separate study, and a harder one to run.
What the study does show is that blanket skepticism of AI in professional education is hard to sustain. The data does not support it. The burden of proof has shifted. The conversation now is about how, not whether.