Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found

There is a version of this story that the legal technology industry would prefer lawyers to believe. It goes like this: yes, ChatGPT hallucinates — that is well documented and embarrassing — but the serious tools, the ones built specifically for lawyers, the ones sold by the same companies that have powered legal research for decades, those are different. Those use something called retrieval-augmented generation. Those are grounded in real databases. Those, in the words of LexisNexis’s own marketing material, deliver “100% hallucination-free linked legal citations.”

Researchers at Stanford University put that claim to the test. The results should be required reading for every lawyer billing time to a legal AI subscription.

Published as a preprint in May 2024 and subsequently peer-reviewed and published in the Journal of Empirical Legal Studies in 2025, the study — “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools” — was conducted by researchers at Stanford’s RegLab and its Human-Centered AI (HAI) institute. The lead researchers were Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, and Professor Daniel E. Ho, the William Benjamin Scott and Luna M. Scott Professor of Law at Stanford and Director of the RegLab. It is, to date, the most rigorous independent empirical evaluation of the AI tools that sit at the heart of modern legal practice.

What they found is not what LexisNexis or Thomson Reuters would choose to headline.

What the Study Actually Did

Before getting to the numbers, it is worth understanding the methodology — because this study generated controversy as well as findings, and lawyers deserve the full picture rather than a selective summary.

The Stanford team constructed a preregistered dataset of over 200 open-ended legal queries designed to reflect the kinds of questions lawyers actually ask in the course of research. The queries spanned four categories: general doctrinal research questions; jurisdiction- or time-specific questions (such as circuit splits and recent changes in the law); false-premise questions (designed to test whether the system would correct a mistaken assumption); and factual recall questions requiring no legal interpretation. The preregistration — the practice of publicly committing to a study design before running it — is a mark of scientific rigour that distinguishes this work from marketing benchmarks run by the vendors themselves.

The tools tested were Lexis+ AI, the generative AI platform built by LexisNexis for general legal research; Westlaw’s AI-Assisted Research (AI-AR), built by Thomson Reuters on technology acquired with the $650 million purchase of Casetext; and Ask Practical Law AI, a more limited Thomson Reuters product that draws on Practical Law’s database of attorney-edited practice guides rather than primary case law. For comparison, GPT-4 without any legal database integration was included as a baseline.

A hallucinated response was defined as one that was either factually incorrect (describing the law wrongly or making a factual error) or “misgrounded” — meaning the AI cited a real source, but the source did not actually support the claim being made. This second category matters enormously for lawyers: a citation that exists but does not say what the AI claims it says is, in some ways, more dangerous than a citation that simply does not exist. At least a non-existent citation is detectable. A real case that has been mischaracterised requires a lawyer to read it carefully enough to catch the distortion.

The Numbers: What the Study Found

The headline finding is stark.

Even these purpose-built, legally-trained, RAG-powered tools hallucinate at rates that should alarm any lawyer relying on them without rigorous verification.

Lexis+ AI — the highest-performing system tested — answered just 65% of queries accurately. It produced hallucinated responses on more than 17% of queries. Westlaw’s AI-Assisted Research performed significantly worse: accurate on just 42% of queries, and hallucinating at nearly twice the rate of Lexis+ AI — approximately 33% of the time. Ask Practical Law AI refused to answer queries outright more than 60% of the time, and when it did respond, its accuracy rate was even lower.

To put the 33% figure in concrete terms: if a lawyer used Westlaw AI-Assisted Research to answer three legal questions during a research session, statistically, at least one of those answers would contain an error or a misgrounded citation.

For comparison, general-purpose models like ChatGPT, Claude, and Llama — without any legal database backing — hallucinated on legal queries between 58% and 80% of the time in the researchers’ earlier study. So the legal-specific tools are meaningfully better than general AI chatbots. But “better than GPT-4 used raw” is not the standard lawyers need. The standard lawyers need is: can I rely on this output without independently verifying every citation and legal proposition? On the evidence of this study, the answer for all of these tools is: no.

The Methodology Dispute — And Why It Doesn’t Change the Bottom Line

The study was not received quietly.

Thomson Reuters objected immediately and forcefully, pointing out that the researchers had tested Ask Practical Law AI — a product designed for practice guidance, not primary law research — as a proxy for Westlaw’s general research AI. The company noted that it had denied Stanford multiple requests for access to its AI-Assisted Research product, and that testing Practical Law AI on case law queries was effectively testing a tool for a purpose it was not built to serve. Artificial Lawyer, the legal technology publication, called the study’s initial methodology “problematic” on similar grounds.

Stanford acknowledged the criticism. Professor Daniel Ho confirmed that Thomson Reuters had denied their access requests, and that this in itself was a finding worth noting: the opacity of these commercial systems makes independent verification of vendor claims structurally difficult. The researchers then ran an updated analysis on the AI-Assisted Research product after Thomson Reuters belatedly provided access. The updated findings, if anything, made the picture clearer: Westlaw’s AI-Assisted Research hallucinated at nearly twice the rate of Lexis+ AI — 33% compared to 17%.

LexisNexis, for its part, disputed the findings through its Chief Product Officer, Jeff Pfeifer, who stated that the company’s own internal data suggested a much lower hallucination rate and argued that the study’s criteria were better suited to measuring “answer and citation quality” than hallucination specifically.

Both companies’ objections deserve to be taken seriously. This is not a clean, uncontested piece of research. The methodology was imperfect in its initial form, the access restrictions created real limitations, and the vendors’ own systems have been updated since the study was conducted. Any fair account of this study has to include those caveats.

But here is what none of those objections change: even accepting every methodological criticism, even upgrading the Thomson Reuters results to the AI-Assisted Research product, even acknowledging that the products have improved since testing — Stanford’s researchers, using a rigorous preregistered design, found that these tools hallucinate at rates between 17% and 33%. And the vendors’ response to that finding was not to publish their own independent benchmarks proving otherwise. It was to dispute the methodology and point to internal data they have not made public.

In the absence of transparent, third-party benchmarking — which the Stanford researchers explicitly called for — lawyers are being asked to trust marketing claims that have not been independently verified. That is a professional responsibility problem, not just a product quality question.

What the Hallucinations Actually Look Like

Abstract percentages are one thing. Concrete examples are another, and the study provides them.

In one documented example, Westlaw’s AI-Assisted Research claimed that a paragraph in the Federal Rules of Bankruptcy Procedure states that deadlines are jurisdictional. No such paragraph exists. And the underlying legal proposition is itself questionable in light of the Supreme Court’s holding in Kontrick v. Ryan, 540 U.S. 443 (2004), which held that FRBP deadlines under a related provision were not jurisdictional. The system did not merely fabricate a citation — it also stated a legal rule that Supreme Court precedent had specifically addressed in the opposite direction.

In another example, Lexis+ AI was asked a question about the standard of review for abortion restrictions following the Supreme Court’s decision in Dobbs v. Jackson Women’s Health Organization, 597 U.S. 215 (2022). The system gave an answer grounded in Casey and the undue burden standard — a framework that Dobbs had expressly overruled. A lawyer who relied on that answer without verification would have submitted research based on superseded constitutional law.

A third example, from Ask Practical Law AI, involved a question about Justice Ginsburg’s role in the Supreme Court’s decision on same-sex marriage. The system not only failed to correct the false premise in the question, but added further false information about the case.

These are not obscure edge cases designed to trip up the systems. They are the kinds of errors that can survive review by a lawyer who is treating these tools as research shortcuts rather than research starting points — which, given the efficiency-focused marketing of these products, is precisely how many lawyers are being encouraged to use them.

The Marketing Claims That Started This

To understand why the Stanford study matters, you need to understand what these companies were saying before it was published.

LexisNexis stated, in marketing materials quoted directly in the study: “Unlike other vendors, however, Lexis+ AI delivers 100% hallucination-free linked legal citations connected to source documents, grounding those responses in authoritative resources that can be relied upon with confidence.”

Casetext, before its acquisition by Thomson Reuters, claimed that its CoCounsel product “does not make up facts, or ‘hallucinate,’ because we’ve implemented controls to limit CoCounsel to answering from known, reliable data sources.”

Thomson Reuters stated that its approach “avoid[s] hallucinations” by relying on trusted content within Westlaw and building in checks and balances.

These are not cautious, hedged statements. They are affirmative claims that the hallucination problem has been solved — or, at minimum, that their products operate in a categorically different way from general-purpose AI. Following publication of the Stanford study, LexisNexis quietly walked back the “100% hallucination-free” language, clarifying that the promise applied only to “linked legal citations” and that no AI tool could guarantee 100% accuracy. Mike Dahn, head of Westlaw Product Management at Thomson Reuters, wrote that the company “makes it very clear with customers that the product can produce inaccuracies” — a statement that sits awkwardly alongside the prior “avoid hallucinations” marketing.

The legal profession deserves vendors who are honest about the limitations of their products from the outset, not after independent researchers call those limitations out. The gap between what was being claimed and what the evidence shows is not a minor discrepancy. It is a significant failure of transparency toward a profession in which errors carry professional, financial, and reputational consequences.

What This Means for Your Duty of Competence

This is where the abstract becomes practical and urgent.

Comment 8 to Model Rule 1.1 of the ABA Model Rules of Professional Conduct states that competent lawyering requires “keeping abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology.” Most state bar associations have adopted equivalent provisions. A lawyer who is using AI legal research tools without understanding their error rates — and without verifying their outputs — is arguably not meeting that standard.

The Stanford study’s authors put it plainly: lawyers using these tools “may find themselves having to verify each and every proposition and citation provided by these tools, undercutting the stated efficiency gains that legal AI tools are supposed to provide.” That is a pointed observation. If the verification burden required to use these tools safely approaches the research burden of doing the work yourself, the efficiency case for the tools collapses — or at least requires a more honest accounting.

The practical implications are not that lawyers should stop using these tools. They are considerably more useful than using no AI at all, and considerably more useful than general-purpose chatbots for legal research. Lexis+ AI answering accurately 65% of the time is genuinely better than GPT-4’s raw performance on legal queries. But “better than nothing” and “better than ChatGPT” are not the same as “safe to rely on without verification.”

What this means in practice: every output from these tools should be treated as a research lead, not a research conclusion. Citations need to be pulled and read. Legal propositions need to be verified against primary sources. The tool is accelerating the discovery of potentially relevant material; the lawyer is still responsible for confirming that the material says what the tool claims it says.

The Transparency Problem the Study Exposed

The most important long-term finding from this research may not be the hallucination rates themselves but the structural condition that makes independent verification of those rates so difficult.

When Stanford researchers asked Thomson Reuters for access to test its AI-Assisted Research product, they were refused — three separate times. They ended up testing Ask Practical Law AI because it was the only Thomson Reuters product they could access. The company later provided access after the preprint was published and public pressure mounted, but the default posture was denial.

This matters because it means lawyers are currently in a position of trusting vendor marketing claims that have not been and cannot easily be independently tested. LexisNexis’s own internal data “suggests a much lower rate of hallucination” than the Stanford study found — but that data has not been published and cannot be examined. Thomson Reuters disputes the methodology — but has not published its own preregistered benchmark showing what the correct methodology would reveal.

As the Stanford researchers argued, what the legal profession needs is public benchmarking of these tools — conducted independently, using preregistered methodology, updated regularly as the products improve. The ABA and state bar associations should be requiring it. Courts, which are increasingly setting rules around AI use in filings, should be demanding it. Until it exists, lawyers who use these tools bear the full verification burden themselves — because the vendors have not provided the evidence needed to justify any other approach.

A Practical Verification Checklist for Legal AI Research

Based on the findings of the Stanford study and the professional responsibility framework it implicates, here is what responsible use of Westlaw AI, Lexis+ AI, and similar tools looks like in practice:

Before you begin:

Treat every AI research session as producing leads, not conclusions
Understand which product you are actually using — Westlaw AI-Assisted Research, Lexis+ AI, and Ask Practical Law AI are distinct products with different performance profiles
Adjust your verification expectations accordingly — Westlaw AI-AR hallucinates at roughly double the rate of Lexis+ AI based on current evidence

For every citation the AI provides:

Pull the actual case or source; do not rely on the AI’s characterisation of its holding
Check whether the case is still good law using Shepard’s or KeyCite — the AI may be drawing on a case that has been overruled
Verify that the quoted or paraphrased language actually appears in the source; misgrounded citations (real cases, wrong propositions) are harder to detect than invented ones

For every legal proposition the AI states:

Ask whether the AI has correctly identified the current legal standard, particularly in areas of recent change (constitutional law post-Dobbs, regulatory law following major agency decisions, etc.)
Be especially cautious with jurisdiction-specific questions — the Stanford study found these were among the most error-prone query types
Pay attention to false-premise errors: if your question contains a mistaken assumption, these tools often fail to correct it and instead amplify it

Before signing any filing:

Do not cite any authority you have not personally verified exists and says what you believe it says
Document your verification process — as courts increasingly impose AI disclosure requirements, an audit trail of your verification steps is professional protection, not administrative overhead

The Bottom Line for Lawyers

The Stanford study does not say that Westlaw AI and Lexis+ AI are useless. It says they are substantially better than general-purpose AI for legal research, meaningfully faster than traditional manual research, and still wrong — sometimes badly wrong — on a significant percentage of queries.

That is a tool worth using carefully. It is not a tool worth trusting uncritically, and it is emphatically not a tool whose output can be placed in a filing without independent verification of the key citations and legal propositions.

The vendors have not yet earned the right to ask lawyers to trust their systems. That trust is built through transparency: published benchmarks, independent auditing, honest acknowledgment of error rates, and continuous public reporting on improvements. Until those conditions are met, the professional responsibility of verifying AI outputs in full sits squarely where it has always sat — with the lawyer whose name goes on the brief.

The hallucination problem in legal AI has not been solved. Not by LexisNexis. Not by Thomson Reuters. Not by any legal AI product currently on the market. The research is unambiguous on this point, and every lawyer using these tools needs to proceed accordingly.

Frequently Asked Questions

Does Westlaw AI hallucinate? Yes. According to Stanford’s RegLab and HAI study, Westlaw’s AI-Assisted Research product hallucinated on approximately 33% of queries tested — nearly twice the rate of Lexis+ AI. Thomson Reuters disputed aspects of the study’s initial methodology, but the updated analysis — which used Westlaw AI-Assisted Research directly — confirmed the higher error rate. Lawyers should verify all outputs from this tool independently.

Does Lexis+ AI hallucinate? Yes. The Stanford study found Lexis+ AI produced incorrect or misgrounded responses on more than 17% of queries. LexisNexis disputed this figure and argued its own internal data showed lower rates, but has not published that data publicly. Lexis+ AI was the highest-performing tool tested and outperformed Westlaw AI-AR significantly, but still requires verification.

Is it safe to use Westlaw AI or Lexis+ AI for legal research? These tools are substantially better than general-purpose AI for legal research, and many lawyers use them productively. But “safe” depends on what you do with the output. Using them as a starting point and verifying all citations and legal propositions against primary sources is appropriate professional practice. Relying on their outputs without verification — particularly before filing — is not.

What is the difference between a hallucinated citation and a misgrounded citation? A hallucinated citation is a case or authority that simply does not exist — it was invented by the AI. A misgrounded citation is a real case or source that the AI has mischaracterised — the source exists, but it does not say what the AI claims. Misgrounded citations are arguably more dangerous because they are harder to detect: a basic citation check will confirm the case exists, but only reading the case will reveal that it has been misrepresented.

Are legal AI tools getting better over time? Both LexisNexis and Thomson Reuters have committed to continuous improvement of their systems, and both companies argue their products have advanced since the Stanford study was conducted. That may be true, but neither has published independent benchmark results to substantiate it. The appropriate professional posture is to assume hallucination risk remains material until transparent, third-party benchmarking says otherwise.

Subscribe to LegalAIWorld Weekly for the latest research, sanctions cases, and practical guidance on AI in legal practice. New posts every week.

Westlaw AI and Lexis+ AI Still Hallucinate: What the Stanford Study Actually Found

What the Study Actually Did

The Numbers: What the Study Found

The Methodology Dispute — And Why It Doesn’t Change the Bottom Line

What the Hallucinations Actually Look Like

The Marketing Claims That Started This

What This Means for Your Duty of Competence

The Transparency Problem the Study Exposed

A Practical Verification Checklist for Legal AI Research

The Bottom Line for Lawyers

Further Reading

Frequently Asked Questions

Leave a Comment Cancel Reply

What the Study Actually Did

The Numbers: What the Study Found

The Methodology Dispute — And Why It Doesn’t Change the Bottom Line

What the Hallucinations Actually Look Like

The Marketing Claims That Started This

What This Means for Your Duty of Competence

The Transparency Problem the Study Exposed

A Practical Verification Checklist for Legal AI Research

The Bottom Line for Lawyers

Further Reading

Frequently Asked Questions

Must Read

Leave a Comment Cancel Reply