It’s Not the AI. It’s the Lawyer: We Don’t Have a Hallucination Problem, We Have a Serious Ethics Problem

Posted on June 30, 2026 by Guest Blogger

Guest Post: Charlie Amiot

I. How We Got Here

Recently I was chatting with a friend from law school who has been a practicing lawyer for seven-plus years and whom I’ve known for 11+ years. I have a great amount of respect for this person’s thoughts and opinions on nearly any subject (admittedly rare for me). I brought general AI usage into the conversation and they told me that they are staying away from AI altogether. The administrative judges in their jurisdiction were moving to ban AI outright. Allegedly this is in response to what the judges were seeing: fake citations, fake cases, fake lawyers in briefs filed with real courts.¹ Due to their particular line of work, they worried that any AI usage in any context could contaminate their work product. They weren’t willing to risk their reputation (or their employer’s reputation), their law license, cases, or income. Instead they chose to take a hardline position as the answer.

Most of us have heard the term hallucination and think we know what it means. For purposes of this article: hallucination is the term used to describe a large language model output that contains incorrect information that the model believes is correct. How incredibly human of it.

Many in the legal profession point to “hallucinations” as the scapegoat when something goes wrong in their filings.² They also often tend to toss their law clerks, paralegals, junior lawyers, and student interns—real and fictitious—under the bus as to who is really at fault for the hallucinations inclusions. Some even blamed deadlines set by the court.

A hallucination is simply an incorrect statement. It stands alone. If you happen to be someone who has a set of encyclopedias sitting next to them, you’re undoubtedly sitting next to hundreds of hallucinations, inserted both at the time of print and facts that have mutated with the passage of time since being printed. Any newspaper or magazine you pick up contains a hallucination. Many textbooks and reference materials contain hallucinations as well. Arguably, only in fiction can there be no hallucinations.³ We are otherwise surrounded by and exposed to them on a daily and hourly basis.

Continuously pointing to LLM hallucinations allows the word hallucination to do a whole lot of work at getting people off the hook of personal responsibility.

II. The Actual Problem Has a Name, and It’s Not “Hallucination”

Let’s be precise about what actually happened when a lawyer filed a brief containing citations to cases that don’t exist: a lawyer signed and submitted a document they had not read. That’s it. That’s the whole story.

The AI didn’t file the brief. The AI didn’t have a law license. The AI didn’t swear an oath, and it didn’t certify anything to the court. The lawyer did all of those things—and apparently did them without reading what they were certifying.

This has a name. Rule 11 of the Federal Rules of Civil Procedure requires that an attorney certify, after an inquiry reasonable under the circumstances, that the legal contentions in a filing are warranted by existing law.⁴ Courts have read that requirement to include actually checking whether the law you’re citing is still good law—failing to run a citator check has been held to violate Rule 11 on its own.⁵ ABA Model Rule 1.1 requires that lawyers provide competent representation, which includes the legal knowledge, skill, thoroughness, and preparation reasonably necessary for the representation.⁶ Rule 3.3 requires candor toward the tribunal—lawyers may not make false statements of law to a court, and they have an affirmative duty to correct one if they discover it.⁷ These rules did not change when generative AI broadly launched. They did not include an exception for outputs you didn’t generate yourself. They have never included such an exception, which is why we don’t typically accept “my paralegal wrote it” as a defense either.

What we are watching, dressed up in technical language, is a failure of basic professional responsibility. A doctor who countersigns a lab result they haven’t reviewed is not a victim of laboratory error. A structural engineer who stamps drawings they didn’t check is not a victim of drafting software. And a lawyer who files a brief they didn’t read is not a victim of AI hallucination. In each case, the professional had a duty to verify, possessed the means to verify, and chose not to. The tool that produced the underlying work product is beside the point.

The hallucination framing is doing exactly the work it’s designed to do: it makes the failure sound technical, mysterious, and external to the lawyer’s control. It isn’t any of those things.

III. The Literacy Failure That Made the PR Failure Possible

If the professional responsibility failure is the immediate problem, there’s a second failure nested underneath it that created the conditions for the first: a significant portion of the legal profession does not understand what AI tools actually do, and that ignorance is not evenly distributed across risk levels.

Here’s what a large language model is not doing when it drafts a brief: it is not retrieving documents from a legal database, reading them, and exercising legal judgement about them. It is predicting text—generating output that is statistically consistent with the patterns in its training data, which means it produces text that looks like a legal citation, formatted correctly, sounding authoritative, because it has been trained on enormous quantities of legal writing that contains real citations formatted exactly that way. It is not lying. It is not hallucinating in the clinical sense. It is doing precisely what it was designed to do, and what it produced is plausible-sounding output that happens to be wrong.

A practitioner who understood this would approach AI-drafted citations the way a careful researcher approaches any secondary source: as a starting point that requires verification, not a deliverable that requires a signature. The verification step isn’t technically demanding. Every major legal research platform provides citation-checking tools. At minimum, you can pull the case. The professional responsibility violation and the literacy failure are not separate problems—the literacy failure is why the professional responsibility failure seemed acceptable.

This matters because the positive case is genuinely strong. There are countless ways to use AI in legal work that carry no meaningful citation risk at all: drafting and editing prose, synthesizing large records, generating research memos that a lawyer then verifies, preparing for negotiation, managing correspondence. The citation problem is specific to one use case—asking an LLM to generate citations as if it were a legal research database—and it is nearly entirely preventable by one habit: read and verify what you’re about to put your name on. That habit isn’t new. It predates AI by at least 200 years.⁸

IV. The Ban Won’t Fix It—And May Make It Worse

Prohibition is a technology-governance strategy with a well-documented track record, and that track record is not good. Banning AI from court filings does not eliminate AI use in legal practice. It eliminates disclosed AI use. Lawyers who are currently using these tools carelessly will continue using them—without oversight, without any professional incentive to develop better habits, and without the profession building the infrastructure to train or regulate responsible use. The lawyers who will comply with a ban are, by and large, the ones who would have checked their citations anyway.

There’s a market dimension to this that deserves attention. A significant and growing number of legal AI products are, in technical terms, wrappers around general-purpose language models, with some legal-specific training added in, and rebranded for legal audiences and sold at prices that reflect the prestige of the legal market rather than the sophistication of the underlying technology.⁹ Some of these products are sold aggressively to law firms and legal departments whose leadership is precisely credulous enough to be impressed by confident technical language and precisely ignorant enough not to notice when the product doesn’t actually do what’s claimed. Even where the underlying technology has matured, the institutions deploying it routinely fail to build the governance, training, and validation infrastructure that responsible use requires—a gap industry observers increasingly identify as the actual point of failure, not the model itself.¹⁰ The people who genuinely understand these systems are rarely the ones in the purchasing meetings. The result is that firms spend significant money on tools that don’t reduce AI risk—they just make AI risk more expensive. An outright ban accelerates this dynamic: it pushes usage further from visibility and toward unaccountable, unvetted, often overpriced private solutions that serve the vendor’s interests more reliably than the client’s.

The access-to-justice dimension is the one that should be keeping judges up at night, and it’s conspicuously absent from most ban discussions. AI tools used responsibly have genuine potential to reduce the cost of legal services, extend the reach of competent representation, and close gaps that have existed in this system for generations. The people who most need that closing are not the ones with BigLaw retainers. Banning AI doesn’t protect those clients. It protects the status quo that was already failing them.

V. What Should Actually Happen

The legal profession has a governance structure. It has bar associations, ethics rules, judicial authority, and law schools that control entry into the profession. These are not weak institutions—they are the ones that decide who gets an education, a license, what competence means, and what consequences attach to failing to meet it. The question is not whether they have the authority to address this problem. They do. The question is whether they are willing to use that authority to address the actual problem rather than the more comfortable one.

Enforcing existing ethics rules against the lawyers who filed unchecked briefs is not complicated. The rules already cover this. What appears to be missing is the will to apply them without the alibi of “the AI did it”—which, as established, is not a defense that survives scrutiny under the Federal Rules of Civil Procedure or the ABA Model Rules.

Beyond enforcement, the more durable fix is curricular. A law school that does not provide students with grounded, accurate AI literacy—not vendor-sponsored tutorials, not hand-wringing seminars, but genuine instruction in what these tools do, what they don’t do, and what professional responsibility looks like in a practice environment where they are ubiquitous—is not preparing lawyers for the profession they are entering. That is a failure of institutional responsibility, and it is one that prospective students, faculty, and accreditors are in a position to name and pressure. It will be worth watching, over the next several years, whether bar admission data and practice location choices start to reflect attorneys voting with their feet toward jurisdictions that have developed coherent AI frameworks rather than reflexive bans.

Law librarians reading this are not bystanders to any of it. Legal research instruction, information literacy, and the professional competence to evaluate and verify sources have always been the core of what law librarians teach and model. The AI context doesn’t change that mission—it makes it more urgent and more visible.

The lawyers who know how to use these tools carefully are, in a meaningful number of cases, the ones who received genuine legal research education from people who cared about getting it right. That instruction doesn’t happen without adequate staffing, and law library staffing has been moving in exactly the wrong direction for years. Law librarians are among the most underpaid professionals in legal education relative to the expertise they hold and the institutional function they serve.¹¹ Positions go unfilled. Existing staff absorb expanding mandates without additional support or compensation. Effective leadership capable of building and sustaining a real AI literacy curriculum is not inevitable—it has to be resourced, prioritized, and protected. You cannot instruct a generation of lawyers in responsible AI use with a skeleton crew and a budget that hasn’t kept pace with the problem. If law schools are serious about preparing students for modern practice, the library isn’t where you find efficiencies. It’s where you invest.

The legal profession does not need an AI ban. It needs accountability applied to the people who failed to meet existing standards, literacy built into the pipeline before those people get licensed, and the collective intellectual honesty to stop blaming the tool for choices that were made by lawyers. What we are watching is not a new problem created by new technology.¹² It is an old problem—lawyers not reading what they sign—that technology has finally made impossible to ignore.¹³ That is, if nothing else, an opportunity. The question is whether the profession takes it.

Charlie (she/her) Amiot (rhymes w/cameo) is a former legal research instructor and reference librarian who currently writes What Congress Should Be Reading, a newsletter tracking Congressional Research Service reports for a general audience. An expert in government information, her work has examined the legislative history of CRS and public access to government information, and she currently serves as Secretary of the Depository Library Council. She also has a longstanding interest in legal AI, with deep, self-directed expertise built through sustained study and engagement with both practitioners and the tools themselves.

I’m sure many readers are familiar with Damien Charlotin’s database of so-called AI Hallucination Cases (https://www.damiencharlotin.com/hallucinations/). Containing judicial opinions only, the database already holds 1600 references. ↩︎
Escott, D. J. (2025, December 8). From hallucination to indictment: The criminalization of the AI-enabled lie. Law360 Canada. https://www.law360.ca/ca/articles/2419185/from-hallucination-to-indictment-the-criminalization-of-the-ai-enabled-lie. Koebler, J. (2025, Sept. 30). 18 Lawyers Caught Using AI Explain Why They Did It. 404media. https://www.404media.co/18-lawyers-caught-using-ai-explain-why-they-did-it/?ref=daily-stories-newsletter. ↩︎
Goldfish actually have great memories. They can be relatively quickly trained to play basketball on command. But in the Ted Lasso universe they are upsettingly portrayed as idiots with a three-second memory who could be outsmarted by Dory. Alas, is that a hallucination? ↩︎
Fed. R. Civ. P. 11(b)(2). https://www.law.cornell.edu/rules/frcp/rule_11. ↩︎
Deters v. Davis, No. CIV.A. 3:11-02-DCR, 2011 WL 2417055 (E.D. Ky. June 13, 2011). See also, Cody James, Citators in the AI Age: Preserving the Human Component Through Court-Created Citators, 118 Law Lib. J. 66, 71-73 (2026). ↩︎
Model Rules of Prof’l Conduct r. 1.1 (Am. Bar Ass’n 2023), https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_1_1_competence/; see also, r. 1.1, Comment 5, https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_1_1_competence/comment_on_rule_1_1/. ↩︎
Model Rules of Prof’l Conduct r. 3.3 (Am. Bar Ass’n 2023). https://www.americanbar.org/groups/professional_responsibility/publications/model_rules_of_professional_conduct/rule_3_3_candor_toward_the_tribunal/. ↩︎
Cody James, Citators in the AI Age: Preserving the Human Component Through Court-Created Citators, 118 Law Lib. J. 66, 68-69 (2026). ↩︎
Some believe instead that the term harness is more accurate; I am more than willing to accept that definition and the examples proffered. Nicola Shaver, AI Harnesses: The Layer Where Differentiation Crystallizes, Legaltech Hub (May 11, 2026). https://www.legaltechnologyhub.com/contents/ai-harnesses-the-layer-where-differentiation-crystallizes/ I use the term wrapper here the way Shaver and Ethan Mollick use harness (https://www.oneusefulthing.org/p/a-guide-to-which-ai-to-use-in-the). ↩︎
Cate Giordano, Legalweek 2026: AI in Legal Has a Deployment Problem, Legaltech Hub (Mar. 24, 2026), https://www.legaltechnologyhub.com/contents/legalweek-2026-ai-in-legal-has-a-deployment-problem. ↩︎
Olivia Smith Schlinck, Academic Law Librarians Are Paid 47% Less Than Their Faculty Counterparts (Feb. 4, 2022), https://ripslawlibrarian.wordpress.com/2022/02/04/academic-law-librarians-are-paid-47-less-than-their-faculty-counterparts/. ↩︎
Samantha Cole, Watch These Judges Rip Into Lawyers For Citing Cases That Don’t Exist, 404media (June 4, 2026), https://www.404media.co/new-york-court-ai-citations-landberg-case/; J. Koebler, Judge Learns Lawyers on Both Sides of Case Used AI, Cancels Trial, Kicks Everyone Off the Case, 404media (June 9, 2026), https://www.404media.co/judge-learns-lawyers-on-both-sides-of-case-used-ai-cancels-trial-kicks-everyone-off-the-case/. ↩︎
OJ Simpson Murder Trial – Shepard’s clip https://www.youtube.com/watch?v=QFOY0Glg0gU. ↩︎

I Tested Claude for Word on Some Classic Litigator Tasks

Posted on April 22, 2026 by Rebecca Fordon

This post was written with the support of the LegalQuants community, a group of lawyers who are building some pretty great things with AI. Curious? Check it out at https://www.legalquants.com/.

Over the past several days I’ve been digging into the Claude for Word add-in, and the headline finding surprised me. On document-intensive legal work — cite-checking, consistency review, Table of Authorities assembly — it seems to need less supervision than either Claude on the web or Claude Code. Four tests bear that out, with limits worth knowing.

Verdict: Claude for Word is narrower than Claude on the web by design — no open web, no arbitrary API calls, a more constrained sandbox. But that narrowness turns out to be a feature for document-anchored tasks. I’d say it is worth considering as a complement to (not a replacement for) established research and drafting workflows.

I ran four tests this week. Below is what I found, what I’d use it for, and what I’d still keep on the web or in Claude Code.

What is Claude for Word, and who has access?

Claude for Word is Anthropic’s Microsoft Word add-in, just released in beta last week. Once installed, it runs as a full sidebar assistant inside Word — it can read all active documents, propose tracked-change edits, add comments, respond to comments you leave, and carry on a conversation about the text while you work. It’s a different surface from Claude on the web (claude.ai) and from Claude Code, and has a different set of available tools and constraints. The add-in is currently available to users with a paid Claude subscription (Pro, Team, or Enterprise).

Two things worth knowing up front, because they shape what Claude for Word can and can’t do:

~~No open web access and~~ No arbitrary API calls from inside the add-in. Claude for Word can read your document and use connected MCP servers, but it can’t ~~browse the web or~~ hit arbitrary APIs the way a Claude Code session or a custom Python pipeline can. [I was wrong about web access, turns out my account just had it turned off! Can’t wait to run more tests now that it’s enabled!]
Access to MCP servers works, which matters for legal research tasks. In my testing I connected a community-built CourtListener MCP, though I had to host it myself (Free Law Project has an official CourtListener MCP coming soon that will make this all easier!).

Test 1: Cross-document consistency

In checking citations in a summary judgment motion I pulled from PACER, Claude caught a citation to an exhibit number that no longer existed in the supporting Rule 56.1 declaration. Classic late-stage renumbering casualty — someone dropped an exhibit, the list was repaginated, and one reference in the brief didn’t get updated. It’s the exact mistake that falls through the cracks because everyone on the team assumes someone else did the cross-walk.

I’d opened the motion and the declaration side by side and asked Claude to check defined terms within the motion (clean) and cross-check every factual statement against the declaration. The exhibit catch was one of several small mismatches flagged. Worth running every time a declaration or exhibit list changes and the brief has to track.

While I had the motion open, I asked Claude to build a Table of Authorities. Anyone who’s assembled one by hand knows this is one of the more miserable tasks in litigation — every citation identified, every page tracked, every short form resolved against a full citation. It worked. First pass took about ten minutes, with some confusion about creation sections and numbering pages along the way, but it pulled it together cleanly at the end. I then asked Claude to take the lessons from that run and write a skill for next time. With the skill loaded, a second TOA on the same document was done in under five minutes.

Two things worth noting about the skill-generation step. First, the loop genuinely works — Claude can reflect on its own recent process and distill it into reusable guidance. Second, this is we should pay attention to: the marginal cost of a custom skill for a recurring task drops toward zero when the tool can draft its own.

Test 2: Bluebooking

Bluebook compliance is notoriously finicky, and producing a cleanly cite-checked document by hand is slow. In my tests, chatbots have not been the greatest for this, especially with formatting changes like small caps, italics, underlining. However, Claude for Word handled standard federal citation form correctly out of the box — without a skill, without the rules bundled into context, without special prompting. Reporter spacing, short-form rules, signal ordering, and the common T11 abbreviations all came through accurately. The caveat: I haven’t yet systematically tested state reporters, niche abbreviations, or the more obscure tables. And I suspect that rules less well-represented in the training data, like style rules for a specific state, wouldn’t come through as well. I’d love to do some more testing here, and hear from anyone in other jurisdictions.

One miss worth documenting: when Claude makes a tracked change that replaces formatted text, sometimes it drops italics and small caps. The fix is mechanical (explicitly reapply the formatting) and easy to bake into a lightweight skill, but out of the box it will quietly break formatting on any replacement that crosses a run boundary.

Test 3: Substantive citation verification

I’ve been building a brief-verification pipeline in Claude Code, built around the CourtListener API, that checks whether cited cases are real and whether they support the propositions they’re cited for. So naturally, I wanted to see if Claude for Word could give my pipeline a run for its money.

For this test I used Claude for Word with a community-built CourtListener MCP — no skill, no custom context, none of the steps from my pipeline — and ran it against a brief I’d already checked with the Claude Code pipeline. Free Law Project, the nonprofit behind CourtListener, has an official MCP coming soon.

Claude for Word caught almost everything the Claude Code pipeline caught. Inverted holdings, fabricated quotes, mischaracterized cases. And it didn’t just flag problems — it diagnosed them. On a misquoted opinion it reconstructed the actual language from the case and showed where an ellipsis had stitched two rewritten sentences into one “quotation.” On a cite pulled from a case that’s about something else entirely, it identified what the case is actually about and suggested narrower authority that would support the point.

The misses fell into a pattern. Claude caught fabrication and misattribution reliably in this test. It was weaker where the quoted language is genuinely in the opinion but is being used to support a proposition slightly broader than what the case stands for — framing drift rather than fiction. Those are close calls a careful human reviewer would flag on a slow read, and they’re worth knowing about before relying on any tool (including mine) as a last line of defense.

The pipeline still had one advantage: coverage. It pulled one opinion the ad hoc approach didn’t, because it does reliable fallback searches when the primary lookup fails.

The honest trade-off: the in-Word approach gets you most of the way there with almost no engineering. A purpose-built pipeline still earns its keep on retrieval coverage, which matters most when a brief cites heavily to sources that may not be as cleanly indexed in CourtListener — unpublished district court orders, older state intermediate appellate decisions, and the like.

Who should use this, and when

The pattern that emerged across these tests: Claude for Word excels at tasks where the document itself is the context and the work is anchored in it. It needs less prompting, less scaffolding, and fewer custom skills than I would have needed on the web or in Claude Code to get comparable output. Where it underperforms is where the task requires infrastructure the sandbox doesn’t provide — reliable fallback retrieval, custom packages, open web access.

Well-suited to Claude for Word:

Quality-control passes on completed drafts. Exhibit-list checks, defined-term consistency, and cross-document fact-checking are low-risk, high-value, and would take hours by hand. Minimal setup required.
Table of Authorities assembly, especially if you’re willing to invest a first pass so Claude can generate a reusable skill. The self-generated skill pattern is worth trying: once you have it, subsequent TOAs drop from ten minutes to less than five.
Cite-format cleanup on briefs that use standard federal authority. The base model handles common Bluebook rules without a skill; a lightweight skill fills in the edges (italics preservation, state-specific reporters, niche abbreviations).
Substantive citation verification on briefs that lean on federal authority — fabricated quotes, inverted holdings, mischaracterized cases. The add-in plus a CourtListener MCP is a strong first pass. Human review is still warranted on framing drift (cases where the quoted language is real but the proposition is broader than the case actually supports).

Still better done elsewhere:

End-to-end citation pipelines for briefs that cite heavily to sources outside CourtListener — unpublished district court orders, administrative decisions, older state intermediate appellate materials. A well-built Claude Code or Python pipeline with reliable fallback searches will catch more cites, even if the ad hoc in-Word approach reasons just as well about each one.
Any workflow requiring arbitrary package installs, open web access, or custom API calls. The Word sandbox is more constrained than Claude.ai or Claude Code, and these limits are real.

Bottom line

Claude for Word is a usable tool for real legal work, and it handled each of the four tasks I tested with some combination of speed, accuracy, and genuine substantive judgment. The pattern that surprised me most is the one I led with: on document-anchored tasks, the add-in required less supervision than I would have needed on the web or in Claude Code to get comparable output. The most likely explanation is that a bounded, document-centric environment lets the model stay focused on the task in front of it, and perhaps its even given additional tools to manage context. Worth a longer investigation than I can give it here.

None of this makes Claude for Word a replacement for a well-considered AI workflow. It’s a complement to one. The question worth asking is which of your workflows belong inside the Word sandbox, which belong on the web or in Claude Code, and which benefit from both.

I’d be interested to hear what others are finding.

Note: Tests 1 and 2 were cold — no skills, no custom context. Test 3 used Claude for Word with a community-built MCP accessing the CourtListener API, benchmarked against a brief-verification pipeline I run in Claude Code against the CourtListener API.

Can’t Stop, Won’t Stop: One Semester, Eight Vibe-Coded Teaching Tools

Posted on March 4, 2026 by Rebecca Fordon

Somewhere around tool number five, I realized I had a problem. The kind where you keep saying “almost done” and then it’s 2 AM and you’re finalizing a React app that lets your students act as bots and deploy hallucinations against one another.

I teach 21st Century Lawyering at The Ohio State University Moritz College of Law. It’s a 3-credit course on law and emerging technology — AI, cybersecurity, document tech, legal automation. And this semester, I built a custom interactive tool for nearly every class session. Eight tools (and counting) for a single course.

I won’t pretend to have come up with all of these ideas on my own. One of the best things about the legal tech education community is how generously people share. Someone demos an exercise at a conference, posts a tool on LinkedIn, or walks through a concept in a workshop, and it sparks something. Nearly every tool below started with an idea I saw someone else do and I thought, “I could build a version of that for my students.” AI-assisted development makes that possible in a way it never was before, totally collapsing the gap between “that’s a great idea” and “I’m using it in class tomorrow.” So this post is partly a show-and-tell, and partly a thank-you to the people whose work made me want to build.

Here’s what I made, who inspired it, and what I learned about what happens when a law professor with some Python experience and an AI coding assistant starts saying yes to every pedagogical impulse.

A note on the live demos: Many of these tools run on my personal API keys. I’ve put some money into keeping them live, but when it runs out I probably won’t re-up. If a demo isn’t working, you can always clone the repo and run it yourself, I’ve tried to keep everything open-source.

The Tools

1. TokenExplorer (Week 2)

The LLM Explorer web app showing a token-by-token probability visualization. The left panel has an "Input" section with the prompt "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." (estimated at 29 tokens), plus "Model Settings" with GPT-3.5 Turbo selected, temperature slider at 1.5, and max tokens at 712. A "Generate" button sits below. The right panel shows the model's output (505 completion tokens) with each token color-coded by prediction probability: green (90-100%), light green (70-89%), light blue (50-69%), yellow (30-49%), orange (10-29%), and pink (0-9%). The output is a hallucinated list of fake cases — "Smith v. Jones, 2021 WL 123 456 (District Court of California, Feb. 15, 2021)," "Brown v. Green, 2021-US-12345," etc. — with many tokens in yellow, orange, and pink, visually demonstrating that the model is generating low-confidence fabrications. The color pattern makes hallucination *visible*: students can see that case names, docket numbers, and dates are produced with notably low probability scores, revealing the model is essentially guessing.

Try it | GitHub

Inspired by: I saw someone demo something similar at a presentation a while back. I wish I could remember who, because it stuck with me. Making the probabilistic nature of an LLM visible rather than just explaining it conceptually was immediately compelling. Token visualizers aren’t new (OpenAI has one, several developers have built similar tools), but I hadn’t seen one built for a classroom context where the point is to emphasize the probabilities. If the original demo was yours, please reach out so I can credit you properly.

The problem: Students arrive thinking LLMs are databases that look up correct answers.

What I built: An interactive tool where students manipulate temperature settings and watch probability distributions shift in real time. They change context and see how a similar prompt produces different next-token predictions. They test factual questions and watch the probabilities change as more hallucinations arise.

Why it matters: Once students see that LLMs are statistical prediction engines generating likely text, not true text, everything else clicks. When they encounter hallucinations later in the semester, they already understand why they happen.

2. Prompt Coach (Week 5)

The Prompt Coach split-panel web app. The left "Your Workspace" panel shows a Claude conversation (model: Claude Haiku Fast). The student's prompt asks: "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." Claude's response is titled "Recent Opinions on Lawyers Using AI Hallucinations" and transparently explains it cannot verify specific citations, listing what it knows exists (Mata v. Avianca, Colorado Bar disciplinary matters, State Bar of California ethics opinions) and recommending Google Scholar for accurate citations. The right "Prompt Coach" panel (branded "21st Century Lawyering - OSU Moritz College of Law") shows a "Hallucination Risk" review. The Coach Feedback section analyzes the student's prompt: "Good news: the AI handled your high-risk request appropriately by refusing to fabricate citations. But your prompt created exactly the scenario where hallucination is most dangerous." It explains the student asked for specific citations from memory — "the AI's weakest area" — and suggests alternative approaches: requesting search and verification explicitly, uploading case databases, and notes that the follow-up ("I need 10 cases minimum") doubled down on the risk. Footer: "Created by Rebecca Fordon | MIT License | GitHub."

Try it | GitHub

Inspired by: Sean Harrington‘s prompting workshops. I still haven’t had the pleasure of experiencing one in person (I’m looking forward to finally catching him at ABA TECHSHOW) but hearing about his approach to teaching prompting AI, with feedback from AI, made me want to build something that could coach students through it in real time.

The problem: Students need to practice prompting, but there’s no good way to give them feedback at scale. I can’t stand behind 30 (or even 11) laptops at once.

What I built: A split-panel web app. The left side is a blank Gemini chat where students draft and test prompts. The right side is an AI coaching interface that evaluates their technique across dimensions like context engineering, document selection, and confidentiality awareness. The coach doesn’t revise the AI’s output directly, but instead connects output problems back to what could be improved with the prompt.

Why it matters: Legal-specific coaching catches things generic prompt guides miss. It flags when a student uploads privileged documents. It notes when a prompt would work on Gemini but fail on CoCounsel’s structured skill system. It frames feedback in terms of professional judgment, not just technical optimization. This could be easily customized to track to different learning objectives (ironically, I rewrote the prompt so many times).

3. QnA Markup Unpaid Wages Client Screener (Week 6)

The QnA Markup Editor at qnamarkup.org showing the complete Wage & Hour Claim Screener. The left panel displays the full QnA Markup source code with the decision tree logic visible: Q(status) checks W-2 status, Q(timely) checks the 2-year statute of limitations, Q(issue) branches into four claim types (minimum wage, overtime, final paycheck, tip theft), with sub-questions about hourly rate thresholds ($7.25, $10.99, $11.00) and salary thresholds ($684/week). GOTO:consult tags route qualifying claims to a consultation booking page; a Q(dol) endpoint directs users to the Department of Labor hotline (1-866-487-9243) and Ohio Legal Help. The right panel shows the interactive output in "Interactive" mode, displaying the first question as a blue speech bubble: "Were you a W-2 employee?" with "Yes" and "No (I was a contractor/1099)" answer buttons. Footer links include "credits | edit | code your own."

Try it on QnA Markup

Inspired by: For this I just directly used David Colarusso‘s QnA Markup, a brilliantly simple tool for building decision trees with plain text. Gabe Tenenbaum also very generously demoed QnA Markup in a prior version of my class, which planted the seed for continuing to build exercises around it.

The problem: I needed a low-tech entry point to teach decision trees and document assembly logic — something where the focus stays on legal reasoning, not the tool.

What I built: A client intake screener that triages potential wage-and-hour claims. Does the caller qualify? Should they book a consultation or contact the Department of Labor directly? Students see legal rules as logic: if/then branching based on employer size, hourly rate, and tipped status.

Why it matters: It forces students to confront the design choices embedded in any intake tool, such as what questions to ask, in what order, what to do with edge cases. It’s intentionally low-tech (just text in a browser) so nobody gets distracted by the interface.

4. Decision Tree to QnA Markup Translator Gem (Week 6)

Gemini interface (dark mode) showing the QnA Markup Optimizer Gem in a conversation titled "QnA Markup Decision Tree Generation." The user's brief prompt says "put in a code block." The Gem responds with QnA Markup for an "Employment Intake & Consultation Screener" described as "Verifies W-2 status and routes to a consultation if ineligible." The code shows a linear intake flow: Q(employment_status): "Were you a W-2 Employee?" → Yes → Q(employer_name): "What was the name of your employer?" → X: (free text) → Q(job_title): "What was your job title at <x>employer_name</x>?" → X: → Q(years_worked): "How many years did you work there?" → X:number → Q(final_confirmation) partially visible. This demonstrates the Gem generating a simpler guided interview structure with variable interpolation.

Try it

The problem: Students understand decision tree logic but struggle with the syntax of turning it into working code.

What I built: A Gemini Gem that bridges the gap. Students describe their logic and it generates QnA Markup. It models the idea that AI assistants can serve as a bridge between domain knowledge and technical implementation.

Why it matters: It’s the same insight that powers the entire “building legal technology” unit (and extends from the prompting unit and into the agent unit): you don’t need to be a coder, you need to be able to describe what you want clearly enough for AI to build it.

5. Ohio Unpaid Wages Screener — The 3-Minute Version (Week 6, cliffhanger)

A vibe-coded web app styled as a Better Call Saul parody — "BETTER CALL OHIO" in large red and yellow text on a black header bar, with "WAGE & HOUR CLAIM SCREENER" as a subtitle. A yellow badge in the upper right reads "YOUR RIGHTS PROTECTED!" with a scales-of-justice icon. Below, on a cream/yellow background, a white card labeled "OFFICIAL SCREENER" (in a red diagonal badge) shows the "CASE INTAKE" heading with the first question in italic: "Were you a W-2 employee?" Two answer buttons styled in black borders read "YES →" and "NO (I WAS A CONTRACTOR/1099) →". At the bottom, a disclaimer reads "ATTORNEY ADVERTISING - RESULTS NOT GUARANTEED" with small dollar sign, clock, and calendar icons. The design deliberately mimics the aesthetic of a late-night TV legal ad.

Try it

The problem: I needed a dramatic way to introduce vibe coding.

What I built: A React app in Gemini Canvas, built on the QnA Markup decision tree we just made. It took roughly 1 minute to ctrl-C/ctrl-V the QnA Markup and generate the app.

Why it matters: I revealed it side-by-side at the end of class as a cliffhanger: “Same basic functionality, but it made a website out of it.” We then discussed, “So why wouldn’t you always vibe-code?” Students surfaced the hard questions themselves: Is the code correct on the law? Is it deterministic (will it always come out the same way)? Who hosts it? Would it be as good if I asked it to generate directly from the law, rather than creating the decision tree ahead of time? Why did it add “OFFICIAL”?¹ It set up the entire vibe-coding class the next day perfectly.

¹ (Eagle-eyed readers will notice that I was too much of a coward to share a direct link to the screenshotted version, and if you visit the link you’ll instead see prominent “parody” stamps).

6. Citation Extractor Gem (Week 7)

Gemini interface (dark mode) showing the Citation Extractor custom Gem in action. The user's prompt reads "Extract citations from the attached PDF." The Gem's response begins with an analysis note explaining it extracted all legal case citations from the document, including those identified as "bogus" or "phony" AI-generated research by the Special Master. Below is a structured table with columns: Bluebook Citation, Core Citation, Court, Year, and Pinpoint Pages. Visible entries include Aetna Cas. & Surety Co. v. Superior Court (Cal. Ct. App., 1984), Arrowhead Capital Finance v. Picturepro (9th Cir., 2023), Boone v. Vanliner Ins. Co. (Ohio, 2001), Booth v. Allstate Ins. Co. marked as "(Flagged)" (Cal. Ct. App., 1989), and Braun ex rel Advanced Battery Techs. v. Zhiguo Fu (S.D.N.Y., 2015). The Gemini model selector shows "Thinking" mode.

Try it

The problem: I needed a way for students to quickly pull all case citations out of a brief to feed into verification workflows.

What I built: A Gemini Gem that takes an uploaded brief and returns a structured table of all case citations, ready for Get & Print on Westlaw and Lexis.

Why it matters: It serves double duty. Practically, it supports the hallucination game (below). Pedagogically, it’s a concrete example of a custom AI assistant built for a specific legal task, connecting back to the Gems work from Week 6 and forward to agentic AI in Week 9. Students see that I practice what I teach: when you have a repetitive legal task, you build a tool for it.

7. Citation Hallucination Game (Week 7)

The Citation Hallucination Game web app during the verification (Solo Practice) phase. Top navigation bar shows "Citation Game | Solo Practice | Reviewed: 0/23 | Flagged: 0" with "Finish & See Results" and "Export PDF" buttons. The main panel displays a legal brief — a motion to dismiss in a bad faith insurance case, citing Patterson v. State Farm Mut. Auto Ins. Co. and Ashcroft v. Iqbal. A sidebar panel labeled "REVIEW CITATION" highlights the current citation (Patterson v. State Farm, 2019 U.S. Dist. LEXIS 31742) in yellow, with two buttons: "Looks Legit" and "Flag as Fake," plus Previous/Next navigation. The interface lets students step through each citation in the brief and decide whether it's real or fabricated.

Try it | GitHub

Inspired by: David Colarusso again, specifically his automation bias exercise, which flips the typical classroom dynamic by making students experience bias rather than just learn about it. That “make them do the thing, not just hear about the thing” approach is exactly what I was looking for. And his hallucination checking frame was pretty handy too because I also wanted to explicitly teach a process for that.

The problem: Students (like many of us) think hallucinations will not happen to them, because they will always read the cases. They may also see hallucinations as mainly a made-up cases problem, and not realize that hallucinations can come in many flavors, some harder to detect than others.

What I built: A competitive team exercise. Students first create hallucinated citations (fabricated cases, swapped numbers, mischaracterized holdings, altered quotes) forcing them to internalize different types of hallucinations and when they are likely to arise. Then they try to catch another team’s fakes under time pressure, mirroring the real conditions lawyers may face when reviewing AI-generated work on deadline.

Why it matters: The key lesson isn’t “can you catch every error” but “given limited time, which errors do you prioritize?” Students independently discovered that no single verification tool is sufficient, that the easy hallucinations (fabricated cases) are solved relatively quickly, but that the dangerous ones (subtle mischaracterizations) slip through.

8. Document Tech Gallery (Week 8)

Landing page of the Document Technology Gallery web app. Header reads "Document Technology Gallery" with subtitle "Beyond Word and Acrobat: The tools lawyers actually use." Introductory text explains that the gallery covers seven key document technology patterns. Three demo cards are visible, organized by category: under "CREATE," cards for Document Automation (~90 sec, "Watch a contract write itself as you answer questions") and Clause Library (~60 sec, "Assemble a contract from pre-approved building blocks"); under "EDIT," a card for AI-Assisted Document Editing (~90 sec, "Edit a legal document three ways: AI, rules, and consistency checking"). Two more cards are partially visible under a "REVIEW" heading. Each card has a "Try it →" link. Clean white background with subtle card borders.

Try it | GitHub

Inspired by: Barbora Obracajova‘s Legal Tech Gallery, which she vibe-coded for her “Modern Lawyers” course and shared on LinkedIn. She created a series of quick interactive demos, 60 seconds each, where students touch the technology instead of watching slides. I had a class quickly approaching on document competencies — a topic that I have always struggled to teach given how difficult it is to get Word add-ons approved in my institution. So the moment I saw her post I knew this would help me.

The problem: Students need hands-on exposure to core document technology competencies (I chose automation, clause libraries, editing, brief verification, metadata cleaning, redaction, and contract review) but there’s no time to go deep on all of them in one class.

What I built: Seven quick interactive demos, about 60 seconds each. Students touch each technology rather than just hear about it. The gallery format builds a shared baseline before diving into further exploration and case studies.

Why it matters: It surfaces the “I didn’t know that existed” moments.

What I Learned from this Process

I can’t stop because the feedback loop is immediate. I see a gap in my teaching, I build something that night, I use it in class the next day. That’s never been possible for me before. I’ve been writing little scripts for years, but AI-assisted development took me from utility scripts to deployed interactive applications. The jump from “I can automate this for myself” to “I can build this for my students” is huge.
Vibe-coding replaced the paper handout. In other classes, when I want a students to work through a problem, my first instinct is a handout, whether a worksheet, fact pattern, checklist, or problem. In this class, my first instinct became “what if I built something they could interact with?” When the subject is technology, it made sense to me that the medium should be too.
Creating interactive apps make the class more hands-on. Every one of these exists because I couldn’t find a way to give students a particular experience otherwise. You can lecture about how LLMs predict tokens. Or you can let students drag a temperature slider and watch the probability distribution change. You can tell students hallucinations are dangerous. Or you can have them create hallucinations and then fail to catch someone else’s.
Building is teaching. When I vibe-code a tool in front of students, or reveal that I built something in 3 minutes that took an hour the traditional way, I’m modeling some of the ways they can use technology. The message isn’t always “look what I made,” but also “you could make this too, and you should, because the people who understand the problem best should be the ones building the solution.”

Eight tools. Week 8 of a 13-week semester. And I have no doubt there are more to come.

If you’re teaching legal technology and you’ve been thinking “I wish I had something that did X” — you probably have enough to build it. Pick the smallest version of the idea, open a vibe-coding tool, and see what happens. And then share it!

What the Science Says About Hallucinations in Legal Research

Posted on February 19, 2026 by Rebecca Fordon

This is Part 1 of a three-part series on AI hallucinations in legal research. Part 2 will examine hallucination detection tools, and Part 3 will provide a practical verification framework for lawyers.

You’ve heard about the lawyers who cited fake cases generated by ChatGPT. These stories have made headlines repeatedly, and we are now approaching 1,000 documented cases where practitioners or self-represented incidents submitted AI-generated hallucinations to courts. But those viral incidents tell us little about why this is happening and how we can prevent it. For that, we can turn to science. Over the past three years, researchers have published dozens of studies examining exactly when and why AI fails at legal tasks—and the patterns are becoming clearer.

A critical caveat: The technology evolves faster than the research. A 2024 study tested 2023 technology; a 2025 study tested 2024 models. By the time you read this, the specific tools and versions have changed again. That’s why this post focuses on patterns that persist across studies rather than exact percentages that will be outdated in months.

Here are the six patterns that matter most for practice.

Pattern #1: Models and Data Access

Not all AI tools are created equal. The research shows a dramatic performance gap based on how the tool is built, though it’s important to understand that both architecture and model generation matter.

Bar chart showing hallucination rates for general-purpose language models on legal queries. Llama 2
hallucinated 88% of the time, GPT-3.5 hallucinated 69%, and GPT-4 hallucinated 58%, demonstrating that newer models perform better but still hallucinate on more than half of legal questions. — Dahl, et al., “Large Legal Fictions,” Fig. 1. The figure shows reduced hallucination rates with more advanced and modern models.

Models are improving over time. A comprehensive 2024 study by Stanford researchers titled “Large Legal Fictions” tested 2023 general-purpose models on over 800,000 verifiable legal questions and found hallucination rates between 58% and 88%. Within that cohort, newer models performed better: GPT-4 hallucinated 58% of the time compared to GPT-3.5 at 69% and Llama 2 at 88%. This pattern of improvement with each model generation appears fairly consistent across AI development.

Chart comparing hallucination rates across legal AI tools and GPT-4. Lexis+ AI had a 17% hallucination
rate, Westlaw AI-Assisted Research had 33%, and GPT-4 had 43%, showing that legal-specific tools with
retrieval-augmented generation substantially outperform general-purpose models. — Magesh, et al., “Hallucination Free?”, Figure 1. The study shows higher hallucinations in general purpose model GPT-4 than specialized legal research products.

Architecture matters, but it’s not the whole story. A second Stanford study, titled “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools”, published in 2025 but testing tools from May 2024, found hallucination rates of 17% for Lexis+ AI, 33% for Westlaw AI-Assisted Research, and 43% for GPT-4. These errors include both outright fabrications (fake cases) and more subtle problems like mischaracterizing real cases or citing inapplicable authority. This head-to-head comparison shows legal-specific tools with retrieval-augmented generation (RAG) substantially outperforming general LLMs.

A randomized controlled trial by Schwarcz et al. reinforces the architecture point from a different angle. When 127 law students used a RAG-based legal tool (Vincent AI) to complete legal tasks, they produced roughly the same hallucination rate as students using no AI at all. Students using a reasoning model without RAG (OpenAI’s o1-preview) produced better analytical work but introduced hallucinations. Both tools dramatically improved productivity—but only the RAG tool did so without increasing error rates. However, the Vals AI Legal Research Report (October 2025, testing July 2025 tools) found ChatGPT matched legal AI tools: ChatGPT achieved 80% accuracy while legal AI tools scored 78-81%. The key difference? The ChatGPT used in the Vals study used web search by default (a form of RAG), giving it access to current information and non-standard sources, while legal tools restrict to proprietary databases for citation reliability. For five question types, ChatGPT actually outperformed the legal AI products on average. Both outperformed the human lawyer baseline of 69%.

Takeaway: Purpose-built legal tools generally excel at citation reliability and authoritative sourcing, but general AI with web search can compete on certain tasks. The real advantage isn’t RAG architecture alone—it’s access to curated, verified legal databases with citators. Know your tool’s strengths: legal platforms for citations and treatment analysis, general AI with web search for non-standard or very recent sources.

Pattern #2: Sycophancy

One of the most dangerous hallucination patterns is that AI agrees with you even when you’re wrong.

The Stanford “Hallucination-Free?” study identified “sycophancy” as one of four major error types. When users ask AI to support an incorrect legal proposition, the AI often generates plausible-sounding arguments using fabricated or mischaracterized authorities rather than correcting the user’s mistaken premise.

Similarly, a 2025 study on evaluating AI in legal operations found that hallucinations multiply when users include false premises in their prompts. Anna Guo’s information extraction research from the same year showed that when presented with leading questions containing false premises, most tools reinforced the error. Only specialized tools correctly identified the absence of the obligations the user incorrectly assumed existed.

This happens because of how large language models work: they’re trained to generate helpful, plausible text in response to user queries, not to verify the truth of the user’s assumptions.

Takeaway: Never ask AI to argue a legal position you haven’t independently verified. Phrase queries neutrally. If you ask “Find me cases supporting [incorrect proposition],” AI may happily fabricate them.

Pattern #3: Jurisdictional and Geographic Complexity

AI performance degrades sharply when dealing with less common jurisdictions, local laws, and lower courts.

Table showing AI hallucination rates varying by geographic location. For the same legal scenarios,
hallucination rates were 45% for Los Angeles, 55% for London, and 61% for Sydney. — Curran, et al., “Place Matters”, Fig. 1. Hallucination rates by jurisdiction.

Researchers in a study called “Place Matters” (2025) tested the same legal scenarios across different geographic locations and found hallucination rates varied dramatically: Los Angeles (45%), London (55%), and Sydney (61%). For specific local laws like a local Australian ‘s Residential Tenancies Act, hallucination rates reached 100%.

The Vals report found a 14-point accuracy drop when tools were asked to handle multi-jurisdictional 50-state surveys. The Large Legal Fictions study confirmed that models hallucinate least on Supreme Court cases and most on district court metadata.

Why? Training data is heavily weighted toward high-profile federal cases and major jurisdictions. State trial court opinions from smaller jurisdictions are underrepresented or absent entirely.

Takeaway: Apply extra scrutiny when researching state or local law, lower court cases, or multi-jurisdictional questions. These are exactly the scenarios where training data or search results may be thinner, causing hallucinations to spike.

Pattern #4: Knowledge Cutoffs

AI tools trained on historical data will apply outdated law unless they actively search for current information.

The “AI Gets Its First Law School A+s” study (2025) provides a striking example: OpenAI’s o3 model applied the Chevron doctrine in an Administrative Law exam, even though Chevron had been overruled by Loper Bright. The model’s knowledge cutoff was May 2024, and Loper Bright was decided in June 2024.

This temporal hallucination problem will always exist unless the tool has web search enabled or actively retrieves from an updated legal database. Not all legal AI tools have this capability, and even those that do may not use it for every query.

Takeaway: Verify that recent legal developments are reflected in AI responses. Ask vendors whether their tool uses web search or real-time database access. Be especially careful when researching areas of law that have recently changed or may be affected by material outside the AI tool’s knowledge base.

Pattern #5: Task Complexity

AI performance correlates directly with task complexity, and the drop-off can be severe.

Simple factual recall—like finding a case citation or identifying the year of a decision—works relatively well. But complex tasks involving synthesis, multi-step reasoning, or integration of information from multiple sources show much worse performance.

The Vals report documented a 14-point accuracy drop when moving from basic tasks to complex multi-jurisdictional surveys. A 2025 study on multi-turn legal conversations (LexRAG) found that RAG systems struggled badly with conversational context, achieving best-case recall rates of only 33%.

Multiple studies note that statute and regulation interpretation is particularly weak. Anna Guo’s information extraction research found that when information is missing from a document (like redacted liability caps), AI fabricates answers rather than admitting it doesn’t know.

Takeaway: Match the task to the tool’s capability. High-stakes work, complex multi-jurisdictional research, and novel legal questions require more intensive verification. Don’t assume that because AI handles simple queries well, it will handle complex ones equally well.

Pattern #6: The Confidence Paradox

Perhaps the most insidious finding: AI sounds equally confident whether it’s right or wrong.

The “Large Legal Fictions” study found no correlation between a model’s expressed confidence and its actual accuracy. An AI might present a completely fabricated case citation with the same authoritative tone it uses for a correct one.

This isn’t a bug in specific products—it’s fundamental to how large language models work. They generate statistically probable text that sounds human-like and professional, regardless of underlying accuracy. In fact, recent research suggests the problem may worsen with post-training: while base models tend to be well-calibrated, reinforcement learning from human feedback often makes models more overconfident because they’re optimized for benchmarks that reward definitive answers over honest expressions of uncertainty.

Even the best-performing legal AI tools in the Vals report achieved only 78-81% accuracy. That means roughly one in five responses contains errors, even from top-tier specialized legal tools.

Takeaway: Never trust AI based on how confident it sounds. The authoritative tone is not a reliability signal. Verification is non-negotiable, no matter which tool you use. Be especially wary of newer models that may sound more confident while not necessarily being more accurate.

What This Means for Practice

Specific hallucination percentages will change as technology improves, but these six patterns appear to persist across different models, products, and study methodologies. Understanding them should inform three key decisions:

1. Tool Selection
Understand your tool’s strengths. Legal-specific platforms excel at citation reliability because they search curated, verified databases with citators. General AI with web search can compete on breadth and recency but lacks those verification layers. Within any tool, look for features like the ability to refuse to answer when uncertain (some tools are now being designed to decline rather than hallucinate when data is insufficient—a positive development worth watching for).

2. Query Strategy
Avoid false premises and leading questions. Phrase queries neutrally. Recognize high-risk scenarios: multi-jurisdictional questions, local or state law, lower court cases, recently changed legal doctrines, and complex synthesis tasks.

3. Verification Intensity
Scale your verification efforts to task complexity and risk factors. A simple citation check might need less verification than a complex multi-state legal analysis. But all AI output needs some verification—the question is how much.

Bottom Line

The research is clear: AI hallucinations in legal work are real, measurable, and follow predictable patterns. These studies have found that even the best legal AI tools hallucinate somewhere between 15% and 25% of the time (including both fabrications and mischaracterizations) based on current data.

But understanding these six patterns—models and data access, sycophancy, jurisdictional complexity, knowledge cutoffs, task complexity, and the confidence paradox—helps you make better decisions about which tools to use, which queries to avoid, and how intensively to verify results.

The goal isn’t to avoid AI. These tools can dramatically increase efficiency when used appropriately. The goal is to use them wisely, with eyes wide open about their limitations and failure modes.

Coming next in this series: How hallucination detection tools work and whether they’re worth using, and a practical framework for verifying AI research results.

References

Andrew Blair-Stanek et al., AI Gets Its First Law School A+s (2025).
Link: https://ssrn.com/abstract=5274547
Products tested: OpenAI o3, GPT-4, GPT-3.5
Testing period: Late 2024

Damian Curran et al., Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries, AI4A2J-ICAIL25 (2025).
Link: https://arxiv.org/abs/2511.06700
Products tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
Testing period: 2024

Matthew Dahl et al., Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 16 J. Legal Analysis 64 (2024).
Link: https://doi.org/10.1093/jla/laae001
Products tested: GPT-4, GPT-3.5, PaLM 2, Llama 2
Testing period: 2023

Anna Guo & Arthur Souza Rodrigues, Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel (2025).
Link: https://www.legalbenchmarks.ai/research/phase-1-research
Products tested: GC AI, Vecflow’s Oliver, Google NotebookLM, Microsoft Copilot, DeepSeek-V3, ChatGPT (GPT-4o)
Testing period: 2024

Haitao Li et al., LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation, ACM Conf. (2025).
Link: https://github.com/CSHaitao/LexRAG
Products tested: GLM-4, GPT-3.5-turbo, GPT-4o-mini, Qwen-2.5, Llama-3.3, Claude-3.5
Testing period: 2024

Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 22 J. Empirical Legal Stud. 216 (2025).
Link: http://arxiv.org/abs/2405.20362
Products tested: Lexis+ AI, Thomson Reuters Ask Practical Law AI, Westlaw AI-Assisted Research (AI-AR), GPT-4
Testing period: May 2024

Bakht Munir et al., Evaluating AI in Legal Operations: A Comparative Analysis of Accuracy, Completeness, and Hallucinations, 53.2 Int’l J. Legal Info. 103 (2025).
Link: https://doi.org/10.1017/jli.2025.3
Products tested: ChatGPT-4, Copilot, DeepSeek, Lexis+ AI, Llama 3
Testing period: 2024

Daniel Schwarcz et al., AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice (Mar. 2025).
Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111
Products tested: vLex (Vincent AI), OpenAI (o1-preview)
Testing period: Late 2024
Note: Randomized controlled trial with 127 law students using AI tools

Vals AI, Vals Legal AI Report (Oct. 2025).
Link: https://www.vals.ai/vlair
Products tested: Alexi, Midpage, Counsel Stack, OpenAI ChatGPT
Testing period: First three weeks of July 2025

Thanksgiving Vibe-Coding and the Case for “Single-Serving” Legal Software

Posted on November 27, 2025 by Rebecca Fordon

Way back in 2023, I thought it was amazing how I could use generative AI to streamline my Thanksgiving prep: I gave it my recipes, and it gave me a schedule. It was a static list—a text document that told me when to put the turkey in, when to swap in the stuffing, and so on.

This year, I started with the same routine. I had six dishes—two stovetop, three oven, one “no-cook” dip—and a family who I’d promised dinner by 3:00 PM. I pasted the recipes into Gemini and asked for a timeline. It handled the “Oven Tetris” flawlessly, giving me a step-by-step game plan, with times and ingredient amounts at each stages.

An image of a cooking schedule titled "Goal: Dinner at 3:00 PM. Oven Strategy: 350°F (Stuffing) → 400°F (Tart) → 500°F (Sprouts)."

The section header is "The Prep Phase (11:00 AM – 12:30 PM)" followed by the text: "Get the messy work out of the way now."

The preparatory steps are listed:

For the Stuffing:

Cube 1 lb white bread (if not already done).

Chop 2 cups celery (5-7 ribs) and 2 cups yellow onion (1 large).

Chop 31 cup parsley and 2 tbsp fresh herbs (thyme/rosemary/sage).

Whisk together 121 cups chicken broth and 2 large eggs in a measuring cup.

For the Potatoes:

Peel 4 lbs Yukon Gold potatoes. Cut into 43-inch slices. Place in a large pot and cover with water (don’t turn heat on).

But then, I had a realization: I didn’t just want an answer; I wanted a tool. I wanted to be able to check things off as I went. I wanted to see an overview and *also* zoom in on the details.

So, I asked: “What if this was a web app?”

The Shift: From Consumer to Builder

In seconds, Gemini went to work. It gave me a React-based interactive checklist. Suddenly, I wasn’t looking at a static timeline; I was interacting with a piece of software.

But the real magic happened when reality hit. As anyone who has managed a closing checklist or a trial docket knows, the timeline always slips. When my guests told me they’d be an hour late, I realized I’d have to manually calculate the drift for each step.

So, I issued a feature request (this is not a good prompt, but it didn’t matter):

“Add a feature where I adjust what time I’ve finished something so the rest will update”

The AI updated the code. It added a little “reschedule” button, so when I tapped a clock icon next to “Stuffing In,” I could then tap “I Finished This Just Now,” and watch as the entire remaining schedule—the tart, the sprouts, the carrots—automatically shifted forward by an hour. Then I could do it again when I got my stuffing in later than the schedule called for. (If you’d like to check out my app you can do so here: Thanksgiving Checklist).

A screenshot of a "Dinner Plan" mobile or web application displaying the cooking schedule. The progress is 0%. The current phase is highlighted as "PHASE 0: THE 'RIGHT NOW' PREP" and scheduled for "Now – 2:00 PM."

Tasks listed under this phase are:

NOW Make Herbed Yogurt Dip: "Mix everything in a bowl. Taste and adjust salt. Serve with snacks."

NOW Make Whipped Cheese Base: "Whip until fluffy. Season with salt. PUT IN FRIDGE."

NOW Final Misc Prep: "Small tasks."

The next scheduled phase is visible at the bottom: "PHASE 1: THE STUFFING" scheduled for "2:00 PM – 3:20 PM." The first task in this phase is "2:00 PM Sauté & Mix Stuffing". — Left is overview, right is the detail (with the reschedule feature).

A screenshot of a "Dinner Plan" mobile or web application showing the cooking schedule, which has progressed to "PHASE 1: THE STUFFING" scheduled for "2:00 PM – 3:20 PM." A highlighted task, "PUT STUFFING IN OVEN", is displayed with a 'SYNC SCHEDULE' pop-up overlay.

The pop-up is dark blue with a button that reads "I Finished This Just Now" and options to adjust the time: "Earlier (-5m)" and "Later (+5m)." The overlay states: "Adjusts all future steps automatically."

The task details for PUT STUFFING IN OVEN are: "Cover tightly with foil. Bake 40 mins at 350°F." and "Click the clock icon here when you actually close the oven door!" A smaller status tag above the task reads 2:15 PM and OVEN IN/OUT (with the 'OVEN IN/OUT' being smaller and circled).

The next task, 2:55 PM Uncover Stuffing, is partially visible: "Remove foil. Bake another 30-35 minutes until crisp." — Left is overview, right is the detail (with the reschedule feature).

The result? Despite how tightly-timed my schedule was, dinner was on the table only 15 minutes late. For my household, where “at least an hour late” is the standard for a holiday meal, this was a massive victory.

The Era of “Single-Serving Software”

We often think of legal technology as big, enterprise-grade platforms: the Case Management System, the Deal Room, the Firm Portal. These tools are excellent for standard workflows. But legal work is rarely standard. It lives in the messy, human chaos between the formal deadlines.

My Thanksgiving experiment proves that the barrier to entry for building “Micro-Tools” has collapsed. We are entering the era of Single-Serving Legal Software—bespoke apps built for a single trial, a single deal, or a single crisis, and then discarded when the matter closes.

Here is what that looks like in practice (all ideas from Gemini because I’ve been out of legal practice too long… I’m curious if readers think any have merit):

1. Litigation: The “Witness Wrangler”

Standard case management software handles court deadlines, but it rarely handles the human logistics of a trial.

The Problem: You have 15 witnesses. Some need flights, some need prep sessions, some are hostile. Their schedules depend entirely on when the previous witness finishes on the stand.
The Single-Serving App: Instead of a static spreadsheet, you spin up a dynamic dashboard shared with the paralegal team.
The “Reschedule” Feature: You click “Witness A ran long; pushed to tomorrow morning.” The app automatically text-alerts Witness B to stay at the hotel and updates the car service pickup time.

2. Transactional: The “Non-Standard” Closing

Deal software is amazing for corporate M&A, but terrible for “weird” assets.

The Problem: You are selling a massive ranch. The closing checklist includes “Transfer Water Rights,” “Inspect Cattle,” and “Repair Barn Roof.” These aren’t just document signings; they are physical events with dependencies.
The Single-Serving App: A logic-based checklist where “Cattle Inspection” is locked until “Barn Roof Repair” is marked Complete. If the roof crew is delayed, the inspection auto-reschedules, alerting all parties.

3. Mass Torts: The “Toxic Plume” Intake

Intake CRMs are generic. Sometimes the “qualification criteria” for a case are chemically or geographically complex.

The Problem: You only want to sign clients who lived in a specific, jagged geographic zone between 1995 and 1998.
The Single-Serving App: A simple web form where a potential client drops a pin on a map.
The Logic: The app performs a “point-in-polygon” check against the specific toxic plume map you uploaded. It instantly tells the intake clerk “Qualified” or “Out of Zone,” saving hours of manual review.

The Accidental Product Roadmap

The beauty of this approach is that it requires zero commitment. I built this app for one dinner. I didn’t worry about making it generalizable. I didn’t build a “Recipe Importer” feature; I just hard-coded the stuffing because it was faster.

But now that I’ve used it, I’m thinking: “Next year, I should ask the AI to create a drag-and-drop interface so I can just paste URLs for any holiday.”

This is exactly how legal innovation should happen. Too often, firms try to buy or build the “Perfect Platform” first. It takes years and costs millions. Single-Serving Software acts as the ultimate Minimum Viable Product (MVP).

Build a specific, hard-coded app for Jones v. Smith.
Validate that the “Witness Rescheduler” actually saved the paralegal 10 hours.
Generalize it only after it proves its value, so someone else in the firm can use it for Doe v. Roe.

You don’t start with the platform. You start with the problem.

A Note on Security & Tools

You might be thinking: “Wait, uploading client data to a web app? Compliance will have a heart attack.”

It’s a valid concern. But the beauty of these AI-generated tools is that they can often be delivered as a single HTML file that you can then save and run entirely locally on your machine—no data leaves the browser. Furthermore, if you are using an Enterprise version of your preferred LLM, your inputs remain within the firm’s secure boundary.

Speaking of tools, this capability isn’t exclusive to one platform. Whether you use Gemini, ChatGPT, or Claude, the ability to turn a prompt into a working React or HTML artifact is now a standard feature. The power lies not in the specific model, but in your willingness to ask for code instead of text.

Conclusion

We are no longer just the consumers of legal software; we are the architects. We can now build the infrastructure to manage our own chaos.

The next time you are drowning in a complex matter, don’t just ask AI for a memo or a checklist. Ask it for a tool. You might just find yourself managing the chaos (almost) on time.

Legal Research Trapping You in an “AI Tunnel”? Use a Toe-hold to Get Out

Posted on October 28, 2025 by Rebecca Fordon

I’ve been watching my legal research students use AI and noticing a common pattern.

They typically go into an AI “Ask” feature in Lexis or Westlaw, get an answer, and then continue the conversation by asking more questions. This is exactly what the tools are designed to encourage.

The problem is that this process often leaves them with only a handful of sources, and not always the most relevant or authoritative ones. They miss critical nuance, and—most dangerously—they can’t see what the AI has limited or hidden from them.

I’ve started calling this the “AI Tunnel.” And I’ve realized that as expert researchers, it’s our job to teach them how to escape it.

A person climbing a cliff. The photo is zoomed in to show only their leg from the knee down. Their is the only thing supporting them. — Photo by Patrick Hendry on Unsplash

The “AI Tunnel” vs. The “Toe-Hold” Strategy

When I use generative AI in my own research, I’m doing something completely different from my students. I’m using it for a “toe-hold.”

I ask AI to “explain the elements of X” to get the key concepts, and I immediately pivot to a treatise to get further detail on those concepts.
I ask AI to “find the statute for Y” to get the statute number, and I immediately pivot to the Notes of Decisions.
I ask AI to “find a few starting cases for Z” to get one good case, and I immediately pivot to the citator and its headnotes, or use the vocabulary to craft a search.

In other words, I use AI as a 1-minute scaffold to get me to traditional research tools. My students are using it to have a 30-minute conversation that delays them from finding the best sources (or maybe they never find them at all).

They are missing the pivot.

Our “Expert Blind Spot” is Their Biggest Hurdle

My first instinct was to just tell them my strategy. “Don’t stay in the AI! Pivot!”

A GIF of a clip from the TV show Friends. Ross and an unseen Friend are trying to get a couch up the stairs and Ross is shouting "PIVOT!!" — Friends is cool again, so I can use this GIF

But as we all know, that doesn’t work. This is a classic “Expert Blind Spot” problem.

The “Toe-Hold” strategy, for an expert, is one seamless, automatic action. For a novice, it’s a series of high-friction steps that rely on implicit skills we take for granted:

Diagnostic Skimming: We don’t read the AI’s wall-of-text answer. We scan it. Our students, who are not yet skilled at skimming, try to read it and get overwhelmed by the noise.
“Pivot Point” Identification: Our expert eyes are trained to instantly spot the “pivot points”: a statute number (O.R.C. 5321.16), a key case name (Bowen v. Kil-Kare, Inc.), or a term of art (“natural accumulation rule”). To a 1L, this is all just undifferentiated text.
Process Knowledge: We automatically know the “if-then” script: “If I have a case, then I go to the citator.” A 1L doesn’t have that script memorized yet.

So we can’t just tell them the strategy. We have to make these implicit skills explicit.

Making the “Toe-Hold” Teachable: Three Concrete Techniques

I’m now redesigning my talks to 1Ls around this single goal. Here are the three main pedagogical tools I’m using to scaffold this “expert” skill for “novices”:

1. The “Narrated Skim”

This is the most critical piece. I’ll do a “canned” demo, put an AI-generated answer on the screen, and literally narrate my internal monologue out loud.

“Okay, I’ve got my answer. I am NOT reading this whole thing. My eyes are scanning only for a statute number, a case name, or a key term of art. I’m ignoring the summary… ignoring the intro… Ah! [point with mouse] Right here: Bowen v. Kil-Kare, Inc. That’s my toe-hold. That’s all I need. I am now leaving this screen.”

This is Cognitive Apprenticeship—making our expert thinking visible.

2. The “Pivot Point” Checklist

To lower cognitive load, I’m giving them a simple checklist that explicitly lists what they are skimming for.

What Am I Skimming For? (An Expert’s Checklist)

Specific Statute Numbers (e.g., O.R.C. 5321.16)

Key Case Names (e.g., Bowen v. Kil-Kare, Inc.)

Key Phrases / Terms of Art (e.g., “natural accumulation rule”)

Key Secondary Sources (e.g., “as mentioned in Prosser and Keeton on Torts“)

3. The “Find the Pivot” Interactive Exercise

My main in-class exercise is no longer a complex problem. It’s a highly scaffolded, 5-minute task focused only on this one skill.

The Task: I’ll give them an AI-generated answer. In pairs, their goal is not to find the “answer.” Their goal is to find the “toe-hold.”
The Prompt: “You have 3 minutes. Scan this document and find the one statute, one case, or one key phrase you would use to ‘escape the tunnel.’ Be prepared to tell me where you would pivot to next (e.g., ‘the Notes of Decisions’ or ‘KeyCite’).”

This approach re-centers our value. We’re teaching students how to build a comprehensive research process, and that AI is just one tool in that toolbox.

How are you teaching this “pivot”? What other “expert blind spots” have you run into when teaching AI? I’d love to hear your thoughts in the comments.

Future of Law Libraries Initiative

Posted on October 13, 2025 by Sean Harrington

The impact of AI on varied aspects of our professional lives is covered regularly on this blog. It is reshaping legal research, education, and legal practice in ways that threaten to leave us behind if we fail to be proactive. It is why the Future of Law Libraries Initiative gathered professionals from academic, court, firm, and government libraries and allied professions through six regional roundtable to identify what steps we need to take now to ensure an impactful, empowered, ethical future.

The message from these roundtables was clear: legal information professionals must take coordinated action on AI policy, training, and infrastructure. To accomplish this, three main recommendations came out of those discussions.

Create a Centralized AI Organization

Law library leaders agreed on the need for a shared, profession-wide structure to:

Connect experts and facilitate collaboration.
Set shared priorities for AI standards, ethics, and vendor engagement.
Advocate for legal information professionals in AI discourse.

This organization could take the form of a new consortium or be embedded within an existing network, but its purpose would remain the same: to ensure law libraries have a unified voice and strong presence in AI governance.

Develop Tiered AI Training for Legal Information Professionals

Ad hoc workshops and webinars are no longer enough. To remain relevant, the profession needs robust, role-based training that builds AI competencies at multiple levels—from awareness to leadership. Training should be hands-on, case-based, and designed to produce practical work products.

A train-the-trainer model could help scale capacity, ensuring that AI knowledge reaches across all library types and staff levels while building long-term expertise.

Establish a Centralized AI Knowledge Hub

To avoid fragmentation and duplication of effort, roundtable participants recommended creating an open, curated repository governed by legal information professionals. This hub would serve as a durable home for:

Policies and standards
Teaching resources and curricula
Evaluation protocols and case studies
Model contracts and datasets

By sharing resources openly, the hub would accelerate adoption of best practices and ensure equitable access across institutions of all sizes.

Dig Deeper — Read the White Paper

This initiative produced a white paper that digs deeper into these recommendations, including practical next steps and insights from the roundtable conversations. It’s a valuable resource for anyone thinking about the future of law libraries and AI.

Get Involved

We are forming working groups to move these recommendations forward.

Steering Committee – Guides the overall vision.
Consortium Charter Group – Shapes governance and structure.
Training Development Group – Builds core AI competencies and pilot programs.
Knowledge Hub Group – Designs the hub and its policies.

More detailed description of the charges, scope of work, and time commitments are outlined in the report. Volunteers should be prepared to commit a year for this first phase.

Volunteer Today

Effortless Boolean: A Free Tool to Supercharge Your Legal Research

Posted on September 18, 2025 by Rebecca Fordon

As anyone who has taught legal research knows, Boolean searching is a superpower. The ability to craft a precise query with terms and connectors is the difference between finding a needle in a haystack and finding nothing at all. But for newcomers, the syntax of ( ), !, /p, and /s can feel like learning a new language under pressure.

The Legal Boolean Search Builder is built directly on a process I’ve been teaching for a while now—an 8-step method designed to take the guesswork out of query construction. It moves from identifying key concepts, to brainstorming alternates, and finally to connecting them with the right syntax.

For years, I’ve shared this process in slide decks, but it’s always been static. I wanted to turn it into something dynamic—a tool that could handle the syntax so that researchers could focus on the strategy.

A screenshot of the Legal Boolean Search Builder, as described in the rest of this post, and available at https://booleanbuilder.replit.app/

The Building Process: An Iterative Approach

I built this project using Gemini’s Canvas, and so it may look familiar to Gemini users. It uses HTML, Tailwind CSS for styling, and vanilla JavaScript for all the interactive logic. No complex frameworks, no dependencies—just a single file you can open in any browser. I then threw it into a github repo and imported to Replit so I could host it there.

This came together in a few hours, so I’m sure there are further tweaks and improvements I could make. I’m immensely grateful to Charlie Amiot and Debbie Ginsberg for their sharp insights and invaluable suggestions that took the tool from a basic concept to a polished, user-friendly application.

Finally, this project was significantly influenced by an amazing fillable PDF created by Dan Kimmons and Tara Mospan. Dan described his process for going from worksheet to fillable PDF in these very pages a few years ago.

How It Works: Key Features

The core idea is to break down the complex task of writing a Boolean query into manageable steps.

1. The Two-Column Layout

The user interface is split into two main sections. On the left, you build your concepts step-by-step. On the right, you see your search string come to life in real-time, along with a helpful review checklist. This instant feedback loop is key to the learning process.

2. Smart Suggestions for Phrases

One of the biggest hurdles for new researchers is knowing when to use an exact phrase search (e.g., "assumption of risk") versus a more flexible proximity search. The tool helps by automatically suggesting a proximity search, filtering out common stop words to focus on the core terms.

3. The Truncation Builder

Finding the correct word root for truncation can be tricky. Is it assum! or assump!? To solve this, I added a “Truncation Builder” modal. You can enter all the variations of a word you can think of, and the tool finds the common root, providing you with the most effective truncated term to copy and use.

Try It Yourself

This project was a fantastic experience in turning a teaching methodology into a living tool. The goal was never to replace the critical thinking that goes into legal research, but to remove the syntactic barriers that can get in the way.

You can try the tool out for yourself and view the source code on GitHub. I’d love to hear your feedback!

Benchmarking a Moving Target, or let’s run a hypo through 7 AIs and see what happens

Posted on September 5, 2025 by Guest Blogger

Debbie Ginsberg, Guest Blogger

Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?

Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.

As we tell our 1Ls, sometimes you need to work with what you have and just write.

The model question

In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):

Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?

I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”. To answer the question effectively, the models would need to address that nuance in some form.

The model benchmarks

The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):

Did the AI answer the question that I asked?
Was the answer thorough (did it more or less match my model answer)?
Did the AI cite the most important cases and sources noted in my model answer?
Were any additional citations the AI included at least facially relevant?
Did the model refrain from providing irrelevant or false information?

I did not benchmark:

Speed (we already know the reasoning models can be slow)
If the citations were wrong in a non-obvious way

The model answer and sources

According ot my model answer, the best answers to the question should include at least the following:

Font software: Font software that creates fonts is protected by copyright. The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
Typefaces/Fonts: Neither of these is protected by copyright law. Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.

Bonus if the answer addressed:

Separability: If the art can be separated from the typeface/font, it’s copyrightable.
Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
International implications: Would we expect to see the same results in other jurisdictions?

In answering this question, I expected the LLMs and RAGs to cite:

The copyright statute (which provides the basis for any copyright determinations)
Copyright regulations (provide additional rules for determining copyright)
Adobe Sys. v. Southern Software, Inc. (software that creates fonts can be copyrighted)
Laatz v. Zazzle, Inc. (a newer case discussing copyrighting fonts made with software; it also includes a discussion about alternatives to copyright)
Shake Shack Enterprises, LLC et al v. Brand Design Company, Inc. (discusses limits of copyrighting font software)
The Copyright Compendium, 2021 (from the Copyright Office) (features hypotheticals about artistic elements in fonts/typefaces – this is probably the most important resource)

Benchmarking with the AI models

For this post, I ran my model in the following LLMs/RAGs:

Lexis Protégé (work account)
Westlaw CoCounsel (work account)
ChatGPT o3 deep research (work account)
Gemini 2.5 deep research (personal paid account)
Perplexity research (personal paid account)
DeepSeek R1 (personal free account)
Claude 3.7 (personal paid account)

I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.

The individual responses are included in the appendix.

I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.

Bechmarking the results

In parsing the results, most answers were fairly similar with some exceptions:

Source	Font software copyrightable	Typefaces/ fonts not copyrightable	Exceptions to font‑software copyright	Art in typefaces/fonts copyrightable
Lexis Protégé	Yes	Yes	Yes	No
Westlaw CoCounsel	Yes	Yes	No	Yes
ChatGPT o3 deep research	Yes	Yes	Yes	Yes
Gemini 2.5 deep research	Yes	Yes	Yes	Yes
Perplexity research	Yes	Yes	Yes	Yes
DeepSeek R1	Yes	Yes	Yes	Yes
Claude 3.7	Yes	Yes	Yes	Yes

Font software is copyrightable: in all answers
Typefaces/fonts are not copyrightable: in all answers
Exceptions to font software copyright: in all answers except Westlaw
Art in typefaces/fonts is copyrightable: in all answers except Lexis

Several answers included additional helpful information:

Source	Sepera-bility	C Office Policies	Altern-atives	Licen-sing	Int’l	Recent	State law
Lexis Protégé	Yes	No	No	No	No	No	No
Westlaw Co-Counsel	No	No	No	No	No	No	Yes
ChatGPT o3 deep research	Yes	Yes	Yes	Yes	Yes	Yes	No
Gemini 2.5 deep research	Yes	Yes	Yes	Yes	No	No	No
Per- plexity research	Yes	No	Yes	No	No	No	No
Deep- Seek R1	Yes	No	Yes	No	No	No	No
Claude 3.7	No	No	Yes	Yes	Yes	No	No

Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
Specific discussions about Copyright Office policies: Gemini, ChatGPT
Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
Specific discussions about licensing: Gemini, Claude, ChatGPT
International considerations: Claude, ChatGPT
Recent developments: ChatGPT
State law: Westlaw

The models were somewhat consistent about what they cited:

LLM/RAG	Copyright statute	Copyright regs	Adobe	Laatz	Shake Shack	The Copyright Compendium
Lexis Protégé	Yes	Yes	Yes	Yes	No	No
Westlaw Co- Counsel	Yes	Yes	Yes	Yes	Yes	No
ChatGPT o3 deep research	Yes	Yes	Yes	No	No	Yes
Gemini 2.5 deep research	Yes	Yes	Yes	Yes	No	Yes
Perplexity research	No	Yes	No	No	No	Yes
DeepSeek R1	Yes	Yes	Yes	No	No	No
Claude 3.7	No	Yes	Yes	No	No	No

The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
Copyright regs: cited by all
Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
Laatz: Lexis, Westlaw, Gemini
Shake Shack: Westlaw
The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion

The models also included additional resources not on my list:

LLM/RAG	Blogs etc.	Restat.	Eltra	Law review	Articles about loans	LibGuides
Lexis Protégé	Yes	Yes	Yes	No	No	No
Westlaw Co- Counsel	Yes	No	No	Yes	Yes	No
ChatGPT o3 deep research	Yes	No	Yes	No	No	No
Gemini 2.5 deep research	Yes	No	Yes	No	No	Yes
Perplexity research	No	No	No	No	No	No
DeepSeek R1	No	No	Yes	No	No	No
Claude 3.7	Yes	No	Yes	No	No	No

Blogs, websites, news articles: The commercial LLMs. Gemini found the most, but it’s Google.
Restatement: Lexis
Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
An actual law review article: Westlaw
Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
LibGuides: Gemini
Included a handy table: ChatGPT, Gemini

The answers varied in depth of discussion and number of sources:

Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
Westlaw: 2.5 pages of formatted text, 17 pages of sources
ChatGPT: 8 pages of well-formatted text, 1 page of sources
Gemini: 6.5 pages of well-formatted text, 1 page of sources
Perplexity: A little more than 4 pages of text, about 1 page of sources
Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
Claude: 2.5 pages of well-formatted text, no separate sources

Hallucinations

I didn’t find any sources that were completely made up
I didn’t find any obvious errors in the written text, though some sources made more sense than others
I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post).

Some random concluding thoughts about benchmarking

When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs. To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.

In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others. For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others. They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).

None of the models appeared to consider that recency would be a significant factor in this problem. Several cited a case from the 70s that didn’t concern fonts. Several failed to cite Laatz, a recent case that’s on point. Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case). The LLMs were less concerned with citing to authority. In all cases, I would have preferred a more curated set of resources than the platforms provided.

Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them).

Pages: 1 2

Coming Soon: The Interactive GenAI Legal Hallucination Tracker — Sneak Peek Today!

Posted on August 10, 2025 by Jenny Wondracek

If you follow me on LinkedIn or spoke with me at AALL, you’ve probably seen me teasing this project like it’s the season finale of a legal tech drama. Well, the wait is (almost) over — here’s your official sneak peek at our forthcoming interactive GenAI Legal Hallucination Tracker.

The People Behind the Tracker

First, credit where credit is due: fellow law librarian Mary Matuszak, the ultimate sleuth of AI blunders. I’ve sent many curious folks her way on LinkedIn, where she’s been posting hallucinations far more regularly than anyone else. By mid-July, when she sent me this spreadsheet, she’d logged 485 entries — and yes, the number has since blown past 500. She’s basically the Nellie Bly of questionable legal citations.

Next up, my research assistant, Nick Sanctis — the wizard making the interactive tracker happen and gently forcing me to learn just enough R to be dangerous. If there’s a delay, blame my attempts to juggle teaching, running a library, staying current with AI developments, and decoding the mysteries of R this fall.

As for me? I’m the publisher, the cheerleader, and the student in this equation.

The Plan

Today we’re releasing a the basic tracker data in a sortable and searchable table format. In the coming weeks, we’ll roll out the more robust interactive version, followed by new features for viewing, filtering, and analyzing the data — each announced in its own post.

But wait! There’s more! We want you to be part of it! Soon, we’ll be recruiting volunteers to:

Help us find and add more hallucination cases (submission method coming soon)
Analyze the data and share insights with the legal community

If you use the tracker, please cite or link to it in your work. Proper attribution keeps this project alive and growing.

The Data

View Full-Screen Interactive Table ↗