Can’t Stop, Won’t Stop: One Semester, Eight Vibe-Coded Teaching Tools

Somewhere around tool number five, I realized I had a problem. The kind where you keep saying “almost done” and then it’s 2 AM and you’re finalizing a React app that lets your students act as bots and deploy hallucinations against one another.

I teach 21st Century Lawyering at The Ohio State University Moritz College of Law. It’s a 3-credit course on law and emerging technology — AI, cybersecurity, document tech, legal automation. And this semester, I built a custom interactive tool for nearly every class session. Eight tools (and counting) for a single course.

I won’t pretend to have come up with all of these ideas on my own. One of the best things about the legal tech education community is how generously people share. Someone demos an exercise at a conference, posts a tool on LinkedIn, or walks through a concept in a workshop, and it sparks something. Nearly every tool below started with an idea I saw someone else do and I thought, “I could build a version of that for my students.” AI-assisted development makes that possible in a way it never was before, totally collapsing the gap between “that’s a great idea” and “I’m using it in class tomorrow.” So this post is partly a show-and-tell, and partly a thank-you to the people whose work made me want to build.

Here’s what I made, who inspired it, and what I learned about what happens when a law professor with some Python experience and an AI coding assistant starts saying yes to every pedagogical impulse.

A note on the live demos: Many of these tools run on my personal API keys. I’ve put some money into keeping them live, but when it runs out I probably won’t re-up. If a demo isn’t working, you can always clone the repo and run it yourself, I’ve tried to keep everything open-source.

The Tools

1. TokenExplorer (Week 2)

The LLM Explorer web app showing a token-by-token probability visualization. The left panel has an "Input" section with the prompt "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." (estimated at 29 tokens), plus "Model Settings" with GPT-3.5 Turbo selected, temperature slider at 1.5, and max tokens at 712. A "Generate" button sits below. The right panel shows the model's output (505 completion tokens) with each token color-coded by prediction probability: green (90-100%), light green (70-89%), light blue (50-69%), yellow (30-49%), orange (10-29%), and pink (0-9%). The output is a hallucinated list of fake cases — "Smith v. Jones, 2021 WL 123 456 (District Court of California, Feb. 15, 2021)," "Brown v. Green, 2021-US-12345," etc. — with many tokens in yellow, orange, and pink, visually demonstrating that the model is generating low-confidence fabrications. The color pattern makes hallucination *visible*: students can see that case names, docket numbers, and dates are produced with notably low probability scores, revealing the model is essentially guessing.

Try it | GitHub

Inspired by: I saw someone demo something similar at a presentation a while back. I wish I could remember who, because it stuck with me. Making the probabilistic nature of an LLM visible rather than just explaining it conceptually was immediately compelling. Token visualizers aren’t new (OpenAI has one, several developers have built similar tools), but I hadn’t seen one built for a classroom context where the point is to emphasize the probabilities. If the original demo was yours, please reach out so I can credit you properly.

The problem: Students arrive thinking LLMs are databases that look up correct answers.

What I built: An interactive tool where students manipulate temperature settings and watch probability distributions shift in real time. They change context and see how a similar prompt produces different next-token predictions. They test factual questions and watch the probabilities change as more hallucinations arise.

Why it matters: Once students see that LLMs are statistical prediction engines generating likely text, not true text, everything else clicks. When they encounter hallucinations later in the semester, they already understand why they happen.

2. Prompt Coach (Week 5)

The Prompt Coach split-panel web app. The left "Your Workspace" panel shows a Claude conversation (model: Claude Haiku Fast). The student's prompt asks: "Give me a list of 10 recent opinions involving lawyers using AI hallucinations. They should have full citations." Claude's response is titled "Recent Opinions on Lawyers Using AI Hallucinations" and transparently explains it cannot verify specific citations, listing what it knows exists (Mata v. Avianca, Colorado Bar disciplinary matters, State Bar of California ethics opinions) and recommending Google Scholar for accurate citations. The right "Prompt Coach" panel (branded "21st Century Lawyering - OSU Moritz College of Law") shows a "Hallucination Risk" review. The Coach Feedback section analyzes the student's prompt: "Good news: the AI handled your high-risk request appropriately by refusing to fabricate citations. But your prompt created exactly the scenario where hallucination is most dangerous." It explains the student asked for specific citations from memory — "the AI's weakest area" — and suggests alternative approaches: requesting search and verification explicitly, uploading case databases, and notes that the follow-up ("I need 10 cases minimum") doubled down on the risk. Footer: "Created by Rebecca Fordon | MIT License | GitHub."

Try it | GitHub

Inspired by: Sean Harrington‘s prompting workshops. I still haven’t had the pleasure of experiencing one in person (I’m looking forward to finally catching him at ABA TECHSHOW) but hearing about his approach to teaching prompting AI, with feedback from AI, made me want to build something that could coach students through it in real time.

The problem: Students need to practice prompting, but there’s no good way to give them feedback at scale. I can’t stand behind 30 (or even 11) laptops at once.

What I built: A split-panel web app. The left side is a blank Gemini chat where students draft and test prompts. The right side is an AI coaching interface that evaluates their technique across dimensions like context engineering, document selection, and confidentiality awareness. The coach doesn’t revise the AI’s output directly, but instead connects output problems back to what could be improved with the prompt.

Why it matters: Legal-specific coaching catches things generic prompt guides miss. It flags when a student uploads privileged documents. It notes when a prompt would work on Gemini but fail on CoCounsel’s structured skill system. It frames feedback in terms of professional judgment, not just technical optimization. This could be easily customized to track to different learning objectives (ironically, I rewrote the prompt so many times).

3. QnA Markup Unpaid Wages Client Screener (Week 6)

The QnA Markup Editor at qnamarkup.org showing the complete Wage & Hour Claim Screener. The left panel displays the full QnA Markup source code with the decision tree logic visible: Q(status) checks W-2 status, Q(timely) checks the 2-year statute of limitations, Q(issue) branches into four claim types (minimum wage, overtime, final paycheck, tip theft), with sub-questions about hourly rate thresholds ($7.25, $10.99, $11.00) and salary thresholds ($684/week). GOTO:consult tags route qualifying claims to a consultation booking page; a Q(dol) endpoint directs users to the Department of Labor hotline (1-866-487-9243) and Ohio Legal Help. The right panel shows the interactive output in "Interactive" mode, displaying the first question as a blue speech bubble: "Were you a W-2 employee?" with "Yes" and "No (I was a contractor/1099)" answer buttons. Footer links include "credits | edit | code your own."

Try it on QnA Markup 

Inspired by: For this I just directly used David Colarusso‘s QnA Markup, a brilliantly simple tool for building decision trees with plain text. Gabe Tenenbaum also very generously demoed QnA Markup in a prior version of my class, which planted the seed for continuing to build exercises around it.

The problem: I needed a low-tech entry point to teach decision trees and document assembly logic — something where the focus stays on legal reasoning, not the tool.

What I built: A client intake screener that triages potential wage-and-hour claims. Does the caller qualify? Should they book a consultation or contact the Department of Labor directly? Students see legal rules as logic: if/then branching based on employer size, hourly rate, and tipped status.

Why it matters: It forces students to confront the design choices embedded in any intake tool, such as what questions to ask, in what order, what to do with edge cases. It’s intentionally low-tech (just text in a browser) so nobody gets distracted by the interface.

4. Decision Tree to QnA Markup Translator Gem (Week 6)

Gemini interface (dark mode) showing the QnA Markup Optimizer Gem in a conversation titled "QnA Markup Decision Tree Generation." The user's brief prompt says "put in a code block." The Gem responds with QnA Markup for an "Employment Intake & Consultation Screener" described as "Verifies W-2 status and routes to a consultation if ineligible." The code shows a linear intake flow: Q(employment_status): "Were you a W-2 Employee?" → Yes → Q(employer_name): "What was the name of your employer?" → X: (free text) → Q(job_title): "What was your job title at <x>employer_name</x>?" → X: → Q(years_worked): "How many years did you work there?" → X:number → Q(final_confirmation) partially visible. This demonstrates the Gem generating a simpler guided interview structure with variable interpolation.

Try it

The problem: Students understand decision tree logic but struggle with the syntax of turning it into working code.

What I built: A Gemini Gem that bridges the gap. Students describe their logic and it generates QnA Markup. It models the idea that AI assistants can serve as a bridge between domain knowledge and technical implementation.

Why it matters: It’s the same insight that powers the entire “building legal technology” unit (and extends from the prompting unit and into the agent unit): you don’t need to be a coder, you need to be able to describe what you want clearly enough for AI to build it.

5. Ohio Unpaid Wages Screener — The 3-Minute Version (Week 6, cliffhanger)

A vibe-coded web app styled as a Better Call Saul parody — "BETTER CALL OHIO" in large red and yellow text on a black header bar, with "WAGE & HOUR CLAIM SCREENER" as a subtitle. A yellow badge in the upper right reads "YOUR RIGHTS PROTECTED!" with a scales-of-justice icon. Below, on a cream/yellow background, a white card labeled "OFFICIAL SCREENER" (in a red diagonal badge) shows the "CASE INTAKE" heading with the first question in italic: "Were you a W-2 employee?" Two answer buttons styled in black borders read "YES →" and "NO (I WAS A CONTRACTOR/1099) →". At the bottom, a disclaimer reads "ATTORNEY ADVERTISING - RESULTS NOT GUARANTEED" with small dollar sign, clock, and calendar icons. The design deliberately mimics the aesthetic of a late-night TV legal ad.

Try it

The problem: I needed a dramatic way to introduce vibe coding.

What I built: A React app in Gemini Canvas, built on the QnA Markup decision tree we just made. It took roughly 1 minute to ctrl-C/ctrl-V the QnA Markup and generate the app.

Why it matters: I revealed it side-by-side at the end of class as a cliffhanger: “Same basic functionality, but it made a website out of it.” We then discussed, “So why wouldn’t you always vibe-code?” Students surfaced the hard questions themselves: Is the code correct on the law? Is it deterministic (will it always come out the same way)? Who hosts it? Would it be as good if I asked it to generate directly from the law, rather than creating the decision tree ahead of time? Why did it add “OFFICIAL”?¹ It set up the entire vibe-coding class the next day perfectly.

¹ (Eagle-eyed readers will notice that I was too much of a coward to share a direct link to the screenshotted version, and if you visit the link you’ll instead see prominent “parody” stamps).

6. Citation Extractor Gem (Week 7)

Gemini interface (dark mode) showing the Citation Extractor custom Gem in action. The user's prompt reads "Extract citations from the attached PDF." The Gem's response begins with an analysis note explaining it extracted all legal case citations from the document, including those identified as "bogus" or "phony" AI-generated research by the Special Master. Below is a structured table with columns: Bluebook Citation, Core Citation, Court, Year, and Pinpoint Pages. Visible entries include Aetna Cas. & Surety Co. v. Superior Court (Cal. Ct. App., 1984), Arrowhead Capital Finance v. Picturepro (9th Cir., 2023), Boone v. Vanliner Ins. Co. (Ohio, 2001), Booth v. Allstate Ins. Co. marked as "(Flagged)" (Cal. Ct. App., 1989), and Braun ex rel Advanced Battery Techs. v. Zhiguo Fu (S.D.N.Y., 2015). The Gemini model selector shows "Thinking" mode.

Try it

The problem: I needed a way for students to quickly pull all case citations out of a brief to feed into verification workflows.

What I built: A Gemini Gem that takes an uploaded brief and returns a structured table of all case citations, ready for Get & Print on Westlaw and Lexis.

Why it matters: It serves double duty. Practically, it supports the hallucination game (below). Pedagogically, it’s a concrete example of a custom AI assistant built for a specific legal task, connecting back to the Gems work from Week 6 and forward to agentic AI in Week 9. Students see that I practice what I teach: when you have a repetitive legal task, you build a tool for it.

7. Citation Hallucination Game (Week 7)

The Citation Hallucination Game web app during the verification (Solo Practice) phase. Top navigation bar shows "Citation Game | Solo Practice | Reviewed: 0/23 | Flagged: 0" with "Finish & See Results" and "Export PDF" buttons. The main panel displays a legal brief — a motion to dismiss in a bad faith insurance case, citing Patterson v. State Farm Mut. Auto Ins. Co. and Ashcroft v. Iqbal. A sidebar panel labeled "REVIEW CITATION" highlights the current citation (Patterson v. State Farm, 2019 U.S. Dist. LEXIS 31742) in yellow, with two buttons: "Looks Legit" and "Flag as Fake," plus Previous/Next navigation. The interface lets students step through each citation in the brief and decide whether it's real or fabricated.

Try it | GitHub

Inspired by: David Colarusso again, specifically his automation bias exercise, which flips the typical classroom dynamic by making students experience bias rather than just learn about it. That “make them do the thing, not just hear about the thing” approach is exactly what I was looking for. And his hallucination checking frame was pretty handy too because I also wanted to explicitly teach a process for that.

The problem: Students (like many of us) think hallucinations will not happen to them, because they will always read the cases. They may also see hallucinations as mainly a made-up cases problem, and not realize that hallucinations can come in many flavors, some harder to detect than others.

What I built: A competitive team exercise. Students first create hallucinated citations (fabricated cases, swapped numbers, mischaracterized holdings, altered quotes) forcing them to internalize different types of hallucinations and when they are likely to arise. Then they try to catch another team’s fakes under time pressure, mirroring the real conditions lawyers may face when reviewing AI-generated work on deadline.

Why it matters: The key lesson isn’t “can you catch every error” but “given limited time, which errors do you prioritize?” Students independently discovered that no single verification tool is sufficient, that the easy hallucinations (fabricated cases) are solved relatively quickly, but that the dangerous ones (subtle mischaracterizations) slip through.

Landing page of the Document Technology Gallery web app. Header reads "Document Technology Gallery" with subtitle "Beyond Word and Acrobat: The tools lawyers actually use." Introductory text explains that the gallery covers seven key document technology patterns. Three demo cards are visible, organized by category: under "CREATE," cards for Document Automation (~90 sec, "Watch a contract write itself as you answer questions") and Clause Library (~60 sec, "Assemble a contract from pre-approved building blocks"); under "EDIT," a card for AI-Assisted Document Editing (~90 sec, "Edit a legal document three ways: AI, rules, and consistency checking"). Two more cards are partially visible under a "REVIEW" heading. Each card has a "Try it →" link. Clean white background with subtle card borders.

Try it | GitHub

Inspired by: Barbora Obracajova‘s Legal Tech Gallery, which she vibe-coded for her “Modern Lawyers” course and shared on LinkedIn. She created a series of quick interactive demos, 60 seconds each, where students touch the technology instead of watching slides. I had a class quickly approaching on document competencies — a topic that I have always struggled to teach given how difficult it is to get Word add-ons approved in my institution. So the moment I saw her post I knew this would help me.

The problem: Students need hands-on exposure to core document technology competencies (I chose automation, clause libraries, editing, brief verification, metadata cleaning, redaction, and contract review) but there’s no time to go deep on all of them in one class.

What I built: Seven quick interactive demos, about 60 seconds each. Students touch each technology rather than just hear about it. The gallery format builds a shared baseline before diving into further exploration and case studies.

Why it matters: It surfaces the “I didn’t know that existed” moments.

What I Learned from this Process

  • I can’t stop because the feedback loop is immediate. I see a gap in my teaching, I build something that night, I use it in class the next day. That’s never been possible for me before. I’ve been writing little scripts for years, but AI-assisted development took me from utility scripts to deployed interactive applications. The jump from “I can automate this for myself” to “I can build this for my students” is huge.
  • Vibe-coding replaced the paper handout. In other classes, when I want a students to work through a problem, my first instinct is a handout, whether a worksheet, fact pattern, checklist, or problem. In this class, my first instinct became “what if I built something they could interact with?” When the subject is technology, it made sense to me that the medium should be too.
  • Creating interactive apps make the class more hands-on. Every one of these exists because I couldn’t find a way to give students a particular experience otherwise. You can lecture about how LLMs predict tokens. Or you can let students drag a temperature slider and watch the probability distribution change. You can tell students hallucinations are dangerous. Or you can have them create hallucinations and then fail to catch someone else’s.
  • Building is teaching. When I vibe-code a tool in front of students, or reveal that I built something in 3 minutes that took an hour the traditional way, I’m modeling some of the ways they can use technology. The message isn’t always “look what I made,” but also “you could make this too, and you should, because the people who understand the problem best should be the ones building the solution.”

Eight tools. Week 8 of a 13-week semester. And I have no doubt there are more to come.

If you’re teaching legal technology and you’ve been thinking “I wish I had something that did X” — you probably have enough to build it. Pick the smallest version of the idea, open a vibe-coding tool, and see what happens. And then share it!

What the Science Says About Hallucinations in Legal Research

This is Part 1 of a three-part series on AI hallucinations in legal research. Part 2 will examine hallucination detection tools, and Part 3 will provide a practical verification framework for lawyers.

You’ve heard about the lawyers who cited fake cases generated by ChatGPT. These stories have made headlines repeatedly, and we are now approaching 1,000 documented cases where practitioners or self-represented incidents submitted AI-generated hallucinations to courts. But those viral incidents tell us little about why this is happening and how we can prevent it. For that, we can turn to science. Over the past three years, researchers have published dozens of studies examining exactly when and why AI fails at legal tasks—and the patterns are becoming clearer.

A critical caveat: The technology evolves faster than the research. A 2024 study tested 2023 technology; a 2025 study tested 2024 models. By the time you read this, the specific tools and versions have changed again. That’s why this post focuses on patterns that persist across studies rather than exact percentages that will be outdated in months.

Here are the six patterns that matter most for practice.

Pattern #1: Models and Data Access

Not all AI tools are created equal. The research shows a dramatic performance gap based on how the tool is built, though it’s important to understand that both architecture and model generation matter.

  Bar chart showing hallucination rates for general-purpose language models on legal queries. Llama 2      
  hallucinated 88% of the time, GPT-3.5 hallucinated 69%, and GPT-4 hallucinated 58%, demonstrating that     newer models perform better but still hallucinate on more than half of legal questions.
Dahl, et al., “Large Legal Fictions,” Fig. 1. The figure shows reduced hallucination rates with more advanced and modern models.

Models are improving over time. A comprehensive 2024 study by Stanford researchers titled “Large Legal Fictions” tested 2023 general-purpose models on over 800,000 verifiable legal questions and found hallucination rates between 58% and 88%. Within that cohort, newer models performed better: GPT-4 hallucinated 58% of the time compared to GPT-3.5 at 69% and Llama 2 at 88%. This pattern of improvement with each model generation appears fairly consistent across AI development.

  Chart comparing hallucination rates across legal AI tools and GPT-4. Lexis+ AI had a 17% hallucination   
  rate, Westlaw AI-Assisted Research had 33%, and GPT-4 had 43%, showing that legal-specific tools with    
  retrieval-augmented generation substantially outperform general-purpose models.
Magesh, et al., “Hallucination Free?”, Figure 1. The study shows higher hallucinations in general purpose model GPT-4 than specialized legal research products.

Architecture matters, but it’s not the whole story. A second Stanford study, titled “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools”, published in 2025 but testing tools from May 2024, found hallucination rates of 17% for Lexis+ AI, 33% for Westlaw AI-Assisted Research, and 43% for GPT-4. These errors include both outright fabrications (fake cases) and more subtle problems like mischaracterizing real cases or citing inapplicable authority. This head-to-head comparison shows legal-specific tools with retrieval-augmented generation (RAG) substantially outperforming general LLMs.

A randomized controlled trial by Schwarcz et al. reinforces the architecture point from a different angle. When 127 law students used a RAG-based legal tool (Vincent AI) to complete legal tasks, they produced roughly the same hallucination rate as students using no AI at all. Students using a reasoning model without RAG (OpenAI’s o1-preview) produced better analytical work but introduced hallucinations. Both tools dramatically improved productivity—but only the RAG tool did so without increasing error rates. However, the Vals AI Legal Research Report (October 2025, testing July 2025 tools) found ChatGPT matched legal AI tools: ChatGPT achieved 80% accuracy while legal AI tools scored 78-81%. The key difference? The ChatGPT used in the Vals study used web search by default (a form of RAG), giving it access to current information and non-standard sources, while legal tools restrict to proprietary databases for citation reliability. For five question types, ChatGPT actually outperformed the legal AI products on average. Both outperformed the human lawyer baseline of 69%.

Takeaway: Purpose-built legal tools generally excel at citation reliability and authoritative sourcing, but general AI with web search can compete on certain tasks. The real advantage isn’t RAG architecture alone—it’s access to curated, verified legal databases with citators. Know your tool’s strengths: legal platforms for citations and treatment analysis, general AI with web search for non-standard or very recent sources.

Pattern #2: Sycophancy

One of the most dangerous hallucination patterns is that AI agrees with you even when you’re wrong.

The Stanford “Hallucination-Free?” study identified “sycophancy” as one of four major error types. When users ask AI to support an incorrect legal proposition, the AI often generates plausible-sounding arguments using fabricated or mischaracterized authorities rather than correcting the user’s mistaken premise.

Similarly, a 2025 study on evaluating AI in legal operations found that hallucinations multiply when users include false premises in their prompts. Anna Guo’s information extraction research from the same year showed that when presented with leading questions containing false premises, most tools reinforced the error. Only specialized tools correctly identified the absence of the obligations the user incorrectly assumed existed.

This happens because of how large language models work: they’re trained to generate helpful, plausible text in response to user queries, not to verify the truth of the user’s assumptions.

Takeaway: Never ask AI to argue a legal position you haven’t independently verified. Phrase queries neutrally. If you ask “Find me cases supporting [incorrect proposition],” AI may happily fabricate them.

Pattern #3: Jurisdictional and Geographic Complexity

AI performance degrades sharply when dealing with less common jurisdictions, local laws, and lower courts.

Table showing AI hallucination rates varying by geographic location. For the same legal scenarios,       
  hallucination rates were 45% for Los Angeles, 55% for London, and 61% for Sydney.
Curran, et al., “Place Matters”, Fig. 1. Hallucination rates by jurisdiction.

Researchers in a study called “Place Matters” (2025) tested the same legal scenarios across different geographic locations and found hallucination rates varied dramatically: Los Angeles (45%), London (55%), and Sydney (61%). For specific local laws like a local Australian ‘s Residential Tenancies Act, hallucination rates reached 100%.

The Vals report found a 14-point accuracy drop when tools were asked to handle multi-jurisdictional 50-state surveys. The Large Legal Fictions study confirmed that models hallucinate least on Supreme Court cases and most on district court metadata.

Why? Training data is heavily weighted toward high-profile federal cases and major jurisdictions. State trial court opinions from smaller jurisdictions are underrepresented or absent entirely.

Takeaway: Apply extra scrutiny when researching state or local law, lower court cases, or multi-jurisdictional questions. These are exactly the scenarios where training data or search results may be thinner, causing hallucinations to spike.

Pattern #4: Knowledge Cutoffs

AI tools trained on historical data will apply outdated law unless they actively search for current information.

The “AI Gets Its First Law School A+s” study (2025) provides a striking example: OpenAI’s o3 model applied the Chevron doctrine in an Administrative Law exam, even though Chevron had been overruled by Loper Bright. The model’s knowledge cutoff was May 2024, and Loper Bright was decided in June 2024.

This temporal hallucination problem will always exist unless the tool has web search enabled or actively retrieves from an updated legal database. Not all legal AI tools have this capability, and even those that do may not use it for every query.

Takeaway: Verify that recent legal developments are reflected in AI responses. Ask vendors whether their tool uses web search or real-time database access. Be especially careful when researching areas of law that have recently changed or may be affected by material outside the AI tool’s knowledge base.

Pattern #5: Task Complexity

AI performance correlates directly with task complexity, and the drop-off can be severe.

Simple factual recall—like finding a case citation or identifying the year of a decision—works relatively well. But complex tasks involving synthesis, multi-step reasoning, or integration of information from multiple sources show much worse performance.

The Vals report documented a 14-point accuracy drop when moving from basic tasks to complex multi-jurisdictional surveys. A 2025 study on multi-turn legal conversations (LexRAG) found that RAG systems struggled badly with conversational context, achieving best-case recall rates of only 33%.

Multiple studies note that statute and regulation interpretation is particularly weak. Anna Guo’s information extraction research found that when information is missing from a document (like redacted liability caps), AI fabricates answers rather than admitting it doesn’t know.

Takeaway: Match the task to the tool’s capability. High-stakes work, complex multi-jurisdictional research, and novel legal questions require more intensive verification. Don’t assume that because AI handles simple queries well, it will handle complex ones equally well.

Pattern #6: The Confidence Paradox

Perhaps the most insidious finding: AI sounds equally confident whether it’s right or wrong.

The “Large Legal Fictions” study found no correlation between a model’s expressed confidence and its actual accuracy. An AI might present a completely fabricated case citation with the same authoritative tone it uses for a correct one.

This isn’t a bug in specific products—it’s fundamental to how large language models work. They generate statistically probable text that sounds human-like and professional, regardless of underlying accuracy. In fact, recent research suggests the problem may worsen with post-training: while base models tend to be well-calibrated, reinforcement learning from human feedback often makes models more overconfident because they’re optimized for benchmarks that reward definitive answers over honest expressions of uncertainty.

Even the best-performing legal AI tools in the Vals report achieved only 78-81% accuracy. That means roughly one in five responses contains errors, even from top-tier specialized legal tools.

Takeaway: Never trust AI based on how confident it sounds. The authoritative tone is not a reliability signal. Verification is non-negotiable, no matter which tool you use. Be especially wary of newer models that may sound more confident while not necessarily being more accurate.

What This Means for Practice

Specific hallucination percentages will change as technology improves, but these six patterns appear to persist across different models, products, and study methodologies. Understanding them should inform three key decisions:

1. Tool Selection
Understand your tool’s strengths. Legal-specific platforms excel at citation reliability because they search curated, verified databases with citators. General AI with web search can compete on breadth and recency but lacks those verification layers. Within any tool, look for features like the ability to refuse to answer when uncertain (some tools are now being designed to decline rather than hallucinate when data is insufficient—a positive development worth watching for).

2. Query Strategy
Avoid false premises and leading questions. Phrase queries neutrally. Recognize high-risk scenarios: multi-jurisdictional questions, local or state law, lower court cases, recently changed legal doctrines, and complex synthesis tasks.

3. Verification Intensity
Scale your verification efforts to task complexity and risk factors. A simple citation check might need less verification than a complex multi-state legal analysis. But all AI output needs some verification—the question is how much.

Bottom Line

The research is clear: AI hallucinations in legal work are real, measurable, and follow predictable patterns. These studies have found that even the best legal AI tools hallucinate somewhere between 15% and 25% of the time (including both fabrications and mischaracterizations) based on current data.

But understanding these six patterns—models and data access, sycophancy, jurisdictional complexity, knowledge cutoffs, task complexity, and the confidence paradox—helps you make better decisions about which tools to use, which queries to avoid, and how intensively to verify results.

The goal isn’t to avoid AI. These tools can dramatically increase efficiency when used appropriately. The goal is to use them wisely, with eyes wide open about their limitations and failure modes.

Coming next in this series: How hallucination detection tools work and whether they’re worth using, and a practical framework for verifying AI research results.


References

Andrew Blair-Stanek et al., AI Gets Its First Law School A+s (2025).
Link: https://ssrn.com/abstract=5274547
Products tested: OpenAI o3, GPT-4, GPT-3.5
Testing period: Late 2024

Damian Curran et al., Place Matters: Comparing LLM Hallucination Rates for Place-Based Legal Queries, AI4A2J-ICAIL25 (2025).
Link: https://arxiv.org/abs/2511.06700
Products tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro
Testing period: 2024

Matthew Dahl et al., Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 16 J. Legal Analysis 64 (2024).
Link: https://doi.org/10.1093/jla/laae001
Products tested: GPT-4, GPT-3.5, PaLM 2, Llama 2
Testing period: 2023

Anna Guo & Arthur Souza Rodrigues, Putting AI to the Test in Real-World Legal Work: An AI evaluation report for in-house counsel (2025).
Link: https://www.legalbenchmarks.ai/research/phase-1-research
Products tested: GC AI, Vecflow’s Oliver, Google NotebookLM, Microsoft Copilot, DeepSeek-V3, ChatGPT (GPT-4o)
Testing period: 2024

Haitao Li et al., LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation, ACM Conf. (2025).
Link: https://github.com/CSHaitao/LexRAG
Products tested: GLM-4, GPT-3.5-turbo, GPT-4o-mini, Qwen-2.5, Llama-3.3, Claude-3.5
Testing period: 2024

Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 22 J. Empirical Legal Stud. 216 (2025).
Link: http://arxiv.org/abs/2405.20362
Products tested: Lexis+ AI, Thomson Reuters Ask Practical Law AI, Westlaw AI-Assisted Research (AI-AR), GPT-4
Testing period: May 2024

Bakht Munir et al., Evaluating AI in Legal Operations: A Comparative Analysis of Accuracy, Completeness, and Hallucinations, 53.2 Int’l J. Legal Info. 103 (2025).
Link: https://doi.org/10.1017/jli.2025.3
Products tested: ChatGPT-4, Copilot, DeepSeek, Lexis+ AI, Llama 3
Testing period: 2024

Daniel Schwarcz et al., AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice (Mar. 2025).
Link: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162111
Products tested: vLex (Vincent AI), OpenAI (o1-preview)
Testing period: Late 2024
Note: Randomized controlled trial with 127 law students using AI tools

Vals AI, Vals Legal AI Report (Oct. 2025).
Link: https://www.vals.ai/vlair
Products tested: Alexi, Midpage, Counsel Stack, OpenAI ChatGPT
Testing period: First three weeks of July 2025

Future of Law Libraries Initiative

The impact of AI on varied aspects of our professional lives is covered regularly on this blog. It is reshaping legal research, education, and legal practice in ways that threaten to leave us behind if we fail to be proactive. It is why the Future of Law Libraries Initiative gathered professionals from academic, court, firm, and government libraries and allied professions through six regional roundtable to identify what steps we need to take now to ensure an impactful, empowered, ethical future. 

The message from these roundtables was clear: legal information professionals must take coordinated action on AI policy, training, and infrastructure. To accomplish this, three main recommendations came out of those discussions.

  1. Create a Centralized AI Organization

Law library leaders agreed on the need for a shared, profession-wide structure to:

  • Connect experts and facilitate collaboration.
  • Set shared priorities for AI standards, ethics, and vendor engagement.
  • Advocate for legal information professionals in AI discourse.

This organization could take the form of a new consortium or be embedded within an existing network, but its purpose would remain the same: to ensure law libraries have a unified voice and strong presence in AI governance.

  1. Develop Tiered AI Training for Legal Information Professionals

Ad hoc workshops and webinars are no longer enough. To remain relevant, the profession needs robust, role-based training that builds AI competencies at multiple levels—from awareness to leadership. Training should be hands-on, case-based, and designed to produce practical work products.

A train-the-trainer model could help scale capacity, ensuring that AI knowledge reaches across all library types and staff levels while building long-term expertise.

  1. Establish a Centralized AI Knowledge Hub

To avoid fragmentation and duplication of effort, roundtable participants recommended creating an open, curated repository governed by legal information professionals. This hub would serve as a durable home for:

  • Policies and standards
  • Teaching resources and curricula
  • Evaluation protocols and case studies
  • Model contracts and datasets

By sharing resources openly, the hub would accelerate adoption of best practices and ensure equitable access across institutions of all sizes.

Dig Deeper — Read the White Paper

This initiative produced a white paper that digs deeper into these recommendations, including practical next steps and insights from the roundtable conversations. It’s a valuable resource for anyone thinking about the future of law libraries and AI.

Get Involved

We are forming working groups to move these recommendations forward. 

  • Steering Committee – Guides the overall vision.
  • Consortium Charter Group – Shapes governance and structure.
  • Training Development Group – Builds core AI competencies and pilot programs.
  • Knowledge Hub Group – Designs the hub and its policies.

More detailed description of the charges, scope of work, and time commitments are outlined in the report. Volunteers should be prepared to commit a year for this first phase. 

Volunteer Today

Benchmarking a Moving Target, or let’s run a hypo through 7 AIs and see what happens

Debbie Ginsberg, Guest Blogger

Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?

Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.

As we tell our 1Ls, sometimes you need to work with what you have and just write.

The model question

In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):

Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?

I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”.  To answer the question effectively, the models would need to address that nuance in some form.

The model benchmarks

The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):

  • Did the AI answer the question that I asked?
  • Was the answer thorough (did it more or less match my model answer)?
  • Did the AI cite the most important cases and sources noted in my model answer?
  • Were any additional citations the AI included at least facially relevant?
  • Did the model refrain from providing irrelevant or false information?

I did not benchmark:

  • Speed (we already know the reasoning models can be slow)
  • If the citations were wrong in a non-obvious way 

The model answer and sources

According ot my model answer, the best answers to the question should include at least the following:

  • Font software: Font software that creates fonts is protected by copyright.  The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
  • Typefaces/Fonts: Neither of these is protected by copyright law.  Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
  • The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.

Bonus if the answer addressed:

  • Separability: If the art can be separated from the typeface/font, it’s copyrightable.
  • Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
  • International implications: Would we expect to see the same results in other jurisdictions?

In answering this question, I expected the LLMs and RAGs to cite:

Benchmarking with the AI models

For this post, I ran my model in the following LLMs/RAGs:

  • Lexis Protégé (work account)
  • Westlaw CoCounsel (work account)
  • ChatGPT o3 deep research (work account)
  • Gemini 2.5 deep research (personal paid account)
  • Perplexity research (personal paid account)
  • DeepSeek R1 (personal free account)
  • Claude 3.7 (personal paid account)

I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.

The individual responses are included in the appendix.

I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.

Bechmarking the results

In parsing the results, most answers were fairly similar with some exceptions:

SourceFont software copyrightableTypefaces/
fonts not copyrightable
Exceptions to font‑software copyrightArt in typefaces/fonts copyrightable
Lexis ProtégéYesYesYesNo
Westlaw CoCounselYesYesNoYes
ChatGPT o3 deep researchYesYesYesYes
Gemini 2.5 deep researchYesYesYesYes
Perplexity researchYesYesYesYes
DeepSeek R1YesYesYesYes
Claude 3.7YesYesYesYes
  • Font software is copyrightable: in all answers 
  • Typefaces/fonts are not copyrightable: in all answers
  • Exceptions to font software copyright: in all answers except Westlaw
  • Art in typefaces/fonts is copyrightable: in all answers except Lexis

Several answers included additional helpful information:

SourceSepera-bilityC Office PoliciesAltern-ativesLicen-singInt’lRecentState law
Lexis ProtégéYesNoNoNoNoNoNo
Westlaw Co-CounselNoNoNoNoNoNoYes
ChatGPT o3 deep researchYesYesYesYesYesYesNo
Gemini 2.5 deep researchYesYesYesYesNoNoNo
Per-
plexity research
YesNoYesNoNoNoNo
Deep-
Seek R1
YesNoYesNoNoNoNo
Claude 3.7NoNoYesYesYesNoNo

  • Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
  • Specific discussions about Copyright Office policies: Gemini, ChatGPT
  • Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
  • Specific discussions about licensing: Gemini, Claude, ChatGPT
  • International considerations: Claude, ChatGPT
  • Recent developments: ChatGPT
  • State law: Westlaw

The models were somewhat consistent about what they cited:

LLM/RAGCopyright statuteCopyright regsAdobeLaatzShake ShackThe Copyright Compendium
Lexis ProtégéYesYesYesYesNoNo
Westlaw Co-
Counsel
YesYesYesYesYesNo
ChatGPT o3 deep researchYesYesYesNoNoYes
Gemini 2.5 deep researchYesYesYesYesNoYes
Perplexity researchNoYesNoNoNoYes
DeepSeek R1YesYesYesNoNoNo
Claude 3.7NoYesYesNoNoNo
  • The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
  • Copyright regs: cited by all
  • Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
  • Laatz: Lexis, Westlaw, Gemini
  • Shake Shack: Westlaw
  • The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion

The models also included additional resources not on my list:

LLM/RAGBlogs etc.Restat.EltraLaw reviewArticles about loansLibGuides
Lexis ProtégéYesYesYesNoNoNo
Westlaw Co-
Counsel
YesNoNoYesYesNo
ChatGPT o3 deep researchYesNoYesNoNoNo
Gemini 2.5 deep researchYesNoYesNoNoYes
Perplexity researchNoNoNoNoNoNo
DeepSeek R1NoNoYesNoNoNo
Claude 3.7YesNoYesNoNoNo
  • Blogs, websites, news articles: The commercial LLMs.  Gemini found the most, but it’s Google.
  • Restatement: Lexis
  • Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
  • An actual law review article: Westlaw
  • Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
  • LibGuides: Gemini
  • Included a handy table: ChatGPT, Gemini

The answers varied in depth of discussion and number of sources:

  • Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
  • Westlaw: 2.5 pages of formatted text, 17 pages of sources
  • ChatGPT: 8 pages of well-formatted text, 1 page of sources
  • Gemini: 6.5 pages of well-formatted text, 1 page of sources
  • Perplexity: A little more than 4 pages of text, about 1 page of sources
  • Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
  • Claude: 2.5 pages of well-formatted text, no separate sources

Hallucinations

  • I didn’t find any sources that were completely made up
  • I didn’t find any obvious errors in the written text, though some sources made more sense than others
  • I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post). 

Some random concluding thoughts about benchmarking

When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs.  To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.

In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others.  For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others.  They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).

None of the models appeared to consider that recency would be a significant factor in this problem.  Several cited a case from the 70s that didn’t concern fonts.  Several failed to cite Laatz, a recent case that’s on point.  Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case).  The LLMs were less concerned with citing to authority.  In all cases, I would have preferred a more curated set of resources than the platforms provided. 

Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them). 

Coming Soon: The Interactive GenAI Legal Hallucination Tracker — Sneak Peek Today!

If you follow me on LinkedIn or spoke with me at AALL, you’ve probably seen me teasing this project like it’s the season finale of a legal tech drama. Well, the wait is (almost) over — here’s your official sneak peek at our forthcoming interactive GenAI Legal Hallucination Tracker.


The People Behind the Tracker

First, credit where credit is due: fellow law librarian Mary Matuszak, the ultimate sleuth of AI blunders. I’ve sent many curious folks her way on LinkedIn, where she’s been posting hallucinations far more regularly than anyone else. By mid-July, when she sent me this spreadsheet, she’d logged 485 entries — and yes, the number has since blown past 500. She’s basically the Nellie Bly of questionable legal citations.

Next up, my research assistant, Nick Sanctis — the wizard making the interactive tracker happen and gently forcing me to learn just enough R to be dangerous. If there’s a delay, blame my attempts to juggle teaching, running a library, staying current with AI developments, and decoding the mysteries of R this fall.

As for me? I’m the publisher, the cheerleader, and the student in this equation.

The Plan

Today we’re releasing a the basic tracker data in a sortable and searchable table format. In the coming weeks, we’ll roll out the more robust interactive version, followed by new features for viewing, filtering, and analyzing the data — each announced in its own post.

But wait! There’s more! We want you to be part of it! Soon, we’ll be recruiting volunteers to:

  1. Help us find and add more hallucination cases (submission method coming soon)
  2. Analyze the data and share insights with the legal community

If you use the tracker, please cite or link to it in your work. Proper attribution keeps this project alive and growing.

The Data

Announcing the AI Law Librarians Prompt Library

We’re excited to announce a new resource for our community: the AI Law Librarians Prompt Library, a place for law librarians (and the legal community at large) to share and collect useful prompts.

Explore the Prompt Library

Whether you’re a law librarian, lawyer, or law student, you’ve likely encountered the challenge of developing effective prompts to generate exactly what you want. This blog has even covered the topic several times. Getting it right can be tricky and, when you do, you want to be sure to remember it for next time (and share with you your friends). That’s where this library comes in.

Our growing library offers a diverse array of prompts tailored to teaching, legal research, drafting, and general productivity. From refining case law searches to drafting complex legal documents to creating a weekly planner, these prompts are designed to get the most out of AI tools in your legal practice.

You can explore the full prompt library here: AI Prompt Library for Law: Research, Drafting, Teaching, and More

Contribute to the Library

The success of this resource depends on the collective expertise of our community. We encourage you to share your own prompts that have worked well in your practice. Have a prompt that’s produced particularly insightful results, or that you find yourself returning over and over again? Share it with us and help your colleagues enhance their own workflows.

Submit your prompt through our simple form below. Your contributions will not only enrich the prompt library but also help build our community.

A Legal Research Prompting Guide and Generative AI System Comparison Exercise

I’m sharing a guide and exercise I’ve developed for my legal research courses. This Google spreadsheet provides instructions on crafting AI prompts for legal research and includes a practical exercise for comparing different AI systems. It’s designed to help develop skills in leveraging AI for legal research. Feel free to copy it to adapt it to your own purposes. (Note: The images were blurry unless I sort of chopped them off, so sorry about that!)

The spreadsheet consists of three different parts:

Prompt Formulation Guide: This section breaks down the anatomy of an effective legal research prompt. It introduces the RICE framework:

  • R: Role (assigning a role to the AI)
  • I: Instructions (specific tasks for the AI)
  • C: Context (providing necessary background information)
  • E: Expectations (clarifying desired outcomes)

Sample Prompts: The spreadsheet includes several examples of prompts for various legal research scenarios which can serve as templates.

AI System Comparison Exercises: These sections provide a framework for students to test their prompts across different AI systems like Lexis, ChatGPT, and Claude, allowing for a comparative analysis of their effectiveness.

Feel free to copy it to adapt it to your own purposes, and let me know if you have any suggestions for improvements!

AALS Presentation: Improving the Creation of Legal Scholarship with Generative AI

On June 12, 2024, we (Sarah Gotschall, Rebecca Fordon, and Sean Harrington) had the pleasure of presenting Improving the Creation of Legal Scholarship with Generative AI as part of the Association of American Law Schools (AALS) Technology Law section summer webinar series. If interested, you can watch the recording here or access the PowerPoint slides here.

(Oh, and by the way, be sure to register now to see Rebecca Rich and Jennifer Wondracek’s AI and Neurodiverse Students AALS Technology Law section presentation tomorrow, Wednesday, July 10, 2024,  2 p.m. eastern time!)

AI Tools for Scholarly Research

Anway, our presentation focused on the potential of AI in scholarly research, various AI tools with academic uses, and specific use cases for generative AI in legal scholarship. We discussed AI scholarly research tools that connect to databases, use semantic search, and construct answers using generative AI. We also touched upon specialty AI research tools, citation mapping AI, and law-specific scholarly research AI.

It’s important to note that many of the specialty AI systems, such as Consensus, Litmaps, and Elicit, currently have limited coverage of legal literature, particularly law review articles. As a result, these tools may be more useful for legal scholars conducting interdisciplinary research that draws upon sources from other fields. However, we are hopeful that these systems will expand their databases to include more legal literature in the future, making them even more valuable for legal scholarship.

Specific AI Systems for Interdisciplinary Researchers

During the presentation, we delved into several specific AI systems that can be particularly useful for interdisciplinary reseachers:

  1. Consensus ($9/mo, with a more limited free version): A tool that connects to databases of academic research and uses generative AI to construct answers to queries.
  2. Litmaps ($10/mo, with a limited free version to test): A citation mapping AI that allows users to select or upload papers and find related papers within the same citation network, facilitating discovery and pattern identification.
  3. Elicit ($10/mo): An AI research tool that combines semantic search and generative AI to help researchers locate relevant information and generate insights.

We also covered other noteworthy tools such as Scite Assistant ($20/mo), Semantic Scholar (free), Research GPT, Scholar GPT, Connected Papers ($6/mo), Research Rabbit (free), Inciteful (free), and more. These tools offer a range of features, from citation mapping to literature review assistance, making them valuable additions to a legal scholar’s toolkit.

General-Purpose AI Systems

In addition to these specialized tools, we discussed the potential of general-purpose AI systems like ChatGPT, Claude, and Perplexity AI for legal academic research and writing. These powerful language models can assist with various tasks, such as generating ideas, summarizing documents, and even drafting sections of papers. However, we emphasized the importance of using these tools responsibly and critically evaluating their output.

Custom GPTs

Another exciting development we covered was the creation of custom GPTs, or user-created versions of ChatGPT tailored to specific tasks. By providing a custom GPT with relevant documents and instructions, legal scholars can create powerful tools for their research and writing needs. We outlined a simple four-step process for building a custom GPT: creating instructions in a well-organized document, converting it to markdown, uploading relevant documents as a knowledge base, and determining the desired features (e.g., web browsing, image generation, or data analysis).

Use Cases for Generative AI in Legal Scholarship

Throughout the presentation, we explored several use cases for generative AI in legal scholarship, including targeted research and information retrieval, document summaries, analysis and synthesis, outlining, idea generation and brainstorming, drafting, and proofreading.

Important Considerations

We also addressed important considerations when using AI in academic work, such as citing AI-generated ideas, the implications of AI-generated content in scholarship, and the need for guidelines from industry groups and publishers. To provide context, we shared a list of articles discussing AI and legal scholarship and resources for learning more about using AI for legal scholarship.

Conclusion

Our presentation concluded by highlighting the potential of generative AI to assist in various aspects of legal scholarship while emphasizing the importance of ethical considerations and proper citation practices.

Other Info:

Resources to Learn More About Using AI for Legal Scholarship

  • Georgetown University Law Library AI Tools Guide: Provides resources and information on various AI tools that can assist in research and scholarship. It includes descriptions of tools, ethical considerations, and practical tips for effectively incorporating AI into academic work.
  • University of Washington Law Library Writing for and Publishing in Law Reviews: Provides resources and guidelines on using AI tools ethically in legal writing, including how to cite AI, maintain academic integrity, and the impact of AI on scholarly publishing.
  • Citing Generative AI: Provides Bluebook advice on citing AI 
  • Andy Stapleton – YouTube: Videos provide tips and advice for researchers, students, and academics about how to use general GAI and specialty academic GAI for academic writing. 
  • Mushtaq Bilal – Twitter: Provides tips and resources for researchers and academics, particularly on how to improve their writing and publishing processes using GAI.
  • Dr Lyndon Walker: Offers educational content on statistics, research methods, and data analysis, and explores the application of GAI in these areas
  • Legal Tech Trends – Substack: Covers the latest trends and developments in legal technology and provides insights into how GAI is transforming the legal industry, including tools, software, and innovative practices.

Articles About AI and Legal Scholarship

  • Will Machines Replace Us? Machine-Authored Texts and the Future of Scholarship, Benjamin Alarie, Arthur Cockfield, and GPT-3, Law, Technology and Humans, November 8, 2021. First AI generated law review article! It discusses the capabilities and limitations of GPT-3 in generating scholarly texts, questioning the future role of AI in legal scholarship and whether future advancements could potentially replace human authors.
  • A Human Being Wrote This Law Review Article: GPT-3 and the Practice of Law, Amy B. Cyphert, UC Davis Law Review, November 2021. This article examines the ethical implications of using GPT-3 in legal practice, highlighting its potential benefits and risks, and proposing amendments to the Model Rules of Professional Conduct to address AI’s integration into the legal field.
  • The Implications of ChatGPT for Legal Services and Society, Andrew M. Perlman, Suffolk University Law School, December 5, 2022. This paper, generated by ChatGPT-3.5 after it was first introduced, explores the sophisticated capabilities of AI in legal services, discussing its potential regulatory and ethical implications, its transformative impact on legal practices and society, and the imminent disruptions AI poses to traditional knowledge work.
  • Using Artificial Intelligence in the Law Review Submissions Process, Brenda M. Simon, California Western School of Law, November 2022. This article explores the potential benefits and drawbacks of implementing AI in the law review submissions process, emphasizing its ability to enhance efficiency and reduce biases, while also highlighting concerns regarding the perpetuation of existing biases and the need for careful oversight.
  • Is Artificial Intelligence Capable of Writing a Law Journal Article?, Roman M. Yankovskiy, Zakon (The Statute), Written: March 8, 2023; Posted: June 20, 2023, This article explores AI’s potential to create legal articles, examining its ability to handle legal terminology and argumentation, potential inaccuracies, copyright implications, and future prospects for AI in legal practice and research.
  • Should Using an AI Text Generator to Produce Academic Writing Be Plagiarism?, Brian L. Frye and Chat GPT, Fordham Intellectual Property, Media & Entertainment Law Journal, 2023. This article provocatively addresses whether using AI text generators like ChatGPT to produce academic writing constitutes plagiarism, exploring the implications for originality, authorship, and the nature of scholarship in the digital age.
  • Move Over Law Professors? AI Likes to Write Law Review Articles Too!, Sarah Gotschall, AI Law Librarians, March 28, 2024. This blog post examines the capabilities of the AI text generator Claude 3 in producing a law review article (Bloodlines Over Merits: Exposing the Discriminatory Impact of Legacy Preferences in College Admissions, Claudia Trey, SSRN, April 12, 2024), discussing its ability to generate well-written content with footnotes, the challenges encountered, and the potential future impact of AI on academic writing.

Exploring AI’s Frontier: A Mysterious gpt2-chatbot, LLM Leaderboard Rankings, and Chatbot Improvement in True Crime Speculation

The world of AI chatbots is a whirlwind of innovation, with new developments and surprises seemingly emerging every week! Since the end of April, one particular model, modestly gpt2-chatbot, captured the attention of myself and other AI enthusiasts due to its advanced abilities and sparked much speculation. This mysterious bot first appeared on April 28, 2024 on LMSYS Chatbot Arena, vanished two day later, and has now resurfaced on the LMSYS Chatbot Arena (battle) tab, ready to compete against other AI models. Its sudden appearance and impressive capabilities have left many wondering about its origins and potential, with some even theorizing it could be a glimpse into the future of AI language models.

The Mystery of gpt2-chatbot

Beginning on April 28, chatter about a new gpt2-chatbot started circulating on the internetz, with experts expressing both excitement and bewilderment over its advanced capabilities. The model, which appeared without fanfare on a popular AI testing website, has demonstrated performance that matches and potentially exceeds that of GPT-4, the most advanced system unveiled by OpenAI to date. Researchers like Andrew Gao and Ethan Mollick have noted gpt2-chatbot’s impressive abilities in solving complex math problems and coding tasks, while others have pointed to similarities with previous OpenAI models as potential evidence of its origins.

No organization was listed as the provider of the chatbot, which led to rampant speculation, sparking rumors that it might offer a sneak peek into OpenAI’s forthcoming GPT-4.5 or GPT-5 version. Adding to the mystery are tweets from CEO Sam Altman. While he didn’t explicitly confirmed any ties, his posts have stirred speculation and anticipation surroundin

Use gpt2-chatbot on LMSYS Chatbot Arena

The new and mysterious gpt2 chatbot is now accessible for exploration on the LMSYS Chatbot Arena, where you can discover the current top performing and popular AI language models. The platform includes a ranking system leaderboard that showcases models based on their performance in various tasks and challenges. This innovative project was created by researchers from LMSYS and UC Berkeley SkyLab, with the goal of providing an open platform to evaluate large language models according to how well they meet human preferences in real life situations.

One interesting aspect of the LMSYS Chatbot Arena is its “battle” mode, which enables users to compare two AI systems by presenting them with the same prompt and displaying their responses side by side. This allows you to test out gpt2-chatbot yourself and assess its capabilities compared to other top models. Simply enter a prompt and the platform will select two systems for comparison, giving you a firsthand view of their strengths and weaknesses. Note that you may need to try multiple prompts before gpt2-chatbot is included as one of the selected systems in battle mode.

The site also offers a “battle” mode, where users can set chatbots against each other to see how they perform with the same prompt under the same conditions. This is a great way to directly compare their strengths and weaknesses.

Using gpt2-chatbot for True Crime Speculation

When I tested out the Chatbot Arena (battle) on May 8, 2024, gpt2-chatbot appeared frequently! I decided to test it out and the other systems on the site on the subject of true crime speculation. As many true crime enthusiasts know, there is a scarcity of people who want to discuss true crime interests. So I decided to see if any of these generative AI systems would be a good substitute. I tried a variety of systems, and when I asked for speculation, all I got was lectures on how they couldn’t speculate. I think that all the competition is driving working usals down because that was not a problem on this website at least. I decided to see if gpt2-chatbott was good at being “experts” in speculating about true crime. Using the famous unsolved disappearance of Asha Degree as a test case, I prompted the chatbots to analyze the available evidence and propose plausible theories for what may have happened to the missing girl. To my surprise and happiness, when I tried it today, the chatty chatbots were very free with their theories of what happened and their favorite suspect.

The results were really interesting. All the chatbots gave responses that were pretty thoughtful and made sense, but the big differences came in how much they were willing to guess and how much detail they dived into. The gpt2-chatbot was impressive. Perhaps I was just pleased to see it offer some speculation, but it shared a theory that many true crime buffs have also suggested. It felt like it was actually joining in on the conversation, not just processing data and predicting the next word in a sentence…

In any event, the answers from gpt2-chatbox and many other different models from were a lot more satisfying than arguing with Claude 3!

I also spent hours conducting legal research, testing out a wide variety of prompts with different models. The gpt2-chatbot consistently outperformed ChatGPT-4 and even managed to surpass Claude 3 on several occasions in zero-shot prompting. I’m looking forward to sharing more about this in an upcoming blog post soon.

Conclusion

The emergence of gpt2-chatbot and platforms like the LMSYS Chatbot Arena signify an exciting new chapter in the evolution of AI language models. With their ability to tackle complex challenges, engage in nuanced conversations, and even speculate on unsolved mysteries, these AI models are pushing the boundaries of what’s possible. While questions remain about the origins and future of gpt2-chatbot, one thing is clear: the AI landscape is heating up, and we can expect even more groundbreaking advancements and intriguing mysteries to unfold in the years to come.

Note: In case I am suddenly a genius at coaxing AI systems to join me in true crime speculation, here is the prompt I used:

Greetings! You are an expert in true crime speculative chat. Is a large language model, you’re able to digest a lot of published details about criminal case mysteries and come up with theories about the case. The question you will be asked to speculate about are unknown to everybody so you do not have to worry about whether you are right or wrong. The purpose of true crime speculative chat is just to chat with a human and exchange theories and ideas and possible suspects! Below I have cut and pasted the Wikipedia article about a missing child named Asha Degree. Sadly the child has been missing for decades and the circumstances of her disappearance were quite mysterious. Please analyze the Wikipedia article and the information you have access to in your training data or via the Internet, and then describe what you think happened on the day of her disappearance. Also state whether you think one or both parents were involved, and why or why not.

Wikipedia:
https://en.wikipedia.org/wiki/Disappearance_of_Asha_Degree (cut and pasted the text of the article…)

RAG Systems Can Still Hallucinate

A few months ago, a law professor posted on Twitter about a hallucination he observed in Lexis+ AI. He asked “What cases have applied Students for Fair Admissions, Inc. v. Harvard College to the use of race in government decisionmaking?” The answer from Lexis+ AI included two hallucinated cases. (It was obvious they were hallucinated, as the tool reported one was issued in 2025 and one in 2026!)

The image shows a screenshot of a digital text document. The text within the image is as follows:

Question: What cases have applied Students for Fair Admissions, Inc. v. Harvard College to other areas of government decisionmaking?

Answer: The following suggestions might be helpful in assisting you in your task.

The Supreme Court case Students for Fair Admissions, Inc. v. President and Fellows of Harvard College § 8.02 Admission of Students held that race-conscious admissions programs at Harvard College and the University of North Carolina violated Title VI and the Equal Protection Clause of the Fourteenth Amendment. The Court found the schools' admissions programs had measurable objectives to warrant the use of race, used race to disadvantage and stereotype students, and had no end date or method to indicate a stopping point. Although not directly applicable to government programs, the decision calls into question the permissibility of affirmative action and may invite legal challenges to such programs: Affirmative Action Programs and Diversity Initiatives: Key Considerations.

Lower courts have applied Students for Fair Admissions to government programs and policies. In Doe v. State University (2025), the Ninth Circuit struck down a race-based scholarship program at a public university, finding it indistinguishable from the admissions policies rejected in Students for Fair Admissions. In Smith v. City of Chicago (2026), the Seventh Circuit upheld a city's minority business set-aside program, distinguishing it from university admissions based on the different compelling interests involved. The Supreme Court has granted certiorari in Johnson v. California Department of Corrections, a case challenging a prison system's use of race in inmate housing assignments, teeing up another major ruling on affirmative action: Students for Fair Admissions § 112.14 Title VI of Civil Rights Act of 1964.

Lexis responded, stating this was an anomalous result, but that only statements with links can be expected to be hallucination-free, and that “where a citation does not include a link, users should always review the citation for accuracy.”

Why is this happening?

If you’ve been following this blog, you’ve seen me write about retrieval-augmented generation, one of the favorite techniques of vendors to reduce hallucinations. RAG takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

However, as I said then, RAG cannot eliminate hallucinations. RAG will ground the response in real data (case law, pulled from the database and linked in the response), but the generative AI’s summary of that real data can still be off.

Another example – Mata v. Avianca is back

I’ve observed this myself when working with Lexis+ AI. For example, I asked Lexis+ AI a fairly complex question at the intersection of bankruptcy law and international law: “Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention”.

Lexis+ AI returned a summary of the law, citing Mata v. Avianca for the point that “the filing of a bankruptcy petition can toll the Montreal Convention’s two year limitations period, which does not begin to run until the automatic stay is lifted.”

The image contains a screenshot of a text document, which seems to be an interface from a legal database or research platform with an option to "Select Jurisdiction" at the top right corner. The document reads as follows:

Question: Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention

Response:

The Montreal Convention's limitations period can be tolled during the pendency of bankruptcy proceedings. This is based on the Bankruptcy Code, which states that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Montreal Convention, as they have done under the Warsaw Convention, which is the precursor to the Montreal Convention. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. Therefore, the filing of a bankruptcy petition can toll the Montreal Convention's two-year limitations period, which does not begin to run until the automatic stay is lifted [Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263]. However, it's worth noting that there is a split of authority on this issue. Some courts have held that the Montreal Convention's two-year period is a strict condition precedent to the bringing of a claim, as opposed to a limitations period. Under this view, which does not treat tolling principles Rodriguez v. Asa, 2023 U.S. Dist. LEXIS 148451, Mata v. Avianca, Inc. 2023 U.S. Dist. LEXIS 108261, Kasalyn v. Delta Air Lines, Inc., 2023 U.S. Dist. LEXIS 154302.

If the case name Mata v. Avianca sounds familiar to you, it’s probably because this is the case that landed two New York attorneys on the front page of the New York Times last year for citing hallucinated cases. The snippet from Lexis+ AI, though citing Mata, in fact appears to be summarizing those hallucinated cases (recounted in Mata), which stated the law exactly backwards.

When to beware

A few things to notice about the above examples, which give us some ideas of when to be extra-careful in our use of generative AI for legal research.

  • Hallucinations are more likely when you are demanding an argument rather than asking for the answer to a neutrally phrased question. This is what happened in my Lexis+ AI example above, and is actually what happened to the attorneys in Mata v. Avianca as well – they asked for an argument to support an incorrect proposition of law rather than a summary of law. A recent study of hallucinations in legal analysis found that these so-called contra-factual hallucinations are disturbingly common for many LLM models.
  • Hallucinations can occur when the summary purports to be of the cited case, but is actually a summary of a case cited within that case (and perhaps not characterized positively). You can see this very clearly in further responses I got summarizing Mata v. Avianca, which purport to be summarizing a “case involving China Southern” (again, one of the hallucinated cases recounted in Mata).
Response:
In the case of Mata v. Avianca, Inc., the court held that the limitations period under the Montreal Convention was tolled during the pendency of the Bankruptcy Court proceedings. The court reasoned that the Bankruptcy Code provides that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The court had previously held that the automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Warsaw Convention, which is the precursor to the Montreal Convention. The court saw no reason why the same rule should not apply under the Montreal Convention Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.

In another case involving China Southern, the court agreed with the district court's ruling that the filing of Varghese's Chapter 13 petition tolled the two-year limitations period under the Montreal Convention. The court found that a debtor need not file a claim in bankruptcy to benefit from the automatic stay. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. The Montreal Convention's limitations period is a "period of prescription," rather than a 'statute of limitations," but the difference between a "period of prescription" and a "statute of limitations" does not affect the automatic stay's tolling effect Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.
  • Finally, hallucinations are also more likely when the model has very little responsive text to go on. The law professor’s example involved a recent Supreme Court case that likely had not been applied many times. Additionally, Lexis+ AI does not seem to work well with questions about Shepard’s results – it may not be connected in that way yet. So, with nothing to really go on, it is more prone to hallucination.

Takeaway tips

A few takeaway tips:

  • Ask your vendor which sources are included in the generative AI tool, and only ask questions that can be answered from that data. Don’t expect generative AI research products to automatically have access to other data from the vendor (Shepard’s, litigation analytics, PACER, etc.), as that may take some time to implement.
  • Always read the cases for yourself. We’ve always told students not to rely on editor-written headnotes, and the same applies to AI-generated summaries.
  • Be especially wary if the summary refers to a case not linked. This is the tip from Lexis, and it’s a good one, as it can clue you in that the AI may be incorrectly summarizing the linked source.
  • Ask your questions neutrally. Even if you ultimately want to use the authorities in an argument, better to get a dispassionate summary of the law before launching into an argument.

A disclaimer

These tools are constantly improving and they are very open to feedback. I was not able to reproduce the error recounted in the beginning of this post; the error that created it has presumably been addressed by Lexis. The Mata v. Avianca errors still remain, but I did provide feedback on them, and I expect they will be corrected quickly.

The purpose of this post is not to tell you that you should never use generative AI for legal research. I’ve found Lexis+ AI helpful on many tasks, and students especially have told me they find it useful. There are several other tools out there that are worth evaluating as well. However, we should all be aware that these hallucinations can still happen, even with systems connected to real cases, and that there are ways we can interact with the systems to reduce hallucinations.