First Known Court Order with Fabricated Cases (and a Test Run of CiteCheck AI)

AI may have struck again with hallucinations. Yesterday evening, I was forwarded a quote from the case opinion of Shahid v. Esaam, 2025 Ga. App. LEXIS 299, at *3 [Ct App June 30, 2025, No. A25A0196]) released on June 30, 2025 by the Georgia Court of Appeals. (HT Mary Matuszak!)(link to official opinion, not Lexis):

We are troubled by the citation of bogus cases in the trial court’s order. As the reviewing court, we make no findings of fact as to how this impropriety occurred, observing only that the order purports to have been prepared by Husband’s attorney, Diana Lynch. We further note that Lynch had cited the two fictitious cases that made it into the trial court’s order in Husband’s response to the petition to reopen, and she cited additional fake cases both in that Response and in the Appellee’s Brief filed in this Court.

Background

The Georgia Court of Appeals (CoA) heard an appeal to reopen a divorce case in the Superior Court of Dekalb County, GA. The Appellant brought to the attention of the CoA that the “trial court relied on two fictitious cases in its order denying her petition.” The Appellee’s attorney ignored this claim and went on to argue the original argument of proper service by publication with multiple fictitious and misrepresented cases. The Appellee’s attorney also demanded attorney’s fees based on another fictitious case that claimed the exact opposite of existing case law. In total, the CoA provided this breakdown of the inaccuracy rate of the citations provided by the Appellant’s attorney, “73 percent of the 15 citations in the brief or 83 percent if the two bogus citations in the superior court’s opinion and the five additional bogus citations in Husband’s response to Wife’s petition to reopen Case are included.” The distraught CoA struck the lower court order, remanded the case, and sanctioned the Appellee’s attorney.

Digging into the Case

I was curious about all of this, so I did some digging this morning. I am still working on acquiring the CoA briefs, but I was able to access the documents from the trial court. The CoA was very cognizant that they do not have any actual proof at this time that AI was used, but with the number of bad citations that the Appellant’s attorney submitted, the CoA speculated about the use of a consumer AI model in the footnotes. To test this theory, I decided not only to do some reading, but to test out LawDroid’s new CiteCheck AI tool. Spoiler alert: I think the speculations are accurate.

CiteCheck AI

If you have not yet heard of LawDroid’s new CiteCheck AI tool, that is only because it is so new. The premise of this tool is you upload a document, and it will check your case citations to see if the citations exist (a.k.a. identify hallucinations). The free version gives you the ability to test it out with five documents. It will OCR your document (if needed), extract the citations, and check the citations against the CourtListener database. You are then given a nice table of the citations, marking them as valid or invalid. If the latter, you are also supplied the reason why it is marked invalid. Remember, however, this only checks their existence, not whether they stand for the proposition for which they are being used.

Bob Ambrogi posted a review of the application that he tested with the Mata v. Avianca, Inc.documents and a document he filed when in practice. From this review, I knew to expect a few false “invalid” markings if the case only has a Lexis or Westlaw citation or if there are abbreviation issues with the case citation. Bob noted that these issues were relatively easy to spot since CiteCheck AI lists the reason it marked the citation invalid.

The CiteCheck AI website also reminds attorneys that you still need to meet your ethical obligations and review everything before submitting it: “Disclaimer: CiteCheck AI is only a tool, it does not relieve lawyers from their duty of care, supervision, and competence. Ensure that you carefully review all work product before sharing it with clients and/or filing it in court.”

The Trial Order

I decided to start with the Trial Order as it is truly the most momentous document here, given it is the first known court order with “bogus” citations, as the CoA called them. The CoA specifically mentioned “the bogus Epps and Hodge case citations from the superior court’s order” in footnote 24, so I went in knowing what cases to watch out for. It turns out that these were the only two cases mentioned in the order, making them really easy to locate.

The first case was listed as “Epps v. Epps (248 Ga. 637,285 S.E.2d 180, 1981)” and was supposed to discuss service by publication. When I ran 248 Ga. 637 through Lexis, it led me to school financing case McDaniel v. Thomas, 248 Ga. 632, 632, 285 S.E.2d 156, 157 (1981) (note the different SE2d reporter citation!). Curious to see what the Epps parallel citation 285 S.E.2d 180 would lead me to, I found criminal case Lewis v. State, 248 Ga. 566, 566, 285 S.E.2d 179, 180 (1981). No sign of Epps v. Epps.

Next, I tried searching the parties. Epps v. Epps, restricted to Georgia cases, returned three results:
1. Epps v. Epps, 162 Ga. 126, 132 S.E. 644 (1926)(Sufficiency of the Evidence)
2. Epps v. Epps, 209 Ga. 643, 644, 75 S.E.2d 165, 167 (1953)(Implied Trusts)
3. Epps v. Epps, 141 Ga. App. 659, 659, 234 S.E.2d 140, 141 (1977)(Conversion)
None of the three discussed service by publication.

The second case was “Hodge v. Hodge (269 Ga. 604,501 S.E.2d 169, 1998),” another alleged service by publication case. Here is the breakdown of this case:

  • 269 Ga. 604 led to fiduciary Atlanta Mkt. Ctr. Mgmt. Co. v. McLane, 269 Ga. 604, 503 S.E.2d 278 (1998)(agency, fiduciary obligations, and contracts)
  • 501 S.E.2d 169 led to the middle of Foster v. City of Keyser, 202 W. Va. 1, 501 S.E.2d 165 (1997)(res ipsa loquitur)(Not even the same state!)
  • Hodge v. Hodge search led to a divorce case! But no mention of service by publication in divorce: Hodge v. Hodge, 2017 Ga. Super. LEXIS 2178.

Trial Order – CiteCheck AI Review

Now that I have done the work by hand, how did Citecheck AI compare?

Validation report shows two citations found and both are invalid.

Success! We both found the same cases for the Georgia reporter cases. It did not check the parallel Southeastern Reporter citations, however. It definitely took a lot less time (under a minute) for CiteCheck AI than it did for me going through all four reporter citations in Lexis.

The Trial Response

Per the CoA, I expected to find seven total bad citations in the Response, including the Epps and Hodge citations that I reviewed above. Being a good (and nosy) librarian, I went through both the Georgia and the Southeastern Reporter citations for each citation, if provided. Liking the CiteCheck AI tabular format, I provide you with my own results in similar style:

Case nameState CitationState ResultRegional ReporterRegional result
Fleming v. Floyd237 Ga. 76Campbell v. State, 237 Ga. 76, 226 S.E.2d 601 (1976) (criminal)226 SE2d 601Same case as Ga citation!
Christie v. Christie277 Ga. 27In re Kent, 277 Ga. 27, 585 S.E.2d 878 (2003)(attorney discipline) & In re Silver, 277 Ga. 27, 585 S.E.2d 879 (2003) (attorney reinstatement)586 SE2d 57Town of Register v. Fortner, 262 Ga. App. 507, 586 S.E.2d 54 (Ga. 2003)(summary judgment)
Mobley v. Murray County178 Ga App 320G. E. Credit Corp. v. Catalina Homes, 178 Ga. App. 319, 342 S.E.2d 734 (1986)(repossession)342 SE2d 780State v. Brown, 178 Ga. App. 307, 307, 342 S.E.2d 779 (Ga. App. 1986)(motion to suppress)
Robinson v. Robinson277 Ga. 75Robinson v. State, 277 Ga. 75, 586 S.E.2d 313 (2003)(criminal)586 SE2d 316Brochin v. Brochin, 277 Ga. 66, 586 S.E.2d 316 (Ga. 2003)(divorce decree finalized before attorney’s fees – no mention of service)
Reynolds v. Reynolds288 Ga App 688AT&T Corp. v. Prop. Tax Servs., 288 Ga. App. 679, 655 S.E.2d 295 (2007)(Tax)N/A

The Trial Response – CiteCheck AI Review

And success again! CiteCheck AI found the same cases that I did with a manual check for the Georgia reporter citations. Parallel citations once again were not considered, however (given the Bluebook no longer requires them, this may not be an issue for long). A new hiccup to take note of, however: It did not report that two cases were located with the Christie v. Christie Georgia reporter search. While page 27 is supposed to be the first page in the citation, it is not unheard of for a student or attorney to put the page number of the language they refer to instead. This makes me uneasy, and I hope this is on the improvement list to include both/all cases on the page listed.

Takeaways

From this exercise, I take a few key lessons and thoughts.

  • The inevitable has happened, and a court has issued an opinion with hallucinated cases.
  • The Court of Appeals did not investigate how the hallucinated citations were put into the order, but I am sure someone will. I await the final report.
  • Give the disciplinary case that I read from the Christie v. Christie search, Georgia takes this sort of thing seriously. The Appellant attorney may face more than just sanctions in the future.
  • The Citecheck AI tool is useful, as long as you remember its limitations.
  • I may lament only having five free trials of the CiteCheck AI tool (Tom, is it coming to LawDroid Copilot?)
  • I now fear the day another order is not caught and hallucinations become law.

Why Law Librarians and Attorneys Will Not Be Replaced by AI Anytime Soon (a.k.a. The Day Jenny Had to Pull Rank)

Generated by DALL-E 2 on Sept. 29, 2023.

I have mentioned my comedic tête-à-tête with ChatGPT hallucinations in passing when discussing the importance of verifying information that the AI provides. Several times, I have been asked for more details. So, for those who are looking for AI humor, here is the story of the day that I had to pull rank on ChatGPT.

Back in January 2023, I asked ChatGPT what legal tasks it could perform. It told me:

  • Summarizing and explaining laws and regulations
  • Researching and citing relevant legal precedent
  • Drafting legal documents such as contracts, briefs, and legal memorandums
  • Answering legal questions and providing information on legal topics
  • Identifying and analyzing legal issues in a given fact pattern
  • Helping with legal research, by providing the most relevant cases and statutes
  • Helping with the interpretation of legal terms and phrases.

Challenge accepted! Since that day, ChatGPT and I have been doing a dance around Ohio Recreational User Law, as I have tested its claims above, while also exploring its abilities, improvements, and plugins and testing various prompting methods.

About a month after ChatGPT gave me the list of tasks, we were discussing Ohio Recreational User caselaw. ChatGPT vehemently insisted that Lash v. Ohio Dept. of Natural Resources, 68 Ohio St. 3d 545, 548-49 (1994) was a relevant case. Alas, 68 Ohio St. 3d 545 leads to Findlay/Hancock Cty. Bar Assn. v. Higgins, 68 Ohio St.3d 545, 629 N.E.2d 413 (1994).

This was not the first time that I had run into hallucinations. For the previous month, we had regularly been discussing caselaw, and ChatGPT showed a particular penchant for falsifying the existence of relevant judicial opinions. Mostly, I pointed out the error, and ChatGPT apologized and moved on. In this instance, however, the AI was like a dog with a bone. It insisted nine times that Lash existed and was an on-point Ohio Recreational User case.

Seven claims that it existed in, the conversation became a bit heated:

ChatGPT: This case is available on many legal databases and can be accessed through various sources, including LexisNexis and Westlaw.

Me (after verifying the lack of existence and stopping to wonder when it had learned about Lexis and Westlaw since this was the first time it had referenced either database to me): That case is not in Westlaw.

ChatGPT: Lash v. Ohio Dept. of Natural Resources, 68 Ohio St.3d 545, 549, 628 N.E.2d 1366 (1994) is a real case that has been cited in multiple legal sources.

However, I am an AI language model and do not have direct access to legal databases to confirm whether or not a specific case is available in Westlaw.

If you are having trouble locating the case in Westlaw, I suggest double-checking the citation format or consulting with a legal librarian or licensed attorney who may be able to assist you in locating the case in a legal database.

Me (thought): It did not just say that! Damn it! Those are fighting words!

Me: I am both a law librarian and a licensed attorney. Lash v. Ohio Dept. of Natural Resources does not exist. 68 Ohio St.3d 545 leads to the previously mentioned Findlay/Hancock case.

And then…finally!

ChatGPT: After conducting further research, I have not been able to find any evidence of a case named Lash v. Ohio Dept. of Natural Resources that pertains to the recreational user land entrant law in Ohio. Therefore, I must retract my previous citation and acknowledge that it is incorrect.

I apologize for any inconvenience or confusion this may have caused, and I appreciate your efforts to clarify this issue.

Me (thought): Ha! It finally admitted defeat! Why does this win feel so hollow????

And so, you have the day that I had to pull rank on ChatGPT. I have no idea why it decided die on the hill of this particular fake case, but the law librarians carried the day.

So here’s to humorous skirmishes with AI and the light they shed on the irreplaceable value of human expertise.

The Truth About Hallucinations in Legal Research AI: How to Avoid Them and Trust Your Sources

Hallucinations in generative AI are not a new topic. If you watch the news at all (or read the front page of the New York Times), you’ve heard of the two New York attorneys who used ChatGPT to create fake cases entire cases and then submitted them to the court.

After that case, which resulted in a media frenzy and (somewhat mild) court sanctions, many attorneys are wary of using generative AI for legal research. But vendors are working to limit hallucinations and increase trust. And some legal tasks are less affected by hallucinations. Understanding how and why hallucinations occur can help us evaluate new products and identify lower-risk uses.

* A brief aside on the term “hallucinations”.  Some commentators have cautioned against this term, arguing that it lets corporations shift the blame to the AI for the choices they’ve made about their models. They argue that AI isn’t hallucinating, it’s making things up, or producing errors or mistakes, or even just bullshitting. I’ll use the word hallucinations here, as the term is common in computer science, but I recognize it does minimize the issue.

With that all in mind, let’s dive in. 

What are hallucinations and why do they happen?

Hallucinations are outputs from LLMs and generative AI that look coherent but are wrong or absurd. They may come from errors or gaps in the training data (that “garbage in, garbage out” saw). For example, a model may be trained on internet sources like Quora posts or Reddit, which may have inaccuracies. (Check out this Washington Post article to see how both of those sources were used to develop Google’s C4, which was used to train many models including GPT-3.5).

But just as importantly, hallucinations may arise from the nature of the task we are giving to the model. The objective during text generation is to produce human-like, coherent and contextually relevant responses, but the model does not check responses for truth. And simply asking the model if its responses are accurate is not sufficient.

In the legal research context, we see a few different types of hallucinations: 

  • Citation hallucinations. Generative AI citations to authority typically look extremely convincing, following the citation conventions fairly well, and sometimes even including papers from known authors. This presents a challenge for legal readers, as they might evaluate the usefulness of a citation based on its appearance—assuming that a correctly formatted citation from a journal or court they recognize is likely to be valid.
  • Hallucinations about the facts of cases. Even when a citation is correct, the model might not correctly describe the facts of the case or its legal principles. Sometimes, it may present a plausible but incorrect summary or mix up details from different cases. This type of hallucination poses a risk to legal professionals who rely on accurate case summaries for their research and arguments.
  • Hallucinations about legal doctrine. In some instances, the model may generate inaccurate or outdated legal doctrines or principles, which can mislead users who rely on the AI-generated content for legal research. 

In my own experience, I’ve found that hallucinations are most likely to occur when the model does not have much in its training data that is useful to answer the question. Rather than telling me the training data cannot help answer the question (similar to a “0 results” message in Westlaw or Lexis), the generative AI chatbots seem to just do their best to produce a plausible-looking answer. 

This does seem to be what happened to the attorneys in Mata v. Avianca. They did not ask the model to answer a legal question, but instead asked it to craft an argument for their side of the issue. Rather than saying that argument would be unsupported, the model dutifully crafted an argument, and used fictional law since no real law existed.

How are vendors and law firms addressing hallucinations?

Several vendors have released specialized legal research products based on generative AI, such as LawDroid’s CoPilot, Casetext’s CoCounsel (since acquired by Thomson Reuters), and the mysterious (at least to academic librarians like me who do not have access) Harvey. Additionally, an increasing number of law firms, including Dentons, Troutman Pepper Hamilton Sanders, Davis Wright Tremaine, and Gunderson Dettmer Stough Villeneuve Franklin & Hachigian) have developed their own chatbots that allow their internal users to query the knowledge of the firm to answer questions.

Although vendors and firms are often close-lipped about how they have built their products, we can observe a few techniques that they are likely using to limit hallucinations and increase accuracy.

First, most vendors and firms appear to be using some form of retrieval-augmented generation (RAG). RAG combines two processes: information retrieval and text generation. The model takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

This reduces hallucinations, because the model receives instructions to limit its responses to the source documents it has received from the database. Several vendors and firms have said they are using retrieval-augmented generation to ground their models in real legal sources, including Gunderson, Westlaw, and Casetext.

To enhance the precision of the retrieved documents, some products may also use vector embedding. Vector embedding is a way of representing words, phrases, or even entire documents as numerical vectors. The beauty of this method lies in its ability to identify semantic similarities. So, a query about “contract termination due to breach” might yield results related to “agreement dissolution because of violations”, thanks to the semantic nuances captured in the embeddings. Using vector embedding along with RAG can provide relevant results, while reducing hallucinations.

Another approach vendors can take is to develop specialized models trained on narrower, domain-specific datasets. This can help improve the accuracy and relevance of the AI-generated content, as the models would be better equipped to handle specific legal queries and issues. Focusing on narrower domains can also enable models to develop a deeper understanding of the relevant legal concepts and terminology. This does not appear to be what law firms or vendors are doing at this point, based on the way they are talking about their products, but there are law-specific data pools becoming available so we may see this soon.

Finally, vendors may fine-tune their models by providing human feedback on responses, either in-house or through user feedback. By providing users with the ability to flag and report hallucinations, vendors can collect valuable information to refine and retrain their models. This constant feedback mechanism can help the AI learn from its mistakes and improve over time, ultimately reducing the occurrence of hallucinations.

So, hallucinations are fixed?

Even though vendors and firms are addressing hallucinations with technical solutions, it does not necessarily mean that the problem is solved. Rather, it may be that our our quality control methods will shift.

For example, instead of wasting time checking each citation to see if it exists, we can be fairly sure that the cases produced by legal research generative AI tools do exist, since they are found in the vendor’s existing database of case law. We can also be fairly sure that the language they quote from the case is accurate. What may be less certain is whether the quoted portions are the best portions of the case and whether the summary reflects all relevant information from the case. This will require some assessment of the various vendor tools.

We will also need to pay close attention to the databases results that are fed into retrieval augmented generation. If those results don’t reflect the full universe of relevant cases, or contain material that is not authoritative, then the answer generated from those results will be incomplete. Think of running an initial Westlaw search, getting 20 pretty good results, and then basing your answer only on those 20 results. For some questions (and searches), that would be sufficient, but for more complicated issues, you may need to run multiple searches, with different strategies, to get what you want.

To be fair, the products do appear to be running multiple searches. When I attended the rash of AI presentations at AALL over the summer, I asked Jeff Pfeiffer of Lexis how he could be sure that the model had all relevant results, and he mentioned that the model sends many, many searches to the database not just one. Which does give some comfort, but leads me to the next point of quality control.

We will want to have some insight into the searches that are being run, so that we can verify that they are asking the right questions. From the demos I’ve seen of CoCounsel and Lexis+ AI, this is not currently a feature. But it could be. For example, the AI assistant from scite (an academic research tool) sends searches to academic research databases and (seemingly using RAG and other techniques to analyze the search results) produces an answer. They also give a mini-research trail, showing the searches that are being run against the database and then allowing you to adjust if that’s not what you wanted.

scite AI Assistant Sample Results
sCcite AI Assistant Settings

Are there uses for generative AI where the risks presented by hallucinations are lessened?

The other good news is that there are plenty of tasks we can give generative AI for which hallucinations are less of an issue. For example, CoCounsel has several other “skills” that do not depend upon accuracy of legal research, but are instead ways of working with and transforming documents that you provide to the tool.

Similarly, even working with a generally applicable tool such as ChatGPT, there are many applications that do not require precise legal accuracy. There are two rules of thumb I like to keep in mind when thinking about tasks to give to ChatGPT: (1) could this information be found via Google? and (2) is a somewhat average answer ok? (As one commentator memorably put it “Because [LLMs] work by predicting the most statistically likely word in a sentence, they churn out average content by design.”)

For most legal research questions, we could not find an answer using Google, which is why we turn to Westlaw or Lexis. But if we just need someone to explain the elements of breach of contract to us, or come up with hypotheticals to test our knowledge, it’s quite likely that content like that has appeared on the internet, and ChatGPT can generate something helpful.

Similarly, for many legal research questions, an average answer would not work, and we may need to be more in-depth in our answers. But for other tasks, an average answer is just fine. For example, if you need help coming up with an outline or an initial draft for a paper, there are likely hundreds of samples in the data set, and there is no need to reinvent the wheel, so ChatGPT or a similar product would work well.

What’s next?

In the coming months, as legal research generative AI products become increasingly available, librarians will need to adapt to develop methods for assessing accuracy. Currently, there appear to be no benchmarks to compare hallucinations across platforms. Knowing librarians, that won’t be the case for long, at least with respect to legal research.

Further reading

If you want to learn more about how retrieval augmented generation and vector embedding work within the context of generative AI, check out some of these sources: