The Truth About Hallucinations in Legal Research AI: How to Avoid Them and Trust Your Sources

Hallucinations in generative AI are not a new topic. If you watch the news at all (or read the front page of the New York Times), you’ve heard of the two New York attorneys who used ChatGPT to create fake cases entire cases and then submitted them to the court.

After that case, which resulted in a media frenzy and (somewhat mild) court sanctions, many attorneys are wary of using generative AI for legal research. But vendors are working to limit hallucinations and increase trust. And some legal tasks are less affected by hallucinations. Understanding how and why hallucinations occur can help us evaluate new products and identify lower-risk uses.

* A brief aside on the term “hallucinations”.  Some commentators have cautioned against this term, arguing that it lets corporations shift the blame to the AI for the choices they’ve made about their models. They argue that AI isn’t hallucinating, it’s making things up, or producing errors or mistakes, or even just bullshitting. I’ll use the word hallucinations here, as the term is common in computer science, but I recognize it does minimize the issue.

With that all in mind, let’s dive in. 

What are hallucinations and why do they happen?

Hallucinations are outputs from LLMs and generative AI that look coherent but are wrong or absurd. They may come from errors or gaps in the training data (that “garbage in, garbage out” saw). For example, a model may be trained on internet sources like Quora posts or Reddit, which may have inaccuracies. (Check out this Washington Post article to see how both of those sources were used to develop Google’s C4, which was used to train many models including GPT-3.5).

But just as importantly, hallucinations may arise from the nature of the task we are giving to the model. The objective during text generation is to produce human-like, coherent and contextually relevant responses, but the model does not check responses for truth. And simply asking the model if its responses are accurate is not sufficient.

In the legal research context, we see a few different types of hallucinations: 

  • Citation hallucinations. Generative AI citations to authority typically look extremely convincing, following the citation conventions fairly well, and sometimes even including papers from known authors. This presents a challenge for legal readers, as they might evaluate the usefulness of a citation based on its appearance—assuming that a correctly formatted citation from a journal or court they recognize is likely to be valid.
  • Hallucinations about the facts of cases. Even when a citation is correct, the model might not correctly describe the facts of the case or its legal principles. Sometimes, it may present a plausible but incorrect summary or mix up details from different cases. This type of hallucination poses a risk to legal professionals who rely on accurate case summaries for their research and arguments.
  • Hallucinations about legal doctrine. In some instances, the model may generate inaccurate or outdated legal doctrines or principles, which can mislead users who rely on the AI-generated content for legal research. 

In my own experience, I’ve found that hallucinations are most likely to occur when the model does not have much in its training data that is useful to answer the question. Rather than telling me the training data cannot help answer the question (similar to a “0 results” message in Westlaw or Lexis), the generative AI chatbots seem to just do their best to produce a plausible-looking answer. 

This does seem to be what happened to the attorneys in Mata v. Avianca. They did not ask the model to answer a legal question, but instead asked it to craft an argument for their side of the issue. Rather than saying that argument would be unsupported, the model dutifully crafted an argument, and used fictional law since no real law existed.

How are vendors and law firms addressing hallucinations?

Several vendors have released specialized legal research products based on generative AI, such as LawDroid’s CoPilot, Casetext’s CoCounsel (since acquired by Thomson Reuters), and the mysterious (at least to academic librarians like me who do not have access) Harvey. Additionally, an increasing number of law firms, including Dentons, Troutman Pepper Hamilton Sanders, Davis Wright Tremaine, and Gunderson Dettmer Stough Villeneuve Franklin & Hachigian) have developed their own chatbots that allow their internal users to query the knowledge of the firm to answer questions.

Although vendors and firms are often close-lipped about how they have built their products, we can observe a few techniques that they are likely using to limit hallucinations and increase accuracy.

First, most vendors and firms appear to be using some form of retrieval-augmented generation (RAG). RAG combines two processes: information retrieval and text generation. The model takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

This reduces hallucinations, because the model receives instructions to limit its responses to the source documents it has received from the database. Several vendors and firms have said they are using retrieval-augmented generation to ground their models in real legal sources, including Gunderson, Westlaw, and Casetext.

To enhance the precision of the retrieved documents, some products may also use vector embedding. Vector embedding is a way of representing words, phrases, or even entire documents as numerical vectors. The beauty of this method lies in its ability to identify semantic similarities. So, a query about “contract termination due to breach” might yield results related to “agreement dissolution because of violations”, thanks to the semantic nuances captured in the embeddings. Using vector embedding along with RAG can provide relevant results, while reducing hallucinations.

Another approach vendors can take is to develop specialized models trained on narrower, domain-specific datasets. This can help improve the accuracy and relevance of the AI-generated content, as the models would be better equipped to handle specific legal queries and issues. Focusing on narrower domains can also enable models to develop a deeper understanding of the relevant legal concepts and terminology. This does not appear to be what law firms or vendors are doing at this point, based on the way they are talking about their products, but there are law-specific data pools becoming available so we may see this soon.

Finally, vendors may fine-tune their models by providing human feedback on responses, either in-house or through user feedback. By providing users with the ability to flag and report hallucinations, vendors can collect valuable information to refine and retrain their models. This constant feedback mechanism can help the AI learn from its mistakes and improve over time, ultimately reducing the occurrence of hallucinations.

So, hallucinations are fixed?

Even though vendors and firms are addressing hallucinations with technical solutions, it does not necessarily mean that the problem is solved. Rather, it may be that our our quality control methods will shift.

For example, instead of wasting time checking each citation to see if it exists, we can be fairly sure that the cases produced by legal research generative AI tools do exist, since they are found in the vendor’s existing database of case law. We can also be fairly sure that the language they quote from the case is accurate. What may be less certain is whether the quoted portions are the best portions of the case and whether the summary reflects all relevant information from the case. This will require some assessment of the various vendor tools.

We will also need to pay close attention to the databases results that are fed into retrieval augmented generation. If those results don’t reflect the full universe of relevant cases, or contain material that is not authoritative, then the answer generated from those results will be incomplete. Think of running an initial Westlaw search, getting 20 pretty good results, and then basing your answer only on those 20 results. For some questions (and searches), that would be sufficient, but for more complicated issues, you may need to run multiple searches, with different strategies, to get what you want.

To be fair, the products do appear to be running multiple searches. When I attended the rash of AI presentations at AALL over the summer, I asked Jeff Pfeiffer of Lexis how he could be sure that the model had all relevant results, and he mentioned that the model sends many, many searches to the database not just one. Which does give some comfort, but leads me to the next point of quality control.

We will want to have some insight into the searches that are being run, so that we can verify that they are asking the right questions. From the demos I’ve seen of CoCounsel and Lexis+ AI, this is not currently a feature. But it could be. For example, the AI assistant from scite (an academic research tool) sends searches to academic research databases and (seemingly using RAG and other techniques to analyze the search results) produces an answer. They also give a mini-research trail, showing the searches that are being run against the database and then allowing you to adjust if that’s not what you wanted.

scite AI Assistant Sample Results
sCcite AI Assistant Settings

Are there uses for generative AI where the risks presented by hallucinations are lessened?

The other good news is that there are plenty of tasks we can give generative AI for which hallucinations are less of an issue. For example, CoCounsel has several other “skills” that do not depend upon accuracy of legal research, but are instead ways of working with and transforming documents that you provide to the tool.

Similarly, even working with a generally applicable tool such as ChatGPT, there are many applications that do not require precise legal accuracy. There are two rules of thumb I like to keep in mind when thinking about tasks to give to ChatGPT: (1) could this information be found via Google? and (2) is a somewhat average answer ok? (As one commentator memorably put it “Because [LLMs] work by predicting the most statistically likely word in a sentence, they churn out average content by design.”)

For most legal research questions, we could not find an answer using Google, which is why we turn to Westlaw or Lexis. But if we just need someone to explain the elements of breach of contract to us, or come up with hypotheticals to test our knowledge, it’s quite likely that content like that has appeared on the internet, and ChatGPT can generate something helpful.

Similarly, for many legal research questions, an average answer would not work, and we may need to be more in-depth in our answers. But for other tasks, an average answer is just fine. For example, if you need help coming up with an outline or an initial draft for a paper, there are likely hundreds of samples in the data set, and there is no need to reinvent the wheel, so ChatGPT or a similar product would work well.

What’s next?

In the coming months, as legal research generative AI products become increasingly available, librarians will need to adapt to develop methods for assessing accuracy. Currently, there appear to be no benchmarks to compare hallucinations across platforms. Knowing librarians, that won’t be the case for long, at least with respect to legal research.

Further reading

If you want to learn more about how retrieval augmented generation and vector embedding work within the context of generative AI, check out some of these sources:

Gesundheit ChatGPT! Flu Snot Prompting?

Somewhat recently, during a webinar on generative AI, when the speaker Joe Regalia mentioned “flu snot” prompting, I was momentarily confused. What was that? Flu shot? Flu snot? I rewound a couple of times until I figured out he was saying “few shot” prompting. Looking for some examples of few-shot learning in the legal research/writing context, I Googled around and found his excellent article entitled ChatGPT and Legal Writing: The Perfect Union on the write.law website.

What Exactly is Few Shot Prompting?

It turns out that few-shot prompting is a technique for improving the performance of chatbots like ChatGPT by supplying a small set of examples (a few!) to guide its answers. This involves offering the AI several prompts with corresponding ideal responses, allowing it to generate more targeted and customized outputs. The purpose of this approach is to provide ChatGPT (or other generative AI) with explicit examples that reflect your desired tone, style, or level of detail.

Legal Research/Writing Prompting Advice from write.law

To learn more, I turned to Regalia’s detailed article which provides his comprehensive insights into legal research/writing prompts and illuminates various prompting strategies, including:

Zero Shot Learning/Prompting

This pertains to a language model’s ability to tackle a novel task, relying on its linguistic comprehension and pre-training insights. GPT excels in zero-shot tasks, attributed to its robust capabilities. (Perhaps unsurprisingly, one-shot learning involves providing the system with just one example.)

Few-Shot Learning/Prompting

Few-shot learning involves feeding GPT several illustrative prompts and responses that echo your desired output. These guiding examples wield more influence than mere parameters because they offer GPT a clear directive of your expectations. Even a singular example can be transformative in guiding its responses.

As an example of few-shot learning, he explains that if you want ChatGPT to improve verbs in your sentence, you can supply a few examples in a prompt like the following:

My sentence: The court issued a ruling on the motion.Better sentence: The court ruled on the motion. 
My sentence: The deadline was not met by the lawyers. 
Better sentence: The lawyers missed the deadline. 
My sentence: The court’s ruling is not released. [now enter the sentence you actually want to improve, hit enter, and GPT will take over]
[GPT’s response] Better sentence: The court has not ruled yet [usually a much-improved version, but you may need to follow up with GPT a few times to get great results like this]

And Much More Prompting Advice!

Regalia’s website offers an abundance of insights as you can see from the extensive list of topics covered in his article. Get background information on how geneative AI system operate, and dive into subjects like chain of thought prompting, assigning roles to ChatGPT, using parameters, and much more.

  • What Legal Writers Need to Know About GPT
  • Chat GPT’s Strengths Out of the Box
  • Chat GPTs Current Weaknesses and Limitations
  • Getting Started with Chat GPT
  • Prompt Engineering for Legal Writers
  • Legal Writing Prompts You Can Use with GPT
  • Using GPT to Improve Your Writing
  • More GPT Legal Writing Examples for Inspiration
  • Key GPT Terms to Know
  • Final Thoughts for GPT and Legal Writers

Experimenting With Few-Shot Prompting Before I Knew the Name

Back in June 2023, I first started dabbling in few-shot prompting without even knowing it had a name, after I came across a Forbes article titled Train ChatGPT To Write Like You In 5 Easy Steps. Intrigued, I wondered if I could use this technique to easily generate a profusion of blog posts in my own personal writing style!!

I followed the article’s instructions, copying and pasting a few of my favorite blog posts into ChatGPT to show it the tone and patterns in my writing that I wanted it to emulate. The result was interesting, but in my humble opinion, the chatty chatbot failed to pick up on my convoluted conversational (and to me, rather humorous) approach. They say that getting good results from generative AI is an iterative process, so I repeatedly tried to convey that I am funny using a paragraph from a blog post:

  • Prompt: Further information. I try to be funny. Here is an example: During a text exchange with my sister complaining about our family traits, I unthinkingly quipped, “You can’t take the I out of inertia.” Lurching sideways in my chair, I excitedly wondered if this was only an appropriate new motto for the imaginary Gotschall family crest, or whether I had finally spontaneously coined a new pithy saying!? Many times have I Googled, hoping in vain, and vainly hoping, to have hit upon a word combo unheard of in Internet history and clever/pithy enough to be considered a saying, only to find that there’s nothing new under the virtual sun.

Fail! Sadly, my efforts were to no avail, it just didn’t sound much like me… (However, that didn’t stop me from asking ChatGPT to write a conclusion for this blog post!)

Conclusion

For those keen to delve deeper into the intricacies of legal research, writing, and the intersection with AI, checking out the resources on write.law is a must. The platform offers a wealth of information, expert insights, and practical advice that can be immensely valuable for both novices and seasoned professionals.

Why Law Librarians?

Some of you reading this may be skeptical that these new AI technologies are 1) within your skillset and/or 2) worth the effort to learn. I’m the congenital optimist who is here to win you over. These tools are on the verge of revolutionizing the field of law (once they get out of their prototype phase) and I can’t think of a better group of people on law school campuses, in government organizations, and in law firms to evaluate and implement these technologies. Law Librarians (traditionally) have two crucial skill sets that make us well-suited to take the lead here:

  • We understand how information is organized and
  • We understand how information is used in the research and practice of law.

This is an AI Youtuber with ~70k subscribers who develops and trains LLMs from scratch. Do you see what he has listed as the number one discipline that people need to learn to use these tools? Computer Science skills rank third on his list compared to “Librarianship and Information Science” at #1.

This dude gets it.

Many of the tips that David Shapiro provides in that video for people creating custom LLMs will be absolutely obvious to law librarians because we live and breathe these every day at our jobs: taxonomies, data organization, “source of truth,” etc. Whether in the tech services department or research instruction, we are well-versed in organizing and finding information.

We already have many of the data structures in place that could be easily used by these technologies. Besides constructing the initial models, our role will be pivotal in continuously updating and assessing their effectiveness. Moreover, we will provide vital guidance on the proper utilization of these tools.

Does this list look like something your Technical Services department does? Can you think of anyone else in your organization who would be better at making knowledge graphs, indexes, or tables of contents for legal materials? Who would be better suited than your Research and Instruction team to teach newcomers how to interact with these tools to get the information that they need? Who in your organization is best positioned to teach (or already teaches) information literacy? I would argue that nobody can do it better than law librarians (not even computer science people).

Now What?

Let’s mobilize a push to collaborate on these tools. We need to get groups of law librarians together who are interested in rolling up their sleeves and digging into the nitty-gritty of creating, auditing, and using LLMs. I am a member of LIT-SIS in AALL and maybe we need a special caucus to address this specific technology. Additionally, we can get consortiums of schools together in each state to develop our own LLMs – outside of the subscription-based products that will roll out for Lexis and Westlaw. Anything we build ourselves will have the needs of our community at the forefront. We can build in all of the transparency, privacy, and accuracy that may be lacking in commercial models. Schools can build tools that would not be commercially viable at firms. Firms and courts could build specialized tools to achieve their unique workflows. It opens up many options that are not available if we’re stuck with the one-size-fits-all nature of Lexis and Westlaw subscriptions.

This is an open-source model that is close to competing with GPT4 (ChatGPT’s underlying model). There are many of these and new models show up every day.

There are many options to create, train, and locally run custom LLMs as long as you have the data. As David Shapiro said in the video, “data is the oil of the information age” and law libraries are deep wells of the type of data that could be used to accurately train these services. Additionally, when you are locally hosting an LLM many of the concerns surrounding privacy, permissions, and student data completely evaporate because you are in control of what information is being sent and stored.

To do all of this, we need organization, collaboration, and funding. Individually this could be difficult but if we band together in consortium, we can get a lot done.

Students

Students are an incredible resource in this area. Many of them come to law school with computer science and data science backgrounds and can help with the creation and development of these models. They need mentors and organizers to help focus their efforts, provide resources, and nurture their creativity. In addition, they provide a deep reservoir of diverse voices and experiences that may not occur to people who have spent decades in academia, the public sector, or law firms. We can bring in students to have competitions to create their own LLM apps for law practice and access to justice initiatives. We can fund fellowships to do work at schools, courts, and firms. We can bring them under our wing to usher in the next generation of tech-savvy law librarians. We can leverage the excitement and energy associated with these new tools to attract new talent into our field – I skimmed TikTok and the #ChatGPT hashtag as around 7.7 billion views. To do that, we need to brainstorm together so that we can get these programs in place.

In Sum

As the torchbearers in this promising venture, it’s time for us, the law librarians, to step up and show the world our unmatched prowess in harnessing the potential of LLMs in law, weaving our expert knowledge in information science, law, and emerging technology. Let us band together, utilizing the rich data reserves at our disposal, and carve out a future where legal technology is not just efficient and transparent, but also a collaborative masterpiece fostered by our relentless pursuit of innovation and excellence.