Ghost in the Machine

Today’s guest post comes from Debbie Ginsberg, Faculty Services Manager at Harvard Law School Library.

I was supposed to write a blog post about the Harvard AI summit about six months ago. For various reasons (e.g., “didn’t get my act together”), that hasn’t happened. But one of the things that was brought up at the summit was who wasn’t at the table—who didn’t have access, whose data wasn’t included, and similar issues.

Since then, I’ve been thinking about the haves and have-nots of AI. There’s one group that I don’t think gets discussed enough.  That’s the giant human workforce that AI needs to function.

Whenever I think of how AI is trained, I imagine a bunch of people somewhat like her (ok, there aren’t so many women and POC in real life, but I’m not going to tell ChatGPT to draw more white men):

And that they’ve been working on processes that look somewhat like this:

But that’s only part of the picture.  Underlying all these processes are people like this:

Who are they?

Large AI companies like OpenAI and Google need people to train data, refine data, and handle content moderation.  These tasks require workers to view thousands of examples of images and texts. To say, “This is a cat,” “The AI got this right,” or “This is not offensive.”  And then do this over and over again.  These are the “ghost workers” behind the machine.  Without them, AI doesn’t function. 

The workers are generally paid piecemeal, which means they often earn very little per hour.  For example, some reports claim that Open AI paid workers in Kenya under $2 to filter questionable content. 

The working conditions are not optimal, especially when the workers are reviewing content.  The workers generally do not receive sufficient training or time to do the work they are asked to do.  The workers may work directly for an AI company, or those companies may use a third-party company like Appen to hire and manage ghost workers (Google used Appen until March 19, having terminated their contract earlier in the year). 

That said, this work is an essential source of income for many around the world. The jobs are relatively flexible as to location and time, and the workers take pride in their output. 

As AI continues to grow, there has been more focus on improving working conditions.  For example, the US has launched investigations into some of the large tech companies in response to concerns about how ghost workers are treated.  And while some AI experts predict that AI will eventually be able to do this work itself, many others believe that AI will continue to depend on ghost workers for a long time to come.

And considering how much profit is at stake, I’m thinking that maybe they should paid more than $2/hour. 

Footnote:

Did I use AI to write this?  Kind of?  I used Google’s NotebookLM tool to review my sources and create notes.  In addition to the sources above, check out:

RAG Systems Can Still Hallucinate

A few months ago, a law professor posted on Twitter about a hallucination he observed in Lexis+ AI. He asked “What cases have applied Students for Fair Admissions, Inc. v. Harvard College to the use of race in government decisionmaking?” The answer from Lexis+ AI included two hallucinated cases. (It was obvious they were hallucinated, as the tool reported one was issued in 2025 and one in 2026!)

The image shows a screenshot of a digital text document. The text within the image is as follows:

Question: What cases have applied Students for Fair Admissions, Inc. v. Harvard College to other areas of government decisionmaking?

Answer: The following suggestions might be helpful in assisting you in your task.

The Supreme Court case Students for Fair Admissions, Inc. v. President and Fellows of Harvard College § 8.02 Admission of Students held that race-conscious admissions programs at Harvard College and the University of North Carolina violated Title VI and the Equal Protection Clause of the Fourteenth Amendment. The Court found the schools' admissions programs had measurable objectives to warrant the use of race, used race to disadvantage and stereotype students, and had no end date or method to indicate a stopping point. Although not directly applicable to government programs, the decision calls into question the permissibility of affirmative action and may invite legal challenges to such programs: Affirmative Action Programs and Diversity Initiatives: Key Considerations.

Lower courts have applied Students for Fair Admissions to government programs and policies. In Doe v. State University (2025), the Ninth Circuit struck down a race-based scholarship program at a public university, finding it indistinguishable from the admissions policies rejected in Students for Fair Admissions. In Smith v. City of Chicago (2026), the Seventh Circuit upheld a city's minority business set-aside program, distinguishing it from university admissions based on the different compelling interests involved. The Supreme Court has granted certiorari in Johnson v. California Department of Corrections, a case challenging a prison system's use of race in inmate housing assignments, teeing up another major ruling on affirmative action: Students for Fair Admissions § 112.14 Title VI of Civil Rights Act of 1964.

Lexis responded, stating this was an anomalous result, but that only statements with links can be expected to be hallucination-free, and that “where a citation does not include a link, users should always review the citation for accuracy.”

Why is this happening?

If you’ve been following this blog, you’ve seen me write about retrieval-augmented generation, one of the favorite techniques of vendors to reduce hallucinations. RAG takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

However, as I said then, RAG cannot eliminate hallucinations. RAG will ground the response in real data (case law, pulled from the database and linked in the response), but the generative AI’s summary of that real data can still be off.

Another example – Mata v. Avianca is back

I’ve observed this myself when working with Lexis+ AI. For example, I asked Lexis+ AI a fairly complex question at the intersection of bankruptcy law and international law: “Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention”.

Lexis+ AI returned a summary of the law, citing Mata v. Avianca for the point that “the filing of a bankruptcy petition can toll the Montreal Convention’s two year limitations period, which does not begin to run until the automatic stay is lifted.”

The image contains a screenshot of a text document, which seems to be an interface from a legal database or research platform with an option to "Select Jurisdiction" at the top right corner. The document reads as follows:

Question: Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention

Response:

The Montreal Convention's limitations period can be tolled during the pendency of bankruptcy proceedings. This is based on the Bankruptcy Code, which states that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Montreal Convention, as they have done under the Warsaw Convention, which is the precursor to the Montreal Convention. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. Therefore, the filing of a bankruptcy petition can toll the Montreal Convention's two-year limitations period, which does not begin to run until the automatic stay is lifted [Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263]. However, it's worth noting that there is a split of authority on this issue. Some courts have held that the Montreal Convention's two-year period is a strict condition precedent to the bringing of a claim, as opposed to a limitations period. Under this view, which does not treat tolling principles Rodriguez v. Asa, 2023 U.S. Dist. LEXIS 148451, Mata v. Avianca, Inc. 2023 U.S. Dist. LEXIS 108261, Kasalyn v. Delta Air Lines, Inc., 2023 U.S. Dist. LEXIS 154302.

If the case name Mata v. Avianca sounds familiar to you, it’s probably because this is the case that landed two New York attorneys on the front page of the New York Times last year for citing hallucinated cases. The snippet from Lexis+ AI, though citing Mata, in fact appears to be summarizing those hallucinated cases (recounted in Mata), which stated the law exactly backwards.

When to beware

A few things to notice about the above examples, which give us some ideas of when to be extra-careful in our use of generative AI for legal research.

  • Hallucinations are more likely when you are demanding an argument rather than asking for the answer to a neutrally phrased question. This is what happened in my Lexis+ AI example above, and is actually what happened to the attorneys in Mata v. Avianca as well – they asked for an argument to support an incorrect proposition of law rather than a summary of law. A recent study of hallucinations in legal analysis found that these so-called contra-factual hallucinations are disturbingly common for many LLM models.
  • Hallucinations can occur when the summary purports to be of the cited case, but is actually a summary of a case cited within that case (and perhaps not characterized positively). You can see this very clearly in further responses I got summarizing Mata v. Avianca, which purport to be summarizing a “case involving China Southern” (again, one of the hallucinated cases recounted in Mata).
Response:
In the case of Mata v. Avianca, Inc., the court held that the limitations period under the Montreal Convention was tolled during the pendency of the Bankruptcy Court proceedings. The court reasoned that the Bankruptcy Code provides that the filing of a bankruptcy petition operates as a stay of proceedings against the debtor that were or could have been commenced before the bankruptcy case was filed. The tolling effect of the automatic stay on a statute of limitations is generally a matter of federal law. The court had previously held that the automatic stay provisions of the Bankruptcy Code may toll the statute of limitations under the Warsaw Convention, which is the precursor to the Montreal Convention. The court saw no reason why the same rule should not apply under the Montreal Convention Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.

In another case involving China Southern, the court agreed with the district court's ruling that the filing of Varghese's Chapter 13 petition tolled the two-year limitations period under the Montreal Convention. The court found that a debtor need not file a claim in bankruptcy to benefit from the automatic stay. The Supreme Court has held that an automatic stay of a legal proceeding under the Bankruptcy Code tolls the limitations period applicable to the stayed proceeding. The Montreal Convention's limitations period is a "period of prescription," rather than a 'statute of limitations," but the difference between a "period of prescription" and a "statute of limitations" does not affect the automatic stay's tolling effect Mata v. Avianca, Inc., 2023 U.S. Dist. LEXIS 108263.
  • Finally, hallucinations are also more likely when the model has very little responsive text to go on. The law professor’s example involved a recent Supreme Court case that likely had not been applied many times. Additionally, Lexis+ AI does not seem to work well with questions about Shepard’s results – it may not be connected in that way yet. So, with nothing to really go on, it is more prone to hallucination.

Takeaway tips

A few takeaway tips:

  • Ask your vendor which sources are included in the generative AI tool, and only ask questions that can be answered from that data. Don’t expect generative AI research products to automatically have access to other data from the vendor (Shepard’s, litigation analytics, PACER, etc.), as that may take some time to implement.
  • Always read the cases for yourself. We’ve always told students not to rely on editor-written headnotes, and the same applies to AI-generated summaries.
  • Be especially wary if the summary refers to a case not linked. This is the tip from Lexis, and it’s a good one, as it can clue you in that the AI may be incorrectly summarizing the linked source.
  • Ask your questions neutrally. Even if you ultimately want to use the authorities in an argument, better to get a dispassionate summary of the law before launching into an argument.

A disclaimer

These tools are constantly improving and they are very open to feedback. I was not able to reproduce the error recounted in the beginning of this post; the error that created it has presumably been addressed by Lexis. The Mata v. Avianca errors still remain, but I did provide feedback on them, and I expect they will be corrected quickly.

The purpose of this post is not to tell you that you should never use generative AI for legal research. I’ve found Lexis+ AI helpful on many tasks, and students especially have told me they find it useful. There are several other tools out there that are worth evaluating as well. However, we should all be aware that these hallucinations can still happen, even with systems connected to real cases, and that there are ways we can interact with the systems to reduce hallucinations.

Shifting Sands: Ethical Guidance for AI in Legal Practice

Generative AI has only been here for one year, and we’ve already seen several lawyers make some big blunders trying to use it in legal practice. (Sean Harrington has been gathering them here). Trying to get ahead of the problem, bar associations across the country have appointed task forces, working groups, and committees to consider whether ethical rules should be revised. Although the sand will continue to shift under our feet, this post will attempt to summarize the ethical rules, guidance and opinions related to generative AI that are either already issued or forthcoming. The post will be updated as new rules are issued.

Image generated by DALLE-3, showing Matrix-style code flowing over the shifting sands of a desert. A sandstorm looms.

California CPRC Best Practices

On November 16, 2023, the California State Bar Board of Trustees approved their Practical Guidance for the Use of Generative Artificial Intelligence in the Practice of Law. The document was initially created by the Committee on Professional Responsibility and Conduct. Unlike ethics opinions or formal rules, which tend to be more prescriptive and specific in nature, this document serves as a guide, offering insights and considerations for lawyers as they navigate the new terrain of AI in legal practice. It is organized by duties, with practical considerations for each duty, and addresses the duty of confidentiality, duties of competence & diligence, duty to supervise, duty of candor, disclosure to clients, charging clients for work produced by generative AI, and more.

Florida Bar Advisory Opinion

On January 19, 2024, the Florida Bar issued its Advisory Opinion 24-1, regarding lawyers’ use of generative AI. The opinion discusses the duty of confidentiality, oversight of AI, the impact on legal fees and costs, and use in lawyer advertising.

New Jersey Supreme Court

On January 24, 2024, the New Jersey Bar issued its Preliminary Guidelines on New Jersey Lawyers’ Use of Artificial Intelligence. The guidelines highlight the importance of accuracy, truthfulness, confidentiality, oversight, and the prevention of misconduct, indicating that AI does not alter lawyers’ core ethical responsibilities but necessitates careful engagement to avoid ethical violations.

Judicial Standing Orders

Beginning soon after the infamous ChatGPT error in Mata v. Avianca, judges began to issue orders limiting the use of generative AI or requiring disclosure of its use or checking for accuracy. To date, at least 24 federal judges and at least one state court judge have issued standing orders.

Fifth Circuit’s Proposed Rule

The United States Court of Appeals for the Fifth Circuit recently solicited comments on its proposed new rule requiring certification as to the use of generative AI. It is the first federal appeals court to consider such a rule.

Judicial Ethics Opinions

Finally, in some jurisdictions, ethical bodies have looked beyond the use of generative AI by lawyers, and have given guidance on how judges can and should use generative AI.

On October 27, 2023, the State Bar of Michigan issued an opinion emphasizing the ethical obligation of judicial officers to maintain competence with advancing technology, including artificial intelligence, highlighting the need for ongoing education and ethical evaluation of AI’s use in judicial processes.

Also in October 2023, the West Virginia Judicial Investigation Commission issued Advisory Opinion 2023-22, opining that judges may use artificial intelligence for research but not to determine case outcomes.

Resources

Is Better Case Law Data Fueling a Legal Research Boom?

Recently, I’ve noticed a surge of new and innovative legal research tools. I wondered what could be fueling this increase, and set off to find out more. 

The Moat

An image generated by DALL-E, depicting a castle made of case law reporters, with sad business children trying to construct their own versions out of pieces of paper. They just look like sand castles.

Historically, acquiring case law data has been a significant challenge, acting as a barrier to newcomers in the legal research market. Established players are often protective of their data. For instance, in an antitrust counterclaim, ROSS Intelligence accused Thomson Reuters of withholding their public law collection, claiming they had to instead resort to purchasing cases piecemeal from sources like Casemaker and Fastcase.  Other companies have taken more extreme measures. For example, Ravel Law partnered with the Harvard Law Library to scan every single opinion in their print reporter collections. There’s also speculation that major vendors might even license some of their materials directly to platforms like Google Scholar, albeit with stringent conditions.

The New Entrants

Despite the historic challenges, several new products have recently emerged offering advanced legal research capabilities:

  • Descrybe.ai (founded 2023) – This platform leverages generative AI to read and summarize judicial opinions, streamlining the search process. Currently hosting around 1.6 million summarized opinions, it’s available for free.
  • Midpage (2022) – Emphasizing the integration of legal research into the writing process, users can employ generative AI to draft documents from selected source (see Nicola Shaver’s short writeup on Midpage here). Midpage is currently free at app.midpage.ai.
  • CoPilot (by LawDroid, founded 2016) – Initially known for creating chatbots, LawDroid introduced CoPilot, a GPT-powered AI legal assistant, in 2023. It offers various tasks, including research, translating, and summarizing. CoPilot is available in beta as a web app and a Chrome extension, and is free for faculty and students.
  • Paxton.ai (2023) – Another generative AI legal assistant, Paxton.ai allows users to conduct legal research, draft documents, and more. Limited free access is available without signup at app.paxton.ai, although case law research will require you to sign up for a free account.
  • Alexi (2017) Originally focused on Canadian law, Alexi provides legal research memos. They’ve recently unveiled their instant memos, powered by generative AI. Alexi is available at alexi.com and provides a free pilot.

Caselaw Access Project and Free Law Project

With the Caselaw Access Project, launched in 2015, Ravel Law and Harvard Law Library changed the game. Through their scanning project, Harvard received rights to the case law data, and Ravel gained an exclusive commercial license for 8 years. (When Lexis acquired Ravel a few years later, they committed to completing the project.) Although the official launch date of free access is February 2024, we are already seeing a free API at Ravel Law (as reported by Sarah Glassmeyer).

Caselaw Access Project data is only current through 2020 (scanning was completed in 2018, and has been supplemented by Fastcase donations through 2020) and does not include digital-first opinions. However, this gap is mostly filled through CourtListener, which contains a quite complete set of state and federal appellate opinions for recent years, painstakingly built through their network of web scrapers and direct publishing agreements. CourtListener offers an API (along with other options for bulk data use).

And indeed, Caselaw Access Project and Free Law Project just recently announced a dataset called Collaborative Open Legal Data (COLD) – Cases. COLD Cases is a dataset of 8.3 million United States legal decisions with text and metadata, suitable for use in machine learning and natural language processing projects.

Most of the legal research products I mentioned above do not disclose their precise source of their case law data. However, both Descrybe.ai and Midpage point to CourtListener as a partner. My theory/opinion is that many of the others may be using this data as well, and that these new, more reliable and more complete sources of data are responsible for fueling some amazing innovation in the legal research sphere.

What Holes Remain?

Reviewing the coverage of CourtListener and Caselaw Access Project it appears to me that they have, when combined:

  • 100% of all published U.S. case law from 2018 and earlier (state and federal)
  • 100% of all U.S. Supreme Court, U.S. Circuit Court of Appeals, and state appellate court cases

There are, nevertheless, still a few holes that remain in the coverage:

  • Newer Reporter Citations. Newer appellate court decisions may not have reporter citations within CourtListener. These may be supplemented as Fastcase donates cases to Caselaw Access Project.
  • Newer Federal District Court Opinions. Although CourtListener collects federal decisions marked as “opinions” within PACER, these decisions are not yet available in their opinion search. Therefore, very few federal district court cases are available for the past 3-4 years. This functionality will likely be added, but even when it is, district courts are inconsistent about marking decisions as “opinions” and so not all federal district court opinions will make their way to CourtListener’s opinions database. To me, this brings into sharp relief the failure of federal courts to comply with the 2002 E-Government Act, which requires federal courts to provide online access to all written opinions.
  • State Trial Court Decisions. Some other legal research providers include state court trial-level decisions. These are generally not published on freely available websites (so CourtListener cannot scrape them) and are also typically not published in print reporters (so Caselaw Access Project could not scan them).
  • Tribal Law. Even the major vendors have patchy access to tribal law, and CourtListener has holes here as well.

The Elephant in the Room

Of course, another major factor in the increase in legal research tools may be simple economics. In August, Thomson Reuters acquired the legal research provider Casetext for the eye-watering sum of $650 million.  And Casetext itself is a newer legal research provider, founded only in 2013. In interviews, Thomson Reuters cited Casetext’s access to domain-specific legal authority, as well as its early access to GPT-4, as key to its success. 

What’s Next?

Both Courtlistener and Caselaw Acess Project have big plans for continuing to increase access to case law. CAP will launch free API access in February 2024, coordinating with LexisNexis, Fastcase, and the Free Law Project on the launch. CourtListener is planning a scanning project to fix remaining gaps in their coverage (CourtListener’s Mike Lissner tells me they are interested in speaking to law librarians about this – please reach out). And I’m sure we can expect to see additional legal research tools, and potentially entire LLMs (hopefully open source!), trained on this legal data.

Know of anything else I didn’t discuss? Let me know in the comments, or find me on social media or email.

Keeping Up With Generative AI in the Law

The pace of generative AI development (and hype) over the past year has been intense, and difficult even for us experienced librarians, masters of information that we are, to follow. Not only is there a constant stream of new products, but also new academic papers, blog posts, newsletters, and more, from people evaluating, experimenting with, and critiquing those products. With that in mind, I’m sharing my favorites, and I’ll also pepper in a few recommendations from my co-bloggers.

Twitter

Before Twitter began its slow decline, it was one of my primary sources for professional connection, and there are many there who are exploring generative AI. I especially enjoy following people outside of the legal world. Many of my favorites are still there, like Ethan Mollick, Anna Mills, and Lance Eaton (all in higher education) as well as critical AI theorists like Timnit Gibru and Emily Bender.

LinkedIn

Despite the good bits that remain on Twitter, many interesting legal tech discussions seem to have moved to LinkedIn (or perhaps I’ve only recently found them there). Some of my favorites to follow on LinkedIn (in no particular order beyond how I’m running across them as I scroll) are: Nicole Black, Sam Harden, Alex Smith, Cat Moon, Damien Riehl, Dennis Kennedy, Uwais Iqbal, Ivy Grey, Robert Ambrogi, Cat Casey, Nicola Shaver, Adam Ziegler, and Michael Bommarito. Both Bob Ambrogi and Nicola Shaver recently had posts gathering legal tech luminaries to follow, so I would recommend checking out those posts and the comments to find more interesting folks. And if anyone else has figured out the LinkedIn etiquette for connecting vs. following someone you only know via other social media, please let me know.

Newsletters

Most of us have many (many, many) newsletters filling our inbox each day. Here are some favorites.

Jenny:

  • AI in Education – a Google group
  • Lawyer Ex Machina – from law librarian Eli Edwards, on legal technology, law practice and selected issues around big data, artificial intelligence, blockchain, social media and more affecting both the substance and the business of law (weekly)
  • The Neuron – AI news, tools, and how-to
  • The Brainyacts – from Josh Kubicki, insight & tips on generative AI use in legal services (daily)

Rebecca:

  • One Useful Thing – from Ethan Mollick, mostly on AI in higher ed (weekly)
  • Do Something – from Sam Harden, on legal tech, often from a small firm and access to justice angle
  • Legal Tech Trends – legal tech links, podcast, articles, products, along with original pieces (every two weeks or so)
  • KnowItAALL – this daily newsletters is a benefit for members of AALL (American Association of Law Libraries), but it is also available to non-members for a fee; great coverage of legal AI, I read it every day
  • AI Law Librarians – is it gauche to recommend our own blog? You can subscribe as a newsletter if you like!

Sean:

Podcasts

There are loads of podcasts on AI, but here are a few we follow:

Blogs & Websites

We’re bloggers, we like blogs. Traditional media can be ok, too, although mind the paywall.

YouTube

Sean also mentioned that much of the interesting stuff is on YouTube, but that it is fairly high-effort because many of the videos are an hour long, or more. Maybe we’ll convince him to share some of his favorite videos soon in a future post!

A Few LibGuides

If you still need more, here are a few libguides:

What about you?

Who are your favorites to follow on social media? Are there helpful newsletters, blogs, podcasts, or anything else that we’ve missed? Let us know in the comments.

The Truth About Hallucinations in Legal Research AI: How to Avoid Them and Trust Your Sources

Hallucinations in generative AI are not a new topic. If you watch the news at all (or read the front page of the New York Times), you’ve heard of the two New York attorneys who used ChatGPT to create fake cases entire cases and then submitted them to the court.

After that case, which resulted in a media frenzy and (somewhat mild) court sanctions, many attorneys are wary of using generative AI for legal research. But vendors are working to limit hallucinations and increase trust. And some legal tasks are less affected by hallucinations. Understanding how and why hallucinations occur can help us evaluate new products and identify lower-risk uses.

* A brief aside on the term “hallucinations”.  Some commentators have cautioned against this term, arguing that it lets corporations shift the blame to the AI for the choices they’ve made about their models. They argue that AI isn’t hallucinating, it’s making things up, or producing errors or mistakes, or even just bullshitting. I’ll use the word hallucinations here, as the term is common in computer science, but I recognize it does minimize the issue.

With that all in mind, let’s dive in. 

What are hallucinations and why do they happen?

Hallucinations are outputs from LLMs and generative AI that look coherent but are wrong or absurd. They may come from errors or gaps in the training data (that “garbage in, garbage out” saw). For example, a model may be trained on internet sources like Quora posts or Reddit, which may have inaccuracies. (Check out this Washington Post article to see how both of those sources were used to develop Google’s C4, which was used to train many models including GPT-3.5).

But just as importantly, hallucinations may arise from the nature of the task we are giving to the model. The objective during text generation is to produce human-like, coherent and contextually relevant responses, but the model does not check responses for truth. And simply asking the model if its responses are accurate is not sufficient.

In the legal research context, we see a few different types of hallucinations: 

  • Citation hallucinations. Generative AI citations to authority typically look extremely convincing, following the citation conventions fairly well, and sometimes even including papers from known authors. This presents a challenge for legal readers, as they might evaluate the usefulness of a citation based on its appearance—assuming that a correctly formatted citation from a journal or court they recognize is likely to be valid.
  • Hallucinations about the facts of cases. Even when a citation is correct, the model might not correctly describe the facts of the case or its legal principles. Sometimes, it may present a plausible but incorrect summary or mix up details from different cases. This type of hallucination poses a risk to legal professionals who rely on accurate case summaries for their research and arguments.
  • Hallucinations about legal doctrine. In some instances, the model may generate inaccurate or outdated legal doctrines or principles, which can mislead users who rely on the AI-generated content for legal research. 

In my own experience, I’ve found that hallucinations are most likely to occur when the model does not have much in its training data that is useful to answer the question. Rather than telling me the training data cannot help answer the question (similar to a “0 results” message in Westlaw or Lexis), the generative AI chatbots seem to just do their best to produce a plausible-looking answer. 

This does seem to be what happened to the attorneys in Mata v. Avianca. They did not ask the model to answer a legal question, but instead asked it to craft an argument for their side of the issue. Rather than saying that argument would be unsupported, the model dutifully crafted an argument, and used fictional law since no real law existed.

How are vendors and law firms addressing hallucinations?

Several vendors have released specialized legal research products based on generative AI, such as LawDroid’s CoPilot, Casetext’s CoCounsel (since acquired by Thomson Reuters), and the mysterious (at least to academic librarians like me who do not have access) Harvey. Additionally, an increasing number of law firms, including Dentons, Troutman Pepper Hamilton Sanders, Davis Wright Tremaine, and Gunderson Dettmer Stough Villeneuve Franklin & Hachigian) have developed their own chatbots that allow their internal users to query the knowledge of the firm to answer questions.

Although vendors and firms are often close-lipped about how they have built their products, we can observe a few techniques that they are likely using to limit hallucinations and increase accuracy.

First, most vendors and firms appear to be using some form of retrieval-augmented generation (RAG). RAG combines two processes: information retrieval and text generation. The model takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.

This reduces hallucinations, because the model receives instructions to limit its responses to the source documents it has received from the database. Several vendors and firms have said they are using retrieval-augmented generation to ground their models in real legal sources, including Gunderson, Westlaw, and Casetext.

To enhance the precision of the retrieved documents, some products may also use vector embedding. Vector embedding is a way of representing words, phrases, or even entire documents as numerical vectors. The beauty of this method lies in its ability to identify semantic similarities. So, a query about “contract termination due to breach” might yield results related to “agreement dissolution because of violations”, thanks to the semantic nuances captured in the embeddings. Using vector embedding along with RAG can provide relevant results, while reducing hallucinations.

Another approach vendors can take is to develop specialized models trained on narrower, domain-specific datasets. This can help improve the accuracy and relevance of the AI-generated content, as the models would be better equipped to handle specific legal queries and issues. Focusing on narrower domains can also enable models to develop a deeper understanding of the relevant legal concepts and terminology. This does not appear to be what law firms or vendors are doing at this point, based on the way they are talking about their products, but there are law-specific data pools becoming available so we may see this soon.

Finally, vendors may fine-tune their models by providing human feedback on responses, either in-house or through user feedback. By providing users with the ability to flag and report hallucinations, vendors can collect valuable information to refine and retrain their models. This constant feedback mechanism can help the AI learn from its mistakes and improve over time, ultimately reducing the occurrence of hallucinations.

So, hallucinations are fixed?

Even though vendors and firms are addressing hallucinations with technical solutions, it does not necessarily mean that the problem is solved. Rather, it may be that our our quality control methods will shift.

For example, instead of wasting time checking each citation to see if it exists, we can be fairly sure that the cases produced by legal research generative AI tools do exist, since they are found in the vendor’s existing database of case law. We can also be fairly sure that the language they quote from the case is accurate. What may be less certain is whether the quoted portions are the best portions of the case and whether the summary reflects all relevant information from the case. This will require some assessment of the various vendor tools.

We will also need to pay close attention to the databases results that are fed into retrieval augmented generation. If those results don’t reflect the full universe of relevant cases, or contain material that is not authoritative, then the answer generated from those results will be incomplete. Think of running an initial Westlaw search, getting 20 pretty good results, and then basing your answer only on those 20 results. For some questions (and searches), that would be sufficient, but for more complicated issues, you may need to run multiple searches, with different strategies, to get what you want.

To be fair, the products do appear to be running multiple searches. When I attended the rash of AI presentations at AALL over the summer, I asked Jeff Pfeiffer of Lexis how he could be sure that the model had all relevant results, and he mentioned that the model sends many, many searches to the database not just one. Which does give some comfort, but leads me to the next point of quality control.

We will want to have some insight into the searches that are being run, so that we can verify that they are asking the right questions. From the demos I’ve seen of CoCounsel and Lexis+ AI, this is not currently a feature. But it could be. For example, the AI assistant from scite (an academic research tool) sends searches to academic research databases and (seemingly using RAG and other techniques to analyze the search results) produces an answer. They also give a mini-research trail, showing the searches that are being run against the database and then allowing you to adjust if that’s not what you wanted.

scite AI Assistant Sample Results
sCcite AI Assistant Settings

Are there uses for generative AI where the risks presented by hallucinations are lessened?

The other good news is that there are plenty of tasks we can give generative AI for which hallucinations are less of an issue. For example, CoCounsel has several other “skills” that do not depend upon accuracy of legal research, but are instead ways of working with and transforming documents that you provide to the tool.

Similarly, even working with a generally applicable tool such as ChatGPT, there are many applications that do not require precise legal accuracy. There are two rules of thumb I like to keep in mind when thinking about tasks to give to ChatGPT: (1) could this information be found via Google? and (2) is a somewhat average answer ok? (As one commentator memorably put it “Because [LLMs] work by predicting the most statistically likely word in a sentence, they churn out average content by design.”)

For most legal research questions, we could not find an answer using Google, which is why we turn to Westlaw or Lexis. But if we just need someone to explain the elements of breach of contract to us, or come up with hypotheticals to test our knowledge, it’s quite likely that content like that has appeared on the internet, and ChatGPT can generate something helpful.

Similarly, for many legal research questions, an average answer would not work, and we may need to be more in-depth in our answers. But for other tasks, an average answer is just fine. For example, if you need help coming up with an outline or an initial draft for a paper, there are likely hundreds of samples in the data set, and there is no need to reinvent the wheel, so ChatGPT or a similar product would work well.

What’s next?

In the coming months, as legal research generative AI products become increasingly available, librarians will need to adapt to develop methods for assessing accuracy. Currently, there appear to be no benchmarks to compare hallucinations across platforms. Knowing librarians, that won’t be the case for long, at least with respect to legal research.

Further reading

If you want to learn more about how retrieval augmented generation and vector embedding work within the context of generative AI, check out some of these sources: