Evaluating Generative AI for Legal Research: A Benchmarking Project

This is a post from multiple authors: Rebecca Fordon (The Ohio State University), Deborah Ginsberg (Harvard Law Library), Sean Harrington (University of Oklahoma), and Christine Park (Harvard Law Library)

In late 2023, several legal research databases and start-up competitors announced their versions of ChatGPT-like products, each professing that theirs would be the latest and greatest. Since then, law librarians have evaluated and tested these products ad hoc, offering meaningful anecdotal evidence of their experience, much of which can be found on this blog and others. However, one-time evaluations can be time-consuming and inconsistent across the board. Certain tools might work better for particular tasks or subject matters than others, and coming up with different test questions and tasks takes time that many librarians might not have in their daily schedules.

It is difficult to test Large-Language Models (LLMs) without back-end access to run evaluations. So to test the abilities of these products, librarians can use prompt engineering to figure out how to get desired results (controlling statutes, key cases, drafts of a memo, etc.). Some models are more successful than others at achieving specific results. However, as these models update and change, evaluations of their efficacy can change as well. Therefore, we plan to propose a typology of legal research tasks based on existing computer and information science scholarship and draft corresponding questions using the typology, with rubrics others can use to score the tools they use.

Although we ultimately plan to develop this project into an academic paper, we share here to solicit thoughts about our approach and connect with librarians who may have research problem samples to share.

Difficulty of Evaluating LLMs

Let’s break down some of the tough challenges with evaluating LLMs, particularly when it comes to their use in the legal field. First off, there’s this overarching issue of transparency—or rather, the lack thereof. We often hear about the “black box” nature of these models: you toss in your data, and a result pops out, but what happens in between remains a mystery. Open-source models allow us to leverage tools to quantify things like retrieval accuracy, text generation precision, and semantic similarity. We are unlikely to get the back-end access we need to perform these evaluations. Even if we did, the layers of advanced prompting and the combination of tools employed by vendors behind the scenes could render these evaluations essentially useless.

Even considering only the underlying models (e.g., GPT4 vs Claude), there is no standardized method to evaluate the performance of LLMs across different platforms, leading to inconsistencies. Many different leaderboards evaluate the performance of LLMs in various ways (frequently based on specific subtasks). This is kind of like trying to grade essays from unrelated classes without a rubric—what’s top-notch in one context might not cut it in another. As these technologies evolve, keeping our benchmarks up-to-date and relevant is becoming an ongoing challenge, and without uniform standards, comparing one LLM’s performance to another can feel like comparing apples to oranges.

Then there’s the psychological angle—our human biases. Paul Callister’s work sheds light on this by discussing how cognitive biases can lead us to over-rely on AI, sometimes without questioning its efficacy for our specific needs. Combine this with the output-based evaluation approach, and we’re setting ourselves up for potentially frustrating misunderstandings and errors. The bottom line is that we need some sort of framework for the average user to assess the output.

One note on methods of evaluation: just before publishing this blog post, we learned of a new study from a group of researchers at Stanford, testing the claims of legal research vendors that their retrieval-augmented generation (RAG) products are “hallucination-free.” The group created a benchmarking dataset of 202 queries, many of which were chosen for their likelihood of producing hallucinations. (For example, jurisdiction/time-specific and treatment questions were vulnerable to RAG-induced hallucinations, whereas false premise and factual recall questions were known to induce hallucinations in LLMs without RAG.) The researchers also proposed a unique way of scoring responses to measure hallucinations, as well as a typology of hallucinations. While this is an important advance in the field and provides a way to continue to test for hallucinations in legal research products, we believe hallucinations are not the only weakness in such tools. Our work aims to focus on the concrete applications of these LLMs and probe into the unique weaknesses and strengths of these tools. 

The Current State of Prompt Engineering

Since the major AI products were released without a manual, we’ve all had to figure out how to use these tools from scratch. The best tool we have so far is prompt engineering. Over time, users have refined various templates to better organize questions and leverage some of the more surprising ways that AI works.

As it turns out, many of the prompt templates, tips, and tricks we use with the general commercial LLMs don’t carry over well into the legal AI sphere, at least with the commercial databases we have access to. For example, because the legal AIs we’ve tested so far won’t ask you questions, researchers may not be able to have extensive conversations with the AI (or any conversation for some of them). So that means we must devise new types of prompts that will work in the legal AI sphere, and possibly work only in the AI sphere.

We should be able to easily design effective prompts because the data set the AIs use is limited. But it’s not always clear exactly what sources the AI is using. Some databases may list how many cases they have for a certain court by year; others may say “selected cases before 1980” without explaining how they were selected. And even when the databases provide coverage, it may not be clear exactly which of those materials the AI can access.

We still need to determine what prompt templates will be most effective across legal databases. More testing is needed. However, we are limited to the specific databases we can access. While most (all?) academic law librarians have access to Lexis+ AI, Westlaw has yet to release its research product to academics. 

Developing a Task Typology

Many of us may have the intuition that there are some legal research tasks for which generative AI tools are more helpful than others. For example, we may find that generative AI is great for getting a working sense of a topic, but not as great for synthesizing a rule from multiple sources. But if we wanted to test that intuition and measure how well AI performed on different tasks, we would need to first define those tasks. This is similar, by the way, to how the LegalBench project approached benchmarking legal analysis—they atomized the IRAC process for legal analysis down to component tasks that they could then measure.

After looking at the legal research literature (in particular Paul Callister’s “problem typing” schemata and AALL’s Principles and Standards for Legal Research Competency), we are beginning to assemble a list of tasks for which legal researchers might use generative AI. We will then group these tasks according to where they fall in an information retrieval schemata for search, following Marchionini (2006) & White (2024), into Find tasks (which require a simple lookup), Learn & Investigate tasks (which require sifting through results, determining relevance, and following threads), and Create, Synthesize, and Summarize tasks (a new type of task for which generative AI is well-suited).

Notably, a single legal research project may contain multiple tasks. Here are a few sample projects applying a preliminary typology:

Again, we may have an initial intuition that generative AI legal research platforms, as they exist today, are not particularly helpful for some of these subtasks. For example, Lexis+AI currently cannot retrieve (let alone analyze) all citing references to a particular case. Nor could we necessarily be certain from, say, CoCounsel’s output, that it contained all cases on point. Part of the problem is that we cannot tell which tasks the platforms are performing, or the data that they have included or excluded in generating their responses. By breaking down problems into their component tasks, and assessing competency on both the whole problem and the tasks, we hope to test our intuitions.

Future Research

We plan on continually testing these LLMs using the framework we develop to identify which tasks are suitable for AIs and which are not. Additionally, we will draft questions and provide rubrics for others to use, so that they can grade AI tools. We believe that other legal AI users will find value in this framework and rubric. 

ABA TECHSHOW 2024 Review

Since so many of the AI Law Librarians team were able to attend this year, we thought we would combine some of our thoughts (missed you Sarah!) about this yearly legal technology conference.

Sean

Startup Alley

We arrived in Chicago on a chilly Wednesday morning, amid an Uber & Lyft strike, with plenty of time to take the train from the airport to our hotel. After an obligatory trip to Giordanno’s our students were ready to head over to the Start-up Pitch Competition. I sat with co-blogger Rebecca Fordon during the competition and we traded opinions on the merits of the start-up pitches. We both come from the academic realm and were interested in seeing the types of products that move the needle for attorneys working at firms.

I was familiar with many of the products because I spend a decent portion of my time demo’ing legal tech as part of my current role. It was stiff competition and there were many outstanding options to choose from. Once all of the pitches were done, the audience voted, and then Bob Ambrogi announced the winners. To my great surprise and pleasure, AltFee won! For the uninitiated, AltFee is “a product that helps law firms replace the billable hour with fixed-fee pricing.” This was very interesting to me because I have long thought that LLMs could mean the death knell of the billable hour in certain legal sectors. This was, at least, confirmation that the attorneys attending the TECHSHOW have this on their radar and are thinking through how they are going to solve this problem.

techshow sessions

This year’s schedule of sessions was noticeably heavy on AI-related topics. This was great for me because I’m super interested in this technology and how it is being implemented in the day-to-day life of practitioners. I saw sessions on everything from case management software, to discovery, to marketing, kinda everything.

An especially inspiring couple of sessions for me featured Judge Scott Schlegel on the Fifth Circuit Court of Appeal in Louisiana. Judge Schlegel is the first judge that I’ve seen make fantastic use of AI in United States Courts for access to justice. I am passionate about this topic and have been fishing for grants to try to implement a handful of projects that I have so it was phenomenal to see that there are judges out there who are willing to be truly innovative. Any initiative for access to justice in the courts would require the buy-in of many stakeholders so having someone like Judge Schlegel to point to as a proof of concept could be crucial in getting my projects off the ground. After hearing his presentations I wished that every court in the US had a version of him to advocate for these changes. Importantly, none of his projects require tons of funding or software development. They are small, incremental improvements that could greatly help regular people navigate the court system – while, in many cases, improving the daily lives of the court staff and judges who have to juggle huge caseloads. Please feel free to email grants opportunities in this vein if you see them: sharrington@ou.edu.

side quest: northwestern law ai symposium

In the weeks leading up to the TECHSHOW I received an invite from Prof. Daniel Linna to attend Northwestern University’s AI and Law: Navigating the Legal Landscape of Artificial Intelligence Symposium. I took a frigid hike down to the school in the morning to attend a few sessions before returning to the TECHSHOW in the afternoon. It was a fantastic event with a great mix of attorneys, law professors, and computer science developers.

I was able to see Professor Harry Surden‘s introductory session on how LLM’s work in legal applications. While this information was not “new” to me per se (since I frequently give a similar presentation), he presented this complicated topic in an engaging, clear, and nuanced way. He’s obviously a veteran professor and expert in this area and so his presentation is much better than mine. He gave me tons of ideas on how to improve my own presentations to summarize and analogize these computer science topics to legal professionals, for which I was very grateful.

The second session was a panel that included Sabine BrunswickerJJ Prescott, and Harry Surden. All were engaged in fascinating projects using AI in the law and I encourage you to take a look through their publications to get a better sense of what the pioneers in our field are doing to make use of these technologies in their research.

Our Students

Each year our school funds a cohort of students to attend the TECHSHOW and this year was no different. This is my first year going with them and I wasn’t sure how much value they would get out of it since they don’t have a ton of experience working in firms using these tools. Was this just a free trip to Chicago or was this pedagogically useful to them?

I will cut to the chase and say that they found this tremendously useful and loved every session that they attended. Law school can (sometimes) get a little disconnected from the day-to-day practice of law and this is a great way to bridge that gap and give the students a sense of what tools attorneys use daily to do their jobs. You’d think that all of the sexy AI-related stuff would be attractive to students but the best feedback came from sessions on basic office applications like MS Outlook and MS Word. Students are definitely hungry for this type of content if you are trying to think through workshops related to legal technology.

In addition to the sessions, the students greatly appreciated the networking opportunities. The TECHSHOW is not overly stuffy and formal and I think they really liked the fact that they could, for example, find an attorney at a big firm working in M&A and pick their brain at an afterparty to get the unfiltered truth about a specific line of work. All of the students said they would go again and I’m going to try to find ways to get even more students to attend next year. If your school ends up bringing students in the future, please reach out to me and we can have our students get together at the event.

Jenny

Jenny live-tweeted the ABA TECHSHOW’s 60 Apps in 60 Minutes and provided links. You can follow her on this exciting journey starting with this tweet:

Rebecca

One of the most impactful sessions for me was titled “Revitalize Your Law Firm’s Knowledge Management with AI,” with Ben Schorr (Microsoft) and Catherine Sanders Reach (North Carolina Bar Association).  To drive home why KM matters so much, they shared the statistic that knowledge workers spend a staggering 2.5 hours a day just searching for what they need. That resonated with me, as I can recall spending hours as a junior associate looking for precedent documents within my document management system. Even as a librarian, I often spend time searching for previous work that either I or a colleague has done.

To me, knowledge management is one of the most exciting potential areas to apply AI, because it’s such a difficult problem that firms have been struggling with for decades. The speaker mentioned hurdles like data silos (e.g., particular practice areas sharing only among themselves), a culture of hoarding information, and the challenges of capturing and organizing vast amounts of data, such as emails and scanned documents with poor OCR. 

The speakers highlighted several AI tools that are attempting to address these issues through improved search going beyond keywords, automating document analysis to aid in categorizing documents, and suggesting related documents. They mentioned Microsoft Copilot, along with process tools like Process Street, Trainual, and Notion. Specific tools like Josef allow users to ask questions of HR documents and policies, rather than hunting for the appropriate documents.

Artificial Intelligence and the Future of Law Libraries Roundtable Events

South Central Roundtable

OU Law volunteered to host the South Central “Artificial Intelligence and the Future of Law Libraries” roundtable and so I was fortunate enough to be allowed to attend. This is the third iteration of a national conversation on what the new AI technologies could mean for the future of law libraries and (more broadly) law librarianship. I thought I would fill you in on my experience and explain a little about the purpose and methodology of the event. The event follows Chatham House Rules so I cannot give you specifics about what anybody said but I can give you an idea of the theme and process that we worked through.

Law Library Director Kenton Brice of OU Law elected to partner with Associate Dean for Library and Technology Greg Ivy and SMU to host the event in Dallas, TX because it was more accessible for many of the people that we wanted to attend. I’d never been to SMU and it’s a beautiful campus in an adorable part of Dallas – here’s a rad stinger I made in Premiere Pro:

Not cleared with SMU’s marketing department

TL;DR: If you get invited, I would highly recommend that you go. I found it enormously beneficial.

History and Impetus

The event is the brainchild of Head of Research, Data & Instruction, Director of Law Library Fellows Program Technology & Empirical Librarian, Cas Laskowsi at the University of Arizona (hereinafter “Cas”). They hosted the inaugural session through U of A’s Washington, DC campus. You may have seen the Dewey B. Strategic article about it since Jean O’Grady was in attendance. The brilliant George H. Pike at Northwestern University hosted the second in the series in Chicago. I know people who have attended each of these sessions and the feedback has been resoundingly positive.

The goal of this collaborative initiative is to provide guidance to law libraries across the country as we work to strategically incorporate artificial intelligence into our operations and plan for the future of our profession. 

Cas, from the U of A Website

Methodology

The event takes the entire day and it’s emotionally exhausting, in the best way possible. We were broken into tables of 6 participants. The participants were hand-selected based on their background and experience so that each table had a range of different viewpoints and perspectives.

Then the hosts (in our case, Kenton Brice and Cas Laskowski) walked us through a series of “virtuous cycle, vicious cycle” exercises. They, thankfully, started with the vicious cycle so that you could end each session on a virtuous cycle, positive note. At the end, each table chose a speaker and then we summarized the opinions discussed so that the entire room could benefit from the conversations. Apparently, this is an exercise done at places like the United Nations to triage and prepare for future events. This process went on through 3 full cycles and then we had about an hour of open discussion at the end. We got there at 8am and had breakfast and lunch on-site (both great – thank you Greg Ivy and SMU catering) because it took the entire day.

We had a great mix of academic, government, and private sector presented at the event and the diversity of stakeholders and experiences made for robust and thought-provoking conversation. Many times I would hear perspectives that had never occurred to me and would have my assumptions challenged to refine my own ideas about what the future might look like. Additionally, the presence of people with extensive expertise in specific domains, such as antitrust, copyright, the intricacies of AMLaw100 firms, and the particular hurdles faced in government roles, enriched the discussions with a depth and nuance that is rare to find. Any one of these areas can require years of experience so having a wide range of experts to answer questions allowed you to really “get into the weeds” and think things through thoroughly.

My Experience

I tend to be (perhaps overly) optimistic about the future of these technologies and so it was nice to have my optimism tempered and refined by people who have serious concerns about what the future of law libraries might look like. While the topics presented were necessarily contentious, everybody was respectful and kind in their feedback. We had plenty of time for everybody to speak (so you didn’t feel like you were struggling to get a word in).

You’d think that 8 hours of talking about these topics would be enough but we nearly ran over on every exercise. People have a lot of deep thoughts, ideas, and concerns about the state and future of our industry. Honestly, I would have been happy to have this workshop go on for several days and cover even more topics if that was possible. I learned so much and gained so much value from the people at my table that it was an incredibly efficient way to get input and share ideas.

Unlike other conferences and events that I’ve attended this one felt revolutionary – as in, we truly need to change the status quo in a big way and start getting to work on new ways to tackle these issues. “Disruptive” has become an absolute buzzword inside of Silicon Valley and academia but now we have something truly disruptive and we need to do something about it. Bringing all these intelligent people together in one room fosters an environment where disparate, fragmented ideas can crystalize into actionable plans, enabling us to support each other through these changes.

The results from all of these roundtables are going to be published in a global White Paper once the series has concluded. Each roundtable has different regions and people involved and I can’t wait to see the final product and hear what other roundtables had to say about these important issues. More importantly, I can’t wait to be involved in the future projects and initiatives that this important workshop series creates.

I echo Jean O’Grady: If you get the call, go.

New Resources for Teaching with Legal AI and Keeping Up with the Latest Research

Today’s guest post comes from the University of Denver Sturm College of Law’s John Bliss. Professor Bliss has been kind enough to share some resources that he has crafted to help teach AI to lawyers and law students. In addition, he has launched a new blog which would likely be of interest to our audience so we are happy to host this cross-over event.

Teaching

I recently posted to SSRN a forthcoming article on Teaching Law in the Age of Generative AI, which draws from early experiments with AI-integrated law teaching, surveys of law students and faculty, and the vast new literature on teaching with generative AI across educational contexts. I outline a set of considerations to weigh when deciding how to incorporate this tech in the legal curriculum. And I suggest classroom exercises, assignments, and policies. See https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4682456.

Blog

I’m also organizing a blog that provides up-to-date analysis of research on AI’s legal capabilities. You can subscribe at http://ai-lawyering.blog. Let me know if you’re interested in contributing. The motivation for the blog is that legal AI is a fast-moving field. It is all too common that our discussions are based on outdated and inaccurate information. Empirical findings are often misconstrued in mass and social media. The blog aims to address this issue by reviewing the latest high-quality research, emphasizing empirical studies of AI capabilities as well as scholarship on the implications of this technology for lawyers and other stakeholders in the legal system.

New Program on Generative AI for Legal Practitioners

I’m working with a non-profit teaching lawyers and other members of the legal profession about generative AI: http://oak-academy.org. Just last week, we held our first session with a group of lawyers, law students, and academics. It seemed to go well!

I look forward to continuing conversations on these topics. Please feel free to reach out—jbliss@law.du.edu

Review: vLex’s Vincent AI

Vincent is vLex’s response to implementing AI into legal research and it’s the most impressive one that I’ve seen for legal research.  Damien Riehl was kind enough to give us a personalized demonstration (thanks for setting that up, Jenny!) and it was a real treat to be able to ask questions about it in real-time.  I would say that the best way to see this in action is to schedule a demo for yourself but if you want to hear my hot-takes about the platform, please keep reading. 

Vincent is Really Cool 

Interface 

Many times when you engage with these models they feel like a complete black-box.  You put in some text, 🪄 presto-chango 🪄, and then it spits something back to you that seems related to what you put into it.  Vincent instead offers you a fairly controlled interface that is centered around what you typically need for something like real-world legal research.  That’s because this doesn’t look like a “chatbot,” sandbox-type experience and feels more like a tool that a professional would use. 

You Can Tell Where It Gets the Information

This is huge because almost everything you need is on one page immediately.  You ask it to draft a legal research memo and the cases are just to the right of the memo.  The relevant portions of the cases have been summarized and presented there for you.  A tool tells you how confident Vincent is that this is close to your request.  Everything below 70% is dropped.  You can toggle between cases, regs, statutes, and secondary materials available.  Everything that could require a deeper dive has a hyperlink.  You can get a sense of what this looks like from vLex’s website about Vincent here: https://vlex.com/vincent-ai.  

Multi-Stage Prompting 

vLex is probably best known for its deep archive of primary international materials.  Vincent uses this to great effect (since we know that many of these NLP technologies started as translation tools).  You can enter a natural language question in English, Vincent will translate it, run the search in the home country’s language, and then provide you with both the original text (so you could translate it yourself) and an English (or whatever) language translation.  Sexy stuff for you FCIL researchers. Also, this is substantially more powerful than something that simply tries to grind through many iterations of similar keyword searches in other languages.   

It’s also a noteable example of multistage prompting and retrieval in legal research.  You can see that it is being fed through not one prompt but many complex chains to produce high-quality, useful output. The tools for US caselaw are similar: Your query is turned into several different prompts that run off in different directions through the vLex database to retrieve information. Some prompts are searching through cases, statutes, regs and their secondary materials to see what is useful; others might be summarizing cases as they relate to your query; other prompts are finding counterarguments; another prompt is evaluating them for confidence on the your specific subject etc. etc. and a final prompt is summarizing all of this information into a neat little report for you. In summary, they’re making great use of the technology’s potential by deploying it in many different ways. The final product is sort of a fabricated, personalized secondary source created by running tons of prompts over the underlying primary materials. In fact, Damien calls this a “Me-tise” 😂 (apologies to Damien if I stole his punchline) and he foresees it being a powerful new tool for legal researchers. I’ve been bullish on the fabrication of secondary materials since I first saw what these things could do so it was exciting to see a precursor of this in action. 

Damien let us know that behind the scenes they are using a combination of the various LLM’s to achieve these results and cut costs when possible: Claude, Llama2 (Meta), and GPT4. We met with him shortly after the OpenAI controversy and he pointed out that they are able to swap models in vLex if necessary.

Secondary Materials and Market Share 

We have all come to love and rely on specific secondary materials that exist in Westlaw and Lexis. vLex’s acquisition of Fastcase meant that they acquired a huge, fantastic database of primary US materials. The one pain point for people who may be interested in switching from Westlaw/Lexis to Fastcase was the relative dearth of secondary materials available. The features that I saw last week in vLex may fill that need for some users and it will be interesting to see if people are lured away from their favorite practice guide or treatise published by Lexis or Thomson Reuters because a robot can now do some of that work summarizing and analyzing vast quantities of primary law. It will also be interesting to see if Lexis and Westlaw will roll out these types of features, since they could be in direct competition with their robust (and pricey) secondary materials offerings.

Before I get a slew of angry emails: I recognize that a traditional secondary material does much more than summarize cases, statutes, and regulations but it does some of that (also remember we’re still in the infancy of this technology for legal research). If that is all the researcher needs, then these tools could work as a replacement for some people (and they don’t rely on monthly updates – they do this on demand). That may allow some people to cut ties from Lexis and Westlaw in a way that could shake up the industry in a way that disrupts the status quo. It could also be incredibly powerful for something like a 50-state survey or even surveys across many different countries. Feel free to let me know what an ignoramus I am in the comments if I am missing something here.

Outstanding Questions 

Price 

I’ll dive right in where you all have questions, “Can we afford this thing?”  Dunno and it depends (super satisfying, I know).  The difficulty here is that these things are still very expensive to operate.  The more sophisticated the model, the larger the database, the more complex the stages of prompting, the various modalities (scanning documents, reading the screen, etc.) – the more it costs them.  They are all trying to figure out how to create a pricing structure where they can 1) offer it to the widest audience possible and 2) remain profitable.  As we know, their primary source of revenue is the big firms and so the product is currently only available in paid beta for select companies. 

Damien and vLex are both refreshingly upfront and clear about this.  No hand-waving or sales talk, which I think is why so many people in our industry look to people like Damien for information about these technologies as they are developed.  Damien mentioned that they are taking the “democratize the law” call to action from Fastcase seriously and are looking for ways to make it affordable on the academic market.  

Possible Future Options 

This is all complete speculation on my part but some sort of limited version of the platform seems like it could be reasonable for the academic market (like BLaw does with their dockets): limited uses per day, limited uses per account, a “lesser” account with limited features, etc. As the market stands today academic law libraries have access to a limited version of Lexis AI, trial access to Casetext Cocounsel (unless you’re willing to pay), no access to Westlaw Copilot, no access to Harvey AI, and no access to vLex. I anticipate all of that will change as the prices come down. The point of frustration is obviously that we want to be able to evaluate these tools so that we can teach them to students, in addition to using them ourselves so that we can benefit from the technology.

In conclusion, Vincent by vLex represents a significant step forward in AI-driven legal research. Its sophisticated multi-stage prompting, transparent sourcing, and potential in fabricating secondary materials make it a formidable tool. The future of Vincent and similar AI platforms in the academic and broader legal research community is certainly something to watch closely.

Demystifying LLMs: Crafting Multiple Choice Questions from Law Outlines

In today’s post, we’ll explore how legal educators and law students can use Large Language Models (LLMs) like ChatGPT and Claude to create multiple-choice questions (MCQs) from a law school outline.

Understanding the Process

My first attempt at this was to simply ask the LLM the best way to make MCQs but it didn’t end up being particularly helpful feedback, so I did some digging. Anthropic recently shed light on their method of generating multiple-choice questions, and it’s a technique that could be immensely beneficial for test preparation – besides being a useful way to conceptualize how to make effective use of the models for studying. They utilize XML tags, which may sound technical, but in essence, these are just simple markers used to structure content. Let’s break down this process into something you can understand and use, even if you’re not a wizard at Technical Services who is comfortable with XML.

Imagine you have a law school outline on federal housing regulations. You want to test your understanding or help students review for exams. Here’s how an LLM can assist you:

STEP 1: Prepare Your Outline

Ensure that your outline is detailed and organized. It should contain clear sections, headings, and bullet points that delineate topics and subtopics. This structure will help the LLM understand and navigate your content. If you’re comfortable using XML or Markdown, this can be exceptionally helpful. Internally, the model identifies the XML tags and the text they contain, using this structure to generate new content. It recognizes the XML tags as markers that indicate the start and end of different types of information, helping it to distinguish between questions and answers.

The model uses the structure provided by the XML tags to understand the format of the data you’re presenting.

STEP 2: Uploading the Outline

Upload your outline into the platform that you’re using. Most platforms that host LLMs will allow you to upload a document directly, or you may need to copy and paste the text into a designated area.

STEP 3: Crafting a General Prompt

You can write a general prompt that instructs the LLM to read through your outline and identify key points to generate questions. For example:

“Please read the uploaded outline on federal housing regulations and create multiple-choice questions with four answer options each. Focus on the main topics and legal principles outlined in the document.”

STEP 4: Utilizing Advanced Features

Some LLMs have advanced features that can take structured or semi-structured data and understand the formatting. These models can sometimes infer the structure of a document without explicit XML or Markdown tags. For instance, you might say:

“Using the headings and subheadings as topics, generate multiple-choice questions that test the key legal concepts found under each section.”

AND/OR

Give the model some examples with XML tags (so it can better replicate what you would like “few shot prompting”):

<Question>
What are "compliance costs" in HUD regulations?
</Question>
<Answers>
1. Fines for non-compliance.
2. Costs associated with adhering to HUD regulations.
3. Expenses incurred during HUD inspections.
4. Overheads for HUD compliance training.
</Answers>

The more examples you give, the better it’s going to be.

AND/OR

You can also use the LLM to add these XML tags depending on the size of your outline and the context limit of the model you are using (OpenAI recently expanded their limit dramatically). Give it a prompt asking it to apply tags and give it an example of the types of tags you would like for your content. Then tell the model to do it with the rest of your outline:

<LawSchoolOutline>
    <CourseTitle>Constitutional Law</CourseTitle>
    <Section>
        <SectionTitle>Executive Power</SectionTitle>
        <Content>
            <SubSection>
                <SubSectionTitle>Definition and Scope</SubSectionTitle>
                <Paragraph>
                    Executive power is vested in the President of the United States and is defined as the authority to enforce laws and ensure they are implemented as intended by Congress.
             

STEP 5: Refining the Prompt

It is very rare that my first try with any of these tools produces fantastic output. It is often a “conversation with a robot sidekick” (as many of you have heard me say at my presentations) and requires you to nudge the model to create better and better output.

If the initial questions need refinement, you can provide the LLM with more specific instructions. For example:

“For each legal case mentioned in the outline, create a question that covers the main issue and possible outcomes, along with incorrect alternatives that are plausible but not correct according to the case facts.”

Replicating the Process

Students can replicate this process for other classes using the same prompt. The trick here is to stay as consistent as possible with the way that you structure and tag your outlines. It might feel like a lot of work on the front end to create 5+ examples, apply tags, etc. but remember that this is something that can be reused later! If you get a really good MCQ prompt, you could use it for every class outline that you have and continue to refine it going forward.

Big Brother

This week, OpenAI announced new features to their platform at their first key-note event, including a new GPT-4 Turbo with 128K context, GPT-4 Turbo with Vision, DALL·E 3 API, and more. Furthermore, announced their agent Assistants API, including their own retrieval augmentation pipeline. (RAG) Today, we will focus on OpenAI’s entry into the RAG market.

At the surface level, RAG boils down to text generation models like Chat-GPT, retrieving data such as documents to assist users with questions and answering, summarization, and so on. Behind the scenes, however, other factors are at play such as vector databases, document chunking, and embedding models. Most RAG pipelines rely on an external vector database and require compute to create the embeddings. However, what OpenAI’s retrieval tool brings to the table is an all-encompassing RAG system. The system eliminates the need for external databases, and compute required to create and store the embeddings. Whether OpenAI’s retrieval system is optimal is a story for another day. Today we are focusing on the data implications.  

Data is the new currency fueling the new economy. Big Tech aims to take control of the economy by ingesting organizations’ private data including IP, leading to a “monolithic system” that completely controls users’ data. Google, Microsoft Adobe, and OpenAI are now offering indemnification to their users against potential copyright infringement lawsuits related to Generative AI, aiming to protect their business model by ensuring more favorable legal precedents. This strategy is underscored by the argument that both the input (ideas, which are uncopyrightable) and the output (machine-generated expressions, deemed uncopyrightable by the US Copyright Office) of Generative AI processes do not constitute copyright infringement. The consequences of Big Tech having their way could be dire, leading us to a cyberpunk dystopia that none of us want to live in. Technology and its algorithms would be in charge, and our personal data could be used to manipulate us. Our data reveals our interests, private health information, location status, etc. When algorithms feed us only limited, targeted information based on our existing interests and views, it restricts outside influence and diversity of opinion that is crucial to freedom of thought. Organizations must not contribute to this cyberpunk dystopia where Big Tech becomes Big Brother. Furthermore, companies are putting their employees, clients, and stakeholders at risk when handing data to Big Tech. Big Tech favors the role of tort feasor, rather than the role of the good Samaritan, and complies with consumer privacy laws.  

To prevent Big Brother, organizations should implement their own RAG pipeline. Open-source frameworks such as Llama-index, Qdrant, and Langchain can be used to create powerful RAG pipelines with your privacy and interests protected. LLMWaare also released an open-source RAG pipeline and domain-specific embedding models. Generative AI is a powerful tool and can enhance our lives, but at the same time in the wrong hands, the cyberpunk nightmare can become a reality. The ease of using prebuilt, turn-key systems, such as those offered by OpenAI, is appealing. However, the long-term risks associated with entrusting our valuable data to corporations, without a regulatory framework or protections, raise concerns about a potentially perilous direction. 

Beware the Legal Bot: Spooky Stories of AI in the Courtroom

The “ChatGPT Attorney” case has drawn much attention, but it’s not the only example of lawyers facing problems with AI use. This blog will compile other instances where attorneys have gotten into trouble for incorporating AI into their practice. Updates will be made as new cases or suggestions arise, providing a centralized resource for both legal educators and practicing attorneys (or it can be used to update a Libguide 😉). I’ll will also add this to one of our menus or headings for easy access.

Attorney Discipline 

Park v. Kim, No. 22-2057, 2024 WL 332478 (2d Cir. Jan. 30, 2024)

“Attorney Jae S. Lee. Lee’s reply brief in this case includes a citation to a non-existent case, which she admits she generated using the artificial intelligence tool ChatGPT. Because citation in a brief to a non-existent case suggests conduct that falls below the basic obligations of counsel, we refer Attorney Lee to the Court’s Grievance Panel, and further direct Attorney Lee to furnish a copy of this decision to her client, Plaintiff-Appellant Park.”

Mata v. Avianca, Inc. (1:22-cv-01461) District Court, S.D. New York 

I will not belabor the ChatGPT attorney (since it has been covered by real journalists like the NYT) – only provide links to the underlying dockets in case you need them since I get asked for them fairly often:

(Fireworks start at the May 4, 2023 OSC)

Zachariah Crabhhill, Colorado Springs 

In a less publicized case from Colorado, an attorney, Zachariah Crabhill, relied on ChatGPT to draft a legal motion, only to find out later that the cited cases were fictitious. Unfortunately, the court filings are not accessible through El Paso County’s records or Bloomberg Law. If any Colorado law librarians can obtain these documents, please contact me, and I’ll update this post accordingly.

News articles: 

Zachariah was subsequently sanctioned and suspended:

Ex Parte Allen Michael Lee, No. 10-22-00281-CR, 2023 WL 4624777 (Tex.
Crim. App. July 19, 2023)

An Opinion of Chief Justice Tom Grey explains that Allen Michael Lee faces charges related to child sexual assault, with bail set at $400,000, which he hasn’t been able to post. Lee sought a bail reduction through a pre-trial habeas corpus application, but the court denied this, leading Lee to argue that the denial was an abuse of discretion due to excessive initial bail. However, his appeal was critiqued for inadequate citation, as the cases he referenced either didn’t exist or were unrelated to his arguments

Updates:

David Wagner, This Prolific LA Eviction Law Firm Was Caught Faking Cases In Court. Did They Misuse AI?, LAist (Oct 12, 2023)
Submitted by my co-author Rebecca Fordon

Cuddy Law Firm in New York has been submitting exhibits of transcripts of interactions with ChatGPT to their motions for attorneys fees (essentially a back and forth to zero in on what is a reasonable rate) in several cases in S.D. NY.”
[This is an ongoing action and we’re waiting to see if it is allowed]
from reader Jason as a comment (very much appreciated, Jason)

A Spooky Glimpse into the Future

In 2019, Canadian Judge Whitten reduced an attorney’s requested fees on the grounds that the attorney had not utilized AI technology:

The decision concerned a request for attorneys’ fees and expenses by defendant, Port Dalhousie Vitalization Corporation (PDVC). The court granted summary judgment in PDVC’s favor against a woman who sued PDVC after she slipped and fell at an Ontario bar for which PDVC was the landlord. The bar, My Cottage BBQ and Brew, defaulted in the case. In his ruling, Justice Whitten mentioned that the use of AI in legal research would have reduced the amount of time one of the attorneys for the defendant would have spent preparing his client’s case. 

https://www.lexisnexis.com/community/insights/legal/b/thought-leadership/posts/judge-slams-attorney-for-not-using-ai-in-court

In domains where AI can significantly expedite workflows, it could indeed become standard practice for judges to scrutinize fee requests more rigorously. Attorneys might be expected to leverage the latest technological tools to carry out tasks more efficiently, thereby justifying their fees. In this scenario, sticking to traditional, manual methods could be perceived as inefficient, and therefore, not cost-effective, leading to fee reductions. This has led many people to wonder if AI will expedite the decline of the billable hour (for more on that please see this fantastic discussion on 3 Geeks and a Law Blog, AI-Pocalypse: The Shocking Impact on Law Firm Profitability).

We hope that you have a Happy Halloween!

Coding with ChatGPT: A Journey to Create A Dynamic Legal Research Aid

I haven’t quite gotten this whole ChatGPT thing. I’ve attended the webinars and the AALL sessions. I generally understand what it’s doing beneath the hood. But I haven’t been able to find a need in my life for ChatGPT to fill. The most relevant sessions for me were the AALS Technology Law Summer Webinar Series with Tracy Norton of Louisiana State University. She has real-world day-to-day examples of when she has been able to utilize ChatGPT, including creating a writing schedule and getting suggestions on professional development throughout a career. Those still just didn’t tip the balance for me.

A few weeks ago, I presented to one of our legal clinics and demonstrated a form that our Associate Director, Tara Mospan, created for crafting an efficient search query. At its heart, the form is a visual representation of how terms and connectors work with each other. Five columns of five boxes, each column represents variations of a term, and connectors between the columns. For a drunk driving case, the term in the first box could be car, and below that we would put synonyms like vehicle or automobile. The second column could include drunk, inebriated, and intoxicated. And we would choose the connector between the columns, whether it be AND, w/p, w/s, or w/#. Then, we write out the whole search query at the bottom: (car OR vehicle OR automobile) w/s (drunk OR inebriated OR intoxicated).

Created years ago by Tara Mospan, this worksheet is loved by ASU Law students who frequently request copies from the law librarians even years after they use it for Legal Research and Writing.

After the presentation, I offered a student some extra copies of the form. She said no, that I presented to her legal writing class last year and she was so taken with the form that she had recreated it in Excel. Not only that, she used macros to transform the entered terms into a final query. I was impressed and asked her to send me a copy. It was exactly as she had described, using basic commands to put the terms together, with OR between terms within a column, and drop downs of connectors. She had taken our static form and transformed it into a dynamic utility.

An ASU Law student recreated the Crafting an Efficient Search PDF using Excel so that it had drop-downs.

Now I was inspired: What if I could combine the features of her Excel document with the clean layout of our PDF form? Finally, I saw a use for ChatGPT in my own life. I had read about how well ChatGPT does with programming and it seemed like the perfect application. It could help me create a fillable PDF, with nuanced JavaScript code to make it easy to use and visually appealing.

I went into ChatGPT and wrote out my initial command:

I am trying to create a fillable PDF. It will consist of five columns of text boxes, and each column will have five boxes. Search terms will be placed in the boxes, although not necessarily in every box. There will be a text box at the bottom where the terms from the boxes above will be combined into a string. When there are entries in multiple boxes in a column, I want the output to put a set of parentheses around the terms and the word OR between each term.

ChatGPT immediately gave me a list of steps, including the JavaScript code for the results box. I excitedly followed the directions to the letter, saved my document, and tested it out. I typed car into the first box and…nothing. It didn’t show up in the results box. I told ChatGPT the problem:

The code does not seem to be working. When I enter terms in the boxes, the text box at the bottom doesn’t display anything.

And this began our back and forth. The whole process took around four hours. I would explain what I wanted, it would provide code, and I would test it. When there were errors, I would note the errors and it would try again. A couple times, the fix to a minor error would start snowballing into a major error, and I would need to go back to the last working version and start over from there. It was a lot like having a programming expert working with you, if they had infinite patience but sometimes lacked basic understanding of what you were asking.

For many things, I had to go step-by-step to work through a problem. Take the connectors, for example. I initially just had AND between them as a placeholder. I asked it to replace the AND with a drop-down menu to choose the connector. The first implementation of this ended up replacing OR between the synonyms instead of the second needed search term. We went back and forth until the connector option worked between the first two columns of terms. Then we worked through the connector between columns two and three, and so on.

At times, it was slow going, but it was still much faster than learning enough JavaScript to program it myself. ChatGPT was also able to easily program minor changes that made the form much more attractive, like not having parentheses appear unless there are two terms in a column, and not displaying the connector unless there are terms entered on both sides of it. And I was able to add a “clear form” button at the end that cleared all of the boxes and reverted the connectors back to the AND option, with only one exchange with ChatGPT.

Overall, it was an excellent introduction to at least one function of AI. I started with a specific idea and ended up with a tangible product that functioned as I initially desired. It was a bit more labor intensive than the articles I’ve read led me to believe, but the end result works better than I ever would have imagined. And more than anything, it has gotten me to start thinking about other projects and possibilities to try with ChatGPT.

Unlocking the Power of Semantic Searches in the Legal Domain

The language of law has many layers. Legal facts are more than objective truths; they tell the story and ultimately decide who wins or loses. A statute can have multiple interpretations, and those interpretations depend on factors like the judge, context, purpose, and history of the statute. Legal language has distinct features, including rare legal terms of art like “restrictive covenant,” “promissory estoppel,” “tort,” and “novation.” This complex legal terminology poses challenges for normal semantic search queries.  

Vector databases represent an exciting new trend, and for good reason. Rather than relying on traditional Boolean logic, semantic search leverages word associations by creating embeddings and storing them in a vector database. In machine learning and natural language processing, embeddings depict words or sentences as dense vectors of real numbers in a continuous vector space. This numerical representation of text is typically generated by a model that tokenizes the text and learns embeddings from the data. Vectors capture the contextual and semantic meaning of each word. When a user makes a semantic query, the search system works to interpret their intent and context. The system then breaks the query into individual words or tokens, converts them into vector representations using embedding models, and returns ranked results based on their relevance. Unlike Boolean search which requires specific syntax, (“AND”, “OR”, etc.) semantic search allows for queries in natural language and opens up a whole new world of potential when searches are not constrained by the rules of exact matching of text. 

However, legal language differs from everyday language. The large number of technical terms, the careful precision, and the fluid interpretations inherent in law mean that semantic search systems may fail to grasp the context and nuances of legal queries. The interconnected and evolving nature of legal concepts poses challenges in neatly mapping them into an embedding space representation. One potential way to improve semantic search in the legal domain is by enhancing the underlying embedding models. Embedding models are often trained on generalized corpora like Wikipedia, giving them a broad but shallow understanding of law. This surface-level comprehension proves insufficient for legal queries, which may seem simple but have layers of nuance. For example, when asked to retrieve the key facts of a case, an embedding model might struggle to discern what facts are relevant versus extraneous details.  

The model may also fail to distinguish between majority and dissenting opinions due to a lack of legal background needed to make such differentiations. Training models on domain-specific legal data represents one promising approach to overcoming these difficulties. By training on in-depth legal corpora, embeddings could better capture the subtleties of legal language, ideas, and reasoning. For example, Legal Bert, which stands for Bidirectional Encoder Representations was pre-trained on the CaseHold dataset. The size of this corpus (37GB) is large, representing 3,446,187 legal decisions across all federal and state courts. The CaseHold data set is larger than the size of the Book Corpus/Wikipedia corpus originally used to train the BERT model.  When tested on the LexGlue benchmark- a benchmark dataset to evaluate the performance of NLP methods in legal tasks, Legal Bert performed better than ChatGPT.  

Semantic search shows promise for transforming legal research, but realizing its full potential in the legal domain poses challenges. Legal language is complex and can make it difficult for generalized embedding models to grasp the nuances of legal queries. However, recent optimized legal embedding models indicate these hurdles can be overcome by training on ample in-domain data. Still, comprehensively encoding the interconnected, evolving nature of legal doctrines into a unified embedding space remains an open research problem. Hybrid approaches combining Boolean and vector models are a promising new frontier that many researchers are exploring. 

Realizing the full potential of semantic search for law remains an ambitious goal requiring innovative techniques. But the payoff could be immense – responsive, accurate AI assistance for case law research and analysis. While still in its promising infancy, the continued maturation of semantic legal search could profoundly augment the capabilities of legal professionals. A shift from generic to domain-specific models holds promise.