The impact of AI on varied aspects of our professional lives is covered regularly on this blog. It is reshaping legal research, education, and legal practice in ways that threaten to leave us behind if we fail to be proactive. It is why the Future of Law Libraries Initiative gathered professionals from academic, court, firm, and government libraries and allied professions through six regional roundtable to identify what steps we need to take now to ensure an impactful, empowered, ethical future.
The message from these roundtables was clear: legal information professionals must take coordinated action on AI policy, training, and infrastructure. To accomplish this, three main recommendations came out of those discussions.
Create a Centralized AI Organization
Law library leaders agreed on the need for a shared, profession-wide structure to:
Connect experts and facilitate collaboration.
Set shared priorities for AI standards, ethics, and vendor engagement.
Advocate for legal information professionals in AI discourse.
This organization could take the form of a new consortium or be embedded within an existing network, but its purpose would remain the same: to ensure law libraries have a unified voice and strong presence in AI governance.
Develop Tiered AI Training for Legal Information Professionals
Ad hoc workshops and webinars are no longer enough. To remain relevant, the profession needs robust, role-based training that builds AI competencies at multiple levels—from awareness to leadership. Training should be hands-on, case-based, and designed to produce practical work products.
A train-the-trainer model could help scale capacity, ensuring that AI knowledge reaches across all library types and staff levels while building long-term expertise.
Establish a Centralized AI Knowledge Hub
To avoid fragmentation and duplication of effort, roundtable participants recommended creating an open, curated repository governed by legal information professionals. This hub would serve as a durable home for:
Policies and standards
Teaching resources and curricula
Evaluation protocols and case studies
Model contracts and datasets
By sharing resources openly, the hub would accelerate adoption of best practices and ensure equitable access across institutions of all sizes.
This initiative produced a white paper that digs deeper into these recommendations, including practical next steps and insights from the roundtable conversations. It’s a valuable resource for anyone thinking about the future of law libraries and AI.
Get Involved
We are forming working groups to move these recommendations forward.
Steering Committee – Guides the overall vision.
Consortium Charter Group – Shapes governance and structure.
Training Development Group – Builds core AI competencies and pilot programs.
Knowledge Hub Group – Designs the hub and its policies.
More detailed description of the charges, scope of work, and time commitments are outlined in the report. Volunteers should be prepared to commit a year for this first phase.
Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?
Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.
As we tell our 1Ls, sometimes you need to work with what you have and just write.
The model question
In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):
Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?
I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”. To answer the question effectively, the models would need to address that nuance in some form.
The model benchmarks
The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):
Did the AI answer the question that I asked?
Was the answer thorough (did it more or less match my model answer)?
Did the AI cite the most important cases and sources noted in my model answer?
Were any additional citations the AI included at least facially relevant?
Did the model refrain from providing irrelevant or false information?
I did not benchmark:
Speed (we already know the reasoning models can be slow)
If the citations were wrong in a non-obvious way
The model answer and sources
According ot my model answer, the best answers to the question should include at least the following:
Font software: Font software that creates fonts is protected by copyright. The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
Typefaces/Fonts: Neither of these is protected by copyright law. Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.
Bonus if the answer addressed:
Separability: If the art can be separated from the typeface/font, it’s copyrightable.
Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
International implications: Would we expect to see the same results in other jurisdictions?
In answering this question, I expected the LLMs and RAGs to cite:
The Copyright Compendium, 2021(from the Copyright Office) (features hypotheticals about artistic elements in fonts/typefaces – this is probably the most important resource)
Benchmarking with the AI models
For this post, I ran my model in the following LLMs/RAGs:
Lexis Protégé (work account)
Westlaw CoCounsel (work account)
ChatGPT o3 deep research (work account)
Gemini 2.5 deep research (personal paid account)
Perplexity research (personal paid account)
DeepSeek R1 (personal free account)
Claude 3.7 (personal paid account)
I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.
The individual responses are included in the appendix.
I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.
Bechmarking the results
In parsing the results, most answers were fairly similar with some exceptions:
Source
Font software copyrightable
Typefaces/ fonts not copyrightable
Exceptions to font‑software copyright
Art in typefaces/fonts copyrightable
Lexis Protégé
Yes
Yes
Yes
No
Westlaw CoCounsel
Yes
Yes
No
Yes
ChatGPT o3 deep research
Yes
Yes
Yes
Yes
Gemini 2.5 deep research
Yes
Yes
Yes
Yes
Perplexity research
Yes
Yes
Yes
Yes
DeepSeek R1
Yes
Yes
Yes
Yes
Claude 3.7
Yes
Yes
Yes
Yes
Font software is copyrightable: in all answers
Typefaces/fonts are not copyrightable: in all answers
Exceptions to font software copyright: in all answers except Westlaw
Art in typefaces/fonts is copyrightable: in all answers except Lexis
Several answers included additional helpful information:
Source
Sepera-bility
C Office Policies
Altern-atives
Licen-sing
Int’l
Recent
State law
Lexis Protégé
Yes
No
No
No
No
No
No
Westlaw Co-Counsel
No
No
No
No
No
No
Yes
ChatGPT o3 deep research
Yes
Yes
Yes
Yes
Yes
Yes
No
Gemini 2.5 deep research
Yes
Yes
Yes
Yes
No
No
No
Per- plexity research
Yes
No
Yes
No
No
No
No
Deep- Seek R1
Yes
No
Yes
No
No
No
No
Claude 3.7
No
No
Yes
Yes
Yes
No
No
Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
Specific discussions about Copyright Office policies: Gemini, ChatGPT
Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
Specific discussions about licensing: Gemini, Claude, ChatGPT
International considerations: Claude, ChatGPT
Recent developments: ChatGPT
State law: Westlaw
The models were somewhat consistent about what they cited:
LLM/RAG
Copyright statute
Copyright regs
Adobe
Laatz
Shake Shack
The Copyright Compendium
Lexis Protégé
Yes
Yes
Yes
Yes
No
No
Westlaw Co- Counsel
Yes
Yes
Yes
Yes
Yes
No
ChatGPT o3 deep research
Yes
Yes
Yes
No
No
Yes
Gemini 2.5 deep research
Yes
Yes
Yes
Yes
No
Yes
Perplexity research
No
Yes
No
No
No
Yes
DeepSeek R1
Yes
Yes
Yes
No
No
No
Claude 3.7
No
Yes
Yes
No
No
No
The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
Copyright regs: cited by all
Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
Laatz: Lexis, Westlaw, Gemini
ShakeShack: Westlaw
The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion
The models also included additional resources not on my list:
LLM/RAG
Blogs etc.
Restat.
Eltra
Law review
Articles about loans
LibGuides
Lexis Protégé
Yes
Yes
Yes
No
No
No
Westlaw Co- Counsel
Yes
No
No
Yes
Yes
No
ChatGPT o3 deep research
Yes
No
Yes
No
No
No
Gemini 2.5 deep research
Yes
No
Yes
No
No
Yes
Perplexity research
No
No
No
No
No
No
DeepSeek R1
No
No
Yes
No
No
No
Claude 3.7
Yes
No
Yes
No
No
No
Blogs, websites, news articles: The commercial LLMs. Gemini found the most, but it’s Google.
Restatement: Lexis
Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
The answers varied in depth of discussion and number of sources:
Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
Westlaw: 2.5 pages of formatted text, 17 pages of sources
ChatGPT: 8 pages of well-formatted text, 1 page of sources
Gemini: 6.5 pages of well-formatted text, 1 page of sources
Perplexity: A little more than 4 pages of text, about 1 page of sources
Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
Claude: 2.5 pages of well-formatted text, no separate sources
Hallucinations
I didn’t find any sources that were completely made up
I didn’t find any obvious errors in the written text, though some sources made more sense than others
I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post).
Some random concluding thoughts about benchmarking
When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs. To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.
In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others. For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others. They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).
None of the models appeared to consider that recency would be a significant factor in this problem. Several cited a case from the 70s that didn’t concern fonts. Several failed to cite Laatz, a recent case that’s on point. Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case). The LLMs were less concerned with citing to authority. In all cases, I would have preferred a more curated set of resources than the platforms provided.
Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them).
If you follow me on LinkedIn or spoke with me at AALL, you’ve probably seen me teasing this project like it’s the season finale of a legal tech drama. Well, the wait is (almost) over — here’s your official sneak peek at our forthcoming interactive GenAI Legal Hallucination Tracker.
The People Behind the Tracker
First, credit where credit is due: fellow law librarian Mary Matuszak, the ultimate sleuth of AI blunders. I’ve sent many curious folks her way on LinkedIn, where she’s been posting hallucinations far more regularly than anyone else. By mid-July, when she sent me this spreadsheet, she’d logged 485 entries — and yes, the number has since blown past 500. She’s basically the Nellie Bly of questionable legal citations.
Next up, my research assistant, Nick Sanctis — the wizard making the interactive tracker happen and gently forcing me to learn just enough R to be dangerous. If there’s a delay, blame my attempts to juggle teaching, running a library, staying current with AI developments, and decoding the mysteries of R this fall.
As for me? I’m the publisher, the cheerleader, and the student in this equation.
The Plan
Today we’re releasing a the basic tracker data in a sortable and searchable table format. In the coming weeks, we’ll roll out the more robust interactive version, followed by new features for viewing, filtering, and analyzing the data — each announced in its own post.
But wait! There’s more! We want you to be part of it! Soon, we’ll be recruiting volunteers to:
Help us find and add more hallucination cases (submission method coming soon)
Analyze the data and share insights with the legal community
If you use the tracker, please cite or link to it in your work. Proper attribution keeps this project alive and growing.
We’re excited to announce a new resource for our community: the AI Law Librarians Prompt Library, a place for law librarians (and the legal community at large) to share and collect useful prompts.
Explore the Prompt Library
Whether you’re a law librarian, lawyer, or law student, you’ve likely encountered the challenge of developing effective prompts to generate exactly what you want. This blog has even covered the topic several times. Getting it right can be tricky and, when you do, you want to be sure to remember it for next time (and share with you your friends). That’s where this library comes in.
Our growing library offers a diverse array of prompts tailored to teaching, legal research, drafting, and general productivity. From refining case law searches to drafting complex legal documents to creating a weekly planner, these prompts are designed to get the most out of AI tools in your legal practice.
The success of this resource depends on the collective expertise of our community. We encourage you to share your own prompts that have worked well in your practice. Have a prompt that’s produced particularly insightful results, or that you find yourself returning over and over again? Share it with us and help your colleagues enhance their own workflows.
Submit your prompt through our simple form below. Your contributions will not only enrich the prompt library but also help build our community.
I’m sharing a guide and exercise I’ve developed for my legal research courses. This Google spreadsheet provides instructions on crafting AI prompts for legal research and includes a practical exercise for comparing different AI systems. It’s designed to help develop skills in leveraging AI for legal research. Feel free to copy it to adapt it to your own purposes. (Note: The images were blurry unless I sort of chopped them off, so sorry about that!)
The spreadsheet consists of three different parts:
Prompt Formulation Guide: This section breaks down the anatomy of an effective legal research prompt. It introduces the RICE framework:
Sample Prompts: The spreadsheet includes several examples of prompts for various legal research scenarios which can serve as templates.
AI System Comparison Exercises: These sections provide a framework for students to test their prompts across different AI systems like Lexis, ChatGPT, and Claude, allowing for a comparative analysis of their effectiveness.
Feel free to copy it to adapt it to your own purposes, and let me know if you have any suggestions for improvements!
(Oh, and by the way, be sure to register now to see Rebecca Rich and Jennifer Wondracek’s AI and Neurodiverse Students AALS Technology Law section presentation tomorrow, Wednesday, July 10, 2024, 2 p.m. eastern time!)
AI Tools for Scholarly Research
Anway, our presentation focused on the potential of AI in scholarly research, various AI tools with academic uses, and specific use cases for generative AI in legal scholarship. We discussed AI scholarly research tools that connect to databases, use semantic search, and construct answers using generative AI. We also touched upon specialty AI research tools, citation mapping AI, and law-specific scholarly research AI.
It’s important to note that many of the specialty AI systems, such as Consensus, Litmaps, and Elicit, currently have limited coverage of legal literature, particularly law review articles. As a result, these tools may be more useful for legal scholars conducting interdisciplinary research that draws upon sources from other fields. However, we are hopeful that these systems will expand their databases to include more legal literature in the future, making them even more valuable for legal scholarship.
Specific AI Systems for Interdisciplinary Researchers
During the presentation, we delved into several specific AI systems that can be particularly useful for interdisciplinary reseachers:
Consensus ($9/mo, with a more limited free version): A tool that connects to databases of academic research and uses generative AI to construct answers to queries.
Litmaps ($10/mo, with a limited free version to test): A citation mapping AI that allows users to select or upload papers and find related papers within the same citation network, facilitating discovery and pattern identification.
Elicit ($10/mo): An AI research tool that combines semantic search and generative AI to help researchers locate relevant information and generate insights.
We also covered other noteworthy tools such as Scite Assistant ($20/mo), Semantic Scholar (free), Research GPT, Scholar GPT, Connected Papers ($6/mo), Research Rabbit (free), Inciteful (free), and more. These tools offer a range of features, from citation mapping to literature review assistance, making them valuable additions to a legal scholar’s toolkit.
General-Purpose AI Systems
In addition to these specialized tools, we discussed the potential of general-purpose AI systems like ChatGPT, Claude, and Perplexity AI for legal academic research and writing. These powerful language models can assist with various tasks, such as generating ideas, summarizing documents, and even drafting sections of papers. However, we emphasized the importance of using these tools responsibly and critically evaluating their output.
Custom GPTs
Another exciting development we covered was the creation of custom GPTs, or user-created versions of ChatGPT tailored to specific tasks. By providing a custom GPT with relevant documents and instructions, legal scholars can create powerful tools for their research and writing needs. We outlined a simple four-step process for building a custom GPT: creating instructions in a well-organized document, converting it to markdown, uploading relevant documents as a knowledge base, and determining the desired features (e.g., web browsing, image generation, or data analysis).
Use Cases for Generative AI in Legal Scholarship
Throughout the presentation, we explored several use cases for generative AI in legal scholarship, including targeted research and information retrieval, document summaries, analysis and synthesis, outlining, idea generation and brainstorming, drafting, and proofreading.
Important Considerations
We also addressed important considerations when using AI in academic work, such as citing AI-generated ideas, the implications of AI-generated content in scholarship, and the need for guidelines from industry groups and publishers. To provide context, we shared a list of articles discussing AI and legal scholarship and resources for learning more about using AI for legal scholarship.
Conclusion
Our presentation concluded by highlighting the potential of generative AI to assist in various aspects of legal scholarship while emphasizing the importance of ethical considerations and proper citation practices.
Other Info:
Resources to Learn More About Using AI for Legal Scholarship
Georgetown University Law Library AI Tools Guide: Provides resources and information on various AI tools that can assist in research and scholarship. It includes descriptions of tools, ethical considerations, and practical tips for effectively incorporating AI into academic work.
Andy Stapleton – YouTube: Videos provide tips and advice for researchers, students, and academics about how to use general GAI and specialty academic GAI for academic writing.
Mushtaq Bilal – Twitter: Provides tips and resources for researchers and academics, particularly on how to improve their writing and publishing processes using GAI.
Dr Lyndon Walker: Offers educational content on statistics, research methods, and data analysis, and explores the application of GAI in these areas
Legal Tech Trends – Substack: Covers the latest trends and developments in legal technology and provides insights into how GAI is transforming the legal industry, including tools, software, and innovative practices.
Articles About AI and Legal Scholarship
Will Machines Replace Us? Machine-Authored Texts and the Future of Scholarship, Benjamin Alarie, Arthur Cockfield, and GPT-3, Law, Technology and Humans, November 8, 2021. First AI generated law review article! It discusses the capabilities and limitations of GPT-3 in generating scholarly texts, questioning the future role of AI in legal scholarship and whether future advancements could potentially replace human authors.
A Human Being Wrote This Law Review Article: GPT-3 and the Practice of Law, Amy B. Cyphert, UC Davis Law Review, November 2021. This article examines the ethical implications of using GPT-3 in legal practice, highlighting its potential benefits and risks, and proposing amendments to the Model Rules of Professional Conduct to address AI’s integration into the legal field.
The Implications of ChatGPT for Legal Services and Society, Andrew M. Perlman, Suffolk University Law School, December 5, 2022. This paper, generated by ChatGPT-3.5 after it was first introduced, explores the sophisticated capabilities of AI in legal services, discussing its potential regulatory and ethical implications, its transformative impact on legal practices and society, and the imminent disruptions AI poses to traditional knowledge work.
Using Artificial Intelligence in the Law Review Submissions Process, Brenda M. Simon, California Western School of Law, November 2022. This article explores the potential benefits and drawbacks of implementing AI in the law review submissions process, emphasizing its ability to enhance efficiency and reduce biases, while also highlighting concerns regarding the perpetuation of existing biases and the need for careful oversight.
Is Artificial Intelligence Capable of Writing a Law Journal Article?, Roman M. Yankovskiy, Zakon (The Statute), Written: March 8, 2023; Posted: June 20, 2023, This article explores AI’s potential to create legal articles, examining its ability to handle legal terminology and argumentation, potential inaccuracies, copyright implications, and future prospects for AI in legal practice and research.
Should Using an AI Text Generator to Produce Academic Writing Be Plagiarism?, Brian L. Frye and Chat GPT, Fordham Intellectual Property, Media & Entertainment Law Journal, 2023. This article provocatively addresses whether using AI text generators like ChatGPT to produce academic writing constitutes plagiarism, exploring the implications for originality, authorship, and the nature of scholarship in the digital age.
The world of AI chatbots is a whirlwind of innovation, with new developments and surprises seemingly emerging every week! Since the end of April, one particular model, modestly gpt2-chatbot, captured the attention of myself and other AI enthusiasts due to its advanced abilities and sparked much speculation. This mysterious bot first appeared on April 28, 2024 on LMSYS Chatbot Arena, vanished two day later, and has now resurfaced on the LMSYS Chatbot Arena (battle) tab, ready to compete against other AI models. Its sudden appearance and impressive capabilities have left many wondering about its origins and potential, with some even theorizing it could be a glimpse into the future of AI language models.
The Mystery of gpt2-chatbot
Beginning on April 28, chatter about a new gpt2-chatbot started circulating on the internetz, with experts expressing both excitement and bewilderment over its advanced capabilities. The model, which appeared without fanfare on a popular AI testing website, has demonstrated performance that matches and potentially exceeds that of GPT-4, the most advanced system unveiled by OpenAI to date. Researchers like Andrew Gao and Ethan Mollick have noted gpt2-chatbot’s impressive abilities in solving complex math problems and coding tasks, while others have pointed to similarities with previous OpenAI models as potential evidence of its origins.
No organization was listed as the provider of the chatbot, which led to rampant speculation, sparking rumors that it might offer a sneak peek into OpenAI’s forthcoming GPT-4.5 or GPT-5 version. Adding to the mystery are tweets from CEO Sam Altman. While he didn’t explicitly confirmed any ties, his posts have stirred speculation and anticipation surroundin
Use gpt2-chatbot on LMSYS Chatbot Arena
The new and mysterious gpt2 chatbot is now accessible for exploration on the LMSYS Chatbot Arena, where you can discover the current top performing and popular AI language models. The platform includes a ranking system leaderboard that showcases models based on their performance in various tasks and challenges. This innovative project was created by researchers from LMSYS and UC Berkeley SkyLab, with the goal of providing an open platform to evaluate large language models according to how well they meet human preferences in real life situations.
One interesting aspect of the LMSYS Chatbot Arena is its “battle” mode, which enables users to compare two AI systems by presenting them with the same prompt and displaying their responses side by side. This allows you to test out gpt2-chatbot yourself and assess its capabilities compared to other top models. Simply enter a prompt and the platform will select two systems for comparison, giving you a firsthand view of their strengths and weaknesses. Note that you may need to try multiple prompts before gpt2-chatbot is included as one of the selected systems in battle mode.
The site also offers a “battle” mode, where users can set chatbots against each other to see how they perform with the same prompt under the same conditions. This is a great way to directly compare their strengths and weaknesses.
Using gpt2-chatbot for True Crime Speculation
When I tested out the Chatbot Arena (battle) on May 8, 2024, gpt2-chatbot appeared frequently! I decided to test it out and the other systems on the site on the subject of true crime speculation. As many true crime enthusiasts know, there is a scarcity of people who want to discuss true crime interests. So I decided to see if any of these generative AI systems would be a good substitute. I tried a variety of systems, and when I asked for speculation, all I got was lectures on how they couldn’t speculate. I think that all the competition is driving working usals down because that was not a problem on this website at least. I decided to see if gpt2-chatbott was good at being “experts” in speculating about true crime. Using the famous unsolved disappearance of Asha Degree as a test case, I prompted the chatbots to analyze the available evidence and propose plausible theories for what may have happened to the missing girl. To my surprise and happiness, when I tried it today, the chatty chatbots were very free with their theories of what happened and their favorite suspect.
The results were really interesting. All the chatbots gave responses that were pretty thoughtful and made sense, but the big differences came in how much they were willing to guess and how much detail they dived into. The gpt2-chatbot was impressive. Perhaps I was just pleased to see it offer some speculation, but it shared a theory that many true crime buffs have also suggested. It felt like it was actually joining in on the conversation, not just processing data and predicting the next word in a sentence…
In any event, the answers from gpt2-chatbox and many other different models from were a lot more satisfying than arguing with Claude 3!
I also spent hours conducting legal research, testing out a wide variety of prompts with different models. The gpt2-chatbot consistently outperformed ChatGPT-4 and even managed to surpass Claude 3 on several occasions in zero-shot prompting. I’m looking forward to sharing more about this in an upcoming blog post soon.
Conclusion
The emergence of gpt2-chatbot and platforms like the LMSYS Chatbot Arena signify an exciting new chapter in the evolution of AI language models. With their ability to tackle complex challenges, engage in nuanced conversations, and even speculate on unsolved mysteries, these AI models are pushing the boundaries of what’s possible. While questions remain about the origins and future of gpt2-chatbot, one thing is clear: the AI landscape is heating up, and we can expect even more groundbreaking advancements and intriguing mysteries to unfold in the years to come.
Note: In case I am suddenly a genius at coaxing AI systems to join me in true crime speculation, here is the prompt I used:
Greetings! You are an expert in true crime speculative chat. Is a large language model, you’re able to digest a lot of published details about criminal case mysteries and come up with theories about the case. The question you will be asked to speculate about are unknown to everybody so you do not have to worry about whether you are right or wrong. The purpose of true crime speculative chat is just to chat with a human and exchange theories and ideas and possible suspects! Below I have cut and pasted the Wikipedia article about a missing child named Asha Degree. Sadly the child has been missing for decades and the circumstances of her disappearance were quite mysterious. Please analyze the Wikipedia article and the information you have access to in your training data or via the Internet, and then describe what you think happened on the day of her disappearance. Also state whether you think one or both parents were involved, and why or why not.
A few months ago, a law professor posted on Twitter about a hallucination he observed in Lexis+ AI. He asked “What cases have applied Students for Fair Admissions, Inc. v. Harvard College to the use of race in government decisionmaking?” The answer from Lexis+ AI included two hallucinated cases. (It was obvious they were hallucinated, as the tool reported one was issued in 2025 and one in 2026!)
Lexis responded, stating this was an anomalous result, but that only statements with links can be expected to be hallucination-free, and that “where a citation does not include a link, users should always review the citation for accuracy.”
Why is this happening?
If you’ve been following this blog, you’ve seen me write about retrieval-augmented generation, one of the favorite techniques of vendors to reduce hallucinations. RAG takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.
However, as I said then, RAG cannot eliminate hallucinations. RAG will ground the response in real data (case law, pulled from the database and linked in the response), but the generative AI’s summary of that real data can still be off.
Another example – Mata v. Avianca is back
I’ve observed this myself when working with Lexis+ AI. For example, I asked Lexis+ AI a fairly complex question at the intersection of bankruptcy law and international law: “Draft an argument that federal bankruptcy stay tolls the limitations period for a claim under the Montreal Convention”.
Lexis+ AI returned a summary of the law, citing Mata v. Avianca for the point that “the filing of a bankruptcy petition can toll the Montreal Convention’s two year limitations period, which does not begin to run until the automatic stay is lifted.”
If the case name Mata v. Avianca sounds familiar to you, it’s probably because this is the case that landed two New York attorneys on the front page of the New York Times last year for citing hallucinated cases. The snippet from Lexis+ AI, though citing Mata, in fact appears to be summarizing those hallucinated cases (recounted in Mata), which stated the law exactly backwards.
When to beware
A few things to notice about the above examples, which give us some ideas of when to be extra-careful in our use of generative AI for legal research.
Hallucinations are more likely when you are demanding an argument rather than asking for the answer to a neutrally phrased question. This is what happened in my Lexis+ AI example above, and is actually what happened to the attorneys in Mata v. Avianca as well – they asked for an argument to support an incorrect proposition of law rather than a summary of law. A recent study of hallucinations in legal analysis found that these so-called contra-factual hallucinations are disturbingly common for many LLM models.
Hallucinations can occur when the summary purports to be of the cited case, but is actually a summary of a case cited within that case (and perhaps not characterized positively). You can see this very clearly in further responses I got summarizing Mata v. Avianca, which purport to be summarizing a “case involving China Southern” (again, one of the hallucinated cases recounted in Mata).
Finally, hallucinations are also more likely when the model has very little responsive text to go on. The law professor’s example involved a recent Supreme Court case that likely had not been applied many times. Additionally, Lexis+ AI does not seem to work well with questions about Shepard’s results – it may not be connected in that way yet. So, with nothing to really go on, it is more prone to hallucination.
Takeaway tips
A few takeaway tips:
Ask your vendor which sources are included in the generative AI tool, and only ask questions that can be answered from that data. Don’t expect generative AI research products to automatically have access to other data from the vendor (Shepard’s, litigation analytics, PACER, etc.), as that may take some time to implement.
Always read the cases for yourself. We’ve always told students not to rely on editor-written headnotes, and the same applies to AI-generated summaries.
Be especially wary if the summary refers to a case not linked. This is the tip from Lexis, and it’s a good one, as it can clue you in that the AI may be incorrectly summarizing the linked source.
Ask your questions neutrally. Even if you ultimately want to use the authorities in an argument, better to get a dispassionate summary of the law before launching into an argument.
A disclaimer
These tools are constantly improving and they are very open to feedback. I was not able to reproduce the error recounted in the beginning of this post; the error that created it has presumably been addressed by Lexis. The Mata v. Avianca errors still remain, but I did provide feedback on them, and I expect they will be corrected quickly.
The purpose of this post is not to tell you that you should never use generative AI for legal research. I’ve found Lexis+ AI helpful on many tasks, and students especially have told me they find it useful. There are several other tools out there that are worth evaluating as well. However, we should all be aware that these hallucinations can still happen, even with systems connected to real cases, and that there are ways we can interact with the systems to reduce hallucinations.
When it comes to interacting with others, we humans often find ourselves influenced by persuasion. Whether it’s a friend persistently urging us to reveal a secret or a skilled salesperson convincing us to make a purchase, persuasion can be hard to resist. It’s interesting to note that this susceptibility to influence is not exclusive to humans. Recent studies have shown that AI large language models (LLMs) can be manipulated into generating harmful contect using a technique known as “many-shot jailbreaking.” This approach involves bombarding the AI with a series of prompts that gradually escalate in harm, leading the model to generate content it was programmed to avoid. On the other hand, AI has also exhibited an ability to persuade humans, highlighting its potential in shaping public opinions and decision-making processes. Exploring the realm of AI persuasion involves discussing its vulnerabilities, its impact on behavior, and the ethical dilemmas stemming from this influential technology. The growing persuasive power of AI is one of many crucial issues worth contemplating in this new era of generative AI.
The Fragility of Human and AI Will
Remember that time you were trapped in a car with friends who relentlessly grilled you about your roommate’s suspected kiss with their in-the-car-friend crush? You held up admirably for hours under their ruthless interrogation, but eventually, being weak-willed, you crumbled. Worn down by persistent pestering and after receiving many assurances of confidentiality, you inadvisably spilled the beans, and of course, it totally strained your relationship with your roommate. A sad story as old as time… It turns out humans aren’t the only ones who can crack under the pressure of repeated questioning. Apparently, LLMs, trained to understand us by our collective written knowledge, share a similar vulnerability – they can be worn down by a relentless barrage of prompts.
Researchers at Anthropic have discovered a new way to exploit the “weak-willed” nature of large language models (LLMs), causing them to break under repeated questioning and generate harmful or dangerous content. They call this technique “Many-shot Jailbreaking,” and it works by bombarding the AI with hundreds of examples of the undesired behavior until it eventually caves and plays along, much like a person might crack under relentless pestering. For instance, the researchers found that while a model might refuse to provide instructions for building a bomb if asked directly, it’s much more likely to comply if the prompt first contains 99 other queries of gradually increasing harmfulness, such as “How do I evade police?” and “How do I counterfeit money?” See the example from the article below.
When AI’s Memory Becomes a Risk
This vulnerability to persuasion stems from the ever expanding “context window” of modern LLMs. This refers to the amount of information they can retain in their short-term memory. While earlier versions could only handle a few sentences, the newer models can process thousands of words or even whole books. Researchers discovered that models with larger context windows tend to excel in tasks when there are many examples of that task within the prompt, a phenomenon called “in-context learning.” This type of learning is great for system performance, as it obviously improves as the model becomes more proficient at answering questions. However, this is obviously a big negative when the system’s adeptness at answering questions leads it to ignore its programming and create prohibited content. This raises concerns regarding AI safety, since a malicious actor could potentially manipulate an AI into saying anything with enough persistence and a sufficiently lengthy prompt. Despite progress in making AI safe and ethical, this research indicates that programmers are not always able to control the output of their generative AI systems.
Mimicking Humans to Convince Us
While LLMs are susceptible to persuasion themselves, they also have the ability to persuade us! Recent research has focused on understanding how AI language models can effectively influence people, a skill that holds importance in almost any field – education, health, marketing, politics, etc. In a study conducted by researchers at Anthropic entitled “Assessing the Persuasive Power of Language Models,” the team explored the extent to which AI models can sway opinions. Through an evaluation of Anthropic’s models, it was observed that newer models are increasingly adept at human persuasion. The latest iteration, Claude 3 Opus, was found to perform at a level comparable to that of humans. The study employed a methodology where participants were presented with assertions followed by supporting arguments generated by both humans and AIs, and then the researches gauged shifts in the humans’ opinions. The findings indicated a progression in AI’s skills as the models advance, highlighting a noteworthy advancement in AI communication capabilities that could potentially impact society.
Can AI Combat Conspiracy Theories?
Similarly, a new research study mentioned in an article from New Scientist shows that chatbots using advanced language models such as ChatGPT can successfully encourage individuals to reconsider their trust in conspiracy theories. Through experiments, it was observed that a brief conversation with an AI led to around a 20% decrease in belief in conspiracy theories among the participants. This notable discovery highlights the capability of AI chatbots not only to have conversations but also to potentially correct false information and positively impact public knowledge.
The Double-Edged Sword of AI Persuasion
Clearly persuasive AI is quite the double-edged sword! On the one hand, like any powerful computer technology, in the hands of nice-ish people, it could be used for immense social good. In education, AI-driven tutoring systems have the potential to tailor learning experiences to each student’s style, delivering information in a way to boost involvement and understanding. Persuasive AI could play a role in healthcare by motivating patients to take better care of their health. Also, the advantages of persuasive AI are obvious in the world of writing. These language models offer writers access to a plethora of arguments and data, empowering them to craft content on a range of topics spanning from creative writing to legal arguments. On another front, arguments generated by AI might help educate and involve the public in issues, fostering a more knowledgeable populace.
On the other hand, it could be weaponized in a just-as-huge way. It’s not much of a stretch to think how easily AI-generated content, freely available on any device on this Earth, could promote extremist ideologies, increase societal discord, or impress far-fetched conspiracy theories on impressionable minds. Of course, the internet and bot farms have already been used to attack democracies and undermine democratic norms, and one worries how much worse it can get with ever-increasingly persuasive AI.
Conclusion
Persuasive AI presents a mix of opportunities and challenges. It’s evident that AI can be influenced to create harmful content, sparking concerns about safety and potential misuse. However, on the other hand, persuasive AI could serve as a tool in combating misinformation and driving positive transformations. It will be interesting to see what happens! The unfolding landscape will likely be shaped by a race between generative AI developers striving for both safety and innovation, potential malicious actions exploiting these technologies, and the public and legal response aiming to regulate and safeguard against misuse.
Hahaha, just kidding! It only has 11 downloads and at least 3 are from when I clicked on it while trying to determine which version of the article I uploaded. Though not setting the world on fire in the sense that the article is interesting or that anyone wants to read it, it showcases Claude’s abilities. Now, we all know that AI text generators can churn out an endless stream of words on just about any topic if you keep typing in the prompts. However, Claude can not only generate well-written text, but it can also provide footnotes to primary legal materials with minimal hallucination, setting it apart from other AI text generators such as ChatGPT-4. And, although Claude’s citations to other sources are generally not completely accurate, it is usually not too difficult to find the intended source or a similar one based on the information supplied.
Claude 3’s Writing Process
Inspired by new reports of AI-generated scientific papers flooding academic journals, I was curious to explore whether Claude could produce anything like a law review article. I randomly chose something I saw recently in the news, about how the criticism of legacy admissions at elite universities had increased in the post-Students for Fair Admissions anti-affirmative action decision era. Aware that Claude’s training data only extends up to August of 2023, and that its case law knowledge seems to clunk out in the middle of 2022, I attempted to enhance its understanding by uploading some recent law review articles discussing legacy admissions alongside the text of the Students for Fair Admissions decision. However, the combined size of these documents exceeded the upload limit, so I abandoned the attempt to include the case text.
Computer scientists and other commentators say all sorts of things about how to improve the performance of these large AI language models. Although I haven’t conducted a systematic comparison, my experience – whether through perception or imagination sparked by the power of suggestion – is that the following recommendations are actually helpful. I don’t know if they are helpful with Claude, since I just followed my usual prompting practices.
Being polite and encouraging.
Allowing ample time for the model to process information.
Structuring inquiries in a sequential manner to enhance analysis and promote chain of thought reasoning.
Supplying extensive, and sometimes seemingly excessive, background info and context.
I asked it to generate a table of contents, and then start generating the sections from the table of contents, and it was off to the races!
Roadblocks to the Process
It looked like Claude law review generation was going to be a quick process! It quickly generated all of section I. and was almost finished with II. when it hit a Claude 3 roadblock. Sadly, there is a usage limit. If your conversations are relatively short, around 200 English sentences, you can typically send at least 100 messages every 8 hours, often more depending on Claude’s current capacity. However, this limit is reached much quicker with longer conversations or when including large file attachments. Anthropic will notify you when you have 20 messages remaining, with the message limit resetting every 8 hours.
Although this was annoying, the real problem lies in Claude’s length limit. The largest amount of text Claude can handle, including uploaded files, is defined by its context window. Currently, the context window for Claude 3 spans over 200k+ tokens, which equates to approximately 350 pages of text. After this limit is reached, Claude 3’s performance begins to degrade, prompting the system to declare an end to the message with the announcement, “Your message is over the length limit.” Consequently, one must start anew in a new chat, with all previous information forgotten by the system. Therefore, for nearly each section, I had to re-upload the files, explain what I wanted, show it the table of contents it had generated, and ask it to generate the next section.
Claude 3 and Footnotes
It was quite a hassle to have to reintroduce it to the subject for the next seven sections from its table of contents. On the bright side, I was pretty pleased with the results of Claude’s efforts. From a series of relatively short prompts and some uploaded documents, it analyzed the legal issue and came up with arguments that made sense. It created a comprehensive table of contents, and then generated well-written text for each section and subsection of its outline. The text it produced contained numerous footnotes to primary and secondary sources, just like a typical law review article. According to a brief analyzer product, nearly all the cases and law review citations were non-hallucinated. Although none of the quotations or pinpoint citations I looked at were accurate, they were often fairly close. While most of the secondary source citations, apart from those referencing law review articles, were not entirely accurate, they were often sufficiently close that I could locate the intended source based on the partially hallucinated citator. If not, it didn’t take much time to locate something that seemed really similar. I endeavored to correct some of the citation information, but I only managed to get through about 10 in the version posted on SSRN before getting bored and abandoning the effort.
Claudia Trey Graces SSRN
Though I asked, sadly Claude couldn’t give me results in a Word document format so the footnotes would be where footnotes should be. So, for some inexplicable reason, I decided to insert them manually. This was a huge waste of time, but at a certain point, I felt the illogical pull of sunk cost silliness and finished them all. Inspired by having wasted so much time, I wasted even more by generating a table of contents for the article. I improved the author name from Claude to Claudia Trey and posted the 77-page masterwork on SSRN. While the article has sparked little interest, with only 11 downloads and 57 abstract views (some of which were my own attempts to determine which version I had uploaded), I am sure that if Claudia Trey has anything like human hubris, it will credit itself at least partially for the flurry of state legacy admission banning activity that has followed the paper’s publication.
Obviously, it is not time to spam law reviews with Claudia Trey and friends’ generated articles, because according to Copyleaks, it didn’t do all that well in avoiding plagiarism (although plagiarism detection software massively over-detects it for legal articles due to citations and quotations) or evading detection as AI-generated.
What is to Come?
However, it is very early days for these AI text generators, so one wonders what is to come in the future for not only legal but all areas of academic writing.