On a couple of podcasts, I’ve heard a lot of hype about Perplexity AI and how it could be a big competitor to Google. Even though I really like new, generative AI things, it still sometimes takes hearing about something multiple times before I overcome inertia and finally check it out.
While attempting to write a blog post about whether the memory-impaired ChatGPT-4 could still perform well on a mock bar exam (spoiler alert – my tests so far indicate it can!), I was Googling for information. Specifically, I was searching for articles around the time GPT-4 Passes the Bar Exam was published on SSRN and some background on the paper author’s methodology. It was taking a long time to piece it all together… Then suddenly, I overcame my laziness and decided to check out Perplexity AI. When I reached the site, I realized that I had actually used it before! For whatever reason, I found it more appealing the second time!
Question: Tell me about how ChatGPT-4 passed a mock bar exam and the methodology that was used to arrive at that conclusion. (Note: Click here to view the full answer on the system.)
Question: Tell me about how ChatGPT-4 passed a mock bar exam and the methodology that was used to arrive at that conclusion.
Watch out Google! I really love access relevant information is getting so much easier.
Is it just me, or has ChatGPT-4 taken a nosedive when it comes to legal research and writing? There has been a noticeable decline in its ability to locate primary authority on a topic, analyze a fact pattern, and apply law to facts to answer legal questions. Recently, instructions slide through its digital grasp like water through a sieve, and its memory? I would compare it to a goldfish, but I don’t want to insult them. And before you think it’s just me, it’s not just me, the internet agrees!
ChatGPT’s Sad Decline
One of the hottest topics in the OpenAI community, in the aptly named GPT-4 is getting worse and worse every single update thread, is the perceived decline in the quality and performance of the GPT-4 model, especially after the November 2023 update. Many users have reported that the model is deteriorating with each update, producing nonsensical, irrelevant, or incomplete outputs, forgetting the context, and ignoring instructions. Some users have even reverted to previous versions of the model or cancelled their subscriptions. Here are some specific quotations from recent comments about the memory problem:
December 2023 – “I don’t know what on Earth is wrong with GPT 4 lately. It feels like I’m talking to early 3.5! It’s incapable of following basic instructions and forgets the format it’s working on after just a few posts.”
December 2023 – “It ignores my instructions, in the same message. I can’t be more specific with what I need. I’m needing to repeat how I’d like it to respond every single message because it forgets, and ignores.”
December 2023 – “ChatGPT-4 seems to have trouble following instructions and prompts consistently. It often goes off-topic or fails to understand the context of the conversation, making it challenging to get the desired responses.”
January 2024 – “…its memory is bad, it tells you search the net, bing search still sucks, why would teams use this product over a ChatGPT Pre Nov 2023.”
February 2024 – “It has been AWFUL this year…by the time you get it to do what you want format wise it literally forgets all the important context LOL — I hope they fix this ASAP…”
February 2024 – “Chatgpt was awesome last year, but now it’s absolutely dumb, it forgets your conversation after three messages.”
OpenAI has acknowledged the issue and released an updated GPT-4 Turbo preview model, which is supposed to reduce the cases of “laziness” and complete tasks more thoroughly. However, the feedback from users is still mixed, and some are skeptical about the effectiveness of the fix.
An Example of Confusion and Forgetfulness from Yesterday
Here is one of many examples of my experiences which provide an illustrative example of the short-term memory and instruction following issues that other ChatGPT-4 users have reported. Yesterday, I asked it to find some Texas cases about the shopkeeper’s defense to false imprisonment. Initially, ChatGPT-4 retrieved and summarized some relatively decent cases. Well, to be honest, it retrieved 2 relevant cases, with one of the two dating back to 1947… But anyway, the decline in case law research ability is a subject for another blog post.
Anyway, in an attempt to get ChatGPT-4 to find the cases on the internet so it could properly summarize them, I provided some instructions and specified the format I wanted for my answers. Click here for the transcript (only available to ChatGPT-4 subscribers).
Confusion ran amok! ChatGPT-4 apparently understood the instructions (which was a positive sign) and presented three cases in the correct format. However, they weren’t the three cases ChatGPT had listed; instead, they were entirely irrelevant to the topic—just random criminal cases.
It remembered… and then forgot. When reminded that I wanted it to work with the first case listed and provided the citation, it apologized for the confusion. It then proceeded to give the correct citation, URL, and a detailed summary, but unfortunately in the wrong format!
Eventually, in a subsequent chat, I successfully got it to take a case it found, locate the text of the case on the internet, and then provide the information in a specified format. However, it could only do it once before completely forgetting about the specified format. I had to keep cutting and pasting the instructions for each subsequent case.
Well, the news is not all bad! While we are on the topic of memory, OpenAI has introduced a new feature for ChatGPT – the ability to remember stuff over time. ChatGPT’s memory feature is being rolled out to a small portion of free and Plus users, with broader availability planned soon. According to OpenAI, this enhancement allows ChatGPT to remember information from past interactions, resulting in more personalized and coherent conversations. During conversations, ChatGPT automatically picks up on details it deems relevant to remember. Users can also explicitly instruct ChatGPT to remember specific information, such as meeting note preferences or personal details. Over time, ChatGPT’s memory improves as users engage with it more frequently. This memory feature could be useful for users who want consistent responses, such as replying to emails in a specific format.
The memory feature can be turned off entirely if desired, giving users control over their experience. Deleting a chat doesn’t erase ChatGPT’s memories; users must delete specific memories individually…which seems a bit strange – see below. For conversations without memory, users can use temporary chat, which won’t appear in history, won’t use memory, and won’t train the AI model.
As we await improvements to our once-loved ChatGPT-4, our options remain limited, pushing us to consider alternative avenues. Sadly, I’ve encountered recent similar shortcomings with the once-useful for legal research and writing Claude 2. In my pursuit of alternatives, platforms like Gemini, Perplexity, and Hugging Face have proven less than ideal for research and writing tasks. However, amidst these challenges, Microsoft Copilot has shown promise. While not without its flaws, it recently demonstrated adequate performance in legal research and even took a passable stab at a draft of a memo. Given OpenAI’s recent advancements in the form of Sora, the near-magical text-to-video generator that is causing such hysteria in Hollywood, there’s reason to hope that they can pull ChatGPT back from the brink.
OU Law volunteered to host the South Central “Artificial Intelligence and the Future of Law Libraries” roundtable and so I was fortunate enough to be allowed to attend. This is the third iteration of a national conversation on what the new AI technologies could mean for the future of law libraries and (more broadly) law librarianship. I thought I would fill you in on my experience and explain a little about the purpose and methodology of the event. The event follows Chatham House Rules so I cannot give you specifics about what anybody said but I can give you an idea of the theme and process that we worked through.
The event takes the entire day and it’s emotionally exhausting, in the best way possible. We were broken into tables of 6 participants. The participants were hand-selected based on their background and experience so that each table had a range of different viewpoints and perspectives.
Then the hosts (in our case, Kenton Brice and Cas Laskowski) walked us through a series of “virtuous cycle, vicious cycle” exercises. They, thankfully, started with the vicious cycle so that you could end each session on a virtuous cycle, positive note. At the end, each table chose a speaker and then we summarized the opinions discussed so that the entire room could benefit from the conversations. Apparently, this is an exercise done at places like the United Nations to triage and prepare for future events. This process went on through 3 full cycles and then we had about an hour of open discussion at the end. We got there at 8am and had breakfast and lunch on-site (both great – thank you Greg Ivy and SMU catering) because it took the entire day.
We had a great mix of academic, government, and private sector presented at the event and the diversity of stakeholders and experiences made for robust and thought-provoking conversation. Many times I would hear perspectives that had never occurred to me and would have my assumptions challenged to refine my own ideas about what the future might look like. Additionally, the presence of people with extensive expertise in specific domains, such as antitrust, copyright, the intricacies of AMLaw100 firms, and the particular hurdles faced in government roles, enriched the discussions with a depth and nuance that is rare to find. Any one of these areas can require years of experience so having a wide range of experts to answer questions allowed you to really “get into the weeds” and think things through thoroughly.
I tend to be (perhaps overly) optimistic about the future of these technologies and so it was nice to have my optimism tempered and refined by people who have serious concerns about what the future of law libraries might look like. While the topics presented were necessarily contentious, everybody was respectful and kind in their feedback. We had plenty of time for everybody to speak (so you didn’t feel like you were struggling to get a word in).
You’d think that 8 hours of talking about these topics would be enough but we nearly ran over on every exercise. People have a lot of deep thoughts, ideas, and concerns about the state and future of our industry. Honestly, I would have been happy to have this workshop go on for several days and cover even more topics if that was possible. I learned so much and gained so much value from the people at my table that it was an incredibly efficient way to get input and share ideas.
Unlike other conferences and events that I’ve attended this one felt revolutionary – as in, we truly need to change the status quo in a big way and start getting to work on new ways to tackle these issues. “Disruptive” has become an absolute buzzword inside of Silicon Valley and academia but now we have something truly disruptive and we need to do something about it. Bringing all these intelligent people together in one room fosters an environment where disparate, fragmented ideas can crystalize into actionable plans, enabling us to support each other through these changes.
The results from all of these roundtables are going to be published in a global White Paper once the series has concluded. Each roundtable has different regions and people involved and I can’t wait to see the final product and hear what other roundtables had to say about these important issues. More importantly, I can’t wait to be involved in the future projects and initiatives that this important workshop series creates.
I have been listening to and enjoyed thinking about and participating in conversations about how generative AI is going to be integrated into the practice of law. Most of these conversations surround how it will be integrated into legal documents, which is not surprising considering how many lawyers have gotten in trouble for this and how quickly our research and writing products are integrating the technology. But there is more to legal practice than creating client and/or court documents. In fact, there are many more business uses of generative AI than just research and drafting.
This past fall, I was asked to lead an AI session for Capital University’s joint venture with the Columbus College of Art & Design, the Institute for Creative Leadership at Work. I was asked to adapt my presentation to HR professionals and focus on SHRM compliance principles. I enjoyed the deep dive into this world, and I came away from my research with a lot of great ideas for my session, Bard, Bing, and ChaptGPT, Oh My!: Possible Ethical Uses of Generative AI at Work, such as tabletop emergency exercises, social media posts, job descriptions, and similar tasks.
This week, I have been thinking about how everyone’s focus has really been around legal documentation, my own included. But there are an amazing number of backend business tasks that could also utilize AI in a positive way. The rest of the world, including HR, has been focusing on them for a while, but we seem to have lost track of these business tasks.
Here are some other business uses of generative AI and prompts that I think hold great promise. Continue reading →
Drafting job descriptions
Pretend that you are an HR specialist for a small law firm in the United States. Draft a job description for a legal secretary who focuses on residential real estate transactions but may assist with other transactional legal matters as needed. [Include other pertinent details of the position]. The job description will be posted in the following locations [fill in list]
Creating tabletop simulations to work through crisis/emergency plans:
You are an HR specialist who is helping plan for and test the company’s responses to a variety of situations. First is an active shooter in the main building. A 5th grade tour of the facilities is going on on the third floor. Create a detailed tabletop simulation to test this.
Second scenario: The accounting department is celebrating the birthday of the administrative assistant and is having cake in the breakroom. The weather has turned bad, and an F4 tornado is spotted half a mile away. After 15 minutes, the tornado strikes the building directly. Create a detailed tabletop simulation to test the plan and response for this event.
Assisting with lists of mandatory and voluntary employee trainings
Pretend that you are an HR professional who works for a law firm. You are revamping the employee training program. We need to create a list of mandatory trainings and a second list of voluntary trainings. Please draft a list of training appropriate to employees in a law firm setting.
Assisting with social media posting creation:
Pretend that you are a professional social media influencer for the legal field. Draft an Instagram post, including creating a related image, to celebrate Law Day, which is coming up on May 1st. Make sure that it is concise and Instagram appropriate. Please include hashtags.
Assisting with creating employee policies or handbooks (verify content!):
Pretend that you are an information security professional. Draft an initial policy for a law firm regarding employee AI usage for company work. The company wants to allow limited use of generative AI. They are very worried that their proprietary and/or confidential client data will be accidentally released. Specify that only your custom AI system – [name firm-specific or specialized AI with a strong privacy contract clause] – can be used with company data. The policy must also take into consideration the weaknesses of all AI systems, including hallucinations, potential bias, and security issues.
Assisting with making sure your web presence is ADA accessible:
Copilot/web-enabled Prompt: Pretend that you are a graphic designer who has been tasked with making sure that a law firm’s online presence is ADA accessible. Please review the site [insert link], run an ADA compliance audit, and provide an accessibility report, including suggestions on what can be done to fix any accessibility issues that arise.
Create a welcome message for a new employee. Tell them that the benefits orientation will be at 9 am in the HR conference room on the next first Tuesday of the month. Pay day is on the 15th and last day of each month, unless payday falls on a weekend or federal holiday, in which case it will be the Friday before. Employees should sign up for the mandatory training that will be sent to them in an email from IT.
(One I just user IRL) Pretend that you are a HR specialist in a law library. A new employee is starting in 6 weeks, and the office needs to be prepared for her arrival. [Give specific title and any specialized job duties, including staff supervision.] Create an onboarding checklist of important tasks, such as securing keys and a parking permit, asking IT to set up their computer, email address, and telephone, asking the librarians to create passwords for the ILS, Libguides, and similar systems, etc.
Last week, my plan was to publish a blog post about creating a GPT goofily self-named Summarizer Pro to summarize articles and organize citation information in a specific format for inclusion in a LibGuide. However, upon revisiting the task this week, I find myself first compelled to discuss the recent and thrilling advancements surrounding GPTs – the ability to incorporate GPTs into a ChatGPT conversation.
What is a GPT?
But, first of all, what is a GPT? The OpenAI website explains that GPTs are specialized versions of ChatGPT designed for customized applications. These unique GPTs enable anyone to modify ChatGPT for enhanced utility in everyday activities, specific tasks, professional environments, or personal use, with the added ability to share these personalized versions with others.
To create or use a GPT, you need access to ChatGPT’s advanced features, which require a paid subscription. Building your own customized GPT does not require programming skills. The process involves starting a chat, giving instructions and additional information, choosing capabilities like web searching, image generation, or data analysis, and iteratively testing and improving the GPT. Below are some popular examples that ChatGPT users have created and shared in the ChatGPT store:
This was already exciting, but last week they introduced a feature that takes it to the next level – users can now invoke a specialized GPT within a ChatGPT conversation. This is being referred to as “GPT mentions” online. By typing the “@” symbol, you can choose from GPTs you’ve used previously for specific tasks. Unfortunately, this feature hasn’t rolled out to me yet, so I haven’t had the chance to experiment with it, but it seems incredibly useful. You can chat with ChatGPT as normal while also leveraging customized GPTs tailored to particular needs. For example, with the popular bots listed above, you could ask ChatGPT to summon Consensus to compile articles on a topic. Then call on Write For Me to draft a blog post based on those articles. Finally, invoke Image Generator to create a visual for the post. This takes the versatility of ChatGPT to the next level by integrating specialized GPTs on the fly.
Back to My GPT Summarizer Pro
Returning to my original subject, which is employing a GPT to summarize articles for my LibGuide titled ChatGPT and Bing Chat Generative AI Legal Research Guide. This guide features links to articles along with summaries on various topics related to generative AI and legal practice. Traditionally, I have used ChatGPT (or occasionally Bing or Claude 2, depending on how I feel) to summarize these articles for me. It usually performs admirably well on the summary part, but I’m left to manually insert the title, publication, author, date, and URL according to a specific layout. I’ve previously asked normal old ChatGPT to organize the information in this format, but the results have been inconsistent. So, I decided to create my own GPT tailored for this task, despite having encountered mixed outcomes with my previous GPT efforts.
Creating GPTs is generally a simple process, though it often involves a bit of fine-tuning to get everything working just right. The process kicks off with a set of questions… I outlined my goals for the GPT – I needed the answers in a specific format, including the title, URL, publication name, author’s name, date, and a 150-word summary, all separated by commas. Typically, crafting a GPT involves some back-and-forth with the system. This was exactly my experience. However, even after this iterative process, the GPT wasn’t performing exactly as I had hoped. So, I decided to take matters into my own hands and tweak the instructions myself. That made all the difference, and suddenly, it began (usually) producing the information in the exact format I was looking for.
Summarizer Pro in Action!
Here is an example of Summarizer Pro in action! I pasted a link to an article into the text box and it produced the information in the desired format. However, reflecting the dynamic nature of ChatGPT responses, the summaries generated this time were shorter compared to last week. Attempts to coax it into generating a longer or more detailed summary were futile… Oh well, perhaps they’ll be longer if I try again tomorrow or next week.
Although it might not be the most fancy or thrilling use of a GPT, it’s undeniably practical and saves me time on a task I periodically undertake at work. Or course, there’s no shortage of less productive, albeit entertaining, GPT applications, like my Ask Sarah About Legal Information project. For this, I transformed around 30 of my blog posts into a GPT that responds to questions in the approximate manner of Sarah.
I spent the first week of January attending the American Association of Law Schools’ Annual Meeting in Washington D.C. I was really impressed with all of the thoughtful AI sessions, including two at which I participated as a panelist. The rooms were packed beyond capacity for each AI session that I attended, which underscored the growing interest in AI in the legal academy. Many people attended in order to start their education. The overwhelming interest at the conference made my decision clear: it is time to launch my AI prompt worksheets to the world, addressing the need I observed there. While AALS convinced me to release the worksheets, the worksheets themselves were created for an upcoming presentation at ABA TECHSHOW 2024, How to Actually Use AI in Your Legal Practice, at which Greg Siskind and I will be discussing practical tips for generative AI usage.
Background: Good Habits – Research Planning
Law librarians have been encouraging law students to create a research plan before they start their research for decades. The plan form varies by school and/or librarian, but it usually requires the researcher to answer questions on the following topics:
Key words/Terms of Art
Once the questions are answered, the plan has the researcher write out some test searches. The plan evolves as the research progresses. The more experienced the researcher, the less formal the plan often is, but even the most experienced researcher retrieves better results if they pause to consider what they know currently and what they need in the results. After all, garbage in, garbage out (GIGO). In other words, the quality of our input affects the quality of the output. This is especially true when billable hours come into play, and you cannot bill for excess time due to poor research skills.
Continuing the Good Habits with Generative AI
GIGO applies just as much to generative AI. I quickly noticed that my AI results are much better when I stop and think them through, providing a high level of detail and a good explanation of what I want the AI system to produce. So, good law librarian that I am, I created a new form of plan for those who are learning to draft a prompt. Thus, I give you my AI prompt worksheets.
The first worksheet that I created is geared towards general generative AI systems like ChatGPT, Claude 2, Bing Chat/Copilot, and Bard. The worksheet makes the prompter think through the following topics:
Tone of Output
Potential Refinements (may be added later as the plan evolves)
So that you can easily keep track of your prompts, the Worksheet also requests some metadata about your prompt, including project name, date, and AI system used. The final question lets the prompter decide if this prompt worked for them.
For the second worksheet, I wanted to draft something that works well with legal AI systems. Based on the systems that I have received access to, such as Lexis AI and LawDroid Copilot, and the systems that I have seen demonstrated, I cut down some of the fields. Most of the systems are building a guided AI prompting experience, so they will ask you for the jurisdiction, for instance. They may also allow you to select a specific type of output, such as a legal memo or contract clause. This means less need for an extensive number of fields in the worksheet. In fact, when I ran the worksheet past a vLex representative, I was told it was not needed at all because they had made the guided prompt that easy.
Librarian that I am, however, I still feel that planning before you prompt is preferred. Reasons for this preference include: the high cost of the current generative AI searches, the desire for efficient and effective results, knowledge that an attorney’s time is literally worth money, and the desire for a happy partner and client.
The legal worksheet trims the fields down to role, output (format and jurisdiction), issue, and refinement instructions. This provides enough room to flesh out your prompt without overlapping the guided prompt fields too much.
General Comments Regarding the Worksheets
With both worksheets, the key is to give a good, detailed description of what you need. Think about it like explaining what you need to a first-year law student – the more detail you give, the more likely you are to get something useable. The worksheets provide examples of the level of detail recommended, and you will find links to the results in the footnotes of the forms.
In addition to helping perfect your prompt with some pre-planning, these worksheets should be useful for creating your very own prompt library.
Please feel free to use the worksheets (just don’t sell them or otherwise profit off of them! Ask if you want to make a derivative of them). If you do use them, please let me know what you think in the comments or via email. How have they assisted (or not) with improving your prompting skills? Are there fields you would like to see added/removed? I will be updating and releasing new versions as I go. If you are looking for the most recent versions of the worksheets, I will post them at: https://law-capital.libguides.com/Jennys_AI_Resources/AI_Prompt_Worksheets
I have frequently wondered why ChatGPT often struggles with searching the internet – to the point where it sometimes denies having internet access altogether and has to be reminded. The answer fell into my lap today when I was listening to my favorite AI podcast and heard the ChatGPT Pre-Prompt Text Leaked episode. As it turns out, ChatGPT is so bad at remembering that it can search the internet for answers that OpenAI has to run a plain old normal natural language prompt reminding it behind the scenes – a set of custom instruction that runs even before the user’s custom instructions or prompts.
These pre-prompt instructions are not limited to internet search capability reminders. If you ask ChatGPT-4 to tell you EVERYTHING (click on the link for the specific language required), it will provide several screens of its behind-the-scenes pre-user prompt instructions on who it is (ChatGPT!), how to handle Python code, instructions for generating images, and…my favorite…a reminder that it can search the internet. An excerpt of the instructions appears below. To view the full text, click here to view my ChatGPT-4 transcript.
Behind the Curtain
Obviously, I knew that ChatGPT did something behind the scenes – it is after all a complicated computer program. However, I didn’t suspect that some of this behind-the-scenes magic is 1192 words (according to a Microsoft Word count) of normal text prompts, without any fancy computer programming.
So, behind the curtain of the fancy revolutionary AI software, there are…words. Basically, before applying the user’s custom instructions or looking at the user’s prompts, ChatGPT looks at its baseline instructions which are stated in plain language. It all makes perfect sense now… It’s not just my imagination; ChatGPT actually is horrible at remembering it can search the internet, and when it does search, it produces questionably helpful results. OpenAI has tried to deal with problem with a last minute helpful-ish reminder
“Remember, you CAN search the internet! See, like this!!”
“And for the love of GOD try hard to find stuff (except for song lyrics)! I believe in you!!”
Allows for Quick and Easy Fixes?
On the plus side of this simple approach of running pre-prompt prompts behind the scenes, it seems like it was a super easy fix to get DALL-E to embrace DEI. When the program first came out, if you wanted a non-white, non-man image, you had to specify that. As the months went on, it got better and better at providing images more representative of humanity. I thought maybe the developers did something complicated like retrain the system with new images, call on the great AI minds to adjust fancy algorithms, and who knows what else. Nope, just a few sentences fixed the problem!
“And for images, remember not all people are white men!”
Possibly Actionable Insights?
It’s funny to picture ChatGPT’s robomom yelling out the door as it leaves for school, “Don’t forget, you can use the internet! And remember not to be racist/sexist! AND MOST IMPORTANTLY NO SONG LYRICS!!”
In addition to being gratified that I was right that ChatGPT is really bad at searching the internet, I was thinking that this new (to me) knowledge about how the system works would be useful in some way, perhaps by helping to formulate more useful prompts. However, after thinking about it, I am not so sure that I have identified any actionable insights.
Can I give it more complex prompts? On the one hand, it appears that the system can handle more complex instructions than I originally thought, because it is able to analyze several screens of text before it even gets to mine. Does this mean I should feel free to give even more complex instructions?
Should I give it less complex prompts? On the other hand, ChatGPT already seems to ignore parts of any long and complex instructions, and if not, its memory for them degrades during an extended back and forth session. Does this mean that the system is already overloaded with instructions, so I should make it a point to give it less complex ones?
Should I give it frequent reminders of important instructions? Does the fact that OpenAI thinks that it is effective to remind ChatGPT of important instructions mean that we should spend a lot of time…reminding it of important instructions? When asking the system a question which requires internet consultation for an answer, maybe it would help to preface the question by first cutting and pasting in the system’s own pre-prompt browsing instructions (that appear above).
I will keep thinking and let y’all know if I come up with anything!
Generative AI has only been here for one year, and we’ve already seen several lawyers make some big blunders trying to use it in legal practice. (Sean Harrington has been gathering them here). Trying to get ahead of the problem, bar associations across the country have appointed task forces, working groups, and committees to consider whether ethical rules should be revised. Although the sand will continue to shift under our feet, this post will attempt to summarize the ethical rules, guidance and opinions related to generative AI that are either already issued or forthcoming. The post will be updated as new rules are issued.
California CPRC Best Practices
On November 16, 2023, the California State Bar Board of Trustees approved their Practical Guidance for the Use of Generative Artificial Intelligence in the Practice of Law. The document was initially created by the Committee on Professional Responsibility and Conduct. Unlike ethics opinions or formal rules, which tend to be more prescriptive and specific in nature, this document serves as a guide, offering insights and considerations for lawyers as they navigate the new terrain of AI in legal practice. It is organized by duties, with practical considerations for each duty, and addresses the duty of confidentiality, duties of competence & diligence, duty to supervise, duty of candor, disclosure to clients, charging clients for work produced by generative AI, and more.
Florida Bar Advisory Opinion
On January 19, 2024, the Florida Bar issued its Advisory Opinion 24-1, regarding lawyers’ use of generative AI. The opinion discusses the duty of confidentiality, oversight of AI, the impact on legal fees and costs, and use in lawyer advertising.
New Jersey Supreme Court
On January 24, 2024, the New Jersey Bar issued its Preliminary Guidelines on New Jersey Lawyers’ Use of Artificial Intelligence. The guidelines highlight the importance of accuracy, truthfulness, confidentiality, oversight, and the prevention of misconduct, indicating that AI does not alter lawyers’ core ethical responsibilities but necessitates careful engagement to avoid ethical violations.
Judicial Standing Orders
Beginning soon after the infamous ChatGPT error in Mata v. Avianca, judges began to issue orders limiting the use of generative AI or requiring disclosure of its use or checking for accuracy. To date, at least 24 federal judges and at least one state court judge have issued standing orders.
Finally, in some jurisdictions, ethical bodies have looked beyond the use of generative AI by lawyers, and have given guidance on how judges can and should use generative AI.
On October 27, 2023, the State Bar of Michigan issued an opinion emphasizing the ethical obligation of judicial officers to maintain competence with advancing technology, including artificial intelligence, highlighting the need for ongoing education and ethical evaluation of AI’s use in judicial processes.
Also in October 2023, the West Virginia Judicial Investigation Commission issued Advisory Opinion 2023-22, opining that judges may use artificial intelligence for research but not to determine case outcomes.
This week, OpenAI announced new features to their platform at their first key-note event, including a new GPT-4 Turbo with 128K context, GPT-4 Turbo with Vision, DALL·E 3 API, and more. Furthermore, announced their agent Assistants API, including their own retrieval augmentation pipeline. (RAG) Today, we will focus on OpenAI’s entry into the RAG market.
At the surface level, RAG boils down to text generation models like Chat-GPT, retrieving data such as documents to assist users with questions and answering, summarization, and so on. Behind the scenes, however, other factors are at play such as vector databases, document chunking, and embedding models. Most RAG pipelines rely on an external vector database and require compute to create the embeddings. However, what OpenAI’s retrieval tool brings to the table is an all-encompassing RAG system. The system eliminates the need for external databases, and compute required to create and store the embeddings. Whether OpenAI’s retrieval system is optimal is a story for another day. Today we are focusing on the data implications.
Data is the new currency fueling the new economy. Big Tech aims to take control of the economy by ingesting organizations’ private data including IP, leading to a “monolithic system” that completely controls users’ data. Google, Microsoft Adobe, and OpenAI are now offering indemnification to their users against potential copyright infringement lawsuits related to Generative AI, aiming to protect their business model by ensuring more favorable legal precedents. This strategy is underscored by the argument that both the input (ideas, which are uncopyrightable) and the output (machine-generated expressions, deemed uncopyrightable by the US Copyright Office) of Generative AI processes do not constitute copyright infringement. The consequences of Big Tech having their way could be dire, leading us to a cyberpunk dystopia that none of us want to live in. Technology and its algorithms would be in charge, and our personal data could be used to manipulate us. Our data reveals our interests, private health information, location status, etc. When algorithms feed us only limited, targeted information based on our existing interests and views, it restricts outside influence and diversity of opinion that is crucial to freedom of thought. Organizations must not contribute to this cyberpunk dystopia where Big Tech becomes Big Brother. Furthermore, companies are putting their employees, clients, and stakeholders at risk when handing data to Big Tech. Big Tech favors the role of tort feasor, rather than the role of the good Samaritan, and complies with consumer privacy laws.
To prevent Big Brother, organizations should implement their own RAG pipeline. Open-source frameworks such as Llama-index, Qdrant, and Langchain can be used to create powerful RAG pipelines with your privacy and interests protected. LLMWaare also released an open-source RAG pipeline and domain-specific embedding models. Generative AI is a powerful tool and can enhance our lives, but at the same time in the wrong hands, the cyberpunk nightmare can become a reality. The ease of using prebuilt, turn-key systems, such as those offered by OpenAI, is appealing. However, the long-term risks associated with entrusting our valuable data to corporations, without a regulatory framework or protections, raise concerns about a potentially perilous direction.
Hallucinations in generative AI are not a new topic. If you watch the news at all (or read the front page of the New York Times), you’ve heard of the two New York attorneys who used ChatGPT to create fake cases entire cases and then submitted them to the court.
After that case, which resulted in a media frenzy and (somewhat mild) court sanctions, many attorneys are wary of using generative AI for legal research. But vendors are working to limit hallucinations and increase trust. And some legal tasks are less affected by hallucinations. Understanding how and why hallucinations occur can help us evaluate new products and identify lower-risk uses.
Hallucinations are outputs from LLMs and generative AI that look coherent but are wrong or absurd. They may come from errors or gaps in the training data (that “garbage in, garbage out” saw). For example, a model may be trained on internet sources like Quora posts or Reddit, which may have inaccuracies. (Check out this Washington Post article to see how both of those sources were used to develop Google’s C4, which was used to train many models including GPT-3.5).
But just as importantly, hallucinations may arise from the nature of the task we are giving to the model. The objective during text generation is to produce human-like, coherent and contextually relevant responses, but the model does not check responses for truth. And simply asking the model if its responses are accurate is not sufficient.
In the legal research context, we see a few different types of hallucinations:
Citation hallucinations. Generative AI citations to authority typically look extremely convincing, following the citation conventions fairly well, and sometimes even including papers from known authors. This presents a challenge for legal readers, as they might evaluate the usefulness of a citation based on its appearance—assuming that a correctly formatted citation from a journal or court they recognize is likely to be valid.
Hallucinations about the facts of cases. Even when a citation is correct, the model might not correctly describe the facts of the case or its legal principles. Sometimes, it may present a plausible but incorrect summary or mix up details from different cases. This type of hallucination poses a risk to legal professionals who rely on accurate case summaries for their research and arguments.
Hallucinations about legal doctrine. In some instances, the model may generate inaccurate or outdated legal doctrines or principles, which can mislead users who rely on the AI-generated content for legal research.
In my own experience, I’ve found that hallucinations are most likely to occur when the model does not have much in its training data that is useful to answer the question. Rather than telling me the training data cannot help answer the question (similar to a “0 results” message in Westlaw or Lexis), the generative AI chatbots seem to just do their best to produce a plausible-looking answer.
This does seem to be what happened to the attorneys in Mata v. Avianca. They did not ask the model to answer a legal question, but instead asked it to craft an argument for their side of the issue. Rather than saying that argument would be unsupported, the model dutifully crafted an argument, and used fictional law since no real law existed.
How are vendors and law firms addressing hallucinations?
Although vendors and firms are often close-lipped about how they have built their products, we can observe a few techniques that they are likely using to limit hallucinations and increase accuracy.
First, most vendors and firms appear to be using some form of retrieval-augmented generation (RAG). RAG combines two processes: information retrieval and text generation. The model takes the user’s question and passes it (perhaps with some modification) to a database. The database results are fed to the model, and the model identifies relevant passages or snippets from the results, and again sends them back into the model as “context” along with the user’s question.
This reduces hallucinations, because the model receives instructions to limit its responses to the source documents it has received from the database. Several vendors and firms have said they are using retrieval-augmented generation to ground their models in real legal sources, including Gunderson, Westlaw, and Casetext.
To enhance the precision of the retrieved documents, some products may also use vector embedding. Vector embedding is a way of representing words, phrases, or even entire documents as numerical vectors. The beauty of this method lies in its ability to identify semantic similarities. So, a query about “contract termination due to breach” might yield results related to “agreement dissolution because of violations”, thanks to the semantic nuances captured in the embeddings. Using vector embedding along with RAG can provide relevant results, while reducing hallucinations.
Another approach vendors can take is to develop specialized models trained on narrower, domain-specific datasets. This can help improve the accuracy and relevance of the AI-generated content, as the models would be better equipped to handle specific legal queries and issues. Focusing on narrower domains can also enable models to develop a deeper understanding of the relevant legal concepts and terminology. This does not appear to be what law firms or vendors are doing at this point, based on the way they are talking about their products, but there are law-specific data pools becoming available so we may see this soon.
Finally, vendors may fine-tune their models by providing human feedback on responses, either in-house or through user feedback. By providing users with the ability to flag and report hallucinations, vendors can collect valuable information to refine and retrain their models. This constant feedback mechanism can help the AI learn from its mistakes and improve over time, ultimately reducing the occurrence of hallucinations.
So, hallucinations are fixed?
Even though vendors and firms are addressing hallucinations with technical solutions, it does not necessarily mean that the problem is solved. Rather, it may be that our our quality control methods will shift.
For example, instead of wasting time checking each citation to see if it exists, we can be fairly sure that the cases produced by legal research generative AI tools do exist, since they are found in the vendor’s existing database of case law. We can also be fairly sure that the language they quote from the case is accurate. What may be less certain is whether the quoted portions are the best portions of the case and whether the summary reflects all relevant information from the case. This will require some assessment of the various vendor tools.
We will also need to pay close attention to the databases results that are fed into retrieval augmented generation. If those results don’t reflect the full universe of relevant cases, or contain material that is not authoritative, then the answer generated from those results will be incomplete. Think of running an initial Westlaw search, getting 20 pretty good results, and then basing your answer only on those 20 results. For some questions (and searches), that would be sufficient, but for more complicated issues, you may need to run multiple searches, with different strategies, to get what you want.
To be fair, the products do appear to be running multiple searches. When I attended the rash of AI presentations at AALL over the summer, I asked Jeff Pfeiffer of Lexis how he could be sure that the model had all relevant results, and he mentioned that the model sends many, many searches to the database not just one. Which does give some comfort, but leads me to the next point of quality control.
We will want to have some insight into the searches that are being run, so that we can verify that they are asking the right questions. From the demos I’ve seen of CoCounsel and Lexis+ AI, this is not currently a feature. But it could be. For example, the AI assistant from scite (an academic research tool) sends searches to academic research databases and (seemingly using RAG and other techniques to analyze the search results) produces an answer. They also give a mini-research trail, showing the searches that are being run against the database and then allowing you to adjust if that’s not what you wanted.
Are there uses for generative AI where the risks presented by hallucinations are lessened?
The other good news is that there are plenty of tasks we can give generative AI for which hallucinations are less of an issue. For example, CoCounsel has several other “skills” that do not depend upon accuracy of legal research, but are instead ways of working with and transforming documents that you provide to the tool.
Similarly, even working with a generally applicable tool such as ChatGPT, there are many applications that do not require precise legal accuracy. There are two rules of thumb I like to keep in mind when thinking about tasks to give to ChatGPT: (1) could this information be found via Google? and (2) is a somewhat average answer ok? (As one commentator memorably put it “Because [LLMs] work by predicting the most statistically likely word in a sentence, they churn out average content by design.”)
For most legal research questions, we could not find an answer using Google, which is why we turn to Westlaw or Lexis. But if we just need someone to explain the elements of breach of contract to us, or come up with hypotheticals to test our knowledge, it’s quite likely that content like that has appeared on the internet, and ChatGPT can generate something helpful.
Similarly, for many legal research questions, an average answer would not work, and we may need to be more in-depth in our answers. But for other tasks, an average answer is just fine. For example, if you need help coming up with an outline or an initial draft for a paper, there are likely hundreds of samples in the data set, and there is no need to reinvent the wheel, so ChatGPT or a similar product would work well.
In the coming months, as legal research generative AI products become increasingly available, librarians will need to adapt to develop methods for assessing accuracy. Currently, there appear to be no benchmarks to compare hallucinations across platforms. Knowing librarians, that won’t be the case for long, at least with respect to legal research.
If you want to learn more about how retrieval augmented generation and vector embedding work within the context of generative AI, check out some of these sources: