Ghost in the Machine

Today’s guest post comes from Debbie Ginsberg, Faculty Services Manager at Harvard Law School Library.

I was supposed to write a blog post about the Harvard AI summit about six months ago. For various reasons (e.g., “didn’t get my act together”), that hasn’t happened. But one of the things that was brought up at the summit was who wasn’t at the table—who didn’t have access, whose data wasn’t included, and similar issues.

Since then, I’ve been thinking about the haves and have-nots of AI. There’s one group that I don’t think gets discussed enough.  That’s the giant human workforce that AI needs to function.

Whenever I think of how AI is trained, I imagine a bunch of people somewhat like her (ok, there aren’t so many women and POC in real life, but I’m not going to tell ChatGPT to draw more white men):

And that they’ve been working on processes that look somewhat like this:

But that’s only part of the picture.  Underlying all these processes are people like this:

Who are they?

Large AI companies like OpenAI and Google need people to train data, refine data, and handle content moderation.  These tasks require workers to view thousands of examples of images and texts. To say, “This is a cat,” “The AI got this right,” or “This is not offensive.”  And then do this over and over again.  These are the “ghost workers” behind the machine.  Without them, AI doesn’t function. 

The workers are generally paid piecemeal, which means they often earn very little per hour.  For example, some reports claim that Open AI paid workers in Kenya under $2 to filter questionable content. 

The working conditions are not optimal, especially when the workers are reviewing content.  The workers generally do not receive sufficient training or time to do the work they are asked to do.  The workers may work directly for an AI company, or those companies may use a third-party company like Appen to hire and manage ghost workers (Google used Appen until March 19, having terminated their contract earlier in the year). 

That said, this work is an essential source of income for many around the world. The jobs are relatively flexible as to location and time, and the workers take pride in their output. 

As AI continues to grow, there has been more focus on improving working conditions.  For example, the US has launched investigations into some of the large tech companies in response to concerns about how ghost workers are treated.  And while some AI experts predict that AI will eventually be able to do this work itself, many others believe that AI will continue to depend on ghost workers for a long time to come.

And considering how much profit is at stake, I’m thinking that maybe they should paid more than $2/hour. 

Footnote:

Did I use AI to write this?  Kind of?  I used Google’s NotebookLM tool to review my sources and create notes.  In addition to the sources above, check out:

Leapfrogging the Competition: Claude 3 Researches and Writes Memos (Better Than Some Law Students and Maybe Even Some Lawyers?)

Introduction

I’ve been incredibly excited about the premium version of Claude 3 since its release on March 4, 2024, and for good reason. Now that my previous favorite chatty chatbot, ChatGPT-4, has gone off the rails, I was missing a competent chatbot… I signed up the second I heard on March 4th, and it has been a pleasure to use Claude 3 ever since. It actually understands my prompts and usually provides me with impressive answers. Anthropic, maker of the Claude chatty chatbot family, has been touting Claude’s accomplishments of supposedly beating its competitors on common chatbot benchmarks, and commentators on the Internet have been singing its praises. Just last week, I was so impressed by its ability to analyze information in news stories in uploaded files that I wrote a LinkedIn post also singing its praises!

Hesitation After Previous Struggles

Despite my high hopes for its legal research abilities after experimenting with it last week, I was hesitant to test Claude 3. I have a rule about intentionally irritating myself—if I’m not already irritated, I don’t go looking for irritation… Over the past several weeks, I’ve wasted countless hours trying to improve the legal research capabilities of ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and my legal research/memo writing GPTs through the magic of (IMHO) clever prompting and repetition. Sadly, I failed miserably and concluded that either ChatGPT-4 was suffering from some form of robotic dementia, or I am. The process was a frustrating waste, and I knew that Claude 3 doing a bad job of legal research too could send me over the edge….

Claude 3’s Wrote a Pretty Good Legal Memorandum!

Luckily for me, when I finally got up the nerve to test out the abilities of Claude 3, I found that the internet hype was not overstated. Somehow, Claude 3 has suddenly leapfrogged over its competitors in legal research/legal analysis/legal memo writing ability – it instantly did what would have taken a skilled researcher over an hour and produced a better legal memorandum which is probably better than that produced by many law students and even some lawyers. Check it out for yourself! Unless this link actually works for any Claude 3 subscribers out there, there doesn’t seem to be a way to actually link to a Claude 3 chat at this time. However, click here for the whole chat I cut and pasted into a Google Drive document, here for a very long screenshot image of the chat, or here for the final 1,446-word version of the memo as a Word document.

Comparing Claude 3 with Other Systems

Back to my story… The students’ research assignment for the last class was to think of some prompts and compare the results of ChatGPT-3.5, Lexis+ AI, Microsoft Copilot, and a system of their choice. Claude 3 did not exist at the time, but I told them not to try the free Claude product because I had canceled my $20.00 subscription to the Claude 2 product in January 2024 due to its inability to provide useful answers – all it would say was that it was unethical to answer every question and tell me to do it myself. When creating an answer sheet before class tomorrow which compares the same set of prompts on different systems, I decided to omit Lexis+ AI (because I find it useless) and to include my new fav Claude 3 in my comparison spreadsheet. Check it out to compare for yourself!

For the research part of the assignment, all systems were given a fact pattern and asked to “Please analyze this issue and then list and summarize the relevant Texas statutes and cases on the issue.” While the other systems either made up cases or produced just two or three actual real and correctly cited cases on the research topic, Claude 3 stood out by generating 7 real, relevant cases with correct citations in response to the legal research question. (And, it cited to 12 cases in the final version of its memo.)

It did a really good job of analysis too!

Generating a Legal Memorandum

Writing a memo was not part of the class assignment because the ChatGPT family was refusing the last few weeks,* and Bing Copilot had to be tricked into writing one as part of a short story, but after seeing Claude 3’s research/analysis results, I decided to just see what happened. I have many elaborate prompts for ChatGPT-4 and my legal memorandum GPTs, but I recalled reading that Claude 3 worked well with zero-shot prompting and didn’t require much explanation to produce good results. So, I decided to keep my prompt simple – “Please generate a draft of a 1500 word memorandum of law about whether Snurpa is likely to prevail in a suit for false imprisonment against Mallatexaspurses. Please put your citations in Bluebook citation format.”

From my experience last week with Claude 3 (and prior experience with Claude 2 which would actually answer questions), I knew the system wouldn’t give me as long an answer as requested. The first attempt yielded a pretty high-quality 735-word draft memo that cited all real cases with the correct citations*** and applied the law to the facts in a well-organized Discussion section. I asked it to expand the memo two more times, and it finally produced a 1,446-word document. Here is part of the Discussion section…

Implications for My Teaching

I’m thrilled about this great leap forward in legal research and writing, and I’m excited to share this information with my legal research students tomorrow in our last meeting of the semester. This is particularly important because I did such a poor job illustrating how these systems could be helpful for legal research when all the compared systems were producing inadequate results.

However, with my administrative law legal research class starting tomorrow, I’m not sure how this will affect my teaching going forward. I had my video presentation ready for tomorrow, but now I have to change it! Moreover, if Claude 3 can suddenly do such a good job analyzing a fact pattern, performing legal research, and applying the law to the facts, how does this affect what I am going to teach them this semester?

*Weirdly, the ChatGPT family, perhaps spurred on by competition from Claude 3, agreed to attempt to generate memos today, which it hasn’t done in weeks…

Note: Claude 2 could at one time produce an okay draft of a legal memo if you uploaded the cases for it, that was months ago (Claude 2 link if it works for premium subscribers and Google Drive link of cut and pasted chat). Requests in January resulted in lectures about ethics which resulted in the above-mentioned cancellation.

Does ChatGPT-4 Have Dementia?

Is it just me, or has ChatGPT-4 taken a nosedive when it comes to legal research and writing? There has been a noticeable decline in its ability to locate primary authority on a topic, analyze a fact pattern, and apply law to facts to answer legal questions. Recently, instructions slide through its digital grasp like water through a sieve, and its memory? I would compare it to a goldfish, but I don’t want to insult them. And before you think it’s just me, it’s not just me, the internet agrees!

ChatGPT’s Sad Decline

One of the hottest topics in the OpenAI community, in the aptly named GPT-4 is getting worse and worse every single update thread, is the perceived decline in the quality and performance of the GPT-4 model, especially after the November 2023 update. Many users have reported that the model is deteriorating with each update, producing nonsensical, irrelevant, or incomplete outputs, forgetting the context, and ignoring instructions. Some users have even reverted to previous versions of the model or cancelled their subscriptions. Here are some specific quotations from recent comments about the memory problem:

  • December 2023 – “I don’t know what on Earth is wrong with GPT 4 lately. It feels like I’m talking to early 3.5! It’s incapable of following basic instructions and forgets the format it’s working on after just a few posts.”
  • December 2023 – “It ignores my instructions, in the same message. I can’t be more specific with what I need. I’m needing to repeat how I’d like it to respond every single message because it forgets, and ignores.”
  • December 2023 – “ChatGPT-4 seems to have trouble following instructions and prompts consistently. It often goes off-topic or fails to understand the context of the conversation, making it challenging to get the desired responses.”
  • January 2024 – “…its memory is bad, it tells you search the net, bing search still sucks, why would teams use this product over a ChatGPT Pre Nov 2023.”
  • February 2024 – “It has been AWFUL this year…by the time you get it to do what you want format wise it literally forgets all the important context LOL — I hope they fix this ASAP…”
  • February 2024 – “Chatgpt was awesome last year, but now it’s absolutely dumb, it forgets your conversation after three messages.”

OpenAI has acknowledged the issue and released an updated GPT-4 Turbo preview model, which is supposed to reduce the cases of “laziness” and complete tasks more thoroughly. However, the feedback from users is still mixed, and some are skeptical about the effectiveness of the fix.

An Example of Confusion and Forgetfulness from Yesterday

Here is one of many examples of my experiences which provide an illustrative example of the short-term memory and instruction following issues that other ChatGPT-4 users have reported. Yesterday, I asked it to find some Texas cases about the shopkeeper’s defense to false imprisonment. Initially, ChatGPT-4 retrieved and summarized some relatively decent cases. Well, to be honest, it retrieved 2 relevant cases, with one of the two dating back to 1947… But anyway, the decline in case law research ability is a subject for another blog post.

Anyway, in an attempt to get ChatGPT-4 to find the cases on the internet so it could properly summarize them, I provided some instructions and specified the format I wanted for my answers. Click here for the transcript (only available to ChatGPT-4 subscribers).

Confusion ran amok! ChatGPT-4 apparently understood the instructions (which was a positive sign) and presented three cases in the correct format. However, they weren’t the three cases ChatGPT had listed; instead, they were entirely irrelevant to the topic—just random criminal cases.

It remembered… and then forgot. When reminded that I wanted it to work with the first case listed and provided the citation, it apologized for the confusion. It then proceeded to give the correct citation, URL, and a detailed summary, but unfortunately in the wrong format!

Eventually, in a subsequent chat, I successfully got it to take a case it found, locate the text of the case on the internet, and then provide the information in a specified format. However, it could only do it once before completely forgetting about the specified format. I had to keep cutting and pasting the instructions for each subsequent case.

Sigh… I definitely echo the sentiments of expressed on the GPT-4 is getting worse and worse every single update thread.

ChatGPT Is Growing a Long Term Memory

Well, the news is not all bad! While we are on the topic of memory, OpenAI has introduced a new feature for ChatGPT – the ability to remember stuff over time. ChatGPT’s memory feature is being rolled out to a small portion of free and Plus users, with broader availability planned soon. According to OpenAI, this enhancement allows ChatGPT to remember information from past interactions, resulting in more personalized and coherent conversations. During conversations, ChatGPT automatically picks up on details it deems relevant to remember. Users can also explicitly instruct ChatGPT to remember specific information, such as meeting note preferences or personal details. Over time, ChatGPT’s memory improves as users engage with it more frequently. This memory feature could be useful for users who want consistent responses, such as replying to emails in a specific format.

The memory feature can be turned off entirely if desired, giving users control over their experience. Deleting a chat doesn’t erase ChatGPT’s memories; users must delete specific memories individually…which seems a bit strange – see below. For conversations without memory, users can use temporary chat, which won’t appear in history, won’t use memory, and won’t train the AI model.

The Future?

As we await improvements to our once-loved ChatGPT-4, our options remain limited, pushing us to consider alternative avenues. Sadly, I’ve encountered recent similar shortcomings with the once-useful for legal research and writing Claude 2. In my pursuit of alternatives, platforms like Gemini, Perplexity, and Hugging Face have proven less than ideal for research and writing tasks. However, amidst these challenges, Microsoft Copilot has shown promise. While not without its flaws, it recently demonstrated adequate performance in legal research and even took a passable stab at a draft of a memo. Given OpenAI’s recent advancements in the form of Sora, the near-magical text-to-video generator that is causing such hysteria in Hollywood, there’s reason to hope that they can pull ChatGPT back from the brink.

AI’s Mechanical Jurisprudence

Guest post by Nicholas Mignanelli, Research Librarian, Yale Law School

In his 1908 essay, “Mechanical Jurisprudence,” the eminent legal scholar Roscoe Pound warns of the dangers of what he calls “scientific law,” namely a “petrification” that “tends to cut off individual initiative in the future, to stifle independent consideration of new problems and of new phases of old problems, and so to impose the ideas of one generation upon the other.” Today, this century-old critique of legal formalism could be used to describe the pitfalls of so-called “AI-driven” legal research and law practice technologies.

Pound’s early work served as the foundation for legal realism, an intellectual movement that radically transformed American law by exposing the human element in judicial decision-making and introducing the indeterminacy thesis—the idea that “laws (broadly defined to include cases, regulations, statutes, constitutional provisions, and other legal materials) do not determine legal outcomes.” Unfortunately, the insights of the legal realists are lost on the founders of today’s legal tech startups and their promoters, even those within the legal academy. As Upton Sinclair once wrote, “It is difficult to get a man to understand something when his salary depends on his not understanding it.”

Yet foundational questions abound. Is law determinate? What systemic biases and hidden assumptions are embedded in the corpus of Anglo-American law? What are the implications of turning the corpus of Anglo-American law into a dataset and automating it? Will AI inhibit the legal creativity exemplified by lawyers like Thurgood Marshall and Ruth Bader Ginsburg? What will all of this mean for the future of law reform? While we can hardly expect vendors to take time to reflect upon these questions, law librarians, in their roles as legal research professors and legal information scholars, must.

Further Reading

Ronald E. Wheeler, Does WestlawNext Really Change Everything: The Implications of WestlawNext on Legal Research, 103 Law Libr. J. 359 (2011).

Susan Nevelow Mart, The Algorithm as a Human Artifact: Implications for Legal [Re]Search, 109 Law Libr. J. 387 (2017).

Susan Nevelow Mart, Results May Vary, A.B.A. J., Mar. 2018, at 48.

Nicholas Mignanelli, Critical Legal Research: Who Needs It?, 112 Law Libr. J. 327 (2020).

Yasmin Sokkar Harker, Invisible Hands and the Triple (Quadruple?) Helix Dilemma: Helping Students Free Their Minds, 101 B.U. L. Rev. Online 17 (2021).

Nicholas Mignanelli, Prophets for an Algorithmic Age, 101 B.U. Law Rev. 41 (2021).

Stephan Meder, Legal Machines: Of Subsumption Automata, Artificial Intelligence, And the Search for the “Correct” Judgment (Verena Beck trans., 2023).