Benchmarking - AI Law LibrariansAI Law Librarians

Debbie Ginsberg, Guest Blogger

Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?

Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.

As we tell our 1Ls, sometimes you need to work with what you have and just write.

The model question

In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):

Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?

I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”. To answer the question effectively, the models would need to address that nuance in some form.

The model benchmarks

The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):

Did the AI answer the question that I asked?
Was the answer thorough (did it more or less match my model answer)?
Did the AI cite the most important cases and sources noted in my model answer?
Were any additional citations the AI included at least facially relevant?
Did the model refrain from providing irrelevant or false information?

I did not benchmark:

Speed (we already know the reasoning models can be slow)
If the citations were wrong in a non-obvious way

The model answer and sources

According ot my model answer, the best answers to the question should include at least the following:

Font software: Font software that creates fonts is protected by copyright. The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
Typefaces/Fonts: Neither of these is protected by copyright law. Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.

Bonus if the answer addressed:

Separability: If the art can be separated from the typeface/font, it’s copyrightable.
Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
International implications: Would we expect to see the same results in other jurisdictions?

In answering this question, I expected the LLMs and RAGs to cite:

The copyright statute (which provides the basis for any copyright determinations)
Copyright regulations (provide additional rules for determining copyright)
Adobe Sys. v. Southern Software, Inc. (software that creates fonts can be copyrighted)
Laatz v. Zazzle, Inc. (a newer case discussing copyrighting fonts made with software; it also includes a discussion about alternatives to copyright)
Shake Shack Enterprises, LLC et al v. Brand Design Company, Inc. (discusses limits of copyrighting font software)
The Copyright Compendium, 2021 (from the Copyright Office) (features hypotheticals about artistic elements in fonts/typefaces – this is probably the most important resource)

Benchmarking with the AI models

For this post, I ran my model in the following LLMs/RAGs:

Lexis Protégé (work account)
Westlaw CoCounsel (work account)
ChatGPT o3 deep research (work account)
Gemini 2.5 deep research (personal paid account)
Perplexity research (personal paid account)
DeepSeek R1 (personal free account)
Claude 3.7 (personal paid account)

I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.

The individual responses are included in the appendix.

I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.

Bechmarking the results

In parsing the results, most answers were fairly similar with some exceptions:

Source	Font software copyrightable	Typefaces/ fonts not copyrightable	Exceptions to font‑software copyright	Art in typefaces/fonts copyrightable
Lexis Protégé	Yes	Yes	Yes	No
Westlaw CoCounsel	Yes	Yes	No	Yes
ChatGPT o3 deep research	Yes	Yes	Yes	Yes
Gemini 2.5 deep research	Yes	Yes	Yes	Yes
Perplexity research	Yes	Yes	Yes	Yes
DeepSeek R1	Yes	Yes	Yes	Yes
Claude 3.7	Yes	Yes	Yes	Yes

Font software is copyrightable: in all answers
Typefaces/fonts are not copyrightable: in all answers
Exceptions to font software copyright: in all answers except Westlaw
Art in typefaces/fonts is copyrightable: in all answers except Lexis

Several answers included additional helpful information:

Source	Sepera-bility	C Office Policies	Altern-atives	Licen-sing	Int’l	Recent	State law
Lexis Protégé	Yes	No	No	No	No	No	No
Westlaw Co-Counsel	No	No	No	No	No	No	Yes
ChatGPT o3 deep research	Yes	Yes	Yes	Yes	Yes	Yes	No
Gemini 2.5 deep research	Yes	Yes	Yes	Yes	No	No	No
Per- plexity research	Yes	No	Yes	No	No	No	No
Deep- Seek R1	Yes	No	Yes	No	No	No	No
Claude 3.7	No	No	Yes	Yes	Yes	No	No

Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
Specific discussions about Copyright Office policies: Gemini, ChatGPT
Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
Specific discussions about licensing: Gemini, Claude, ChatGPT
International considerations: Claude, ChatGPT
Recent developments: ChatGPT
State law: Westlaw

The models were somewhat consistent about what they cited:

LLM/RAG	Copyright statute	Copyright regs	Adobe	Laatz	Shake Shack	The Copyright Compendium
Lexis Protégé	Yes	Yes	Yes	Yes	No	No
Westlaw Co- Counsel	Yes	Yes	Yes	Yes	Yes	No
ChatGPT o3 deep research	Yes	Yes	Yes	No	No	Yes
Gemini 2.5 deep research	Yes	Yes	Yes	Yes	No	Yes
Perplexity research	No	Yes	No	No	No	Yes
DeepSeek R1	Yes	Yes	Yes	No	No	No
Claude 3.7	No	Yes	Yes	No	No	No

The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
Copyright regs: cited by all
Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
Laatz: Lexis, Westlaw, Gemini
Shake Shack: Westlaw
The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion

The models also included additional resources not on my list:

LLM/RAG	Blogs etc.	Restat.	Eltra	Law review	Articles about loans	LibGuides
Lexis Protégé	Yes	Yes	Yes	No	No	No
Westlaw Co- Counsel	Yes	No	No	Yes	Yes	No
ChatGPT o3 deep research	Yes	No	Yes	No	No	No
Gemini 2.5 deep research	Yes	No	Yes	No	No	Yes
Perplexity research	No	No	No	No	No	No
DeepSeek R1	No	No	Yes	No	No	No
Claude 3.7	Yes	No	Yes	No	No	No

Blogs, websites, news articles: The commercial LLMs. Gemini found the most, but it’s Google.
Restatement: Lexis
Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
An actual law review article: Westlaw
Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
LibGuides: Gemini
Included a handy table: ChatGPT, Gemini

The answers varied in depth of discussion and number of sources:

Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
Westlaw: 2.5 pages of formatted text, 17 pages of sources
ChatGPT: 8 pages of well-formatted text, 1 page of sources
Gemini: 6.5 pages of well-formatted text, 1 page of sources
Perplexity: A little more than 4 pages of text, about 1 page of sources
Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
Claude: 2.5 pages of well-formatted text, no separate sources

Hallucinations

I didn’t find any sources that were completely made up
I didn’t find any obvious errors in the written text, though some sources made more sense than others
I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post).

Some random concluding thoughts about benchmarking

When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs. To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.

In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others. For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others. They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).

None of the models appeared to consider that recency would be a significant factor in this problem. Several cited a case from the 70s that didn’t concern fonts. Several failed to cite Laatz, a recent case that’s on point. Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case). The LLMs were less concerned with citing to authority. In all cases, I would have preferred a more curated set of resources than the platforms provided.

Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them).

Pages: 1 2

AI Law Librarians

All Things AI Law Librarian-ish, Generative AI, and Legal Research/Education/Technology

Tag Archives: Benchmarking

Benchmarking a Moving Target, or let’s run a hypo through 7 AIs and see what happens