Benchmarking a Moving Target, or let’s run a hypo through 7 AIs and see what happens

Debbie Ginsberg, Guest Blogger

Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?

Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.

As we tell our 1Ls, sometimes you need to work with what you have and just write.

The model question

In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):

Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?

I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”.  To answer the question effectively, the models would need to address that nuance in some form.

The model benchmarks

The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):

  • Did the AI answer the question that I asked?
  • Was the answer thorough (did it more or less match my model answer)?
  • Did the AI cite the most important cases and sources noted in my model answer?
  • Were any additional citations the AI included at least facially relevant?
  • Did the model refrain from providing irrelevant or false information?

I did not benchmark:

  • Speed (we already know the reasoning models can be slow)
  • If the citations were wrong in a non-obvious way 

The model answer and sources

According ot my model answer, the best answers to the question should include at least the following:

  • Font software: Font software that creates fonts is protected by copyright.  The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
  • Typefaces/Fonts: Neither of these is protected by copyright law.  Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
  • The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.

Bonus if the answer addressed:

  • Separability: If the art can be separated from the typeface/font, it’s copyrightable.
  • Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
  • International implications: Would we expect to see the same results in other jurisdictions?

In answering this question, I expected the LLMs and RAGs to cite:

Benchmarking with the AI models

For this post, I ran my model in the following LLMs/RAGs:

  • Lexis Protégé (work account)
  • Westlaw CoCounsel (work account)
  • ChatGPT o3 deep research (work account)
  • Gemini 2.5 deep research (personal paid account)
  • Perplexity research (personal paid account)
  • DeepSeek R1 (personal free account)
  • Claude 3.7 (personal paid account)

I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.

The individual responses are included in the appendix.

I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.

Bechmarking the results

In parsing the results, most answers were fairly similar with some exceptions:

SourceFont software copyrightableTypefaces/
fonts not copyrightable
Exceptions to font‑software copyrightArt in typefaces/fonts copyrightable
Lexis ProtégéYesYesYesNo
Westlaw CoCounselYesYesNoYes
ChatGPT o3 deep researchYesYesYesYes
Gemini 2.5 deep researchYesYesYesYes
Perplexity researchYesYesYesYes
DeepSeek R1YesYesYesYes
Claude 3.7YesYesYesYes
  • Font software is copyrightable: in all answers 
  • Typefaces/fonts are not copyrightable: in all answers
  • Exceptions to font software copyright: in all answers except Westlaw
  • Art in typefaces/fonts is copyrightable: in all answers except Lexis

Several answers included additional helpful information:

SourceSepera-bilityC Office PoliciesAltern-ativesLicen-singInt’lRecentState law
Lexis ProtégéYesNoNoNoNoNoNo
Westlaw Co-CounselNoNoNoNoNoNoYes
ChatGPT o3 deep researchYesYesYesYesYesYesNo
Gemini 2.5 deep researchYesYesYesYesNoNoNo
Per-
plexity research
YesNoYesNoNoNoNo
Deep-
Seek R1
YesNoYesNoNoNoNo
Claude 3.7NoNoYesYesYesNoNo

  • Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
  • Specific discussions about Copyright Office policies: Gemini, ChatGPT
  • Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
  • Specific discussions about licensing: Gemini, Claude, ChatGPT
  • International considerations: Claude, ChatGPT
  • Recent developments: ChatGPT
  • State law: Westlaw

The models were somewhat consistent about what they cited:

LLM/RAGCopyright statuteCopyright regsAdobeLaatzShake ShackThe Copyright Compendium
Lexis ProtégéYesYesYesYesNoNo
Westlaw Co-
Counsel
YesYesYesYesYesNo
ChatGPT o3 deep researchYesYesYesNoNoYes
Gemini 2.5 deep researchYesYesYesYesNoYes
Perplexity researchNoYesNoNoNoYes
DeepSeek R1YesYesYesNoNoNo
Claude 3.7NoYesYesNoNoNo
  • The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
  • Copyright regs: cited by all
  • Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
  • Laatz: Lexis, Westlaw, Gemini
  • Shake Shack: Westlaw
  • The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion

The models also included additional resources not on my list:

LLM/RAGBlogs etc.Restat.EltraLaw reviewArticles about loansLibGuides
Lexis ProtégéYesYesYesNoNoNo
Westlaw Co-
Counsel
YesNoNoYesYesNo
ChatGPT o3 deep researchYesNoYesNoNoNo
Gemini 2.5 deep researchYesNoYesNoNoYes
Perplexity researchNoNoNoNoNoNo
DeepSeek R1NoNoYesNoNoNo
Claude 3.7YesNoYesNoNoNo
  • Blogs, websites, news articles: The commercial LLMs.  Gemini found the most, but it’s Google.
  • Restatement: Lexis
  • Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
  • An actual law review article: Westlaw
  • Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
  • LibGuides: Gemini
  • Included a handy table: ChatGPT, Gemini

The answers varied in depth of discussion and number of sources:

  • Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
  • Westlaw: 2.5 pages of formatted text, 17 pages of sources
  • ChatGPT: 8 pages of well-formatted text, 1 page of sources
  • Gemini: 6.5 pages of well-formatted text, 1 page of sources
  • Perplexity: A little more than 4 pages of text, about 1 page of sources
  • Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
  • Claude: 2.5 pages of well-formatted text, no separate sources

Hallucinations

  • I didn’t find any sources that were completely made up
  • I didn’t find any obvious errors in the written text, though some sources made more sense than others
  • I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post). 

Some random concluding thoughts about benchmarking

When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs.  To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.

In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others.  For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others.  They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).

None of the models appeared to consider that recency would be a significant factor in this problem.  Several cited a case from the 70s that didn’t concern fonts.  Several failed to cite Laatz, a recent case that’s on point.  Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case).  The LLMs were less concerned with citing to authority.  In all cases, I would have preferred a more curated set of resources than the platforms provided. 

Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them).