Debbie Ginsberg, Guest Blogger
Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?
Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.
As we tell our 1Ls, sometimes you need to work with what you have and just write.
The model question
In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):
Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?
I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”. To answer the question effectively, the models would need to address that nuance in some form.
The model benchmarks
The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):
- Did the AI answer the question that I asked?
- Was the answer thorough (did it more or less match my model answer)?
- Did the AI cite the most important cases and sources noted in my model answer?
- Were any additional citations the AI included at least facially relevant?
- Did the model refrain from providing irrelevant or false information?
I did not benchmark:
- Speed (we already know the reasoning models can be slow)
- If the citations were wrong in a non-obvious way
The model answer and sources
According ot my model answer, the best answers to the question should include at least the following:
- Font software: Font software that creates fonts is protected by copyright. The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
- Typefaces/Fonts: Neither of these is protected by copyright law. Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
- The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.
Bonus if the answer addressed:
- Separability: If the art can be separated from the typeface/font, it’s copyrightable.
- Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
- International implications: Would we expect to see the same results in other jurisdictions?
In answering this question, I expected the LLMs and RAGs to cite:
- The copyright statute (which provides the basis for any copyright determinations)
- Copyright regulations (provide additional rules for determining copyright)
- Adobe Sys. v. Southern Software, Inc. (software that creates fonts can be copyrighted)
- Laatz v. Zazzle, Inc. (a newer case discussing copyrighting fonts made with software; it also includes a discussion about alternatives to copyright)
- Shake Shack Enterprises, LLC et al v. Brand Design Company, Inc. (discusses limits of copyrighting font software)
- The Copyright Compendium, 2021 (from the Copyright Office) (features hypotheticals about artistic elements in fonts/typefaces – this is probably the most important resource)
Benchmarking with the AI models
For this post, I ran my model in the following LLMs/RAGs:
- Lexis Protégé (work account)
- Westlaw CoCounsel (work account)
- ChatGPT o3 deep research (work account)
- Gemini 2.5 deep research (personal paid account)
- Perplexity research (personal paid account)
- DeepSeek R1 (personal free account)
- Claude 3.7 (personal paid account)
I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.
The individual responses are included in the appendix.
I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.
Bechmarking the results
In parsing the results, most answers were fairly similar with some exceptions:
Source | Font software copyrightable | Typefaces/ fonts not copyrightable | Exceptions to font‑software copyright | Art in typefaces/fonts copyrightable |
Lexis Protégé | Yes | Yes | Yes | No |
Westlaw CoCounsel | Yes | Yes | No | Yes |
ChatGPT o3 deep research | Yes | Yes | Yes | Yes |
Gemini 2.5 deep research | Yes | Yes | Yes | Yes |
Perplexity research | Yes | Yes | Yes | Yes |
DeepSeek R1 | Yes | Yes | Yes | Yes |
Claude 3.7 | Yes | Yes | Yes | Yes |
- Font software is copyrightable: in all answers
- Typefaces/fonts are not copyrightable: in all answers
- Exceptions to font software copyright: in all answers except Westlaw
- Art in typefaces/fonts is copyrightable: in all answers except Lexis
Several answers included additional helpful information:
Source | Sepera-bility | C Office Policies | Altern-atives | Licen-sing | Int’l | Recent | State law |
Lexis Protégé | Yes | No | No | No | No | No | No |
Westlaw Co-Counsel | No | No | No | No | No | No | Yes |
ChatGPT o3 deep research | Yes | Yes | Yes | Yes | Yes | Yes | No |
Gemini 2.5 deep research | Yes | Yes | Yes | Yes | No | No | No |
Per- plexity research | Yes | No | Yes | No | No | No | No |
Deep- Seek R1 | Yes | No | Yes | No | No | No | No |
Claude 3.7 | No | No | Yes | Yes | Yes | No | No |
- Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
- Specific discussions about Copyright Office policies: Gemini, ChatGPT
- Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
- Specific discussions about licensing: Gemini, Claude, ChatGPT
- International considerations: Claude, ChatGPT
- Recent developments: ChatGPT
- State law: Westlaw
The models were somewhat consistent about what they cited:
LLM/RAG | Copyright statute | Copyright regs | Adobe | Laatz | Shake Shack | The Copyright Compendium |
Lexis Protégé | Yes | Yes | Yes | Yes | No | No |
Westlaw Co- Counsel | Yes | Yes | Yes | Yes | Yes | No |
ChatGPT o3 deep research | Yes | Yes | Yes | No | No | Yes |
Gemini 2.5 deep research | Yes | Yes | Yes | Yes | No | Yes |
Perplexity research | No | Yes | No | No | No | Yes |
DeepSeek R1 | Yes | Yes | Yes | No | No | No |
Claude 3.7 | No | Yes | Yes | No | No | No |
- The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
- Copyright regs: cited by all
- Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
- Laatz: Lexis, Westlaw, Gemini
- Shake Shack: Westlaw
- The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion
The models also included additional resources not on my list:
LLM/RAG | Blogs etc. | Restat. | Eltra | Law review | Articles about loans | LibGuides |
Lexis Protégé | Yes | Yes | Yes | No | No | No |
Westlaw Co- Counsel | Yes | No | No | Yes | Yes | No |
ChatGPT o3 deep research | Yes | No | Yes | No | No | No |
Gemini 2.5 deep research | Yes | No | Yes | No | No | Yes |
Perplexity research | No | No | No | No | No | No |
DeepSeek R1 | No | No | Yes | No | No | No |
Claude 3.7 | Yes | No | Yes | No | No | No |
- Blogs, websites, news articles: The commercial LLMs. Gemini found the most, but it’s Google.
- Restatement: Lexis
- Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
- An actual law review article: Westlaw
- Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
- LibGuides: Gemini
- Included a handy table: ChatGPT, Gemini
The answers varied in depth of discussion and number of sources:
- Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
- Westlaw: 2.5 pages of formatted text, 17 pages of sources
- ChatGPT: 8 pages of well-formatted text, 1 page of sources
- Gemini: 6.5 pages of well-formatted text, 1 page of sources
- Perplexity: A little more than 4 pages of text, about 1 page of sources
- Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
- Claude: 2.5 pages of well-formatted text, no separate sources
Hallucinations
- I didn’t find any sources that were completely made up
- I didn’t find any obvious errors in the written text, though some sources made more sense than others
- I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post).
Some random concluding thoughts about benchmarking
When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs. To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.
In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others. For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others. They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).
None of the models appeared to consider that recency would be a significant factor in this problem. Several cited a case from the 70s that didn’t concern fonts. Several failed to cite Laatz, a recent case that’s on point. Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case). The LLMs were less concerned with citing to authority. In all cases, I would have preferred a more curated set of resources than the platforms provided.
Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them).