About Our Research Group
The AI Law Librarians Benchmarking Group is a collaborative team of law librarians dedicated to systematically evaluating generative AI tools for legal research. As legal tech vendors rapidly release AI-powered products—each claiming to transform research practices—we saw a need for a consistent, thoughtful, and ongoing approach to testing these tools in real-world legal research scenarios. Rather than relying on isolated or anecdotal use, our group is developing a framework grounded in legal research pedagogy and information science to assess how these tools perform across defined task types.
The group includes a few of your AI Law Librarian bloggers, but also welcomes several other librarians. The full group is:
- Rebecca Fordon, The Ohio State University Moritz Law Library
- Jonathan Franklin, University of Washington School of Law
- Deborah Ginsberg, Harvard Law Library
- Nick Hafen, BYU Law School
- Sean Harrington, University of Oklahoma Law Library
- Christine Park, Harvard Law Library
Together, we are building a typology of legal research tasks, designing evaluation prompts and rubrics, and documenting our findings to share with the broader legal and library communities. We welcome collaboration and feedback from others working at the intersection of law, AI, and information literacy. Our introduction post is here.
Legal Research AI Benchmarking Studies / Comparisons
- UMN/UMich. Randomly Controlled Trial – Daniel Schwarcz et al., AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice (Mar. 2025).
- Stanford 2025 Legal Research Hallucination Study – Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, 22 Journal of Empirical Legal Studies 216 (2025).
- AI Smackdown – Bob Ambrogi, In ‘AI Smackdown,’ Law Librarians Compare Legal AI Research Platforms, Finding Distinct Strengths and Limitations, LawSites (Feb. 14, 2025).
Legal Reasoning and Task Benchmarking
- Stanford 2024 Legal Hallucination Study – Matthew Dahl et al., Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models, 16 Journal of Legal Analysis 64 (2024).
- LawBench – Zhiwei Fei et al., LawBench: Benchmarking Legal Knowledge of Large Language Models, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Nov. 2024).
- LegalBench – Neel Guha et al., LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models, 36 Advances in Neural Information Processing Systems 44123 (Dec. 2023).
- Vals – Vals Legal AI Report, https://www.vals.ai/vlair (Feb. 2025).
Vendor Reports about Benchmarking
- Mike Dahn, Thomson Reuters Best Practices for Benchmarking AI for Legal Research, Thomson Reuters Institute, https://www.thomsonreuters.com/en-us/posts/innovation/thomson-reuters-best-practices-for-benchmarking-ai-for-legal-research/ (Feb. 12, 2025).
- Paxton AI Achieves 94%+ Accuracy on Stanford Hallucination Benchmark | PAXTON, https://www.paxton.ai/post/paxton-ai-achieves-94-accuracy-on-stanford-hallucination-benchmark (July 24, 2024.
- Introducing the Paxton AI Citator: Setting New Benchmarks in Legal Research | PAXTON, https://www.paxton.ai/post/introducing-the-paxton-ai-citator-setting-new-benchmarks-in-legal-research (July 8, 2024).
- Introducing BigLaw Bench, Harvey, https://www.harvey.ai/blog/introducing-biglaw-bench (Aug. 29, 2024).