Jennifer Ballard, Good Journey Consulting
In November 2025, Stephen Embry of Above the Law issued a wake-up call to outside counsel: in-house lawyers are increasingly using AI tools and experiencing efficiency gains, and it is only a matter of time before they require the same from their outside lawyers.
However, identifying one or more AI tools that will be the best fit for law practice is a daunting task. There are hundreds of AI tools for lawyers crowding the market, and there is significant hype surrounding the AI industry in general. Fortunately, there are a growing number of independent efforts to evaluate the real-world utility of AI tools for lawyers. These studies offer some data about AI tools at fixed points in time that can be used to make better informed decisions about AI tool selection.
Independent evaluations of AI tools for lawyers
Below are summaries of seven such independent studies, which individually and collectively reveal helpful insights into where it may (or may not) currently be worthwhile to integrate AI tools with your practice.
Contract drafting study
A September 2025 contract drafting study from Legalbenchmarks.ai, a collaboration between legal professionals, AI experts, and researchers, evaluated 13 AI tools (seven legal industry AI tools and six general-purpose AI tools) against a human baseline that consisted of in-house commercial lawyers with an average of 10 years of working experience. The legal industry AI tools included in the study were August, Brackets, GC AI, InstaSpace, SimpleDocs, Wordsmith, and an anonymous tool, while the general-purpose tools were ChatGPT, Claude, Copilot, Gemini, Le Chat, and Qwen. The study found that some AI tools outperformed the human baseline in producing reliable first drafts of contracts. The study did not find a meaningful difference in the output reliability or output usefulness between the general-purpose and legal industry AI tools. The top performing tools for output were Gemini, ChatGPT, GC AI, Brackets, August, and SimpleDocs. The study concluded that while the legal industry AI tools were not outperforming general-purpose AI tools on output, they were beginning to differentiate themselves with workflow and support functionalities for lawyers, such as integrating with Microsoft Word, and offering clause libraries and templates. The most meaningful differentiator the study found among the legal industry AI tools was whether the tool integrated with existing workflow and technology. For workflow integration or support, the top performers were Brackets, GC AI, and SimpleDocs. You can read this study here.
Information extraction study
The second study from Legalbenchmarks.ai, released in April 2025, focused on information extraction tasks for in-house lawyers. This study evaluated six AI tools, including two legal industry AI tools: GC AI and Vecflow’s Oliver, as well as two general-purpose AI assistant tools: Google’s Notebook LM and Microsoft Copilot, and two general-purpose LLM chatbots: DeepSeek and ChatGPT. All of the AI tools were scored on both accuracy and usefulness. The study found that the two legal-industry AI tools, GC AI and Oliver, received the highest combined scores, concluding that while general-purpose AI tools could match legal industry AI tools in accuracy, the legal industry AI tools delivered more value in usability and workflow integration. You can read this study here.
Vals Legal AI Report
In February 2025, Vals AI, a platform that seeks to advance generative AI with independent and scalable evaluation infrastructure, released the Vals Legal AI Report (VLAIR), which evaluated four legal industry AI tools (CoCounsel, Harvey Assistant, Oliver, and Vincent AI) and compared the results to a lawyer control group. The tools were evaluated across up to seven tasks commonly performed by lawyers (each company could opt into as many of the task evaluations as desired). One or more AI tools beat the lawyer control group on four tasks (document extraction, document question-answering, document summarization, and transcript analysis), while the lawyer control group surpassed the AI tools on two tasks (redlining and EDGAR research) and matched the highest performing tool on one task (chronology generation). Harvey Assistant, which participated in six of the seven tasks, had the strongest performance, receiving the top score on five tasks and the second-place score on one task, and beating or matching the lawyer control group in five tasks. This study can be accessed here.
VLAIR—Legal Research
In October 2025, Vals AI released an extension of VLAIR focusing on legal research. VLAIR—Legal Research evaluated three legal industry AI tools (Alexi, Counsel Stack, and Midpage), as well as ChatGPT and a human baseline of lawyers from one law firm who were all experienced in conducting legal research. The study involved 200 legal research questions. The AI tools and the lawyer baseline were each given a weighted score, with 50% of the score given to accuracy, while 40% was given to authoritativeness, meaning whether the response was supported by citations to proper sources, and 10% of the score was given to appropriateness, meaning whether the response was easily understood and could be shared as-is with others. The study found that the legal industry AI tools received the highest weighted scores, ranging from 76% to 78%, followed by ChatGPT at 74%, with the lawyer baseline scoring the lowest at 69%. Counsel Stack had the highest score of the legal industry AI tools.
Notably, the study found that when the AI tools outperformed the lawyer baseline, they did so by a large margin. Of the 200 questions included in the study, AI tools outperformed the lawyer baseline on 150 of the questions, and the average point margin was 31%. In contrast, when the lawyer baseline outperformed the AI tools, it was by an average point margin of 9%, and typically involved questions concerning complex multi-jurisdictional analysis, judgment-based synthesis, or when a deeper understanding of context was necessary. You can read this study in its entirety here.
Vals AI LegalBench contributions
In 2023, researchers created a benchmark called LegalBench, which included 162 legal reasoning tasks evaluated across 20 large language models (LLMs). Benchmarks are datasets and tasks that have been standardized to measure the capabilities of an AI model across an industry. Vals AI contributed to the LegalBench benchmark with a December 2025 update, which evaluated 92 AI models on legal tasks, finding that the top performing AI models were: (1) Gemini 3 Pro (87.04% accuracy), (2) Gemini 3 Flash (86.86% accuracy), and (3) GPT 5 (86.02% accuracy). You can read more about Vals AI’s contribution to LegalBench here.
Law student study
The University of Minnesota published a study in March 2025 called AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice. In this study, law students tested Vincent AI, a legal industry AI tool that was refined using retrieval augmented generation (RAG) and OpenAI’s o1-preview, an AI reasoning model, on six legal tasks, finding that one or both AI tools significantly enhanced the quality of the legal work compared to the legal work performed without AI in five out of six tasks: (1) drafting an email for a client, (2) drafting a legal memo for a partner, (3) analyzing a complaint and drafting a written analysis, (4) drafting a motion to consolidate, and (5) drafting a persuasive letter. Additionally, the study found that both AI tools significantly boosted productivity in the same five out of six legal tasks, with particular strength in tasks like analyzing complaints and drafting persuasive letters. Neither tool demonstrated improvement in quality or efficiency for the sixth task, drafting a non-disclosure agreement. The study noted that it was the only task where participants were provided a general template to use in their response, which may have reduced the potential for AI-driven quality improvement. You can read this study in its entirety here.
Legal research hallucination study
Stanford RegLab published a preprint study in May 2024 called, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. This study tested OpenAI’s GPT-4 along with three legal industry AI tools refined with RAG: Westlaw’s AI-Assisted Research, Ask Practical Law AI (both Thomson Reuters products), and Lexis+ AI, concluding that all four tools hallucinate. The hallucination rates of the RAG-tuned AI tools tested in the study were reduced compared to GPT-4 (which it found hallucinated 43% of the time) yet remained substantial. The study found that Westlaw’s AI-Assisted Research hallucinated one-third of the time, while Ask Practical Law AI and Lexis+ AI produced hallucinations in more than one of every six responses. LexisNexis and Thomson Reuters both responded that their internal testing and customer feedback demonstrated higher rates of accuracy than the study results, with Thomson Reuters asserting an accuracy rate of approximately 90% for their AI-Assisted Research tool. While the results of this study are already dated given the recent swift progression of AI developments, the Stanford study identified that the most important takeaway of its results was that the legal industry needs thorough and transparent benchmarks and evaluations of AI tools. This study can be accessed here.
What insights do these studies collectively provide?
When these studies are considered collectively, it becomes evident that lawyers should not summarily dismiss AI tools. Several independent studies have now concluded that using an AI tool to perform certain tasks may elevate a lawyer’s work through improved quality and/or efficiency. Tasks that were found by the studies to benefit from the use of an AI tool included contract drafting, document extraction, document question-answering, document summarization, transcript analysis, drafting emails and letters, drafting complaints, analyzing complaints, drafting motions, and some legal research tasks.
In contrast, tasks where AI tools did not add value within the parameters of the studies included redlining, EDGAR research, and chronology generation. While the Minnesota law student study did not find added value in using AI tools to draft a non-disclosure agreement when the students were provided a general template to use in their response, lawyers can compare this finding to the more recent Legalbenchmarks.ai contract drafting study finding that some AI tools outperformed the human baseline of commercial lawyers with 10 years of experience in producing reliable first drafts of contracts. Additionally, lawyers can consider testing one or more AI tools for contract drafting to draw their own conclusions.
Over time, the findings from these studies can also be used to evaluate how AI tools are evolving. For example, when the findings of Vals AI’s LegalBench contributions are compared to the Stanford hallucination study, it appears that the accuracy of OpenAI’s GPT AI models has improved significantly since May 2024 (December 2025: 86.02% accuracy, May 2024: 57% accuracy). This is notable in part because many legal industry AI tools use OpenAI’s models and their competitors’ models as their underlying infrastructure.
Some of the studies concluded that it is a toss-up whether you can presently get better output from a general-purpose AI tool or a legal industry AI tool. Further, some of the studies note that legal industry AI tools are distinguishing themselves from the general-purpose AI models by offering better workflow integration and support. Additionally, lawyers should know that some legal industry AI tools may offer more data privacy and security advantages than consumer-grade general-purpose AI tools.
What else should lawyers consider when evaluating AI tool options?
Lawyers should be prepared to distinguish between independent studies, such as the ones discussed above, and in-house evaluations by the companies making AI tools for lawyers. Some AI tool studies are conducted by AI companies themselves and publicized for marketing purposes. While an AI tool company’s evaluations of its own product may provide useful data, it’s important to be mindful of the source of any data utilized for decision-making purposes.
Additionally, while the studies highlighted above have yielded helpful insights, the evaluations conducted to date have only assessed the tip of the iceberg. There are many uses for AI in legal practice and hundreds of AI tools for lawyers that have not been independently evaluated. This means that lawyers who will evaluate AI tool solutions beyond the tools and tasks included in the studies covered in this article should be prepared to do their own testing to determine if an AI tool is a good match for their organization.
Finally, AI tool selection should not begin and end with considering the AI tool options available. Instead, lawyers should start the AI tool selection process by gaining an understanding of the many possible uses that AI tools currently offer and prioritizing the technology issues experienced by their organizations. AI tools for legal research command significant attention in the legal industry, yet many lawyers have not taken time to consider whether legal research is really the highest priority technology issue that their organization needs to address with an AI tool.
Once a lawyer has clarity about where improved technology solutions are most needed in their unique practice, the information in this article becomes most useful, and better-informed decisions can be made about which AI tools deserve further consideration. Further evaluation of an AI tool prior to final selection may include testing the AI tool to assess its real-world performance and should always include a risk assessment of the AI tool’s data privacy and security policies to confirm alignment with a lawyer’s professional responsibilities. ♦
Want to learn more about AI tools for lawyers? Through June 30, 2026, use the code BIZ60 for $60 off Jennifer Ballard’s “How to Pick the Best AI Tools for Your Law Practice” CLE. Learn more here.