Introduction to LawBench: Benchmarking Legal Knowledge of Large Language Models
In the digital age, the capabilities of large language models (LLMs) have made significant strides across various domains. Despite their success, there remains uncertainty about their proficiency when applied to specialized fields such as law, where tasks are both highly specialized and critical for public safety. Addressing this gap, LawBench emerges as a comprehensive benchmark aimed at evaluating the legal aptitude of these models. For more details, you can check out the LawBench paper.
Overview
LawBench is meticulously crafted to provide a precise evaluation of a large language model's legal capabilities. When designing testing tasks, it simulates three dimensions of judicial cognition, selecting 20 diverse tasks to thoroughly assess the models' abilities. Unlike some existing benchmarks that focus solely on multiple-choice questions, LawBench incorporates a variety of task types that are closely related to real-world legal applications. These include legal entity recognition, reading comprehension, crime amount calculation, and legal consultation services.
Recognizing that current large models might refuse to respond to certain legal queries due to safety policies or misinterpret instructions, LawBench introduces a separate metric— "waive rate." This metric measures the frequency at which models refuse to provide answers or fail to comprehend instructions properly. The benchmark reports the performance of 51 language models across various dimensions, which include 20 multilingual models, 22 Chinese-language models, and 9 legal-specialized language models.
Dataset
LawBench’s dataset spans 20 distinct tasks that cover three levels of cognitive ability:
- Legal Knowledge Memorization: Evaluates whether LLMs can recall essential legal concepts, terminologies, statutes, and facts.
- Legal Knowledge Understanding: Tests whether LLMs can grasp meanings and relationships in legal texts, understand legal entities, events, and the significance of the text.
- Legal Knowledge Application: Assesses whether LLMs can correctly apply their legal knowledge and reasoning to solve real-world legal tasks encountered in downstream applications.
Each task in the benchmark consists of 500 examples, sourced from various authoritative databases and repositories relevant to the Chinese legal system.
Here are some sample tasks included in LawBench:
- Legal Knowledge Memorization: Tasks like legal statute recitation and knowledge-based Q&A.
- Legal Knowledge Understanding: Tasks like proofreading documents and identifying legal disputes.
- Legal Knowledge Application: Tasks like predicting legal statutes based on facts and calculating crime amounts.
Data Format
Data used for LawBench tasks is stored in JSON format, which is easy to handle programmatically. Each task's data is stored in a <task_id>.json
file, and it can be loaded using a JSON parser to return a list of dictionary entries. The dataset can be accessed from the data folder.
Model Evaluation
LawBench provides results for various models classified into groups such as Multilingual LLMs, showcasing their parameter sizes, training techniques, and how they were accessed. These include popularly used models like MPT, LLaMA, and their Instruct versions. The list clearly defines whether the models have undergone supervised fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), providing a comprehensive picture of their readiness in handling legal tasks.
With this structured and extensive testing platform, LawBench is a pivotal tool for understanding and improving the legal reasoning capabilities of large language models. It bridges the gap between technological advancement and practical implementation in law, thereby fostering developments that make legal AI more reliable, accessible, and integrated into everyday legal practices.