LawBench is a comprehensive benchmark designed to evaluate the legal knowledge and reasoning capabilities of Large Language Models (LLMs). It addresses the gap in understanding LLM performance in the specialized and safety-critical legal domain, targeting researchers and developers working with LLMs for legal applications. The benchmark provides a structured way to assess how well LLMs can recall, understand, and apply legal knowledge, offering insights into their reliability for legal tasks.
How It Works
LawBench simulates three dimensions of judicial cognition: legal knowledge memory, understanding, and application. It comprises 20 distinct tasks, each with 500 examples, covering a wider range of real-world legal scenarios than typical multiple-choice benchmarks. Tasks include legal entity recognition, reading comprehension, crime amount calculation, and legal consultation. A unique "abstention rate" metric is also introduced to measure model refusal or failure to understand queries, addressing potential safety policy limitations in LLMs.
Quick Start & Requirements
rouge_chinese==1.0.3
, cn2an==0.5.22
, ltp==4.2.13
, OpenCC==1.1.6
, python-Levenshtein==0.21.1
, pypinyin==0.49.0
, tqdm==4.64.1
, timeout_decorator==0.5.0
.F
folder structure and run python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv
from the evaluation
directory.data
folder as JSON files.Highlighted Details
Maintenance & Community
The project is associated with the paper "LawBench: Benchmarking Legal Knowledge of Large Language Models" by Fei et al. (2023). External contributors are welcomed for dataset expansion and model evaluation. Contact information for further collaboration is available.
Licensing & Compatibility
The licensing is complex, as LawBench is a composite dataset. Users are required to adhere to the licenses of the original data creators for each task. Specific license details for each task are available via the task list.
Limitations & Caveats
The project acknowledges that ROUGE-L may not be ideal for evaluating long-form generation and plans to explore LLM-based, task-specific metrics. It also aims to develop better strategies to prevent data contamination, as some models might have been trained on parts of the test data.
1 year ago
1 week