LawBench by open-compass

Benchmark for legal LLMs

Created 2 years ago

389 stars

Top 73.8% on SourcePulse

Project Summary

LawBench is a comprehensive benchmark designed to evaluate the legal knowledge and reasoning capabilities of Large Language Models (LLMs). It addresses the gap in understanding LLM performance in the specialized and safety-critical legal domain, targeting researchers and developers working with LLMs for legal applications. The benchmark provides a structured way to assess how well LLMs can recall, understand, and apply legal knowledge, offering insights into their reliability for legal tasks.

How It Works

LawBench simulates three dimensions of judicial cognition: legal knowledge memory, understanding, and application. It comprises 20 distinct tasks, each with 500 examples, covering a wider range of real-world legal scenarios than typical multiple-choice benchmarks. Tasks include legal entity recognition, reading comprehension, crime amount calculation, and legal consultation. A unique "abstention rate" metric is also introduced to measure model refusal or failure to understand queries, addressing potential safety policy limitations in LLMs.

Quick Start & Requirements

Installation: The project requires specific Python packages: rouge_chinese==1.0.3, cn2an==0.5.22, ltp==4.2.13, OpenCC==1.1.6, python-Levenshtein==0.21.1, pypinyin==0.49.0, tqdm==4.64.1, timeout_decorator==0.5.0.
Evaluation: To evaluate model predictions, place results in a F folder structure and run python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv from the evaluation directory.
Data: Datasets are stored in the data folder as JSON files.
Resources: No specific hardware requirements are mentioned, but running evaluations on 51 LLMs suggests significant computational resources may be needed for comprehensive testing.

Highlighted Details

Evaluates 51 LLMs, including multilingual, Chinese-specific, and legal-specific models.
Includes a novel "abstention rate" metric to capture model refusal or misunderstanding.
Tasks are categorized into legal knowledge memory, understanding, and application.
Provides detailed performance tables for zero-shot and one-shot evaluations across various models and tasks.

Maintenance & Community

The project is associated with the paper "LawBench: Benchmarking Legal Knowledge of Large Language Models" by Fei et al. (2023). External contributors are welcomed for dataset expansion and model evaluation. Contact information for further collaboration is available.

Licensing & Compatibility

The licensing is complex, as LawBench is a composite dataset. Users are required to adhere to the licenses of the original data creators for each task. Specific license details for each task are available via the task list.

Limitations & Caveats

The project acknowledges that ROUGE-L may not be ideal for evaluating long-form generation and plans to explore LLM-based, task-specific metrics. It also aims to develop better strategies to prevent data contamination, as some models might have been trained on parts of the test data.

LawBench by open-compass

Explore Similar Projects

lawqa_jp by digital-go-jp

fuzi.mingcha by irlab-sdu

LLM-and-Law by Jeryi-Sun

awesome-legal-nlp by maastrichtlawtech

wisdomInterrogatory by zhihaiLLM

legal-ml-datasets by neelguha

legal-prompts-for-gpt by TracyWang95

legalbench by HazyResearch

DISC-LawLLM by FudanDISC

lawyer-llama by AndrewZhe

LexiLaw by CSHaitao

ChatLaw by PKU-YuanGroup