LawBench  by open-compass

Benchmark for legal LLMs

created 1 year ago
355 stars

Top 79.7% on sourcepulse

GitHubView on GitHub
Project Summary

LawBench is a comprehensive benchmark designed to evaluate the legal knowledge and reasoning capabilities of Large Language Models (LLMs). It addresses the gap in understanding LLM performance in the specialized and safety-critical legal domain, targeting researchers and developers working with LLMs for legal applications. The benchmark provides a structured way to assess how well LLMs can recall, understand, and apply legal knowledge, offering insights into their reliability for legal tasks.

How It Works

LawBench simulates three dimensions of judicial cognition: legal knowledge memory, understanding, and application. It comprises 20 distinct tasks, each with 500 examples, covering a wider range of real-world legal scenarios than typical multiple-choice benchmarks. Tasks include legal entity recognition, reading comprehension, crime amount calculation, and legal consultation. A unique "abstention rate" metric is also introduced to measure model refusal or failure to understand queries, addressing potential safety policy limitations in LLMs.

Quick Start & Requirements

  • Installation: The project requires specific Python packages: rouge_chinese==1.0.3, cn2an==0.5.22, ltp==4.2.13, OpenCC==1.1.6, python-Levenshtein==0.21.1, pypinyin==0.49.0, tqdm==4.64.1, timeout_decorator==0.5.0.
  • Evaluation: To evaluate model predictions, place results in a F folder structure and run python main.py -i ../predictions/zero_shot -o ../predictions/zero_shot/results.csv from the evaluation directory.
  • Data: Datasets are stored in the data folder as JSON files.
  • Resources: No specific hardware requirements are mentioned, but running evaluations on 51 LLMs suggests significant computational resources may be needed for comprehensive testing.

Highlighted Details

  • Evaluates 51 LLMs, including multilingual, Chinese-specific, and legal-specific models.
  • Includes a novel "abstention rate" metric to capture model refusal or misunderstanding.
  • Tasks are categorized into legal knowledge memory, understanding, and application.
  • Provides detailed performance tables for zero-shot and one-shot evaluations across various models and tasks.

Maintenance & Community

The project is associated with the paper "LawBench: Benchmarking Legal Knowledge of Large Language Models" by Fei et al. (2023). External contributors are welcomed for dataset expansion and model evaluation. Contact information for further collaboration is available.

Licensing & Compatibility

The licensing is complex, as LawBench is a composite dataset. Users are required to adhere to the licenses of the original data creators for each task. Specific license details for each task are available via the task list.

Limitations & Caveats

The project acknowledges that ROUGE-L may not be ideal for evaluating long-form generation and plans to explore LLM-based, task-specific metrics. It also aims to develop better strategies to prevent data contamination, as some models might have been trained on parts of the test data.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.