legalbench by HazyResearch

Legal reasoning benchmark for evaluating LLMs

Created 3 years ago

539 stars

Top 59.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

LegalBench is an open-science initiative to create and maintain a comprehensive benchmark for evaluating the legal reasoning capabilities of large language models. It targets researchers and practitioners in AI and law, aiming to drive innovation in legal NLP and assess the safety and reliability of LLMs in legal contexts.

How It Works

LegalBench comprises 162 distinct tasks, each with an associated dataset of input-output pairs designed to test specific legal reasoning skills. Tasks are sourced through a crowd-sourcing effort involving legal professionals and academics, ensuring coverage of diverse legal domains, text types, and reasoning challenges. LLMs are evaluated by measuring their accuracy in generating the correct output for given legal inputs.

Quick Start & Requirements

Tasks are available on Hugging Face: https://huggingface.co/datasets/nguha/legalbench
Detailed task descriptions and contribution guidelines are available on the project website: https://hazyresearch.stanford.edu/legalbench/
Evaluation methodologies are outlined in associated notebooks.

Highlighted Details

162 tasks curated from 40 contributors, covering a wide spectrum of legal reasoning.
Tasks are designed to reflect real-world legal processes and academic assessments.
Encourages community contributions to expand the benchmark's scope and relevance.
Aims to inspire new algorithmic innovations through the unique challenges posed by legal text.

Maintenance & Community

The project is an ongoing effort with active community involvement. Contact information for questions and contributions is provided. Links to related projects and research papers are also available.

Licensing & Compatibility

LegalBench is a collection of datasets with varying licenses. Users are instructed to adhere to the specific license of each dataset creator. A notebook is provided to help select tasks based on license information.

Limitations & Caveats

The benchmark's composition is subject to the ongoing crowd-sourcing effort, meaning its scope and coverage will evolve. Users must manage the licensing complexities of the individual datasets within the benchmark.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

13 stars in the last 30 days