harvey-labs  by harveyai

Benchmark for evaluating LLM agents in legal domains

Created 1 month ago
342 stars

Top 80.7% on SourcePulse

GitHubView on GitHub
Project Summary

A benchmark built to evaluate and improve agent capabilities for supporting legal work. Harvey LAB provides an open-source framework for evaluating and enhancing Large Language Model (LLM) agent capabilities in performing realistic legal tasks. It addresses the need for objective assessment in legal AI by offering a curated dataset of tasks and a robust execution harness. This resource benefits engineers, researchers, and power users aiming to develop or deploy AI agents for legal support, enabling them to benchmark performance and identify areas for improvement in complex legal workflows.

How It Works

LAB comprises two main components: a dataset of legal tasks, each including agent instructions, relevant documents, and scoring rubrics, and an execution harness designed to run and evaluate agent performance against these tasks. The approach focuses on simulating "real legal work" within "realistic environments," exemplified by a comprehensive M&A data-room assignment walkthrough. Evaluation employs a rigorous "all-pass rubric scoring" system, complemented by the behavior analysis of an "LLM judge," ensuring detailed and nuanced performance assessment.

Quick Start & Requirements

A full walkthrough covering setup, task inspection, agent execution, scoring, and report review is available in docs/tutorial.md. Additional documentation detailing architecture, evaluation methodology, and contribution guidelines can be found at docs/architecture.md, docs/evaluation.md, and docs/contributing.md, respectively. Specific installation commands and non-default prerequisites are not detailed in the provided README snippet.

Highlighted Details

  • Benchmark designed for "real legal work" and "realistic environments."
  • Dataset includes tasks with agent instructions, documents, and rubrics.
  • Execution harness facilitates agent runs and performance evaluation.
  • Evaluation methodology features "all-pass rubric scoring" and "LLM judge behavior."
  • Includes a practical example: a "realistic M&A data-room assignment."

Maintenance & Community

Harvey LAB is an "ongoing project" with plans to "consistently add to and refine the task set and execution harness." The project actively encourages community contributions, including adding new tasks, model adapters, evaluation improvements, and documentation. Specific community channels or contributor details were not provided in the README snippet.

Licensing & Compatibility

The provided README snippet does not specify the project's license type or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

As an "ongoing project," Harvey LAB is subject to continuous development, with its task set and execution harness expected to be "consistently add[ed] to and refine[d]." This implies potential for evolving functionality, API changes, and ongoing refinement of evaluation metrics.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
49
Issues (30d)
3
Star History
339 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.