financebench  by patronus-ai

Benchmark for financial question answering with LLMs

Created 2 years ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

A new benchmark suite, FinanceBench, addresses the critical need for evaluating Large Language Models (LLMs) in open-book financial question answering. It targets researchers and engineers developing AI solutions for the financial sector, offering a standardized method to assess LLM capabilities and identify current limitations, thereby guiding future development and adoption decisions.

How It Works

FinanceBench comprises 10,231 ecologically valid financial questions, complete with human-annotated answers and evidence strings, designed to establish a minimum performance standard for LLMs. The repository provides an open-source sample of 150 annotated examples, alongside two JSONL files detailing questions and document metadata. These can be loaded and joined using Python's pandas library, facilitating the evaluation of LLM performance on real-world financial queries.

Quick Start & Requirements

  • Data Loading: Use pandas to load the provided JSONL files:
    import pandas as pd
    df_questions = pd.read_json("data/financebench_open_source.jsonl", lines=True)
    df_meta = pd.read_json("data/financebench_document_information.jsonl", lines=True)
    df_full = pd.merge(df_questions, df_meta, on="doc_name")
    
  • Prerequisites: Python, pandas. Financial source documents (PDFs) are located in /pdfs/, and model evaluation results are in /results/.
  • Links:

Highlighted Details

  • Features an open-source sample of 150 annotated examples from a comprehensive dataset of 10,231 financial questions.
  • Evaluated 16 state-of-the-art LLM configurations, including GPT-4-Turbo, Llama2, and Claude2, with and without retrieval systems.
  • Reveals significant LLM limitations: GPT-4-Turbo with retrieval incorrectly answered or refused 81% of questions in a sample.
  • Augmentation techniques like longer context windows improve performance but introduce unrealistic latency for enterprise use.
  • All tested models exhibit critical weaknesses, such as hallucinations, hindering their suitability for enterprise financial applications.

Maintenance & Community

  • For inquiries regarding the full FinanceBench dataset or evaluation, contact: contact@patronus.ai.

Licensing & Compatibility

  • The repository's README does not specify a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

  • The provided repository contains only a sample (150 examples) of the full FinanceBench dataset.
  • Current LLMs demonstrate substantial limitations, including hallucinations and performance degradation under realistic enterprise constraints like latency, making them unsuitable for many critical financial QA tasks without further development.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Dan Guido Dan Guido(Cofounder of Trail of Bits), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

llm-council by karpathy

1.9%
15k
A multi-LLM collaborative framework for enhanced question answering
Created 3 months ago
Updated 3 months ago
Feedback? Help us improve.