financebench by patronus-ai

Benchmark for financial question answering with LLMs

Created 2 years ago

339 stars

Top 81.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Li Jiang

Coauthor of AutoGen; Engineer at Microsoft

Project Summary

A new benchmark suite, FinanceBench, addresses the critical need for evaluating Large Language Models (LLMs) in open-book financial question answering. It targets researchers and engineers developing AI solutions for the financial sector, offering a standardized method to assess LLM capabilities and identify current limitations, thereby guiding future development and adoption decisions.

How It Works

FinanceBench comprises 10,231 ecologically valid financial questions, complete with human-annotated answers and evidence strings, designed to establish a minimum performance standard for LLMs. The repository provides an open-source sample of 150 annotated examples, alongside two JSONL files detailing questions and document metadata. These can be loaded and joined using Python's pandas library, facilitating the evaluation of LLM performance on real-world financial queries.

Quick Start & Requirements

Data Loading: Use pandas to load the provided JSONL files:

import pandas as pd
df_questions = pd.read_json("data/financebench_open_source.jsonl", lines=True)
df_meta = pd.read_json("data/financebench_document_information.jsonl", lines=True)
df_full = pd.merge(df_questions, df_meta, on="doc_name")

Prerequisites: Python, pandas. Financial source documents (PDFs) are located in /pdfs/, and model evaluation results are in /results/.
Links:
- Full FinanceBench paper: arXiv:2311.11944

Highlighted Details

Features an open-source sample of 150 annotated examples from a comprehensive dataset of 10,231 financial questions.
Evaluated 16 state-of-the-art LLM configurations, including GPT-4-Turbo, Llama2, and Claude2, with and without retrieval systems.
Reveals significant LLM limitations: GPT-4-Turbo with retrieval incorrectly answered or refused 81% of questions in a sample.
Augmentation techniques like longer context windows improve performance but introduce unrealistic latency for enterprise use.
All tested models exhibit critical weaknesses, such as hallucinations, hindering their suitability for enterprise financial applications.

Maintenance & Community

For inquiries regarding the full FinanceBench dataset or evaluation, contact: contact@patronus.ai.

Licensing & Compatibility

The repository's README does not specify a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

The provided repository contains only a sample (150 examples) of the full FinanceBench dataset.
Current LLMs demonstrate substantial limitations, including hallucinations and performance degradation under realistic enterprise constraints like latency, making them unsuitable for many critical financial QA tasks without further development.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days