FinEval  by SUFE-AIFLM-Lab

Financial LLM evaluation benchmark

Created 2 years ago
254 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> FinEval is a pioneering Chinese benchmark dataset designed to comprehensively evaluate the professional capabilities and security of Large Language Models (LLMs) within the specialized and risk-sensitive financial industry. It addresses the uncertainty surrounding LLM performance in complex financial tasks by providing over 26,000 diverse question types. FinEval targets researchers and developers seeking to assess and improve LLMs for financial applications, offering a robust foundation for advancing AI in finance.

How It Works

FinEval employs a rigorous methodology, combining quantitative fundamental methods with extensive research, summarization, and manual screening. The benchmark encompasses six key financial domains: Academic Knowledge, Industry Knowledge, Security Knowledge, Financial Agent, Financial Multimodal Capabilities, and Financial Rigor Testing. It features a variety of question formats, including multiple-choice, short-answer, reasoning, and retrieval-based tasks. Evaluation utilizes zero-shot, few-shot, and Chain-of-Thought prompting strategies, with results presented via text and multimodal performance leaderboards.

Quick Start & Requirements

  • Installation: Requires Python 3.8 and Conda. Installation involves creating a Conda environment, cloning the repository, and installing dependencies via pip install -r requirements.txt.
  • Dataset: Downloadable via Hugging Face or provided links (URLs not fully specified in README snippet). Requires unzipping into the code/data directory.
  • Documentation: Official website: https://fineval.readthedocs.io/zh_CN/latest/. Paper: https://arxiv.org/abs/2308.09975. Hugging Face dataset: https://huggingface.co/datasets/SUFE-AIFLM-Lab/FinEval.

Highlighted Details

  • Features over 26,000 diverse question types, simulating real-world financial scenarios.
  • Comprehensive coverage across six critical financial domains, including academic knowledge, industry applications, security, agent capabilities, multimodal reasoning, and output rigor.
  • Provides detailed leaderboards for text and multimodal LLM performance, enabling direct comparison of state-of-the-art models.
  • Evaluates advanced LLM capabilities such as financial text summarization, investment advice generation, API invocation, and multimodal financial data analysis.

Maintenance & Community

The provided README does not detail specific maintenance contributors, community channels (e.g., Discord, Slack), or active development roadmaps.

Licensing & Compatibility

  • License: Apache-2.0.
  • Compatibility: The Apache-2.0 license is permissive and generally compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Data Availability: Only the "Financial Academic Knowledge" dataset is currently open-sourced. Data for "Financial Industry Knowledge," "Financial Security Knowledge," and "Financial Agents" requires explicit authorization for evaluation.
  • Setup: Specific download URLs for the dataset are not fully provided in the README snippet, requiring users to locate them via Hugging Face or contact the maintainers.
Health Check
Last Commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.