financial-datasets by virattt

Financial dataset generator for LLM Q&A

Created 1 year ago

392 stars

Top 73.4% on SourcePulse

View on GitHub

1 Expert Loves This Project

Didier Lopes

Founder of OpenBB

Project Summary

This Python library enables the creation of question-and-answer financial datasets from various text sources, including 10-K filings, PDFs, and general text documents, using Large Language Models (LLMs). It is designed for researchers and developers working with LLMs in the financial domain, aiming to simplify the generation of realistic, context-rich financial Q&A pairs.

How It Works

The library leverages LLMs, specifically mentioning gpt-4-turbo, to process financial documents and extract relevant information. Users can provide raw text, a PDF URL, or a company ticker and year for 10-K filings. The DatasetGenerator class then orchestrates the LLM calls to generate question-answer pairs, including the supporting context from the source material. This approach automates the laborious process of manual dataset creation for financial NLP tasks.

Quick Start & Requirements

Primary install: pip install financial-datasets
Prerequisites: OpenAI API key.
Links: Colab examples

Highlighted Details

Supports generation from raw text, PDF URLs, and 10-K filings via ticker and year.
Output format is JSON, with each entry containing a question, answer, and context.
Allows specifying which sections of a 10-K filing to process (e.g., "Item 1", "Item 7").
Offers flexibility in controlling the number of generated questions (max_questions).

Maintenance & Community

The project is open-source and welcomes contributions via issues and pull requests.
No specific community channels (Discord/Slack) or notable contributors are listed in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The library relies on external LLM APIs, specifically mentioning OpenAI's gpt-4-turbo, which requires an API key and incurs costs. The quality and accuracy of the generated datasets are dependent on the LLM's performance and the clarity of the input financial documents.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days