financial-datasets  by virattt

Financial dataset generator for LLM Q&A

created 1 year ago
356 stars

Top 79.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This Python library enables the creation of question-and-answer financial datasets from various text sources, including 10-K filings, PDFs, and general text documents, using Large Language Models (LLMs). It is designed for researchers and developers working with LLMs in the financial domain, aiming to simplify the generation of realistic, context-rich financial Q&A pairs.

How It Works

The library leverages LLMs, specifically mentioning gpt-4-turbo, to process financial documents and extract relevant information. Users can provide raw text, a PDF URL, or a company ticker and year for 10-K filings. The DatasetGenerator class then orchestrates the LLM calls to generate question-answer pairs, including the supporting context from the source material. This approach automates the laborious process of manual dataset creation for financial NLP tasks.

Quick Start & Requirements

  • Primary install: pip install financial-datasets
  • Prerequisites: OpenAI API key.
  • Links: Colab examples

Highlighted Details

  • Supports generation from raw text, PDF URLs, and 10-K filings via ticker and year.
  • Output format is JSON, with each entry containing a question, answer, and context.
  • Allows specifying which sections of a 10-K filing to process (e.g., "Item 1", "Item 7").
  • Offers flexibility in controlling the number of generated questions (max_questions).

Maintenance & Community

  • The project is open-source and welcomes contributions via issues and pull requests.
  • No specific community channels (Discord/Slack) or notable contributors are listed in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The library relies on external LLM APIs, specifically mentioning OpenAI's gpt-4-turbo, which requires an API key and incurs costs. The quality and accuracy of the generated datasets are dependent on the LLM's performance and the clarity of the input financial documents.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.