Discover and explore top open-source AI tools and projects—updated daily.
AI-powered dialogue extraction for building character datasets from novels
Top 97.9% on SourcePulse
Summary
This repository provides a tool for extracting structured dialogue datasets from novels using various AI platforms. It addresses the need for automated dataset creation for conversational AI and NLP research, enabling users to generate role-dialogue pairs from unstructured text. The primary benefit is streamlining the process of building specialized datasets from literary sources, supporting multiple AI backends for flexibility.
How It Works
The project leverages a modular architecture to process novel texts. It employs token-based chunking to manage context for long documents and sends these chunks to supported AI models (DeepSeek, OpenAI, SiliconFlow, Kimi, or custom endpoints) for dialogue extraction. The core logic identifies role
and dialogue
fields, outputting them in a JSON format. Key advantages include multi-platform AI integration, concurrent processing via multi-threading for speed, and features like chunk tracking for source traceability and detailed statistical analysis of the extracted data.
Quick Start & Requirements
git clone https://github.com/KMnO4-zx/extract-dialogue.git
and cd extract-dialogue
.pip install -r requirements.txt
.env.example
to .env
and set LLM_PLATFORM
and relevant API keys (e.g., DEEPSEEK_API
, OPENAI_API_KEY
).python dialogue_extractor.py <your_novel.txt> --stats
.
Highlighted Details
chunk_id
and dialogue_index
for precise tracking of dialogue origins within the text.Maintenance & Community
The project encourages contributions via Issues and Pull Requests, with guidelines provided for code style, testing, and documentation. The primary community and issue tracking hub is the GitHub repository.
Licensing & Compatibility
The project is released under the permissive MIT License, allowing for broad use, modification, and distribution, including in commercial applications.
Limitations & Caveats
Extraction accuracy may require tuning of AI model parameters (e.g., temperature) or selection of specific AI platforms known for long-text processing. The effectiveness can vary based on the novel's writing style and complexity. Functionality is dependent on the availability and cost of external AI service APIs.
1 month ago
Inactive