extract-dialogue by KMnO4-zx

AI-powered dialogue extraction for building character datasets from novels

Created 2 years ago

341 stars

Top 81.4% on SourcePulse

Project Summary

Summary

This repository provides a tool for extracting structured dialogue datasets from novels using various AI platforms. It addresses the need for automated dataset creation for conversational AI and NLP research, enabling users to generate role-dialogue pairs from unstructured text. The primary benefit is streamlining the process of building specialized datasets from literary sources, supporting multiple AI backends for flexibility.

How It Works

The project leverages a modular architecture to process novel texts. It employs token-based chunking to manage context for long documents and sends these chunks to supported AI models (DeepSeek, OpenAI, SiliconFlow, Kimi, or custom endpoints) for dialogue extraction. The core logic identifies role and dialogue fields, outputting them in a JSON format. Key advantages include multi-platform AI integration, concurrent processing via multi-threading for speed, and features like chunk tracking for source traceability and detailed statistical analysis of the extracted data.

Quick Start & Requirements

Clone the repository: git clone https://github.com/KMnO4-zx/extract-dialogue.git and cd extract-dialogue.
Install dependencies: pip install -r requirements.txt.
Configure API keys: Copy env.example to .env and set LLM_PLATFORM and relevant API keys (e.g., DEEPSEEK_API, OPENAI_API_KEY).
Run extraction: python dialogue_extractor.py <your_novel.txt> --stats.
- Prerequisites: Python, pip, and API keys for chosen AI services.
- Links: Project page: https://github.com/KMnO4-zx/extract-dialogue

Highlighted Details

Multi-Platform AI Support: Integrates with DeepSeek, OpenAI, SiliconFlow, Kimi, and custom OpenAI-compatible API endpoints.
Concurrent Processing: Enhances speed through multi-threading, with configurable thread counts (default 8).
Chunk Management: Outputs include chunk_id and dialogue_index for precise tracking of dialogue origins within the text.
Detailed Statistics: Generates comprehensive analytics on dialogue counts, role distribution, average dialogue length, and processing metrics.
Flexible Output: Options to include/exclude chunk IDs, sort output, and generate legacy formats.

Maintenance & Community

The project encourages contributions via Issues and Pull Requests, with guidelines provided for code style, testing, and documentation. The primary community and issue tracking hub is the GitHub repository.

Licensing & Compatibility

The project is released under the permissive MIT License, allowing for broad use, modification, and distribution, including in commercial applications.

Limitations & Caveats

Extraction accuracy may require tuning of AI model parameters (e.g., temperature) or selection of specific AI platforms known for long-text processing. The effectiveness can vary based on the novel's writing style and complexity. Functionality is dependent on the availability and cost of external AI service APIs.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days