CLI tool for processing large text files with GPT models
Top 96.2% on sourcepulse
This project provides a Python script to process large text files using OpenAI's GPT models (GPT-3.5 and GPT-4) by splitting them into manageable chunks. It's designed for users needing to perform tasks like translation, information extraction, or summarization on documents exceeding standard token limits, offering a solution for handling extensive textual data efficiently.
How It Works
The script divides the input text into smaller segments based on a specified chunk size and token limit. It then sends these chunks to the OpenAI API concurrently using multithreading, allowing for parallel processing. Responses are collected and consolidated into an output file. The implementation includes exponential backoff with jitter for handling OpenAI rate limits, with a default retry limit of three failures.
Quick Start & Requirements
pip install openai tiktoken tqdm
OPENAI_KEY
environment variable), command-line interface.python main.py -i INPUT_FILE -o OUTPUT_FILE -l LOG_FILE -m MODEL -c CHUNKSIZE -t TOKENS -v TEMPERATURE -p PROMPT
python main.py -i input.txt -o output.txt -l log.txt -m 'gpt-3.5-turbo' -c 500 -t 200 -v 0.5 -p 'Translate English to French:'
Highlighted Details
Maintenance & Community
The project is a personal repository by andrewgcodes. No specific community channels or roadmap are detailed in the README.
Licensing & Compatibility
Limitations & Caveats
The script defaults to a retry limit of three failures for API calls, which might be insufficient for very high error rates. The maximum chunk size should be kept under 4000 tokens to avoid OpenAI API errors, and the prompt length also counts towards this limit.
2 years ago
Inactive