lightspeedGPT  by andrewgcodes

CLI tool for processing large text files with GPT models

created 2 years ago
269 stars

Top 96.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a Python script to process large text files using OpenAI's GPT models (GPT-3.5 and GPT-4) by splitting them into manageable chunks. It's designed for users needing to perform tasks like translation, information extraction, or summarization on documents exceeding standard token limits, offering a solution for handling extensive textual data efficiently.

How It Works

The script divides the input text into smaller segments based on a specified chunk size and token limit. It then sends these chunks to the OpenAI API concurrently using multithreading, allowing for parallel processing. Responses are collected and consolidated into an output file. The implementation includes exponential backoff with jitter for handling OpenAI rate limits, with a default retry limit of three failures.

Quick Start & Requirements

  • Install: pip install openai tiktoken tqdm
  • Prerequisites: Python 3.6+, OpenAI API key (set as OPENAI_KEY environment variable), command-line interface.
  • Usage: python main.py -i INPUT_FILE -o OUTPUT_FILE -l LOG_FILE -m MODEL -c CHUNKSIZE -t TOKENS -v TEMPERATURE -p PROMPT
  • Example: python main.py -i input.txt -o output.txt -l log.txt -m 'gpt-3.5-turbo' -c 500 -t 200 -v 0.5 -p 'Translate English to French:'
  • Documentation: GitHub Repository

Highlighted Details

  • Processes inputs of unlimited size by chunking.
  • Utilizes multithreading for parallel API calls.
  • Implements exponential backoff with jitter for rate limit resilience.
  • Supports GPT-3.5 and GPT-4 models.

Maintenance & Community

The project is a personal repository by andrewgcodes. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

  • License: MIT
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The script defaults to a retry limit of three failures for API calls, which might be insufficient for very high error rates. The maximum chunk size should be kept under 4000 tokens to avoid OpenAI API errors, and the prompt length also counts towards this limit.

Health Check
Last commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.