lightspeedGPT by andrewgcodes

CLI tool for processing large text files with GPT models

Created 2 years ago

269 stars

Top 95.6% on SourcePulse

Project Summary

This project provides a Python script to process large text files using OpenAI's GPT models (GPT-3.5 and GPT-4) by splitting them into manageable chunks. It's designed for users needing to perform tasks like translation, information extraction, or summarization on documents exceeding standard token limits, offering a solution for handling extensive textual data efficiently.

How It Works

The script divides the input text into smaller segments based on a specified chunk size and token limit. It then sends these chunks to the OpenAI API concurrently using multithreading, allowing for parallel processing. Responses are collected and consolidated into an output file. The implementation includes exponential backoff with jitter for handling OpenAI rate limits, with a default retry limit of three failures.

Quick Start & Requirements

Install: pip install openai tiktoken tqdm
Prerequisites: Python 3.6+, OpenAI API key (set as OPENAI_KEY environment variable), command-line interface.
Usage: python main.py -i INPUT_FILE -o OUTPUT_FILE -l LOG_FILE -m MODEL -c CHUNKSIZE -t TOKENS -v TEMPERATURE -p PROMPT
Example: python main.py -i input.txt -o output.txt -l log.txt -m 'gpt-3.5-turbo' -c 500 -t 200 -v 0.5 -p 'Translate English to French:'
Documentation: GitHub Repository

Highlighted Details

Processes inputs of unlimited size by chunking.
Utilizes multithreading for parallel API calls.
Implements exponential backoff with jitter for rate limit resilience.
Supports GPT-3.5 and GPT-4 models.

Maintenance & Community

The project is a personal repository by andrewgcodes. No specific community channels or roadmap are detailed in the README.

Licensing & Compatibility

License: MIT
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The script defaults to a retry limit of three failures for API calls, which might be insufficient for very high error rates. The maximum chunk size should be kept under 4000 tokens to avoid OpenAI API errors, and the prompt length also counts towards this limit.

lightspeedGPT by andrewgcodes

Explore Similar Projects

flash-tokenizer by NLPOptimize

TokenDagger by M4THYOU

semchunk by isaacus-dev

OpenAlpaca by yxuansu

ScaleLLM by vectorch-ai

tiktoken-rs by zurawiki

recurrent-pretraining by seal-rg

dclm by mlfoundations

infiniteGPT by emmethalm

OpenPipe by OpenPipe

datatrove by huggingface

gpt-2-simple by minimaxir