Discover and explore top open-source AI tools and projects—updated daily.
wilpelSemantic compression for LLM contexts
Top 80.4% on SourcePulse
Summary
Caveman Compression offers lossless semantic compression for Large Language Model (LLM) contexts, targeting engineers and researchers. It significantly reduces token counts by stripping predictable grammar while preserving factual content, enabling more information to fit within LLM context windows.
How It Works
The core approach leverages LLMs' ability to reconstruct predictable linguistic elements. By removing grammar ("a", "the", "is"), connectives ("therefore", "however"), and filler words, the method retains unpredictable facts, numbers, names, and technical terms. This strategy achieves substantial token reduction (up to 58%) by eliminating only what LLMs can reliably infer, thus preserving meaning and enabling denser context. Three methods are offered: LLM-based for maximum savings, MLM-based for offline predictability-aware compression, and NLP-based for free, offline, multilingual rule-based compression.
Quick Start & Requirements
Installation involves creating a Python virtual environment and installing dependencies via pip install -r requirements.txt (LLM), requirements-nlp.txt (NLP), or requirements-mlm.txt (MLM). The LLM-based method requires an OpenAI API key configured in a .env file. NLP and MLM methods require spaCy language models (e.g., en_core_web_sm). Python 3.8+ is a prerequisite. Links to Quick Start, Examples, Benchmarks, and Spec are provided.
Highlighted Details
Maintenance & Community
The project is authored by William Peltomäki. No specific details regarding active maintenance, community channels (like Discord/Slack), or notable contributors are present in the README.
Licensing & Compatibility
The project is released under the MIT license. This permissive license allows for broad compatibility, including commercial use and integration into closed-source projects without significant restrictions.
Limitations & Caveats
Caveman Compression is not suitable for user-facing content, marketing copy, legal documents, or emotional communication. The LLM-based method incurs API costs, while the MLM-based method requires downloading a ~500MB model. The NLP-based method offers lower compression rates compared to the other two.
4 months ago
Inactive
apoorvumang
benbrandt
chonkie-inc
microsoft
openai