ml-clara by apple

Bridging retrieval and generation for efficient RAG

Created 2 months ago

912 stars

Top 39.9% on SourcePulse

Project Summary

CLaRa addresses limitations in Retrieval-Augmented Generation (RAG) by unifying retrieval and generation optimization through continuous latent reasoning and efficient document compression. It targets researchers and engineers seeking to improve RAG efficiency and semantic preservation, offering significant compression rates (32x-64x) while maintaining high performance on question-answering tasks.

How It Works

CLaRa employs a novel three-stage training approach to overcome disjoint optimization and semantic bias in compressed representations. Stage 1 (Compression Pretraining) uses a Salient Compressor Pretraining (SCP) framework with QA-based supervision to retain key semantics. Stage 2 (Compression Instruction Tuning) fine-tunes the compressor on instruction-following tasks. Stage 3 (End-to-End Fine-tuning) jointly trains a reranker and generator in a shared continuous space using a differentiable top-k estimator, unifying retrieval and generation optimization to avoid redundant encoding.

Quick Start & Requirements

Setup involves cloning the repository, creating a conda environment (python=3.10), and installing dependencies (pip install -r requirements.txt). Key requirements include PyTorch >= 2.0, Transformers >= 4.20, DeepSpeed >= 0.18, Flash Attention 2, and Accelerate. Data must be prepared in JSONL format for each stage. Training is initiated via shell scripts (scripts/train_pretraining.sh, scripts/train_instruction_tuning.sh, scripts/train_stage_end_to_end.sh). A video instruction guide is available: https://youtu.be/al2VoAKn8GU.

Highlighted Details

Achieves 32x-64x document compression while preserving semantic information for accurate answer generation.
Outperforms established RAG compression baselines (PISCO, LLMLingua-2) on multiple QA benchmarks.
Unifies retrieval and generation optimization in a shared continuous latent space, mitigating redundant encoding and disjoint training objectives.

Maintenance & Community

Implementation built upon the OpenRLHF framework. Models are available on Huggingface (link not provided). No explicit community channels (Discord/Slack) or roadmap detailed.

Licensing & Compatibility

License type is not specified in the README, posing a potential blocker for commercial adoption or integration.

Limitations & Caveats

This is a research-oriented project; production readiness is not explicitly stated. The lack of clear licensing information is a significant adoption hurdle. Multi-stage training and specific dependencies (e.g., Flash Attention 2) may complicate setup.

Health Check

Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

473 stars in the last 30 days