Unified toolkit for document retrieval across modalities, languages, and scale
Top 51.0% on sourcepulse
Tevatron is a unified toolkit for building and deploying neural document retrieval systems, supporting large-scale, multilingual, and multimodal data. It enables researchers and practitioners to efficiently train and fine-tune dense retrievers using parameter-efficient methods like LoRA, integrating with advanced libraries for optimized performance.
How It Works
Tevatron leverages efficient training techniques such as DeepSpeed, vLLM, and FlashAttention for large-scale model training on GPUs and TPUs. It supports parameter-efficient fine-tuning (PEFT) via LoRA, allowing adaptation of large language models to retrieval tasks with reduced computational cost. The toolkit handles data preparation, model encoding, and similarity search, offering flexibility for both textual and multimodal retrieval scenarios.
Quick Start & Requirements
pip install transformers datasets peft deepspeed accelerate faiss-cpu && pip install -e .
magix
, and GradCache
. pip install transformers datasets flax optax
followed by cloning and installing magix
and GradCache
, then pip install -e .
.jax-toolbox
Docker image.transformers
, datasets
, peft
, deepspeed
, accelerate
, faiss-cpu
(for PyTorch), flax
, optax
, magix
, GradCache
(for JAX).jsonl
for training (query, positive/negative docs) and corpus (docid, text). Image fields are optional.Tevatron/msmarco-passage-aug
).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
v1
branch.4 days ago
1 day