tevatron by texttron

Unified toolkit for document retrieval across modalities, languages, and scale

Created 4 years ago

715 stars

Top 48.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Research Scientist at Ai2

Project Summary

Tevatron is a unified toolkit for building and deploying neural document retrieval systems, supporting large-scale, multilingual, and multimodal data. It enables researchers and practitioners to efficiently train and fine-tune dense retrievers using parameter-efficient methods like LoRA, integrating with advanced libraries for optimized performance.

How It Works

Tevatron leverages efficient training techniques such as DeepSpeed, vLLM, and FlashAttention for large-scale model training on GPUs and TPUs. It supports parameter-efficient fine-tuning (PEFT) via LoRA, allowing adaptation of large language models to retrieval tasks with reduced computational cost. The toolkit handles data preparation, model encoding, and similarity search, offering flexibility for both textual and multimodal retrieval scenarios.

Quick Start & Requirements

PyTorch (GPU): pip install transformers datasets peft deepspeed accelerate faiss-cpu && pip install -e .
JAX (TPU): Requires JAX installation, magix, and GradCache. pip install transformers datasets flax optax followed by cloning and installing magix and GradCache, then pip install -e ..
JAX (GPU): Recommended to use NVIDIA's jax-toolbox Docker image.
Dependencies: PyTorch or JAX, HuggingFace transformers, datasets, peft, deepspeed, accelerate, faiss-cpu (for PyTorch), flax, optax, magix, GradCache (for JAX).
Resources: Training a Mistral-7B model on MSMARCO passage dataset with LoRA takes ~70 hours on 4xA6000 GPUs or ~110 hours on 1xA100 GPU. TPU training is faster (~35 hours on v4-8 TPU).
Data Format: jsonl for training (query, positive/negative docs) and corpus (docid, text). Image fields are optional.
Datasets: Integrates with HuggingFace datasets (e.g., Tevatron/msmarco-passage-aug).
Docs: Tevatron

Highlighted Details

Supports training billion-scale LLM neural retrievers on GPUs and TPUs.
Integrates with vLLM, DeepSpeed, FlashAttention, and gradient accumulation for efficient training and inference.
Provides self-contained HuggingFace datasets for multimodal and multilingual retrieval.
Directly loads and fine-tunes state-of-the-art models like BGE-Embedding and Instruct-E5.

Maintenance & Community

Active development, with v2.0 being the current focus. Some v1 features are not yet migrated.
Contact: Luyu Gao (luyug@cs.cmu.edu), Xueguang Ma (x93ma@uwaterloo.ca).
Issue tracker available for toolkit-specific questions.

Licensing & Compatibility

The repository does not explicitly state a license in the README. The citation lists authors from CMU and Waterloo, suggesting academic research origins.

Limitations & Caveats

Tevatron v2.0 is still migrating features from v1; users needing v1 functionality should check out the v1 branch.
The README does not specify a license, which could impact commercial use or integration into closed-source projects.

Health Check

Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days