tevatron  by texttron

Unified toolkit for document retrieval across modalities, languages, and scale

created 3 years ago
676 stars

Top 51.0% on sourcepulse

GitHubView on GitHub
Project Summary

Tevatron is a unified toolkit for building and deploying neural document retrieval systems, supporting large-scale, multilingual, and multimodal data. It enables researchers and practitioners to efficiently train and fine-tune dense retrievers using parameter-efficient methods like LoRA, integrating with advanced libraries for optimized performance.

How It Works

Tevatron leverages efficient training techniques such as DeepSpeed, vLLM, and FlashAttention for large-scale model training on GPUs and TPUs. It supports parameter-efficient fine-tuning (PEFT) via LoRA, allowing adaptation of large language models to retrieval tasks with reduced computational cost. The toolkit handles data preparation, model encoding, and similarity search, offering flexibility for both textual and multimodal retrieval scenarios.

Quick Start & Requirements

  • PyTorch (GPU): pip install transformers datasets peft deepspeed accelerate faiss-cpu && pip install -e .
  • JAX (TPU): Requires JAX installation, magix, and GradCache. pip install transformers datasets flax optax followed by cloning and installing magix and GradCache, then pip install -e ..
  • JAX (GPU): Recommended to use NVIDIA's jax-toolbox Docker image.
  • Dependencies: PyTorch or JAX, HuggingFace transformers, datasets, peft, deepspeed, accelerate, faiss-cpu (for PyTorch), flax, optax, magix, GradCache (for JAX).
  • Resources: Training a Mistral-7B model on MSMARCO passage dataset with LoRA takes ~70 hours on 4xA6000 GPUs or ~110 hours on 1xA100 GPU. TPU training is faster (~35 hours on v4-8 TPU).
  • Data Format: jsonl for training (query, positive/negative docs) and corpus (docid, text). Image fields are optional.
  • Datasets: Integrates with HuggingFace datasets (e.g., Tevatron/msmarco-passage-aug).
  • Docs: Tevatron

Highlighted Details

  • Supports training billion-scale LLM neural retrievers on GPUs and TPUs.
  • Integrates with vLLM, DeepSpeed, FlashAttention, and gradient accumulation for efficient training and inference.
  • Provides self-contained HuggingFace datasets for multimodal and multilingual retrieval.
  • Directly loads and fine-tunes state-of-the-art models like BGE-Embedding and Instruct-E5.

Maintenance & Community

  • Active development, with v2.0 being the current focus. Some v1 features are not yet migrated.
  • Contact: Luyu Gao (luyug@cs.cmu.edu), Xueguang Ma (x93ma@uwaterloo.ca).
  • Issue tracker available for toolkit-specific questions.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. The citation lists authors from CMU and Waterloo, suggesting academic research origins.

Limitations & Caveats

  • Tevatron v2.0 is still migrating features from v1; users needing v1 functionality should check out the v1 branch.
  • The README does not specify a license, which could impact commercial use or integration into closed-source projects.
Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
3
Star History
88 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.