atlas  by facebookresearch

Research code for retrieval-augmented language models

created 2 years ago
544 stars

Top 59.4% on sourcepulse

GitHubView on GitHub
Project Summary

Atlas is a research code repository for few-shot learning with retrieval-augmented language models, targeting NLP researchers and practitioners. It enables joint pre-training of dense retrievers and encoder-decoder language models, achieving state-of-the-art results on tasks like Natural Questions with significantly fewer parameters than larger models.

How It Works

Atlas jointly trains a dense retriever (Contriever) and a fusion-in-decoder (FiD) language model (T5). It performs retrieval on-the-fly during training and inference using a custom distributed GPU index. This approach allows for efficient handling of large corpora (up to 400M passages) and dynamic index refreshing, optimizing retrieval accuracy and training stability.

Quick Start & Requirements

  • Install: Clone the repo, create a conda environment, and install dependencies: git clone https://github.com/facebookresearch/atlas.git && cd atlas && conda create --name atlas-env python=3.8 && conda activate atlas-env && conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch && conda install -c pytorch faiss-gpu=1.7.2 cudatoolkit=11.3 && pip install -r requirements.txt
  • Prerequisites: Python 3.8, Fairscale (0.4.6), Transformers (4.18.0), NumPy (1.22.4), FAISS (1.7.2), PyTorch (1.11.0), CUDA 11.3. GPU acceleration is essential.
  • Resources: Training can involve large datasets (e.g., 400M passages) and models up to 11B parameters, requiring substantial GPU memory and compute.
  • Docs: Getting Started

Highlighted Details

  • Supports training of large fusion-in-decoder models up to 11B parameters.
  • Enables end-to-end retrieval-augmented training over custom corpora.
  • Features a fast, parallel distributed GPU-based exact and approximate nearest neighbor search.
  • Offers strategies to manage "stale" retrieval indices, including query-side finetuning and reranking.

Maintenance & Community

  • Status: REPO NO LONGER MAINTAINED, RESEARCH CODE PROVIDED AS IT IS.
  • Contributors: Primarily Facebook AI Research (FAIR).

Licensing & Compatibility

  • Code License: CC-BY-NC (Non-Commercial) for most code. Apache 2.0 for specific Hugging Face transformer components.
  • Data License: CC-BY-SA for Wikipedia-derived corpora and indices.
  • Commercial Use: The CC-BY-NC license restricts commercial use.

Limitations & Caveats

The repository is explicitly marked as "NO LONGER MAINTAINED," meaning no future updates or bug fixes are expected. The CC-BY-NC license restricts commercial applications.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.