atlas by facebookresearch

Research code for retrieval-augmented language models

Created 3 years ago

552 stars

Top 57.9% on SourcePulse

View on GitHub

3 Experts Love This Project

Cofounder of Cloudera

Project Summary

Atlas is a research code repository for few-shot learning with retrieval-augmented language models, targeting NLP researchers and practitioners. It enables joint pre-training of dense retrievers and encoder-decoder language models, achieving state-of-the-art results on tasks like Natural Questions with significantly fewer parameters than larger models.

How It Works

Atlas jointly trains a dense retriever (Contriever) and a fusion-in-decoder (FiD) language model (T5). It performs retrieval on-the-fly during training and inference using a custom distributed GPU index. This approach allows for efficient handling of large corpora (up to 400M passages) and dynamic index refreshing, optimizing retrieval accuracy and training stability.

Quick Start & Requirements

Install: Clone the repo, create a conda environment, and install dependencies: git clone https://github.com/facebookresearch/atlas.git && cd atlas && conda create --name atlas-env python=3.8 && conda activate atlas-env && conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch && conda install -c pytorch faiss-gpu=1.7.2 cudatoolkit=11.3 && pip install -r requirements.txt
Prerequisites: Python 3.8, Fairscale (0.4.6), Transformers (4.18.0), NumPy (1.22.4), FAISS (1.7.2), PyTorch (1.11.0), CUDA 11.3. GPU acceleration is essential.
Resources: Training can involve large datasets (e.g., 400M passages) and models up to 11B parameters, requiring substantial GPU memory and compute.
Docs: Getting Started

Highlighted Details

Supports training of large fusion-in-decoder models up to 11B parameters.
Enables end-to-end retrieval-augmented training over custom corpora.
Features a fast, parallel distributed GPU-based exact and approximate nearest neighbor search.
Offers strategies to manage "stale" retrieval indices, including query-side finetuning and reranking.

Maintenance & Community

Status: REPO NO LONGER MAINTAINED, RESEARCH CODE PROVIDED AS IT IS.
Contributors: Primarily Facebook AI Research (FAIR).

Licensing & Compatibility

Code License: CC-BY-NC (Non-Commercial) for most code. Apache 2.0 for specific Hugging Face transformer components.
Data License: CC-BY-SA for Wikipedia-derived corpora and indices.
Commercial Use: The CC-BY-NC license restricts commercial use.

Limitations & Caveats

The repository is explicitly marked as "NO LONGER MAINTAINED," meaning no future updates or bug fixes are expected. The CC-BY-NC license restricts commercial applications.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days