large_concept_model by facebookresearch

Language modeling research paper in a sentence representation space

Created 1 year ago

2,323 stars

Top 19.4% on SourcePulse

View on GitHub

4 Experts Love This Project

Elvis Saravia

Founder of DAIR.AI

Omar Sanseviero

DevRel at Google DeepMind

Travis Fischer

Founder of Agentic

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Project Summary

Large Concept Models (LCM) offers an implementation for language modeling in a sentence representation space, targeting researchers and practitioners interested in novel sequence-to-sequence architectures. It enables auto-regressive sentence prediction using explicit, language-agnostic "concepts" derived from the SONAR embedding space, supporting multilingual text and speech.

How It Works

LCM models operate on sentence embeddings, treating them as discrete concepts. The repository includes implementations for Mean Squared Error (MSE) regression and diffusion-based generation approaches. This concept-centric approach aims to capture higher-level semantic meaning, moving beyond token-level prediction for potentially more robust and interpretable language generation.

Quick Start & Requirements

Installation: Recommended via uv (uv sync --extra cpu --extra eval --extra data) or pip. GPU support requires manual installation of compatible torch and fairseq2 versions (e.g., uv pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/cu121 and uv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu121).
Prerequisites: Python, fairseq2 (release candidate), torch, nltk (for evaluation). GPU with CUDA is recommended for training/inference.
Data Preparation: Requires data processed into sentences and embedded using SONAR. A sample pipeline is provided.
Documentation: Blog, Paper, fairseq2.

Highlighted Details

Supports 1.6B parameter models trained on 1.3T tokens.
SONAR embedding space supports up to 200 text and 57 speech languages.
Includes recipes for reproducing training and fine-tuning of MSE and Two-tower diffusion LCMs.
Evaluation scripts for benchmarking and comparison with LLMs are available.

Maintenance & Community

Developed by the "LCM team" at Meta AI.
Codebase relies on fairseq2.
Contribution guidelines are provided.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The diffusion models are noted as "coming soon" in quantized space.
Requires specific release candidate versions of fairseq2.
Reproducing paper's experimental setup requires careful attention to resource requirements (e.g., SLURM jobs).

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

17 stars in the last 30 days