large_concept_model  by facebookresearch

Language modeling research paper in a sentence representation space

created 7 months ago
2,254 stars

Top 20.5% on sourcepulse

GitHubView on GitHub
Project Summary

Large Concept Models (LCM) offers an implementation for language modeling in a sentence representation space, targeting researchers and practitioners interested in novel sequence-to-sequence architectures. It enables auto-regressive sentence prediction using explicit, language-agnostic "concepts" derived from the SONAR embedding space, supporting multilingual text and speech.

How It Works

LCM models operate on sentence embeddings, treating them as discrete concepts. The repository includes implementations for Mean Squared Error (MSE) regression and diffusion-based generation approaches. This concept-centric approach aims to capture higher-level semantic meaning, moving beyond token-level prediction for potentially more robust and interpretable language generation.

Quick Start & Requirements

  • Installation: Recommended via uv (uv sync --extra cpu --extra eval --extra data) or pip. GPU support requires manual installation of compatible torch and fairseq2 versions (e.g., uv pip install torch==2.5.1 --extra-index-url https://download.pytorch.org/whl/cu121 and uv pip install fairseq2==v0.3.0rc1 --pre --extra-index-url https://fair.pkg.atmeta.com/fairseq2/whl/rc/pt2.5.1/cu121).
  • Prerequisites: Python, fairseq2 (release candidate), torch, nltk (for evaluation). GPU with CUDA is recommended for training/inference.
  • Data Preparation: Requires data processed into sentences and embedded using SONAR. A sample pipeline is provided.
  • Documentation: Blog, Paper, fairseq2.

Highlighted Details

  • Supports 1.6B parameter models trained on 1.3T tokens.
  • SONAR embedding space supports up to 200 text and 57 speech languages.
  • Includes recipes for reproducing training and fine-tuning of MSE and Two-tower diffusion LCMs.
  • Evaluation scripts for benchmarking and comparison with LLMs are available.

Maintenance & Community

  • Developed by the "LCM team" at Meta AI.
  • Codebase relies on fairseq2.
  • Contribution guidelines are provided.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

  • The diffusion models are noted as "coming soon" in quantized space.
  • Requires specific release candidate versions of fairseq2.
  • Reproducing paper's experimental setup requires careful attention to resource requirements (e.g., SLURM jobs).
Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
139 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.