calm by shaochenze

Continuous Autoregressive Language Models for efficient text generation

Created 3 months ago

727 stars

Top 47.4% on SourcePulse

Project Summary

CALM (Continuous Autoregressive Language Models) introduces a paradigm shift to overcome the token-by-token generation bottleneck in Large Language Models (LLMs). It enables predicting a single continuous vector representing an entire chunk of K tokens, significantly improving training and inference efficiency. This approach offers a novel scaling dimension for LLMs, termed "semantic bandwidth," benefiting researchers and practitioners seeking more efficient and scalable language models.

How It Works

CALM employs a two-stage process. First, a high-fidelity autoencoder compresses K tokens into a continuous vector and reconstructs them with near-perfect accuracy. Second, a continuous-domain language model performs autoregressive prediction in this vector space. This method reduces the number of autoregressive steps by a factor of K, leading to substantial efficiency gains and enabling scaling based on semantic bandwidth.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/shaochenze/calm.git) and install dependencies (pip install -r requirements.txt).
Data Preparation: Download and process "the pile-uncopyrighted" dataset using bash data/get_data.sh. Requires at least 2.5TB of free disk space.
Training: Two main stages:
1. Train the autoencoder (bash train/train_autoencoder.sh).
2. Train the CALM language model using energy-based training (bash train/train_energy.sh).
Alternative Training: Scripts for Diffusion and Flow Matching generative heads are available (train/train_diffusion.sh, train/train_flow.sh).
Baseline: A standard autoregressive Transformer baseline can be trained (train/train_ar.sh).
Evaluation: Use bash train/eval_energy.sh to evaluate checkpoints.
Prerequisites: Python, PyTorch (implied by torchrun), sufficient disk space (2.5TB+), and multi-GPU setup (e.g., 8 GPUs per node). Scripts utilize bf16 for mixed-precision training.

Highlighted Details

Ultra-Efficient: Dramatically improves training and inference efficiency by reducing autoregressive steps by a factor of K.
New Scaling Axis: Introduces "semantic bandwidth" (K) as a dimension for LLM scaling, beyond parameters and data.
Likelihood-Free Toolkit: Provides algorithms for continuous domain modeling, including a robust autoencoder, Energy-Based Training, BrierLM metric for evaluation, and Temperature Sampling.
Performance Claims: Pre-trained CALM models achieve BrierLM scores: CALM-M (371M) at 5.72, CALM-L (735M) at 6.58, CALM-XL (1.82B) at 8.53. The baseline AR model is expected to reach ~6.05.

Maintenance & Community

Contact: For questions, submit an issue or contact chenzeshao@tencent.com.
No specific community links (Discord/Slack) or roadmap are mentioned.

Licensing & Compatibility

The README does not explicitly state the license type or provide compatibility notes for commercial use.

Limitations & Caveats

Requires a substantial 2.5TB+ disk space for the dataset.
Alternative generative heads (Diffusion, Flow Matching) showed slightly lower performance compared to the Energy-based head in experiments.
The README does not specify hardware requirements beyond what's implied by the training scripts (e.g., multi-GPU setup).

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days