Multimodal autoregressive model for long-context video/text
Top 7.2% on sourcepulse
Large World Model (LWM) addresses the limitations of current language models in understanding complex, long-form tasks and temporal information by jointly modeling text and video sequences. It targets researchers and developers seeking to build AI with a deeper understanding of both human knowledge and the physical world, enabling capabilities like long-context retrieval and video understanding.
How It Works
LWM utilizes a multimodal autoregressive approach, trained on millions of tokens from diverse long videos and books. It employs RingAttention for scalable training on sequences up to 1 million tokens, overcoming memory and computational constraints. Key innovations include masked sequence packing for mixed sequence lengths, loss weighting for modality balance, and a model-generated QA dataset for long-sequence chat.
Quick Start & Requirements
conda create -n lwm python=3.10
and conda activate lwm
, then pip install -r gpu_requirements.txt
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Vision-language models are Jax-only; PyTorch support is limited to text models. GPU performance is less optimized compared to TPUs.
9 months ago
1 day