LWM  by LargeWorldModel

Multimodal autoregressive model for long-context video/text

created 1 year ago
7,314 stars

Top 7.2% on sourcepulse

GitHubView on GitHub
Project Summary

Large World Model (LWM) addresses the limitations of current language models in understanding complex, long-form tasks and temporal information by jointly modeling text and video sequences. It targets researchers and developers seeking to build AI with a deeper understanding of both human knowledge and the physical world, enabling capabilities like long-context retrieval and video understanding.

How It Works

LWM utilizes a multimodal autoregressive approach, trained on millions of tokens from diverse long videos and books. It employs RingAttention for scalable training on sequences up to 1 million tokens, overcoming memory and computational constraints. Key innovations include masked sequence packing for mixed sequence lengths, loss weighting for modality balance, and a model-generated QA dataset for long-sequence chat.

Quick Start & Requirements

  • Install with conda create -n lwm python=3.10 and conda activate lwm, then pip install -r gpu_requirements.txt.
  • Recommended: TPUs for optimized performance. GPUs are supported but less optimized.
  • Requires Ubuntu; Windows/macOS not tested.
  • See data.md and sharding.md for detailed documentation.

Highlighted Details

  • Achieves state-of-the-art benchmarks in retrieval tasks and long video understanding.
  • Offers models with context sizes ranging from 32K to 1 million tokens.
  • Supports both language-only and vision-language models.
  • Open-sourced 7B parameter models are available in PyTorch (text-only) and Jax (text and vision-language).

Maintenance & Community

  • Based on the RingAttention codebase.
  • Tested on TPUv3 and TPUv4.
  • For issues, open a GitHub issue.
  • Cites work on RingAttention and Blockwise Parallel Transformers.

Licensing & Compatibility

  • Code released under Apache 2.0 License.
  • Models released under the Llama-2 license.

Limitations & Caveats

Vision-language models are Jax-only; PyTorch support is limited to text models. GPU performance is less optimized compared to TPUs.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
66 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.