LWM by LargeWorldModel

Multimodal autoregressive model for long-context video/text

Created 1 year ago

7,394 stars

Top 6.9% on SourcePulse

11 Experts Love This Project

mateiz

Cofounder of Databricks

chiphuyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Jiayi-Pan

Author of SWE-Gym; MTS at xAI

pgarbacki

Cofounder of Fireworks AI

and 7 more!

Project Summary

Large World Model (LWM) addresses the limitations of current language models in understanding complex, long-form tasks and temporal information by jointly modeling text and video sequences. It targets researchers and developers seeking to build AI with a deeper understanding of both human knowledge and the physical world, enabling capabilities like long-context retrieval and video understanding.

How It Works

LWM utilizes a multimodal autoregressive approach, trained on millions of tokens from diverse long videos and books. It employs RingAttention for scalable training on sequences up to 1 million tokens, overcoming memory and computational constraints. Key innovations include masked sequence packing for mixed sequence lengths, loss weighting for modality balance, and a model-generated QA dataset for long-sequence chat.

Quick Start & Requirements

Install with conda create -n lwm python=3.10 and conda activate lwm, then pip install -r gpu_requirements.txt.
Recommended: TPUs for optimized performance. GPUs are supported but less optimized.
Requires Ubuntu; Windows/macOS not tested.
See data.md and sharding.md for detailed documentation.

Highlighted Details

Achieves state-of-the-art benchmarks in retrieval tasks and long video understanding.
Offers models with context sizes ranging from 32K to 1 million tokens.
Supports both language-only and vision-language models.
Open-sourced 7B parameter models are available in PyTorch (text-only) and Jax (text and vision-language).

Maintenance & Community

Based on the RingAttention codebase.
Tested on TPUv3 and TPUv4.
For issues, open a GitHub issue.
Cites work on RingAttention and Blockwise Parallel Transformers.

Licensing & Compatibility

Code released under Apache 2.0 License.
Models released under the Llama-2 license.

Limitations & Caveats

Vision-language models are Jax-only; PyTorch support is limited to text models. GPU performance is less optimized compared to TPUs.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

23 stars in the last 30 days

Explore Similar Projects

Ola by Ola-Omni

Omni-modal language model research paper

Created 11 months ago

Updated 7 months ago

LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

Updated 9 months ago

Video-MME by MME-Benchmarks

Evaluation benchmark for multimodal LLMs in video analysis

Created 1 year ago

Updated 1 month ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

UniVTG by showlab

Video-language temporal grounding model

Created 2 years ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory) and

Jesse Clark

Jesse Clark(Cofounder of Marqo).

LanguageBind by PKU-YuanGroup

Multimodal pretraining research paper using language-based semantic alignment

Created 2 years ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

Updated 1 year ago

memo by memoavatar

Talking video generation research paper

Created 1 year ago

Updated 5 months ago

Sa2VA by bytedance

Multimodal model for dense grounded image/video understanding

Created 1 year ago

Updated 1 day ago

Starred by

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI),

Thomas Wolf

Thomas Wolf(Cofounder of Hugging Face), and

3 more.

InternLM-XComposer by InternLM

Multimodal model for long-context video/audio interactions, image understanding, and composition

Created 2 years ago

Updated 7 months ago

TheoremExplainAgent by TIGER-AI-Lab

AI system for multimodal theorem explanation via video generation

Created 10 months ago

Updated 5 months ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.