lingbot-va by Robbyant

Causal video-action world model for robot control

Created 3 weeks ago

New!

694 stars

Top 49.0% on SourcePulse

Project Summary

LingBot-VA is a causal video-action world model designed for generalist robot control. It addresses the challenge of unifying visual dynamics prediction and action inference within a single framework, enabling robots to learn and execute complex tasks more efficiently and with greater generalization capabilities. The project targets researchers and engineers in robotics and AI, offering a novel approach to world modeling that leads to state-of-the-art performance in simulated and real-world robotic manipulation.

How It Works

LingBot-VA employs an autoregressive video-action world modeling approach. Its core innovation lies in architecturally unifying visual dynamics prediction and action inference within a single interleaved sequence, while preserving their distinct conceptual roles. This is achieved through a high-efficiency, dual-stream Mixture-of-Transformers (MoT) architecture incorporating Asynchronous Execution and KV Cache. This design allows for efficient processing and enables the model to achieve significant improvements in sample efficiency, long-horizon task success rates, and generalization to novel environments.

Quick Start & Requirements

Primary Install:

pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126
pip install websockets einops diffusers==0.36.0 transformers==4.55.2 accelerate msgpack opencv-python matplotlib ftfy easydict
pip install flash-attn --no-build-isolation

Prerequisites: Python 3.10.16, PyTorch 2.9.0, CUDA 12.6.
RoboTwin Evaluation Setup: Requires cloning the RoboTwin repository (https://robotwin-platform.github.io/doc/usage/robotwin-install.html), modifying its requirements.txt (e.g., transforms3d==0.4.2, sapien==3.0.0b1, gymnasium==0.29.1, huggingface_hub==0.36.2), and running provided installation scripts.
Inference Server/Client: Scripts are available for deploying standalone or server-client architectures for distributed inference.

Highlighted Details

Achieves state-of-the-art performance on RoboTwin 2.0 simulation benchmarks, exceeding the 90+ threshold for both Easy Success Rate (92.9%) and Hard Success Rate (91.6%).
Outperforms strong baselines on LIBERO benchmarks across Spatial, Object, Goal, and Long Avg metrics, demonstrating robust generalization.
Shows state-of-the-art results in real-world manipulation tasks (e.g., Make Breakfast, Fold Clothes) with high Progress and Success Rates using only 50 trials per task.

Maintenance & Community

Weights and code for the shared backbone were released on January 29, 2026. For questions, discussions, or collaborations, users can open an issue on GitHub or contact Dr. Qihang Zhang (liuhuan.zqh@antgroup.com) or Dr. Lin Li (fengchang.ll@antgroup.com).

Licensing & Compatibility

This project is released under the Apache License 2.0. This permissive license allows for commercial use and integration with closed-source projects without significant restrictions.

Limitations & Caveats

The provided documentation does not explicitly detail known limitations, alpha status, or specific bugs. The setup for evaluation, particularly with the RoboTwin environment, involves modifying existing scripts and dependencies, which may require careful configuration and troubleshooting.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

696 stars in the last 27 days