CogACT by microsoft

Vision-language-action model for robotic manipulation

Created 1 year ago

395 stars

Top 73.0% on SourcePulse

Project Summary

CogACT is a foundational vision-language-action model designed for robotic manipulation, enabling robots to understand and execute complex tasks described through natural language. It targets researchers and developers in robotics and AI, offering a unified framework for integrating visual perception, language understanding, and motor control for enhanced robotic autonomy.

How It Works

CogACT leverages a diffusion model (DiT) for action generation, conditioned on visual inputs and language prompts. This approach allows for the generation of entire action sequences in a single inference pass, unlike models that generate actions token-by-token. This unified cognition token strategy is key to its efficiency and ability to produce multi-step actions.

Quick Start & Requirements

Installation: pip install -e . (after git clone and conda create --name cogact python=3.10)
Prerequisites: Python 3.8+, PyTorch >= 2.2.0, CUDA >= 12.0. Training requires Flash-Attention 2.
Inference: Requires ~30GB memory (fp32).
Models: Available on Hugging Face: CogACT/CogACT-Small, CogACT/CogACT-Base, CogACT/CogACT-Large.
Docs: Project Page, Paper

Highlighted Details

Offers Small, Base, and Large model variants.
Supports full fine-tuning for custom datasets (recommended over LoRA).
Integrates with the SIMPLER simulation environment for evaluation.
Provides deployment scripts for real-world robot integration.
Achieves ~181ms inference time on an A6000 GPU (5.5Hz) with Adaptive Action Ensemble.

Maintenance & Community

Developed by Microsoft.
Welcomes contributions via pull requests with a Contributor License Agreement (CLA).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

MIT License.
Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The model's accuracy may decrease with increased open-loop prediction when using action chunking for higher frequencies. Training from scratch requires significant computational resources and data.

CogACT by microsoft

Explore Similar Projects

Hybrid-VLA by PKU-HMI-Lab

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

vla0 by NVlabs

Instruct2Act by OpenGVLab

RDT2 by thu-ml

molmoact by allenai

peract by peract

ReKep by huangwl18

VIMA by vimalabs

RoboticsDiffusionTransformer by thu-ml

diffusion_policy by real-stanford

Isaac-GR00T by NVIDIA