Discover and explore top open-source AI tools and projects—updated daily.
microsoftVision-language-action model for robotic manipulation
Top 77.1% on SourcePulse
CogACT is a foundational vision-language-action model designed for robotic manipulation, enabling robots to understand and execute complex tasks described through natural language. It targets researchers and developers in robotics and AI, offering a unified framework for integrating visual perception, language understanding, and motor control for enhanced robotic autonomy.
How It Works
CogACT leverages a diffusion model (DiT) for action generation, conditioned on visual inputs and language prompts. This approach allows for the generation of entire action sequences in a single inference pass, unlike models that generate actions token-by-token. This unified cognition token strategy is key to its efficiency and ability to produce multi-step actions.
Quick Start & Requirements
pip install -e . (after git clone and conda create --name cogact python=3.10)Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The model's accuracy may decrease with increased open-loop prediction when using action chunking for higher frequencies. Training from scratch requires significant computational resources and data.
5 days ago
1 week
microsoft
GuyTevet
NVIDIA