CogACT  by microsoft

Vision-language-action model for robotic manipulation

created 8 months ago
308 stars

Top 88.2% on sourcepulse

GitHubView on GitHub
Project Summary

CogACT is a foundational vision-language-action model designed for robotic manipulation, enabling robots to understand and execute complex tasks described through natural language. It targets researchers and developers in robotics and AI, offering a unified framework for integrating visual perception, language understanding, and motor control for enhanced robotic autonomy.

How It Works

CogACT leverages a diffusion model (DiT) for action generation, conditioned on visual inputs and language prompts. This approach allows for the generation of entire action sequences in a single inference pass, unlike models that generate actions token-by-token. This unified cognition token strategy is key to its efficiency and ability to produce multi-step actions.

Quick Start & Requirements

Highlighted Details

  • Offers Small, Base, and Large model variants.
  • Supports full fine-tuning for custom datasets (recommended over LoRA).
  • Integrates with the SIMPLER simulation environment for evaluation.
  • Provides deployment scripts for real-world robot integration.
  • Achieves ~181ms inference time on an A6000 GPU (5.5Hz) with Adaptive Action Ensemble.

Maintenance & Community

  • Developed by Microsoft.
  • Welcomes contributions via pull requests with a Contributor License Agreement (CLA).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • MIT License.
  • Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

The model's accuracy may decrease with increased open-loop prediction when using action chunking for higher frequencies. Training from scratch requires significant computational resources and data.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
5
Star History
66 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.