Instruct2Act  by OpenGVLab

Robotics framework maps instructions to actions using LLMs

Created 2 years ago
369 stars

Top 76.5% on SourcePulse

GitHubView on GitHub
Project Summary

Instruct2Act is a framework designed for robotic manipulation tasks, enabling robots to understand and execute complex, multi-modal instructions. It targets researchers and developers in robotics and AI, offering a zero-shot approach that leverages large language models (LLMs) to translate natural language and visual cues into executable Python programs for robot control.

How It Works

Instruct2Act utilizes LLMs to generate Python code that orchestrates a perception-planning-action loop. The perception stage integrates foundation models like Segment Anything Model (SAM) for object localization and CLIP for classification. This modular design allows for flexibility in instruction modalities and task requirements, translating high-level commands into precise robotic actions by combining LLM reasoning with specialized foundation models.

Quick Start & Requirements

  • Installation: Install required packages via environment.yaml. Install VIMABench separately.
  • Prerequisites: Requires SAM and OpenCLIP model checkpoints. OpenAI API key is needed for generation. CUDA device option should be configured for SAM inference speed.
  • Running: Execute robotic_anything_gpt_online.py.
  • Resources: Official documentation and VIMABench instructions are available.

Highlighted Details

  • Zero-shot performance surpasses state-of-the-art learning-based policies on several tasks.
  • Supports task-specific and task-agnostic prompts, including pointing-language enhanced prompts.
  • Offers both offline and online code generation modes for robotic manipulation.
  • Evaluated on six representative tabletop manipulation tasks from VIMABench.

Maintenance & Community

The project has seen recent updates, including the release of ManipVQA and A3VLM, with checkpoints available. Real-world demo videos are shared on YouTube and Bilibili. The project acknowledges contributions from VIMABench, OpenCLIP, and SAM.

Licensing & Compatibility

The README does not explicitly state the license type or any restrictions for commercial use or closed-source linking.

Limitations & Caveats

Network instability can affect the quality of generated code. The original VIMABench movements are fast; users may need to add delays for visualization. SAM inference speed may require manual CUDA device configuration.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Andrew Ng Andrew Ng(Founder of DeepLearning.AI; Cofounder of Coursera; Professor at Stanford), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

vision-agent by landing-ai

0.1%
5k
Visual AI agent for generating runnable vision code from image/video prompts
Created 1 year ago
Updated 2 weeks ago
Feedback? Help us improve.