JARVIS-1  by CraftJarvis

Minecraft agent for open-world multi-tasking with multimodal input

created 1 year ago
373 stars

Top 77.1% on sourcepulse

GitHubView on GitHub
Project Summary

JARVIS-1 is an open-world agent designed for complex, long-horizon tasks within the Minecraft environment. It targets researchers and developers in embodied AI and reinforcement learning, enabling agents to perceive multimodal inputs, generate sophisticated plans, and execute embodied control, mimicking human-like planning and interaction.

How It Works

JARVIS-1 leverages pre-trained multimodal language models to map visual observations and textual instructions into actionable plans. These plans are then executed by goal-conditioned controllers. A key innovation is its multimodal memory, which integrates both pre-trained knowledge and in-game experiences to enhance planning capabilities. This approach allows for sophisticated task decomposition and execution in a dynamic, open-world setting.

Quick Start & Requirements

  • Install: pip install -e . (after environment setup)
  • Prerequisites: Linux only, Python 3.10, JDK 8, Anaconda recommended. Requires downloading STEVE-I weights and setting TMPDIR=/tmp and OPENAI_API_KEY.
  • Setup: Requires running prepare_mcp.py to build MCP-Reborn.
  • Docs: Website, Paper

Highlighted Details

  • Capable of over 200 tasks in Minecraft, ranging from short-horizon (chopping trees) to long-horizon (obtaining a diamond pickaxe).
  • Outperforms state-of-the-art agents by 5x in the "ObtainDiamondPickaxe" task.
  • Utilizes a memory-augmented multimodal language model for planning.
  • Built upon STEVE-1 (VPT model) and integrates with Minedojo and MC-TextWorld.

Maintenance & Community

  • Project associated with Zihao Wang, Shaofei Cai, and others.
  • Twitter for updates.

Licensing & Compatibility

  • License details are not explicitly stated in the README.

Limitations & Caveats

The current release only includes offline evaluation code and a partial multimodal memory. Key components like the multimodal descriptor, multimodal retrieval, and online learning capabilities are marked as "Coming Soon" or not yet released, limiting the agent's full functionality.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.