Motion-language model for generating human motion and text descriptions
Top 25.3% on sourcepulse
MotionGPT is a unified motion-language generation model designed for researchers and developers working with human motion data. It addresses the challenge of modeling and generating both human motion and natural language descriptions within a single framework, enabling tasks like text-to-motion generation, motion captioning, and motion prediction.
How It Works
MotionGPT treats human motion as a form of language by discretizing 3D motion into "motion tokens" using vector quantization. This "motion vocabulary" is then used in conjunction with text tokens for language modeling. The model leverages a T5 encoder-decoder architecture, pre-trained on a mixture of motion-language data and fine-tuned on prompt-based question-and-answer tasks. This approach allows it to capture semantic couplings between motion and language, benefiting from the generative capabilities of large language models.
Quick Start & Requirements
pip install -r requirements.txt
and download necessary models/data using provided bash scripts (prepare/download_smpl_model.sh
, prepare/prepare_t5.sh
, prepare/download_pretrained_models.sh
).python app.py
. Batch processing is available via python demo.py
.Highlighted Details
Maintenance & Community
The project is associated with NeurIPS 2023. Links to HuggingFace demos and the arXiv paper are provided. Further community interaction channels are not explicitly mentioned in the README.
Licensing & Compatibility
The code is distributed under an MIT License. However, it depends on libraries and datasets (SMPL, SMPL-X, PyTorch3D) which have their own licenses that must also be followed. Commercial use may be restricted by these underlying licenses.
Limitations & Caveats
MotionGPT struggles with generating unseen motions (e.g., gymnastics) even if it understands the text. The model's performance is limited by the size of available motion datasets (HumanML3D, KIT), which are significantly smaller than typical language datasets. VQ-based methods are less suitable for fine-grained body part editing compared to diffusion models.
1 month ago
Inactive