PyTorch library for multimodal multi-task model training
Top 26.3% on sourcepulse
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal, multi-task models at scale. It provides modular building blocks, pre-trained model classes, and example scripts for researchers and engineers working with content understanding and generative multimodal tasks. The library aims to facilitate replication of SOTA models and serve as a foundation for future research.
How It Works
The library is structured around composable components, including fusion layers, loss functions, datasets, and utilities. These are used to construct common multimodal architectures like ALBEF, BLIP-2, CLIP, CoCa, DALL-E 2, FLAVA, MAE, and MDETR. The modular design allows users to mix and match components to build custom models or adapt existing ones, leveraging the broader PyTorch ecosystem.
Quick Start & Requirements
pip install torchmultimodal-nightly
(Linux only for PyPI). Building from source is also supported.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The library is currently in a Beta Release, indicating potential for ongoing changes and API instability. PyPI installation is limited to Linux platforms.
5 days ago
1 week