OFA  by OFA-Sys

Unified sequence-to-sequence model for cross-modality, vision, and language tasks

created 3 years ago
2,509 stars

Top 19.0% on sourcepulse

GitHubView on GitHub
Project Summary

OFA (Open Foundation Architectures) is a unified sequence-to-sequence pretrained model designed to handle diverse multimodal and language tasks. It targets researchers and practitioners seeking a single framework for tasks like image captioning, visual question answering, text-to-image generation, and text classification, offering a unified approach to multimodal AI.

How It Works

OFA employs a unified sequence-to-sequence architecture that treats all tasks as text generation problems. It achieves modality and task unification by mapping inputs from various modalities (image, text) and tasks into a common sequence format. This approach allows a single pretrained model to be fine-tuned or prompt-tuned for a wide array of downstream applications, simplifying the multimodal AI landscape.

Quick Start & Requirements

  • Installation: git clone https://github.com/OFA-Sys/OFA followed by pip install -r requirements.txt.
  • Prerequisites: Python 3.7.4, PyTorch 1.8.1, Torchvision 0.9.1. Java 1.8 is required for COCO evaluation.
  • Resources: Checkpoints range from 33M (Tiny) to 930M (Huge) parameters. Datasets can be substantial (e.g., VQA data is ~135GB after decompression).
  • Demos & Docs: Online demos are available on Hugging Face Spaces and ModelScope. Colab notebooks are provided for guided procedures. See checkpoints.md for model details.

Highlighted Details

  • Achieves state-of-the-art results on image captioning (1st on MSCOCO Leaderboard) and strong performance on VQA, visual grounding, and text-to-image generation.
  • Supports both fine-tuning and prompt tuning for adapting the model to new tasks.
  • Offers multiple model sizes (Tiny, Medium, Base, Large, Huge) to balance performance and resource requirements.
  • Extended capabilities include OFA-OCR for Chinese text recognition and MMSpeech for ASR.

Maintenance & Community

The project is actively maintained with recent updates in 2023. It welcomes contributions via issues and pull requests. Contact information for developers is provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

Images are encoded as base64 strings, requiring conversion for data processing. The large size of datasets and checkpoints may pose significant storage and computational requirements. The README mentions that CIDEr optimization can be unstable and requires careful hyperparameter tuning.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.