OFA by OFA-Sys

Unified sequence-to-sequence model for cross-modality, vision, and language tasks

Created 4 years ago

2,552 stars

Top 18.2% on SourcePulse

View on GitHub

6 Experts Love This Project

Edward Sun

Research Scientist at Meta Superintelligence Lab

Jeff Hammerbacher

Cofounder of Cloudera

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

and 2 more!

Project Summary

OFA (Open Foundation Architectures) is a unified sequence-to-sequence pretrained model designed to handle diverse multimodal and language tasks. It targets researchers and practitioners seeking a single framework for tasks like image captioning, visual question answering, text-to-image generation, and text classification, offering a unified approach to multimodal AI.

How It Works

OFA employs a unified sequence-to-sequence architecture that treats all tasks as text generation problems. It achieves modality and task unification by mapping inputs from various modalities (image, text) and tasks into a common sequence format. This approach allows a single pretrained model to be fine-tuned or prompt-tuned for a wide array of downstream applications, simplifying the multimodal AI landscape.

Quick Start & Requirements

Installation: git clone https://github.com/OFA-Sys/OFA followed by pip install -r requirements.txt.
Prerequisites: Python 3.7.4, PyTorch 1.8.1, Torchvision 0.9.1. Java 1.8 is required for COCO evaluation.
Resources: Checkpoints range from 33M (Tiny) to 930M (Huge) parameters. Datasets can be substantial (e.g., VQA data is ~135GB after decompression).
Demos & Docs: Online demos are available on Hugging Face Spaces and ModelScope. Colab notebooks are provided for guided procedures. See checkpoints.md for model details.

Highlighted Details

Achieves state-of-the-art results on image captioning (1st on MSCOCO Leaderboard) and strong performance on VQA, visual grounding, and text-to-image generation.
Supports both fine-tuning and prompt tuning for adapting the model to new tasks.
Offers multiple model sizes (Tiny, Medium, Base, Large, Huge) to balance performance and resource requirements.
Extended capabilities include OFA-OCR for Chinese text recognition and MMSpeech for ASR.

Maintenance & Community

The project is actively maintained with recent updates in 2023. It welcomes contributions via issues and pull requests. Contact information for developers is provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

Images are encoded as base64 strings, requiring conversion for data processing. The large size of datasets and checkpoints may pose significant storage and computational requirements. The README mentions that CIDEr optimization can be unstable and requires careful hyperparameter tuning.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days