Oscar  by microsoft

Vision-language pre-training research paper

Created 5 years ago
1,049 stars

Top 35.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides code and models for Oscar and VinVL, advanced pre-training methods for vision-language tasks. It targets researchers and practitioners in NLP and computer vision, enabling state-of-the-art performance on tasks like image captioning and visual question answering.

How It Works

Oscar utilizes object tags detected in images as anchors to facilitate image-text alignment during pre-training. VinVL, an evolution of Oscar, revisits visual representations, offering improved object-attribute detection for enhanced performance on vision-language tasks. This object-centric approach simplifies cross-modal learning and achieves superior results.

Quick Start & Requirements

Installation instructions are available in INSTALL.md. Pre-trained models, datasets, and VinVL image features can be found in VinVL_DOWNLOAD.md and DOWNLOAD.md. Scripts for downstream finetuning are in MODEL_ZOO.md and VinVL_MODEL_ZOO.md.

Highlighted Details

  • Achieved state-of-the-art performance on seven vision-language tasks with VinVL.
  • Pre-trained on a corpus of 6.5 million text-image pairs.
  • Released finetuned models and pre-trained checkpoints.
  • Code for Oscar+ pretraining and VinVL feature extraction is available.

Maintenance & Community

The project is associated with Microsoft Research. Updates include visual instruction tuning with GPT-4 (LLaVA).

Licensing & Compatibility

Oscar is released under the MIT license, permitting commercial use and integration with closed-source projects.

Limitations & Caveats

The README mentions the release of Oscar+ pretraining code and VinVL features, but specific details on dependencies or setup complexity beyond general installation instructions are not immediately apparent.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

METER by zdou0830

0%
373
Multimodal framework for vision-and-language transformer research
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.