unilm by microsoft

Foundation models for language, vision, speech, and multimodal tasks

Created 6 years ago

21,934 stars

Top 2.0% on SourcePulse

View on GitHub

24 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Alex Cheema

Cofounder of EXO Labs

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Omar Sanseviero

DevRel at Google DeepMind

and 20 more!

Project Summary

This repository serves as a central hub for Microsoft's foundational AI research, focusing on large-scale self-supervised pre-training across diverse tasks, languages, and modalities. It offers a comprehensive collection of models and architectures for NLP, computer vision, speech, and multimodal AI, targeting researchers and developers building advanced AI systems.

How It Works

The project's core strength lies in its "Big Convergence" philosophy, unifying pre-training methodologies across text, vision, speech, and their combinations. It leverages novel architectures like RetNet and BitNet for improved efficiency and scalability, and explores multimodal grounding with models like Kosmos-2.5. This unified approach aims for greater generality and capability in foundation models.

Quick Start & Requirements

Installation typically involves cloning repositories and following individual model instructions, often requiring PyTorch.
Many models require significant GPU resources (e.g., multiple A100s) and large datasets for training or fine-tuning.
Specific models may have unique dependencies detailed in their respective sub-directories.
Links to model releases and demos are provided throughout the README.

Highlighted Details

Features groundbreaking architectures like RetNet (Retentive Network) and BitNet (1-bit Transformers).
Includes state-of-the-art multimodal models such as Kosmos-2.5 and BEiT-3.
Offers a wide array of pre-trained models for over 100 languages and various modalities (vision, speech, document AI).
Provides toolkits for sequence-to-sequence fine-tuning and efficient decoding.

Maintenance & Community

Actively updated with recent releases (e.g., RedStone, LongNet, TextDiffuser-2).
Primary contact is Furu Wei (fuwei@microsoft.com) for inquiries. GitHub issues are used for model support.

Licensing & Compatibility

The project's license is specified in the LICENSE file.
Portions of the code are based on the Hugging Face transformers project. Specific model licenses may vary.

Limitations & Caveats

Many models are research prototypes and may require substantial computational resources and expertise for effective use or reproduction.
The sheer volume of models and research areas means some may be less actively maintained than others.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

75 stars in the last 30 days