DeBERTa by microsoft

BERT enhancement via disentangled attention, enhanced mask decoder

Created 5 years ago

2,188 stars

Top 20.4% on SourcePulse

View on GitHub

6 Experts Love This Project

Binyuan Hui

Research Scientist at Alibaba Qwen

Jeff Hammerbacher

Cofounder of Cloudera

Elvis Saravia

Founder of DAIR.AI

Evan Hubinger

Head of Alignment Stress-Testing at Anthropic

and 2 more!

Project Summary

DeBERTa is a family of Transformer-based language models that improve upon BERT and RoBERTa by introducing disentangled attention and an enhanced mask decoder. It offers state-of-the-art performance on various NLP tasks, including SuperGLUE and XNLI, making it suitable for researchers and developers seeking high-accuracy language understanding models.

How It Works

DeBERTa employs a disentangled attention mechanism that represents words with separate content and position vectors, computing attention weights using disentangled matrices for content and relative positions. This approach enhances the model's ability to capture nuanced relationships between tokens. Additionally, it utilizes an enhanced mask decoder for pre-training, replacing the standard softmax layer for improved efficiency and performance. DeBERTa V3 further refines this by incorporating ELECTRA-style pre-training with gradient-disentangled embedding sharing.

Quick Start & Requirements

Install: pip install deberta
Prerequisites: Linux, CUDA 10.0+, PyTorch 1.3.0+, Python 3.6+. Docker is recommended for simplified setup.
Resources: Pre-trained models range from 22M (XSmall) to 1.5B parameters. Fine-tuning requires significant GPU resources (e.g., 8x V100 GPUs for reported benchmarks).
Docs: https://github.com/microsoft/DeBERTa

Highlighted Details

Achieves state-of-the-art results on SuperGLUE (89.9 with 1.5B model), surpassing T5-11B and human performance.
DeBERTa-V3-XSmall (22M parameters) outperforms RoBERTa-Base and XLNet-Base on MNLI and SQuAD v2.0.
Offers multilingual support with mDeBERTa-V3-Base, achieving strong zero-shot cross-lingual transfer on XNLI.
Provides a wide range of pre-trained models, including V2 and V3 variants with different sizes and fine-tuned checkpoints.

Maintenance & Community

The project is actively maintained by Microsoft researchers. Key contacts include Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen.

Licensing & Compatibility

The repository's code is typically released under a permissive license (e.g., MIT), allowing for commercial use and integration into closed-source projects. However, users should always verify the specific license associated with downloaded models.

Limitations & Caveats

The larger models (e.g., V2-XXLarge) require substantial computational resources for both fine-tuning and inference. While Docker simplifies setup, users without Docker experience may face a steeper learning curve. The README mentions specific CUDA and PyTorch versions, suggesting potential compatibility issues with newer or older versions.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days