DeBERTa  by microsoft

BERT enhancement via disentangled attention, enhanced mask decoder

created 5 years ago
2,126 stars

Top 21.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DeBERTa is a family of Transformer-based language models that improve upon BERT and RoBERTa by introducing disentangled attention and an enhanced mask decoder. It offers state-of-the-art performance on various NLP tasks, including SuperGLUE and XNLI, making it suitable for researchers and developers seeking high-accuracy language understanding models.

How It Works

DeBERTa employs a disentangled attention mechanism that represents words with separate content and position vectors, computing attention weights using disentangled matrices for content and relative positions. This approach enhances the model's ability to capture nuanced relationships between tokens. Additionally, it utilizes an enhanced mask decoder for pre-training, replacing the standard softmax layer for improved efficiency and performance. DeBERTa V3 further refines this by incorporating ELECTRA-style pre-training with gradient-disentangled embedding sharing.

Quick Start & Requirements

  • Install: pip install deberta
  • Prerequisites: Linux, CUDA 10.0+, PyTorch 1.3.0+, Python 3.6+. Docker is recommended for simplified setup.
  • Resources: Pre-trained models range from 22M (XSmall) to 1.5B parameters. Fine-tuning requires significant GPU resources (e.g., 8x V100 GPUs for reported benchmarks).
  • Docs: https://github.com/microsoft/DeBERTa

Highlighted Details

  • Achieves state-of-the-art results on SuperGLUE (89.9 with 1.5B model), surpassing T5-11B and human performance.
  • DeBERTa-V3-XSmall (22M parameters) outperforms RoBERTa-Base and XLNet-Base on MNLI and SQuAD v2.0.
  • Offers multilingual support with mDeBERTa-V3-Base, achieving strong zero-shot cross-lingual transfer on XNLI.
  • Provides a wide range of pre-trained models, including V2 and V3 variants with different sizes and fine-tuned checkpoints.

Maintenance & Community

The project is actively maintained by Microsoft researchers. Key contacts include Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen.

Licensing & Compatibility

The repository's code is typically released under a permissive license (e.g., MIT), allowing for commercial use and integration into closed-source projects. However, users should always verify the specific license associated with downloaded models.

Limitations & Caveats

The larger models (e.g., V2-XXLarge) require substantial computational resources for both fine-tuning and inference. While Docker simplifies setup, users without Docker experience may face a steeper learning curve. The README mentions specific CUDA and PyTorch versions, suggesting potential compatibility issues with newer or older versions.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
49 stars in the last 90 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), Abhishek Thakur Abhishek Thakur(World's First 4x Kaggle GrandMaster), and
5 more.

xlnet by zihangdai

0.0%
6k
Language model research paper using generalized autoregressive pretraining
created 6 years ago
updated 2 years ago
Feedback? Help us improve.