ModernBERT by AnswerDotAI

Research repo for modernizing BERT via architecture and scaling

Created 1 year ago

1,609 stars

Top 25.9% on SourcePulse

View on GitHub

5 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Bryan Helmig

Cofounder of Zapier

Jeff Hammerbacher

Cofounder of Cloudera

Gabriel Almeida

Cofounder of Langflow

and 1 more!

Project Summary

ModernBERT offers a modular and scalable approach to building Transformer encoder models, focusing on architectural improvements and efficient training. It's designed for researchers and practitioners aiming to develop state-of-the-art language models with enhanced performance and longer context capabilities.

How It Works

ModernBERT introduces FlexBERT, a flexible building block system for encoder architectures, configurable via YAML files. It builds upon MosaicBERT, integrating Flash Attention 2 for improved speed and memory efficiency. This modularity allows for easier experimentation with different architectural components and scaling strategies.

Quick Start & Requirements

Install via Conda: conda env create -f environment.yaml
Activate environment: conda activate bert24
Flash Attention: Requires building from source or installing precompiled wheels for Hopper GPUs, or pip install "flash_attn==2.6.3".
GPU-equipped machine is mandatory.
Setup time and resource requirements depend on model size and dataset.
For details: ModernBERT Collection on HuggingFace, arXiv preprint

Highlighted Details

Modular encoder design (FlexBERT) configurable via YAML.
Leverages Composer framework for training.
Supports both text and tokenized data formats (MDS, CSV/TSV, JSONL).
Includes scripts for fine-tuning and evaluating retrieval models (ColBERT, Sentence Transformers).

Maintenance & Community

ModernBERT is a collaboration between Answer.AI, LightOn, and friends. The repository is research-focused, with a HuggingFace collection available for easier integration. Further documentation and reproducibility are planned.

Licensing & Compatibility

The codebase builds upon MosaicBERT, which is under the Apache 2.0 license. This license permits commercial use and modification.

Limitations & Caveats

The README is noted as "very barebones and is still under construction." The StreamingTextDataset may exhibit uneven memory distribution across accelerators. Flash Attention installation can be complex, especially for specific GPU architectures.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

21 stars in the last 30 days