LinkBERT by michiyasunaga

Knowledgeable language model pretrained with document links

Created 3 years ago

449 stars

Top 66.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

LinkBERT enhances transformer-based language models by incorporating knowledge from document links, such as hyperlinks and citations, into the pre-training process. This approach aims to improve performance on knowledge-intensive and cross-document NLP tasks for researchers and practitioners in general and biomedical domains.

How It Works

LinkBERT extends BERT by processing linked documents within the same model context during pre-training, unlike BERT's single-document approach. This allows it to capture inter-document knowledge, leading to improved performance on tasks requiring broad contextual understanding and factual recall.

Quick Start & Requirements

Install: Create a conda environment (conda create -n linkbert python=3.8), activate it (source activate linkbert), and install dependencies (pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html, pip install transformers==4.9.1 datasets==1.11.0 fairscale==0.4.0 wandb sklearn seqeval).
Data: Download preprocessed datasets from a provided link or preprocess raw data using provided scripts.
Models: Available on HuggingFace (michiyasunaga/LinkBERT-base, michiyasunaga/LinkBERT-large, michiyasunaga/BioLinkBERT-base, michiyasunaga/BioLinkBERT-large).
Usage: Load models using HuggingFace Transformers. Fine-tuning scripts are provided for MRQA, BLURB, MedQA, and MMLU tasks.
Prerequisites: Python 3.8, PyTorch 1.10.1 with CUDA 11.3, Transformers 4.9.1.

Highlighted Details

Achieves state-of-the-art results on biomedical benchmarks (BLURB, PubMedQA, BioASQ, MedQA-USMLE, MMLU-professional medicine).
Demonstrates improved performance over BERT-base and BERT-large on general benchmarks like MRQA and GLUE.
Offers both general and biomedical domain-specific pretrained models.
Compatible with HuggingFace Transformers for easy integration.

Maintenance & Community

The project is associated with ACL 2022 and provides a Codalab worksheet for reproducibility. No specific community channels or active maintenance indicators are mentioned in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project requires specific older versions of PyTorch (1.10.1) and Transformers (4.9.1), which may pose compatibility challenges with current ecosystems. The license is not specified, potentially impacting commercial adoption.

Health Check

Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days