biobert-pretrained  by naver

Pre-trained weights for biomedical text mining

created 6 years ago
695 stars

Top 50.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides pre-trained weights for BioBERT, a BERT-based language representation model specifically tailored for biomedical text mining tasks. It offers researchers and practitioners a powerful tool for applications like named entity recognition, relation extraction, and question answering within the biomedical domain, leveraging extensive biomedical corpora for enhanced performance.

How It Works

BioBERT is built upon Google's original BERT architecture, utilizing a WordPiece vocabulary derived from BERT-base-Cased. This approach allows for effective representation of novel biomedical terms through subword tokenization. The model has been pre-trained on large biomedical text datasets, including PubMed abstracts and PubMed Central full texts, resulting in specialized language understanding capabilities for the biomedical field.

Quick Start & Requirements

  • Pre-trained weights can be downloaded from the releases section of this repository.
  • Requires a Python environment compatible with the original BERT implementation.
  • Fine-tuning instructions and code are available in the DMIS GitHub repository for BioBERT.

Highlighted Details

  • Offers three pre-trained weight combinations: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC).
  • Available in both BioBERT-Base and BioBERT-Large v1.1 variants, with recommendations based on GPU resources.
  • Uses Google's WordPiece vocabulary for subword tokenization of biomedical terms.
  • Pre-training corpora include PubMed abstracts (approx. 4.5 billion words) and PubMed Central full texts (approx. 13.5 billion words).

Maintenance & Community

Licensing & Compatibility

  • The repository itself does not explicitly state a license for the pre-trained weights. The underlying BERT code is typically Apache 2.0. Users should verify licensing for commercial use.

Limitations & Caveats

  • The README does not provide specific licensing information for the pre-trained weights, which may impact commercial use.
  • Pre-processed corpora are not provided, requiring users to download and process them from external FTP links.
Health Check
Last commit

5 years ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
1 more.

BioGPT by microsoft

0.1%
4k
BioGPT is a generative pre-trained transformer for biomedical text
created 3 years ago
updated 1 year ago
Feedback? Help us improve.