biobert-pretrained by naver

Pre-trained weights for biomedical text mining

Created 7 years ago

704 stars

Top 48.6% on SourcePulse

2 Experts Love This Project

hammer

Jeff Hammerbacher

Cofounder of Cloudera

thomwolf

Cofounder of Hugging Face

Project Summary

This repository provides pre-trained weights for BioBERT, a BERT-based language representation model specifically tailored for biomedical text mining tasks. It offers researchers and practitioners a powerful tool for applications like named entity recognition, relation extraction, and question answering within the biomedical domain, leveraging extensive biomedical corpora for enhanced performance.

How It Works

BioBERT is built upon Google's original BERT architecture, utilizing a WordPiece vocabulary derived from BERT-base-Cased. This approach allows for effective representation of novel biomedical terms through subword tokenization. The model has been pre-trained on large biomedical text datasets, including PubMed abstracts and PubMed Central full texts, resulting in specialized language understanding capabilities for the biomedical field.

Quick Start & Requirements

Pre-trained weights can be downloaded from the releases section of this repository.
Requires a Python environment compatible with the original BERT implementation.
Fine-tuning instructions and code are available in the DMIS GitHub repository for BioBERT.

Highlighted Details

Offers three pre-trained weight combinations: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC).
Available in both BioBERT-Base and BioBERT-Large v1.1 variants, with recommendations based on GPU resources.
Uses Google's WordPiece vocabulary for subword tokenization of biomedical terms.
Pre-training corpora include PubMed abstracts (approx. 4.5 billion words) and PubMed Central full texts (approx. 13.5 billion words).

Maintenance & Community

Managed by Naver.
For issues or help, submit a GitHub issue. Contact Jinhyuk Lee (lee.jnhk@gmail.com) or Sungdong Kim (sungdong.kim@navercorp.com) for inquiries.

Licensing & Compatibility

The repository itself does not explicitly state a license for the pre-trained weights. The underlying BERT code is typically Apache 2.0. Users should verify licensing for commercial use.

Limitations & Caveats

The README does not provide specific licensing information for the pre-trained weights, which may impact commercial use.
Pre-processed corpora are not provided, requiring users to download and process them from external FTP links.

Health Check

Last Commit

5 years ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

awesome-bioie by caufieldjh

Curated list of resources for Biomedical Information Extraction (BioIE)

Created 6 years ago

Updated 1 year ago

Macadam by yongzhuo

NLP tool for text classification, sequence labeling, and relation extraction

Created 5 years ago

Updated 2 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

BLUE_Benchmark by ncbi-nlp

Benchmark for biomedicine text-mining tasks

Created 6 years ago

Updated 4 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

LinkBERT by michiyasunaga

Knowledgeable language model pretrained with document links

Created 3 years ago

Updated 3 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

bluebert by ncbi-nlp

BERT model for biomedical NLP

Created 6 years ago

Updated 2 years ago

KG_RAG by BaranziniLab

KG-RAG empowers LLMs using knowledge graphs for knowledge-intensive tasks

Created 2 years ago

Updated 1 year ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Vaibhav Nivargi

Vaibhav Nivargi(Cofounder of Moveworks).

USC-DS-RelationExtraction by INK-USC

Relation extraction system using distant supervision

Created 9 years ago

Updated 5 years ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

BERT-Relation-Extraction by plkmo

PyTorch scripts for relation extraction, based on BERT

Created 6 years ago

Updated 2 years ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

CBLUE by CBLUEbenchmark

Benchmark for Chinese biomedical language understanding

Created 4 years ago

Updated 2 years ago

NLP-Projects by gaoisbest

NLP project collection with concepts and scripts

Created 8 years ago

Updated 5 years ago

Starred by

Malte Pietsch

Malte Pietsch(Cofounder of deepset) and

Bojan Tunguz

Bojan Tunguz(AI Scientist; Formerly at NVIDIA).

DocProduct by re-search

Medical Q\&A with deep language models

Created 6 years ago

Updated 2 years ago

bert_seq2seq by 920232796

PyTorch toolkit for sequence-to-sequence and other NLP tasks

Created 5 years ago

Updated 3 years ago

Feedback? Help us improve.