awesome-bioie  by caufieldjh

Curated list of resources for Biomedical Information Extraction (BioIE)

Created 6 years ago
392 stars

Top 73.4% on SourcePulse

GitHubView on GitHub
Project Summary

This repository is a curated list of resources for Biomedical Information Extraction (BioIE), targeting researchers and engineers working with unstructured biomedical data. It provides a comprehensive overview of methods, tools, datasets, and organizations in the field, aiming to facilitate the extraction of structured knowledge from complex biological and clinical text.

How It Works

The resource is organized into categories covering research overviews, active groups, journals, conferences, challenges, tutorials, code libraries, tools, annotation platforms, techniques, datasets, and ontologies. It emphasizes publicly accessible, cost-free resources with permissive licenses, reflecting the rapid evolution of BioIE driven by advancements in Large Language Models (LLMs) and BERT-based models.

Quick Start & Requirements

This is a curated list, not a software package. To engage with the resources:

  • Code Libraries: Many libraries (e.g., spaCy, Biopython, medaCy, ScispaCy) are available via pip.
  • Datasets: Access often requires registration, data use agreements, or UTS accounts.
  • Tools: Some tools offer demos (e.g., CLAMP) or require local installation.
  • Resources: Links to papers, GitHub repos, and official documentation are provided throughout.

Highlighted Details

  • Extensive coverage of LLMs and BERT variants (BioBERT, ClinicalBERT, SciBERT, PubMedBERT) applied to biomedical tasks.
  • Detailed lists of annotated datasets for entities, relations (e.g., PPI), and events, including corpora like BC5CDR, CRAFT, and n2c2.
  • Information on major BioIE research groups, conferences (e.g., ACL BioNLP, BIBM, ISMB), and challenges (e.g., BioASQ, BioCreative).
  • Includes resources for ontologies and controlled vocabularies like UMLS, Disease Ontology, and RxNorm.

Maintenance & Community

The list is community-driven, encouraging contributions via pull requests. It references active research groups from institutions like Boston Children's Hospital, Mayo Clinic, and NIH/NLM, indicating a vibrant research ecosystem.

Licensing & Compatibility

Resources are preferentially selected for no monetary cost and limited license requirements. However, specific datasets may have usage restrictions or require registration. Compatibility for commercial use depends on the individual resource's license.

Limitations & Caveats

The field is rapidly evolving, particularly with LLMs, meaning some "Pre-LLM Guides" may lack the latest context. Dataset accessibility can vary, with some requiring significant administrative steps or having specific usage terms.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

s2orc by allenai

0.3%
967
Corpus for NLP/text mining research on scientific papers
Created 5 years ago
Updated 1 year ago
Feedback? Help us improve.