DocProduct by re-search

Medical Q\&A with deep language models

Created 6 years ago

572 stars

Top 56.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Malte Pietsch

Cofounder of deepset

Bojan Tunguz

AI Scientist; Formerly at NVIDIA

Project Summary

This project provides a medical question-answering system that leverages deep learning models like BERT and GPT-2 to retrieve and generate answers from a large corpus of medical data. It is targeted at researchers and developers interested in exploring advanced NLP techniques for specialized domains, offering a novel approach to medical information retrieval.

How It Works

The system employs a dual-model architecture. First, a fine-tuned BioBERT model encodes medical questions and answers into vector representations. These embeddings are then processed by separate Feed-Forward Neural Networks (FCNNs) for questions and answers, mapping them into a metric space. Similarity is calculated using a custom cross-entropy loss that treats other answers in a batch as negative samples, encouraging closer embeddings for relevant question-answer pairs. Finally, a fine-tuned GPT-2 model generates an answer based on the question and the top-k retrieved relevant medical information.

Quick Start & Requirements

Install: pip install tensorflow-gpu==2.0.0-alpha0, pip install mkl, pip install https://github.com/Santosh-Gupta/DocProduct/archive/master.zip. FAISS CPU/GPU installation requires manual download and compilation.
Prerequisites: TensorFlow 2.0.0-alpha0, FAISS (CPU or GPU), Python 3.6.
Resources: Requires significant disk space for over a terabyte of data (TFRECORDS, CSV, CKPT) and substantial GPU memory for training and inference with BERT and GPT-2 models.
Demos: Interactive retrieval model: https://colab.research.google.com/drive/11hAr1qo7VCSmIjWREFwyTFblU2LVeh1R. Training model: https://colab.research.google.com/drive/1Rz2rzkwWrVEXcjiQqTXhxzLCW5cXi7xA. End-to-end demo: https://colab.research.google.com/drive/1Bv7bpPxIImsMG4YWB_LWjDRgUHvi7pxx.

Highlighted Details

Winner of the #PoweredByTF 2.0 Challenge (Top 6 Finalist).
Utilizes a custom loss function inspired by negative sampling for embedding similarity training.
Optimized input pipeline using tf.data and TFRecords for efficient preprocessing of large datasets.
Developed an imperative BERT implementation for better debugging and compatibility with TensorFlow 2.0 eager execution.

Maintenance & Community

The project was a finalist in a TensorFlow challenge and presented to the TensorFlow Engineering Team. Collaboration is welcomed via email at Research2Vec@gmail.com.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is explicitly stated as not being for actionable medical advice and is not ready for widespread commercial use. The end-to-end demo is experimental. The installation instructions for FAISS are complex and require manual steps.

Health Check

Last Commit

2 years ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days