medmcqa  by medmcqa

Medical MCQA dataset for advanced reasoning

Created 3 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

MedMCQA: Large-Scale Medical MCQA Dataset

MedMCQA is a substantial dataset designed to address multiple-choice question answering (MCQA) within the medical domain, specifically targeting real-world medical entrance exam questions. It serves NLP researchers and developers aiming to build advanced QA systems capable of deeper reasoning across diverse medical subjects. The dataset facilitates the development of models that can understand and answer complex medical queries, thereby advancing the field of medical AI.

How It Works

This project provides a curated dataset comprising over 194,000 high-quality MCQs sourced from AIIMS and NEET PG medical entrance examinations. It covers 2.4k healthcare topics and 21 distinct medical subjects, offering high topical diversity. Each data instance includes a question, multiple-choice options, the correct answer, and an expert's explanation, designed to test complex reasoning abilities beyond simple recall. The dataset is structured with splits based on actual exams to promote robust model generalization and evaluation.

Quick Start & Requirements

  • Install dependencies: pip3 install -r requirements.txt
  • Download data: https://drive.google.com/uc?export=download&id=15VkJdq5eyWIkfb_aoD3oS8i4tScbHYky
  • Run experiments: Clone the repository, install dependencies, download and unzip the data, then execute python3 train.py --model bert-base-uncased --dataset_folder_name "/content/medmcqa_data/".
  • Links: Repository (https://github.com/medmcqa/medmcqa), Paper (https://arxiv.org/abs/2203.14371), Homepage (https://medmcqa.github.io).

Highlighted Details

  • Features 194k+ high-quality MCQs from real medical entrance exams (AIIMS & NEET PG).
  • Covers 2.4k healthcare topics and 21 distinct medical subjects.
  • Designed to test 10+ distinct reasoning abilities.
  • Dataset split by exams (AIIMS PG, NEET PG) to promote model generalization.

Maintenance & Community

  • Point of contact: medmcqa [at] gmail.com.
  • No community channels (e.g., Discord, Slack) or roadmap links are provided in the README.

Licensing & Compatibility

  • The license type and terms for commercial use or closed-source linking are not specified in the provided README content.

Limitations & Caveats

  • Test set evaluation requires submitting predictions via a Google Form, as ground truth is withheld to preserve integrity.
  • The absence of explicit licensing information poses a potential adoption blocker for commercial or sensitive projects.
Health Check
Last Commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.