large-qa-datasets by ad-freiburg

Collection of question answering datasets for NLP tasks

Created 6 years ago

437 stars

Top 67.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yiran Wu

Coauthor of AutoGen

Project Summary

This repository provides a curated collection of large-scale question answering (QA) datasets, primarily aimed at researchers and developers in Natural Language Processing (NLP). It offers a centralized resource for various QA tasks, including extractive, abstractive, and multi-hop reasoning, enabling the development and benchmarking of advanced QA models.

How It Works

The collection comprises datasets generated through diverse methodologies, ranging from crowd-sourced question formulation over text passages (SQuAD, QuAC, CoQA) to automated generation from knowledge bases (FreebaseQA, CFQ) and web data (WebQuestions, TriviaQA). This variety allows for training and evaluating QA systems on different data distributions and reasoning complexities.

Quick Start & Requirements

Datasets are typically provided in standard formats (e.g., JSON, TSV) and can be downloaded directly from linked sources.
No specific installation or code execution is required to access the datasets themselves.
Links to official dataset pages, papers, and sometimes code repositories are provided for each entry.

Highlighted Details

Comprehensive coverage of major QA datasets published between 2013 and 2020.
Includes datasets focusing on specific challenges like multi-hop reasoning (HotpotQA) and conversational QA (QuAC, CoQA).
Features datasets generated via automated methods, offering large-scale data for model training.
Provides links to original research papers and dataset repositories for detailed information.

Maintenance & Community

This repository acts as a curated index rather than an actively maintained project. The datasets themselves are maintained by their respective creators.

Licensing & Compatibility

Dataset licenses vary by source. Users must consult the individual dataset licenses for terms of use, redistribution, and commercial application.

Limitations & Caveats

The repository itself does not provide tools for data processing or model training; it is purely a collection of links and descriptions. Users are responsible for downloading, managing, and processing the data according to each dataset's specific license and format.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days