data-selection-survey  by alon-albalak

A survey cataloging research on data selection for language models

Created 2 years ago
253 stars

Top 99.3% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a curated survey of research papers focused on data selection techniques for language models across all training stages. It serves as a valuable, community-driven resource for researchers and practitioners in NLP and ML, offering a structured overview of methodologies for selecting and curating training data.

How It Works

The project functions as a comprehensive bibliography, meticulously organizing and listing relevant academic papers. These papers are categorized by specific data selection sub-topics, such as pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual or instruction-tuned models, providing a structured entry point into the research literature.

Quick Start & Requirements

This repository is a curated list of research papers and does not contain executable code or require installation. It serves as a reference guide rather than a software project.

Highlighted Details

  • Comprehensive coverage of data selection techniques for language models, categorized into sub-topics like pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual and instruction-tuned models.
  • Features contributions from a distinguished team of researchers in the field of language modeling.
  • The primary output is the survey paper itself, available at arXiv:2402.16827, serving as a central reference.
  • Encourages community contributions to maintain its currency and completeness.

Maintenance & Community

The project is maintained by a team of researchers and actively encourages community contributions via pull requests or issues to expand its coverage. The README lists numerous authors and collaborators, indicating a strong academic backing.

Licensing & Compatibility

No specific open-source license is mentioned in the provided README content. Users should assume standard copyright unless otherwise specified.

Limitations & Caveats

As a survey, its primary limitation is its reliance on the completeness of community contributions and the rapidly evolving nature of the field, which may lead to a lag in incorporating the very latest research. The scope is limited to data selection for language models, excluding other aspects of model training or development.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.