data-selection-survey  by alon-albalak

A survey cataloging research on data selection for language models

Created 1 year ago
254 stars

Top 99.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a curated survey of research papers focused on data selection techniques for language models across all training stages. It serves as a valuable, community-driven resource for researchers and practitioners in NLP and ML, offering a structured overview of methodologies for selecting and curating training data.

How It Works

The project functions as a comprehensive bibliography, meticulously organizing and listing relevant academic papers. These papers are categorized by specific data selection sub-topics, such as pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual or instruction-tuned models, providing a structured entry point into the research literature.

Quick Start & Requirements

This repository is a curated list of research papers and does not contain executable code or require installation. It serves as a reference guide rather than a software project.

Highlighted Details

  • Comprehensive coverage of data selection techniques for language models, categorized into sub-topics like pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual and instruction-tuned models.
  • Features contributions from a distinguished team of researchers in the field of language modeling.
  • The primary output is the survey paper itself, available at arXiv:2402.16827, serving as a central reference.
  • Encourages community contributions to maintain its currency and completeness.

Maintenance & Community

The project is maintained by a team of researchers and actively encourages community contributions via pull requests or issues to expand its coverage. The README lists numerous authors and collaborators, indicating a strong academic backing.

Licensing & Compatibility

No specific open-source license is mentioned in the provided README content. Users should assume standard copyright unless otherwise specified.

Limitations & Caveats

As a survey, its primary limitation is its reliance on the completeness of community contributions and the rapidly evolving nature of the field, which may lead to a lag in incorporating the very latest research. The scope is limited to data selection for language models, excluding other aspects of model training or development.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.