Discover and explore top open-source AI tools and projects—updated daily.
alon-albalakA survey cataloging research on data selection for language models
Top 99.1% on SourcePulse
This repository provides a curated survey of research papers focused on data selection techniques for language models across all training stages. It serves as a valuable, community-driven resource for researchers and practitioners in NLP and ML, offering a structured overview of methodologies for selecting and curating training data.
How It Works
The project functions as a comprehensive bibliography, meticulously organizing and listing relevant academic papers. These papers are categorized by specific data selection sub-topics, such as pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual or instruction-tuned models, providing a structured entry point into the research literature.
Quick Start & Requirements
This repository is a curated list of research papers and does not contain executable code or require installation. It serves as a reference guide rather than a software project.
Highlighted Details
Maintenance & Community
The project is maintained by a team of researchers and actively encourages community contributions via pull requests or issues to expand its coverage. The README lists numerous authors and collaborators, indicating a strong academic backing.
Licensing & Compatibility
No specific open-source license is mentioned in the provided README content. Users should assume standard copyright unless otherwise specified.
Limitations & Caveats
As a survey, its primary limitation is its reliance on the completeness of community contributions and the rapidly evolving nature of the field, which may lead to a lag in incorporating the very latest research. The scope is limited to data selection for language models, excluding other aspects of model training or development.
8 months ago
Inactive
mlfoundations