data-selection-survey by alon-albalak

A survey cataloging research on data selection for language models

Created 2 years ago

260 stars

Top 97.5% on SourcePulse

Project Summary

This repository provides a curated survey of research papers focused on data selection techniques for language models across all training stages. It serves as a valuable, community-driven resource for researchers and practitioners in NLP and ML, offering a structured overview of methodologies for selecting and curating training data.

How It Works

The project functions as a comprehensive bibliography, meticulously organizing and listing relevant academic papers. These papers are categorized by specific data selection sub-topics, such as pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual or instruction-tuned models, providing a structured entry point into the research literature.

Quick Start & Requirements

This repository is a curated list of research papers and does not contain executable code or require installation. It serves as a reference guide rather than a software project.

Highlighted Details

Comprehensive coverage of data selection techniques for language models, categorized into sub-topics like pretraining filtering, data quality, deduplication, toxicity filtering, and specialized selection for multilingual and instruction-tuned models.
Features contributions from a distinguished team of researchers in the field of language modeling.
The primary output is the survey paper itself, available at arXiv:2402.16827, serving as a central reference.
Encourages community contributions to maintain its currency and completeness.

Maintenance & Community

The project is maintained by a team of researchers and actively encourages community contributions via pull requests or issues to expand its coverage. The README lists numerous authors and collaborators, indicating a strong academic backing.

Licensing & Compatibility

No specific open-source license is mentioned in the provided README content. Users should assume standard copyright unless otherwise specified.

Limitations & Caveats

As a survey, its primary limitation is its reliance on the completeness of community contributions and the rapidly evolving nature of the field, which may lead to a lag in incorporating the very latest research. The scope is limited to data selection for language models, excluding other aspects of model training or development.

data-selection-survey by alon-albalak

Explore Similar Projects

ContinualLM by UIC-Liu-Lab

InstructionZoo by FreedomIntelligence

data_management_LLM by ZigeW

sft_datasets by chaoswork

IndicLLMSuite by AI4Bharat

awesome-instruction-dataset by yaodongC

awesome-llms-fine-tuning by Curated-Awesome-Lists

dclm by mlfoundations

Awesome-LLMs-Datasets by lmmlzn

training-fine-tuning-large-language-models-workshop-dhs2024 by dipanjanS

LLMsPracticalGuide by Mooler0410

LLMSurvey by RUCAIBox