data_management_LLM by ZigeW

LLM training data management resource list

Created 2 years ago

342 stars

Top 80.5% on SourcePulse

Project Summary

This repository serves as a curated collection of research papers and resources focused on the critical aspects of data management for training Large Language Models (LLMs). It targets researchers and practitioners in the NLP and AI fields seeking to understand and optimize data selection, quality, quantity, and composition for both pre-training and fine-tuning LLMs. The primary benefit is a centralized, organized overview of the latest advancements and challenges in LLM data management.

How It Works

The repository organizes papers into thematic categories, mirroring the structure of a survey paper on LLM data management. It covers pre-training aspects like domain composition, data quantity, and quality, as well as supervised fine-tuning considerations such as task composition, data quality, and instruction complexity. This structured approach allows users to navigate specific areas of interest within the vast landscape of LLM data research.

Quick Start & Requirements

This repository is a collection of links to research papers and code repositories, not a runnable software package. No installation or specific requirements are needed to browse the content.

Highlighted Details

Comprehensive coverage of data management topics including domain composition, quality filtering, deduplication, toxicity filtering, diversity, social biases, and hallucination sources.
Detailed sections on supervised fine-tuning data, covering task composition, instruction quality, diversity, complexity, and prompt design.
Links to numerous papers with associated code and datasets, facilitating practical exploration and reproduction of research findings.
Includes resources on scaling laws, data-centric AI, and practical guides for LLM development.

Maintenance & Community

The repository is maintained by ZigeW. It appears to be a static collection of curated links, with no explicit mention of community forums, active development, or a roadmap.

Licensing & Compatibility

The repository itself does not host code or data directly, thus it does not have a specific license. The linked papers and code repositories will have their own respective licenses.

Limitations & Caveats

This is a curated list of research papers and not a software tool. It does not provide any implementation or direct functionality for data management. The content is limited to the papers and resources that the curator has identified and linked.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days