data_management_LLM  by ZigeW

LLM training data management resource list

created 1 year ago
329 stars

Top 84.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository serves as a curated collection of research papers and resources focused on the critical aspects of data management for training Large Language Models (LLMs). It targets researchers and practitioners in the NLP and AI fields seeking to understand and optimize data selection, quality, quantity, and composition for both pre-training and fine-tuning LLMs. The primary benefit is a centralized, organized overview of the latest advancements and challenges in LLM data management.

How It Works

The repository organizes papers into thematic categories, mirroring the structure of a survey paper on LLM data management. It covers pre-training aspects like domain composition, data quantity, and quality, as well as supervised fine-tuning considerations such as task composition, data quality, and instruction complexity. This structured approach allows users to navigate specific areas of interest within the vast landscape of LLM data research.

Quick Start & Requirements

This repository is a collection of links to research papers and code repositories, not a runnable software package. No installation or specific requirements are needed to browse the content.

Highlighted Details

  • Comprehensive coverage of data management topics including domain composition, quality filtering, deduplication, toxicity filtering, diversity, social biases, and hallucination sources.
  • Detailed sections on supervised fine-tuning data, covering task composition, instruction quality, diversity, complexity, and prompt design.
  • Links to numerous papers with associated code and datasets, facilitating practical exploration and reproduction of research findings.
  • Includes resources on scaling laws, data-centric AI, and practical guides for LLM development.

Maintenance & Community

The repository is maintained by ZigeW. It appears to be a static collection of curated links, with no explicit mention of community forums, active development, or a roadmap.

Licensing & Compatibility

The repository itself does not host code or data directly, thus it does not have a specific license. The linked papers and code repositories will have their own respective licenses.

Limitations & Caveats

This is a curated list of research papers and not a software tool. It does not provide any implementation or direct functionality for data management. The content is limited to the papers and resources that the curator has identified and linked.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.