This repository is a curated collection of papers and projects focused on the intersection of Large Language Models (LLMs) and data-centric methods. It serves as a comprehensive survey for researchers and practitioners looking to understand and leverage data-related techniques for LLM development, evaluation, and application.
How It Works
The survey categorizes resources across key areas: Data Characteristics for LLM Stages (pretraining, SFT, RAG, etc.), Data Processing (acquisition, deduplication, filtering, selection, mixing, synthesis), Data Storage, Data Serving, and the application of LLMs for Data Management (manipulation, analysis, system optimization). It highlights the DATA4LLM concept, emphasizing Inclusiveness, Abundance, Articulation, and Sanitization as crucial for high-quality LLM datasets.
Quick Start & Requirements
This repository is a survey and does not require installation or execution. It provides links to external papers and projects for further exploration.
Highlighted Details
Maintenance & Community
The repository is associated with the paper "A Survey of LLM × DATA" by Zhou et al. (2025). Further community engagement details are not specified in the README.
Licensing & Compatibility
The repository itself is likely under a permissive license (e.g., MIT, Apache 2.0) as is common for curated lists. However, the linked papers and projects will have their own respective licenses.
Limitations & Caveats
As a survey, it is a snapshot of the field and may not include the very latest research. The depth of coverage for each topic varies, and users will need to consult the linked resources for detailed information.
2 weeks ago
Inactive