awesome-data-llm by OpenDataBox

Survey of LLM x DATA

Created 1 year ago

802 stars

Top 43.2% on SourcePulse

Project Summary

This repository is a curated collection of papers and projects focused on the intersection of Large Language Models (LLMs) and data-centric methods. It serves as a comprehensive survey for researchers and practitioners looking to understand and leverage data-related techniques for LLM development, evaluation, and application.

How It Works

The survey categorizes resources across key areas: Data Characteristics for LLM Stages (pretraining, SFT, RAG, etc.), Data Processing (acquisition, deduplication, filtering, selection, mixing, synthesis), Data Storage, Data Serving, and the application of LLMs for Data Management (manipulation, analysis, system optimization). It highlights the DATA4LLM concept, emphasizing Inclusiveness, Abundance, Articulation, and Sanitization as crucial for high-quality LLM datasets.

Quick Start & Requirements

This repository is a survey and does not require installation or execution. It provides links to external papers and projects for further exploration.

Highlighted Details

Comprehensive coverage of data lifecycle stages for LLMs.
Detailed breakdown of data processing techniques, including acquisition, filtering, and selection.
Extensive lists of datasets and tools relevant to LLM data management.
Explores the emerging field of using LLMs to optimize data systems themselves.

Maintenance & Community

The repository is associated with the paper "A Survey of LLM × DATA" by Zhou et al. (2025). Further community engagement details are not specified in the README.

Licensing & Compatibility

The repository itself is likely under a permissive license (e.g., MIT, Apache 2.0) as is common for curated lists. However, the linked papers and projects will have their own respective licenses.

Limitations & Caveats

As a survey, it is a snapshot of the field and may not include the very latest research. The depth of coverage for each topic varies, and users will need to consult the linked resources for detailed information.

awesome-data-llm by OpenDataBox

Explore Similar Projects

LLM-inference-optimization-paper by chenhongyu2048

SiLLM by armbues

FrugalGPT by stanford-futuredata

distill by samuelfaj

HALOs by ContextualAI

llama.cpp-deepseek-v4-flash by antirez

magpie by magpie-align

DB-GPT by TsinghuaDatabaseGroup

local-llms-analyse-finance by thu-vu92

langfuse-python by langfuse

tiny-llm by skyzh

LLM-Engineers-Handbook by PacktPublishing