LLM4Annotation by Zhen-Tan-dmml

Survey of LLMs for data annotation and synthesis

Created 2 years ago

645 stars

Top 51.2% on SourcePulse

Project Summary

This repository is a curated survey of research papers focused on leveraging Large Language Models (LLMs) for data annotation and synthesis. It serves as a comprehensive resource for researchers and practitioners in Natural Language Processing (NLP) and Machine Learning (ML) interested in automated data generation, dataset creation, and improving model performance through synthetic data.

How It Works

The repository functions as a dynamic, community-driven bibliography. It compiles and categorizes academic papers, providing links to their sources (primarily arXiv preprints). The content is updated regularly, reflecting the rapid advancements in LLM applications for data annotation and synthesis, with a focus on various techniques like Chain-of-Thought, self-training, and preference optimization.

Quick Start & Requirements

This repository is a collection of research papers and does not involve direct code execution or installation. Users can access the compiled list of papers and datasets via the provided links.

Highlighted Details

Extensive collection of papers on LLM-based data annotation and synthesis, updated frequently.
Categorization of papers by specific techniques (e.g., Long-CoT Synthesis & Distillation, LLM-as-a-Judge).
Includes links to relevant datasets used in the research.
Complements an EMNLP 2024 oral survey paper on the same topic.

Maintenance & Community

The repository is maintained by Dawei Li and welcomes contributions via Pull Requests. Users can cite the associated survey paper for its utility.

Licensing & Compatibility

The repository itself does not have a specific license mentioned, as it is a curated list of external research papers. The licensing of individual papers would depend on their respective publication venues.

Limitations & Caveats

This repository is a bibliography and does not provide code or tools for implementing the discussed techniques. Users must refer to the individual papers for implementation details and potential limitations of specific methods.

LLM4Annotation by Zhen-Tan-dmml

Explore Similar Projects

AwesomeOPD by thinkwee

tamingLLMs by souzatharsis

LLM-Synthetic-Data by pengr

IndicLLMSuite by AI4Bharat

DataArc-SynData-Toolkit by DataArcTech

Awesome-Knowledge-Distillation-of-LLMs by Tebmer

awesome-production-llm by jihoo-kim

Awesome-LLM-Synthetic-Data by wasiahmad

Awesome-AIGC by wshzd

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing by ghimiresunil

so-large-lm by datawhalechina

LLMSurvey by RUCAIBox