Curated list of code LLM research, plus datasets
Top 17.6% on sourcepulse
This repository is a curated list of research papers, models, and datasets related to Large Language Models (LLMs) for code and software engineering activities. It serves as a comprehensive resource for researchers and practitioners interested in the intersection of Natural Language Processing (NLP) and Software Engineering (SE), providing an organized overview of the rapidly evolving field.
How It Works
The repository categorizes research into broad areas such as LLM architectures (base models, code-adapted LLMs, encoder-decoder models), fine-tuning strategies (instruction tuning, reinforcement learning), and reasoning capabilities (code agents, interactive coding). It also details downstream tasks like code generation, translation, repair, and analysis, alongside relevant datasets and evaluation metrics. The organization aims to provide a structured understanding of the landscape, from foundational models to specific applications.
Quick Start & Requirements
This repository is a collection of links and information, not a runnable software project. No installation or specific requirements are needed to browse its contents.
Highlighted Details
Maintenance & Community
The repository is actively maintained, with recent updates noted for April 2025. Contributions are welcomed via GitHub issues. The primary contributors are associated with the AI Native team at Ant Group, who also maintain the open-source project CodeFuse.
Licensing & Compatibility
The repository itself is a collection of links and does not have a specific license. The linked papers and datasets will have their own respective licenses.
Limitations & Caveats
As a curated list, the repository's value is dependent on the completeness and accuracy of its entries. While extensive, it may not capture every single relevant publication in this fast-moving field.
1 week ago
1 day