Awesome-Code-LLM  by codefuse-ai

Curated list of code LLM research, plus datasets

Created 2 years ago
2,886 stars

Top 16.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository is a curated list of research papers, models, and datasets related to Large Language Models (LLMs) for code and software engineering activities. It serves as a comprehensive resource for researchers and practitioners interested in the intersection of Natural Language Processing (NLP) and Software Engineering (SE), providing an organized overview of the rapidly evolving field.

How It Works

The repository categorizes research into broad areas such as LLM architectures (base models, code-adapted LLMs, encoder-decoder models), fine-tuning strategies (instruction tuning, reinforcement learning), and reasoning capabilities (code agents, interactive coding). It also details downstream tasks like code generation, translation, repair, and analysis, alongside relevant datasets and evaluation metrics. The organization aims to provide a structured understanding of the landscape, from foundational models to specific applications.

Quick Start & Requirements

This repository is a collection of links and information, not a runnable software project. No installation or specific requirements are needed to browse its contents.

Highlighted Details

  • Features recent papers and models, including contributions from Codefuse AI (GALLa, CodeFuse-CGM, EasyDeploy, Rodimus, CodeFuse-CGE).
  • Includes a comprehensive list of surveys on LLMs for code, covering both NLP and SE perspectives.
  • Provides extensive lists of LLMs, datasets, and benchmarks relevant to code intelligence.
  • Offers recommended readings for those new to NLP or LLMs.

Maintenance & Community

The repository is actively maintained, with recent updates noted for April 2025. Contributions are welcomed via GitHub issues. The primary contributors are associated with the AI Native team at Ant Group, who also maintain the open-source project CodeFuse.

Licensing & Compatibility

The repository itself is a collection of links and does not have a specific license. The linked papers and datasets will have their own respective licenses.

Limitations & Caveats

As a curated list, the repository's value is dependent on the completeness and accuracy of its entries. While extensive, it may not capture every single relevant publication in this fast-moving field.

Health Check
Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)
4
Issues (30d)
0
Star History
73 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

AlphaCodium by Codium-ai

0.1%
4k
Code generation research paper implementation
Created 1 year ago
Updated 9 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Omar Khattab Omar Khattab(Coauthor of DSPy, ColBERT; Professor at MIT), and
5 more.

CodeXGLUE by microsoft

0.3%
2k
Benchmark for code intelligence tasks
Created 5 years ago
Updated 1 year ago
Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
6 more.

awesome-machine-learning-on-source-code by src-d

0.1%
6k
Curated list of ML applied to source code (MLonCode)
Created 8 years ago
Updated 4 years ago
Feedback? Help us improve.