CodeFuse-CGM by codefuse-ai

Code Graph LLM for repository-level software engineering tasks

created 7 months ago

401 stars

Top 72.0% on SourcePulse

Project Summary

CodeFuse-CGM is a framework designed for repository-level software engineering tasks, utilizing a graph-based approach augmented by Large Language Models (LLMs). It targets developers and researchers aiming to automate issue resolution by understanding code context and structure, offering a significant improvement in automated code repair and analysis.

How It Works

CGM constructs a repository-level code graph to represent project context. It then employs a Retrieval-Augmented Generation (RAG) pipeline consisting of four stages: Rewriter, Retriever, Reranker, and Reader. The Rewriter analyzes issues and generates queries, the Retriever finds relevant code subgraphs, the Reranker prioritizes files within these subgraphs, and the Reader generates code patches based on the refined context. This graph-integrated approach allows models to generalize across various SE tasks.

Quick Start & Requirements

Installation: Install required packages via pip: transformers==4.46.1, tokenizers==0.20.0, accelerate==1.0.1, peft==0.13.2, jinja2==2.11.3, fuzzywuzzy==0.18.0, python-Levenshtein==0.25.1, networkx==3.0.
Prerequisites: Python 3.8+. Specific components require torch==2.1.0, transformers==4.39.2, tokenizers==0.15.2, accelerate==0.28.0 for CGE-large, and RapidFuzz==1.5.0, faiss-cpu for Retriever. vllm>=0.8.5 is needed for the Reranker.
Setup: Requires generating node embeddings, rewriter embeddings, and running inference scripts for each component.
Documentation: Detailed examples and prompt generation functions are provided within the repository.

Highlighted Details

Achieved 44.00% resolve rate on SWE-Bench-Lite leaderboard with CGM-72B-V1.2.
Supports multi-framework training with Accelerate (Deepspeed, FSDP).
Efficient fine-tuning options include LoRA, QLoRA, and full-parameter training.
The framework is modular, with distinct scripts for Rewriter, Retriever, Reranker, and Reader components.

Maintenance & Community

The project is actively developed by the AI Native team at Ant Group, with recent updates in January 2025. They have a strong track record of publications and open-source contributions (CodeFuse project). Community contributions are welcomed via pull requests and issues.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The setup involves multiple complex steps for generating embeddings and running inference for each module, requiring significant computational resources and technical expertise. Specific model weights (e.g., CGE-large, Qwen Model) are referenced but not directly linked for download in the README.

CodeFuse-CGM by codefuse-ai

Explore Similar Projects

awesome-AI-system by lambda7xx

xmc.dspy by KarelDO

Paper-Replications by YuvrajSingh-mist

codeqai by fynnfluegge

naturalcc by CGCL-codes

LocAgent by gersteinlab

awesome-generative-ai-data-scientist by business-science

Awesome-Code-LLM by codefuse-ai

lingua by facebookresearch

Awesome-LLM-Inference by xlite-dev

dgm by jennyzzt

awesome-machine-learning-on-source-code by src-d