CodeFuse-CGM  by codefuse-ai

Code Graph LLM for repository-level software engineering tasks

created 7 months ago
401 stars

Top 72.0% on SourcePulse

GitHubView on GitHub
Project Summary

CodeFuse-CGM is a framework designed for repository-level software engineering tasks, utilizing a graph-based approach augmented by Large Language Models (LLMs). It targets developers and researchers aiming to automate issue resolution by understanding code context and structure, offering a significant improvement in automated code repair and analysis.

How It Works

CGM constructs a repository-level code graph to represent project context. It then employs a Retrieval-Augmented Generation (RAG) pipeline consisting of four stages: Rewriter, Retriever, Reranker, and Reader. The Rewriter analyzes issues and generates queries, the Retriever finds relevant code subgraphs, the Reranker prioritizes files within these subgraphs, and the Reader generates code patches based on the refined context. This graph-integrated approach allows models to generalize across various SE tasks.

Quick Start & Requirements

  • Installation: Install required packages via pip: transformers==4.46.1, tokenizers==0.20.0, accelerate==1.0.1, peft==0.13.2, jinja2==2.11.3, fuzzywuzzy==0.18.0, python-Levenshtein==0.25.1, networkx==3.0.
  • Prerequisites: Python 3.8+. Specific components require torch==2.1.0, transformers==4.39.2, tokenizers==0.15.2, accelerate==0.28.0 for CGE-large, and RapidFuzz==1.5.0, faiss-cpu for Retriever. vllm>=0.8.5 is needed for the Reranker.
  • Setup: Requires generating node embeddings, rewriter embeddings, and running inference scripts for each component.
  • Documentation: Detailed examples and prompt generation functions are provided within the repository.

Highlighted Details

  • Achieved 44.00% resolve rate on SWE-Bench-Lite leaderboard with CGM-72B-V1.2.
  • Supports multi-framework training with Accelerate (Deepspeed, FSDP).
  • Efficient fine-tuning options include LoRA, QLoRA, and full-parameter training.
  • The framework is modular, with distinct scripts for Rewriter, Retriever, Reranker, and Reader components.

Maintenance & Community

The project is actively developed by the AI Native team at Ant Group, with recent updates in January 2025. They have a strong track record of publications and open-source contributions (CodeFuse project). Community contributions are welcomed via pull requests and issues.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The setup involves multiple complex steps for generating embeddings and running inference for each module, requiring significant computational resources and technical expertise. Specific model weights (e.g., CGE-large, Qwen Model) are referenced but not directly linked for download in the README.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
89 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Simon Mo Simon Mo(Core Maintainer of vLLM), and
5 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 10 months ago
updated 1 month ago
Feedback? Help us improve.