CodeFuse-CGM  by codefuse-ai

Code Graph LLM for repository-level software engineering tasks

Created 1 year ago
523 stars

Top 60.2% on SourcePulse

GitHubView on GitHub
Project Summary

CodeFuse-CGM is a framework designed for repository-level software engineering tasks, utilizing a graph-based approach augmented by Large Language Models (LLMs). It targets developers and researchers aiming to automate issue resolution by understanding code context and structure, offering a significant improvement in automated code repair and analysis.

How It Works

CGM constructs a repository-level code graph to represent project context. It then employs a Retrieval-Augmented Generation (RAG) pipeline consisting of four stages: Rewriter, Retriever, Reranker, and Reader. The Rewriter analyzes issues and generates queries, the Retriever finds relevant code subgraphs, the Reranker prioritizes files within these subgraphs, and the Reader generates code patches based on the refined context. This graph-integrated approach allows models to generalize across various SE tasks.

Quick Start & Requirements

  • Installation: Install required packages via pip: transformers==4.46.1, tokenizers==0.20.0, accelerate==1.0.1, peft==0.13.2, jinja2==2.11.3, fuzzywuzzy==0.18.0, python-Levenshtein==0.25.1, networkx==3.0.
  • Prerequisites: Python 3.8+. Specific components require torch==2.1.0, transformers==4.39.2, tokenizers==0.15.2, accelerate==0.28.0 for CGE-large, and RapidFuzz==1.5.0, faiss-cpu for Retriever. vllm>=0.8.5 is needed for the Reranker.
  • Setup: Requires generating node embeddings, rewriter embeddings, and running inference scripts for each component.
  • Documentation: Detailed examples and prompt generation functions are provided within the repository.

Highlighted Details

  • Achieved 44.00% resolve rate on SWE-Bench-Lite leaderboard with CGM-72B-V1.2.
  • Supports multi-framework training with Accelerate (Deepspeed, FSDP).
  • Efficient fine-tuning options include LoRA, QLoRA, and full-parameter training.
  • The framework is modular, with distinct scripts for Rewriter, Retriever, Reranker, and Reader components.

Maintenance & Community

The project is actively developed by the AI Native team at Ant Group, with recent updates in January 2025. They have a strong track record of publications and open-source contributions (CodeFuse project). Community contributions are welcomed via pull requests and issues.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The setup involves multiple complex steps for generating embeddings and running inference for each module, requiring significant computational resources and technical expertise. Specific model weights (e.g., CGE-large, Qwen Model) are referenced but not directly linked for download in the README.

Health Check
Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

AlphaCodium by Codium-ai

0.1%
4k
Code generation research paper implementation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.