CodeGeeX  by zai-org

Code generation model for multilingual programming

Created 3 years ago
8,632 stars

Top 6.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

CodeGeeX is a 13-billion parameter, open-source, multilingual code generation model designed for tasks like code completion, translation, and summarization. It targets developers and researchers seeking to leverage large language models for programming assistance and evaluation across multiple languages.

How It Works

CodeGeeX is a transformer-based decoder-only model trained on a corpus of over 158.7 billion tokens spanning 23 programming languages. It utilizes a vocabulary of 50,400 tokens, processing whitespaces as separate tokens. The model architecture features 40 transformer layers with a hidden size of 5,120 and an expanded feed-forward layer size of 20,480, supporting a maximum sequence length of 2,048.

Quick Start & Requirements

  • Installation: pip install -e . or use the provided Docker image (docker pull codegeex/codegeex:latest).
  • Prerequisites: Python 3.7+, CUDA 11+, PyTorch 1.10+, DeepSpeed 0.6+.
  • Model Weights: Requires application and download (~26GB).
  • Inference: Supports single GPU (27GB+ RAM), quantized inference (15GB+ RAM), and multi-GPU inference (<6GB RAM per GPU).
  • Resources: Official VS Code and Jetbrains extensions are available.
  • Links: Homepage, DEMO, Model Weights, Paper, HumanEval-X.

Highlighted Details

  • Achieves state-of-the-art average performance on the HumanEval-X benchmark for multilingual code generation.
  • Supports cross-lingual code translation between 5 languages (Python, C++, Java, JavaScript, Go).
  • Offers IDE extensions for VS Code and Jetbrains for integrated coding assistance.
  • Quantized and model-parallel inference options reduce GPU memory requirements.

Maintenance & Community

  • Active development with releases like CodeGeeX2 and CodeGeeX4.
  • Community engagement via Discord, Slack, and Telegram.
  • Supported by Tsinghua University (KEG, IIIS), Peng Cheng Laboratory, and Zhipu.AI.

Licensing & Compatibility

  • Code is licensed under Apache-2.0.
  • Model weights have a separate, unspecified license. Commercial use may require clarification.

Limitations & Caveats

The model weights license is not explicitly detailed, potentially impacting commercial use. While competitive, performance can vary across language pairs for translation tasks.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
43 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
6 more.

awesome-machine-learning-on-source-code by src-d

0.1%
6k
Curated list of ML applied to source code (MLonCode)
Created 8 years ago
Updated 4 years ago
Feedback? Help us improve.