naturalcc  by CGCL-codes

Toolkit for natural code comprehension and intelligence tasks

created 4 years ago
302 stars

Top 89.3% on sourcepulse

GitHubView on GitHub
Project Summary

NaturalCC is an open-source toolkit for code intelligence, designed for researchers and developers to train custom machine learning models for software engineering tasks like code generation, summarization, and retrieval. It leverages advanced sequence modeling techniques to bridge the gap between programming and natural languages, offering a modular framework and support for state-of-the-art large code models.

How It Works

Built on Fairseq's registry mechanism, NaturalCC provides a modular and extensible framework for diverse code intelligence tasks. It supports state-of-the-art large code models (Code Llama, CodeT5, StarCoder) and includes tools for feature extraction using compiler frameworks like LLVM. The toolkit is optimized for efficient multi-GPU training using NCCL and torch.distributed, with support for both FP32 and FP16 computations.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies via pip install -r requirements.txt, and install the package with pip install --editable ./src.
  • Prerequisites: GCC/G++ 5.0+, NVIDIA GPU with NCCL and CUDA Toolkit (recommended for training), Hugging Face token for certain models (e.g., StarCoder). Python 3.6 is specified for environment creation, but compatibility with newer versions is implied by recent updates.
  • Resources: Requires downloading model checkpoints. Training and inference examples are provided.
  • Links: Paper, Demo

Highlighted Details

  • Supports state-of-the-art large code models including Code Llama, CodeT5, CodeGen, and StarCoder.
  • Offers access to preprocessed benchmarks like Human-Eval, CodeSearchNet, and Py150.
  • Includes scripts for feature extraction using LLVM.
  • Benchmarks multiple downstream tasks with evaluation capabilities using metrics like pass@k.

Maintenance & Community

The project released NaturalCC 2.0 in November 2023, integrating compatibility with Hugging Face Transformers. It has also incorporated code from research papers on neural code search and structural analysis of pre-trained models. The project welcomes contributions.

Licensing & Compatibility

NaturalCC is open-sourced under the MIT license, which is permissive for commercial use and closed-source linking.

Limitations & Caveats

The initial environment setup specifies Python 3.6, which may require careful management for users on newer Python versions. While the toolkit supports various large code models, specific model compatibility and performance may vary.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Feedback? Help us improve.