naturalcc  by CGCL-codes

Toolkit for natural code comprehension and intelligence tasks

Created 4 years ago
304 stars

Top 88.0% on SourcePulse

GitHubView on GitHub
Project Summary

NaturalCC is an open-source toolkit for code intelligence, designed for researchers and developers to train custom machine learning models for software engineering tasks like code generation, summarization, and retrieval. It leverages advanced sequence modeling techniques to bridge the gap between programming and natural languages, offering a modular framework and support for state-of-the-art large code models.

How It Works

Built on Fairseq's registry mechanism, NaturalCC provides a modular and extensible framework for diverse code intelligence tasks. It supports state-of-the-art large code models (Code Llama, CodeT5, StarCoder) and includes tools for feature extraction using compiler frameworks like LLVM. The toolkit is optimized for efficient multi-GPU training using NCCL and torch.distributed, with support for both FP32 and FP16 computations.

Quick Start & Requirements

  • Installation: Clone the repository, install dependencies via pip install -r requirements.txt, and install the package with pip install --editable ./src.
  • Prerequisites: GCC/G++ 5.0+, NVIDIA GPU with NCCL and CUDA Toolkit (recommended for training), Hugging Face token for certain models (e.g., StarCoder). Python 3.6 is specified for environment creation, but compatibility with newer versions is implied by recent updates.
  • Resources: Requires downloading model checkpoints. Training and inference examples are provided.
  • Links: Paper, Demo

Highlighted Details

  • Supports state-of-the-art large code models including Code Llama, CodeT5, CodeGen, and StarCoder.
  • Offers access to preprocessed benchmarks like Human-Eval, CodeSearchNet, and Py150.
  • Includes scripts for feature extraction using LLVM.
  • Benchmarks multiple downstream tasks with evaluation capabilities using metrics like pass@k.

Maintenance & Community

The project released NaturalCC 2.0 in November 2023, integrating compatibility with Hugging Face Transformers. It has also incorporated code from research papers on neural code search and structural analysis of pre-trained models. The project welcomes contributions.

Licensing & Compatibility

NaturalCC is open-sourced under the MIT license, which is permissive for commercial use and closed-source linking.

Limitations & Caveats

The initial environment setup specifies Python 3.6, which may require careful management for users on newer Python versions. While the toolkit supports various large code models, specific model compatibility and performance may vary.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Omar Khattab Omar Khattab(Coauthor of DSPy, ColBERT; Professor at MIT), and
5 more.

CodeXGLUE by microsoft

0.3%
2k
Benchmark for code intelligence tasks
Created 5 years ago
Updated 1 year ago
Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
6 more.

awesome-machine-learning-on-source-code by src-d

0.1%
6k
Curated list of ML applied to source code (MLonCode)
Created 8 years ago
Updated 4 years ago
Feedback? Help us improve.