Toolkit for natural code comprehension and intelligence tasks
Top 89.3% on sourcepulse
NaturalCC is an open-source toolkit for code intelligence, designed for researchers and developers to train custom machine learning models for software engineering tasks like code generation, summarization, and retrieval. It leverages advanced sequence modeling techniques to bridge the gap between programming and natural languages, offering a modular framework and support for state-of-the-art large code models.
How It Works
Built on Fairseq's registry mechanism, NaturalCC provides a modular and extensible framework for diverse code intelligence tasks. It supports state-of-the-art large code models (Code Llama, CodeT5, StarCoder) and includes tools for feature extraction using compiler frameworks like LLVM. The toolkit is optimized for efficient multi-GPU training using NCCL and torch.distributed
, with support for both FP32 and FP16 computations.
Quick Start & Requirements
pip install -r requirements.txt
, and install the package with pip install --editable ./src
.Highlighted Details
Maintenance & Community
The project released NaturalCC 2.0 in November 2023, integrating compatibility with Hugging Face Transformers. It has also incorporated code from research papers on neural code search and structural analysis of pre-trained models. The project welcomes contributions.
Licensing & Compatibility
NaturalCC is open-sourced under the MIT license, which is permissive for commercial use and closed-source linking.
Limitations & Caveats
The initial environment setup specifies Python 3.6, which may require careful management for users on newer Python versions. While the toolkit supports various large code models, specific model compatibility and performance may vary.
2 days ago
1 day