aiXcoder-7B by aixcoder-plugin

Code LLM for enhanced programming tasks

Created 1 year ago

2,276 stars

Top 19.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

Summary

aiXcoder-7B is an open-source Code Large Language Model designed for code understanding and generation across multiple languages. It targets developers and researchers seeking state-of-the-art performance in code completion, comprehension, and generation, offering significant advantages over similarly sized models in benchmarks.

How It Works

Trained on 1.2T tokens, aiXcoder-7B employs structured Fill-In-the-Middle (FIM) tasks derived from Abstract Syntax Trees (ASTs) combined with autoregressive training (70% FIM, 30% autoregressive). This approach aims for complete code node prediction using RoPE, SwiGLU, and Grouped Query Attention with a 32,768 token sequence length. Rigorous data filtering excludes copyleft licenses and includes deduplication, sensitive info removal, and static analysis. Batch processing clusters related code files locally while maintaining overall randomness.

Quick Start & Requirements

Installation is supported via Python environments (Python 3.8+, PyTorch 2.1.0+, transformers 4.34.1+) or Docker. Flash Attention is recommended for faster inference (requires CUDA). Inference can be run via command line or Python scripts. Fine-tuning is available using Huggingface's PEFT tools. Model weights are downloadable.

Highlighted Details

aiXcoder-7B achieves state-of-the-art results on multilingual NL2Code benchmarks and excels in code completion (FIM), outperforming models like CodeLlama 34B and StarCoder2 15B. It also demonstrates strong cross-file code evaluation capabilities. The model supports quantization via bitsandbytes and offers dedicated VS Code and Jetbrains plugins.

Maintenance & Community

The repository welcomes contributions and feedback, but specific community channels or development activity indicators are not detailed in the provided README.

Licensing & Compatibility

Source code is licensed under Apache-2.0. Model weights are for academic research use; commercial use requires application via email to support@aiXcoder.com.

Limitations & Caveats

The base model lacks instruct-tuning, limiting its optimal performance on tasks like code debugging or test case generation; instruct-tuned versions are planned. Creating structured FIM training data may require custom implementation for users replicating the pre-training data pipeline.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days