Discover and explore top open-source AI tools and projects—updated daily.
Code LLM for enhanced programming tasks
Top 19.8% on SourcePulse
Summary
aiXcoder-7B is an open-source Code Large Language Model designed for code understanding and generation across multiple languages. It targets developers and researchers seeking state-of-the-art performance in code completion, comprehension, and generation, offering significant advantages over similarly sized models in benchmarks.
How It Works
Trained on 1.2T tokens, aiXcoder-7B employs structured Fill-In-the-Middle (FIM) tasks derived from Abstract Syntax Trees (ASTs) combined with autoregressive training (70% FIM, 30% autoregressive). This approach aims for complete code node prediction using RoPE, SwiGLU, and Grouped Query Attention with a 32,768 token sequence length. Rigorous data filtering excludes copyleft licenses and includes deduplication, sensitive info removal, and static analysis. Batch processing clusters related code files locally while maintaining overall randomness.
Quick Start & Requirements
Installation is supported via Python environments (Python 3.8+, PyTorch 2.1.0+, transformers 4.34.1+) or Docker. Flash Attention is recommended for faster inference (requires CUDA). Inference can be run via command line or Python scripts. Fine-tuning is available using Huggingface's PEFT tools. Model weights are downloadable.
Highlighted Details
aiXcoder-7B achieves state-of-the-art results on multilingual NL2Code benchmarks and excels in code completion (FIM), outperforming models like CodeLlama 34B and StarCoder2 15B. It also demonstrates strong cross-file code evaluation capabilities. The model supports quantization via bitsandbytes and offers dedicated VS Code and Jetbrains plugins.
Maintenance & Community
The repository welcomes contributions and feedback, but specific community channels or development activity indicators are not detailed in the provided README.
Licensing & Compatibility
Source code is licensed under Apache-2.0. Model weights are for academic research use; commercial use requires application via email to support@aiXcoder.com.
Limitations & Caveats
The base model lacks instruct-tuning, limiting its optimal performance on tasks like code debugging or test case generation; instruct-tuned versions are planned. Creating structured FIM training data may require custom implementation for users replicating the pre-training data pipeline.
3 months ago
Inactive