Open code LLM family (1.5B/8B) for English and Chinese
Top 24.6% on sourcepulse
OpenCoder provides a family of open-source Large Language Models (LLMs) specifically designed for code generation and understanding, targeting AI researchers and developers. It aims to offer a transparent and reproducible foundation for advancing code AI by releasing not only model weights but also comprehensive training data, processing pipelines, and experimental results.
How It Works
OpenCoder models are pretrained from scratch on a massive 2.5 trillion token dataset, comprising 90% raw code and 10% code-related web data. They are then fine-tuned using over 4.5 million high-quality supervised fine-tuning (SFT) examples. This approach, detailed in their paper, emphasizes rigorous ablation studies on data cleaning strategies and training processes, including file-level and repository-level deduplication, to ensure robust performance and validate their methodology.
Quick Start & Requirements
transformers
library.transformers
. Models support bfloat16
and device_map="auto"
.Highlighted Details
fineweb-code-corpus
and 10GB fineweb-math-corpus
.opc-sft-stage1
and 375K opc-sft-stage2
data.opc_data_filtering
).Maintenance & Community
The project has released numerous components and data, indicating active development. Links to Hugging Face model repositories are provided.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is actively releasing components, suggesting some resources may still be in progress or progressively uploaded. License details are not readily available, which could impact commercial adoption.
7 months ago
1 day