OpenCoder-llm  by OpenCoder-llm

Open code LLM family (1.5B/8B) for English and Chinese

created 9 months ago
1,784 stars

Top 24.6% on sourcepulse

GitHubView on GitHub
Project Summary

OpenCoder provides a family of open-source Large Language Models (LLMs) specifically designed for code generation and understanding, targeting AI researchers and developers. It aims to offer a transparent and reproducible foundation for advancing code AI by releasing not only model weights but also comprehensive training data, processing pipelines, and experimental results.

How It Works

OpenCoder models are pretrained from scratch on a massive 2.5 trillion token dataset, comprising 90% raw code and 10% code-related web data. They are then fine-tuned using over 4.5 million high-quality supervised fine-tuning (SFT) examples. This approach, detailed in their paper, emphasizes rigorous ablation studies on data cleaning strategies and training processes, including file-level and repository-level deduplication, to ensure robust performance and validate their methodology.

Quick Start & Requirements

  • Install/Run: Use Hugging Face transformers library.
  • Prerequisites: PyTorch, transformers. Models support bfloat16 and device_map="auto".
  • Links: 🤗 Model, 📄 Paper, 🏠 Home Page

Highlighted Details

  • Offers 1.5B and 8B parameter base and instruct models, supporting English and Chinese.
  • Pretrained on 2.5T tokens, including 148GB fineweb-code-corpus and 10GB fineweb-math-corpus.
  • Fine-tuned on 4.21M opc-sft-stage1 and 375K opc-sft-stage2 data.
  • Released intermediate checkpoints and a data cleaning pipeline (opc_data_filtering).

Maintenance & Community

The project has released numerous components and data, indicating active development. Links to Hugging Face model repositories are provided.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is actively releasing components, suggesting some resources may still be in progress or progressively uploaded. License details are not readily available, which could impact commercial adoption.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
110 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.