OpenCoder-llm  by OpenCoder-llm

Open code LLM family (1.5B/8B) for English and Chinese

Created 1 year ago
1,983 stars

Top 22.0% on SourcePulse

GitHubView on GitHub
Project Summary

OpenCoder provides a family of open-source Large Language Models (LLMs) specifically designed for code generation and understanding, targeting AI researchers and developers. It aims to offer a transparent and reproducible foundation for advancing code AI by releasing not only model weights but also comprehensive training data, processing pipelines, and experimental results.

How It Works

OpenCoder models are pretrained from scratch on a massive 2.5 trillion token dataset, comprising 90% raw code and 10% code-related web data. They are then fine-tuned using over 4.5 million high-quality supervised fine-tuning (SFT) examples. This approach, detailed in their paper, emphasizes rigorous ablation studies on data cleaning strategies and training processes, including file-level and repository-level deduplication, to ensure robust performance and validate their methodology.

Quick Start & Requirements

  • Install/Run: Use Hugging Face transformers library.
  • Prerequisites: PyTorch, transformers. Models support bfloat16 and device_map="auto".
  • Links: 🤗 Model, 📄 Paper, 🏠 Home Page

Highlighted Details

  • Offers 1.5B and 8B parameter base and instruct models, supporting English and Chinese.
  • Pretrained on 2.5T tokens, including 148GB fineweb-code-corpus and 10GB fineweb-math-corpus.
  • Fine-tuned on 4.21M opc-sft-stage1 and 375K opc-sft-stage2 data.
  • Released intermediate checkpoints and a data cleaning pipeline (opc_data_filtering).

Maintenance & Community

The project has released numerous components and data, indicating active development. Links to Hugging Face model repositories are provided.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is actively releasing components, suggesting some resources may still be in progress or progressively uploaded. License details are not readily available, which could impact commercial adoption.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), Yaowei Zheng Yaowei Zheng(Author of LLaMA-Factory), and
3 more.

curator by bespokelabsai

0.2%
2k
Synthetic data curation tool for post-training and structured data extraction
Created 1 year ago
Updated 6 days ago
Feedback? Help us improve.