OpenCoder-llm by OpenCoder-llm

Open code LLM family (1.5B/8B) for English and Chinese

Created 1 year ago

1,983 stars

Top 22.0% on SourcePulse

View on GitHub

2 Experts Love This Project

Casper Hansen

Author of AutoAWQ

Maxime Labonne

Head of Post-Training at Liquid AI

Project Summary

OpenCoder provides a family of open-source Large Language Models (LLMs) specifically designed for code generation and understanding, targeting AI researchers and developers. It aims to offer a transparent and reproducible foundation for advancing code AI by releasing not only model weights but also comprehensive training data, processing pipelines, and experimental results.

How It Works

OpenCoder models are pretrained from scratch on a massive 2.5 trillion token dataset, comprising 90% raw code and 10% code-related web data. They are then fine-tuned using over 4.5 million high-quality supervised fine-tuning (SFT) examples. This approach, detailed in their paper, emphasizes rigorous ablation studies on data cleaning strategies and training processes, including file-level and repository-level deduplication, to ensure robust performance and validate their methodology.

Quick Start & Requirements

Install/Run: Use Hugging Face transformers library.
Prerequisites: PyTorch, transformers. Models support bfloat16 and device_map="auto".
Links: 🤗 Model, 📄 Paper, 🏠 Home Page

Highlighted Details

Offers 1.5B and 8B parameter base and instruct models, supporting English and Chinese.
Pretrained on 2.5T tokens, including 148GB fineweb-code-corpus and 10GB fineweb-math-corpus.
Fine-tuned on 4.21M opc-sft-stage1 and 375K opc-sft-stage2 data.
Released intermediate checkpoints and a data cleaning pipeline (opc_data_filtering).

Maintenance & Community

The project has released numerous components and data, indicating active development. Links to Hugging Face model repositories are provided.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is actively releasing components, suggesting some resources may still be in progress or progressively uploaded. License details are not readily available, which could impact commercial adoption.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

38 stars in the last 30 days