magicoder by ise-uiuc

Code generation model family for instruction following

Created 2 years ago

2,075 stars

Top 21.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Jeff Hammerbacher

Cofounder of Cloudera

Bryan Helmig

Cofounder of Zapier

Kevin Hou

Head of Product Engineering at Windsurf

and 3 more!

Project Summary

Magicoder is a family of large language models specifically designed for code generation, addressing the need for high-quality, low-bias instruction data. It targets developers and researchers seeking advanced code synthesis capabilities, offering state-of-the-art performance on benchmarks like HumanEval and MBPP.

How It Works

Magicoder leverages a novel approach called OSS-Instruct, which uses open-source code snippets to enrich and diversify LLM-synthesized instruction data. This method mitigates inherent biases in purely synthetic data, leading to more realistic and controllable code generation. The models are further fine-tuned on datasets like Evol-Instruct for enhanced instruction-following.

Quick Start & Requirements

Install/Run: Use Hugging Face transformers pipeline.
Prerequisites: Python, PyTorch, CUDA (for GPU acceleration).
Resource Footprint: Requires significant GPU memory for 7B models (e.g., torch_dtype=torch.bfloat16, device_map="auto").
Demo: Online Gradio demo and local setup script available.
Docs: Hugging Face Link

Highlighted Details

Magicoder-S-DS-6.7B achieves 76.8 on HumanEval, outperforming GPT-3.5-turbo-1106 and Gemini Ultra.
Models are available in 7B and 6.7B parameter sizes.
Base models include Llama2 (for CL series) and DeepSeek-Coder (for DS series).
Trained on proprietary OSS-Instruct (75K) and Evol-Instruct (110K) datasets.

Maintenance & Community

Project inspired several other open-source coding models.
Contact information provided for key contributors.
No explicit community channels (Discord/Slack) or roadmap links are present in the README.

Licensing & Compatibility

Licenses vary by base model: Llama2 for CL series, DeepSeek for DS series.
Compatible with commercial use, subject to the underlying base model's license terms.

Limitations & Caveats

Models may produce errors or misleading content, particularly for non-coding tasks.
Usage is subject to OpenAI's terms of use due to training data origins.
The project does not aim to compete with OpenAI's commercial products.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days