magicoder  by ise-uiuc

Code generation model family for instruction following

created 1 year ago
2,020 stars

Top 22.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Magicoder is a family of large language models specifically designed for code generation, addressing the need for high-quality, low-bias instruction data. It targets developers and researchers seeking advanced code synthesis capabilities, offering state-of-the-art performance on benchmarks like HumanEval and MBPP.

How It Works

Magicoder leverages a novel approach called OSS-Instruct, which uses open-source code snippets to enrich and diversify LLM-synthesized instruction data. This method mitigates inherent biases in purely synthetic data, leading to more realistic and controllable code generation. The models are further fine-tuned on datasets like Evol-Instruct for enhanced instruction-following.

Quick Start & Requirements

  • Install/Run: Use Hugging Face transformers pipeline.
  • Prerequisites: Python, PyTorch, CUDA (for GPU acceleration).
  • Resource Footprint: Requires significant GPU memory for 7B models (e.g., torch_dtype=torch.bfloat16, device_map="auto").
  • Demo: Online Gradio demo and local setup script available.
  • Docs: Hugging Face Link

Highlighted Details

  • Magicoder-S-DS-6.7B achieves 76.8 on HumanEval, outperforming GPT-3.5-turbo-1106 and Gemini Ultra.
  • Models are available in 7B and 6.7B parameter sizes.
  • Base models include Llama2 (for CL series) and DeepSeek-Coder (for DS series).
  • Trained on proprietary OSS-Instruct (75K) and Evol-Instruct (110K) datasets.

Maintenance & Community

  • Project inspired several other open-source coding models.
  • Contact information provided for key contributors.
  • No explicit community channels (Discord/Slack) or roadmap links are present in the README.

Licensing & Compatibility

  • Licenses vary by base model: Llama2 for CL series, DeepSeek for DS series.
  • Compatible with commercial use, subject to the underlying base model's license terms.

Limitations & Caveats

  • Models may produce errors or misleading content, particularly for non-coding tasks.
  • Usage is subject to OpenAI's terms of use due to training data origins.
  • The project does not aim to compete with OpenAI's commercial products.
Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Woosuk Kwon Woosuk Kwon(Author of vLLM), and
11 more.

WizardLM by nlpxucan

0.1%
9k
LLMs built using Evol-Instruct for complex instruction following
created 2 years ago
updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Travis Fischer Travis Fischer(Founder of Agentic), and
6 more.

codellama by meta-llama

0.1%
16k
Inference code for CodeLlama models
created 1 year ago
updated 11 months ago
Feedback? Help us improve.