magicoder  by ise-uiuc

Code generation model family for instruction following

Created 2 years ago
2,051 stars

Top 21.6% on SourcePulse

GitHubView on GitHub
Project Summary

Magicoder is a family of large language models specifically designed for code generation, addressing the need for high-quality, low-bias instruction data. It targets developers and researchers seeking advanced code synthesis capabilities, offering state-of-the-art performance on benchmarks like HumanEval and MBPP.

How It Works

Magicoder leverages a novel approach called OSS-Instruct, which uses open-source code snippets to enrich and diversify LLM-synthesized instruction data. This method mitigates inherent biases in purely synthetic data, leading to more realistic and controllable code generation. The models are further fine-tuned on datasets like Evol-Instruct for enhanced instruction-following.

Quick Start & Requirements

  • Install/Run: Use Hugging Face transformers pipeline.
  • Prerequisites: Python, PyTorch, CUDA (for GPU acceleration).
  • Resource Footprint: Requires significant GPU memory for 7B models (e.g., torch_dtype=torch.bfloat16, device_map="auto").
  • Demo: Online Gradio demo and local setup script available.
  • Docs: Hugging Face Link

Highlighted Details

  • Magicoder-S-DS-6.7B achieves 76.8 on HumanEval, outperforming GPT-3.5-turbo-1106 and Gemini Ultra.
  • Models are available in 7B and 6.7B parameter sizes.
  • Base models include Llama2 (for CL series) and DeepSeek-Coder (for DS series).
  • Trained on proprietary OSS-Instruct (75K) and Evol-Instruct (110K) datasets.

Maintenance & Community

  • Project inspired several other open-source coding models.
  • Contact information provided for key contributors.
  • No explicit community channels (Discord/Slack) or roadmap links are present in the README.

Licensing & Compatibility

  • Licenses vary by base model: Llama2 for CL series, DeepSeek for DS series.
  • Compatible with commercial use, subject to the underlying base model's license terms.

Limitations & Caveats

  • Models may produce errors or misleading content, particularly for non-coding tasks.
  • Usage is subject to OpenAI's terms of use due to training data origins.
  • The project does not aim to compete with OpenAI's commercial products.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Ross Taylor Ross Taylor(Cofounder of General Reasoning; Cocreator of Papers with Code), and
11 more.

open-instruct by allenai

0.3%
3k
Training codebase for instruction-following language models
Created 2 years ago
Updated 9 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), John Yang John Yang(Coauthor of SWE-bench, SWE-agent), and
28 more.

stanford_alpaca by tatsu-lab

0.1%
30k
Instruction-following LLaMA model training and data generation
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.