PyCodeGPT  by microsoft

GPT model for Python code completion and generation

Created 3 years ago
280 stars

Top 93.0% on SourcePulse

GitHubView on GitHub
Project Summary

PyCodeGPT is a GPT-Neo-based model for Python code completion and generation, aiming to provide an efficient and effective alternative to models like OpenAI Codex and GitHub Copilot. It is designed for developers and researchers working with Python code generation tasks.

How It Works

PyCodeGPT leverages a GPT-Neo architecture, specifically trained on a custom dataset of 96GB of Python code scraped from 1.2 million GitHub repositories. This extensive, self-collected dataset allows for a more tailored and potentially higher-quality model compared to relying solely on smaller public datasets. The model is available in a 110M parameter version, derived from GPT-Neo 125M with an expanded vocabulary.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.7, git clone https://github.com/openai/human-eval
  • Evaluation: Run eval_human_eval.py with specified arguments.
  • Functional Correctness: evaluate_functional_correctness <samples_path>
  • Documentation: HumanEval

Highlighted Details

  • Trained on 96GB of Python code from 1.2M GitHub repositories.
  • PyCodeGPT-110M achieves comparable accuracy to Codex models of similar size on the HumanEval dataset.
  • Pass@1 score of 8.32% for PyCodeGPT-110M on HumanEval.

Maintenance & Community

  • The project is associated with the CERT paper from IJCAI 2022.
  • Citation required for model usage: @inproceedings{CERT, title={{CERT}: Continual Pre-training on Sketches for Library-oriented Code Generation}, author={Zan, Daoguang and Chen, Bei and Yang, Dejian and Lin, Zeqi and Kim, Minsu and Guan, Bei and Wang, Yongji and Chen, Weizhu and Lou, Jian-Guang}, booktitle={The 2022 International Joint Conference on Artificial Intelligence}, year={2022} }

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

  • The project requires specific setup for evaluation, including uncommenting a line in the human-eval repository.
  • Performance metrics are provided for a specific 110M parameter model; larger or different versions may have different capabilities.
Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research) and Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

DS-1000 by xlang-ai

0.4%
256
Benchmark for data science code generation
Created 2 years ago
Updated 10 months ago
Feedback? Help us improve.