gpt-code-clippy by CodedotAl

Open-source code completion model

Created 4 years ago

3,284 stars

Top 14.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andreas Jansson

Cofounder of Replicate

Project Summary

GPT-Code-Clippy (GPT-CC) aims to provide an open-source alternative to GitHub Copilot, fine-tuning large language models on a curated dataset of publicly available code. It targets developers and researchers interested in code generation and AI-assisted programming, offering a foundation for building similar tools.

How It Works

GPT-CC models are fine-tuned from GPT-2 and GPT-Neo architectures. The training dataset is derived from the SEART GitHub Search, filtered for repositories with over 10 stars, more than 2 commits, and an open-source license, while excluding forks and limiting file size. This is augmented with data from The Pile, followed by a deduplication process based on alphanumeric variable sequences within files.

Quick Start & Requirements

Models are available via Hugging Face: https://huggingface.co/models?search=code-clippy
A VS Code extension is available: https://github.com/CodedotAl/code-clippy-vscode
A Hugging Face Spaces demo is provided: https://huggingface.co/spaces/flax-community/code-clippy-problem-solver

Highlighted Details

Dataset construction and limitations are detailed in a datasheet: https://github.com/ncoop57/datasets/tree/code-clippy/datasets/code_clippy
HumanEval benchmark results show low performance for most models, with some APPS-specific models performing better on APPS tasks.
A known issue exists with incorrect/misleading filenames in the dataset, potentially affecting training data quality.

Maintenance & Community

The project acknowledges several contributors. Further community engagement details (e.g., Discord/Slack) are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for the project or the models. The dataset is intended for Huggingface's datasets library, which typically uses permissive licenses.

Limitations & Caveats

The project acknowledges a bug with incorrect filenames in the dataset, which may have impacted training. Performance on standard benchmarks like HumanEval is reported as very low for most models. The README contains multiple "TODO" items regarding recommended models and training procedures, indicating ongoing development or incomplete documentation.

Health Check

Last Commit

4 years ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days