gpt-code-clippy  by CodedotAl

Open-source code completion model

created 4 years ago
3,294 stars

Top 15.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GPT-Code-Clippy (GPT-CC) aims to provide an open-source alternative to GitHub Copilot, fine-tuning large language models on a curated dataset of publicly available code. It targets developers and researchers interested in code generation and AI-assisted programming, offering a foundation for building similar tools.

How It Works

GPT-CC models are fine-tuned from GPT-2 and GPT-Neo architectures. The training dataset is derived from the SEART GitHub Search, filtered for repositories with over 10 stars, more than 2 commits, and an open-source license, while excluding forks and limiting file size. This is augmented with data from The Pile, followed by a deduplication process based on alphanumeric variable sequences within files.

Quick Start & Requirements

Highlighted Details

  • Dataset construction and limitations are detailed in a datasheet: https://github.com/ncoop57/datasets/tree/code-clippy/datasets/code_clippy
  • HumanEval benchmark results show low performance for most models, with some APPS-specific models performing better on APPS tasks.
  • A known issue exists with incorrect/misleading filenames in the dataset, potentially affecting training data quality.

Maintenance & Community

The project acknowledges several contributors. Further community engagement details (e.g., Discord/Slack) are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state the license for the project or the models. The dataset is intended for Huggingface's datasets library, which typically uses permissive licenses.

Limitations & Caveats

The project acknowledges a bug with incorrect filenames in the dataset, which may have impacted training. Performance on standard benchmarks like HumanEval is reported as very low for most models. The README contains multiple "TODO" items regarding recommended models and training procedures, indicating ongoing development or incomplete documentation.

Health Check
Last commit

3 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.