Open-source code completion model
Top 15.0% on sourcepulse
GPT-Code-Clippy (GPT-CC) aims to provide an open-source alternative to GitHub Copilot, fine-tuning large language models on a curated dataset of publicly available code. It targets developers and researchers interested in code generation and AI-assisted programming, offering a foundation for building similar tools.
How It Works
GPT-CC models are fine-tuned from GPT-2 and GPT-Neo architectures. The training dataset is derived from the SEART GitHub Search, filtered for repositories with over 10 stars, more than 2 commits, and an open-source license, while excluding forks and limiting file size. This is augmented with data from The Pile, followed by a deduplication process based on alphanumeric variable sequences within files.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project acknowledges several contributors. Further community engagement details (e.g., Discord/Slack) are not explicitly provided in the README.
Licensing & Compatibility
The README does not explicitly state the license for the project or the models. The dataset is intended for Huggingface's datasets library, which typically uses permissive licenses.
Limitations & Caveats
The project acknowledges a bug with incorrect filenames in the dataset, which may have impacted training. Performance on standard benchmarks like HumanEval is reported as very low for most models. The README contains multiple "TODO" items regarding recommended models and training procedures, indicating ongoing development or incomplete documentation.
3 years ago
Inactive