Pipeline for data generation and fine-tuning Qwen2.5 Coder models
Top 84.6% on sourcepulse
This repository provides a pipeline for generating data and fine-tuning Qwen2.5 Coder models for fast, accurate code application. It targets developers and AI-powered code editors seeking efficient, high-throughput code modification capabilities, offering models that achieve hundreds of tokens per second.
How It Works
The project fine-tunes smaller Qwen2.5 Coder models (1.5B and 7B parameters) using the Unsloth library for accelerated training and reduced VRAM usage. Data generation involves cloning open-source repositories, converting them into a structured dataset using repo_to_dataset.py
, and then leveraging large language models like Claude Sonnet 3.5 and GPT-4 to create synthetic code update snippets. The models are optimized for a specific inference prompt structure designed for merging code updates while preserving code structure, comments, and indentation.
Quick Start & Requirements
tests_evaluate/fireworks/test_fireworks.py
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Evaluation of code transformations is noted as non-trivial due to flexibility in code insertion and function ordering, suggesting that simple file comparison may not be sufficient. The project is primarily focused on TypeScript/TSX data.
1 month ago
Inactive