fast-apply by kortix-ai

Pipeline for data generation and fine-tuning Qwen2.5 Coder models

Created 1 year ago

387 stars

Top 74.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Anton Osika

Cofounder of Lovable

Project Summary

This repository provides a pipeline for generating data and fine-tuning Qwen2.5 Coder models for fast, accurate code application. It targets developers and AI-powered code editors seeking efficient, high-throughput code modification capabilities, offering models that achieve hundreds of tokens per second.

How It Works

The project fine-tunes smaller Qwen2.5 Coder models (1.5B and 7B parameters) using the Unsloth library for accelerated training and reduced VRAM usage. Data generation involves cloning open-source repositories, converting them into a structured dataset using repo_to_dataset.py, and then leveraging large language models like Claude Sonnet 3.5 and GPT-4 to create synthetic code update snippets. The models are optimized for a specific inference prompt structure designed for merging code updates while preserving code structure, comments, and indentation.

Quick Start & Requirements

Install: Clone the repository and follow the provided Python scripts for data generation and fine-tuning.
Prerequisites: Python, HuggingFace libraries, Unsloth, Anthropic/OpenAI API access for data generation, Fireworks AI for deployment (optional).
Resources: Fine-tuning requires GPU resources; specific VRAM needs depend on model size and quantization. Data generation can be resource-intensive.
Links: Models and dataset are available on HuggingFace: https://huggingface.co/Kortix. Inference script: tests_evaluate/fireworks/test_fireworks.py.

Highlighted Details

Achieves ~340 tok/s for the 1.5B model and ~150 tok/s for the 7B model on fast providers like Fireworks.
Data generation pipeline uses Claude Sonnet 3.5 (70%) and GPT-4 (30%) for high-quality synthetic data.
Fine-tuning utilizes Unsloth for efficient QLoRA with 4-bit quantization.
Dataset composition is 80% TypeScript/TSX, 15% Python, and 5% other languages.

Maintenance & Community

The project is actively maintained by Kortix AI.
Contributions are welcomed for adding more languages, reporting bugs, and suggesting features.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Evaluation of code transformations is noted as non-trivial due to flexibility in code insertion and function ordering, suggesting that simple file comparison may not be sufficient. The project is primarily focused on TypeScript/TSX data.

Health Check

Last Commit

3 months ago

Responsiveness

1+ week

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days