fast-apply  by kortix-ai

Pipeline for data generation and fine-tuning Qwen2.5 Coder models

created 10 months ago
327 stars

Top 84.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a pipeline for generating data and fine-tuning Qwen2.5 Coder models for fast, accurate code application. It targets developers and AI-powered code editors seeking efficient, high-throughput code modification capabilities, offering models that achieve hundreds of tokens per second.

How It Works

The project fine-tunes smaller Qwen2.5 Coder models (1.5B and 7B parameters) using the Unsloth library for accelerated training and reduced VRAM usage. Data generation involves cloning open-source repositories, converting them into a structured dataset using repo_to_dataset.py, and then leveraging large language models like Claude Sonnet 3.5 and GPT-4 to create synthetic code update snippets. The models are optimized for a specific inference prompt structure designed for merging code updates while preserving code structure, comments, and indentation.

Quick Start & Requirements

  • Install: Clone the repository and follow the provided Python scripts for data generation and fine-tuning.
  • Prerequisites: Python, HuggingFace libraries, Unsloth, Anthropic/OpenAI API access for data generation, Fireworks AI for deployment (optional).
  • Resources: Fine-tuning requires GPU resources; specific VRAM needs depend on model size and quantization. Data generation can be resource-intensive.
  • Links: Models and dataset are available on HuggingFace: https://huggingface.co/Kortix. Inference script: tests_evaluate/fireworks/test_fireworks.py.

Highlighted Details

  • Achieves ~340 tok/s for the 1.5B model and ~150 tok/s for the 7B model on fast providers like Fireworks.
  • Data generation pipeline uses Claude Sonnet 3.5 (70%) and GPT-4 (30%) for high-quality synthetic data.
  • Fine-tuning utilizes Unsloth for efficient QLoRA with 4-bit quantization.
  • Dataset composition is 80% TypeScript/TSX, 15% Python, and 5% other languages.

Maintenance & Community

  • The project is actively maintained by Kortix AI.
  • Contributions are welcomed for adding more languages, reporting bugs, and suggesting features.

Licensing & Compatibility

  • The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Evaluation of code transformations is noted as non-trivial due to flexibility in code insertion and function ordering, suggesting that simple file comparison may not be sufficient. The project is primarily focused on TypeScript/TSX data.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
81 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.