Open-dLLM by pengzhangzhi

Diffusion LLM for code generation

Created 4 months ago

502 stars

Top 62.0% on SourcePulse

Project Summary

Open-dLLM addresses the lack of transparency and reproducibility in diffusion-based large language models (dLLMs) for code generation. It provides the complete open-source stack, including pretraining, evaluation, inference, and checkpoints, enabling researchers and engineers to fully understand, reproduce, and build upon dLLMs. The project's primary benefit is fostering open development and deeper insights into diffusion model architectures for code tasks.

How It Works

Open-dLLM is built upon a diffusion-based LLM architecture, specifically releasing Open-dCoder for code generation. Its core advantage lies in its comprehensive openness, providing the entire pipeline from raw data to inference, a significant differentiator from other dLLM projects that typically only release inference scripts and weights. The framework includes a pretraining pipeline utilizing open datasets, efficient inference scripts, and a robust evaluation suite covering standard code generation and infilling benchmarks.

Quick Start & Requirements

Installation is managed via micromamba, with key Python dependencies including torch==2.5.0 (with cu121), flash-attn==2.7.4.post1, transformers==4.54.1, and triton>=3.1.0. A CUDA-enabled GPU (CUDA >= 12.3.0 recommended) is essential for running the models. The project provides a quickstart Python script for sampling code generation using the fredzzp/open-dcoder-0.5B model. Further details and the model can be found on Hugging Face: fredzzp/open-dcoder-0.5B.

Highlighted Details

Performance: The 0.5B parameter Open-dCoder model demonstrates competitive performance against significantly larger dLLMs (e.g., LLaDA 8B, Dream 7B) on benchmarks like HumanEval and MBPP.
Open Evaluation Suite: A fully open-source evaluation suite is provided for dLLMs, supporting standard code generation (HumanEval, MBPP) and code infilling tasks.
Reproducibility: The project offers transparent configurations, pretraining pipelines, and open datasets (FineCode) for complete reproducibility.
Initialization: Pretraining continues from Qwen2.5-Coder, adapting it to a Masked Diffusion Model (MDM) objective.

Maintenance & Community

The project acknowledges contributors including Fred Zhangzhi Peng, Shuibai Zhang, and Alex Tong. While specific community channels like Discord or Slack are not detailed in the README, the project aims to contribute back to the diffusion LLM community.

Licensing & Compatibility

The provided README does not explicitly state the software license. This absence of a specified license makes it difficult to assess compatibility for commercial use or integration into closed-source projects without further clarification.

Limitations & Caveats

The most significant caveat is the lack of a declared software license, which is a critical factor for adoption decisions. Additionally, while Open-dCoder (0.5B) shows competitive results, its performance on code infilling tasks with "Oracle Length" suggests potential limitations in handling variable-length code completion without specific tuning. The setup also requires specific CUDA versions and environment management tools like micromamba.

Open-dLLM by pengzhangzhi

Explore Similar Projects

Awesome-LLM-Constrained-Decoding by Saibo-creator

supercharger by catid

Seed-Coder by ByteDance-Seed

cwm by facebookresearch

awesome-ai-coding by wsxiaoys

granite-code-models by ibm-granite

CodeTF by salesforce

Kimi-Dev by MoonshotAI

LiveCodeBench by LiveCodeBench

Awesome-Code-LLM by codefuse-ai

AlphaCodium by Codium-ai

DeepSeek-Coder by deepseek-ai