OpenPangu7B-with-Medusa by wujing215

Accelerates large language model inference on Ascend NPUs

Created 8 months ago

357 stars

Top 78.1% on SourcePulse

Project Summary

OpenPangu-7B-with-Medusa offers an end-to-end speculative inference acceleration for the OpenPangu-7B large language model, specifically optimized for Ascend NPU hardware. It targets users requiring high-throughput LLM inference, particularly in low-latency scenarios, by significantly reducing decoding steps through the Medusa framework.

How It Works

This project implements Medusa's speculative decoding approach. Medusa introduces lightweight prediction heads that generate multiple candidate tokens in parallel. A Tree Attention mechanism then efficiently validates these candidates, allowing the model to accept a longer sequence of predicted tokens in a single decoding step. This significantly reduces the number of forward passes required compared to standard autoregressive decoding. The implementation is engineered for Ascend NPUs, utilizing static tensor structures for candidate trees and minimizing host-device communication to leverage Ascend's graph execution capabilities.

Quick Start & Requirements

Clone the repository: git clone https://github.com/wujing215/OpenPangu7B-with-Medusa.git
Navigate to third_party and clone the Medusa submodule: git clone https://github.com/FasterDecoding/Medusa.git
Install: pip install -e .
Prerequisites: Ascend NPU hardware, OpenPangu-7B base model weights, and Medusa Heads weights. Hugging Face model loading is also supported.
Usage: Run inference via python inference/medusa_generate.py or benchmarking with python inference/benchmark.py. Interactive mode is available.

Highlighted Details

Achieves up to 1.43x speedup for short sequence generation (64 tokens) on Ascend hardware.
Reports an Accept Rate of 1.84 for 64-token generation, indicating efficiency gains.
Static graph optimization and zero-copy mechanisms are key to minimizing overhead on Ascend 910B.
Effective for scenarios requiring low latency, particularly with medium-to-short text generation.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license or provide compatibility notes for commercial use or linking with closed-source software. The underlying Medusa implementation may have its own licensing terms.

Limitations & Caveats

The observed speedup decreases as the generated sequence length increases (e.g., 1.13x for 256 tokens). This is attributed to a reduction in the prediction accuracy of the lightweight Medusa Heads as contextual complexity grows. The optimizations are primarily targeted at the Ascend NPU platform.

OpenPangu7B-with-Medusa by wujing215

Explore Similar Projects

TransnormerLLM by OpenNLPLab

MagicPIG by Infini-AI-Lab

TriForce by Infini-AI-Lab

LLaDA2.X by inclusionAI

speculative-decoding by lucidrains

dots.llm1 by rednote-hilab

hipfire by Kaden-Schutt

dflash-mlx by bstnxbt

MinivLLM by Wenyueh

YaLM-100B by yandex

EAGLE by SafeAILab

DeepSpec by deepseek-ai