OpenPangu7B-with-Medusa  by wujing215

Accelerates large language model inference on Ascend NPUs

Created 6 months ago
373 stars

Top 75.8% on SourcePulse

GitHubView on GitHub
Project Summary

OpenPangu-7B-with-Medusa offers an end-to-end speculative inference acceleration for the OpenPangu-7B large language model, specifically optimized for Ascend NPU hardware. It targets users requiring high-throughput LLM inference, particularly in low-latency scenarios, by significantly reducing decoding steps through the Medusa framework.

How It Works

This project implements Medusa's speculative decoding approach. Medusa introduces lightweight prediction heads that generate multiple candidate tokens in parallel. A Tree Attention mechanism then efficiently validates these candidates, allowing the model to accept a longer sequence of predicted tokens in a single decoding step. This significantly reduces the number of forward passes required compared to standard autoregressive decoding. The implementation is engineered for Ascend NPUs, utilizing static tensor structures for candidate trees and minimizing host-device communication to leverage Ascend's graph execution capabilities.

Quick Start & Requirements

  1. Clone the repository: git clone https://github.com/wujing215/OpenPangu7B-with-Medusa.git
  2. Navigate to third_party and clone the Medusa submodule: git clone https://github.com/FasterDecoding/Medusa.git
  3. Install: pip install -e .
  4. Prerequisites: Ascend NPU hardware, OpenPangu-7B base model weights, and Medusa Heads weights. Hugging Face model loading is also supported.
  5. Usage: Run inference via python inference/medusa_generate.py or benchmarking with python inference/benchmark.py. Interactive mode is available.

Highlighted Details

  • Achieves up to 1.43x speedup for short sequence generation (64 tokens) on Ascend hardware.
  • Reports an Accept Rate of 1.84 for 64-token generation, indicating efficiency gains.
  • Static graph optimization and zero-copy mechanisms are key to minimizing overhead on Ascend 910B.
  • Effective for scenarios requiring low latency, particularly with medium-to-short text generation.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord/Slack), or roadmaps are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license or provide compatibility notes for commercial use or linking with closed-source software. The underlying Medusa implementation may have its own licensing terms.

Limitations & Caveats

The observed speedup decreases as the generated sequence length increases (e.g., 1.13x for 256 tokens). This is attributed to a reduction in the prediction accuracy of the lightweight Medusa Heads as contextual complexity grows. The optimizations are primarily targeted at the Ascend NPU platform.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
270 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0%
491
MoE model for research
Created 1 year ago
Updated 9 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.6%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 months ago
Feedback? Help us improve.