Discover and explore top open-source AI tools and projects—updated daily.
wujing215Accelerates large language model inference on Ascend NPUs
Top 75.8% on SourcePulse
OpenPangu-7B-with-Medusa offers an end-to-end speculative inference acceleration for the OpenPangu-7B large language model, specifically optimized for Ascend NPU hardware. It targets users requiring high-throughput LLM inference, particularly in low-latency scenarios, by significantly reducing decoding steps through the Medusa framework.
How It Works
This project implements Medusa's speculative decoding approach. Medusa introduces lightweight prediction heads that generate multiple candidate tokens in parallel. A Tree Attention mechanism then efficiently validates these candidates, allowing the model to accept a longer sequence of predicted tokens in a single decoding step. This significantly reduces the number of forward passes required compared to standard autoregressive decoding. The implementation is engineered for Ascend NPUs, utilizing static tensor structures for candidate trees and minimizing host-device communication to leverage Ascend's graph execution capabilities.
Quick Start & Requirements
git clone https://github.com/wujing215/OpenPangu7B-with-Medusa.gitthird_party and clone the Medusa submodule: git clone https://github.com/FasterDecoding/Medusa.gitpip install -e .python inference/medusa_generate.py or benchmarking with python inference/benchmark.py. Interactive mode is available.Highlighted Details
Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord/Slack), or roadmaps are provided in the README.
Licensing & Compatibility
The README does not explicitly state the project's license or provide compatibility notes for commercial use or linking with closed-source software. The underlying Medusa implementation may have its own licensing terms.
Limitations & Caveats
The observed speedup decreases as the generated sequence length increases (e.g., 1.13x for 256 tokens). This is attributed to a reduction in the prediction accuracy of the lightweight Medusa Heads as contextual complexity grows. The optimizations are primarily targeted at the Ascend NPU platform.
2 months ago
Inactive
lucidrains
SafeAILab
yandex