LLM-TPU by sophgo

Generative AI model deployment on Sophgo edge TPUs

Created 1 year ago

260 stars

Top 97.7% on SourcePulse

Project Summary

Summary

The sophgo/LLM-TPU project facilitates the deployment of various open-source generative AI models, with a focus on Large Language Models (LLMs), onto Sophgo's BM1684X and BM1688 (CV186X) AI accelerator chips. It targets developers and researchers aiming to leverage specialized hardware for efficient AI inference. The primary benefit is enabling high-performance generative AI workloads on custom ASICs, bridging the gap between model availability and dedicated hardware deployment.

How It Works

This project employs a two-stage process: model conversion and runtime inference. Models are first transformed into a proprietary bmodel format using the TPU-MLIR compiler. Subsequently, the tpu-runtime inference engine, accessed via C++ interfaces, handles the execution of these bmodel files on either PCIE or SoC environments. This approach allows for optimization tailored to the specific architecture of the BM1684X/BM1688 chips.

Quick Start & Requirements

To begin, clone the repository and execute the provided shell script:

git clone https://github.com/sophgo/LLM-TPU.git
./run.sh --model llama2-7b

Primary requirements include Sophgo BM1684X or BM1688 hardware. Specific software dependencies for the runtime and compiler are implied but not exhaustively detailed in the README. Further details are available in the "Quick Start" section and associated documentation links.

Highlighted Details

Extensive support for numerous LLMs (e.g., Llama3.1, Qwen3, ChatGLM4, Gemma2) and multimodal models (e.g., Qwen-VL, InternVL2, Stable Diffusion XL).
Recent model updates include Qwen3, QWQ-32B, and DeepSeek-R1-Distill-Qwen series.
Advanced features such as multi-core parallelism, speculative sampling (LookaheadDecoding), and prefill cache reuse are implemented.
Support for both PCIE and SoC deployment configurations.

Maintenance & Community

The project appears actively updated, with recent additions in April 2025. For hardware-specific inquiries, users are directed to contact Sophgo via their official website. No direct links to community forums like Discord or Slack were found in the provided README.

Licensing & Compatibility

The README does not specify a software license for the LLM-TPU project itself. Compatibility for commercial use or integration into closed-source projects is therefore unclear and requires direct inquiry with Sophgo.

Limitations & Caveats

The project is exclusively tied to Sophgo's BM1684X and BM1688 hardware, limiting its applicability to users without this specific silicon. The absence of a stated software license presents a significant adoption blocker for many organizations. Precision optimization guidance suggests prioritizing AWQ or GPTQ models, or using llmc-tpu for calibration with floating-point models.

Health Check

Last Commit

3 days ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days