LLM-TPU  by sophgo

Generative AI model deployment on Sophgo edge TPUs

Created 1 year ago
253 stars

Top 99.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

The sophgo/LLM-TPU project facilitates the deployment of various open-source generative AI models, with a focus on Large Language Models (LLMs), onto Sophgo's BM1684X and BM1688 (CV186X) AI accelerator chips. It targets developers and researchers aiming to leverage specialized hardware for efficient AI inference. The primary benefit is enabling high-performance generative AI workloads on custom ASICs, bridging the gap between model availability and dedicated hardware deployment.

How It Works

This project employs a two-stage process: model conversion and runtime inference. Models are first transformed into a proprietary bmodel format using the TPU-MLIR compiler. Subsequently, the tpu-runtime inference engine, accessed via C++ interfaces, handles the execution of these bmodel files on either PCIE or SoC environments. This approach allows for optimization tailored to the specific architecture of the BM1684X/BM1688 chips.

Quick Start & Requirements

To begin, clone the repository and execute the provided shell script:

git clone https://github.com/sophgo/LLM-TPU.git
./run.sh --model llama2-7b

Primary requirements include Sophgo BM1684X or BM1688 hardware. Specific software dependencies for the runtime and compiler are implied but not exhaustively detailed in the README. Further details are available in the "Quick Start" section and associated documentation links.

Highlighted Details

  • Extensive support for numerous LLMs (e.g., Llama3.1, Qwen3, ChatGLM4, Gemma2) and multimodal models (e.g., Qwen-VL, InternVL2, Stable Diffusion XL).
  • Recent model updates include Qwen3, QWQ-32B, and DeepSeek-R1-Distill-Qwen series.
  • Advanced features such as multi-core parallelism, speculative sampling (LookaheadDecoding), and prefill cache reuse are implemented.
  • Support for both PCIE and SoC deployment configurations.

Maintenance & Community

The project appears actively updated, with recent additions in April 2025. For hardware-specific inquiries, users are directed to contact Sophgo via their official website. No direct links to community forums like Discord or Slack were found in the provided README.

Licensing & Compatibility

The README does not specify a software license for the LLM-TPU project itself. Compatibility for commercial use or integration into closed-source projects is therefore unclear and requires direct inquiry with Sophgo.

Limitations & Caveats

The project is exclusively tied to Sophgo's BM1684X and BM1688 hardware, limiting its applicability to users without this specific silicon. The absence of a stated software license presents a significant adoption blocker for many organizations. Precision optimization guidance suggests prioritizing AWQ or GPTQ models, or using llmc-tpu for calibration with floating-point models.

Health Check
Last Commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Pawel Garbacki Pawel Garbacki(Cofounder of Fireworks AI), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

zml by zml

0.4%
3k
AI inference stack for production
Created 1 year ago
Updated 1 day ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.7%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 months ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 11 hours ago
Feedback? Help us improve.