Hunyuan-A13B  by Tencent-Hunyuan

LLM with fine-grained MoE architecture

Created 2 months ago
748 stars

Top 46.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Hunyuan-A13B is an open-source large language model from Tencent, built on a fine-grained Mixture-of-Experts (MoE) architecture. It offers a balance of high performance and computational efficiency, making it suitable for advanced reasoning and general-purpose applications, particularly in resource-constrained environments.

How It Works

The model features an 80 billion parameter total with 13 billion active parameters, leveraging MoE for efficiency. It supports hybrid reasoning (fast/slow thinking), a 256K context window, and is optimized for agent tasks. Grouped Query Attention (GQA) and multiple quantization formats (FP8, INT4) are employed for efficient inference.

Quick Start & Requirements

  • Installation: Primarily via Hugging Face transformers library or pre-built Docker images for TensorRT-LLM, vLLM, and SGLang.
  • Prerequisites: transformers library, PyTorch. Docker images require NVIDIA Container Toolkit and CUDA 12.8 for vLLM. TensorRT-LLM deployment requires specific configuration files.
  • Resources: Model weights are substantial; quantization significantly reduces requirements.
  • Links: Hugging Face, ModelScope, TensorRT-LLM Docker, vLLM Docker, SGLang Docker.

Highlighted Details

  • Achieves competitive performance across benchmarks like MMLU, GSM8k, and agent-specific tasks.
  • Offers FP8 and INT4 quantized versions for reduced memory footprint and faster inference.
  • Supports flexible reasoning modes (fast/slow thinking) via prompt engineering or API parameters.
  • Natively handles a 256K context window.

Maintenance & Community

Licensing & Compatibility

  • License details are not explicitly stated in the provided README snippet, but the open-source nature suggests permissive usage. Commercial use compatibility should be verified.

Limitations & Caveats

The README mentions specific CUDA versions for certain Docker deployments (e.g., CUDA 12.8 for vLLM), implying potential compatibility constraints. Detailed performance benchmarks are provided, but direct comparisons to all relevant models may be limited.

Health Check
Last Commit

2 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
1
Star History
22 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 1 day ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 12 hours ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 22 hours ago
Feedback? Help us improve.