Hunyuan-A13B  by Tencent-Hunyuan

LLM with fine-grained MoE architecture

created 1 month ago
717 stars

Top 48.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Hunyuan-A13B is an open-source large language model from Tencent, built on a fine-grained Mixture-of-Experts (MoE) architecture. It offers a balance of high performance and computational efficiency, making it suitable for advanced reasoning and general-purpose applications, particularly in resource-constrained environments.

How It Works

The model features an 80 billion parameter total with 13 billion active parameters, leveraging MoE for efficiency. It supports hybrid reasoning (fast/slow thinking), a 256K context window, and is optimized for agent tasks. Grouped Query Attention (GQA) and multiple quantization formats (FP8, INT4) are employed for efficient inference.

Quick Start & Requirements

  • Installation: Primarily via Hugging Face transformers library or pre-built Docker images for TensorRT-LLM, vLLM, and SGLang.
  • Prerequisites: transformers library, PyTorch. Docker images require NVIDIA Container Toolkit and CUDA 12.8 for vLLM. TensorRT-LLM deployment requires specific configuration files.
  • Resources: Model weights are substantial; quantization significantly reduces requirements.
  • Links: Hugging Face, ModelScope, TensorRT-LLM Docker, vLLM Docker, SGLang Docker.

Highlighted Details

  • Achieves competitive performance across benchmarks like MMLU, GSM8k, and agent-specific tasks.
  • Offers FP8 and INT4 quantized versions for reduced memory footprint and faster inference.
  • Supports flexible reasoning modes (fast/slow thinking) via prompt engineering or API parameters.
  • Natively handles a 256K context window.

Maintenance & Community

Licensing & Compatibility

  • License details are not explicitly stated in the provided README snippet, but the open-source nature suggests permissive usage. Commercial use compatibility should be verified.

Limitations & Caveats

The README mentions specific CUDA versions for certain Docker deployments (e.g., CUDA 12.8 for vLLM), implying potential compatibility constraints. Detailed performance benchmarks are provided, but direct comparisons to all relevant models may be limited.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
4
Issues (30d)
6
Star History
719 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 16 hours ago
Feedback? Help us improve.