Ling-V2 by inclusionAI

Efficient MoE LLMs for advanced reasoning and high-speed generation

Created 5 months ago

257 stars

Top 98.3% on SourcePulse

Project Summary

Ling-V2 is an open-source family of Mixture-of-Experts (MoE) Large Language Models (LLMs) from InclusionAI, designed to deliver state-of-the-art performance with high computational efficiency. Targeting researchers and developers seeking powerful yet resource-conscious LLMs, Ling-V2 offers significant advantages in complex reasoning and instruction following, achieving performance comparable to much larger dense models with a fraction of activated parameters.

How It Works

Ling-V2 employs a 1/32 activation ratio MoE architecture, meticulously optimized with choices in expert granularity, shared expert ratios, attention mechanisms, and routing strategies like sigmoid routing and aux-loss-free design. This sparse activation, combined with techniques such as MTP loss, QK-Norm, and half RoPE, allows models like Ling-mini-2.0 (16B total parameters, 1.4B activated) to deliver performance equivalent to 7–8B dense models. Furthermore, the project leverages FP8 mixed-precision training, utilizing tile/blockwise FP8 scaling, FP8 optimizers, and on-demand transpose weights for extreme memory optimization and efficient training.

Quick Start & Requirements

Integration is primarily supported via Hugging Face Transformers with a provided code snippet. For users in mainland China, ModelScope is recommended. Advanced inference can be achieved using vLLM or SGLang, both requiring the cloning of their respective repositories and applying provided patches (bailing_moe_v2.patch) to their installations. Specific hardware requirements are not detailed beyond GPU mentions for performance benchmarks and inference speed examples (e.g., H20, 80G GPUs). Users should ensure compatibility with Python environments supporting these libraries. Links to model downloads (Hugging Face, ModelScope) and external libraries (vLLM, SGLang) are available within the repository.

Highlighted Details

Efficiency: Achieves over 7x equivalent dense performance, with Ling-mini-2.0 (1.4B activated) matching 7–8B dense models.
Speed: Generates at over 300 tokens/s (Ling-mini-2.0 on H20), more than 2x faster than comparable dense models.
Context: Supports up to 128K context length using YaRN.
Training: Open-sourced FP8 efficient training solution and multiple pre-training checkpoints (up to 20T tokens) are available.
Model Variants: Includes Ling-mini-2.0 (1.4B activated) and Ling-flash-2.0 (6.1B activated).

Maintenance & Community

The project is provided by InclusionAI. Specific details regarding community channels (e.g., Discord, Slack), active contributors, sponsorships, or a public roadmap are not detailed in the provided README.

Licensing & Compatibility

The code repository is licensed under the permissive MIT License, allowing for broad use, including commercial applications and linking with closed-source software.

Limitations & Caveats

Integration with vLLM and SGLang currently requires users to manually apply patches to the respective libraries, as these changes are not yet merged into their official releases. Support for Mixture-of-Thought (MoT) is noted as available for base models in SGLang but not yet for chat models. Specific hardware requirements beyond GPU usage for performance metrics are not explicitly detailed.

Ling-V2 by inclusionAI

Explore Similar Projects

Seed-Thinking-v1.5 by ByteDance-Seed

LLaDA2.X by inclusionAI

XBai-o4 by MetaStone-AI

dots.llm1 by rednote-hilab

fiddler by efeslab

gpt-oss-recipes by huggingface

snowflake-arctic by Snowflake-Labs

Seed-Coder by ByteDance-Seed

MobileLLM by facebookresearch

optimum-neuron by huggingface

lightning-thunder by Lightning-AI

openvino by openvinotoolkit