YAYI2 by wenge-research

Chinese LLM for research, base and chat versions, 30B parameters

Created 2 years ago

3,426 stars

Top 14.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Omar Sanseviero

DevRel at Google DeepMind

Project Summary

YAYI 2 is a 30B parameter multilingual large language model developed by wenge-research, designed to advance the Chinese LLM ecosystem. It offers Base and Chat versions, trained on over 2 trillion tokens of high-quality, multilingual data, and fine-tuned with millions of instructions and RLHF for better alignment.

How It Works

YAYI 2 is a Transformer-based LLM. Its training corpus includes a significant portion of Chinese data, alongside other languages, processed through a rigorous data pipeline involving standardization, cleaning, deduplication, and toxicity filtering. The model utilizes a Byte-Pair Encoding (BPE) tokenizer trained on 500GB of multilingual data, with a vocabulary size of 81920, featuring digit splitting and manual addition of HTML identifiers for improved performance.

Quick Start & Requirements

Inference: Clone the repository, create a conda environment (conda create --name yayi_inference_env python=3.8), activate it, and install dependencies (pip install transformers==4.33.1 torch==2.0.1 sentencepiece==0.1.99 accelerate==0.25.0). Inference example provided uses transformers and requires a CUDA-enabled GPU (e.g., A100/A800).
Fine-tuning: Requires Python 3.10, deepspeed, transformers, accelerate, flash-attn, and triton. Full parameter fine-tuning is recommended on 16x A100 (80G) or higher. LoRA fine-tuning is also supported.
Resources: Inference can run on a single A100/A800. Full parameter training requires significant distributed hardware.
Documentation: README, Hugging Face Repo, Technical Report.

Highlighted Details

Achieves strong performance across various benchmarks (C-Eval, MMLU, GAOKAO-Bench, GSM8K, HumanEval) compared to similarly sized open-source models.
Supports both full parameter and LoRA fine-tuning with deepspeed.
Tokenizer handles unknown characters due to its byte-level nature and includes specific optimizations for numerical and HTML data.
Models and data are available on the Modao community platform.

Maintenance & Community

Models and data uploaded to Modao community.
Technical report published in December 2023.
Links to Hugging Face repo provided.

Licensing & Compatibility

Code licensed under Apache-2.0.
Model and Data licensed under CC BY-NC 4.0 and a custom YAYI license, respectively.
Commercial use requires explicit permission via a registration form and approval from wenge-research.

Limitations & Caveats