YAYI2  by wenge-research

Chinese LLM for research, base and chat versions, 30B parameters

Created 1 year ago
3,425 stars

Top 14.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

YAYI 2 is a 30B parameter multilingual large language model developed by wenge-research, designed to advance the Chinese LLM ecosystem. It offers Base and Chat versions, trained on over 2 trillion tokens of high-quality, multilingual data, and fine-tuned with millions of instructions and RLHF for better alignment.

How It Works

YAYI 2 is a Transformer-based LLM. Its training corpus includes a significant portion of Chinese data, alongside other languages, processed through a rigorous data pipeline involving standardization, cleaning, deduplication, and toxicity filtering. The model utilizes a Byte-Pair Encoding (BPE) tokenizer trained on 500GB of multilingual data, with a vocabulary size of 81920, featuring digit splitting and manual addition of HTML identifiers for improved performance.

Quick Start & Requirements

  • Inference: Clone the repository, create a conda environment (conda create --name yayi_inference_env python=3.8), activate it, and install dependencies (pip install transformers==4.33.1 torch==2.0.1 sentencepiece==0.1.99 accelerate==0.25.0). Inference example provided uses transformers and requires a CUDA-enabled GPU (e.g., A100/A800).
  • Fine-tuning: Requires Python 3.10, deepspeed, transformers, accelerate, flash-attn, and triton. Full parameter fine-tuning is recommended on 16x A100 (80G) or higher. LoRA fine-tuning is also supported.
  • Resources: Inference can run on a single A100/A800. Full parameter training requires significant distributed hardware.
  • Documentation: README, Hugging Face Repo, Technical Report.

Highlighted Details

  • Achieves strong performance across various benchmarks (C-Eval, MMLU, GAOKAO-Bench, GSM8K, HumanEval) compared to similarly sized open-source models.
  • Supports both full parameter and LoRA fine-tuning with deepspeed.
  • Tokenizer handles unknown characters due to its byte-level nature and includes specific optimizations for numerical and HTML data.
  • Models and data are available on the Modao community platform.

Maintenance & Community

  • Models and data uploaded to Modao community.
  • Technical report published in December 2023.
  • Links to Hugging Face repo provided.

Licensing & Compatibility

  • Code licensed under Apache-2.0.
  • Model and Data licensed under CC BY-NC 4.0 and a custom YAYI license, respectively.
  • Commercial use requires explicit permission via a registration form and approval from wenge-research.

Limitations & Caveats

  • The Chat version is listed as "Comming soon...".
  • Commercial use is restricted and requires a separate licensing agreement and approval process.
Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.