YAYI2  by wenge-research

Chinese LLM for research, base and chat versions, 30B parameters

created 1 year ago
3,423 stars

Top 14.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

YAYI 2 is a 30B parameter multilingual large language model developed by wenge-research, designed to advance the Chinese LLM ecosystem. It offers Base and Chat versions, trained on over 2 trillion tokens of high-quality, multilingual data, and fine-tuned with millions of instructions and RLHF for better alignment.

How It Works

YAYI 2 is a Transformer-based LLM. Its training corpus includes a significant portion of Chinese data, alongside other languages, processed through a rigorous data pipeline involving standardization, cleaning, deduplication, and toxicity filtering. The model utilizes a Byte-Pair Encoding (BPE) tokenizer trained on 500GB of multilingual data, with a vocabulary size of 81920, featuring digit splitting and manual addition of HTML identifiers for improved performance.

Quick Start & Requirements

  • Inference: Clone the repository, create a conda environment (conda create --name yayi_inference_env python=3.8), activate it, and install dependencies (pip install transformers==4.33.1 torch==2.0.1 sentencepiece==0.1.99 accelerate==0.25.0). Inference example provided uses transformers and requires a CUDA-enabled GPU (e.g., A100/A800).
  • Fine-tuning: Requires Python 3.10, deepspeed, transformers, accelerate, flash-attn, and triton. Full parameter fine-tuning is recommended on 16x A100 (80G) or higher. LoRA fine-tuning is also supported.
  • Resources: Inference can run on a single A100/A800. Full parameter training requires significant distributed hardware.
  • Documentation: README, Hugging Face Repo, Technical Report.

Highlighted Details

  • Achieves strong performance across various benchmarks (C-Eval, MMLU, GAOKAO-Bench, GSM8K, HumanEval) compared to similarly sized open-source models.
  • Supports both full parameter and LoRA fine-tuning with deepspeed.
  • Tokenizer handles unknown characters due to its byte-level nature and includes specific optimizations for numerical and HTML data.
  • Models and data are available on the Modao community platform.

Maintenance & Community

  • Models and data uploaded to Modao community.
  • Technical report published in December 2023.
  • Links to Hugging Face repo provided.

Licensing & Compatibility

  • Code licensed under Apache-2.0.
  • Model and Data licensed under CC BY-NC 4.0 and a custom YAYI license, respectively.
  • Commercial use requires explicit permission via a registration form and approval from wenge-research.

Limitations & Caveats

  • The Chat version is listed as "Comming soon...".
  • Commercial use is restricted and requires a separate licensing agreement and approval process.
Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Calvin French-Owen Calvin French-Owen(Coounder of Segment), and
12 more.

StableLM by Stability-AI

0.0%
16k
Language models by Stability AI
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
9 more.

alpaca-lora by tloen

0.0%
19k
LoRA fine-tuning for LLaMA
created 2 years ago
updated 1 year ago
Feedback? Help us improve.