ChatLM-mini-Chinese by charent

Small Chinese chat model (0.2B) for dialogue generation

Created 2 years ago

1,660 stars

Top 25.3% on SourcePulse

Project Summary

ChatLM-mini-Chinese is a 0.2B parameter Chinese conversational language model trained from scratch, offering a complete pipeline from data cleaning and tokenizer training to pre-training, SFT, and DPO optimization. It targets users needing a lightweight, efficient Chinese LLM for research or deployment on resource-constrained hardware, enabling custom fine-tuning for downstream tasks.

How It Works

The model is based on the T5 architecture, adapted for text-to-text generation. It utilizes a custom tokenizer trained on a large Chinese corpus, and its training pipeline emphasizes efficiency, supporting stream chat and offering full code transparency for data processing, tokenizer training, and model optimization stages. The project highlights a custom trainer for flexible training control and supports PEFT for efficient fine-tuning.

Quick Start & Requirements

Install dependencies via pip install -r requirements.txt or conda install --file requirements.txt.
Model weights can be downloaded from Hugging Face Hub or ModelScope.
Requires Python 3.10+ and PyTorch with CUDA support for GPU acceleration.
Full training pipeline code is available for local execution.
Official quick-start and API examples are provided.

Highlighted Details

0.2B parameter model, requiring minimal VRAM (512MB for inference, 4GB for pre-training with batch_size=1, fp16).
Open-sourced datasets, data cleaning processes (including mini_hash deduplication), and tokenizer training.
Supports custom trainer for flexible training, including arbitrary checkpointing and resuming.
Demonstrates fine-tuning for downstream tasks like triplet information extraction, retaining conversational abilities.

Maintenance & Community

Project actively updated, with recent additions including ModelScope download support and deduplication techniques.
Community links are not explicitly provided in the README.

Licensing & Compatibility

The model weights are available for download, but the specific license for the model weights and code is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The model's small size (0.2B parameters) and limited pre-training dataset size (9M) may lead to occasional irrelevant responses or "hallucinations." The C-Eval scores indicate baseline performance, suggesting it may not excel in complex evaluation benchmarks without further task-specific fine-tuning.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days