ChatLM-mini-Chinese  by charent

Small Chinese chat model (0.2B) for dialogue generation

created 1 year ago
1,570 stars

Top 27.2% on sourcepulse

GitHubView on GitHub
Project Summary

ChatLM-mini-Chinese is a 0.2B parameter Chinese conversational language model trained from scratch, offering a complete pipeline from data cleaning and tokenizer training to pre-training, SFT, and DPO optimization. It targets users needing a lightweight, efficient Chinese LLM for research or deployment on resource-constrained hardware, enabling custom fine-tuning for downstream tasks.

How It Works

The model is based on the T5 architecture, adapted for text-to-text generation. It utilizes a custom tokenizer trained on a large Chinese corpus, and its training pipeline emphasizes efficiency, supporting stream chat and offering full code transparency for data processing, tokenizer training, and model optimization stages. The project highlights a custom trainer for flexible training control and supports PEFT for efficient fine-tuning.

Quick Start & Requirements

  • Install dependencies via pip install -r requirements.txt or conda install --file requirements.txt.
  • Model weights can be downloaded from Hugging Face Hub or ModelScope.
  • Requires Python 3.10+ and PyTorch with CUDA support for GPU acceleration.
  • Full training pipeline code is available for local execution.
  • Official quick-start and API examples are provided.

Highlighted Details

  • 0.2B parameter model, requiring minimal VRAM (512MB for inference, 4GB for pre-training with batch_size=1, fp16).
  • Open-sourced datasets, data cleaning processes (including mini_hash deduplication), and tokenizer training.
  • Supports custom trainer for flexible training, including arbitrary checkpointing and resuming.
  • Demonstrates fine-tuning for downstream tasks like triplet information extraction, retaining conversational abilities.

Maintenance & Community

  • Project actively updated, with recent additions including ModelScope download support and deduplication techniques.
  • Community links are not explicitly provided in the README.

Licensing & Compatibility

  • The model weights are available for download, but the specific license for the model weights and code is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The model's small size (0.2B parameters) and limited pre-training dataset size (9M) may lead to occasional irrelevant responses or "hallucinations." The C-Eval scores indicate baseline performance, suggesting it may not excel in complex evaluation benchmarks without further task-specific fine-tuning.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.