UltraChat by thunlp

Multi-round dialogue dataset and models for chat language model training

Created 2 years ago

2,760 stars

Top 17.1% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Wing Lian

Founder of Axolotl AI

Teknium

Cofounder of Nous Research

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

UltraChat provides a large-scale, diverse, and informative multi-round dialogue dataset and associated models (UltraLM) for training conversational AI. It is designed for researchers and developers aiming to build powerful chat language models with general conversational capabilities, offering a significant resource for advancing open-source LLM development.

How It Works

The dataset is constructed using LLMs to generate conversational turns, simulating user interactions across three sectors: "Questions about the World," "Writing and Creation," and "Assistance on Existent Materials." This approach leverages Turbo APIs and carefully designed prompts to create multi-turn dialogues, ensuring diversity in topics and complexity. The models are trained using BMTrain for acceleration.

Quick Start & Requirements

Model Reconstruction: Download LLaMA-13B and UltraLM delta weights. Run /UltraLM/recover.sh to obtain final weights.
Chatting: Replace model path in /UltraLM/chat_cli.sh with your recovered model path.
Training: Use provided scripts in .src/ (e.g., train_bm.py for LLaMA with BMTrain, train.py for GPT-J with OpenPrompt). Requires data downloaded to ./data.
Dependencies: LLaMA-13B base model, BMTrain, PyTorch, Huggingface Transformers. GPU recommended for training.
Data Explorer: Nomic AI Atlas Explorer

Highlighted Details

UltraLM-13B ranks #1 among open-source models on the AlpacaEval Leaderboard.
UltraLM models achieve high win rates against text-davinci-003 on AlpacaEval with UltraRM.
Dataset construction avoids direct use of internet data or benchmark data to prevent contamination.
Includes 1.57 million dialogues across diverse topics and tasks.

Maintenance & Community

Active development with releases of new models and datasets (e.g., UltraFeedback, UltraLM-13B-v2.0, UltraRM, UltraCM).
Project associated with THU NLP group.

Licensing & Compatibility

Dataset distributed under the MIT License.
Model weights are typically released separately and may have different licensing terms (e.g., based on LLaMA).

Limitations & Caveats