UltraChat  by thunlp

Multi-round dialogue dataset and models for chat language model training

created 2 years ago
2,645 stars

Top 18.2% on sourcepulse

GitHubView on GitHub
Project Summary

UltraChat provides a large-scale, diverse, and informative multi-round dialogue dataset and associated models (UltraLM) for training conversational AI. It is designed for researchers and developers aiming to build powerful chat language models with general conversational capabilities, offering a significant resource for advancing open-source LLM development.

How It Works

The dataset is constructed using LLMs to generate conversational turns, simulating user interactions across three sectors: "Questions about the World," "Writing and Creation," and "Assistance on Existent Materials." This approach leverages Turbo APIs and carefully designed prompts to create multi-turn dialogues, ensuring diversity in topics and complexity. The models are trained using BMTrain for acceleration.

Quick Start & Requirements

  • Model Reconstruction: Download LLaMA-13B and UltraLM delta weights. Run /UltraLM/recover.sh to obtain final weights.
  • Chatting: Replace model path in /UltraLM/chat_cli.sh with your recovered model path.
  • Training: Use provided scripts in .src/ (e.g., train_bm.py for LLaMA with BMTrain, train.py for GPT-J with OpenPrompt). Requires data downloaded to ./data.
  • Dependencies: LLaMA-13B base model, BMTrain, PyTorch, Huggingface Transformers. GPU recommended for training.
  • Data Explorer: Nomic AI Atlas Explorer

Highlighted Details

  • UltraLM-13B ranks #1 among open-source models on the AlpacaEval Leaderboard.
  • UltraLM models achieve high win rates against text-davinci-003 on AlpacaEval with UltraRM.
  • Dataset construction avoids direct use of internet data or benchmark data to prevent contamination.
  • Includes 1.57 million dialogues across diverse topics and tasks.

Maintenance & Community

  • Active development with releases of new models and datasets (e.g., UltraFeedback, UltraLM-13B-v2.0, UltraRM, UltraCM).
  • Project associated with THU NLP group.

Licensing & Compatibility

  • Dataset distributed under the MIT License.
  • Model weights are typically released separately and may have different licensing terms (e.g., based on LLaMA).

Limitations & Caveats

  • Models may exhibit hallucinations.
  • Reasoning, math, and coding abilities require explicit enhancement.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
90 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.