DialogStudio  by salesforce

Unified dataset for conversational AI research

created 2 years ago
512 stars

Top 61.9% on sourcepulse

GitHubView on GitHub
Project Summary

DialogStudio offers a comprehensive, unified collection of diverse conversational AI datasets, catering to researchers and developers building advanced dialogue systems. It aims to simplify dataset access and facilitate LLM training by standardizing and cataloging numerous dialogue resources.

How It Works

DialogStudio unifies and standardizes a vast array of conversational datasets, preserving original information while enabling easier access and research. Datasets are categorized and available via Hugging Face, with examples provided in the repository. The project also includes models fine-tuned on selected DialogStudio datasets and general tasks, offering pre-trained capabilities for conversational AI applications.

Quick Start & Requirements

  • Load datasets using datasets.load_dataset('Salesforce/dialogstudio', '{dataset_name}').
  • Models can be loaded using Hugging Face's transformers library (e.g., Salesforce/dialogstudio-t5-base-v1.0).
  • Requires Python and Hugging Face libraries.
  • See Huggingface datasets and Huggingface models for details.

Highlighted Details

  • Unified collection of diverse dialogue datasets across categories like task-oriented, open-domain, and knowledge-grounded dialogues.
  • Includes version 1.0 T5 models (base, large, 3B) fine-tuned on DialogStudio datasets and general tasks.
  • Implements an evaluation framework using GPT-3.5-turbo to assess dialogue quality across six criteria.
  • Provides dataset examples and detailed statistics for each included dataset.

Maintenance & Community

  • Active development with recent updates in March 2024 (xLAM, dataset viewer) and August 2023 (v1.0 models).
  • Welcomes community contributions.
  • Paper accepted by EACL 2024 Findings.

Licensing & Compatibility

  • Codebase is under Apache License 2.0.
  • Modified datasets are primarily under Apache License 2.0, but some retain original licenses or cite original papers. Users must verify individual dataset licenses.
  • No explicit restrictions mentioned for commercial use, but original dataset licenses may apply.

Limitations & Caveats

The project notes that users are responsible for understanding and adhering to the original licenses of the included datasets, as DialogStudio does not assume responsibility for licensing issues.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind) and Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers).

voice_datasets by jim-schwoebel

0.1%
2k
Voice dataset list for voice/sound computing
created 6 years ago
updated 1 year ago
Feedback? Help us improve.