DialogStudio  by salesforce

Unified dataset for conversational AI research

Created 2 years ago
516 stars

Top 60.8% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

DialogStudio offers a comprehensive, unified collection of diverse conversational AI datasets, catering to researchers and developers building advanced dialogue systems. It aims to simplify dataset access and facilitate LLM training by standardizing and cataloging numerous dialogue resources.

How It Works

DialogStudio unifies and standardizes a vast array of conversational datasets, preserving original information while enabling easier access and research. Datasets are categorized and available via Hugging Face, with examples provided in the repository. The project also includes models fine-tuned on selected DialogStudio datasets and general tasks, offering pre-trained capabilities for conversational AI applications.

Quick Start & Requirements

  • Load datasets using datasets.load_dataset('Salesforce/dialogstudio', '{dataset_name}').
  • Models can be loaded using Hugging Face's transformers library (e.g., Salesforce/dialogstudio-t5-base-v1.0).
  • Requires Python and Hugging Face libraries.
  • See Huggingface datasets and Huggingface models for details.

Highlighted Details

  • Unified collection of diverse dialogue datasets across categories like task-oriented, open-domain, and knowledge-grounded dialogues.
  • Includes version 1.0 T5 models (base, large, 3B) fine-tuned on DialogStudio datasets and general tasks.
  • Implements an evaluation framework using GPT-3.5-turbo to assess dialogue quality across six criteria.
  • Provides dataset examples and detailed statistics for each included dataset.

Maintenance & Community

  • Active development with recent updates in March 2024 (xLAM, dataset viewer) and August 2023 (v1.0 models).
  • Welcomes community contributions.
  • Paper accepted by EACL 2024 Findings.

Licensing & Compatibility

  • Codebase is under Apache License 2.0.
  • Modified datasets are primarily under Apache License 2.0, but some retain original licenses or cite original papers. Users must verify individual dataset licenses.
  • No explicit restrictions mentioned for commercial use, but original dataset licenses may apply.

Limitations & Caveats

The project notes that users are responsible for understanding and adhering to the original licenses of the included datasets, as DialogStudio does not assume responsibility for licensing issues.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.