BurstGPT by HPMLL

Dataset for LLM serving optimization

Created 2 years ago

261 stars

Top 97.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Research Scientist at Meta Superintelligence Lab

Zhuohan Li

Coauthor of vLLM

Project Summary

Summary

HPMLL/BurstGPT offers a real-world workload trace dataset for LLM serving systems, specifically capturing interactions with ChatGPT (GPT-3.5) and GPT-4. This resource benefits researchers and engineers by providing realistic usage patterns to optimize the performance and efficiency of LLM inference infrastructure.

How It Works

The project releases detailed CSV traces collected over 110-121 consecutive days, encompassing millions of requests. It captures key metrics such as timestamps, session IDs, elapsed response times, model types (GPT-3.5/GPT-4), token counts, and log types (conversation/API). This data allows for the modeling and simulation of diverse LLM serving workloads, enabling the evaluation and enhancement of system throughput and latency.

Quick Start & Requirements

The dataset is available in CSV format across multiple files, including versions with and without failed requests (zero response tokens). A simple request generator demo is provided in the example/ directory. No specific software installation is detailed, as the primary artifact is the data itself. Users will need standard tools for CSV processing.

Highlighted Details

Dataset spans 110-121 days, comprising ~5.34M lines and ~220MB per major release.
Includes distinct traces for ChatGPT (GPT-3.5) and GPT-4.
Schema features Session ID and Elapsed time for detailed conversational and response time analysis.
Offers both raw traces and filtered versions excluding requests with zero response tokens.

Maintenance & Community

Users can report issues or ask questions via a provided mailing list. No other community channels or explicit contributor information are detailed in the README.

Licensing & Compatibility

The README does not specify a software license or data usage terms. This lack of explicit licensing information may pose compatibility concerns for commercial use or integration into proprietary systems.

Limitations & Caveats

The README does not detail specific limitations of the dataset or its intended use. It focuses on its utility for optimizing LLM serving systems.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days