BurstGPT  by HPMLL

Dataset for LLM serving optimization

Created 2 years ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

HPMLL/BurstGPT offers a real-world workload trace dataset for LLM serving systems, specifically capturing interactions with ChatGPT (GPT-3.5) and GPT-4. This resource benefits researchers and engineers by providing realistic usage patterns to optimize the performance and efficiency of LLM inference infrastructure.

How It Works

The project releases detailed CSV traces collected over 110-121 consecutive days, encompassing millions of requests. It captures key metrics such as timestamps, session IDs, elapsed response times, model types (GPT-3.5/GPT-4), token counts, and log types (conversation/API). This data allows for the modeling and simulation of diverse LLM serving workloads, enabling the evaluation and enhancement of system throughput and latency.

Quick Start & Requirements

The dataset is available in CSV format across multiple files, including versions with and without failed requests (zero response tokens). A simple request generator demo is provided in the example/ directory. No specific software installation is detailed, as the primary artifact is the data itself. Users will need standard tools for CSV processing.

Highlighted Details

  • Dataset spans 110-121 days, comprising ~5.34M lines and ~220MB per major release.
  • Includes distinct traces for ChatGPT (GPT-3.5) and GPT-4.
  • Schema features Session ID and Elapsed time for detailed conversational and response time analysis.
  • Offers both raw traces and filtered versions excluding requests with zero response tokens.

Maintenance & Community

Users can report issues or ask questions via a provided mailing list. No other community channels or explicit contributor information are detailed in the README.

Licensing & Compatibility

The README does not specify a software license or data usage terms. This lack of explicit licensing information may pose compatibility concerns for commercial use or integration into proprietary systems.

Limitations & Caveats

The README does not detail specific limitations of the dataset or its intended use. It focuses on its utility for optimizing LLM serving systems.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Adam Wolff Adam Wolff(Claude Code Core; MTS at Anthropic), Samuel Colvin Samuel Colvin(Founder and Author of Pydantic), and
5 more.

anthropic-sdk-python by anthropics

2.4%
3k
Python SDK for Anthropic's REST API
Created 3 years ago
Updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), and
3 more.

risingwave by risingwavelabs

0%
8k
Stream processing and serving for AI agents and real-time data applications
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.