lost_in_conversation  by microsoft

Benchmarking LLM multi-turn task completion

Created 1 year ago
274 stars

Top 94.2% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides a framework for benchmarking Large Language Models (LLMs) on their ability to complete tasks across multi-turn conversations. It addresses the challenge of LLMs "getting lost" in extended dialogues, offering researchers and domain experts a tool to systematically evaluate and reproduce findings related to LLM conversational limitations. The primary benefit is enabling reproducible research and deeper understanding of LLM behavior in complex, multi-turn interactions.

How It Works

The project utilizes a simulation engine (simulator_*.py) to generate both single-turn and multi-turn conversations. These simulations are driven by 600 sharded instructions, with task-specific logic defined for seven analytical generation domains: code, database querying, API function calling, elementary math, data-to-text, summarization, and translation. A web-based Streamlit application (app_conv_viewer.py) allows for detailed inspection and analysis of the simulated dialogues, facilitating experiment reproduction.

Quick Start & Requirements

  • Run Simulations: python run_simulations.py
  • View Conversations: streamlit run app_conv_viewer.py
  • Prerequisites: Requires OPENAI_API_KEY or AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT environment variables for LLM API access.
  • Non-default: Code task simulation is Unix-only. Database task requires downloading ~5GB of test databases to data/spider/databases/.
  • Resource: LLM API calls will incur costs.
  • Docs: Accompanying paper available at https://arxiv.org/abs/2505.06120.

Highlighted Details

  • Dedicated framework for benchmarking LLM multi-turn task completion.
  • Enables reproduction of experimental results from the "Lost in Conversation" paper.
  • Supports seven distinct analytical generation task types.
  • Includes an interactive conversational viewer for detailed dialogue analysis.

Maintenance & Community

This project is maintained by Microsoft and welcomes contributions via pull requests, requiring agreement to a Contributor License Agreement (CLA). It adheres to the Microsoft Open Source Code of Conduct. Direct feedback and inquiries can be sent to plaban@microsoft.com.

Licensing & Compatibility

The license type is not explicitly stated in the provided README. The repository is released for research purposes and is not recommended for commercial or real-world applications without significant further testing and validation. It was primarily developed and tested using the English language.

Limitations & Caveats

Users must provide their own LLM API access, incurring associated costs. The code task simulation is incompatible with Windows. The database task requires manual data setup. Generated outputs may contain factual errors or fabrications, necessitating human oversight. Simulated conversations do not represent natural human-AI interactions and are unsuitable for high-risk or regulated domains. The system has not undergone systematic security hardening against vulnerabilities like prompt injection. Performance in non-English languages requires independent expert evaluation.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
2
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Kaichao You Kaichao You(Core Maintainer of vLLM), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

z-bench by zhenbench

0.2%
504
Chinese LLM prompt dataset for non-technical users
Created 3 years ago
Updated 2 years ago
Feedback? Help us improve.