Discover and explore top open-source AI tools and projects—updated daily.
microsoftBenchmarking LLM multi-turn task completion
Top 94.2% on SourcePulse
Summary
This repository provides a framework for benchmarking Large Language Models (LLMs) on their ability to complete tasks across multi-turn conversations. It addresses the challenge of LLMs "getting lost" in extended dialogues, offering researchers and domain experts a tool to systematically evaluate and reproduce findings related to LLM conversational limitations. The primary benefit is enabling reproducible research and deeper understanding of LLM behavior in complex, multi-turn interactions.
How It Works
The project utilizes a simulation engine (simulator_*.py) to generate both single-turn and multi-turn conversations. These simulations are driven by 600 sharded instructions, with task-specific logic defined for seven analytical generation domains: code, database querying, API function calling, elementary math, data-to-text, summarization, and translation. A web-based Streamlit application (app_conv_viewer.py) allows for detailed inspection and analysis of the simulated dialogues, facilitating experiment reproduction.
Quick Start & Requirements
python run_simulations.pystreamlit run app_conv_viewer.pyOPENAI_API_KEY or AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT environment variables for LLM API access.data/spider/databases/.https://arxiv.org/abs/2505.06120.Highlighted Details
Maintenance & Community
This project is maintained by Microsoft and welcomes contributions via pull requests, requiring agreement to a Contributor License Agreement (CLA). It adheres to the Microsoft Open Source Code of Conduct. Direct feedback and inquiries can be sent to plaban@microsoft.com.
Licensing & Compatibility
The license type is not explicitly stated in the provided README. The repository is released for research purposes and is not recommended for commercial or real-world applications without significant further testing and validation. It was primarily developed and tested using the English language.
Limitations & Caveats
Users must provide their own LLM API access, incurring associated costs. The code task simulation is incompatible with Windows. The database task requires manual data setup. Generated outputs may contain factual errors or fabrications, necessitating human oversight. Simulated conversations do not represent natural human-AI interactions and are unsuitable for high-risk or regulated domains. The system has not undergone systematic security hardening against vulnerabilities like prompt injection. Performance in non-English languages requires independent expert evaluation.
11 months ago
Inactive
zhenbench