Discover and explore top open-source AI tools and projects—updated daily.
machine-theoryLLM council for democratic AI benchmarking
Top 95.6% on SourcePulse
Summary
This project introduces a novel framework for evaluating Large Language Models (LLMs) by enabling them to form a "council" and democratically decide consensus on subjective prompts. It targets researchers and practitioners grappling with the limitations of human-curated benchmarks and the inherent biases of individual LLMs, offering a decentralized approach to self-assessment.
How It Works
The core mechanism involves deploying multiple LLMs to collectively judge and elect a "best" model for a given prompt, mimicking a democratic process. This approach leverages LLM-as-a-Judge capabilities in a decentralized, consensus-driven manner, aiming to overcome the subjectivity and value-laden nature of traditional LLM evaluations.
Quick Start & Requirements
Installation is straightforward via pip: pip install lm-council. A prerequisite is configuring an OpenRouter API key in a .env file. The library supports running councils on single or multiple prompts in parallel, with options to save and load council states. Official resources include a website (https://llm-council.com), dataset (https://huggingface.co/datasets/llm-council/emotional_application), paper (https://arxiv.org/abs/2406.08598), talk recording (https://youtu.be/hI0XCE27QqE), and slides (https://bit.ly/44XSEnh).
Highlighted Details
Maintenance & Community
The project is associated with authors Justin Zhao, Flor Miriam Plaza-del-Arco, Benjamin Genchel, and Amanda Cercas Curry. No specific community channels (e.g., Discord, Slack) or explicit roadmap details are provided in the README.
Licensing & Compatibility
The specific open-source license for this repository is not explicitly stated in the provided README text.
Limitations & Caveats
The system's functionality is dependent on the OpenRouter API for model access. The project appears research-oriented, stemming from a specific paper, and may not represent a fully generalized or production-ready evaluation suite without further development or adaptation.
5 months ago
Inactive
braintrustdata
SalesforceAIResearch