EQ-Bench  by EQ-bench

Benchmark for emotional intelligence evaluation in LLMs

created 1 year ago
333 stars

Top 83.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

EQ-Bench provides a benchmark for evaluating the emotional intelligence of large language models (LLMs). It offers a comprehensive suite of tasks, including creative writing assessment and a unique "judgemark" capability for evaluating an LLM's ability to judge creative output. This benchmark is valuable for researchers and developers seeking to quantify and improve the emotional understanding and nuanced response generation of LLMs.

How It Works

EQ-Bench employs an LLM-as-a-judge approach for its creative writing tasks, utilizing detailed criteria to score model outputs. The "judgemark" task further assesses an LLM's capacity to evaluate creative writing, aggregating metrics like correlation with other benchmarks and response spread. The scoring system has evolved to a full-scale approach, with differences from reference answers scaled on a curve to improve discriminative power and reduce variance from minor parameter perturbations.

Quick Start & Requirements

  • Install: pip install -r install_reqs.sh or use ooba_quick_install.sh for a full setup.
  • Prerequisites: Python 3.x, transformers (GitHub version recommended), torch, openai, scipy, tqdm, sentencepiece, hf_transfer, peft, bitsandbytes, trl, accelerate, tensorboardX, huggingface_hub. Additional requirements exist for QWEN models and results uploading (e.g., gspread, firebase_admin).
  • Inference Engines: Supports transformers, openai, ooba, and llama.cpp.
  • Setup: Configuration requires a config.cfg file for API keys and runtime settings.
  • Docs: EQ-Bench About

Highlighted Details

  • Includes creative-writing and judgemark benchmarks.
  • Supports multiple inference engines including llama.cpp and custom OpenAI-compatible endpoints (e.g., Ollama).
  • Version 2.4 updates the creative writing benchmark with new prompts, a Claude 3.5 Sonnet judge, and weighted scoring criteria.
  • Leaderboard models are tested for 10 iterations with 95% confidence intervals displayed.

Maintenance & Community

  • Active development with recent updates (v2.4 in June 2024).
  • Community contributions noted (e.g., German language support by CrispStrobe).
  • Integration with eleuther-eval-harness.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • A known issue exists with the Oobabooga inference engine where the API plugin may stop responding after approximately 30 queries, requiring a pipeline reload.
  • Results from different judge models are not directly comparable.
  • The benchmark is primarily focused on English and German languages.
Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
48 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.