Benchmark for emotional intelligence evaluation in LLMs
Top 83.6% on sourcepulse
EQ-Bench provides a benchmark for evaluating the emotional intelligence of large language models (LLMs). It offers a comprehensive suite of tasks, including creative writing assessment and a unique "judgemark" capability for evaluating an LLM's ability to judge creative output. This benchmark is valuable for researchers and developers seeking to quantify and improve the emotional understanding and nuanced response generation of LLMs.
How It Works
EQ-Bench employs an LLM-as-a-judge approach for its creative writing tasks, utilizing detailed criteria to score model outputs. The "judgemark" task further assesses an LLM's capacity to evaluate creative writing, aggregating metrics like correlation with other benchmarks and response spread. The scoring system has evolved to a full-scale approach, with differences from reference answers scaled on a curve to improve discriminative power and reduce variance from minor parameter perturbations.
Quick Start & Requirements
pip install -r install_reqs.sh
or use ooba_quick_install.sh
for a full setup.transformers
(GitHub version recommended), torch
, openai
, scipy
, tqdm
, sentencepiece
, hf_transfer
, peft
, bitsandbytes
, trl
, accelerate
, tensorboardX
, huggingface_hub
. Additional requirements exist for QWEN models and results uploading (e.g., gspread
, firebase_admin
).transformers
, openai
, ooba
, and llama.cpp
.config.cfg
file for API keys and runtime settings.Highlighted Details
creative-writing
and judgemark
benchmarks.llama.cpp
and custom OpenAI-compatible endpoints (e.g., Ollama).Maintenance & Community
eleuther-eval-harness
.Licensing & Compatibility
Limitations & Caveats
1 year ago
1 week