Discover and explore top open-source AI tools and projects—updated daily.
EQ-benchBenchmark for emotional intelligence evaluation in LLMs
Top 76.1% on SourcePulse
EQ-Bench provides a benchmark for evaluating the emotional intelligence of large language models (LLMs). It offers a comprehensive suite of tasks, including creative writing assessment and a unique "judgemark" capability for evaluating an LLM's ability to judge creative output. This benchmark is valuable for researchers and developers seeking to quantify and improve the emotional understanding and nuanced response generation of LLMs.
How It Works
EQ-Bench employs an LLM-as-a-judge approach for its creative writing tasks, utilizing detailed criteria to score model outputs. The "judgemark" task further assesses an LLM's capacity to evaluate creative writing, aggregating metrics like correlation with other benchmarks and response spread. The scoring system has evolved to a full-scale approach, with differences from reference answers scaled on a curve to improve discriminative power and reduce variance from minor parameter perturbations.
Quick Start & Requirements
pip install -r install_reqs.sh or use ooba_quick_install.sh for a full setup.transformers (GitHub version recommended), torch, openai, scipy, tqdm, sentencepiece, hf_transfer, peft, bitsandbytes, trl, accelerate, tensorboardX, huggingface_hub. Additional requirements exist for QWEN models and results uploading (e.g., gspread, firebase_admin).transformers, openai, ooba, and llama.cpp.config.cfg file for API keys and runtime settings.Highlighted Details
creative-writing and judgemark benchmarks.llama.cpp and custom OpenAI-compatible endpoints (e.g., Ollama).Maintenance & Community
eleuther-eval-harness.Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
braintrustdata
huggingface
MLGroupJLU
lmarena