EQ-Bench by EQ-bench

Benchmark for emotional intelligence evaluation in LLMs

Created 2 years ago

396 stars

Top 72.9% on SourcePulse

3 Experts Love This Project

omarsar

Founder of DAIR.AI

thomwolf

Cofounder of Hugging Face

mlabonne

Head of Post-Training at Liquid AI

Project Summary

EQ-Bench provides a benchmark for evaluating the emotional intelligence of large language models (LLMs). It offers a comprehensive suite of tasks, including creative writing assessment and a unique "judgemark" capability for evaluating an LLM's ability to judge creative output. This benchmark is valuable for researchers and developers seeking to quantify and improve the emotional understanding and nuanced response generation of LLMs.

How It Works

EQ-Bench employs an LLM-as-a-judge approach for its creative writing tasks, utilizing detailed criteria to score model outputs. The "judgemark" task further assesses an LLM's capacity to evaluate creative writing, aggregating metrics like correlation with other benchmarks and response spread. The scoring system has evolved to a full-scale approach, with differences from reference answers scaled on a curve to improve discriminative power and reduce variance from minor parameter perturbations.

Quick Start & Requirements

Install: pip install -r install_reqs.sh or use ooba_quick_install.sh for a full setup.
Prerequisites: Python 3.x, transformers (GitHub version recommended), torch, openai, scipy, tqdm, sentencepiece, hf_transfer, peft, bitsandbytes, trl, accelerate, tensorboardX, huggingface_hub. Additional requirements exist for QWEN models and results uploading (e.g., gspread, firebase_admin).
Inference Engines: Supports transformers, openai, ooba, and llama.cpp.
Setup: Configuration requires a config.cfg file for API keys and runtime settings.
Docs: EQ-Bench About

Highlighted Details

Includes creative-writing and judgemark benchmarks.
Supports multiple inference engines including llama.cpp and custom OpenAI-compatible endpoints (e.g., Ollama).
Version 2.4 updates the creative writing benchmark with new prompts, a Claude 3.5 Sonnet judge, and weighted scoring criteria.
Leaderboard models are tested for 10 iterations with 95% confidence intervals displayed.

Maintenance & Community

Active development with recent updates (v2.4 in June 2024).
Community contributions noted (e.g., German language support by CrispStrobe).
Integration with eleuther-eval-harness.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

A known issue exists with the Oobabooga inference engine where the API plugin may stop responding after approximately 30 queries, requiring a pipeline reload.
Results from different judge models are not directly comparable.
The benchmark is primarily focused on English and German languages.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

6 stars in the last 30 days

Explore Similar Projects

writing by lechmazur

LLM benchmark for creative story writing

Created 1 year ago

Updated 3 weeks ago

Awesome-LLMs-as-Judges by CSHaitao

Survey paper for LLM-based evaluation methods

Created 1 year ago

Updated 5 months ago

EvaluationPapers4ChatGPT by THU-KEG

ChatGPT resource hub for evaluation, detection, and datasets

Created 2 years ago

Updated 1 year ago

Starred by

Edward Sun

Edward Sun(Research Scientist at Meta Superintelligence Lab).

JudgeLM by baaivision

LLM judge for evaluating LLM-generated answers

Created 2 years ago

Updated 11 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

MentalLLaMA by SteveKGYang

Open-source LLM for interpretable mental health analysis

Created 2 years ago

Updated 1 year ago

Awesome-LLM-Eval by onejune2018

Curated list for LLM evaluation tools, datasets, and models

Created 2 years ago

Updated 1 month ago

Starred by

Shyamal Anadkat

Shyamal Anadkat(Research Scientist at OpenAI),

Travis Fischer

Travis Fischer(Founder of Agentic), and

2 more.

autoevals by braintrustdata

Evaluation tool for AI model outputs using automatic methods

Created 2 years ago

Updated 1 week ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

Awesome-LLMs-Evaluation-Papers by tjunlp-lab

Paper list for LLM evaluation, based on a comprehensive survey

Created 2 years ago

Updated 1 year ago

PandaLM by WeOpenML

LLM evaluation benchmark for reproducible, automated assessment

Created 2 years ago

Updated 1 year ago

Starred by

Carol Willing

Carol Willing(Core Contributor to CPython, Jupyter),

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera), and

3 more.

LLM-eval-survey by MLGroupJLU

Survey paper resource for LLM evaluation

Created 2 years ago

Updated 7 months ago

Starred by

Clement Delangue

Clement Delangue(Cofounder of Hugging Face),

Wing Lian

Wing Lian(Founder of Axolotl AI), and

6 more.

evaluation-guidebook by huggingface

LLM evaluation guide for practitioners

Created 1 year ago

Updated 1 month ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

9 more.

arena-hard-auto by lmarena

Automatic LLM benchmark for instruction-tuned models, correlating with human preference

Created 2 years ago

Updated 6 months ago

Feedback? Help us improve.