Search results

135 results for "evaluation framework"

25 of 135 repos

...
Repository
Description
Stars
Stars 7d Δ
Stars 7d %
PRs 7d Δ
Created
Response rate
Issues 30d
Last active

1

ten-frameworkTEN-framework
Real-time, distributed, cloud-edge collaborative multimodal AI agent framework. Supports C++/Go/Python/JS/TS. Integrates LLMs, vision, and h...
7k
Top 10%
130
1.9%
43
1y ago

Inactive

6h ago

2

evalscopemodelscope
Framework for model evaluation and benchmarking. Supports LLMs, multimodal, embedding, and CLIP models. Includes built-in benchmarks and met...
1k
Top 50%
34
2.5%
2
1y ago

1 day

1d ago

3

helmstanford-crfm
Starred by
chiphuyen:
lantiga:
simonw:
transitive-bullshit:
+2
Framework for holistic evaluation of foundation models (LLMs, multimodal). Includes benchmarks, models, metrics (efficiency, bias, toxicity)...
2k
Top 25%
22
0.9%
13
3y ago

1 week

16h ago

4

evaluatehuggingface
Starred by
chiphuyen:
natolambert:
hammer:
apsdehal:
+4
Library to evaluate and compare ML models. Includes implementations of metrics for NLP, CV, and other tasks. Add new evaluation modules easi...
2k
Top 25%
8
0.3%
0
3y ago

Inactive

3w ago

5

Benchmark suite to evaluate Google's Gemma language models across tasks. Compares Gemma to Llama 2 and Mistral. Includes visualizations.
442
2
0.5%
0
3mo ago

Inactive

2mo ago

6

SuperClaude_FrameworkSuperClaude-Org
Starred by
apsdehal:
Framework for Claude Code. Enhances with specialized commands, cognitive personas, and development methodologies.
12k
Top 5%
1,059
9.7%
8
1mo ago

Inactive

1d ago

7

verdicthaizelabs
Starred by
didierrlopes:
thomwolf:
natolambert:
hammer:
Framework for building LLM-as-a-judge systems. It composes judge architectural primitives for scalable oversight & automated evaluation.
264
6
2.3%
0
7mo ago

1 day

2w ago

8

openevalslangchain-ai
Starred by
transitive-bullshit:
Library for evaluating LLM applications. It supports LLM-as-judge, RAG, code, and other evaluators. It integrates with LangSmith for experim...
651
9
1.4%
0
5mo ago

1 week

1mo ago

9

prometheus-evalprometheus-eval
A framework for evaluating LLMs in generation tasks. It includes tools for training, evaluating, and using language models as judges.
977
Top 50%
4
0.4%
0
1y ago

1 week

3mo ago

10

zeno-buildzeno-ml
Starred by
transitive-bullshit:
hammer:
Collection of examples using Zeno to evaluate generative AI models. It is architecture agnostic, working with OpenAI, LangChain, and Hugging...
490
1
0.2%
0
2y ago

1 day

1y ago

11

phasellmwgryc
Starred by
transitive-bullshit:
hammer:
Framework to manage & evaluate LLM-driven experiences. Standardizes API calls, evaluates outputs, & automates prompt engineering. Supports m...
457
1
0.2%
0
2y ago

1 week

6mo ago

12

evalplusevalplus
Starred by
chiphuyen:
infwinston:
Evaluation framework for code LLMs. Features HumanEval+ & MBPP+ for correctness, and EvalPerf for efficiency. Supports many LLM backends.
2k
Top 50%
7
0.5%
0
2y ago

1 day

3w ago

13

Framework for evaluating LLM unlearning. Supports benchmarks like TOFU/MUSE, various unlearning methods, datasets, & evaluation metrics.
334
3
0.9%
1
1y ago

1 day

1w ago

14

cevalhkust-nlp
Chinese evaluation suite for foundation models. It contains 13948 multi-choice questions across 52 disciplines. Evaluate zero/few-shot perfo...
2k
Top 25%
2
0.1%
0
2y ago

1 week

6d ago

15

promptbenchmicrosoft
Starred by
transitive-bullshit:
A PyTorch-based library for evaluating LLMs. It supports prompt engineering, adversarial prompts, dynamic evaluation, and multi-prompt evalu...
3k
Top 25%
6
0.2%
0
2y ago

Inactive

3w ago

16

model_specopenai
Starred by
didierrlopes:
Defines desired behaviors for models via a specification. Includes evaluation prompts to test model performance in challenging situations.
524
36
7.1%
0
5mo ago

Inactive

3mo ago

17

evalsopenai
Starred by
aangelopoulos:
chiphuyen:
taranjeet:
gakonst:
+15
Framework to evaluate LLMs and systems built using LLMs. Includes a registry of evals and the ability to write custom evals.
17k
Top 5%
39
0.2%
0
2y ago

Inactive

7mo ago

18

opikcomet-ml
Starred by
chiphuyen:
sb2nov:
dguido:
jmorganca:
+2
Open-source framework for evaluating, testing, and monitoring LLM applications. Includes tracing, prompt playground, and LLM-as-a-judge metr...
12k
Top 5%
300
2.5%
34
2y ago

1 day

16h ago

19

Benchmark for evaluating Chinese LLMs alignment. It uses multi-dimensional, rule-calibrated LLM-as-Judge evaluation with chain-of-thought an...
401
2
0.5%
0
1y ago

1 day

11mo ago

20

OLMo-Evalallenai
Framework to evaluate language models on NLP tasks. It supports task sets, aggregate metrics, and integration with Google Sheets.
355
0
0%
0
1y ago

Inactive

3w ago

21

HarmBenchcenterforaisafety
Framework to evaluate automated red teaming methods and LLM attacks/defenses. It supports 33 LLMs and 18 red teaming methods.
693
9
1.3%
0
1y ago

1 day

11mo ago

22

ChainForgeianarawjo
Starred by
ebursztein:
eugeneyan:
chiphuyen:
jmorganca:
+1
Open-source visual environment for prompt engineering and LLM evaluation. Compare prompts/models, set up metrics, and visualize results.
3k
Top 25%
8
0.3%
0
2y ago

Inactive

3d ago

23

autogenmicrosoft
Starred by
wesm:
chiphuyen:
ebursztein:
Ying1123:
+7
Framework for multi-agent AI apps that can act autonomously or alongside humans. It supports agent workflows and rapid prototyping.
48k
Top 1%
312
0.6%
21
1y ago

1 day

16h ago

24

PandaLMWeOpenML
Automated LLM evaluation benchmark. It compares responses, gives reasons, and includes a human-annotated dataset for validation and training...
921
Top 50%
1
0.1%
0
2y ago

1 day

1y ago

25

evalchemymlfoundations
Starred by
jmorganca:
Unified toolkit for evaluating post-trained language models. Supports parallel evaluation, simplified usage, and standardized result managem...
492
7
1.4%
0
8mo ago

1 week

1mo ago

25 of 135 repos

...
Feedback? Help us improve.