sourcepulse

Search results

135 results for "evaluation framework"

Showing 1 - 25 of 13525 of 135 repos

...

Rows

	Repository	Description	Stars	Stars 7d Δ	Stars 7d %	PRs 7d Δ	Created	Response rate	Last active
1	ten-frameworkTEN-framework	Real-time, distributed, cloud-edge collaborative multimodal AI agent framework. Supports C++/Go/Python/JS/TS. Integrates LLMs, vision, and h...	7k Top 10%	130	1.9%	43	1y ago	Inactive	6h ago
2	evalscopemodelscope	Framework for model evaluation and benchmarking. Supports LLMs, multimodal, embedding, and CLIP models. Includes built-in benchmarks and met...	1k Top 50%	34	2.5%	2	1y ago	1 day	1d ago
3	helmstanford-crfm Starred by +2	Framework for holistic evaluation of foundation models (LLMs, multimodal). Includes benchmarks, models, metrics (efficiency, bias, toxicity)...	2k Top 25%	22	0.9%	13	3y ago	1 week	16h ago
4	evaluatehuggingface Starred by +4	Library to evaluate and compare ML models. Includes implementations of metrics for NLP, CV, and other tasks. Add new evaluation modules easi...	2k Top 25%	8	0.3%	0	3y ago	Inactive	3w ago
5	2025-GSoC-Proposal-Selectedheilcheng	Benchmark suite to evaluate Google's Gemma language models across tasks. Compares Gemma to Llama 2 and Mistral. Includes visualizations.	442	2	0.5%	0	3mo ago	Inactive	2mo ago
6	SuperClaude_FrameworkSuperClaude-Org Starred by	Framework for Claude Code. Enhances with specialized commands, cognitive personas, and development methodologies.	12k Top 5%	1,059	9.7%	8	1mo ago	Inactive	1d ago
7	verdicthaizelabs Starred by	Framework for building LLM-as-a-judge systems. It composes judge architectural primitives for scalable oversight & automated evaluation.	264	6	2.3%	0	7mo ago	1 day	2w ago
8	openevalslangchain-ai Starred by	Library for evaluating LLM applications. It supports LLM-as-judge, RAG, code, and other evaluators. It integrates with LangSmith for experim...	651	9	1.4%	0	5mo ago	1 week	1mo ago
9	prometheus-evalprometheus-eval	A framework for evaluating LLMs in generation tasks. It includes tools for training, evaluating, and using language models as judges.	977 Top 50%	4	0.4%	0	1y ago	1 week	3mo ago
10	zeno-buildzeno-ml Starred by	Collection of examples using Zeno to evaluate generative AI models. It is architecture agnostic, working with OpenAI, LangChain, and Hugging...	490	1	0.2%	0	2y ago	1 day	1y ago
11	phasellmwgryc Starred by	Framework to manage & evaluate LLM-driven experiences. Standardizes API calls, evaluates outputs, & automates prompt engineering. Supports m...	457	1	0.2%	0	2y ago	1 week	6mo ago
12	evalplusevalplus Starred by	Evaluation framework for code LLMs. Features HumanEval+ & MBPP+ for correctness, and EvalPerf for efficiency. Supports many LLM backends.	2k Top 50%	7	0.5%	0	2y ago	1 day	3w ago
13	open-unlearninglocuslab	Framework for evaluating LLM unlearning. Supports benchmarks like TOFU/MUSE, various unlearning methods, datasets, & evaluation metrics.	334	3	0.9%	1	1y ago	1 day	1w ago
14	cevalhkust-nlp	Chinese evaluation suite for foundation models. It contains 13948 multi-choice questions across 52 disciplines. Evaluate zero/few-shot perfo...	2k Top 25%	2	0.1%	0	2y ago	1 week	6d ago
15	promptbenchmicrosoft Starred by	A PyTorch-based library for evaluating LLMs. It supports prompt engineering, adversarial prompts, dynamic evaluation, and multi-prompt evalu...	3k Top 25%	6	0.2%	0	2y ago	Inactive	3w ago
16	model_specopenai Starred by	Defines desired behaviors for models via a specification. Includes evaluation prompts to test model performance in challenging situations.	524	36	7.1%	0	5mo ago	Inactive	3mo ago
17	evalsopenai Starred by +15	Framework to evaluate LLMs and systems built using LLMs. Includes a registry of evals and the ability to write custom evals.	17k Top 5%	39	0.2%	0	2y ago	Inactive	7mo ago
18	opikcomet-ml Starred by +2	Open-source framework for evaluating, testing, and monitoring LLM applications. Includes tracing, prompt playground, and LLM-as-a-judge metr...	12k Top 5%	300	2.5%	34	2y ago	1 day	16h ago
19	AlignBenchTHUDM	Benchmark for evaluating Chinese LLMs alignment. It uses multi-dimensional, rule-calibrated LLM-as-Judge evaluation with chain-of-thought an...	401	2	0.5%	0	1y ago	1 day	11mo ago
20	OLMo-Evalallenai	Framework to evaluate language models on NLP tasks. It supports task sets, aggregate metrics, and integration with Google Sheets.	355	0	0%	0	1y ago	Inactive	3w ago
21	HarmBenchcenterforaisafety	Framework to evaluate automated red teaming methods and LLM attacks/defenses. It supports 33 LLMs and 18 red teaming methods.	693	9	1.3%	0	1y ago	1 day	11mo ago
22	ChainForgeianarawjo Starred by +1	Open-source visual environment for prompt engineering and LLM evaluation. Compare prompts/models, set up metrics, and visualize results.	3k Top 25%	8	0.3%	0	2y ago	Inactive	3d ago
23	autogenmicrosoft Starred by +7	Framework for multi-agent AI apps that can act autonomously or alongside humans. It supports agent workflows and rapid prototyping.	48k Top 1%	312	0.6%	21	1y ago	1 day	16h ago
24	PandaLMWeOpenML	Automated LLM evaluation benchmark. It compares responses, gives reasons, and includes a human-annotated dataset for validation and training...	921 Top 50%	1	0.1%	0	2y ago	1 day	1y ago
25	evalchemymlfoundations Starred by	Unified toolkit for evaluating post-trained language models. Supports parallel evaluation, simplified usage, and standardized result managem...	492	7	1.4%	0	8mo ago	1 week	1mo ago