ollama-grid-search by dezoito

Desktop app to evaluate/compare LLMs

Created 2 years ago

910 stars

Top 39.8% on SourcePulse

1 Expert Loves This Project

jmorganca

Cofounder of Ollama

Project Summary

This project provides a desktop application for evaluating and comparing Large Language Models (LLMs) served via Ollama. It targets users who need to systematically test different models, prompts, and inference parameters to find optimal configurations for their use cases, offering a visual interface for results inspection and A/B testing.

How It Works

The application leverages Ollama's API to interact with various LLMs. It implements a "grid search" concept by iterating through user-defined combinations of models, prompts, and inference parameters (e.g., temperature, top_p). This systematic approach allows for comprehensive testing, with features for A/B testing different prompts or models side-by-side and managing experiment logs for reproducibility.

Quick Start & Requirements

Install: Download pre-built binaries from the releases page.
Prerequisites: Ollama must be installed and serving models locally or remotely.
Setup: Minimal setup required if Ollama is already running.
Docs: Project Blog Post

Highlighted Details

Automated fetching of models from local or remote Ollama servers.
Supports A/B testing of prompts and models simultaneously.
Includes a prompt archive with "/" autocompletion for prompt selection.
Outputs inference metadata like time and tokens per second.

Maintenance & Community

Active development with contributions from multiple individuals.
Development notes and workflow charts are available for contributors.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration with closed-source applications.

Limitations & Caveats

The project is primarily focused on evaluating models served through Ollama; integration with other LLM serving frameworks is not supported. Future features include result grading and sharing capabilities.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

7 stars in the last 30 days

Explore Similar Projects

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Teknium

Teknium(Cofounder of Nous Research), and

1 more.

Prompt-Engineering-Toolkit by teknium1

Web app for LLM prompt experimentation and optimization

Created 1 year ago

Updated 1 year ago

Starred by

Maxime Beauchemin

Maxime Beauchemin(Author of Apache Airflow, Superset; Founder of Preset).

promptimize by preset-io

Prompt engineering toolkit for evaluating and testing prompts

Created 2 years ago

Updated 1 month ago

Starred by

Travis Fischer

Travis Fischer(Founder of Agentic).

arc-agi-benchmarking by arcprize

CLI tool for benchmarking LLMs on ARC-AGI tasks

Created 1 year ago

Updated 2 weeks ago

Starred by

Andrej Karpathy

Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n),

Edward Z. Yang

Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and

5 more.

yet-another-applied-llm-benchmark by carlini

LLM benchmark for evaluating models on previously asked programming questions

Created 2 years ago

Updated 10 months ago

Starred by

Marc Klingen

Marc Klingen(Cofounder of Langfuse),

Vasek Mlejnsky

Vasek Mlejnsky(Cofounder of E2B), and

1 more.

openevals by langchain-ai

Evaluation toolkit for LLM apps, like tests for traditional software

Created 1 year ago

Updated 18 hours ago

Starred by

Georgios Konstantopoulos

Georgios Konstantopoulos(CTO, General Partner at Paradigm) and

Travis Fischer

Travis Fischer(Founder of Agentic).

flux by paradigmxyz

LLM power tool for parallel exploration of multiple completions

Created 2 years ago

Updated 1 year ago

Starred by

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

3 more.

promptbench by microsoft

LLM evaluation framework

Created 2 years ago

Updated 5 days ago

Starred by

Elie Bursztein

Elie Bursztein(Cybersecurity Lead at Google DeepMind),

Bryan Helmig

Bryan Helmig(Cofounder of Zapier), and

7 more.

ChainForge by ianarawjo

Visual environment for LLM prompt battle-testing

Created 2 years ago

Updated 1 month ago

Starred by

Didier Lopes

Didier Lopes(Founder of OpenBB),

Travis Fischer

Travis Fischer(Founder of Agentic), and

16 more.

promptfoo by promptfoo

CLI tool for LLM prompt/agent/RAG testing and red-teaming

Created 2 years ago

Updated 16 hours ago

Starred by

Gregor Zunic

Gregor Zunic(Cofounder of Browser Use),

Alex Chen

Alex Chen(Cofounder of Nexa AI), and

15 more.

ragas by vibrantlabsai

Toolkit for LLM application evaluation

Created 2 years ago

Updated 1 day ago

Starred by

Alexey Milovidov

Alexey Milovidov(Cofounder of Clickhouse),

Marc Klingen

Marc Klingen(Cofounder of Langfuse), and

20 more.

langfuse by langfuse

Open source LLM engineering platform for observability and evals

Created 2 years ago

Updated 16 hours ago

Starred by

Anastasios Angelopoulos

Anastasios Angelopoulos(Cofounder of LMArena),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

35 more.

evals by openai

Framework for evaluating LLMs and LLM systems, plus benchmark registry

Created 3 years ago

Updated 3 months ago

Feedback? Help us improve.