auto-evaluator by langchain-ai

LLM auto-evaluation app for QA chains

Created 2 years ago

781 stars

Top 45.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Chuan Li

Chief Scientific Officer at Lambda

Jesse Clark

Cofounder of Marqo

Jeff Hammerbacher

Cofounder of Cloudera

Marc Klingen

Cofounder of Langfuse

and 2 more!

Project Summary

This project provides an automated evaluation framework for Document Question-Answering (QA) systems built with LangChain. It addresses the challenge of systematically assessing and improving QA chain performance, such as mitigating hallucinations and poor answer quality, by auto-generating test sets and grading responses. The target audience includes LLM developers and researchers seeking to benchmark and optimize their QA pipelines.

How It Works

The system leverages Anthropic's and OpenAI's research on model-written evaluations. It first auto-generates a question-answer test set from provided documents using LangChain's QAGenerationChain. Then, it runs a specified RetrievalQA chain against this test set, retrieving relevant document chunks and synthesizing answers with an LLM. Finally, it employs model-graded evaluations to assess both the relevance of retrieved documents and the quality of the generated answers against ground truth.

Quick Start & Requirements

Backend: pip install -r requirements.txt and uvicorn evaluator_app:app
Frontend: yarn install and yarn dev
Prerequisites: OpenAI API key (OPENAI_API_KEY), Anthropic API key (ANTHROPIC_API_KEY).
Testing API: curl -X POST -F "files=@docs/karpathy-lex-pod/karpathy-pod.txt" -F "num_eval_questions=1" -F "chunk_chars=1000" -F "overlap=100" -F "split_method=RecursiveTextSplitter" -F "retriever_type=similarity-search" -F "embeddings=OpenAI" -F "model_version=gpt-3.5-turbo" -F "grade_prompt=Fast" -F "num_neighbors=3" http://localhost:8000/evaluator-stream
Demo: Pre-loaded with Lex Fridman podcast transcript and QA pairs.
Playground: Allows custom document uploads and optional test set input.
Docs: [Not explicitly linked, but code structure implies LangChain QA components.]

Highlighted Details

Auto-generates QA test sets from input documents.
Supports model-graded evaluation for retrieval relevance and answer quality.
Allows configuration of chunk size, overlap, embedding models, retrievers (FAISS default), and LLMs.
Summarizes experimental results in a table and chart, including latency.
Offers two modes: Demo and Playground.

Maintenance & Community

Developed by langchain-ai.
Backend deployed on Railway, frontend on Vercel.
Contributions are welcome, with specific instructions for running backend and frontend locally.

Licensing & Compatibility

License not specified in the README.
Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The project is presented as an "app" and may require significant configuration of LangChain components and API keys. Specific details on supported embedding models beyond OpenAI and retriever types beyond FAISS similarity search are not exhaustively listed. The effectiveness of auto-generated test sets and model-graded evaluations depends heavily on the chosen prompts and underlying LLMs.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days