auto-evaluator  by langchain-ai

LLM auto-evaluation app for QA chains

created 2 years ago
771 stars

Top 46.2% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an automated evaluation framework for Document Question-Answering (QA) systems built with LangChain. It addresses the challenge of systematically assessing and improving QA chain performance, such as mitigating hallucinations and poor answer quality, by auto-generating test sets and grading responses. The target audience includes LLM developers and researchers seeking to benchmark and optimize their QA pipelines.

How It Works

The system leverages Anthropic's and OpenAI's research on model-written evaluations. It first auto-generates a question-answer test set from provided documents using LangChain's QAGenerationChain. Then, it runs a specified RetrievalQA chain against this test set, retrieving relevant document chunks and synthesizing answers with an LLM. Finally, it employs model-graded evaluations to assess both the relevance of retrieved documents and the quality of the generated answers against ground truth.

Quick Start & Requirements

  • Backend: pip install -r requirements.txt and uvicorn evaluator_app:app
  • Frontend: yarn install and yarn dev
  • Prerequisites: OpenAI API key (OPENAI_API_KEY), Anthropic API key (ANTHROPIC_API_KEY).
  • Testing API: curl -X POST -F "files=@docs/karpathy-lex-pod/karpathy-pod.txt" -F "num_eval_questions=1" -F "chunk_chars=1000" -F "overlap=100" -F "split_method=RecursiveTextSplitter" -F "retriever_type=similarity-search" -F "embeddings=OpenAI" -F "model_version=gpt-3.5-turbo" -F "grade_prompt=Fast" -F "num_neighbors=3" http://localhost:8000/evaluator-stream
  • Demo: Pre-loaded with Lex Fridman podcast transcript and QA pairs.
  • Playground: Allows custom document uploads and optional test set input.
  • Docs: [Not explicitly linked, but code structure implies LangChain QA components.]

Highlighted Details

  • Auto-generates QA test sets from input documents.
  • Supports model-graded evaluation for retrieval relevance and answer quality.
  • Allows configuration of chunk size, overlap, embedding models, retrievers (FAISS default), and LLMs.
  • Summarizes experimental results in a table and chart, including latency.
  • Offers two modes: Demo and Playground.

Maintenance & Community

  • Developed by langchain-ai.
  • Backend deployed on Railway, frontend on Vercel.
  • Contributions are welcome, with specific instructions for running backend and frontend locally.

Licensing & Compatibility

  • License not specified in the README.
  • Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The project is presented as an "app" and may require significant configuration of LangChain components and API keys. Specific details on supported embedding models beyond OpenAI and retriever types beyond FAISS similarity search are not exhaustively listed. The effectiveness of auto-generated test sets and model-graded evaluations depends heavily on the chosen prompts and underlying LLMs.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Julien Chaumond Julien Chaumond(Cofounder of Hugging Face), and
1 more.

question_generation by patil-suraj

0%
1k
Question generation study using transformers
created 5 years ago
updated 1 year ago
Feedback? Help us improve.