LLM auto-evaluation app for QA chains
Top 46.2% on sourcepulse
This project provides an automated evaluation framework for Document Question-Answering (QA) systems built with LangChain. It addresses the challenge of systematically assessing and improving QA chain performance, such as mitigating hallucinations and poor answer quality, by auto-generating test sets and grading responses. The target audience includes LLM developers and researchers seeking to benchmark and optimize their QA pipelines.
How It Works
The system leverages Anthropic's and OpenAI's research on model-written evaluations. It first auto-generates a question-answer test set from provided documents using LangChain's QAGenerationChain
. Then, it runs a specified RetrievalQA chain against this test set, retrieving relevant document chunks and synthesizing answers with an LLM. Finally, it employs model-graded evaluations to assess both the relevance of retrieved documents and the quality of the generated answers against ground truth.
Quick Start & Requirements
pip install -r requirements.txt
and uvicorn evaluator_app:app
yarn install
and yarn dev
OPENAI_API_KEY
), Anthropic API key (ANTHROPIC_API_KEY
).curl -X POST -F "files=@docs/karpathy-lex-pod/karpathy-pod.txt" -F "num_eval_questions=1" -F "chunk_chars=1000" -F "overlap=100" -F "split_method=RecursiveTextSplitter" -F "retriever_type=similarity-search" -F "embeddings=OpenAI" -F "model_version=gpt-3.5-turbo" -F "grade_prompt=Fast" -F "num_neighbors=3" http://localhost:8000/evaluator-stream
Highlighted Details
Maintenance & Community
langchain-ai
.Licensing & Compatibility
Limitations & Caveats
The project is presented as an "app" and may require significant configuration of LangChain components and API keys. Specific details on supported embedding models beyond OpenAI and retriever types beyond FAISS similarity search are not exhaustively listed. The effectiveness of auto-generated test sets and model-graded evaluations depends heavily on the chosen prompts and underlying LLMs.
1 month ago
1 week