promptbase  by microsoft

Prompt engineering resources for eliciting top performance from foundation models

Created 1 year ago
5,678 stars

Top 9.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a collection of resources, best practices, and example scripts for advanced prompt engineering, specifically targeting foundation models like GPT-4. It aims to help researchers and practitioners achieve state-of-the-art performance on various benchmarks, particularly in complex reasoning and domain-specific tasks, by offering structured methodologies and extensible frameworks.

How It Works

The core of the project is the "Medprompt" methodology, which combines dynamic few-shot selection, self-generated chain-of-thought (CoT), and choice-shuffle ensembling. Dynamic few-shot selection uses semantic similarity (via text-embedding-ada-002) to retrieve relevant examples for each query. Self-generated CoT automates the creation of reasoning steps, and ensembling with choice shuffling enhances robustness. Medprompt+ extends this by incorporating a portfolio approach, dynamically selecting between direct few-shot prompts and CoT-based prompts based on GPT-4's assessment of task complexity.

Quick Start & Requirements

  • Install via pip install -e . after cloning the repository and navigating to the src directory.
  • Requires Azure OpenAI API keys and endpoints (AZURE_OPENAI_API_KEY, AZURE_OPENAI_CHAT_API_KEY, AZURE_OPENAI_CHAT_ENDPOINT_URL, AZURE_OPENAI_EMBEDDINGS_URL).
  • Datasets for benchmarks (MMLU, HumanEval, DROP, GSM8K, MATH, Big-Bench-Hard) must be downloaded separately and placed in src/promptbase/datasets/.
  • Example run command: python -m promptbase mmlu --subject <SUBJECT>
  • Links: Medprompt Blog, Medprompt Research Paper

Highlighted Details

  • Achieved 90.10% on MMLU with GPT-4 using Medprompt+.
  • Demonstrates outperforming specialized models with generalist foundation models via prompting.
  • Includes detailed methodology for dynamic few-shot selection, self-generated CoT, and ensembling.
  • Benchmarks provided for GPT-4 and Gemini Ultra across MMLU, GSM8K, MATH, HumanEval, BIG-Bench-Hard, DROP, and HellaSwag.

Maintenance & Community

  • Developed by Microsoft Research.
  • Future plans include more case studies, interviews, and specialized tooling deep dives.

Licensing & Compatibility

  • The repository itself appears to be MIT licensed, but the underlying methodologies and data usage are tied to Microsoft's AI services and research.

Limitations & Caveats

Some scripts are for reference and may not be immediately executable against public APIs. Medprompt+ relies on access to logprobs from GPT-4, which were not publicly available via the API at the time of the README's writing but were expected to be enabled.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab).

AGIEval by ruixiangcui

0.1%
763
Benchmark for evaluating foundation models on human-centric tasks
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.