xmc.dspy by KarelDO

In-context learning for extreme multi-label classification

Created 2 years ago

446 stars

Top 67.2% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Omar Khattab

Coauthor of DSPy, ColBERT; Professor at MIT

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides Infer-Retrieve-Rank (IReRa), a modular program for extreme multi-label classification (XMC) using in-context learning with large language models (LLMs) and retrievers. It targets researchers and practitioners needing to classify text into thousands or millions of categories with minimal labeled data, achieving state-of-the-art performance without fine-tuning.

How It Works

IReRa employs a novel "Infer-Retrieve-Rank" strategy, leveraging a teacher LLM to generate demonstrations for a student LLM. This approach optimizes a student model's performance on XMC tasks using only a few labeled examples. The system is built on the DSPy programming model, allowing users to customize components like LLMs, retrievers, and optimization strategies to balance cost and performance. LM calls are cached for reproducibility.

Quick Start & Requirements

Installation: Requires Python 3.10 and an experimental branch of DSPy. Clone DSPy, checkout a specific commit (802f2d5f26c1a64d8aad6adbd8b4394b9c4bb743), and install it, followed by pip install -r requirements.txt.
Prerequisites: OpenAI API key for OpenAI models. For local models, set up a Text Generation Interface (TGI) server and configure lm_config.json.
Data: Run bash scripts/load_data.sh and bash scripts/load_cache.sh.
Running: Use bash scripts/compile_left_to_right.sh or bash scripts/run_left_to_right.sh for pre-compiled results. Evaluate with python run_irera.py or compile and run with python compile_irera.py.
Resources: Official paper: arxiv.org/pdf/2401.12178.pdf.

Highlighted Details

Achieves state-of-the-art performance on XMC tasks with minimal labeled examples (≈50).
Modular design allows customization of LLMs, retrievers, and optimization strategies.
Caches LM calls to avoid re-inference costs for reproducibility.
Supports applying to new tasks by adding data loaders and custom signatures.

Maintenance & Community

The project is maintained by Karel D'Oosterlinck. Users can follow @KarelDoostrlnck on Twitter for updates. Contributions are welcomed via issues or pull requests.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README notes that results from run_irera.py may differ slightly from compile_irera.py due to potential bugs in model loading/saving. The Optimizer class's awareness of program implementation details needs to be resolved for more flexible strategy application.

Health Check

Last Commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days