context-1-data-gen  by chroma-core

Synthetic multi-hop search data generation pipeline

Created 1 month ago
390 stars

Top 73.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a synthetic data generation pipeline, context-1-data-gen, designed to create multi-hop search tasks across diverse domains. It targets researchers and engineers who need realistic, multi-step retrieval datasets for evaluating complex information-seeking behaviors. The pipeline generates data following an "explore → verify → extend" pattern, offering a structured approach to simulating intricate search scenarios.

How It Works

The core methodology involves generating multi-hop search tasks by applying an "explore → verify → extend" pattern within each supported domain. This structured approach allows for the creation of complex, multi-step retrieval challenges that mimic real-world information discovery processes, enhancing the evaluation of search and retrieval systems.

Quick Start & Requirements

  • Installation: Use uv sync for base dependencies, or uv sync --all-extras to include optional features like reranking, patents, and indexing.
  • Configuration: Copy .env.example to .env and populate it with necessary API keys.
  • Prerequisites: Requires API keys for Anthropic, Serper, Jina, OpenAI, Chroma, and Baseten, which are used across different domains and functionalities (e.g., embeddings, indexing, reranking).
  • Documentation: Detailed documentation for each domain (Web, SEC, Patents, Email) is available in their respective README files.

Highlighted Details

  • Supports data generation for multiple distinct domains: Web, SEC filings, Patents, and Email (Epstein).
  • Generates synthetic multi-hop search tasks following a structured "explore → verify → extend" pattern.
  • Context-1 model weights are available separately.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), or project roadmap are present in the provided README snippet.

Licensing & Compatibility

The license type and any compatibility notes for commercial use are not specified in the provided README snippet.

Limitations & Caveats

Operation is heavily dependent on numerous third-party API keys, potentially leading to significant costs and external service dependencies. The project's description is listed as "None," suggesting it may be in an early or incomplete state.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
2
Star History
393 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.