Gen-Searcher  by tulerfeng

Agentic search framework for knowledge-grounded image generation

Created 2 weeks ago

New!

264 stars

Top 96.6% on SourcePulse

GitHubView on GitHub
Project Summary

Gen-Searcher introduces a multimodal deep research agent designed to enhance image generation by incorporating complex real-world knowledge. It addresses the need for more accurate and contextually relevant image synthesis by enabling agents to perform web searches, browse evidence, reason across multiple sources, and retrieve visual references before generation. This project is targeted at researchers and developers in AI and computer vision, offering a novel approach to grounding image generation in real-world information.

How It Works

Gen-Searcher trains a multimodal deep research agent for image generation requiring complex real-world knowledge. Its core approach involves an agentic search loop: web search, evidence browsing, multi-source reasoning, and visual reference retrieval, all preceding image synthesis. This enables more accurate and up-to-date results by grounding generation in real-world context, a novel capability for such agents. The project introduces dedicated training datasets (Gen-Searcher-SFT-10k, Gen-Searcher-RL-6k) and a new benchmark (KnowGen) to facilitate this research.

Quick Start & Requirements

Primary installation involves cloning the repository and setting up two distinct Conda environments for SFT and RL training, each requiring specific pip installations for libraries like LLaMA-Factory, rllm, and vllm. Key prerequisites include Python 3.11, substantial GPU resources (minimum 8x 80GB for SFT, 4x 80GB for RL), and API keys for services like Serper and Jina. Official project pages, paper, models, and datasets are available on Hug

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
266 stars in the last 18 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

RPG-DiffusionMaster by YangLing0818

0%
2k
Training-free paradigm for text-to-image generation/editing
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

NExT-GPT by NExT-GPT

0.1%
4k
Any-to-any multimodal LLM research paper
Created 2 years ago
Updated 11 months ago
Feedback? Help us improve.