arxiv-paper-curator by jamwithai

Build a production-grade RAG research assistant

Created 5 months ago

2,102 stars

Top 21.0% on SourcePulse

Project Summary

This project provides a comprehensive, hands-on course for building a production-grade Retrieval-Augmented Generation (RAG) system, specifically an AI research assistant that curates and answers questions about arXiv papers. It targets AI/ML engineers, software engineers, and data scientists looking to master end-to-end AI application development using industry best practices.

How It Works

The system is architected around a microservices approach orchestrated via Docker Compose. Key components include FastAPI for the API, PostgreSQL for metadata storage, OpenSearch for hybrid search, Apache Airflow for workflow automation, and Ollama for local LLM serving. The data pipeline involves fetching papers from the arXiv API, parsing PDFs using Docling, and storing extracted metadata and content. Future weeks promise implementation of advanced RAG techniques like hybrid search, context-aware chunking, and production deployment.

Quick Start & Requirements

Install/Run: Clone the repository and use docker compose up --build -d to start all services.
Prerequisites: Docker Desktop, Python 3.12+, UV package manager, 8GB+ RAM, 20GB free disk space.
Resources: Links to official quick-start notebooks are provided for Week 1 and Week 2.

Highlighted Details

Full RAG system architecture with FastAPI, PostgreSQL, OpenSearch, Airflow, and Ollama.
Automated data ingestion pipeline for arXiv papers using rate-limited API calls and PDF parsing.
Focus on production-grade implementation and best practices.
Modular weekly learning path with corresponding code releases.

Maintenance & Community

The project is developed by Jam With AI. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

The project is presented as a course with future weeks (3-6) marked as "Coming Soon," indicating incomplete functionality for the full RAG pipeline and deployment stages.

arxiv-paper-curator by jamwithai

Explore Similar Projects

leettools by leettools-dev

AI-research-SKILLs by zechenzhangAGI

CortexON by TheAgenticAI

MLE-agent by MLSysOps

harvester by wzdnzd

n8n-workflow-templates by Danitilahun

awesome-generative-ai-data-scientist by business-science

deep-research-web-ui by AnotiaWang

sqlflow by sql-machine-learning

langmanus by Darwin-lfl

open_deep_research by langchain-ai

FastGPT by labring