arxiv-paper-curator  by jamwithai

Build a production-grade RAG research assistant

Created 1 month ago
595 stars

Top 54.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a comprehensive, hands-on course for building a production-grade Retrieval-Augmented Generation (RAG) system, specifically an AI research assistant that curates and answers questions about arXiv papers. It targets AI/ML engineers, software engineers, and data scientists looking to master end-to-end AI application development using industry best practices.

How It Works

The system is architected around a microservices approach orchestrated via Docker Compose. Key components include FastAPI for the API, PostgreSQL for metadata storage, OpenSearch for hybrid search, Apache Airflow for workflow automation, and Ollama for local LLM serving. The data pipeline involves fetching papers from the arXiv API, parsing PDFs using Docling, and storing extracted metadata and content. Future weeks promise implementation of advanced RAG techniques like hybrid search, context-aware chunking, and production deployment.

Quick Start & Requirements

  • Install/Run: Clone the repository and use docker compose up --build -d to start all services.
  • Prerequisites: Docker Desktop, Python 3.12+, UV package manager, 8GB+ RAM, 20GB free disk space.
  • Resources: Links to official quick-start notebooks are provided for Week 1 and Week 2.

Highlighted Details

  • Full RAG system architecture with FastAPI, PostgreSQL, OpenSearch, Airflow, and Ollama.
  • Automated data ingestion pipeline for arXiv papers using rate-limited API calls and PDF parsing.
  • Focus on production-grade implementation and best practices.
  • Modular weekly learning path with corresponding code releases.

Maintenance & Community

The project is developed by Jam With AI. Further community or roadmap details are not explicitly provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting commercial use and closed-source linking.

Limitations & Caveats

The project is presented as a course with future weeks (3-6) marked as "Coming Soon," indicating incomplete functionality for the full RAG pipeline and deployment stages.

Health Check
Last Commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
324 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.