talk2arxiv  by evanhu1

RAG system for ArXiv paper PDFs

created 1 year ago
527 stars

Top 60.8% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Talk2Arxiv provides a Retrieval-Augmented Generation (RAG) system for interacting with academic papers hosted on ArXiv. It allows users to prepend "talk2" to any ArXiv PDF link to access a chat interface powered by ChatGPT, enabling them to ask questions about the paper's content. This is beneficial for researchers and students seeking to quickly understand or query complex academic documents.

How It Works

The system leverages GROBID for PDF text extraction, followed by a custom chunking algorithm that prioritizes logical sections (abstract, intro, etc.) and employs recursive subdivision. Text is embedded using Cohere's EmbedV3 model and stored in Qdrant for efficient retrieval and caching. A reranking process ensures contextual relevance before feeding information to the language model. The frontend is built with Typescript, React, TailwindCSS, and NextJS, while the backend uses Flask, Gunicorn, and Nginx.

Quick Start & Requirements

  • Install and run with yarn and yarn run dev.
  • Requires Node.js and Python environments.
  • No specific hardware or GPU requirements mentioned, but performance may vary.
  • Official demo: https://talk2arxiv.org/

Highlighted Details

  • GROBID for robust PDF parsing.
  • Custom chunking combining logical sections and recursive subdivision.
  • Cohere EmbedV3 for high-quality text embeddings.
  • Qdrant vector database for efficient caching and retrieval.
  • Reranking mechanism for enhanced contextual relevance.

Maintenance & Community

  • The project is maintained by evanhu1.
  • Roadmap includes improved chunking, LaTeX source extraction, and visual LLM integration.
  • Community links (Discord/Slack, social media) are not provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license.
  • Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The backend is not built for scale and may stall under high concurrent request loads due to its single-threaded handling. The roadmap indicates future improvements to address potential limitations in handling symbolic math and non-standard text elements.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Simon Willison Simon Willison(Author of Django).

semantra by freedmand

0.0%
3k
CLI tool for semantic document search
created 2 years ago
updated 11 months ago
Feedback? Help us improve.