RAG system for ArXiv paper PDFs
Top 60.8% on sourcepulse
Talk2Arxiv provides a Retrieval-Augmented Generation (RAG) system for interacting with academic papers hosted on ArXiv. It allows users to prepend "talk2" to any ArXiv PDF link to access a chat interface powered by ChatGPT, enabling them to ask questions about the paper's content. This is beneficial for researchers and students seeking to quickly understand or query complex academic documents.
How It Works
The system leverages GROBID for PDF text extraction, followed by a custom chunking algorithm that prioritizes logical sections (abstract, intro, etc.) and employs recursive subdivision. Text is embedded using Cohere's EmbedV3 model and stored in Qdrant for efficient retrieval and caching. A reranking process ensures contextual relevance before feeding information to the language model. The frontend is built with Typescript, React, TailwindCSS, and NextJS, while the backend uses Flask, Gunicorn, and Nginx.
Quick Start & Requirements
yarn
and yarn run dev
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The backend is not built for scale and may stall under high concurrent request loads due to its single-threaded handling. The roadmap indicates future improvements to address potential limitations in handling symbolic math and non-standard text elements.
1 year ago
1 week