talk2arxiv by evanhu1

RAG system for ArXiv paper PDFs

Created 2 years ago

528 stars

Top 59.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andre Zayarni

Cofounder of Qdrant

Project Summary

Talk2Arxiv provides a Retrieval-Augmented Generation (RAG) system for interacting with academic papers hosted on ArXiv. It allows users to prepend "talk2" to any ArXiv PDF link to access a chat interface powered by ChatGPT, enabling them to ask questions about the paper's content. This is beneficial for researchers and students seeking to quickly understand or query complex academic documents.

How It Works

The system leverages GROBID for PDF text extraction, followed by a custom chunking algorithm that prioritizes logical sections (abstract, intro, etc.) and employs recursive subdivision. Text is embedded using Cohere's EmbedV3 model and stored in Qdrant for efficient retrieval and caching. A reranking process ensures contextual relevance before feeding information to the language model. The frontend is built with Typescript, React, TailwindCSS, and NextJS, while the backend uses Flask, Gunicorn, and Nginx.

Quick Start & Requirements

Install and run with yarn and yarn run dev.
Requires Node.js and Python environments.
No specific hardware or GPU requirements mentioned, but performance may vary.
Official demo: https://talk2arxiv.org/

Highlighted Details

GROBID for robust PDF parsing.
Custom chunking combining logical sections and recursive subdivision.
Cohere EmbedV3 for high-quality text embeddings.
Qdrant vector database for efficient caching and retrieval.
Reranking mechanism for enhanced contextual relevance.

Maintenance & Community

The project is maintained by evanhu1.
Roadmap includes improved chunking, LaTeX source extraction, and visual LLM integration.
Community links (Discord/Slack, social media) are not provided in the README.

Licensing & Compatibility

The README does not explicitly state a license.
Compatibility for commercial use or closed-source linking is undetermined.

Limitations & Caveats

The backend is not built for scale and may stall under high concurrent request loads due to its single-threaded handling. The roadmap indicates future improvements to address potential limitations in handling symbolic math and non-standard text elements.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days