code_qa  by sankalp1999

Explore codebases with natural language RAG

Created 1 year ago
258 stars

Top 98.1% on SourcePulse

GitHubView on GitHub
Project Summary

Summary sankalp1999/code_qa is a RAG-powered system for natural language querying of codebases. It targets developers and researchers seeking to understand complex code by providing contextual answers and interactive chat, leveraging Treesitter for AST parsing and LanceDB for efficient vector storage.

How It Works The system parses codebases into abstract syntax trees (ASTs) using Treesitter, then indexes code chunks with OpenAI or Jina embeddings stored in LanceDB. Natural language queries retrieve relevant code snippets via vector search and generate contextual answers using LLMs like GPT-4o, with an optional Colbert-based reranker for improved relevance. This approach enables efficient, semantic code exploration.

Quick Start & Requirements

  • Install: Clone repo, set up Python 3.6+ venv, pip install -r requirements.txt, run redis-server.
  • Prerequisites: Python 3.6+, Redis server on localhost:6379.
  • Configuration: Create .env with OPENAI_API_KEY (required) and optional JINA_API_KEY.
  • Usage: Index code with ./index_codebase.sh <path>, run server with python app.py <folder_path>, access UI at http://localhost:5001.
  • Docs/Demo: Blog posts detailing the build process are linked in the README.

Highlighted Details

  • Optimized branch (feature/optimization) offers 2.5x faster performance (10-20s worst-case) via reduced HYDE token limits and enhanced context processing with SambaNova Llama 3.1 models.
  • Supports Python, Rust, JavaScript, and Java codebases.
  • Utilizes Treesitter for language-agnostic AST parsing.
  • Integrates LanceDB for vector database storage and retrieval.
  • Employs OpenAI GPT-4o-mini/GPT-4o for chat and Answerdotai's colbert-small-v1 for reranking.

Maintenance & Community The README does not provide specific details on maintainers, community channels, or project roadmap.

Licensing & Compatibility Licensed under the MIT License, permitting broad use and modification.

Limitations & Caveats The primary branch's performance may differ from the claimed 2.5x speedup achieved in the feature/optimization branch. Performance is dependent on specific LLM configurations and API availability. Requires a local Redis instance and OpenAI API key for core functionality.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Vasek Mlejnsky Vasek Mlejnsky(Cofounder of E2B).

super-rag by superagent-ai

0%
384
RAG pipeline for AI apps
Created 1 year ago
Updated 1 year ago
Feedback? Help us improve.