DataChad  by gustavz

App for Q&A using LLMs and vector DBs

created 2 years ago
319 stars

Top 86.1% on sourcepulse

GitHubView on GitHub
Project Summary

DataChad is a Python application designed for querying diverse data sources using natural language. It targets users who need to extract information from documents, URLs, or file paths, providing a conversational interface powered by LLMs and Langchain. The primary benefit is enabling users to interact with their data through simple questions, abstracting away the complexity of data retrieval and processing.

How It Works

DataChad processes data by loading it, splitting it into text chunks, and generating embeddings using OpenAI or Hugging Face models. These embeddings are stored in Activeloop's database hub. A Langchain is constructed with a configurable LLM (defaulting to gpt-3.5-turbo), multiple vector stores for knowledge bases, and a dedicated "smart FAQ" vector store. User queries are embedded, used for similarity searches across the vector stores, and the most relevant results provide context for the LLM to generate answers. Chat history is cached locally for a persistent conversational experience.

Quick Start & Requirements

  • Install via pip install datachad (or clone and run).
  • Requires Python >= 3.10.
  • Configuration involves creating a .env file with credentials (OpenAI API key, Activeloop API key) or setting environment variables.
  • Official documentation and demo links are not explicitly provided in the README.

Highlighted Details

  • Supports multiple file types and formats within knowledge bases.
  • Offers "Smart FAQs" as curated Q&A lists.
  • Utilizes Activeloop's database hub for vector storage.
  • Caches chat history locally for conversational continuity.

Maintenance & Community

The project is actively maintained with a public TODO list indicating planned features and refactors, including support for multiple models/embeddings, local/private mode, streaming responses, and a decoupled UI. Contributions via Issues and Pull Requests are encouraged.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The application is primarily designed for Python 3.10+ and relies on external API keys (OpenAI, Activeloop). Several advanced features like asynchronous I/O, FastAPI integration, and a separate frontend are still in the TODO list, suggesting the current version may be more of a proof-of-concept or internal tool. File storage uses downloaded files rather than tempfile, which may have implications for resource management.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 20 hours ago
Feedback? Help us improve.