knowledge-gpt by geeks-of-data

Knowledge extraction tool using GPT models

Created 2 years ago

290 stars

Top 91.0% on SourcePulse

Project Summary

This project provides a framework for extracting knowledge from diverse sources like websites, PDFs, PPTX, DOCX, and YouTube content, enabling Q&A sessions powered by large language models like GPT. It's designed for developers and researchers looking to build applications that leverage contextual information retrieval and generation.

How It Works

The core mechanism involves transforming text from various sources into fixed-size vector embeddings using either OpenAI or open-source models. When a query is received, it's also vectorized and compared against the stored embeddings to find the most relevant information. This context is then used to construct a prompt for a language model, generating a precise answer. The approach supports multiple data types and extraction methods, including speech-to-text for YouTube audio.

Quick Start & Requirements

Install via pip: pip install knowledgegpt
Requires OpenAI API key (set in example_config.py).
Download spaCy model: python3 -m spacy download en_core_web_sm
For API server: uvicorn server:app --reload
Docker: docker build -t knowledgegptimage . and docker run -p 8888:8888 knowledgegptimage
Official PyPI: https://pypi.org/project/knowledgegpt/

Highlighted Details

Supports extraction from websites, PDFs, DOCX, PPTX, and YouTube (audio/transcripts).
Integrates with OpenAI's GPT models for answer generation.
Offers flexibility in choosing embedding models (OpenAI or Hugging Face).
Includes a RESTful API for server deployment.

Maintenance & Community

The project is open-source, encouraging contributions via pull requests. Further community engagement details (e.g., Discord/Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not specify a license. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The project is under active development with several "TODO" items, including integration with vector databases (Pinecone, Milvus, Qdrant) and a web interface. Support for audio files larger than 25MB and advanced web scraping are also listed as future work.

knowledge-gpt by geeks-of-data

Explore Similar Projects

voicebox-pytorch by lucidrains

dataspeech by huggingface

dia2 by nari-labs

meetingmind by misbahsy

multi-modal-researcher by langchain-ai

fish-diffusion by fishaudio

vits-simple-api by Artrajz

whisper-plus by kadirnar

PDF2Audio by lamm-mit

audiolm-pytorch by lucidrains

Amphion by open-mmlab

go-openai by sashabaranov