CPU inference for document Q&A using open-source LLMs
Top 39.0% on sourcepulse
This project provides a guide and implementation for running open-source Large Language Models (LLMs), specifically Llama 2, on CPU inference for document question-answering (Q&A). It targets developers and researchers seeking private, self-managed LLM deployments without the high costs associated with GPU hardware, enabling local document analysis.
How It Works
The project leverages a combination of technologies for efficient CPU-based LLM inference. It utilizes GGML-quantized models, specifically Llama-2-7B-Chat, which are optimized for CPU execution. LangChain acts as the orchestration framework, integrating C Transformers for Python bindings to the C/C++ GGML library. Document processing involves FAISS for efficient similarity search and Sentence-Transformers (all-MiniLM-L6-v2) to create vector embeddings of the documents, enabling semantic retrieval for Q&A.
Quick Start & Requirements
models/
folder. Run from the project directory: poetry run python main.py "<user query>"
.Highlighted Details
Maintenance & Community
No specific contributors, sponsorships, or community links (Discord/Slack) are mentioned in the README.
Licensing & Compatibility
The README does not explicitly state the license for the project's code or the included models. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project focuses on a specific quantized model (Llama-2-7B-Chat-GGML) and may not be directly compatible with other LLM architectures or quantization formats without modification. Performance will be heavily dependent on the user's CPU capabilities.
1 year ago
1 day