TinyLLM by jasonacox

Local LLM and chatbot setup for consumer hardware

Created 2 years ago

309 stars

Top 87.1% on SourcePulse

Project Summary

This project provides a framework for setting up and running local Large Language Models (LLMs) on consumer-grade hardware, offering a ChatGPT-like web interface. It targets users who want to experiment with LLMs locally without requiring high-end GPUs, enabling features like web summarization and news aggregation.

How It Works

TinyLLM acts as an orchestrator, allowing users to choose between three popular LLM inference servers: Ollama, vLLM, or llama-cpp-python. These servers provide an OpenAI-compatible API, which the project's FastAPI-based chatbot then interfaces with. The chatbot supports Retrieval Augmented Generation (RAG) features, enabling it to summarize URLs, fetch current news, retrieve stock prices, and get weather information.

Quick Start & Requirements

Install: Clone the repository (git clone https://github.com/jasonacox/TinyLLM.git).
Prerequisites: Python 3, CUDA 12.2 (for NVIDIA), 8GB+ RAM, 128GB+ SSD. Recommended GPU: NVIDIA GTX 1060 6GB or better, or Apple M1/M2.
Setup: Requires setting up an inference server (Ollama, vLLM, or llama-cpp-python) and then running the chatbot interface.
Links: Research

Highlighted Details

Supports multiple LLM backends (Ollama, vLLM, llama-cpp-python) for flexibility.
Chatbot includes RAG features for summarizing URLs, news, stocks, and weather.
Offers an OpenAI API compatible web service for easy integration.
Provides detailed instructions for running various LLM models (Mistral, Llama-2, Mixtral, Phi-3) with different quantization levels.

Maintenance & Community

The project is actively maintained by jasonacox.
References include popular LLM projects like llama.cpp, llama-cpp-python, and vLLM.

Licensing & Compatibility

The project itself does not explicitly state a license in the README.
LLM models listed have varying licenses (Apache 2.0, MIT, Meta). Compatibility for commercial use depends on the chosen LLM's license.

Limitations & Caveats

Ollama and llama-cpp-python backends currently support only one session/prompt at a time. vLLM requires more VRAM as it typically uses non-quantized models, though AWQ models are available. Some models, like Mistrallite, are noted as potentially glitchy.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days