Chat app for local LLaMA model inference
Top 43.4% on sourcepulse
This repository provides an easy-to-use interface for running Meta's LLaMA large language models on home PCs. It targets users with NVIDIA GPUs and sufficient RAM, enabling local chat interactions and fine-tuning capabilities.
How It Works
The project leverages PyTorch and Hugging Face Transformers for LLaMA model inference and training. It supports both raw model weights requiring manual merging and a Hugging Face version that handles automatic downloading and caching. The implementation allows for flexible generation parameter tuning, including temperature, top-p, and top-k sampling, with options for repetition penalty and custom stop sequences.
Quick Start & Requirements
conda create -n llama python=3.10
), activate it (conda activate llama
), install PyTorch with CUDA 11.7 (conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
), install requirements (pip install -r requirements.txt
), and the package (pip install -e .
).python merge-weights.py
. For HF version, models are downloaded automatically.python example-chat.py ./model ./tokenizer/tokenizer.model
(for raw weights) or python hf-chat-example.py
(for HF version).Highlighted Details
accelerate
for memory optimization.Maintenance & Community
The project is based on several foundational LLaMA repositories. Community interaction and prompt sharing occur via GitHub Issues.
Licensing & Compatibility
The repository itself appears to be un-licensed, but it is heavily based on Meta's LLaMA, which has its own usage restrictions. The Hugging Face models are subject to their respective licenses. Commercial use is likely restricted by the underlying LLaMA license.
Limitations & Caveats
Running larger models (30B+) requires substantial RAM (48GB+) and potentially slow inference on lower-end hardware or with limited VRAM. The project relies on downloading LLaMA weights, which may have distribution restrictions.
2 years ago
Inactive