llama-chat by randaller

Chat app for local LLaMA model inference

Created 2 years ago

842 stars

Top 42.3% on SourcePulse

Project Summary

This repository provides an easy-to-use interface for running Meta's LLaMA large language models on home PCs. It targets users with NVIDIA GPUs and sufficient RAM, enabling local chat interactions and fine-tuning capabilities.

How It Works

The project leverages PyTorch and Hugging Face Transformers for LLaMA model inference and training. It supports both raw model weights requiring manual merging and a Hugging Face version that handles automatic downloading and caching. The implementation allows for flexible generation parameter tuning, including temperature, top-p, and top-k sampling, with options for repetition penalty and custom stop sequences.

Quick Start & Requirements

Install: Clone the repo, create a Conda environment (conda create -n llama python=3.10), activate it (conda activate llama), install PyTorch with CUDA 11.7 (conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia), install requirements (pip install -r requirements.txt), and the package (pip install -e .).
Prerequisites: NVIDIA GPU (2GB VRAM minimum, more recommended), 64-128GB RAM (192GB ideal for 65B models), Python 3.10.
Model Download: For raw weights, download via torrent and merge using python merge-weights.py. For HF version, models are downloaded automatically.
Run: python example-chat.py ./model ./tokenizer/tokenizer.model (for raw weights) or python hf-chat-example.py (for HF version).
Docs: https://github.com/randaller/llama-chat

Highlighted Details

Supports LLaMA models from 7B to 65B parameters.
Offers both inference and fine-tuning capabilities via Hugging Face integration.
Allows offloading to GPU using accelerate for memory optimization.
Demonstrates fine-tuning for Stable Diffusion prompt generation.

Maintenance & Community

The project is based on several foundational LLaMA repositories. Community interaction and prompt sharing occur via GitHub Issues.

Licensing & Compatibility

The repository itself appears to be un-licensed, but it is heavily based on Meta's LLaMA, which has its own usage restrictions. The Hugging Face models are subject to their respective licenses. Commercial use is likely restricted by the underlying LLaMA license.

Limitations & Caveats

Running larger models (30B+) requires substantial RAM (48GB+) and potentially slow inference on lower-end hardware or with limited VRAM. The project relies on downloading LLaMA weights, which may have distribution restrictions.

llama-chat by randaller

Explore Similar Projects

r1-ktransformers-guide by ubergarm

dots.llm1 by rednote-hilab

slowllama by okuvshynov

ScaleLLM by vectorch-ai

crabml by crabml

parallelformers by tunib-ai

alpaca_lora_4bit by johnsmith0031

InferLLM by MegEngine

ComfyUI-nunchaku by nunchaku-tech

DeepSpeed-MII by deepspeedai

DeepSeek-MoE by deepseek-ai

llama-models by meta-llama