whisper_dictation  by themanyone

Voice keyboard for local AI chat, image gen, webcam, & voice control

created 2 years ago
256 stars

Top 99.0% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a private, voice-controlled interface for interacting with a computer, integrating speech-to-text, AI chat, image generation, and system control. It targets users seeking a hands-free, AI-powered computing experience, akin to a "ship's computer," enabling tasks like dictation, web searches, and application launching via voice commands.

How It Works

The system leverages whisper.cpp for efficient, local speech-to-text and translation, minimizing external dependencies. Voice commands are parsed to trigger actions using pyautogui for system control and application launching. It can optionally integrate with local LLMs (like llama.cpp) or cloud services (OpenAI, Gemini) for AI chat and text-to-speech via mimic3 or piper, and local Stable Diffusion for image generation.

Quick Start & Requirements

  • Install GStreamer and ladspa-delay-so-delay-5s (via gstreamer1-plugins-bad-free-extras).
  • Install Python dependencies: pip install -r whisper_dictation/requirements.txt.
  • Build whisper.cpp with CUDA support: GGML_CUDA=1 make -j.
  • Run whisper.cpp server: ./whisper_cpp_server -l en -m models/ggml-tiny.en.bin --port 7777.
  • Requires >= 4 GiB VRAM for full functionality, especially with LLMs and image generation.

Highlighted Details

  • Reduced dependencies by eliminating torch, pycuda, cudnn, and ffmpeg.
  • Stable Diffusion can run with as little as 2 GiB VRAM using --medvram or --lowvram flags.
  • Supports local LLMs via llama.cpp and optional cloud APIs (OpenAI, Gemini).
  • Enables voice-controlled webcam, audio recording, and application launching.

Maintenance & Community

  • Developed by Henry Kroll III (themanyone).
  • Links to GitHub, YouTube, Mastodon, LinkedIn, and a "Buy Me a Coffee" page are provided.
  • Mentions mimic3 may be abandoned in favor of piper.

Licensing & Compatibility

  • Licensed under MIT.
  • Permissive license allows for individual modification and use, suitable for commercial applications.

Limitations & Caveats

  • mimic3 is noted as potentially abandoned, with piper suggested as a replacement.
  • High VRAM usage can occur with large models and context windows, potentially leading to crashes.
  • Performance may vary based on hardware, especially for LLM inference and image generation.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
30 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.