byetype by devonmochi

Markdown-driven AI voice input and image text extraction

Created 11 months ago

287 stars

Top 91.3% on SourcePulse

Project Summary

A Markdown-driven AI speech input tool, ByeType offers advanced customization for transcription accuracy and text extraction from images. It targets users needing precise voice input for specialized industries or personal habits, and those requiring reliable text extraction from screenshots, by significantly reducing manual post-processing.

How It Works

ByeType employs multimodal large language models to process raw audio directly, integrating user-defined rules and optimizations within a single transcription step. This approach bypasses the error-prone "ASR + LLM post-processing" pipeline. Customization is achieved through editable Markdown files, allowing users to define proprietary vocabulary, transcription logic, and text formatting strategies. The tool also features AI-powered image text extraction that understands visual context, intelligently repairing broken lines and restoring clean code blocks from screenshots, surpassing traditional OCR capabilities.

Quick Start & Requirements

Installation: Desktop versions for macOS and Windows are available (~8 MB install size). iOS support is provided via Shortcuts (ByeType LongCat, ByeType Gemini).
Prerequisites: Requires users to provide their own API keys for supported AI services (e.g., Google Gemini, Alibaba Cloud). macOS requires microphone and accessibility permissions.
Links: Configuration is managed through in-app settings and by editing Markdown files.

Highlighted Details

Markdown-Driven Customization: Define proprietary vocabulary, transcription rules, and text optimization styles (e.g., auto-formatting, translation) via editable Markdown prompts.
Direct LLM Audio Processing: Processes raw audio with LLMs, applying all rules in one step for superior accuracy compared to ASR + post-processing.
Advanced Image Text Extraction: Intelligently repairs line breaks and reconstructs clean, usable code snippets from screenshots, understanding visual layout beyond OCR.
Cross-Platform Support: Available for macOS, Windows, and iOS.

Maintenance & Community

The project encourages community contributions via Issues and Pull Requests. It acknowledges the Linux.do community for its contributions. No specific community channels (like Discord/Slack) or active maintainer details are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: The MIT license is permissive, generally allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Users must supply their own API keys for AI model access. Certain models, like Gemini, may require network proxy configuration for users in specific regions. Auto-pasting of transcribed text relies on accessibility permissions, with manual pasting as a fallback. Transcription speed can be optimized by disabling "Thinking Mode" or selecting lighter models.

byetype by devonmochi

Explore Similar Projects

HoldSpeak by karolswdev

VoiceFlow by infiniV

pindrop by watzon

JARVIS-AGI by SreejanPersonal

voxt by hehehai

BiBi-Keyboard by BryceWG

markit by Michaelliv

freestyle by freestyle-voice

whispo by egoist

typewhisper-mac by TypeWhisper

openless by Open-Less

FluidVoice by altic-dev