byetype  by devonmochi

Markdown-driven AI voice input and image text extraction

Created 9 months ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

A Markdown-driven AI speech input tool, ByeType offers advanced customization for transcription accuracy and text extraction from images. It targets users needing precise voice input for specialized industries or personal habits, and those requiring reliable text extraction from screenshots, by significantly reducing manual post-processing.

How It Works

ByeType employs multimodal large language models to process raw audio directly, integrating user-defined rules and optimizations within a single transcription step. This approach bypasses the error-prone "ASR + LLM post-processing" pipeline. Customization is achieved through editable Markdown files, allowing users to define proprietary vocabulary, transcription logic, and text formatting strategies. The tool also features AI-powered image text extraction that understands visual context, intelligently repairing broken lines and restoring clean code blocks from screenshots, surpassing traditional OCR capabilities.

Quick Start & Requirements

  • Installation: Desktop versions for macOS and Windows are available (~8 MB install size). iOS support is provided via Shortcuts (ByeType LongCat, ByeType Gemini).
  • Prerequisites: Requires users to provide their own API keys for supported AI services (e.g., Google Gemini, Alibaba Cloud). macOS requires microphone and accessibility permissions.
  • Links: Configuration is managed through in-app settings and by editing Markdown files.

Highlighted Details

  • Markdown-Driven Customization: Define proprietary vocabulary, transcription rules, and text optimization styles (e.g., auto-formatting, translation) via editable Markdown prompts.
  • Direct LLM Audio Processing: Processes raw audio with LLMs, applying all rules in one step for superior accuracy compared to ASR + post-processing.
  • Advanced Image Text Extraction: Intelligently repairs line breaks and reconstructs clean, usable code snippets from screenshots, understanding visual layout beyond OCR.
  • Cross-Platform Support: Available for macOS, Windows, and iOS.

Maintenance & Community

The project encourages community contributions via Issues and Pull Requests. It acknowledges the Linux.do community for its contributions. No specific community channels (like Discord/Slack) or active maintainer details are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: The MIT license is permissive, generally allowing for commercial use and integration into closed-source projects without significant restrictions.

Limitations & Caveats

Users must supply their own API keys for AI model access. Certain models, like Gemini, may require network proxy configuration for users in specific regions. Auto-pasting of transcribed text relies on accessibility permissions, with manual pasting as a fallback. Transcription speed can be optimized by disabling "Thinking Mode" or selecting lighter models.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
3
Star History
47 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.