Skip to content
This repository was archived by the owner on Mar 25, 2026. It is now read-only.

danielrosehill/AI-Transcription-Notepad

Repository files navigation

AI Transcription Notepad

Multimodal Cloud Transcription for Desktop

License: MIT Platform Python


Download · Documentation · User Manual (PDF)


AI Transcription Notepad Main Interface


Why AI Transcription Notepad?

Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. AI Transcription Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.

Traditional Approach AI Transcription Notepad
Record → ASR → Raw text → LLM → Formatted output Record → Multimodal AI → Formatted output
Two API calls, higher latency Single API call, faster results
AI reads text only AI "hears" your voice

The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.


Key Benefits

  • Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
  • Fast — Single API call with local preprocessing
  • Smart cleanup — Removes filler words, adds punctuation, formats output
  • Global hotkeys — Record from anywhere, even when minimized
  • Flexible output — App window, clipboard, or inject directly at cursor
  • Translation — Translate to 30+ languages in the same API call

Documentation

Documentation Online Documentation
Full documentation site with guides, reference, and troubleshooting.
User Manual PDF User Manual v3 (PDF)
Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting.

Quick Start

  1. Download from Releases (AppImage, .deb, or Windows installer)
  2. Add your OpenRouter API key (get one here)
  3. Press Record, speak naturally, press Transcribe
  4. Get clean, formatted text
# Or run from source
git clone https://github.com/danielrosehill/AI-Transcription-Notepad.git
cd AI-Transcription-Notepad && ./run.sh

Dual-Pipeline Architecture

AI Transcription Notepad combines local preprocessing with cloud transcription for optimal cost and quality.

flowchart LR
    subgraph LOCAL["Local Preprocessing"]
        direction LR
        A[Record<br/>48kHz] --> B[AGC<br/>Normalize]
        B --> C[VAD<br/>Remove Silence]
        C --> D[Compress<br/>16kHz mono]
    end

    subgraph CLOUD["Cloud Transcription"]
        direction LR
        E[Prompt<br/>Concatenation] --> F[Gemini API<br/>Audio + Prompt]
        F --> G[Formatted<br/>Text]
    end

    D --> E

    style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
Loading
Stage Component Purpose
Local AGC Normalizes audio levels (target -3 dBFS)
Local VAD Strips silence — typically 30-80% reduction
Local Compress Downsamples to 16kHz mono WAV
Cloud Prompt Concatenation Builds layered instructions
Cloud Gemini API Single-pass transcription + cleanup

Prompt Concatenation System

AI Transcription Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.

flowchart TB
    subgraph FOUNDATION["Foundation Layer (Always Applied)"]
        F1[Remove filler words]
        F2[Add punctuation]
        F3[Fix grammar & spelling]
        F4[Honor verbal commands]
        F5[Handle background audio]
    end

    subgraph FORMAT["Format Layer"]
        FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
    end

    subgraph STYLE["Style Layer"]
        S1[Formality<br/>Casual → Professional]
        S2[Verbosity<br/>None → Maximum reduction]
    end

    subgraph PERSONAL["Personalization"]
        P1[Email signatures]
        P2[User name]
    end

    FOUNDATION --> FORMAT
    FORMAT --> STYLE
    STYLE --> PERSONAL
    PERSONAL --> OUTPUT[Final Prompt]

    style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
    style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
    style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
    style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
Loading

Prompt Stacks

Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:

Stack Example Layers Combined
Meeting Notes + Actions Foundation + Meeting format + Action item extraction
Technical Documentation Foundation + Doc format + Code extraction + Markdown
Quick Email Foundation + Email format + Professional tone + Signature

Create custom stacks in the Prompt Stacks tab, then apply them with a single click.


Supported Provider

Provider Default Model Notes
OpenRouter google/gemini-3-flash-preview Gemini 3 Flash (default), Gemini 3 Pro (fallback)

OpenRouter is the sole provider. It offers per-key cost tracking, low latency, and access to Gemini 3 models via an OpenAI-compatible API.


Screenshots

Click to expand screenshots

Main Interface

Main Interface

Analytics Dashboard

Analytics

Global Hotkeys

Hotkeys

Prompt Formats

Formats


Technology Stack

Component Technology
Transcription OpenRouter (Gemini 3 Flash / Pro)
Voice Activity Detection TEN VAD
Text-to-Speech Edge TTS
Database Mongita
UI Framework PyQt6

See Technology Stack for details.


Benchmark Data

Real usage from ~2,000 transcriptions shows excellent performance with OpenRouter's Gemini models:

Provider Model Avg Inference Chars/sec
OpenRouter google/gemini-2.5-flash 2.5s 204

Anonymized usage data available in data/.


AI-Human Co-Authorship

This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.


Related Projects


License

MIT

About

Voice note taking utility that uses cloud audio multimodal models for single pass transcription and text cleanup

Topics

Resources

License

Stars

Watchers

Forks

Contributors