Most transcription apps use a two-step process: ASR transcription followed by LLM cleanup. AI Transcription Notepad sends audio directly to multimodal AI models that transcribe and format in a single pass.
| Traditional Approach | AI Transcription Notepad |
|---|---|
| Record → ASR → Raw text → LLM → Formatted output | Record → Multimodal AI → Formatted output |
| Two API calls, higher latency | Single API call, faster results |
| AI reads text only | AI "hears" your voice |
The AI hears tone, pauses, and emphasis. Verbal commands like "scratch that" or "new paragraph" work naturally.
- Cost-effective — 848 transcriptions for $1.17 (~1.4¢ per 1,000 words)
- Fast — Single API call with local preprocessing
- Smart cleanup — Removes filler words, adds punctuation, formats output
- Global hotkeys — Record from anywhere, even when minimized
- Flexible output — App window, clipboard, or inject directly at cursor
- Translation — Translate to 30+ languages in the same API call
|
|
Online Documentation Full documentation site with guides, reference, and troubleshooting. |
|
|
User Manual v3 (PDF) Complete 27-page guide covering installation, configuration, hotkey setup, and troubleshooting. |
- Download from Releases (AppImage, .deb, or Windows installer)
- Add your OpenRouter API key (get one here)
- Press Record, speak naturally, press Transcribe
- Get clean, formatted text
# Or run from source
git clone https://github.com/danielrosehill/AI-Transcription-Notepad.git
cd AI-Transcription-Notepad && ./run.shAI Transcription Notepad combines local preprocessing with cloud transcription for optimal cost and quality.
flowchart LR
subgraph LOCAL["Local Preprocessing"]
direction LR
A[Record<br/>48kHz] --> B[AGC<br/>Normalize]
B --> C[VAD<br/>Remove Silence]
C --> D[Compress<br/>16kHz mono]
end
subgraph CLOUD["Cloud Transcription"]
direction LR
E[Prompt<br/>Concatenation] --> F[Gemini API<br/>Audio + Prompt]
F --> G[Formatted<br/>Text]
end
D --> E
style LOCAL fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style CLOUD fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
| Stage | Component | Purpose |
|---|---|---|
| Local | AGC | Normalizes audio levels (target -3 dBFS) |
| Local | VAD | Strips silence — typically 30-80% reduction |
| Local | Compress | Downsamples to 16kHz mono WAV |
| Cloud | Prompt Concatenation | Builds layered instructions |
| Cloud | Gemini API | Single-pass transcription + cleanup |
AI Transcription Notepad uses a layered prompt architecture where instructions are concatenated at transcription time. This allows flexible, modular control over output formatting.
flowchart TB
subgraph FOUNDATION["Foundation Layer (Always Applied)"]
F1[Remove filler words]
F2[Add punctuation]
F3[Fix grammar & spelling]
F4[Honor verbal commands]
F5[Handle background audio]
end
subgraph FORMAT["Format Layer"]
FMT[Email / Todo / Meeting Notes<br/>Blog / Documentation / AI Prompt]
end
subgraph STYLE["Style Layer"]
S1[Formality<br/>Casual → Professional]
S2[Verbosity<br/>None → Maximum reduction]
end
subgraph PERSONAL["Personalization"]
P1[Email signatures]
P2[User name]
end
FOUNDATION --> FORMAT
FORMAT --> STYLE
STYLE --> PERSONAL
PERSONAL --> OUTPUT[Final Prompt]
style FOUNDATION fill:#fff3e0,stroke:#ff9800,stroke-width:2px
style FORMAT fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
style STYLE fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style PERSONAL fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
Prompt Stacks let you save and combine multiple prompt layers for recurring workflows:
| Stack Example | Layers Combined |
|---|---|
| Meeting Notes + Actions | Foundation + Meeting format + Action item extraction |
| Technical Documentation | Foundation + Doc format + Code extraction + Markdown |
| Quick Email | Foundation + Email format + Professional tone + Signature |
Create custom stacks in the Prompt Stacks tab, then apply them with a single click.
| Provider | Default Model | Notes |
|---|---|---|
| OpenRouter | google/gemini-3-flash-preview |
Gemini 3 Flash (default), Gemini 3 Pro (fallback) |
OpenRouter is the sole provider. It offers per-key cost tracking, low latency, and access to Gemini 3 models via an OpenAI-compatible API.
| Component | Technology |
|---|---|
| Transcription | OpenRouter (Gemini 3 Flash / Pro) |
| Voice Activity Detection | TEN VAD |
| Text-to-Speech | Edge TTS |
| Database | Mongita |
| UI Framework | PyQt6 |
See Technology Stack for details.
Real usage from ~2,000 transcriptions shows excellent performance with OpenRouter's Gemini models:
| Provider | Model | Avg Inference | Chars/sec |
|---|---|---|---|
| OpenRouter | google/gemini-2.5-flash | 2.5s | 204 |
Anonymized usage data available in data/.
This software was developed through AI-human collaboration. Code was generated by Claude Opus 4.5 under my direction—I designed the architecture and specified requirements while Claude wrote the implementation.
- Audio-Multimodal-AI-Resources — Curated list of audio-capable multimodal models
- Audio-Understanding-Test-Prompts — Test prompts for evaluating audio understanding
MIT




