pmerwin/audio-transcription-mcp
If you are the rightful owner of audio-transcription-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
This application provides real-time audio transcription using OpenAI Whisper, available as both a standalone CLI tool and an MCP server for Cursor and Claude Desktop.
Audio Transcription MCP Server
Real-time audio transcription using OpenAI Whisper. Capture and transcribe system audio (meetings, videos, music) automatically with AI assistance through Cursor or Claude Desktop.
β¨ Features
- π€ Real-time transcription - Captures and transcribes audio as it plays
- π Zero installation - Use with
npx
, no global install needed - π€ AI-powered - Uses OpenAI's Whisper API for accurate transcription
- π Timestamped transcripts - Every entry is timestamped in markdown format
- π Session isolation - Each session gets its own unique transcript file
- β‘ Smart silence detection - Automatically pauses when no audio detected
- π― Automated setup - One command sets up audio routing
- π§ͺ Built-in testing - Verify your setup before starting
π Quick Start (5 Minutes)
Step 1: Run Automated Setup
The setup script installs everything you need and guides you through configuration:
npx audio-transcription-mcp setup
What this does:
- β Installs Homebrew (if needed)
- β Installs ffmpeg for audio processing
- β Installs BlackHole virtual audio driver
- β Guides you through creating a Multi-Output Device (or does it automatically!)
- β Takes 5 minutes, mostly automated
First time? The script will walk you through everything with clear instructions. Don't worry if it asks for your Mac password - that's normal for installing software!
Step 2: Test Your Setup
Verify everything works before using it:
npx audio-transcription-mcp test
This captures 5 seconds of audio and shows you if it's working correctly.
Step 3: Configure Your AI Assistant
Add to your Cursor or Claude Desktop config:
Cursor Configuration (click to expand)
Edit ~/.cursor/config.json
:
{
"mcpServers": {
"audio-transcription": {
"command": "npx",
"args": ["-y", "audio-transcription-mcp"],
"env": {
"OPENAI_API_KEY": "sk-your-key-here",
"INPUT_DEVICE_NAME": "BlackHole"
}
}
}
}
Then restart Cursor and ask:
"Start transcribing audio"
Claude Desktop Configuration (click to expand)
Edit ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"audio-transcription": {
"command": "npx",
"args": ["-y", "audio-transcription-mcp"],
"env": {
"OPENAI_API_KEY": "sk-your-key-here",
"INPUT_DEVICE_NAME": "BlackHole",
"OUTFILE_DIR": "/Users/yourname/Documents/Transcripts"
},
"allowedDirectories": [
"/Users/yourname/Documents/Transcripts"
]
}
}
}
Important:
- Create the directory:
mkdir -p ~/Documents/Transcripts
- Replace
yourname
with your actual username - Restart Claude Desktop
Then ask:
"Start transcribing audio"
Step 4: Set System Output
Go to System Settings > Sound > Output and select "Multi-Output Device"
This routes audio to both your speakers (so you can hear) and BlackHole (for transcription).
Step 5: Start Transcribing!
In Cursor or Claude Desktop, just ask:
"Start transcribing audio"
Your AI assistant will start capturing and transcribing audio in real-time!
π What You Need
- macOS 10.15+ (Catalina or later)
- OpenAI API key - Get one here (pay-as-you-go, ~$0.36/hour - see detailed costs)
- 5 minutes for setup
π― Use Cases
- Meeting transcription - Zoom, Google Meet, Teams calls
- Content creation - Transcribe videos, podcasts, or music
- Accessibility - Real-time captions for any audio
- Note-taking - Automatic transcripts of lectures or presentations
- Research - Transcribe interviews or focus groups
π§ Troubleshooting
Audio Not Being Captured
Problem: Test shows silent or very low audio levels
Solution:
- Check System Settings > Sound > Output is set to "Multi-Output Device"
- Open Audio MIDI Setup and verify both outputs are checked:
- β Built-in Output
- β BlackHole 2ch
- Play some audio and run
npx audio-transcription-mcp test
again
BlackHole Not Showing Up
Problem: BlackHole doesn't appear in device list
Solution: Restart your Mac. Audio drivers require a restart to be recognized by the system.
Setup Script Fails
Problem: Automated setup doesn't work
Solution: The script will fall back to manual mode with clear instructions. This is normal on first run if accessibility permissions aren't granted. Just follow the 4-step guide shown.
Want to Start Over?
If you need to remove everything and start fresh:
# Uninstall BlackHole and ffmpeg
brew uninstall blackhole-2ch ffmpeg
# Delete Multi-Output Device
# 1. Open Audio MIDI Setup
# 2. Select "Multi-Output Device" in left sidebar
# 3. Press Delete key
# Then run setup again
npx audio-transcription-mcp setup
Need More Help?
- π
- π Report an Issue
- π¬ Discussions
π Additional Documentation
π οΈ Advanced Usage
Standalone CLI Mode
You can use this as a standalone CLI without MCP:
# Start transcription (saves to meeting_transcript.md)
npx audio-transcription-mcp start
# Press Ctrl+C to stop
Configure via .env
file:
OPENAI_API_KEY=sk-your-key-here
INPUT_DEVICE_NAME=BlackHole
CHUNK_SECONDS=8
OUTFILE=meeting_transcript.md
MCP Server Tools
When used with Cursor or Claude Desktop, these tools are available:
start_transcription
- Start capturing and transcribing audiopause_transcription
- Pause transcription temporarilyresume_transcription
- Resume after pausestop_transcription
- Stop and get session statsget_status
- Check if transcription is runningget_transcript
- Retrieve current transcript contentclear_transcript
- Clear and start freshcleanup_transcript
- Delete transcript file
Configuration Options
Environment variables you can customize:
Variable | Default | Description |
---|---|---|
OPENAI_API_KEY | (required) | Your OpenAI API key |
INPUT_DEVICE_NAME | BlackHole | Audio input device name |
CHUNK_SECONDS | 8 | Seconds of audio per chunk |
MODEL | whisper-1 | OpenAI Whisper model |
OUTFILE_DIR | process.cwd() | Output directory for transcripts |
SAMPLE_RATE | 16000 | Audio sample rate (Hz) |
CHANNELS | 1 | Number of audio channels |
ποΈ How It Works
- Audio Routing: Multi-Output Device sends system audio to both your speakers and BlackHole
- Capture: ffmpeg captures audio from BlackHole in 8-second chunks
- Processing: Audio is converted to WAV format suitable for Whisper API
- Transcription: Each chunk is sent to OpenAI Whisper for transcription
- Output: Timestamped text is appended to a markdown file in real-time
- Silence Detection: Automatically pauses after 32 seconds of silence to save API costs
π° Costs & Performance
What You're Paying For
You ONLY pay for OpenAI Whisper API calls - everything else runs locally for free!
β FREE (runs locally on your machine):
- Audio capture with ffmpeg
- Audio processing and buffer management
- Silence detection and level analysis
- File operations (writing/reading transcripts)
- All MCP server operations
π° PAID (OpenAI API):
- Only the transcription API calls to OpenAI Whisper
- $0.006 per minute of audio transcribed
- Silent chunks are automatically skipped to save money
Actual Costs
With default 8-second chunks:
Duration | API Calls | Approximate Cost |
---|---|---|
1 minute | ~7.5 chunks | $0.006 |
1 hour | ~450 chunks | $0.36 |
8-hour workday | ~3,600 chunks | $2.88 |
Cost per chunk: ~$0.0008 (less than a tenth of a cent!)
Built-in Cost Savings
The tool includes smart silence detection that saves you money:
- π Silent audio chunks are NEVER sent to OpenAI
- π° Automatically tracks cost savings in the debug log
- βΈοΈ Auto-pauses after 32 seconds of silence
- π View statistics with
get_status
to see chunks skipped
Example: In a 1-hour meeting with 15 minutes of silence, you save ~$0.09 automatically!
Performance
- Memory usage: 50-100 MB per session
- CPU usage: Minimal (ffmpeg handles audio processing)
- API latency: 1-3 seconds per chunk
- Accuracy: 90-95% for clear speech
- Network: Only during transcription API calls
Cost Optimization Tips
- Increase chunk size - Fewer API calls (set
CHUNK_SECONDS=15
) - Use silence detection - Enabled by default, saves money automatically
- Pause when not needed - Use
pause_transcription
during breaks - Monitor usage - Check OpenAI dashboard for actual costs
Bottom line: Transcription is cheap (~36Β’/hour), runs mostly locally, and automatically saves money by skipping silence. You're only charged when actual speech is being transcribed.
π§ͺ Development & Testing
For contributors and developers:
π See for complete setup instructions
Just add to your config and restart - that's it!
See the npx configuration at the top of this README for Cursor and Claude Desktop.
For Standalone CLI (Local Development)
π See for complete setup instructions
# Install dependencies
npm install
npm run build
# Configure environment
cp env.example .env # Then add your OpenAI API key
# Run standalone CLI
npm start
π License & Contributing
This project is licensed under the MIT License - see the file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
Development Resources
- π
- π§ͺ
- π
- π§
Made with β€οΈ for transcribing meetings, content, and conversations.
Star β this repo if you find it useful!