audio-transcription-mcp

pmerwin/audio-transcription-mcp

3.3

If you are the rightful owner of audio-transcription-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

This application provides real-time audio transcription using OpenAI Whisper, available as both a standalone CLI tool and an MCP server for Cursor and Claude Desktop.

Tools
6
Resources
0
Prompts
0

Audio Transcription MCP Server

Real-time audio transcription using OpenAI Whisper. Capture and transcribe system audio (meetings, videos, music) automatically with AI assistance through Cursor or Claude Desktop.

✨ Features

  • 🎀 Real-time transcription - Captures and transcribes audio as it plays
  • πŸ”„ Zero installation - Use with npx, no global install needed
  • πŸ€– AI-powered - Uses OpenAI's Whisper API for accurate transcription
  • πŸ“ Timestamped transcripts - Every entry is timestamped in markdown format
  • πŸ”’ Session isolation - Each session gets its own unique transcript file
  • ⚑ Smart silence detection - Automatically pauses when no audio detected
  • 🎯 Automated setup - One command sets up audio routing
  • πŸ§ͺ Built-in testing - Verify your setup before starting

πŸš€ Quick Start (5 Minutes)

Step 1: Run Automated Setup

The setup script installs everything you need and guides you through configuration:

npx audio-transcription-mcp setup

What this does:

  • βœ… Installs Homebrew (if needed)
  • βœ… Installs ffmpeg for audio processing
  • βœ… Installs BlackHole virtual audio driver
  • βœ… Guides you through creating a Multi-Output Device (or does it automatically!)
  • βœ… Takes 5 minutes, mostly automated

First time? The script will walk you through everything with clear instructions. Don't worry if it asks for your Mac password - that's normal for installing software!

Step 2: Test Your Setup

Verify everything works before using it:

npx audio-transcription-mcp test

This captures 5 seconds of audio and shows you if it's working correctly.

Step 3: Configure Your AI Assistant

Add to your Cursor or Claude Desktop config:

Cursor Configuration (click to expand)

Edit ~/.cursor/config.json:

{
  "mcpServers": {
    "audio-transcription": {
      "command": "npx",
      "args": ["-y", "audio-transcription-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here",
        "INPUT_DEVICE_NAME": "BlackHole"
      }
    }
  }
}

Then restart Cursor and ask:

"Start transcribing audio"

Claude Desktop Configuration (click to expand)

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "audio-transcription": {
      "command": "npx",
      "args": ["-y", "audio-transcription-mcp"],
      "env": {
        "OPENAI_API_KEY": "sk-your-key-here",
        "INPUT_DEVICE_NAME": "BlackHole",
        "OUTFILE_DIR": "/Users/yourname/Documents/Transcripts"
      },
      "allowedDirectories": [
        "/Users/yourname/Documents/Transcripts"
      ]
    }
  }
}

Important:

  1. Create the directory: mkdir -p ~/Documents/Transcripts
  2. Replace yourname with your actual username
  3. Restart Claude Desktop

Then ask:

"Start transcribing audio"

Step 4: Set System Output

Go to System Settings > Sound > Output and select "Multi-Output Device"

This routes audio to both your speakers (so you can hear) and BlackHole (for transcription).

Step 5: Start Transcribing!

In Cursor or Claude Desktop, just ask:

"Start transcribing audio"

Your AI assistant will start capturing and transcribing audio in real-time!


πŸ“– What You Need

🎯 Use Cases

  • Meeting transcription - Zoom, Google Meet, Teams calls
  • Content creation - Transcribe videos, podcasts, or music
  • Accessibility - Real-time captions for any audio
  • Note-taking - Automatic transcripts of lectures or presentations
  • Research - Transcribe interviews or focus groups

πŸ”§ Troubleshooting

Audio Not Being Captured

Problem: Test shows silent or very low audio levels

Solution:

  1. Check System Settings > Sound > Output is set to "Multi-Output Device"
  2. Open Audio MIDI Setup and verify both outputs are checked:
    • β˜‘ Built-in Output
    • β˜‘ BlackHole 2ch
  3. Play some audio and run npx audio-transcription-mcp test again

BlackHole Not Showing Up

Problem: BlackHole doesn't appear in device list

Solution: Restart your Mac. Audio drivers require a restart to be recognized by the system.

Setup Script Fails

Problem: Automated setup doesn't work

Solution: The script will fall back to manual mode with clear instructions. This is normal on first run if accessibility permissions aren't granted. Just follow the 4-step guide shown.

Want to Start Over?

If you need to remove everything and start fresh:

# Uninstall BlackHole and ffmpeg
brew uninstall blackhole-2ch ffmpeg

# Delete Multi-Output Device
# 1. Open Audio MIDI Setup
# 2. Select "Multi-Output Device" in left sidebar
# 3. Press Delete key

# Then run setup again
npx audio-transcription-mcp setup

Need More Help?


πŸ“š Additional Documentation

πŸ› οΈ Advanced Usage

Standalone CLI Mode

You can use this as a standalone CLI without MCP:

# Start transcription (saves to meeting_transcript.md)
npx audio-transcription-mcp start

# Press Ctrl+C to stop

Configure via .env file:

OPENAI_API_KEY=sk-your-key-here
INPUT_DEVICE_NAME=BlackHole
CHUNK_SECONDS=8
OUTFILE=meeting_transcript.md

MCP Server Tools

When used with Cursor or Claude Desktop, these tools are available:

  • start_transcription - Start capturing and transcribing audio
  • pause_transcription - Pause transcription temporarily
  • resume_transcription - Resume after pause
  • stop_transcription - Stop and get session stats
  • get_status - Check if transcription is running
  • get_transcript - Retrieve current transcript content
  • clear_transcript - Clear and start fresh
  • cleanup_transcript - Delete transcript file

Configuration Options

Environment variables you can customize:

VariableDefaultDescription
OPENAI_API_KEY(required)Your OpenAI API key
INPUT_DEVICE_NAMEBlackHoleAudio input device name
CHUNK_SECONDS8Seconds of audio per chunk
MODELwhisper-1OpenAI Whisper model
OUTFILE_DIRprocess.cwd()Output directory for transcripts
SAMPLE_RATE16000Audio sample rate (Hz)
CHANNELS1Number of audio channels

πŸ—οΈ How It Works

  1. Audio Routing: Multi-Output Device sends system audio to both your speakers and BlackHole
  2. Capture: ffmpeg captures audio from BlackHole in 8-second chunks
  3. Processing: Audio is converted to WAV format suitable for Whisper API
  4. Transcription: Each chunk is sent to OpenAI Whisper for transcription
  5. Output: Timestamped text is appended to a markdown file in real-time
  6. Silence Detection: Automatically pauses after 32 seconds of silence to save API costs

πŸ’° Costs & Performance

What You're Paying For

You ONLY pay for OpenAI Whisper API calls - everything else runs locally for free!

βœ… FREE (runs locally on your machine):

  • Audio capture with ffmpeg
  • Audio processing and buffer management
  • Silence detection and level analysis
  • File operations (writing/reading transcripts)
  • All MCP server operations

πŸ’° PAID (OpenAI API):

  • Only the transcription API calls to OpenAI Whisper
  • $0.006 per minute of audio transcribed
  • Silent chunks are automatically skipped to save money

Actual Costs

With default 8-second chunks:

DurationAPI CallsApproximate Cost
1 minute~7.5 chunks$0.006
1 hour~450 chunks$0.36
8-hour workday~3,600 chunks$2.88

Cost per chunk: ~$0.0008 (less than a tenth of a cent!)

Built-in Cost Savings

The tool includes smart silence detection that saves you money:

  • πŸ”‡ Silent audio chunks are NEVER sent to OpenAI
  • πŸ’° Automatically tracks cost savings in the debug log
  • ⏸️ Auto-pauses after 32 seconds of silence
  • πŸ“Š View statistics with get_status to see chunks skipped

Example: In a 1-hour meeting with 15 minutes of silence, you save ~$0.09 automatically!

Performance

  • Memory usage: 50-100 MB per session
  • CPU usage: Minimal (ffmpeg handles audio processing)
  • API latency: 1-3 seconds per chunk
  • Accuracy: 90-95% for clear speech
  • Network: Only during transcription API calls

Cost Optimization Tips

  1. Increase chunk size - Fewer API calls (set CHUNK_SECONDS=15)
  2. Use silence detection - Enabled by default, saves money automatically
  3. Pause when not needed - Use pause_transcription during breaks
  4. Monitor usage - Check OpenAI dashboard for actual costs

Bottom line: Transcription is cheap (~36Β’/hour), runs mostly locally, and automatically saves money by skipping silence. You're only charged when actual speech is being transcribed.

πŸ§ͺ Development & Testing

For contributors and developers:

πŸ“– See for complete setup instructions

Just add to your config and restart - that's it!

See the npx configuration at the top of this README for Cursor and Claude Desktop.

For Standalone CLI (Local Development)

πŸ“– See for complete setup instructions

# Install dependencies
npm install
npm run build

# Configure environment
cp env.example .env  # Then add your OpenAI API key

# Run standalone CLI
npm start

πŸ“„ License & Contributing

This project is licensed under the MIT License - see the file for details.

Contributions are welcome! Please feel free to submit a Pull Request.

Development Resources

  • πŸ“–
  • πŸ§ͺ
  • πŸ“‹
  • πŸ”§

Made with ❀️ for transcribing meetings, content, and conversations.

Star ⭐ this repo if you find it useful!