README - audio-transcription-mcp by imjoshnewton

Audio Transcription MCP Server

An MCP (Model Context Protocol) server that enables Claude to transcribe and analyze audio files, specifically designed for French language learning exercises.

Features

Download MP3 audio files from HTTP/HTTPS URLs
Transcribe French audio to text using OpenAI Whisper
Translate French text to English using GPT-4
Analyze French imperative sentences to identify the subject pronoun (tu, vous, nous)
Secure handling of temporary files with automatic cleanup

Prerequisites

Bun runtime (latest stable version)
OpenAI API key with access to Whisper and GPT-4

Installation

Clone the repository:

cd /Users/joshnewton/Development/audio-transcription-mcp

Install dependencies:

bun install

Set up environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export LOG_LEVEL="info"  # Optional: debug, info, error
export TEMP_DIR="/tmp/audio-transcription"  # Optional: custom temp directory
export MAX_FILE_SIZE="25000000"  # Optional: max file size in bytes (default 25MB)

Usage

Starting the Server

bun run index.ts

Or with npm scripts:

bun start  # Production mode
bun dev    # Development mode with auto-reload

Integration with Claude Code

Add the server to Claude Code:

claude mcp add /Users/joshnewton/Development/audio-transcription-mcp

Available Tools

1. `transcribe_audio`

Downloads and transcribes an audio file from a URL.

Input:

{
  "url": "https://example.com/audio.mp3"
}

Output:

{
  "transcription": "Original French text",
  "translation": "English translation",
  "language": "French"
}

2. `analyze_french_imperative`

Analyzes a French audio file to determine the imperative subject pronoun.

Input:

{
  "url": "https://example.com/french-command.mp3"
}

Output:

{
  "transcription": "Écoutez attentivement",
  "translation": "Listen carefully",
  "subject": "vous",
  "analysis": "The verb 'écoutez' ends in -ez, which indicates the formal/plural 'vous' form"
}

Example Usage in Claude

User: "Listen to this audio and tell me who is being given the command: https://example.com/french-audio.mp3"
Claude: [Uses analyze_french_imperative tool]
Result: The command "Écoutez attentivement" is addressed to "vous" (formal or plural 'you')

Development

Running Tests

bun test

Project Structure

audio-transcription-mcp/
├── index.ts                 # Main entry point
├── src/
│   ├── server.ts           # MCP server setup
│   ├── handlers/           # Tool request handlers
│   ├── services/           # Core services
│   └── utils/              # Utility functions
└── tests/                  # Test files

Security Considerations

Only HTTP/HTTPS URLs are accepted (no file:// or local paths)
File size limited to 25MB by default
Temporary files are automatically cleaned up
API keys are never logged
Input validation on all user-provided data

Multi-Language Support Implementation Plan

Current State

The server is currently optimized for French language learning with hardcoded language parameters in several components.

Changes Required for Multi-Language Support

Phase 1: Core Multi-Language Infrastructure

Service Layer Updates:

transcriber.ts:34 - Remove hardcoded language: "fr" parameter
analyzer.ts:29-42 - Replace French-specific imperative analysis with configurable grammar analysis
translator.ts:25 - Make source/target languages configurable parameters

Tool Schema Changes:

Replace analyze_french_imperative with generic analyze_grammar tool
Add required language parameter to tool input schemas
Add optional targetLanguage parameter for translation control

Handler Updates:

Add language parameter validation against supported languages
Pass language codes to all services
Update response formatting to include language metadata

Phase 2: Language-Specific Analysis Modules

New File Structure:

src/services/
├── analysis/
│   ├── french.ts     # Imperative analysis (tu/vous/nous)
│   ├── spanish.ts    # Ser vs Estar, subjunctive detection
│   ├── german.ts     # Case analysis (Nominativ, Akkusativ, Dativ, Genitiv)
│   ├── italian.ts    # Subjunctive mood, formal/informal register
│   └── base.ts       # Common analysis interface
├── languages.ts      # Language configurations and metadata
└── language-detector.ts # Auto-detection fallback logic

Language Configuration System:

interface LanguageConfig {
  code: string;           // ISO 639-1 code ('fr', 'es', 'de', etc.)
  name: string;           // Display name
  whisperCode: string;    // Whisper API language code
  analysisRules: object;  // Language-specific grammar rules
  defaultTarget: string;  // Default translation target
}

Phase 3: Advanced Features

Auto-Detection: Use Whisper without language parameter, confidence scoring
Environment Configuration: SUPPORTED_LANGUAGES=fr,es,de,it
Fallback Handling: Default language and error recovery
Performance: Language-specific caching and optimization

Estimated Implementation Time: 3-4 days for core infrastructure, +1-2 days per additional language's specialized analysis rules.

Roadmap & Feature Ideas

Language Expansion

Multi-language support (Spanish, German, Italian, Portuguese)
Auto-detect language in audio files
Cross-language comparative analysis tools
Language-specific linguistic analysis patterns

Enhanced French Analysis

Verb tense identification and explanation
Subjunctive mood detection
Conditional vs. indicative mood analysis
Grammar error detection and suggestions
Vocabulary difficulty level assessment
Formal vs informal register analysis beyond imperatives

Advanced Audio Processing

Support for additional audio formats (WAV, M4A, AAC, FLAC)
Audio enhancement and noise reduction
Speaker diarization (multiple speakers)
Audio speed/pitch adjustment for learning
Batch processing of multiple audio files
Audio segmentation by sentences/phrases
Real-time streaming transcription

Language Learning Tools

Vocabulary extraction with frequency analysis
Automatic flashcard generation from transcriptions
Pronunciation scoring and feedback
IPA (International Phonetic Alphabet) notation
Cultural context and idiom explanations
Sentence complexity scoring
Learning progress tracking and analytics

Output & Export Features

Subtitle file generation (SRT, VTT, ASS)
PDF study guides with translations and notes
Anki deck export for spaced repetition
Study worksheet generation
Audio-synchronized text highlighting
Integration with popular language learning apps

Performance & Technical Improvements

Caching for repeated audio URLs
Streaming support for large audio files
Multiple AI model options (local vs cloud)
Rate limiting and usage analytics
WebSocket support for real-time features
Database integration for user progress
API versioning and backward compatibility

Specialized Analysis Tools

Emotion/sentiment analysis in speech
Speaking pace and fluency analysis
Accent identification and analysis
Conversation flow analysis
Politeness markers detection
Regional dialect identification
Academic vs colloquial language detection

License

MIT