damondel/prep-for-research-analysis
If you are the rightful owner of prep-for-research-analysis and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A specialized Model Context Protocol (MCP) server designed to convert transcripts and documents into clean Markdown format for research analysis workflows.
Prep for Research Analysis MCP Server
A specialized Model Context Protocol (MCP) server for converting transcripts and documents to clean Markdown format for research analysis workflows.
🎯 Purpose
Transform raw transcripts, documents, and images into clean, anonymized Markdown perfect for research analysis. Designed specifically for researchers, analysts, and anyone working with interview data, meeting transcripts, or document conversion workflows.
✨ Key Features
- 🎙️ VTT Transcript Conversion: Convert subtitle files with intelligent speaker identification
- � Text Document Conversion: Transform plain text files (.txt) to structured Markdown
- �🔒 Privacy Protection: Automatic speaker anonymization (e.g., "Jennifer Adams" → "(JA)")
- 📊 Research-Ready Output: YAML frontmatter for AI analysis tools and Agent Playground
- 🖼️ Image Processing: Extract content from screenshots and documents using OCR
- 📝 Content Sanitization: Remove emails, phone numbers, and sensitive data
- ⚡ Production-Ready: Handles large files (6000+ lines tested) efficiently
🚀 Quick Start
Installation
git clone https://github.com/your-username/prep-for-research-analysis.git
cd prep-for-research-analysis
npm install
Configuration
-
Copy the environment template:
cp .env.example .env -
(Optional) Add Azure AI credentials for advanced image processing:
AZURE_AI_FOUNDRY_ENDPOINT=your-endpoint-here AZURE_AI_FOUNDRY_KEY=your-key-here
Start the Server
npm start
💬 GitHub Copilot Integration
Setup
Add to your VS Code settings.json:
{
"github.copilot.chat.mcp.servers": {
"prep-for-research-analysis": {
"name": "Prep for Research Analysis",
"command": "node",
"args": ["prep-for-research-analysis-server.js"],
"cwd": "/path/to/your/project"
}
}
}
Usage Examples
@prep-for-research-analysis convert_vtt_to_md filePath="interview.vtt" anonymizeSpeakers=true
@prep-for-research-analysis convert_txt_to_md filePath="notes.txt" title="Meeting Notes" tags=["meeting", "notes"] anonymize=true
@prep-for-research-analysis process_file_for_azure inputPath="meeting.vtt" title="Strategy Meeting" tags=["strategy", "planning"]
@prep-for-research-analysis convert_image_to_md filePath="diagram.png" title="System Architecture"
📊 Sample Output
Input (VTT)
WEBVTT
00:07:10.709 --> 00:07:15.070
<v Jennifer Adams>I think we need to focus on user experience first.</v>
00:07:15.070 --> 00:07:17.709
<v Mike Johnson>Absolutely, that's our top priority.</v>
Output (Markdown)
---
title: Strategy Meeting
created: '2025-06-27T18:45:00.000Z'
content_type: transcript
tags:
- strategy
- planning
azure_ai_foundry:
ready: true
format_version: '1.0'
---
# Transcript
## Speaker Key
- (JA): [Speaker anonymized]
- (MJ): [Speaker anonymized]
---
**(JA):** I think we need to focus on user experience first.
**(MJ):** Absolutely, that's our top priority.
🛠️ Available Tools
| Tool | Description | Use Case |
|---|---|---|
convert_vtt_to_md | Convert VTT files to Markdown | Basic transcript conversion |
convert_txt_to_md | Convert plain text files to Markdown | Document structure and anonymization |
process_file_for_azure | Full pipeline with YAML frontmatter | Research analysis preparation |
convert_image_to_md | Extract text from images | Document digitization |
anonymize_content | Remove sensitive information | Privacy protection |
add_yaml_frontmatter | Add metadata headers | AI tool compatibility |
📁 Project Structure
prep-for-research-analysis/
├── prep-for-research-analysis-server.js # Main MCP server
├── package.json # Dependencies
├── .env.example # Environment template
├── tests/
│ └── sample-files/ # Test data
├── docs/ # Documentation
└── .vscode/ # VS Code integration
🧪 Testing
# Run all tests
npm run test:all
# Test basic functionality
npm run test:basic
# Test VTT conversion
npm test
📖 Documentation
- - Step-by-step conversion instructions
- - Setup guide for VS Code
- - Copilot configuration
- - Comprehensive usage documentation
⚙️ Advanced Configuration
Large File Processing
The server efficiently handles large transcripts (6000+ lines tested). For extremely large files, the parser uses streaming techniques to maintain performance.
Custom Anonymization
Speaker anonymization can be customized by modifying the extractSpeakerInfo function in the main server file.
Azure AI Integration
For advanced image processing, configure Azure AI Foundry credentials in your .env file.
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
🐛 Troubleshooting
Common Issues
Server won't start:
- Check Node.js version (18+ required)
- Verify all dependencies are installed (
npm install)
Large files timing out:
- Increase timeout in your MCP client
- Consider processing in chunks for extremely large files
Image processing fails:
- Ensure Tesseract.js dependencies are installed
- Check Azure AI credentials if using advanced mode
📄 License
This project is licensed under the MIT License - see the file for details.
🙏 Acknowledgments
- Built on the Model Context Protocol
- Uses Tesseract.js for OCR processing
- Inspired by the need for clean, research-ready transcript conversion
Ready to convert your transcripts? Start with the !
Project Structure
mcp-explore/
├── azure-ai-foundry-server-fixed.js # Main MCP server
├── mcp-explorer.js # MCP client utility
├── package.json # Node.js dependencies
├── .env # Environment variables (your credentials)
├── .env.example # Environment template
├── README.md # This file
├── USAGE_GUIDE.md # Comprehensive usage guide
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guidelines
├── LICENSE # MIT license
├── examples/ # Sample files for testing
│ ├── sample-presentation.vtt # Sample VTT file
│ ├── research-meeting.vtt # Another VTT example
│ └── *.md # Generated outputs
├── tests/ # Test suite
│ ├── quick-validation.js # Quick server validation
│ ├── full-pipeline-test.js # Complete workflow test
│ ├── test-azure-ai-foundry.js # Azure AI specific tests
│ └── test-enhanced-vtt.js # VTT processing tests
├── docs/ # Additional documentation
└── .github/ # GitHub workflows
Installation
npm install
Usage
Start the MCP Server
npm start
# or
npm run server
Run Tests
# Run all enhanced tests
npm test
# Run specific tests
npm run test:basic # Basic functionality
npm run test:pipeline # Full pipeline test
npm run test:enhanced # Enhanced VTT processing
Available Tools
- convert_vtt_to_md - Convert VTT files to Markdown with speaker anonymization
- convert_image_to_md - Convert images to Markdown using OCR or Azure AI Foundry
- anonymize_content - Remove sensitive information from text
- add_yaml_frontmatter - Add Azure AI Foundry metadata
- process_file_for_azure - Complete pipeline (recommended)
Example Usage
Input VTT Format
WEBVTT
3e2f4a5b-9d8c-5f3e-b2c3-4d5e6f708901/10-0
00:00:01.000 --> 00:00:06.000
<v Dr. Jennifer Adams (She/Her)>Good afternoon everyone.</v>
Output Markdown
---
title: Meeting Transcript
created: '2025-06-27T00:00:00.000Z'
content_type: transcript
tags: ['meeting', 'anonymized', 'azure-ai-foundry']
azure_ai_foundry:
ready: true
format_version: '1.0'
---
# Transcript
## Speaker Key
- (JA): [Speaker anonymized]
---
**(JA):** Good afternoon everyone.
Directory Structure
├── azure-ai-foundry-server-fixed.js # Main MCP server
├── mcp-explorer.js # MCP client
├── package.json # Project configuration
├── README.md # This file
├── examples/ # Sample files and outputs
│ ├── research-meeting.vtt # Sample complex VTT
│ ├── sample-presentation.vtt # Sample simple VTT
│ ├── anonymized-research-meeting.md # Sample output
│ └── processed-presentation.md # Sample output
└── tests/ # Test scripts
├── test-azure-ai-server.js # Basic tests
├── test-enhanced-vtt.js # Enhanced VTT tests
└── full-pipeline-test.js # Complete pipeline tests
Supported File Types
- VTT: WebVTT subtitle files with speaker identification
- Images: PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP with OCR or AI processing
- Future: Additional formats as needed
Processing Options
Image Processing
- OCR Mode (default): Uses Tesseract.js for text extraction
- Azure AI Foundry Mode: Uses multimodal AI for enhanced content understanding
Configuration
To use Azure AI Foundry mode:
- Copy
.env.exampleto.env - Add your Azure AI Foundry credentials:
AZURE_AI_FOUNDRY_ENDPOINT=https://your-resource.openai.azure.com/ AZURE_AI_FOUNDRY_API_KEY=your_api_key_here AZURE_AI_FOUNDRY_MODEL_NAME=gpt-4-vision - Use
processingMode: "azure-ai"in your tool calls
Azure AI Foundry Integration
The server generates markdown files with proper YAML frontmatter that Azure AI Foundry agents can consume directly:
title: Document titlecontent_type: Type of content (transcript, document, etc.)tags: Categorization tagsazure_ai_foundry.ready: Indicates file is ready for AI consumptionazure_ai_foundry.format_version: Format specification version
Development
To extend functionality:
- Add new tools to the
setupHandlers()method - Implement tool logic as async methods
- Update the tools list in
ListToolsRequestSchemahandler - Add tests in the
tests/directory
License
ISC