damondel/prep-for-research-analysis
If you are the rightful owner of prep-for-research-analysis and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A specialized Model Context Protocol (MCP) server designed to convert transcripts and documents into clean Markdown format for research analysis workflows.
convert_vtt_to_md
Convert VTT files to Markdown with speaker anonymization.
process_file_for_azure
Full pipeline with YAML frontmatter for research analysis preparation.
convert_image_to_md
Extract text from images for document digitization.
anonymize_content
Remove sensitive information for privacy protection.
add_yaml_frontmatter
Add metadata headers for AI tool compatibility.
Prep for Research Analysis MCP Server
A specialized Model Context Protocol (MCP) server for converting transcripts and documents to clean Markdown format for research analysis workflows.
๐ฏ Purpose
Transform raw transcripts, documents, and images into clean, anonymized Markdown perfect for research analysis. Designed specifically for researchers, analysts, and anyone working with interview data, meeting transcripts, or document conversion workflows.
โจ Key Features
- ๐๏ธ VTT Transcript Conversion: Convert subtitle files with intelligent speaker identification
- ๏ฟฝ Text Document Conversion: Transform plain text files (.txt) to structured Markdown
- ๏ฟฝ๐ Privacy Protection: Automatic speaker anonymization (e.g., "Jennifer Adams" โ "(JA)")
- ๐ Research-Ready Output: YAML frontmatter for AI analysis tools and Agent Playground
- ๐ผ๏ธ Image Processing: Extract content from screenshots and documents using OCR
- ๐ Content Sanitization: Remove emails, phone numbers, and sensitive data
- โก Production-Ready: Handles large files (6000+ lines tested) efficiently
๐ Quick Start
Installation
git clone https://github.com/your-username/prep-for-research-analysis.git
cd prep-for-research-analysis
npm install
Configuration
-
Copy the environment template:
cp .env.example .env
-
(Optional) Add Azure AI credentials for advanced image processing:
AZURE_AI_FOUNDRY_ENDPOINT=your-endpoint-here AZURE_AI_FOUNDRY_KEY=your-key-here
Start the Server
npm start
๐ฌ GitHub Copilot Integration
Setup
Add to your VS Code settings.json
:
{
"github.copilot.chat.mcp.servers": {
"prep-for-research-analysis": {
"name": "Prep for Research Analysis",
"command": "node",
"args": ["prep-for-research-analysis-server.js"],
"cwd": "/path/to/your/project"
}
}
}
Usage Examples
@prep-for-research-analysis convert_vtt_to_md filePath="interview.vtt" anonymizeSpeakers=true
@prep-for-research-analysis convert_txt_to_md filePath="notes.txt" title="Meeting Notes" tags=["meeting", "notes"] anonymize=true
@prep-for-research-analysis process_file_for_azure inputPath="meeting.vtt" title="Strategy Meeting" tags=["strategy", "planning"]
@prep-for-research-analysis convert_image_to_md filePath="diagram.png" title="System Architecture"
๐ Sample Output
Input (VTT)
WEBVTT
00:07:10.709 --> 00:07:15.070
<v Jennifer Adams>I think we need to focus on user experience first.</v>
00:07:15.070 --> 00:07:17.709
<v Mike Johnson>Absolutely, that's our top priority.</v>
Output (Markdown)
---
title: Strategy Meeting
created: '2025-06-27T18:45:00.000Z'
content_type: transcript
tags:
- strategy
- planning
azure_ai_foundry:
ready: true
format_version: '1.0'
---
# Transcript
## Speaker Key
- (JA): [Speaker anonymized]
- (MJ): [Speaker anonymized]
---
**(JA):** I think we need to focus on user experience first.
**(MJ):** Absolutely, that's our top priority.
๐ ๏ธ Available Tools
Tool | Description | Use Case |
---|---|---|
convert_vtt_to_md | Convert VTT files to Markdown | Basic transcript conversion |
convert_txt_to_md | Convert plain text files to Markdown | Document structure and anonymization |
process_file_for_azure | Full pipeline with YAML frontmatter | Research analysis preparation |
convert_image_to_md | Extract text from images | Document digitization |
anonymize_content | Remove sensitive information | Privacy protection |
add_yaml_frontmatter | Add metadata headers | AI tool compatibility |
๐ Project Structure
prep-for-research-analysis/
โโโ prep-for-research-analysis-server.js # Main MCP server
โโโ package.json # Dependencies
โโโ .env.example # Environment template
โโโ tests/
โ โโโ sample-files/ # Test data
โโโ docs/ # Documentation
โโโ .vscode/ # VS Code integration
๐งช Testing
# Run all tests
npm run test:all
# Test basic functionality
npm run test:basic
# Test VTT conversion
npm test
๐ Documentation
- - Step-by-step conversion instructions
- - Setup guide for VS Code
- - Copilot configuration
- - Comprehensive usage documentation
โ๏ธ Advanced Configuration
Large File Processing
The server efficiently handles large transcripts (6000+ lines tested). For extremely large files, the parser uses streaming techniques to maintain performance.
Custom Anonymization
Speaker anonymization can be customized by modifying the extractSpeakerInfo
function in the main server file.
Azure AI Integration
For advanced image processing, configure Azure AI Foundry credentials in your .env
file.
๐ค Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
๐ Troubleshooting
Common Issues
Server won't start:
- Check Node.js version (18+ required)
- Verify all dependencies are installed (
npm install
)
Large files timing out:
- Increase timeout in your MCP client
- Consider processing in chunks for extremely large files
Image processing fails:
- Ensure Tesseract.js dependencies are installed
- Check Azure AI credentials if using advanced mode
๐ License
This project is licensed under the MIT License - see the file for details.
๐ Acknowledgments
- Built on the Model Context Protocol
- Uses Tesseract.js for OCR processing
- Inspired by the need for clean, research-ready transcript conversion
Ready to convert your transcripts? Start with the !
Project Structure
mcp-explore/
โโโ azure-ai-foundry-server-fixed.js # Main MCP server
โโโ mcp-explorer.js # MCP client utility
โโโ package.json # Node.js dependencies
โโโ .env # Environment variables (your credentials)
โโโ .env.example # Environment template
โโโ README.md # This file
โโโ USAGE_GUIDE.md # Comprehensive usage guide
โโโ CHANGELOG.md # Version history
โโโ CONTRIBUTING.md # Contribution guidelines
โโโ LICENSE # MIT license
โโโ examples/ # Sample files for testing
โ โโโ sample-presentation.vtt # Sample VTT file
โ โโโ research-meeting.vtt # Another VTT example
โ โโโ *.md # Generated outputs
โโโ tests/ # Test suite
โ โโโ quick-validation.js # Quick server validation
โ โโโ full-pipeline-test.js # Complete workflow test
โ โโโ test-azure-ai-foundry.js # Azure AI specific tests
โ โโโ test-enhanced-vtt.js # VTT processing tests
โโโ docs/ # Additional documentation
โโโ .github/ # GitHub workflows
Installation
npm install
Usage
Start the MCP Server
npm start
# or
npm run server
Run Tests
# Run all enhanced tests
npm test
# Run specific tests
npm run test:basic # Basic functionality
npm run test:pipeline # Full pipeline test
npm run test:enhanced # Enhanced VTT processing
Available Tools
- convert_vtt_to_md - Convert VTT files to Markdown with speaker anonymization
- convert_image_to_md - Convert images to Markdown using OCR or Azure AI Foundry
- anonymize_content - Remove sensitive information from text
- add_yaml_frontmatter - Add Azure AI Foundry metadata
- process_file_for_azure - Complete pipeline (recommended)
Example Usage
Input VTT Format
WEBVTT
3e2f4a5b-9d8c-5f3e-b2c3-4d5e6f708901/10-0
00:00:01.000 --> 00:00:06.000
<v Dr. Jennifer Adams (She/Her)>Good afternoon everyone.</v>
Output Markdown
---
title: Meeting Transcript
created: '2025-06-27T00:00:00.000Z'
content_type: transcript
tags: ['meeting', 'anonymized', 'azure-ai-foundry']
azure_ai_foundry:
ready: true
format_version: '1.0'
---
# Transcript
## Speaker Key
- (JA): [Speaker anonymized]
---
**(JA):** Good afternoon everyone.
Directory Structure
โโโ azure-ai-foundry-server-fixed.js # Main MCP server
โโโ mcp-explorer.js # MCP client
โโโ package.json # Project configuration
โโโ README.md # This file
โโโ examples/ # Sample files and outputs
โ โโโ research-meeting.vtt # Sample complex VTT
โ โโโ sample-presentation.vtt # Sample simple VTT
โ โโโ anonymized-research-meeting.md # Sample output
โ โโโ processed-presentation.md # Sample output
โโโ tests/ # Test scripts
โโโ test-azure-ai-server.js # Basic tests
โโโ test-enhanced-vtt.js # Enhanced VTT tests
โโโ full-pipeline-test.js # Complete pipeline tests
Supported File Types
- VTT: WebVTT subtitle files with speaker identification
- Images: PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP with OCR or AI processing
- Future: Additional formats as needed
Processing Options
Image Processing
- OCR Mode (default): Uses Tesseract.js for text extraction
- Azure AI Foundry Mode: Uses multimodal AI for enhanced content understanding
Configuration
To use Azure AI Foundry mode:
- Copy
.env.example
to.env
- Add your Azure AI Foundry credentials:
AZURE_AI_FOUNDRY_ENDPOINT=https://your-resource.openai.azure.com/ AZURE_AI_FOUNDRY_API_KEY=your_api_key_here AZURE_AI_FOUNDRY_MODEL_NAME=gpt-4-vision
- Use
processingMode: "azure-ai"
in your tool calls
Azure AI Foundry Integration
The server generates markdown files with proper YAML frontmatter that Azure AI Foundry agents can consume directly:
title
: Document titlecontent_type
: Type of content (transcript, document, etc.)tags
: Categorization tagsazure_ai_foundry.ready
: Indicates file is ready for AI consumptionazure_ai_foundry.format_version
: Format specification version
Development
To extend functionality:
- Add new tools to the
setupHandlers()
method - Implement tool logic as async methods
- Update the tools list in
ListToolsRequestSchema
handler - Add tests in the
tests/
directory
License
ISC