prep-for-research-analysis

damondel/prep-for-research-analysis

3.2

If you are the rightful owner of prep-for-research-analysis and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A specialized Model Context Protocol (MCP) server designed to convert transcripts and documents into clean Markdown format for research analysis workflows.

Tools
  1. convert_vtt_to_md

    Convert VTT files to Markdown with speaker anonymization.

  2. process_file_for_azure

    Full pipeline with YAML frontmatter for research analysis preparation.

  3. convert_image_to_md

    Extract text from images for document digitization.

  4. anonymize_content

    Remove sensitive information for privacy protection.

  5. add_yaml_frontmatter

    Add metadata headers for AI tool compatibility.

Prep for Research Analysis MCP Server

A specialized Model Context Protocol (MCP) server for converting transcripts and documents to clean Markdown format for research analysis workflows.

License: MIT Node.js MCP

๐ŸŽฏ Purpose

Transform raw transcripts, documents, and images into clean, anonymized Markdown perfect for research analysis. Designed specifically for researchers, analysts, and anyone working with interview data, meeting transcripts, or document conversion workflows.

โœจ Key Features

  • ๐ŸŽ™๏ธ VTT Transcript Conversion: Convert subtitle files with intelligent speaker identification
  • ๏ฟฝ Text Document Conversion: Transform plain text files (.txt) to structured Markdown
  • ๏ฟฝ๐Ÿ”’ Privacy Protection: Automatic speaker anonymization (e.g., "Jennifer Adams" โ†’ "(JA)")
  • ๐Ÿ“Š Research-Ready Output: YAML frontmatter for AI analysis tools and Agent Playground
  • ๐Ÿ–ผ๏ธ Image Processing: Extract content from screenshots and documents using OCR
  • ๐Ÿ“ Content Sanitization: Remove emails, phone numbers, and sensitive data
  • โšก Production-Ready: Handles large files (6000+ lines tested) efficiently

๐Ÿš€ Quick Start

Installation

git clone https://github.com/your-username/prep-for-research-analysis.git
cd prep-for-research-analysis
npm install

Configuration

  1. Copy the environment template:

    cp .env.example .env
    
  2. (Optional) Add Azure AI credentials for advanced image processing:

    AZURE_AI_FOUNDRY_ENDPOINT=your-endpoint-here
    AZURE_AI_FOUNDRY_KEY=your-key-here
    

Start the Server

npm start

๐Ÿ’ฌ GitHub Copilot Integration

Setup

Add to your VS Code settings.json:

{
  "github.copilot.chat.mcp.servers": {
    "prep-for-research-analysis": {
      "name": "Prep for Research Analysis",
      "command": "node",
      "args": ["prep-for-research-analysis-server.js"],
      "cwd": "/path/to/your/project"
    }
  }
}

Usage Examples

@prep-for-research-analysis convert_vtt_to_md filePath="interview.vtt" anonymizeSpeakers=true

@prep-for-research-analysis convert_txt_to_md filePath="notes.txt" title="Meeting Notes" tags=["meeting", "notes"] anonymize=true

@prep-for-research-analysis process_file_for_azure inputPath="meeting.vtt" title="Strategy Meeting" tags=["strategy", "planning"]

@prep-for-research-analysis convert_image_to_md filePath="diagram.png" title="System Architecture"

๐Ÿ“Š Sample Output

Input (VTT)

WEBVTT

00:07:10.709 --> 00:07:15.070
<v Jennifer Adams>I think we need to focus on user experience first.</v>

00:07:15.070 --> 00:07:17.709
<v Mike Johnson>Absolutely, that's our top priority.</v>

Output (Markdown)

---
title: Strategy Meeting
created: '2025-06-27T18:45:00.000Z'
content_type: transcript
tags:
  - strategy
  - planning
azure_ai_foundry:
  ready: true
  format_version: '1.0'
---

# Transcript

## Speaker Key
- (JA): [Speaker anonymized]
- (MJ): [Speaker anonymized]

---

**(JA):** I think we need to focus on user experience first.

**(MJ):** Absolutely, that's our top priority.

๐Ÿ› ๏ธ Available Tools

ToolDescriptionUse Case
convert_vtt_to_mdConvert VTT files to MarkdownBasic transcript conversion
convert_txt_to_mdConvert plain text files to MarkdownDocument structure and anonymization
process_file_for_azureFull pipeline with YAML frontmatterResearch analysis preparation
convert_image_to_mdExtract text from imagesDocument digitization
anonymize_contentRemove sensitive informationPrivacy protection
add_yaml_frontmatterAdd metadata headersAI tool compatibility

๐Ÿ“ Project Structure

prep-for-research-analysis/
โ”œโ”€โ”€ prep-for-research-analysis-server.js  # Main MCP server
โ”œโ”€โ”€ package.json                          # Dependencies
โ”œโ”€โ”€ .env.example                         # Environment template
โ”œโ”€โ”€ tests/
โ”‚   โ””โ”€โ”€ sample-files/                    # Test data
โ”œโ”€โ”€ docs/                                # Documentation
โ””โ”€โ”€ .vscode/                            # VS Code integration

๐Ÿงช Testing

# Run all tests
npm run test:all

# Test basic functionality
npm run test:basic

# Test VTT conversion
npm test

๐Ÿ“– Documentation

  • - Step-by-step conversion instructions
  • - Setup guide for VS Code
  • - Copilot configuration
  • - Comprehensive usage documentation

โš™๏ธ Advanced Configuration

Large File Processing

The server efficiently handles large transcripts (6000+ lines tested). For extremely large files, the parser uses streaming techniques to maintain performance.

Custom Anonymization

Speaker anonymization can be customized by modifying the extractSpeakerInfo function in the main server file.

Azure AI Integration

For advanced image processing, configure Azure AI Foundry credentials in your .env file.

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ› Troubleshooting

Common Issues

Server won't start:

  • Check Node.js version (18+ required)
  • Verify all dependencies are installed (npm install)

Large files timing out:

  • Increase timeout in your MCP client
  • Consider processing in chunks for extremely large files

Image processing fails:

  • Ensure Tesseract.js dependencies are installed
  • Check Azure AI credentials if using advanced mode

๐Ÿ“„ License

This project is licensed under the MIT License - see the file for details.

๐Ÿ™ Acknowledgments


Ready to convert your transcripts? Start with the !

Project Structure

mcp-explore/
โ”œโ”€โ”€ azure-ai-foundry-server-fixed.js  # Main MCP server
โ”œโ”€โ”€ mcp-explorer.js                    # MCP client utility
โ”œโ”€โ”€ package.json                       # Node.js dependencies
โ”œโ”€โ”€ .env                               # Environment variables (your credentials)
โ”œโ”€โ”€ .env.example                       # Environment template
โ”œโ”€โ”€ README.md                          # This file
โ”œโ”€โ”€ USAGE_GUIDE.md                     # Comprehensive usage guide
โ”œโ”€โ”€ CHANGELOG.md                       # Version history
โ”œโ”€โ”€ CONTRIBUTING.md                    # Contribution guidelines
โ”œโ”€โ”€ LICENSE                            # MIT license
โ”œโ”€โ”€ examples/                          # Sample files for testing
โ”‚   โ”œโ”€โ”€ sample-presentation.vtt        # Sample VTT file
โ”‚   โ”œโ”€โ”€ research-meeting.vtt           # Another VTT example
โ”‚   โ””โ”€โ”€ *.md                          # Generated outputs
โ”œโ”€โ”€ tests/                             # Test suite
โ”‚   โ”œโ”€โ”€ quick-validation.js            # Quick server validation
โ”‚   โ”œโ”€โ”€ full-pipeline-test.js          # Complete workflow test
โ”‚   โ”œโ”€โ”€ test-azure-ai-foundry.js       # Azure AI specific tests
โ”‚   โ””โ”€โ”€ test-enhanced-vtt.js           # VTT processing tests
โ”œโ”€โ”€ docs/                              # Additional documentation
โ””โ”€โ”€ .github/                           # GitHub workflows

Installation

npm install

Usage

Start the MCP Server

npm start
# or
npm run server

Run Tests

# Run all enhanced tests
npm test

# Run specific tests
npm run test:basic     # Basic functionality
npm run test:pipeline  # Full pipeline test
npm run test:enhanced  # Enhanced VTT processing

Available Tools

  1. convert_vtt_to_md - Convert VTT files to Markdown with speaker anonymization
  2. convert_image_to_md - Convert images to Markdown using OCR or Azure AI Foundry
  3. anonymize_content - Remove sensitive information from text
  4. add_yaml_frontmatter - Add Azure AI Foundry metadata
  5. process_file_for_azure - Complete pipeline (recommended)

Example Usage

Input VTT Format

WEBVTT

3e2f4a5b-9d8c-5f3e-b2c3-4d5e6f708901/10-0
00:00:01.000 --> 00:00:06.000
<v Dr. Jennifer Adams (She/Her)>Good afternoon everyone.</v>

Output Markdown

---
title: Meeting Transcript
created: '2025-06-27T00:00:00.000Z'
content_type: transcript
tags: ['meeting', 'anonymized', 'azure-ai-foundry']
azure_ai_foundry:
  ready: true
  format_version: '1.0'
---

# Transcript

## Speaker Key
- (JA): [Speaker anonymized]

---

**(JA):** Good afternoon everyone.

Directory Structure

โ”œโ”€โ”€ azure-ai-foundry-server-fixed.js  # Main MCP server
โ”œโ”€โ”€ mcp-explorer.js                    # MCP client
โ”œโ”€โ”€ package.json                       # Project configuration
โ”œโ”€โ”€ README.md                          # This file
โ”œโ”€โ”€ examples/                          # Sample files and outputs
โ”‚   โ”œโ”€โ”€ research-meeting.vtt          # Sample complex VTT
โ”‚   โ”œโ”€โ”€ sample-presentation.vtt       # Sample simple VTT
โ”‚   โ”œโ”€โ”€ anonymized-research-meeting.md # Sample output
โ”‚   โ””โ”€โ”€ processed-presentation.md     # Sample output
โ””โ”€โ”€ tests/                            # Test scripts
    โ”œโ”€โ”€ test-azure-ai-server.js       # Basic tests
    โ”œโ”€โ”€ test-enhanced-vtt.js          # Enhanced VTT tests
    โ””โ”€โ”€ full-pipeline-test.js         # Complete pipeline tests

Supported File Types

  • VTT: WebVTT subtitle files with speaker identification
  • Images: PNG, JPG, JPEG, GIF, BMP, TIFF, WEBP with OCR or AI processing
  • Future: Additional formats as needed

Processing Options

Image Processing

  • OCR Mode (default): Uses Tesseract.js for text extraction
  • Azure AI Foundry Mode: Uses multimodal AI for enhanced content understanding

Configuration

To use Azure AI Foundry mode:

  1. Copy .env.example to .env
  2. Add your Azure AI Foundry credentials:
    AZURE_AI_FOUNDRY_ENDPOINT=https://your-resource.openai.azure.com/
    AZURE_AI_FOUNDRY_API_KEY=your_api_key_here
    AZURE_AI_FOUNDRY_MODEL_NAME=gpt-4-vision
    
  3. Use processingMode: "azure-ai" in your tool calls

Azure AI Foundry Integration

The server generates markdown files with proper YAML frontmatter that Azure AI Foundry agents can consume directly:

  • title: Document title
  • content_type: Type of content (transcript, document, etc.)
  • tags: Categorization tags
  • azure_ai_foundry.ready: Indicates file is ready for AI consumption
  • azure_ai_foundry.format_version: Format specification version

Development

To extend functionality:

  1. Add new tools to the setupHandlers() method
  2. Implement tool logic as async methods
  3. Update the tools list in ListToolsRequestSchema handler
  4. Add tests in the tests/ directory

License

ISC