google-research-mcp

zoharbabin/google-research-mcp

3.4

If you are the rightful owner of google-research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Google Researcher MCP Server empowers AI assistants with web research capabilities through Google Search, content scraping, and Gemini AI analysis.

Tools
4
Resources
0
Prompts
0

Google Researcher MCP Server

codecov License: MIT Node.js Version

Empower AI assistants with robust, persistent, and secure web research capabilities.

This server implements the Model Context Protocol (MCP), providing a suite of tools for Google Search, content scraping, and Gemini AI analysis. It's designed for performance and reliability, featuring a persistent caching system, comprehensive timeout handling, and enterprise-grade security.

šŸŽ‰ Latest Update (v1.2.1): Fixed critical issue where scrape_page and research_topic tools were returning placeholder test content instead of actual scraped data. All tools now return real web content as expected.

Table of Contents

Why Use This Server?

  • Extend AI Capabilities: Grant AI assistants access to real-time web information and powerful analytical tools.
  • Maximize Performance: Drastically reduce latency for repeated queries with a sophisticated two-layer persistent cache (in-memory and disk).
  • Reduce Costs: Minimize expensive API calls to Google Search and Gemini by caching results.
  • Ensure Reliability: Prevent failures and ensure consistent performance with comprehensive timeout handling and graceful degradation.
  • Flexible & Secure Integration: Connect any MCP-compatible client via STDIO or HTTP+SSE, with enterprise-grade OAuth 2.1 for secure API access.
  • Open & Extensible: MIT licensed, fully open-source, and designed for easy modification and extension.

Features

  • Core Research Tools:
    • google_search: Find information using the Google Search API.
    • scrape_page: Extract content from websites and YouTube videos with robust transcript extraction.
    • analyze_with_gemini: Process text using Google's powerful Gemini AI models.
    • research_topic: A composite tool that combines search, scraping, and analysis into a single, efficient operation.
  • YouTube Transcript Extraction:
    • Robust YouTube transcript extraction with comprehensive error handling: 10 distinct error types with clear, actionable messages.
    • Intelligent retry logic with exponential backoff: Automatic retries for transient failures (network issues, rate limiting, timeouts).
    • User-friendly error messages and diagnostics: Clear feedback when transcript extraction fails, with specific reasons.
  • Advanced Caching System:
    • Two-Layer Cache: Combines a fast in-memory cache for immediate access with a persistent disk-based cache for durability.
    • Custom Namespaces: Organizes cached data by tool, preventing collisions and simplifying management.
    • Manual & Automated Persistence: Offers both automatic, time-based cache saving and manual persistence via a secure API endpoint.
  • Robust Performance & Reliability:
    • Comprehensive Timeouts: Protects against network issues and slow responses from external APIs.
    • Graceful Degradation: Ensures the server remains responsive even if a tool or dependency fails.
    • Dual Transport Protocols: Supports both STDIO for local process communication and HTTP+SSE for web-based clients.
  • Enterprise-Grade Security:
    • OAuth 2.1 Protection: Secures all HTTP endpoints with modern, industry-standard authorization.
    • Granular Scopes: Provides fine-grained control over access to tools and administrative functions.
  • Monitoring & Management:
    • Administrative API: Exposes endpoints for monitoring cache statistics, managing the cache, and inspecting the event store.

System Architecture

The server is built on a layered architecture designed for clarity, separation of concerns, and extensibility.

graph TD
    subgraph "Client"
        A[MCP Client]
    end

    subgraph "Transport Layer"
        B[STDIO]
        C[HTTP-SSE]
    end

    subgraph "Core Logic"
        D{MCP Request Router}
        E[Tool Executor]
    end

    subgraph "Tools"
        F[google_search]
        G[scrape_page]
        H[analyze_with_gemini]
        I[research_topic]
    end

    subgraph "Support Systems"
        J[Persistent Cache]
        K[Event Store]
        L[OAuth Middleware]
    end

    A -- Connects via --> B
    A -- Connects via --> C
    B -- Forwards to --> D
    C -- Forwards to --> D
    D -- Routes to --> E
    E -- Invokes --> F
    E -- Invokes --> G
    E -- Invokes --> H
    E -- Invokes --> I
    F & G & H & I -- Uses --> J
    D -- Uses --> K
    C -- Protected by --> L

    style J fill:#f9f,stroke:#333,stroke-width:2px
    style K fill:#ccf,stroke:#333,stroke-width:2px
    style L fill:#f99,stroke:#333,stroke-width:2px

For a more detailed explanation, see the .

YouTube Transcript Extraction

The server includes a robust YouTube transcript extraction system that provides reliable access to video transcripts with comprehensive error handling and automatic recovery mechanisms.

Key Features

  • Comprehensive Error Classification: Identifies 10 distinct error types with clear, actionable messages
  • Intelligent Retry Logic: Exponential backoff mechanism for transient failures (max 3 attempts)
  • Production Optimizations: 91% performance improvement and 80% log reduction
  • User-Friendly Feedback: Clear error messages explaining why transcript extraction failed

Supported Error Types

Error CodeDescriptionUser Action
TRANSCRIPT_DISABLEDVideo owner disabled transcriptsTry a different video
VIDEO_UNAVAILABLEVideo no longer availableVerify the URL and video status
VIDEO_NOT_FOUNDInvalid video ID or URLCheck the YouTube URL format
NETWORK_ERRORNetwork connectivity issuesSystem will retry automatically
RATE_LIMITEDYouTube API rate limitingSystem will retry with backoff
TIMEOUTRequest timed outSystem will retry automatically
PARSING_ERRORTranscript data parsing failedContact support if persistent
REGION_BLOCKEDVideo blocked in server regionUse proxy if needed
PRIVATE_VIDEOVideo requires authenticationUse public videos only
UNKNOWNUnexpected error occurredContact support with details

Retry Behavior

The system automatically retries failed requests for transient errors:

  • Maximum Attempts: 3 retries for NETWORK_ERROR, RATE_LIMITED, and TIMEOUT
  • Exponential Backoff: Progressive delays between retries to avoid overwhelming YouTube's API
  • Smart Recovery: Only retries errors that are likely to succeed on subsequent attempts

Example Error Messages

When transcript extraction fails, users receive clear, specific error messages:

Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx.
Reason: TRANSCRIPT_DISABLED - The video owner has disabled transcripts.
Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx after 3 attempts.
Reason: NETWORK_ERROR - A network error occurred.

For complete technical details, see the .

Getting Started

Prerequisites

Installation & Setup

  1. Clone the Repository:

    git clone https://github.com/zoharbabin/google-research-mcp.git
    cd google-researcher-mcp
    
  2. Install Dependencies:

    npm install
    
  3. Configure Environment Variables: Create a .env file by copying the example and filling in your credentials.

    cp .env.example .env
    

    Now, open .env in your editor and add your API keys and OAuth configuration. See the comments in .env.example for detailed explanations of each variable.

Running the Server

  • Development Mode: For development with automatic reloading on file changes, use:

    npm run dev
    

    This command uses tsx to watch for changes and restart the server.

  • Production Mode: First, build the TypeScript project into JavaScript, then start the server:

    npm run build
    npm start
    

Upon successful startup, you will see confirmation that the transports are ready:

āœ… stdio transport ready
🌐 SSE server listening on http://127.0.0.1:3000/mcp

Usage

Available Tools

The server provides a suite of powerful tools for research and analysis. Each tool is designed with detailed descriptions and annotations to be easily understood and utilized by AI models.

ToolTitleDescription & Parameters
google_searchGoogle Web SearchDescription: Searches the web using the Google Custom Search API to find relevant web pages and resources. Ideal for finding current information, discovering authoritative sources, and locating specific documents. Results are cached for 30 minutes.

Parameters:
- query (string, required): The search query. Use specific, targeted keywords for best results.
- num_results (number, optional, default: 5): The number of search results to return (1-10).
scrape_pageWeb Page & YouTube Content ExtractorDescription: Extracts text content from web pages and YouTube videos with robust transcript extraction capabilities. Features comprehensive error handling with 10 distinct error types (TRANSCRIPT_DISABLED, VIDEO_UNAVAILABLE, NETWORK_ERROR, etc.), automatic retry logic with exponential backoff for transient failures, and user-friendly error messages. Supports both youtube.com/watch?v= and youtu.be/ URL formats. Results are cached for 1 hour.

Parameters:
- url (string, required): The URL of the web page or YouTube video to scrape. YouTube URLs automatically extract transcripts when available.
analyze_with_geminiGemini AI Text AnalysisDescription: Processes and analyzes text content using Google's Gemini AI models. It can summarize, answer questions, and generate insights from provided text. Large texts are automatically truncated. Results are cached for 15 minutes.

Parameters:
- text (string, required): The text content to analyze.
- model (string, optional, default: "gemini-2.0-flash-001"): The Gemini model to use (e.g., gemini-2.0-flash-001, gemini-pro).
research_topicComprehensive Topic Research WorkflowDescription: A powerful composite tool that automates the entire research process: it searches for a topic, scrapes the content from multiple sources, and synthesizes the findings with Gemini AI. It's designed for resilience and provides comprehensive analysis.

Parameters:
- query (string, required): The research topic or question.
- num_results (number, optional, default: 3): The number of sources to research (recommended: 2-5).

Client Integration

STDIO Client (Local Process)

Ideal for local tools and CLI applications.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "node",
  args: ["dist/server.js"]
});
const client = new Client({ name: "test-client" });
await client.connect(transport);

const result = await client.callTool({
  name: "google_search",
  arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);

// YouTube transcript extraction example
const youtubeResult = await client.callTool({
  name: "scrape_page",
  arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
});
console.log(youtubeResult.content[0].text);
HTTP+SSE Client (Web Application)

Suitable for web-based clients. Requires a valid OAuth 2.1 Bearer token.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StreamableHTTPClientTransport } from "@modelcontextprotocol/sdk/client/streamableHttp.js";

// The client MUST obtain a valid OAuth 2.1 Bearer token from your
// configured external Authorization Server before making requests.
const transport = new StreamableHTTPClientTransport(
  new URL("http://localhost:3000/mcp"),
  {
    getAuthorization: async () => `Bearer YOUR_ACCESS_TOKEN`
  }
);
const client = new Client({ name: "test-client" });
await client.connect(transport);

const result = await client.callTool({
  name: "google_search",
  arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);

// YouTube transcript extraction with error handling
try {
  const youtubeResult = await client.callTool({
    name: "scrape_page",
    arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
  });
  console.log("Transcript:", youtubeResult.content[0].text);
} catch (error) {
  if (error.content && error.content[0].text.includes("TRANSCRIPT_DISABLED")) {
    console.log("Video owner has disabled transcripts");
  } else if (error.content && error.content[0].text.includes("VIDEO_NOT_FOUND")) {
    console.log("Video not found - check the URL");
  } else {
    console.log("Transcript extraction failed:", error.content[0].text);
  }
}

Management API

The server provides several administrative endpoints for monitoring and control. Access to these endpoints is protected by OAuth scopes.

MethodEndpointDescriptionRequired Scope
GET/mcp/cache-statsView cache performance statistics.mcp:admin:cache:read
GET/mcp/event-store-statsView event store usage statistics.mcp:admin:event-store:read
POST/mcp/cache-invalidateClear specific cache entries.mcp:admin:cache:invalidate
POST/mcp/cache-persistForce the cache to be saved to disk.mcp:admin:cache:persist
GET/mcp/oauth-scopesGet documentation for all OAuth scopes.Public
GET/mcp/oauth-configView the server's OAuth configuration.mcp:admin:config:read
GET/mcp/oauth-token-infoView details of the provided token.Requires authentication

Performance & Reliability

The server has been optimized for production use with significant performance improvements and reliability enhancements:

YouTube Transcript Extraction Performance

  • 91% Performance Improvement: End-to-end tests for YouTube transcript extraction are now 91% faster
  • 80% Log Reduction: Streamlined logging reduces noise while maintaining diagnostic capabilities
  • Production Controls: Environment-based configuration allows fine-tuning of retry behavior and timeouts

System Reliability

  • Intelligent Error Recovery: Automatic retry with exponential backoff for transient failures
  • Graceful Degradation: The system continues operating even when individual components encounter issues
  • Comprehensive Error Classification: 10 distinct error types provide precise feedback for troubleshooting
  • Resource Optimization: Efficient memory and CPU usage patterns for high-volume operations

Monitoring & Diagnostics

  • Enhanced Logging: Detailed but efficient logging for production debugging
  • Performance Metrics: Built-in performance tracking for all major operations
  • Error Analytics: Structured error reporting for operational insights

These optimizations ensure the server can handle production workloads efficiently while providing reliable service even under adverse conditions.

Security

OAuth 2.1 Authorization

The server implements OAuth 2.1 authorization for all HTTP-based communication, ensuring that only authenticated and authorized clients can access its capabilities.

  • Protection: All endpoints under /mcp/ (except for public documentation endpoints) are protected.
  • Token Validation: The server validates JWTs (JSON Web Tokens) against the configured JWKS (JSON Web Key Set) URI from your authorization server.
  • Scope Enforcement: Each tool and administrative action is mapped to a specific OAuth scope, providing granular control over permissions.

For a complete guide on setting up OAuth, see the .

Available Scopes

Tool Execution Scopes
  • mcp:tool:google_search:execute
  • mcp:tool:scrape_page:execute
  • mcp:tool:analyze_with_gemini:execute
  • mcp:tool:research_topic:execute
Administrative Scopes
  • mcp:admin:cache:read
  • mcp:admin:cache:invalidate
  • mcp:admin:cache:persist
  • mcp:admin:event-store:read
  • mcp:admin:config:read

Testing

The project maintains a high standard of quality through a combination of end-to-end and focused component tests.

ScriptDescription
npm testRuns all focused component tests (*.spec.ts) using Jest.
npm run test:e2eExecutes the full end-to-end test suite for both STDIO and SSE transports.
npm run test:coverageGenerates a detailed code coverage report.

For more details on the testing philosophy and structure, see the .

Troubleshooting

MethodEndpointDescriptionRequired Scope
GET/mcp/cache-statsView cache performance statistics.mcp:admin:cache:read
GET/mcp/event-store-statsView event store usage statistics.mcp:admin:event-store:read
POST/mcp/cache-invalidateClear specific cache entries.mcp:admin:cache:invalidate
POST/mcp/cache-persistForce the cache to be saved to disk.mcp:admin:cache:persist
GET/mcp/oauth-scopesGet documentation for all OAuth scopes.Public
GET/mcp/oauth-configView the server's OAuth configuration.mcp:admin:config:read
GET/mcp/oauth-token-infoView details of the provided token.Requires authentication

Contributing

We welcome contributions of all kinds! This project is open-source under the MIT license and we believe in the power of community collaboration.

  • ⭐ Star this repo if you find it useful.
  • šŸ“ Fork it to create your own version.
  • šŸ’” Report issues if you find bugs or have suggestions for improvements.
  • šŸš€ Submit PRs for bug fixes, new features, or documentation enhancements.

To contribute code, please follow our .

License

This project is licensed under the MIT License. See the file for details.