morrain/zhihuMcpServer
If you are the rightful owner of zhihuMcpServer and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Puppeteer MCP Server is a tool designed for scraping webpages and converting them into markdown format using Puppeteer, Readability, and Turndown.
Puppeteer MCP Server
This Model Context Protocol (MCP) server provides a tool for scraping webpages and converting them to markdown format using Puppeteer, Readability, and Turndown. It features a simple, rule-based interaction mechanism to handle common elements like cookie banners.
Now easily runnable via npx
!
Features
- Scrapes webpages using Puppeteer with stealth mode
- Uses a rule-based system to automatically handle common pop-ups (e.g., cookie consent banners).
- Extracts main content with Mozilla's Readability
- Converts HTML to well-formatted Markdown
- Handles authentication via a QR code login flow, automatically persisting sessions.
- Accessible via the Model Context Protocol
- Option to view browser interaction in real-time by disabling headless mode
- Easily consumable as an
npx
package.
Quick Start with NPX
The recommended way to use this server is via npx
, which ensures you're running the latest version without needing to clone or manually install.
-
Prerequisites: Ensure you have Node.js and npm installed.
-
Environment Setup (Optional): You can configure the server using a
.env
file or shell environment variables.Example
.env
file or shell exports:# Optional (defaults shown) # TRANSPORT_TYPE=stdio # Options: stdio, sse, http # PORT=3001 # Only used in sse/http modes # DISABLE_HEADLESS=true # Uncomment to see the browser in action
-
Run the Server: Open your terminal and run:
npx -y zhihu-mcp-server
- The
-y
flag automatically confirms any prompts fromnpx
. - This command will download (if not already cached) and execute the server.
- By default, it starts in
stdio
mode. SetTRANSPORT_TYPE=sse
orTRANSPORT_TYPE=http
for HTTP server modes.
- The
Authentication
For tools that require you to be logged in (like publish-answer
), this server uses a cookie-based authentication flow. You no longer need to provide a COOKIE
environment variable.
The process is as follows:
- Login: Call the
login-with-qrcode
tool. This will return a QR code. - Scan: Scan the QR code with the appropriate mobile app (e.g., Zhihu) to log in.
- Session Saved: Once you log in, the server automatically saves the session cookies to a local file (
qrcodes/cookies.json
). - Automatic Authentication: All subsequent requests from tools like
scrape-webpage
,get-hot-question
, andpublish-answer
will automatically use these saved cookies to authenticate your session.
This means you only need to log in once, and your session will be reused until the cookies expire.
Using as an MCP Tool with NPX
This server is designed to be integrated as a tool within an MCP-compatible LLM orchestrator. Here's an example configuration snippet:
{
"mcpServers": {
"web-scraper": {
"command": "npx",
"args": ["-y", "zhihu-mcp-server"],
"env": {
// Optional:
// "TRANSPORT_TYPE": "stdio", // or "sse" or "http"
// "DISABLE_HEADLESS": "true" // To see the browser during operations
}
}
// ... other MCP servers
}
}
When configured this way, the MCP orchestrator will manage the lifecycle of the zhihu-mcp-server
process.
Environment Configuration Details
Regardless of how you run the server (NPX or local development), it uses the following environment variables:
TRANSPORT_TYPE
: (Optional) The transport protocol to use.- Options:
stdio
(default),sse
,http
stdio
: Direct process communication (recommended for most use cases)sse
: Server-Sent Events over HTTP (legacy mode)http
: Streamable HTTP transport with session management
- Options:
PORT
: (Optional) The port for the HTTP server in SSE or HTTP mode.- Default:
3001
.
- Default:
DISABLE_HEADLESS
: (Optional) Set totrue
to run the browser in visible mode.- Default:
false
(browser runs in headless mode).
- Default:
Communication Modes
The server supports three communication modes:
- stdio (Default): Communicates via standard input/output.
- Perfect for direct integration with LLM tools that manage processes.
- Ideal for command-line usage and scripting.
- No HTTP server is started. This is the default mode.
- SSE mode: Communicates via Server-Sent Events over HTTP.
- Enable by setting
TRANSPORT_TYPE=sse
in your environment. - Starts an HTTP server on the specified
PORT
(default: 3001). - Use when you need to connect to the tool over a network.
- Connect to:
http://localhost:3001/sse
- Enable by setting
- HTTP mode: Communicates via Streamable HTTP transport with session management.
- Enable by setting
TRANSPORT_TYPE=http
in your environment. - Starts an HTTP server on the specified
PORT
(default: 3001). - Supports full session management and resumable connections.
- Connect to:
http://localhost:3001/mcp
- Enable by setting
Tool Usage (MCP Invocation)
The server provides the following tools:
scrape-webpage
Scrapes a webpage and returns its content as markdown.
Tool Parameters:
url
(string, required): The URL of the webpage to scrape.autoInteract
(boolean, optional, default: true): Whether to automatically handle interactive elements.
get-hot-question
Gets a hot question from the specified URL.
Tool Parameters:
type
(string, optional, default:day
): The type of hot question list to get. Can behour
,day
, orweek
.
publish-answer
Publishes an answer to a question on the specified URL.
Tool Parameters:
url
(string, required): The URL of the question to answer.answer
(string, required): The answer to publish.
login-with-qrcode
Gets a login QR code from the specified URL.
Tool Parameters:
qrSelector
(string, optional): The CSS selector for the QR code element. Defaults to.Qrcode-qrcode
.switchQrSelector
(string, optional): The CSS selector for the button to switch to QR code login.
Response Format:''
The tool returns its result in a structured format:
content
: An array containing a single text object with the raw markdown of the scraped webpage.metadata
: Contains additional information:message
: Status message.success
: Boolean indicating success.contentSize
: Size of the content in characters (on success).
Example Success Response:
{
"content": [
{
"type": "text",
"text": "# Page Title\n\nThis is the content..."
}
],
"metadata": {
"message": "Scraping successful",
"success": true,
"contentSize": 8734
}
}
Example Error Response:
{
"content": [
{
"type": "text",
"text": ""
}
],
"metadata": {
"message": "Error scraping webpage: Failed to load the URL",
"success": false
}
}
How It Works
Simple Interaction
The system uses a simple rule-based approach to handle common website interruptions. It searches for buttons containing keywords like "Accept", "Agree", or "Continue" and clicks them to dismiss pop-ups like cookie banners.
Content Extraction
After interactions, Mozilla's Readability extracts the main content, which is then sanitized and converted to Markdown using Turndown with custom rules for code blocks and tables.
Docker
This project includes a Dockerfile
to build and run the server in a containerized environment.
Building the Docker Image
From the project root directory, run:
docker build -t zhihu-mcp-server:latest .
Running the Docker Container
To run the server inside a Docker container, use the following command. You can pass environment variables using the -e
flag.
To get the login QR code and persist the session, you need to mount a volume to the container. This ensures the qrcodes/cookies.json
file is saved on your host machine.
// 临时调试,交互式运行
mkdir -p ./qrcodes && sudo chown 999:999 ./qrcodes && \
docker run -it --rm \
--user 999:999 \
-e TRANSPORT_TYPE=http \
-e PORT=3001 \
-v $(pwd)/qrcodes:/home/pptruser/qrcodes \
-p 3001:3001 \
zhihu-mcp-server:latest
//
mkdir -p ./qrcodes && sudo chown 999:999 ./qrcodes && \
docker run -d \
--user 999:999 \
-e TRANSPORT_TYPE=http \
-e PORT=3001 \
-p 3001:3001 \
zhihu-mcp-server:latest
Docker Environment Variables
When running the server in a Docker container, you can configure it with the following environment variables:
TRANSPORT_TYPE
: (Optional) The transport protocol to use.- Options:
stdio
(default),sse
,http
. - Example:
-e TRANSPORT_TYPE=http
- Options:
PORT
: (Optional) The port for the HTTP server insse
orhttp
mode. You must also map this port using the-p
flag in thedocker run
command.- Default:
3001
. - Example:
-e PORT=8080 -p 8080:8080
- Default:
DISABLE_HEADLESS
: (Optional) Set totrue
to run the browser in visible mode. Note: This is primarily for debugging and may require additional X11 forwarding configuration to work correctly with Docker.- Default:
false
(browser runs in headless mode). - Example:
-e DISABLE_HEADLESS=true
- Default:
Installation & Development (for Modifying the Code)
If you wish to contribute, modify the server, or run a local development version:
- Clone the Repository:
git clone https://github.com/morrain/zhihuMcpServer.git cd zhihuMcpServer
- Install Dependencies:
npm install
- Build the Project:
npm run build
- Run for Development:
Or, for automatic rebuilding on changes:
npm start
npm run dev
Customization (for Developers)
You can modify the behavior of the scraper by editing:
src/ai/page-interactions.ts
: Add new keywords or logic for handling different types of pop-ups.src/scrapers/webpage-scraper.ts
(visitWebPage
function): Change Puppeteer options.src/utils/markdown-formatters.ts
: Adjust Turndown rules for Markdown conversion.
Dependencies
Key dependencies include:
@modelcontextprotocol/sdk
puppeteer
,puppeteer-extra
@mozilla/readability
,jsdom
turndown
,sanitize-html
express
(for SSE/HTTP modes)zod