rodcar/pnda-mcp
If you are the rightful owner of pnda-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
PNDA-MCP is a Model Context Protocol (MCP) server designed for Peru's National Open Data Platform, facilitating AI agents in discovering and analyzing public datasets.
PNDA-MCP
Model Context Protocol (MCP) Server for PNDA - National Open Data Platform / Plataforma Nacional de Datos Abiertos (Peru)
👨💻 Author
Ivan Yang Rodriguez Carranza
📋 Table of Contents
- 🎯 Overview
- 🎬 Demo
- 🔧 Tools
- 💬 Prompts
- 🚀 How to Use
- 💡 Examples
- 🏛️ Architecture Diagram
- ⚙️ ETL Pipeline
- 📝 License
🎯 Overview
PNDA-MCP is a Model Context Protocol (MCP) server for Peru's National Open Data Platform (Plataforma Nacional de Datos Abiertos). Although Peru's open data platform datosabiertos.gob.pe hosts valuable datasets, it can be a challenging for AI agents to find and retrieve the most relevant data for a specific data analysis question. PNDA-MCP simplifies this by providing tools and prompts that let AI agents or any MCP client (such as VS Code or Claude Desktop) easily search for and access datasets metadata, and associated data files. The goal is to enable data scientist agents or code agents to automatically discover and analyze public datasets.
This repository includes the ETL pipeline used to extract, transform, and index dataset titles (see etl
folder).
🎬 Demo
Demo (Spanish):
🔧 Tools
Name | Input | Description |
---|---|---|
dataset_search | query , top_k | Search for relevant datasets from the PNDA (Plataforma Nacional de Datos Abiertos) Peru. query is the search text, top_k limits the number of results returned (max 25). |
dataset_details | id | Get dataset details including title, metadata, and resources. Returns complete resource information: direct download URLs, file names, sizes, creation dates, MIME types, formats, states, and descriptions. |
💬 Prompts
Name | Input | Description |
---|---|---|
question_generation | topic | Generate 5 data analysis questions for any topic using available PNDA datasets. |
analysis_quick | question | Create a minimal Jupyter notebook with quick data analysis addressing a question. |
analysis_full | question | Create a complete Jupyter notebook with detailed data exploration and analysis addressing a question. |
🚀 How to Use
VS Code (Remote Server)
Note: Requires
npx
which comes bundled with npm. If you don't have npm installed, install Node.js which includes npm.
The fastest and easiest way to try this MCP is to use the 1-click installation button:
Note: If the MCP tools and prompts do not load immediately, please try restarting VS Code.
Manual installation:
- Open the Command Palette:
View > Command Palette
(orCmd+Shift+P
on Mac /Ctrl+Shift+P
on Windows/Linux) - Type and select:
MCP: Add Server...
- Choose "Command (stdio)" as the server type
- For "Command to run (with optional arguments)", enter:
npx mcp-remote https://pnda-mcp.onrender.com/mcp
- Set the name for the MCP server:
pnda-mcp
- Select where to save the configuration: User Settings saves the config globally for all projects. Workspace Settings saves it locally for just the current one.
- Save the configuration
- Restart VS Code for the MCP server to become available.
VS Code (Local Server)
Important: Before running the MCP server locally, you need to:
- Have an OpenAI API key. Get your OpenAI API key from platform.openai.com.
- Have a Pinecone account. If you don't have an account, you can sign up at pinecone.io.
- Configure your OpenAI API key and Pinecone API key in the
.env
configuration file.- Run the ETL pipeline to index the datasets metadata from PNDA to Pinecone (see the ETL Pipeline section below)
- Open the Command Palette:
View > Command Palette
(orCmd+Shift+P
on Mac /Ctrl+Shift+P
on Windows/Linux) - Type and select:
MCP: Add Server...
- Choose "Command (stdio)" as the server type
Note: Replace
/path/to/pnda-mcp
with the actual path where you cloned the repository.
- For "Command to run (with optional arguments)", enter:
uv --directory /path/to/pnda-mcp run main.py
- Set the name for the MCP server:
pnda-mcp
- Select where to save the configuration: User Settings saves the config globally for all projects. Workspace Settings saves it locally for just the current one.
- Save the configuration
- Restart VS Code for the MCP server to become available.
MCP Inspector (Alternative)
Important: Before running the MCP server locally, you need to:
- Have an OpenAI API key. Get your OpenAI API key from platform.openai.com.
- Have a Pinecone account. If you don't have an account, you can sign up at pinecone.io.
- Configure your OpenAI API key and Pinecone API key in the
.env
configuration file.- Run the ETL pipeline to index the datasets metadata from PNDA to Pinecone (see the ETL Pipeline section below)
Note: Requires
npx
which comes bundled with npm. If you don't have npm installed, install Node.js which includes npm.
Note: Replace
/path/to/pnda-mcp
with the actual path where you cloned the repository.
Run:
npx @modelcontextprotocol/inspector \
uv \
--directory /path/to/pnda-mcp \
run \
main.py
Open MCP Inspector (URL displayed in the console) and configure the MCP client with the following settings:
- Transport Type: STDIO
- Command:
python
- Arguments:
main.py
💡 Examples
Prompt | Input | Demo | Notebook | Language |
---|---|---|---|---|
question_generation | Mining | View Demo | - | English |
analysis_quick | How has student enrollment at the National University of Engineering evolved between 2017 and 2023 by faculties and degree programs? | View Demo | English | |
analysis_full | What types of fatal accidents are most frequent in the Peruvian mining industry, and in which departments do they occur most often? | View Demo | English | |
question_generation | Minería | View Demo | - | Spanish |
analysis_quick | ¿Cómo ha evolucionado la matrícula de estudiantes en la Universidad Nacional de Ingeniería entre 2017 y 2023 por facultades y carreras? | View Demo | Spanish | |
analysis_full | ¿Qué tipos de accidentes mortales son más frecuentes en la industria minera peruana y en qué departamentos ocurren con mayor frecuencia? | View Demo | Spanish |
🏛️ Architecture Diagram
PNDA-MCP follows the Model Context Protocol specification and provides a clean abstraction layer for PNDA.
graph LR
CLIENT[MCP Client<br/>VS Code, Cursor, etc.] --> MCP_SERVER[PNDA-MCP Server]
subgraph TOOLS ["🔧 Tools"]
DATASET_SEARCH[dataset_search]
DATASET_DETAILS[dataset_details]
end
subgraph "💬 Prompts"
QUESTION_GEN[question_generation]
ANALYSIS_QUICK[analysis_quick]
ANALYSIS_FULL[analysis_full]
end
MCP_SERVER --> DATASET_SEARCH
MCP_SERVER --> DATASET_DETAILS
MCP_SERVER --> QUESTION_GEN
MCP_SERVER --> ANALYSIS_QUICK
MCP_SERVER --> ANALYSIS_FULL
DATASET_SEARCH -->|semantic search| PINECONE[Pinecone Vector Database]
DATASET_SEARCH --> OPENAI[OpenAI Text Embeddings API]
DATASET_DETAILS --> CACHE[Cache Layer]
CACHE --> |fallback source| PNDA_API[PNDA API]
CACHE --> |secondary fallback| PINECONE
style CLIENT fill:#e3f2fd
style MCP_SERVER fill:#f3e5f5
style PNDA_API fill:#fff3e0
style PINECONE fill:#fff3e0
style OPENAI fill:#fff3e0
⚙️ ETL Pipeline
Important: The following ETL documentation is only needed if you want to run the MCP locally or deploy your own MCP service. You can use the remote MCP service without running the ETL.
To search datasets using natural language, semantic search with text vector embeddings is used. The ETL pipeline handles the initial indexing and ongoing synchronization of the vector database containing dataset metadata from Peru's National Open Data Platform. It can be run manually or automatically via cron jobs to ensure the dataset information stays up to date.
Requirements
- Docker & Redis: Runs Redis server locally which serves as a message broker and result backend to coordinate tasks during ETL pipeline execution with Celery workers.
- OpenAI API key: The OpenAI Text Embeddings API converts dataset titles into vectors using the
text-embedding-3-small
model. Get your OpenAI API key from platform.openai.com. - Pinecone account: Dataset titles are indexed in Pinecone cloud vector database for semantic search. If you don't have an account, you can sign up at pinecone.io.
Setup and Usage
Note: Make sure you have
uv
installed. If not, install it from uv.tool.
-
Clone and install:
git clone https://github.com/rodcar/pnda-mcp.git cd pnda-mcp uv sync
-
Create
.env
fileMacOS/Linux:
cp .env.example .env
Windows:
copy .env.example .env
-
Set your
OPENAI_API_KEY
andPINECONE_API_KEY
values in the.env
file.Note: Get your OpenAI API key from platform.openai.com and your Pinecone API key from app.pinecone.io.
-
Run Redis with Docker
Note: Celery also supports other broker and backend options. See Celery documentation for more details.
docker run -d -p 6379:6379 redis
-
Start Celery worker
MacOS/Linux:
./etl/celery_worker.sh
Windows:
uv run celery -A etl.tasks.app worker --loglevel=info
Note: The Celery worker processes ETL tasks asynchronously. Keep this terminal window open, you'll see task execution logs here when the pipeline runs.
-
Run the ETL pipeline
The pipeline can be executed manually (on-demand) or automated using a cron job for daily execution. It is recommended to perform the initial indexing manually, then use the cron job to maintain data synchronization.
Manual Execution:
python -m etl.pipeline
Note: The execution might take several minutes. You can see the logs in the
etl/logs/etl.log
file, and the output files of intermediate ETL tasks in theetl/results
folder.Note: You can remove all pending the tasks from the Celery task queue with the following command:
celery -A etl.tasks.app purge -f
.Scheduled with Cron Job (MacOS/Linux):
a. Make the script executable:
Note: Replace
/path/to/pnda-mcp/etl/cron.sh
with the actual path to thecron.sh
file.chmod +x /path/to/pnda-mcp/etl/cron.sh
b. Edit crontab:
crontab -e
c. Add this line (runs daily at 2 AM):
Note: Replace
/path/to/pnda-mcp/etl/cron.sh
with the actual path to thecron.sh
file.Note: If you are using vim, press
i
to enter insert mode and paste the cron job; pressEsc
to return to normal mode. Use:wq
to save and exit.Note: To change the hour replacing the 2 (which means 2 AM) with your desired hour in 24-hour format (e.g., 14 for 2 PM).
0 2 * * * /path/to/pnda-mcp/etl/cron.sh
d. Verify the cron job was added to the crontab:
crontab -l
The pipeline will execute daily at the time specified in the crontab configuration.
Note: You can see the logs in the
etl/logs/etl.log
file, and the output files of intermediate ETL tasks in theetl/results
folder.
ETL Diagram
The following diagram shows the three-stage ETL pipeline that processes dataset metadata from Peru's National Open Data Platform.
flowchart LR
subgraph EXTRACT_WRAPPER["<b>Extract</b>"]
EXTRACT["Fetch complete dataset list from PNDA API"] --> PNDA_API["For each dataset, fetch metadata from PNDA API"]
end
subgraph TRANSFORM_WRAPPER["<b>Transform</b>"]
FILTER["Filter active datasets"] --> STRUCTURE["Format dataset metadata for indexing"]
end
subgraph LOAD_WRAPPER["<b>Load</b>"]
FILTER_CHANGED["Filter datasets with changes*"] --> EMBEDDINGS["Generate embeddings using OpenAI Text Embeddings API"] --> UPSERT["Upsert embeddings to the vector database (Pinecone)"]
end
EXTRACT_WRAPPER e1@==> TRANSFORM_WRAPPER
TRANSFORM_WRAPPER e2@==> LOAD_WRAPPER
e1@{ animate: true }
e2@{ animate: true }
style EXTRACT fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style PNDA_API fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style FILTER fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style STRUCTURE fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style FILTER_CHANGED fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style EMBEDDINGS fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style UPSERT fill:#fff3e0,stroke:#f57c00,stroke-width:2px
*Filters datasets where metadata_modified
has changed since the last local version (etl/results/processing_results.json
). This means the metadata must be updated in the vector database.
📝 License
This project is licensed under the .