MCP-File-Server

bielik/MCP-File-Server

3.2

If you are the rightful owner of MCP-File-Server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The MCP Research File Server is a multimodal file server designed to enhance research proposal development and AI-assisted writing through the Model Context Protocol (MCP).

Tools
9
Resources
0
Prompts
0

MCP KnowledgeExplorer

šŸŽ‰ Status: Phase 4B M2 COMPLETE - Keyword Search Ready! Phase 4B Milestone 2 (Keyword Search Path) is 100% complete with production-ready full-text search capabilities. The complete TEXT_EXTRACT → CHUNK → FTS_INDEX pipeline, FTS5 search with trigram tokenizer, PermissionPostprocessor security layer, and search_fulltext MCP tool (8th tool) are operational with 49+ comprehensive tests. Ready for M3 Semantic Search implementation.

Quick Start

# Start the entire system with one command
# This will now start FOUR services: backend, frontend, indexer, and Qdrant
docker-compose up --build

# Access the services
# Frontend UI: http://localhost:5173
# Backend API: http://localhost:8000
# MCP HTTP: http://localhost:8000/mcp (for AI clients)
# Qdrant API: http://localhost:6333 (vector database)

# View logs for a specific service
docker-compose logs -f backend
docker-compose logs -f frontend
docker-compose logs -f indexer
docker-compose logs -f qdrant

Testing

MCP Wisdom Comprehensive Test Suite

Validate all MCP functionality with the comprehensive test routine:

# Quick test (validates current environment)
./scripts/test-mcp-wisdom.sh

# Full comprehensive test suite
./scripts/test-mcp-wisdom.sh --comprehensive

# Create isolated test environment
./scripts/test-mcp-wisdom.sh --isolated

# View previous test results
./scripts/test-mcp-wisdom.sh --report-only

Features:

  • āœ… Adaptive pre-validation (checks workspace and permissions)
  • āœ… Tests all 8 MCP tools with permission enforcement
  • āœ… Security testing (directory traversal, unauthorized access)
  • āœ… Performance validation (sub-25ms response times)
  • āœ… JSON reporting with detailed metrics

Phase 4B Test Suite

Validate the new Phase 4B foundations:

# Run Phase 4B database schema tests
cd backend && python -m pytest tests/phase4b/test_database_schema.py -v

# Run Phase 4B Qdrant integration tests
cd backend && python -m pytest tests/phase4b/test_qdrant_integration.py -v

# Run all Phase 4B tests
cd backend && python -m pytest tests/phase4b/ -v

Test Coverage:

  • āœ… 11 database schema tests (DocumentChunk, FTS5, triggers)
  • āœ… 8 Qdrant integration tests (connection, collections, vectors)
  • āœ… Temporary database isolation for reliable testing
  • āœ… Graceful skipping when dependencies unavailable

Force Reindex Operations

The Force Reindex feature provides administrators with a powerful tool to rebuild the content index from scratch or refresh indexing flags. Available through both web UI and CLI for maximum flexibility.

Web UI Access

Navigate to the Indexer tab in the web interface (http://localhost:5173) and use the Force Reindex dropdown button:

  • Soft Reindex (recommended): Non-destructive reindexing that preserves existing data
  • Hard Reset: Complete rebuild that purges all existing data
CLI Operations

Trigger and manage database reindexing operations programmatically:

# Trigger soft reindex (keeps existing data, recommended)
python scripts/trigger_reindex.py trigger --mode soft

# Trigger hard reset (purges and rebuilds data)
python scripts/trigger_reindex.py trigger --mode hard

# Filter by path for targeted reindexing
python scripts/trigger_reindex.py trigger --mode soft --path /projects

# Include all file types (not just text files)
python scripts/trigger_reindex.py trigger --mode soft --no-text-only

# Dry run to see what would be reindexed
python scripts/trigger_reindex.py trigger --mode soft --dry-run

# Check batch status with real-time monitoring
python scripts/trigger_reindex.py status <batch_id>
python scripts/trigger_reindex.py status <batch_id> --watch

# List all batches
python scripts/trigger_reindex.py list --all

# Control batch execution
python scripts/trigger_reindex.py pause <batch_id>
python scripts/trigger_reindex.py resume <batch_id>
python scripts/trigger_reindex.py cancel <batch_id>

# System management
python scripts/trigger_reindex.py system
python scripts/trigger_reindex.py clear-maintenance

Update – 2025-09-30: Hard reset now prunes database records for files that no longer exist on the shared filesystem, rebuilds the Phase 4B FTS tables, and then enqueues fresh TEXT_EXTRACT jobs. This prevents stale jobs from resurfacing deleted paths.

Tip: After a large hard reset, if the indexer dashboard still reports processing jobs even though the queue is empty, restart the indexer container or send POST /control/resume to refresh the status payload.

API Integration

Force Reindex operations can be integrated into automation workflows via REST API:

# Create new reindex batch
curl -X POST http://localhost:8000/admin/reindex/force \
  -H "X-Admin-Key: admin-secret-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{"mode": "soft", "scope": {"path_prefix": "/projects", "text_only": true}}'

# Monitor batch progress
curl -H "X-Admin-Key: admin-secret-key-change-me" \
  http://localhost:8000/admin/reindex/batches/{batch_id}

# Control batch execution
curl -X POST http://localhost:8000/admin/reindex/batches/{batch_id}/pause \
  -H "X-Admin-Key: admin-secret-key-change-me"

Features:

  • āœ… Dual Modes: Soft reindex (non-destructive) and Hard reset (complete rebuild)
  • āœ… Chunked Processing: 5000 files per chunk for memory efficiency
  • āœ… Path Filtering: Target specific directories for reindexing
  • āœ… Maintenance Mode: Prevents race conditions during operations
  • āœ… Batch Management: Full control with pause/resume/cancel capabilities
  • āœ… Progress Tracking: Real-time progress with ETA calculations
  • āœ… Admin Security: Protected by admin API key authentication
  • āœ… Dry Run Support: Preview operations without making changes
  • āœ… Error Recovery: Transactional operations with rollback support

Performance Characteristics:

  • Soft reindex: ~2-5 seconds per 1000 files
  • Hard reset: ~10-30 seconds per 1000 files (depending on chunk count)
  • Memory usage: <100MB additional during processing
  • Sub-second response times for batch status queries

MCP Integration with AI Clients

Claude Desktop (Recommended)

Claude Desktop has mature MCP support and works directly with localhost connections:

# Add MCP server (HTTP transport recommended)
claude mcp add --transport http wisdom http://localhost:8000/mcp

# Test connection
curl -X POST http://localhost:8000/mcp -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "method": "tools/list", "id": 1}'

Available as:

  • mcp__wisdom__read_file - Read file contents with permission checking
  • mcp__wisdom__list_files - List directory contents with metadata
  • mcp__wisdom__write_file - Write file contents (subject to permissions)
  • mcp__wisdom__list_all_files - List all indexed files across workspace
  • mcp__wisdom__search_files_by_metadata - Search files by metadata criteria
  • mcp__wisdom__get_file_info - Get detailed file information and metadata
  • mcp__wisdom__get_search_statistics - Retrieve indexing and search statistics
  • mcp__wisdom__search_fulltext - Full-text search with FTS5 and filtering

ChatGPT Desktop

ChatGPT Desktop expects Streamable HTTP transport (MCP 2025-03-26 spec) and may have localhost access restrictions. Our server uses basic HTTP JSON-RPC transport, so we need a compatibility layer.

Solution: Proxy + Tunnel Approach

Step 1: Install Dependencies

pip install aiohttp

Step 2: Start Proxy Server

# Start the compatibility proxy (bridges ChatGPT ↔ MCP server)
python chatgpt_proxy.py

This starts a proxy on port 9000 that forwards requests to your MCP server on port 8000.

Step 3: Create Public Tunnel

ChatGPT Desktop may not access localhost directly. Create a public tunnel:

# Option A: Specific subdomain (preferred, if available)
npx localtunnel --port 8000 --subdomain wisdom-direct
npx localtunnel --port 9000 --subdomain wisdom-proxy

# Option B: Random subdomain (fallback if specific fails)
npx localtunnel --port 8000  # Gets random URL like https://abc-def.loca.lt
npx localtunnel --port 9000  # Gets random URL like https://xyz-123.loca.lt

Step 4: Configure ChatGPT Desktop

Use one of these URLs in ChatGPT Desktop's Connector settings:

# Direct connection (bypasses proxy)
https://[your-subdomain].loca.lt/mcp

# Through proxy (for debugging/logging)
https://[proxy-subdomain].loca.lt/mcp
Process Management

The ChatGPT integration requires 3 running processes:

  1. Docker Compose (your main MCP server) - Port 8000
  2. Proxy Server (python chatgpt_proxy.py) - Port 9000
  3. Tunnel Process (npx localtunnel --port XXXX) - Creates public URL

Troubleshooting Indexer & FTS

  • FTS5 reports invalid file format (found 0, expected 4 or 5) - drop and recreate the chunks_fts virtual table, then rebuild the index:
    python - <<'PY'
    import sqlite3
    conn = sqlite3.connect('data/database.db')
    cur = conn.cursor()
    cur.executescript("""
      DROP TRIGGER IF EXISTS chunks_fts_insert;
      DROP TRIGGER IF EXISTS chunks_fts_update;
      DROP TRIGGER IF EXISTS chunks_fts_delete;
      DROP TABLE IF EXISTS chunks_fts;
      CREATE VIRTUAL TABLE chunks_fts USING fts5(
          text,
          content='document_chunks',
          content_rowid='id',
          tokenize='trigram'
      );
      CREATE TRIGGER chunks_fts_insert AFTER INSERT ON document_chunks BEGIN
          INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
      END;
      CREATE TRIGGER chunks_fts_update AFTER UPDATE ON document_chunks BEGIN
          INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
          INSERT INTO chunks_fts(rowid, text) VALUES (new.id, new.text);
      END;
      CREATE TRIGGER chunks_fts_delete AFTER DELETE ON document_chunks BEGIN
          INSERT INTO chunks_fts(chunks_fts, rowid, text) VALUES('delete', old.id, old.text);
      END;
      INSERT INTO chunks_fts(chunks_fts) VALUES('rebuild');
    """)
    conn.commit()
    conn.close()
    PY
    
  • Dashboard shows processing jobs even though the queue is empty - call POST /control/resume or restart the indexer container to refresh the cached queue statistics. The database view (SELECT status, COUNT(*) FROM index_jobs GROUP BY status) is the source of truth.
  • Phase 4B jobs requeue repeatedly - ensure the indexer service is on the latest build with the queue fixes (TEXT_EXTRACT now reuses existing CHUNK/FTS jobs). Clear any dead-letter entries with UPDATE index_jobs SET status='pending' ... before restarting the worker.

Troubleshooting Connection Issues

Problem: ChatGPT shows "URL is invalid" or connection fails

Solutions:

  1. Try IPv4 explicitly: Use 127.0.0.1 instead of localhost
  2. Check Developer Mode: Ensure ChatGPT Pro/Plus with Developer Mode enabled
  3. Test tunnel manually:
    curl -X POST https://your-tunnel-url.loca.lt/mcp \
      -H "Content-Type: application/json" \
      -d '{"jsonrpc": "2.0", "method": "initialize", "id": 1, "params": {"protocolVersion": "2025-03-26", "capabilities": {}, "clientInfo": {"name": "test", "version": "1.0"}}}'
    
Problem: Tunnel connections refused or firewall errors

Solutions:

  1. Try without specific subdomain:
    npx localtunnel --port 8000  # Let it assign random subdomain
    
  2. Use alternative tunnel service:
    # If you have ngrok installed
    ngrok http 8000
    
Problem: Proxy not receiving requests

Check proxy logs - if no requests appear, ChatGPT isn't reaching the tunnel URL.

Debug steps:

  1. Verify tunnel is running: Visit the tunnel URL in browser
  2. Test tunnel directly with curl (as shown above)
  3. Check ChatGPT Desktop network restrictions

1. Introduction

1.1. Purpose

This project is a production-ready Model Context Protocol (MCP) server that enables AI agents to safely interact with your local file system. It provides a web-based management interface for real-time monitoring and granular permission control, and is being extended with a powerful, local-first search and retrieval engine.

1.2. Current Status - āœ… Phase 4A 100% COMPLETE

Phase 4A advanced search infrastructure has been fully implemented with comprehensive documentation for Phase 4B development:

šŸŽÆ Phase 4A Achievement Summary
  • āœ… Search Infrastructure Complete: Indexer service with file watching, job queue, metadata extraction
  • āœ… 7 MCP Tools Operational: 3 core file system + 4 new search tools with cursor pagination
  • āœ… Database Integration: Phase 4A models (IndexedFile, IndexJob, ControlSetting) with WAL mode
  • āœ… Performance Validated: Sub-100ms response times, crash recovery, comprehensive testing
  • āœ… Production Issues Resolved: All 3 critical bugs identified by independent review fixed
  • āœ… Documentation Complete: Implementation summary and Phase 4B development guide
🚨 Critical Production Fixes (v4.0.1) - RESOLVED
  • āœ… Issue 013: Database URL misconfiguration causing empty workspace API responses
  • āœ… Issue 014: Indexer import errors preventing service startup
  • āœ… Issue 015: Frontend 404 cascade errors when workspace list empty
  • āœ… Additional Fixes: SQLAlchemy parameter format, environment variables, cross-container imports
šŸ“– Phase 4B Ready Documentation
  • šŸ“– docs/Phase4A-Implementation-Summary.md - Complete Phase 4A achievement overview
  • šŸ“– docs/Phase4B-Development-Guide.md - Detailed semantic search implementation plan
  • šŸ“– indexer/README.md - Comprehensive indexer service documentation
  • šŸŽÆ Next Phase: Semantic search with vector embeddings, document clustering, 4 new semantic MCP tools

2. Architectural Vision: The "Indexer-Query" Model

To support advanced search, the architecture follows a robust "Indexer-Query" pattern. This design separates the application into distinct, cooperating services for maximum performance and resilience:

  • Backend Service (Query Engine): The existing FastAPI application remains the central hub for all API/MCP requests. It is responsible for running fast searches against the indexes and, most importantly, applying security permissions to all results in real-time.
  • Indexer Service (Background Processor): A new, separate service that continuously monitors the filesystem. It is responsible for the heavy lifting: parsing files, extracting text (including OCR), calculating embeddings, and populating the search indexes.
  • Frontend Service (UI): The existing React application, which will be enhanced with a new "Indexer Dashboard" to give the user full visibility and control (Pause/Resume) over the indexing process.

This separation ensures that intensive background processing never impacts the responsiveness of the main application.


3. System Architecture

3.1. Architecture Diagram

The complete system architecture is visualized in our comprehensive Mermaid diagram:

šŸ“Š

This diagram shows the complete three-service Docker architecture with all components, data flows, and .env configuration controls. Key architectural elements include:

  • Frontend Service: React UI with Indexer Dashboard
  • Backend Service: FastAPI Query Engine with LlamaIndex integration
  • Indexer Service: Background processing with crash-resilient job queue
  • Data Stores: SQLite (WAL mode), Qdrant vector DB, HuggingFace model cache
  • Configuration: Comprehensive .env system for hardware adaptation

3.2. High-Level Flow

User/Agent → Frontend/MCP → Backend (Query Engine) → Permission Filter → Results
                                ↓
File System ← Indexer ← Job Queue ← File Watcher
    ↓           ↓
SQLite FTS ← Embedding Model → Qdrant Vector DB

4. MCP Protocol Implementation

The existing file system tools remain fully functional. Phase 4 will introduce a new suite of search tools.

4.1. Available MCP Tools (Phase 4A Complete)

ToolParametersDescriptionStatus
Core File System Tools
read_filepath: stringRead complete file contentsāœ… Ready
list_filespath: stringList directory contents with metadataāœ… Ready
write_filepath: string, content: stringWrite content to fileāœ… Ready
Phase 4A Search Tools
list_all_fileslimit, cursor, sort_byList all indexed files with cursor paginationāœ… Ready
search_files_by_metadatafilename_pattern, file_types, size_range, mtime_rangeSearch files by metadata criteriaāœ… Ready
get_file_infodoc_id: stringGet detailed file information by document IDāœ… Ready
get_search_statistics-Retrieve indexing progress and search statisticsāœ… Ready
Phase 4B Search Tools
search_fulltextquery: string, limit, cursor, highlight, file_typesFull-text search with FTS5, highlighting, and filteringāœ… Ready

4.2. Planned Advanced Search Tools (Phase 4B M3+)

ToolDescriptionStatus
search_semanticSemantic similarity search using embeddingsšŸ“‹ Planned
find_similar"More like this" vector searchšŸ“‹ Planned
search_hybridCombined FTS5 and vector search with RRFšŸ“‹ Planned

5. Configuration Management

5.1. Environment Variables (.env)

The system uses comprehensive .env configuration to support flexible deployment across different hardware setups:

Hardware & Model Selection
# GPU/CPU switching - critical for RTX 4060 users
INDEX_EMBED_DEVICE=cpu           # or 'gpu' when RTX 4060 available
INDEX_EMBED_QUANT=fp16           # Future: 4bit/8bit/fp16 for VRAM management
INDEX_EMBED_MODEL=paraphrase-multilingual-MiniLM-L12-v2
Feature Toggles
# Resource-intensive features with user control
OCR_ENABLED=true                 # Tesseract OCR processing
RERANK_ENABLED=false             # Optional cross-encoder reranker
Performance Tuning
# Search and indexing behavior
RETRIEVAL_MODE=hybrid            # hybrid/fts/vector - invaluable for debugging
INDEXER_BATCH_SIZE=50            # Files per batch - tune memory vs speed
INDEXER_MAX_WORKERS=2            # Parallel processing control

This configuration strategy enables seamless switching between laptop (CPU-only) and desktop (RTX 4060) environments while maintaining optimal performance for each setup.

Frontend API Base URL (new)
# Point the frontend to a non-default backend origin if needed
VITE_API_BASE_URL=http://localhost:8000
Shared Filesystem Mount: Host vs Container paths
  • In Docker, your host folder (e.g., C:\Users\<you>\MCP Test) is mounted into the container at /source.
  • Backend APIs and the indexer always use /source inside the container to access files.
  • The validator in config/env_config.py now treats /source as authoritative in Docker, so a Windows path string in .env will no longer trigger a false warning.

Example docker-compose mapping:

volumes:
  - ${SHARED_FS_PATH:-./shared-fs}:/source  # host:container

Keep SHARED_FS_PATH pointing at your host folder. Inside containers, the code uses /source.


6. Technology Stack

CategoryTechnologyPurposeStatus
ContainerizationDocker Compose4-service orchestration: backend, frontend, indexer, qdrantāœ… Phase 4B M1
BackendFastAPIMCP/API endpoints and Query Engine with search toolsāœ… Phase 4A
FrontendReact, ViteWeb UI with indexer dashboard and controlsāœ… Phase 4A
IndexerPython/WatchdogBackground file monitoring and job processingāœ… Phase 4A
Application DBSQLite (WAL)Concurrent storage: workspaces, permissions, files, jobs, chunksāœ… Phase 4B M1
Full-Text SearchSQLite FTS5Trigram tokenizer for typo-tolerant keyword searchāœ… Phase 4B M1
Job QueueSQLiteCrash-resilient job queue with atomic claimingāœ… Phase 4A
Vector DatabaseQdrant v1.7.4High-performance storage and retrieval of vector embeddingsāœ… Phase 4B M1
RAG FrameworkLlamaIndexCore toolkit for data ingestion, indexing, and queryingāœ… Phase 4B M1
Embedding Modelparaphrase-multilingual-MiniLM-L12-v2CPU-based multilingual model for semantic searchšŸ“‹ Phase 4B M2+
OCR EngineTesseractExtracts text from images and scanned documentsšŸ“‹ Phase 4B M3+

6. Project Structure

MCPFileServer/
ā”œā”€ā”€ šŸ“ backend/                  # Python FastAPI backend (Query Engine)
ā”œā”€ā”€ šŸ“ frontend/                 # React TypeScript frontend (UI)
ā”œā”€ā”€ šŸ“ indexer/                  # Python background service for indexing
ā”œā”€ā”€ šŸ“ config/                   # Global configuration
ā”œā”€ā”€ šŸ“ data/                     # SQLite database (gitignored)
ā”œā”€ā”€ šŸ“ docs/                     # Phase documentation and guides
ā”œā”€ā”€ šŸ“ scripts/                  # Testing and utility scripts
ā”œā”€ā”€ šŸ“„ docker-compose.yml        # 4-service orchestration: backend, frontend, indexer, qdrant
ā”œā”€ā”€ šŸ“„ chatgpt_proxy.py          # MCP compatibility proxy for ChatGPT Desktop
ā”œā”€ā”€ šŸ“„ .env                      # Environment variables
ā”œā”€ā”€ šŸ“„ README.md                 # This file
└── šŸ“„ CLAUDE.md                 # Project context for Claude

7. Monitoring and Debugging

7.1. Log Access

# View all container logs
docker-compose logs

# Follow logs for a specific service in real-time
docker-compose logs -f backend
docker-compose logs -f frontend
docker-compose logs -f indexer

7.2. Service Health Monitoring

Indexer Service Status:

# Check indexer health
curl http://localhost:8002/live    # Liveness check
curl http://localhost:8002/ready   # Readiness check with database validation

# Monitor indexing status
curl http://localhost:8002/status/system   # Overall system status
curl http://localhost:8002/status/jobs     # Job queue statistics
curl http://localhost:8002/status/files    # Indexed file statistics

Backend Service Status:

# Check backend health
curl http://localhost:8000/         # Health check
curl http://localhost:8000/docs     # API documentation

# Database diagnostic (useful for troubleshooting)
curl http://localhost:8000/api/system/db-info

Force Reindex Management:

# Check reindex system status
curl http://localhost:8000/admin/reindex/status

# List active reindex batches
curl -H "X-Admin-Key: admin-secret-key-change-me" \
  http://localhost:8000/admin/reindex/batches

# Get specific batch status
curl -H "X-Admin-Key: admin-secret-key-change-me" \
  http://localhost:8000/admin/reindex/batches/{batch_id}

# Access web UI for visual management
# Navigate to Indexer tab -> Force Reindex dropdown

7.3. MCP Tools Testing

7.4. Indexer Dashboard Alerts

The Indexer view in the frontend now mirrors the backend status payload:

  • Surfaces banner alerts whenever the service is stopped, unreachable, or backlog builds up.
  • Shows per-stage queue depth (TEXT_EXTRACT, CHUNK, FTS_INDEX) and highlights failed/dead-letter jobs.
  • Progress bar segments indexed vs pending files and flags any indexed files that still lack text chunks.
  • Pulls service_error, job_backlog, and integrity_stats from /api/indexer/status, so Docker operators immediately see if Phase 4B pipelines are stalled.

Complete MCP Protocol Validation:

# Test all 8 MCP tools with comprehensive suite
./scripts/test-mcp-wisdom.sh --comprehensive

# Quick validation
curl -X POST http://localhost:8000/mcp -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "method": "tools/list", "id": 1}'

šŸ“‹ For Phase 4B Development:

  • See docs/Phase4A-Implementation-Summary.md for complete achievement details
  • See docs/Phase4B-Development-Guide.md for semantic search implementation plan
  • See indexer/README.md for comprehensive indexer service documentation

Phase 4A is 100% complete. The system is production-ready and fully documented for Phase 4B semantic search development.