mcp-langextract

smian0/mcp-langextract

3.1

If you are the rightful owner of mcp-langextract and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

LangExtract MCP Server is a FastMCP server for Google's langextract library, enabling AI assistants to extract structured information from unstructured text using Large Language Models through a MCP interface.

Tools
7
Resources
0
Prompts
0

LangExtract MCP Server

A FastMCP server for Google's langextract library. This server enables AI assistants like Claude Code to extract structured information from unstructured text using Large Language Models through a MCP interface.

LangExtract Server MCP server

Overview

LangExtract is a Python library that uses LLMs to extract structured information from text documents while maintaining precise source grounding. This MCP server exposes langextract's capabilities through the Model Context Protocol. The server includes intelligent caching, persistent connections, and server-side credential management to provide optimal performance in long-running environments like Claude Code.

Quick Setup for Claude Code

Prerequisites

  • Claude Code installed and configured
  • Google Gemini API key (Get one here)
  • Python 3.10 or higher

Installation

Install directly into Claude Code using the built-in MCP management:

claude mcp add mcp-langextract -e LANGEXTRACT_API_KEY=your-gemini-api-key -- uv run --with fastmcp fastmcp run src/langextract_mcp/server.py

The server will automatically start and integrate with Claude Code. No additional configuration is required.

Verification

After installation, verify the integration entering in Claude Code:

/mcp

You should see output indicating the server is running and can enter the server to see its tool contents.

Available Tools

The server provides the following tools for text extraction workflows:

Core Extraction

  • extract_from_text - Extract structured information from provided text (supports templates)
  • extract_from_url - Extract information from web content (supports templates)
  • save_extraction_results - Save results to JSONL format
  • generate_visualization - Create interactive HTML visualizations

Template Management

  • list_extraction_templates - List available extraction templates by domain
  • get_template_details - Get detailed configuration for a specific template
  • validate_template - Validate custom template structure

Template Resources Templates are exposed as MCP resources for discovery by clients. Available domains include:

  • Finance: M&A deals (ma-deals), Earnings reports (earnings)
  • Medical: Clinical records (medical-records)
  • Academic: Research papers (research-paper)
  • Legal: Contracts and agreements (legal-contract)

For more information, you can checkout out the resources available to the client under src/langextract_mcp/resources

Usage Examples

Template-Based Extraction (Recommended)

Use pre-built templates for common extraction tasks:

# Extract M&A deals from financial news
extract_from_text(
    text="Microsoft announced it will acquire GitHub for $7.5 billion in an all-stock transaction",
    template_id="ma-deals"
)

# Extract earnings data with custom temperature
extract_from_text(
    text="Apple reported Q2 revenue of $94.8 billion, up 4.9% year-over-year",
    template_id="earnings",
    temperature=0.1
)

# Extract medical information from clinical notes
extract_from_text(
    text="Patient prescribed 500mg amoxicillin twice daily for bacterial infection",
    template_id="medical-records"
)

Custom Extraction

For specialized needs, define your own extraction patterns:

extract_from_text(
    text="The startup raised $50M Series B led by Acme Ventures",
    prompt_description="Extract fundraising information including company, amount, and investors",
    examples=[
        {
            "text": "TechCorp secured $25M Series A from XYZ Partners",
            "extractions": [
                {
                    "extraction_class": "fundraising",
                    "extraction_text": "TechCorp secured $25M Series A from XYZ Partners",
                    "attributes": {
                        "company": "TechCorp",
                        "amount": "$25M",
                        "round_type": "Series A",
                        "investor": "XYZ Partners"
                    }
                }
            ]
        }
    ]
)

URL Processing with Templates

Extract information directly from web content:

# Extract research findings from academic papers
extract_from_url(
    url="https://arxiv.org/abs/example",
    template_id="research-paper"
)

# Extract earnings data from company press releases
extract_from_url(
    url="https://investor.example.com/earnings",
    template_id="earnings",
    temperature=0.1  # Override template default for more consistent results
)

Template Discovery

Find and explore available templates:

# List all templates
list_extraction_templates()

# List only finance templates
list_extraction_templates(domain="finance")

# Get detailed information about a template
get_template_details("ma-deals")

Template System

The template system provides pre-configured extraction patterns for common domains, enabling consistent and high-quality extractions without manual configuration.

Available Templates

Template IDDomainDescription
ma-dealsFinanceExtract M&A deal information (acquirer, target, amount, structure)
earningsFinanceExtract financial metrics from earnings reports (revenue, EPS, growth)
medical-recordsMedicalExtract medical info (medications, diagnoses, procedures, vitals)
research-paperAcademicExtract research methodology, results, and citations
legal-contractLegalExtract contract terms (parties, obligations, payment terms)

Template Benefits

  • Domain Expertise: Templates created by domain experts with optimized prompts and examples
  • Consistent Output: Standardized extraction schemas across similar documents
  • Parameter Optimization: Pre-tuned model settings (temperature, extraction passes, etc.)
  • Extensible: Support for template inheritance and customization
  • Discoverable: Templates exposed as MCP resources for client exploration

Creating Custom Templates

Templates are JSON files with this structure:

{
  "id": "custom-template",
  "version": "1.0.0",
  "name": "Custom Template",
  "description": "Description of what this template extracts",
  "domain": "custom",
  "config": {
    "model_id": "gemini-2.5-pro",
    "temperature": 0.2,
    "extraction_passes": 1
  },
  "prompt_description": "Clear instructions for extraction",
  "examples": [
    {
      "text": "Sample input text",
      "extractions": [
        {
          "extraction_class": "example",
          "extraction_text": "extracted portion",
          "attributes": {
            "key": "value"
          }
        }
      ]
    }
  ]
}

Supported Models

This server currently supports Google Gemini models only, optimized for reliable structured extraction with advanced schema constraints:

  • gemini-2.5-flash - Recommended default - Optimal balance of speed, cost, and quality
  • gemini-2.5-pro - Best for complex reasoning and analysis tasks requiring highest accuracy

The server uses persistent connections, schema caching, and connection pooling for optimal performance with Gemini models. Support for additional providers may be added in future versions.

Configuration Reference

Environment Variables

Configure the MCP server behavior through environment variables:

Single Model Configuration (Basic Setup)
# Use same model for both extraction and sentiment analysis
LANGEXTRACT_MODEL_ID=deepseek-v3:latest        # Primary model
LANGEXTRACT_MODEL_URL=http://localhost:11434   # For local models
LANGEXTRACT_API_KEY=your-gemini-api-key        # For cloud models
Dual Model Configuration (Recommended)
# EXTRACTION MODEL (Stage 1: Fast fact extraction)
LANGEXTRACT_MODEL_ID=phi-4:latest              # Lightweight, fast model
LANGEXTRACT_MODEL_URL=http://localhost:11434   # Ollama URL

# SENTIMENT MODEL (Stage 2: Accurate sentiment analysis)  
LANGEXTRACT_SENTIMENT_MODEL_ID=qwen3:32b       # Specialized sentiment model
LANGEXTRACT_SENTIMENT_TEMPERATURE=0.3          # Consistent sentiment analysis
LANGEXTRACT_SENTIMENT_MODEL_URL=http://localhost:11434  # Optional: separate instance
Performance Benefits of Dual Model Setup:
  • 90% faster extraction with lightweight model (phi-4, gemma-2:9b)
  • 95% sentiment accuracy with specialized model (qwen3:32b, deepseek-v3)
  • 60% less compute overall vs using heavy model for both stages
Hardware Requirements:
ConfigurationVRAM RequiredPerformance
Dual Model (Recommended)32GBPhi-4: 200 tok/s + Qwen3-32B: 50 tok/s
Single Heavy Model24GBQwen3-32B: 50 tok/s for all operations
Single Light Model8GBPhi-4: 200 tok/s, reduced sentiment accuracy
Advanced Configuration:
# Performance tuning
LANGEXTRACT_TIMEOUT=300                         # Request timeout (seconds)
LANGEXTRACT_TEMPERATURE=0.3                     # Extraction temperature
LANGEXTRACT_MAX_CHAR_BUFFER=1500               # Text chunk size
LANGEXTRACT_EXTRACTION_PASSES=1                # Number of passes
LANGEXTRACT_MAX_WORKERS=5                      # Parallel workers

Tool Parameters

Configure extraction behavior through tool parameters:

{
    "model_id": "gemini-2.5-flash",     # Language model selection
    "max_char_buffer": 1000,            # Text chunk size
    "temperature": 0.5,                 # Sampling temperature (0.0-1.0)  
    "extraction_passes": 1,             # Number of extraction attempts
    "max_workers": 10                   # Parallel processing threads
}

Output Format

All extractions return consistent structured data:

{
    "document_id": "doc_123",
    "total_extractions": 5,
    "extractions": [
        {
            "extraction_class": "medication", 
            "extraction_text": "amoxicillin",
            "attributes": {"type": "antibiotic"},
            "start_char": 25,
            "end_char": 35
        }
    ],
    "metadata": {
        "model_id": "gemini-2.5-flash",
        "extraction_passes": 1,
        "temperature": 0.5
    }
}

Use Cases

LangExtract MCP Server supports a wide range of use cases across multiple domains. In healthcare and life sciences, it can extract medications, dosages, and treatment protocols from clinical notes, structure radiology and pathology reports, and process research papers or clinical trial data. For legal and compliance applications, it enables extraction of contract terms, parties, and obligations, as well as analysis of regulatory documents, compliance reports, and case law. In research and academia, the server is useful for extracting methodologies, findings, and citations from papers, analyzing survey responses and interview transcripts, and processing historical or archival materials. For business intelligence, it helps extract insights from customer feedback and reviews, analyze news articles and market reports, and process financial documents and earnings reports.

Support and Documentation

📚 Documentation Index:

  • - Central hub for all documentation and resources

Getting Started:

  • - Essential commands, tool matrix, and performance tips
  • - Complete tool documentation with signatures and examples
  • - Comprehensive tool usage and workflows

Developer Resources:

  • - Extension patterns, architecture, and development setup
  • - Core library analysis and implementation details
  • - Framework patterns and best practices
  • - Model evaluation setup with DeepEval integration

External References: