ThreeFish-AI/data-extractor
3.3
If you are the rightful owner of data-extractor and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Scrapy MCP Server is a robust and stable web scraping MCP Server built on Scrapy and FastMCP, designed for long-term use in commercial environments.
Tools
5
Resources
0
Prompts
0
Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.
🛠️ MCP Server Core Tools (14)
Web Page
| 工具名称 | 功能描述 | 主要参数 |
|---|---|---|
| scrape_webpage | 单页面抓取 | url, method(自动选择), extract_config(选择器配置), wait_for_element(CSS 选择器) |
| scrape_multiple_webpages | 批量页面抓取 | urls(列表), method(统一方法), extract_config(全局配置) |
| scrape_with_stealth | 反检测抓取 | url, method(selenium/playwright), scroll_page(滚动加载), wait_for_element |
| fill_and_submit_form | 表单自动化 | url, form_data(选择器:值), submit(是否提交), submit_button_selector |
| extract_links | 专业链接提取 | url, filter_domains(域名过滤), exclude_domains(排除域名), internal_only(仅内部) |
| extract_structured_data | 结构化数据提取 | url, data_type(all/contact/social/content/products/addresses) |
| get_page_info | 页面信息获取 | url(目标 URL) - 返回标题、状态码、元数据 |
| check_robots_txt | 爬虫规则检查 | url(域名 URL) - 检查 robots.txt 规则 |
| convert_webpage_to_markdown | 页面转 Markdown | url, method, extract_main_content(提取主内容), embed_images(嵌入图片), formatting_options |
| batch_convert_webpages_to_markdown | 批量 Markdown 转换 | urls(列表), method, extract_main_content, embed_images, embed_options |
PDF Document
| 工具名称 | 功能描述 | 主要参数 |
|---|---|---|
| convert_pdf_to_markdown | PDF 转 Markdown | pdf_source(URL/路径), method(auto/pymupdf/pypdf), page_range, output_format |
| batch_convert_pdfs_to_markdown | 批量 PDF 转换 | pdf_sources(列表), method, page_range, output_format, include_metadata |
Service Management
| 工具名称 | 功能描述 | 主要参数 |
|---|---|---|
| get_server_metrics | 性能指标监控 | 无参数 - 返回请求统计、性能指标、缓存情况 |
| clear_cache | 缓存管理 | 无参数 - 清空所有缓存数据 |
🎯 Quick Navigation
🤝 Contribution
欢迎提交 Issue 和 Pull Request 来改进这个项目。
📄 License
MIT License - 详见 文件
注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。