data-extractor

ThreeFish-AI/data-extractor

3.3

If you are the rightful owner of data-extractor and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

Scrapy MCP Server is a robust and stable web scraping MCP Server built on Scrapy and FastMCP, designed for long-term use in commercial environments.

Tools
5
Resources
0
Prompts
0

Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments.

🛠️ MCP Server Core Tools (14)

Web Page

工具名称功能描述主要参数
scrape_webpage单页面抓取url, method(自动选择), extract_config(选择器配置), wait_for_element(CSS 选择器)
scrape_multiple_webpages批量页面抓取urls(列表), method(统一方法), extract_config(全局配置)
scrape_with_stealth反检测抓取url, method(selenium/playwright), scroll_page(滚动加载), wait_for_element
fill_and_submit_form表单自动化url, form_data(选择器:值), submit(是否提交), submit_button_selector
extract_links专业链接提取url, filter_domains(域名过滤), exclude_domains(排除域名), internal_only(仅内部)
extract_structured_data结构化数据提取url, data_type(all/contact/social/content/products/addresses)
get_page_info页面信息获取url(目标 URL) - 返回标题、状态码、元数据
check_robots_txt爬虫规则检查url(域名 URL) - 检查 robots.txt 规则
convert_webpage_to_markdown页面转 Markdownurl, method, extract_main_content(提取主内容), embed_images(嵌入图片), formatting_options
batch_convert_webpages_to_markdown批量 Markdown 转换urls(列表), method, extract_main_content, embed_images, embed_options

PDF Document

工具名称功能描述主要参数
convert_pdf_to_markdownPDF 转 Markdownpdf_source(URL/路径), method(auto/pymupdf/pypdf), page_range, output_format
batch_convert_pdfs_to_markdown批量 PDF 转换pdf_sources(列表), method, page_range, output_format, include_metadata

Service Management

工具名称功能描述主要参数
get_server_metrics性能指标监控无参数 - 返回请求统计、性能指标、缓存情况
clear_cache缓存管理无参数 - 清空所有缓存数据

🎯 Quick Navigation

🤝 Contribution

欢迎提交 IssuePull Request 来改进这个项目。

📄 License

MIT License - 详见 文件


注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。