nihonaustralia-crawler-mcp

RyotaroHy/nihonaustralia-crawler-mcp

3.1

If you are the rightful owner of nihonaustralia-crawler-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Australian Job Scraper MCP Server is designed to scrape job listings from Australian job sites popular with Japanese workers and store them in Supabase, featuring automated daily scraping via GitHub Actions.

Tools
8
Resources
0
Prompts
0

Australian Job Scraper MCP Server

A Model Context Protocol (MCP) server that scrapes job listings from Australian job sites popular with Japanese workers and stores them in Supabase. Features automated daily scraping via GitHub Actions.

🎯 Target Job Sites

SiteURLTypeRequirements
日豪プレス (Nichigo Press)nichigopress.jpStatic✅ No JS required
JAMS.TVjams.tvStatic✅ No JS required
Adecco Australiaadecco.com.auDynamic🤖 Puppeteer required
HAYS Australiahays.com.auDynamic🤖 Puppeteer required
Recruit Australiarecruitaustralia.comStatic✅ No JS required
Total Personnelworkinaus.com.auDynamic🤖 Puppeteer required
APS Jobsapsjobs.gov.auDynamic🤖 Puppeteer required

🚀 Quick Start

1. Setup Environment

# Clone repository
git clone https://github.com/ryo-kozin/nihonaustralia-crawler-mcp.git
cd nihonaustralia-crawler-mcp

# Install dependencies
pnpm install

# Install Chrome for Puppeteer
npx puppeteer browsers install chrome

# Copy environment template (only needed for full scraper with database)
cp .env.example .env

2. Test Without Database (Recommended First Step)

# Run test scraper - outputs to console only, no database required
pnpm run test:scrape

This will scrape a few sample jobs from each site and display them in the console, perfect for testing the setup.

3. Configure Supabase (For Full Production Use)

3.1 Create Supabase Project
  1. Create a new Supabase project at supabase.com
  2. Copy your project URL, anon key, and service role key to .env:
    SUPABASE_URL=https://your-project-ref.supabase.co
    SUPABASE_ANON_KEY=your-anon-key
    SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
    
3.2 Apply Database Schema

⚠️ IMPORTANT: Database Management

This crawler project does not manage its own database migrations. All database schema changes are managed by the main web application project (nihonaustralia-web).

To set up the database:

  1. Navigate to the main web project:

    cd ../nihonaustralia-web
    
  2. Apply migrations from the web project:

    # Link to your Supabase project (if not already linked)
    supabase link --project-ref your-project-ref
    
    # Apply all migrations including scraping tables
    supabase db push
    
  3. The following scraping-related tables will be created:

    • scraping_sites (site configuration)
    • scraping_logs (scraping monitoring)
    • Additional columns in job_posts for scraping integration

Why this setup?

  • Both web and crawler projects use the same Supabase database
  • Schema is managed centrally in nihonaustralia-web to avoid conflicts
  • This ensures data consistency and prevents migration issues
3.3 Verify Schema Setup

After applying the schema, verify these tables exist:

  • job_posts (with new scraping columns: source_type, scraped_from, source_url, approval_status)
  • scraping_sites (site configuration)
  • scraping_logs (scraping monitoring)
  • Views: approved_jobs_with_trust, pending_scraped_jobs
3.4 Schema Features
  • Unified Approach: Scraped jobs are stored in the existing job_posts table alongside manual posts
  • Source Tracking: source_type ('manual' or 'scraped') identifies job origin
  • Approval Workflow: Scraped jobs start with approval_status = 'pending'
  • Duplicate Prevention: Unique constraint on (scraped_from, source_url) for scraped jobs

4. Run Full Scraper (With Database)

# Run development scraper for all sites (requires Supabase)
pnpm run scrape

# Run for specific sites only
pnpm run scrape nichigopress jams

# Build and run production version
pnpm run scrape:build

5. Run Local JSON Output (No Database Required)

For generating JSON files for ChatGPT processing without database setup:

# Generate JSON files in scraped-jobs/ directory
pnpm run scrape:local

# Run for specific sites only
pnpm run scrape:local nichigopress jams

# Build and run local JSON version
pnpm run scrape:local:build

Features of Local JSON Mode:

  • ✅ No Supabase setup required
  • ✅ Generates structured JSON files with complete database schema mapping
  • ✅ Includes processing instructions for ChatGPT
  • ✅ Perfect for data analysis and manual review
  • ✅ Files saved with timestamp: {site-name}_{timestamp}.json

🔧 MCP Server Usage

The server can be used as an MCP server with Claude or other MCP clients:

# Start MCP server
pnpm run start

# Available tools:
# - puppeteer_navigate: Navigate to URLs
# - puppeteer_screenshot: Take screenshots
# - puppeteer_click: Click elements
# - puppeteer_hover: Hover over elements
# - puppeteer_fill: Fill input fields
# - puppeteer_select: Select dropdown options
# - puppeteer_evaluate: Execute JavaScript
# - scrape_jobs: Scrape all job sites and save to Supabase

⚙️ GitHub Actions Setup

1. Repository Secrets

Add these secrets to your GitHub repository:

  • SUPABASE_URL: Your Supabase project URL
  • SUPABASE_ANON_KEY: Your Supabase anonymous key
  • SUPABASE_SERVICE_ROLE_KEY: Your Supabase service role key (for bypassing RLS)

2. Automated Scraping

The workflow runs automatically:

  • Daily: 2 AM UTC (1 PM AEDT/12 PM AEST)
  • Manual: Via GitHub Actions "Run workflow" button

3. Monitoring

Check scraping results in:

  • GitHub Actions logs
  • Supabase scraping_logs table
  • Job data in job_posts table (filtered by source_type = 'scraped')
  • Use pending_scraped_jobs view for moderation workflow

📊 Database Schema

Unified Job Posts Table

The system uses a unified approach where both manual and scraped jobs are stored in the job_posts table:

Core Fields:

  • Basic details (title, description, company)
  • Employment info (type, salary, location)
  • Requirements (English level, visa status)
  • Contact information

Scraping Integration Fields:

  • source_type: 'manual' or 'scraped'
  • scraped_from: Source site name (references scraping_sites.site_name)
  • source_url: Original job posting URL
  • approval_status: 'pending', 'approved', or 'rejected'
  • scraped_at: Timestamp of scraping
  • raw_html: Original HTML content for reference

Supporting Tables

Scraping Sites (scraping_sites)

  • Site configuration and status
  • JavaScript requirements
  • Site-specific scraping settings
  • Last scrape timestamps

Scraping Logs (scraping_logs)

  • Scraping performance tracking
  • Success/error rates and messages
  • Job counts (found/new/updated)
  • Execution times

Views

approved_jobs_with_trust: Frontend-ready view of approved jobs only pending_scraped_jobs: Scraped jobs awaiting approval for moderation workflow

🔍 Data Fields Extracted

FieldDescriptionSites Supporting
titleJob titleAll sites
descriptionJob description/contentAll sites
company_nameHiring company nameMost sites
employment_typeFull-time/Part-time/Casual/ContractMost sites
salary_*Salary information (parsed & raw)Some sites
location_*Location details (state/city/suburb)All sites
english_levelRequired English proficiencyJapanese sites
visa_requirementsVisa/citizenship requirementsSome sites
contact_*Contact informationMost sites

🛠️ Development

Project Structure

src/
├── index.ts          # Main MCP server
├── job-scraper.ts    # Scraping logic for all sites
└── cli-scraper.ts    # Standalone CLI tool

.github/workflows/
└── job-scraper.yml   # GitHub Actions workflow

Building

pnpm run build       # Compile TypeScript
pnpm run clean       # Remove build artifacts

Testing Individual Sites

# Test scraper with console output only (no database required)
pnpm run test:scrape

# Test specific scrapers during development (requires Supabase)
pnpm run scrape nichigopress    # Japanese news site
pnpm run scrape adecco hays     # Recruitment agencies
pnpm run scrape apsjobs         # Government jobs

🔒 Privacy & Ethics

  • Respects robots.txt and site terms of service
  • Uses reasonable request delays to avoid overwhelming servers
  • Collects only publicly available job listing information
  • No personal data collection beyond public contact information

� Troubleshooting

Common Issues

Migration History Conflicts

⚠️ Do not run migrations from this project!

If you encounter database issues:

# Navigate to the web project for all database operations:
cd ../nihonaustralia-web

# Handle migration issues from the web project:
supabase migration repair --status reverted [migration-ids]
supabase db push

Database Connection Issues

# Test database connection:
npm run scrape

# Check for error: "relation 'scraping_sites' does not exist"
# → Apply schema migrations first (see section 3.2)

Puppeteer Frame Errors

  • Frame detachment errors are common with dynamic sites
  • These don't affect static site scraping (nichigopress, jams)
  • Check specific site selectors if issues persist

Development Tips

Testing Without Database

# Always test selectors first:
pnpm run test:scrape

Checking Schema Status

-- In Supabase SQL Editor, verify tables exist:
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public' AND table_name IN ('scraping_sites', 'scraping_logs');

-- Check job_posts has new columns:
SELECT column_name FROM information_schema.columns
WHERE table_name = 'job_posts' AND column_name LIKE '%scrap%';

�📝 License

ISC License - see LICENSE file for details.

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new scrapers
  4. Submit a pull request

📞 Support

  • Create an issue for bugs or feature requests
  • Check GitHub Actions logs for scraping issues
  • Monitor Supabase logs for database problems