nihonaustralia-crawler-mcp by RyotaroHy - MCP Server

Australian Job Scraper MCP Server

A Model Context Protocol (MCP) server that scrapes job listings from Australian job sites popular with Japanese workers and stores them in Supabase. Features automated daily scraping via GitHub Actions.

🎯 Target Job Sites

Site	URL	Type	Requirements
日豪プレス (Nichigo Press)	nichigopress.jp	Static	✅ No JS required
JAMS.TV	jams.tv	Static	✅ No JS required
Adecco Australia	adecco.com.au	Dynamic	🤖 Puppeteer required
HAYS Australia	hays.com.au	Dynamic	🤖 Puppeteer required
Recruit Australia	recruitaustralia.com	Static	✅ No JS required
Total Personnel	workinaus.com.au	Dynamic	🤖 Puppeteer required
APS Jobs	apsjobs.gov.au	Dynamic	🤖 Puppeteer required

🚀 Quick Start

1. Setup Environment

# Clone repository
git clone https://github.com/ryo-kozin/nihonaustralia-crawler-mcp.git
cd nihonaustralia-crawler-mcp

# Install dependencies
pnpm install

# Install Chrome for Puppeteer
npx puppeteer browsers install chrome

# Copy environment template (only needed for full scraper with database)
cp .env.example .env

2. Test Without Database (Recommended First Step)

# Run test scraper - outputs to console only, no database required
pnpm run test:scrape

This will scrape a few sample jobs from each site and display them in the console, perfect for testing the setup.

3. Configure Supabase (For Full Production Use)

3.1 Create Supabase Project

Create a new Supabase project at supabase.com

Copy your project URL, anon key, and service role key to .env:

SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key

3.2 Apply Database Schema

⚠️ IMPORTANT: Database Management

This crawler project does not manage its own database migrations. All database schema changes are managed by the main web application project (nihonaustralia-web).

To set up the database:

Navigate to the main web project:
```
cd ../nihonaustralia-web
```

Apply migrations from the web project:

# Link to your Supabase project (if not already linked)
supabase link --project-ref your-project-ref

# Apply all migrations including scraping tables
supabase db push

The following scraping-related tables will be created:
- scraping_sites (site configuration)
- scraping_logs (scraping monitoring)
- Additional columns in job_posts for scraping integration

Why this setup?

Both web and crawler projects use the same Supabase database
Schema is managed centrally in nihonaustralia-web to avoid conflicts
This ensures data consistency and prevents migration issues

3.3 Verify Schema Setup

After applying the schema, verify these tables exist:

job_posts (with new scraping columns: source_type, scraped_from, source_url, approval_status)
scraping_sites (site configuration)
scraping_logs (scraping monitoring)
Views: approved_jobs_with_trust, pending_scraped_jobs

3.4 Schema Features

Unified Approach: Scraped jobs are stored in the existing job_posts table alongside manual posts
Source Tracking: source_type ('manual' or 'scraped') identifies job origin
Approval Workflow: Scraped jobs start with approval_status = 'pending'
Duplicate Prevention: Unique constraint on (scraped_from, source_url) for scraped jobs

4. Run Full Scraper (With Database)

# Run development scraper for all sites (requires Supabase)
pnpm run scrape

# Run for specific sites only
pnpm run scrape nichigopress jams

# Build and run production version
pnpm run scrape:build

5. Run Local JSON Output (No Database Required)

For generating JSON files for ChatGPT processing without database setup:

# Generate JSON files in scraped-jobs/ directory
pnpm run scrape:local

# Run for specific sites only
pnpm run scrape:local nichigopress jams

# Build and run local JSON version
pnpm run scrape:local:build

Features of Local JSON Mode:

✅ No Supabase setup required
✅ Generates structured JSON files with complete database schema mapping
✅ Includes processing instructions for ChatGPT
✅ Perfect for data analysis and manual review
✅ Files saved with timestamp: {site-name}_{timestamp}.json

🔧 MCP Server Usage

The server can be used as an MCP server with Claude or other MCP clients:

# Start MCP server
pnpm run start

# Available tools:
# - puppeteer_navigate: Navigate to URLs
# - puppeteer_screenshot: Take screenshots
# - puppeteer_click: Click elements
# - puppeteer_hover: Hover over elements
# - puppeteer_fill: Fill input fields
# - puppeteer_select: Select dropdown options
# - puppeteer_evaluate: Execute JavaScript
# - scrape_jobs: Scrape all job sites and save to Supabase

⚙️ GitHub Actions Setup

1. Repository Secrets

Add these secrets to your GitHub repository:

SUPABASE_URL: Your Supabase project URL
SUPABASE_ANON_KEY: Your Supabase anonymous key
SUPABASE_SERVICE_ROLE_KEY: Your Supabase service role key (for bypassing RLS)

2. Automated Scraping

The workflow runs automatically:

Daily: 2 AM UTC (1 PM AEDT/12 PM AEST)
Manual: Via GitHub Actions "Run workflow" button

3. Monitoring

Check scraping results in:

GitHub Actions logs
Supabase scraping_logs table
Job data in job_posts table (filtered by source_type = 'scraped')
Use pending_scraped_jobs view for moderation workflow

📊 Database Schema

Unified Job Posts Table

The system uses a unified approach where both manual and scraped jobs are stored in the job_posts table:

Core Fields:

Basic details (title, description, company)
Employment info (type, salary, location)
Requirements (English level, visa status)
Contact information

Scraping Integration Fields:

source_type: 'manual' or 'scraped'
scraped_from: Source site name (references scraping_sites.site_name)
source_url: Original job posting URL
approval_status: 'pending', 'approved', or 'rejected'
scraped_at: Timestamp of scraping
raw_html: Original HTML content for reference

Supporting Tables

Scraping Sites (scraping_sites)

Site configuration and status
JavaScript requirements
Site-specific scraping settings
Last scrape timestamps

Scraping Logs (scraping_logs)

Scraping performance tracking
Success/error rates and messages
Job counts (found/new/updated)
Execution times

Views

approved_jobs_with_trust: Frontend-ready view of approved jobs only pending_scraped_jobs: Scraped jobs awaiting approval for moderation workflow

🔍 Data Fields Extracted

Field	Description	Sites Supporting
`title`	Job title	All sites
`description`	Job description/content	All sites
`company_name`	Hiring company name	Most sites
`employment_type`	Full-time/Part-time/Casual/Contract	Most sites
`salary_*`	Salary information (parsed & raw)	Some sites
`location_*`	Location details (state/city/suburb)	All sites
`english_level`	Required English proficiency	Japanese sites
`visa_requirements`	Visa/citizenship requirements	Some sites
`contact_*`	Contact information	Most sites

🛠️ Development

Project Structure

src/
├── index.ts          # Main MCP server
├── job-scraper.ts    # Scraping logic for all sites
└── cli-scraper.ts    # Standalone CLI tool

.github/workflows/
└── job-scraper.yml   # GitHub Actions workflow

Building

pnpm run build       # Compile TypeScript
pnpm run clean       # Remove build artifacts

Testing Individual Sites

# Test scraper with console output only (no database required)
pnpm run test:scrape

# Test specific scrapers during development (requires Supabase)
pnpm run scrape nichigopress    # Japanese news site
pnpm run scrape adecco hays     # Recruitment agencies
pnpm run scrape apsjobs         # Government jobs

🔒 Privacy & Ethics

Respects robots.txt and site terms of service
Uses reasonable request delays to avoid overwhelming servers
Collects only publicly available job listing information
No personal data collection beyond public contact information

� Troubleshooting

Common Issues

Migration History Conflicts

⚠️ Do not run migrations from this project!

If you encounter database issues:

# Navigate to the web project for all database operations:
cd ../nihonaustralia-web

# Handle migration issues from the web project:
supabase migration repair --status reverted [migration-ids]
supabase db push

Database Connection Issues

# Test database connection:
npm run scrape

# Check for error: "relation 'scraping_sites' does not exist"
# → Apply schema migrations first (see section 3.2)

Puppeteer Frame Errors

Frame detachment errors are common with dynamic sites
These don't affect static site scraping (nichigopress, jams)
Check specific site selectors if issues persist

Development Tips

Testing Without Database

# Always test selectors first:
pnpm run test:scrape

Checking Schema Status

-- In Supabase SQL Editor, verify tables exist:
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public' AND table_name IN ('scraping_sites', 'scraping_logs');

-- Check job_posts has new columns:
SELECT column_name FROM information_schema.columns
WHERE table_name = 'job_posts' AND column_name LIKE '%scrap%';

�📝 License

ISC License - see LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new scrapers
Submit a pull request

📞 Support

Create an issue for bugs or feature requests
Check GitHub Actions logs for scraping issues
Monitor Supabase logs for database problems