RyotaroHy/nihonaustralia-crawler-mcp
If you are the rightful owner of nihonaustralia-crawler-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Australian Job Scraper MCP Server is designed to scrape job listings from Australian job sites popular with Japanese workers and store them in Supabase, featuring automated daily scraping via GitHub Actions.
Australian Job Scraper MCP Server
A Model Context Protocol (MCP) server that scrapes job listings from Australian job sites popular with Japanese workers and stores them in Supabase. Features automated daily scraping via GitHub Actions.
🎯 Target Job Sites
| Site | URL | Type | Requirements |
|---|---|---|---|
| 日豪プレス (Nichigo Press) | nichigopress.jp | Static | ✅ No JS required |
| JAMS.TV | jams.tv | Static | ✅ No JS required |
| Adecco Australia | adecco.com.au | Dynamic | 🤖 Puppeteer required |
| HAYS Australia | hays.com.au | Dynamic | 🤖 Puppeteer required |
| Recruit Australia | recruitaustralia.com | Static | ✅ No JS required |
| Total Personnel | workinaus.com.au | Dynamic | 🤖 Puppeteer required |
| APS Jobs | apsjobs.gov.au | Dynamic | 🤖 Puppeteer required |
🚀 Quick Start
1. Setup Environment
# Clone repository
git clone https://github.com/ryo-kozin/nihonaustralia-crawler-mcp.git
cd nihonaustralia-crawler-mcp
# Install dependencies
pnpm install
# Install Chrome for Puppeteer
npx puppeteer browsers install chrome
# Copy environment template (only needed for full scraper with database)
cp .env.example .env
2. Test Without Database (Recommended First Step)
# Run test scraper - outputs to console only, no database required
pnpm run test:scrape
This will scrape a few sample jobs from each site and display them in the console, perfect for testing the setup.
3. Configure Supabase (For Full Production Use)
3.1 Create Supabase Project
- Create a new Supabase project at supabase.com
- Copy your project URL, anon key, and service role key to
.env:SUPABASE_URL=https://your-project-ref.supabase.co SUPABASE_ANON_KEY=your-anon-key SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
3.2 Apply Database Schema
⚠️ IMPORTANT: Database Management
This crawler project does not manage its own database migrations. All database schema changes are managed by the main web application project (nihonaustralia-web).
To set up the database:
-
Navigate to the main web project:
cd ../nihonaustralia-web -
Apply migrations from the web project:
# Link to your Supabase project (if not already linked) supabase link --project-ref your-project-ref # Apply all migrations including scraping tables supabase db push -
The following scraping-related tables will be created:
scraping_sites(site configuration)scraping_logs(scraping monitoring)- Additional columns in
job_postsfor scraping integration
Why this setup?
- Both web and crawler projects use the same Supabase database
- Schema is managed centrally in
nihonaustralia-webto avoid conflicts - This ensures data consistency and prevents migration issues
3.3 Verify Schema Setup
After applying the schema, verify these tables exist:
job_posts(with new scraping columns:source_type,scraped_from,source_url,approval_status)scraping_sites(site configuration)scraping_logs(scraping monitoring)- Views:
approved_jobs_with_trust,pending_scraped_jobs
3.4 Schema Features
- Unified Approach: Scraped jobs are stored in the existing
job_poststable alongside manual posts - Source Tracking:
source_type('manual' or 'scraped') identifies job origin - Approval Workflow: Scraped jobs start with
approval_status = 'pending' - Duplicate Prevention: Unique constraint on
(scraped_from, source_url)for scraped jobs
4. Run Full Scraper (With Database)
# Run development scraper for all sites (requires Supabase)
pnpm run scrape
# Run for specific sites only
pnpm run scrape nichigopress jams
# Build and run production version
pnpm run scrape:build
5. Run Local JSON Output (No Database Required)
For generating JSON files for ChatGPT processing without database setup:
# Generate JSON files in scraped-jobs/ directory
pnpm run scrape:local
# Run for specific sites only
pnpm run scrape:local nichigopress jams
# Build and run local JSON version
pnpm run scrape:local:build
Features of Local JSON Mode:
- ✅ No Supabase setup required
- ✅ Generates structured JSON files with complete database schema mapping
- ✅ Includes processing instructions for ChatGPT
- ✅ Perfect for data analysis and manual review
- ✅ Files saved with timestamp:
{site-name}_{timestamp}.json
🔧 MCP Server Usage
The server can be used as an MCP server with Claude or other MCP clients:
# Start MCP server
pnpm run start
# Available tools:
# - puppeteer_navigate: Navigate to URLs
# - puppeteer_screenshot: Take screenshots
# - puppeteer_click: Click elements
# - puppeteer_hover: Hover over elements
# - puppeteer_fill: Fill input fields
# - puppeteer_select: Select dropdown options
# - puppeteer_evaluate: Execute JavaScript
# - scrape_jobs: Scrape all job sites and save to Supabase
⚙️ GitHub Actions Setup
1. Repository Secrets
Add these secrets to your GitHub repository:
SUPABASE_URL: Your Supabase project URLSUPABASE_ANON_KEY: Your Supabase anonymous keySUPABASE_SERVICE_ROLE_KEY: Your Supabase service role key (for bypassing RLS)
2. Automated Scraping
The workflow runs automatically:
- Daily: 2 AM UTC (1 PM AEDT/12 PM AEST)
- Manual: Via GitHub Actions "Run workflow" button
3. Monitoring
Check scraping results in:
- GitHub Actions logs
- Supabase
scraping_logstable - Job data in
job_poststable (filtered bysource_type = 'scraped') - Use
pending_scraped_jobsview for moderation workflow
📊 Database Schema
Unified Job Posts Table
The system uses a unified approach where both manual and scraped jobs are stored in the job_posts table:
Core Fields:
- Basic details (title, description, company)
- Employment info (type, salary, location)
- Requirements (English level, visa status)
- Contact information
Scraping Integration Fields:
source_type: 'manual' or 'scraped'scraped_from: Source site name (referencesscraping_sites.site_name)source_url: Original job posting URLapproval_status: 'pending', 'approved', or 'rejected'scraped_at: Timestamp of scrapingraw_html: Original HTML content for reference
Supporting Tables
Scraping Sites (scraping_sites)
- Site configuration and status
- JavaScript requirements
- Site-specific scraping settings
- Last scrape timestamps
Scraping Logs (scraping_logs)
- Scraping performance tracking
- Success/error rates and messages
- Job counts (found/new/updated)
- Execution times
Views
approved_jobs_with_trust: Frontend-ready view of approved jobs only
pending_scraped_jobs: Scraped jobs awaiting approval for moderation workflow
🔍 Data Fields Extracted
| Field | Description | Sites Supporting |
|---|---|---|
title | Job title | All sites |
description | Job description/content | All sites |
company_name | Hiring company name | Most sites |
employment_type | Full-time/Part-time/Casual/Contract | Most sites |
salary_* | Salary information (parsed & raw) | Some sites |
location_* | Location details (state/city/suburb) | All sites |
english_level | Required English proficiency | Japanese sites |
visa_requirements | Visa/citizenship requirements | Some sites |
contact_* | Contact information | Most sites |
🛠️ Development
Project Structure
src/
├── index.ts # Main MCP server
├── job-scraper.ts # Scraping logic for all sites
└── cli-scraper.ts # Standalone CLI tool
.github/workflows/
└── job-scraper.yml # GitHub Actions workflow
Building
pnpm run build # Compile TypeScript
pnpm run clean # Remove build artifacts
Testing Individual Sites
# Test scraper with console output only (no database required)
pnpm run test:scrape
# Test specific scrapers during development (requires Supabase)
pnpm run scrape nichigopress # Japanese news site
pnpm run scrape adecco hays # Recruitment agencies
pnpm run scrape apsjobs # Government jobs
🔒 Privacy & Ethics
- Respects robots.txt and site terms of service
- Uses reasonable request delays to avoid overwhelming servers
- Collects only publicly available job listing information
- No personal data collection beyond public contact information
� Troubleshooting
Common Issues
Migration History Conflicts
⚠️ Do not run migrations from this project!
If you encounter database issues:
# Navigate to the web project for all database operations:
cd ../nihonaustralia-web
# Handle migration issues from the web project:
supabase migration repair --status reverted [migration-ids]
supabase db push
Database Connection Issues
# Test database connection:
npm run scrape
# Check for error: "relation 'scraping_sites' does not exist"
# → Apply schema migrations first (see section 3.2)
Puppeteer Frame Errors
- Frame detachment errors are common with dynamic sites
- These don't affect static site scraping (nichigopress, jams)
- Check specific site selectors if issues persist
Development Tips
Testing Without Database
# Always test selectors first:
pnpm run test:scrape
Checking Schema Status
-- In Supabase SQL Editor, verify tables exist:
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public' AND table_name IN ('scraping_sites', 'scraping_logs');
-- Check job_posts has new columns:
SELECT column_name FROM information_schema.columns
WHERE table_name = 'job_posts' AND column_name LIKE '%scrap%';
�📝 License
ISC License - see LICENSE file for details.
🤝 Contributing
- Fork the repository
- Create a feature branch
- Add tests for new scrapers
- Submit a pull request
📞 Support
- Create an issue for bugs or feature requests
- Check GitHub Actions logs for scraping issues
- Monitor Supabase logs for database problems