Custom-MCP-server-to-paint-in-Python

sushant097/Custom-MCP-server-to-paint-in-Python

3.2

If you are the rightful owner of Custom-MCP-server-to-paint-in-Python and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

This document provides a structured summary of a Model Context Protocol (MCP) server designed to control a GUI drawing application using an Agent (LLM) without direct API access.

Tools
3
Resources
0
Prompts
0

EAG v2 — Assignment 4 (Agent → MCP → GUI “Paint”)

Make an Agent (LLM) control a GUI drawing app that has no API by calling MCP tools. The Agent must open a paint-like app, draw a rectangle, then write text inside the rectangle — with no manual paint calls in the client code.

This repo contains a Mac-ready solution that uses Preview.app + Markup tools instead of MS Paint.


What’s in this repo

  • server.py – MCP server exposing GUI tools:

    • open_paint() → opens Preview with a blank canvas, sizes the window, shows Markup toolbar
    • draw_rectangle(x1, y1, x2, y2) → draws a rectangle via Preview’s Tools → Annotate → Rectangle
    • add_text_in_paint(text) → inserts a Text box and types your text
    • All coordinates are window-relative (origin = top-left of Preview’s front window)
  • talk2mcp.py – MCP client agent:

    • Starts the server via stdio and lists available tools

    • Uses Gemini Flash 2.0 (via google-genai) with a strict system prompt that forces the model to emit one line per step:

      • FUNCTION_CALL: <tool_name>|arg1|arg2|...
      • or FINAL_ANSWER: DONE
    • Parses that single line and calls the MCP tool. No manual GUI calls here — the Agent drives everything.

  • requirements.txt – macOS dependencies (no Windows-only libs):

    mcp
    google-genai
    python-dotenv
    pyautogui
    pyobjc
    pillow
    

What changed (vs the instructor’s Windows/MS Paint baseline)

  1. OS swap (Windows → macOS)

    • Replaced pywinauto / pywin32 with pyautogui + AppleScript to control Preview.app (works on Apple Silicon).
    • Dropped Windows-only packages; updated requirements.txt accordingly.
  2. Window-relative coordinates

    • The server reads the Preview window bounds and converts window-relative (x, y) to absolute screen coordinates.
    • This makes things resilient to multi-monitor setups and different resolutions.
  3. No manual paint calls in the client

    • talk2mcp.py now only forwards the Agent’s FUNCTION_CALL: lines to the server.
    • The Agent itself decides the sequence: open_paintdraw_rectangleadd_text_in_paintFINAL_ANSWER.
  4. Robust tool activation

    • Preview Markup toolbar is toggled programmatically.
    • Tools are invoked via menu navigation (AppleScript), which is more reliable than keyboard shortcuts across setups.

How it works (flow)

+--------------------+       MCP (stdio)        +------------------+      mac GUI
|  talk2mcp.py       |  <-------------------->  |    server.py     |   (Preview.app)
|  (Agent client)    |                          |   (MCP server)   |
+--------------------+                          +------------------+
        |  list_tools()                                  |
        |----------------------------------------------->|
        |                                                |
        |  prompt LLM (system prompt lists tools)        |
        |                                                |
        |  LLM: "FUNCTION_CALL: open_paint"              |
        |----------------------------------------------->|  open Preview + Markup toolbar
        |  LLM: "FUNCTION_CALL: draw_rectangle|x1|y1|x2|y2"
        |----------------------------------------------->|  drag to draw rectangle
        |  LLM: "FUNCTION_CALL: add_text_in_paint|Hello" |
        |----------------------------------------------->|  place text box + type
        |  LLM: "FINAL_ANSWER: DONE"                     |
        v
     complete

Setup (macOS)

  1. Create venv & install

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    
  2. API key Create a .env file in the project root:

    GEMINI_API_KEY=your_google_api_key_here
    
  3. Permissions (very important)

    • System Settings → Privacy & Security

      • Accessibility → allow your Terminal/iTerm/Python
      • Screen Recording → allow the same
    • Restart your terminal if changes don’t take effect.


Run

Open two terminals (or tabs):

Terminal A – server

source venv/bin/activate
python server.py

Terminal B – client

source venv/bin/activate
python talk2mcp.py

You should see the Agent emit calls like:

FUNCTION_CALL: open_paint
FUNCTION_CALL: draw_rectangle|400|250|1100|650
FUNCTION_CALL: add_text_in_paint|What is the capital of Nepal?
FINAL_ANSWER: DONE

And in Preview you’ll see a rectangle and your text inside.


Tuning the Agent’s behavior

  • In talk2mcp.py:

    • Set what you want written inside the rectangle:

      user_text = "Hello from the Agent!"
      query = f'Task: Draw a rectangle and write the following text inside it: "{user_text}".'
      
    • The system prompt already enforces the one-line response format and mentions the three tools.

  • If the Agent draws off-canvas:

    • Adjust the rectangle coordinates you suggest (the Agent generally picks sensible values after a successful first run).
    • You can also add a hint into the query, e.g. “Use coordinates roughly in the center of the window.”

Coordinates & multi-monitor notes

  • All tool coords are window-relative in server.py.

    • (0,0) is the top-left of Preview’s front window.
    • If your rectangle isn’t visible, try smaller values (e.g., x1=200, y1=150, x2=1000, y2=600).
  • The server sizes the window to a known rectangle (defaults: left=100, top=80, width=1400, height=900) to make this repeatable. You can change those inside open_paint().


Troubleshooting

  • Preview doesn’t respond / clicks do nothing

    • Check Accessibility + Screen Recording permissions.
    • Ensure Preview is frontmost (the server calls activate_app("Preview"), but try clicking Preview once manually).
  • “Text” tool not inserted

    • Menus may be localized. In server.py, the menu click is:

      Tools → Annotate → Text
      

      If your system language differs, change "Text" to your local menu label.

  • Gemini errors (auth/model)

    • Confirm .env has a valid GEMINI_API_KEY.
    • Internet required for google-genai.
  • Agent keeps saying DONE without drawing

    • The system prompt forces tool calls, but if it still happens, increase the loop iterations in talk2mcp.py or include a stronger hint in query like “First call open_paint, then draw_rectangle, then add_text_in_paint.”

Quick FAQ

Q: Can I still use MS Paint if I switch to Windows later?

A: Yes. Swap server.py to the Windows version that uses pywinauto and keep talk2mcp.py unchanged (tool names stay the same).

Q: Can the Agent choose coordinates itself?

A: Yes; the prompt encourages it. If it picks poorly, add a coordinate hint or provide a “safe default rectangle” in the prompt.

Q: How do I verify the Agent actually called the tools?

A: Both talk2mcp.py and server.py print logs for each tool call — you’ll see the FUNCTION_CALL: lines and the server’s responses in the consoles.