Custom-MCP-server-to-paint-in-Python by sushant097 - MCP Server

EAG v2 — Assignment 4 (Agent → MCP → GUI “Paint”)

Make an Agent (LLM) control a GUI drawing app that has no API by calling MCP tools. The Agent must open a paint-like app, draw a rectangle, then write text inside the rectangle — with no manual paint calls in the client code.

This repo contains a Mac-ready solution that uses Preview.app + Markup tools instead of MS Paint.

What’s in this repo

server.py – MCP server exposing GUI tools:
- open_paint() → opens Preview with a blank canvas, sizes the window, shows Markup toolbar
- draw_rectangle(x1, y1, x2, y2) → draws a rectangle via Preview’s Tools → Annotate → Rectangle
- add_text_in_paint(text) → inserts a Text box and types your text
- All coordinates are window-relative (origin = top-left of Preview’s front window)
talk2mcp.py – MCP client agent:
- Starts the server via stdio and lists available tools
- Uses Gemini Flash 2.0 (via google-genai) with a strict system prompt that forces the model to emit one line per step:
  - FUNCTION_CALL: <tool_name>|arg1|arg2|...
  - or FINAL_ANSWER: DONE
- Parses that single line and calls the MCP tool. No manual GUI calls here — the Agent drives everything.
requirements.txt – macOS dependencies (no Windows-only libs):
```
mcp
google-genai
python-dotenv
pyautogui
pyobjc
pillow
```

What changed (vs the instructor’s Windows/MS Paint baseline)

OS swap (Windows → macOS)
- Replaced pywinauto / pywin32 with pyautogui + AppleScript to control Preview.app (works on Apple Silicon).
- Dropped Windows-only packages; updated requirements.txt accordingly.
Window-relative coordinates
- The server reads the Preview window bounds and converts window-relative (x, y) to absolute screen coordinates.
- This makes things resilient to multi-monitor setups and different resolutions.
No manual paint calls in the client
- talk2mcp.py now only forwards the Agent’s FUNCTION_CALL: lines to the server.
- The Agent itself decides the sequence: open_paint → draw_rectangle → add_text_in_paint → FINAL_ANSWER.
Robust tool activation
- Preview Markup toolbar is toggled programmatically.
- Tools are invoked via menu navigation (AppleScript), which is more reliable than keyboard shortcuts across setups.

How it works (flow)

+--------------------+       MCP (stdio)        +------------------+      mac GUI
|  talk2mcp.py       |  <-------------------->  |    server.py     |   (Preview.app)
|  (Agent client)    |                          |   (MCP server)   |
+--------------------+                          +------------------+
        |  list_tools()                                  |
        |----------------------------------------------->|
        |                                                |
        |  prompt LLM (system prompt lists tools)        |
        |                                                |
        |  LLM: "FUNCTION_CALL: open_paint"              |
        |----------------------------------------------->|  open Preview + Markup toolbar
        |  LLM: "FUNCTION_CALL: draw_rectangle|x1|y1|x2|y2"
        |----------------------------------------------->|  drag to draw rectangle
        |  LLM: "FUNCTION_CALL: add_text_in_paint|Hello" |
        |----------------------------------------------->|  place text box + type
        |  LLM: "FINAL_ANSWER: DONE"                     |
        v
     complete

Setup (macOS)

Create venv & install

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

API key Create a .env file in the project root:
```
GEMINI_API_KEY=your_google_api_key_here
```
Permissions (very important)
- System Settings → Privacy & Security
  - Accessibility → allow your Terminal/iTerm/Python
  - Screen Recording → allow the same
- Restart your terminal if changes don’t take effect.

Run

Open two terminals (or tabs):

Terminal A – server

source venv/bin/activate
python server.py

Terminal B – client

source venv/bin/activate
python talk2mcp.py

You should see the Agent emit calls like:

FUNCTION_CALL: open_paint
FUNCTION_CALL: draw_rectangle|400|250|1100|650
FUNCTION_CALL: add_text_in_paint|What is the capital of Nepal?
FINAL_ANSWER: DONE

And in Preview you’ll see a rectangle and your text inside.

Tuning the Agent’s behavior

In talk2mcp.py:
- Set what you want written inside the rectangle:
```
user_text = "Hello from the Agent!"
query = f'Task: Draw a rectangle and write the following text inside it: "{user_text}".'
```
- The system prompt already enforces the one-line response format and mentions the three tools.
If the Agent draws off-canvas:
- Adjust the rectangle coordinates you suggest (the Agent generally picks sensible values after a successful first run).
- You can also add a hint into the query, e.g. “Use coordinates roughly in the center of the window.”

Coordinates & multi-monitor notes

All tool coords are window-relative in server.py.
- (0,0) is the top-left of Preview’s front window.
- If your rectangle isn’t visible, try smaller values (e.g., x1=200, y1=150, x2=1000, y2=600).
The server sizes the window to a known rectangle (defaults: left=100, top=80, width=1400, height=900) to make this repeatable. You can change those inside open_paint().

Troubleshooting

Preview doesn’t respond / clicks do nothing
- Check Accessibility + Screen Recording permissions.
- Ensure Preview is frontmost (the server calls activate_app("Preview"), but try clicking Preview once manually).
“Text” tool not inserted
- Menus may be localized. In server.py, the menu click is:
```
Tools → Annotate → Text
```
  If your system language differs, change "Text" to your local menu label.
Gemini errors (auth/model)
- Confirm .env has a valid GEMINI_API_KEY.
- Internet required for google-genai.
Agent keeps saying DONE without drawing
- The system prompt forces tool calls, but if it still happens, increase the loop iterations in talk2mcp.py or include a stronger hint in query like “First call open_paint, then draw_rectangle, then add_text_in_paint.”

Quick FAQ

Q: Can I still use MS Paint if I switch to Windows later?

A: Yes. Swap server.py to the Windows version that uses pywinauto and keep talk2mcp.py unchanged (tool names stay the same).

Q: Can the Agent choose coordinates itself?

A: Yes; the prompt encourages it. If it picks poorly, add a coordinate hint or provide a “safe default rectangle” in the prompt.

Q: How do I verify the Agent actually called the tools?

A: Both talk2mcp.py and server.py print logs for each tool call — you’ll see the FUNCTION_CALL: lines and the server’s responses in the consoles.