Claude + Mobile via MCP: Giving the Model Hands on a Real Phone

Name: Drengr
Author: Drengr

May 3, 20265 min read

ClaudeMCPMobile

I plugged in a Pixel two months ago, ran one command in Claude Desktop, and watched it open Maps and start navigation to my home address from a single sentence prompt. It was the first time I'd ever seen a language model physically operate a phone. Latency was about two seconds per action; the part that surprised me was the third step, where Claude noticed it had landed on the wrong screen and self-corrected without being asked.

That experience is what this post is about — what "Claude on mobile via MCP" actually means today, what's required underneath to make it work, and why the bottleneck for AI agents has stopped being model size and started being whether the model has hands.

What "Claude mobile MCP" actually refers to

There isn't a product called Claude Mobile MCP. What people are searching for is a way to let Claude — running in Claude Desktop, Claude Code, Cursor, or any other MCP-aware client — control a real Android phone or iOS simulator. MCP (Model Context Protocol) is Anthropic's open standard for exposing tools and data to language models. It's the cleanest way to plug a new capability into Claude without rebuilding the client.

The missing piece has been a server on the MCP side that knows how to drive a phone. That's what Drengr does. It's a single Rust binary that exposes three tools to any MCP client:

drengr_look      observe the current screen + UI tree
drengr_do        execute a tap / type / swipe / draw / key event
drengr_query     read structured data (devices, activity, crashes)

Three verbs. No XPath, no Appium daemon, no fragile selectors. Claude calls drengr_look, gets back a compact text description of what's on screen with numbered elements, decides what to do, and calls drengr_do with the action and target element. Drengr executes against the device through its native channel (ADB on Android, WDA on iOS simulators) and returns a situation report — what changed, what appeared, whether the app crashed — so the next decision starts grounded.

The bottleneck stopped being the model

A common assumption is that mobile AI assistants are blocked on model capability. They aren't. Current Claude models reason about a screenshot better than most humans can describe one. Open Photos, find a specific picture, attach it to a WhatsApp message — Claude has the visual reasoning to do all of that. What it doesn't have, by default, is a way to actually touch the screen.

That gap is purely transport. iOS sandboxing prevents one app from reaching into another. Android Accessibility Services exist but are heavy to set up, scary to permission, and limited in what they can synthesize. Cloud-only assistants are dead for anything physical because:

Latency. A two-second cloud round trip per tap feels broken when you're holding the phone.
Privacy. Banking apps, health apps, messages — none of that should leave the device for a UI inference.
Network independence. Subway, airplane, bad hotel wifi.

Once Gemini Nano on Android and Apple Intelligence on iOS are widespread, the model is local. The control plane has to be local too. Drengr is a single static binary; that's not a coincidence. (More on why local matters in The Missing Control Plane for Local AI Agents.)

How to actually try it

Two commands, assuming you have Claude Code installed:

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

The first registers Drengr as an MCP server in Claude Code. The second checks that ADB and Xcode are reachable on your machine and prints any device it can see. Then plug in an Android phone (with USB debugging enabled) or boot an iOS simulator, and ask Claude to do something on it.

A few prompts that have worked for me:

"Open YouTube and search for m3 ultra benchmarks."
"Open Slack, find the channel called engineering, post the message merging at 3pm."
"Open Maps and start navigation to home."

Each one becomes an observe-decide-act loop under the hood. Claude calls drengr_look, reasons over the text scene Drengr returns (~300 tokens vs ~100KB for a raw screenshot — see why text-first matters here), decides what to do, calls drengr_do, gets a situation diff back, and repeats until the task is done or it gets stuck.

Beyond mobile QA

The obvious early audience for a mobile MCP server is QA — automate the test flows that break every sprint. That market is real but small. The much bigger one is everything else you can build once Claude can touch a phone:

RPA on mobile. UiPath / Automation Anywhere shops have spent a decade automating desktop workflows. Mobile has been a gap because the existing tooling assumed a developer sat behind it. With an LLM in the loop, the tooling assumption changes.
Accessibility tools. A low-vision user can ask Claude to operate a banking app on their behalf, with the app running on their own device.
On-device personal assistants. "Find pictures from last weekend in Photos and attach them to a WhatsApp message to my mom" is a sentence Claude can already plan. The hands-and-eyes layer was missing.

None of those need new model capability. They need a working transport. (That argument in full: AI Can Browse the Web. Why Can't It Tap a Phone?)

Where to start

If you want Claude to control a phone, install Drengr and ask it to do something. The control plane is the part you don't want to build yourself — WDA, ADB, the screen-capture pipeline, the situation diffing, the cross-platform abstraction, the MCP wiring — they're all unglamorous infrastructure that's already done.

claude mcp add drengr -- npx -y drengr mcp
drengr doctor

Then point Claude at any Android device or iOS simulator and watch what happens when the model has hands.