Back to blog

AI Can Browse the Web. Why Can't It Tap a Phone?

8 min read
EngineeringMCP

The Gap Nobody Talks About

Every week there's a new "Show HN" for AI-powered browser testing. Playwright agents, Puppeteer bots, Chrome extensions that turn the DOM into JSON for LLMs. The web automation space is overflowing with AI-native tools.

Then someone asks: "How do I do this on a phone?"

Silence.

The best answer the industry has is Appium — a tool from 2013 that requires you to set up a Selenium grid, write XPath selectors, and maintain brittle element locators that break every time a designer moves a button. Or Espresso/XCTest, which require you to embed test code inside the app itself.

None of these are AI-native. They were built for humans to write scripts, not for LLMs to reason about screens.

Why Mobile Is Harder Than Web

The web has one universal API: the DOM. Every browser exposes the same tree of elements with the same attributes. Playwright reads the DOM, the AI decides what to click, done.

Mobile doesn't have that. Android has its accessibility hierarchy. iOS has its own framework. They return different structures, different attributes, different coordinate systems. Cloud device farms add another layer — now you're talking to a device over Appium WebDriver, which adds its own abstraction on top.

The result: every mobile testing tool is platform-specific, setup-heavy, and hostile to AI agents that just want to know "what's on screen?" and "tap that button."

What I Built Instead

Drengr is a single Rust binary that sits between the AI and the device. It exposes exactly 3 tools over the Model Context Protocol (MCP):

  • drengr_look — tells the AI what's on screen, either as an annotated image or a compact text description (~300 tokens instead of a 200KB screenshot)
  • drengr_do — executes an action (tap, type, swipe, long press, scroll, launch app, etc.) and reports back what changed
  • drengr_query — answers questions without touching the screen (is the app crashed? what HTTP calls happened? what's the current activity?)

The AI client — Claude Desktop, Cursor, VS Code, whatever — is the brain. It decides strategy. Drengr is the hands. It handles the platform mess so the AI never has to think about ADB vs simctl vs Appium.

The Thing That Makes It Work: Situation Reports

After every action, Drengr doesn't just say "ok, done." It tells the AI what changed:

{
  "screen_changed": true,
  "new_elements": [12, 15],
  "disappeared_elements": [7],
  "activity_changed": true,
  "crash": false,
  "stuck": false
}

The AI reads this and immediately knows: the screen updated, two new elements appeared, one vanished, we navigated somewhere new, and the app is still alive. No need to take another screenshot and visually diff it.

This is what browser testing tools don't need — the DOM gives you change events for free. On mobile, you have to build this layer yourself. I spent months on it so you don't have to.

A Real Test Looks Like This

app: com.example.app
tasks:
  - name: login
    task: "Log in with [email protected] and password123"
    timeout: 60s
  - name: checkout
    task: "Add headphones to cart and complete purchase"
    timeout: 90s

No selectors. No XPath. No element IDs to maintain. The AI reads the screen, decides what to do, and Drengr executes it. When the UI changes, the YAML doesn't break — because there's nothing brittle in it.

Run with drengr test tests.yml and get human-readable output, JSON, or JUnit XML for your CI pipeline.

Why This Gap Exists

Browser testing got AI-native tools early because the DOM is an open, text-friendly format that LLMs can reason about directly. Mobile UIs are visual, proprietary, and locked behind platform-specific APIs that nobody unified.

The MCP protocol changes this. It gives AI agents a standard way to connect to tools — and Drengr is the tool that bridges MCP to mobile devices. Android, iOS, simulators, cloud farms — one interface, one binary, one install:

npm install -g drengr

The web got its AI testing moment. Mobile's turn is now.