Field Notes: How Drengr's Architecture Aligns with (and Diverges from) Current Research

March 14, 202612 min read

ResearchAI

I've been an Android engineer for about ten years. I still remember the first time I discovered Espresso. I was genuinely thrilled — here was a framework from Google, deeply integrated with the Android SDK, that could simulate real user behavior and verify UI state. I dove in headfirst. Wrote hundreds of tests. Felt like I was doing engineering the right way.

Then reality set in. Tests that passed locally failed on CI because of animation timing. Tests that worked on a Pixel broke on a Samsung because of slightly different view hierarchies. A designer moved a button into a BottomSheet and forty tests turned red overnight — none of them testing anything related to that button. I spent more time maintaining the test suite than it saved me in bug prevention. And this was Google's own tool, built by one of the most capable engineering organizations on the planet. If Espresso was the best we had, the problem wasn't implementation — it was the entire approach.

I moved through Appium, UIAutomator, tried Maestro when it came out. Each one was a refinement of the same fundamental idea: match elements by ID or XPath, perform actions, assert state. And each one broke the same way — the moment the UI evolved, the tests fossilized. I've sat in sprint retrospectives where someone says "the UI tests are red again" and everyone nods like it's weather. And the part that quietly frustrated me most: the decision-makers above me — experienced, respected leaders who'd built careers on shipping great mobile products — had accepted these tools as the ceiling. Not out of laziness, but out of familiarity. When every conference talk, every "best practices" blog post, and every Google I/O session tells you Espresso is the answer, questioning it feels like questioning gravity. So the test suites stayed brittle, the teams stayed frustrated, and the leadership stayed confident they were using the best tools available. After a while, you start to wonder whether brittle UI tests are almost as good as not having tests at all.

That frustration is where Drengr started. Not from a paper. Not from a hackathon. From years of watching test suites rot faster than we could maintain them, and a quiet conviction that AI could do something fundamentally better — tests that understand what they're looking at instead of matching on fragile element IDs.

I started prototyping in late 2024. A simple idea: what if an AI agent could look at a screen, understand what it sees, and interact with the app the way a human would? No hardcoded selectors. No XPath expressions that shatter on the next release. Just "navigate to the settings page and verify the toggle works." If the UI changes, the agent adapts. Self-evolving tests.

Drengr is still early. I'm still figuring things out, still iterating, still learning what works and what doesn't. But recently I took some time to look at what the academic research community has been publishing — and I was surprised to find that researchers at Google, Meta, Microsoft, Tencent, and Princeton have been circling the same problems from different angles. Some of their solutions look like mine. Some are fundamentally different. A few of their insights are already changing how I think about what I'm building.

This post is my attempt to map the territory honestly — where Drengr's early architecture converges with published research, where it diverges, and what I've learned from reading the papers after building the first version of the system.

The Observe-Act Loop: Independent Convergence

Drengr's core architecture is three MCP tools: drengr_look (observe the screen), drengr_do (execute an action), and drengr_query (read structured data). An AI agent calls these in a loop — look at the screen, decide what to do, do it, look again.

In late 2023, Zhang et al. at Tencent published AppAgent: Multimodal Agents as Smartphone Users (arXiv:2312.13771). Their system does the same thing — observe the screen, decide, act — but as a Python agent framework. What struck me was their screen annotation approach: they number interactive elements on the screenshot so the LLM can reference them by ID. I'd independently arrived at the same design for Drengr's element numbering system. When two teams solve the same problem the same way without talking to each other, it usually means the solution is natural to the problem space.

A month later, Wang et al. published Mobile-Agent (arXiv:2401.16158), taking a purely vision-centric approach — no XML dumps, no accessibility tree, just screenshots plus detection and OCR models. Their finding that you don't need system metadata to navigate apps effectively was an important validation. Drengr deliberately uses both screenshots and the accessibility tree — the tree is faster to parse, costs almost nothing in tokens, and gives precise element bounds that vision models still struggle with. But Mobile-Agent's results are a useful signal: as vision models improve, the tree may become optional, and Drengr's architecture is designed to make that transition seamless when the time is right.

The key difference between Drengr and these systems: they're agent frameworks. Drengr is infrastructure. AppAgent and Mobile-Agent are Python applications that contain both the perception logic and the decision-making. Drengr separates these entirely — it handles perception and action, and delegates all decision-making to whatever LLM is on the other end of the MCP connection. This is a fundamentally different deployment model, and it's what lets Drengr work with Claude Desktop, Cursor, Windsurf, or any other MCP client without modification.

The OODA Loop: Military Theory Meets AI Agents

When I implemented drengr run — the autonomous agent mode — I structured it as an OODA loop: Observe (capture screen), Orient (situation engine analyzes what changed), Decide (LLM picks an action), Act (execute it). I chose OODA because it maps cleanly to the problem. The alternatives — simple while loops, state machines, behavior trees — all felt either too rigid or too unstructured.

I was genuinely surprised to find that Schneier and Raghavan published Agentic AI's OODA Loop Problem in IEEE Security & Privacy in 2025, analyzing exactly this pattern from a security perspective. Their key insight is that every stage of the OODA loop is a distinct attack surface. Prompt injection corrupts the Observe phase. Data poisoning corrupts Orient. Probabilistic decision-making without output verification corrupts Act. They specifically mention MCP and tool-calling systems as creating compounded vulnerabilities.

Reading this paper directly influenced Drengr's security model. The drengr_look observation phase cross-references the visual screenshot against the accessibility tree — if the two disagree (an element is visible but not in the tree, or vice versa), that inconsistency is surfaced in the situation report. It's not full tamper-evidence yet, but the dual-source design gives Drengr a foundation that purely vision-based systems don't have. Schneier and Raghavan's framing helped me see that as a security property, not just an implementation detail.

More recently, Yasuno published RAPTOR-AI for Disaster OODA Loop (arXiv:2602.00030) in early 2026, applying the OODA pattern to disaster response with entropy-aware strategy selection. The concept of adjusting confidence thresholds based on situational entropy maps directly to what Drengr's situation engine does — detecting when the screen hasn't changed (stuck detection), when the app has crashed, or when the agent is in an unfamiliar state.

BFS App Exploration: An Old Idea, Reimagined

Drengr's drengr explore mode does BFS traversal of an app — systematically tapping every interactive element, recording the resulting screens, and building a navigation graph. I built this because I needed a way to map unfamiliar apps before writing test suites for them.

The academic lineage here goes back to DroidBot by Li et al. (IEEE/ACM ICSE-C 2017), which built state transition models from live UI interactions. DroidBot used hard-coded heuristics to decide what to tap next. Drengr replaces those heuristics with an LLM decision layer — the agent can reason about whether a button is likely to navigate somewhere useful or just dismiss a dialog.

Wen et al. at Microsoft Research took this further with AutoDroid (ACM MobiCom 2024), combining LLM-driven exploration with a reusable knowledge graph. Their publication at MobiCom — a top-tier systems conference — establishes this as a recognized systems contribution, not just an ML exercise. Drengr's approach is architecturally simpler — a single Rust binary versus a Python/LLM stack — but the core insight is the same: BFS exploration is dramatically more effective when guided by a language model than by heuristics.

ReAct and Tool Use: The Conceptual Foundation

Two papers form the conceptual bedrock of what Drengr enables, even though I hadn't read either when I started building.

Yao et al. at Princeton published ReAct: Synergizing Reasoning and Acting in Language Models (ICLR 2023, arXiv:2210.03629). ReAct interleaves chain-of-thought reasoning with executable actions — the model reasons about what to do, issues an action, observes the result, reasons again. Every time Claude calls drengr_look, reasons about what to tap, then calls drengr_do, it's executing a ReAct loop. Drengr is, architecturally, a ReAct-compatible tool suite for mobile devices.

Schick et al. at Meta published Toolformer (NeurIPS 2023, arXiv:2302.04761), demonstrating that LLMs can learn when and how to call external tools. Toolformer's tools were information retrieval APIs — calculators, search engines, QA systems. Drengr's tools have physical side effects. When drengr_do taps a button, a real device changes state. That distinction matters — the consequences of a wrong action are much more significant than a wrong search query.

Screen Understanding: Where the Field Is Heading

Two papers from Google Research point to where Drengr's perception layer might evolve.

ScreenAI (Baechler et al., IJCAI 2024, arXiv:2402.04615) is a 4.6B-parameter vision-language model fine-tuned specifically for UI understanding. It identifies UI elements — buttons, text fields, images — at the pixel level from raw screenshots. Currently, Drengr uses the Android accessibility tree alongside screenshots for element identification. ScreenAI suggests that the screenshot alone might eventually be sufficient, which would make Drengr's perception layer identical across Android, iOS, and any other platform with a display.

Spotlight (Li and Li, arXiv:2209.14927, 2023) goes even further — a vision-only model for mobile UI understanding that outperforms methods using both screenshots and view hierarchies. This directly challenges Drengr's current design of using the accessibility tree as a primary data source. If vision-only models can outperform metadata-enhanced models, then Drengr's drengr_query tool (which reads the UI tree) might eventually become redundant — replaced by richer visual understanding from the LLM itself.

For now, the accessibility tree remains the right default — it's reliable, fast, and doesn't require a specialized vision model. But Drengr's perception layer is designed as a swappable trait, so when vision-only models reach the point where they consistently outperform metadata-enhanced approaches across device types and screen densities, the switch is an implementation change, not an architectural one.

The Survey: Situating Drengr in the Field

Wang et al. published GUI Agents with Foundation Models: A Comprehensive Survey (arXiv:2411.04890, 2024) — a systematic review of 100+ papers on LLM-based GUI agents across web, desktop, and mobile. Reading this survey was like looking at a map after you've already hiked the trail. I recognized the landmarks.

Drengr's three-tool architecture fits cleanly into the survey's taxonomy of perception-grounding-action pipelines. What the survey made clear is that most systems in this space are tightly coupled — the perception, grounding, and action components are part of the same codebase, usually Python. Drengr's contribution is decoupling these: it handles perception and action, and lets any MCP-compatible LLM handle grounding and reasoning. This is a systems architecture choice, not an ML innovation — but it's one that the survey suggests is underexplored.

Where Drengr Diverges

After reading all of this, here's what I think Drengr is doing differently — or at least trying to:

Infrastructure, not framework. Almost every paper describes an end-to-end agent. Drengr is deliberately not an agent — it's the hands and eyes that agents use. This separation came from ten years of watching tightly-coupled testing tools become unmaintainable. It's a pattern I saw repeated across every mobile organization I worked in — teams with brilliant leadership, seasoned VPs, directors who'd shipped apps to hundreds of millions of users — and yet the testing infrastructure always calcified the same way. The tooling forced coupling, but I also think there was a deeper issue — the mental model at the top often stopped at "we need more test coverage" without questioning whether the testing paradigm itself was the bottleneck. When you've shipped successful apps for years using a certain approach, it takes a particular kind of intellectual honesty to ask whether that approach has a ceiling. Most organizations optimized within the paradigm rather than questioning it. When your test framework is also your test runner is also your assertion library is also your device manager, everything breaks together. I think the industry internalized that pain as normal. Drengr's hypothesis is that it doesn't have to be. Separate the perception and action layer from the intelligence layer, and each can evolve independently. The agent will change. The tools should remain.
Rust, not Python. Every system cited above is Python. Drengr is a single static Rust binary. As an Android engineer, I know what it's like to ask a team to install a tool with twelve dependencies. I wanted curl | bash and done. That choice has trade-offs — I wrote about them in a separate post.
MCP first — by design. People ask why I released Drengr as an MCP server before building a standalone CLI agent. The answer comes from watching this industry long enough to know what survives and what doesn't. AI models improve every few months. The agent that's state-of-the-art today will be obsolete by next year. But the ability to observe a screen, tap a button, and read a UI tree? That's stable. That's the invariant. By releasing the tool layer first — as an MCP server that any AI client can consume — I'm building on the part that lasts. Claude Desktop uses it today. Cursor uses it today. Whatever comes next year will use it too, because the interface is standardized. If I'd built a monolithic agent instead, I'd be rewriting it every time a better model dropped. The Model Context Protocol didn't exist when most of these papers were written. Drengr's bet is that a standard protocol between AI agents and tools is more valuable than another custom agent framework. I might be wrong. But ten years of watching tightly-coupled tools age badly makes me think this bet is right.
Born from the field, not the lab. This isn't a research project with a team, compute budget, and publication timeline. It's one Android engineer who got tired of writing tests that broke every sprint and decided to try a different approach. The architecture reflects that — pragmatic, incremental, shaped by what I actually needed rather than what's theoretically optimal.

What I Learned

Reading these papers after building the first version taught me something I didn't expect: the problems I was solving alone — in my apartment, after work, on weekends — are the same problems that well-funded research teams at Google and Microsoft are working on. That's both humbling and encouraging.

The convergence gives me confidence that I'm not building something crazy. The divergence — particularly Drengr's choice to be infrastructure rather than an agent, and to use the accessibility tree alongside vision rather than vision alone — reflects deliberate trade-offs, not gaps. Where the academic work explores what's theoretically optimal, Drengr is built around what's practically reliable today while keeping the architecture open to what's coming.

I'm not an academic. I don't have a lab or a publication record. I'm an Android engineer with a decade of scar tissue from brittle test suites, building a tool shaped by what I actually needed in the field. The researchers cited here are formalizing the theory behind problems I've been solving through iteration and observation. We're approaching the same territory from different directions — and I think both directions produce insights the other can't.

If you're working in this area — whether you're writing papers or building tools or just frustrated with your own test suite — I'd love to hear from you. This space is wide open, and I think we're all just getting started.

References

Zhang et al. "AppAgent: Multimodal Agents as Smartphone Users." arXiv:2312.13771, 2023.
Wang et al. "Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception." arXiv:2401.16158, 2024.
Wen et al. "AutoDroid: LLM-powered Task Automation in Android." ACM MobiCom, 2024.
Schneier & Raghavan. "Agentic AI's OODA Loop Problem." IEEE Security & Privacy, 2025.
Yasuno. "RAPTOR-AI for Disaster OODA Loop." arXiv:2602.00030, 2026.
Li et al. "DroidBot: A Lightweight UI-Guided Test Input Generator for Android." IEEE/ACM ICSE-C, 2017.
Wang et al. "GUI Agents with Foundation Models: A Comprehensive Survey." arXiv:2411.04890, 2024.
Schick et al. "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS, 2023.
Yao et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR, 2023.
Baechler et al. "ScreenAI: A Vision-Language Model for UI and Infographics Understanding." IJCAI, 2024.
Li & Li. "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus." arXiv:2209.14927, 2023.