Agent-desktop – Native desktop automation CLI for AI agents
CLI ToolFreeI've been building computer-use tools for a while, and I quietly launched this about a month ago (122 Stars on GH). I figured it was worth sharing here.Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly li
Capabilities8 decomposed
native-desktop-ui-automation-via-cli
Medium confidenceProvides command-line interface to programmatically control native desktop UI elements (windows, buttons, text fields, menus) across operating systems using accessibility APIs and platform-specific automation frameworks. Works by wrapping OS-level automation APIs (Windows UI Automation, macOS Accessibility, Linux AT-SPI) into a unified CLI command schema that AI agents can invoke as subprocess calls or shell commands.
Bridges AI agents directly to native desktop UIs via CLI rather than requiring browser automation or custom integrations — uses OS accessibility APIs as the automation substrate, enabling agents to control any application with accessibility support without application-specific bindings
Simpler than Selenium/Playwright for desktop apps and more universal than application-specific APIs because it targets the OS-level accessibility layer that all modern applications expose
window-and-element-discovery-via-accessibility-tree
Medium confidenceScans and exposes the accessibility tree of running desktop applications, allowing agents to discover available UI elements (windows, buttons, text fields, menus) by querying element properties like role, label, state, and hierarchy. Implements by traversing the OS accessibility API tree structure and serializing it into queryable formats that agents can parse to locate interaction targets.
Exposes raw accessibility tree structure as queryable data rather than requiring agents to know exact element IDs or coordinates — enables semantic element discovery based on accessibility metadata (roles, labels, states) that applications provide for assistive technology
More reliable than image-based UI automation (no OCR errors) and more flexible than coordinate-based clicking because it uses semantic accessibility metadata that persists across UI theme changes and layout adjustments
keyboard-and-mouse-input-simulation
Medium confidenceSimulates keyboard input (key presses, text entry, modifier combinations) and mouse actions (clicks, drags, scrolling, movement) at the OS level by injecting events into the system input queue. Implements using platform-specific input injection APIs (Windows SendInput, macOS CGEvent, Linux XTest) to ensure events are delivered to the focused application with proper timing and sequencing.
Injects input events directly into the OS input queue rather than sending events to specific application windows — ensures compatibility with any application regardless of how it handles input, but requires careful timing and state management
More universal than application-specific input APIs because it works at the OS level, but requires more careful timing and state management than higher-level automation frameworks that provide built-in synchronization
screenshot-and-screen-capture-with-element-highlighting
Medium confidenceCaptures full-screen or region-specific screenshots and optionally highlights specific UI elements (bounding boxes, color overlays) to provide visual feedback to agents about current desktop state. Implements by using OS graphics APIs (Windows GDI+, macOS Quartz, Linux X11/Wayland) to capture framebuffer content and overlay element bounding boxes from the accessibility tree.
Combines raw screenshot capture with accessibility tree data to overlay semantic element information (bounding boxes, labels) rather than relying on OCR or image analysis — provides agents with both visual and structural context
More accurate element highlighting than vision-based approaches because it uses accessibility metadata, but requires that elements are properly exposed in the accessibility tree
multi-window-and-application-context-management
Medium confidenceTracks and manages context across multiple open windows and applications, allowing agents to switch focus, query window state, and maintain awareness of which application is currently active. Implements by monitoring OS window manager events and maintaining a window registry that agents can query to discover available windows and switch between them.
Maintains persistent window registry and focus state rather than treating each window interaction independently — enables agents to reason about application context and coordinate actions across multiple windows
More sophisticated than simple window switching because it tracks window state and properties, enabling agents to make intelligent decisions about which window to target based on application context
cli-command-composition-and-scripting
Medium confidenceProvides a command-line interface that agents can invoke via subprocess calls or shell scripts, with structured command syntax for composing complex automation sequences. Implements by parsing CLI arguments into action objects, executing them sequentially with error handling, and returning structured output that agents can parse to determine success/failure and next steps.
Exposes desktop automation as a CLI tool that agents invoke via subprocess rather than requiring language-specific SDK bindings — enables agents in any language/runtime to access desktop automation without native library dependencies
More flexible than language-specific SDKs because it works with any agent implementation, but incurs subprocess overhead and requires careful output parsing compared to direct library integration
error-handling-and-action-validation
Medium confidenceValidates automation actions before execution and provides detailed error reporting when actions fail, including accessibility tree state at failure point and suggestions for recovery. Implements by pre-checking element existence and state, executing actions with exception handling, and capturing diagnostic information (element properties, window state, error context) for agent debugging.
Captures accessibility tree state at failure point rather than just reporting error codes — provides agents with semantic context about why an action failed and what UI state led to the failure
More informative than simple error codes because it includes UI state context, enabling agents to make intelligent recovery decisions or log detailed failure information for human debugging
cross-platform-abstraction-layer
Medium confidenceAbstracts platform-specific differences (Windows UI Automation vs macOS Accessibility vs Linux AT-SPI) behind a unified CLI interface, allowing agents to write platform-agnostic automation code. Implements by detecting the host OS at runtime and routing commands to the appropriate platform-specific backend while maintaining consistent command syntax and output format.
Provides unified CLI interface across Windows, macOS, and Linux by internally routing to platform-specific accessibility APIs — enables agents to use identical command syntax regardless of OS without learning platform-specific APIs
More portable than platform-specific automation tools because agents write once and run on any OS, but requires maintaining multiple backend implementations and handling platform-specific edge cases
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Agent-desktop – Native desktop automation CLI for AI agents, ranked by overlap. Discovered automatically through the match graph.
Peekaboo
** - a macOS-only MCP server that enables AI agents to capture screenshots of applications, or the entire system.
UI-TARS-desktop
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
chrome-devtools-mcp
Chrome DevTools for coding agents
Safari MCP
Native Safari browser automation for AI agents — 80 tools via AppleScript, zero Chrome overhead, keeps logins, runs silently. macOS only.
Windows-MCP
MCP Server for Computer Use in Windows
lamda
The most powerful Android RPA agent framework, next generation mobile automation.
Best For
- ✓AI agent developers building desktop automation workflows
- ✓teams automating legacy desktop application testing
- ✓developers integrating LLMs with native desktop tools that lack APIs
- ✓AI agents that need to explore unfamiliar desktop applications dynamically
- ✓developers building adaptive automation that adjusts to UI layout changes
- ✓teams testing accessibility compliance of desktop applications
- ✓agents automating data entry and form filling in desktop applications
- ✓developers testing keyboard navigation and accessibility features
Known Limitations
- ⚠Requires OS-level permissions and accessibility API access — may need elevated privileges or accessibility settings enabled
- ⚠Performance depends on OS event loop responsiveness — high-frequency interactions may experience latency or dropped events
- ⚠Limited to UI elements exposed via accessibility APIs — some custom-drawn or obfuscated UI components may not be detectable
- ⚠No built-in OCR or image recognition — relies on accessibility tree structure rather than visual content analysis
- ⚠Accessibility tree completeness varies by application — poorly-designed apps may have sparse or missing accessibility metadata
- ⚠Tree traversal can be slow for deeply nested UIs or applications with thousands of elements
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Show HN: Agent-desktop – Native desktop automation CLI for AI agents
Categories
Alternatives to Agent-desktop – Native desktop automation CLI for AI agents
Search the Supabase docs for up-to-date guidance and troubleshoot errors quickly. Manage organizations, projects, databases, and Edge Functions, including migrations, SQL, logs, advisors, keys, and type generation, in one flow. Create and manage development branches to iterate safely, confirm costs
Compare →Are you the builder of Agent-desktop – Native desktop automation CLI for AI agents?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →