via “vision-language model-driven screenshot interpretation and action reasoning”
Open-source infrastructure for Computer-Use Agents. Sandboxes, SDKs, and benchmarks to train and evaluate AI agents that can control full desktops (macOS, Linux, Windows).
Unique: Implements a unified Responses API message format abstraction layer that normalizes outputs from 100+ heterogeneous VLM providers (native computer-use models like Claude, composed models via grounding adapters, and local model adapters), eliminating provider-specific parsing logic and enabling seamless model swapping without agent code changes.
vs others: Broader model coverage and provider flexibility than Anthropic's native computer-use API alone, with explicit support for local/open-source models and a standardized message format that decouples agent logic from model implementation details.