Oppo open-sources Android AI agent X-OmniClaw that uses your camera, screen, and voice without leaving the phone

May 17, 2026

Oppo’s Multi-X team has unveiled X-OmniClaw, an innovative open-source AI agent designed for Android devices. This groundbreaking technology operates seamlessly within the device, utilizing the camera, screen, and voice functionalities to perform tasks across various applications without relying on cloud processing. In a detailed technical report, Oppo’s AI Center emphasizes the distinctiveness of its approach compared to cloud-based platforms such as RedFinger, Alibaba’s Wuying, and Tencent Cloud Phone. Unlike these services that operate within virtualized Android environments in data centers, X-OmniClaw functions directly on the physical device, ensuring that core logic for perception, control, and app interaction remains local. The cloud serves merely as a supplementary resource for advanced reasoning when necessary.

Camera, screen, and voice feed into a single pipeline

The architecture of X-OmniClaw integrates three perception channels into a unified pipeline. Initially, a vision-language model interprets the scene in conjunction with the user’s request, setting the stage for subsequent actions. For instance, when a user inquires about the price of a product while capturing it on camera, the system reformulates the query internally before executing the command.

Photo gallery becomes searchable memory

In terms of long-term memory, X-OmniClaw transforms local data into semantic entries. During periods of inactivity, it processes gallery photos into concise descriptions of objects, scenes, and events, which are then stored in a Markdown file. This memory module is designed to filter out sensitive information before saving, addressing potential upload risks associated with cloud vision. The report suggests that transitioning to on-device models is a crucial next step, ensuring that raw images remain securely on the device.

Cloned tap paths replace step-by-step replays

Rather than recalibrating every action from scratch, X-OmniClaw replicates user behavior into reusable skills. It captures the complete launch command for an app page, allowing it to navigate directly via deeplink in future interactions, bypassing the need to replay the original sequence of taps. In cases where this method fails, the system systematically explores simpler launch methods. To identify tappable elements, X-OmniClaw employs a combination of XML structure data, a grounding model, and text recognition, enhancing its effectiveness in ad-heavy interfaces.

From price checks to homework help

In practical demonstrations, X-OmniClaw showcases its versatility. In one scenario, a user points their camera at a product and asks for its price. The agent swiftly navigates to the shopping app, retrieves relevant information, and communicates prices and sales figures using its vision-language model. In another instance, X-OmniClaw functions as a “ScreenAvatar,” effectively tackling a series of practice problems autonomously.

Further illustrating its capabilities, the agent can create a highlight album from a collection of parrot photos upon request. It efficiently gathers the relevant images, utilizes a video editing app’s one-click composition tool, and selects the desired files with precision. Additionally, after recording a path to a deeply nested discount page, the user can simply issue a voice command to access that exact subpage in subsequent interactions, even when public deeplinks are unavailable.

This project builds upon the open-source HermesApp codebase and positions itself between OpenClaw, which is more focused on PCs, and the emergent-capability-driven Hermes Agent from Nous Research. The code and assets are accessible on GitHub, reflecting a commitment to transparency and collaboration in the development of AI technologies.

Notably, Google has demonstrated with its Gemma 4 that a fully local model on a smartphone can function effectively as an agent. In the demo app “Google AI Edge Gallery,” the model exhibits agent capabilities by querying Wikipedia, generating QR codes, and opening mood trackers with trend charts. X-OmniClaw’s methodology builds on ByteDance’s UI-TARS, a visual GUI agent reliant solely on screenshots and coordinates, while enhancing accuracy through the integration of structural XML data and on-device execution.

AppWizard