Oppo open-sources X — OmniClaw, an on-device Android agent that fuses camera, screen, voice and text

News

5/18/2026, 3:45:28 AM

Oppo open-sources X — OmniClaw, an on-device Android agent that fuses camera, screen, voice and text

Oppo's Multi‑X team published source code and a technical report on May 17, 2026 for X — OmniClaw, an open‑source AI agent that runs on Android devices and is designed to operate on physical phones rather than cloud‑hosted virtual devices. The release matters because it gives builders a path to automating real apps using live sensors on the device while keeping core processing local, a distinction that affects privacy and what inputs an agent can access.

The package includes an architecture diagram and implementation notes showing a pipeline that fuses multiple perception channels. Camera imagery, screen contents, voice and text are time‑aligned, then passed through a vision‑language stage that interprets them and emits a structured intent for downstream modules. Oppo says core logic for perception, control and app interaction executes locally; cloud language models are invoked only occasionally as "fuel" for higher‑level reasoning.

Oppo explicitly contrasts OmniClaw's on‑device design with cloud‑phone platforms such as RedFinger, Alibaba's Wuying and Tencent Cloud Phone, which run Android instances in data centers and therefore lack access to local sensors and private device data. That architectural difference frames OmniClaw as a privacy‑ and sensor‑aware alternative for developers who need live inputs from a user's phone.

To support memory and reuse, the agent condenses local data during idle time: gallery photos are processed into compact semantic entries and saved in a Markdown file labeled "image — memory.md" after a sensitive‑content filtering step. The team flags the risks of uploading raw images to cloud vision services and notes that moving more vision models fully on‑device is a planned next step so raw photos need not leave the handset.

For in‑app navigation, OmniClaw avoids simple replay of recorded taps and instead clones user tap behavior into reusable skills. It extracts an app page's full launch command and attempts deeplinks to jump directly to deep pages, falling back to progressively simpler launch methods if deeplinks fail. Tappable‑element detection merges XML structure, an on‑device grounding model and OCR/text recognition to cope with ad‑heavy or ambiguous user interfaces.

Demos in the report show practical uses: pointing a phone camera at a product and asking for price comparisons across stores (the system internally reformulates intents such as "price of Evian spray on Taobao"), a floating assistant that interacts with apps to help solve homework, and autonomous creation of photo albums from a user's gallery. The report does not list specific local model names, so builders should expect integration work and resource limits when deploying advanced on‑device perception.

Sources

The Decoder AI · 5/17/2026

Replies (0)

No replies in this topic yet.

Back