Hands‑on test finds Gemini ingests YouTube, MP4 and MOV in browser while Claude refuses video and ChatGPT needs Codex

News

5/11/2026, 12:44:59 PM

Hands‑on test finds Gemini ingests YouTube, MP4 and MOV in browser while Claude refuses video and ChatGPT needs Codex

Reporter David Gewirtz tested three assistants with the prompt “watch this video” and found Gemini’s web interface directly accepted YouTube links and large local MP4/MOV files; Claude refused to process video or audio frames;

Reporter David Gewirtz ran a hands‑on comparison of three AI assistants to see whether they can truly “watch” videos rather than only read metadata. Using the prompt wording “watch this video,” Gewirtz fed each model the same three clips and found a clear split: Gemini processed the videos directly in a browser tab, Claude explicitly refused to handle video or audio frames, and ChatGPT could work with video prompts but relied on Codex‑style code assistance for deeper analysis. That difference matters for teams that want end‑to‑end video understanding without building extra preprocessing pipelines.

The test material was concrete. Gewirtz used a YouTube clip about annealing, a 625MB MP4 motion test recorded on a DJI Neo 2 drone (no audio, with gestures controlling flight), and an original 1.65GB MOV walk‑and‑talk local file that had previously been uploaded to YouTube. The local MOV was deliberately supplied without metadata or transcripts to test raw visual comprehension rather than relying on captions or hosting‑side text.

On behavior and interfaces, Gemini’s web interface accepted a YouTube URL and both large local files directly in a browser tab without requiring a standalone app, processing MP4 and MOV data inline. Claude’s responses were categorical: it stated it “can't watch videos” and that it does not process visual or audio frames or streams. ChatGPT could accept video prompts but, for more advanced tasks, required developer tooling akin to Codex to extract frames, run code‑based analyses, or integrate external vision pipelines.

For builders, those differences have immediate consequences. Gemini’s native ingestion reduces the need to implement separate transcription, frame extraction, hosting, or vision model pipelines when the goal is an end‑to‑end browser flow. Teams using Claude must build pre‑processing chains — transcription, frame sampling and external vision models — before sending results to the assistant. ChatGPT’s Codex dependency signals a similar reality: expect to combine chat APIs with code‑assisted tooling for richer video tasks.

Gewirtz also exercised practical tasks that reveal real constraints: he asked each assistant to interpret the drone’s gesture controls and to propose better thumbnails. The drone clip, containing no audio and requiring motion interpretation, was a stress test that Gemini handled directly in the browser, while Claude declined to process the media and ChatGPT required auxiliary, code‑based handling to perform the same visual analyses.

Subscription context and testing notes may influence results. The evaluation used then‑current paid tiers — ChatGPT Plus ($20/month), Gemini Pro ($20/month), and Claude Max ($100/month)—and Gewirtz reported that the prompt “watch this video” reliably prevented assistants from falling back to metadata searches. Builders should validate video capabilities on their chosen model and plan for pre‑ or post‑processing where direct ingestion is not available.

Sources

ZDNET AI · 5/11/2026

Replies (0)

No replies in this topic yet.

Back