Glance adds automated pipeline to convert hour‑long videos into mobile: what developers gain

News

5/15/2026, 9:07:59 AM

Glance adds automated pipeline to convert hour‑long videos into mobile: what developers gain

Glance has deployed an automated production pipeline that turns 1–2 hour horizontal videos into multiple 30 — 180 second vertical clips optimized for mobile lock screens, enabling the company to scale daily throughput from about 3,500 toward more than 10,000 videos and making manual editing impractical. That capacity targets long‑form sources such as podcasts, news reports, movies and web series, letting publishers and creators generate mobile‑ready moments at scale.

The system focuses on preserving conversational context while producing short, vertical assets. It automatically selects the most engaging roughly 60‑second moments, detects active speakers and interview layouts, stacks split‑screens vertically for multi‑participant scenes, and applies programmatic masks, logos and overlays to maintain brand consistency. The first pipeline module — video clipping — extracts audio, produces precise transcripts with word‑level timestamps, and uses Speech‑to‑Text v2 plus a generative model to pick optimal start and end timestamps for clips.

Gemini is used to validate transcript text (not word timing), producing short video segments paired with time‑aligned transcripts for downstream processing. The Intelligent Reframing Engine converts wide (16:9) frames into portrait (9:16) crops without losing context. It uses active speaker detection that differentiates static images from live people, split‑screen detection to stack interview participants, and multi‑stage scene analysis to decide crops so faces and action remain visible. Object tracking and video manipulation are implemented with Samurai, OpenCV and MoviePy; Google Vision API checks and automated caption workflows add "karaoke‑style" word timestamps to boost engagement on silent‑by‑default mobile surfaces.

A final stage applies automated branding elements and prepares clips for publishing in large batches while minimizing manual review. Glance’s engineering notes underline a set of practical constraints for builders: precise word‑level timing is critical for exact clip boundaries; generative models (for example, Gemini and Gemini 2.5 Flash (Nano Banana)) can identify promising segments but must be paired with verified speech transcripts; and visual context requires object tracking plus split‑screen logic. Combining Speech‑to‑Text v2, vision APIs and custom video tooling lets the pipeline scale from thousands to tens of thousands of daily clips while preserving conversational continuity and brand fidelity.

Sources

Google Cloud Blog — AI & Machine Learning · 5/13/2026

Replies (0)

No replies in this topic yet.

Back