Violin open‑sources an AI pipeline to translate and voice videos

News

5/17/2026, 5:51:19 AM

Violin open‑sources an AI pipeline to translate and voice videos

Violin is an open-source AI pipeline that automates transcription, LLM translation and synthesized speech to make video content accessible across languages while avoiding vendor lock-in.

On May 14, 2026, Violin launched as a community open‑source project that automates transcription, translation and synthesized voice output for video, pairing an end‑to‑end pipeline with a multimodal chat assistant. The release aims to help developers and content creators convert and interact with videos in other languages without depending on a single proprietary vendor, enabling broader access and easier localization.

Violin’s pipeline runs three core stages. Automatic speech recognition (ASR) uses Whisper V3 large endpoint to produce timestamped transcripts. Translation defaults to Deepseek V4 Pro and accepts user‑supplied translation rules to preserve domain terminology and faithfulness. Text‑to‑speech (TTS) is powered by Cartesia’s Sonic 3, which generates native‑speaker voices from plain text with voice‑style prompts; the project explicitly disallows voice cloning and, by default, overlays the synthesized voice at low volume over the original audio.

The project also includes a multimodal video chat assistant that answers free‑form questions grounded in both audio and visual context. The assistant samples recent video frames alongside subtitle context and routes those inputs to a vision‑language model (the release cites Qwen3.5 — 397B‑A17B) for grounded question answering. The system is hosted on the provider cloud and exposed via the Together API for programmatic access.

Violin ships as a web app, command‑line interface and agent skill, and the authors publish an open‑source repository plus a "Try Violin" demo to accelerate adoption. The release notes a production‑grade model selection — more than 40 models chosen for production — indicating the team tuned multiple components for latency and multilingual quality rather than relying on a single off‑the‑shelf model.

The authors frame Violin against a concrete accessibility gap: research cited in the post finds 66% of videos from the top 250 YouTube channels are in English and 15% are in Spanish, leaving much content inaccessible to many viewers. By combining timestamped ASR, rule‑guided translation and TTS, Violin’s aim is to let higher‑quality video content reach broader audiences and to give creators an efficient localization path.

For builders and integrators, Violin preserves production controls important to localization workflows: timestamped transcripts for lip‑sync and subtitle alignment, user‑defined translation rules for specialized terminology, selectable native voices (examples include Korean, Dutch, Italian and Chinese), and multiple integration points via web, CLI and agent interfaces. The authors demonstrate the tool by translating a Together Talks technical talk led by Percy Liang into Chinese, offering a concrete end‑to‑end example developers can adapt and extend.

Sources

Together AI Blog · 5/14/2026

Replies (0)

No replies in this topic yet.

Back