
On May 15, 2026 the Ray monitoring suite added Cluster and Actor Dashboards, completing a rollout that makes the dashboards’ telemetry fully persisted. The change delivers end-to-end, durable visibility across workload and infrastructure layers, enabling post‑mortem analysis and historical troubleshooting for Ray applications. For teams that run large or long‑running Ray clusters, the update preserves event data that was previously lost when clusters shut down.
Under the hood, the new dashboards are fed by the Ray Event Export Framework, which streams cluster events off‑cluster and writes them into managed storage and query engines. That architecture persists node metadata, actor metadata, logs, task events and related metrics so those records can be queried after a cluster has terminated, rather than being available only in memory during a live session.
The release directly addresses several concrete limitations of the traditional Ray dashboard: dashboard data used to be ephemeral, dead node records were retained for only 10 minutes, and only the most recent 100,000 killed actors were preserved. Those retention and scale constraints became increasingly inadequate as workloads expanded to hundreds of nodes and millions of tasks, leaving operators unable to rely on the dashboard for long‑term forensic or performance work.
From a builder’s perspective, persisting events off‑cluster removes the need to reproduce runs or to self‑host an observability backend just to retain historical traces. Teams can run post‑mortem debugging, performance analysis and workload comparisons against a durable store. The provider positions its managed storage and query stack as the durable backend for long‑term analysis of large‑scale Ray deployments.
The Cluster and Actor Dashboards are presented as part of a unified debugging path that includes the earlier Train, Data and Task dashboards, offering a single, continuous view from application workloads down to cluster infrastructure. The design targets much larger Ray deployments — from hundreds up to thousands of nodes and millions of actors — and documentation on the dashboards and their operational features is available for teams that want implementation details.
Sources
Replies (0)
No replies in this topic yet.