Google introduces Flex and Priority tiers to optimize the cost and reliability of the Gemini API

Explainer

Google introduced new synchronous inference tiers Flex and Priority for the Gemini API, allowing developers to effectively balance compute cost and response reliability.

Mihail Lebedev

4/25/2026, 2:29:29 PM

Google introduces Flex and Priority tiers to optimize the cost and reliability of the Gemini API

On April 2, 2026, Google AI's official blog announced a major update for developers using the Gemini API infrastructure. Product Manager Lucia Locher and Engineer Hussein Hassan Harrirou introduced two new service tiers: Flex Inference and Priority Inference. This innovation provides creators of neural network products with a single, unified interface for precise and granular control over the balance between computation cost and response reliability. The emergence of these tools is directly related to the natural evolution of artificial intelligence systems: as technologies transition from simple chat formats to complex autonomous agents, engineers require a more flexible approach to computational resource allocation.

Before the introduction of the new service tiers, developers constantly had to manage two completely different types of software logic, separating their application architecture. On one hand, there were background tasks, such as processing large volumes of data for information enrichment or internal model reasoning processes, which do not require an immediate response. On the other hand, there were interactive tasks directly user-facing, including chatbots and specialized digital assistants, where maximum reliability and response speed are critically important. Supporting both types of workflows simultaneously meant needing to separate systems between standard synchronous request submission and the use of the asynchronous Batch API.

The Flex Inference tier is positioned by Google developers as a cost-optimized option for latency-tolerant workloads, while eliminating the overhead typically associated with batch processing. A key economic advantage of this solution is the ability to reduce costs by fifty percent compared to the standard API. Such a significant price reduction is achieved by intentionally lowering the criticality level of the request, which in practice means a potential decrease in the reliability of an immediate response and the addition of certain processing delays.

The practical application of the Flex Inference tier covers a wide range of tasks where time is not a critical factor for the end-user. Google engineers highlight background updates for customer relationship management systems and large-scale research simulations as ideal use cases. Additionally, this tier is excellent for complex autonomous workflows where a generative model gathers information, reviews data, or performs deep analysis in the background before delivering the final result. It is worth noting that the Flex tier will be available for all paid plans and is supported when making requests to the GenerateContent and Interactions API interfaces.

For the most critical applications requiring uncompromising stability, Google introduced the Priority Inference tier, which provides the highest level of service guarantees at a premium price. In this mode, requests are assigned maximum criticality priority, helping to ensure their continuous processing and protection from preemption even during peak platform loads. One of the most important technical features of this tier is the graceful degradation mechanism for service requirements. If a client's traffic volume suddenly exceeds the limits set for the Priority tier, excess requests are not rejected with an error but are automatically served at the standard tier instead of the priority tier.

The application scope of the Priority Inference tier is focused on scenarios where any delay could negatively impact the customer experience or disrupt business logic. The official blog mentions ideal use cases such as real-time customer support bots, live content moderation pipelines, and other time-sensitive requests. The API architecture ensures full transparency: the system's response always clearly indicates which tier actually served a specific request. This provides developers with full visibility into actual performance and allows precise billing tracking. To leverage these benefits, it is sufficient to set the `service_tier` parameter accordingly.

Although the announcement in the Google AI Blog details the architectural and functional advantages of the new synchronous routing tiers, precise premium pricing figures for Priority Inference and basic Standard API rates are not provided in the press release itself. For this information, the publication's authors direct developers to the official Gemini API documentation, which provides a complete breakdown of pricing for optimizing production tiers. Additionally, for engineers wishing to immediately test the announced functionality in practice, Google provides a specialized cookbook containing ready-to-run code examples.

Sources

Google AI Blog · 4/2/2026

Replies (0)

No replies in this topic yet.

Back