Early Fusion Architecture: how Falcon Perception changes the approach to computer vision

Background

The TII research group introduced Falcon Perception and Falcon OCR - new compact models combining text and image processing in a single transformer with hybrid attention.

Anna Sokolova

4/25/2026, 12:47:17 PM

Early Fusion Architecture: how Falcon Perception changes the approach to computer vision

On April 1, 2026, a detailed technical report on the development of new computer vision models was published on the Hugging Face platform blog, authored by the tiiuae account, representing the Technology Innovation Institute (TII) research group. The key release was the Falcon Perception model—a compact early-fusion transformer with 0,6 billion parameters, designed for spatial grounding and open-vocabulary segmentation based on natural language queries. Simultaneously, the Falcon OCR model with 0,3 billion parameters was introduced, demonstrating the highest throughput among all open-source optical character recognition systems.

The emergence of these architectures was a direct response to fundamental problems in existing visual perception systems, which are traditionally designed as multi-component pipelines. Such classic solutions typically involve a frozen base visual model for feature extraction, followed by a separate decoding stage for integration with language data, as well as third-party algorithms for final matching. While such an approach has proven relatively effective in architectures like SAM 3, it inevitably entails significant engineering compromises.

In contrast to traditional pipelines, the Falcon Perception architecture relies entirely on a single autoregressive transformer that processes visual and textual data within a shared parameter space. Instead of separating image encoding and language generation processes, the model perceives original image patches, text queries, and task tokens as a single unified sequence, starting directly from the very first layer. This conceptual shift allowed developers to forego heavy mathematical overlays, making the dense prediction task solvable solely through masking mechanisms and a lightweight output interface.

Since the input data have fundamentally different structural natures, developers needed to create a special hybrid attention mechanism for their correct joint processing. Image pixels are two-dimensional and require bidirectional context for accurate analysis, whereas the text prediction interface historically operates strictly sequentially. The implemented hybrid mask elegantly resolves this contradiction: image tokens apply bidirectional attention to all other visual tokens, forming a global context analogous to classic visual encoders.

To effectively address the dense perception problem, where an image can contain anywhere from zero to several hundred objects, researchers developed a structured Chain-of-Perception interface. Since straightforward step-by-step generation of high-resolution masks requires unacceptably large computational resources, the process of identifying each instance was strictly decomposed into three sequential steps. First, the neural network generates a coordinate token, defining the geometric center of the object and resolving ambiguity in selecting a specific target. Immediately thereafter, a size token is predicted, defining the spatial boundaries of the element.

The formation of final predictions is carried out through specialized output heads, which add minimal computational overhead to the base transformer. Coordinate and size heads use Fourier feature encoding, projecting continuous data through a random Gaussian matrix into a high-dimensional sinusoidal space. This step helps overcome the spectral bias of neural networks and ensures more precise localization compared to a simple discrete distribution. The decoded geometric parameters are returned to the common sequence for refinement, after which the segmentation head computes the scalar product between the token's hidden state and upsampled visual features, creating the final binary mask.

The practical effectiveness of the described solutions is confirmed by testing results on the SA-Co dataset, where Falcon Perception achieved a Macro-F1 score of 68,0, significantly surpassing the SAM 3 model's result of 62,3. The technical report transparently notes the main current limitation of the new architecture—a lag in object presence calibration, where the Matthews Correlation Coefficient (MCC) was 0,64 compared to 0,82 for SAM 3. For a deep analysis of the reasons for such deviations, the developers released an open diagnostic benchmark, PBench.

Sources

Hugging Face Blog · 4/1/2026

Replies (0)

No replies in this topic yet.

Back