A Dharma — AI team report published May 22, 2026 shows that a small share of requests can dominate production OCR costs: In real‑world PDF OCR runs with the DharmaOCR model, fewer than three percent of pages consumed almost half of the total wall‑clock time across batched inference. The disproportionate runtime came from requests that never emitted an end‑of‑sequence (EOS) token and instead looped until forcibly cut off by a hard max‑tokens guard, creating a clear operational bottleneck.
The slow requests shared a consistent pattern: they hit the configured token maximum and exhibited an n‑gram repetition pattern at their tail. Rather than terminating, each request repeated a fragment or token, then repeated again, continuing until the system’s maximum‑token limit stopped generation. That repetition drove individual requests to run far longer than typical pages and skewed overall throughput measurements.
To diagnose the failure mode the team instrumented inference logs and plotted each request’s start and end times, visually highlighting degenerate requests. Those visualizations made the imbalance obvious and helped isolate outliers. The authors then re‑ran the experiment on two additional datasets and observed the same qualitative shape — varying in intensity but reproducible — confirming the phenomenon was not a one‑off noise artifact.
The report situates this behavior within the broader literature on text degeneration: earlier work such as Holtzman et al. (2020) described self‑reinforcing repetition in autoregressive generation, and the team cites Dong et al. (2026) for measurements showing substantial differences in output length between healthy and degenerate runs. In these loops a single token or short sequence comes to dominate the conditional distribution and produces indefinite repetition.
Dharma — AI tested common inference‑level mitigations — raising repetition penalties, lowering temperature, switching decoders, and adding streaming abort checks that kill a request once repetition begins — and found they reduce symptom severity. The authors emphasize, however, that such tuning provides pragmatic operational safeguards rather than a full solution, since it treats symptoms at decoding time without addressing why models fall into loops.
The team argues the root cause is structural and tied to training methodology: models trained with maximum‑likelihood estimation (MLE) optimize next‑token probability token‑by‑token and never observe the full generated sequence during training. That objective yields strong continuation ability but does not penalize producing incomplete, repeating outputs, creating measurable consequences for throughput and inference cost in production systems.
The report accompanies a public model repository and demo space for DharmaOCR, offering reproducible artifacts for others to validate the findings. Authors associated with the work include Erick Lachmann, Pimenta de Freitas Cardoso, and Gabriel Pimenta. The team frames the results as a signal that fixes grounded in the training distribution, not just decoder tweaks, are needed to prevent disproportionate inference costs in OCR deployments.
Sources
Replies (0)
No replies in this topic yet.