Trajectory Open‑Sources C‑LoRA, Reporting 2.81× Experiment‑Throughput Gain for Continual RL

News

5/31/2026, 8:47:56 AM

Trajectory Open‑Sources C‑LoRA, Reporting 2.81× Experiment‑Throughput Gain for Continual RL

Trajectory, with UC Berkeley Sky Lab and Anyscale, released Continuous Multi‑LoRA (C‑LoRA), a concurrent multi‑LoRA stack that keeps adapters hot in memory and reports a 2.

Trajectory, in collaboration with UC Berkeley Sky Lab and Anyscale, published a field report and open‑sourced Continuous Multi‑LoRA (C‑LoRA), a concurrent multi‑LoRA training stack for continual reinforcement learning. The team says C‑LoRA delivers a 2.81× end‑to‑end experiment‑throughput improvement compared with a single‑tenant baseline while observing no regression in training rewards, a change that shortens experiment cycles and improves GPU utilization for organizations running many parallel RL experiments.

C‑LoRA maps each experiment to its own LoRA adapter on an always‑hot inference engine. The stack keeps adapters resident in inference memory using vLLM so decoding can interleave tokens from different adapters within the same batch. Each tenant has an AdapterStore that holds LoRA parameters, FP32 master weights, optimizer moments and gradient buffers; during training the system swaps a single tenant’s state onto the GPU for a forward/backward step while the scheduler continues decoding for other tenants.

Inference concurrency is enabled by an SGMV decode kernel that fuses per‑adapter matrix‑vector work into one GPU launch per decode step, letting vLLM mix tenants in decode batches and avoid per‑job engine reloads. Because training still runs single‑adapter on the GPU, immediate throughput gains come from eliminating engine cold starts and decoding stalls rather than from parallelizing optimizer steps across adapters. The team reports no reward degradation as throughput increases.

Trajectory validated C‑LoRA on a single H200 node using Qwen3‑4B‑Instruct‑2507 in a synchronous RL setup on the GSM8K benchmark framed as an agentic, tool‑use problem. The task required the model to call a Calculator tool and then a Final Answer tool; a reward of 1.0 was assigned only when the Final Answer matched the correct result. Under the chosen learning algorithm the reported policy trajectory rose from roughly 40% accuracy at step 0 to above 90% by step 9.

The report calls out four target inefficiencies in traditional RL stacks that C‑LoRA addresses: cold starts (checkpoint reload and runtime warm‑up that can exceed 30 minutes for large models), high memory costs for frontier models (for example, Qwen3.5‑397B can require up to eight H200 nodes), single‑tenant execution that serializes experiments, and low job utilization from trainer/inference stalls. Using LoRA reduces memory requirements by roughly an order of magnitude and enables multiplexing multiple experiment adapters on shared engines.

Trajectory has published the code in the NovaSky‑AI/SkyRL GitHub repository and positions C‑LoRA as infrastructure aimed at production continual learning. The current tradeoff is clear: substantially higher inference concurrency and faster end‑to‑end experiment throughput, while per‑adapter training on the GPU remains sequential; extending concurrent gains into the optimizer path will require further engineering.

Sources

MarkTechPost AI · 5/31/2026

Replies (0)

No replies in this topic yet.

Back