Silvus Technologies Unveils Streamcaster 4400 Xtreme For Harsh-Environment Maritime Communications
Silvus Technologies Unveils StreamCaster 4400 XTREME - A Maritime Powerhouse for Ultra-Ruggedized …
23. July 2025
The Quest for Efficiency in Large Language Models: How Mixture-of-Recursions Delivers 2x Faster Inference
Large language models (LLMs) have become increasingly popular for their ability to process vast amounts of text data with precision and accuracy. However, as these models scale, their memory footprints and computational requirements often become a significant challenge, making them difficult to deploy in organizations outside of hyperscale data centers.
To address this scaling challenge, researchers at KAIST AI and Mila have introduced a new Transformer architecture called Mixture-of-Recursions (MoR), which significantly improves model accuracy and delivers higher throughput compared with vanilla transformers. MoR combines parameter sharing with adaptive computation to tackle the high computational demands of LLMs.
The scaling challenges of LLMs are directly tied to their ever-increasing size, making both training and deployment challenging for organizations outside of hyperscale data centers. To improve LLM efficiency, efforts have focused mainly on two methods: parameter sharing and adaptive computation. Parameter sharing techniques reduce the total number of unique parameters by reusing weights across different parts of the model, thereby reducing the overall computational complexity.
MoR works by partitioning the model into a few “recursion blocks,” each with a shared pool of parameters. This design allows for more computation without increasing the model’s size. The framework enhances this recursive approach with two key components: a lightweight router and a more efficient key-value (KV) caching strategy.
The first component, the router, intelligently assigns a specific recursion depth to each token. This concept is similar to the routing mechanism in Mixture-of-Experts (MoE) models, where a router directs tokens to specialized expert networks. In MoR, however, the “experts” are the different recursion depths, allowing the model to choose how much computation to apply to each token dynamically.
The second component, KV caching strategy, selectively stores and retrieves key-value pairs only for the tokens that are still active at a given recursion step. This targeted caching reduces memory traffic and improves throughput without needing complex, post-training modifications.
To test their framework, researchers trained MoR models ranging from 135 million to 1.7 billion parameters and compared them against vanilla and standard recursive baseline models on validation loss and few-shot accuracy benchmarks. The results demonstrate significant gains. When given an equal training compute budget, an MoR model achieved higher average few-shot accuracy (43.1% vs. 42.3%) than a vanilla baseline despite using nearly 50% fewer parameters.
MoR’s design also proves to be scalable. While it slightly underperformed the vanilla model at the smallest 135M parameter scale, the gap closed rapidly as the model size increased. For models with over 360M parameters, MoR matched or exceeded the performance of standard Transformers, especially on lower compute budgets.
Furthermore, MoR’s design dramatically boosts inference throughput. One MoR configuration achieved a 2.06x speedup over the vanilla baseline. This could translate into significant operational cost savings for companies operating at scale.
For enterprises looking to adopt MoR without massive upfront investment, uptraining existing open-source models is a more cost-effective approach, according to Sangmin Bae, co-author of the paper and a PhD student at KAIST. The “optimal settings will highly depend on the specific deployment setting,” Bae explained.
MoR’s modality-agnostic design opens the door to significant efficiency gains in processing video, audio, and other complex data types by dynamically adjusting the processing depth for each segment of a video or audio stream. This could unlock even greater cost savings and performance improvements, bringing the power of large-scale AI to a wider range of enterprise applications.
In conclusion, Mixture-of-Recursions offers an effective path towards achieving large-model capabilities with significantly reduced computational and memory overhead, making it an attractive solution for enterprises looking to adopt cutting-edge AI technologies.