Skip to main content
Today: Today February 19, 2026
HubNews
Blockchain+
Cybersecurity+
Development+
Economy & Finance+
Gaming+
Artificial Intelligence+
Hardware+
Startups
Blockchain+
Cybersecurity+
Development+
Economy & Finance+
Gaming+
Artificial Intelligence+
Hardware+
Startups

HubNews

Receive weekly the main news and analyses about Artificial Intelligence directly in your email.

Sign Up for Free

News

  • Home Page
  • Feed
  • Guides
  • AI Products
  • Top
  • Deep Dives
  • Search

More

  • Games
  • Tools
  • Subscribe Free
  • Podcast

Information

  • About Us
  • Contact
  • FAQ
  • Developers
  • Sponsors

Legal

  • Privacy Policy
  • Terms of Service

© 2026 HubNews.ai. All rights reserved.

Artificial Intelligence
Nvidia Develops Technique that Reduces LLM Costs by 8x While

Nvidia Develops Technique that Reduces LLM Costs by 8x While

TL;DR

Nvidia introduces a new technique called <a href="https://arxiv.org/abs/2506.05345"><u>Dynamic Memory Sparsification</u></a> (DMS), which decreases the memory

venturebeat.com•February 12, 2026•
3 min read
•1 views

Nvidia introduces a new technique called Dynamic Memory Sparsification (DMS), which decreases the memory costs of large language models by up to **eight times**. This innovation allows models to maintain or even enhance their reasoning capabilities while processing information.

With DMS, the key-value cache (KV), which stores temporary data during model reasoning, is efficiently compressed. Previous research showed difficulty in reducing the cache without compromising the model's intelligence, but Nvidia's approach successfully discards significant parts of the cache without loss of accuracy.

Challenges of Reasoning in Language Models

Language models improve their performance on complex tasks by generating "chain-of-thought" tokens that detail their reasoning. However, this process increases computational demand due to the linear growth of the KV cache, which can become a significant obstacle in practical applications.

The increased memory usage on GPUs results in **latency** and limits the number of users served simultaneously. Piotr Nawrot, an engineer at Nvidia, emphasizes: "The issue is not just about the amount of hardware, but also whether your infrastructure is processing 100 or 800 reasoning threads at the same cost.".

Solving this problem is not just a technical issue but also an economic one, as rising operational costs can impact companies. Previous methods that used fixed rules like the "sliding window" to keep only the most recent tokens often resulted in loss of crucial information.

How Dynamic Memory Sparsification Works

The DMS technique modifies existing models, allowing them to manage their memory intelligently. Instead of following a rigid deletion rule, DMS trains models to identify which tokens are essential and which can be discarded.

Nawrot explains: "It’s not just a guessing game about importance; the model learns a policy that explicitly preserves the final output distribution". DMS adapts pre-trained models, such as Llama 3 or Qwen 3, allowing them to become self-compressible without needing training from scratch.

An important feature of DMS is the "delayed eviction" mechanism, which allows tokens deemed unimportant to remain accessible for a time before being deleted, ensuring relevant information is integrated before elimination.

DMS in Action

To validate the technique, Nvidia applied DMS to reasoning models like Qwen-R1 and Llama 3.2, testing them on challenging benchmarks. The results indicate a notable improvement in performance while avoiding the compression associated with greater difficulty in understanding long contexts.

In tests with the AIME 24 benchmark, the Qwen-R1 32B model, equipped with DMS, achieved **12.0 points** more compared to a standard model, all without increasing memory requirements. This highlights that the model can develop deeper reasoning without the usual additional cost.

These advances in efficiency also translate to hardware savings, allowing a single server to handle up to **five times more** simultaneous queries while maintaining quality. Nvidia's DMS is a significant addition to the KVPress library, with simplified implementation.

Future Perspectives on Memory Management

DMS represents a shift in how memory management can integrate into artificial intelligence systems, being compatible with emerging architectures like Multi-Head Latent Attention (MLA). This combination could result in even greater efficiency gains.

As companies evolve from simple chatbots to complex reasoning systems, reducing inference costs becomes a priority. Techniques like DMS are differentiators for scaling these capabilities sustainably. "We have barely scratched the surface of what’s possible," concludes Nawrot, referring to the future of DMS in expanding the boundaries of reasoning in language models.

Content selected and edited with AI assistance. Original sources referenced above.

Share

Sources

venturebeat.com

Primary
https://venturebeat.com/orchestration/nvidias-new-technique-cuts-llm-reasoning-costs-by-8x-without-losing-accuracy

Feb 12, 2026

Enjoyed this article?

Get the best tech news delivered to your inbox every day.

Comments

Write a comment

More in Artificial Intelligence

Introduces 'Observational Memory' and Reduces AI Costs by Up to 10x
Artificial Intelligence

Introduces 'Observational Memory' and Reduces AI Costs by Up to 10x

Observational memory is a new memory architecture approach that promises to cut artificial intelligence (AI) costs by up to 10 times, developed by Mastra.

HubNews • FEB 10 • 1 min read
Nvidia launches DreamDojo, AI model for training robots
Artificial Intelligence

Nvidia launches DreamDojo, AI model for training robots

Nvidia has announced DreamDojo, a new artificial intelligence system designed to teach robots how to interact with the physical world. Utilizing 44 thousand hours of human video, this advancement aims to reduce time and costs in training humanoid robots.

HubNews • FEB 9 • 1 min read
Google Integrates Agentive Vision into Gemini 3 Flash
Artificial Intelligence

Google Integrates Agentive Vision into Gemini 3 Flash

Google has implemented the concept of agentive vision in its Gemini 3 Flash model, enabling a combination of visual reasoning with code execution.

HubNews • FEB 6 • 1 min read