Reduce Your LLM Bill by 73% with Semantic Caching

venturebeat.com

Reduce Your LLM Bill by 73% with Semantic Caching

TL;DR

The increasing use of language model (LLM) APIs can lead to significant cost increases. A specific bill grew by 30% per month due to repeated user queries with different phrasings. To address this, we implemented semantic caching, resulting in a 73% reduction in API costs.

venturebeat.com•January 10, 2026•

3 min read

•0 views

The increasing use of language model (LLM) APIs can lead to significant cost increases. A specific bill grew by 30% per month due to repeated user queries with different phrasings. To address this, we implemented semantic caching, resulting in a 73% reduction in API costs.

To understand the issue, we analyzed the query logs and found that questions such as "What is your return policy?" and "Can I get a refund?" were treated as separate queries, despite generating similar responses. Exact match caching was insufficient, capturing only 18% of redundant cases.

Deficiency of Exact Match Caching

Traditional caching associates the query text as the cache key. This works for identical queries, but users often rephrase questions. A study with 100,000 queries revealed that only 18% of the questions were exact duplicates, while 47% were semantically similar and 35% were completely new.

These similar questions were all processing LLM calls, resulting in unnecessary costs. The implementation of a semantic caching system, which observes the meaning of the query rather than its phrasing, increased our cache hit rate to 67%.

Architecture of Semantic Caching

The implementation of semantic caching replaced text keys with similarity lookup based on embedding in a vector space. The embedding model is configured to set a similarity threshold, allowing for relevant queries to be identified.

If the current query semantically matches a stored query, the system returns the cached response, avoiding the full call to the LLM. Tuning the threshold is crucial. Too high of a threshold may miss hits, and too low may lead to incorrect responses.

Threshold Tuning and Results

Different types of queries require distinct thresholds. For frequent queries, such as frequently asked questions (FAQ), a threshold of 0.94 ensures high accuracy. After testing, we were able to set the adaptive cache considering the type of query.

We evaluated performance and optimized thresholds, achieving an increase in hit rate to 67% and reducing LLM costs from $47K to $12.7K/month, a significant reduction of 73%.

Challenges and Cache Invalidation Strategies

As information changes, stored responses can become obsolete. We implemented three strategies, including time-based invalidation, which defines a time-to-live (TTL) for different types of content, and event-based invalidation to refresh entries when the underlying data changes.

Additionally, we conducted periodic checks to determine the validity of cached responses, using similarity analyses to ensure that answers remain relevant.

Final Results and Recommendations

After three months, we observed that 0.8% of hits were incorrect, but this rate was within an acceptable limit. The impact of semantic caching was positive, as evidenced by a 65% improvement in average latency, along with a substantial reduction in costs.

To ensure efficiency, it is crucial to adjust specific thresholds for each type of query, maintain an active invalidation method, and avoid outdated data being presented to users.

Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.

Content selected and edited with AI assistance. Original sources referenced above.

Reduce Your LLM Bill by 73% with Semantic Caching

TL;DR

Deficiency of Exact Match Caching

Architecture of Semantic Caching

Threshold Tuning and Results

Challenges and Cache Invalidation Strategies

Final Results and Recommendations

Share

venturebeat.com

Enjoyed this article?

Comments

Write a comment

More in Artificial Intelligence

Introduces 'Observational Memory' and Reduces AI Costs by Up to 10x

Nvidia launches DreamDojo, AI model for training robots

Google Integrates Agentive Vision into Gemini 3 Flash