How to Reduce LLM API Costs by 70% with Intelligent Caching
by InferXgate Team on
LLM API costs can quickly spiral out of control as your application scales. In this post, we’ll show you how InferXgate’s intelligent caching can reduce your costs by 60-90%—without sacrificing response quality.
The Cost Problem
Let’s do some quick math. Say you’re building a customer support chatbot:
- 10,000 queries per day
- Average 500 tokens per query (input + output)
- Using Claude 3 Sonnet at $3/$15 per million tokens
That’s roughly $75/day or $2,250/month—just for one chatbot.
But here’s the thing: many of those queries are similar or identical. “What are your business hours?” “How do I reset my password?” “What’s your return policy?”
Enter Response Caching
InferXgate sits between your application and the LLM provider, caching responses automatically:
App → InferXgate → Cache Hit? → Return cached response
↓
Cache Miss? → Provider → Cache response → Return
Setting Up Caching
Enable caching in your InferXgate configuration:
# config.yaml
cache:
enabled: true
ttl: 3600s # 1 hour
redis:
url: "redis://localhost:6379"
That’s it! InferXgate will now cache identical requests automatically.
Understanding Cache Keys
By default, InferXgate creates cache keys from:
- Model name
- Messages array
- System prompt
- Temperature (if non-zero)
This means two requests with the same prompt will hit the cache, even from different users.
Customizing Cache Behavior
You can tune what’s included in cache keys:
cache:
key_includes:
- model
- messages
- system_prompt
key_excludes:
- user_id # Share cache across users
- temperature # If you always use temp=0
Semantic Caching: The Next Level
Exact-match caching is great, but what about similar queries?
- “What time do you open?”
- “What are your hours?”
- “When do you open?”
These are all asking the same thing. With semantic caching, InferXgate can recognize similarity:
cache:
semantic:
enabled: true
threshold: 0.92 # 92% similarity required
embedding_model: "text-embedding-3-small"
Now similar (not just identical) queries can share cached responses.
Real-World Results
We ran InferXgate for a month on a production support chatbot. The results:
| Metric | Before | After | Change |
|---|---|---|---|
| Monthly API Cost | $2,450 | $735 | -70% |
| Cache Hit Rate | N/A | 72% | — |
| Avg Response Time | 1.2s | 0.08s* | -93% |
*For cached responses
Best Practices
1. Set Appropriate TTLs
Different content needs different cache durations:
cache:
ttl_rules:
# Static content: cache longer
- pattern: "business hours|return policy|shipping"
ttl: 86400s # 24 hours
# Dynamic content: shorter TTL
- pattern: "order status|account balance"
ttl: 300s # 5 minutes
# Default
- ttl: 3600s # 1 hour
2. Use Consistent System Prompts
Cache keys include system prompts. Avoid adding per-request data:
# Bad: Timestamp breaks caching
system = f"You are a helpful assistant. Time: {datetime.now()}"
# Good: Static system prompt
system = "You are a helpful assistant."
3. Monitor Your Cache
Track cache effectiveness:
curl http://localhost:3000/metrics | grep cache
# inferxgate_cache_hits_total 45230
# inferxgate_cache_misses_total 12045
# inferxgate_cache_hit_ratio 0.789
4. Warm Your Cache
For known frequent queries, pre-populate the cache:
common_queries = [
"What are your business hours?",
"How do I reset my password?",
"What is your return policy?",
]
for query in common_queries:
client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": query}]
)
When NOT to Cache
Caching isn’t always appropriate:
- Personalized responses that depend on user data
- Real-time information like stock prices or weather
- Creative tasks where variety is desired
- Sensitive data that shouldn’t be stored
Disable caching per-request when needed:
response = client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[...],
extra_headers={"X-Cache-Control": "no-cache"}
)
Conclusion
Caching is one of the most effective ways to reduce LLM costs. With InferXgate:
- Set up takes minutes: Just enable Redis caching
- 70%+ cost reduction is realistic for many workloads
- Faster responses: Cached responses return in milliseconds
- No code changes: Works transparently with existing applications
Ready to start saving? Get started with InferXgate today.
Have questions? Join our Discord community or check out the caching documentation.