InferXgate | How to Reduce LLM API Costs by 70% with Intelligent Caching

LLM API costs can quickly spiral out of control as your application scales. In this post, we’ll show you how InferXgate’s intelligent caching can reduce your costs by 60-90%—without sacrificing response quality.

The Cost Problem

Let’s do some quick math. Say you’re building a customer support chatbot:

10,000 queries per day
Average 500 tokens per query (input + output)
Using Claude 3 Sonnet at $3/$15 per million tokens

That’s roughly $75/day or $2,250/month—just for one chatbot.

But here’s the thing: many of those queries are similar or identical. “What are your business hours?” “How do I reset my password?” “What’s your return policy?”

Enter Response Caching

InferXgate sits between your application and the LLM provider, caching responses automatically:

App → InferXgate → Cache Hit? → Return cached response
                 ↓
                 Cache Miss? → Provider → Cache response → Return

Setting Up Caching

Enable caching in your InferXgate configuration:

# config.yaml
cache:
  enabled: true
  ttl: 3600s  # 1 hour
  
  redis:
    url: "redis://localhost:6379"

That’s it! InferXgate will now cache identical requests automatically.

Understanding Cache Keys

By default, InferXgate creates cache keys from:

Model name
Messages array
System prompt
Temperature (if non-zero)

This means two requests with the same prompt will hit the cache, even from different users.

Customizing Cache Behavior

You can tune what’s included in cache keys:

cache:
  key_includes:
    - model
    - messages
    - system_prompt
  
  key_excludes:
    - user_id        # Share cache across users
    - temperature    # If you always use temp=0

Semantic Caching: The Next Level

Exact-match caching is great, but what about similar queries?

“What time do you open?”
“What are your hours?”
“When do you open?”

These are all asking the same thing. With semantic caching, InferXgate can recognize similarity:

cache:
  semantic:
    enabled: true
    threshold: 0.92  # 92% similarity required
    embedding_model: "text-embedding-3-small"

Now similar (not just identical) queries can share cached responses.

Real-World Results

We ran InferXgate for a month on a production support chatbot. The results:

Metric	Before	After	Change
Monthly API Cost	$2,450	$735	-70%
Cache Hit Rate	N/A	72%	—
Avg Response Time	1.2s	0.08s*	-93%

*For cached responses

Best Practices

1. Set Appropriate TTLs

Different content needs different cache durations:

cache:
  ttl_rules:
    # Static content: cache longer
    - pattern: "business hours|return policy|shipping"
      ttl: 86400s  # 24 hours
    
    # Dynamic content: shorter TTL
    - pattern: "order status|account balance"
      ttl: 300s    # 5 minutes
    
    # Default
    - ttl: 3600s   # 1 hour

2. Use Consistent System Prompts

Cache keys include system prompts. Avoid adding per-request data:

# Bad: Timestamp breaks caching
system = f"You are a helpful assistant. Time: {datetime.now()}"

# Good: Static system prompt
system = "You are a helpful assistant."

3. Monitor Your Cache

Track cache effectiveness:

curl http://localhost:3000/metrics | grep cache

# inferxgate_cache_hits_total 45230
# inferxgate_cache_misses_total 12045
# inferxgate_cache_hit_ratio 0.789

4. Warm Your Cache

For known frequent queries, pre-populate the cache:

common_queries = [
    "What are your business hours?",
    "How do I reset my password?",
    "What is your return policy?",
]

for query in common_queries:
    client.chat.completions.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": query}]
    )

When NOT to Cache

Caching isn’t always appropriate:

Personalized responses that depend on user data
Real-time information like stock prices or weather
Creative tasks where variety is desired
Sensitive data that shouldn’t be stored

Disable caching per-request when needed:

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[...],
    extra_headers={"X-Cache-Control": "no-cache"}
)

Conclusion

Caching is one of the most effective ways to reduce LLM costs. With InferXgate:

Set up takes minutes: Just enable Redis caching
70%+ cost reduction is realistic for many workloads
Faster responses: Cached responses return in milliseconds
No code changes: Works transparently with existing applications

Ready to start saving? Get started with InferXgate today.

Have questions? Join our Discord community or check out the caching documentation.