Cost Optimization
Strategies to reduce LLM API costs with InferXgate
Cost Optimization
LLM API costs can quickly become significant at scale. InferXgate provides multiple strategies to optimize costs while maintaining quality.
Cost Overview
Typical LLM pricing (as of 2025):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude 3 Opus | $15.00 | $75.00 |
| Claude 3 Sonnet | $3.00 | $15.00 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o | $5.00 | $15.00 |
| Gemini 1.5 Pro | $3.50 | $10.50 |
Caching Strategy
Enable Response Caching
Caching can reduce costs by 60-90%:
cache:
enabled: true
ttl: 3600s # 1 hour default
redis:
url: "redis://localhost:6379"
Cache Key Optimization
Fine-tune cache keys for maximum hit rate:
cache:
key_strategy:
# Include these in cache key
include:
- model
- messages
- system_prompt
# Exclude these (improves hit rate)
exclude:
- temperature # If using temp=0
- user_id # If responses are shareable
Semantic Caching
Cache similar queries (not just exact matches):
cache:
semantic:
enabled: true
# Similarity threshold (0-1)
threshold: 0.95
# Embedding model for similarity
embedding_model: "text-embedding-3-small"
Model Routing
Least-Cost Routing
Automatically route to cheapest available provider:
load_balancing:
strategy: least_cost
providers:
anthropic:
models:
claude-3-opus:
input_cost: 15.00
output_cost: 75.00
claude-3-sonnet:
input_cost: 3.00
output_cost: 15.00
openai:
models:
gpt-4-turbo:
input_cost: 10.00
output_cost: 30.00
Quality-Aware Routing
Route based on task complexity:
routing:
quality_aware:
enabled: true
rules:
# Simple tasks → cheaper models
- condition:
max_tokens: 100
no_code: true
route_to: "claude-3-haiku"
# Complex reasoning → premium models
- condition:
keywords: ["analyze", "compare", "evaluate"]
route_to: "claude-3-opus"
# Default
- route_to: "claude-3-sonnet"
Token Optimization
Prompt Compression
Reduce input tokens automatically:
optimization:
prompt_compression:
enabled: true
# Remove extra whitespace
normalize_whitespace: true
# Compress system prompts
compress_system: true
# Maximum compression ratio
max_ratio: 0.7
Output Limiting
Prevent unnecessarily long responses:
optimization:
output_limits:
# Default max tokens if not specified
default_max_tokens: 1024
# Hard cap regardless of request
absolute_max_tokens: 4096
Token Counting
Track token usage precisely:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1")
response = client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
)
# Token usage in response
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")
Budget Controls
Spending Limits
Set hard spending limits:
budget:
enabled: true
limits:
# Daily limit
daily: 100.00
# Monthly limit
monthly: 2000.00
# Per-request limit
per_request: 5.00
Per-User Budgets
Control spending per API key:
budget:
per_key:
enabled: true
keys:
"key_team_a":
daily: 50.00
monthly: 1000.00
"key_team_b":
daily: 25.00
monthly: 500.00
Budget Alerts
Get notified before hitting limits:
budget:
alerts:
# Alert at these thresholds
thresholds:
- 50%
- 75%
- 90%
# Notification webhook
webhook: "https://your-slack-webhook.com/xxx"
Provider Optimization
Use Provider Batching
Some providers offer batch pricing:
providers:
openai:
batching:
enabled: true
# Batch requests within window
window: 100ms
max_batch_size: 20
Committed Use Discounts
Configure discounted pricing:
providers:
anthropic:
pricing:
# Apply 20% committed use discount
discount: 0.20
# Custom rates
input_cost_override: 2.40 # $3.00 * 0.80
output_cost_override: 12.00 # $15.00 * 0.80
Monitoring Costs
Cost Tracking Dashboard
Enable cost tracking:
metrics:
cost_tracking:
enabled: true
# Track by these dimensions
dimensions:
- model
- api_key
- user_id
Prometheus Metrics
Query cost metrics:
# Total cost (last 24h)
sum(increase(inferxgate_cost_total[24h]))
# Cost by model
sum by (model) (increase(inferxgate_cost_total[24h]))
# Cost by user
sum by (api_key) (increase(inferxgate_cost_total[24h]))
# Average cost per request
rate(inferxgate_cost_total[1h]) / rate(inferxgate_requests_total[1h])
Cost Reports
Generate cost reports via API:
# Get cost summary
curl "http://localhost:3000/admin/costs?period=daily" \
-H "Authorization: Bearer $ADMIN_KEY"
# Response
{
"period": "2024-01-15",
"total_cost": 145.67,
"total_requests": 15234,
"total_tokens": {
"input": 2500000,
"output": 750000
},
"by_model": {
"claude-3-sonnet": {
"cost": 95.25,
"requests": 12000
},
"gpt-4-turbo": {
"cost": 50.42,
"requests": 3234
}
}
}
Cost-Saving Strategies
1. Implement Aggressive Caching
cache:
enabled: true
ttl: 86400s # 24 hours for stable content
# Cache even with slight temperature variations
temperature_tolerance: 0.1
2. Use Smaller Models When Possible
Create a model mapping:
model_mapping:
# Map expensive models to cheaper alternatives
"gpt-4": "gpt-4-turbo"
"claude-3-opus": "claude-3-sonnet"
# Allow override with header
allow_override: true
override_header: "X-Force-Model"
3. Implement Request Deduplication
deduplication:
enabled: true
# Window to detect duplicates
window: 5s
# Return cached response for duplicates
return_cached: true
4. Optimize Prompts
# Bad: Verbose system prompt
system_bad = """
You are a helpful assistant. Your job is to help users with their questions.
Please be polite and professional. Always provide accurate information.
If you don't know something, please say so. Thank you for your help!
"""
# Good: Concise system prompt
system_good = "You are a helpful assistant. Be accurate and concise."
# Savings: ~30 tokens per request
5. Use Streaming Wisely
Streaming can help with perceived latency but may incur overhead:
streaming:
# Only stream for long responses
auto_stream_threshold: 500 # tokens
# Disable streaming for simple queries
disable_for_short: true
ROI Calculator
Estimate your savings with InferXgate:
# Example calculation
monthly_requests = 100000
avg_tokens_per_request = 500
cache_hit_rate = 0.70
base_cost_per_1k_tokens = 0.01
# Without InferXgate
base_cost = (monthly_requests * avg_tokens_per_request / 1000) * base_cost_per_1k_tokens
print(f"Without caching: ${base_cost:.2f}/month")
# With InferXgate caching
cached_requests = monthly_requests * cache_hit_rate
uncached_requests = monthly_requests * (1 - cache_hit_rate)
cached_cost = (uncached_requests * avg_tokens_per_request / 1000) * base_cost_per_1k_tokens
print(f"With caching: ${cached_cost:.2f}/month")
print(f"Savings: ${base_cost - cached_cost:.2f}/month ({cache_hit_rate*100:.0f}%)")
Best Practices Summary
- Enable caching: Start with response caching for immediate savings
- Monitor costs: Set up dashboards before costs become a problem
- Set budgets: Implement spending limits to prevent surprises
- Optimize prompts: Reduce token usage in system prompts
- Route intelligently: Use cheaper models for simple tasks
- Review regularly: Analyze cost reports weekly
- Cache aggressively: Tune TTL based on content freshness needs