Cost Optimization

Strategies to reduce LLM API costs with InferXgate

Cost Optimization

LLM API costs can quickly become significant at scale. InferXgate provides multiple strategies to optimize costs while maintaining quality.

Cost Overview

Typical LLM pricing (as of 2025):

ModelInput (per 1M tokens)Output (per 1M tokens)
Claude 3 Opus$15.00$75.00
Claude 3 Sonnet$3.00$15.00
GPT-4 Turbo$10.00$30.00
GPT-4o$5.00$15.00
Gemini 1.5 Pro$3.50$10.50

Caching Strategy

Enable Response Caching

Caching can reduce costs by 60-90%:

cache:
  enabled: true
  ttl: 3600s  # 1 hour default
  
  redis:
    url: "redis://localhost:6379"

Cache Key Optimization

Fine-tune cache keys for maximum hit rate:

cache:
  key_strategy:
    # Include these in cache key
    include:
      - model
      - messages
      - system_prompt
    
    # Exclude these (improves hit rate)
    exclude:
      - temperature      # If using temp=0
      - user_id         # If responses are shareable

Semantic Caching

Cache similar queries (not just exact matches):

cache:
  semantic:
    enabled: true
    
    # Similarity threshold (0-1)
    threshold: 0.95
    
    # Embedding model for similarity
    embedding_model: "text-embedding-3-small"

Model Routing

Least-Cost Routing

Automatically route to cheapest available provider:

load_balancing:
  strategy: least_cost
  
  providers:
    anthropic:
      models:
        claude-3-opus:
          input_cost: 15.00
          output_cost: 75.00
        claude-3-sonnet:
          input_cost: 3.00
          output_cost: 15.00
          
    openai:
      models:
        gpt-4-turbo:
          input_cost: 10.00
          output_cost: 30.00

Quality-Aware Routing

Route based on task complexity:

routing:
  quality_aware:
    enabled: true
    
    rules:
      # Simple tasks → cheaper models
      - condition:
          max_tokens: 100
          no_code: true
        route_to: "claude-3-haiku"
        
      # Complex reasoning → premium models
      - condition:
          keywords: ["analyze", "compare", "evaluate"]
        route_to: "claude-3-opus"
        
      # Default
      - route_to: "claude-3-sonnet"

Token Optimization

Prompt Compression

Reduce input tokens automatically:

optimization:
  prompt_compression:
    enabled: true
    
    # Remove extra whitespace
    normalize_whitespace: true
    
    # Compress system prompts
    compress_system: true
    
    # Maximum compression ratio
    max_ratio: 0.7

Output Limiting

Prevent unnecessarily long responses:

optimization:
  output_limits:
    # Default max tokens if not specified
    default_max_tokens: 1024
    
    # Hard cap regardless of request
    absolute_max_tokens: 4096

Token Counting

Track token usage precisely:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/v1")

response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Token usage in response
print(f"Input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Budget Controls

Spending Limits

Set hard spending limits:

budget:
  enabled: true
  
  limits:
    # Daily limit
    daily: 100.00
    
    # Monthly limit
    monthly: 2000.00
    
    # Per-request limit
    per_request: 5.00

Per-User Budgets

Control spending per API key:

budget:
  per_key:
    enabled: true
    
    keys:
      "key_team_a":
        daily: 50.00
        monthly: 1000.00
        
      "key_team_b":
        daily: 25.00
        monthly: 500.00

Budget Alerts

Get notified before hitting limits:

budget:
  alerts:
    # Alert at these thresholds
    thresholds:
      - 50%
      - 75%
      - 90%
    
    # Notification webhook
    webhook: "https://your-slack-webhook.com/xxx"

Provider Optimization

Use Provider Batching

Some providers offer batch pricing:

providers:
  openai:
    batching:
      enabled: true
      
      # Batch requests within window
      window: 100ms
      max_batch_size: 20

Committed Use Discounts

Configure discounted pricing:

providers:
  anthropic:
    pricing:
      # Apply 20% committed use discount
      discount: 0.20
      
      # Custom rates
      input_cost_override: 2.40   # $3.00 * 0.80
      output_cost_override: 12.00  # $15.00 * 0.80

Monitoring Costs

Cost Tracking Dashboard

Enable cost tracking:

metrics:
  cost_tracking:
    enabled: true
    
    # Track by these dimensions
    dimensions:
      - model
      - api_key
      - user_id

Prometheus Metrics

Query cost metrics:

# Total cost (last 24h)
sum(increase(inferxgate_cost_total[24h]))

# Cost by model
sum by (model) (increase(inferxgate_cost_total[24h]))

# Cost by user
sum by (api_key) (increase(inferxgate_cost_total[24h]))

# Average cost per request
rate(inferxgate_cost_total[1h]) / rate(inferxgate_requests_total[1h])

Cost Reports

Generate cost reports via API:

# Get cost summary
curl "http://localhost:3000/admin/costs?period=daily" \
  -H "Authorization: Bearer $ADMIN_KEY"

# Response
{
  "period": "2024-01-15",
  "total_cost": 145.67,
  "total_requests": 15234,
  "total_tokens": {
    "input": 2500000,
    "output": 750000
  },
  "by_model": {
    "claude-3-sonnet": {
      "cost": 95.25,
      "requests": 12000
    },
    "gpt-4-turbo": {
      "cost": 50.42,
      "requests": 3234
    }
  }
}

Cost-Saving Strategies

1. Implement Aggressive Caching

cache:
  enabled: true
  ttl: 86400s  # 24 hours for stable content
  
  # Cache even with slight temperature variations
  temperature_tolerance: 0.1

2. Use Smaller Models When Possible

Create a model mapping:

model_mapping:
  # Map expensive models to cheaper alternatives
  "gpt-4": "gpt-4-turbo"
  "claude-3-opus": "claude-3-sonnet"
  
  # Allow override with header
  allow_override: true
  override_header: "X-Force-Model"

3. Implement Request Deduplication

deduplication:
  enabled: true
  
  # Window to detect duplicates
  window: 5s
  
  # Return cached response for duplicates
  return_cached: true

4. Optimize Prompts

# Bad: Verbose system prompt
system_bad = """
You are a helpful assistant. Your job is to help users with their questions.
Please be polite and professional. Always provide accurate information.
If you don't know something, please say so. Thank you for your help!
"""

# Good: Concise system prompt
system_good = "You are a helpful assistant. Be accurate and concise."

# Savings: ~30 tokens per request

5. Use Streaming Wisely

Streaming can help with perceived latency but may incur overhead:

streaming:
  # Only stream for long responses
  auto_stream_threshold: 500  # tokens
  
  # Disable streaming for simple queries
  disable_for_short: true

ROI Calculator

Estimate your savings with InferXgate:

# Example calculation
monthly_requests = 100000
avg_tokens_per_request = 500
cache_hit_rate = 0.70
base_cost_per_1k_tokens = 0.01

# Without InferXgate
base_cost = (monthly_requests * avg_tokens_per_request / 1000) * base_cost_per_1k_tokens
print(f"Without caching: ${base_cost:.2f}/month")

# With InferXgate caching
cached_requests = monthly_requests * cache_hit_rate
uncached_requests = monthly_requests * (1 - cache_hit_rate)
cached_cost = (uncached_requests * avg_tokens_per_request / 1000) * base_cost_per_1k_tokens
print(f"With caching: ${cached_cost:.2f}/month")
print(f"Savings: ${base_cost - cached_cost:.2f}/month ({cache_hit_rate*100:.0f}%)")

Best Practices Summary

  1. Enable caching: Start with response caching for immediate savings
  2. Monitor costs: Set up dashboards before costs become a problem
  3. Set budgets: Implement spending limits to prevent surprises
  4. Optimize prompts: Reduce token usage in system prompts
  5. Route intelligently: Use cheaper models for simple tasks
  6. Review regularly: Analyze cost reports weekly
  7. Cache aggressively: Tune TTL based on content freshness needs