Caching

Configure Redis caching to reduce costs and improve response times.

InferXgate’s intelligent caching can reduce your LLM API costs by 60-90%.

How It Works

  1. Request comes in with a prompt
  2. InferXgate generates a cache key from the request
  3. If cached, returns immediately (cache hit)
  4. If not cached, forwards to provider and caches response

Configuration

# Redis connection
REDIS_URL=redis://localhost:6379

# Enable caching (default: true)
ENABLE_CACHING=true

# Cache TTL in seconds (default: 3600 = 1 hour)
CACHE_TTL_SECONDS=3600

# Max cache size in MB (default: 1024)
CACHE_MAX_SIZE_MB=1024

Cache Key Generation

Cache keys are generated from:

  • Model name
  • Messages content
  • Temperature
  • Max tokens
  • Other parameters

Requests with identical parameters return cached responses.

Cache Headers

Response headers indicate cache status:

X-Cache: HIT          # Served from cache
X-Cache: MISS         # Fetched from provider
X-Cache-TTL: 3540     # Seconds until expiry

Bypassing Cache

Force a fresh response:

response = client.chat.completions.create(
    model="claude-3-opus-20240229",
    messages=[{"role": "user", "content": "Hello"}],
    extra_headers={"X-Cache-Control": "no-cache"}
)

Cache Statistics

View cache metrics:

curl http://localhost:3000/stats

Response:

{
  "cache": {
    "hits": 8500,
    "misses": 1500,
    "hit_rate": 0.85,
    "size_bytes": 104857600,
    "entries": 5000
  }
}

Redis Setup

Docker

services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --appendonly yes

Redis Cluster

For high availability:

REDIS_URL=redis://node1:6379,node2:6379,node3:6379
REDIS_CLUSTER=true

Best Practices

  1. Set appropriate TTL - Balance freshness vs. cost savings
  2. Monitor hit rate - Aim for 60%+ hit rate
  3. Size cache appropriately - Enough for your working set
  4. Use Redis persistence - Preserve cache across restarts

Cost Savings Example

MetricWithout CacheWith Cache (80% hit)
Requests10,00010,000
API Calls10,0002,000
Cost (at $0.01/call)$100$20
Savings-$80 (80%)