Performance Tuning

Optimize InferXgate for maximum throughput and minimal latency

Performance Tuning

InferXgate is built in Rust for extreme performance. This guide covers advanced tuning techniques to maximize throughput and minimize latency.

Performance Benchmarks

Out of the box, InferXgate achieves:

MetricValue
Latency OverheadUnder 5ms
Throughput10,000+ req/sec
Memory UsageUnder 50MB base
CPU EfficiencyNear-native performance

Connection Pooling

HTTP Client Configuration

Optimize connection pooling for your workload:

# config.yaml
http:
  pool_size: 100           # Connections per provider
  pool_idle_timeout: 90s   # Keep-alive duration
  connect_timeout: 10s     # Connection establishment timeout
  request_timeout: 300s    # Request timeout (for long completions)

Provider-Specific Pools

Configure separate pools for different providers:

providers:
  anthropic:
    pool_size: 50
    max_concurrent: 100
    
  openai:
    pool_size: 75
    max_concurrent: 150

Caching Optimization

Cache Hit Ratio

Monitor and optimize your cache hit ratio:

# Check cache metrics
curl http://localhost:3000/metrics | grep cache

# Expected output
inferxgate_cache_hits_total 45230
inferxgate_cache_misses_total 12045
inferxgate_cache_hit_ratio 0.789

Cache Key Strategy

Optimize cache keys for your use case:

cache:
  enabled: true
  
  # Include model in cache key (recommended)
  key_includes_model: true
  
  # Include temperature (disable for deterministic prompts)
  key_includes_temperature: false
  
  # Normalize whitespace in prompts
  normalize_prompts: true

Redis Configuration

Tune Redis for optimal performance:

cache:
  redis:
    url: "redis://localhost:6379"
    
    # Connection pool
    pool_size: 20
    
    # Timeouts
    connect_timeout: 5s
    read_timeout: 2s
    write_timeout: 2s
    
    # Compression (for large responses)
    compression: true
    compression_threshold: 1024  # bytes

Redis Cluster Mode

For high-availability deployments:

cache:
  redis:
    mode: cluster
    nodes:
      - "redis://node1:6379"
      - "redis://node2:6379"
      - "redis://node3:6379"
    
    # Cluster-specific settings
    read_from_replicas: true

Async Processing

Request Queuing

Handle burst traffic with request queues:

server:
  # Maximum concurrent requests
  max_concurrent_requests: 1000
  
  # Queue overflow requests
  queue:
    enabled: true
    max_size: 5000
    timeout: 30s

Streaming Optimization

Optimize streaming responses:

streaming:
  # Buffer size for chunks
  buffer_size: 4096
  
  # Flush interval
  flush_interval: 50ms
  
  # Keep-alive during long generations
  keepalive_interval: 15s

Memory Management

Response Buffer Limits

Prevent memory exhaustion:

limits:
  # Maximum response size
  max_response_size: 10MB
  
  # Maximum tokens to buffer
  max_token_buffer: 100000
  
  # Request body limit
  max_request_size: 1MB

Garbage Collection

Rust doesn’t have GC, but you can tune allocator behavior:

# Use jemalloc for better performance
export MALLOC_CONF="background_thread:true,dirty_decay_ms:1000"

# Start InferXgate
./inferxgate serve

Load Balancing Tuning

Latency-Based Routing

Optimize for lowest latency:

load_balancing:
  strategy: least_latency
  
  # Latency measurement
  latency:
    # Weight recent measurements more heavily
    decay_factor: 0.9
    
    # Minimum samples before routing
    min_samples: 10
    
    # Health check interval
    probe_interval: 30s

Cost-Optimized Routing

Balance cost and performance:

load_balancing:
  strategy: least_cost
  
  cost:
    # Acceptable latency threshold
    max_latency_ms: 500
    
    # Provider cost weights (relative)
    weights:
      anthropic: 1.0
      openai: 1.2
      google: 0.8

Network Optimization

TCP Tuning

Optimize TCP settings:

server:
  tcp:
    nodelay: true          # Disable Nagle's algorithm
    keepalive: true
    keepalive_interval: 60s

HTTP/2 Settings

Enable HTTP/2 for multiplexing:

server:
  http2:
    enabled: true
    max_concurrent_streams: 250
    initial_window_size: 65535

Monitoring Performance

Prometheus Metrics

Key metrics to monitor:

# Request latency percentiles
histogram_quantile(0.99, rate(inferxgate_request_duration_seconds_bucket[5m]))

# Throughput
rate(inferxgate_requests_total[1m])

# Error rate
rate(inferxgate_errors_total[5m]) / rate(inferxgate_requests_total[5m])

# Cache effectiveness
inferxgate_cache_hits_total / (inferxgate_cache_hits_total + inferxgate_cache_misses_total)

Grafana Dashboard

Import the InferXgate dashboard for visualization:

# Dashboard ID for Grafana
# Import from: https://grafana.com/grafana/dashboards/xxxxx

Profiling

CPU Profiling

Profile CPU usage:

# Enable profiling endpoint
INFERXGATE_PROFILE=true ./inferxgate serve

# Capture profile
curl http://localhost:3000/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze with pprof
go tool pprof -http=:8080 cpu.prof

Memory Profiling

Track memory allocations:

# Capture heap profile
curl http://localhost:3000/debug/pprof/heap > heap.prof

# Analyze
go tool pprof -http=:8080 heap.prof

Production Checklist

Before deploying to production:

  • Configure appropriate pool sizes for expected load
  • Enable and tune Redis caching
  • Set up monitoring with Prometheus/Grafana
  • Configure rate limiting to prevent abuse
  • Enable HTTP/2 for better multiplexing
  • Set appropriate timeouts for your use case
  • Test with expected peak load
  • Configure log levels (warn/error for production)

Troubleshooting

High Latency

  1. Check provider health: curl /health
  2. Review cache hit ratio in metrics
  3. Check connection pool utilization
  4. Verify network latency to providers

Memory Growth

  1. Check response buffer limits
  2. Review concurrent request limits
  3. Monitor for memory leaks in metrics
  4. Check Redis connection pool

Throughput Bottlenecks

  1. Increase pool sizes
  2. Enable request queuing
  3. Scale horizontally with load balancer
  4. Review provider rate limits