Performance Tuning

Optimize InferXgate for maximum throughput and minimal latency

Performance Tuning

InferXgate is built in Rust for extreme performance. This guide covers advanced tuning techniques to maximize throughput and minimize latency.

Performance Benchmarks

Out of the box, InferXgate achieves:

Metric	Value
Latency Overhead	Under 5ms
Throughput	10,000+ req/sec
Memory Usage	Under 50MB base
CPU Efficiency	Near-native performance

Connection Pooling

HTTP Client Configuration

Optimize connection pooling for your workload:

# config.yaml
http:
  pool_size: 100           # Connections per provider
  pool_idle_timeout: 90s   # Keep-alive duration
  connect_timeout: 10s     # Connection establishment timeout
  request_timeout: 300s    # Request timeout (for long completions)

Provider-Specific Pools

Configure separate pools for different providers:

providers:
  anthropic:
    pool_size: 50
    max_concurrent: 100
    
  openai:
    pool_size: 75
    max_concurrent: 150

Caching Optimization

Cache Hit Ratio

Monitor and optimize your cache hit ratio:

# Check cache metrics
curl http://localhost:3000/metrics | grep cache

# Expected output
inferxgate_cache_hits_total 45230
inferxgate_cache_misses_total 12045
inferxgate_cache_hit_ratio 0.789

Cache Key Strategy

Optimize cache keys for your use case:

cache:
  enabled: true
  
  # Include model in cache key (recommended)
  key_includes_model: true
  
  # Include temperature (disable for deterministic prompts)
  key_includes_temperature: false
  
  # Normalize whitespace in prompts
  normalize_prompts: true

Redis Configuration

Tune Redis for optimal performance:

cache:
  redis:
    url: "redis://localhost:6379"
    
    # Connection pool
    pool_size: 20
    
    # Timeouts
    connect_timeout: 5s
    read_timeout: 2s
    write_timeout: 2s
    
    # Compression (for large responses)
    compression: true
    compression_threshold: 1024  # bytes

Redis Cluster Mode

For high-availability deployments:

cache:
  redis:
    mode: cluster
    nodes:
      - "redis://node1:6379"
      - "redis://node2:6379"
      - "redis://node3:6379"
    
    # Cluster-specific settings
    read_from_replicas: true

Async Processing

Request Queuing

Handle burst traffic with request queues:

server:
  # Maximum concurrent requests
  max_concurrent_requests: 1000
  
  # Queue overflow requests
  queue:
    enabled: true
    max_size: 5000
    timeout: 30s

Streaming Optimization

Optimize streaming responses:

streaming:
  # Buffer size for chunks
  buffer_size: 4096
  
  # Flush interval
  flush_interval: 50ms
  
  # Keep-alive during long generations
  keepalive_interval: 15s

Memory Management

Response Buffer Limits

Prevent memory exhaustion:

limits:
  # Maximum response size
  max_response_size: 10MB
  
  # Maximum tokens to buffer
  max_token_buffer: 100000
  
  # Request body limit
  max_request_size: 1MB

Garbage Collection

Rust doesn’t have GC, but you can tune allocator behavior:

# Use jemalloc for better performance
export MALLOC_CONF="background_thread:true,dirty_decay_ms:1000"

# Start InferXgate
./inferxgate serve

Load Balancing Tuning

Latency-Based Routing

Optimize for lowest latency:

load_balancing:
  strategy: least_latency
  
  # Latency measurement
  latency:
    # Weight recent measurements more heavily
    decay_factor: 0.9
    
    # Minimum samples before routing
    min_samples: 10
    
    # Health check interval
    probe_interval: 30s

Cost-Optimized Routing

Balance cost and performance:

load_balancing:
  strategy: least_cost
  
  cost:
    # Acceptable latency threshold
    max_latency_ms: 500
    
    # Provider cost weights (relative)
    weights:
      anthropic: 1.0
      openai: 1.2
      google: 0.8

Network Optimization

TCP Tuning

Optimize TCP settings:

server:
  tcp:
    nodelay: true          # Disable Nagle's algorithm
    keepalive: true
    keepalive_interval: 60s

HTTP/2 Settings

Enable HTTP/2 for multiplexing:

server:
  http2:
    enabled: true
    max_concurrent_streams: 250
    initial_window_size: 65535

Monitoring Performance

Prometheus Metrics

Key metrics to monitor:

# Request latency percentiles
histogram_quantile(0.99, rate(inferxgate_request_duration_seconds_bucket[5m]))

# Throughput
rate(inferxgate_requests_total[1m])

# Error rate
rate(inferxgate_errors_total[5m]) / rate(inferxgate_requests_total[5m])

# Cache effectiveness
inferxgate_cache_hits_total / (inferxgate_cache_hits_total + inferxgate_cache_misses_total)

Grafana Dashboard

Import the InferXgate dashboard for visualization:

# Dashboard ID for Grafana
# Import from: https://grafana.com/grafana/dashboards/xxxxx

Profiling

CPU Profiling

Profile CPU usage:

# Enable profiling endpoint
INFERXGATE_PROFILE=true ./inferxgate serve

# Capture profile
curl http://localhost:3000/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze with pprof
go tool pprof -http=:8080 cpu.prof

Memory Profiling

Track memory allocations:

# Capture heap profile
curl http://localhost:3000/debug/pprof/heap > heap.prof

# Analyze
go tool pprof -http=:8080 heap.prof

Production Checklist

Before deploying to production:

Configure appropriate pool sizes for expected load
Enable and tune Redis caching
Set up monitoring with Prometheus/Grafana
Configure rate limiting to prevent abuse
Enable HTTP/2 for better multiplexing
Set appropriate timeouts for your use case
Test with expected peak load
Configure log levels (warn/error for production)

Troubleshooting

High Latency

Check provider health: curl /health
Review cache hit ratio in metrics
Check connection pool utilization
Verify network latency to providers

Memory Growth

Check response buffer limits
Review concurrent request limits
Monitor for memory leaks in metrics
Check Redis connection pool

Throughput Bottlenecks

Increase pool sizes
Enable request queuing
Scale horizontally with load balancer
Review provider rate limits