Performance Tuning
Optimize InferXgate for maximum throughput and minimal latency
Performance Tuning
InferXgate is built in Rust for extreme performance. This guide covers advanced tuning techniques to maximize throughput and minimize latency.
Performance Benchmarks
Out of the box, InferXgate achieves:
| Metric | Value |
|---|---|
| Latency Overhead | Under 5ms |
| Throughput | 10,000+ req/sec |
| Memory Usage | Under 50MB base |
| CPU Efficiency | Near-native performance |
Connection Pooling
HTTP Client Configuration
Optimize connection pooling for your workload:
# config.yaml
http:
pool_size: 100 # Connections per provider
pool_idle_timeout: 90s # Keep-alive duration
connect_timeout: 10s # Connection establishment timeout
request_timeout: 300s # Request timeout (for long completions)
Provider-Specific Pools
Configure separate pools for different providers:
providers:
anthropic:
pool_size: 50
max_concurrent: 100
openai:
pool_size: 75
max_concurrent: 150
Caching Optimization
Cache Hit Ratio
Monitor and optimize your cache hit ratio:
# Check cache metrics
curl http://localhost:3000/metrics | grep cache
# Expected output
inferxgate_cache_hits_total 45230
inferxgate_cache_misses_total 12045
inferxgate_cache_hit_ratio 0.789
Cache Key Strategy
Optimize cache keys for your use case:
cache:
enabled: true
# Include model in cache key (recommended)
key_includes_model: true
# Include temperature (disable for deterministic prompts)
key_includes_temperature: false
# Normalize whitespace in prompts
normalize_prompts: true
Redis Configuration
Tune Redis for optimal performance:
cache:
redis:
url: "redis://localhost:6379"
# Connection pool
pool_size: 20
# Timeouts
connect_timeout: 5s
read_timeout: 2s
write_timeout: 2s
# Compression (for large responses)
compression: true
compression_threshold: 1024 # bytes
Redis Cluster Mode
For high-availability deployments:
cache:
redis:
mode: cluster
nodes:
- "redis://node1:6379"
- "redis://node2:6379"
- "redis://node3:6379"
# Cluster-specific settings
read_from_replicas: true
Async Processing
Request Queuing
Handle burst traffic with request queues:
server:
# Maximum concurrent requests
max_concurrent_requests: 1000
# Queue overflow requests
queue:
enabled: true
max_size: 5000
timeout: 30s
Streaming Optimization
Optimize streaming responses:
streaming:
# Buffer size for chunks
buffer_size: 4096
# Flush interval
flush_interval: 50ms
# Keep-alive during long generations
keepalive_interval: 15s
Memory Management
Response Buffer Limits
Prevent memory exhaustion:
limits:
# Maximum response size
max_response_size: 10MB
# Maximum tokens to buffer
max_token_buffer: 100000
# Request body limit
max_request_size: 1MB
Garbage Collection
Rust doesn’t have GC, but you can tune allocator behavior:
# Use jemalloc for better performance
export MALLOC_CONF="background_thread:true,dirty_decay_ms:1000"
# Start InferXgate
./inferxgate serve
Load Balancing Tuning
Latency-Based Routing
Optimize for lowest latency:
load_balancing:
strategy: least_latency
# Latency measurement
latency:
# Weight recent measurements more heavily
decay_factor: 0.9
# Minimum samples before routing
min_samples: 10
# Health check interval
probe_interval: 30s
Cost-Optimized Routing
Balance cost and performance:
load_balancing:
strategy: least_cost
cost:
# Acceptable latency threshold
max_latency_ms: 500
# Provider cost weights (relative)
weights:
anthropic: 1.0
openai: 1.2
google: 0.8
Network Optimization
TCP Tuning
Optimize TCP settings:
server:
tcp:
nodelay: true # Disable Nagle's algorithm
keepalive: true
keepalive_interval: 60s
HTTP/2 Settings
Enable HTTP/2 for multiplexing:
server:
http2:
enabled: true
max_concurrent_streams: 250
initial_window_size: 65535
Monitoring Performance
Prometheus Metrics
Key metrics to monitor:
# Request latency percentiles
histogram_quantile(0.99, rate(inferxgate_request_duration_seconds_bucket[5m]))
# Throughput
rate(inferxgate_requests_total[1m])
# Error rate
rate(inferxgate_errors_total[5m]) / rate(inferxgate_requests_total[5m])
# Cache effectiveness
inferxgate_cache_hits_total / (inferxgate_cache_hits_total + inferxgate_cache_misses_total)
Grafana Dashboard
Import the InferXgate dashboard for visualization:
# Dashboard ID for Grafana
# Import from: https://grafana.com/grafana/dashboards/xxxxx
Profiling
CPU Profiling
Profile CPU usage:
# Enable profiling endpoint
INFERXGATE_PROFILE=true ./inferxgate serve
# Capture profile
curl http://localhost:3000/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze with pprof
go tool pprof -http=:8080 cpu.prof
Memory Profiling
Track memory allocations:
# Capture heap profile
curl http://localhost:3000/debug/pprof/heap > heap.prof
# Analyze
go tool pprof -http=:8080 heap.prof
Production Checklist
Before deploying to production:
- Configure appropriate pool sizes for expected load
- Enable and tune Redis caching
- Set up monitoring with Prometheus/Grafana
- Configure rate limiting to prevent abuse
- Enable HTTP/2 for better multiplexing
- Set appropriate timeouts for your use case
- Test with expected peak load
- Configure log levels (warn/error for production)
Troubleshooting
High Latency
- Check provider health:
curl /health - Review cache hit ratio in metrics
- Check connection pool utilization
- Verify network latency to providers
Memory Growth
- Check response buffer limits
- Review concurrent request limits
- Monitor for memory leaks in metrics
- Check Redis connection pool
Throughput Bottlenecks
- Increase pool sizes
- Enable request queuing
- Scale horizontally with load balancer
- Review provider rate limits