Rate Limiting
Configure and manage rate limits to protect your LLM gateway
Rate Limiting
Rate limiting protects your InferXgate deployment from abuse and ensures fair resource allocation across users and applications.
Overview
InferXgate supports multiple rate limiting strategies:
- Token-based: Limit by tokens consumed
- Request-based: Limit by request count
- Cost-based: Limit by estimated cost
- Concurrent: Limit simultaneous requests
Basic Configuration
Enable rate limiting in your configuration:
# config.yaml
rate_limiting:
enabled: true
# Default limits (applied to all requests)
default:
requests_per_minute: 60
tokens_per_minute: 100000
concurrent_requests: 10
Rate Limit Strategies
Request-Based Limits
Simple request counting:
rate_limiting:
strategy: requests
limits:
# Per minute
requests_per_minute: 60
# Per hour
requests_per_hour: 1000
# Per day
requests_per_day: 10000
Token-Based Limits
Limit by token consumption:
rate_limiting:
strategy: tokens
limits:
# Input + output tokens
tokens_per_minute: 100000
tokens_per_hour: 1000000
# Separate input/output limits
input_tokens_per_minute: 50000
output_tokens_per_minute: 50000
Cost-Based Limits
Limit by estimated cost:
rate_limiting:
strategy: cost
limits:
# USD limits
cost_per_minute: 1.00
cost_per_hour: 20.00
cost_per_day: 100.00
Concurrent Request Limits
Limit simultaneous requests:
rate_limiting:
concurrent:
enabled: true
max_concurrent: 10
queue_timeout: 30s
Per-User Rate Limits
API Key-Based Limits
Different limits per API key:
rate_limiting:
per_key:
enabled: true
# Key-specific overrides
keys:
"key_premium_user":
requests_per_minute: 200
tokens_per_minute: 500000
"key_free_tier":
requests_per_minute: 10
tokens_per_minute: 10000
JWT Claims-Based Limits
Rate limit based on JWT claims:
rate_limiting:
jwt_claims:
enabled: true
# Claim to use for rate limit tier
tier_claim: "rate_limit_tier"
tiers:
enterprise:
requests_per_minute: 1000
tokens_per_minute: 1000000
pro:
requests_per_minute: 100
tokens_per_minute: 100000
free:
requests_per_minute: 10
tokens_per_minute: 10000
Per-Model Rate Limits
Apply different limits per model:
rate_limiting:
per_model:
enabled: true
models:
"claude-3-opus":
requests_per_minute: 20
tokens_per_minute: 50000
"claude-3-sonnet":
requests_per_minute: 60
tokens_per_minute: 200000
"gpt-4":
requests_per_minute: 30
tokens_per_minute: 100000
Rate Limit Storage
In-Memory (Default)
Fast but not distributed:
rate_limiting:
storage: memory
# Cleanup interval
cleanup_interval: 60s
Redis Storage
For distributed deployments:
rate_limiting:
storage: redis
redis:
url: "redis://localhost:6379"
key_prefix: "ratelimit:"
# Use Redis Lua scripts for atomicity
use_lua: true
Rate Limit Headers
InferXgate returns standard rate limit headers:
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1699574400
X-RateLimit-Reset-After: 45
Header Configuration
Customize headers:
rate_limiting:
headers:
enabled: true
# Header names
limit_header: "X-RateLimit-Limit"
remaining_header: "X-RateLimit-Remaining"
reset_header: "X-RateLimit-Reset"
# Include retry-after on 429
retry_after: true
Rate Limit Response
When rate limited, InferXgate returns:
{
"error": {
"message": "Rate limit exceeded. Please retry after 45 seconds.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 45
}
}
Custom Error Response
Customize the error response:
rate_limiting:
error_response:
status_code: 429
message: "You've exceeded your rate limit. Please upgrade your plan."
include_retry_after: true
Sliding Window Algorithm
InferXgate uses a sliding window for accurate rate limiting:
rate_limiting:
algorithm: sliding_window
# Window configuration
window:
size: 60s # Window size
precision: 1s # Granularity
Fixed Window (Alternative)
For simpler rate limiting:
rate_limiting:
algorithm: fixed_window
window:
size: 60s
Burst Handling
Allow temporary bursts:
rate_limiting:
burst:
enabled: true
# Allow 2x normal rate for short bursts
multiplier: 2.0
# Burst window
window: 10s
Rate Limit Bypass
Allow certain requests to bypass rate limits:
rate_limiting:
bypass:
# Bypass for health checks
paths:
- "/health"
- "/metrics"
# Bypass for admin keys
keys:
- "admin_key_xxxxx"
# Bypass for internal IPs
ips:
- "10.0.0.0/8"
- "192.168.0.0/16"
Monitoring Rate Limits
Prometheus Metrics
# Rate limit hits
rate(inferxgate_rate_limit_hits_total[5m])
# Rate limit by key
inferxgate_rate_limit_hits_total{key="key_xxxxx"}
# Current usage percentage
inferxgate_rate_limit_usage_ratio
Alerting
Set up alerts for rate limit abuse:
# Prometheus alert rule
groups:
- name: inferxgate
rules:
- alert: HighRateLimitHits
expr: rate(inferxgate_rate_limit_hits_total[5m]) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of rate limit hits"
Best Practices
Setting Appropriate Limits
- Start conservative: Begin with lower limits and increase based on usage
- Monitor patterns: Use metrics to understand actual usage
- Tiered limits: Offer different tiers for different user types
- Document limits: Clearly communicate limits to API users
Handling Rate Limits Client-Side
Implement exponential backoff:
import time
import random
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1")
def call_with_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "rate_limit" in str(e).lower():
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited, waiting {wait_time:.1f}s")
time.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
# Usage
response = call_with_retry(lambda: client.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Hello!"}]
))
Multi-Tenant Deployments
For SaaS applications:
rate_limiting:
multi_tenant:
enabled: true
# Identify tenant from header or JWT
tenant_source: header
tenant_header: "X-Tenant-ID"
# Default tenant limits
default_limits:
requests_per_minute: 60
# Tenant-specific limits (from database)
dynamic_limits:
enabled: true
refresh_interval: 60s
Troubleshooting
Rate Limits Not Applying
- Check if rate limiting is enabled
- Verify storage backend is working
- Check bypass rules aren’t matching
- Review logs for rate limit decisions
Inconsistent Limits (Distributed)
- Ensure all instances use Redis storage
- Check Redis connectivity
- Verify clock synchronization
- Review Lua script execution