Everything you need to manage LLMs at scale
InferXgate provides a comprehensive set of features for routing, caching, monitoring, and securing your LLM infrastructure.
Unified API
One API to rule them all. Use your existing OpenAI SDK with any provider.
OpenAI SDK Compatibility
Drop-in replacement for OpenAI SDK. Works with Python, TypeScript, Go, and any OpenAI-compatible client.
Smart Model Routing
Automatically routes requests to the right provider based on model prefixes like claude-, gpt-, or gemini-.
Real-time Streaming
Full Server-Sent Events (SSE) support for streaming responses from all providers.
Function Calling
Unified function/tool calling interface across all supported providers.
Provider Support
Access all major LLM providers through a single, unified interface.
Anthropic Claude
Full support for Claude 4, Claude 3.5 Sonnet, Claude 3 Opus, and Haiku models.
OpenAI
Complete GPT-5, GPT-4.1, GPT-4 Turbo, and GPT-3.5 model support with all features.
Google Gemini
Gemini 2.5 Pro, Gemini Flash, and Gemini 1.5 Pro models ready to use.
Azure OpenAI
Enterprise Azure OpenAI deployments with custom endpoint support.
AWS Bedrock
Amazon Bedrock integration for Claude, Titan, and other models.
Groq
Ultra-fast inference with Llama and Mixtral models.
Performance
Built in Rust for maximum performance and minimal resource usage.
Intelligent Caching
Redis-powered semantic caching delivers 60-90% faster responses for repeated or similar queries.
Connection Pooling
Maintains 10 persistent connections per host for optimal throughput and reduced latency.
<5ms Latency Overhead
Minimal processing overhead means your requests reach providers faster.
10,000+ Requests/Second
Handle massive scale with Rust's async runtime and efficient memory management.
Analytics & Monitoring
Complete visibility into your LLM usage, costs, and performance.
Real-time Dashboard
Beautiful React dashboard showing requests, latency, costs, and usage patterns.
Prometheus Metrics
Export metrics to Prometheus for custom dashboards and alerting in Grafana.
Per-Request Cost Tracking
Track costs per request, user, model, and time period with detailed breakdowns.
Usage Statistics API
Programmatic access to usage data via /stats endpoint for custom integrations.
Load Balancing
Intelligent request distribution across multiple providers and API keys.
Round-Robin
Distribute requests evenly across all configured providers and keys.
Least Latency
Automatically route to the fastest responding provider based on recent metrics.
Least Cost
Optimize spending by routing to the most cost-effective provider for each request.
Automatic Failover
Seamlessly switch to backup providers when primary endpoints fail or rate limit.
Security & Auth
Enterprise-grade security features to protect your AI infrastructure.
JWT Authentication
Secure API access with industry-standard JSON Web Tokens and configurable expiry.
Virtual API Keys
Create multiple API keys per user with individual rate limits and permissions.
Rate Limiting
Sliding window rate limiting to prevent abuse and control costs per key or user.
Domain Whitelisting
Restrict API access to specific email domains for enterprise security.
Enterprise Features Coming Soon
We're building enterprise-grade features for teams that need more control, compliance, and collaboration capabilities.
- Organizations & Teams
- Role-Based Access Control (RBAC)
- SAML/OIDC Single Sign-On
- Audit Logging
- Priority Support & SLA
- Custom Integrations
Interested in enterprise features?
Contact UsReady to get started?
InferXgate is free, open-source, and self-hosted. Get up and running in minutes with Docker.