Everything you need to manage LLMs at scale

InferXgate provides a comprehensive set of features for routing, caching, monitoring, and securing your LLM infrastructure.

Unified API

One API to rule them all. Use your existing OpenAI SDK with any provider.

OpenAI SDK Compatibility

Drop-in replacement for OpenAI SDK. Works with Python, TypeScript, Go, and any OpenAI-compatible client.

Smart Model Routing

Automatically routes requests to the right provider based on model prefixes like claude-, gpt-, or gemini-.

Real-time Streaming

Full Server-Sent Events (SSE) support for streaming responses from all providers.

Function Calling

Unified function/tool calling interface across all supported providers.

Provider Support

Access all major LLM providers through a single, unified interface.

Anthropic Claude

Full support for Claude 4, Claude 3.5 Sonnet, Claude 3 Opus, and Haiku models.

OpenAI

Complete GPT-5, GPT-4.1, GPT-4 Turbo, and GPT-3.5 model support with all features.

Google Gemini

Gemini 2.5 Pro, Gemini Flash, and Gemini 1.5 Pro models ready to use.

Azure OpenAI

Enterprise Azure OpenAI deployments with custom endpoint support.

Coming Soon

AWS Bedrock

Amazon Bedrock integration for Claude, Titan, and other models.

Coming Soon

Groq

Ultra-fast inference with Llama and Mixtral models.

Performance

Built in Rust for maximum performance and minimal resource usage.

Intelligent Caching

Redis-powered semantic caching delivers 60-90% faster responses for repeated or similar queries.

Connection Pooling

Maintains 10 persistent connections per host for optimal throughput and reduced latency.

<5ms Latency Overhead

Minimal processing overhead means your requests reach providers faster.

10,000+ Requests/Second

Handle massive scale with Rust's async runtime and efficient memory management.

Analytics & Monitoring

Complete visibility into your LLM usage, costs, and performance.

Real-time Dashboard

Beautiful React dashboard showing requests, latency, costs, and usage patterns.

Prometheus Metrics

Export metrics to Prometheus for custom dashboards and alerting in Grafana.

Per-Request Cost Tracking

Track costs per request, user, model, and time period with detailed breakdowns.

Usage Statistics API

Programmatic access to usage data via /stats endpoint for custom integrations.

Load Balancing

Intelligent request distribution across multiple providers and API keys.

Round-Robin

Distribute requests evenly across all configured providers and keys.

Least Latency

Automatically route to the fastest responding provider based on recent metrics.

Least Cost

Optimize spending by routing to the most cost-effective provider for each request.

Automatic Failover

Seamlessly switch to backup providers when primary endpoints fail or rate limit.

Security & Auth

Enterprise-grade security features to protect your AI infrastructure.

JWT Authentication

Secure API access with industry-standard JSON Web Tokens and configurable expiry.

Virtual API Keys

Create multiple API keys per user with individual rate limits and permissions.

Rate Limiting

Sliding window rate limiting to prevent abuse and control costs per key or user.

Domain Whitelisting

Restrict API access to specific email domains for enterprise security.

Enterprise

Enterprise Features Coming Soon

We're building enterprise-grade features for teams that need more control, compliance, and collaboration capabilities.

Organizations & Teams
Role-Based Access Control (RBAC)
SAML/OIDC Single Sign-On
Audit Logging
Priority Support & SLA
Custom Integrations

Interested in enterprise features?

Ready to get started?

InferXgate is free, open-source, and self-hosted. Get up and running in minutes with Docker.

Get Started View on GitHub