Why We Chose Rust for Our LLM Gateway
by InferXgate Team on
When we set out to build InferXgate, choosing the right programming language was one of our most important decisions. After evaluating several options, we chose Rust. Here’s why.
The Problem with Existing Solutions
Most LLM proxy solutions are built with interpreted languages like Python or Node.js. While these are great for rapid development, they introduce challenges at scale:
- Garbage collection pauses: Unpredictable latency spikes
- Memory overhead: High per-connection memory usage
- CPU inefficiency: Interpreter overhead on every request
- Concurrency limitations: GIL in Python, single-threaded event loop in Node
For a gateway sitting in the critical path of every AI request, these trade-offs are unacceptable.
Why Rust?
Zero-Cost Abstractions
Rust’s core philosophy is “zero-cost abstractions”—you don’t pay for what you don’t use. This means we can write high-level, maintainable code without sacrificing performance.
// High-level async code that compiles to efficient machine code
async fn proxy_request(&self, request: ChatRequest) -> Result<ChatResponse> {
let provider = self.router.select_provider(&request)?;
let response = provider.complete(request).await?;
self.metrics.record(&response);
Ok(response)
}
Predictable Performance
Unlike garbage-collected languages, Rust gives us:
- No GC pauses: Memory is freed deterministically
- Consistent latency: P99 stays close to P50
- Predictable throughput: No unexpected slowdowns
Our benchmarks show under 5ms added latency at the 99th percentile—something difficult to achieve with GC’d languages.
Memory Safety Without Runtime Cost
Rust’s ownership system prevents memory bugs at compile time:
- No null pointer exceptions
- No use-after-free bugs
- No data races in concurrent code
This safety comes with zero runtime overhead—the checks happen entirely at compile time.
Async/Await Done Right
Rust’s async runtime (we use Tokio) provides:
- Work-stealing scheduler for optimal CPU utilization
- Efficient async I/O without callback hell
- True parallelism across all CPU cores
// Handle thousands of concurrent connections efficiently
#[tokio::main]
async fn main() {
let listener = TcpListener::bind("0.0.0.0:3000").await.unwrap();
loop {
let (socket, _) = listener.accept().await.unwrap();
tokio::spawn(handle_connection(socket));
}
}
Performance Results
After building InferXgate in Rust, our benchmarks showed:
| Metric | InferXgate (Rust) | Python Proxy | Node.js Proxy |
|---|---|---|---|
| Latency (P50) | 1.2ms | 8.5ms | 5.2ms |
| Latency (P99) | 4.8ms | 45ms | 28ms |
| Throughput | 12,000 req/s | 1,200 req/s | 2,500 req/s |
| Memory (idle) | 12MB | 85MB | 65MB |
| Memory (load) | 45MB | 450MB | 280MB |
The Trade-offs
Rust isn’t without trade-offs:
- Steeper learning curve: The borrow checker takes time to learn
- Longer compile times: Type checking and optimization take time
- Smaller ecosystem: Fewer libraries than Python or Node
For a production infrastructure component like InferXgate, we believe the performance benefits far outweigh these costs.
Conclusion
Building InferXgate in Rust was the right choice for our use case. The performance characteristics—low latency, high throughput, minimal memory usage—are exactly what you need for infrastructure sitting in the critical path of AI requests.
If you’re building performance-sensitive infrastructure, we highly recommend giving Rust a serious look.
Interested in learning more? Check out our GitHub repository or join our Discord community.