io_uring: How flashQ Achieves Kernel-Level Async I/O Performance

When building flashQ, we faced a fundamental challenge: how do you build a job queue that can handle hundreds of thousands of operations per second while maintaining low latency? The answer lies in one of Linux's most significant kernel innovations in the past decade: io_uring.

In this article, we'll explore what io_uring is, why it matters, and how flashQ leverages it to achieve unprecedented performance on Linux systems.

The Problem with Traditional Async I/O

Before io_uring, Linux applications had two primary options for handling I/O:

1. Blocking I/O with Thread Pools

The traditional approach: spawn threads, let them block on I/O operations. Simple, but inefficient:

Context switching overhead: Each thread switch costs 1-10 microseconds
Memory overhead: Each thread requires its own stack (typically 8MB)
Scalability limits: Thousands of concurrent connections = thousands of threads

2. Event-driven I/O (epoll/kqueue)

Modern async runtimes like Tokio use epoll (Linux) or kqueue (macOS/BSD):

// Traditional epoll workflow
int epfd = epoll_create1(0);
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &event);

while (1) {
    int n = epoll_wait(epfd, events, MAX_EVENTS, -1);  // syscall
    for (int i = 0; i < n; i++) {
        read(events[i].data.fd, buf, len);  // syscall
        process(buf);
        write(events[i].data.fd, response, len);  // syscall
    }
}

Better than threads, but still problematic:

Syscall overhead: Every read(), write(), accept() requires a kernel transition
Data copying: Data must be copied between kernel and user space
Notification only: epoll tells you a socket is ready, but you still need syscalls to do actual I/O

For high-throughput applications, these syscalls become the bottleneck. A server handling 100K requests/second makes 300K+ syscalls per second just for basic I/O.

Enter io_uring: A Paradigm Shift

Introduced in Linux 5.1 (2019) by Jens Axboe, io_uring fundamentally changes how applications interact with the kernel for I/O operations.

The Core Innovation: Shared Ring Buffers

io_uring creates two ring buffers shared between user space and kernel space:

┌─────────────────────────────────────────────────────────────┐
│                      User Space                              │
│  ┌─────────────────────┐    ┌─────────────────────┐         │
│  │  Submission Queue   │    │  Completion Queue   │         │
│  │  (SQ) - Requests    │    │  (CQ) - Results     │         │
│  │                     │    │                     │         │
│  │  [read fd=5, ...]   │    │  [done, 1024 bytes] │         │
│  │  [write fd=7, ...]  │    │  [done, 512 bytes]  │         │
│  │  [accept fd=3, ...] │    │  [error, EAGAIN]    │         │
│  └──────────┬──────────┘    └──────────▲──────────┘         │
│             │ shared memory            │                     │
├─────────────┼──────────────────────────┼─────────────────────┤
│             │      Kernel Space        │                     │
│             ▼                          │                     │
│  ┌─────────────────────────────────────┴───┐                │
│  │           io_uring Subsystem            │                │
│  │                                         │                │
│  │   • Processes SQ entries                │                │
│  │   • Performs actual I/O                 │                │
│  │   • Posts results to CQ                 │                │
│  └─────────────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────┘

Key Benefits

Feature	Traditional (epoll)	io_uring
Syscalls per I/O	1-3 per operation	0 (batched submission)
Data copies	User ↔ Kernel	Zero-copy possible
Batching	Not native	Submit hundreds at once
Kernel polling	No	Yes (SQPOLL mode)
Fixed buffers	No	Yes (registered buffers)

How flashQ Uses io_uring

flashQ is written in Rust, leveraging the tokio-uring crate for io_uring support. Here's how we integrate it:

Runtime Detection

flashQ automatically detects the optimal I/O backend at startup:

// Simplified runtime selection logic
pub fn select_io_backend() -> IoBackend {
    #[cfg(target_os = "linux")]
    {
        if io_uring_available() && kernel_version() >= (5, 1) {
            return IoBackend::IoUring;
        }
        return IoBackend::Epoll;
    }

    #[cfg(target_os = "macos")]
    return IoBackend::Kqueue;

    #[cfg(target_os = "windows")]
    return IoBackend::Iocp;
}

You'll see the active backend in the startup logs:

# Linux with io_uring (Docker default)
INFO flashq_server::runtime: IO backend: io_uring (kernel-level async)

# Linux without io_uring feature
INFO flashq_server::runtime: IO backend: epoll (poll-based async)

# macOS
INFO flashq_server::runtime: IO backend: kqueue (poll-based async)

Batched Operations

One of io_uring's biggest advantages is batching. Instead of making individual syscalls, flashQ batches multiple operations:

// Batch multiple socket operations
async fn handle_connections(ring: &IoUring) {
    let mut submissions = Vec::with_capacity(32);

    // Collect pending operations
    for conn in pending_connections.drain(..) {
        submissions.push(ReadOp::new(conn.fd, conn.buffer));
    }

    // Submit all at once - single syscall for 32 operations
    ring.submit_batch(&submissions).await;

    // Process completions
    for completion in ring.completions() {
        handle_completion(completion);
    }
}

This reduces syscall overhead by 95%+ under high load.

Zero-Copy Networking

With registered buffers, flashQ can perform true zero-copy I/O:

// Register buffers once at startup
let buffers = IoUring::register_buffers(
    (0..BUFFER_COUNT)
        .map(|_| vec![0u8; BUFFER_SIZE])
        .collect()
);

// Use registered buffers for I/O - no copying!
async fn read_message(fd: RawFd, buf_idx: u16) -> io::Result {
    ring.read_fixed(fd, buf_idx, 0).await
}

Data flows directly from network card to application memory without intermediate kernel buffer copies.

SQPOLL Mode for Ultra-Low Latency

In SQPOLL mode, the kernel continuously polls the submission queue without requiring any syscalls:

// Enable kernel-side polling
let ring = IoUring::builder()
    .setup_sqpoll(2000)  // Poll for 2ms before sleeping
    .build()?;

// Submissions are picked up automatically by kernel thread
// No io_uring_enter() syscall needed!

This is ideal for latency-sensitive workloads where every microsecond counts.

Performance Impact

We benchmarked flashQ with and without io_uring on identical hardware (AMD EPYC 7763, 64 cores, 128GB RAM):

Throughput Comparison

Metric	epoll	io_uring	Improvement
Jobs pushed/sec	245,000	312,000	+27%
Jobs processed/sec	180,000	228,000	+26%
P99 latency (push)	1.8ms	0.9ms	-50%
P99 latency (fetch)	2.1ms	1.1ms	-48%
CPU usage at 100K/s	45%	31%	-31%
Syscalls/sec at 100K ops	~320,000	~12,000	-96%

Latency Distribution

Push Latency Distribution (100K jobs/sec sustained)

epoll:
  P50:  0.4ms  ████████████████
  P90:  1.2ms  ████████████████████████████████████████
  P99:  1.8ms  ████████████████████████████████████████████████████████████
  P999: 4.2ms  ████████████████████████████████████████████████████████████████████████

io_uring:
  P50:  0.2ms  ████████
  P90:  0.6ms  ████████████████████
  P99:  0.9ms  ██████████████████████████████
  P999: 1.8ms  ████████████████████████████████████████████████████████████

Building flashQ with io_uring

io_uring support is enabled by default in our Docker images. For custom builds:

Docker (Recommended)

# Our official image has io_uring enabled
docker run -d -p 6789:6789 ghcr.io/egeominotti/flashq:latest

# Verify io_uring is active
docker logs flashq | grep "IO backend"
# → INFO flashq_server::runtime: IO backend: io_uring (kernel-level async)

Building from Source

# Clone the repository
git clone https://github.com/egeominotti/flashq.git
cd flashq

# Build with io_uring feature
cargo build --release --features io-uring

# Run
./target/release/flashq-server

Requirements

Linux kernel 5.1+ (5.10+ recommended for all features)
liburing installed on the build system
For SQPOLL mode: CAP_SYS_ADMIN capability or root

Platform Compatibility

flashQ runs on all major platforms with automatic backend selection:

Platform	I/O Backend	Notes
Linux (kernel 5.1+)	io_uring	Fastest, kernel-level async
Linux (older kernels)	epoll	Fast, poll-based
macOS	kqueue	Native, optimal for macOS
Windows	IOCP	Native, optimal for Windows
FreeBSD	kqueue	Native BSD support

When io_uring Makes the Biggest Difference

io_uring provides the most benefit in these scenarios:

High Connection Count

With thousands of concurrent connections, the syscall reduction is dramatic. Each connection that would require separate read()/write() calls now batches efficiently.

High Throughput Workloads

AI workloads pushing hundreds of thousands of jobs benefit enormously. The 27% throughput improvement compounds with scale.

Latency-Sensitive Applications

The 50% P99 latency reduction matters for real-time applications where tail latency affects user experience.

CPU-Constrained Environments

The 31% CPU reduction means you can handle more load on the same hardware, or use smaller (cheaper) instances.

The Future of io_uring

io_uring continues to evolve rapidly:

Linux 5.19+: Multishot operations, reducing submission overhead further
Linux 6.0+: Zero-copy send support
Linux 6.1+: Improved buffer ring management
Upcoming: User-space networking (io_uring + XDP)

flashQ will continue adopting new io_uring features as they stabilize, ensuring you always get the best performance Linux can offer.

Conclusion

io_uring represents a fundamental shift in how high-performance applications interact with the Linux kernel. By eliminating syscall overhead, enabling zero-copy I/O, and supporting batched operations, it allows flashQ to achieve performance levels that were previously impossible.

For flashQ users, this translates to:

27% higher throughput on Linux systems
50% lower tail latency for time-sensitive workloads
31% less CPU usage for the same workload
Better scalability under high connection counts

The best part? It's automatic. Deploy flashQ on a modern Linux system, and you get io_uring performance out of the box.

Experience io_uring Performance

Deploy flashQ and see the difference yourself.

Get Started