flashQ Architecture: How We Built a 1.9M Jobs/sec Queue Server

When we set out to build flashQ, our goal was simple: create the fastest job queue server possible, optimized for AI workloads, without requiring Redis. In this article, we'll dive deep into the architecture decisions that enable flashQ to process 1.9 million jobs per second.

High-Level Architecture

flashQ is written in Rust and follows a sharded, lock-free architecture designed for maximum concurrency:

┌─────────────────────────────────────────────────────────────┐
│                      flashQ Server                          │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐       │
│  │ TCP/IP  │  │  HTTP   │  │  gRPC   │  │  Unix   │       │
│  │ Handler │  │   API   │  │   API   │  │ Socket  │       │
│  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘       │
│       │            │            │            │             │
│       └────────────┴────────────┴────────────┘             │
│                         │                                   │
│              ┌──────────▼──────────┐                       │
│              │   Command Router    │                       │
│              └──────────┬──────────┘                       │
│                         │                                   │
│         ┌───────────────┼───────────────┐                  │
│         ▼               ▼               ▼                  │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐              │
│   │ Shard 0  │   │ Shard 1  │...│ Shard 31 │              │
│   └──────────┘   └──────────┘   └──────────┘              │
│                         │                                   │
│              ┌──────────▼──────────┐                       │
│              │   DashMap Index     │                       │
│              │  (Lock-Free O(1))   │                       │
│              └─────────────────────┘                       │
└─────────────────────────────────────────────────────────────┘

The 32-Shard Design

At the heart of flashQ is a sharded architecture. We distribute queues across 32 shards based on a hash of the queue name. This reduces lock contention by 97% compared to a single-lock design.

// Queue name → Shard mapping
fn get_shard(queue_name: &str) -> usize {
    let hash = fxhash::hash64(queue_name.as_bytes());
    (hash as usize) % 32
}

Each shard contains:

Queues: HashMap of queue name → IndexedPriorityQueue
Processing: Jobs currently being processed
Completed: Recently completed job IDs
DLQ: Dead letter queue for failed jobs
Results: Job results for the finished() API

Lock-Free Job Index with DashMap

One of our biggest performance wins came from using DashMap for the global job index. DashMap is a concurrent HashMap that uses fine-grained locking, allowing multiple threads to read and write simultaneously.

// Global job index - O(1) lookups
pub struct QueueManager {
    shards: [RwLock<Shard>; 32],
    job_index: DashMap<u64, JobLocation>,  // Lock-free!
}

// JobLocation tells us exactly where a job is
enum JobLocation {
    Waiting { shard: u8, queue: CompactString },
    Processing { shard: u8 },
    Completed { shard: u8 },
    Failed { shard: u8 },
}

This gives us O(1) job lookups regardless of how many queues or jobs exist. Operations like getJob(), cancel(), and getState() are blazing fast.

IndexedPriorityQueue for O(log n) Operations

Standard binary heaps don't support efficient removal of arbitrary elements. We built an IndexedPriorityQueue that maintains a secondary index:

pub struct IndexedPriorityQueue<T> {
    heap: Vec<T>,
    index: HashMap<u64, usize>,  // job_id → heap position
}

impl<T> IndexedPriorityQueue<T> {
    // All operations are O(log n)
    fn push(&mut self, job: T) { ... }
    fn pop(&mut self) -> Option<T> { ... }
    fn remove(&mut self, job_id: u64) -> Option<T> { ... }  // Key!
    fn update_priority(&mut self, job_id: u64, priority: i32) { ... }
}

This enables efficient cancel(), promote(), and changePriority() operations that would be O(n) with a standard heap.

CompactString for Zero-Allocation Queue Names

Queue names are accessed constantly. We use CompactString which stores strings up to 24 bytes inline (no heap allocation):

// Most queue names fit in 24 bytes
use compact_str::CompactString;

// "embeddings" - 10 bytes, stored inline ✓
// "user-notifications" - 18 bytes, stored inline ✓
// "very-long-queue-name-here" - 25 bytes, heap allocated

This reduces memory allocations by ~60% for typical workloads.

Sharded Processing Map

Jobs being processed are distributed across 32 shards (separate from queue shards). This reduces contention during ack() and fail() operations by 97%:

// Processing is sharded by job_id
fn get_processing_shard(job_id: u64) -> usize {
    (job_id as usize) % 32
}

// ack() only locks one of 32 processing shards
fn ack(&self, job_id: u64) -> Result<()> {
    let shard_idx = get_processing_shard(job_id);
    let mut processing = self.processing[shard_idx].write();
    // ...
}

Binary Protocol with MessagePack

flashQ supports both JSON and MessagePack protocols. MessagePack provides:

40% smaller payloads on the wire
3-5x faster serialization/deserialization
Full type safety and schema compatibility

// TypeScript SDK - enable binary protocol
const client = new FlashQ({
  host: 'localhost',
  port: 6789,
  useBinary: true  // Use MessagePack
});

Memory Management

We use several strategies to prevent unbounded memory growth:

Completed Jobs Cleanup

// When completed_jobs exceeds 50K, remove oldest 25K
if completed_jobs.len() > 50_000 {
    let to_remove: Vec<_> = completed_jobs
        .iter()
        .take(25_000)
        .cloned()
        .collect();
    for id in to_remove {
        completed_jobs.remove(&id);
    }
}

Job Results TTL

// Results are cleaned up based on keepCompletedAge
await client.push('queue', data, {
  keepCompletedAge: 86400000,  // Keep result for 24h
  keepCompletedCount: 100     // Or keep in last 100
});

String Interning

// Queue names are interned to reduce allocations
// Limited to 10K unique names to prevent memory exhaustion
static INTERNED: DashMap<String, CompactString> = ...;

Background Tasks

flashQ runs several background tasks at different intervals:

Task	Interval	Purpose
Wakeup	100ms	Notify workers, check dependencies
Timeout	500ms	Check and fail stalled jobs
Cron	1s	Execute scheduled cron jobs
Metrics	5s	Collect metrics history
Cleanup	60s	Clean completed jobs, results, index

Multi-Protocol Support

flashQ supports four connection methods simultaneously:

# TCP (default) - lowest latency
PORT=6789 ./flashq-server

# HTTP REST API + WebSocket
HTTP=1 HTTP_PORT=6790 ./flashq-server

# gRPC API
GRPC=1 GRPC_PORT=6791 ./flashq-server

# Unix Socket - highest throughput for local
UNIX_SOCKET=1 ./flashq-server

Clustering Architecture

For high availability, flashQ supports clustering using PostgreSQL for coordination:

┌──────────┐    ┌──────────┐    ┌──────────┐
│  Node 1  │    │  Node 2  │    │  Node 3  │
│ (Leader) │    │(Follower)│    │(Follower)│
└────┬─────┘    └────┬─────┘    └────┬─────┘
     │               │               │
     └───────────────┼───────────────┘
                     │
              ┌──────▼──────┐
              │  PostgreSQL │
              │  (Shared)   │
              └─────────────┘

Leader election uses PostgreSQL advisory locks (pg_try_advisory_lock). Only the leader runs background tasks; all nodes handle client requests.

Performance Optimizations Summary

Optimization	Benefit
DashMap job_index	Lock-free O(1) lookups, 40% faster
32 Sharded processing	-97% contention on ack/fail
CompactString	Zero heap alloc for short names
IndexedPriorityQueue	O(log n) cancel/update/promote
MessagePack protocol	40% smaller, 3-5x faster serialization
mimalloc allocator	Faster memory allocation
parking_lot locks	Faster than std::sync
Atomic u64 IDs	Lock-free ID generation
Coarse timestamps	Cached time, fewer syscalls
LTO build	Cross-crate optimization

Benchmarks

On a modern server (32 cores, 64GB RAM):

Operation	Throughput
Push (batch)	1.9M jobs/sec
Processing (no-op)	280K jobs/sec
Processing (CPU work)	196K jobs/sec
Concurrent push (10 conn)	59K ops/sec

Conclusion

flashQ's architecture is built around one principle: minimize contention. By sharding data, using lock-free structures where possible, and optimizing hot paths, we've created a job queue that can handle the most demanding AI workloads.

The combination of Rust's zero-cost abstractions, careful data structure selection, and attention to memory management allows flashQ to outperform traditional Redis-based solutions while being simpler to operate.

Try flashQ

Experience the performance for yourself.

Get Started →