When we set out to build flashQ, our goal was simple: create the fastest job queue server possible, optimized for AI workloads, without requiring Redis. In this article, we'll dive deep into the architecture decisions that enable flashQ to process 1.9 million jobs per second.
High-Level Architecture
flashQ is written in Rust and follows a sharded, lock-free architecture designed for maximum concurrency:
┌─────────────────────────────────────────────────────────────┐
│ flashQ Server │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ TCP/IP │ │ HTTP │ │ gRPC │ │ Unix │ │
│ │ Handler │ │ API │ │ API │ │ Socket │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ └────────────┴────────────┴────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Command Router │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Shard 0 │ │ Shard 1 │...│ Shard 31 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ DashMap Index │ │
│ │ (Lock-Free O(1)) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The 32-Shard Design
At the heart of flashQ is a sharded architecture. We distribute queues across 32 shards based on a hash of the queue name. This reduces lock contention by 97% compared to a single-lock design.
// Queue name → Shard mapping
fn get_shard(queue_name: &str) -> usize {
let hash = fxhash::hash64(queue_name.as_bytes());
(hash as usize) % 32
}
Each shard contains:
- Queues: HashMap of queue name → IndexedPriorityQueue
- Processing: Jobs currently being processed
- Completed: Recently completed job IDs
- DLQ: Dead letter queue for failed jobs
- Results: Job results for the
finished()API
Lock-Free Job Index with DashMap
One of our biggest performance wins came from using DashMap for the global job index. DashMap is a concurrent HashMap that uses fine-grained locking, allowing multiple threads to read and write simultaneously.
// Global job index - O(1) lookups
pub struct QueueManager {
shards: [RwLock<Shard>; 32],
job_index: DashMap<u64, JobLocation>, // Lock-free!
}
// JobLocation tells us exactly where a job is
enum JobLocation {
Waiting { shard: u8, queue: CompactString },
Processing { shard: u8 },
Completed { shard: u8 },
Failed { shard: u8 },
}
This gives us O(1) job lookups regardless of how many queues or jobs exist. Operations like getJob(), cancel(), and getState() are blazing fast.
IndexedPriorityQueue for O(log n) Operations
Standard binary heaps don't support efficient removal of arbitrary elements. We built an IndexedPriorityQueue that maintains a secondary index:
pub struct IndexedPriorityQueue<T> {
heap: Vec<T>,
index: HashMap<u64, usize>, // job_id → heap position
}
impl<T> IndexedPriorityQueue<T> {
// All operations are O(log n)
fn push(&mut self, job: T) { ... }
fn pop(&mut self) -> Option<T> { ... }
fn remove(&mut self, job_id: u64) -> Option<T> { ... } // Key!
fn update_priority(&mut self, job_id: u64, priority: i32) { ... }
}
This enables efficient cancel(), promote(), and changePriority() operations that would be O(n) with a standard heap.
CompactString for Zero-Allocation Queue Names
Queue names are accessed constantly. We use CompactString which stores strings up to 24 bytes inline (no heap allocation):
// Most queue names fit in 24 bytes
use compact_str::CompactString;
// "embeddings" - 10 bytes, stored inline ✓
// "user-notifications" - 18 bytes, stored inline ✓
// "very-long-queue-name-here" - 25 bytes, heap allocated
This reduces memory allocations by ~60% for typical workloads.
Sharded Processing Map
Jobs being processed are distributed across 32 shards (separate from queue shards). This reduces contention during ack() and fail() operations by 97%:
// Processing is sharded by job_id
fn get_processing_shard(job_id: u64) -> usize {
(job_id as usize) % 32
}
// ack() only locks one of 32 processing shards
fn ack(&self, job_id: u64) -> Result<()> {
let shard_idx = get_processing_shard(job_id);
let mut processing = self.processing[shard_idx].write();
// ...
}
Binary Protocol with MessagePack
flashQ supports both JSON and MessagePack protocols. MessagePack provides:
- 40% smaller payloads on the wire
- 3-5x faster serialization/deserialization
- Full type safety and schema compatibility
// TypeScript SDK - enable binary protocol
const client = new FlashQ({
host: 'localhost',
port: 6789,
useBinary: true // Use MessagePack
});
Memory Management
We use several strategies to prevent unbounded memory growth:
Completed Jobs Cleanup
// When completed_jobs exceeds 50K, remove oldest 25K
if completed_jobs.len() > 50_000 {
let to_remove: Vec<_> = completed_jobs
.iter()
.take(25_000)
.cloned()
.collect();
for id in to_remove {
completed_jobs.remove(&id);
}
}
Job Results TTL
// Results are cleaned up based on keepCompletedAge
await client.push('queue', data, {
keepCompletedAge: 86400000, // Keep result for 24h
keepCompletedCount: 100 // Or keep in last 100
});
String Interning
// Queue names are interned to reduce allocations
// Limited to 10K unique names to prevent memory exhaustion
static INTERNED: DashMap<String, CompactString> = ...;
Background Tasks
flashQ runs several background tasks at different intervals:
| Task | Interval | Purpose |
|---|---|---|
| Wakeup | 100ms | Notify workers, check dependencies |
| Timeout | 500ms | Check and fail stalled jobs |
| Cron | 1s | Execute scheduled cron jobs |
| Metrics | 5s | Collect metrics history |
| Cleanup | 60s | Clean completed jobs, results, index |
Multi-Protocol Support
flashQ supports four connection methods simultaneously:
# TCP (default) - lowest latency
PORT=6789 ./flashq-server
# HTTP REST API + WebSocket
HTTP=1 HTTP_PORT=6790 ./flashq-server
# gRPC API
GRPC=1 GRPC_PORT=6791 ./flashq-server
# Unix Socket - highest throughput for local
UNIX_SOCKET=1 ./flashq-server
Clustering Architecture
For high availability, flashQ supports clustering using PostgreSQL for coordination:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (Leader) │ │(Follower)│ │(Follower)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────▼──────┐
│ PostgreSQL │
│ (Shared) │
└─────────────┘
Leader election uses PostgreSQL advisory locks (pg_try_advisory_lock). Only the leader runs background tasks; all nodes handle client requests.
Performance Optimizations Summary
| Optimization | Benefit |
|---|---|
| DashMap job_index | Lock-free O(1) lookups, 40% faster |
| 32 Sharded processing | -97% contention on ack/fail |
| CompactString | Zero heap alloc for short names |
| IndexedPriorityQueue | O(log n) cancel/update/promote |
| MessagePack protocol | 40% smaller, 3-5x faster serialization |
| mimalloc allocator | Faster memory allocation |
| parking_lot locks | Faster than std::sync |
| Atomic u64 IDs | Lock-free ID generation |
| Coarse timestamps | Cached time, fewer syscalls |
| LTO build | Cross-crate optimization |
Benchmarks
On a modern server (32 cores, 64GB RAM):
| Operation | Throughput |
|---|---|
| Push (batch) | 1.9M jobs/sec |
| Processing (no-op) | 280K jobs/sec |
| Processing (CPU work) | 196K jobs/sec |
| Concurrent push (10 conn) | 59K ops/sec |
Conclusion
flashQ's architecture is built around one principle: minimize contention. By sharding data, using lock-free structures where possible, and optimizing hot paths, we've created a job queue that can handle the most demanding AI workloads.
The combination of Rust's zero-cost abstractions, careful data structure selection, and attention to memory management allows flashQ to outperform traditional Redis-based solutions while being simpler to operate.