Building an AI API means handling slow operations. LLM calls can take 10-30 seconds. You need both synchronous endpoints (wait for result) and asynchronous endpoints (poll for status).
This tutorial shows how to build a production AI inference API with Hono and flashQ.
API Design
We'll build three endpoints:
| Endpoint | Description |
|---|---|
POST /api/generate |
Queue a job, return job ID immediately |
GET /api/job/:id |
Get job status and result |
POST /api/generate/sync |
Wait for result (with timeout) |
Project Setup
# Create project
mkdir ai-api && cd ai-api
# Install dependencies
npm install hono flashq openai
# Start flashQ server
docker run -d -p 6789:6789 flashq/flashq
Async Endpoint (Queue and Return)
The async pattern queues a job and returns immediately:
import { Hono } from 'hono';
import { Queue } from 'flashq';
const app = new Hono();
const queue = new Queue('ai-inference');
// POST /api/generate - Queue job, return ID
app.post('/api/generate', async (c) => {
const { prompt, model = 'gpt-4' } = await c.req.json();
const job = await queue.add('generate', {
prompt,
model
});
return c.json({
jobId: job.id,
status: 'queued',
statusUrl: `/api/job/${job.id}`
}, 202);
});
The client receives the job ID immediately and can poll for status.
Status Endpoint
Check job status and get results when ready:
// GET /api/job/:id - Get job status
app.get('/api/job/:id', async (c) => {
const jobId = c.req.param('id');
const job = await queue.getJob(jobId);
if (!job) {
return c.json({ error: 'Job not found' }, 404);
}
// Return status based on job state
switch (job.status) {
case 'waiting':
case 'delayed':
return c.json({
jobId,
status: 'pending',
position: job.position
});
case 'active':
return c.json({
jobId,
status: 'processing',
progress: job.progress
});
case 'completed':
return c.json({
jobId,
status: 'completed',
result: job.returnvalue
});
case 'failed':
return c.json({
jobId,
status: 'failed',
error: job.failedReason
}, 500);
}
});
Sync Endpoint (Wait for Result)
Sometimes you want to wait for the result. Use queue.finished():
// POST /api/generate/sync - Wait for result
app.post('/api/generate/sync', async (c) => {
const { prompt, model = 'gpt-4', timeout = 30000 } = await c.req.json();
const job = await queue.add('generate', {
prompt,
model
});
try {
// Wait for job to complete (with timeout)
const result = await queue.finished(job.id, timeout);
return c.json({
jobId: job.id,
status: 'completed',
result
});
} catch (error) {
// Timeout or failure
return c.json({
jobId: job.id,
status: 'timeout',
message: 'Job did not complete in time. Poll /api/job/:id for status.'
}, 408);
}
});
Worker Implementation
The worker processes jobs in the background:
import { Worker } from 'flashq';
import OpenAI from 'openai';
const openai = new OpenAI();
new Worker('ai-inference', async (job) => {
const { prompt, model } = job.data;
// Update progress
await job.updateProgress(10);
const response = await openai.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }]
});
await job.updateProgress(100);
return {
text: response.choices[0].message.content,
model,
usage: response.usage
};
}, {
limiter: {
max: 60, // 60 requests
duration: 60000 // per minute
}
});
Client Usage
Async Pattern (Polling)
// 1. Submit job
const response = await fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt: 'Explain quantum computing' })
});
const { jobId, statusUrl } = await response.json();
// 2. Poll for result
async function pollForResult(jobId: string) {
while (true) {
const status = await fetch(`/api/job/${jobId}`).then(r => r.json());
if (status.status === 'completed') {
return status.result;
}
if (status.status === 'failed') {
throw new Error(status.error);
}
// Wait 1 second before polling again
await new Promise(r => setTimeout(r, 1000));
}
}
const result = await pollForResult(jobId);
console.log(result.text);
Sync Pattern (Simple)
const response = await fetch('/api/generate/sync', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
prompt: 'Explain quantum computing',
timeout: 60000
})
});
const { result } = await response.json();
console.log(result.text);
Bonus: Bulk Processing
Process multiple prompts efficiently:
// POST /api/generate/bulk - Queue multiple jobs
app.post('/api/generate/bulk', async (c) => {
const { prompts, model = 'gpt-4' } = await c.req.json();
const jobs = await queue.addBulk(
prompts.map((prompt: string) => ({
name: 'generate',
data: { prompt, model }
}))
);
return c.json({
jobIds: jobs.map(j => j.id),
count: jobs.length
}, 202);
});
// GET /api/jobs/status - Batch status check
app.post('/api/jobs/status', async (c) => {
const { jobIds } = await c.req.json();
const statuses = await Promise.all(
jobIds.map(async (id: string) => {
const job = await queue.getJob(id);
return {
jobId: id,
status: job?.status || 'not_found',
result: job?.status === 'completed' ? job.returnvalue : null
};
})
);
return c.json({ jobs: statuses });
});
Complete Example
import { Hono } from 'hono';
import { Queue, Worker } from 'flashq';
import OpenAI from 'openai';
const app = new Hono();
const queue = new Queue('ai-inference');
const openai = new OpenAI();
// Async endpoint
app.post('/api/generate', async (c) => {
const { prompt, model = 'gpt-4' } = await c.req.json();
const job = await queue.add('generate', { prompt, model });
return c.json({ jobId: job.id, statusUrl: `/api/job/${job.id}` }, 202);
});
// Status endpoint
app.get('/api/job/:id', async (c) => {
const job = await queue.getJob(c.req.param('id'));
if (!job) return c.json({ error: 'Not found' }, 404);
return c.json({
jobId: job.id,
status: job.status,
result: job.status === 'completed' ? job.returnvalue : null,
error: job.status === 'failed' ? job.failedReason : null
});
});
// Sync endpoint
app.post('/api/generate/sync', async (c) => {
const { prompt, model = 'gpt-4', timeout = 30000 } = await c.req.json();
const job = await queue.add('generate', { prompt, model });
try {
const result = await queue.finished(job.id, timeout);
return c.json({ jobId: job.id, status: 'completed', result });
} catch {
return c.json({ jobId: job.id, status: 'timeout' }, 408);
}
});
// Worker
new Worker('ai-inference', async (job) => {
const response = await openai.chat.completions.create({
model: job.data.model,
messages: [{ role: 'user', content: job.data.prompt }]
});
return {
text: response.choices[0].message.content,
usage: response.usage
};
}, {
limiter: { max: 60, duration: 60000 }
});
export default app;
Key Takeaways
- Use async endpoints for long-running tasks
- Provide a status endpoint for polling
- Use
queue.finished()for sync endpoints - Configure rate limiting in the worker