Real-time AI Inference API with flashQ + Hono

Building an AI API means handling slow operations. LLM calls can take 10-30 seconds. You need both synchronous endpoints (wait for result) and asynchronous endpoints (poll for status).

This tutorial shows how to build a production AI inference API with Hono and flashQ.

API Design

We'll build three endpoints:

Endpoint	Description
`POST /api/generate`	Queue a job, return job ID immediately
`GET /api/job/:id`	Get job status and result
`POST /api/generate/sync`	Wait for result (with timeout)

Project Setup

# Create project
mkdir ai-api && cd ai-api

# Install dependencies
npm install hono flashq openai

# Start flashQ server
docker run -d -p 6789:6789 flashq/flashq

Async Endpoint (Queue and Return)

The async pattern queues a job and returns immediately:

import { Hono } from 'hono';
import { Queue } from 'flashq';

const app = new Hono();
const queue = new Queue('ai-inference');

// POST /api/generate - Queue job, return ID
app.post('/api/generate', async (c) => {
  const { prompt, model = 'gpt-4' } = await c.req.json();

  const job = await queue.add('generate', {
    prompt,
    model
  });

  return c.json({
    jobId: job.id,
    status: 'queued',
    statusUrl: `/api/job/${job.id}`
  }, 202);
});

The client receives the job ID immediately and can poll for status.

Status Endpoint

Check job status and get results when ready:

// GET /api/job/:id - Get job status
app.get('/api/job/:id', async (c) => {
  const jobId = c.req.param('id');
  const job = await queue.getJob(jobId);

  if (!job) {
    return c.json({ error: 'Job not found' }, 404);
  }

  // Return status based on job state
  switch (job.status) {
    case 'waiting':
    case 'delayed':
      return c.json({
        jobId,
        status: 'pending',
        position: job.position
      });

    case 'active':
      return c.json({
        jobId,
        status: 'processing',
        progress: job.progress
      });

    case 'completed':
      return c.json({
        jobId,
        status: 'completed',
        result: job.returnvalue
      });

    case 'failed':
      return c.json({
        jobId,
        status: 'failed',
        error: job.failedReason
      }, 500);
  }
});

Sync Endpoint (Wait for Result)

Sometimes you want to wait for the result. Use queue.finished():

// POST /api/generate/sync - Wait for result
app.post('/api/generate/sync', async (c) => {
  const { prompt, model = 'gpt-4', timeout = 30000 } = await c.req.json();

  const job = await queue.add('generate', {
    prompt,
    model
  });

  try {
    // Wait for job to complete (with timeout)
    const result = await queue.finished(job.id, timeout);

    return c.json({
      jobId: job.id,
      status: 'completed',
      result
    });
  } catch (error) {
    // Timeout or failure
    return c.json({
      jobId: job.id,
      status: 'timeout',
      message: 'Job did not complete in time. Poll /api/job/:id for status.'
    }, 408);
  }
});

Worker Implementation

The worker processes jobs in the background:

import { Worker } from 'flashq';
import OpenAI from 'openai';

const openai = new OpenAI();

new Worker('ai-inference', async (job) => {
  const { prompt, model } = job.data;

  // Update progress
  await job.updateProgress(10);

  const response = await openai.chat.completions.create({
    model,
    messages: [{ role: 'user', content: prompt }]
  });

  await job.updateProgress(100);

  return {
    text: response.choices[0].message.content,
    model,
    usage: response.usage
  };
}, {
  limiter: {
    max: 60,        // 60 requests
    duration: 60000 // per minute
  }
});

Client Usage

Async Pattern (Polling)

// 1. Submit job
const response = await fetch('/api/generate', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ prompt: 'Explain quantum computing' })
});

const { jobId, statusUrl } = await response.json();

// 2. Poll for result
async function pollForResult(jobId: string) {
  while (true) {
    const status = await fetch(`/api/job/${jobId}`).then(r => r.json());

    if (status.status === 'completed') {
      return status.result;
    }

    if (status.status === 'failed') {
      throw new Error(status.error);
    }

    // Wait 1 second before polling again
    await new Promise(r => setTimeout(r, 1000));
  }
}

const result = await pollForResult(jobId);
console.log(result.text);

Sync Pattern (Simple)

const response = await fetch('/api/generate/sync', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'Explain quantum computing',
    timeout: 60000
  })
});

const { result } = await response.json();
console.log(result.text);

Bonus: Bulk Processing

Process multiple prompts efficiently:

// POST /api/generate/bulk - Queue multiple jobs
app.post('/api/generate/bulk', async (c) => {
  const { prompts, model = 'gpt-4' } = await c.req.json();

  const jobs = await queue.addBulk(
    prompts.map((prompt: string) => ({
      name: 'generate',
      data: { prompt, model }
    }))
  );

  return c.json({
    jobIds: jobs.map(j => j.id),
    count: jobs.length
  }, 202);
});

// GET /api/jobs/status - Batch status check
app.post('/api/jobs/status', async (c) => {
  const { jobIds } = await c.req.json();

  const statuses = await Promise.all(
    jobIds.map(async (id: string) => {
      const job = await queue.getJob(id);
      return {
        jobId: id,
        status: job?.status || 'not_found',
        result: job?.status === 'completed' ? job.returnvalue : null
      };
    })
  );

  return c.json({ jobs: statuses });
});

Complete Example

import { Hono } from 'hono';
import { Queue, Worker } from 'flashq';
import OpenAI from 'openai';

const app = new Hono();
const queue = new Queue('ai-inference');
const openai = new OpenAI();

// Async endpoint
app.post('/api/generate', async (c) => {
  const { prompt, model = 'gpt-4' } = await c.req.json();
  const job = await queue.add('generate', { prompt, model });
  return c.json({ jobId: job.id, statusUrl: `/api/job/${job.id}` }, 202);
});

// Status endpoint
app.get('/api/job/:id', async (c) => {
  const job = await queue.getJob(c.req.param('id'));
  if (!job) return c.json({ error: 'Not found' }, 404);

  return c.json({
    jobId: job.id,
    status: job.status,
    result: job.status === 'completed' ? job.returnvalue : null,
    error: job.status === 'failed' ? job.failedReason : null
  });
});

// Sync endpoint
app.post('/api/generate/sync', async (c) => {
  const { prompt, model = 'gpt-4', timeout = 30000 } = await c.req.json();
  const job = await queue.add('generate', { prompt, model });

  try {
    const result = await queue.finished(job.id, timeout);
    return c.json({ jobId: job.id, status: 'completed', result });
  } catch {
    return c.json({ jobId: job.id, status: 'timeout' }, 408);
  }
});

// Worker
new Worker('ai-inference', async (job) => {
  const response = await openai.chat.completions.create({
    model: job.data.model,
    messages: [{ role: 'user', content: job.data.prompt }]
  });
  return {
    text: response.choices[0].message.content,
    usage: response.usage
  };
}, {
  limiter: { max: 60, duration: 60000 }
});

export default app;

Key Takeaways

Use async endpoints for long-running tasks
Provide a status endpoint for polling
Use queue.finished() for sync endpoints
Configure rate limiting in the worker

Build Your API

Start building production AI APIs with flashQ and Hono.

See Docs →