How to Rate Limit OpenAI API Calls with flashQ

If you've worked with OpenAI's API, you've probably seen this error:

Error 429: Rate limit reached for gpt-4 in organization org-xxx
on requests per min (RPM): Limit 500, Used 500, Requested 1.

Rate limits are a fact of life when working with AI APIs. OpenAI, Anthropic, Cohere, and every other provider enforces them. Without proper handling, your application will crash, users will see errors, and you'll waste money on failed requests.

In this tutorial, we'll build a robust rate-limiting system using flashQ that:

Never exceeds API rate limits
Automatically retries failed requests
Tracks costs per request
Handles multiple API tiers

Understanding OpenAI Rate Limits

OpenAI has two types of rate limits:

Limit Type	Description	Typical Values (Tier 1)
RPM	Requests per minute	500 RPM for GPT-4
TPM	Tokens per minute	10,000 TPM for GPT-4

Your tier depends on how much you've spent. New accounts start at Tier 1 with lower limits.

Basic Rate Limiting with flashQ

flashQ has built-in token bucket rate limiting. Here's the simplest setup:

import { Queue, Worker } from 'flashq';
import OpenAI from 'openai';

const openai = new OpenAI();
const queue = new Queue('openai-calls');

// Set rate limit: 400 requests per minute (leaving buffer)
await queue.setRateLimit(400);

// Add jobs
await queue.add('chat', {
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Hello!' }]
});

// Worker processes at controlled rate
new Worker('openai-calls', async (job) => {
  const response = await openai.chat.completions.create(job.data);
  return response.choices[0].message.content;
});

The queue will now process at most 400 jobs per minute, regardless of how many jobs you add.

Handling 429 Errors with Retries

Even with rate limiting, you might still hit 429 errors during traffic spikes. Configure automatic retries:

// Add job with retry configuration
await queue.add('chat', {
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Explain quantum computing' }]
}, {
  attempts: 5,
  backoff: {
    type: 'exponential',
    delay: 2000  // 2s, 4s, 8s, 16s, 32s
  }
});

// Worker with error handling
new Worker('openai-calls', async (job) => {
  try {
    const response = await openai.chat.completions.create(job.data);
    return {
      content: response.choices[0].message.content,
      usage: response.usage
    };
  } catch (error) {
    if (error.status === 429) {
      // Rate limited - throw to trigger retry
      throw new Error('Rate limited by OpenAI');
    }
    if (error.status === 400) {
      // Bad request - don't retry, return error
      return { error: error.message };
    }
    throw error; // Other errors - retry
  }
});

Different Queues for Different Models

Each OpenAI model has different rate limits. Create separate queues:

// GPT-4: Lower limits, higher cost
const gpt4Queue = new Queue('openai-gpt4');
await gpt4Queue.setRateLimit(400); // 400 RPM

// GPT-3.5: Higher limits, lower cost
const gpt35Queue = new Queue('openai-gpt35');
await gpt35Queue.setRateLimit(3000); // 3000 RPM

// Embeddings: Very high limits
const embeddingsQueue = new Queue('openai-embeddings');
await embeddingsQueue.setRateLimit(5000); // 5000 RPM

// Route requests to appropriate queue
function getQueue(model) {
  if (model.startsWith('gpt-4')) return gpt4Queue;
  if (model.startsWith('gpt-3.5')) return gpt35Queue;
  if (model.includes('embedding')) return embeddingsQueue;
  return gpt35Queue;
}

Tracking Costs

AI API costs can spiral quickly. Track them per job:

// Pricing per 1K tokens (as of 2024)
const PRICING = {
  'gpt-4': { input: 0.03, output: 0.06 },
  'gpt-4-turbo': { input: 0.01, output: 0.03 },
  'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
  'text-embedding-3-small': { input: 0.00002, output: 0 }
};

function calculateCost(model, usage) {
  const prices = PRICING[model] || PRICING['gpt-3.5-turbo'];
  const inputCost = (usage.prompt_tokens / 1000) * prices.input;
  const outputCost = (usage.completion_tokens / 1000) * prices.output;
  return inputCost + outputCost;
}

// Worker with cost tracking
new Worker('openai-gpt4', async (job) => {
  const response = await openai.chat.completions.create(job.data);
  const cost = calculateCost(job.data.model, response.usage);

  return {
    content: response.choices[0].message.content,
    usage: response.usage,
    cost: cost
  };
});

// Aggregate costs
let totalCost = 0;
gpt4Queue.on('completed', (job, result) => {
  totalCost += result.cost;
  console.log(`Job cost: $${result.cost.toFixed(4)}, Total: $${totalCost.toFixed(2)}`);
});

Budget Controls

Stop processing when you hit a budget limit:

const DAILY_BUDGET = 100; // $100/day
let dailySpend = 0;

gpt4Queue.on('completed', async (job, result) => {
  dailySpend += result.cost;

  if (dailySpend >= DAILY_BUDGET) {
    console.log('Daily budget reached! Pausing queue.');
    await gpt4Queue.pause();

    // Alert team
    await sendSlackAlert(`OpenAI daily budget of $${DAILY_BUDGET} reached`);
  }
});

// Reset daily spend at midnight
setInterval(async () => {
  const now = new Date();
  if (now.getHours() === 0 && now.getMinutes() === 0) {
    dailySpend = 0;
    await gpt4Queue.resume();
    console.log('Daily budget reset. Queue resumed.');
  }
}, 60000); // Check every minute

Handling Multiple Providers

Most teams use multiple AI providers. Here's a pattern for that:

import OpenAI from 'openai';
import Anthropic from '@anthropic-ai/sdk';

const openai = new OpenAI();
const anthropic = new Anthropic();

// Separate queues per provider
const openaiQueue = new Queue('llm-openai');
const anthropicQueue = new Queue('llm-anthropic');

await openaiQueue.setRateLimit(500);
await anthropicQueue.setRateLimit(1000);

// Unified interface
async function chat(provider, messages, options = {}) {
  const queue = provider === 'anthropic' ? anthropicQueue : openaiQueue;

  const job = await queue.add('chat', {
    provider,
    messages,
    ...options
  });

  return queue.finished(job.id);
}

// Workers
new Worker('llm-openai', async (job) => {
  const response = await openai.chat.completions.create({
    model: job.data.model || 'gpt-4',
    messages: job.data.messages
  });
  return response.choices[0].message.content;
});

new Worker('llm-anthropic', async (job) => {
  const response = await anthropic.messages.create({
    model: job.data.model || 'claude-3-opus-20240229',
    max_tokens: 1024,
    messages: job.data.messages
  });
  return response.content[0].text;
});

Production Checklist

Before going to production with rate-limited AI calls:

✅ Set rate limits below your API tier limits (leave 10-20% buffer)
✅ Configure exponential backoff for retries
✅ Track costs per job and set budget alerts
✅ Use separate queues for different models/providers
✅ Monitor the dead letter queue for persistent failures
✅ Set up alerts for high error rates
✅ Test failover behavior when limits are hit

💡 Pro Tip

Request a rate limit increase from OpenAI once you have consistent usage. They're generally responsive and will bump your limits based on spend history.

Conclusion

Rate limiting AI API calls is essential for building reliable applications. With flashQ's built-in rate limiting, automatic retries, and event system, you can build robust AI applications that never exceed quotas and stay within budget.

Start Building

Get flashQ running and start rate limiting your AI calls in minutes.

Read the Docs →