Rate Limiting Your API: Best Practices Every Developer Should Know

Every API that's ever been exposed to the internet has been abused in some way — usually faster than the developer expected. The question isn't whether you'll need rate limiting, it's whether you'll add it before or after your first incident.

Rate limiting sits at the intersection of security, reliability, and fairness. It keeps your infrastructure stable under load, prevents individual clients from monopolizing resources, and gives you a lever to pull when something goes wrong. Done well, it's largely invisible to legitimate users and a strong wall against everyone else.

This guide walks through everything from choosing the right algorithm to the exact response headers you should return — with working code you can adapt for your stack.

The four rate limiting algorithms compared

Before writing a line of code, you need to pick an algorithm. Each handles traffic bursts differently, and the right choice depends on what you're protecting.

Fixed Window Counter⭐ Simple

How it works: Count requests in a fixed time bucket (e.g. 0–60s, 60–120s). Reset the counter when the window ends.

✓ Simplest to implement. Easy to reason about.

✗ Boundary spike problem — a user can send 2× the limit by hitting the end of one window and start of the next.

Best for: Internal admin APIs, low-risk endpoints

Sliding Window Log⭐⭐ Moderate

How it works: Store a timestamp for every request. Count only requests within the last N seconds from now. Drop old timestamps.

✓ Precise — no boundary spikes. Accurate per-second enforcement.

✗ Memory-heavy at scale. Storing every timestamp for millions of users adds up.

Best for: High-security endpoints, authentication, payment flows

Sliding Window Counter⭐⭐ Moderate

How it works: Hybrid: track current and previous window counts. Weight the previous count by how much of it still falls in the sliding window.

✓ Good approximation of sliding window. Memory-efficient — just two counters per client.

✗ Slightly approximate, not perfectly precise at boundaries.

Best for: Most public APIs — best balance of accuracy and efficiency

Token Bucket⭐⭐⭐ More involved

How it works: Each client has a bucket that fills at a constant rate (e.g. 10 tokens/second, max 100). Each request costs tokens. Requests are rejected when the bucket is empty.

✓ Allows controlled bursts. Natural and flexible. Used by most large APIs (AWS, Stripe).

✗ Slightly harder to explain to end users. Burst size needs careful tuning.

Best for: Production APIs where controlled bursts are acceptable

Which to pick? For most APIs, start with the sliding window counter. It's accurate enough for almost all use cases, memory-efficient, and straightforward to implement with Redis. Use token bucket if your users have legitimate reasons to burst (batch processing, bulk uploads). Use sliding window log only if you need exact precision for security-critical endpoints.

What to rate limit (and what not to)

Not every endpoint needs the same treatment. Applying a blanket limit everywhere is lazy and often counterproductive — it blocks legitimate use cases while not specifically targeting high-risk ones. Think in tiers:

🔴 Strict limits — security-critical

—Login / password check endpoints
—Password reset request + OTP verification
—Account creation
—Email verification resend
—Any endpoint that triggers an email, SMS, or notification

Suggested limit: 3–10 requests / minute per IP, 5–20 / hour per user

These are the primary targets for brute force. Low limits are non-negotiable.

🟡 Moderate limits — data endpoints

—Search endpoints
—List / browse endpoints (products, users, posts)
—Export / download endpoints
—AI-powered or expensive compute endpoints

Suggested limit: 60–300 requests / minute per user or API key

Protects database and compute resources from hammering without blocking normal use.

🟢 Light limits or none — static / cheap

—Health check endpoints (/health, /ping)
—Static asset endpoints
—Publicly cached read endpoints
—Webhook endpoints (rate limit by payload not requests)

Suggested limit: High limit or none — but still consider IP-level global limits

Over-limiting here breaks monitoring, CDNs, and uptime checks.

Choosing your rate limit key

The rate limit key is what you're counting against. Getting this wrong means your limits either don't work (too broad) or block legitimate users (too narrow).

IP address

✓ Use for: Unauthenticated endpoints, login forms, registration

✗ Watch out: Corporate users behind NAT — entire office shares one IP. Blocking it locks out hundreds of legit users.

key = f"rate:ip:{client_ip}:{endpoint}"

User ID / API key

✓ Use for: Authenticated endpoints. Most fair — limits are per paying customer.

✗ Watch out: Doesn't protect pre-auth endpoints. Attackers can create many accounts.

key = f"rate:user:{user_id}:{endpoint}"

IP + endpoint

✓ Use for: Focused protection per route. Lets users call one endpoint freely while limiting another.

✗ Watch out: More keys to manage. Needs consistent endpoint normalization.

key = f"rate:ip:{client_ip}:POST:/auth/login"

User ID + endpoint

✓ Use for: Best for authenticated APIs with different limits per feature.

✗ Watch out: Requires auth to be resolved before rate limiting — adds middleware complexity.

key = f"rate:user:{user_id}:GET:/api/search"

API key tier

✓ Use for: SaaS products with free/pro/enterprise plans. Limits tied to subscription.

✗ Watch out: Requires knowing the plan at middleware time. Needs caching to avoid DB hits per request.

key = f"rate:apikey:{api_key}"

For most production APIs, combine strategies: use IP-based limits for unauthenticated endpoints and user ID-based limits for authenticated ones. This covers both attack vectors without penalizing corporate users.

Implementation: Redis + Python / Node.js

Redis is the standard backing store for rate limiting because it's fast, supports atomic operations, and has native TTL (expiry) support. Here are working implementations for the two most common stacks.

Sliding window counter in Python (FastAPI / Flask)

import redis
import time
from fastapi import Request, HTTPException

r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def is_rate_limited(
    key: str,
    limit: int,
    window_seconds: int,
) -> tuple[bool, int, int]:
    """
    Sliding window counter using two Redis keys.
    Returns: (is_limited, current_count, remaining)
    """
    now = int(time.time())
    current_window = now // window_seconds
    previous_window = current_window - 1

    current_key = f"{key}:{current_window}"
    previous_key = f"{key}:{previous_window}"

    pipe = r.pipeline()
    pipe.get(current_key)
    pipe.get(previous_key)
    current_count_raw, previous_count_raw = pipe.execute()

    current_count = int(current_count_raw or 0)
    previous_count = int(previous_count_raw or 0)

    # Weight previous window by how much of it is still "in" the window
    elapsed_in_window = now % window_seconds
    previous_weight = 1 - (elapsed_in_window / window_seconds)
    estimated_count = current_count + (previous_count * previous_weight)

    if estimated_count >= limit:
        return True, int(estimated_count), 0

    # Increment current window counter
    pipe = r.pipeline()
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)
    pipe.execute()

    remaining = max(0, limit - int(estimated_count) - 1)
    return False, int(estimated_count) + 1, remaining


# FastAPI middleware
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    client_ip = request.client.host
    path = request.url.path

    # Strict limit on auth endpoints
    if path.startswith("/auth/"):
        key = f"rate:ip:{client_ip}:auth"
        limited, count, remaining = is_rate_limited(key, limit=10, window_seconds=60)
    else:
        key = f"rate:ip:{client_ip}:global"
        limited, count, remaining = is_rate_limited(key, limit=300, window_seconds=60)

    if limited:
        return JSONResponse(
            status_code=429,
            content={"error": "Too many requests. Please slow down."},
            headers={
                "Retry-After": "60",
                "X-RateLimit-Limit": str(10 if path.startswith("/auth/") else 300),
                "X-RateLimit-Remaining": "0",
            },
        )

    response = await call_next(request)
    response.headers["X-RateLimit-Remaining"] = str(remaining)
    return response

Token bucket in Node.js (Express)

const Redis = require("ioredis");
const redis = new Redis();

// Token bucket rate limiter
async function tokenBucket(key, { capacity, refillRate, refillPeriodMs }) {
  const now = Date.now();
  const bucketKey = `bucket:${key}`;

  const result = await redis
    .multi()
    .hgetall(bucketKey)
    .exec();

  const data = result[0][1];
  let tokens = data?.tokens ? parseFloat(data.tokens) : capacity;
  let lastRefill = data?.lastRefill ? parseInt(data.lastRefill) : now;

  // Refill tokens based on elapsed time
  const elapsed = now - lastRefill;
  const tokensToAdd = (elapsed / refillPeriodMs) * refillRate;
  tokens = Math.min(capacity, tokens + tokensToAdd);
  lastRefill = now;

  if (tokens < 1) {
    // Calculate when the next token will be available
    const msUntilToken = ((1 - tokens) / refillRate) * refillPeriodMs;
    return { allowed: false, remaining: 0, retryAfterMs: Math.ceil(msUntilToken) };
  }

  tokens -= 1;

  await redis
    .multi()
    .hset(bucketKey, "tokens", tokens.toString(), "lastRefill", lastRefill.toString())
    .pexpire(bucketKey, refillPeriodMs * 2)
    .exec();

  return { allowed: true, remaining: Math.floor(tokens), retryAfterMs: 0 };
}

// Express middleware
function rateLimitMiddleware(config) {
  return async (req, res, next) => {
    const key = req.user?.id
      ? `user:${req.user.id}`
      : `ip:${req.ip}`;

    const result = await tokenBucket(key, config);

    res.setHeader("X-RateLimit-Limit", config.capacity);
    res.setHeader("X-RateLimit-Remaining", result.remaining);

    if (!result.allowed) {
      res.setHeader("Retry-After", Math.ceil(result.retryAfterMs / 1000));
      return res.status(429).json({ error: "Rate limit exceeded." });
    }

    next();
  };
}

// Apply to Express app
const express = require("express");
const app = express();

// Strict limit for login
app.post(
  "/auth/login",
  rateLimitMiddleware({ capacity: 5, refillRate: 1, refillPeriodMs: 60_000 }),
  loginHandler
);

// General API limit
app.use(
  "/api/",
  rateLimitMiddleware({ capacity: 100, refillRate: 10, refillPeriodMs: 1_000 }),
);

Response headers — tell clients what's happening

Good rate limiting is transparent. Clients shouldn't have to guess how many requests they have left or when they can retry. The standard headers make this information machine-readable, which means good API clients can automatically back off without crashing or spamming retries.

X-RateLimit-LimitSend always

Example: 100

The maximum number of requests allowed in the current window.

X-RateLimit-RemainingSend always

Example: 47

How many requests the client has left before hitting the limit.

X-RateLimit-ResetSend always

Example: 1751234567 (Unix timestamp)

When the current window resets and the counter returns to the limit. Send as UTC epoch seconds.

Retry-AfterSend always

Example: 30 (seconds)

Only sent on 429 responses. How long the client must wait before retrying. Required by RFC 6585.

X-RateLimit-PolicyOptional

Example: 100;w=60;burst=20;comment="standard"

Draft standard (IETF) describing the full policy in a structured format. Not widely required yet but good practice.

# Example response headers on a normal request
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1751234567

# Example response on a 429
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1751234597

{
  "error": "rate_limit_exceeded",
  "message": "You've exceeded 100 requests per minute. Retry after 30 seconds.",
  "retry_after": 30
}

Rate limit tiers and per-user quotas

A single global limit is a blunt instrument. Production APIs almost always need differentiated limits — different rules for different user types, plan levels, or use cases. Here's a pattern for implementing tiered rate limits cleanly.

Tier	Requests / min	Requests / day	Burst	Who
Unauthenticated	10	500	15	Anonymous / public
Free tier	60	5,000	100	Signed-up, no subscription
Pro	300	50,000	500	Paying individual
Business	1,000	200,000	2,000	Team plan
Enterprise	Custom	Unlimited	Custom	Contract customers
Internal services	None	None	None	Service-to-service (trusted)

# Python: look up tier limits from cache/DB
TIER_LIMITS = {
    "anonymous":   {"per_minute": 10,    "per_day": 500},
    "free":        {"per_minute": 60,    "per_day": 5_000},
    "pro":         {"per_minute": 300,   "per_day": 50_000},
    "business":    {"per_minute": 1_000, "per_day": 200_000},
    "internal":    None,  # no limit
}

def get_user_limits(user) -> dict | None:
    if user is None:
        return TIER_LIMITS["anonymous"]
    if user.is_internal_service:
        return None  # no limits for trusted services
    return TIER_LIMITS.get(user.subscription_tier, TIER_LIMITS["free"])

# In middleware
limits = get_user_limits(request.user)
if limits:
    limited_minute, _, remaining = is_rate_limited(
        key=f"rate:{key}:minute",
        limit=limits["per_minute"],
        window_seconds=60,
    )
    limited_day, _, _ = is_rate_limited(
        key=f"rate:{key}:day",
        limit=limits["per_day"],
        window_seconds=86_400,
    )
    if limited_minute or limited_day:
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

Common bypass techniques to defend against

Naive rate limiting is easier to bypass than most developers realize. Here are the most common tricks attackers use — and how to close each gap.

⚔️ IP rotation

Attacker routes requests through a proxy pool, using a different IP for each one. IP-only rate limiting is completely defeated.

Defense: Add rate limits on user ID / API key post-authentication. Use fingerprinting signals (User-Agent, Accept-Language, TLS fingerprint) to detect rotation patterns. Services like Cloudflare Bot Management do this automatically.

⚔️ Account farming

Attacker creates thousands of free accounts to each get their own rate limit bucket. Distributed brute force through many accounts.

Defense: Require email verification before full access. Apply stricter limits to unverified accounts. Use device fingerprinting to link accounts from the same device. Require CAPTCHA after N failed attempts.

⚔️ Slow drip attacks

Instead of hammering the endpoint, attacker sends requests just under the threshold — forever. Doesn't trigger the rate limit but still causes damage over time.

Defense: Add both per-minute AND per-day limits. Monitor for sustained low-rate patterns. A real user rarely sends exactly 9 requests/minute for 72 hours straight.

⚔️ Header spoofing

If you trust X-Forwarded-For or X-Real-IP headers without validation, attackers can spoof their IP address and bypass IP-based limits.

Defense: Only trust X-Forwarded-For from your own load balancer or proxy. Set TRUSTED_PROXIES explicitly. Extract the real client IP only from the last untrusted hop.

⚔️ Endpoint variation

If you rate limit /api/search but not /api/search/, /api/Search, or /API/search, attackers try variations to find unprotected paths.

Defense: Normalize all paths before using them in rate limit keys. Apply limits at the middleware level before routing, not at the route handler.

Production readiness checklist

Before shipping your rate limiting to production, work through this checklist. Each item represents a real gap that's caused an incident somewhere.

🔴

Rate limits applied before auth resolves (for auth endpoints)

Critical

🔴

Both per-minute and per-day limits configured for key endpoints

Critical

🔴

All 429 responses include Retry-After header

Critical

⚪

X-RateLimit-Remaining header sent on every response

🔴

Client IP extracted correctly from X-Forwarded-For (trusted proxy only)

Critical

⚪

Rate limit key includes endpoint path, not just IP

🔴

Redis connection failure handled gracefully — fail open or closed?

Critical

⚪

Rate limit errors logged with key, client ID, and endpoint for monitoring

🔴

Alert set up for sudden spike in 429 responses (potential attack)

Critical

⚪

Internal service accounts whitelisted from rate limits

⚪

Rate limit configuration externalized (env/config file, not hardcoded)

🔴

Tested that rotating IPs doesn't bypass per-user limits

Critical

⚪

Path normalization applied before building rate limit keys

⚪

429 error message is informative but doesn't reveal limit details to attackers

⚠️ What to do when Redis goes down

If your rate limiter depends on Redis and Redis goes down, you have two choices: fail open (allow all requests) or fail closed (block all requests). Failing open is usually right for availability — a brief window of unprotected traffic is better than a full outage. Implement a circuit breaker pattern around Redis calls so that a Redis failure degrades gracefully rather than crashing your middleware.

FAQ

Should I implement rate limiting myself or use a gateway?↓

For simple cases, implement it yourself — it gives you full control and there are no moving parts. For complex cases (multiple services, multiple teams, high traffic), use an API gateway like Kong, AWS API Gateway, or Cloudflare. They handle rate limiting, auth, logging, and routing in one layer. The two aren't mutually exclusive: a gateway for global limits + application-level limits for fine-grained control is a common production pattern.

What HTTP status code should I return for rate limiting?↓

429 Too Many Requests is the correct code, standardized in RFC 6585. Some older APIs used 503 (Service Unavailable) or 403 (Forbidden) — both are wrong. 429 tells clients exactly what happened and that retrying later will work. Always pair it with a Retry-After header.

How do I rate limit without Redis?↓

For a single-server setup, an in-memory store (a dict in Python, a Map in Node.js) works but won't survive restarts and doesn't scale across multiple processes. For multiple servers, you need shared state — Redis is the standard, but Memcached or a database with atomic operations (PostgreSQL advisory locks, for example) also work. For serverless functions, you'll need an external store since function instances don't share memory.

How do I handle legitimate bursts without punishing users?↓

Use the token bucket algorithm with a burst parameter. If your normal limit is 60 req/min, allow a burst of 100 tokens. This lets a user make 100 requests immediately after a long pause without hitting the limit, while still enforcing the average rate. This is how Stripe, GitHub, and most major APIs handle it.

Should I rate limit by IP or by user?↓

Both, applied at different layers. IP limits protect pre-authentication endpoints (login, registration, password reset) where you don't have a user ID yet. User-based limits protect authenticated endpoints and are fairer — they don't penalize shared offices or universities where many people share an IP. Apply both for critical endpoints: IP limit as a first line, user limit after auth.

How do I communicate rate limits to my API users?↓

Document them clearly in your API reference — what the limits are, what keys they apply to, and what happens when exceeded. Return the X-RateLimit-* headers on every response so clients can build backoff logic. Write a Retry-After header on 429s. Consider a dashboard showing usage vs. limits for API key holders. Surprises are the worst — known limits that clients can program against are fine.

Quick recap

1Pick your algorithm: sliding window counter for most APIs, token bucket if clients need burst support.
2Not every endpoint needs the same limit — auth and notification endpoints need the strictest treatment.
3Rate limit key = IP for pre-auth, user ID for post-auth, combine both for critical endpoints.
4Redis is the standard backend — atomic increments + TTL makes it ideal for this.
5Always return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After on 429.
6Implement tiered limits per plan — anonymous, free, pro, enterprise should have different quotas.
7Defend against bypass: IP rotation, account farming, slow drip, header spoofing, path variation.
8Handle Redis failure gracefully — decide on fail-open vs. fail-closed before it happens in production.

The four rate limiting algorithms compared

What to rate limit (and what not to)

Choosing your rate limit key

Implementation: Redis + Python / Node.js

Sliding window counter in Python (FastAPI / Flask)

Token bucket in Node.js (Express)

Response headers — tell clients what's happening

Rate limit tiers and per-user quotas

Common bypass techniques to defend against

Production readiness checklist

FAQ

Quick recap

Feedback