The Token-Bucket Algorithm: Smart Rate-Limiting for Modern API Workloads

You’ve built an incredible new service—maybe a powerful API wrapping an LLM, a data processing endpoint, or a SaaS feature. It works beautifully. But as users start pouring in, a new kind of fear sets in. How do you stop one power user with a runaway script from racking up a huge bill on your OpenAI account? How do you prevent a sudden burst of traffic from degrading the service for all your other customers?

This is the classic rate-limiting problem. You need to let legitimate users make requests, even in short bursts, but you must enforce a fair usage policy over the long term to keep your service stable and your costs predictable.

The Headache: Your Brilliant API is Getting Abused

Without effective rate-limiting, even the best-designed APIs are vulnerable to common and frustrating problems:

Runaway Costs: Services that rely on pay-per-use resources (like LLM providers) are at risk. A single user hitting your endpoint in a tight loop could exhaust your budget in minutes.
Denial of Service (for Everyone Else): A burst of traffic from one client can overwhelm your servers, database, or upstream APIs, leading to slow response times and timeouts for all other users. This is often unintentional, but the effect is the same.
Upstream Rate Caps: The APIs you depend on (like Google Maps, Twitter, or a specific AI model) have their own rate limits. If your service funnels too many requests, you'll get blocked, breaking your application for everyone.
Unfair Resource Allocation: In a multi-tenant system, you want to ensure that one high-volume customer can't monopolize all the available resources at the expense of others on the same plan.

Simply rejecting every request over a certain number per second is too blunt. It doesn't allow for natural, "bursty" traffic patterns. What you need is a smarter, more flexible approach.

The Simple Fix: The Token-Bucket Algorithm

The good news is that this is a well-solved problem. The token-bucket algorithm is a simple and elegant way to manage usage that handles bursts gracefully while ensuring a consistent average rate.

Think of it like a gumball machine with a very specific set of rules:

The machine (the "bucket") can hold a maximum number of gumballs (the "capacity"). It starts full.
To make an API request, a user must take one gumball (a "token") from the machine.
The machine is slowly refilled with new gumballs at a steady, fixed rate (the "refill rate").
If a user wants to make a request but the machine is empty, their request is rejected. They have to wait for it to be refilled.

This model perfectly matches our needs. A user can make a quick burst of requests, using up all the available tokens. But after that, they are limited to the steady refill rate, preventing long-term abuse while the bucket tops itself back up for the next burst.


╭───────────────── Bucket (Capacity: 10) ──────────────────╮
│ Tokens: 🟢🟢🟢🟢🟢⚪⚪⚪⚪⚪ (5 remaining)                   │
╰──────────────────────────────────────────────────────────╯
▲                                      │
│ Consume 1 token per request          │ Refills at a slow,
▼                                      │ steady rate (e.g., 1 token / 5 sec)
┌──────────┐                            ┌────────────┐
│ API Call │ ─ ─ ─ ─ Needs Token ─ ─ \> │ Is Token   │
└──────────┘                            │ Available? │─ ─ YES ─ \> Process Request
└─────┬──────┘
│
└─ NO ─ ─ ─ \> Reject (HTTP 429)

How It Works: A Python Implementation

The logic is surprisingly straightforward. You only need to store two pieces of information for each user or API key: the number of remaining tokens and the time of the last refill.

In a real-world application, you would persist this state in a shared datastore like Redis or a database to coordinate usage across all your application servers. For simplicity, this example shows a thread-safe, in-memory implementation.

1. `token_bucket.py` (The Core Logic)

This class encapsulates the algorithm itself. Note that we calculate refills based on the time elapsed since the last request, so there's no need for a background timer.

import time
import threading
from typing import final

@final
class TokenBucket:
    """
    Thread-safe Token Bucket for regulating request rates.

    Attributes:
        capacity (float): Maximum tokens in the bucket (burst limit).
        _refill_rate (float): Tokens replenished per second.
        _tokens (float): Current token count.
        _last_refill_time (float): Last timestamp tokens were refilled.
        _lock (threading.Lock): Lock to ensure thread-safe operations.
    """

    def __init__(self, capacity: float, refill_rate: float):
        """
        Initialize a TokenBucket.

        Args:
            capacity (float): >0, max tokens bucket can hold.
            refill_rate (float): >0, tokens added each second.

        Raises:
            ValueError: if capacity or refill_rate are non-positive.
        """
        # Validate inputs
        if not (capacity > 0 and refill_rate > 0):
            raise ValueError("Capacity and refill rate must be positive")

        # Core parameters
        self.capacity = float(capacity)
        self._refill_rate = float(refill_rate)

        # Start full
        self._tokens = float(capacity)
        # Timestamp for next refill calculation
        self._last_refill_time = time.monotonic()

        # Lock for thread-safe token modifications
        self._lock = threading.Lock()

    def _refill(self) -> None:
        """
        Refill tokens based on elapsed time since last refill.
        Must be called while holding the lock.

        Calculates:
            elapsed = now - _last_refill_time
            tokens_to_add = elapsed * _refill_rate
        Applies cap: min(capacity, _tokens + tokens_to_add)
        Updates _last_refill_time to now.
        """
        now = time.monotonic()
        elapsed = now - self._last_refill_time

        # Only refill if time has advanced
        if elapsed > 0:
            tokens_to_add = elapsed * self._refill_rate
            # Increase tokens, but do not exceed capacity
            self._tokens = min(self.capacity, self._tokens + tokens_to_add)
            # Reset refill timestamp
            self._last_refill_time = now

    def consume(self, tokens: float = 1.0) -> bool:
        """
        Attempt to remove tokens from the bucket.

        Args:
            tokens (float): Number of tokens to consume (>0).

        Returns:
            bool: True if tokens were available and consumed, False otherwise.

        Raises:
            ValueError: if requested token count is non-positive.
        """
        if tokens <= 0:
            raise ValueError("Number of tokens to consume must be positive")

        with self._lock:
            # Refill before attempting consumption
            self._refill()

            # If enough tokens, deduct and allow
            if self._tokens >= tokens:
                self._tokens -= tokens
                return True

            # Not enough tokens, rate limit exceeded
            return False

    @property
    def tokens(self) -> float:
        """
        Get current token count, after refilling.

        Returns:
            float: current available tokens (<= capacity).
        """
        with self._lock:
            # Ensure up-to-date token count
            self._refill()
            return self._tokens

2. `test_token_bucket.py` (A Simple Test Case)

A good implementation deserves a good test to verify its behavior, especially the burst and refill logic.

import unittest
import time
from token_bucket import TokenBucket

class TestTokenBucketRealistic(unittest.TestCase):
    """
    Test suite for the TokenBucket algorithm, as used in an API rate limiter.

    Scenario:
      - Burst capacity: 10 tokens (max immediate requests)
      - Refill rate:    5 tokens per second (sustained rate)

    Each test prints actual vs. expected outcomes for clarity in a blog post.
    """

    def test_burst_limit(self):
        """
        Burst Test:
        Simulates sending 11 rapid API requests. The bucket should allow
        the first 10 (up to capacity) and deny the 11th.
        """
        bucket = TokenBucket(capacity=10, refill_rate=5)
        print("\n--- Burst Test (capacity=10) ---")
        allowed_count = 0

        for i in range(1, 12):
            # Attempt to consume 1 token per request
            allowed = bucket.consume()
            expected = (i <= 10)
            status = "ALLOWED" if allowed else "DENIED"
            exp_status = "ALLOWED" if expected else "DENIED"
            print(f" Request #{i:2d}: actual={status:7s}  expected={exp_status}")
            if allowed:
                allowed_count += 1

        # Verify exactly 10 requests were permitted
        self.assertEqual(allowed_count, 10, f"Allowed {allowed_count} requests; expected 10")
        # The next immediate request must be denied
        self.assertFalse(bucket.consume(), "11th request must be DENIED")

    def test_sustained_rate(self):
        """
        Sustained Throughput Test:
        After draining the bucket, wait 1 second to allow the refill rate
        to add tokens. Exactly 5 requests should succeed (5 tokens/sec).
        """
        bucket = TokenBucket(capacity=10, refill_rate=5)
        # Drain the bucket fully (simulate a sudden burst)
        for _ in range(10):
            bucket.consume()

        print("\n--- Sustained Rate Test (5 tokens/sec) ---")
        print(" Waiting 1 second to refill...")
        time.sleep(1)

        # Check how many tokens were refilled
        available = bucket.tokens
        print(f" Tokens refilled: actual={available:.2f}  expected≈5.00")
        self.assertGreaterEqual(available, 4.9,
                                f"Expected ≥4.9 tokens, got {available:.2f}")

        successes = 0
        # Attempt 6 subsequent requests: only first 5 should pass
        for i in range(1, 7):
            allowed = bucket.consume()
            expected = (i <= 5)
            status = "ALLOWED" if allowed else "DENIED"
            exp_status = "ALLOWED" if expected else "DENIED"
            print(f" Request #{i:2d} after refill: actual={status:7s}  expected={exp_status}")
            if allowed:
                successes += 1

        self.assertEqual(successes, 5, f"Allowed {successes} after refill; expected 5")
        self.assertFalse(bucket.consume(), "6th request after refill must be DENIED")

    def test_no_overfill(self):
        """
        Idle Capping Test:
        Consume a few tokens, then wait long enough that the theoretical
        refill would exceed capacity. The bucket should never exceed its max.
        """
        bucket = TokenBucket(capacity=10, refill_rate=5)
        # Consume 3 tokens (immediate requests)
        bucket.consume(3)
        current = bucket.tokens
        print("\n--- No Overfill Test ---")
        print(f" After consuming 3: actual={current:.2f}  expected≈7.00")

        print(" Waiting 3 seconds (would refill 15 tokens) ...")
        time.sleep(3)

        final = bucket.tokens
        print(f" After idle: actual={final:.2f}  expected=10.00 (cap)")
        # Ensure the bucket caps at its capacity
        self.assertAlmostEqual(final, 10.0, delta=1e-3, msg=f"Final tokens {final:.2f}; should cap at 10.00")

if __name__ == '__main__':
    # Default verbosity=1: clean pass/fail summary
    unittest.main()

Why This Rocks for APIs

Handles Bursts Gracefully: Users can have short, intense periods of activity, which is natural. This approach accommodates them without punishing them.
Guaranteed Long-Term Rate: While bursts are allowed, the algorithm mathematically guarantees that the average consumption rate will never exceed your configured refill rate.
Protects Your Budget & Stability: It provides a predictable ceiling on usage, preventing cost overruns and protecting your backend services from being overwhelmed.
Simple to Implement and Reason About: The state is minimal (just two variables) and the logic is easy to understand, making it less prone to bugs than more complex solutions.
Highly Flexible: You can create different buckets for different users, API keys, or subscription plans. Want to give "Pro" users a higher burst capacity and a faster refill rate? Just instantiate the class with different numbers.

Putting It to Work: Rate-Limiting an LLM Service

Imagine you're offering a service with different subscription tiers for access to standard and premium AI models.

Plan	Standard Models (e.g., GPT-4.1)	Premium Models (e.g., o3)
Free	50 requests / day	5 requests / day (shares the same pool)
Pro	Unlimited	200 requests / day
Enterprise	Unlimited	Unlimited

Here's how you'd manage this with token buckets:

Create Buckets Per User: For each user, you would maintain one or two bucket states in your database (e.g., in a Redis hash or a SQL table).
Configure Per Plan:
- Free Plan: One bucket with capacity=50 and refill_rate=50 / 86400 (tokens per second in a day). All model calls draw from this single bucket.
- Pro Plan: One bucket for premium models with capacity=200 and refill_rate=200 / 86400. Standard models would have no bucket and would bypass the check.
- Enterprise Plan: No buckets needed. The check is skipped entirely.
On Each API Request:
- Identify the user and their plan.
- Determine which bucket to use based on the model requested.
- Load the bucket's state (_tokens, _last_refill_time) from the database.
- Call bucket.consume().
- If it returns true, process the request and save the updated state back to the database.
- If it returns false, reject the request with an HTTP 429 Too Many Requests error.

This database record might look like this in a document store:

{
  "userId": "user-12345",
  "plan": "pro",
  "buckets": {
    "premium_models": {
      "capacity": 200,
      "refillRate": 0.0023148, // 200 / 86400
      "tokens": 145.7,
      "lastRefillTime": 1718992345.678
    }
  }
}

Wrapping Up

In the age of powerful, pay-per-use APIs, effective rate-limiting isn't just a "nice-to-have"—it's a critical component of a robust, scalable, and financially viable service. The token-bucket algorithm provides a simple, flexible, and powerful guarantee: it allows for good-faith bursts while ensuring that long-term usage always stays within your defined limits.

By integrating this small but mighty piece of logic into your middleware, you can shield your service from abuse, ensure a fair experience for all your users, and sleep better at night knowing your costs are under control.