Plan throughput for your LLM API rate limits
Enter your account's RPM and TPM limits plus your per-task token and latency profile. The calculator finds the safe concurrency, the binding limit, and how many tasks per minute your agent pipeline can actually push.
Inputs
From your provider dashboard / tier.
Combined input + output token budget.
Prompt + context per request.
Completion length per request.
e.g. multi-turn agent loops > 1.
Wall-clock time per request.
Prompt-cache hits count less toward TPM.
Buffer to avoid 429s on bursts.
| Constraint | Capacity (tasks/min) | % of limit used |
|---|
How the throughput is calculated
Every LLM provider enforces two independent ceilings: a requests-per-minute (RPM) cap and a tokens-per-minute (TPM) cap. Your real throughput is whichever runs out first. This tool computes both capacities in the same unit — tasks per minute — so the bottleneck is obvious instead of hidden behind raw 429 errors.
The formula
First it derives the token cost of one task. Input tokens are split into cached and fresh: billable_in = in × (1 − cached%), since prompt-cache reads barely touch your TPM budget. Then tokens_per_task = (billable_in + out) × calls.
Two independent capacities follow, each multiplied by an effective limit after headroom (eff = limit × (1 − headroom%)):
rpm_capacity = eff_RPM ÷ calls and tpm_capacity = eff_TPM ÷ tokens_per_task.
The achievable rate is tasks_per_min = min(rpm_capacity, tpm_capacity), and the binding limit is whichever produced that minimum. A third ceiling matters in practice: latency. With a per-call wall time of L seconds and N calls per task, a single worker finishes 60 ÷ (N × L) tasks per minute, so the concurrency needed to actually reach the rate-limit ceiling is ceil(tasks_per_min ÷ tasks_per_worker). If you launch more workers than that, you only generate 429s; fewer, and you leave quota on the table.
Why it matters for agent pipelines
Multi-agent and tool-loop workloads multiply calls per task, so a job that looks RPM-safe at 1 call quietly becomes TPM-bound at 6 calls with long contexts. The percent-of-limit columns show exactly how much of each ceiling a balanced run consumes, letting you right-size worker pools, batch windows, and retry budgets before paying for throughput you can't legally use.