Token cost
Per-token pricing charged by LLM providers for input and output — the primary cost driver for AI applications at scale.
Updated 2026-04-22 · 3 min read
Definition
Large language model providers price requests by tokens — roughly a sub-word unit. Inputs (prompt + context) and outputs (generated completion) are metered separately, usually at different rates, with output priced 3–5× input.
Why it matters
For most production AI workloads, token cost dwarfs everything else on the bill. The biggest wins come from reducing tokens, not from squeezing out a last few percent on infrastructure: prompt compression, retrieval-based context pruning, caching deterministic prefixes, and picking the smallest model that meets quality bar.
Optimization levers
- Shorten prompts — system prompts balloon quietly; audit them.
- Truncate retrieved context — reranking beats stuffing.
- Cache prefix tokens where providers support it (Anthropic, OpenAI).
- Route by difficulty — small model first, escalate only when needed.
- Quantise or self-host once volume justifies the fixed cost.
Related Terms
Token-based pricing
Billing for LLM API usage by tokens processed — input and output text converted to billable units that scale with every request.
Unit economics
Cost per business unit — per order, per tenant, per active user — so efficiency becomes a trackable engineering outcome.
AI inference cost
The cost to run trained models in production — API calls, GPU compute, and hosted endpoints — distinct from one-off training spend.