Understand how GPUs process inference requests and what limits throughput
Configuration
Memory80 GB
Bandwidth2.0 TB/s
FP16 FLOPS312 TFLOPS
Stopped
1x
Single GPU View (per-GPU memory)
HBM (DRAM) - Per GPU
0 / 80 GB
Weights0 GB
KV Cache0 GB
Free
Memory Bus
0 / 2000 GB/s
↓
L2 Cache / SRAM
~50 MB (fast)
Current layer activations
Tensor Cores
0 / 312 TFLOPS
Output Tokens
IdleWaiting for requests...
Compute
0%
Bandwidth
0%
Memory
0%
Request Timeline
Waiting: 0
Prefilling: 0
Decoding: 0
Completed: 0
Start the simulation to see requests
Data Center View
Single GPU Configuration
Active: 0
Comms: 0/layer
All-Reduce
Syncing activations...
NVLink (900 GB/s)
InfiniBand (400 Gb/s)
PCIe (64 GB/s)
Weights per GPU14 GB
All-reduce size0 MB
Comm overhead0 ms/layer
Effective bandwidth2.0 TB/s
Metrics
Current Bottleneck
None (idle)
Throughput
0tok/s
Avg Latency
0ms
Time to First Token
0ms
Active Requests
0
Queue Depth
0
Requests/sec
0
How to read this:
The simulation shows how a GPU processes inference requests.
Prefill phase (purple): Processing the input prompt. All tokens computed in parallel. Compute-bound
Decode phase (blue): Generating output tokens one at a time. Must read all weights for each token. Bandwidth-bound
Watch the utilization meters to see which resource is the bottleneck!
Theoretical Limits
Loading calculations...
Understanding AI Inference
Why Prefill is Compute-Bound
During prefill, all input tokens are processed in parallel through massive matrix multiplications.
For a 1000-token prompt, you're doing [1000 x hidden] x [hidden x hidden] matrix operations.
Each active request stores Key and Value tensors for all generated tokens.
This cache grows with context length and limits how many requests can run concurrently.
For 70B model with batch 16: ~256KB per all-reduce.
With 80 layers × 2 per layer = 160 all-reduces per forward pass!
How Inference is Priced
Most API providers charge differently for input vs output tokens:
Token Type
Typical Price Ratio
Why?
Input tokens
1x (baseline)
Processed in parallel, compute-efficient
Output tokens
3-5x more expensive
Sequential decode, bandwidth-bound, less efficient
This pricing reflects the actual computational cost difference between prefill (parallel, efficient) and decode (sequential, bandwidth-starved).
Reasoning models with long chain-of-thought are particularly expensive because they generate many output tokens.