AI Inference Data Center Simulator

Understand how GPUs process inference requests and what limits throughput

Configuration
Memory 80 GB
Bandwidth 2.0 TB/s
FP16 FLOPS 312 TFLOPS
Stopped
1x
Single GPU View (per-GPU memory)
HBM (DRAM) - Per GPU
0 / 80 GB
0 GB
0 GB
Memory Bus
0 / 2000 GB/s
L2 Cache / SRAM
~50 MB (fast)
Current layer activations
Tensor Cores
0 / 312 TFLOPS
Output Tokens
Idle Waiting for requests...
Compute
0%
Bandwidth
0%
Memory
0%
Request Timeline
Waiting: 0
Prefilling: 0
Decoding: 0
Completed: 0
Start the simulation to see requests
Data Center View
Single GPU Configuration
Active: 0
Comms: 0/layer
NVLink (900 GB/s)
InfiniBand (400 Gb/s)
PCIe (64 GB/s)
Weights per GPU 14 GB
All-reduce size 0 MB
Comm overhead 0 ms/layer
Effective bandwidth 2.0 TB/s
Metrics
Current Bottleneck
None (idle)
Throughput
0 tok/s
Avg Latency
0 ms
Time to First Token
0 ms
Active Requests
0
Queue Depth
0
Requests/sec
0
How to read this:

The simulation shows how a GPU processes inference requests.

Prefill phase (purple): Processing the input prompt. All tokens computed in parallel. Compute-bound

Decode phase (blue): Generating output tokens one at a time. Must read all weights for each token. Bandwidth-bound

Watch the utilization meters to see which resource is the bottleneck!
Theoretical Limits
Loading calculations...

Understanding AI Inference

Why Prefill is Compute-Bound

During prefill, all input tokens are processed in parallel through massive matrix multiplications. For a 1000-token prompt, you're doing [1000 x hidden] x [hidden x hidden] matrix operations.

Arithmetic Intensity = FLOPS / Bytes ≈ 1000+ (excellent for GPUs)

The GPU compute units stay busy because you reuse each weight many times across tokens.

Why Decode is Bandwidth-Bound

During decode, you generate one token at a time. You must read ALL model weights (~140GB for 70B) just to produce a single token.

Arithmetic Intensity = 2 FLOPS / Byte (terrible - GPU is starved)

Token generation speed ≈ Memory Bandwidth / (2 × Model Parameters)

KV Cache Limits Concurrency

Each active request stores Key and Value tensors for all generated tokens. This cache grows with context length and limits how many requests can run concurrently.

KV Cache = 2 × layers × heads × head_dim × seq_len × 2 bytes

For 70B model with 4K context: ~5.4 GB per request!

Batching Improves Efficiency

By processing multiple decode requests together, you read weights once but compute for N tokens. This increases arithmetic intensity proportionally.

Batch of 32: Intensity goes from 2 to 64 FLOPS/byte

But batching is limited by KV cache memory - you can't batch more requests than fit in memory.

Step-by-Step: What Happens During One Decode Step

1
Read embedding for new token
Tiny: ~8KB for one token's embedding vector
2
For each of 80 layers, read Q/K/V weight matrices
~600MB per layer × 80 = 48GB just for attention weights
3
Compute Q, K, V for the single new token
Matrix-vector multiply: [8192] × [8192×8192] - relatively few FLOPs
4
Read entire KV cache (all previous K, V)
For 4K context: ~5GB of cached keys and values
5
Compute attention: Q against all cached K's
One query attending to thousands of keys
6
Append new K, V to cache
Cache grows by ~1.3MB per token per layer
7
Read FFN weights and compute
~1.1GB per layer × 80 = 88GB for feed-forward weights
8
Output: sample next token from logits
After reading 140GB+ of data, we produce ONE token
Total data read per token ≈ 140GB (weights) + KV cache
At 2 TB/s bandwidth → ~70ms per token → ~14 tokens/second

Demand Profile Characteristics

Chatbot

Short prompts (50-300 tokens), medium responses (100-500 tokens).

Decode-dominated → Bandwidth-bound

High request volume, latency-sensitive. Batching critical for cost efficiency. Token pricing favors this workload.

Coding Agent

Medium prompts with code (500-2000 tokens), longer responses (200-1000 tokens).

Mixed → Bandwidth + KV Cache pressure

Multi-turn conversations accumulate context. Tool use causes variable latency. Code context is token-dense.

Reasoning Model

Variable prompts, massive outputs (2000-8000+ "thinking" tokens).

Extreme decode → Severely bandwidth-bound

Chain-of-thought generates thousands of tokens. Most expensive per request. KV cache grows very large.

Summarization

Long prompts (2000-8000 tokens), short outputs (100-300 tokens).

Prefill-dominated → More compute-bound

Processes entire documents. KV cache for input is large. Relatively efficient per output token.

Multi-GPU Parallelism Strategies

Tensor Parallelism (TP)

Split model weights across GPUs. Each GPU holds 1/TP of every layer. Required when model doesn't fit on one GPU.

70B model = 140GB → needs TP=2 on 80GB GPUs

Pro: Enables larger models
Con: Requires all-reduce after each layer (communication overhead)

Data Parallelism (DP)

Each GPU has full model copy. Different requests go to different replicas. No inter-GPU communication during inference.

8 GPUs with DP = 8× throughput (independent)

Pro: Linear throughput scaling, no communication
Con: Model must fit on each GPU

NVLink vs InfiniBand

NVLink: GPU-to-GPU within server. 600-900 GB/s, ~1μs latency. Fast enough for TP within a node.

TP within node: NVLink (900 GB/s)
TP across nodes: InfiniBand (50 GB/s) - 18× slower!

Cross-node TP is usually avoided because InfiniBand is too slow for the frequent all-reduce operations.

All-Reduce Operation

After each layer's computation, all GPUs must synchronize their partial results. This is an "all-reduce" - every GPU sends and receives.

All-reduce size ≈ 2 × hidden_dim × batch_size × 2 bytes

For 70B model with batch 16: ~256KB per all-reduce. With 80 layers × 2 per layer = 160 all-reduces per forward pass!

How Inference is Priced

Most API providers charge differently for input vs output tokens:

Token Type Typical Price Ratio Why?
Input tokens 1x (baseline) Processed in parallel, compute-efficient
Output tokens 3-5x more expensive Sequential decode, bandwidth-bound, less efficient

This pricing reflects the actual computational cost difference between prefill (parallel, efficient) and decode (sequential, bandwidth-starved). Reasoning models with long chain-of-thought are particularly expensive because they generate many output tokens.