AI Inference Data Center Simulator

Configuration

GPU Preset

Memory 80 GB

Bandwidth 2.0 TB/s

FP16 FLOPS 312 TFLOPS

Model Size

Preset Configuration

Data Center Scale

Tensor Parallelism (TP)

Data Parallelism (Replicas)

Interconnect

Demand Profile

Batching Strategy

Max Batch Size

Request Arrival Rate

Stopped

Speed: 1x

Single GPU View (per-GPU memory)

HBM (DRAM) - Per GPU

0 / 80 GB

Weights 0 GB

KV Cache 0 GB

Free

Memory Bus

0 / 2000 GB/s

↓

L2 Cache / SRAM

~50 MB (fast)

Current layer activations

Tensor Cores

0 / 312 TFLOPS

Output Tokens

Idle Waiting for requests...

Compute

Bandwidth

Memory

Request Timeline

Waiting: 0

Prefilling: 0

Decoding: 0

Completed: 0

Start the simulation to see requests

Data Center View

Single GPU Configuration

Active: 0

Comms: 0/layer

NVLink (900 GB/s)

InfiniBand (400 Gb/s)

PCIe (64 GB/s)

Weights per GPU 14 GB

All-reduce size 0 MB

Comm overhead 0 ms/layer

Effective bandwidth 2.0 TB/s

Metrics

Current Bottleneck

None (idle)

Throughput

0 tok/s

Avg Latency

0 ms

Time to First Token

0 ms

Active Requests

Queue Depth

Requests/sec

How to read this:

The simulation shows how a GPU processes inference requests.

Prefill phase (purple): Processing the input prompt. All tokens computed in parallel. Compute-bound

Decode phase (blue): Generating output tokens one at a time. Must read all weights for each token. Bandwidth-bound

Watch the utilization meters to see which resource is the bottleneck!

Theoretical Limits

Loading calculations...

Understanding AI Inference

Why Prefill is Compute-Bound

During prefill, all input tokens are processed in parallel through massive matrix multiplications. For a 1000-token prompt, you're doing [1000 x hidden] x [hidden x hidden] matrix operations.

Arithmetic Intensity = FLOPS / Bytes ≈ 1000+ (excellent for GPUs)

The GPU compute units stay busy because you reuse each weight many times across tokens.

Why Decode is Bandwidth-Bound

During decode, you generate one token at a time. You must read ALL model weights (~140GB for 70B) just to produce a single token.

Arithmetic Intensity = 2 FLOPS / Byte (terrible - GPU is starved)

Token generation speed ≈ Memory Bandwidth / (2 × Model Parameters)

KV Cache Limits Concurrency

Each active request stores Key and Value tensors for all generated tokens. This cache grows with context length and limits how many requests can run concurrently.

KV Cache = 2 × layers × heads × head_dim × seq_len × 2 bytes

For 70B model with 4K context: ~5.4 GB per request!

Batching Improves Efficiency

By processing multiple decode requests together, you read weights once but compute for N tokens. This increases arithmetic intensity proportionally.

Batch of 32: Intensity goes from 2 to 64 FLOPS/byte

But batching is limited by KV cache memory - you can't batch more requests than fit in memory.

Step-by-Step: What Happens During One Decode Step

Read embedding for new token
Tiny: ~8KB for one token's embedding vector

For each of 80 layers, read Q/K/V weight matrices
~600MB per layer × 80 = 48GB just for attention weights

Compute Q, K, V for the single new token
Matrix-vector multiply: [8192] × [8192×8192] - relatively few FLOPs

Read entire KV cache (all previous K, V)
For 4K context: ~5GB of cached keys and values

Compute attention: Q against all cached K's
One query attending to thousands of keys

Append new K, V to cache
Cache grows by ~1.3MB per token per layer

Read FFN weights and compute
~1.1GB per layer × 80 = 88GB for feed-forward weights

Output: sample next token from logits
After reading 140GB+ of data, we produce ONE token

Total data read per token ≈ 140GB (weights) + KV cache
At 2 TB/s bandwidth → ~70ms per token → ~14 tokens/second

Demand Profile Characteristics

Chatbot

Short prompts (50-300 tokens), medium responses (100-500 tokens).

Decode-dominated → Bandwidth-bound

High request volume, latency-sensitive. Batching critical for cost efficiency. Token pricing favors this workload.

Coding Agent

Medium prompts with code (500-2000 tokens), longer responses (200-1000 tokens).

Mixed → Bandwidth + KV Cache pressure

Multi-turn conversations accumulate context. Tool use causes variable latency. Code context is token-dense.

Reasoning Model

Variable prompts, massive outputs (2000-8000+ "thinking" tokens).

Extreme decode → Severely bandwidth-bound

Chain-of-thought generates thousands of tokens. Most expensive per request. KV cache grows very large.

Summarization

Long prompts (2000-8000 tokens), short outputs (100-300 tokens).

Prefill-dominated → More compute-bound

Processes entire documents. KV cache for input is large. Relatively efficient per output token.

Multi-GPU Parallelism Strategies

Tensor Parallelism (TP)

Split model weights across GPUs. Each GPU holds 1/TP of every layer. Required when model doesn't fit on one GPU.

70B model = 140GB → needs TP=2 on 80GB GPUs

Pro: Enables larger models
Con: Requires all-reduce after each layer (communication overhead)

Data Parallelism (DP)

Each GPU has full model copy. Different requests go to different replicas. No inter-GPU communication during inference.

8 GPUs with DP = 8× throughput (independent)

Pro: Linear throughput scaling, no communication
Con: Model must fit on each GPU

NVLink vs InfiniBand

NVLink: GPU-to-GPU within server. 600-900 GB/s, ~1μs latency. Fast enough for TP within a node.

TP within node: NVLink (900 GB/s)
TP across nodes: InfiniBand (50 GB/s) - 18× slower!

Cross-node TP is usually avoided because InfiniBand is too slow for the frequent all-reduce operations.

All-Reduce Operation

After each layer's computation, all GPUs must synchronize their partial results. This is an "all-reduce" - every GPU sends and receives.

All-reduce size ≈ 2 × hidden_dim × batch_size × 2 bytes

For 70B model with batch 16: ~256KB per all-reduce. With 80 layers × 2 per layer = 160 all-reduces per forward pass!

How Inference is Priced

Most API providers charge differently for input vs output tokens:

Token Type	Typical Price Ratio	Why?
Input tokens	1x (baseline)	Processed in parallel, compute-efficient
Output tokens	3-5x more expensive	Sequential decode, bandwidth-bound, less efficient

This pricing reflects the actual computational cost difference between prefill (parallel, efficient) and decode (sequential, bandwidth-starved). Reasoning models with long chain-of-thought are particularly expensive because they generate many output tokens.