ai-newspaper.

Where AI capital meets product breakthroughs.

Infrastructure & Hardware

Verify LPU vs GPU inference speeds for real-time LLMs

Groq's marketing copy promises lightning-fast inference — hundreds of tokens per second, sub-100ms latency, the end of the spinning wheel for LLM-powered applications.

Verify LPU vs GPU inference speeds for real-time LLMs

The problem isn't that anyone lacks opinions on LPU versus GPU. The problem is that most people are measuring the wrong thing. TFLOPS — the number vendors love to slap on spec sheets — tells you almost nothing about real-time inference responsiveness. If you're building a chatbot, an agent pipeline, or any product where a human is staring at a screen waiting for a response, the metrics you actually care about are time to first token and tokens per second. Full stop. And the architectural choices behind those numbers are fundamentally different between LPUs and GPUs.

Architectural Divergence: SRAM-Based LPUs vs. HBM-Based GPUs

Let's get the hardware straight, because this is where the marketing fog rolls in thick.

A GPU — think NVIDIA H100 or A100 — is built around massive parallelism. Thousands of cores, high-bandwidth memory (HBM3 delivering roughly 3.35 TB/s on the H100), and a software stack designed to squeeze throughput out of large matrix operations. For training, this architecture is unmatched. You're feeding enormous batches of data through billions of parameters, and parallel throughput is everything.

An LPU — Groq's Language Processing Unit being the most visible example — takes the opposite bet. Instead of spreading work across thousands of cores with shared memory, it uses a deterministic, single-core architecture backed by SRAM. SRAM is faster per access than HBM but has far less capacity. The trade-off is deliberate: you give up the ability to hold an entire massive model in local memory and instead optimize for predictable, low-latency token generation through an on-chip interconnect and a custom software compiler.

The memory hierarchy is worth dwelling on because it drives everything downstream. SRAM sits directly on the compute die — there's no off-chip memory bus, no HBM stack, no serialization overhead between the processor and the data. Access latencies drop from nanoseconds-tens in HBM to low single-digit nanoseconds in SRAM. The catch is capacity: an H100 packs 80GB of HBM3, enough to hold a 70-billion-parameter model in FP16 with room to spare. Groq's LPU chips, by contrast, have roughly 230MB of SRAM per chip — enough to hold a manageable chunk of a model's active weights, but requiring aggressive model partitioning across multiple chips for anything larger than a small model. That inter-chip communication, handled by Groq's custom interconnect fabric, becomes the architectural spine of the entire system.

What this means in practice:

  • GPUs excel at high-batch-size inference (processing many requests simultaneously) and training workloads. They amortize memory latency across parallelism.
  • LPUs excel at single-stream or low-concurrency real-time inference, where the user in front of the screen needs one response as fast as possible.

Neither architecture is universally superior. That's the boring, honest answer nobody's spec sheet will give you.

Defining Real-Time Metrics: Why TTFT and TPS Outperform TFLOPS

Here's the thing that drives me up the wall: teams benchmark hardware by quoting raw FLOPS and then wonder why their real-time application feels sluggish. TFLOPS measures theoretical peak compute. It says nothing about how quickly a specific prompt returns its first token or how fast subsequent tokens stream out.

For latency-sensitive LLM workloads, you need two numbers:

Time to First Token (TTFT) — how many milliseconds elapse between sending your prompt and receiving the first generated token. For interactive applications, anything above 500ms starts to feel broken. Groq's LPU inference engines routinely deliver sub-100ms TTFT for models like Llama 3 70B. Standard GPU-backed endpoints typically sit in the 200–800ms range depending on model size, batch occupancy, and network overhead.

Tokens Per Second (TPS) — the streaming rate of generated tokens after the first one arrives. Here, LPUs frequently exceed 500 TPS on mid-size models. GPU-based inference varies wildly: a dedicated H100 might hit 100–150 TPS on Llama 3 70B in optimized configurations, but that number collapses the moment you're sharing the cluster with other tenants.

Why do these metrics diverge so sharply? It comes down to what happens inside the hardware during a forward pass. Every token generation step requires reading the model's key-value cache and running attention computations. On a GPU, those reads compete with other requests' KV caches for HBM bandwidth, and the kernel scheduler has to decide which thread blocks get priority. On an LPU with SRAM-resident KV caches, the read is essentially instantaneous and the execution path is pre-compiled — there's nothing to schedule, nothing to contend for.

There's also a compounding effect on TTFT that people miss. The prefill phase — where the model processes your entire input prompt before generating the first output token — is compute-bound on GPUs for long prompts. The attention computation scales quadratically with prompt length, and that compute has to be scheduled across thousands of cores. On an LPU, the deterministic compiler has already mapped the attention pattern onto the hardware; execution proceeds at a fixed, predictable pace regardless of what other workloads are in the queue.

TFLOPS is a spec-sheet vanity metric for inference. If your users are waiting, TTFT and TPS are the only numbers that matter.

The table below crystallizes the difference for a practical comparison:

MetricLPU (Groq-class)GPU (H100-class)
TTFT (Llama 3 70B)< 100ms200–800ms
TPS (Llama 3 70B)500+ TPS100–150 TPS (dedicated)
Memory TypeSRAM (on-chip)HBM3 (off-chip, high capacity)
Execution ModelDeterministic, single-streamNon-deterministic, batch-parallel
Best ForReal-time interactiveTraining, high-concurrency batch

Notice the qualifier on the GPU TPS number: dedicated. The moment you're on a shared inference endpoint — which is how most teams actually deploy — that number is a moving target.

Deterministic Execution: Eliminating Kernel Scheduling Bottlenecks

This is the most underappreciated difference between the two architectures, and it's where most benchmark comparisons go sideways.

When you send a prompt to a GPU-backed endpoint, the software stack — CUDA kernels, tensor parallelism engines, dynamic batching algorithms — has to schedule your request. It competes with other requests on the same hardware. Kernel launches are non-deterministic; memory access patterns vary based on batch composition; the scheduler makes real-time decisions about which workloads to prioritize. The result is latency variance. One request takes 300ms to first token; the next takes 900ms. Your P95 latency looks fine. Your P99 latency is a disaster.

LPUs sidestep this entirely. The deterministic software compiler pre-computes the execution plan for a given model at compile time. There's no dynamic scheduling. No kernel launch overhead. No contention for memory bandwidth across concurrent workloads. When a prompt hits the LPU, the execution path is fixed and predictable.

To be more precise about what "deterministic" means here: the compiler analyzes the model's computation graph — every matrix multiplication, every attention head, every layer normalization — and maps it to a fixed sequence of operations on the chip's processing elements. That mapping is identical every single time. The same prompt structure produces the same execution timeline, down to the cycle. GPUs, by contrast, rely on a runtime scheduler (CUDA streams, concurrent kernels) that makes allocation decisions on the fly. Those decisions are usually good. "Usually" is doing a lot of heavy lifting in that sentence.

What this gives you in production:

1. Consistent latency under load — TTFT doesn't degrade as sharply when concurrency increases, because there's no scheduling chaos to introduce variance.

2. Predictable scaling behavior — you can model how adding tokens to a prompt or increasing model size affects response time with far more accuracy.

3. Simpler operational monitoring — fewer tail-latency spikes means fewer alert fires, fewer on-call pages, and less time debugging infrastructure that "should be fine according to the dashboards."

The trade-off is real: deterministic execution means less flexibility. You can't dynamically reconfigure the hardware for a different workload on the fly. If you need the same system to train, fine-tune, and serve inference, GPUs remain the more versatile choice. And if your workload involves variable-length inputs with wildly different compute profiles — say, processing everything from 10-token commands to 50,000-token document analyses in the same queue — the rigid execution model of an LPU can actually become a constraint rather than an advantage, because the compiler-optimized path assumes a narrower band of input characteristics.

Benchmarking Methodology for Latency-Sensitive LLM Workloads

So you've decided to run your own comparison. Good. Don't trust anyone else's numbers at face value — not Groq's, not NVIDIA's, not a third-party benchmark blog that tested with a quantized 8B model and extrapolated to 70B.

Here's what a credible benchmark needs to control for:

Model and quantization must be identical. FP16 on one side and INT4 on the other is not a comparison — it's a marketing exercise. If you're benchmarking Llama 3 70B, specify the precision level and hold it constant. Note that some LPU deployments run models in their own proprietary compiled format, which may not map cleanly to standard quantization levels — document exactly what format each side is using.

Concurrent request load must be defined. An LPU serving a single user and a GPU serving fifty concurrent users is not an apples-to-apples test. Run both at 1, 10, and 50 concurrent streams and measure TTFT and TPS at each level.

Prompt and output lengths must be fixed. A 50-token prompt generating 200 output tokens behaves very differently from a 2,000-token prompt generating 100 tokens. Define your workload profile and test against it. If your production traffic is mostly short conversational turns (under 200 tokens of input), benchmark that. If it's long-context RAG pipelines feeding 4,000-token prompts, benchmark that instead. The architecture that wins for short prompts may not win for long ones.

Network latency must be isolated. If one endpoint is in US-East and the other in US-West, you're measuring fiber optic cable, not silicon. Use same-region deployment or subtract network RTT from your measurements. If you're hitting a public API endpoint, run a baseline ping to confirm the network contribution before drawing conclusions.

Measurement window must be long enough. A ten-minute burst test tells you almost nothing. Run for at least 60 minutes to capture scheduling variance, memory pressure patterns, and any warm-up effects. GPU-backed endpoints in particular can exhibit different latency profiles during the first few minutes of a cold start versus steady-state operation.

Warm versus cold runs matter. On GPU systems, the first few inference calls often pay a JIT compilation or CUDA kernel caching penalty. On LPU systems, the model is pre-compiled, so cold-start overhead is minimal — but the hardware itself may have a warm-up period as the on-chip caches populate. Run your benchmark long enough for both platforms to reach steady state, then measure.

The metrics you report should center on TTFT at P50, P95, and P99, plus sustained TPS at each concurrency level. Anything less is anecdotal. And if you're publishing those results internally to drive a deployment decision, include the raw distribution — not just the mean. A platform with a 150ms mean TTFT but a 1.2-second P99 is a very different product experience than one with a 120ms mean and a 200ms P99.

Strategic Deployment: Matching Hardware to Inference Requirements

Here's where I land on this, and I won't hedge.

If your product requires a human to wait for a response — a chat interface, an AI coding assistant, a real-time translation layer, an agent executing a multi-step workflow — and latency is a user experience requirement, not just a performance optimization, then LPU-class inference is the architecture to evaluate first. The deterministic execution model and SRAM-based memory access deliver a qualitatively different experience. Users feel the difference. It's not a marginal improvement; it's the difference between "instant" and "waiting."

If you're running batch processing, offline evaluation, model training, or high-concurrency workloads where individual request latency is secondary to aggregate throughput, GPUs remain the workhorse. No LPU on the market today can touch the H100's training throughput or its flexibility across mixed workloads.

There's a middle ground that's worth calling out: hybrid architectures. Some teams are deploying GPU clusters for offline and batch workloads while routing latency-sensitive traffic to LPU endpoints. The orchestration layer — a routing proxy that inspects the incoming request and sends it to the right backend based on latency requirements, prompt length, or user tier — adds operational complexity, but it lets you optimize cost and performance simultaneously. This pattern is still early, and the tooling is immature, but it's the direction the infrastructure stack is moving for teams that can't afford to pick one horse.

The deployment decision also has a financial dimension that teams often underweight. Infrastructure costs scale differently: GPU clusters offer economies of scale across diverse workloads, while LPU deployments are optimized for a narrower but critical slice of the inference stack. For teams thinking about long-term capital allocation — whether you're a startup burning runway or an enterprise optimizing cloud spend — understanding the cost-per-token at your actual concurrency level is as important as the raw speed numbers. The silicon you bet on shapes both your product and your balance sheet.

Choosing between LPU and GPU inference isn't a hardware preference — it's a product decision. The latency your users experience is the latency you chose to ship.

The honest answer to "which is faster?" is: faster at what, for whom, under what load, at what cost? Run the benchmark yourself. Control the variables. Measure what your users actually feel. And then ship the architecture that matches your product's promise — not a vendor's spec sheet.