AI Knowledge Base · 19 Mar 2026 · 9 min read

What Is HBM and Why Does It Limit AI Scaling?

High Bandwidth Memory has become as critical as GPUs in determining how fast AI infrastructure can expand — and it's in shorter supply.

based.ai

AI & Machine Learning

High Bandwidth Memory (HBM) is a specialized DRAM architecture that stacks memory chips vertically, enabling data transfer rates up to 10 times faster than conventional memory — a capability now essential for training and running large language models. While GPUs capture headlines in AI infrastructure buildout, memory bandwidth has emerged as an equally constraining factor, with supply concentrated among three manufacturers and demand growing faster than production capacity.

The current surge in AI Infrastructure spending — exemplified by hyperscalers committing Reuters reporting over $200 billion in combined 2026 capital expenditures — has pushed HBM from a niche product into a critical bottleneck. Memory shortages now pose comparable constraints to GPU availability in determining how quickly AI capabilities can scale.

Memory Bandwidth Fundamentals

Traditional DRAM connects to processors through a parallel bus, typically achieving transfer rates of 25-50 GB/s. HBM instead uses through-silicon vias (TSVs) to stack multiple DRAM dies vertically atop an interposer, creating thousands of connections between memory and processor. This architecture enables HBM3, the current generation, to deliver bandwidths exceeding 800 GB/s per stack.

HBM3 vs. DDR5 Performance

HBM3 Bandwidth819 GB/s

DDR5 Bandwidth64 GB/s

Bandwidth Multiplier12.8x

Power Efficiency Gain~50%

The vertical stacking approach solves a fundamental physics problem: as processors grow more powerful, the distance between compute cores and memory becomes a limiting factor. HBM places memory dies within millimeters of the GPU silicon, drastically reducing latency while increasing throughput.

Why AI Workloads Demand HBM

Large language models operate through matrix multiplication across billions of parameters. During training, these models repeatedly load weights from memory, perform calculations, then write results back — a cycle that occurs trillions of times. Memory bandwidth, not raw compute speed, often determines training velocity.

A model like GPT-4, with an estimated 1.7 trillion parameters, requires moving terabytes of data between memory and compute cores during each training step. According to research from Stanford University, memory bandwidth utilization exceeds 80% during transformer training, making it the primary bottleneck for large models.

Technical Context

The relationship between model size and memory demand is nonlinear. A model with 175 billion parameters (GPT-3 scale) requires approximately 700 GB of HBM just to store weights in 16-bit precision, before accounting for optimizer states, gradients, and activations during training. Scaling to trillion-parameter models pushes total memory requirements into multiple terabytes per training cluster.

Inference workloads — serving model outputs to end users — face similar constraints. Each query requires loading relevant model weights from memory before computation can begin. For real-time applications demanding sub-100ms response times, HBM’s low latency becomes non-negotiable. Standard DRAM introduces delays that accumulate across the billions of operations required per inference pass.

The Three-Player Supply Constraint

HBM production is concentrated among SK Hynix, Samsung, and Micron — a oligopoly created by the technology’s extreme manufacturing complexity. Fabricating HBM requires stacking eight or more DRAM dies with sub-micron alignment precision, then bonding them to an interposer using thousands of TSVs, each approximately 5 microns in diameter.

SK Hynix currently holds approximately 50% market share, having secured Nvidia as its primary customer for H100 and H200 GPUs. Samsung controls roughly 40%, while Micron entered volume production only in 2024. This concentration creates vulnerability: production delays at any single manufacturer ripple through the entire AI Supply Chain.

2013

HBM Standard Released

JEDEC publishes HBM1 specification, enabling 128 GB/s bandwidth.

2016

HBM2 Adoption

Second generation doubles bandwidth to 256 GB/s, gains traction in high-performance computing.

2020

HBM2E Deployed

Enhanced variant reaches 460 GB/s, becomes standard for AI accelerators.

2023

HBM3 Volume Production

Third generation achieves 819 GB/s, adopted by Nvidia H100 and subsequent AI GPUs.

2025

HBM3E Transition

Enhanced HBM3 reaches 1.15 TB/s, targets next-generation AI training clusters.

Manufacturing capacity remains the binding constraint. Converting standard DRAM fabs to HBM production requires equipment investments exceeding $3 billion per facility and 18-24 month conversion timelines, according to Semiconductor Industry Association data. Demand growth has outpaced these expansion cycles since 2023.

Pricing Dynamics and Allocation

HBM pricing has decoupled from standard DRAM markets. While commodity DRAM trades at approximately $3-4 per gigabyte, HBM3 commands $20-25 per gigabyte in contract pricing, per TrendForce market data. This 6-7x premium reflects both manufacturing complexity and supply scarcity.

Hyperscalers now negotiate HBM supply agreements 12-18 months in advance, locking in capacity at fixed prices to secure allocation. Spot market availability is essentially zero — all production flows through pre-committed contracts. This allocation model mirrors GPU supply constraints, where major cloud providers reserve entire production runs quarters ahead of delivery.

“HBM is the new oil. You can have all the GPUs you want, but without memory bandwidth, they sit idle. And right now, there’s not enough to go around.”

— Rene Haas, CEO of Arm Holdings, at 2025 Semiconductor Industry Forum

The pricing power of HBM suppliers has restructured profit distribution across the AI hardware stack. Memory manufacturers capture 30-35% of total system value in high-end AI servers, compared to 15-20% in traditional server configurations. This shift redirects infrastructure spending toward memory vendors at the expense of other component suppliers.

How Memory Shortages Limit AI Scaling

Current HBM production constraints impose a hard ceiling on deployable AI compute. Nvidia’s H200 GPU requires six HBM3E stacks totaling 141 GB — meaning each unit consumes memory equivalent to approximately 4,700 conventional DDR5 DIMMs in bandwidth terms. Training clusters for frontier models can incorporate 10,000+ GPUs, translating to demand for 60,000 HBM stacks per cluster.

Manufacturing output cannot yet support this demand trajectory. Industry analysts at Gartner estimate global HBM production capacity at approximately 500 million gigabytes annually for 2026 — sufficient for roughly 100,000 high-end AI GPUs, well below the projected 300,000+ units hyperscalers plan to deploy.

2026 HBM Supply vs. Demand Gap

Metric	Available Supply	Projected Demand
Total HBM Capacity (GB)	500 million	1.2 billion
Equivalent H200 GPUs	~100,000	~240,000
Supply Deficit	—	58%

This supply gap forces architectural compromises. Some training clusters now employ tiered memory hierarchies, using smaller amounts of HBM for active computation while offloading less frequently accessed data to standard DRAM or even SSD storage. These hybrid approaches increase complexity and reduce effective training speed by 15-25%, but enable deployment with available memory supply.

The shortage also delays next-generation model development. OpenAI’s rumored GPT-5 training, targeting 10+ trillion parameters, would require memory bandwidth only achievable through HBM3E at scale — a capability supply-constrained until late 2026 at earliest. Competitors face identical limitations, creating an industry-wide pause in model scaling velocity not seen since 2021.

Supply Expansion Timelines

Memory manufacturers have announced aggressive capacity expansion, but delivery timelines extend into 2027-2028. SK Hynix plans to invest $15 billion in additional HBM production facilities through 2027. Samsung is allocating $12 billion to convert existing DRAM fabs to HBM-capable lines. Micron, entering the market later, committed $7 billion to new HBM capacity in its 2025 capital plan.

These investments will substantially increase supply, but not before 2027. Fab construction and equipment installation typically require 24-30 months from groundbreaking to initial production. Ramp to full volume output adds another 6-12 months. The memory shortage constraining current AI buildouts will persist through most of 2026 regardless of investment pace.

Next-generation HBM4, expected in 2027, promises bandwidth exceeding 2 TB/s per stack — doubling HBM3E capability. However, it also requires new manufacturing processes and materials, potentially extending the supply-demand mismatch if adoption outpaces production readiness. Early JEDEC specifications indicate HBM4 will use hybrid bonding techniques that further increase manufacturing complexity.

Key Takeaways

HBM delivers 10x+ bandwidth vs. standard DRAM through vertical stacking and proximity to GPU silicon, making it essential for AI workloads that are memory-bottlenecked rather than compute-bottlenecked.
Three manufacturers control global supply, creating concentration risk and enabling 6-7x pricing premiums over commodity memory.
Current production capacity can supply ~100,000 high-end AI GPUs annually, well below hyperscaler demand for 300,000+ units, forcing architectural compromises and delaying frontier model development.
Despite $34 billion in announced capacity investments, supply constraints will persist through 2026 due to 24-30 month fab construction timelines.

Related Coverage

For analysis of how memory constraints are reshaping AI infrastructure economics and supply chains:

Micron’s 194% revenue surge demonstrates memory’s emergence as a revenue driver comparable to GPUs in AI buildouts.
Samsung strike threats illustrate supply concentration risks when labor disruptions affect one of three global HBM producers.
Micron’s $200 billion capacity expansion outlines the multi-year timeline for supply relief despite aggressive investment.
Meta’s infrastructure prioritization shows how hyperscalers are reallocating capital to secure scarce memory and compute resources.

Sources

Reuters

Stanford University

Semiconductor Industry Association

TrendForce

Gartner

JEDEC

The Wire Daily

Get the day’s most important stories in technology, geopolitics, and macroeconomics — delivered every morning.

Subscribe Free →