Free Shipping on orders over US$49.99

The Challenges of Powering Big AI Chips

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

As data center AI workloads grow in importance, AI chips are getting bigger and bigger, and the amount of power they need is growing rapidly, too. Today’s high-end GPUs can demand as much as 700 W per chip, with CPUs, FPGAs (field-programmable gate arrays) and ASICs (application-specific integrated circuits) not far behind. Delivering hundreds of Watts into a tiny space presents several challenges, and the quality of the power delivery system can have a significant impact on the chip’s performance.

Lev Slutskiy, regional sales manager at Vicor, told EE Times that the amount of power demanded by fast processing of big data and AI training is increasing rapidly.

“At the same time, current and new technologies—7 nm and below—require lower voltage and this leads to very high currents,” he said. “Another challenge is to maintain supply voltage during fast load transients, while the area occupied by the power supply should be minimized.”

These are all challenges the power delivery network (PDN), which supplies power to big processors, must cope with. Addressing these challenges means coming up with a PDN architecture with minimal impedance for minimal losses and minimizing PDN parasitic inductance and resistance to optimize load transient response.

The Open Compute Project, an industry consortium led by hyperscalers like Meta and Google that drives standardization in IT infrastructure, is advocating for a change from 12 V to 48 V for PDNs.

Headshot of Lev Slutskiy.
Lev Slutskiy (Source: Vicor)

Slutskiy explained that using higher voltage means the same amount of energy can be delivered at lower current, meaning I2R losses are minimized. Forty-eight volts was chosen as it is the highest possible voltage allowed for SELV (safety extra-low voltage); at 48 V, the current needed to supply the same power is a quarter of the current needed at 12 V, which means I2R losses drop by a factor of 16.

“Using 48 V instead of 12 V makes multiphase architectures de facto impossible, since duty cycle decreases and the efficiency drops, so a new architectural approach is needed,” Slutskiy said, noting that Vicor’s factorized power architecture (FPA) enables 48 V power distribution, while allowing the last conversion stage to be pushed extremely close to the load.

FPA separates power conversion into two stages: a 48-V pre-regulation module followed by a fixed-ratio voltage transformation stage (VTM). The regulation module can be placed wherever convenient, but the VTM can be placed very close to the processor, and is optimized for high current density, efficiency and low noise.

Vicor modules can be placed either on the processor substrate (for optimal load transient response), beside it or underneath it on the other side of the PCB. This helps reduce the impedance of the “last inch” of delivery, meaning more power gets to the processor and less is wasted.

Artist rendering of Vicor modular current multipliers for powering AI processors.
Vicor modular current multipliers (MCMs) can be placed on the motherboard or the processor substrate—as close to the processor as possible to minimize losses. (Source: Vicor)

Traditional limits to powering AI processors

“Most HPC processors are reaching, if they haven’t reached already, the limit to which traditional solutions can be used to effectively power them without having to throttle their performance, defeating the purpose of their very existence,” Mukund Krishna, senior manager of product marketing at Empower Semiconductor, told EE Times.

Mukind Krishna (Source: Empower Semiconductor)

Krishna said that the biggest challenge HPC (high-performance computing) and AI processors present is that their compute capabilities step up in large steps from one generation to the next, rather than gradual improvements.

“From a power consumption perspective, this results in [doubling] the TDP [thermal design power] and peak power consumption of these processors in just a single generation step, while being realized in more or less the same form factors, due to moving to the next process node,” he said. “As a related effect of increased capability and finer process nodes, step changes in current, or what is known as load step or load transients are also doubling in the same time scales—a few to tens of nanoseconds.”

Traditional server power management solutions are relatively slow, with large magnetics and they often require many output filtering capacitors—meaning they must be placed further from the load where there is space available, increasing impedance and reducing efficiency. Krishna points out that doubling load current means a quadratic increase in PCB (printed circuit board) losses.

“Slow switching speed leads to low bandwidth, and combined with the increased impedance to the load due to lateral distance results in the need for a decoupling solution of passive components near or under the processors,” he said.

Empower’s integrated voltage regulator (IVR) technology is designed to replace traditional solutions for point of load regulation. An IVR integrates the entire regulator into a single package.

Image of a demo platform for Empower’s EP70xx IVR for powering AI processors.
A demo platform for Empower’s EP70xx IVR (center left). (Source: Empower Semiconductor)

“IVRs can switch at orders of magnitude higher frequency, providing that many orders of magnitude higher bandwidth,” Krishna said. “The increased switching speeds also greatly reduce the magnetic and filtering requirements, allowing the integration or elimination of what are usually large and tall components.”

IVRs can also be placed much closer, ideally vertically below the processor, shortening the power delivery path as much as possible, Krishna added, explaining that simply bringing the regulator physically closer to the load can reduce PCB losses by a factor of 10. This can be as much as 5-10% more power for the processor, even for mid-range processors.

Faster switching speeds can also enable dynamic voltage scaling (DVS) schemes to switch in nanoseconds rather than microseconds. This equates to big savings by turning power up or down faster when the processor demands it. The savings could be 10-30%, depending on the exact use case, according to Krishna.

“HPC processors are being designed in smaller process nodes, which inherently come with high leakage and stringent limits on operating voltages,” he said. “Simply put, a fast transient response allows the voltage margining budget to be tightened, allowing power consumption savings exceeding 10%.”

Thermal considerations

Headshot of Brian Korn.
Brian Korn (Source: Advanced Energy)

Thermal considerations and cooling are a huge challenge as we move toward bigger processors for HPC and AI, since power consumption jumps dramatically, Brian Korn, VP of data center, telecom and network products at Advanced Energy, told EE Times. He also cited the rising cost of energy and climate changes that are causing extreme temperatures as major challenges for data center operators.

“At the Datacenter Investment Conference and Expo, which took place in early April, data center operators shared that as the industry moves towards energy-intensive HPC and AI, they would need to provision 50 to 100 kW racks and possibly beyond, which highlights the need to change how data centers are designed,” he said.

The move to 48-V power distribution will result in increased efficiency, but also smaller busbars and significantly better thermal performance, Korn added, explaining that even half a percentage point gain in efficiency can save data center operators more than $1,000 per power shelf over the course of its 5-year life. Advanced Energy’s 48 V Artesyn ORv3-Compliant Power Shelf and PSU is approximately 2% more efficient than traditional 12 V designs, according to the company.

Image of Advanced Energy’s 48 V Artesyn ORv3-Compliant Power Shelf.
Advanced Energy’s 48 V Artesyn ORv3-Compliant Power Shelf. (Source: Advanced Energy)

Korn’s example is a 10 MW data center with servers that consume 50% of the energy, with a power usage effectiveness of 1.6.

“A 2% increase in server PSU efficiency increases the PUE and leads to a 1.6% decrease in electricity use—that’s 1.4 million kWh saved per year,” he said. “At $0.07 per kWh, that’s a saving of $98,000 in power per year. This energy consumption is equivalent to the energy generated by more than 2,000 barrels of oil!”

While hyperscalers like Google made the transition to 48 V more than 10 years ago, the transition to 48-V servers in the rest of the industry has been slower because of server availability, but it is gathering steam, Korn said. In the meantime, Advanced Energy is contributing to the Open Compute Project (OCP) on 48-V distribution and alternate cooling techniques to address HPC and AI’s thermal challenges going forward.

Source link

We will be happy to hear your thoughts

Leave a reply

Enable registration in settings - general
Compare items
  • Total (0)
Shopping cart