//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
The Compute Express Link (CXL) spec is arguably one of the fastest-maturing interfaces in the semiconductor industry. Its widespread buy-in has meant many vendors have designed products to build out the ecosystem, with the Storage Networking Industry Association (SNIA) being the latest to put its hat in the ring to help further improve data movement.
On Nov. 28, SNIA introduced the Smart Data Accelerator Interface (SDXI) specification. Similar to CXL, the SDXI spec prioritizes efficient data movement; specifically, SDXI is a standard for a memory-to-memory data mover and acceleration interface. The genesis of the specification dates to September 2020, when a SNIA technical working group (TWG) set out to realize the concept of a Direct Memory Access (DMA) data-mover device and addressed common limitations.
The role of a DMA is to offload software-based copy loops to free up CPU execution cycles. Although the concept is well known, DMA adoption is often limited to specific privileged software and I/O use cases employing device-specific interfaces that aren’t forward-compatible. These limitations mean user-mode application usage is difficult in a non-virtualized environment and almost impossible in a multi-tenant virtualized environment.
SDXI works with CXL and heterogenous computing
SNIA developed the SDXI standard to provide an architectural interface to address current DMA limitations, SNIA TWG Chair Shyam Iyer said during an online briefing of the SDXI Platform Data Mover. Aside from overcoming the existing constraints of a DMA, SDXI will support heterogenous computing in parallel with CXL, which is now in its third iteration.
Most of the needs of today’s system architectures start at the application level, according to Iyer, with compute needs that are typically addressed by the CPU. When computation occurs, data is stored in memory, which shares a relationship of coherency with the CPU to boost performance.
“When the application needs to scale, it adds more threads to it,” he said. This translates to more cores on the CPU, while an I/O device is employed whenever the data needs to be transported out of the memory, with optimization to address latency and bandwidth, Iyer added. “This has been a system architecture that has worked pretty well, but of late, we’re seeing increased needs for applications, which means that typical compute architectures are evolving.”
Typical architectures today have CPUs and application-specific standard parts, including drives, network interface controllers and field-programmable gate arrays (FPGAs) — all of which try to boost application performance along with many memory types, Iyer explained. Mix in links and fabrics, such as CXL, and everything can be connected. “That means the memory types are truly democratized with these types of links and fabrics and the application can make use of all of them,” he said. “But they also have the same design constraints, whether it’s latency, bandwidth, coherency or control.”
It’s all about data movement
On a basic level, CXL is all about easily moving data to the best resource available, including memory or storage, in part by reducing how far the data must travel. It’s quickly gained momentum as a standard, with the recently formed CXL Consortium having released version 3.0 at the Flash Memory Summit in August. The CXL Consortium also acquired the intellectual property of the GenZ Consortium, a specification with similar characteristics. The OpenCAPI assets are also being folded into the CXL Consortium to move the standard forward.
The CXL spec has experienced very active engagement by the industry with a “who’s who” participating in the consortium, according to Rita Gupta, CXL Consortium contributor and CXL system architect with AMD. “CXL is becoming the industry focal point for coherent I/O standards.”
Like DMA efforts, there have been propriety attempts for I/O coherency, but trends over the past few years haven’t only reflected an increased demand for data processing and compute but also a need for heterogeneous computing, according to Gupta. This need means having different types of memories and devices connected and performing together. “All that means you need more and more memory capacity and bandwidth.”
CXL is the first open standard that solves the I/O interconnect problem comprehensively. As a cache-coherent interconnect standard for processors, CXL leverages the PCIe infrastructure with a mix and match of three protocols: CXL.io, CXL.cache and CXL.memory.
“It’s a low-latency standard,” Gupta said. “If you look at the CXL.memory and CXL.cash accesses, they’re targeted to somewhere near CPU latencies.” CXL also provides asymmetric complexity so that device implementations are eased from the burden of maintaining the coherency, she added.
Mixing and matching the three CXL protocols allows for many different use cases. For example, you can view devices with the CXL.io and CXL.memory interface as CXL memory buffers, Gupta said, while a device using all three protocols can have its memory managed by the host.
The first iteration of CXL introduced the three types of devices with the primary feature being point-to-point attachment, while version 2.0 added fanout, switching and memory pooling. With CXL 3.0, the focus shifted to scalability. “If you look at the progression of the CXL spec, it is not just looking at the problems of the compute industry that we are facing today, but it is looking at the problems that are for the future,” Gupta said.
Because CXL is a media-agnostic interface, it’s possible to add lower-cost memories in a system to reduce its overall total cost of ownership (TCO), Gupta noted. This is because memory tiering enables “hot” data to be placed in faster memory, while “cold” data can be placed in a slower tier. “This is where the data moment becomes extremely critical.”
With this usage model, capacity and bandwidth can be added to a system while lowering its TCO, Gupta explained, and one memory location could be accessed by multiple hosts with coherency managed via the CXL protocol. “These usage models are intended to reduce the memory stranding because if you look at the memory resources, which are very expensive, they are being effectively utilized across the different systems,” she said. “This resource disaggregation helps improve the data usage efficiency.”
CXL’s fluidity gets accelerated with SDXI
Moving forward into heterogeneous computing, data movement becomes more and more important, Gupta said. “What CXL enables is a very fluid and flexible memory model.”
Different memory types, expanders and accelerators are all available as a resource, but it’s crucial that data movement is as efficient as possible across all of them.
Iyer said today’s data movement is usually a software-based memory copy that uses a stable instruction set architecture — a standard that applications can easily work with because it’s familiar. Application performance is degraded, however, because the computer is being used to perform the data copies, according to Iyer. The problem with existing DMAs, meanwhile, is that they’re all vendor-specific. “There is no standardized access for user-level software with the help of these DMA engines.”
That’s were SDXI becomes paramount — with the “X” standing for accelerator. Iyer said SNIA’s proposed standard is for a memory-to-memory data-movement interface that is extensible, forward-compatible and independent of the I/O interconnect technology. “An SDXI interface implementation can exist on different form factors.”
For example, it could be implemented on a CPU in an integrated manner, or in discrete chips like GPUs or FPGAs or even smart I/O devices, while the design eliminates all the software context isolation layers to improve performance and enable direct user-mode access for the applications, according to Iyer.
Like CXL, SDXI targets different types of memory. By having a specification that is architectural in nature, you can build additional offloads that leverage the same interface, he said.
There are many use cases in which a standardized DMA, such as SDXI, is valuable because it allows an application to instruct a work item in the form of a descriptor — the data copy can be done while the application is free to perform other things and will be notified once the copy is completed. Another scenario that you can execute differently with SDXI, Iyer noted, is the storage and retrieval of data, which is usually accomplished by multiple memory buffer copies that can degrade performance — even with the help of a persistent memory region in the memory architecture.
A third scenario where SDXI shines, Iyer said, is when two machines want to perform data movement to each other’s address spaces, which could be optimized by an accelerator that can safely and securely read the data buffer from one guest version machine, spin it around, and then write that data buffer into a second virtual machine. “This is the best of both worlds.”
Despite the benefits offered by an SDXI, it’s a work in progress, Iyer said. At present, SNIA’s TWG is exploring how to set up a connection between multiple address spaces before making data-movement requests, as well as different ways SDXI can work better in a CXL and heterogeneous environment. “It is architecture-independent, implementation-independent and interconnect-independent,” he said.