//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
AI accelerator chip company Untether has released a new version of its imagAIne software development kit (SDK) for the company’s first-gen runAI chip, which will allow bare-metal programming for customers in fast-moving neural network applications or high performance computing (HPC).
“What really limits the adoption of [startups’] AI hardware accelerators is the software stack,” said Untether’s VP of product Bob Beachler in an exclusive interview with EE Times.
While essential to enabling the combination of optimum prediction accuracy in the application, sufficient flexibility for the desired use cases, and developer velocity, a high-quality SDK can nonetheless be a huge challenge for startups’ limited resources. Untether now has more engineers on its software team than its hardware team, Beachler said.
An AI accelerator chip company’s SDK is crucial to lowering applications onto hardware efficiently. It includes a compiler, which maps layer descriptions from the machine learning framework to kernels (the actual code running on the hardware), as well as physical allocation, which maps where the kernels go on the chip (plus a runtime). The SDK also provides a toolchain that allows analysis of this process.
Open programming model
A key new feature of Untether’s SDK is an open programming model—that is, the ability for customers to write their own kernels, analogous to writing kernels in low-level CUDA code for Nvidia GPUs, including bare-metal programming.
Custom kernels are required by applications, such AI in autonomous driving, where neural network operations evolve quickly, and HPC and scientific computing where applications outside of AI require specialized kernels for optimum efficiency.
While Untether previously offered to write kernels on customers’ behalf, this service required access to their code. Beachler said that allowing customers to write their own kernels opens up specific sections of the market, including government and military applications where customers are unwilling to hand over their code. It also helps conserve Untether’s resources as its customer list grows.
Why not make the open programming model available from the start?
“The bottleneck is making it interpretable for someone who hasn’t lived and breathed the architecture from the very beginning,” Beachler said. “That requires a certain level of maturity of the tool flow and the compiler… it took us two years to get to the point where we feel like [the SDK] is a solid enough, stable enough, and explainable enough, albeit with a training program, so that a non-Untether person can understand it and do it.”
Untether’s at-memory compute scheme is a spatial architecture made up of memory banks, which include small RISC processors within the banks to keep memory and compute close together. It’s possible to run a single instance of each layer (for efficiency) or more than one instance of layers or sub-graphs simultaneously (for performance). Communication between kernels, however, would be different in these two scenarios. Untether now has a framework that handles kernel-to-kernel and bank-to-bank communication styles.
With the new SDK, users can now see Untether’s kernel library and modify existing kernels, or write the kernels directly from scratch (bare-metal programming). Bare-metal programmers can also perform manual kernel mapping (say which kernel connects to which, and assign them to different banks), while Untether’s framework does the physical allocation and generates files to send to the runtime. While kernel creation requires knowledge of Untether’s proprietary RISC processors inside the banks and its custom instructions, those familiar with low-level programming shouldn’t find this a challenge, Beachler said.
“This allows [users] to really be their own boss,” he said. “They never need to talk to us. They can go ahead and make obscure layers, make obscure kernels, and be able to integrate it into the compiler so that they can go ahead and move forward.”
Aside from custom kernels, prediction accuracy is high on the list of customer demands, Beachler added. Quantizing to runAI’s INT8 or INT16 formats while maintaining accuracy is something Untether is focusing on; the latest version of the company’s SDK can handle post-quantization retraining, if required. This can include classic retraining, or a technique called knowledge distillation (which involves a student-teacher relationship between the original and the quantized model).
Untether’s poster session at NeurIPS was also about quantization—specifically, about quantizing transformer networks to INT8. Transformers present particular problems for quantization because their iterative nature means errors accumulate and propagate. Natural language processing inference applications are therefore extremely sensitive to accuracy. Combining Untether’s quantization methods with a new proprietary technique where activation functions are implemented via a lookup table can help ensure accuracy in these types of models, Beachler said, adding that function also relies on good kernel design.
The ability to write custom kernels will carry over to Untether’s second-gen chip, speedAI, when it becomes available in the second half of 2023.
“The only difference between runAI and speedAI in the SDK tool flow is the low-level kernel code, which is slightly different,” Beachler said. “It is recompiled for the RISC-V ISA on speedAI and optimized for speedAI’s dual RISC-V memory banks.”
While runAI kernels will need to be recompiled to work on speedAI, designers’ knowledge of kernel development for runAI will carry over to speedAI without any problems, he said.