Free Shipping on orders over US$49.99

How Efinix is Conquering the Hurdle of Hardware Acceleration for Devices at the Edge  

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

In previous articles, we established the many ways FPGAs surpass other AI chipsets for running machine learning algorithms at the edge in terms of reconfigurability, power consumption, size, speed, and cost. Moreover, how the microarchitecture-agnostic RISC-V instruction set architecture (ISA) marries up with the architectural flexibility of the FPGA seamlessly. However, the apparent lack of mid-range, cost-effective FPGAs and their less-than-straightforward design flow are a major bottleneck — the software skills required for the fully custom hardware description language (HDL) implementation are difficult to find and often come with a steep learning curve. 

Efinix fills the gap with FPGAs built on the innovative quantum compute fabric made up of reconfigurable tiles known as exchangeable logic and routing (XLR) cells that function as either logic or routing, rethinking the traditional fixed ratio of logic elements (LEs) and routing resources. This allows for a high-density fabric in a small device package where no part of the FPGA is underutilized. The potential of this platform transcends the typical barriers facing edge-based devices today: power consumption, latency, cost, size, and ease of development. 

Possibly the most striking feature of Efinix FPGAs is the ecosystem and state-of-the-art tool flow surrounding it that lowers development barriers, allowing designers to readily implement AI at the edge using the same silicon — from prototype to production. Efinix has embraced the RISC-V, thereby allowing users to create applications and algorithms in software — capitalizing on the ease of programmability of this ISA without being bound to proprietary IP cores such as ARM. Since this is all done with flexible FPGA fabric, users can massively accelerate in hardware. Efinix offers support for both low level and more complex custom instruction acceleration. Some of these techniques include the TinyML accelerator and predefined hardware accelerator sockets. With approaches such as these, the leaps in acceleration accomplished delivers hardware performance while retaining a software-defined model that can be iterated and refined without the need to learn VHDL. This results in blazing-fast speeds for edge devices, all while consuming low power and functioning within a small footprint. This article discusses precisely how the Efinix platform simplifies the entire design and development cycle, allowing users to take advantage of the flexible FPGA fabric for a scalable embedded processing solution. 

Barriers at the edge — a dam-blocking progress 

From massive wireless sensor networks to streaming a high-resolution 360o immersive AR or VR experience, most of the world’s data lies at the edge. Disaggregating the compute burden from the cloud and bringing it closer to the devices opens doors for next-generation, bandwidth-hungry, ultra-low-latency applications in autonomous driving, immersive digital experiences, autonomous industrial facilities, telesurgery, and so on. The use cases are endless once the enormous roadblock of transmitting data to and from the cloud is sidestepped. 

However, the very defining factors of low-latency, power-hungry compute at the edge are the very same factors that pose a significant design challenge for these small but prolific power-limited devices. How then is it possible to design a device capable of processing the power-hungry relevant ML algorithms without having to invest in elaborate technologies? The solution has been to implement any hardware deemed sufficient to run the suitable applications and algorithms (e.g., CPU, GPU, ASIC, FPGA, ASSP) while accelerating the more compute-intensive tasks to balance the compute time (latency) and resources used (power consumed). 

As with any innovation, the landscape of deep learning is continually shifting with updating models and optimization techniques, necessitating the use of more agile hardware platforms that can change almost as rapidly as the programs running on them with little to no risk. The parallel processing and flexibility/reconfigurability of FPGAs seem to line up seamlessly with this need. However, making these devices available for mainstream, high-volume applications requires lowering the design barriers for configuring and accelerating the FPGA fabric — a time-consuming process that normally requires a high degree of expertise. Furthermore, traditional accelerators are typically not granular enough and incorporate large pieces of a model that typically do not scale well. They also generally consume far too much power and are, more often than not, proprietary — causing engineers to relearn how to use the vendor-specific platform. 

The Sapphire RISC-V core 

Creating an application on the RISC-V Core in C/C++ 

Efinix squarely addresses all of these potential obstacles by taking on the challenge of making FPGAs available to the AI/ML community in an intuitive way. The RISC-V Sapphire core is fully user configurable through the Efinity GUI; this way, users do not have to know all the VHDL behind implementing the RISC-V in the FPGA and can exploit the straightforward programmability of common software languages (e.g., C/C++). This allows teams to rapidly generate applications and algorithms in software at speed. All the required peripherals and buses can be specified, configured, and instantiated alongside the Sapphire core to deliver a fully configured SoC (Figure 1). This RISC-V capability includes multi-core (up to four cores) support and Linux capability, delivering a high-performance processor cluster to a designer’s FPGA application as well as the ability to run applications directly on the Linux operating system. The next step — hardware acceleration — is greatly simplified with hardware-software partitioning; once a designer has perfected their algorithm in software, they can progressively start to accelerate this within the flexible Efinix FPGA fabric. However, before we move on to the next step of hardware acceleration, it would be important to understand the inherent benefits of the RISC-V architecture and how it can be exploited for use within the FPGA fabric. 

Figure 1: The Efinity GUI enables designers to configure their Sapphire RISC-V core (left) along with all the required peripherals and buses in familiar programming languages for a fully configured SoC. This capability is extended to up to four RISC-V cores.

Custom-instruction-capable RISC-V 

The RISC-V architecture is unique in that it does not have all of its instructions defined; instead, there are a few instructions left open for the designer to define and implement. In other words, a custom arithmetic logic unit (ALU) can be created, and it will perform whatever arbitrary function when called upon by the custom instruction (Figure 2). These custom instructions will have the same architecture as the rest of the instructions (e.g., two registers in, one register out) granting a total of eight bytes of data to work with and four bytes that can be passed back to the RISC-V. 

However, since the ALU is built within the FPGA, it can both access and pull data from the FPGA. This allows users to expand beyond the eight bytes of data and make the ALU arbitrarily complex — giving access to data that was put out on the FPGA previously (e.g., access to data from sensors). The ability to have an arbitrarily complex ALU is a multiplying factor for speed when it comes to hardware acceleration. Efinix has taken this ability of the custom instruction and adapted it for the AI and ML communities with the TinyML platform. 

Figure 2: Custom ALUs can be created with the RISC-V where the standard configuration includes two source registers (rs1 and rs2) that are four bytes wide and one destination register (rd) that is four bytes wide.


The TinyML platform — a library of custom instructions 

Hardware acceleration with the TinyML platform 

The TinyML platform streamlines the process of hardware acceleration where Efinix has taken the compute primitives used in TensorFlow Lite models and created custom instructions to optimize their execution on accelerators in the FPGA fabric (Figure 3). Through this, the standard software-defined models of TensorFlow are absorbed into the RISC-V complex and are accelerated to run at hardware speed, taking advantage of the rich, open-source TensorFlow Lite community. The entire development flow has been streamlined using the popular Ashling tool flow to make setup, application creation, and debugging a simple and intuitive process.

Figure 3: TensorFlow Lite creates a quantized version of standard TensorFlow models and uses a library of functions to allow these models to run on MCUs at the edge. Efinix TinyML takes these TensorFlow Lite models, and using the custom instruction capabilities of the RISC-V core, accelerates them in the FPGA hardware.

Many of the TinyML platform’s libraries of custom instructions are all available to the open-source community on the Efinix GitHub for free access to the Efinix Sapphire core and everything that is needed to design and develop highly accelerated edge AI applications. 

Accelerations strategies: an overview 

The combination of the RISC-V core, the Efinix FPGA fabric, and the rich, open-source TensorFlow community allows for creative acceleration strategies that can be broken down into several steps (Figure 4): 

  • Step 1: Run the TensorFlow Lite model using the Efinity RISC-V IDE, 
  • Step 2: Use the TinyML accelerator, 
  • Step 3: User-defined custom instruction accelerator, 
  • Step 4: Hardware accelerator templates. 

As stated earlier, “Step 1” is a standard process through the Efinity GUI where users can take the Tensorflow Lite models and run it in software on the RISC-V using the very same, familiar process one would with a standard MCU — without having to worry about VHDL. After Step 1, designers will, more often than not, find that the performance of the algorithm they are running is not optimal and therefore requires acceleration. “Step 2” involves hardware-software partitioning where users can implement the fundamental building blocks inside the TensorFlow Lite models and literally click and drag to instantiate custom instructions and get a massive acceleration on the way the model runs on the Sapphire RISC-V core. 

Figure 4: Efinix acceleration strategies.

User-defined custom instruction accelerator 

“Step 3” leaves it open for designers to create their own custom instructions without leveraging the templates found in the TinyML platform, allowing users to innovate and create acceleration on top of the RISC-V core. 

Hardware accelerator templates 

Finally, after the required fundamental elements are now accelerated on the RISC-V, “Step 4” involves burying them inside the free Efinix SoC framework with “sockets” of acceleration. The quantum accelerator socket allows users to “point at” data, retrieve it, and edit its contents to, say, perform a convolution on bigger blocks of data. 

The Sapphire SoC can be used to perform overall system control and execute algorithms that are inherently sequential or require flexibility. As stated earlier, the hardware-software codesign allows users to choose whether to perform this compute in the RISC-V processor or in hardware. In this acceleration methodology, the pre-defined hardware accelerator socket is connected to a direct memory access (DMA) controller and an SoC slave interface for data transfer and CPU control, which may be used for pre-processing/post-processing before or after the AI inference. The DMA controller facilitates communication between the external memory and other building blocks in the design by (Figure 5): 

  • Storing frames of data into the external memory, 
  • Sending and receiving data to/from the hardware acceleration block, 
  • Sending data to the post-processing engine. 

In an image-signal-processing application, this can look like leaving the RISC-V processor to execute the RGB to grayscale conversion as embedded software, while the hardware accelerator performs Sobel edge detection, binary erosion, and binary dilation in the pipelined, streaming architecture of the FPGA (see “Edge Vision SoC User Guide”). This can be scaled up for multi-camera vision systems, allowing companies to turn their designs into a product and deploy them extremely rapidly. 

Figure 5: Sample edge vision SoC framework block diagram.

MediaPipe Face Mesh use case 

The simplicity of this process might be better highlighted with an example. The MediaPipe Face Mesh ML model estimates hundreds of different three-dimensional facial landmarks in real-time. Efinix took this model and deployed it on the Titanium Ti60 development kit running at 300 MHz. As shown in Figure 6, convolutions on the RISC-V core contributed the most to latency. It is worth noting that the FPGA’s resource utilization of close to 60% does not actually reflect the size of the ML model. Instead, this is due to the fact that the entire camera subsystem has been instantiated in the FPGA in order to perform acceleration benchmarking in real-time.  

Figure 6: MediaPipe Face Mesh pre-trained network running on the Ti60 development kit showing both latency and resources used.

Simple custom instructions with the TinyML platform (Step 2) 

Creating and running a simple, custom two registers in, one register out convolution instruction shows a four- to five-fold improvement in latency. This improvement continues as custom instructions used to accelerate the ADD, MAXIMUM, and MUL functions. However, latency improvements hit a plateau since the RISC-V is spending less time doing these operations (Figure 7).  

Figure 7: Obvious latency improvements with simple, custom instructions created for CONV, ADD, MAXIMUM, and MUL functions.

Complex instructions with DMA (Step 4) 

An arbitrarily complex ALU is also generated to replace the original CONV. This changes the slope of the original curve and dramatically improves the latency once more. However, FPGA utilization has also jumped up since the complex instruction has taken more resources inside the FPGA. Once again, the resource bar standing at nearly 100% is simply due the fact that the FPGA here contains the entire camera subsystem for demonstration purposes, what is important to note is the relative decrease in latency and increase in utilization (Figure 8).  

What’s more, switching to a larger FPGA, such as the Ti180, would run all of these complex instructions for massive acceleration without using even 50 percent of the FPGA resources available. These apparent tradeoffs are precisely what allow engineers to readily visualize the balancing act between latency, power consumption, and the cost/size of the FPGA. An edge application that has stringent latency requirements but more lenient power constraints could opt to increasingly accelerate the design for a drastic performance improvement. In power constrained applications, this increase in performance can be traded off by reducing the clock speed for a more moderate improvement in performance at a dramatically lower power. 

Figure 8: A larger custom convolution instruction is employed for much more acceleration; however, resources consumed jumps up. Note, the near full utilization of the FPGA is simply due to the fact that the FPGA contains the entire camera subsystem, this would be dramatically lower if the Ti60 was simply running the ML model

A paradigm shift in AI/ML development 

In a nutshell, Efinix has combined the familiar development environment of the RISC-V ISA and exploited its custom instruction capability to function within the architecturally flexible FPGA fabric. Unlike many hardware accelerators, this approach does not require any third-party tools or compilers. The acceleration is also fine grain with the acceleration of machine instructions — a level of granularity that only makes sense with an FPGA. 

The fact that edge devices can be prototyped and deployed on the innovative design architecture of the Efinix FPGA means the solution is future-proofed. New models and updated network architectures can be expressed in familiar software environments and accelerated at the custom instruction level with only a small amount of VHDL (with libraries of available templates to use for guidance). This degree of hardware-software partitioning where 90 percent of the model remains in software running on the RISC-V allows for an extremely fast time to market. The combination of all of these approaches yields an elegant solution that truly lowers the barriers to entry for implementing an edge device. Designers now have access to a world-class embedded processing capability that can be accessed with a state-of-the-art tool flow and instantiated on the revolutionary Efinix Quantum fabric. 


Source link

We will be happy to hear your thoughts

Leave a reply

Enable registration in settings - general
Compare items
  • Total (0)
Shopping cart