Transformer Hareware Software Codesign for NeuroSoc

Screenshot of the package website

Hardware and Software Optimizations for Transformer Algorithm Performance on NeuroSoC:

In this project, we implemented advanced hardware and software optimizations tailored specifically for enhancing the performance of transformer algorithms on our custom-designed NeuroSoC. Our team focused on optimizing the underlying computational architecture to ensure high efficiency and rapid processing capabilities. This included redesigning certain hardware components to better align with the computational demands of transformer models, as well as refining the software algorithms to exploit these hardware enhancements effectively. Development and Integration of Fast, Precise Floating-Point Exponential Calculations:

We developed a highly precise and efficient method for floating-point exponential calculations, based on the improved Schraudolph Expf algorithm. This new method was integrated into the NeuroSoC by utilizing the exponential instruction extensions available in the RISC-V architecture. The integration was performed on a Snitch-based SoC architecture, which is known for its lightweight, highly configurable cores, making it an ideal choice for implementing specialized mathematical functions that require high precision and speed. Acceleration of Software Computations Using SIMD Parallelism and Streaming SIMD Extensions (SSR):

To further boost the processing speed and throughput of transformer models on our NeuroSoC, we accelerated software computations using Single Instruction, Multiple Data (SIMD) parallelism. This was achieved by incorporating Streaming SIMD Extensions (SSR), which allowed us to process multiple data points simultaneously, significantly reducing the time required for large-scale matrix operations and other compute-intensive tasks associated with transformer models. This enhancement not only improved the performance but also increased the overall efficiency of the system by allowing more operations to be executed in parallel.

These enhancements collectively contribute to a more robust and faster processing platform, ideal for deploying advanced machine learning models and algorithms that require intensive computation and high throughput.

Run Wang
Run Wang
Master in Information Technology and Electrical Engineering

Formerly with extensive experience in algorithm design, now transitioned into the realm of digital architecture IC design.

Related