
While Transformers are dominated by floating-point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.
To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph’s method. This block is integrated into the floating-point unit (FPU) of the RISC-V cores of a compute cluster through custom instruction set architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage this extension, we execute Softmax with 162.7× less latency and 74.3× less energy compared to the baseline cluster, achieving an 8.2× performance improvement and 4.1× higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration.
Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3, and ViT, achieving up to 5.8× and 3.6× reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.