LUTAccel: Look-up-Table based Vector Systolic Accelerator on FPGAs

Aashish Kumar Tiwary1, Saketh Gajawada2, Jay Shah3, Nanditha Rao4
1International Institute of Information Technology, Bangalore, 2InternationalInstituteofInformationTechnologyBangalore, 3International Institute of Information Technology Bangalore, 4IBM


Abstract

FPGA-based edge inference accelerators have been used for edge reference due to their unique advantage of reconfigurability. Convolutional neural networks (CNNs) and Transformers require compute-intensive convolution or general matrix multiplications (GEMM) and expensive memory accesses. Typical ways to address these challenges are quantization, pruning and low bit-width computes. In this work, we propose LUTAccel: a unique integration of LUT-based computation into a Vector Systolic Accelerator (VSA). This architecture aims to increase throughput through vectorization and resource reuse by using look-up tables (LUT). These LUTs along with Block memories (BRAM) in the FPGA replace all convolution and multiplication computations thereby reducing compute resources. We propose a LUT based design (LBD) and a BRAM based design (BBD) to implement the neural network (NN) layer on a Xilinx ZCU104 FPGA. Our implementation achieves a peak throughput of 1407 GOp/s for optimal lane widths of vector-4 and vector-6, which is 2.5x more than the baseline design. We achieve an average of 39% and 61% reduction in LUT usage for LBD and BBD respectively. Our approach results in nearly 47% lesser power consumption and is 4.6x more power efficient than the baseline design.