We present a hardware-software-firmware system for neuronal network emulation (up to 160M Neurons and 40B Synapses at INT32 resolution) hosted through the San Diego Super Computer Center (SDSC). Our system comprises 40 High-performance Computing grade boards in each server, with each board equipped with 8GB of High-Bandwidth-Memory (HBM) for synaptic storage and Xilinx Virtex Ultrascale+ FPGA for neuronal computing. Each FPGA comes equipped with a PCIe interface to the host processor, which is used as a network I/O interface, as well as for hardware configuration for running any workload.
Our custom network compiler uses novel partitioning techniques to ensure the compute-balanced and memory-efficient execution of event-driven AI workloads into the hardware resources for high-throughput implementation. The compiler also handles the placement of neurons into the physical cores and the optimal placement of synaptic weights into HBM. This leads to a simplified, yet efficient and reconfigurable hardware architecture.
Our microarchitecture inside each FPGA consists of 32 cores, with 128k Neurons and 4M Synapses per core. Neuron membrane potentials and spike events are stored locally using on-chip SRAM memories. Each core in FPGA consists of several submodules (Axon processor, Neuron processor, Synapse processor) and other control elements for ensuring massively parallel and deeply pipelined implementation of spike-based computing.
All cores inside the FPGAs communicate spike events between each using their own address- event-routing (AER) router interface connected to the network-on-chip (NoC) grids consisting of multicast high-performance buses (mAHBs). These mAHBs govern very low-latency spike transmission toward the post-synaptic destinations among the neighboring cores. The peripheral logic in mAHBs handles the shared bus access mechanisms between the cores along with handshaking between them to ensure the timed arrival of spikes and prevent overflow of outgoing events. We currently have demonstrated a spike throughput of 420Mevents/sec per 128k-neuron core using this NoC architecture.
We also demonstrate our system running computer vision tasks for spiking datasets at very low latency and energy consumption.