为实现卷积神经网络数据的高度并行传输与计算,生成高效的硬件加速器设计方案,提出了一种基于数据对齐并行处理、多卷积核并行计算的硬件架构设计和探索方法.该方法首先根据输入图像尺寸对数据进行对齐预处理,实现数据层面的高度并行传输与计算,以提高加速器的数据传输和计算速度,并适应多种尺寸的输入图像;采用多卷积核并行计算方法,使不同的卷积核可同时对输入图片进行卷积,以实现卷积核层面的并行计算;基于该方法建立硬件资源与性能的数学模型,通过数值求解,获得性能与资源协同优化的高效卷积神经网络硬件架构方案.实验结果表明:所提出的方法,在Xilinx Zynq XC7Z045上实现的基于16位定点数的SSD网络(single shot multibox detector network)模型在175 MHz的时钟频率下,吞吐量可以达到44.59帧/s,整板功耗为9.72 W,能效为31.54 GOP/(s·W);与实现同一网络的中央处理器(CPU)和图形处理器(GPU)相比,功耗分别降低85.1%与93.9%;与现有的其他卷积神经网络硬件加速器设计相比,能效提升20%~60%,更适用于低功耗嵌入式应用场合.
To achieve highly parallel data transmission and computation of convolutional neural network acceleration and generate efficient hardware accelerator design, a hardware design and exploration method based on data-alignment and multi-filter parallel computing was proposed. In order to improve the data transmission and computation speed and adapt to various input image sizes, the method first aligned the data according to the input image size to achieve highly parallel transmission and computation at the data level. The method also used the multi-filter parallel computing method so that different filters can simultaneously convolve the input image to achieve parallel computing at the filters level. Based on this method, mathematical models of hardware resources and performance were formulated and numerically solved to obtain the performance and resource co-optimized neural network hardware architecture. The proposed design method was applied to the single shot multibox detector (SSD) network, and results show that the accelerator on Xilinx Zynq XC7Z045 at 175 MHz clock frequency could achieve the throughput of 44.59 FPS, power consumption of 9.72 W, and power efficiency of 31.54 GOP/(s·W). The accelerator consumed 85.1% and 93.9% less power than the central processing unit (CPU) and graphics processing unit (GPU) implementations respectively. Compared with the exiting designs, the power efficiency of the proposed design increased 20%~60%. Therefore, the design method is more suitable for embedded applications with low power requirements.