Modeling Optimal Tile Size for Enhancing Data Reuse in Convolutional Neural Networks

S. Seydi ¹ ( )
M. Salehi ² ( University of Tehran )

Submited date : 2024-11-25 Accepted date : 2025-09-23

Keywords: Convolutional neural networks, energy consumption, off-chip memory, data reuse, tiling.,

Abstract :

Artificial neural networks are a class of computational models inspired by the structure and functionality of biological neural networks in the human brain. Convolutional Neural Networks (CNNs), as a prominent type of these models, are widely applied in various domains such as image classification, object detection, natural language processing, and healthcare.

As CNN architectures grow in size, the number of parameters and the volume of data movement increase, leading to higher dependence on off-chip memory, which in turn significantly raises energy consumption. A primary strategy for reducing both energy usage and off-chip memory accesses is to maximize data reuse at every level of the memory hierarchy. Data reuse can be exploited at three levels: (1) data flow and processing elements, (2) loop and computation scheduling, and (3) inter-layer and network-level operations.

Tiling is one of the key techniques for improving data reuse at the scheduling level. In this work, we present a precise mathematical formulation for modeling the number of data reuses. We then formulate an optimization problem to determine the optimal parameters for maximizing data reuse for each network configuration. Furthermore, we investigate the relationship between network structural parameters, such as kernel size and stride, and the optimal tile size. Our analysis shows that, in 70% of the network layers examined, the optimal tile size is smaller than four times the kernel size.

References:

[1] S. Genovese, "Artificial intelligence: a guide for thinking humans," ORDO, vol. 71, no. 1, pp. 444-449, Apr. 2020.
[2] O. Campesato, Artificial Intelligence, Machine Learning, and Deep Learning, Mercury Learning and Information, 2020.
[3] J. Cheng, J. Wu, C. Leng, Y. Wang, and Q. Hu, "Quantized CNN: A unified approach to accelerate and compress convolutional networks," IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 10, pp. 4730-4743, Oct. 2018.
[4] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, "A survey of convolutional neural networks: analysis, applications, and prospects," IEEE Trans. Neural Networks Learn. Syst., vol. 33, no. 12, pp. 6999-7019, Dec. 2022
[5] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, "Optimizing loop Operation and dataflow in FPGA acceleration of deep convolutional neural networks," in Proc. of the 2017 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, pp. 45-54, Monterey, CA, USA, 22-24 Feb. 2017.
[6] P. Dhilleswararao, S. Boppu, M. S. Manikandan, and L. R. Cenkeramaddi, "Efficient hardware architectures for accelerating deep neural networks: survey," IEEE Access, vol. 10, pp. 131788-131828, 2022.
[7] Y. –H. Chen, J. Emerl, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 367-379, Jun. 2016.
[8] S. Zheng et al., "Efficient scheduling of irregular network structures on CNN accelerators," IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 39, no. 11, pp. 3408-3419, Nov. 2020.
[9] M. Horowitz, "1.1 Computing’s energy problem (and what we can do about it)," in ¬Proc. IEEE Int. Solid-State Circuits Conf., pp. 10-14, San Francisco, CA, USA, 9-13 Feb. 2014.
[10] E. Valpreda et al., "HW-Flow-Fusion: inter-layer scheduling for convolutional neural network accelerators with dataflow architectures," Electron., vol. 11, no. 18, Article ID: 2933, Sept. 2022.
[11] M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer CNN accelerators," in Proc. of the Annual Int. Symp. on Microarchitecture, 12 pp., Taipei, Taiwan, 15-19 Oct. 2016.
[12] J. Li, et al., "SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators," in Proc. 2018 Design, Automation and Test in Europe Conf. Exhib., pp. 343-348, Dresden, Germany, 19-23 Mar. 2018.
[13] Q. Nie and S. Malik, "MemFlow: Memory-driven data scheduling with datapath co-design in accelerators for large-scale inference applications," IEEE Trans. Comput. Des. Integr. Circuits Syst., vol. 39, no. 9, pp. 1875–1888, Sept. 2020
[14] Q. Nie and S. Malik, "CNNFlow: Memory-driven data flow optimization for convolutional neural networks," ACM Trans. Des. Autom. Electron. Syst., vol. 28, no. 3, Article ID: 40, Feb. 2022.
[15] A. Parashar, et al., "Timeloop: A systematic approach to DNN accelerator evaluation," in Proc. IEEE Int. Symp. Perform. Anal. Syst. Software, pp. 304-315, Madison, WI, USA, 2019.
[16] A. Stoutchinin, F. Conti, and L. Benini, Optimally Scheduling CNN Convolutions for Efficient Memory Access, arXiv Preprint, arXiv:1902.01492, Feb. 2019.
[17] L. Cavigelli, et al., "Origami: A convolutional network accelerator," in Proc. ACM 25th edition on Great Lakes Symp. on VLSI, pp. 199-204, Pittsburgh, PA, USA, 20-22 May 2015.
[18] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," in Proc. of the 32nd Int. Conf. on Machine Learning, pp. 1737-1746, Lille, France, 6-11 Jul. 2015.
[19] C. Zhang, et al. "Optimizing FPGA-based accelerator design for deep convolutional neural networks," in Proc. 2015 ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, pp. 161-170, Monterey, CA, USA, Feb. 2015.
[20] I. Dadras, S. Seydi, M. H. Ahmadilivani, J. Raik, and M. E. Salehi, "Fully-Fusible Convolutional Neural Networks for End-to-End Fused Architecture with FPGA Implementation," in Proc. 2023 30th IEEE Int. Conf. Electron. Circuits Syst., 5 pp., Istanbul, Turkey, 4-7 Dec. 2023.
[21] B. Rokh, A. Azarpeyvand, and A. Khanteymoori, "A comprehensive survey on model quantization for deep neural networks in image classification," ACM Trans. on Intelligent Systems and Technology, vol. 14, no. 6, Article ID: 97, Dec. 2023.