بازشناسی کارای کنشهای انسانی با محدود کردن فضای جستجو در روشهای یادگیری عمیق
محورهای موضوعی : مهندسی برق و کامپیوترمریم کوهزادی هیکویی 1 , نصرالله مقدم چرکری 2 *
1 - دانشگاه تربیت مدرس
2 - دانشگاه تربیت مدرس
کلید واژه: بازشناسی کنشهای انسانی, یادگیری عمیق, فضایی- زمانی, پیچیدگی محاسباتی, سازوکار انتخاب ویژگی,
چکیده مقاله :
کارایی سیستمهای بازشناسی کنشهای انسانی به استخراج بازنمایی مناسب از دادههای ویدئویی وابسته است. در سالهای اخیر روشهای یادگیری عمیق به منظور استخراج بازنمایی فضایی- زمانی کارا از دادههای ویدئویی ارائه شده است، در حالی که روشهای یادگیری عمیق در توسعه بعد زمان، پیچیدگی محاسباتی بالایی دارند. همچنین پراکندگی و محدودبودن دادههای تمایزی و عوامل نویزی زیاد، مشکلات محاسباتی بازنمایی کنشها را شدیدتر ساخته و قدرت تمایز را محدود مینماید. در این مقاله، شبکههای یادگیری عمیق فضایی و زمانی با افزودن سازوکارهای انتخاب ویژگی مناسب جهت مقابله با عوامل نویزی و کوچکسازی فضای جستجو، ارتقا یافتهاند. در این راستا، سازوکارهای انتخاب ویژگی غیر برخط و برخط، برای بازشناسی کنشهای انسانی با پیچیدگی محاسباتی کمتر و قدرت تمایز بالاتر مورد بررسی قرار گرفته است. نتایج نشان داد که سازوکار انتخاب ویژگی غیر برخط، منجر به کاهش پیچیدگی محاسباتی قابل ملاحظه میگردد و سازوکار انتخاب ویژگی برخط، ضمن کنترل پیچیدگی محاسباتی، منجر به افزایش قدرت تمایز میشود.
The efficiency of human action recognition systems depends on extracting appropriate representations from the video data. In recent years, deep learning methods have been proposed to extract efficient spatial-temporal representations. Deep learning methods, on the other hand, have a high computational complexity for development over temporal domain. Challenges such as the sparsity and limitation of discriminative data, and highly noise factors increase the computational complexity of representing human actions. Therefore, creating a high accurate representation requires a very high computational cost. In this paper, spatial and temporal deep learning networks have been enhanced by adding appropriate feature selection mechanisms to reduce the search space. In this regard, non-online and online feature selection mechanisms have been studied to identify human actions with less computational complexity and higher accuracy. The results showed that the non-linear feature selection mechanism leads to a significant reduction in computational complexity and the online feature selection mechanism increases the accuracy while controlling the computational complexity.
[1] A. Karpathy, et al., "Large-scale video classification with convolutional neural networks," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR'14, pp. 1725-1732, Columbus, OH, USA, 23-28 Jun. 2014.
[2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in Proc. of the IEEE Int. Conf. on Computer Vision, pp. 4489-4497, Santiago, Chile, 7-13 Dec. 2015.
[3] L. Wang, et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Springer, 2016.
[4] L. Wang, et al., "Temporal segment networks for action recognition in videos," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2740- 2755, Nov. 2018.
[5] A. Diba, V. Sharma, and L. Van Gool, Deep Temporal Linear Encoding Networks, 2017.
[6] Z. Lan, et al., Deep local video feature for action recognition, 2017.
[7] W. Du, Y. Wang, and Y. Qiao, "Recurrent spatial-temporal attention network for action recognition in videos," IEEE Trans. on Image Processing, vol. 27, no. 3, pp. 1347-1360, Mar. 2017.
[8] Q. Liu, X. Che, and M. Bie, "R-STAN: residual spatial-temporal attention network for action recognition," IEEE Access, vol. 7, pp. 82246-82255, 2019.
[9] J. Li, X. Liu, M. Zhang, and D. Wang, "Spatio-temporal deformable 3D ConvNets with attention for action recognition," Pattern Recognition, vol. 98, Article ID: 107037, Feb. 2020.
[10] Y. Quan, Y. Chen, R. Xu, and H. Ji, "Attention with structure regularization for action recognition," Computer Vision and Image Understanding, vol. 187, Article ID: 102794, Oct. 2019.
[11] J. Zhang, H. Hu, and X. Lu, "Moving foreground-aware visual attention and key volume mining for human action recognition," ACM Trans. on Multimedia Computing, Communications, and Applications, vol. 15, no. 3, Article ID:. 74, 16 pp., Aug. 2019.
[12] H. Sang, Z. Zhao, and D. He, "Two-level attention model based video action recognition network," IEEE Access, vol. 7, pp. 118388-118401, 2019.
[13] S. Sharma, R. Kiros, and R. Salakhutdinov, Action Recognition Using Visual Attention, arXiv preprint arXiv:1511.04119, 2015.
[14] Y. Peng, Y. Zhao, and J. Zhang, "Two-stream collaborative learning with spatial-temporal attention for video classification," IEEE Trans. on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 773-786, Mar. 2018.
[15] D. Li, et al., "Unified spatio-temporal attention networks for action recognition in videos," IEEE Trans. on Multimedia, vol. 21, no. 2, pp. 416-428, Feb. 2018.
[16] H. Zhang, et al., "End-to-end temporal attention extraction and human action recognition," Machine Vision and Applications, vol. 29, no. 7, pp. 1127-1142, Oct. 2018.
[17] H. Ge, et al., "An attention mechanism based convolutional LSTM network for video action recognition," Multimedia Tools and Applications, vol. 78, pp. 20533-20556, Mar. 2019.
[18] M. Koohzadi and N. M. Charkari, "A context based deep temporal embedding network in action recognition," Neural Processing Letters, no. 1, 34 pp., 2020.
[19] M. Abadi, et al., "Tensorflow: a system for large-scale machine learning," in Proc. of the 12th USENIX Conf. on Operating Systems Design and Implementation, pp. 265-283, Savannah, GA, USA, 2-4 Nov. 2016.
[20] Z. Zhang, Z. Lvm C. Gan, and Q. Zhu, "Human action recognition using convolutional LSTM and fully-connected LSTM with different attentions," Neurocomputing, vol. 410, pp. 304-316, 14 Oct. 2020.
[21] J. Carreira, A. Zisserman, and Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, arXiv preprint arXiv:1705.07750, 2017.
[22] A. Diba, et al., Spatio-temporal channel correlation networks for action classification, 2018.
[23] J. Zhu, W. Zou, Z. Zhu, and L. Li, "End-to-end video-level representation learning for action recognition," in Proc. 24th Int. Conf on Pattern Recognition, pp. 645-650, Beijing, China, 20-24 Aug. 2018.
[24] Z. Li, K. Gavrilyuk, E.Gavves, M. Jain, C G. Snoekab, "VideoLSTM convolves, attends and flows for action recognition," Computer Vision and Image Understanding, vol. 166, pp. 41-50, 20-24 Jan. 2018.
[25] T. Yu, et al., "Joint spatial-temporal attention for action recognition," Pattern Recognition Letters, vol. 112, pp. 226-233, Jul. 2018.
[26] Z. Qiu, T. Yao, C. W. Ngo, X. Tian, and T. Mei, "Learning spatio-temporal representation with local and global diffusion," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 12056-12065, Long Beach, CA, USA, 15-20 Jun. 2019.
[27] C. Feichtenhofer, H. Fan, J. Malik, and K. He, "Slowfast networks for video recognition," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision, pp. 6202-6211, Seoul, South Korea, 27 Oct.-2 Nov. 2019.
[28] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: motion-augmented RGB stream for action recognition," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 7874-7883, Long Beach, CA, USA, 15-20 Jun. 2019.
[29] C. Y. Ma, M. H. Chen, Z. Kirab, and G.n AlRegib, "TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition," Signal Processing: Image Communication, vol. 1, pp. 76-87, 2019.
[30] B. Pang, K. Zha, H. Cao, C. Shi, and C. Lu, "Deep RNN framework for visual sequential applications," in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 423-432, Long Beach, CA, USA, 15-20 Jun. 2019.
[31] A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, "Evolving space-time neural architectures for videos," in Proc. of the IEEE In. Conf. on Computer Vision, pp. 1793-1802, Long Beach, CA, USA, 15-20 Jun. 2019.
[32] C. Zhuang, A. Andonian, and D. Yamins, Unsupervised Learning from Video with Deep Neural Embeddings, arXiv preprint arXiv:1905.11954, 2019.
[33] N. Sayed, B. Brattoli, and B. Ommer, Cross and Learn: Cross-Modal Self-Supervision, arXiv preprint arXiv:1811.03879, 2018.
[34] L. Meng, et al., "Interpretable spatio-temporal attention for video action recognition," in Proc. of the IEEE/CVF Int. Conf. on Computer Vision Workshops, , pp. 1513-1522, Seoul, South Korea, 27-28 Oct. 2019.
[35] C. Dai, X. Liu, and J. Lai, "Human action recognition using two-stream attention based LSTM networks," Applied Soft Computing, vol. 86, Article ID: 105820, Jan. 2019.
[36] L. Wang, et al., "Temporal segment networks: towards good practices for deep action recognition," in Proc. 14th European Conf., pp. 20-36, Amsterdam, The Netherlands, 11-14 October, 2016.