Robust Persian Isolated Digit Recognition Based on LSTM and Speech Spectral Features

Subject Areas : electrical and computer engineering

1 -

Received: 2020-03-27 Accepted : 2021-01-27 Published : 2021-10-04

Keywords: Isolated digit recognition, similarity of digit pronunciation, hidden Markov model, long short term memory and robustness,

Abstract :

One of the challenges of isolated Persian digit recognition is similar pronunciation of some digits such as "zero and three", "nine and two" and "five, seven and eight". This challenge leads to the high substitution errors and reduces the recognition accuracy. In this paper, a combined solution based on short-term memory (LSTM) and hidden Markov model (HMM) is proposed to solve the mentioned challenge. The proposed approach increases the recognition rate of Persian digits on average 2 percent and in the best case 8 percent in comparison to the HMM-based approach. In the following of this work, due to the intensification of the mentioned challenge in noisy conditions, the robust recognition of Persian digits with similar pronunciation was considered. In order to increase the robustness of the LSTM-based recognizer, robust features extracted from the speech spectrum such as spectral entropy, burst degree, bisector frequency, spectral flatness, first formant and autocorrelation-based zero crossing rate were used. Using these features, while reducing the number of features for recognizing similar Persian digits from 39 coefficients to a maximum of 4 and a minimum of 1 coefficient, on average improved the robustness of the isolated digit recognizer in different noisy conditions (30 different situations resulting from five noise types of white, pink, babble, factory and car noises and six signal-to-noise ratios of -5, 0, 5, 10, 15 and 20 decibels) by 10%, 13%, 15% and 13% compared to the HMM-based, LSTM-based, deep belief network-based recognizers with Mel-Cepstrum coefficients and a convolutional neural network-recognizer with Mel Spectrogram features.

References:

[1] ف. فکری، شناسایی صحبت توسط کامپیوتر، پایان‌نامه کارشناسی ارشد، دانشگاه صنعتی شریف، دانشکده مهندسی برق، 1371.
[2] ح. بابابیک، "بازشناسی گفتار با استفاده از تلفیق مدل مخفی مارکف و شبکه عصبی،" مجموعه مقالات هفتمین کنفرانس مهندسی برق ایران، مقالات مخابرات سيستم، صص. 204-199، تهران، ایران، 29-27 اردیبهشت 1378.
[3] س. بابایی‌زاده، ا. غلام‌پور و ک. نایبی، "بهبود کارایی سیستم‌های بازشناسی گفتار گسسته با ترکیب شبکه‌های عصبی و مدل‌های مارکف پنهان،" مجموعه مقالات هفتمین کنفرانس مهندسی برق ایران، مقالات مخابرات سيستم، صص. 190-183، تهران، ایران، 29-27 اردیبهشت 1378.
[4] ش. رستم‌زاده، س. م. احدی، ح. شیخ¬زاده، نجار، "بازشناسی گفتار فارسی ناپیوسته، به صورت ناوابسته به گوینده به کمک مدل‌های پنهان مارکف با چگالی پیوسته،" مجموعه مقالات ششمین کنفرانس مهندسی برق ایران، صص. 97-93، تهران، ایران، اردیبهشت 1377.
[5] م. م. همایون‌پور و ا. نجاری، "بازشناسی ارقام ناوابسته به گوینده با استفاده از مدل پیشگوی عصبی،" مجموعه مقالات هفتمین کنفرانس مهندسی برق ایران، صص. 81-75، تهران، ایران، 29-27 اردیبهشت 1378.
[6] ا. صیادیان، ک. بدیع، م. حکاک و م. ر. بیک‌زاده، "ارائه روش آماری FPG-GMM در بازشناسی گفتار،" مجموعه مقالات هشتمین کنفرانس مهندسی برق ایران، صص. 406-398، اصفهان، ایران، 30-28 اردیبهشت 1379.
[7] ا. اکبری و ب. ناصرشریف، "بازشناسی هجاها در اعداد دورقمی فارسی به وسیله مدل مخفی مارکف،" مجموعه مقالات ششمین کنفرانس سالانه انجمن کامپیوتر ایران، صص. 437-432، اصفهان، ایران، 4-2 اسفند 1379.
[8] م. م. همایون‌پور و ج. کبودیان، "بازشناسی اعداد فارسی بر روی خط تلفن: مقایسه‌ای بین روش‌های آماری، عصبی و هیبرید،" مجله مهندسی برق، سال چهاردهم، شماره آ-56، صص. 1065-1045، پاییز 1382.
[9] دانشگاه صنعتی امیرکبیر، گزارش نهایی طرح ملی پردازش زبان فارسی، شورای پژوهش‌های علمی کشور، کمیسیون اطلاع‌رسانی و فناوری اطلاعات، صص. 68-67، 1380.
[10] J. Wu, Y. Chua, M. Zhang, H. Li, and K. C. Tan, "A spiking neural network framework for robust sound classification," Frontiers in Neuroscience, vol. 12, Article No.: 836, 17 pp., Nov. 2018.
[11] A. Wazir and J. Chuah, "Spoken Arabic digits recognition using deep learning," in Proc. IEEE Int. Conf. on Automatic Control and Intelligent Systems, I2CACIS’19, pp. 339-344, Selangor, Malaysia, 29-29 Jun. 2019.
[12] E. Swedia, A. Mutiara, and M. Subali, "Deep learning long-short term memory (LSTM) for indonesian speech digit recognition using LPC and MFCC feature," in Proc. 3rd Int. Conf. on Informatics and Computing, ICIC’18, 5 pp., Palembang, Indonesia, 17-18 Oct. 2018.
[13] N. Zerari, S. Abdelhamid, H. Bouzgou, and C. Raymond, "Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition," in Proc. 2nd Int. Conf. on Natural Language and Speech Processing, ICNLSP’18, 6 pp., Algiers, Algeria, 25-26 Apr. 2018.
[14] A. B. Nassif, S. Ismail, A. Imtinan, A. Mohammad, and S. Khaled, "Speech recognition using deep neural networks: a systematic review," IEEE Access, vol. 7, pp. 19143-19165, 2019.
[15] R. Sharmin, K. R. Shantanu, and R. H. Mohammad, "Bengali spoken digit classification: a deep learning approach using convolutional neural network," Procedia Computer Science, vol. 17, pp. 1381-1388, 2020.
[16] B. Zada and U. Rahim, "Pashto isolated digits recognition using deep convolutional neural network," Heliyon, vol. 6, no. 2, Article No.: e03372, 6 pp., Feb. 2020.
[17] A. Graves, D. Eck, and J. Schmidhuber, LSTM and Timewarping: Spoken Digit Recognition with a Recurrent Neural Network, Technical Report, No. IDSIA-12-03, pp. 1-9, 2003.
[18] D. Dhanashri and S. B. Dhonde, "Isolated word speech recognition system using deep neural networks," in Proc. of the Int Conf. on Data Engineering and Communication Technology, Springer, Singapore, pp. 9-17, Aug. 2017.
[19] S. Vihari, A. S. Murthy, P. Soni, and D. C. Naik, "Comparison of speech enhancement algorithms," Procedia Computer Science, vol. 89, pp. 666-676, 2016.
[20] A. Pervaiz, et al., "Incorporating noise robustness in speech command recognition by noise augmentation of training data," Sensors, vol. 20, no. 8, pp. 2326-2344, 2020.
[21] D. Grozdic, J. Slobodan, S. P. Dragana, G. Jovan, and M. Branko, "Comparison of cepstral normalization techniques in whispered speech recognition," Advances in Electrical and Computer Engineering, vol. 17, no. 1, pp. 21-26, Feb. 2017.
[22] V. Mitra, et al., "Robust features in deep-learning-based speech recognition," in S.Watanabe, M. Delcroix, F. Metze, and J.Hershey (eds) New Era for Robust Speech Recognition, Springer, Cham, pp. 187-217, 2017.
[23] D. Vazhenina and K. Markov, "End-to-end noisy speech recognition using Fourier and Hilbert spectrum features," Electronics, vol. 9, no. 7, pp. 1157-1174, 2020.
[24] S. Chang and S. Wegmann, "On the importance of modeling and robustness for deep neural network feature," in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, ICASSP’15, pp. 4530-4534, South Brisbane, Australia, 19-24 Apr. 2015.
[25] S. Tabibian, A. Akbari, and B. Nasersharif, "Keyword spotting using an evolutionary-based classifier and discriminative features," Engineering Applications of Artificial Intelligence, vol. 26, no. 7, pp. 1660-1670, Aug. 2013.
[26] S. A. Hejazi, R. Kazemi, and S. Ghaemmaghami, "Isolated persian digit recognition uding a hybrid HMM-SVM," in Proc. Int. Symp. on Intelligent Signal Processing and Communication Systems, 4 pp., Bangkok, Thailand, 8-11 Feb. 2009.
[27] L. Ming, Y. Wang, J. Wang, J. Wang, and X. Xie, "Speech enhancement method based on LSTM neural network for speech recognition," in Proc. 14th IEEE Int. Conf. on Signal Processing, ICSP’18, pp. 245-249, Beijing, China, 12-16 Aug. 2018.
[28] ش. طبیبیان، "بهبود بازشناسی ارقام مجزای فارسی در تلفن همراه بر مبنای کاهش خطای دسته‌بندی در سطح قاب،" مجموعه مقالات بیست و چهارمین کنفرانس ملی انجمن کامپیوتر ایرانصص. 135-128، تهران، ایران، 235-22 اسفند 1397.
[29] ش. طبیبیان، "بهبود بازشناسی ارقام مشابه فارسی مبتنی بر شبکه بازگشتی LSTM،" بیست و چهارمین کنفرانس ملی انجمن کامپیوتر ایران، صص. 438-432، تهران، ایران، 23-22 اسفند 1397.
[30] M. M. Naseri and S. Tabibian, "Improving the robustness of persian spoken isolated digit recognition based on LSTM," in Proc. 6th Int. Conf. of Signal Processing and Intelligent Systems, ICSPIS’20, 6 pp., Mashhad, Iran, 23-24 Dec. 2020.
[31] S. Hochreiter and J. Schmidhuber, "Long short term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997.
[32] C. Olah, Understanding LSTM Networks, http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 2015.
[33] Hidden Markov Model Toolkit (HTK), Speech vision and robotics group of the Cambridge University engineering department, http://htk.eng.cam.ac.uk/, August 2015.
[34] ش. طبیبیان، "طراحی و جمع‌آوری دادگان صوتی اعداد تک‌رقمی فارسی مبتنی بر تلفن همراه،" مجموعه مقالات چهارمین کنفرانس پردازش سیگنال و سیستم‌های هوشمند، 5 صص.، تهران، ایران، 4-4 دی 1397.
[35] A. M. Toh, R. Togneri, and S. Nordholm, "Spectral entropy as speech features for speech recognition," in Proc. of Postgraduate Electrical Engineering and Computing Symp., PEECS’05, pp. 22-25, Perth, Australia, 2005.
[36] G. Peeters, A Large Set of Audio Features for Sound Description (Similarity and Classification) in the CUIDADO Project, Cuidado Project Report, Ircam, pp. 1-25, 2004.
[37] C. Y. Lin, J. S. Rager Jang, and K. T. Chen, "Automatic segmentation and labeling for Mandarin Chinese Speech Corpora for concatenation-based TTS," Computer Linguistic Chinese Language Processing, vol. 10, pp. 145-166, 2005.
[38] P. Kathirvel, M. S. Manikandan, S. Senthilkumar, and K. P. Soman, "Noise robust zerocrossing rate computation for audio signal classification," in Proc. 3rd Int. Conf. on Trendz in Information Sciences & Computing, TISC’11, pp. 65-69, Chennai, India, 8-9 Dec. 2011.
[39] K. Dietrich and J. Peters, "Testing the correlation of word error rate and perplexity," Speech Communication, vol. 38, no. 1-2, pp. 19-28, Sept. 2002.
[40] M. D. Mahony, Sensory Evaluation of Food: Statistical Methods and Procedures, CRC Press, 1986.
[41] MathWorks, Long Short-Term Memory Networks, https://www.mathworks.com/help/deeplearning/ug/long-short-term-memory-networks.html

Automatic Change Detection by Intelligent Backgrounding Method
Print Date : 2003-03-21
Speed Estimation and Sensorless Torque Optimization of Single Phase Induction Motor
Print Date : 2003-06-21
Optimal Design of Three-Phase Squirrel-Cage Induction Motor for Electric Vehicle
Print Date : 2003-06-21
A New Method in Design and Implementation of Electronic Synchronizer Based on Phase Locked Loop for Fast Paralleling of Diesel–Generators
Print Date : 2003-06-21
A New Circuit for Protecting of Series Connected Power Thyristors
Print Date : 2003-06-21
Cooperation in Multi-Agent Systems Using Learning Automata
Print Date : 2003-06-21

Share To

Article Url

Robust Persian Isolated Digit Recognition Based on LSTM and Speech Spectral Features