استفاده از مدلهای وابسته به محتوا در واژهياب گفتار متمايزساز
محورهای موضوعی : مهندسی برق و کامپیوترشیما طبیبیان 1 * , احمد اکبری 2 , بابک ناصرشريف 3
1 - پژوهشگاه هوافضا
2 - دانشگاه علم و صنعت ایران
3 - دانشگاه صنعتی خواجه نصیرالدین طوسی
کلید واژه: استخراج ويژگي بازشناس واج مستقل از محتوا وابسته به محتوا ماشين بردار پشتيبان واژهيابي گفتار متمايزساز,
چکیده مقاله :
رويكردهاي واژهيابي گفتار به دو گروه تقسيم میشوند: رويكردهاي مبتني بر مدل مخفي ماركف و رويكردهاي متمايزساز. يكي از فوايد رويكردهاي مبتني بر مدل مخفي ماركف، قابليت استفاده از اطلاعات وابسته به محتوا (سه واج) در جهت بهبود كارايي سيستم واژهياب گفتار ميباشد. از طرفی، عدم امكان استفاده از اطلاعات وابسته به محتوا يكي از معایب رويكردهاي واژهيابي گفتار متمايزساز محسوب ميشود. در اين مقاله، راهكاري براي رفع اين عیب ارائه شده که به اين منظور، بخش استخراج ويژگي يك سيستم واژهياب گفتار متمايزساز مبتنی بر الگوریتم تکاملی (EDSTD)- كه در كارهاي قبلي ما ارائه شده است- به گونهاي تغيير یافته كه اطلاعات وابسته به محتوا را در نظر بگيرد. در مرحله نخست، يك رويكرد استخراج ويژگي مستقل از محتوا پيشنهاد شده و سپس رويكردي براي به كارگيري اطلاعات وابسته به محتوا در بخش استخراج ويژگي ارائه شده است. نتايج ارزيابيها روی دادگان TIMIT حاكي از آن است كه نرخ بازشناسي سيستم EDSTD وابسته به محتوا (CD-EDSTD) در اخطار اشتباه بر كلمه كليدي بر ساعت بزرگتر از دو، حدود 3% از نرخ بازشناسي درست سيستم EDSTD مستقل از محتوا (CI-EDSTD) بالاتر است. هزينه اين بهبود دقت، حدود 36/0 افت سرعت پاسخگويي است كه قابل چشمپوشي ميباشد.
Spoken Term Detection (STD) approaches can be divided into two main groups: Hidden Markov Model (HMM)-based and Discriminative STD (DSTD) approaches. One of the important advantages of HMM-based methods is that they can use context dependent (diphone or triphones) information to improve the whole STD system performance. On the other hand, lack of triphones information is one of the significant drawbacks of DSTD methods. In this paper, we propose a solution to overcome this drawback of DSTD systems. To this end, we modify the feature extraction part of an Evolutionary DSTD (EDSTD) system to consider triphones information. At first, we propose a monophone-based feature extraction part for the EDSTD system. Then, we propose an approach for exploiting triphones information in the EDSTD system. The results on TIMIT database indicate that the true detection rate of the triphone-based EDSTD (Tph-EDSTD) system, in false alarm per keyword per hour greater than two, is about 3% higher than that of the monophone-based EDSTD (Mph-SDSTD) system. This improvement costs about 36% degradation of the system response speed which is neglected.
[1] M. Weintraub, "LVCSR log-likelihood ratio scoring for keyword spotting," in Proc. of ICASSP'95, vol. 1, pp. 129-132, May 1995.
[2] I. Szoke, et al., "Comparison of keyword spotting approaches for informal continuous speech," in Proc. of Interspeech'05, vol. 1, pp. 633-636, Sep. 2005.
[3] B. Ramabhadran, A. Sethy, J. Mamou, B. Kingsbury, and U. Chaudhari, "Fast decoding for open vocabulary spoken term detection," in Proc. of NAACL HLT'09, vol. 1, pp. 277-280, Jan. 2009.
[4] D. Wang, J. Tejedor, S. King, and J. Frankel, "Term-dependent confidence normalization for out-of-vocabulary spoken term detection," J. of Computer Science and Technology, vol. 27, no. 2, pp. 358-375, Mar. 2012.
[5] D. Wang, J. Tejedor, J. Frankel, S. King, and J. Colas, "Posterior based confidence measures for spoken term detection," in Proc. of ICASSP'09, vol. 8, pp. 4889-4892, Apr. 2009.
[6] R. C. Rose, "Keyword detection in conversational speech utterances using hidden markov model based continuous speech recognition," Computer Speech & Language J., vol. 9, no. 4, pp. 309-333, Oct. 1995.
[7] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard, "Posterior based keyword spotting with a priori thresholds," in Proc. 9th Int. Conf. on Spoken Language Processing, ISCA'06, vol. 1, 8 pp., May 2006.
[8] J. Tejedor, D. Wang, J. Frankel, S. King, and J. Colas, "A comparison of grapheme and phoneme-based units for Spanish spoken term detection," Speech Communication, vol. 50, no. 11-12, pp. 980-991, Nov./Dec. 2008.
[9] A. Shokri, S. Tabibian, A. Akbari, B. Nasersharif, and J. Kabudian, "Robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter," in Proc. of IEEE GCC Conf. and Exhibition, vol. 1, pp. 497-500, Feb. 2011.
[10] I. Bazzi, Modeling Out-of-Vocabulary Words for Robust Speech Recognition, Ph.D. Thesis, MIT, 2002.
[11] I. Szoke, Hybrid Word - Subword Spoken Term Detection, Ph.D Thesis, Faculty of Information Technology BUT, 2010.
[12] J. Junkawitsch, G. Ruske, and H. Hoge, "Efficient methods for detecting keywords in continuous speech," in Proc. Eurospeech Conf. on Speech Communication and Technology, vol. 1, pp. 259-262, Sep. 1997.
[13] U. Yapanel, Garbage Modeling Techniques for a Turkish Keyword Spotting System, Master of Science Thesis, Electrical Engineering Department, Bogazici University, 2000.
[14] Z. De Greve, Application in Automatic Speech Recognition: Keyword Spotting Based on Online Garbage Modeling, Internship Report (IDIAP Research Institute), 2006.
[15] L. R. Bahl, P. F. Brown, P. De Souza, and R. L. Mercer, "Maximum mutual information estimation of hidden markov model parameters for speech recognition," in Proc. of ICASSP'86, vol. 1, pp. 49-52, Apr. 1986.
[16] B. Juang and S. Katagiri, "Discriminative learning for minimum error classification," IEEE Trans. on Signal Processing, vol. 40, no. 12, pp. 3043-3054, Dec. 1992.
[17] D. Povey and P. Woodland, "Minimum phone error and I-smoothing for improved discriminative training," in Proc. of ICASSP'02, vol. 1, pp. 105-108, May 2002.
[18] H. Jiang, X. Li, and C. Liu, "Large margin hidden markov models for speech recognition," IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1584-1595, Sep. 2006.
[19] J. C. Chen and J. T. Chien, "Bayesian large margin hidden markov models for speech recognition," in Proc. of ICASSP'09, vol. 7, pp. 3765-3768, Apr. 2009.
[20] K. P. Li, J. A. Naylor, and M. L. Rossen, "A whole word recurrent neural network for keyword spotting," in Proc. of ICASSP'92, vol. 2, pp. 81-84, Mar. 1992.
[21] S. Fernandez, A. Graves, and J. Schmidhuber, "An application of recurrent neural networks to discriminative keyword spotting," in Proc. Conf. on Artificial Neural Networks, ICANN'07, vol. 4669, pp. 220-229, Sep. 2007.
[22] J. Keshet, D. Grangier, and S. Bengio, "Discriminative keyword spotting," Speech Communication, vol. 51, no. 4, pp. 317-329, Apr. 2009.
[23] S. Tabibian, A. Shokri, A. Akbari, and B. Nasersharif, "Performance evaluation for an HMM-based keyword spotter and a large-margin based one in noisy environments," in Proc. World Conf. on Information Technology, Procedia Computer Science, vol. 3, pp. 1018-1022, Jun. 2010.
[24] Y. Benayed, D. Fohr, J. P. Haton, and G. Chollet, "Improving the performance of a keyword spotting system by using support vector machines," in Proc. IEEE Workshop on Automatic Speech Recognition & Understanding, ASRU'03, vol. 1, pp. 145-149, 30 Nov.-3 Dec. 2003.
[25] J. Keshet and S. Bengio, Automatic Speech and Speaker Recognition, Large Margin and Kernel Methods, John Willy and Sons, 1st Edition, 2009.
[26] S. Tabibian, A. Akbari, and B. Nasersharif, "An evolutionary based discriminative system for keyword spotting," in Proc. Symp. on Artificial Intelligence and Signal Processing, AISP'11, vol. 1, pp. 83-88, Jan. 2011.
[27] S. Tabibian, A. Akbari, and B. Nasersharif, "A fast search technique for discriminative keyword spotting," in Proc. Symp. on Artificial Intelligence and Signal Processing, AISP'12, vol. 1, pp. 140-144, May 2012.
[28] V. N. Vapnik, Statistical Learning Theory, Wiley, 1998.
[29] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan, "Phoneme alignment based on discriminative learning," in Proc. of InterSpeech, vol. 5, pp. 2961-2964, Sep. 2005.
[30] J. Keshet, S. Shalev-Shwartz, S. Bengio, Y. Singer, and D. Chazan, "Discriminative kernel-based phoneme sequence recognition," in Proc. of the Int. Conf. on Spoken Language Processing, vol. 2, pp. 593-596, Sep. 2006.
[31] M. Wollmer, F. Eyben, J. Keshet, A. Graves, B. Schuller, and G. Rigoll, "Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional LSTM networks," in Proc. of ICASSP'09, vol. 7, pp. 3949-3952, Apr. 2009.
[32] D. T. Toledano, L. A. H. Gomez, and L. V. Grande, "Automatic phonetic segmentation," IEEE Trans. on Speech and Audio Processing, vol. 11, no. 6, pp. 617-625, Nov. 2003.
[33] J. W. Kuo, H. Y. Lo, and H. M. Wang, "Improved HMM/SVM methods for automatic phoneme segmentation," in Proc. of the 10th European Conf. on Speech Communication and Technology (Interspeech 2007-Eurospeech), vol. 1, pp. 2057-2060, Aug. 2007.
[34] O. Dekel, J. Keshet, and Y. Singer, "Online algorithm for hierarchical phoneme classification, Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Lecture Notes in Computer Science, Springer-Verlag, pp. 146-159, 2004.
[35] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlin, 2009.
[36] J. C. Platt, Advances in Kernel Methods: Support Vector Learning, MIT Press, 1999.
[37] C. P. Chen, J. Blimes, and K. Kirchhoff, "Low-resource noise-robust feature post-processing on AURORA 2.0," in Proc. of ICSLP'02, pp. 2445-2448, Sep. 2002.
[38] A. M. Toh, R. Togneri, and S. Nordholm, "Spectral entropy as speech features for speech recognition," in Proc. of Postgraduate Electrical Engineering and Computing Symposium, PEECS'05, vol. 1, pp. 22-25, May 2005.
[39] G. Peeters, A Large Set of Audio Features for Sound Description (Similarity and Classification) in the CUIDADO Project, Cuidado Project Rep. Ircam, pp. 1-25, 2004.
[40] C. Y. Lin, J. S. Rager Jang, and K. T. Chen, "Automatic segmentation and labeling for Mandarin Chinese speech corpora for concatenation-based TTS," Computational Linguistics and Chinese Language Processing, vol. 10, no. 2, pp. 145-166, Jun. 2005.
[41] J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan, "A large-margin algorithm for speech-to-phoneme and music-to-score alignment," IEEE Trans. on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2373-2382, Nov. 2007.
[42] S. A. Hejazi, R. Kazemi, and S. Ghaemmaghami, "Isolated persian digit recognition using a hybrid HMM-SVM," in Proc. Int. Symp. on Intelligent Signal Processing and Communication Systems, 4 pp., Feb. 2008.
[43] S. Chandrakala and C. Chandra Sekhar, "Combination of generative models and SVM based classifier for speech emotion recognition," in Proc. of Int. Joint Conf. on Neural Networks, vol. 1, pp. 497-502, Jun. 2009.
[44] J. Stadermann and G. Rigoll, "A hybrid SVM/HMM acoustic modeling approach to automatic speech recognition," in Proc. of Interspeech'04, vol. 1, pp. 661-664, Oct. 2004.
[45] Q. Zhi-Yi, L. Yu, Z. Li-Hong, and S. Ming-Xin, "A speech recognition system based on a hybrid HMM/SVM architecture," in Proc. of the 1st Int. Conf. on Innovative Computing, Information, and Control, vol. 2, pp. 100-104, 30 Aug.-1 Sep. 2006.
[46] A. R. Ahmad, C. Viard-Gaudin, and M. Khalid, "Lexicon-based word recognition using support vector machine and hidden markov model," in Proc. of the Int. Conf. on Document Analysis and Recognition, vol. 1, pp. 161-165, Jul. 2009.
[47] L. Lori, R. Kassel, and S. Stephanie, "Speech database development: design and analysis of the acoustic-phonetic corpus," in Proc. of DARPA Speech Recognition Workshop, vol. 2, pp. 161-170, pp. 100-109, Feb. 1986.
[48] D. G. Zacharie and J. P. Pinto, Keyword Spotting on Word Lattices, Research Report, IDIAP Research Institute, 2007.