Robust Recognition of Direct and Telephony Speech Using Proper Extraction of Feature Vectors and Their Modification by Neural Networks Inversion
Subject Areas : electrical and computer engineeringM. Vali 1 * , S. A. Seyed Salehi 2
1 -
2 -
Keywords: Robust speech recognitionneural networkinversionfeature vectors,
Abstract :
A vast amount of research is going on for design of robust speech recognition in to alleviate speech variability conditions. One of the variability aspects is the difference between telephony speech and direct speech (recorded in noise free conditions). In this paper by using a set of experiments, it is shown that LHCB parameters are superior to traditional MFCCs for speech recognition applications when they are used in a neural network based speech recognition system for both direct and telephony speech. Then by extraction of LHCBs from direct and telephony speech, and training of a MLP based speech recognition model, a direct and telephony speech recognition system is developed. Using a neural network inversion based on gradient descent method, the telephony speech feature vectors are modified toward to the direct speech feature vectors and by training a second network on modified telephony and direct speech feature vectors a 1.4% enhancement on speech recognition was achieved. Later, using general inversion method of neural networks both telephony and direct speech feature vectors are modified in a manner which mainly contains phonetic information and not other speech variations. Then by the training of the second neural network on this dataset, the system achieved 2.98% and 1.68% higher recognition rate for direct and telephony speech, respectively.
[1] S. Fouri, "Robust methods in automatic speech recognition and understanding," in Proc. Eurospeech, pp. 1993-1997, Geneva, Switzerland, 2003.
[2] Y. Gong, "Speech recognition in noisy environments: A survey," Speech Communication, vol. 16, no. 3, pp. 261-291, Apr. 1995.
[3] C. H. Lee and Q. Huo, "On adaptive decision rules and decision parameter adaptation for automatic speech recognition," in Proceedings of the IEEE, vol. 88, pp. 1241- 1269, Aug. 2000.
[4] A. Martin, J. Fiscus, B. Fisher, D. Pallet, and M. Przybocki, "System Descriptions and Performance Summary," presented at the Conversational Speech Recognition Workshop: DARPA Hub-5E Evaluation, Baltimore, Maryland, US, May 1997.
[5] D. Yuk and J. Flanagan, "Telephone speech recognition using neural networks and hidden Markov models," in Proc. ICASSP, pp. 157-160, 1999.
[6] S. Thrun, "Is learning the n-th thing any easier than learning the first?" Advances in Neural Information Processing Systems, MIT Press, 1996.
[7] S. Ben-David and R. Schuller, "Exploiting task relatedness for multiple task learning," Lecture Notes in Computer Science, vol. 2777, pp. 567 - 580,2003.
[8] C. W. Omlin and C. L. Giles, "Training second-order recurrent neural networks using hints," in Proc. of the Ninth International Conference on Machine Learning., pp. 363-368, 1992.
[9] S. Parveen and P. Green, "Multitask learning in connectionist robust ASR using recurrent neural networks," in Proc. Eurospeech, pp. 1813-1816, Geneva, Switzerland, Sep. 2003.
[10] P., Niyogi and et al. "Incorporating prior information in machine learning by creating virtual examples," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2196-2209, Nov. 1998.
[11] الف. نژادقلي، بازشناخت مقاوم گفتار نسبت به تنوعات مختلف گوينده در شبكههاي عصبي بازشناخت گفتار، پاياننامه كارشناسي ارشد، دانشگاه صنعتي اميركبير، دانشكده مهندسي پزشكي، 1382.
[12] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.Warmuth, "Learnability and the Vapnik-Chervo-Nenkis dimention," J. Ass. Comput. Match., vol.36, no.4, pp. 929-965, 1989.
[13] M. Bijankhan, J. Seikhzadeghan, M. R. Roohani, Y. Samareh, K. Lucas, M. Tebyani., "FARSDAT: the speech database of Farsi spoken language," in Proc. SST-94, pp. 826-831, Perth, Australia, 1994.
[14] S. B. Davis and P. Mermelstein, "Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences," IEEE Trans. ASSP, vol. 28, no. 4, pp. 357-366, Aug. 1980.
[15] م. رحيمينژاد، توسعه و بهبود كيفيت روشهاي استخراج پارامترهاي بازنمايي در سيستمهاي بازشناخت گفتار، پايان نامه كارشناسي ارشد، دانشگاه صنعتي اميركبير، دانشكده مهندسي پزشكي، 1381.
[16] J. Han and W. Gao, "Robust telephone speech recognition based on channel compensation," Journal of Pattern Recognition Society, vol. 32, no.6, pp. 1061-1067, Jun. 1999.
[17] C. A. Jensen, et al., "Inversion of feedforward neural networks: algorithms and applications," Proceedings of the IEEE, vol. 87, no. 9, pp. 1536-1549, Sep. 1999.
[18] R. J., Williams, "Inverting a connectionist network mapping by backpropagation of error," in Proc 8th Annu. Conf. Cognitive Science Society, pp. 859-865, 1986.