Proposing Two Data Augmentation Techniques for ASR with Limited Data: Gradual Masking and Word Frequency-Aware Masking
Ma. Asadolahzade Kermanshahi
1
(
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
)
A. Akbari Azirani
2
(
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
)
B. Nasersharif
3
(
Department of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran
)
Keywords: Speech recognition, word masking, data augmentation, limited data.,
Abstract :
Data scarcity is the main challenge for DNN-based speech recognition, and data augmentation serves as an effective solution. This paper presents a comprehensive taxonomy of data augmentation methods in speech recognition while investigating the effectiveness of the most important techniques in this domain, masking-based methods, under limited data conditions. The examined methods include two powerful approaches: SpecAugment and word masking. Despite their proven effectiveness in high-resource scenarios, these methods have been less studied under limited data conditions. After analyzing the shortcomings of word masking in limited data settings, we propose two novel methods: (1) Gradual masking, which begins training with frame-level masking and then transitions to word-level masking; and (2) Word frequency-aware masking, which masks high-frequency words first, followed by low-frequency words. Experiments on the 100-hour LibriSpeech subset demonstrate that our first proposed method achieves a WER of 6.8% on the clean test set and 18.2% on the challenging test set, representing improvements of 6.8% and 4.2% respectively over SpecAugment. The second proposed method reaches a WER of 6.6% on the clean test set and 17.3% on the challenging test set, achieving improvements of 9.6% and 8.9% respectively compared to SpecAugment.
[1] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, "End-to-end speech recognition: A survey, "IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 32, pp. 325-351, 2024.
[2] M. Gales and S. Young, "The application of hidden Markov models in speech recognition, "Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, Jul. 2008.
[3] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, Nov. 2012.
[4] A. Graves, N. Jaitly, and A. Mohamed, "Hybrid speech recognition with deep bidirectional LSTM," in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, pp. 273-278 Olomouc, Czech Republic, 8-12 Dec. 2013.
[5] M. Asadolahzade Kermanshahi and M. M. Homayounpour, "Improving phoneme sequence recognition using phoneme duration information in DNN-HSMM," Journal of AI and Data Mining, vol. 7, no. 1, pp. 137147, Jan.2019.
[6] A. Gulati et al., "Conformer: Convolution-augmented transformer for speech recognition,” in Proc. INTERSPEECH, pp. 5036-5040, Shanghai, China, 25-29 Oct. 2020.
[7] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "Wav2Vec 2.0: A framework for self-supervised learning of speech representations," in Proc. Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 12449-12460, 2020.
[8] Y. Zhang et al., Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages, arXiv preprint arXiv:2303.01037, 2023.
[9] A. Radford, et al., "Robust speech recognition via large-scale weak supervision," in Proc. of the 40th Int. Conf. on Machine Learning, pp. 28492-28518, Honolulu, HI, USA, 23-29 Jul. 2023.
[10] A. Hussein, S. Watanabe, and A. Ali, "Arabic speech recognition by end-to-end, modular systems and human," Computer Speech & Language, vol. 71, Article ID: 101272, Jan. 2022.
[11] D. S. Park et al., "Specaugment: A simple data augmentation method for automatic speech recognition," in Proc. 20th Annual Conf. of the Int. Speech Communication Association, pp. 2613-2617, Graz, Austria, 15-19 Sept. 2019.
[12] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech recognition," in Proc. 16th Annual Conf. of the Int. Speech Communication Association, Dresden, Germany, pp. 3586-3589, Dresden, Germany, 6-10 Sept. 2015.
[13] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," in Proc. 42nd IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 5220-5224, New Orleans, LA, USA, 5-9 Mar. 2017.
[14] M. Bartelds, N. San, B. McDonnell, D. Jurafsky, and M. Wieling, "Making more of little data: Improving low-resource automatic speech recognition using data augmentation," in Proc. of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 715-729, Toronto, Canada, 9-12 Jul. 2023.
[15] D. S. Park et al., "Improved noisy student training for automatic speech recognition," in Proc. 21st Annual Conf. of the Int. Speech Communication Association, pp. 2817-2821, Shanghai, China, 25-29 Oct. 2020.
[16] Y. Zhang et al., "Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition, "IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519-1532, 2022.
[17] A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, "Efficient self-supervised learning with contextualized target representations for vision, speech and language," in Proc. Int. Conf. on Machine Learning, pp. 1416-1429, Honolulu, Hawaii, USA, 23-29 Jul. 2023.
[18] M. Asadolahzade Kermanshahi, Transfer Learning for ASR to Deal with Low-Resource Data Problem, Technical Report, Tehran, Iran, 2019. https://www.researchgate.net/publication/359159354_Transfer_Learning_for_ASR_to_Deal_with_Low-Resource_Data_Problem.
[19] M. Asadolahzade Kermanshahi, A. Akbari, and B. Nasersharif, "Transfer learning for end-to-end ASR to deal with low-resource problem in Persian language," in Proc. 26th Int. Computer Conference, Computer Society of Iran, 5 pp., Tehran, Iran, 3-4 Mar. 2021.
[20] H. Kheddar, Y. Himeur, S. Al-Maadeed, A. Amira, and F. Bensaali, "Deep transfer learning for automatic speech recognition: Towards better generalization, "Knowledge-Based Systems, vol. 277, Article ID: 110851, Oct. 2023.
[21] A. Babuet al., "XLS-R: Self-supervised cross-lingual speech representation learning at scale," in Proc. 23th Annual Conf. of the Int. Speech Communication Association, pp. 2278-2282, Incheon, South Korea, 18-20 Sept. 2022.
[22] V. Pratap, et al., "Scaling speech technology to 1,000+ languages," Journal of Machine Learning Research, vol. 25, Article ID: 97, 52 pp., 2024.
[23] X. Cui, V. Goel, and B. Kingsbury, "Data augmentation for deep neural network acoustic modeling," IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1469-1477, Sept. 2015.
[24] C. Wang, et al., "Semantic mask for transformer based end-to-end speech recognition," in Proc. 21st Annual Conf. of the Int. Speech Communication Association, pp. 971-975, Shanghai, China, 25-29 Oct. 2020.
[25] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, "Curriculum learning," in Proc. of the 26th Annual Int. Conf. on Machine Learning, pp. 41-48, Montréal, Canada, 14-18 Jun. 2009.
[26] L. Meng, et al., "MixSpeech: Data augmentation for low-resource automatic speech recognition," in Proc. 46th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 7008-7012, Toronto, Canada, 6-11 Jun. 2021.
[27] D. Fucci, M. Gaido, M. Negri, M. Cettolo, and L. Bentivogli, "No pitch left Behind: Addressing gender unbalance in automatic speech recognition through pitch manipulation," in Proc. Automatic Speech Recognition and Understanding Workshop, 8 pp., Taipei, Taiwan, 16-20 Dec. 2023.
[28] N. Jaitly and G. E. Hinton, "Vocal tract length perturbation (VTLP) improves speech recognition," in Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, p. 21, Atlanta, GA, USA, 16-21 Jun. 2013.
[29] Y. Qian, H. Hu, and T. Tan, "Data augmentation using generative adversarial networks for robust speech recognition," Speech Communication, vol. 114, pp. 1-9, Nov. 2019.
[30] E. Casanova et al., "ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion," in Proc. 24th Annual Conf. of the Int. Speech Communication Association, pp. 1244-1248, Dublin, Ireland, 20-24 Aug. 2023.
[31] J. Sun et al., "Semantic data augmentation for end-to-end Mandarin speech recognition," in Proc. 21st Annual Conf. of the Int. Speech Communication Association, pp. 1269-1273, Brno, Czech Republic, 30 Aug.-3 Sept. 2021.
[32] T. K. Lam, M. Ohta, S. Schamoni, and S. Riezler, “On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR,"in Proc. 21st Annual Conf. of the Int. Speech Communication Association, pp. 1299-1303, Brno, Czech Republic, 30 Aug.-3 Sept. 2021.
[33] G. Wang, et al., "G-Augment: Searching for the meta-structure of data augmentation policies for ASR," in Proc. IEEE Spoken Language Technology Workshop, pp. 23-30, Doha, Qatar, 9-12 Jan. 2022.
[34] Z. Jin, et al., "Towards automatic data augmentation for disordered speech recognition," in Proc. 49th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 10626-10630, Seoul, Korea, 14-19 Apr. 2024.
[35] R. Li, G. Ma, D. Zhao, R. Zeng, X. Li, and H. Huang, "A policy-based approach to the SpecAugment method for low resource E2E ASR," in Proc. Asia Pacific Signal and Information Processing Association, pp. 630-635, Chiang Mai, Thailand, 7-10 Nov. 2022.
[36] T. -Y. Hu, et al., "SapAugment: Learning asample adaptive policy for data augmentation," in Proc. 46th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 4040-4044, Toronto, Canada, 6-11 Jun. 2021.
[37] A. Sriram, M. Auli, and A. Baevski, "Wav2Vec-Aug: Improved self-supervised training with limited data," in Proc. 20th Annual Conf. of the Int. Speech Communication Association, pp. 4950-4954, Incheon, South Korea, 18-20 Sept. 2022.
[38] D. Jiang, W. Li, M. Cao, W. Zou, and X. Li, "Speech SimCLR: Combining contrastive and reconstruction objective for self-supervised speech representation learning," in Proc. 21st Annual Conf. of the Int. Speech Communication Association, pp. 1544-1548, Brno, Czech Republic, 30 Aug.-3 Sept. 2021.
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proc. of Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171-4186, Minneapolis, MN, USA, 2-7 Jun. 2019.
[40] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an ASR corpus based on public domain audio books," in Proc. ICASSP, pp. 5206-5210, South Brisbane, Australia, 2015.
[41] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proc. of the 23rd Int. Conf. on Machine Learning, pp. 369-376, 25-29 Jun. 2006.
[42] S. Watanabe, et al., "ESPnet: End-to-end speech processing toolkit," in Proc. 19th Annual Conf. of the Int. Speech Communication Association, pp. 2207-2211, Hyderabad, India, 2018.
[43] [43] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal forced aligner: trainable text-speech alignment using Kaldi.," in Proc.18th Annual Conf. of the Int. Speech Communication Association, pp. 498-502, 20-24 Aug. 2017.
[44] K. Le, T. V. Ho, D. Tran, and D. T. Chau, "SegAug: CTC-Aligned segmented augmentation for robust RNN-transducer based speech recognition," in Proc. 50th IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 5 pp., Hyderabad, India, 6-11 Apr. 2025.