摘要
传统声纹识别方法过程复杂,模型识别准确率低,是声纹识别应用发展的关键问题。利用深度学习具有自主特征提取及分类的特点,结合卷积神经网络(CNN)和长短期记忆网络(LSTM),提出一种结合的网络模型学习声纹识别特征及对其进行身份认证。将原始语音转换为固定长度语谱图,顺序进入CNN、LSTM,结合网络进行训练以及声纹特征学习。通过对比CNN、LSTM以及DNN网络,验证CNN-LSTM网络在声纹识别中具有较少迭代次数情况下高准确率的特性。经实验结果可以得出,语音空间特征及时序特征均是声纹识别中重要的影响因素,实验中的CNN-LSTM网络模型准确率达到95.42%,损失低值达到0.097 3。该方法有利于实际声纹识别的应用。
The traditional voiceprint recognition method is complex with low recognition accuracy, which is a key issue in the development of voiceprint recognition applications. In this paper, we used deep learning with autonomous feature extraction and classification, combining with convolutional neural network(CNN) and long-term and short-term memory network(LSTM). A combined network model was proposed to learn the features of voiceprint recognition and identity authentication. The original speech was converted into a fixed-length spectrogram, and sequentially entered into the combined network CNN and LSTM for training, and learning voiceprint feature. By comparing CNN, LSTM and DNN, We verified the high accuracy of the CNN-LSTM network in voiceprint recognition with fewer iterations. The experimental results show that the speech space features and time series features are important factors in voiceprint recognition. The accuracy of CNN-LSTM network model in the experiment reaches 95.42%, and the loss value is 0.0973. The method is benefical to the practical application of voiceprint recognition.
引文
[1] Schmidhuber J. Deep learning in neural networks: an overview [J]. Neural Networks, 2014, 61(3): 85-94.
[2] Abdel-Hamid O, Mohamed A R, Jiang H, et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition[C]//2012 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2012: 4277-4280.
[3] Simonyan K, Zisserman A. Very Deep Convolutional Networks for LargeScale Image Recognition [J]. Computer Science, 2014, 13(2): 120-131.
[4] Variani E,Lei X,Mcdermott E,et al.Deep neural networks for small footprint text-dependent speaker verification[C]//2014 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2014.
[5] Snyder D, Garcia-Romero D, Povey D, et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification[C]//Proc. InterSpeech 2017:999-1003.
[6] Waibel A,Hanazawa T,Hinton G,et al.Phoneme recognition using time-delay neural networks[J]. IEEE transactions on acoustics, speech, and signal processing, 1989, 37(3): 328-339
[7] Abdel-Hamid O, Mohamed A R, Jiang H, et al. Convolutional Neural Networks for Speech Recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(10):1533-1545.
[8] 余玲飞, 刘强. 基于深度循环网络的声纹识别方法研究及应用[J]. 计算机应用研究, 2019,36(1):153-158.
[9] Bhattacharya G, Alam J, Stafylakis T, et al. Deep Neural Network based Text-Dependent Speaker Recognition: Preliminary Results[C]//Odyssey 2016.21-24 Jun 2016,Bilbao,Spain.
[10] Heigold G, Moreno I, Bengio S, et al. End-to-End Text-Dependent Speaker Verification[C]//Acoustics, Speech and Signal Processing(ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
[11] Chowdhury F A R R, Wang Q, Moreno I L, et al. Attention-Based Models for Text-Dependent Speaker Verification[C]//ICASSP 2018—2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2018.
[12] Zhang C, Koishida K. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances[C]//Interspeech, 2017.
[13] Greff K, Srivastava R K, Koutník, Jan, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks & Learning Systems, 2015, 28(10):2222-2232.
[14] TensorFlow.谷歌深度学习框架[EB/OL]. 2018. https://www.tensorflow.org/?hl=zh-cn.
[15] Free ST Chinese Mandarin Corpus[DB/OL]. 2016. http://www.openslr.org/38/.