用户名: 密码: 验证码:
基于韵律特征的SVM说话人识别
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
语音信号是用于个人身份确认的一种有效的生物特征,与文本无关的说话人识别的研究也是语音信号处理的一个重要的研究方向,其研究具有重要的理论意义和广泛的应用前景。为了考察最新的相关研发进展,并提供统一的衡量标准,美国国家标准技术研究院(NIST)于1996年起开始主持说话人识别评测。NIST说话人评测代表了说话人识别领域的最高水准,NIST设立了多项评测任务,探索和尝试各种语音条件下的研究方法,并为各个项目指定统一的电话和广播语音(多环境、多通道、大规模说话人)、测试基准、评测规则和标准。其中一个任务是采用长语音进行说话人识别,是为了探索用于与文本无关的说话人识别的语音信号高层次信息而设置的。
     除了短时倒谱参数,语音中的高层次信息也是一种有效的说话人特征参数,但其通常与文本内容有关,因此研究从信号中提取可用于与文本无关说话人识别的语音高层次特征信息就成了目前研究的焦点。本文对韵律的提取方法及其与区辨模型结合运用于与文本无关的说话人识别进行了探讨。
     从与文本无关的说话人识别的特点出发,本文先讲述概率统计模型,从文本相关的语音韵律(语音特征随时间变化的轨迹)中提取的特征信息,进行数据压缩、聚类,再利用支持向量机SVM进行区分。
     文章提出了一种基于小波分析从韵律中提取超音段韵律信息的方法,分别从声道的MFCC轨迹和基频轨迹,时域能量轨迹中进行超音段韵律特征的提取。由MFCC各维参数的近似不相关和声道缓变的特点,MFCC轨迹的韵律特征只以概貌系数来刻画,提取的PMFCC作为主参数,在参数级和由基频F0轨迹的六维韵律特征参数PF0、由时域能量轨迹的六维韵律特征参数PE,组成更加有效的PMFCCFE参数,进而利用支持向量机SVM模型进行区分。
     在NIST数据库上的实验表明,与传统的短时MFCC的GMM-UBM系统相比,超音段韵律特征PMFCCFE的GMM-SVM系统的EER相对下降了57.9%,MinDCF相对下降了41.4%。显著提高了说话人识别的性能。
Speech signal as effective biological feature, is particularly useful for identification, and text-independent speaker recognition is one of the primary research fields of speech signal processing, and not only of great theoretical significance, but also has a wide variety of applications.
     The National Institute of Standards and Technology (NIST) has coordinated Speaker Recognition Evaluations since 1996 to investigate and measure the latest approaches. The evaluations represent the state-of-the-art achievements of speech recognition. NIST sets up several tasks to examine speaker recognition performance under different circumstances. NIST offers to the participants telephone and broadcast speech data ranging from multiple channels to various environments, the evaluation specifications, and same evaluation criterion. One task offers long-duration speech from speakers, aiming to make full use of text-independent high level information for recognizing speakers.
     In addition to short-term spectral features such as MFCC, high level information can also serve as effective feature for speaker recognition, but it usually associated with dependent text. The solution to explore the high level feature for text-independent speaker recognition becomes a focus. The thesis illustrates the effective and easy solution to extract prosodic feature and its models to discriminate speakers.
     According to the nature of text-independent speaker recognition, conventional probabilistic model GMM-UBM is used for data compression and cluster of prosodic features, and then, Support vector machine(SVM) is used to recognize speakers. The results prove this approach effective.
     The thesis introduces a method of extracting super-segmental features with wavelet analysis, with which prosodic features of MFCC contour,F0 contour and E contour are extracted. As MFCC is a high dimension case, each dimension has a low correlation to others and the vocal tract convey the slow changes of speech, the approximation coefficients are utilized to form the vocal super-segmental feature PMFCC.F0 contour and E contour prosodic features consist of 6 dimensions respectively. In this way, with wavelet analysis, prosodic features are extracted from MFCC,F0 and energy contours respectively, these complementary features are fused at feature level to yield a most effective feature PMFCCFE,GMM mean super-vectors of PMFCCFE are used to train SVM models to discriminate target speakers and imposters more effectively.
     The experiments conducted on the 2006 NIST 8side-1side subset show that the prosodic GMM-SVM system relatively improves the performance of the verification system by 57.9% in EER,41.4% in MinDCF, compared with the MFCC-based GMM-UBM system.
引文
[1] Liu, L., J. He, et al. A comparison of human and machine in speaker recognition. in Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH).(1999).
    [2] Schmidt-Nielsen, A. and T. H. Crystal . "Speaker verification by human listeners: experiments comparing human and machine performance using the NIST 1998 speaker evaluation data." Digital Signal Processing 10(1-3): 249-266.(2000).
    [3] Fine, S., J. Navratil, et al. A hybrid GMM/SVM approach to speaker identification.(2001).
    [4] Fine, S., J. Navratil, et al. Enhancing GMM Scores using SVM" Hints", ISCA.(2001).
    [5] Campbell, W. M., D. E. Sturim, et al. SVM based speaker verification using a GMM supervector kernel and NAP variability compensation.(2006).
    [6] Liu, M., Y. Xie, et al. A new hybrid GMM/SVM for speaker verification.(2006).
    [7] Weber, F., B. Peskin, et al. "Speaker recognition on single-and multispeaker data." Digital Signal Processing 10(1-3): 75-92.(2000).
    [8] Peskin, B., J. Navratil, et al. Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS'02.(2003).
    [9] Vair, C., D. Colibro, et al. Loquendo-Politecnico di Torino’s 2006 NIST Speaker Recognition Evaluation System.(2007).
    [10] Rosenberg, A. E., J. DeLong, et al. The use of cohort normalized scores for speaker verification, ISCA.(1992).
    [11] Auckenthaler, R., M. Carey, et al. "Score normalization for text-independent speaker verification systems." Digital Signal Processing 10(1-3): 42-54.(2000).
    [12] Reynolds, D. A., T. F. Quatieri, et al. "Speaker Verification Using Adapted Gaussian Mixture Models." Digital Signal Processing 10(1-3): 19-41.(2000).
    [13] Reynolds, D. A. Channel robust speaker verification via feature mapping.(2003).
    [14] Campbell W,Sturim D,Reynolds D.Support vector machines using GMM supervectors for speaker verification[J].IEEE Signal Processing Letters,(2006),13(5):308-311.
    [15] Kenny, P. and P. Dumouchel .Experiments in speaker verification using factor analysis likelihood ratios, ISCA.(2004).
    [16] Reynolds, D. A. and R. C. Rose "Robust text-independent speaker identification using Gaussianmixture speaker models." Speech and Audio Processing, IEEE Transactions on 3(1): 72-83.(1995).
    [17] Martin, A. and M. Przybocki "The NIST 1999 speaker recognition evaluation—An overview." Digital Signal Processing 10(1-3): 1-18.(2000).
    [18] Burget, L., P. Matejka, et al. NIST Speaker Recognition Evaluation 2006.(2006).
    [19] Doddington, G. "Speaker Recognition based on Idiolectal Differences between Speakers." Seventh European Conference on Speech Communication and Technology.(2001).
    [20] Reynolds, D.A. at al.“The SuperSID project: exploiting high-level information for high-accuracy speaker recognition”IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP '03). Volume 4, 6-10 April 2003
    [21] Sonmez,K. ,Shriberg E.,Heck, L. and Weintraub,M. ,“Modeling dynamic prosodic variation for speaker verification," in Proc. of the InternationalConference on Speech and Language Processing (ICSLP), 1998.
    [22] Adami,A.,Mihaescu,R. ,Reynolds,D.,Godfrey,J. ,“Modeling Prosodic Dynamics for Speaker Recognition,”ICASSP 2003
    [23] Campbell, W. M., J. R. Campbell, et al. High-level speaker verification with support vector machines. Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on.(2004).
    [24] Yao, Z.-Q., L.-F. Wu, et al. "Extracting supra-segment information for text-independent speaker verification." Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing 20(4): 376-380.(2005).
    [25] Adami, A. G. "Modeling prosodic differences for speaker recognition." Speech Communication 49(4): 277-291.(2007).
    [26] Nengheng, Z., L. Tan, et al. "Integration of Complementary Acoustic Features for Speaker Recognition." Signal Processing Letters, IEEE 14(3): 181-184.(2007).[27] S. Furui, F. Itakura, and et al.. Talker recognition by longtime averaged speech spectrum, Electron., Commun. in Japan, vol. 55-A, no. 10,pp. 54-61, 1972.
    [28] Markel,J.,Oshika, B. and Gray,A. Jr. Long-term featureaveraging for speaker recognition, IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-25, pp. 330-337, 1977
    [29] Gish,H. ,Karnofsky,K. ,Krasner,M.,Roucos,S.,Schwartz,R. ,and Wolf J.,“Investigation of text-independent speaker identi_cation over telephone channles," in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, (Tampa, FL), pp. 379-382, 1985.
    [30] Gish,H., Robust discrimination in automatic speaker identi_cation, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 293-296, 1990.
    [31] Gish,H. et al.,Methods and experiments for text-independent speaker recognition over telephone channels," in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 865-868, 1986.
    [32] Gillick,L.,Baker,J.,Bridle, J.,Hunt, M.,Ito,Y. ,et al. Application of large vocabulary continuous speech recognition to topic and speaker identi_cation using telephone speech, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, (Minneapolis), pp. 471-474, April 1993.
    [33] Matsui,T. and Furui,S. A text-independent speaker recognition method robust against utterance variations, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 377-380, 1991.
    [34] Kao ,Y. et al. Free-text speaker identification over long distance telephone channel using hypothesized phonetic segmentation, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 177-180, 1992.
    [35] Sturim,D.,Reynolds,D.,Dunn,R., and Quatieri,T. Speaker verifcation using text-constrained Gaussian mixture models, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002.
    [36] Weber,F. ,Manganaro,L. ,Peskin,B. , and Shriberg,E. Using prosodic and lexical information for speaker identi_cation, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002.
    [37] Soong,F. ,Rosenberg, A. ,Rabiner,L. , and Juang,B. A vector quantization approach to speaker recognition, in Proc. of the International Conference on Acoustics, Speech, and SignalProcessing (ICASSP), vol. 1, (Tampa, FL), pp. 387-390, 1985.
    [38] Lippmann,R.,“Pattern classi_cation using neural networks," IEEE Communications Magazine, vol. 27, no. 11, pp. 47-64, 1989.
    [39] Vapnik,V.,《统计学习理论的本质》,清华大学出版社,2000.9.
    [40] Mak,M. Text-independent speaker veri_cation over a telephone network by radial basis function networks, in International Symposium on Multi-Technology Information Processing, (Taiwan), pp. 145-150, 1996.
    [41] Fine,S.,Navratil,J., and Gopinath,R. A hybrid GMM/SVM approach to speaker identifcation, in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001.
    [42] Rose,R.,Hofstetter,E., and Reynolds,D. Integrated models of signal and background with application to speaker identification in noise. IEEE Trans. Speech and Audio Processing, 2(2):245-257, April 1994.
    [43] Reynolds,D. and Carlson,B. Text-independent speaker verification using decoupled and integrated speaker and speech recognizers. In Proc. EUROSPEECH, pages 647-650, Madrid, 1995.
    [44] Wan,V. and Renals,S.“Speaker verification using sequence discriminant support vector machines,”IEEE Trans. Speech and Audio Processing,2004
    [45] Longworth,C. and Gales,M.“Derivative and Parametric Kernels for Speaker verification,”InterSpeech 2007
    [46] Liu,M.,Dai,B.,“Improved GMM-UBM/SVM for speaker verification,”ICASSP 2006
    [47] Schmidt,M. and Gish,H. Speaker identification via support vector classifiers, in Proc. ICASSP, vol. 1, 1996, pp. 105–108.
    [48] Wan,V.,Campbell,W.. Support vector machines for speaker verification and identification, in Proc. Neural Networks for Signal Processing X, 2000, pp. 775–784.
    [49] Fine,S.,Navrǎtil,J., and Gopinath,R.“A hybrid GMM/SVM approach to speaker identification,”in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001.
    [50] Karush,W. Minima of functions of several variables with inequalities as side constraints.Master's thesis, Department of Mathematics, University of Chicago, 1939.
    [51] Kuhn,H. and Tucker,A. Nonlinear programming. In Proc. 2nd Berkeley Symposium on Mathematical Statistics and Probabilities, pages 481.492, Berkeley, 1951. University of California Press.
    [52] Burges,C.,A Tutorial on Support Vector Machines for Pattern Recognition,《Data Mining and Knowledge Discovery》,1998 / 2 / 2, pp1-47
    [53] Courant,R. and Hilbert,D., Methods of Mathematical Physics,Interscience,1953
    [54] Stevens,S., Volkman, J. "The relation of pitch to frequency," American Journal of Psychology, Vol. 53, pg. 329.(1940).
    [55] Yuo, K. and Wang,H. "Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences." Speech Communication 28(1): 13-24.(1999)
    [56] Young, S.,Odell,J. et al. "The HTK Book." Cambridge University 1996.(1995).
    [57] Cheng Y. and Leung,H.,“Speaker verification using fundamental frequency”,in Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP),2002
    [58] Sonmez,M.,Heck, L., et al.“A lognormal tied mixture model of pitch for prosody-based speaker recognition”in Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1997
    [59] Atal,B., "Automatic speaker recognition based on pitch contour,"Journal of Acoustical Society of America, vol. 52, no. 6, pp. 1687-1697,1972.
    [60] Carey,M.,Parris,E.,Lloyd-Thomas,H., and Bennett,S.,“Robust prosodic features for speaker identification," in NIST Speaker Recognition Workshop, March 1996.
    [61] Robert M. Ward at al.“An Investigation of Speaker Verification Accuracy Using Fundamental Frequency and Duration As Distinguish Features”, System Theory, 1989. Proceedings. Twenty-First Southeastern Symposium on 26-28 March 1989 Page(s): 390– 394
    [62] SuperSID Project Website:http://www.clsp.jhu.edu/ws2002/groups/supersid/
    [63] Campbell,J.,Reynolds,D.,Dunn,R.,”Fusing High- and Low-Level Features for Speaker Recognition”, Eurospeech2003
    [64] Doddington,G.,“Some Experiments on Idiolectal Differences among Speakers,”Available: http://www.nist.gov/speech/tests/spk/2001/doc/, 14 November 2000
    [65] Boersma,P.,Weenink,D. Praat, a system for doing phonetics by computer[J].Glot International,2001,5:341-345
    [66] Xu,D.,Dai,B.,Xu,M.,et al.Pitch Prosodic Feature for Speaker Verification[C]//International Workshop on Computer Science and Engineering.Moscow,Russia:World Academic Union-World Academic Press,2008:388-392.
    [67] The NIST Year 2006 Speaker Recognition Evaluation Plan[EB/OL].http://www.itl.nist.gov/iad/mig//tests/spk/2006/sre-06_evalplan-v9.pdf
    [68] Thomopoulos,S.Sensor Integration and Data Fusion[J].Journal of Robotic Systems,1990,7(3):337-372.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700