语音关键词检索若干问题的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

语音关键词检索若干问题的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：A Study of Key Problems in Spoken Term Detection
作者：李宝祥
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：语音识别 ; 关键词检索 ; 置信度评价 ; 激活力 ; 混淆矩阵
英文关键词：speech recognition ; spoken term detection ; confidence
英文关键词：measure ; word activation force ; confusion matrix
学位年度：2013
导师：郭军
学科代码：081002
学位授予单位：北京邮电大学
论文提交日期：2013-04-05

摘要

随着现代传播媒质和互联网的高速发展,海量的语音数据成为信息的重要载体,如何有效地利用这些数据成为必然。语音关键词检索运用语音识别,信息检索和自然语言处理等学科的知识,通过检测语音数据中是否包含给定的关键词,解决从纷繁复杂的语音数据中获取有用知识的问题。本文针对语音关键词检索中语音识别后处理、索引结构、关键词匹配方法和置信度评价等若干关键问题进行了研究,主要工作与创新包括以下几个方面：
     1.加权音节混淆矩阵生成算法
     混淆矩阵在查询扩展和距离度量等方面具有重要应用。传统的混淆矩阵通过将统计语音识别的最优结果与标注文本进行对齐而获得,不严格的对齐方式和环境敏感的语音识别结果等因素降低了混淆矩阵的准确度。本文提出从混淆网络中生成加权音节混淆矩阵的方法,通过选择含有正确结果的混淆集作为统计对象,并根据时间重叠性和归一化的声学得分将音节间的混淆度概率化。实验表明,该算法在语音识别率较低和训练语料较少时依然可以获得较好的准确率。
     2.基于词激活力(Word Activation Force, WAF)模型的置信度特征提取算法
     在语音识别后处理和关键词结果排序中,置信度评价不可或缺。目前,大部分置信度特征来源于解码信息,如何提取反映高层语义信息的置信度特征变得非常重要。本文提出一种基于词激活力模型的置信度特征提取算法,通过统计目标词与其上下文的激活力信息,判断在语义空间目标词与其相邻词的匹配度。实验表明,基于词激活力模型的置信度特征与来源于解码的置信度特征有很好的信息互补作用,它们的组合有效提高了系统的检索性能。
     3.基于声学距离的关键词匹配算法
     在语音关键词检索系统中,由于语音识别结果中存在错误和查询指令为集外词两种情况不可避免,所以不便采用直接匹配的方法进行关键词的检索。为降低上述问题对检索系统的不利影响,常用编辑距离实现模糊匹配。但编辑距离的插入,删除,替代权重都为固定值,不具有准确性和灵活性。本文提出利用声学距离的方法来解决模糊匹配问题。在计算声学距离时,插入,删除和替代的权重值来源于加权音节混淆矩阵,即任意两个不同音节间的权重系数是各异的。实验表明,声学距离比编辑距离更准确度量了音节串间的相似性,提高了系统的检索性能。
     4.基于分层索引的快速检索算法
     声学距离匹配算法通过对语音识别器的替代、插入和删除错误的容错处理,提升了系统检索的准确率,同时也增加了检索时间。在检索过程中,目标音节序列与索引库中每个音节序列间的声学距离计算最为耗时。本文研究了一种基于分层索引的快速检索技术,通过将声学距离较小的音节序列映射到同一个超类序列,以构建超类索引库。超类索引库的建立缩小了声学距离的计算空间,但声学距离的计算仍不可避免。将索引库中序列间的声学距离预先计算并存储为距离索引,通过查表的方法即可快速获取序列间的相似性。实验表明,分层索引技术虽然增加了索引的存储容量,但更大程度上降低了系统的检索时间。
With the rapid development of modern media and internet, a large number of speech data becomes the important carrier of information. Spoken term detection depends on the theories in the fields of speech recognition, information retrieval and natural language processing, etc. It aims to obtain useful knowledge from complex voice data by detecting individual occurrences of specified search terms. This dissertation focuses on several key problems in spoken term detection, such as post-processing, hierarchical indexing, keyword matching and confidence measure, and its main contributions and innovations are described as follows:
     1. Weighted syllable confusion matrix generation algorithm
     Confusion matrix has important applications in query expansion and distance metric. Generally, a confusion matrix is generated from the alignment between1-best hypotheses and the reference. Each syllable in the1-best hyptotheses is not necessarily the optimal, and the recognition results of noise data are always wrong. So, the confusion matrix generated from the traditional methods is inaccurate. We generate the weighted syllable confusion matrix from the confusion network, time information is adopted to align the confusion network and reference, and only the slices including the right syllable are considered. The confusion weights between the syllables are calculated according to the time overlap and normalized acoustic score. The experiments show that the algorithm can provide high performance not only in high recognition error rate but also few training corpus.
     2. Confidence feature extraction algorithm based on word activation force model
     Confidence measure is very important in speech recognition post-processing and the ranking of the retrieved results. Currently, most of the confidence features derived from decoding information, how to extract effective confidence features from high-level information sources becomes very important. The word appeared in a sentence is closely related with its neighbors, because they interact with each other at the point of syntactical and semantic information. The word activation force model establishes these relations according to the statistics of word occurrence and co-occurrence. We proposed a confidence feature extraction algorithm based on word activation force model, which can determine the match between word and its context in semantic space. The experiments show that the proposed confidence feature increases the number of information sources of confidence features with a good information complementary effect and can effectively improve the performance of confidence evaluation combined with confidence features from decoding information.
     3. Keyword matching algorithm based on the acoustic distance
     Speech recognition errors are inevitable in the spoken term detection system, and the queries are always out-of-vocabulary words. The exact match method is no longer applicable. Edit distance was used to address these problems through approximate matching. But, approximate matching was implemented by using a very simple error cost model based on a small set of heuristic rules. In order to take the degree of acoustic confusability between syllables into account for string matching, acoustic distance is proposed, it assigns smaller costs for particularly confusable pairs of syllables. The costs of acoustic distance derived from syllable confusion probabilities which can be acquired from weighted syllable confusion matrix. The experiment shows that the acoustic distance provides for more robust approximate string matching than the edit distance.
     4. Fast syllable sequence search algorithm based on hierarchical indexing
     Acoustic distance matching technique improves the accuracy of spoken term detection by allowing for syllable substitution, insertion and deletion errors; however, this comes at the cost of reduced search time. The single major cause of computation required at search time is the calculation of acoustic distance between the target syllable sequence and every one of indexed syllable sequences in the index database. Hierarchical indexing method is proposed to effectively predict the subset of sequences in the index database that will have the best acoustic distance, and avoid actually having to do the calculation for all other sequences. The use of hierarchical indexing restrict the search space to a set of syllable sequences likely to have been generated by the search term, but the computations of acoustic distance are also needed. The acoustic distance between the syllable sequences in the index database are precomputed and stored in the distance index. We can quickly obtain sequence similarity by searching for the distance index. The experiment results demonstrate that hierarchical indexing and acoustic distance index database need more the storage cost, but increase the search speed greatly with no loss in spoken term detection accuracy.

引文

[1]Alberti C, Bacchiani M, Bezman A, et al. An audio indexing system for election video material. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2009:4873-4876.
    [2]Cernocky J, Szoke I, Fapso M, et al. Search in speech for public security and defense, In Proceedings of IEEE Workshop on Signal Processing Applications for Public Security and Forensics,2007:1-7.
    [3]Rose R, Norouzian A, Reddy A, et al. Subword-based spoken term detection in audio course lectures. In proceedings of the International Conference on Acoustics, Speech and Signal Processing,2010:5282-5285.
    [4]Manning C D, Raghavan P, Schutze H. Introduction to information retrieval. Cambridge:Cambridge University Press,2008.
    [5]Li J, Lee C H. On designing and evaluating speech event detectors. In proceedings of Interspeech,2005:3365-3368.
    [6]Reynolds D A. Speaker identification and verification using Gaussian mixture speaker models. Speech communication,1995,17(1):91-108.
    [7]National Institute of Standards and Technology, Spoken Term Detection evaluation web site.2006. [Online].Available:http://www.nist.gov/speech/tests/std/.
    [8]Garofolo J S, Auzanne C G P, Voorhees E M. The TREC spoken document retrieval track:A success story. NIST special publications sp,2000 (246):107-130.
    [9]Lee H, Chen Y, Lee L. Utterance-level latent topic transition modeling for spoken documents and its application in automatic summarization, In proceedings of the International Conference on Acoustics, Speech and Signal Processing,2012: 5065-5068.
    [10]Chen B. Word topic models for spoken document retrieval and transcription. ACM Transactions on Asian Language Information Processing,2009,8(1):1-27
    [11]Douglas H R. Method and apparatus for editing documents through voice recognition. U.S. Patent 5,875,429.1999.
    [12]Segal E, Segal A. Keyless portable cellular phone system having remote voice recognition. U.S. Patent 6,167,251[P].2000.
    [13]Furui S. Speech recognition technology in the ubiquitous/wearable computing environment. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,2000,6:3735-3738.
    [14]Franz A M, Horiguchi K. Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system. U.S. Patent 6,278,968.2001.
    [15]Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 1978,26(1):43-49.
    [16]Vintsyuk T K. Speech discrimination by dynamic programming. Cybernetics and Systems Analysis,1968,4(1):52-57.
    [17]Lee K F. Automatic Speech Recognition:The Development of the Sphinx Recognition System. Kluwer Academic Pub,1989.
    [18]倪崇嘉,刘文举,徐波。汉语大词汇量连续语音识别系统研究进展。中文信息学报。2009,23(1)：112-123.
    [19]L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proceeding of the IEEE,1989,77(2):257-286.
    [20]Waibel A. Modular construction of time-delay neural networks for speech recognition. Neural computation,1989,1(1):39-46.
    [21]Zhu X. Semi-Supervised Learning with Graphs. [Ph.D Dissertation]. Carnegie Mellon University,2005.
    [22]Christiansen, R.W, Rushforth C. Detecting and Locating Key Words in Continuous Speeeh Using Linear Predietive Coding, IEEE Transactions on ASSP, 1977,25(5):361-367.
    [23]Myers C, Rabiner L, Rosenberg A. An investigation of the use of dynamic time warping for word spotting and connected speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1980:173-177.
    [24]Higgins, Alan L, Wohlford, Robert E, Keyword Reeognition Using Template Concatenation. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1985:233-1236.
    [25]Wilpon J G, Lee C H, Rabiner L R. Application of hidden Markov models for recognition of a limited set of words in unconstrained speech. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1989:254-257.
    [26]Rohlicek J R, Russell W, Roukos S, et al. Continuous hidden Markov modeling for speaker-independent word spotting. In proceedings of the International Conference on Acoustics, Speech and Signal Processing,1989:627-630.
    [27]Wilpon J, Rabiner L, Lee C H. Automatic Recognition of Keywords in Unconstrained Speeeh Using Hidden Markov Models. IEEE Transactions on Acousties,1990.
    [28]Bourlard H, D'hoore B, Boite J-M. Optimizing recognition and rejection performance in word spotting systems. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,1994:373-376.
    [29]Jones G J F, Foote J T, Sparck Jones K, et al. Video mail retrieval:The effect of word spotting accuracy on precision. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,1995,1:309-312.
    [30]Knill K M, Young S J. Speaker dependent keyword spotting for hand-held devices. Technical Report,1994.
    [31]郑方。连续无限制语音流中关键词识别方法研究：[博士学位论文]。北京：清华大学,1997。
    [32]俞一彪,赵鹤鸣,周旭东。运用互信息匹配及关键词分析的语音对话系统小型微型计算机系统。2003.24(2)：147-150.
    [33]燕鹏举。对话系统中的自然语言理解研究：[博士学位论文]。北京：清华大学,2002.
    [34]张国亮。口语对话系统中语音识别的研究：[博士学位论文]。北京：清华大学计算机科学与技术系,2003.
    [35]严斌峰。口语对话系统中关键词识别的研究：[博士学位论文]。北京：清华大学科学与技术系,2004.
    [36]刘建。可定制关键词识别系统的研究与实现：[硕士学位论文]。北京：清华大学,2004.
    [37]陈一宁。连续语音流中大词表关键词检测算法的研究：[博士学位论文]。北京：清华大学电子工程系,2004.
    [38]Mangu L, Brill E, Stolcke A. Finding consensus in speech recognition:word error minimization and other applications of confusion networks. Computer Speech& Language,2000,14(4):373-400.
    [39]Chia T K, Sim K C, Li H, et al. A lattice-based approach to query-by-example spoken document retrieval. In proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval,2008: 363-370.
    [40]James D A, Young S J. A fast lattice-based approach to vocabulary independent word spotting. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,1994:377-380.
    [41]Foote J T, Young S J, Jones G J F, et al. Unconstrained keyword spotting using phone lattices with application to spoken document retrieval. Computer Speech& Language,1997,11(3):207-224.
    [42]欧智坚,罗骏,谢达东等.多功能语音/音频信息检索系统的研究与实现.全国网络与信息安全技术研讨会论文集.北京,2004：106-112.
    [43]Mangu L, Brill E, Stolcke A. Finding consensus in speech recognition:word error minimization and other applications of confusion networks. Computer Speech and Language,2000,14 (4):373-400.
    [44]Mamou J, Ramabhadran B, Siohan O. Vocabulary independent spoken term detection. In proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2007:615-622.
    [45]Shao J, Zhao Q, Zhang P, et al. A fast fuzzy keyword spotting algorithm based on syllable confusion network. In proceedings of Interspeech,2007:2405-2408.
    [46]Meng S, Yu P, Liu J, et al. Fusing multiple systems into a compact lattice index for Chinese spoken term detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing,2008:4345-4348.
    [47]Miller D R H, Kleber M, Kao C L, et al. Rapid and accurate spoken term detection. In proceedings of Interspeech,2007:314-317.
    [48]Liu C, Wang D, Tejedor J. N-Grams FST Indexing for Spoken Term Detection. In proceedings of Interspeech,2012.
    [49]NIST. The spoken term detection (STD) 2006 evaluation plan, http://www/nist/gov/speech/tests/std/docs/std06-evalplan-v10.pdf.
    [50]Jiang H. Confidence measures for speech recognition:A survey. Speech Communication,2005,45(4):455-470.
    [51]Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets. Neural computation,2006,18(7):1527-1554.
    [52]Sukkar R, Lee C. Vocabulary Independent Discriminative Utterance Verification for Non-keyword Rejection in Sub-word Based Speech Recognition. In Proceeding of ICASSP 1998,4:420-429.
    [53]Wessel F, Schluter R, Macherey K, et al. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing,2001,9(3):288-298.
    [54]Kamppari S O, Hazen T J. Word and phone level acoustic confidence scoring. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2003:1799-1802.
    [55]Rahim M G, Lee C H, Juang B H. Discriminative utterance verification for connected digits recognition. IEEE Transactions on Speech and Audio Processing, 1997,5(3):266-277.
    [56]Zhang R, Rudnicky A I. Word level confidence annotation using combinations of features. In Proceeding of European Conference on Speech,2001:2105-2108.
    [57]Guo G, Huang C, Jiang H, et al. A comparative study on various confidence measures in large vocabulary speech recognition, In Proceeding of International Conference on Chinese Spoken Language Processing,2004:9-12.
    [58]Cox S, Dasmahapatra S. High-level approaches to confidence estimation in speech recognition. IEEE Transactions on Speech and Audio Processing,2002,10(7): 460-471.
    [59]Inkpen D, Desilets A. Semantic similarity for detecting recognition errors in automatic speech transcripts. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, 2005:49-56.
    [60]Chang E I C, Jackson E G. Natural language speech recognition using slot semantic confidence scores related to their word recognition confidence scores:U.S. Patent 6567778,2003.
    [61]Logan B, Moreno P, Van Thong J M. An experimental study of an audio indexing system for the web. In Proceedings of the 6th International Conference on Spoken Language Processing,2000:676-679.
    [1]Myers C, Rabiner L, Rosenberg A. An investigation of the use of dynamic time warping for word spotting and connected speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1980,5: 173-177.
    [2]Bourlard H, D'hoore B, Boite J-M. Optimizing recognition and rejection performance in word spotting systems. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1994:373-376.
    [3]Lee K F. On large-vocabulary speaker-independent continuous speech recognition. Speech communication,1988,7(4):375-379.
    [4]Woodland P C, Odell J J, Valtchev V, et al. Large vocabulary continuous speech recognition using HTK. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1994:125-128.
    [5]Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling in speech recognition:the Shared Views of Four Research Groups. Signal Processing Magazine,2012,29(6):82-97.
    [6]Povey D, Kingsbury B, Mangu L, et al. fMPE:Discriminatively trained features for speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2005:961-964.
    [7]Mohamed A, Dahl G E, Hinton G. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing,2012,20(1):14-22.
    [8]Dahl G E, Yu D, Deng L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing,2012,20(1):30-42.
    [9]Mertens T, Schneider D. Efficient subword lattice retrieval for German spoken term detection. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2009:4885-4888.
    [10]黄湘松。基于混淆网络的汉语语音检索技术研究。[博士学位论文]。哈尔滨：哈尔滨工程大学,2010年。
    [11]Pan Y, Chang H, Lee L S. Subword-based position specific posterior lattices (S-PSPL) for indexing speech information. In proceedings of Interspeech.2007: 318-321.
    [12]Weintraub M. Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,1993,2:463-466.
    [13]Gauvain J L, Lee C H. Maximum a posterior estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing,1994,2(2):291-298.
    [14]Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE,1989,77(2):257-286.
    [15]Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing,1980,28(4):357-366.
    [16]Lee K F. Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,1990,38(4):599-609.
    [17]Katz S. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing,1987,35(3):400-401.
    [18]Bahl L R, Jelinek F, Mercer R L. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1983 (2):179-190.
    [19]Alumae T, Kurimo M. Efficient Estimation of Maximum Entropy Language Models with N-gram features:an SRILM extension. In Proceedings of 7th Annual Conference of the International Speech Communication Association,2010:1820-1823.
    [20]Mikolov T, Kombrink S, Burget L, et al. Extensions of recurrent neural network language model. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2011:5528-5531.
    [21]Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In proceedings of the 34th annual meeting on Association for Computational Linguistics,1996:310-318.
    [22]Stolcke A. SRILM-an extensible language modeling toolkit. In proceedings of the international conference on spoken language processing,2002,2:901-904.
    [23]Soong F K, Huang E F. A tree-trellis based fast search for finding the N-best sentence hypotheses in continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,1991:705-708.
    [24]He D, Hou Y, Li Y, et al. Key technologies of pre-processing and post-processing methods for embedded automatic speech recognition systems. In Proceedings of Mechatronics and Embedded Systems and Applications,2010:76-80.
    [25]Fiscus J G. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In proceedings of workshop on Automatic Speech Recognition and Understanding,1997:347-354.
    [26]Thambiratnam K, Sridharan S. Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Transactions on Audio, Speech, and Language Processing,2007,15(1):346-357.
    [27]Wang D. Out-of-vocabulary spoken term detection. [Ph.D Dissertation]. The University of Edinburgh,2010.
    [28]Liu C, Wang D, Tejedor J. N-Gram FST Indexing for Spoken Term Detection. In proceedings of Interspeech,2012.
    [1]孔英会,景美丽。基于混淆矩阵和集成学习的分类方法研究。计算机工程与科学。2012,6：111-117。
    [2]Ou G, Murphey Y L. Multi-class pattern classification using neural networks. Pattern Recognition,2007,40(1):4-18.
    [3]Logan B, Van Thong J M. Confusion-based query expansion for OOV words in spoken document retrieval. In proceedings of ICSLP,2002,1997-2000.
    [4]Ringger E K, Allen J F. Robust error correction of continuous speech recognition. In Proceedings of the ESCA-NATO Robust Workshop.1997.
    [5]Zhang P, Shao J, Zhao Q, et al. Keyword spotting based on syllable confusion network. In Proceedings of International Conference on Natural Computation,2007,2: 656-659.
    [6]孙成立。语音关键词识别技术的研究。[博士学位论文]。北京：北京邮电大学。2008年。
    [7]Young S J, Young S. The HTK hidden Markov model toolkit:Design and philosophy. University of Cambridge, Department of Engineering,1993.
    [8]Young S J, Russell N H, Thornton J H S. Token passing:a simple conceptual model for connected speech recognition systems. Cambridge University Engineering Department,1989.
    [9]Young S, Kershaw D, Odell J P, et al. HTK toolkit, http://htk.eng.cam.ac.uk/.
    [10]李伟,吴及,王智国。一种快速的语音识别词图生成算法。清华大学学报(自然科学版)。2009,49(S 1)：1254-1257.
    [11]孟莎,余鹏,Frank Seide等。基于后验概率词格的汉语自然对话语音索引。清华大学学报。2008,48(S1)：673-677.
    [12]Hori T, Hetherington I L, Hazen T J, et al. Open-vocabulary spoken utterance retrieval using confusion networks. In Proceedings of International Conference on Acoustics, Speech and Signal Processing,2007,4:73-76.
    [13]黄湘松。基于混淆网络的汉语语音检索技术研究。[博士学位论文]。哈尔滨：哈尔滨工程大学。2010年。
    [14]Turunen V T, Kurimo M. Indexing confusion networks for morph-based spoken document retrieval. In proceedings of annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2007:631-638.
    [15]Fosler-Lussier E, Morgan N. Effects of speaking rate and word frequency on pronunciations in convertional speech. Speech Communication,1999,29(2):137-158.
    [16]Greenberg S. Speaking in shorthand-A syllable-centric perspective for understanding pronunciation variation. Speech Communication,1999,29(2): 159-176.
    [17]Young S J, Evermann G, Gales M J F, et al. The HTK book version 3.4. Cambridge University Engineering Department,2006.
    [18]Zhang P, Shao J, Zhao Q, et al. Keyword spotting based on syllable confusion network. In Proceedings of International Conference on Natural Computation,2007,2: 656-659.
    [19]Lin L. Error-Responsive Feedback Mechanisms for Speech Recognizer. [Ph.D dissertation]. CMU:School of Computer Science,1997.
    [20]Pao C, Schmid P, Glass J. Confidence scoring for speech understanding systems. In proceedings of ICSLP,1998:815-818.
    [21]San-Segundo R, Pellom B, Hacioglu K, et al. Confidence measures for spoken dialogue systems. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,2001,1:393-396,
    [22]Sarikaya R, Gao Y, Picheny M. Word level confidence measurement using semantic features. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,2003,1:604-607.
    [23]Carpenter P, Jin C, Wilson D, et al. Is this conversation on track. In proceedings of Eurospeech,2001:2121-2124.
    [24]Rayner M, Carter D, Digalakis V, et al. Combining knowledge sources to reorder n-best speech hypothesis lists. In proceedings of the workshop on Human Language Technology,1994:217-221.
    [25]Siu M, Gish H. Evaluation of word confidence for speech recognition systems. Computer Speech & Language,1999,13(4):299-319.
    [26]Allauzen A. Error detection in confusion network. In proceedings of INTERSPEECH.2007:1749-1752.
    [27]戴东波。序列数据的相似性查询及聚类研究。[博士学位论文]。上海：复旦大学。2012年。
    [1]Jiang H. Confidence measures for speech recognition:A survey. Speech communication,2005,45(4):455-470.
    [2]Saraclar M, Sproat R. Lattice-based search for spoken utterance retrieval. In proceedings of HLT-NAACL,2004:129-136.
    [3]Mamou J, Ramabhadran B, Siohan O. Vocabulary independent spoken term detection. In proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2007:615-622.
    [4]Meng S, Yu P, Liu J, et al. Fusing multiple systems into a compact lattice index for Chinese spoken term detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing,2008:4345-4348.
    [5]Mangu L. Finding Consensus in Speech Recognition. [Ph.D Dissertation]. Johns Hopkins University,2000.
    [6]Schaaf T, Kemp T. Confidence measures for spontaneous speech recognition. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 1997,2:875-878.
    [7]Guo G, Huang C, Jiang H, et al. A comparative study on various confidence measures in large vocabulary speech recognition. In proceedings of Chinese Spoken Language Processing,2004:9-12.
    [8]Rose R C, Juang B H, Lee C H. A training procedure for verifying string hypotheses in continuous speech recognition. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing,1995,1:281-284.
    [9]Wessel F, Schluter R, Macherey K, et al. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing,2001,9(3):288-298.
    [10]Chen T H, Chen B, Wang H M. On using entropy information to improve posterior probability-based confidence measures. Chinese Spoken Language Processing,2006:454-463.
    [11]Razik J, Mella O, Fohr D, et al. Local Word Confidence Measure Using Word Graph and N-Best List. In proceedings of 9th European Conference on Speech Communication and Technology,2005:3369-3372
    [12]Fabian T, Lieb R, Ruske G, et al. Impact of Word Graph Density on the Quality of Posterior Probability Based Confidence Measures. In proceedings of 8th European Conference on Speech Communication and Technology,2003:917-920.
    [13]Huang Z Y. HNC (Hierarchical Network Concept) Theory. Tsinghua University Press, Beijing,1998.
    [14]Cox S, Dasmahapatra S. High-level approaches to confidence estimation in speech recognition. IEEE Transactions on Speech and Audio Processing,2002,10(7): 460-471.
    [15]Chen Wei, Liu Gang, Guo Jun, et al. Novel Confidence Feature Extraction Algorithm Based on Latent Topic Similarity. IEICE Transactions on Information and Systems,2010,8(E93.D):2243-2251.
    [16]Guo J, Guo H, Wang Z. An activation force-based affinity measure for analyzing complex networks. Scientific reports,2011.
    [17]Yang Luo, Guang Chen; Yongtian Zhang. WAF-based document clustering algorithm Computer Science and Network Technology (ICCSNT),2011:14-16.
    [18]Zhanyi Wang, Wenlong Lv, Heng Li, et al. PRIS at TREC 2011 Entity Track: Related Entity Finding and Entity List Completion. In proceedings of the 20th Text Retrieval Conference,2011.
    [19]Song D, Bruza P. Discovering information flow suing high dimensional conceptual space. In proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2001: 327-333.
    [20]Bruza P, Song D. A comparison of various approaches for using probabilistic dependencies in language modeling. In proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, ACM, 2003:419-420.
    [21]Azzopardi L, Girolami M, Crowe M. Probabilistic hyperspace analogue to language. In proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2005:575-576.
    [22]Rubenstein H, Goodenough J B. Contextual correlates of synonymy. Communications of the ACM,1965,8(10):627-633.
    [23]Hodgson J M. Informational constraints on pre-lexical priming. Language and Cognitive Processes,1991,6(3):169-205.
    [24]Thomas K Landauera, Peter W Foltzb, Darrell Lahamc. Introduction to Latent Semantic Analysis. Discourse Processes,1998,25(3):259-284.
    [25]Baudat G, Anouar F. Kernel-based methods and function approximation. In Proceedings of International Conference on Neural Networks,2001,2:1244-1249.
    [26]Wahba G. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning,1999,6: 69-87.
    [27]Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers,1999,10(3): 61-74.
    [28]张学工。关于统计学习理论与支持向量机。自动化学报,2001,26(1)：32-42.
    [1]Logan B, Moreno P, Van Thong J M. An experimental study of an audio indexing system for the web. In Proceedings of the 4th International Conference on Spoken Language Processing, USA,1996.
    [2]Lin Chase. Error-Responsive Feedback Mechanisms for Speech Recognizers. [Ph.D Dissertation]. Robotics Institute, Carnegie Mellon University,1997.
    [3]Duta N, Schwartz R, Makhoul J. Analysis of the errors produced by the 2004 BBN speech recognition system in the DARPA EARS evaluations. IEEE Transactions on Audio, Speech, and Language Processing,2006,14(5):1745-1753.
    [4]Hirsimaki T, Creutz M, Siivola V, et al. Unlimited vocabulary speech recognition with morph language models applied to Finnish. IEEE Transactions on Computer Speech & Language,2006,20(4):515-541.
    [5]Siohan O, Bacchiani M. Fast vocabulary independent audio search using path based graph indexing. In proceedings of Interspeech,2005:53-56.
    [6]Mamou J, Ramabhadran B, Siohan O. Vocabulary independent spoken term detection. In proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, ACM,2007:615-622.
    [7]Chaudhari U V, Picheny M. Improvements in phone based audio search via constrained match with high order confusion estimates. In Proceedings of International Conference on Automatic Speech Recognition & Understanding,2007: 665-670.
    [8]Saraclar M, Sproat R. Lattice-based search for spoken utterance retrieval. Urbana, 2004,51:61801.
    [9]Meng S, Yu P, Seide F, et al. A study of lattice-based spoken term detection for Chinese spontaneous speech. In Proceedings of International Conference on Automatic Speech Recognition & Understanding,2007:635-640.
    [10]Meng S, Yu P, Liu J, et al. Fusing multiple systems into a compact lattice index for Chinese spoken term detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing,2008:4345-4348.
    [11]Yu P, Seide F. A hybrid word/phoneme-based approach for improved vocabulary-independent search in spontaneous speech. In Proceedings of the 4th International Conference on Spoken Language Processing,2004:293-296.
    [12]Mangu L, Brill E, Stolcke A. Finding consensus in speech recognition:word error minimization and other applications of confusion networks. Computer Speech& Language,2000,14(4):373-400.
    [13]Hori T, Hetherington I L, Hazen T J, et al. Open-vocabulary spoken utterance retrieval using confusion networks. In Proceedings of International Conference on Acoustics, Speech and Signal Processing,2007:73-76.
    [14]Chelba C, Acero A. Position specific posterior lattices for indexing speech. In Proceedings of the 43rd Annual meeting on Association for Computational Linguistics, 2005:443-450.
    [15]Pan Y, Chang H, Lee L S. Subword-based position specific posterior lattices (S-PSPL) for indexing speech information. In proceedings of Interspeech,2007: 318-321.
    [16]Witbrock M J, Hauptmann A G. Using words and phonetic strings for efficient information retrieval from imperfectly transcribed spoken documents. In proceedings of the second ACM international conference on Digital libraries, ACM,1997:30-35.
    [17]Thambiratnam K, Sridharan S. Rapid yet accurate speech indexing using dynamic match lattice spotting. IEEE Transactions on Audio, Speech, and Language Processing,2007,15(1):346-357.
    [18]Cernocky J, Burget L, Schwarz P, et al. Search in speech, language identification and speaker recognition in Speech@FIT. In Proceedings of 17th International Conference on Radioelektronika,2007:1-6.
    [19]Kanda N, Sagawa H, Sumiyoshi T, et al. Open-vocabulary keyword detection from super-large scale speech database. In Proceedings of the 10th Workshop on Multimedia Signal Processing,2008:939-944.
    [20]Mohri M, Pereira F, Riley M. Weighted automata in text and speech. In Proceedings of the 12th biennial European Conference on Artificial Intelligence workshop on Extended finite state models of language,1996:46-50.
    [21]Allauzen C, Mohri M, Saraclar M. General indexation of weighted automata: application to spoken utterance retrieval. In Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004, 2004:33-40.
    [22]Parlak S, Saraclar M. Spoken term detection for Turkish broadcast news. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2008:5244-5247.
    [23]Akbacak M, Vergyri D, Stolcke A. Open-vocabulary spoken term detection using graphone-based hybrid recognition systems. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing,2008:5240-5243.
    [24]Thambiratnam A. J. K. Acoustic keyword spotting in speech with applications to data mining. [Ph.D. dissertation]. Queensland University of Technology.2005.
    [25]Wallace R, Vogt R, Sridharan S. Spoken term detection using fast phonetic decoding. In proceedings of the International Conference on Acoustics, Speech and Signal Processing,2009:4881-4884.
    [26]Wallace, Roy Geoffrey. Fast and accurate phonetic spoken term detection. [Ph.D. dissertation]. Queensland University of Technology.2010.
    [27]Garofolo J S, Auzanne C G P, Voorhees E M. The TREC spoken document retrieval track:A success story. NIST SPECIAL PUBLICATION SP,2000 (246): 107-130.
    [28]Garofolo J, Lard J, Voorhees E.2000 TREC-9 Spoken Document Retrieval track. 2000. http://trec.nist.gov/pubs/trec9/sdrt9_slides/index.htm.
    [29]Tejedor J, Wang D, King S, et al. A posterior probability-based system hybridisation and combination for spoken term detection. In proceedings of INTERSPEECH 2009:2131-2134.
    [30]Logan B, Moreno P, Deshmukh O. Word and sub-word indexing approaches for reducing the effects of OOV queries on spoken audio. In Proceedings of the Second International Conference on Human Language Technology Research,2002:31-35.
    [31]孟莎,余鹏,Frank Seide等.基于后验概率词格的汉语自然对话语音索引。清华大学学报。第S1期,2008.
    [32]Chen B, Wang H, Lee L. Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Transactions on Speech and audio Processing,2002,10(5): 303-314.
    [33]刘凤晨刘庆文胡玥等。n-Gram/2L索引结构的存储与时间优化算法。计算机工程与应用。第5期,2008,pp.180-183.
    [34]韩纪庆,郑铁然,郑贵滨。音频信息检索理论与技术。科学出版社,2011年,pp.143-145.
    [35]Thambiratnam A J K, Sridharan S. Dynamic match lattice spotting for indexing speech content. U.S. Patent Application 11/377,327[P].2006.
    [36]Liu B, Xia Y, Yu P S. Clustering through decision tree construction. In proceedings of the ninth international conference on Information and knowledge management, ACM,2000:20-29.
    [37]Yiqing Z U. The text design for continuous speech database of standard Chinese. Chinese Journal of Acoustics,1999,18(1):56-59.
    [38]Young S, Evermann G, Gales M, et al. The HTK book. Cambridge University Engineering Department,2002.
    [39]Stolcke A. SRILM-an extensible language modeling toolkit. In Proceedings of the International Conference on spoken language processing,2002,2:901-904.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700