用户名: 密码: 验证码:
中文自然语言理解中基于条件随机场理论的词法分析研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机技术的不断发展和互联网的广泛普及,人们迫切需要一种自然、便捷的方式与计算机交流,使计算机能够“听懂”人类的语言。语音识别就是实现这种人机交互界面的关键技术,统计语言模型作为当前连续语音识别技术的基石之一,离不开自然语言处理技术的支持。对于中文来说,中文词法分析是中文信息处理技术的基础和关键,它直接关系到后续的句法分析和语义理解,并最终影响到实际的应用系统。因此,它一直是目前中文信息处理研究领域的一个热点和难点。
     本文系统地介绍了条件随机场(Conditional Random Fields,CRFs)模型及其在中文词法分析领域的应用,分析了目前主流的条件随机场模型训练准则和参数优化方法。然后以中文词法分析为应用背景,从区分性原理的角度研究了条件随机场训练准则,提出了基于条件随机场模型的交集型歧义消解方法,讨论了特定领域中的新词提取和词典优化算法,为中文词法分析的研究提供了新的方法和思路。最后简要阐述了中文词法研究在汉语语音识别中的应用。
     首先,本文研究了条件随机场模型区分性训练准则。目前,条件随机场模型的参数训练准则主要是基于最大似然/最大后验概率,其训练的目标是使训练语料中正确标注序列的概率最大。而以此目标建立的模型并不能保证在实际的测试环境中可以找到最佳的标注序列,从而获得较高的标注正确率。故目前的训练准则与序列标注性能评价指标之间存在着不匹配的情况。针对这一问题,本文提出了一种新的区分性训练准则—最小标注错误(Minimum Tag Error,MTE)。该准则在每条候选路径中加入该条路径相对于参考路径的正确率权重,以训练语料平均正确率最大化为目标函数。为了有效地计算平均正确度,本文还提出了一种新的前-后向算法,推导了正确率期望求解方法。实验表明,该准则不仅使切分指标的F-score值略有上升,而且使词表外(OutOf Vocabulary,OOV)词的召回率显著提高,即该准则在未知词识别的能力上具有明显的优势。同时,该准则在命名实体识别方面的性能也获得了较大的提升。
     其次,针对条件随机场等概率图模型不具备支持向量机(Support Vector Machine,SVM)那样良好的泛化能力,本文借鉴大间隔原理,提出了类似于大间隔思想的区分性条件随机场训练方法—增强型条件随机场(Boosted Conditional RandomFields,BCRF)。该方法不仅继承了传统CRFs凸函数的特性,保证了全局最优解,而且也融合了大间隔模型所具有的泛化能力,其内涵可理解为在正确标注序列和候选序列之间加入了一个“软间隔”,而该“软间隔”与两个序列间的汉明距离(候选序列中错误标注元素的个数)成一定的比例关系。实验结果表明,该方法与传统的最大后验概率方法相比具有明显的优势,不仅能够提高切分精度,而且能够提高OOV词和命名实体识别的能力。但与MTE方法相比,虽然其切分精度和识别性能略有下降,但其参数计算方法相对较简便,无需进行第二次前-后向算法。
     再次,本文讨论了中文交集型歧义消解方法。利用SVM在分类问题上的卓越表现及其适用于处理高维数据的特点,研究了SVM在交集型歧义消解问题上特征的选取原则和表示方法。通过分析交集型歧义两种切分方式之间存在的差异,采用互信息、附属种类、二字词频和单字词频四种统计量进行特征表示和融合,比较了特征的不同表示方法对分类性能的影响。实验表明,特征的选择和表示方法对SVM分类性能的提高至关重要,具有互补特性的特征组成的高维特征向量可以在很大程度上提高SVM分类器的歧义消解能力。针对SVM方法在处理链长大于1的歧义字串时必须将其转化为多个链长为1的字串进行处理所带来的不便,本文提出了一种基于条件随机场模型的歧义消解方法,将传统的二值分类问题转化为序列标注问题。该方法不仅能同时处理任意链长的歧义字串,而且对于真歧义字串,可以充分利用上下文信息给出不同语言环境下正确的切分形式。实验结果表明,该方法取得了目前最佳的性能表现。
     然后,讨论了特定领域中新词提取和词典优化算法。在缺少特定领域的训练语料情况下,有监督的机器学习方法不能很好地发挥其优势。基于词典的最大匹配切分方法虽然最简单有效,但由于缺乏特定领域的专业词典和新词汇的不断涌现,使得基于词典的切分算法在特定领域中的切分精度受到严重影响。本文以通用词典为初始词典,利用启发式排歧规则,在粗切分的基础上,提出了一种改进的新词提取和词典优化算法。该算法以语言模型困惑度最小化为新词提取标准,自动从候选集中提取新词,加入到初始词典得到适用于特定领域的扩充词典。为了计算候选词加入词典前后模型困惑度的变化,本文提出了一种简单有效的近似计算方法。实验结果表明,该算法不仅能提取很多特定领域的专业词汇,而且能有效地降低模型的困惑度,提高切分正确率。
     最后,简单介绍了语言模型在语音识别系统中的应用,分析了中文词法研究对统计语言建模的作用及其对语音识别系统性能的影响。
With the constantly developed technology of computer and the widespread popularity of Internet, people urgently need a natural and convenient way to communicate with computers. In order to make computers "understand" human beings' language, speech recognition is the key technology for realizing the interface of human-computer interaction. The statistical language model, which is one of the cornerstones in current continuous speech recognition technology, requires the support of natural language processing technology. For Chinese language, Chinese morpheme analysis is the basic and key technology for Chinese information processing because it directly relates to the sentence analysis and semantic understanding of the next step, and finally affects the actual application system. Therefore, Chinese morpheme analysis is always a hotspot and a difficulty in present research area of Chinese information processing.
     In this dissertation, we first study the model of conditional random fields (CRFs) and its application to the field of Chinese morpheme analysis. The dominant training criteria and parameter optimization methods are analyzed. Under the background of Chinese morpheme analysis, new training criteria based on discriminative principle are studied, the method of overlapping ambiguity resolution based on conditional random fields is proposed and the algorithm of new words extraction and lexicon optimization is discussed in the specific domain, all of which provide a new approach and idea for Chinese morpheme analysis. Finally, we briefly describe the application of Chinese morpheme analysis in the field of speech recognition.
     Discriminative training criteria of CRFs are firstly investigated. Currently the training methods of CRFs are mainly based on maximum likelihood (ML) or maximum a posterior (MAP) which aim to maximize the probability of the correct labeling sequence in the training data. The best sequence selected by these models is not guaranteed to be high accuracy in the real test environment. Therefore, there is a mismatch between the training criteria and the performance evaluation metric in the task of sequence labeling. A new discriminative training criterion called minimum tag error (MTE) is proposed in the dissertation which is integrated with sentence tagging accuracy. The objective function in MTE aims to maximize the expected tagging accuracy rate on the training corpus. To calculate the average accuracy efficiently, a new forward-backward algorithm is presented and the accuracy expectation is induced. The experiments show that the MTE criterion can not only improve the F - score but also increase R_(oov) significantly. That is to say, the MTE criterion has a clear advantage in recognizing out-of-vocabulary (OOV) words. At the same time, the MTE training method exhibits improved performance in name entity recognition.
     Secondly, since probabilistic graphic models such as CRFs do not take the advantage of good generalization as support vector machine (SVM) , a new discriminative training method named boosted conditional random fields (BCRF) is proposed which is motivated by the theory of large margin. The new method not only inherits the convex attribute of CRFs which can be guaranteed to achieve globally optimal solution, but also combines the generalization ability provided by large margin models. The understanding of BCRF can be regarded as a soft margin enforced between the reference sentence and the hypothesised one which is proportional to the Hamming distance (the number of errors in the hypothesised sentence) . Experiments show that the presented method achieves significant improvement compared with the traditional MAP method. The new approach can not only improve the segmentation accuracy but also increase the performance of OOV identification and name entity recognition. But when compared with the MTE criterion, although the segmentation accuracy and the recognition performance obtained by the BCRF method decrease slightly, the parameter optimization method is comparatively easy without another forward-backward algorithm.
     Thirdly, Chinese overlapping ambiguity resolution is discussed in this dissertation. Since SVM has the remarkable advantage on the task of classification and can be applied to deal with high-dimensional vectors, feature selection and representation based on SVM are studied for the task of resolving Chinese overlapping ambiguous strings. Based on the two different segmentation forms possibly existed in the ambiguous strings, four statistical parameters (mutual information, accessor variety, two-character word frequency and single-character word frequency) are adopted to represent different dimensional feature vectors. Classification performance is compared when different feature vectors are represented. The experiments show that feature selection and representation are vital important to improve the classification performance. High-dimensional features represented by complementary statistics can highly improve the ambiguity resolution ability of SVM classifiers. But it is very inconvenient for SVM classifiers to deal with longer ambiguous strings whose lengths are larger than three because the strings should be first converted into multiple three-character ambiguous strings. In order to solve this problem, a new method based on CRFs is proposed. Instead of the traditional methods which treated the overlapping ambiguity as a binary classification problem, the new method regards it as a sequence labeling problem. The proposed method can not only deal with overlapping ambiguous strings of any lengths no matter whether the ambiguous strings are pseudo ambiguity or true ambiguity but also consider the context information and the dependencies among the predicted labels at the same time. The experimental results show that this method achieves state-of-the-art performance.
     New words extraction and lexicon optimization algorithm are then studied in the specific domain. Since the training data towards specific domain are extremely scarce, supervised machine learning methods can not take their advantages. Although dictionary-based maximum matching method is simple and efficient, the segmentation accuracy is seriously influenced by the lack of specific lexicon and the constantly appeared new words. In this dissertation, by making use of heuristic rules, an initial segmentation based on a general lexicon which is served as an original lexicon is obtained. According to the initial segmen- tation, we present an improved method for new word extraction and lexicon optimization. The proposed approach selects new words based on a perplexity minimization criterion, extracts new words from the candidate word lists and adds them into the original lexicon. The augmented lexicon, which contains the new words, can be considered as the lexicon towards specific domain. To efficiently calculate the language model perplexity before and after the candidate word is added to the lexicon, a simple substituted method is proposed to approximatively estimate the perplexity change. Experiments show that this method can not only extract many specific new words, but also reduce the model perplexity and improve the segmentation accuracy.
     Finally, the application of language model to the field of speech recognition system is briefly introduced, and the effect on statistical language modeling and the influence on speech recognition system are analyzed for the research of Chinese morpheme.
引文
[1]Altun Y.Investigating loss functions and optimization methods for discriminative learning of label sequences.In Proc.of EMNLP 2003,145-152.
    [2]Baum L E and Eagon J A.An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology.Bull.Amer.Math.Soc.,vol.73,360-363,1967.
    [3]Berger A L,Della Pietra S A,Della Pietra V J.A maximum entropy approach to natural language processing.Computational Linguistics,vol.22(1),39-71,1996.
    [4]Berton A,Fetter P,Regel-Brietzmann P.Compound words in large-vocabulary German speech recognition systems.In Proc.of ICSLP 1996,vol.2,1165-1168.
    [5]Borthwick A.A maximum entropy approach to named entity recognition.PhD Dissertation 1999.
    [6]Brants T.Tnt- A statistical part-of-speech tagger.In Proc.of the ANLP 2000,
    [7]蔡铁。基于支持向量机的稳健语音识别技术研究。博士学位论文,上海交通大学,2005。
    [8]Chang C C,Lin C J.LIBSVM-A Library for Support Vector Machines.Available from:http://www.csie.ntu.edu.tw/cjlin/libsvm
    [9]Chang E,Shi Y,Zhou J L,et al.Speech lab in a box:a Mandarin speech toolbox to jumpstart speech related research.In Proc.of Eurospeech,2001.2779-2782.
    [10]Chang J S,Su K Y.An unsupervised iterative method for Chinese new lexicon extraction.International Journal of Computational Linguistics & Chinese Language Processing,vol.2(2),1997.
    [11]Chen Aitao.Chinese word segmentation using minimal linguistic knowledge.In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing 2003,148-151.
    [12]Chen K J,Liu S H.Word identification for Mandarin Chinese sentences.In Proc.of COLING 1992,101-107.
    [13]陈浪舟。面向语音识别的高性能统计语言模型的研究。博士学位论文,中国科学院自动化研究所,1999.
    [14]陈晴。基于条件随机场的自动分词技术的研究。硕士学位论文,东北大学,2004.
    [15]Chen S F,Rosenfeld R.A survey of smoothing techniques for ME models.IEEE Trans.Speech and Audio Processing,vol.8(1),37-50,2000.
    [16]Chen W L,Zhang Y J,Isahara H.Chinese named entity recognition with conditional random fields.In Proc.of the Fifth SIGHAN Workshop on Chinese Language Processing 2006,118-121.
    [17]陈小荷。现代汉语自动分析。北京语言文化大学出版社,104-114,2000。
    [18]陈小荷。用基于词的二元模型消解交集型分词歧义。南京师大学报(社会科学版),2004(6),109-113.
    [19]Cohn T,Smith A and Osborne M.Scaling conditional random fields using errorcorrecting codes.In Proc.of ACL 2005,10-17.
    [20]Cohn T A.Scaling conditional random fields for natural language processing.Ph.D thesis,University of Melbourne,2007.
    [21]Collins M.Discriminative training methods for hidden markov models:theory and experiments with percepton algorithms.In Proc.of EMNLP 2002,1-8.
    [22]Darroch J N,Ratcliff D.Generalized iterative scaling for log-linear models.The Annals of Mathematical Statistics,43,1470-1480.
    [23]Emerson T.The second international Chinese word segmentation Bakeoff.In Proc.of the Fourth SIGHAN Workshop on Chinese language processing 2005,123-133.
    [24]Eric Brill.A simple rule-based part of speech tagger.In Proc.of ACL 1992,
    [25]Eric Brill.Tranformation-based error driven learning and natural language processing:a case study in part of speech tagging.Computational Linguistics,1995.
    [26]Feng Haodi,Chen Kang,Deng Xiaotie,Zheng Weimin.Accessor variety criteria for Chinese word extraction.Computational Linguistics,vol.30(1),75-93,2004.
    [27]Feng Haodi,Chen Kang,Kit Chunyu,Deng Xiaotie.Unsupervised segmentation of Chinese corpus using accessor variety.LNAI 3248,694-703,2005.
    [28]付国宏,王晓龙。汉语词语边界自动划分的模型与算法。计算机研究与发展,vol.36(9),1999。
    [29]Fu Guohong,Luke K K.A two-stage statistical word segmentation system for Chinese.In Proc.of the Second SIGHAN Workshop on Chinese Language Processing 2003,156-159.
    [30] Fu Guohong, Luke K K. An integrated approach to Chinese word segmentation. Journal of Chinese Language and Computing, vol. 13(3), 249-260, 2003.
    
    [31] Fu Guohong, Kit Chunyu, Webster J.J. Chinese word segmentation as morpheme-based lexical chunking. Information Sciences 178(9), 2282-2296, 2008.
    
    [32] Gan K W. Integrating word boundary identification with sentence understanding. In Proc. of ACL 1993, 301-303.
    
    [33] 高军,无监督的动态分词方法。北京邮电大学学报, vol. 20(4),66-69, 1997.
    
    [34] Gao Jianfeng, Goodman Joshua T, Li Mingjing, et al. Toward a unified approach to statistical language modeling for Chinese. ACM Transactions on Asian Language Information Processing, vol. 1(1), 3-33, 2002.
    
    [35] Gao Jianfeng, Li Mu and Huang Chang-Ning. Improved source-channel models for Chinese word segmentation. In Proc. of ACL 2003
    
    [36] Gao Jianfeng, Wu Andi, Li Mu, Huang Chang-Ning, Li Hongqiao, Xia Xinsong, Qin Haowei. Adaptive Chinese word segmentation. In Proc. of ACL 2004
    
    [37] Gao Jianfeng, Li Mu, Huang Chang Ning, Wu Andi. Chinese word segmentation and name entity recongnition:a pragmatic approach. Computational Linguistics, vol.31(4), 531-574, 2005.
    
    [38] Goh C L, Asahara M and Matsumoto Y. Pruning false unknown words to improve Chinese word segmentation . In Proc. of PACLIC 2004, 139-149.
    
    [39] Goh C L, Asahara M and Matsumoto Y. Chinese word segmentation by classification of characters. Computational Linguistics and Chinese Language Processing, vol. 10(3),381-396, 2005.
    
    [40] Goh C L, Asahara M and Matsumoto Y. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, vol. 15(1), 1-12, 2005.
    
    [41] Good I J, The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4), 237-264, 1953.
    
    [42] Gopalakrishnan P, Kanevsky D, Nadas A, and Nahamoo D. A generalisation of the baum algorithm in rational objective functions. In Proc. of ICASSP, vol.1, 631-634,1989.
    
    [43] Gregory M L, Altun Y. Using conditional random fields to predict pitch accents in conversational speech. In Proc. of ACL 2004, 677-683.
    [44]Gross S S,Russakovsky O,Do C B,Batzoglou S.Training conditional random fields for maximum labelwise accuracy.In Proc.of NIPS 2006.
    [45]Gunawardana A,Hahajan M,Acero A,Platt 3 C.Hidden conditional random fields for phone classification.In Proc.of Eurospeech,1117-1120,2005.
    [46]韩客松,王永成,陈桂林。汉语语言的无词典分词模型系统。计算机应用研究,vol.16(10),8-9,1999.
    [47]He Shan,Zhu Jie.A bootstrap method for Chinese new words extraction.In Proc.of ICASSP 2001,Volume 1,I - 581,2001.
    [48]洪铭材,张阔,唐杰,李涓子。基于条件随机场(CRFs)的中文词性标注方法。计算机科学,vol.33(10),148-151,2006.
    [49]黄浩。基于区分性原理的汉语语音识别中声调问题的研究。博士学位论文,上海交通大学,2008.
    [50]黄心晔。汉语语音识别的统计模型研究。博士学位论文,东南大学,2000.
    [51]胡春静,韩兆强。基于隐马尔可夫模型(HMM)的词性标注的应用研究。计算机工程与应用,vol.38(6),62-64,2002.
    [52]胡散东。语音识别后处理系统的研究。硕士学位论文,华南理工大学,2000.
    [53]Isozaki Hideki,Kazawa Hideto.Efficient support vector classifiers for name entity recognition.In Proc.of COLING 2002,C02-1054.
    [54]Jelinek F and Mercer R L.Interpolated estimation of Markov source parameters from sparse data.In Proc.of the Workshop on Pattern Recognition in Practice,1980.
    [55]姜维,关毅,王晓龙。基于条件随机域的词性标注模型。计算机工程与应用,第21期,13-16,2006.
    [56]Katz S M.Estimation of probabilities from sparse data for the language model component of a speech recognizer.IEEE Transactions on Acoustics,Speech and Singal Processing,ASSP-35(3),400-401,1987.
    [57]Kit Chunyu,Xu Zhiming,Webster J J.Integrating ngram model and case-based learning for Chinese word segmentation.In Proc.of the Second SIGHAN Workshop on Chinese Language Processing 2003,160-163.
    [58]Kudo T.CRF++:Yet another CRF toolkit,2007.Available from:http://crfpp.sourceforge.net/.
    [59]Lafferty J,McCallum A,Pereira F.Conditional random fields:probabilistic models for segmenting and labeling sequence data.In Proc.of ICML 2001,282-289.
    [60]Levow G A.The third international Chinese language processing Bakeoff:word segmentation and named entity recognition.In Proc.of the Fifth SIGHAN Workshop on Chinese language processing 2006,108-117.
    [61]李保利,陈玉忠,俞士汶。信息抽取研究综述。计算机工程与应用,vol.39(10),2003.
    [62]李刚。知识发现的图模型方法。博士学位论文,中国科学院软件研究所,2001。
    [63]李家福,张亚非。一种基于概率模型的分词系统。系统仿真学报,vol.14(5),544-550.2002.
    [64]Li Mu,Gao Jianfeng,Huang Chang Ning and Li Jianfeng.Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation.In Proc.of the Second SIGHAN Workshop on Chinese Language Processing,2003,1-7.
    [65]李蓉,刘少辉,叶世伟,史忠植。基于SVM和K-NN结合的汉语交集型歧义切分方法。中文信息学报,vol.15(6),2001.
    [66]Li W and McCallum A.Rapid development of Hindi named entity recognition using conditional random fields and feature induction.ACM Transaction on Asian Language Information Processing(TALP),vol.2(3),290-294,2003.
    [67]Li X,Jiang H,Liu C.Large margin HMMs for speech recognition.In Proc.of ICASSP,513-516,2005.
    [68]梁南元。书面汉语自动分词系统-CDWS。中文信息学报,1987(2)。
    [69]Liu D C,Nocedal J.On the limited memory BFGS method for large-scale optimization.Mathematic Programming,45,503-528,1989.
    [70]刘加。汉语大词汇量连续语音识别系统研究进展。电子学报,vol.28(1),85-91,2000.
    [71]刘群。汉语词法分析和句法分析技术综述。第一届学生计算语言学研讨会(SWCL 2002)专题讲座,2002。
    [72]Low J K,Ng H T,Guo Wenyuan.A Maximum Entropy Approach to Chinese Word Segmentation.In Proc.of the Fourth SIGHAN workshop on Chinese language processing 2005.
    [73]罗盛芬,孙茂松。基于字串内部结合紧密度的汉语自动抽词实验研究。中文信息学报,vol.17(3),2003.
    [74]Luo Shenfen,Sun Maosong.Two-character Chinese word extraction based on hybrid of internal and contextual measures.In Proc.of the Second SIGHAN Workshop on Chinese Language Processing 2003,24-30.
    [75]罗智勇,宋柔。现代汉语自动分词中专名的一体化、快速识别方法。国际中文电脑学术会议,323-328,2001。
    [76]Magerman D M.Statistical decision-tree models for parsing.In Proc.of ACL 1995,276-283.
    [77]Malouf R.A comparison of algorithms for maximum entropy parameter estimation.In Proc.of CoNLL 2002,49-55.
    [78]McCallum A,Freitag D,Pereira F.Maximum entropy markov models for information extraction and segmentation.In Proc.of ICML 2000,591-598.
    [79]McCallum A.Efficiently inducing features of conditional random fields.In Proc.of the 19~(th) Conference on Uncertainty in Artificial Intelligence 2003,403-410.
    [80]McCallum A and Li Wei.Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons.In Proc.of CoNLL 2003,188-191.
    [81]McDonald R and Pereira F.Identifying gene and protein mentions in text using conditional random fields.BMC Bioinformatics,6(Suppl 1)(S 6),2005.
    [82]McEnery T,Xiao R.The Lancaster Corpus of Mandarin Chinese.Available:http://www.lancs.ac.uk/fass/projects/corpus/LCMC/default.htm
    [83]Morris J,Fosler-Lussier E.Combining phonetic attributes using conditional random fields.In Proc.of Interspeech,USA,597-600,2006.
    [84]Nocedal J.Updating quasi-Newton matrices with limited storage.Mathematics of Computation,35(151):773-782,1980.
    [85]Palmer D D.A trainable rule-based algorithm for word segmentation.In Proc.of ACL 1997,321-328.
    [86]Peng Fuchun,Feng Fangfang,McCallum Andrew.Chinese segmentation and new word detection using conditional random fields.In Proc.of COLING 2004,562-568.
    [87]Peng F C,McCallum A.Accurate information extration from research papers using conditional random fields.In Proc.of HLT-NAACL 2004,329-336.
    [88]Pietra S D,Pietra V D and Lafferty J.Inducing features of random fields.IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.19(4),380-393,1997.
    [89]Povey D,Woodland P C.Minimum Phone Error and I-smoothing for Improved Discriminative Training.In Proc.of ICASSP,105-108,2002.
    [90]Povey D.Discriminative Training for Large Vocabulary Speech Recognition,Ph.D.thesis,Cambridge University,2004.
    [91]Povey D,Kanevsky D,Kingsbury B,Ramabhadran B,Saon G,Visweswariah K.Boosted MMI for Model and Feature space discrminative training.In Proc.of ICASSP,2008.
    [92]Quattoni A,Collins M,Darrell T.Conditional random fields for object recognition.In Proc.of NIPS 2004.
    [93]Ratnaparkhi A.A maximum entropy model for part-of-speech tagging.In Proc.of EMNLP 1996,133-141.
    [94]Ries K,Buφ F D,Waibel A.Class phrase models for language modeling.In Proc.of ICSLP 1996,398-401.
    [95]Roth D.Learning to resolve nature language ambiguities:a unified approach.Menlo Park California,AAAIPress,806-813,2001.
    [96]Sekine S,Grishman R and Shinou H.A decision tree method for finding and classifying names in Japanese Texts.In Proc.of the 6th Workshop on Very Large Corpora,1998.
    [97]Sha F,Pereira F.Shallow parsing with conditional random fields.In Proceedings of Human Language Technology -NAACL,134-141,2003.
    [98]Sha F,Saul L K.Large margin hidden Markov models for automatic speech recognition.In B.Sch(o|¨)kopf,J.Platt,and T.Hofmann,editors,Advances in Neural Information Processing Systems 19,Cambridge,MA,2007.MIT Press.
    [99]Sim K C.Structured Precision Matrix Modelling for Speech Recognition.Cambridge University,July 2006.
    [100]宋柔,朱宏。基于语料库和规则库的人名识别方法。计算语言研究与应用,北京语言学院出版社,1993。
    [101]Sproat R,Shih C L.A statistical method for finding word boundaries in Chinese text.Computer Processing of Chinese and Oriental Languages,vol.4(4),336-351,1990.
    [102]Sproat R,Emerson T.The first international Chinese word segmentation Bakeoff.In Proc.of the second SIGHAN Workshop on Chinese language processing 2003.133-143.
    [103]Sun J,Gao J F,Zhang L,Zhou M,Huang C N.Chinese named entity identification using class-based language model.In Proc.of COLING 2002.967-973.
    [104]孙静。特定领域中文语音识别系统中语言模型和语义分析的研究。硕士学位论文,上海交通大学,2003.
    [105]孙茂松,黄昌宁,邹嘉彦,陆方,沈达阳。利用汉字二元语法关系解决汉语自动分词中的交集型歧义。计算机研究与发展,vol.34(5),1997.
    [106]Sun Mao Song,Shen Da Yang,Tsou B K.Chinese word segmentation without using lexicon and hand-crafted training data.In Proc.of ACL-COLING 1998,1265-1271.
    [107]孙茂松,左正平,黄昌宁。消解中文三字长交集型分词歧义的算法。清华大学学报(自然科学版)vol.39(5),101-103,1999.
    [108]孙茂松,左正平,邹嘉彦。高频最大交集型歧义切分字段在汉语自动分词中的作用。中文信息学报,vol.13(1),1999.
    [109]孙茂松,肖明,邹嘉彦。基于无指导学习策略的无词表条件下的汉语自动分词。计算机学报,vol.27(6),2004.
    [110]Suzuki Jun,McDermott Erik and Isozaki Hideki.Training conditional random fields with multivariate evaluation.In Proc.of ACL 2006,217-224.
    [111]Taskar B,Guestrin C and Koller D.Max-margin Markov networks.In Proc.of NIPS 2003,25-32.
    [112]Tian Y M,Yao T S,Chen Q,Zhu J B.Applying conditional random fields to Chinese shallow parsing.Lecture Notes in Computer Science,vol.3406,167-176,2005.
    [113]Torralba A,Murphy K P,Freeman W T.Contextual models for object detection using boosted random fields.In Proc.of NIPS 2004.
    [114]Tsai T H,Chou W C,Wu S H et al.Integrating linguistic knowledge into a conditional random field framework to identify biomedical named entities.Expert Systems with Applications,vol.30(1),117-128,2006.
    [115]Tseng H,Chang P,Andrew G,Jurafsky D and Manning C.A conditional random field word segmenter for Sighan Bakeoff 2005.In Proc.of the 4th SIGHAN Workshop on Chinese Language Processing 2005,168-171.
    [116]Tsochantaridis I,Joachims T,Hofmann T and Altun Y.Large margin methods for structured and interdependent output variables.Journal of Machine Learning Research 6(2005),1453-1484.
    [117]Tung C H,Lee H J.Identification of unknown words from corpus.Computer Processing of Chinese & Oriental Language,vol.8,131-145,1994.
    [118]Utiyama Masao,Isahara Hitoshi.A statistical model for domain-independent text segmentation.In Proc.of ACL-EACL 2001,491-498.
    [119]Vapnik V.The Nature of Statistical Learning Theory,Springer-Verlag,New York,1995.
    [120]Vishwanathan S V N,Schraudolph N N,Schmidt M W,Murphy K P.Accelerated training of conditional random fields with stochastic gradient methods.In Proc.of ICML 2006,969-976.
    [121]Wallach H.Efficient training of conditional random fields.Master thesis,University of Edinburgh,2002.
    [122]王建会。中文信息处理中若干关键技术的研究。博士学位论文,复旦大学,2004。
    [123]王磊。支持向量机学习算法的若干问题研究。博士学位论文,电子科技大学,2007。
    [124]王锡江,王启祥,陈家俊。基于邻接知识的汉语自动分词系统。计算机研究与发展,vol.29(11),54-58,1992.
    [125]王振华,孔祥龙,陆汝占,刘绍明。结合决策树方法的中文姓名识别。中文信息学报,vol.18(6),2004.
    [126]Wu Y Z,Zhao J,Xu B.Chinese named entity recognition combining a statistical model with human knowledge.In Proc.of The Workshop on Multilingual and Mixedlanguage Named Entity Recognition:Combining Statistical and Symbolic Models (ACL 2003),65-72.
    [127]Xue Nianwen.Chinese word segmentation as character tagging.Computational Linguistics and Chinese Language Processing,vol.8(1),29-48,2003.
    [128]颜龙。大词汇量汉语连续语音识别系统中若干问题的研究。博士学位论文,北京邮电大学,2005.
    [129]姚天顺,朱靖波,张刑,杨莹。自然语言理解-一种让机器懂得人类语言的研究(第2版)。清华大学出版社,2002。
    [130]尹锋。基于神经网络的汉语自动分词系统的设计与分析。情报学报,vol.17(1),41-50,1998.
    [131]Young S,Evermann G,Hain T,Kershaw D,Moore G,Odell J,Ollason D,Povey D,Valtchev V,Woodland P C.The HTK Book,Cambridge University Engineering Department,http://htk.eng.cam.ac.uk,2004.
    [132]Yu D,Deng L,He X,Acero A.Large margin minimum classification error training for large scale speech recognition tasks.In Proc.of ICASSP,vol.4,1137-1140,2007.
    [133]俞鸿魁。基于层次隐马尔可夫模型的汉语词法分析和命名实体识别技术。硕士学位论文,北京化工大学,2004。
    [134]张春霞,都天永。汉语自动分词的研究现状与困难。系统仿真学报,vol.17(1),2005。
    [135]张锋,樊孝忠。基于最大熵模型的交集型切分歧义消解。北京理工大学学报,vol.25(7),590-593,2005。
    [136]Zhang H P,Liu Q,Cheng X H,Zhang H and Yu H K.Chinese lexical analysis using hierarchical hidden markov model.In Proc.of the Second SIGHAN Workshop 2003,63-70.
    [137]张华平,刘群。基于角色标注的中国人名自动识别研究。计算机学报,vol.27(1),2004.
    [138]Zhang Jian,Gao Jianfeng and Zhou Ming.Extraction of Chinese compound words-an experimental study on a very large corpus.In Proc.of the second Chinese Language Processing Workshop,2000.
    [139]Zhang L.Maximum Entropy Modeling Toolkit for Python and C + +.Available from:http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html
    [140]Zhang R Q,Kikui G and Sumita E.Subword-based tagging by conditional random fields for chinese word segmentation.In Proc.of the Human Language Technology Conference of the North American Chapter of the ACL 2006,193-196.
    [141]Zhao H,Huang C N and Li M.An improved Chinese word segmentation system with conditional random fields.In Proc.of the 5th SIGHAN Workshop on Chinese Language Processing 2006,162-165.
    [142]Zhao H,Huang C N,Li M and Lu B L.Effective tag set selection in Chinese word segmentation via conditional random field modeling.In Proc.of PACLIC-20 2006,87-94.
    [143]Zhao H,Kit Chunyu.Scaling conditional random fields by one-against-the-other decomposition.Journal of Computer Science & Technology,To appear,2008.
    [144]Zhao H,Kit Chunyu.Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition.In Proc.of the 6th SIGHAN Workshop on Chinese Language Processing 2008,106-111.
    [145]Zhao Jun,Gao Jianfeng,Chang Eric,et al.Lexicon optimization for Chinese language modeling.In Proc.of ISCSLP 2000,
    [146]赵铁军,吕雅娟,于浩,杨沐昀,刘芳。提高汉语自动分词精度的多步处理策略。中文信息学报,vol.15(1),13-18,2001.
    [147]赵岩。基于统计语言模型的汉语词法分析研究。博士学位论文,哈尔滨工业大学,2005.
    [148]赵岩,王晓龙,刘秉权,关毅。融合聚类触发对特征的最大熵词性标注模型。计算机研究与发展,vol.43(2),268-274,2006.
    [149]郑家恒,刘开瑛。自动分词系统中姓氏人名的处理探讨。计算语言研究与应用,北京语言学院出版社,1993。
    [150]Zhou Guodong,Yang Lingpeng,Su Jian and Ji Donghong.Mutual information independence model using kernel density estimation for segmenting and labeling sequential data.In Proc.of CICLing 2005,LNCS 3406,155-166.
    [151]Zhou G,Su J.Named entity recognition using an HMM-based chunk tagger.In Proc.of ACL 2000,473-480.
    [152]Zhou J S,Dai X Y,Ni R Y,Chen J J.A hybrid approach to Chinese word segmentation around CRFs.In Proc.of the 4th SIGHAN Workshop on Chinese Language Processing 2005,196-199.
    [153]周俊生,戴新宇,尹存燕,陈家骏。自然语言信息抽取中的机器学习方法研究。计算机科学,vol.32(3),186-189,2005.
    [154]周俊生,戴新宇,尹存燕,陈家骏。基于层叠条件随机场模型的中文机构名自动识别。电子学报,vol.34(5),804-809,2006。
    [155]Zhou J S,He Liang,Dai X Y,Chen J J.Chinese named entity recognition with a multi-phase model.In Proc.of the Fifth SIGHAN Workshop on Chinese Language Processing 2006,213-216.
    [156]周雅倩。最大熵方法及其在自然语言处理中的应用。博士学位论文,复旦大学,2004.
    [157]朱珣。中文自动分词系统的研究。硕士学位论文,华中师范大学,2004.
    [158]邹荣。大词汇量连续语音识别系统中统计语言模型的研究。硕士学位论文,北京邮电大学,2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700