用户名: 密码: 验证码:
基于条件随机场的汉语分词研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet技术的迅速发展,自然语言处理已经成为信息处理领域一个引人注目的研究热点。由于汉语的特殊性,大多数汉语自然语言处理任务都需建立在分词的基础之上,因而分词的准确程度将直接影响到一系列的后续处理。由于汉语自身的复杂性,分词问题一直是汉语自然语言处理的瓶颈问题。
     条件随机场是用于标记和切分序列数据的条件概率模型,也是在给定输入节点条件下计算输出节点的条件概率的无向图模型。它不需要以隐马尔可夫模型为代表的“生成”模型那样的严格独立假设,并克服了最大熵马尔可夫模型和其他“非生成”模型所存在的标记偏置问题。该模型可以非常容易地将输入序列中的任意特征加入到模型中,而且也可以将一些其他的信息加入进来,如构词规则等。
     本文系统地描述了条件随机场的定义、模型结构、势函数表示、参数估计及其训练方法等,并将条件随机场运用于汉语分词,采用汉字标注的分词方法。本文应用条件随机场进行了大量的实验,实验语料采用SIGHAN比赛的国际标准语料,并进行了封闭式测试。实验分析了条件随机场模型参数选择和汉字标注集选择对实验结果所产生的影响,并且利用条件随机场模型能够添加任意特征的优点,添加了一些新的特征到模型中,并从字的构词能力角度出发,探索了字位置概率特征。在PKU语料库上的实验表明:字位置概率特征的引入,使结果F1值提高了3.5%,达到94.5%。最后利用各分词系统的分词结果,运用“结果集成”方法,使分词系统的F1值进一步提高到95.6%。
During the last decade, Natural Language Processing (NLP) has become a hot research field. Due to special characteristics of the Chinese language, Chinese word segmentation plays a critical role in many Chinese NLP applications and has become a bottleneck in Chinese Information Processing.
     Conditional Random Fields (CRFs) is not only a conditioned probabilistic model for labeling and segmenting sequential data, but also an undirected graph model that calculates the conditional probability over output nodes given the input nodes. It relaxes the strong independence assumptions of a generative model (e.g. Hidden Markov Model) and overcomes the label-bias problem exhibited by the Maximum Entropy Markov Model and other discriminative models. CRFs can easily incorporate arbitrary features of the input sequence and introduce some other information, such as the rules of word’s formation.
     This paper proposes a CRFs-based Chinese word segmentation system with focus on the importance of parameter selection and different tagging strategies. Within the infrastructure of CRFs, we also explore some new features, such as the word formation power of a character. Evaluation on the SIGHAN PKU benchmark corpus shows that the new features significantly improve the F1 score by 3.5%. It also shows that our system achieves 94.5% in F1. This suggests that CRFs works well and holds great potential in Chinese word segmentation. In addition, we also explore the effect of integrating different models, including CRFs, HMM and MEMM. Evaluation on the SIGHAN PKU benchmark corpus shows that these models are quite complementary and the integrated system achieves 95.6% in F1, which much outperforms the state-of-the-art systems.
引文
[1] 俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002 年,第 16 卷(第 5 期):49-64. 和(第 6 期):58-65.
    [2] 梁南元.汉语自动分词系统-CDWS[J].中文信息学报,1987, 1(2):4-52.
    [3] 刘颖.计算语言学[M].北京:清华大学出版社,2002.
    [4] 张民,李生,王海峰 等.基于知识评价的快速汉语自动分词系统[J].情报学报,1998,15(2):95-105.
    [5] 孙艳峰,王建荣,冯志勇 等.Ontology 技术在自然语言理解中的应用[J].青海师范大学学报(自然科学版),2003,3:5-58.
    [6] 曹桂宏,何不廉,吴光远 等.中文分词对中文信息检索系统性能的影响[J].计算机工程与应用,2003,40(19):78-80.
    [7] 聂颂,何不廉,孙越恒.统计与规则结合的一种新词识别方法[J].微型机与应用,2003,2(10):58-60.
    [8] 李庆虎,陈玉健,孙家广.一种中文分词词典新机制—双字哈希机制[J].中文信息学报,2003,17 (4):13-18.
    [9] 刘挺,吴岩,干开铸.最大概率分词问题及其解法[J].哈尔滨工业大学学报,1998,30(6):37-41.
    [10] 金瑜,陆启明,高峰.基于上下文相关的最大概率汉语自动分词算法[J].计算机工程,2004,30 (16):146-148.
    [11] 赵伟,戴新宇,尹存燕 等.一种规则与统计相结合的汉语分词方法[J].计算机应用研究,2004,24 (3):23-25.
    [12] 刘群,张华平,俞鸿魁 等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429.
    [13] Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields [A]. Proceeding of The 20th International Conference on Computational Linguistics [C]. Geneva, Switzerland, 2004: 562-568.
    [14] 王晓龙,关毅.计算机自然语言处理[M].北京:清华大学出版社.2005.
    [15] 王显芳,杜利民.一种能够检测所有交叉歧义的汉语自动分词算法[J].电子学报,2004,32(1):50-54.
    [16] 孙茂松,黄昌宁,邹嘉彦 等.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J] .计算机研究与发,1997,34(5):332-339.
    [17] 郑德权,于凤,王开涛 等.基于汉语二字应成词的歧义字段切分方法[J] .计算机工程与应用,2003,39(1):17-26.
    [18] 孙茂松,左正平,邹嘉彦.高频最大交集型歧义切分字段在汉语自动分词中的作用[J].中文信息学报,1999,13(1):27-24.
    [19] 张俊盛,陈舜德,郑萦 等.多语料库作法之中文姓名辨识[J].中文信息学报,1992,6(3):7-15.
    [20] 宋柔,朱宏,潘维桂 等.基于语料库和规则库的人名识别法[Z].见:陈力为,袁琦.计算语言学研究与应用[C],北京:北京语言学院出版社,1993:150-154.
    [21] 孙茂松,黄昌宁,高海燕 等.中文姓名的自动辨识[J].中文信息学报.1995,9(2):16-27.
    [22] 黄德根,杨元生,王省 等.基于统计方法的中文姓名识别[J].中文信息学报,2001,15(2):31-37.
    [23] 孙茂松,张维杰.英语姓名译名的自动识别[Z].见:陈力为,袁琦.计算语言学研究与应用[C],北京:北京语言学院出版社,1993:144-149.
    [24] 沈达阳,孙茂松,黄昌宁.中国地名的自动辨识[Z].见:陈力为,袁琦.计算语言学进展与应用[C],北京:清华大学出版社,1995: 68-74.
    [25] Chen H H, Lee J C. The identification of organization names in Chinese texts [J]. Communications of COLIPS, Singapore, 1994: 131-142.
    [26] 张小衡, 王玲玲.中文机构名称的识别与分析[J].中文信息学报,1997,11(4):21-32.
    [27] 孙斌.切分歧义字段的综合性分级处理方法[EB/OL].北京大学计算语言学研究所讨论班,http://icl.pku.edu.cn/doubtfire/,99.4.13.
    [28] 张华平,刘群.基于 N 最短路径方法的中文词语粗分模型[J].中文信息学报,2002,16(5):1-7.
    [29] 陈桂林,王永成 等.一种改进的快速分词算法[J].计算机研究与发展,2000(4):8-424.
    [30] Church K, Hanks P. Word association norms, mutual information, and lexicography [J]. Computational Linguistics. 1990, 16(1), 22-29.
    [31] Frank Smadja. Retrieving collocation from text: Xtract [J]. Computational Linguistics, 1993, 19(1): 143-177.
    [32] 李家福,张亚非.基于 EM 算法的汉语自动分词算法[J].情报学报,2002, 21(3): 269-272.
    [33] 何炎样,冯夏根,周水庚.演化算法在中文自动分词中的应用[J].计算机工程,2002,8(5):80-82.
    [34] 陈桂林,王永成.一种改进的快速分词算法[J].计算机研究与发展,2000, 37(4):418-424.
    [35] 刘挺,吴岩.串频统计和词匹配相结合的汉语自动分词系统[J].中文信息学报.1998,12(1):17-22.
    [36] 黄德根,朱和合,杨元生.基于单词与双词可信度的汉语自动分词[J].计算机研究与发展.2001,(7):132-135.
    [37] 曹倩,丁艳,王超 等.汉语自动分词研究及其在信息检索中的应用[J].计算机应用研究,2004,(5):71-74.
    [38] 赵曾贻,陈天娥,朱兰.一种基于语词的分词方法[J].苏州大学学报,2002, 18(3):44-48.
    [39] 李振星,徐泽平,唐卫清 等.全二分最大匹配快速分词算法[J].计算机工程与应用,2002,38(11):106-109.
    [40] 金翔宇,孙正兴,张福炎.一种中文文档的非受限无词抽词方法[J].中文信息学报,2001,15(6):33-39.
    [41] Lafferty J, McCallum A, Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data [A]. Processing of the International Conference on Machine Learning(ICML-2001) [C]. Williams college, MA, 2001: 282-289.
    [42] Tan Y, Yao T, Chea Q et al. Applying conditional random fields to Chinese shallow parsing [A]. Proceedings of CICLing-2005 [C], Mexico City, Mexico, 2005: 167-176.
    [43] Kudo T, Yamamoto K, Matsumoto Y. Applying Conditional Random Fields to Japanese Morphological Analysis [A]. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) [C], Barcelona, 2004: 230-237.
    [44] Pinto D, McCallum A, Wei X et al. Table extraction using conditional random fields [A]. Proceedings of the 26th ACM SIGIR [C], Toronto, Canada, 2003: 235-242.
    [45] Merialdo B. Tagging English text with a probabilistic model [J]. Computational Linguistics. 1994, 20(2): 155-157.
    [46] J. Kupiec. Robust part-of-speech tagging using a hidden Markov model [J]. Computer Speech and Language. 1992, 6: 225-242.
    [47] Ferran Pla, Antonio Molina, and Natividad Prieto. Improving text chunking by means of lexical-contextual information in statistical language models [A]. Proceedings of CoNLL-2000 [C], Lisbon, Portugal, 2000: 148-150.
    [48] Zhou Guodong, Su Jian, Tey Tongguan. Hybrid text chunking [A]. In Proceedings of CoNLL-2000 [C], Lisbon, Portugal, 2000: 163-165.
    [49] Rabiner L, Juang B. H. Fundamentals of Speech Recognition [M]. Englewood Cliffs, NJ: Prentice-Hall, 1993.
    [50] R. Durbin, S. Eddy, A. Krogh, and G Mitchison. Biological sequence analysis: Probabilistic models of proteins and nucleic acids [M]. Cambridge University Press, 1998.
    [51] Andrew McCallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation [A]. Proceedings of the International Conference on Machine Learning(ICML-2000). Stanford, California, 2000: 591-598.
    [52] Nocedal J, Wright S J. Numerical Optimization [M]. New York:Springer-Verlag, 1999.
    [53] Sha F, Pereira F. Shallow parsing with conditional random fields [A]. Proceedings of HLT-NAACL 2003 [C], Columbia University, 2003: 134-141.
    [54] Xue N. Chinese word segmentation as character Tagging [J]. Computational Linguistics and Chinese Language Processing. 2003, 8(1): 29-48.
    [55] Zhou J, Dai X, Ni R et al. A hybrid approach to Chinese word segmentation around CRFs [A]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing [C], Jeju Island, Korea, 2005: 196-199.
    [56] 刘开瑛.中文文本自动分词和标注[M].第一版.北京:商务印书馆.2000:4-10.
    [57] Richard Sproat, Thomas Emerson. The first international Chinese word segmentation bakeoff [A]. The First SIGHAN Workshop Attached with the ACL2003 [C]. Sapporo, Japan, 2003: 133-143.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700