基于词典和概率统计的中文分词算法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于词典和概率统计的中文分词算法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：The Research of Chinese Word Segmentation Algorithm Based on Dictionary and Probability Statistics
作者：何爱元
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：中文分词 ; 未登录词 ; 歧义处理 ; 语言模型 ; 词典 ; 概率统计
英文关键词：Chinese word segmentation ; unknown word ; ambiguity processing ; language model ; dictionary ; Probability Statistics
学位年度：2011
导师：李晓光
学科代码：081202
学位授予单位：辽宁大学
论文提交日期：2011-05-01

摘要

对于汉语的自然语言处理来说,汉语自动分词是文本分析的第一个步骤。目前的中文分词方法,可以分为基于词典的分词方法、基于统计的分词方法和基于理解的分词方法三种。基于理解的分词方法研究尚不成熟。如今,比较流行的方法是将词典的方法和统计的方法结合起来。中文分词面临的难点问题是未登录词的识别和歧义切分。
     近年来,开发的大量的中文分词系统对中文分词中的未登录词识别,通常的做法是在分词系统中加入单独的未登录词识别模块,建立相关的规则来识别。这些分词系统对一些专有名词,如人名、地名、机构名等能够较好的识别,但是对于那些没有特殊规则的网络新词几乎不能识别,这在很大程度上影响了分词的精度。对于歧义切分,尽管近几年对歧义切分的准确率有所提高,但是歧义切分问题仍是迫切需要解决的问题。
     这两年,字标注的分词方法,取得了很好的成绩。但是,它的分词成绩受限于训练语料类型与规模的分词模式,虽然是目前的研究主流,但这与实用分词的需求背道而驰。
     因此本文采用了基于词典和概率统计的分词方法提高分词系统的实用性,并解决当前分词系统中急需解决的未登录词识别及歧义切分的问题。
     本文主要做了两方面的改进:
     第一,本文采用了与以往新词识别不同的角度对网络新词的识别做了相关研究,我们采用的方法是定期在互联网中采集不同领域的大量网页,用本文中的识别策略进行新词的识别。本文在识别新词中,对特殊标点符号中的词、文章关键词、超链接词汇等做了相关分析与研究。并将识别的新词添加到分词词典中,来扩充词典的词汇量。这对解决分词中的未登录词问题非常有效。最终来提高分词系统的分词准确率和召回率。
     第二,本文在原有的n元语言模型的基础上,提出了逆向n元语言模型,并分析了n取3时能够使模型的性能最优。从而提出了一种基于双向三元语言模型的中文分词方法,然后在该语言模型中加入了词信息。本文中的基于双向三元模型含词位置信息的分词算法,能更好的处理汉语切分中的歧义问题。最后,通过实验比较,本文的分词系统在速度和精度上都能达到不错的效果。
For Chinese natural language processing, Chinese word segmentation is the first step in text analysis. The current method of Chinese word segmentation can be divided into three kinds: the method based on dictionary, the method based on probability statistics and the method based on understanding. The understanding method is not mature. Today, the combination method based on statistical and dictionary is more popular. The difficult problem of Chinese word segmentation is the unknown word recognition and ambiguity processing.
     In recent years, to the unknown word recognition, many segmentation systems add a single identification module and establish relevant rules to recognize the unknown word. The research on named entities such as person name, place name and organization name, etc. has got good achievement. However, the research on web new words which have not special rules can not recognize. These words affect the accuracy of segmentation system. In recent years, although for ambiguity segmentation to improve the accuracy of segmentation ambiguity, but ambiguity segmentation problem is still urgent need to address the problem. For ambiguous word segmentation, although the accuracy has increased, this problem should be solved urgently.
     Therefore, this thesis uses statistics and dictionary method to solve the problem of unknown word recognition and ambiguity segmentation.
     This paper includes two betterments:
     In the first place, this paper uses a different direction to recognize the new word. We collect a large number of pages from different areas of the Internet and use our policy to recognize. Finally, we add these new words to the dictionary and to expand dictionary vocabulary. This is very effective to solve the unknown words of Chinese segmentation. Ultimately, improve the segmentation system's precision rate and recall rate.
     In the second place, we present reverse n-gram language model through the original n-gram language models. So, this paper proposes a language model based on two-way 3-gram language model. Finally, this paper adds the word information to the model. Adding the word information can improve system performance. This model can better handle the ambiguity of Chinese segmentation. Through the experimental comparison, our system can achieve good effect in speed and accuracy.

引文

[1]张茂元,卢正鼎,邹春燕.一种基于语境的中文分词方法研究.小型微型计算机系统,2004,Vol.26(1):129-133
    [2]姚天顺,朱靖波等.自然语言处理.北京:清华大学出版社,2002.3-5
    [3] Hinrich主编.统计自然语言处理基础.苑春法,李庆中等译.北京:电子工业出版社,2005.
    [4]宗成庆.统计自然语言处理.北京:清华大学出版社,2008.105-110
    [5]何国斌,赵晶璐.基于最大匹配的中文分词概率算法研究.计算机工程,2005,Vol.36(5):173-175
    [6]王思力,王斌.基于双字耦合度的中文分词交叉歧义处理方法.中文信息学报,2007,Vol.21(5):14-18
    [7]付英英,孙济庆.近五年我国中文分词研究论文计量分析.现代情报,2009,Vol.29(11):161-166
    [8]刘件,魏程.中文分词算法研究.微计算机应用,2008,Vol.29(8):11-16
    [9]赵海,揭春雨.基于有效子串标注的中文分词.中文信息学报,2007,Vol.21(5):8-13
    [10]黄昌宁,赵海.中文分词十年回顾.中文信息学报,2007,Vol2(3):8-19
    [11]黄居仁.瓶颈,挑战,与转机:中文分词研究的新思维,第十届全国计算语言学学术会议,2009.14-19
    [12]李月伦,常宝宝.基于最大间隔马尔可夫网模型的汉语分词方法.中文信息学报,2010,Vol.24(1):8-14
    [13]张素智,刘放美.基于矩阵约束法的中文分词研究.计算机工程,2007,Vol.33(15):98-100
    [14]梁南元.书面汉语自动分词系统-CDWS.中文信息学报,1987,Vol.2(2)
    [15]吴涛,张毛迪,陈传波.一种改进的统计与后串最大匹配的中文分词算法研究.计算机工程与科学,2008,Vol.30(8):79-82
    [16]何国斌,赵晶璐.汉语文本自动分词算法的研究.计算机工程与应用,2010,Vol.46(3):125-130
    [17]陈飞,王秀峰,饶一梅.一种混合的中文分词算法.南开大学学报(自然科学版),2007,Vol.40(5):27-32
    [18]罗桂琼,费洪晓,戴弋.基于反序词典的中文分词技术研究.计算机技术与发展,2008,Vol.18(1):80-83
    [19]周程远.中文自动分词系统的研究与实现:[硕士论文] .上海:华东师范大学,2009
    [20]王靖.基于机械切分和标注的中文分词研究:[硕士论文] .湖南:湖南大学,2009
    [21]张华平,刘群.基于N-最短路径的中文词语粗分模型.中文信息学报,2002,Vol.16(5):1-7
    [22]徐飞,孙劲光.基于一种粗切分的最短路径中文分词研究.计算机与信息技术,2007,Vol.11(10):32-34
    [23]刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析.计算机研究与发展,2004,Vol.41(8):1422-1429
    [24]盛启东,谭守标,徐超等.巧用黑盒法逆推百度中文分词算法.计算机技术与发展,2010,Vol.20(4):136-139
    [25]张德鑫.“水至清则无鱼”——我的新生词语规范观.北京大学学报(哲学社会科学版),2000,Vol.37(201):106-119
    [26]周纲,刘洋,刘群等.面向Internet的中文新词语检测.中文信息学报,2004,Vol.18(6):1-9
    [27] Richard Sproat and Tom Emerson, The First International Chinese Word Segmentation Bakeoff. In proceedings of the Second SIGHAN Workshop on Chinese Language Processing, 2003:11~12
    [28]盛启东.基于web的新词语发现研究:[硕士论文] .安徽:安徽大学,2010
    [29]张海军,史树敏,朱朝勇等.中文新词识别技术综述.计算机科学,2010,Vol.37(3):6-10
    [30] Chooi-Ling Goh, Chinese Unknown Word Identification by Combining Statistical Models: [Master’s Thesis]. Japan: Nara Institute of Science and Technology, 2003
    [31] Jian-Yun Nie, Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge. Communications of COLIPS, 1995 (I&2), 47-57
    [32]贾自艳,史忠植.基于概率统计技术和规则方法的新词发现.计算机工程,2004,Vol.30(20):19-21
    [33]廖先桃.新词发现综述.山东大学学报(理学版),2006,Vol.41(3):131-134
    [34]金翔宇.一种中文文档的非受限无词典抽调方法.中文信息学报,2002,Vol.15(6):33-39
    [35]李钝,曹元大,万月.Internet中的新词识别.北京邮电大学学报,2008,Vol.31(1):26-29
    [36]韩艳,林煜熙,姚建民.基于统计信息的未登录词的扩展识别方法.中文信息学报,2009,Vol.23(3):24-30
    [37]周蕾,朱巧明.基于统计和规则的未登录词识别方法研究.计算机工程,2007,Vol.33(8):196-198
    [38]沈勤中,周国栋,朱巧明等.基于字位置概率特征的条件随机场中文分词方法.苏州大学学报(自然科学版),2008,Vol.24(3):49-54
    [39]刘丹,方卫国,周泓.基于贝叶斯网络的二元语法中文分词模型.计算机工程,2010,Vol.36(1):12-14
    [40] Martin popel, David Marecek. Perplexity of n-Gram and Dependency Language Models. TSD 2010. 2010: 173~180
    [41]肖镜辉.非时齐语言建模技术研究及实践.[博士学位论文] .黑龙江:哈尔滨工业大学,2007
    [42]杨琳,张建平,颜永红.特定领域的汉语语言模型平滑算法比较研究.计算机工程与应用,2006,Vol.32(5):14-16
    [43]刘丹,方卫国,周泓二元语法中文分词数据平滑算法性能研究.计算机工程与应用,2009,Vol.45(17):33-36
    [44]刘婷.中文自动分词法在全文检索中的研究与应用.[硕士学位论文] .江苏:南京航空航天大学,2007

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700