用户名: 密码: 验证码:
基于遗传算法的汉语文本主题词提取研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
本文使用概念语义网络的思想构造主题词典,描述词间概念语义关系,从而在概念层次上理解文本的主题,克服了以往只能从形式上进行机械的匹配抽词来完成主题标引的缺陷,实现了规范的主题词提取,提高了中文文本主题自动标引质量。
     首先,分别使用有词典分词方法和无词典分词方法对样例文档进行切词,计算出关键词的权重,初步抽取出代表样例文档的关键词向量。
     然后,将遗传算法应用到中文文档的主题提取上,对初步得到的文档矢量做进一步的精化和寻优处理,找出最能反映文档内容,又比较简洁的特征矢量。
     本文利用人工智能技术、自然语言理解相关技术和相关数学理论将文档的主题表示成词集,简化了文档主题的表示形式,从而为中文文本自动分类、自动摘要、案例检索等工作打下基础。同时将智能计算理论应用到中文文本主题提取上,设计一套适合该种数据集主题提取的遗传算法,进一步丰富了智能计算的应用领域。
To adapt the requirement of large-scale Chinese text origin processing, the paper studies and realizes the method of extracting automatically thematic words from Chinese texts according to heredity calculate way semantics.
     Diversity and nonstandard of language used by Chinese text authors result in the keywords of expressing the same theme word presenting various literal form. Such vocabulary difference affects severely text processing automatically based on natural language matching.
     During processing the information of Chinese text, the first step of understanding natural language is that auto identifying the boundary of phrase in the Chinese text and dividing the Chinese characters sequence slice into exactitude of phrase string. Only when the step has been overcome, it is possible to extract theme word from the Chinese language and analyze theme word, then natural language can be understood.
     Therefore, the foundation and precondition of the theme word withdraw from Chinese text is that extracting automatically words from Chinese text. The quality of word segmentation influences directly the quality of text theme word extraction. Traditional natural language processing technique adopts only the mechanical keyword to match. And only the outside form matches, but it lacks the ability of knowledge processing and understanding.
     The technique of achieving normative thematic words extraction by computer should resolve some questions. These questions include: partitioning keywords from the text; identifying or leading theme word from the key words; confirming whether the theme word is the mark word.
     Currently the more mature Chinese word segmentation technique is based on word mechanical segmentation of dictionary and is based on the method of word no-dictionary segmentation frequency statistic for no-domain.
     In order to implement semantic understanding, this paper combines domain background knowledge and regards thematic words as basic concept.
     It constructs theme word dictionary by using concept semantic network and describes the concept semantic relationship between words and words. The concept semantic network constructs an associational venation and reasoning basis between knowledge. It can understand natural language in semantic to certain extent and understand the Chinese text theme word or users’requirement from concept level. It achieves normative thematic words extraction and improves the quality of Chinese text theme word indexing. It also lays a solid foundation on text automatic processing, such as intelligent retrieval, text categorization and automatic abstraction.
     At first, documents are cut into words by using words dictionary and no- dictionary segmentation method. It can compute the weight of keywords and abstract the feature vector representing documents.
     Then this paper applies to a genetic algorithm thematic words extraction from concept semantic network Chinese texts and original document vector is refined. The vector space model gained can express the content of texts and is the simplest at the same time.
     The theme word segment is one of the basic tasks of Chinese text auto processing, but also is the difficult questions of natural language processing. Because of the impact of the complexity of the structure of Chinese language and the illegibility of segment and syntax analysis and so on, Chinese auto word segment can’t have substantiality development, it is impacted on the quality of theme word extraction of Chinese text in a certain extent.
     Thus more deep research of the technique of word segment must be done and a good quality word segment system must be explored.
     The paper studies an algorithm of extracting thematic words from Chinese text, which is only on the level of the theme word, but the theme word is always isolated, it can’t embody the text theme perfectly.
     Therefore, the theme of articles should be dealt with in different level from theme word to concept and sentence and phrase of theme in the aftertime work.
     The theme of the text should be represented by making use of artificial intelligence technique, natural language comprehension related technique and related mathematical theories to simplify the form of the text file. It lays a solid foundation on text automatic processing, such as Chinese text origin auto classification, auto summary, and case index.
     At the same time applying to the theory of intelligent computing thematic words extraction from Chinese texts, we design a genetic algorithm which is used for thematic words extraction from the data and enrichs the application fields of intelligent computing.
引文
[1] 湛燕 陈昊 袁方 王熙照.基于中文文本分类的分词方法研究.计算机工程与应用.2003.Vol(23). 87~91
    [2] 尹锋 林亚平.汉语自动分词技术的现状及发展趋势.软件世界-技术专题.1996.Vol(12).80~84
    [3] 王科 高常波 翟雪峰 罗万伯.汉语分词的主要技术及其应用展望.通信技术.2003.Vol(6).12~15
    [4] 陈燕娜 邵志清.基于全文搜索的中文搜索引擎设计技术.计算机工程与应用.2002.38(17).196~918
    [5] 黄昌宁.中文信息处理中的分词问题.语言文字应用.1997.Vol(1).72~78
    [6] 杨宪泽 .中文自动分词探讨.西藏民族学院学报.1994. Vol(20) . 242~245
    [7] 许建潮 胡明 时密林.文书档案的主题标引研究.情报学报.1998. Vol(17).263~265
    [8] 宋炜 张铭.语义网简明教程.高等教育出版社.2004.第1版.1~50
    [9] 李 蕾 等 . 基 于 语 义 网 络 的 概 念 检 索 研 究 与 实 现 . 情 报 学报.2000.19(5).525~531
    [10] 徐宝文 张卫丰.数据挖掘技术在 Web 预取中的应用研究.计算机学报.2001.24(4).430~436
    [11] 孙茂松 左正平 黄昌宁.汉语自动分词词典机制的实验研究.中文信息学报.1999.14(1).1~6
    [12] 郭辉 苏中义 王文 崔骏.一种改进的 MM 分词算法.微型电脑应用.2002.18(1).13~15
    [13] 王继成 潘金贵 张福炎.Web 文本挖掘技术研究.计算机研究与发展.2000.37(5).513~520
    [14] 陆丽娜 陈亚萍等.挖掘关联规则中 Apriori 算法的研究.小型微型计算机系统.2000.Vol(9).940~943
    [15] 邹海山 吴勇 吴月珠 陈阵.中文搜索引擎中的中文信息处理技术.计算机应用研究.2000.Vol(12).21~24
    [16] 韩客松 王永成 陈桂林.汉语文献的无词典分词模型系统.计算机应用研究.1999.Vol(10).8~9
    [17] Holland, J.H. (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI.
    [18] Holland, J.H. (1962) Outline for a Logical Theory of Adaptive Systems. Journal of the association of computing machinery,3.
    [19] 潘正君 康立山 陈毓屏.演化计算.清华大学出版社.广西科学技术出版社.1998.7. 第 1 版.1~36
    [20] 周春光 梁艳春.计算智能.吉林大学出版社.2005.7.第 2 版.207~295
    [21] 王凌.车间调度及其遗传算法.清华大学出版社.2003.5.第 1 版.1~42
    [22] 李敏强 寇纪凇 林丹 李书全.遗传算法的基本理论与应用.科学出版社.2003.3.第二版.1~45
    [23] M. Mirmehdi et al. Genetic Optimization of Image Feature Extraction Process .Pattern Recognition Letters.1997.18. 355~365
    [24] 吴成柯 刘靖.图像分割的遗传算法方法.西安电子科技大学学报.1996.Vol(23).34~41
    [25] D.N.Chen & H.S.Yang.Robust Image Segmentation Using Genetic Algorithm with a Fuzzy Measure Pattern Recognition.1996.Vol(29). 1195~1211
    [26] Baker J E. Reducing Bias and Inefficiency in the Selection Algorithm. In(334).14 ~21.1987
    [27] 陈建华 包煊.Web 挖掘系统的设计与实现.计算机工程.2002.28(8).141~143
    [28] 刘明吉 王秀峰 饶一梅 黄亚楼.Web 文本信息的特征获取算法.小型微型计算机系统.2002.23 (6).683~686
    [29] 刘明吉.基于协同演化的文本特征获取算法.计算机工程.2005.31(4).85~87
    [30] 彭洪汇 林作铨.Internet 上的搜索引擎和元搜索引擎. 计算机科学.2002.29 (9).1~12
    [31] 唐培丽 王树明 胡明.基于语义的汉语文献主题词提取算法研究.吉林大学学报(信息科学版).2005.Vol(5).46~49
    [32] 朱晓华.基于概念空间方法的信息检索技术研究.大学图书馆学报.2003.Vol(2).47~53

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700