用户名: 密码: 验证码:
中文网页自动分类的一种实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
搜索引擎是网络信息检索的重要工具,在中文搜索引擎的实现中,中文网页的自动分类是一个很重要的研究方向。通过自动分类不仅仅可以将网页按照类别信息分别建立相应的数据库,提高中文搜索引擎的查全率和查准率,而且可以建立自动的分类信息资源,为用户提供分类信息目录,并且,自动分类的好与坏,对后面的相关性排序过程也有一定的积极作用。
     本文分析了网页中对分类过程有贡献的结构成分,并针对中文网页的特点和网页分析过程中的对分词质量的要求,对现有的最长次长分词算法进行了相应的简化和调整,使其更加适用与自动分类过程。并将信息检索领域中用于计算关键字与相关文献相关权重的IDF(Inverse Document Frequency)公式应用于自动分类过程,结合对中文网页的分析结果,得出具有可调参数的权重计算公式,根据公式要求,设计并建立了用于保存分类训练结果的分类权重向量库。利用语料训练的结果并运用VSM模型,实现了一种有实践意义的中文网页自动分类方法。
     经过闭式和开式测试,本方法在进行大量语料训练后可以使相关网页的识别准确率达到90%以上,比原有的概率分布算法有了明显的提高,而在算法效率方面基本与原有的算法相当,显示出其相当的实际应用价值。
Search engine is a capital tool of Internet information retrieval. Automatic categorization of Chinese web page is an important study direction in the implementation of Chinese search engine. By the automatic categorization, web pages is distinguishingly created into corresponding data bases according to category info, which improve recall and precision ration of Chinese search engine. In the meantime, automatic categorization info resource is established to provide category message catalog for users. In addition, the quality of automatic categorization in some measure has positive effect upon sequent relativity sort process.
    This paper analyzes structure components on the web page contributing to categorization process and, aiming at characteristics of Chinese web page and requirement of participle quality in web page analysis process, accordingly simplifies and adjusts the in being algorithm about longer/longest participle, thereby it further applies in automatic categorization process. By utilizing the IDF (Inverse Document Frequency) formula in automatic categorization process, which was used in information retrieval field to calculate the relativity term weight between keywords and relevant documents, and combining with analysis result of Chinese web page, the formula carrying adjustable parameter for calculating the correlative degree is obtained. Categorization correlative degree vector library, which is used to conserve categorization-training result, is designed and established to meet demands of the formula. An automatic categorization method of Chinese web page, which has practical signification, is achieved by using corpus training result and VSM model.
    Through close and open cycle tests, the results of experiment show that, this method can improve the correct recognition rate of correlative web pages to upward of 90% with little decline in efficiency, which is superior to the former one ?Probability Distributing Algorithm. It is supposed to have a good application prospect.
引文
(1) 翁惠玉,马范援,朱义军,杨传厚.网络搜索引擎的现状分析.情报学报,1999.18:100-102
    (2) 成颖,史九林.自动分类研究现状与展望.情报学报,1999,18(1):20-26
    (3) 邹涛,王继成,黄源,张福炎.中文文档自动分类系统的设计与实现.中文信息学报,1999.13(3):26-32
    (4) 高文,刘峰,黄铁军.数字图书馆-原理与技术实现.1.北京:清华大学出版社.2000.197-205
    (5) 曹素丽,曾伏虎,曹焕光.基于汉字字频向量的中文文本自动分类系统.山西大学学报,1999.22(2):144-149
    (6) 曹素青,曾伏虎,曹焕光.一个中文文本自动分类数学模型.情报学报,1999.18(1):27-32
    (7) 吕津,赵明生.对因特网上自动信息提取的研究.数据通信,2000.1:5-8
    (8) 王继成,潘金贵,张福炎.Web文本挖掘技术研究.计算机研究与发展,2000.37(5):513-520
    (9) Greiff, Warren R.A Theory of Term Weighting Based on Exploratory Data Analysis, Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98), Australia:Melbourne, 1998.24-28
    (10) 刁倩,王永成,张惠惠.文本自动分类中的词权重与分类算法.中文信息学报.2000.14(3):25-29
    (11) 雷鸣,刘建国,王建勇,陈葆珏.一种基于词典的搜索引擎系统动态更新模型.计算机研究与发展,2000.37(10):1265-1270
    
    
    (12) 张月杰,姚天顺.基于特征相关性的汉语文本自动分类模型的研究.小型微型计算机系统,1998.19(8):49-55
    (13) 张俐,李星,陆大纟金.中文网页自动分类新算法.清华大学学报,2000.40(1):39-42

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700