用户名: 密码: 验证码:
基于统计方法的中文文本自动分类研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的发展,人们已经从信息缺乏的时代过渡到信息极为丰富的数字化的时代。如何从这些海量信息中迅速有效地获得所需信息也就成为一项很重要的研究课题。为此目的,文本自动分类被研究者提出并进行了应用研究。研究文本自动分类具有重要意义,它可以大大的缩短了对资料的整理时间,为信息检索提供方便,有利于现实文档的存档管理。
     本文主要是应用统计的方法对文本自动分类进行理论和实践的探讨。我们的工作主要从以下方面进行:
     1.探讨了基于统计方法的文本自动分类的定义、常用模型和常用算法。
     2.讨论了文本自动分类器一般方法、步骤及有关技术细节。
     3.在向量空间模型下,实现了向量距离加权算法、代表向量算法、中心向量算法构造的三种分类器。对三种分类器分别以字、词为特征进行分类测试、分析发现:①使用相同的分类算法,用词作为特征项,比以字作为特征的分类效果好;②用不同的算法构造分类器对分类效果的影响很大,如中心向量算法在字、词特征下的分类效果优于其他两算法;在以字为特征的情况下,该算法的平均查全率80.73%,平均查准率82.94%;在以词为特征的情况下,该算法的平均查全率83.6%,平均查准率85.97%;③选用语料不同对分类效果也有影响,如用新浪网(www.sina.com.cn)网页语料进行测试,使用中心向量法分类器和词作为特征的情况下,平均准确率为89.31%,平均查全率为88.33%。
     4.基于改进后的中心向量法重构自动分类器,测试取得了开放测试平均查全率90.35%、平均准确率90.87%和封闭测试平均查全率98.36%、平均准确率98.74%的分类效果,说明改进后的算法适合中文文本分类。
     本文所得到的这些实验数据对于开发实际的文本分类系统具有指导意义。该研究可应用于网络信息检索、信息过滤、中文文本自动分类、中文网页自动分类等应用领域。
With the development of information technique, people have already transited into the ages in which information is extremely abundant and digitized from the age lacks information. How to acquire the useful information quickly and effectively from information-sea has become a very important problem. For this purpose, the text automatic classification has been put forward and studied in application.
    This paper gives details to the research on the theory and practice of the text automatic classification using the statistical method. The main aspects of the paper are as follows:
    1. The definition, common used models and common used algorithms of classification are discussed theoretically.
    2. Discuss the general methods and the key technology of constructing classifier.
    3. We employ vector-distance weighted algorithm, representative-vector-dista -nee algorithm and center-vector algorithm to construct the classifier. And then, the experiments of the three classification algorithms have been done respectively with different feature-set (Chinese-character feature-set and Chinese-word feature-set). According to the analysis of the experimental results, we find that: Qthe classification result with the same classifier by taking Chinese-word as feature is better than by Chinese-character. (2)the influence to classification result is highly effected by using different classifier, for example, the center-vector algorithm obtains better classification results than other two algorithms. With the character feature, the average recall is 80.73%, and the average precision is 82.94%, and with the Chinese-word feature, the average recall is 83.6%, and the average precision is 85.97%.Different corpuses influence the classification result. For example, the average recall is 89.31% and
     the average precision is 88.33%, by using the news web pages as corpus from the web site "www.sina.com.cn", which adopt the center-vector algorithm to structure classifier and select Chinese-word as feature.
    
    
    
    
    4. For the improved algorithm experimental results, the average recall is 96.35%, and the average precision is 90.87%. The experimental results indicate that the improved algorithm is suit for Chinese text automatic classification.
    This study can be used in network information retrieve, information filter, Chinese text automatic classification, Chinese web page automatic classification and other application fields.
引文
[1] 苏新宁.档案自动分类算法研究.情报学报,1995,14(3) p194~p200.
    [2] 叶新明.基于《中图法》的中文文献自动分类.情报学报,1995,14(6) p423~p433.
    [3] 何军.汉语语料的自动分类.中文信息学报,1995,(4) p25~p32.
    [4] 王永成,张坤.中文文献自动分类研究.情报学报,1997,16(5) p354~p359.
    [5] Hull D A. Improving text retrieval for the routing problem using latent semantic indexing [A]. Proceedings of SI-GIR-99, 22nd ACM international conference on research and development in information retrieval [C]. Dublin, Ire-land, 1994. p282—p289.
    [6] Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[A]. Proceedings of ICML-97, 14th international conference on machine learning[C]. Nashville, TN, 1997. 143—151.
    [7] Cohen W W. Hirsh H. Joins that generalize: text classification using WHIRL[A]. Proc of the fourth int'l conference on knowledge discovery and data mining[C]. 1998.
    [8] Yang Y. Expert network: effective and efficient learning from human decisions in text categorization and retrieval[A]. In SIGIR-94[C]. 1994.
    [9] McCallum A, Nigam K. A comparison of event models for naive bayes text classification[A]. Learning for text categorization: papers from the 1998 workshop[C]. AAAI Press, 1998. 41—48.
    [10] Li Y H, Jain A K. Classification of text documents[J]. The Computer Journal, 1998, 41(8): 537—548.
    [11] Freund Y, Schapire R E. A decision the roretic generalization of on-line learning and an application to boosting[A]. Proc of the second European conf on computational learning theory[C]. 1995. 23—27.
    
    
    [12] Gong Xiu jun, Shi Zhong zhi. Web mining based on bayes latent semantic model, 2001[A]. Ininfo-techand info-net, 2001 proceedings, ICII international conference 2001[C]. Beijing, 2001. 52—57.
    [13] Shivakumar Vaithyanathan et al. Hierarchical Bayes for text classification[A]. Proceedings of international workshop on text and Web mining[C]. 2000.
    [14] Yang Y, Chute C G. An example-based mapping method for text categorization and retrieval[J]. ACM Trans Inform System, 1994, 12(3): 252—277.
    [15] Joachims T. Text categorization with support vector machines: learning with many relevant features[A]. Proceedings of ECML-98, 10th European conference on machine learning[C]. Chemnitz, Germany, 1998. 137—142.
    [16] Joachims T. Transductive inference for text classification using support vector machines[A]. Proceedings of ICML-99, 16th international conference on machine learning[C]. Bled, Slovenia, 1999. 200—209.
    [17] Mitchell T M. Machine learning[M]. New York: McGraw Hill, 1996.
    [18] Fuhr N,Buckley C. A probabilistic learning approach for document indexing[J]. ACM Trans Inform System, 1991, 9(3): 223—248.
    [19] Cohen W W, Hirsh H. Joins that generalize: text classification using whirl[A]. Proceedings of KDD-98, 4th international conference on knowledge discovery and data mining[C]. New York:NY, 1998. 169—173.
    [20] Cohen W W, Singer Y. Context-sensitive learning methods for text categorization[J]. ACM Trans Inform System, 1999, 17(2): 141—173.
    [21] Dagan I, Karov Y, et al. Mistake-driven learning in text categorization[A]. Proceedings of EMNLP-97, 2nd conference on empirical methods in natural language processing[C].Providence,RI, 1997. 55—63.
    [22] Lam S L, Lee D L. Feature reduction for neural network based text categorization[A]. Proceedings of DASFAA-99, 6th IEEE international conference on database advanced systems for advanced application[C]. Hsinchu, Taiwan, 1999. 195—202.
    [23] Turmer K,Ghosh J. Error correlation and error reduction in ensemble
    
    classifiers[J]. Connection Science, 1996, p385—p403.
    [24] Scott S, Matwin S. Feature engineering for text classification[A]. Proceedings of ICML-99, 16th international conference on machine learning[C]. Bled,Slovenia, 1999.p379—p388.
    [25] Larkey L S,Croft W B. Combining classifiers in text categorization[A]. Proceedings of SIGIR-96, 19th ACM international conference on research and development in information retrieval[C]. 1996. 289—297.
    [26] Mladenic D. Text-learning and related intelligent agents: asurvey [J]. IEEE Intelligent Systems, 1999, 14(4).
    [27] Cohen W, Singer Y. Context-sensitive learning methods for text categorization[A]. Proceedings of the 19th international ACM SIGIR conference on research and development in information retrieval[C].Zurich,Switzerland,1996.30—315.
    [28] Chung-Kwan Shin,UiTak Yun,et al. A hybrid approach of neural network and memory-based learning to data mining[J]. Neural Networks, IEEE Transactions, 2000, 11(3): 637—646.
    [29] C. Cortes and V. Vapnik. Support-Vector networks. Machine Learning, 273-297, November 1995.
    [30] Vladimir V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
    [31] Thorsten Joachims. Text Categorization with support Vector Machines: Learning with Many Relevant Features.
    [32] Paolo Frasconi et al. Hidden Markov Models for Text Categorization in Multi-Page Documents. Journal of Intelligent Information Systems, 18:2/3,195-217, 2002.
    [33] Soumen Chakrabarti et al. Fast and accurate text classification via multiple linear discriminant projections. Proceedings of the 28th VLDB Conference Hong Kong, China, 2002.
    [34] 刘开瑛,郑家恒等.基于《金融档案分类表》的自动分类算法研究.情报学
    
    报,1997,16(5) pp346~353.
    [35] 梦云,曹素青.基于字频向量的中文文本自动分类系统.情报学报,2000,19(6) pp644-649.
    [36] 牛伟霞,张永奎.潜在语义索引方法在信息过滤中的应用.计算机工程与应用,2001,9pp57-59.
    [37] 李晓黎,刘继敏,史忠植.概念推理网机器在文本分类中的应用.计算机研究与发展,2000,9 pp57-59.
    [38] 周水庚 等.一个无需词典支持和切词处理的中文文档分类系统.计算机研究与发展,2001,7pp839-844.
    [39] http://mtgroup.ict.ac.cn/class/.
    [40] 刘斌,黄铁军等.一种新的基于统计的自动文本分类方法.中文信息学报,2002,16(9),pp18-24.
    [41] 朱靖波,姚天顺.基于FIFA算法的文本分类.中文信息学报 2002.vol(16).3,pp20-26.
    [42] 刘少辉.董明楷等.一种基于向量空间模型的多层次文本分类方法.中文信息学报 2002.vol(16).3,pp8-14.
    [43] 孙丽华等.规则分类在文本自动分类中的应用. 20th International Conference on Computer Processing of Oriental Languages Shen yang,China,2003.
    [44] Church, K.W.Lisa. F.Rau. Commercial Applications of Natural Language Processing, Communications of ACM,Vol.38.No.11,Nov.1995
    [45] 周水庚 等.基于相邻字对信息的中文文档分类研究.《小型微型计算机系统》.2001,22(4):pp462-466.
    [46] S.Deerwester, S.Dumais,G. Furnas,T. Landauer and R.Harshman. Indexing by Laent Semantic Analysis, Journal of the American Society for Information Science,Vol.41,No.6,PP.391-407,1990.
    [47] 史忠植 著.《知识发现》 清华大学出版社 2002.
    
    
    [48] Callum A and Nigam K. A comparison of Event Models for Naive Bayes Text Classification[A]. AAAI-98 Workshop on Learning for Text Categorization[C]. Madison, Wisconsin: AAAI Press, 1998, 509-516.
    [49] Y. Yang and J.P.Pedersen. Feature selection in statistical learning of text categorization. in the 14th Int.Conf.on Machine Learning,PP.412-420,1997.
    [50] 代六玲 等.中文文本分类中特征抽取方法的比较研究.中文信息学报,2004,vol(18)No.1,pp26-32.
    [51] Thomat Ault and Yiming Yang. KNN, Rocchio and Metrics for Information Filtering at TREC10 Notes, Nov.2001.
    [52] Yinming Yang and Xin Liu, A re-examination of text categorization methods. Proceeding of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999,42-49.
    [53] H.Ragas and C.H.Koster. Four text classification algorithms compared on a Dutch Corpus. Proceeding of ACM SIGIR Conference on Research and Development in Information Retrieval, 1998,369-370.
    [54] Zhou Xuezhong et. A Comparative Study on Text Representation and Classifiers in Chinese Text Categorization, Advances in Computation of Oriental Languages. 20th International Conference on Computer Processing of Oriental Languages, 2003,454-461.
    [55] 刁倩,王永成等.文本自动分类中的词权重与分类算法.中文信息学报,2000,14(3),pp24-29.
    [56] Christopher D.Manning, Hinrich Schutze. Foundations of Statistical Natural Language Processing. The MIT Press Cambridge, Massachusetts London, England.
    [57] 庞剑锋等.基于向量空间模型的文本自动分类系统的研究与实现.中文信息学报,1999.
    [58] 战学刚,林鸿飞,姚天顺.中文文献的层次分类方法.中文信息学报,1999,13(6),pp20-25.
    [59] Schutze H, Hull D, Pedersen J..A Comparison of Selective Bayesian Network
    
    Classifiers.ICML-96,1996
    [60] 冯是聪,王继民.关于“中文网页自动分类竞赛”结果的分析.中文信息学报 2003 Vol.17 No.5 p34-40.
    [61] 周水庚等.基于N-gram信息的中文文档分类研究.中文信息学报 2001 vol.15 No.1 p34-39.
    [62] Gabriel Pui Cheong Fung et. Discriminative Category Matching: Efficient Text Classification for Huge Document Collections. ICDM 2002: 187-194.
    [63] 王灏等.文本分类实现技术.广西师范大学学报(自然科学版) 2003 vol.21 No.1 p173-179

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700