用户名: 密码: 验证码:
基于SVM的中文文本分类相关算法的研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
文本分类是按照文本内容、结构等方面来组织信息,帮助人们从中挑选出所需要的内容。支持向量机是机器学习和模式识别领域研究的热点,近年来在文本分类方面广泛应用。
     本文以支持向量机为基础,对文本分类的相关算法进行了深入的研究,用这些算法设计并实现了一个中文文本分类系统,该系统主要包括以下几个模块:
     (1)预处理。实现了正向最大匹配分词算法和逆向最大匹配分词算法,提出并实现了一种改进的分词算法。该算法改进了传统的纯文本词表,使用首字索引的二级哈希词典结构;分词算法改进的匹配规则还可以同时有效地处理歧义词和未登录词问题;然后将编码策略结合到停用词表的匹配过程中进行停用词处理;
     (2)特征处理。实现了互信息、文档频率、信息增益和X~2(CHI)四种特征选择算法,用公式表示了特征词制约分类精度的三大因素,然后与互信息结合提出了改进的互信息特征选择算法,该算法在保留原有互信息计算简单的优点的同时还有利于强关联词的选择;
     (3)构造分类器。将标准支持向量机扩展到多类分类器以适应多个类别情况下的分类;针对样本的动态增加提出了一种支持向量机的增量学习方法;提出了一种改进的基于组合学习方法AdaBoost的支持向量机分类器的构造算法,这种方法采用规则抽样,有利于解决样本分布不平衡情况下的分类问题。
     另外,通过实验对本文系统中各个模块实现的算法进行了评测和比较。
Text Classification organizes information according to structures, content of text and so on to help the people pick out the information they need. Support Vector Machine (SVM) is the hot spot in Machine Learning and Pattern Recognition fields. It is used widely in Text Classification recently.
     This paper takes SVM theory as the foundation, researches the related algorithms of The Chinese Text Classification, designs and implements a Chinese Text Classification System using these algorithms. It includes:
     (1) Preprocessing. FMM, MM and an improved word segmentation algorithm are implemented. This algorithm improved the traditional plain text vocabulary, uses a dictionary structure of first character index and second-level Hash. Meanwhile the matching rule of this improved Segmentation algorithm can solve the problems of ambiguities and unknown word effectively; then add code strategy to the Stop Word matching process to eliminate the Stop Word.
     (2) Feature Selection processing. Four algorithms including Mutual Information (MI), Document Frequency (DF), Information Gain (IG) and x~2 (CHI) are implemented. Three influencing factors about Feature to the precision of Classification are expressed with formula, then they are unified with MI, an improved Feature Selection algorithm based MI is proposed. This algorithm retained the original MI's merit of calculating simply, also is advantageous in choosing the strong associated word.
     (3) Classification module constructing. Multi-class SVM(M-SVMs) is extended from the standard SVM to fit the situation of multi-classification; a Incremental Learning method based SVM is proposed for classifying the dynamic samples; a constructing method of a improved Combined Leaning algorithm about AdaBoost based SVM is proposed, this method using rule sampling can be advantageous in the sample whose distribution is not balance.
     In addition, all algorithms in each module of this system are contrasted and evaluated through experiments.
引文
[1]孙建军.信息检索技术[M].北京:科学出版社,2004.
    [2]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究,2001,18(9):23-26.
    [3]张华平.中文信息处理技术发展简史[EB/OL].http://www.nlp.org.cn,中国科学院计算技术研究所软件实验室.2002.
    [4]甘立国.中文文本分类系统的研究与实现[D].北京:北京化工大学,2006.
    [5]Church,K.W.Lisa,F.Rau.Commercial Applications of Natural Language Processing[J].Communications of ACM,1995,38(11):71-78.
    [6]WinterWen.中文搜索引擎技术揭秘:中文分词[EB/OL].http://www.stlchina.org/twiki/bin/view.pl/Main/SESegment.2005.
    [7]顾益军,樊孝忠,王建华等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
    [8]Zou F,Wang F L,Deng X T,et al.Stop Word List Construction and Application in Chinese Language Processing[J].WSEAS Transactions on Information Science and Application,2006,3(6):1036-1044.
    [9]宋枫溪.自动文本分类若干基本问题研究[D].南京:南京理工大学,2004,72-74.
    [10]Gerald Salton,Wong,A.,and Yang,C.S.,"A vector space model for automatic indexing"[J].Comm.ACM 18(11),1975,pp.613-620.
    [11]Gerald Salton,Automatic information organization and retrieval[M],Addison-Wesley,Reading PA,1968.
    [12]昝红英.基于实体属性的中文网页检索研究[D].北京:北京大学,2004,9-10.
    [13]Han J,Kamber M.数据挖掘概念与技术[M].孟小峰等译.北京:机械工业出版社,2005.
    [14]于瑞萍.中文文本分类相关算法的研究与实现[D].西安:西北大学,2007,23-24.
    [15]李斗,李弼程.一种神经网络文本分类器的设计和实现[J].计算机工程与应用,2005,41(17):107-109,119.
    [16]Cover T M,Hart P E.Nearest neighbor pattern classification.IEEE Transactions on Information Theory[J].1967,13(3):21-27.
    [17]R.Ghani,S.Slattery and Yiming Yang.Hypertext Categorization using Hyperlink Patterns and Meta Data[A].The Eighteenth International Conference on Machine Leaming(ICML'01)[C],2001,178-185.
    [18]K.Van Rijsbergen,Information Retrieval[M].Butterworths,London.1979.
    [19]韩维良.汉语自动分词系统中切分歧义与未登录词的处理策略[J].青海师范大学学报:自然科学版,2004,2:31-34.
    [20]秦颖,王小捷,张素香.汉语分词中组合歧义字段的研究[J].中文信息学报,2007,21(1):3-8.
    [21]王显芳,杜利民.一种能够监测所有交叉歧义的汉语分词算法[J].电子学报,2004,32(1):50-54.
    [22]张国煊,王小华,周必水.快速书面汉语自动分词系统及其算法设计[J].计算机研究与发展,1993,30(1):63-67.
    [23]陈桂林,王永成,韩克松,王刚.一种高效的中文电子词表数据结构[J].计算机研究与发展,2000,37(1):109-116.
    [24]Ma Yu-chun,Song Han-tao.Research of chenese word segmentation based on the web[J].Computer Application,2004,24(4):134-136.
    [25]He Ke-kang,Xu Hui.Design of an expert system of automatic word segmentation in written Chenese text[J].Journal of Chinese Information Processing,1991,5(2):1-14.
    [26]Yan Yin-tang,Zhou Xiao-qiang.Study of segmentation strategy on ambiguous phrase of overlap type[J].Journal of the China Society for Scientific and Technical Information,2000,19(6):637-643.
    [27]徐燕,李锦涛,王斌,孙春明.基于区分类别能力的高性能特征选择方法[J].软件学报,2008,1(1):82-89.
    [28]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中 文信息学报,2004,18(1):26-32.
    [29]Yang Yiming,Pederson JO.A Comparative Study on Feature Selection in Text Categorization[A].Proceedings of the 14th International Conference on Machine Learning[C].Nashville:Morgan Kaufmann,1997,412-420.
    [30]Mlademnic,D.,Grobelnik,M.Feature Selection for unbalanced class distribution and Naive Bayes[A].Proceedings of the 14th International Conference on Machine Learning[C].Bled:Morgan Kaufmann,1999,258-267.
    [31]谭金波,黄峰,杨晓江,李艺.一种改进的互信息特征选择算法[J].情报学报,2006,25(6):651-656.
    [32]杨允信.文本文件自动分类之研究[A].台湾地区第六届计算语言学研讨会论文集[C].1993.
    [33]Yiming Yang.An Evaluation of Statistical Approaches to Text Categorization[J].Information Retrieval,1999,1(1):67-88.
    [34]Vapnik V著.张学工译.统计学习理论的本质[M].北京:清华大学出版社,2000.
    [35]Nello Cristianini,John Shawe-Taylor著.李国正,王猛,曾华军译.支持向量机导论[M].北京:电子工业出版社,2004.
    [36]SERGIOS THEODORIDIS著.李晶皎译.模式识别(第二版)[M].北京:电子工业出版社,2005.
    [37]刘志刚,李德仁,秦前清,史文中.支持向量机在多类分类问题中的推广[J].计算机工程与应用,2004,40(7):10-13,65.
    [38]唐小力,吕宏伟.基于SVM的文本多类分类方法研究[J].电脑知识与技术:学术交流,2006,3:162-162,168.
    [39]Vapnik V,Golowich S,Smola A.Support vector method for function approximation,regression estimation,and signal processing.Mozer M,Jordan M,Petsche T.Advances in Neural Information Processing System 9.Cambndge:MIT Press,1997:281-287.
    [40]曹杰,刘志镜.基于支持向量机的增量学习算法[J].计算机应用研究,2007,24(8):48-52.
    [41]Y.Freund,R.E.Schapire.Experiments with a new boosting algorithm[A].Proceedings of the 13th International Conference on Machine Learning[C].Bari,Italy:Morgan Kaufmann,1996,148-156.
    [42]董乐红,耿国华,周明全.基于Boosting算法的文本自动分类器设计[J].计算机应用,2007,27(2):384-386.
    [43]Valentini G,Dietterich T G.Bias-variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods[J].Journal of Machine Learning Research,2004,5:725-775.
    [44]王晓丹,孙东延,郑春颖等.一种基于AdaBoost的SVM分类器[J].空军工程大学学报:自然科学版,2006,7(6):54-57.
    [45]JOSHI MV,AGARWAL RC,KUMAR V.Predicting Rare Classes:Can Boosting Make Any Weak Learner Strong?[A]Proceedings of the Eighth ACM SIGKDD Conference on Knowledge Discovery and Data Mining(KDD2002)[C].Edmonton,Canada,2002.
    [46]Chih-Chung Chang and Chih-Jen Lin.LIBSVM:a Library for Support Vector Machines[EB/OL].Software available at http://www.csie.ntu.edu.tw/~cjlin/.National Taiwan University,2007.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700