文本分类及其相关技术研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

文本分类及其相关技术研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Text Classification and Its Related Technologies
作者：李荣陆
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 文本流派分类 ; 信息过滤 ; 样本选择 ; 最大熵模型 ; 层次化分类 ; N-Gram
英文关键词：Text Classification ; Text Genre Classification ; Information Filtering ; Samples Selection ; Maximum Entropy Model ; Hierarchical Classification ; N-Gram
学位年度：2005
导师：胡运发
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2005-04-05

摘要

随着Internet的迅猛发展和日益普及,电子文本信息迅速膨胀,如何有效地组织和管理这些信息,并快速、准确、全面地从中找到用户所需要的信息是当前信息科学和技术领域面临的一大挑战。文本分类作为处理和组织大量文本数据的关键技术,可以在较大程度上解决信息杂乱现象的问题,方便用户准确地定位所需的信息和分流信息。而且作为信息过滤、信息检索、搜索引擎、文本数据库、数字化图书馆等领域的技术基础,文本分类技术有着广泛的应用前景。
     本文对文本分类及其相关技术进行了研究。从提高分类方法的快速性、准确性和稳定性出发,提出多种有效的解决或改进的方法和技术。同时,对文本分类技术的一个新的研究方向——文本流派分类,文本分类的一个重要应用领域——文本信息过滤,进行了研究。本文研究内容和创新工作主要包括以下五点。
     (1)训练样本的选择
     训练样本的选择对分类器的创建非常重要,非典型样本不仅增加了分类器的训练时间,而且容易给训练样本集中引入一些“噪声”。论文针对KNN这种常用的文本分类方法,分析了什么是它的典型样本,提出了一种基于密度的样本选择算法。根据样本ε邻域内的样本数目估计样本周围的密度,根据样本ε邻域内不同类别样本的数目确定类别之间的边界。裁剪高密度区域的样本,减少非典型样本的数量。同时,尽量保留类别边界部分的样本,以保证分类器的准确性。
     (2)基于最大熵模型的中文文本分类研究
     中文本文分类和英文文本分类有许多不同之处,文本特征的提取方式、稀疏程度都有所不同,所以分类结果亦有所不同。对于最大熵模型来说尤为不同,因为汉语的熵高于英语。论文从中文文本特征的生成方法入手,使用了分词和N-Gram两种文本特征生成方法,使用了绝对折扣技术对特征的概率进行平滑处理,对最大熵模型和Naive Bayes、KNN、SVM三种方法的性能进行了比较分析。在实验中发现最大熵模型的稳定性不够好,所以将Bagging和最大熵模型结合起来,提高了最大熵模型的稳定性。
     (3)使用层次分类改善平面分类的性能
     不同于以往的层次化分类,论文中使用了一种本质为图的层次结构,利用这种层次结构解决平面分类问题,从而提高平面分类的查准率和查全率。在普通的类别层次结构中,同一父类的兄弟类别之间的混淆关系是对称的,但事实上类别之间的混淆关系不是对称的。论文从分类器的混淆矩阵入手,引入了混淆类别的概念。利用混淆类别构造的类别层次结构,从查准率和查全率的角度来考虑类别之间的关系,表达出了混淆关系的非对称性。
With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are presented. Moreover, research on text genre classification, which is a new research field in text classification, and information filtering, which is an important application of text classification are also done. Our primary works are as follow.(1) Selection of Training SamplesSelection of training samples has the great important influence on the performance of classifier. Using the atypical samples not only increases the training time, but also is apt to bring the noise to training samples. In the paper, what is the typical sample of KNN is analyzed, and a method of samples selection based density is presented. The number of samples in the e -neighborhood of a specified sample is used to estimate the density of region surrounding the sample. The number of classes in the e -neighborhood of a specified sample is used to judge whether the sampe is around the border of classes. Reduce the atypical samples by reduce the samples in the high-density region. In the same time, reserve the samples around the border of classes in order to guarantee the precision of classifier.(2) Research on Chinese Text Classification Based on Maximum Entropy Model There are many differences between Chinese text classification and English textclassification. So the classification results are also different. It is espically different for maximum entropy model because the entropy of Chinese is higher than that of English. In the paper, two kinds of methods of Chinese text feature generation, word segmentation and n-gram, are used. Absolute-discounting technique is adopted to smooth the feature probability. Maximum entropy model, Naive Bayes, KNN and SVM are compared. Experiment results show that maximum entropy model isn't stable enough. So bagging is used to improve the stability of maximum entropy model.(3) Using Hierarchical Classification to Improve the Performance of Flat

引文

[ADW98] C. Apte, P. Damerau and S. Weiss. Text mining with decision rules and decision trees. In Proceedings of the Conference on Automated Learning and Discovery Workshop 6: Learning from Text and the Web, 1998.
    [AFJ+95] R. Armstrong, D. Freitag, T. Joachims and T. Mitchell. WebWatcher: a learning apprentice for the world wide web. In Proceedings of AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.
    [AS94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases, In Proceedings of the 22th International Conference on Very Large Database, Santiago, Chile, 1994: 487～499.
    [BDD96] A. Berger, S. Della Pietra and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 38～73.
    [Ber99] A. Berger. Error-correcting output coding for text classification. In Proceedings of International Joint Conference on Artificial Intelligence: Workshop on Machine Learning for Information Filtering, 1999.
    [BFO+84] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Monterey, CA: Wadsworth International Group, 1984.
    [BHV04] P. Beineke, T. Hastie and S. Vaithyanathan. The sentimental factor: improving review classification via human-provided information. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 2004.
    [BM98] A. Blum, T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th COLT, 1998.
    [Bre96] L. Breiman. Bagging Predictors. Machine Learning, 1996, 24(2): 123～140.
    [BS95] M. Balabanovic and Y. Shoham. Learning information retrieval agents: experiments with automated web browsing, in AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.
    [BSA92] S.O. Belkasim, M. Shridhar and M. Ahmadi. Pattern classification using an efficient KNNR. Pattern Recognition Letter. 1992, 25(10): 1269～1273.
    [Cha99] S. Chakrabarti. Hypertext databases and data mining. In Proceedings of SIGMOD99, 1999.
    [CPS98] K. Cios, W. Pedrycz, and R.Swiniarski. Data Mining Methods for Knowledge Discovery. Boston: Kluwer Academic Publishers, 1998.
    [CS96] W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996: 307～315.
    [CT94] W. Cavnar and J. Trenkle. N-gram-based text categorization. In Proceedings of SDAIR-94, 1994.
    [DK82] P. Devijver and J. Kittler. Pattern Recognition: A Statistical Approach. Prentice Hall. Englewood Cliffs, 1982.
    [DL99] G. Dong and J. Li. Efficient mining of emerging patterns: Discovery trends and differences. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD99), San Diego, CA, 1999.
    [DLP03] K. Dave, S. Lawrence, and D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the 22th International World Wide Web Conference, Budapest, Hungary, 2003.
    [ELM03] S. Eyheramendy, D. D. Lewis and and D. Madigan. On the naive bayes model for text categorization. Artificial Intelligence & Statistics 2003.
    [FK03] A. Finn, N. Kushmerick. Learning to Classify Documents according to Genre. In Processings of IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis, 2003.
    [Fro01] I. Frommholz. Categorizing web documents in hierarchical catalogues. In Proceedings of the 23rd European Colloquium on Information Retrieval Research (ECIR01). Darmstand, DE, 2001.
    [Gha00] R. Ghani. Using error-correcting codes for text classification. In Proceedings of 17th International Conference on Machine Learning, 2000.
    [GLW86] A. Griffiths, H. C. Luckhurst and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science, 1984, 37: 3-11.
    [God02] S. Godbole. Exploiting confusion matrices for automatic generation of topic hierarchies and scaling up multi-way classifiers. Technical Report, Indian Institute of Technology, Bombay, 2002.
    [GRW84] A. Griffiths, L. A. Robinson and P. Willett. Hierarchic agglomerative clustering methods for automatic document classification. Journal of Document, 1984, 40: 175-205.
    [GSC02] S. Godbole, S. Sarawagi and S. Chakrabarti. Scaling multi-class support vector machines using inter-class confusion. In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (SIGKDD02), 2002.
    [Har68] P. E. Hart. The condensed nearest neighbor rule. IEEE Trans. Information Theory. 1968,IT-14(3):515~516.
    [HL02] C. Hsu, C. Lin. A comparison on methods for multi-class support vector machines, IEEE Transactions on Neural Networks. 2002, 13: 415-425.
    [HW91] P. Hayes and S. Weinstein. Construe/tis: a system for content-based indexing of a database of news stories. In Proceedings of Annual Conference on Innovative Applications of AI, 1991.
    [JMF+95] T. Joachims, T. Mitchell, D. Freitag, and R. Armstrong. Webwatcher: machine learning and hypertext. In K. Morik and J. Herrmann, editors, GI Fachgruppentreffen Maschinelles Lernen, University of Dortmund, August, 1995.
    [Joa98] T. Joachims. Text categorization with support vector machines: Learning With Many Relevant Features. In Proceedings of 10th European Conference on Machine Learning, 1998, 137-142.
    [Kas80] G.V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 1980,29:119-127.
    [KS97] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning (ICML97), 1997.
    [Kun95] L. I. Kuncheva. Editing for the k-nearest neighbors rule by a genetic algorithms. Pattern Rcognition Letters. 1995, 16(8): 809-814.
    [Kun97] L. I. Kuncheva. Fitness functions in editing knn reference set by genetic algorithms. Pattern Rcognition. 1997, 30(6).:1041-1049.
    [LA94] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR94, 1994:3-12.
    [LDR00] J. Li, G. Dong and K. Ramamohanrarao. Making use of the most expressive jumping emerging patterns for classification. In Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD00), Kyoto, Japan, 2000.
    [Lew98] D. D. Lewis. Naive (Bayes) at forty: The Independence Assumption in Information Retrieval. In Proceedings of the 1 Oth European Conference on Machine Learning, New York, 1998,4-15.
    [LHM98] B. Liu, W. Hsu and Y. Ma. Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98), New York, 1998.
    [Lie95] H. Lieberman. Letizia: an agent that assists web browsing. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1995.
    [LS97] W.Y. Loh and Y.S. Shih. Split selection methods for classification trees. Statistica Sinica,7: 815-840, 1997.
    [LSC+96] D. D. Lewis, R. E. Schapire, J. P. Callan and R. Papka. Training algorithms for linear text classifiers. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996: 298-306.
    [LSW97] B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proceedings of the Thirteenth International Conference on Data Engineering (ICDE97), Birmingham, England, 1997.
    [LV88] W.Y. Loh and N. Vanichsetakul. Tree-structured classification via generalized discriminant analysis. Journals of the American Statistical Association, 1988, 83: 715-728.
    [Mag94] J. Magidson. The CHAID approach to segmentation modeling: CHI-squared automatic interaction detection. In R.P. Bagozzi, editor, Advanced Methods of Marketing Research, Cambridge, MA: Blackwell Business, 1994: 118-159.
    [MNZ99] S. Martin, H. Ney and J. Zaplo. Smoothing methods in maximum entropy language modeling. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Phoenix, AR, March 1999, 1:545-548.
    [MRM+98] A. McCallum, R. Rosenfeld, T. Mitchell and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (ICML98), 1998.
    [Mur98] S.K. Murthy. Automatic construction of decision tree from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 1998, 2:345-389.
    [NMT+00] K. Nigam, A. K. McCallum, S. Thrun and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2/3): 102-134.
    [NLM99] K. Nigam, J. Lafferty and A. McCallum. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Information Filtering, Stockholm, Sweden, 1999.
    [PCS00] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems, 2000, 12: 547-553.
    [PLV02] B. Pang, L. Lee, and S Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.
    [PS03] F. Peng and D. Schuurmans. Combining naive bayes and n-gram language models for text classification. Proceedings of The 25th European Conference on Information Retrieval Research (ECIR03). April 14-16, 2003, Pisa, Italy.
    [Qui86] J. R. Quinlan. Induction of decision tree. Machine Learning, 1986, 1:81-106.
    [Qui93] J. R. Quinlan. C 4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.
    [Rat96] A. Ratnaparkhi. A maximum entropy model for Part-of-Speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference, Philadelphia, 1996.
    [Rat97] A. Ratnaparkhi. A simple introduction to maximum entropy models for natural language processing, Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania, 1997.
    [Rat98] A. Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. [PhD dissertation], University of Pennsylvania, 1998.
    [RS98] R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and pruning. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB98), New York, 1998.
    [Rui86] V. E. Ruiz. An algorithm for finding nearest neighbors in (Approximately) constant average time. Pattern Recognition Letter, 1986, 4(3):145-147.
    [SB90] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Society for Information Science, 1990, 41(4):288-197.
    [SF86] J.C. Schlimmer and D. Fisher. A case study of incremental concept induction. In Proceedings of the 5th International Conference on Artificial Intelligence (AAAI86), San Mateo: Morgan Kaufmann, 1986.
    [Spe97] E. Spertus. Smokey: automatic recognition of hostile messages. In Proceedings of the Conference on Innovative Applications of Artificial Intelligence, 1997.
    [SS98] R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, 1998:80-91.
    [Swi98] R. Swiniarshi. Rough sets and principal component analysis and their applications in feature extraction and selection, data model building and classification. In S.Pal and A.Skowron, editors, Fuzzy Sets, Rough Sets and Decision Making Processes. New York: Springer-Verlag, 1998.
    [Tur02] P. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to nsupervised Classification of Reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.
    [Utg88] P.E. Utgoff. An incremental ID3. In Proceedings of the Fifth International Conference on Machine Learning, San Mateo, CA, 1988:107-120.
    [Wie95] E. Wiener. A neural network approach to topic spotting. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 95),Las Vegas,NV, 1995.
    [Wie00] J. Wiebe. Learning subjective adiectives from corpora. In Proceedings of 17th National Conference on Artificial Intelligence, Austin, Texas, 2000.
    [Wil72] D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, 1972, SMC-2(3): 408-421.
    [WW99] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99), Bruges, Belgium, 1999:219-224.
    [WZH01] K. Wang, S. Zhou and Y. He. Hierarchical classification of real life documents. In Proceedings of the First Siam International Conference on Data Mining. Chicago, 2001.
    [Yan99] Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1(1): 76～88.
    [YC92] Y. Yang and C.G. Chute. A linear least squares fit mapping method for information retrieval from natural language texts. In Proceedings of the 14th Conference on Computational Linguistics (COLING92), 1992.
    [YC94] Y. Yang and C. G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 1994, 12(3): 252～277.
    [YL99] Y. Yang and X. Lin. A re-examination of text categorization methods. In Proceedings of SIGIR 99, New York, 1999.
    [YNB+03] J. Yi, T. Nasukawa, R. Bunescu, W. Niblack. Sentiment analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques, Proceedings of the Third IEEE International Conference on Data Mining, November 19-22, 2003.
    [Zia91] W. Ziarko. The discovery, analysis, and representation of data dependencies in databases. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, pages 195-209. Menlo Park: AAAI Press, 1991.
    [ZLC+03] H. Zhang, Q. Liu, X. Cheng, H. Zhang and H. Yu. Chinese Lexical Analysis Using Hierarchical Hidden Markov Model, In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, 2003.
    [卜02] 卜东波,白硕,李国杰.聚类/分类中的粒度原理[J].计算机学报,2002,25(8):810～816.
    [刁02] 刁力力,胡可云,陆玉昌,石纯一.用Boosting方法组合增强Stumps进行文本分类[J].软件学报,2002,13(8):1363～1367.
    [黄00] 黄萱菁,吴立德,石崎洋之等.独立于语种的文本分类方法[J].中文信息学报,2000,14(6):1～7.
    [解02] 解冲锋,李星.基于序列的文本自动分类算法[J].软件学报,2002,13(4):783～789.
    [李00] 李晓黎,刘继敏,史忠植.概念推理网及其在文本分类中的应用[J].计算机研究与发展,2000,37(9):1033-1038.
    [李04a] 李荣陆,胡运发.基于密度的KNN文本分类器训练样本裁剪方法[J].计算机研究与发展,2004,41(4):539～545.
    [李04b] 李晓梅,马树元,吴平东等.基于Bagging的手写体数字识别系统[J].计算机工程与科学,26(2):36～39.
    [刘00] 刘开瑛等.中文文本自动分词[M].北京:商务印书馆.2000.
    [刘04a] 刘永丹.文档数据库若干关键技术研究[D],复旦大学,上海,2004.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700