云模型在文本挖掘应用中的关键问题研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

云模型在文本挖掘应用中的关键问题研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Key Problems in Text Mining Based on Cloud Method
作者：代劲
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：文本挖掘 ; 云模型 ; 文本云相似度 ; 文本特征提取 ; 文本分类聚类
英文关键词：text mining ; cloud model ; text cloud similarity ; text feature selection ; text classification and clustering
学位年度：2011
导师：何中市
学科代码：081202
学位授予单位：重庆大学
论文提交日期：2011-05-01

摘要

文本挖掘(Text Mining,简称TM)是以文本信息作为挖掘对象,从中寻找信息的结构、模型、模式等隐含的、具有潜在价值知识的过程。TM在信息检索、模式识别、自然语言处理等等多个领域均有所涉及。由于文本是信息存储的最主要途径,因此TM的重要性也日益凸显。
     在目前TM的研究中,传统的数据挖掘方法依然占据着主导地位。然而随着TM技术研究的进一步深入,将传统的数据挖掘方法应用于TM面临着越来越严峻的挑战。例如文本对象的高维稀疏性、算法复杂度过高及需要先验知识等等问题,已经严重阻碍了TM技术的推广应用。
     TM面临的这些难题归根到底都是由于自然语言的不确定性造成的。自然语言中(尤其是文本语言)的不确定性,本质上来源于人脑思维的不确定性。这种不确定性使得人们具有更为丰富的理解空间与更为深入的认知能力,然而随之而来也形成了TM的众多难题。因此,若能从降低自然语言的复杂性入手,在充分利用现有技术的基础上勇于创新,探索出适用于TM的不确定性人工智能处理方法,将会大大促进TM技术的快速发展。
     借助不确定性知识研究的重要工具——云模型在定性概念与定量数据间的转换作用,作者将云理论引入TM关键问题研究当中。用以抛砖引玉,为TM技术的进一步发展提供一种新的思路与解决方法。本文的主要内容如下:
     ①云模型在TM中的理论扩充。
     对文本知识表示以及相应模型的物理空间转换方法、文本概念的相似性度量进行了研究,为云模型的引入打好理论基础。包含以下三个方面内容:
     1)基于VSM的文本信息表。将知识表示中信息表的概念引入文本表示,在VSM模型基础上将文本系统用文本信息表来进行知识表示。
     2)基于云模型的文本信息表转换。文本间的不确定性关系可以通过云模型进行概念表示,但前提是各属性的取值须处于相同的论域内。也就是说文本在不同属性上的值都有必须具有同一物理含义。未处理的文本信息表属性含义不统一并且取值也差异较大。因此,在利用云模型进行数据挖掘前,必须将文本信息表进行转换。在概率统计方法的基础上,本文提出一种新的文本信息表转换方法。通过该方法,文本信息表由不同属性空间转换成同一物理空间中,体现了属性取值的概率分布。
     3)基于云相似度的文本云相似度量。目前TM中一般使用余弦相似度来衡量文档之间的相关性,但目前无论哪一种相似度度量方法均是以基于对象属性之间的严格匹配进行计算,而对文本对象的整体性考虑不足。结合TM中文本对象的整体性质与个体特点考虑,本文提出了基于云向量数字特征的云相似度。用云向量的数字特征来对文本进行整体刻画,文本间的相似即可转换为云向量之间的相似进行度量。此相似度不仅能快速提高挖掘性能,找出对象间的共性特点,而且能充分考虑到属性值的随机性与模糊性。
     ②基于云模型的文本特征自动提取算法。
     特征选择是文本特征降维的一种有效方法。现有选择尺度的确定均通过实验验证得到,即基于经验的方法。在综合考虑文本特征整体与局部分布基础上,提出了一种高性能的文本特征自动提取算法。算法应用云隶属度对特征分布进行修正,在不需任何先验知识的条件下通过云隶属度大小来对特征权值进行刻画并完成特征的选择,充分体现了特征的概率分布特点。通过横向实验对比与结果分析,显示出该特征集不仅特征个数较少,而且分类精度较高,在性能上领先于主要的一些特征选择方法。
     ③基于云概念跃升的文本分类算法。
     云模型对定性知识表示、定性定量知识转换具有较好的处理能力。在此基础上,利用云模型中的概念抽取方法来进行文本分类应用。在将文本集转换为基于VSM模型的文本知识表的基础上,对训练集中相同类别文档的定性概念进行跃升。根据测试文本与各类别定性概念之间云相似度的大小决定测试文本所属类别。通过在不同特征提取方法下与不同分类器的性能对比,证明该算法不仅具有较强的特征适应能力,在分类性能上也优于主流的分类器。
     ④基于云相似度量的快速无监督文本聚类。
     针对目前文本聚类算法存在的问题,提出了一种基于云相似度量的快速无监督文本聚类算法。该算法以特征自动提取算法为基础,在k-Means动态聚类算法上,用逐级逼近的策略来获取最优k值。k值获取的过程也就是自动聚类的过程。在此过程中,提取每一个文本的云模型数字特征,然后采用云相似度来计算文本和文本间的相似程度。该算法不仅避免了文本对象的高维稀疏性,而且保留了k-Means均值算法的高效。同时,逐级逼近的策略也解决了聚类簇数需先验知识的缺点,得出的聚类结果更符合文本分布特点。
Text Mining (TM for short) is a process to find out the potential value of text knowledge, such as text information structure, model and pattern, etc. TM involves data mining, pattern recognition, information retrieval, natural language processing and other fields. Because text is the main way to store information, the importance of TM is increasingly obvious.
     In the present research to TM, traditional data mining methods still dominated. However, with further research in TM, it faces more severe challenges to apply the traditional data mining methods. These difficulties, such as the huge dimensions and sparsity of text object, the high complexity of algorithm and the requirement of prior knowledge and so on, have seriously hampered the development of TM.
     In the final analysis, these problems in TM process are due to the uncertainty of natural language. The uncertainty of natural language (especially text) comes from the uncertainty of the human thinking in essence. It makes people to have a richer understanding of spatial and cognitive abilities, but also brought a series of problems to TM. Therefore, from the point of reducing the complexity of natural language, if we can carry out the advanced innovation, which based on making full use of these existing technologies, and find out a novel uncertainty artificial intelligence approach for TM, it will greatly facilitate the rapid development of TM.
     Cloud model is an important tool in the uncertain knowledge research. With the efficient conversion function between qualitative and quantitative data, cloud model is introduced to the key issues of TM. Our primary works are as follow.
     (1) Cloud model theory expansion in TM.
     The researches, which involve text knowledge representation, the physical space conversion of the corresponding model and the similarity measures of the text concept, have been carried out. The following three aspects are contained.
     1) Text information table based on VSM model.
     The information table in knowledge representation system is introduced to text representation. On this basis, text system is expressed as text information table based on VSM model.
     2) Text information table conversion based on cloud model.
     When cloud model is used to deal with the uncertainty relations between texts, it musts be guaranteed that the values of every attribute are the same domain. That is to say, the different attribute values of text have the same physical meaning. But the attributes of existing text information table have different inner meaning and their values are vastly different. It needs to convert these attributes to the unified physical space. Using probability statistical method, a text information table transformation algorithm is proposed. Through this algorithm, the attributes of text information table have been converted to the unified physical space and it reflects the probability distribution of them.
     3) Text similarity measure based on cloud similarity.
     The cosine similarity is commonly used method to measure the similarity between texts in text mining. Yet not matter what kind of similarity measure is based on the fact that object properties must strict match. It will result in the lack of consideration of the integrity of text object. Combined the overall distribution with the individual characteristics of text object, a novel cloud similarity is proposed based on vector digital characteristics of cloud, which is used to describe the overall text. By cloud similarity, the similarity between texts is converted to the similarity between cloud vectors. It not only improves the mining performance and can quickly identify the common features, but also fully considers the randomness and fuzziness of the attribute values.
     (2) Text feature automatic selection algorithm based on cloud model (named FAS).
     Feature selection is an effective method for reducing the size of text feature space. So far, some effective methods for feature selection have been developed. For the purpose of acquiring the optimal number of features, these methods mainly depend on observation or experience. In this paper, by combining the overall with the local distribution of features in categories, a high performance algorithm for feature automation selection (FAS) is proposed. By using FAS, the feature set can be obtained automatically. Besides, it can effectively amend the distribution of features by using cloud model theory. Analysis and open experimental results show the selected feature set has fewer features and better classification performance than the existing methods.
     (3) Text classifier based on cloud concept jumping up (named CCJU).
     With the efficient conversion function between qualitative and quantitative data, the concept extraction method of cloud model is applied to text classification. On the basis of the conversion from text collection to text information table based VSM model, the text qualitative concept, which is extraction from the same category, is jumping up. According to compare the cloud similarity between the test text and each category, the test text is assigned to the most similar category. Through the comparison among different text classifiers based on different feature selection methods, it full proves that CCJU not only has a strong ability to adapt to the different text features, the classification performance is also better than the traditional classifiers.
     (4) Rapid and unsupervised text clustering based on cloud similarity (named CS-Means)
     Aiming at the shortcomings of the existing text clustering algorithm, a rapid and unsupervised text clustering based on cloud similarity is proposed. After text pretreatment using FAS algorithm, it takes a gradual approach strategy to obtain the optimal k (cluster number) value based on k-Means clustering algorithm. The process to obtain k value is the automatic clustering process. In this period, the digital characteristics of text cloud vector are extraction firstly. Next, the cloud similarity degree is used to measure the similarity between texts. The algorithm not only avoids the difficulties which bring by the huge dimensions and sparsity of text objects, but also retains the high performance of k-Means. At the same time, the gradual approach strategy also solves the problem which is how to assign the cluster numbers. So, the clustering results are more meet the characteristic of text distribution.

引文

[1]第25次中国互联网络发展状况统计报告[R],中国互联网络信息中心(CNNIC), 2010-1.
    [2]林鸿飞,战学刚,姚天顺.中文文本挖掘的特征导航机制[J],东北大学学报(自然科学版), 2000, 21(3):240-243.
    [3]陈晓云.文本挖掘若干关键技术研究[D],复旦大学, 2005.
    [4]李荣陆.文本分类及相关技术研究[D],复旦大学, 2005.
    [5]熊云波.文本信息处理的若干关键技术研究[D]复旦大学, 2006.
    [6] Nie J, Gao J. On the Use of Words and N-grams for Chinese Information Retrieval [C]. In proc. of the 5th International Workshop on Information Retrieval with Asian Languages, 2000, 252-261.
    [7]周茜,赵明生,扈曼.中文文本分类中的特征选择研究[J],中文信息学报, 2004, 18(3):17-23.
    [8]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J],中文信息学报, 2004, 18(1):26-32.
    [9]尚文倩,黄厚宽,刘玉玲等.文本分类中基于基尼指数的特征选择算法研究[J],计算机研究与发展, 2006, 43(10):1688-1694.
    [10]任纪生,王作英.基于特征有序对量化表示的文本分类方法[J],清华大学学报(自然科学版), 2006, 46(4):527-529.
    [11]卢苇,彭雅.几种常用文本分类算法性能比较与分析[J],湖南大学学报(自然科学版), 2007, 34(6):67-69.
    [12]李莹,张晓辉,王华勇等.一种应用向量聚合技术的KNN中文文本分类方法[J],小型微型计算机系统, 2004, 25(6), 993-996.
    [13] Joachims T. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization[C]. In proceedings of ICML-97, 14th interactional conference on machine leaming. Nashville, 1997:143-151.
    [14]郭庆琳,樊孝忠,柳长安.基于文本聚类的自动文摘系统的研究与实现[J],计算机工程, 2006, 32(4):30-32.
    [15]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报, 2006, 17(9): 1848-1859.
    [16]李德毅.不确定性人工智能[M].北京:国防工业出版社, 2005: 171-177.
    [17]李德毅,刘常昱.论正态云模型的普适性[J].中国工程科学, 2004, 6(8): 28-34.
    [18]李德毅,刘常昱,杜鹢等.不确定性人工智能[J].软件学报, 2004, 15(11): 1583-1594.
    [19]王继成,潘金贵,张福炎.Web文本挖掘技术研究[J],计算机研究与发展, 2000, 37(5):513-520.
    [20]朱克斌,唐菁,杨炳儒.Web文本挖掘系统及聚类分析算法[J],计算机工程, 2004, 30(13):138-139.
    [21] Han J W, Micheline K. Data Mining: Concepts and Techniques[M],机械工业出版社, 2001:171-180.
    [22]胡佳妮.文本挖掘中若干关键问题的研究[D],北京邮电大学, 2008.
    [23]赵晖,荣莉莉.支持向量机组合分类及其在文本分类中的应用[J],小型微型计算机系统, 2005, 26(10):1816-1820.
    [24]李荣陆,王建会,陈晓云等.使用最大熵模型进行中文文本分类[J],计算机研究与发展, 2005, 42(1):94-101.
    [25]郑玉明,史晶蕊,廖湖声.文本分类的神经网络模型[J],计算机工程, 2005, 31(21):37-39.
    [26] Sebastiani F. Machine learning in automated text Categorization [A]. ACM Computing Surveys, 2002, 34(1):1-47.
    [27] Eseudero G, Marquez L. Boosting applied to word sense disambiguation [C].In Proceedings of ECML-00, 11th European Conference on Machine Learning, Barcelona, Spain, 2000:129-141.
    [28]黄萱菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统[J],软件学报, 2003, 14(3):435-442.
    [29]刘永丹,曾海泉,李荣陆等.基于语义分析的倾向性文本过滤[J],通信学报, 2004, 25(7), 25(7):78-85.
    [30]马亮,陈群秀,蔡莲红.一种改进的自适应文本信息过滤模型[J],计算机研究与发展, 2005, 42(1):79-84.
    [31]秦志光,罗琴,张凤荔.一种混合的垃圾邮件过滤算法研究[J],电子科技大学学报, 2007, 36(3):485-488.
    [32]钱卫宁,周傲英.从多角度分析现有聚类算法[J],软件学报, 2002, 13(8):1382-1394.
    [33]刘远超,王晓龙,徐志明等.文档聚类综述[J],中文信息学报, 2006, 20(3):55-62.
    [34]张光卫,康建初,李鹤松等.基于云模型的全局最优化算法[J].北京航空航天大学学报, 2007, 33(4): 486-491.
    [35]蒋嵘,李德毅,范建华.数值型数据的泛概念树的自动生成方法[J].计算机学报, 2000,23(5):470-476.
    [36]张光卫,李德毅,李鹏等.基于云模型的协同过滤推荐算法[J].软件学报, 2007, 18(10):2403-2411.
    [37] Lin J, Gunopulos D. Dimensionality Reduction by Random Projection and Latent Semantic Indexing[C]. In: Text Mining Workshop, at the 3rd SIAM International Conference on DataMining, 2003:198-210
    [38] Kaski S. Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering[C]. In: Proceedings of International Joint Conference on Neural Networks (IJCNN'98) , IEEE Service Center, 1998: 413-418.
    [39] Bingham E, Mannila H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data[C]. In: Proc. SIGKDD (2001), 2001: 245-250.
    [40] Lee D, Seung H. Algorithms for Non-negative Matrix Factorization[C]. In: Adv. Neural Info Proc. System, 2001, 13: 556-562.
    [41] Lee D, Seung H. Learning the Parts of Objects by Nonnegative Matrix Factorization [J]. Nature, 1999, 401(21): 788-791.
    [42] George K, Han E H. Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization[C]. In: ACM CIKM Conference, 2000:151-170.
    [43] Dumais S, Furnas G, Landauer T, et al. Using Latent Semantic Analysis to Improve Access to Textual Information[C]. In: Proc. of the Conf. on Human Factors in Computing Systems CHI'88, Washington, 1988:320-326.
    [44] Selvakuberan K, Indradevi M, Rajaram R. Combined feature selection and classification: A novel approach for the categorization of web pages [J]. Journal of Information and Computing Science, 2008, 3(2): 83?89.
    [45] Yang Y M, Pedersen J O. A comparative study on feature selection in text categorization[C] In Proc. of the 14th International Conf. on Machine Learning (ICML 1997). San Francisco: MIT Press, 1997: 412-420.
    [46] Jana N, Petr S, Michal H. Conditional mutual information based feature selection for classification task[C]. In Proc. of the 12th Iberoamericann Congress on Pattern Recognition (CIAPR 2007). Berlin: Springer-Verlag, 2007: 417?426.
    [47] Santana L E A, de Oliveira D F, Canuto A M P, et al. A comparative analysis of feature selection methods for ensembles with different combination methods[C]. In Proc. of International Joint Conf. on Neural Networks (IJCNN 2007). Piscataway: IEEE, 2007: 643-648.
    [48] Forman G. An extensive empirical study of feature selection metrics for text classification [J]. Journal of Machine Learning Research, 2003, 3(1): 1533?7928.
    [49] Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines[J]. Journal of Machine Learning Research, 2005, 6(1): 37?53.
    [50] Rogati M, Yang Y. High-performing feature selection for text classification[C]. In Proc. of the 11th ACM Int’l Conf. on Information and Knowledge Management (CIKM 2002). McLean:ACM Press, 2002: 659-661.
    [51] Makrehchi M, Kame M S. Text classification using small number of features[C] In Proc. of the 4th International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2005). Berlin: Springer-Verlag, 2005: 580?589.
    [52] Soucy P, Mineau G W. Feature selection strategies for text categorization[C] In Proc. of the 16th Conf. of the Canadian Society for Computational Studies of Intelligence (CSCSI 2003). Halifax: Springer-Verlag, 2003: 505?509.
    [53] Bong C H, Narayanan K. An empirical study of feature selection for text categorization based on term weightage[C] In Proc. of the IEEE/WLC/ACM Int’l Conf. on Web Intelligence (WI 2004). Beijing: IEEE Computer Society Press, 2004: 599?602.
    [54] Li S, Zong C Q. A new approach to feature selection for text categorization[C] In Proc. of the IEEE Int’1 Conf. on Natural Language Processing and Knowledge Engineering (NLP-KE 2005). Wuhan: IEEE Press, 2005: 626?630.
    [55]胡佳妮,徐蔚然,郭军等.中文文本分类中的特征选择算法研究[J].光通信研究, 2005, 3(129): 44-46.
    [56]徐燕,李锦涛,王斌等.文本分类中特征选择的约束研究[J].计算机研究与发展, 2008, 45(4): 596-602.
    [57] Dai W H, Jiao C Z, He T. Research of k-means clustering method based on parallel genetic algorithm[C] In Proc. of the 3rd Int’l Conf. on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP 2007). 2007: 158?161.
    [58] Yang Y, Liu X. A re-examination of text categorization methods[C] In Proc. of the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, 1999: 42-49.
    [59] Tan S, Cheng X, Ghanem M, et al. A novel refinement approach for text categorization[C]. In Proc. of the 14th ACM Conf. on Information and Knowledge Management (CIKM 2005). Bremen: ACM Press, 2005: 469-476.
    [60] David L. Reuters-21578 test collection [EB/OL]. [2007?02?04]. http://www.daviddlewis.corn/resources/testcollections/reuters 21578/.
    [61] Charles-Antoine J, John E, France B. Controlled user evaluations of information visualization interfaces for text retrieval: literature review and meta-analysis[J]. Journal of the American Society for Information Science and Technology, 2008, 59(6): 1012-1024.
    [62] Haruechaivasak, Choochart J, Wittawat S. Implementing news article category browsing based on text categorization technique[C]. In Proc. of Web Intelligence and Intelligent Agent Technology (WI-IAT 2008). Piscataway: IEEE, 2008: 143-146.
    [63] Myunggwon H, Chang C, Byungsu Y, et al. Word sense disambiguation based on relation structure[C]. In Proc. of Advanced Language Processing and Web Information Technology (ALPIT 2008). Piscataway: IEEE, 2008: 15-20.
    [64] Xuerui W, Mccallum A, Xing W. Topical n-grams: phrase and topic discovery, with and application to information retrieval[C]. In Proc. 7th IEEE International Conference on Data Mining (ICDM 2007). Piscataway: IEEE, 2007: 697-702.
    [65] Han E, Karypis. Centroid-Based Document Classification: Analysis & Experimental Results[A]. The Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases[C], 2000.
    [66] He J, Tan A H. A Comp1arative Study on Chinese Text Categorization Methods [C]. In Proc. of the PRICAI 2000 Workshop on Text and Web Mining, Melbourne, 2000: 24-35.
    [67] Saktib G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [A]. Communication of the ACM[C], 1975, 18(11): 613-620.
    [68]庞剑峰,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现[J].计算机应用研究, 2001,(9):23-26.
    [69]朱华宇,孙正兴,张福炎.一个基于向量空间模型的中文文本自动分类系统[J].计算机工程, 2001, 27(2):15-17.
    [70]吴立德,夏迎炬,黄聋.基于向量空间模型的文本过滤系统[J].北京:软件学报,2003, 14(3):435-442.
    [71]鲁松,白硕等.文本中词语权重计算方法的改进[A]. The 2000 International Conference on Multilingual Information Processing[C], 2000: 31-36.
    [72] Yiming Yang, Xin Liu. A Re-examination of Text Categorization Methods [C], In Proc. of ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), 1999, pp. 42-49.
    [73]孙健,王伟,钟义信.基于K-最近距离的自动文本分类的研究[J].北京邮电大学学报, 2001, 24(1):42-46.
    [74] Andrew M, Kamal, Nigam. A Comparison of Event Models for Naive Bayes Text Categorization [R]. AAAI-98 Workshop on Learning for Text Categorization, 1998.
    [75] Skowron A, Rauszer C. The Discernibility Matrices and Functions in Information System [M]. Dordrecht: Kluwer Academic Publishers, 1992: 33l-362.
    [76] Huaping Z. Chinese Lexical Analysis Using Hierarchical Hidden Markov Model [A]. Second SIGHAN workshop affiliated with 41st ACL. Sapporo Japan, July, 2003:63-70.
    [77] Cutting D, Karger D, Pedersen J, and etc. Scatter-gather: A cluster-based approach to browsing large document collections [C], In Proc. of SIGIR-92, 1992:318-329.
    [78] Hatzivassiloglou V, Klavans J L, Holcombe M L. Simfinder: A Flexible Clustering Tool for Summarization [C]. In proceedings of the Workshop on Summarization in NAACL-01, Pittsburg, Pennsylvania, USA, 2001: 41-49.
    [79] Wang X, Zhai C X. Learn from Web Search Logs to Organize Search Results [C].In Proceedings of SIGIR-07, 2007:87-94.
    [80] Wen J R, Nie J Y, etc. Clustering user queries of a search engine [C].In Proceedings of the 10th International World Wide Web Conference, Hong Kong, 2001:162-168.
    [81] Fang Y C, Parthasarathy S, Schwartz E. Using Clustering to Boost Text Classification [C].In proceedings of the IEEE ICDM Workshop on Text Mining, Maebashi City, Japan, 2002:(101-111).
    [82]林鸿飞,马雅彬.基于聚类的文本过滤模型[J],大连理工大学学, 2003, 42(2):249-252.
    [83] Wong S K M, Ziarko W, Raghavan V V, and etc. Generalized vector spaces model in information retrieval [C].In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1985:18-25.
    [84] Raghavan V V, Wong S K M. A critical analysis of vector space model in information retrieval [J]. Journal of the American Society for Information Science, 1986, 37(5):279-287.
    [85] Van Rijsbergen C J. The Geometry of Information Retrieval [M], Butterworths, 2004.
    [86] Van Rijsbergen C. J. A new theoretical framework for information retrieval [M]. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1986: 194-200.
    [87]邢永康,马少平.信息检索的概率模型[J],计算机科学, 2003, 30(08):13-17.
    [88] Hiemstra D. A linguistically motivated probabilistic model of information retrieval [C]. In Proceedings of the Second European Conference on Research and Advance Technology for Digital Libraries (ECDL), 1998:569-584.
    [89] Ponte J, Croft W B. A Language Modeling Approach to Information Retrieval [C]. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, PP.275-28 1.
    [90] Song F, Croft W B. A general language model for information retrieval [C]. In Proceedings of the 22nd Annual International ACM?SIGIR Conference on Research and Development in Information Retrieval, 1999:279-280.
    [91]宋韶旭,李春平.基于非对称相似度的文本聚类方法[J],清华大学学报(自然科学版), 2006, 46(7):1325-1328.
    [92]冯中慧,鲍军鹏,沈钧毅.一种增量式文本软聚类算法[J],西安交通大学学报, 2007, 41(4):398-401.
    [93]彭京,杨冬青,唐世渭等.一种基于语义内积空间模型的文本聚类算法[J],计算机学报, 2007, 30(8):1354-1363.
    [94]张云,冯博琴,麻首强等.蚁群遗传融合的文本聚类算法[J],西安交通大学学报, 2007, 41(10):1146-1150.
    [95]张猛,王大玲,于戈.一种基于自动阈值发现的文本聚类方法[J],计算机研究与发展, 2004, 4l(10):1748-1753.
    [96]张毓敏,谢康林.基于SOM算法实现的文本聚类[J], 2004, 30(1):75-76.
    [97] Broder A, Fontoura M, Gabrilovich E, and etc. Robust Classification of Rare Queries Using Web Knowledge [C].In Proceedings of the 30th annual Interactional ACM SIGIR, 2007: 23 1-238.
    [98]冯国臻.基于结构分析的大规模WWW文本信息检索的研究[D],中国科学院计算技术研究所, 2001.
    [99] He X, Cai D, Liu H, and etc. Locality preserving indexing for document representation [C].In Proceedings of ACM SIGIR, 2004:96-103.
    [100] Yah S, Xu D, Zhang B, etc. Graph Embedding and Extension: A General Framework for Dimensionality Reduction [C], IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1):40-51.
    [101] Ester M, Kriegel H P, Sander J, and etc. A density-based algorithm for discovering clusters in large spatial databases [C].In Proceedings of International Conference Knowledge Discovery and Data Mining, 1996: 226-231.
    [102] Veenman C J, Reinders M J T, Backer E. A maximum variance duster algorithm [C], IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(9):1273-1280.
    [103] Zhu X., Ghahramani Z., and Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions [C]. In Proceedings of the Twentieth International Conference Machine Learning, 2003:912-919.
    [104] Zhou D., Bousquet O., Lal T., and etc. Learning with local and global consistency [C]. In Proceedings of Advances in Neural Information Processing System, 2004:13-18.
    [105] Amnon H. Some Philosophical Issues in Computer Science [J]. Minds and Machines, 2011, 21(2): 123-133.
    [106] Stefano M, Luca V. A social approach to context-aware retrieval [J]. World Wide Web, 2011, 14(4): 377-405.
    [107] Charalampos K, Basilis M. Efficient parallel Text Retrieval techniques on Bulk Synchronous Parallel (BSP)/Coarse Grained Multicomputers (CGM) [J]. Journal of Supercomputing, 2009, 48(3): 286-318.
    [108] Ronan C, Colm O. Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space [J]. Artificial Intelligence Review, 2006, 26(12): 35-47.
    [109] Laurence A F P, Kotagiri R. Efficient storage and retrieval of probabilistic latent semantic information for information retrieval [J]. The VLDB Journal, 2009, 18(1): 141-155.
    [110] Douglas W, Jason R, Bruce H, and etc. Evaluation of information retrieval for E-discovery [J]. Artificial Intelligence and Law, 2010, 18(4): 347-386.
    [111]张俊林.基于语言模型的信息检索系统研究[D].中国科学院软件研究所, 2004.
    [112]李晓光,王大玲,于戈.基于统计语言模型的信息检索[J],计算机科学, 2005, 32(8): 124-127.
    [113]丁国栋,白硕,王斌.文本检索的统计语言建模方法综述[J],计算机研究与发展, 2006, 43(5):769-776.
    [114]楼炉群,牛军钰.信息检索中语言模型的研究[J],计算机工程, 2007, 33(4):49-51.
    [115]富羽鹏,张敏,马少平.企业与内联网信息检索方法概述[J],广西师范大学学报(自然科学版), 2007, 25(2):90-98.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700