半监督聚类算法的研究与应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

半监督聚类算法的研究与应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research and Application of Semi-supervised Clustering Algorithms
作者：管仁初
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：机器学习 ; 半监督聚类 ; 吸引子传播算法 ; k-means算法
英文关键词：Machine Learning ; Semi-supervised Clustering ; Affinity Propagation ; k-means
学位年度：2010
导师：梁艳春
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2010-06-01

摘要

本文主要针对半监督学习中的聚类算法展开研究:
     1.提出了3-集合相似性度量,它是一种包含结构信息的非欧空间度量。在相似特征集、排斥特征集和仲裁特征集的基础上,提出了新的相似性度量和新的聚类算法:权吸引子传播算法。
     2.提出了融合3-集合相似性度量、半监督学习策略和吸引子传播算法的种子吸引子传播算法。它降低了文本聚类算法复杂性,避免了随机初始化和陷入局部极小的缺陷,提高了准确率,并具有更好的鲁棒性。
     3.提出了融合增量学习的增量吸引子传播半监督聚类算法。该算法将标记样本的先验信息嵌入到相似性矩阵中并运用增量学习策略将它们扩散开来。实验结果表明新算法具有更优良的性能。
     4.分析了已标记样本数量对半监督聚类算法的影响。5种算法在3个数据集上的实验结果表明:已标记样本数目的增长能够帮助半监督聚类算法获得更好的性能。但当数目增长超过关键点位置时,这种改进将变得迟缓甚至停滞。
     5.分析了未标记样本数量对半监督聚类算法的影响。4种新的半监督聚类算法在
     3个数据集上的实验结果表明:在多数情况下,较少未标记样本的增量学习能够帮助半监督聚类算法得到更好的结果。但是,未标记样本规模增长超过关键点时,改进将减少甚至起反作用。
Machine learning is a important area of Artifical Intelligence. Its purpose is to give computer learning ability and make good use of the obtained knowledge in application. In machine learing, supervised learning provides good exemplars for computer and there is plenty of information embedded in them. Furthermore, supervised learning also gives out the learing rules and right directions. On the other hands, without any exemplar and pre-known knowledge, un-supervised learning merely depends on the algorithms to mine the information. Due to its aimlessness, un-supervised learning is often used as a pre-processing tool in data mining. However, semi-supervised learning could do data anal-ysis and mining effectively with few examplars or little pre-known information. It also avoids the aimlessness and uncertainty. Recently, semi-supervised has became a new inter-esting direction of machine learning.
     Nonetheless, most of the successful semi-supervised machine learing algorithms are based on the semi-supervised classification and semi-supervised regression. There is little research in semi-supervised clustering. Accordingly, our researches mainly focus on the semi-supervised clustering ananlysis and proposed some new algorithms. To do compari-sion analysis, the new semi-supervised clustering algorithms are applied to text mining. At the meantime, for the nitty-gritty problem-labeled and unlabeled sample sizes effect on semi-supervised clustering, we implement experimental analysis in detail and draw the conclusions. The main contributions are as follows:
     1) A new Non-Euclidean Space similarity measurement containing the structure in-formation is proposed, which is namely Tri-Set Affinity Propagation. In text mining, many frameworks employ the vector space model and similarity metrics in Euclid space. By clustering texts in this way, the advantage is simple and easy to perform. However, when the data scale puffs up, the vector space will become high-dimensional and sparse. Then, the computational complexity grows exponentially. To overcome this difficulty, a Non-Euclidean Space similarity measurement is proposed based on the definitions of Sim-ilar Feature Set (SFS), Rejective Feature Set (RFS) and Arbitral Feature Set (AFS). The new similarity measurement not only breaks out the Euclidean Space constraint, but also contains the structural information of documents. Therefore, a novel clustering algorithm, named weight affinity propagation (WAP), is developed by combining the new similarity measurement and AP. In addition, as a benchmark dataset, Reuters-21578 is used to test the proposed algorithm. Experimental results show that the proposed method is superior to the classical k-means, traditional SOFM and Affinity Propagation with classic similarity mea-surement.
     2) A new semi-supervised clustering algorithm is proposed, which is called Seeds Af-finity Propagation. There are two main contributions: (1) utilization of Tri-Set similarity metric; (2) a novel seed construction method to improve semi-supervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578, and compared it with k-means, the original Affinity Propagation with Co-sine coefficient, Tris-Set Affinity Propagation and Semi-supervised Affinity Propagation with Cosine coefficient. Furthermore, we have analyzed the individual effect of the two proposed contributions. Experimental results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21% higher than in the AP algorithm) and that the proposed semi-supervised strategy achieves both better clustering results and faster convergence (using only 76% iterations of the original AP). The complete SAP algorithm provides enhanced robustness compared with all other methods, obtains higher F-measure (ca. 40% improvement over k-means and AP) and lower entropy (ca. 28 % decrease over k-means and AP), and improves significantly clustering execution time (twenty time faster) in respect than k-means.
     3) A new semi-supervised clustering algorithm with incremental learning is proposed, which is calledI?ncremental Affinity Propagation. Considering the original Affinity Propa-gation couldn’t cope with part known data directly, a semi-supervised scheme called in-cremental affinity propagation clustering is proposed after the propostion of Seeds Affinity Propagation. In the scheme, the pre-known information is represented by adjusting similar-ity matrix. Moreover, an incremental study is applied to amplify the prior knowledge. To examine the effectiveness of the method, we concentrate it to text clustering problem and describe the specific method accordingly. The method is applied to the benchmark data set Reuters-21578. Numerical results show that the proposed method performs very well on the data set and has most advantages over two other commonly used clustering methods. On the experiments of four data scales, the average of F-measure for IAP is 36.2% and 31.2% higher than Affinity Propagation with Cosine coefficient (AP(CC)) and k-means, respectively; the average of Entropy is 14.7% and15.9% lower than AP(CC) and k-means, respectively.
     4) The effect of labeled sample scales on semi-supervised clustering is discussed and analyzed. To examine and investigate the effect of labeled sample sizes, based on k-means and AP, we implement five semi-supervised algorithms: Seeded k-means, Constrained k-means, Loose Seeds Affinity Propagation, Compact Seeds Affinity Propagation, and Seeds Affinity Propagation. All the five algorithms are applied to three benchmark data sets in text mining, namely: Reuters-21578, NSF Research Awards Abstracts1990–2003 and 20 Newsgroups. Moreover, we plot the F-measure and Entropy moving average curves for the five algorithms performance on the three data sets. Furthermore, we propose a sa-tisfactory rate measurement to search for the local optimal values, which is called check points. Experimental results show that the increase of labeled sample size can help semi-supervised clustering algorithms to achieve better results; however, when they arrive at the check points, the improvements will grow slowly or stagnate. On the other hand, the check points’location often varies with different semi-supervised clustering algorithms and different data sets.
     5) The effect of unlabeled sample sizes on semi-supervised clustering is discussed and analyzed. To observe the effect of unalabeled sample sizes on semi-supervised clustering, based on k-means and Affinity Propagation, we propose four novel semi-supervised algo-rithms: Incremental Seeded k-means, Incremental Constrained k-means, Incremental Affin-ity Propagation, and Incremental Seeds Affinity Propagation. These new algorithms are ap-plied to three benchmark data sets. Meanwhile, the F-measure and Entropy curves of unla-beled sample learning results from 1 to 400 are plotted. Then, similar as the analysis of la-beled sample tests, the satifactorty rate is applied to fing out the check points. The experi-mental results show that: in most cases, incremental learning with smaller unlabeled sample size can help the semi-supervised clustering algorithms to achieve better results. But when the unlabeled sample size increase over the check points, the improvement may decrease or even counteract. In addition, the check point locations vary with different semi-supervised clustering algorithms.
     In summary, first of all, based on the analysis of the classical clustering algorithms, we have proposed a few novel clustering algorithms by introducing semi-supervised learning strategies. These new algorithms obtain good clustering results in the experiments. Then, we further extend our study on the effect of labeled and unlabeled sample sizes for semi-supervised clustering. Based on the analysis of experimental results, it can be con- cluded that: both the labeled and unlabeled samples size increase can make improvements on semi-supervised clustering performance. However, when the labeled and unlabeled data sizes increase over check points, the samples’effect on semi-supervised clustering will re-duce or even disappear.

引文

[1] Mitchell T M著.曾华军,张银奎等译.机器学习[M].北京:机械工业出版社,1997.
    [2] Mjolsness E,DeCoste D. Machine learning for science: State of the art and future prospects [J]. Science, 2001, 293(5537): 2051-2055.
    [3] Chapelle O, Sch?lkopf B, Zien A. Semi-supervised learning [M]. Cambridge, Mass.: MIT Press, 2006.
    [4] Alpaydin E著.范明,昝红英等译.机器学习导论[M].北京:机械工业出版社,2009.
    [5] Baldi P, Long A D. A Bayesian framework for the analysis of microarray expression data: Regula-rized t-test and statistical inferences of gene changes [J]. Bioinformatics, 2001, 17(6):509-519.
    [6] Tomlins S A, Rhodes D R, Sun X W, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in Prostate Cancer [J]. Science, 2005, 10(310): 644-648.
    [7] MacDonald J W, Ghosh D. COPA-cancer outlier profile analysis [J]. Bioinformatics, 2006, 22: 2950-2951.
    [8] Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis [J]. Biostatistics, 2007, 8: 2-8.
    [9] Wu B. Cancer outlier differential gene expression detection [J]. Biostatistics, 2007, 8(3): 566-575.
    [10] Cui X Q, Churchill G A. Statistical tests for differential expression in cDNA microarray experi-ments [J]. Genome Biology, 2003, 4(4): 210.
    [11] Liang Y C, Lee H P, Lim S P, et al. Proper orthogonal decomposition and its applications—Part I: Theory. Journal of Sound and Vibration, 2002, 252(3): 527–544.
    [12] Zhao G, Maclean A L. A comparison of canonical discriminates analysis and principal component analysis for spectral transformation [J]. Photogrammetric Engineering & Remote Sensing, 2000, 66(7): 841-847.
    [13] Yao H B and Tian L. A genetic-algorithm-based selective principal component analysis (GA-SPCA) method for high-dimensional data feature extraction [J]. IEEE Transactions. Geoscience & Remote Sensing, 2003, 41(6): 1469-1478.
    [14] Cheng Q M, Jing L H, and Panahi A. Principal component analysis with optimum order sample correlation coefficient for image enhancement [J]. International Journal of Remote Sensing, 2006, 27(16): 3387-3401.
    [15] Yang C, Lu L J, Lin H P,et al. A fuzzy-statistics-based principal component analysis (FS-PCA) method for multispectral image enhancement and display. IEEE Transactions on Geoscience & Remote Sensing, 2008, 46(11): 3937-3947.
    [16]周春光,梁艳春.计算智能[M].吉林:吉林大学出版社,2001.
    [17]杨斌,聂在平,夏耀先等.基于改进共扼梯度法的前馈网络快速监督学习算法[J].电子学报,2002,30(12):1845-1847.
    [18] Wang Y J, Lin C T. A second-order learning algorithm for multilayer network based on block Hes-sian matrix [J]. Neural Networks, 1998, 11(9): 1607-1622.
    [19] Yu X H, Chen G A. Efficient back propagation learning using optimal learning rate and momentum [J]. Neural Networks, 1997, 10(2): 517-527.
    [20]魏海坤.神经网络结构设计的理论与方法[M].北京:国防工业出版社,2005.
    [21] Barbounis T G, Theocharis J B, Alexiadis M C, et al. Long-term wind speed and power forecasting using local recurrent neural network models [J]. IEEE Transactions on Energy Conversion, 2006, 21(1):273-284.
    [22] Berthouze L, Tijsseling A. A neural model for context-dependent sequence learning [J]. Neural Processing Letters, 2006, 23(1): 27-45.
    [23] Hawkins J, Boden M. The applicability of recurrent neural networks for biological sequence analy-sis [J]. IEEE-ACM Transactions on Computational Biology and Bioinformatics, 2005, 2(3): 243-253.
    [24] Leibold C, Kempter R. Memory capacity for sequences in a recurrent network with biological con-straints [J]. Neural Computation, 2006, 18(4): 904-941.
    [25] Chen Y S, Chang C J, Hsieh Y L. A channel effect prediction-based power control scheme using PRNN/ERLS for uplinks in DS-CDMA cellular mobile systems [J]. IEEE Transactions on Wireless Communications, 2006, 5 (1): 23-27.
    [26]时小虎. Elman神经网络与进化算法的若干理论研究及应用[D].吉林:吉林大学计算机学院,2006.
    [27] Drucker H, Wu D H, Vapnik V N. Support vector machines for spam categorization [J]. IEEE Transactions on Neural Networks. 1999, 10(5):1048-1054.
    [28] Chapelle O, Haffner P, Vapnik V N. Support vector machines for histogram-based image classifi-cation [J]. IEEE Transactions on Neural Networks 1999 1 (5):1055-1064.
    [29] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers [J]. Neural Processing Letters, 1999, 9:293-300
    [30] Suykens J A K, Lukas L, Vandewalle J, et al. Sparse approximation using least squares support vector machines. In Proceedings of the IEEE International Symposium on Circuits and Systems [C], Geneva, Switzerland, 2000.
    [31] Osuna E, Freund R, Girosi F, et al. An improved training algorithm for support vector machines. IEEE Workshop on Neural Networks and Signal Processing [C], Amelia Island,1997.
    [32]吴春国.广义染色体遗传算法与迭代式最小二乘支持向量机回归算法研究[D].吉林:吉林大学计算机学院,2006.
    [33]高滢.多关系聚类分析方法研究[D].吉林:吉林大学计算机学院,2008.
    [34] Scudder H J. Probability of error of some adaptive pattern-recognition machines [J]. IEEE Transac-tions on Information Theory, 1965, 11:363-371.
    [35] Fralick S C. Learning to recognize patterns without a teacher [J]. IEEE Transactions on Information Theory, 1967, 13:57-64.
    [36] A. K. Agrawala. Learning with a probabilistic teacher [J]. IEEE Transactions on Information Theory, 1970, 16: 373-379.
    [37] Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics [C], pages 189-196, 1995.
    [38] McCallum A, Nigam K. A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization [C], AAAI Press, pages 41-48, 1998.
    [39] McCallum A, Nigam K. Employing EM and pool-based active learning for text classification. In Proceedings of the International Conference on Machine Learning [C], Madison, WI, 1998.
    [40] Wang L, Chan K L, Zhang Z. Bootstrapping SVM active learning by incorporating unlabelled im-ages for image retrieval. In Conference on Computer Vision and Pattern Recognition [C], pages 629-634, 2003.
    [41] Zhang Y, Brady M, Smith S. Hidden Markov random field model and segmentation of brain MR images [J]. IEEE Transactions on Medical Imaging, 2001, 20(1):45–57.
    [42] Zhu X, Ghahramani Z, Lafferty J. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of International Conference on Machine Learning [C], Washington. DC, 2003.
    [43] Lin T, Zha H B. Riemannian manifold learning [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(5): 796-809.
    [44]王声望,郑维行.实变函数与泛函分析概要(第三版)第二册[M].北京:高等教育出版社,2006.
    [45] Rutkowski L. Clustering for data mining: A data recovery approach [J]. Psychometrika, 2007, 72(1): 109-110.
    [46] Plasse M, Niang N, Saporta G, et al. Combined use of association rules mining, and clustering me-thods to find relevant links between binary rare attributes in a large data set [J]. Computational Statistics & Data Analysis, 2007, 52(1): 596-613.
    [47] Chopra P, Kang J, Yang J, et al. Microarray data mining using landmark gene-guided clustering [J]. BMC Bioinformatics, 2008, 9: 92.
    [48] Gudmundsson J, Kreveld van M, Narasimhan G. Region-restricted clustering for geographic data mining. In Proceedings of 14th Annu. Allerton European Symposium on Algorithms [C], Zurich, Switzerland, pages 399-410, 2006.
    [49] Zhang C L, Huang S, Xue G R, et al. Image description mining and hierarchical clustering on data records using HR-tree. In Proceedings of 8th Asia-Pacific Web Conference and Workshops [C], Harbin, China, 2006, 379-390.
    [50] Tseng V S, Kao C P. Efficiently mining gene expression data via a novel parameterless clustering method [J]. IEEE-ACM Transactions on Ransactions on Computional Biology and Bioinformat-ics, 2005, 2(4): 355-365.
    [51] Li X Y, Ye N. A supervised clustering and classification algorithm for mining data with mixed va-riables [J]. IEEE Transactions on systems, man, and cybernetics, 2006, 36(2): 396-406.
    [52] Li M Q, Zhang L. Multinomial mixture model with feature selection for text clustering [J]. Know-ledge-Based Systems, 2008, 21(7): 704-708.
    [53] Li Y J, Luo C, Chung S M. Text clustering with feature selection by using statistical data [J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5): 641-652.
    [54] Ward J H, Jr. Hierarchical Grouping to Optimize an Objective Function [J]. Journal of the Amer-ican Statistical Association, 1963, 58(301): 236-244.
    [55] MacQUEEN J. Some methods for classification and analysis of multivariate observations. In Pro-ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability [C], Univer-sity of California Press, pages 281-297, 1967.
    [56] Kaski S, Kangas J, Kohonen T. Bibliography of self-organizing map (SOM) papers: 1981–1997 [J]. Neural Computing Surveys, 2002, 1(3):1–156.
    [57] Frey B J, Dueck D. Clustering by passing messages between data points [J]. Science, 2007, 315: 972~976.
    [58]高滢,刘大有,齐红等.一种半监督置均值多关系数据聚类算法[J].软件学报, 2008, 19(11):2814-2821.
    [59] Zhong S. Semi-supervised model based document Clustering: A comparative study [J]. Machine Learning, 2006, 65 (1): 3 - 29.
    [60]李昆仑,曹铮,曹丽苹等.半监督聚类的若干新进展[J].模式识别与人工智能, 2009, 22(5):735-742.
    [61] Wagstaff K, Cardie C, Rogers S, et al. Constrained K-means Clustering with Background Know-ledge. In Proceedings of International Conference on Machine Learning [C], San Francisco, USA, 2001.
    [62] Huang D S, Pan W. Incorporating Biological Knowledge into Distance-Based Clustering Analysis of Micro Array Gene Expression Data [J]. Bioinformatics, 2006, 22 (10): 1259-1268.
    [63] Huang R Z, Lam W. An Active Learning Framework for Semi-Supervised Document Clustering with Language Modeling [J]. Data &Knowledge Engineering, 2008, 68 (1):49-67.
    [64] Chang H, Yeung D Y. Locally Linear Metric Adaptation with Application to Semi-Supervised Clustering and Image Retrieval [J]. Pattern Recognition, 2006, 39 (7): 1253-1264.
    [65] Basu S, Banerjee A, Mooney R J. Semi-Supervised Clustering by Seeding. In Proceedings of In-ternational Conference on Machine Learning [C], Sydney, Australia, 2002.
    [66] Klein D, Kamvar S D, Manning C. From Instance-Level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. In Proceedings of International Confe-rence on Machine Learning [C], Sydney, Australia, 2002.
    [67]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10), 2412-2422.
    [68] Ma?a-López M J, Buenaga M D, Gómez-Hidalgo J M. Multidocument summarization: An added value to clustering in interactive retrieval [J]. ACM Transactions on Information Systems, 2004, 22(2): 215–241.
    [69] Jing L P, Ng M K, Huang J Z. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(8): 1026-1041.
    [70] Huang S, Chen Z, Yu Y, et al. Multitype features coselection for web document clustering [J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(4): 448-458.
    [71] Kanungo T, Mount D M, Netanyahu N S, et al. An efficient k-means clustering algorithm: Analysis and implementation [J]. IEEE Transactions Pattern Analysis and Machine Intelligence, 2002, 24(7): 881–892.
    [72] Wu X D, Kumar V, Quinlan J R, et al. Top 10 algorithms in data mining [J]. Knowledge and In-formation Systems, 2008, 14(1):1-37.
    [73] Basu S. Semi-supervised Clustering Probabilistic Models, Algorithms and Experiments [D]. USA: the Faculty of the Graduate School ofThe University of Texas at Austin, 2005.
    [74] Jiang W, Ding F, Xiang Q L. An Affinity Propagation based method for Vector Quantization [EB/OL]. (2007-10-11) [2008-01-21]. http://arxiv.org/abs/0710.2037v2.
    [75] Frey B J, Dueck D. Non-metric affinity propagation for unsupervised image categorization. In Pro-ceedings of IEEE 11th International Conference on Computer Vision [C]. Rio de Janeiro, Brazil, 2007.
    [76] Demiriz A, Bennett K P, EmbrechtsM J. Semi-Supervised Clustering Using Genetic Algorithms. In Proceedings of the Artificial Neural Networks in Engineering Conference [C]. New York, USA, 1999.
    [77]王开军,张军英,李丹等.自适应仿射传播聚类[J].自动化学报, 2007, 33(12): 1242~1246.
    [78] Michele L, Sumedha, Martin W. Clustering by soft-constraint affinity propagation Applications to gene-expression data [J]. Bioinformatics, 2007, 23(20): 2708-2715.
    [79] Frey B J. Affinity Propagation FAQ [EB/OL], [2008-01-21]. http://www.psi.toronto.edu/affinitypropagation/faq.html
    [80] Brusco M J, Kohn H F. Comment on‘Clustering by passing messages between data points’[J]. Science, 2008, 319(5864): 726c.
    [81] Frey B J, Dueck D. Response to comment on‘Clustering by passing messages between data points’[J]. Science, 2008, 319(5864): 726d.
    [82]刘涛,吴功宜,陈正.一种高效的用于文本聚类的无监督特征选择算法[J].计算机研究与发展,2005, 42(3): 381~386.
    [83]卜东波,白硕,李国杰.文本聚类中权重计算的对偶性策略[J].软件学报,2002, 13(11): 2083-2089.
    [84] Li Y J, Chung M S, Holt J D. Text Document Clustering Based on Frequent Word Meaning Se-quences [J]. Data & Knowledge Engineering, 2008, 61(1):381-404
    [85] Rijsbergen van C J. Information Retrieval [M], London: Butterworth, 1979: 22-28.
    [86] Aggrawal C C, Yu P S. Finding generalized projected clusters in high dimensional spaces. In Pro-ceedings of 2000 ACM SIGMOD Intentional Conference on Management of Data[C], Dallas: ACM, 2000.
    [87] Abramson M, Wechsler H. Competitive reinforcement learning for combinatorial problems, In Proceedings of IEEE/INNS Intentional Conference on Neural Networks [C]. Washington DC: IEEE, 2001.
    [88] Lewis D D. Reuters-21578 Text Categorization Collection Data Set [EB/OL], 1997, [2008-01-21]. http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection.
    [89] Hammouda K M, Kamel M S. Efficient Phrase-Based Document Indexing for Web Document Clustering [J]. IEEE Transaction on Knowledge and Engineering, 2004, 16(10): 1279-1296
    [90] Buckley C, Lewit A F. Optimizations of Inverted Vector Searches. In Proceedings of Annual ACM SIGIR Conference on Research and Development in Information Retrieval [C], New York, 1985.
    [91] Jardin N, Rijsbergen van C J. The use of hierarchic clustering in information retrieval [J]. Informa-tion Storage and Retrieval, 1971, 7(5): 217-240.
    [92] Salton G. Dynamic Information and Library Processing [M]. USA: Englewood Cliffs, NJ: Pren-tice-Hall, Inc., 1975.
    [93] Salton G, McGill M J. Introduction to Modern Information Retrieval [M].USA: New York: McGraw Hill Book Co., 1983.
    [94] Voorhees E M. The efficiency of inverted index and cluster searches. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval [C], New York, 1986.
    [95] Jiang T Y, Tuzhilin A. Dynamic Micro Targeting: Fitness-Based Approach to Predicting Individual Preferences. The Seventh IEEE International Conference on Data Mining [C], Omaha, USA, 2007.
    [96] Ma H F, Fan X H, Chen J. An Incremental Chinese Text Classification Algorithm Based on Quick Clustering. 2008 International Symposiums on Information Processing [C], Moscow, 2008.
    [97] Wang W H, Zhang H W, Wu F, et al. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation. International Conference on Hybrid Learning 2008 [C], Hong Kong, 2008.
    [98] Sebastiani F. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
    [99] Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer [M]. Addison-Wesley, 1989.
    [100] Wang F, Zhang C S. Label Propagation through Linear Neighbourhoods. IEEE Transactions on Knowledge and Data Engineering, vol.20, no.1, pp.55–67, Jan. 2008.
    [101] Zhou Z H, Li M. Semi-Supervised Regression with Co-Training Style Algorithms [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(11):1479–1493.
    [102] Belkin M, Niyogi P. Semi-supervised learning on riemannian manifolds [J]. Machine Learning, 2004, 56(1-3):209-239.
    [103] Zhou Z H, Li M. Tri-Training: Exploiting Unlabeled Data Using Three Classifiers [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11):1529–1541.
    [104] Wu X D, Zhang C Q. Efficient Mining of Both Positive and Negative Association Rules [J]. ACM Transactions on Information Systems, 2004, 22(3): 381-405.
    [105] Gao S, Wu W, Lee C H, et al. A Maximal Figure-of-Merit (MFoM)-Learning Approach to Robust Classifier Design for Text Categorization [J]. ACM Transactions on Information Systems, 2006, 24(2): 190–218.
    [106] Zhou Z H, Li M. Distributional features for text categorization [J]. IEEE Transactions on Know-ledge and Data Engineering, 2009, 21(3): 428–442.
    [107] Estabrooks A, Jo T, Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 2004, 2(1):18-36.
    [108] Dumais S, Platt J, Heckerman D, et al. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and know-ledge management [C], New Orleans, 1998.
    [109] Zha H, Ding C, Gu M, et al. Spectral Relaxation for K-means Clustering. Advances in Neural In-formation Processing Systems [C]. Vancouver, 2001.
    [110]肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813.
    [111] Wen G Z, Huang D S. A Novel Spike Sorting Method Based on Semi-supervised Learning. Intel-ligent Computing Theories and Applications. In Proceedings of the 4th International Conference on Intelligent Computing [C], Shanghai, 2008.
    [112] Cohen I, Cozman F G, Sebe N, et al. Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to Human-Computer Interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(12): 1553-1567.
    [113] Culp M, Michailidis G. Graph-Based Semisupervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(1); 174-179.
    [114] Zhu X J. Semi-Supervised Learning with Graphs [D]. USA: School of Computer Science, Carne-gie Mellon Univ., 2005.
    [115] Pazzani M J. NSF Research Award Abstracts 1990-2003 Data Set. [EB/OL] http://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003
    [116] Mitchell T. 20 Newsgroups. [EB/OL] http://kdd.ics.uci.edu/databases/20newsgroups/ 20newsgroups.html
    [117] Kashyap, Rangasami L. Optimal Choice of AR and MA Parts in Autoregressive Moving Average Models [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1982, PAMI-4(2): 99-104.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700