用户名: 密码: 验证码:
近邻传播聚类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近邻传播算法(AP)是近年出现的一种在数据挖掘领域极具竞争力的聚类算法,相比较于传统聚类算法,AP算法能够在较短时间内完成大规模多类别数据集的聚类,且该算法能够很好地解决非欧空间问题。因此,AP算法自提出以来,受到了广泛的关注和应用。尽管如此,AP算法仍然存在两个主要缺陷:1)AP算法只适合处理紧致的超球形结构的数据聚类问题,当数据集分布松散或结构复杂时,该算法不能给出理想的聚类结果;2)AP算法是一种无监督的学习方法,不能直接适用于有约束条件的聚类问题。本文针对上述问题展开专门研究。
     论文从改进相似度矩阵入手,首先针对部分聚类数据密度不同以及线性不可分的问题,提出核共享近邻传播(KSAP)算法;进而针对任意形状任意结构的簇的聚类问题,提出基于流形距离的近邻传播(APMD)算法和混合核模型的APMD算法(MKAPMD);最后,针对约束聚类问题,提出基于成对约束的半监督近邻传播(SAP-PC)算法。
     本文的主要创新点包括:
     1)提出核共享近邻传播算法。针对传统基于欧式距离的相似性度量只能聚类紧致结构数据集的缺点,利用数据间共享最近邻数量对局部密度不敏感的特性,设计了一种基于高斯核的共享近邻相似性度量,并据此提出了核共享近邻传播算法。仿真结果证明:该算法能够解决密度不均匀和部分简单线性不可分的数据集聚类问题。
     2)提出基于流形距离的近邻传播算法。针对现有AP算法只能发现超球形聚类的缺点,依据聚类的局部一致性和全局一致性假设,设计了一种局部保持的流形距离度量方法,据此提出了APMD算法,该算法可以发现任意形状的类簇。在此基础上提出了一种快速APMD算法,该算法在保持聚类性能的同时大大提高了算法运算速度。仿真结果证明:所提算法有效解决了非凸形数据集聚类问题。
     3)提出混合核模型的APMD算法。针对单核模型的APMD算法不适合混合分布的数据集聚类问题,提出了混合核模型的APMD算法。根据数据的流形结构信息建立多核模型,并利用流形边界自动调整核参数,通过多个核函数更好地拟合数据集结构。仿真结果证明:该算法能够解决多种流形混叠的数据集聚类问题,拓宽了近邻传播算法的应用范围。
     4)提出基于成对约束的半监督近邻传播算法。针对AP算法无法直接适用于有约束条件的聚类问题,提出了SAP-PC算法。在流形距离矩阵的基础上,利用成对约束的传播特性使约束在特征空间传播开来,并通过修改距离矩阵的方法将其展现,从而解决了约束条件对近邻传播算法应用的限制。此外,通过对成对约束进行开操作和闭操作选出闭包中心,用闭包中心代替约束对,从而解决了成对约束的违反问题。仿真结果证明:该算法可以有效提高聚类精度。
Affinity Propagation (AP) algorithm is becoming increasingly popular in recent years as an efficient and fast clustering algorithm. Comparing with other traditional clustering algorithms, AP can be completed in a shorter time on clustering large-scale and multi-class data sets. Because of no requirement for the symmetry of the similarity matrix, it can be a good solution to non-Euclidean space. Therefore, AP has attracted considerable attentions and continuous discussions since its publication.
     However, there are two important shortcomings of AP. One is that AP is only suitable for the cluster on compact Hyperspherical structure datasets but has poor performance on the loose and complex structure datasets; the other one is that AP is an unsupervised learning method and cannot applicable directly to the clustering under constraints in practice. To solve these problems, this paper carries out an exploratory research on clustering technology.
     Several approaches are presented to improve the efficiency of AP by adjusting the similarity matrix. Kernel Shared Affinity Propagation (KSAP) algorithm is proposed to solve the asymmetrical density dataset and some unseparated linear data clustering. Futherly, there are two algorithms, Affinity Propagation Clustering based on Manifold Distance (APMD) and APMD based on Multi-Kernel Modeling (MKAPMD), which are designed for arbitrary shape and structure datasets clustering. In the end, with the consideration of constraints, Semi-Supervised Affinity Propagation Clustering with Pairwise Constraints (SAP-PC) algorithm is put forward.
     The innovations of this work include:
     1. Kernel Shared Affinity Propagation algorithm is proposed. Using the characteristic of data shared by nearest neighbors that is insensitive with the local density, a new similarity, shared nearest neighbors based on Gaussian kernel (KSNN), is designed to overcome the shortcoming of traditional similarity based on Euclidean distance that is only suit for clustering compact datasets. On the basis, KSAP is proposed. The simulation results show that KSAP can handle asymmetrical density dataset and some unseparated linear data clustering.
     2. Affinity Propagation Clustering based on Manifold Distance is proposed. In a scenario of local-consistent and global-consistent hypothesis, a locality preserving manifold distance is designed and APMD algorithm is proposed, to improve AP method that can only be used for clustering Hyperspherical datasets. APMD overcomes the defect of the original algorithm that can’t clusters non-convex structure datasets. Futherly, a fast APMD algorithm is given to greatly reduce the computational complexity as well as maintain the performance of APMD. The simulation results show that the proposed algorithms settle the problem of clustering non-convex datasets effectively.
     3. Multi-kernel modeling by APMD (MKAPMD) is offered. To eliminate the flaw of APMD algorithm with single-kernel having poor performance on mixture model clustering, a multi-kernel model based on the manifold of datasets is established. And according to the manifold boundary, the parameters of kernel functions are adaptively optimized to fit the structure of datasets well. On this basis, The MKAPMD algorithm is proposed, which can settle the problem of clustering a variety of mixing manifold datasets and broaden the application scope of AP. The simulation results show the validity of this algorithm.
     4. Semi-Supervised Affinity Propagation clustering with pairwise constraints is proposed. Based on the manifold distance matrix, the Pairwise constraints were spread in the feature space by their propagation characteristics, and described by modifying the distance matrix of datasets; consequently, the application limit of AP to the clustering with constraints in practice is resolved. In addition, opening operation and closing operation was carried on the pairwise constraints to find the closure centers, which are used to replace the pairwise constraints. The proposed algorithm on this basis is SAP-PC that effectively solves the problem of constraint violation and improves the performance of AP.
引文
[1] Mitchell T. Machine Learning [M]. New York: McGraw Hill, 1997.
    [2] Gelbard R, Goldman O, Spiegler I. Investigating diversity of clustering methods: An empirical comparison [J]. Data & Knowledge Engineering, 2007, 63(1):155-166.
    [3] Sambasivam S, Theodosopoulos N. Advanced data clustering methods of mining Web documents [J]. Issues in Informing Science and Information Technology, 2006, (3):563-579.
    [4] Marques JP, Written; Wu YF, Trans. Pattern Recognition Concepts, Methods and Applications [M]. 2nd ed., Beijing: Tsinghua University Press, 2002. 51-74 (in Chinese).
    [5]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61.
    [6] PangNing Tan, Michael S, Vipin K. Written; Ming Fan, JanHong Fan. Trans. Introduction to Data Mining [M]. Beijing: Posts&Telecom Press, 2006. 305-400 (in Chinese).
    [7] Ding C, He X. K-Nearest-Neighbor in data clustering: Incorporating local information into global optimization [A].In: Proceeding of the ACM Symposium on Applied Computing [C]. Nicosia: ACM Press, 2004:584-589.
    [8] Li YJ. A clustering algorithm based on maximalθ-distant subtrees [J]. Pattern Recognition, 2007, 40(5):1425-1431.
    [9] Ma WM, Chow E, Tommy W. A new shifting grid clustering algorithm[J]. Pattern Recognition, 2004, 37(3):503-514.
    [10] Pilevar AH, Sukumar M. A grid-clustering algorithm for high-dimensional very large spatial data bases [J]. Pattern Recognition Letters, 2005,26(7):999-1010.
    [11] Nanni M, Pedreschi D. Time-Focused clustering of trajectories of moving objects [J]. Journal of Intelligent Information Systems, 2006, 27(3):267-289.
    [12] Birant D, Kut A. ST-DBSCAN: An algorithm for clustering spatial-temporal data [J]. Data & Knowledge Engineering, 2007,60(1): 208-221.
    [13] Sergios Thedoridis, Konstantinos Koutroumbas. Pattern Recognition [M]. 3rd ed. Beijing: Publishing House of Electronics Industry, 2010: 134-160.
    [14] Tsai CF, Tsai CW, Wu HC, Yang T. A novel data clustering approach for data mining in large databases [J]. Journal of Systems and Software, 2004,73(1):133-145.
    [15] Kumar P, Krishna PR, Bapi RS, De SK. Rough clustering of sequential data [J]. Data & Knowledge Engineering, 2007, 3(2):183-199.
    [16] Roy Gelbard, Orit Goldman, Israel Spiegler. Investigating diversity of clustering methods: An empirical comparison [J]. Data & Knowledge Engineering, 2007, 63: 155-166.
    [17] Wu X D, Kumar V, Quinlan J R, et al. Top 10 algorithms in data mining [J]. Knowledge and Information Systems, 2008, 14(1):1-37.
    [18] Ng A Y, Jordan M I, Weiss Y. On spectral clustering: Analysis and an algorithm [A].In: Proceedings of Advances in Neural Information Processing Systems (NIPS14) [C]. Cambridge, UK: MIT Press, 2002: 849-864.
    [19] Frey B J, Dueck D. Clustering by passing messages between data points [J]. Science, 2007, 315(5814): 972-976.
    [20] Frey B J, Dueck D. Mixture Modeling by Affinity Propagation [A].In: Proceedings of Advances in Neural Information Processing Systems (NIPS18) [C]. Cambridge, UK: MIT Press, 2006:379-386.
    [21] Michael J B, Kohn H F. Comment on "Clustering by Passing Messages Between Data Points" [J]. Science 2008, 319(726c).
    [22] Frey B J, Dueck D. Response to Comment on "Clustering by Passing Messages Between Data Points" [J]. Science, 2008, 319(726d).
    [23] Frey B J. Affinity Propagation FAQ [EB/OL].: http://www.psi.toronto.edu/affinitypropagation/, 2008.
    [24] Xiang Liang Zhang. Contributions to Large Scale Data Clustering and Streaming with Affinity Propagation: Application to Autonomic Grids [D]. PARIS: University PARIS-SUD, 2010.
    [25] Yun Fu, Zhu Li, Xi Zhou. Laplacian Affinity Propagation for Semi-Supervised Object Classification [A].In: Proceedings of IEEE International Conference on Image Processing 2007 [C]. San Antonio, 2007:189-192.
    [26] Jeon Hyung Kang, Kristina Lerman, Anon Plangprasopchok. Analyzing Microblogs with Affinity Propagation [A].In: Proceedings of the 1st Workshop on Social Media Analytics [C]. Washington, 2010:67-70.
    [27] Chen Y, Lorenzo B, Sun FY. A fuzzy-statistics based affinity propagation technique for clustering in multispectral images [A].In: Proceedings of IEEE Transactions on Geosciences and Remote Sensing [C]. 2010, 48(6): 2647-2659.
    [28] Givoni IE, Frey B J. A binary variable model for affinity propagation [J]. Neural Computation, 2009, 21(6): 1589-1600.
    [29] Givoni IE, Frey B J. Semi-Supervised Affinity Propagation with Instance-Level constraints [A].In: Proceedings of the 12th International Conference on Artificial Intelligence and Statistics 2009 [C]. Florida, 2009(5).
    [30] Michele L, Sumedha M W. Clustering by soft-constraint affinity propagation: Applications to gene-expression data [J]. Bioinformatics, 2007, 23(20):2708-2715.
    [31] Michele L, Sumedha, M W. Unsupervised and semi-supervised clustering by message passing: soft-constraint affinity propagation [J]. European Phys.J. B 66, Springer. 2008:125-135.
    [32] Cyril Furtlehner, Michele Sebag, Xiangliang Zhang. Scaling Analysis of Affinity Propagation [J]. APS Physical Review E, 2010,81(6):1103-1117.
    [33] Chong Fu, Jie Wang, Xiao Guang Chen. Flow Transformation of Anonymous Communication Based on Hierarchical Weighted Affinity Propagation Clustering [J]. Journal of Computational Information Systems, 2011, 1(7):319-326.
    [34]王开军,张军英,李丹.自适应仿射传播聚类[J].自动化学报,2007,33(12):1242-1246.
    [35]肖宇,于剑.基于近邻传播算法的半监督聚类[J].软件学报,2008,19(11):2803-2813.
    [36]董俊,王锁萍,熊范纶.可变相似性度量的近邻传播聚类[J].电子与信息学报,2010, 32(3):509-14.
    [37]张莉,周伟达,焦李成.核聚类算法[J].计算机学报,2002,25(06):587-590.
    [38]陈晓峰.核方法在分类、回归与聚类方面的研究[D].无锡:江南大学,2009.
    [39]冯晓磊,于洪涛.密度不敏感的近邻传播算法研究[J].计算机工程.
    [40] UCI Machine Learning Repositery [DB/OL].:http://www.ics.uci.edu/~mlearn/databases/, 2008.
    [41] Kernel machines. USA Post Databases [DB/OL].:http://www.kernel-machines.org/databa/, 2008.
    [42] Lewis D. Reuters-21578 Text Categorization Collection Data Set [EB/OL].: http://archive.ics.uci.edu /ml/datasets/ Reuters-21578, 2008.
    [43] Zhou D, Bousquet O. Learning with Local and Global Consistency [A]. Proceeding of advances in Neural Information Processing Systems [C]. Cambridge: MIT Press, 2004:321-328.
    [44] M Brand. Charting a manifold [A]. Proceeding of advances in Neural Information Processing Systems [C]. Cambridge: MIT Press, 2003.
    [45] Seung H S, Daniel D L. The manifold ways of perception [J]. Science, 2000, 290(5500): 2268-2269.
    [46]王守觉.仿生模式识别(拓扑模式识别):一种模式识别新模型的理论与应用[J].电子学报,2002,30(10):1417-1420.
    [47] Belkin M, Niyogi P, Vikas Sindhwani. On Manifold Regularization [A].In: Proceedings of the 10th International Conference on Artificial Intelligence and Statistics [C]. Barbados, 2005.
    [48] Belkin M, Niyogi P, Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [J]. Computer and System Sciences, 2007.
    [49] Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples [J]. Journal of Machine Learning Research, 2006, 7(11):2399?2434.
    [50] Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [J], Neural Computation, 2003, 15(6), 1373-1396.
    [51] Bin Zhao, James TK, Changshui Zhang. Multiple Kernel Clustering [A].In: Proceedings of the 9th Society for Industrial and Applied Mathematics International conference on Geometric Design [C]. 2009:638-649.
    [52]汪洪桥,孙富春,蔡艳宁.多核学习方法[J].自动化学报,2010,36(8):1038-1049.
    [53] Andrew BG, Xiao Jin Zhu, Aarti Singh. Multi-Manifold Semi-Supervised Learning [A].In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics [C]. Florida, 2009:169-176.
    [54] Zenglin Xu, Rong Jin, Irwin King. An Extended Level Method for Efficient Multiple Kernel Learning [A].In: Proceedings of the 22nd Annual Conference on Neural Information Processing Systems [C]. 2008.
    [55] Olivier C, Bernhard S, Alexander Z. Semi-Supervised Learning [M]. USA: MIT Press, 2006: 3-10.
    [56] Zhu Xiaojin. Semi-Supervised Learning Literature Survey [J]. Computer Sciences.University of Wisconsin. 2008:1-60.
    [57] Wagstaff K, Cardie C, Rogers S, et al. Constrained K-means Clustering with Background Knowledge [A].In: Proceeding of the 18th International Conference on Machine Learning [C]. San Francisco, 2001:577-584.
    [58] Bilenko M, Basu S, Mooney R J. Integrating Constraints and Metric Learning in Semi-Supervised Clustering [A].In: Proceedings of the 21st International Conference on Machine Learning [C]. Canada, 2004: 81-88.
    [59] Brian Kulis, Sugato Basu, Inderjit Dhillon. Semi-supervised graph clustering: a kernel approach [J]. Machine learning, Springer. 2009, 1(74): 1-22.
    [60] Levi Lelis, Jorg Sander. Semi-supervised Density-Based Clustering [A].In: Proceedings of the 9th IEEE International Conference on Data Mining [C]. Miami, 2009:842-847.
    [61] Conroy B, Ramadge P. A supervisory approach to semi-supervised clustering [A].In: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing [C]. Dallas, 2010: 1858-1861.
    [62] Xing E P, Ng A Y, JordanM I. Distance Metric Learning with Application to Clustering with Side-Information [A].In: Proceedings of the 16th Annual Conference on Neural Information Processing Systems [C]. Cambridge, 2003: 505-512.
    [63]李昆仑,曹铮,曹丽萍.半监督聚类的若干新进展[J].模式识别与人工智能. 2009,22(5):735-742.
    [64] P Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses [J], Computer Science, 2008(1).
    [65]魏莱,王守觉.基于流形距离的半监督判别分析[J].软件学报.2010,21(10):2445-2453.
    [66]王玲,薄列峰,焦李成.密度敏感的谱聚类[J].电子学报,2007,35(8):1577-1581.
    [67]王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10): 2412?2422.
    [68]管仁初.半监督聚类算法的应用于研究[D].吉林:吉林大学,2010.
    [69]冯晓磊,于洪涛.基于流形距离的半监督近邻传播算法[J].计算机应用研究,2011,10.
    [70]李洁,高新波,焦李成.基于特征加权的模糊聚类新算法[J].电子学报,2006,34(1): 412-420.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700