聚类算法模型的研究及应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

聚类算法模型的研究及应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Researches on Mathematical Models for Data Clustering and Applications
作者：陈树
论文级别：博士
学科专业名称：轻工信息技术与工程
中文关键词：生物信息 ; 基因 ; 数据挖掘 ; 聚类 ; 拓扑覆盖矩阵 ; 随机投影
英文关键词：Bioinformatics ; gene ; data mining ; cluster ; topological overlap matrix ; random projection
学位年度：2007
导师：徐保国
学科代码：081203
学位授予单位：江南大学
论文提交日期：2007-09-01
答辩委员会主席：陈昊

摘要

生物信息学是计算分子生物学与计算机科学之间的交叉学科。近年来,随着数据挖掘技术发展,生物技术正给整个人类带来前所未有的巨大变化。本文围绕聚类模型及其在生物信息中的应用开展研究,主要内容、贡献和创新包括:
     (1)微阵列中稠密区的研究
     稠密区是一个有统计意义的数据模式集合,它能用来标识基因模式和相关样本集合,也可以消除孤立点、噪声及非正常模式有关的基因模式。通过对稠密区特性的研究,可以依据稠密区的特征对其划分为几个不同的类别,进而给出对不同类别的相应算法。进一步对不同大小稠密区分布的研究,进而评测稠密区的生物意义。在具体应用中,采用两个实际的数据集来测试该算法,第一个数据集来自30个批次小试生产β-甘露聚糖酶样本的数据模式,样本浓度由高到低,该算法有效的标识出浓度不同数据模式。第二个数据集来自酵母菌数据集的基因表达,该算法同样有效的标识出表达相似的基因。同时,将该算法和另外四个常用的聚类算法进行对比,在同一合成数据集聚类效果的对比表明该算法有优越的性能。
     (2)基因网络模块的探测
     基因和蛋白质交互网络的生物研究表明这些网络由模块组成。模块的识别是理解整个网络结构的关键一步,为此,将结点不相似的测量方法与聚类算法结合起来,从而给出模块的识别方法,更进一步,在拓扑覆盖矩阵的基础上(该方法在很多生物上的应用上得到证实),采用一个结点不相似测量的通用方法,它综合标准的拓扑覆盖矩阵法,并在此基础上与双向层次聚类算法相结合。它主要用于网络模块的标识,也可用于结点连接的度量,它优于基于结点度的连接方法,在分析该算法相关特性的基础上,给出了它在基因表达网络中的适用的原因。最后,通过应用表明,标准的拓扑覆盖矩阵适用于发现较小模块,而采用推广的拓扑覆盖矩阵结合双向层次聚类算法则更适用于发现较大的模块。
     (3)基于随机投影集合的高维数据聚类研究
     针对高维数据聚类中如何产生多个低维的基聚类和如何对这些低维的聚类集合进行组合的问题,采用随机投影和双向图划分法,特别在基聚类集合中,采用一种新的基于OPTOC聚类算法。通过在八个数据集上评测集合构造器,结果表明:随机投影生成的集合性能优于其它两个集合构造器生成的集合。通过对四个不同的共识函数在用两种不同的类型的集合上的评测,结果表明:两个基于图形的划分方法性能优于另外两种方法,其中双向图划分法对两个集合基聚类的改善率比较高。
     (4)基于尺度聚类的研究
     基于尺度聚类模型的特点在于允许用户直接动态的控制聚类的尺度,即用户从不同的尺度观测数据集,就能得到相应尺度的聚类,且这种尺度是数据集所固有的。它引入聚类的同源算法和分离算法构建目标函数,特别在同源和分离识别方面,采用Renyi熵来表示类内相似度和类间分离度,用尺度参数控制聚类的尺度。在数据集中,用Pearson相关系数作为对象之间相似度的测量,该算法的时间复杂度低于典型的层次聚类和划分聚类算法。该模型在对生物信息、图像的数据集聚类过程中显示它的良好的效果。
     最后,在总全文进行了总结,提出了有待进一步研究的课题和今后研究工作的重点。
Bioinformatics is the interdiscipline of computational molecular biology and computer science.With the rapid development of data mining,biological technologies are reshaping the human society.This paper studies mathematical models for data clustering and applications for bioinformatics.The main content,contribution and innovation in the paper are described below:
     (1) The research of dense regions in microarray data.
     A dense region is data subset of statistical significance. It can identify similar subgroup of genes or samples,on the other hand,it can also get rid of outlier and abnormal data. After studying properties of dense regions,we classify dense regions according to their properties and then give the corresponding algorithms.We can also find biological meaning based on their distribution property. In the two experiments, the first dataset contains data of 30 beta-mannanase samples. Beta-mannanase's products concentration is from low to high. The algorithm can identify the subsets of samples and data patterns simultaneously.The second datast is the microarray of gene expression during yeast's cell cycles.The algorithm also can work well.After the comparsion with four other clustering algorithms for two synthetic datasets,it demonstrates its usefulness.
     (2) Detecing modules in gene networks.
     Genes and their protein products carry out celluar processes in the context of functional modules.Thus it's critical to identify these modules in order to know the gene network structure.We combine unsimilar measurement with clustering method,putting a new way to identify modules.furthermore,based on topological overlap matrix(the method has been verified in many biological applications),we come up with a new generalized method combined with bidirectional hierarchical clustering.It mainly apply in detecting modules、measurement of nodes.It performs better than other measurements between nodes.Meantime,we give a proof of its applications.At last ,normal topological matrix can find small modules,whereas this bidirectional hierarchical clustering based on generalized topogical matrix can work well in find large modules in the gene networks..
     (3) Ensemble methods for high dimensional clustering based on random projection.
     We explore how to employ ensemble methods to solve high dimensional data clustering problems.I investigate three different approaches to constructing ensembles based on randomized dimension reduction,particularly,we employ a new clustering method based on OPTOC algorithm.The results demonstrate the random projection is an effective approach for generating cluster ensembles for high dimensional data and that its efficacy is attributable to its ability to produce diverse base clusterings,then I employ a graph based approach which tansforms the problem of combining clusterings into a bipartite graph.Comparisons of the bipartite approach to three existing approaches illustrate that the bipartite approach achieves the best overall performances.
     (4) A scale dependent model for clustering
     We employ a model for clustering a set of high dimensional data into subsets of homogenous clusters which are well separated by each other. A novel feature of this model is that it allows the user to directly control the scale of the clusters.This is realized by formulating the clustering problem as an optimization problem. We study some properties of homogeneity and separation defined based on pair-wise measured by the Pearson correlation coefficient,particularly,we use Renyi's Entropy to represent the index of homogeneity and separation. In this case,for a dataset, the performance of the algorithm is better than typical hierarchical and partitional algorithms Experimental results on synthetic,biological and iamge data demonstrate the usefulness of proposed model.
     Finally, there are concluded with a summary and some problems needed to be studied in future are put forward.

引文

[1]赵国屏.生物信息学[M].北京:科学出版社,2002
    [2]马立人.生物芯片[M].北京:化学工业出版社,2000.
    [3]G..L.Liu.Introduction to Combinatorial Mathmatics[M].Macgraw Hill,1968.
    [4]J.B MacQueen.Some Methods for Classification and Analysis of Multivariate Observations [J].The 5th Berkeley Symposium on Mathematical Statistics and Probabilty,1967,9:281-297
    [5]Leonard Kaufman,and Peter Rousseeuw.Find Groups in Data:An Introduction to Cluster Analysis[M].John Wiley & Sons,Inc.,1990.
    [6]R.Ng,J.Han.Efficient and Effective Clustering Methods for Spatial Data Mining[J].International Conference on Very Large Data Bases,Santiago,Chile,1994.
    [7]Brai Everitt.Cluster Analysis[M].John Wiley & Sones,New York,1974.
    [8]T.Zhang,R.Ramakrishnan,and M.Livny.BITCH:An efficient Data Clustering Method for Very Large Database[J].International Conferrence on Management of Data,1996.
    [9]S.Guha,R.Vaithyannthan,and A.Garg.CURE:An efficient Clustering Algorithm for Large Database[J].International Conferrence on Management of Data,1998.
    [10]G.Karypis,E.H.Han,and V.kumar.CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic Modeling[J].Computer,1999,32:68-75.
    [11]Leonard Kaufman and Peter Rousseeuw.Finding Groups in Data:An Introduction to Cluster Analysis[M].John Wiley & Sons,Inc.,New York,1990.
    [12]孙即祥.现代模式识别[M].长沙:国防科技大学出版社,2002
    [13]M.Ester,H.Kriegel,J.Sander and X.Xu.A Density-based Algorithm for Discovering Clusters in Large Spatial DataBases with Noise[J].The 2nd Internatioal Conferrence on Knowledge Discovery and Data Mining,Portland,1996.
    [14]J.Sander,M.Ester,H.Kriegel,and X.Xu.Density-based Clustering in Spatial Databases:The Algorithm GDBSCAN and its Applications[J].Data Mining and Knowledge Discovery,2:169-194,1998.
    [15]Fuk90 K.Fukunaga.Introduction to Statistical Pattern Recognition[M].Boston Academic Press,1990.
    [16]W.Wang,J.Yang,and R.Muntz.STING:A Statistical Information Grid Approach to Spatial Data Mining[J].The 23rd VLIB Conferrence,Athens,Greece,1997.
    [17]G.Sheikholeslami,S.Chatterjee,and A.Zhang.WaveCluster:A Wavelet-based Clustering Approach for Spatial Data in Very Large Databases[J].The VLDB Journal,8:289-304,2000
    [18]Chris Ding,Xiaofeng He,Hongyun Zha,Ming Gu,and Horst Simon.Spectral Min-Max Cut for Graph Partitioning and Data Clustering[J].The 1st IEEE International Conferrence on Data Mining,San Jose,CA,2001.
    [19]L.Ertoz,M Steinbach,and V.Kumar.Finding Clusters of Different Sizea,Shapes,and Densities in Noisy,High dimensional Data[J].The 3th SIAM International Conferrence on Data Mining,San Francisco,CA,2003.
    [20]L.Davidson and S.Ravi.Clustering With Constraints:Feasibility Issues and the k-means Algorithm[J].The 5th SIAM International Conferrence on Data Mining,Newport Beach,CA,2005.
    [21]D.Gondek,S.Vaithyanathan,and A.Garg.Clustering with Modellevel Constraints[J].The 5th SIAM International Conferrence on Data Mining,Newport Beach,CA,2005.
    [22]Kandogan E.Visualizing multi-dimensional clusters,trends and outliers using star coordinates[C].Proceedings of the 7th ACM SIGKDD.CA,2001.
    [23]范九伦,裴继红,谢维信.基于可能性分布的聚类有效性[J].电子学报,1998,26(4):113-115.
    [24]范九伦,吴成茂.基于样本最大分类信息的聚类有效性函数[J].模糊系统与数字,2001,15(3):68-73.
    [25]行小帅,焦李成.数据挖掘的聚类方法[J].电路与系统学报,2003,8(1):59-67.
    [26]E.Ravasz,A.Somera,D.Mongru,Z.Oltvai,and A.Barabasi.Hierarchical Organization of Modularity in Metabolic Networks[J].Science,297(5586):1551-1555,2002.
    [1]马立人.生物芯片[M].北京:化学工业出版社,2000
    [2]孙即祥.现代模式识别[M].长沙:国防科技大学出版社,2002
    [3]赵国屏.生物信息学[M].北京:科学出版社,2002
    [4]J.A.Hartigan.Direct Clustering of a Data Matrix[J].Journal of the American Statistical Society,1972,67(337):123-129
    [5]Y.Cheng and G.Church.Biclustering of Expression Data[C].The 8th International Conferrence on Intel.Sys.for Mol.Biol.,2002.
    [6]T.Lu.Gene regulation and DNA Damage in the Ageing Human Brain[J].Nature,2004,429(24):883-891
    [7]G.Dennis,B.Sherman,D.Hosack,J.Yang.W.Gao,H.Lane,and R.Lempicki.DAVID:Database for Annotation,Visualization,and Integrated Discovery[J].Genome Biology,2003,4(9)
    [8]李雄飞,李军.数据挖掘与知识发现[M].北京:高等教育出版社,2003.
    [9]P.Spellman,G.Sherlock,M.Zhang,V.Iyer,and K.Anders.Comprehensive Identification of Cell Cycle-regulated Genes of the Yast Saccharomyces Cerevisiae by Microarray Hybridization[J].Mol.Biol.Cell,1998,9(12):3273-3297
    [1]Z.Bar-Joseph,G.K.Gerber,T.I.Lee,N.J.Rinaldi,J.Y.Yoo,F.Robert and D.K.Gifford.Computational Discovery of Gene Modules and Regulatory Networks[J].Nature Biotechnology,2003,21(11):1337-1342
    [2]S,Bergmann,J.Ihmels and N.Barkai.Similarities and Differences in Genome-wide Expression data of Six Organisms[J].PLOS Biology,2004,2(1):0085-0093
    [3]赵国屏.生物信息学[M].北京:科学出版社,2002
    [4]S.Prinz,I.Avlia-Campillo,C.Aldridge and T.Galitski.Control of Yeast Filamentous-form Growth by Modules in an Integrated Molecular Network[J].Genome Research,2004,14(3):380-390
    [5]T,Yoyoda and A.Konagaya.KnowledgeEditor:A New Tool for Interactive Modeling and Analyzing Biological Pathways based on Microarray Data[J].Bioinformatics,2003,19(3):433-434
    [6]M.Newman and M.Girvan.Finding and Evaluating Communtity Structure in Networks[J].Physical Review,2004,69(2):26113
    [7]P.Tamayo,D.Slonim,J.Mesirov,E.Dmitrovsky,E.S.Lander and T.R.Golub.Interpreting Pattrens of Gene Expression With Self-organizing Maps:Methods and Application to Hematopoietic[J].PNAS,1999,96(1):2907-2912
    [8]P.Spellman,G.Sherlock,M.Zhang,V.Iyer and K.Anders.Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saceharomyces Cerevisiae by Microarray Hybridization[J],Mol.Biol.Cell,1998,9(12):3273-3297
    [9]B.Zhang and G.Karypis.Using the Scale-free Topology Criterion for Constructing Weighted Gene Co-expression Networks[J].Statistical Application in Genetics and Molecular Biology,2005,2(1):338-343
    [10]E.Ravasz,A.Somera,D.Mongru,Z.Oltva and A.barabasi.Hierarchical Organization of Modularity in Metabolic Networks[J].Science,2002,297(5586):1551-1555
    [11]S.Wasserman and K.Faust.Social Network Analysis:Methods and Applications[M].Structural Analysis in the Social Science,Cambridge University Press,New York,1994
    [12]M.Newman.The Structure and Function of Complex Newworks[J].SIAM Review,2003,45(2):167-256
    [13]T.Cox and M.Cox.Multidimensional Scaling[M].Chapman and Hall/CRC,Roca Raton,2001
    [1]K.Beyer,J.Goldstein,R.Ramakrishnan and U.Shaft.When is nearest neigh-bors meaningful[C]Proceedings of the International Conference on Database Theories,1999,PP:217-235.
    [2]C.H.Papadimitriou,P.Raghavan,H.Tamaki and S.Vempala.Latentsemantic indexing:A probabilistic analysis[C].Proceedings of the Seventeenth ACM Symposium on the Principles of Database Systems,1998,PP:159-168
    [3]毛国君.数据挖掘原理与算法[M].北京:清华大学出版社,2003.
    [4]H.Hotelling.Analysis of a complex of statistical variables into principal components[J].Journal of Educational Psychology,1933,24:417-441
    [5]张杰,阳宪惠.多变量统计过程控制[M].北京:化学工业出版社,2000
    [6]W.B.Johnson and J.Lindenstrauss.Extensions of Lipschitz mappings into a Hilbert space[C].Conference in Modern Analysis and Probability,1984,PP:189-206
    [7]Ella Bingham and Heikki Mannila.Random projection in dimensionality reduction:Applications to image and text data[C].Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2001,PP:245-250
    [8]P.Diaconis and D.Freedman.Asymptotics of graphical projection pursuit[J].Annals of Statistics,1984,12:378-389
    [9]X.Z.Fern and C.E.Brodley.Random projection for high dimensional data clustering:A cluster ensemble approach[C]Proceedings of the Twentieth Inter-national Conference on Machine Learning,2003,PP:186-193
    [10]Junghui Chen,Kun-Chih Liu.On-line batch process monitoring using dynamic PCA and dynamic PLS models[J].Chemical Engineering Science,2002,57:63-75.
    [11]Kulkarni Savita G,Chaudhary Amit Kumar,Nandi Somnath,Tambe Sanjeevs,Kulkarni Bhaskar D.Modeling and monitoring of batch processes using principal component analysis (PCA) assisted generalized regression neural networks(GRNN)[J].Biochemical Engineering Journal 2004,18(3):193-210.
    [12]Lu N.,F.Gao,F.Wang.PCA-based modeling and on-line monitoring strategy for uneven-length batch processes[J].Ind.Eng.Chem.Res.,2004b,43:3343-3352.
    [13]J.Shi and J.Malik.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905
    [14]I.S.Dhillon.Co-clustering documents and words using bipartite spectral graph partitioning[C].In Knowledge Discovery and Data Mining,2001,PP:269-274
    [15]A.Ng,M.Jordan,and Y.Weiss.On spectral clustering:Analysis and an algorithm[C].In Advances in Neural Information Processing Systems,2002,PP:849-856
    [16]A.Strehl and J.Ghosh.Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J].Machine Learning Research,2002,3:583-417
    [17]S.Monti,P.Tamayo,J.Mesirov and T.Golub.Consensus clustering:A resampling-based method for class discovery and visualization of gene expression microarray data[J].Machine Learning,2003,52:91-118
    [18]Xiaohua Hu[C].Proceedings of the Fourth IEEE Symposium on Bioinformatics and Bioengineering,2004
    [19]M.Dash,K.Choi,P.Scheuermann and H.Liu.Feature selection for clustering-A filter solution[C].Proceedings of IEEE International Conference on Data Mining,2002,PP:115-122
    [20]P.Mitra and C.A.Murthy.Unsupervised featureselectionusing featuresimilarity [J].IEEE Trans.on Pattern Ananlysis and Machine Intelligence,2002,24(3):301-312
    [21]S.Vaithyanathan and B.Dom.Generalized model selection for unsupervised learning in high dimensions[C].In Advances in Neural Information Processing Systems,2000.
    [22]J.G.Dy and C.E.Brodley.Feature subset selection and order identification for unsupervised learning[C].In Proceedings of the Seventeenth International Conference on Machine Learning,2000,PP:247-254
    [23]M.H.Law,A.K.Jain,andMarioA.T.Figueiredo.Featureselectioninmixture-based clustering[C].In Advances in Neural Information Processing Systems,2003,PP:609-616,2003.
    [24]R.J.Bolton and W.J.Krzanowski.Projectionpursuitclustering for exploratory data analysis[J].Journal of Computational and Graphical Statistics,2003,12(1):121-142
    [25]C.Ding,X.He,H.Zha,and H.Simon.Adaptive dimension reduction for clustering high dimensional data[C].Proceedings of the Second IEEE International Conference on Data Mining,2002,PP:107-114.
    [26]C.Cheng,A.W.Fu and Y.Zhang.Entropy-based subspace clustering for mining numerical data[C].Proceedings of the Fifth ACM SIGKDD International Conferenceon Knowledge Discoveryand Data Mining,1999,PP:84-93
    [27]R.Agrawal,J.Gehrke,D.Gunopulos and P.Raghavan.Automatic subspace clustering of high dimensional data for data mining applications[C].Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data,1998,PP:94-105.
    [28]C.M.Procopiuc,M.Jones,P.K.Agarwal and T.M.Murali.A Monte Carlo algorithm for fast projective clustering[C].Proceedings of ACM SIGMOD Conference of Management of Data,2002,PP:418-427
    [29]C.C.Aggarwal,J.L.Wolf,P.S.Yu,C.Procopiuc and J.Park.Fast algorithms for projected clustering[C].Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data,1999,PP:61-72
    [30]C.C.Aggarwal and P.S.Yu.Finding generalized projected clusters in high dimensionalspaces[J].Proceedings of ACM SIGMOD Conference on Management of Data,2000,PP:70-81
    [31]Piotr Indyk and Rajeev Motwani.Approximate nearest neighbors:Towards removing the curse of dimensionality[C].Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing,1998,PP:604-613.
    [32]J.Fridlyand and S.Dudoit.Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method[M].Technical Report,Statistics Department,UC Berkeley,2001.
    [33]X.Z.Fern and C.E.Brodley.Solving cluster ensemble problems by bipartite graph partitioning[C].Proceedings of the Twenty First International Conference on Machine Learning,2004,PP:281-288
    [34]A.Topchy,A.K.Jain and W.Punch.A mixture model for clustering ensembles[C].Proceeding of SIAM Conference on Data Mining,2004,PP:379-390
    [35]A.Fred and A.K.Jain.Data clustering using evidence accumulation[C].Proceedings of International Conference on Pattern Recognition,2002.
    [1]李雄飞,李军.数据挖掘与知识发现[M].北京:高等教育出版社,2003.
    [2]G.Karypis,E.H.Han and V.Kumar.CHAMELEON:A Hierarchical Clustering Algorithm using Dynamic Modeling[J].Computer,1999,32:68-75
    [3]T.F Chan and L.Vese.Active Contours without Edges[J].IEEE Transactions on Image Processing,2001,80(10):266-277
    [4]R.Sharan and R.Shamir.CLICK:A Clustering Algorithm for Gene Expression Analysis[C].The International Conferrence on Intelligent Systems for Molecular Bioogy,2000,PP:260-268
    [5]Y.Zhao and G.Karypis.Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering[J].Machine Learning,2004,55(3):311-331
    [6]I.S Dhillon and D.S Modha.Concept Decompositions for Large Sparse Text Data using Clustering[J].Machine Learning,2001,42(3):143-175
    [7]毛国君.数据挖掘原理与算法[M].北京:清华大学出版社,2003
    [8]S.Guha,R.Rastogi and K.Shim.CURE:An efficient Clustering Algorithm for Large Databases[C].The International Conferrence on Management of Data,1998,PP:73-84
    [9]M.B.Eisen,P.T.Spellman,P.O.Brown and D.Botstein.Cluster Analysis and Display of Genome-wide Expression Patterns[C].The proceeding of National Academy of Sciences,1998,85:14863-14868
    [10]V.R.Iyer.The Transcriptional Program in the Response of Human Fibroblasts to Sernum[J].Science,1999,283:83-87
    [11]J.R.Gilbert,G.L.Miller and S.Teng.Geometric Mesh Partitioning:Implementation and Experiments[J].SIAM J.Sci.Comp.,1998,19(6):2091-2110

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700