对于系统发育谱法聚类算法的改进

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

对于系统发育谱法聚类算法的改进

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：An Improvement of Cluster on Phylogenetic Profiling Method
作者：李东晗
论文级别：硕士
学科专业名称：计算机应用科学
中文关键词：系统发育谱 ; 权重 ; 层次聚类 ; K均值聚类 ; 生物学距离 ; K均值初始样本
英文关键词：Phylogenetic Profiling ; Bioinformatics distance ; hierarchical cluster ; K-means
学位年度：2011
导师：马志强
学科代码：081203
学位授予单位：东北师范大学
论文提交日期：2011-03-01

摘要

随着高效、自动化的测序技术的出现,生物信息学的中心课题,已经从对基因的测序,转移至对已测序基因的分析,主要是对基因功能的研究及注释。由于同源性方法的自身缺陷问题及精度问题,人们开始逐渐重视非同源性方法。非同源性方法主要是通过序列的属性对归类,进而进行功能预测。系统发育谱法在众多非同源性方法应用中应用最为广泛。
     系统发育谱法于1999年由Pellegrini提出,随后众多学者从基因参照组选择、系统发育谱构建、谱相似性分析这三方面对其改进。本文在这些基础之上,先构建基于权重的系统发育谱,之后交替使用层次聚类法与K均值聚类法进行相似性分析。在谱相似性分析阶段,提出两点改进:一是提出一种新的距离,用于层次聚类法的聚类阶段。二是从层次聚类法中提取更多信息,为K均值聚类法提供初始信息,更充分的利用层次聚类法的结果,使得K均值聚类法的结果更准确。
     目前在聚类算法中,主要应用的是欧式距离。因为我们所处理的样本大都属于欧式空间,所以采用欧式距离聚类可以得到不错的效果。本文所采用的距离,是一种非欧空间距离。相比欧式距离,它强化了已知信息对样本距离的影响。它不仅考虑样本之间的距离,还考量了样本与参照系样本的距离。使用这种新的距离,可以使我们优先处理与已知参照系相近的样本。
     K均值聚类法的缺陷在于初始条件选取的敏感性:初始聚类数K与初始聚类目标的选取,会对最后的聚类结果产生很大影响。目前对K均值算法的改进主要在初始信息的选取上。前人采用层次聚类与K均值聚类结合使用的方法,目的是利用层次法为K均值聚类法提供初始聚类数K。本文在此基础上,从层次聚类法的结果中提取更多有用信息,给出K均值聚类法的初始聚类目标。
     最后,本文用Escherichia coli K12基因组作为试验样本,对这些改进进行试验验证。由试验结果可知,相比与原先的结果,新的算法准确性更高。
With appearance of the automatic, efficient sequencing technique, the task of Bioinformatics has transferred to the gene analysis and genome donation. Because of the shortcomings of the homology method, people pay more and more attention to the non-homology ways. The classification and function analysis is based on the attribute of the sequence.
     Phylogenetic profiling is a kind of non-homology annotation method using evolution information. After it was proposed by Pellegrini in 1999, many researchers had improved it from reference genome selection, phylogenetic profiling foundation and profile’s similarity analysis. Phylogenetic profiling has three forms: discrete, continuous and weight-based. Weight–based type is developed from continuous one. It can mark the gene which has good performance in the sample protein more prominent, and the gene which is seldom translate in the sample will also be weaken by weight accordingly. In this paper we use this type of phylogenetic profiling method to pre-process the protein data, then the hierarchical cluster and K-means cluster are used together. Two improvements are made upon predecessor’s work: First, a kind of distance based on Bioinformatics background is used in hierarchical cluster. Second, abstract more information from hierarchical cluster result as the initial parameter of K-means cluster. It will make the K-means cluster more efficient.
     Most distance we adapt in cluster algorithm are Euclidean distance. Because most of the samples we deal with are in Euclidean space, the cluster result perform well. The distance we adapt in hierarchical cluster is a new type, which belongs to non-Euclidean space. Compare with the Euclidean distance, this kind of distance strengthen the already-known information. Not only the distance between two samples is taken in consideration, the distance between samples and the reference subject is also took in, which ensure us to deal the samples similar to the reference group.
     The shortcomings of the K-means is that initial parameters has strong impact on the result. Currently most improvement are mainly focus on the choice of the initial parameters. The purpose of using hierarchical cluster and K-means cluster together, is that provide K-means cluster initial cluster number K from hierarchical cluster. We abstract more information from hierarchical cluster result to provide K-means cluster initial point.
     Finally, Escherichia coli K12 genome is chose as experiment sample to verify the improvement. As we can see from the result, compare with the traditional one,new algorithm has more accuracy and more efficient.

引文

[1]陈铭.后基因组时代的生物信息学[J].生物信息学,2004,7,29-34
    [2]陈作舟,朱晟等.Ortholog-概念、生物信息预测方法和数据库.生物物理学报[J].2004,Vol.296(2):137-142.
    [3][Eisen, J.A. and Wu, M. (2002) Phylogenetic analysis and gene functional predictions: phylogenomics in action. Theor. Popul. Biol., 61,481-487]
    [4]解涛,梁卫平,丁达夫.后基因组时代的基因组功能注释.生物化学与生物物理进展[J].2000,(02):166-170.
    [5]Jingchun Sun, Jinlin Xu, Zhen Liu, et al. Refined phylogenetic profiles method for predicting protein–protein Interactions.BIOINFORMATICS[J] 2005, Vol.21(16): 3409-3415.
    [6] Jingchun Sun a, Yixue Li b, and Zhongming Zhao.Phylogenetic profiles for the prediction of protein–protein interactions: How to select reference organisms, ScienceDirect[J],2007, vol. 353(4): 985-991
    [7]Jingchun Sun, and Zhongming Zhao. Construction of phylogenetic profiles based on thegenetic distance of hundreds of genomes, ScienceDirect[J],2007, vol.355(3):849-853.
    [8]Sun JG, LinJ, Zhao LY. Clustering algorithm research. Journal of Software, 2008,19(1):48-61.
    [9]http://www.jos.org.cn/1000-9825/19/48.htm
    [10]M Pellegrini, E M Marcotte, M.J. Thompson, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci [J].1999,Vol.96: 4285-4288
    [11][Date SV, Marcotte EM. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol[J], 2003, Vol.21:1055-1062.
    [12]Jingchun Sun, Jinlin Xu, Zhen Liu, et al. Refined phylogenetic profiles method forpredicting protein–protein Interactions. BIOINFORMATICS[J] 2005, Vol.21(16) 3409-3415.
    [13]马雅楠,孙平平,魏雅卓,陈林英,崔颖,马志强.改进的系统发育谱算法在蛋白质功能注释中的应用.生物信息学, Vol.7 No.1 March.2009
    [14]Jain AK, DubesRC. Algorithms for Clustering Data.Prentice-Hall Advanced Reference Series,1988.1-334
    [15]Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge, Discovery II, 1998,3(2):283-304
    [16]Marques JP, Written; Wu YF, Trans. Pattern Recognition Concepts, Methods and Applications. 2nded., Beijing:Tsinghua University Press, 2002.51-74(in Chinese)
    [17]Fred ALN, Leitao JMN. Partitional vs hierarchical clustering using a minimum grammar complexity approach. In:Proc.of the SSPR”SPR 2000, 193-202.
    [18]http://www.sigmod.org/dblp/db/conf/sspr/sspr2000.html
    [19]陈志杰,2002.高等代数与解析几何(上)[M].北京:高等教育出版社,海德堡:施普林格出版社. 317
    [20]http://www.ncbi.nlm.nih.gov/美国国立生物技术信息中心
    [21]孙啸,陆祖宏,谢建明.生物信息学基础[M].北京:清华大学出版社, 2005,1-7,283-286,292-303
    [22] http://www.ncbi.nlm.nih.gov/COG/old/
    [23]ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__DH10B_uid58979/

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700