用户名: 密码: 验证码:
转录因子结合位点识别问题的算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
转录是基因表达的第一阶段,也是基因调节的主要阶段,通过转录因子与特异的DNA序列结合,对基因的表达起抑制或增强的作用。识别DNA序列的中的这些结合区域,即转录因子结合位点识别,对了解基因的转录活性及理解基因表达有着重要意义,是现今生物信息学中最为广泛研究的问题之一。
     转录因子结合位点识别问题的难点在于,与大量长度几百或上千碱基的背景噪声序列相比,长度为十几或几十的模体信号相对较短,并且同一转录因子的模体实例还有可能部分发生变异。同时,随着序列长度和数量的增加,解空间大小也会飞速巨增,计算开销往往不切实际。此外,识别结合区域中的多个转录因子结合位点、寻找特定的共调控转录因子结合位点组合以及在全基因组范围内寻找结合位点,也是此问题所面临的巨大挑战。本论文针对转录因子结合位点识别问题中所使用的数学模型、优化技术、高效识别方法以及与新型生物实验结合的进一步发展等问题进行了深入的研究,将所提出的方法应用于模拟字符串数据、不同物种和组织的启动子序列和全基因组的DNA数据进行转录因子结合位点识别。主要工作可概括如下:
     (1)针对传统转录因子结合位点识别问题组合候选解集规模过大,经典的概率求解方法易于陷入局部最优解的情况,提出了定位投影求精算法。通过一个基于位置频率矩阵的定位投影过程,将数据集划分,聚类为不同的子集。从这些子集中过滤筛选出具有一定信息量和复杂度若干子集,分别作为期望最大化算法的初始状态并进行迭代求精。本论文通过对定位投影过程中阀值的设定,实现了对OOPS、ZOOPS、TCM三种模体实例不同分布模型的处理。同时,结合高阶马尔可夫模型作为背景加强模体特异性,使概率模型更加符合真实生物数据。此外,引入了相似函数对各子集输出结果进行评估,使得定位投影求精算法可以解决多模体识别问题。实验结果表明,该算法可以在多个真核物种的启动子序列中有效识别转录因子结合位点。
     (2)针对由转录因子结合位点识别问题衍生得到的(l, d)植入模体搜索问题,传统算法在效率和准确度上往往较难达到良好的平衡,并且难以解决挑战实例的情况,提出了一种基于期望最大化的启发式聚类算法CEM。通过参照序列的设定,该算法将数据集划分为不同的子集,并使用改进的期望最大化算法来探索子集中最好的局部最优解。CEM将精确方法与概率方法相结合,克服了传统期望最大化算法陷入不同局部解的缺点,可准确寻找到植入位点,对识别高退化性模体有较好的性能。模拟数据测试结果表明,CEM不但能准确识别一般实例中的植入模体信号,对于挑战实例的植入模体信号识别也有较高准确率。此外,真实数据实验证明该算法可有效应用于实际物种的转录因子结合位点识别问题。
     (3)针对全基因组范围的转录因子结合位点识别问题,提出了一种用于ChIP-seq数据的转录因子结合位点识别算法MMFChIP。该算法将精确方法和概率方法相结合,针对ChIP-seq的数据特点,通过对正负两个输入集合的比较,选出发生频率较高且相似的子序列生成位置频率矩阵,并结合模体内位置依赖性和高阶马尔可夫进行统计建模,利用错误发现率对预测实例进行控制。在输出时,还利用一个后处理过程聚类相似的模体。ChIP-seq数据测试证明,MMFChIP适用于处理大规模数据中的模体发现问题,不但可以发现数据中的多个模体成分,并且对这些数据中的潜在辅助因子也可以进行较好的预测。
The first step of gene expression, transcription, is the main step of generegulation. Through transcription factors binding the specific-DNA sequences, geneexpression can be suppressed or enhanced. Identifying the binding regions of DNAsequences, that is, the identification of transcription factor binding sites, has the greatsignificance with understanding the transcriptional activity of genes andcomprehending gene expression, which is one of the most widely researches inbioinformatics now.
     The difficults of transcription factor binding sites identification problem are thatthe motif signal with length more than tens is relatively short with comparing to alarge number of the background noise sequences with length hundreds or thousands ofbases, and the motif instances of the same transcription factor may also partiallymutate. Meanwhile, with the length and the number of sequences increasing, the sizeof solution space will increase rapidly, and the computational cost will becomeimpractical. In addition, identifying multiple transcription factor binding sites in thebinding region, finding the specific co-regulated transcription factor binding sites anddiscovering the genome-wide binding sites, are the great challenges faced by thisproblem. In this paper, our researches are aimed at the mathematical models,optimization techniques, efficient identification methods and the further developmentscombining with the new biological experiments. The proposed methods are applied tothe the string simulate data, the promoter sequences of different species and tissuesand genome-wide DNA data to identify transcription factor binding sites.
     (1) We propose a novel fixed-position projection refinement algorithm in orderto solve the large candidate solution space of traditional transcription factorbinding sites identification problem and alleviate the classic methods to fallinto local optimal. This algorithm divides the data into different subsetsthrough a projection process based on the corresponding probabilisticfrequency matrix, it filters the subsets with certain information score andcomplexity score, which are used as the initial condition for expectationmaximum refinement. Our algorithm achieves the different motif instancesdistribution in the model OOPS, ZOOPS and TCM by setting the threshold inthe fixed-position projection process. Meanwhile, using the high orderMarkov model as background can make the probability model more closer to the reality. In addation, our algorithm can be extended to a multiple motifsdiscovery version by using the similarity function. Experimental results showthat the algorithm can effectively identify transcription factor binding sites inthe promoter sequences in multiple eukaryotic species.
     (2) For the planted (l, d) motif search problem derived from the transcriptionfactor binding sites identification, traditional algorithms are hard to make thegoog trade-off between efficiency and accuracy, and also hard to solve thechallenging instances. We present a heuristic cluster based EM algorithm,CEM, which refines the cluster subsets in the modified EM method toexplore the best local optimal solution through setting the reference sequence.Combining the exact methods and the probabilistic methods, CEM overcomethe shortcomings of the traditional EM algorithm falling into different localsolutions. Our algorithm can accurately find the planted sites and has betterperformance on the high degenerated motifs. Simulation data experimentsshow that CEM can not only accurately identify the planted motif signals inthe practical instances but also has a better accuracy in the challenginginstances. In addition, the real data experiments show that the algorithm canbe effectively applied to real species for identifying the transcription factorbinding sites.
     (3) For identifying the genome-wide transcription factor binding sites, wepropose an algorithm, MMFChip, to identify the binding motifs oftranscription factors in ChIP-seq data. Our algorithm is motivated on thecharacteristics that the amount and quality of ChIP-seq data have beendramatically increased. Through the word enumeration strategy of countingthe occurrences of the subsequences in the positive and negative set,MMFChip combines the exact methods and the probabilistic methods, whichselects the relative enriched subsequences and searches the (l, d) instances ofeach enriched subsequence to constitute the corresponding Position WeightMatrixes. Then MMFChip employs the statistic model including theintra-motif dependency and high-order Markov model to give the motif amore precise statistic description, and utilizes the False Discovery Rate tocontrol the quality of outputs. Finally, MMFChip uses a post-processing tocluster the similar motifs. Applying MMFChip on the ChIP-seq datasets, theresults show that our algorithm can process the motif discovery problem in the large-scale data, the mutiple binding motifs and co-factors can beeffectively detected.
引文
[1] Avila, V.L., Biology: investigating life on earth1995: Jones&BartlettLearning.
    [2] Sadava, D., et al., Life: the science of biology. Vol.3.2009: Macmillan.
    [3] Lander, E.S., et al., Initial sequencing and analysis of the human genome.Nature,2001.409(6822): p.860-921.
    [4] Durbin, R., Biological sequence analysis: probabilistic models of proteinsand nucleic acids1998: Cambridge university press.
    [5] Jones, N.C. and P. Pevzner, An introduction to bioinformatics algorithms2004: MIT press.
    [6] Isaev, A., Introduction to mathematical methods in bioinformatics2004:Springer.
    [7]孙啸,陆祖宏,谢建明,生物信息学基础2005:清华大学出版社有限公司.
    [8] Watson, J.D. and F.H. Crick, Molecular structure of nucleic acids. Nature,1953.171(4356): p.737-738.
    [9] Benson, D.A., et al., GenBank. Nucleic Acids Research,2010.38(suppl1): p.D46-D51.
    [10] Kanz, C., et al., The EMBL nucleotide sequence database. Nucleic AcidsResearch,2005.33(suppl1): p. D29-D33.
    [11] Miyazaki, S., et al., DDBJ in the stream of various biological data. NucleicAcids Research,2004.32(suppl1): p. D31-D34.
    [12] Boeckmann, B., et al., The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in2003. Nucleic Acids Research,2003.31(1): p.365-370.
    [13] Gribskov, M., A.D. McLachlan, and D. Eisenberg, Profile analysis: detectionof distantly related proteins. Proceedings of the National Academy ofSciences,1987.84(13): p.4355-4358.
    [14] Yates III, J.R., et al., Method to correlate tandem mass spectra of modifiedpeptides to amino acid sequences in the protein database. Analyticalchemistry,1995.67(8): p.1426-1436.
    [15] Romero, P., et al. Identifying disordered regions in proteins from amino acidsequence. in Neural Networks,1997., International Conference on.1997.IEEE.
    [16] Clark, A.G., et al., Evolution of genes and genomes on the Drosophilaphylogeny. Nature,2007.450(7167): p.203-218.
    [17] Sudmant, P.H., et al., Diversity of human copy number variation andmulticopy genes. Science,2010.330(6004): p.641-646.
    [18] Michaelson, J.J., et al., Whole-genome sequencing in autism identifies hotspots for de novo germline mutation. Cell,2012.151(7): p.1431-1442.
    [19] Qin, J., et al., A metagenome-wide association study of gut microbiota intype2diabetes. Nature,2012.490(7418): p.55-60.
    [20] Crick, F., Central dogma of molecular biology. Nature,1970.227(5258): p.561-563.
    [21] Pearson, H., Genetics: what is a gene? Nature,2006.441(7092): p.398-401.
    [22] Solomon, E.P., et al., Biology1993: Saunders College Publishing USA.
    [23] Grummt, I., Regulation of mammalian ribosomal gene transcription by RNApolymerase I. Progress in nucleic acid research and molecular biology,1998.62: p.109-154.
    [24] Lee, Y., et al., MicroRNA genes are transcribed by RNA polymerase II. TheEMBO journal,2004.23(20): p.4051-4060.
    [25] Willis, I.M., RNA polymerase III, in EJB Reviews19931994, Springer. p.29-39.
    [26] Herr, A., et al., RNA polymerase IV directs silencing of endogenous DNA.Science,2005.308(5718): p.118-120.
    [27] Wierzbicki, A.T., et al., RNA polymerase V transcription guidesARGONAUTE4to chromatin. Nature genetics,2009.41(5): p.630-634.
    [28] Latchman, D.S., Transcription factors: an overview. The internationaljournal of biochemistry&cell biology,1997.29(12): p.1305-1312.
    [29] Carey, M. and S.T. Smale, Transcriptional regulation in eukaryotes2000:Cold Spring Harbor Laboratory Press Cold Spring Harbor:.
    [30]朱玉贤,李毅,郑晓峰,现代分子生物学2007:高等教育出版社.
    [31] Watson, J.D., Recombinant DNA: genes and genomes: a short course2007:Macmillan.
    [32] Lee, T.I. and R.A. Young, Transcription of eukaryotic protein-coding genes.Annual review of genetics,2000.34(1): p.77-137.
    [33] Wray, G.A., et al., The evolution of transcriptional regulation in eukaryotes.Molecular biology and evolution,2003.20(9): p.1377-1419.
    [34] Zambelli, F., G. Pesole, and G. Pavesi, Motif discovery and transcriptionfactor binding sites before and after the next-generation sequencing era.Briefings in Bioinformatics,2013.14(2): p.225-237.
    [35] Schneider, T.D., Consensus sequence zen. Applied bioinformatics,2002.1(3): p.111.
    [36] Rose, T.M., et al., Consensus-degenerate hybrid oligonucleotide primers foramplification of distantly related sequences. Nucleic Acids Research,1998.26(7): p.1628-1635.
    [37] Cornish-Bowden, A., Nomenclature for incompletely specified bases innucleic acid sequences: recommendations1984. Nucleic Acids Research,1985.13(9): p.3021.
    [38] Stormo, G.D., DNA binding sites: representation and discovery.Bioinformatics,2000.16(1): p.16-23.
    [39] Liu, J.S., A.F. Neuwald, and C.E. Lawrence, Bayesian models for multiplelocal sequence alignment and Gibbs sampling strategies. Journal of theAmerican Statistical Association,1995.90(432): p.1156-1170.
    [40] Bailey, T.L. and C. Elkan, Fitting a mixture model by expectationmaximization to discover motifs in bipolymers,1994, Department ofComputer Science and Engineering, University of California, San Diego.
    [41] Schneider, T.D., et al., Information content of binding sites on nucleotidesequences. Journal of Molecular Biology,1986.188(3): p.415-431.
    [42] Hertz, G.Z. and G.D. Stormo, Identifying DNA and protein patterns withstatistically significant alignments of multiple sequences. Bioinformatics,1999.15(7): p.563-577.
    [43] Schneider, T.D. and R.M. Stephens, Sequence logos: a new way to displayconsensus sequences. Nucleic Acids Research,1990.18(20): p.6097-6100.
    [44] Down, T.A. and T.J. Hubbard, NestedMICA: sensitive inference ofover-represented motifs in nucleic acid sequence. Nucleic Acids Research,2005.33(5): p.1445-1453.
    [45] Thomas-Chollier, M., et al., RSAT: regulatory sequence analysis tools.Nucleic Acids Research,2008.36(suppl2): p. W119-W127.
    [46] Yipu Zhang, Hongwei Huo, and Qiang Yu, A Heuristic Cluster-based EMAlgorithm for the Planted (l, d) Problem, J Bioinform Comput Biol.,2013,11(4):1350009.
    [47] Huggins, P., et al., DECOD: fast and accurate discriminative DNA motiffinding. Bioinformatics,2011.27(17): p.2361-2367.
    [48] Georgiev, S., et al., Evidence-ranked motif identification. Genome Biol,2010.11(2): p. R19.
    [49] Galas, D.J. and A. Schmitz, DNAase footprinting a simple method for thedetection of protein-DNA binding specificity. Nucleic Acids Research,1978.5(9): p.3157-3170.
    [50] Horak, C.E. and M. Snyder, ChIP-chip: a genomic approach for identifyingtranscription factor binding sites. Methods in enzymology,2002.350: p.469-483.
    [51] Tompa, M., Assessing Computational Tools for the Discovery ofTranscription Factor Binding Sites. Nature Biotechnology,2005.23(no1): p.137-144.
    [52] Sadler, J., M. Waterman, and T. Smith, Regulatory pattern identification innucleic acid sequences. Nucleic Acids Research,1983.11(7): p.2221-2232.
    [53] Waterman, M., R. Arratia, and D. Galas, Pattern recognition in severalsequences: consensus and alignment. Bulletin of mathematical biology,1984.46(4): p.515-527.
    [54] Galas, D.J., M. Eggert, and M.S. Waterman, Rigorous pattern-recognitionmethods for DNA sequences: Analysis of promoter sequences fromEscherichia coli. Journal of Molecular Biology,1985.186(1): p.117-128.
    [55] Li, M., B. Ma, and L. Wang, Finding Similar Regions in Many Sequences.Journal of Computer and System Sciences,2002.65(1): p.73-96.
    [56] Marsan, L. and M.-F. Sagot, Algorithms for extracting structured motifsusing a suffix tree with an application to promoter and regulatory siteconsensus identification. Journal of Computational Biology,2000.7(3-4): p.345-362.
    [57] Pavesi, G., G. Mauri, and G. Pesole, An algorithm for finding signals ofunknown length in DNA sequences. Bioinformatics,2001.17(suppl1): p.S207-S214.
    [58] Caselle, M., F. Di Cunto, and P. Provero, Correlating overrepresentedupstream motifs to gene expression: a computational approach to regulatoryelement discovery in eukaryotes. BMC Bioinformatics,2002.3(1): p.7.
    [59] Corà, D., et al., Computational identification of transcription factor bindingsites by functional analysis of sets of genes sharing overrep-resentedupstream motifs. BMC Bioinformatics,2004.5(1): p.57.
    [60] Van Helden, J., B. André, and J. Collado-Vides, Extracting regulatory sitesfrom the upstream region of yeast genes by computational analysis ofoligonucleotide frequencies. Journal of Molecular Biology,1998.281(5): p.827-842.
    [61] Shinozaki, D., T. Akutsu, and O. Maruyama, Finding optimal degeneratepatterns in DNA sequences. Bioinformatics,2003.19(suppl2): p.ii206-ii214.
    [62] Sinha, S. and M. Tompa, YMF: a program for discovery of noveltranscription factor binding sites by statistical overrepresentation. NucleicAcids Research,2003.31(13): p.3586-3588.
    [63] Pavesi, G., et al., Weeder Web: discovery of transcription factor binding sitesin a set of sequences from co-regulated genes. Nucleic Acids Research,2004.32(suppl2): p. W199-W203.
    [64] Marschall, T. and S. Rahmann, Efficient exact motif discovery.Bioinformatics,2009.25(12): p. i356-i364.
    [65] Akutsu, T., H. Arimura, and S. Shimozono. On approximation algorithms forlocal multiple alignment. in Proceedings of the fourth annual internationalconference on Computational molecular biology.2000. ACM.
    [66] Hertz, G.Z., G.W. Hartzell, and G.D. Stormo, Identification of consensuspatterns in unaligned DNA sequences known to be functionally related.Computer applications in the biosciences: CABIOS,1990.6(2): p.81-92.
    [67] Lawrence, C.E. and A.A. Reilly, An expectation maximization (EM)algorithm for the identification and characterization of common sites inunaligned biopolymer sequences. Proteins: Structure, Function, andBioinformatics,1990.7(1): p.41-51.
    [68] Bailey, T. and C. Elkan, The value of prior knowledge in discovering motifswith MEME. Proceedings of the Third International Conference onIntelligent Systems for Molecular Biology, Cambridge, United Kingdom,July16-19,1995,1995.3: p.21-29.
    [69] Lawrence, C.E., et al., Detecting subtle sequence signals: a Gibbs samplingstrategy for multiple alignment. Science,1993.262(5131): p.208-214.
    [70] Neuwald, A.F., J.S. Liu, and C.E. Lawrence, Gibbs motif sampling:detection of bacterial outer membrane protein repeats. Protein science,1995.4(8): p.1618-1632.
    [71] Hughes, J.D., et al., Computational identification of Cis-regulatory elementsassociated with groups of functionally related genes in Saccharomycescerevisiae. Journal of Molecular Biology,2000.296(5): p.1205-1214.
    [72] Workman, C. and G. Stormo, ANN-Spec: a method for discoveringtranscription factor binding sites with improved specificity. Pac SympBiocomput,2000: p.467-478.
    [73] Liu, X., D.L. Brutlag, and J.S. Liu, BioProspector: discovering conservedDNA motifs in upstream regulatory regions of co-expressed genes. PacificSymposium On Biocomputing NIL,2001: p.127-138.
    [74] Thijs, G., et al., A higher-order background model improves the detection ofpromoter regulatory elements by Gibbs sampling. Bioinformatics,2001.17(12): p.1113-1122.
    [75] Marchal, K., et al., Genome-specific higher-order background models toimprove motif detection. Trends in microbiology,2003.11(2): p.61-66.
    [76] Narasimhan, C., P. LoCascio, and E. Uberbacher, Backgroundrareness-based iterative multiple sequence alignment algorithm forregulatory element detection. Bioinformatics,2003.19(15): p.1952-1963.
    [77] Bailey, T.L., et al., MEME: discovering and analyzing DNA and proteinsequence motifs. Nucleic Acids Research,2006.34(suppl2): p.W369-W373.
    [78] Thijs, G., et al., A Gibbs sampling method to detect over-represented motifsin the upstream regions of co-expressed genes, in Proceedings of the fifthannual international conference on Computational biology2001, ACM:Montreal, Quebec, Canada.
    [79] Aerts, S., et al., Toucan: deciphering the cis‐regulatory logic of coregulatedgenes. Nucleic Acids Research,2003.31(6): p.1753-1764.
    [80] Frith, M.C., et al., Finding functional sequence elements by multiple localalignment. Nucleic Acids Research,2004.32(1): p.189-200.
    [81] Thompson, W., E.C. Rouchka, and C.E. Lawrence, Gibbs Recursive Sampler:finding transcription factor binding sites. Nucleic Acids Research,2003.31(13): p.3580-3585.
    [82] Wei, Z. and S.T. Jensen, GAME: detecting cis-regulatory elements using agenetic algorithm. Bioinformatics,2006.22(13): p.1577-1584.
    [83]霍红卫等,(l, d)-模体识别问题的遗传优化算法.计算机学报,2012.35(7):p.1429-1439.
    [84] Li, L., GADEM: a genetic algorithm guided formation of spaced dyadscoupled with an EM algorithm for motif discovery. Journal ofComputational Biology,2009.16(2): p.317-329.
    [85] Fogel, G.B., et al., Discovery of sequence motifs related to coexpression ofgenes using evolutionary computation. Nucleic Acids Research,2004.32(13): p.3826-3835.
    [86] Defrance, M. and J. Van Helden, Info-gibbs: a motif discovery algorithmthat directly optimizes information content during sampling. Bioinformatics,2009.25(20): p.2715-2722.
    [87] Bailey, T.L. and C. Elkan, Unsupervised learning of multiple motifs inbiopolymers using expectation maximization. Machine learning,1995.21(1-2): p.51-80.
    [88] Pevzner, P.A. and S.-H. Sze. Combinatorial approaches to finding subtlesignals in DNA sequences. in ISMB.2000.
    [89] Zhang, S., et al., MotifClick: prediction of cis-regulatory binding sites viamerging cliques. BMC Bioinformatics,2011.12(1): p.238.
    [90] Fratkin, E., et al., MotifCut: regulatory motifs finding with maximumdensity subgraphs. Bioinformatics,2006.22(14): p. e150-e157.
    [91] Mahony, S., et al., Transcription factor binding site identification using theself-organizing map. Bioinformatics,2005.21(9): p.1807-1814.
    [92] Lee, N.K. and D. Wang, SOMEA: self-organizing map based extractionalgorithm for DNA motif identification with heterogeneous model. BMCBioinformatics,2011.12(Suppl1): p. S16.
    [93] Wasserman, W.W., et al., Human-mouse genome comparisons to locateregulatory sites. Nature genetics,2000.26(2): p.225-228.
    [94] Favorov, A.V., et al., A Gibbs sampler for identification of symmetricallystructured, spaced DNA motifs with improved estimation of the signal length.Bioinformatics,2005.21(10): p.2240-2245.
    [95] Shamir, R., et al., EXPANDER–an integrative program suite for microarraydata analysis. BMC Bioinformatics,2005.6(1): p.232.
    [96] Che, D., et al., BEST: binding-site estimation suite of tools. Bioinformatics,2005.21(12): p.2909-2911.
    [97] Gordon, D.B., et al., TAMO: a flexible, object-oriented framework foranalyzing transcriptional regulation using DNA-sequence motifs.Bioinformatics,2005.21(14): p.3164-3165.
    [98] McCue, L.A., et al., Phylogenetic footprinting of transcription factor bindingsites in proteobacterial genomes. Nucleic Acids Research,2001.29(3): p.774-782.
    [99] Sinha, S., M. Blanchette, and M. Tompa, PhyME: a probabilistic algorithmfor finding motifs in sets of orthologous sequences. BMC Bioinformatics,2004.5: p.170.
    [100] Siddharthan, R., E.D. Siggia, and E. Van Nimwegen, PhyloGibbs: a Gibbssampling motif finder that incorporates phylogeny. Plos ComputationalBiology,2005.1(7): p. e67.
    [101] Wang, T. and G.D. Stormo, Combining phylogenetic data with co-regulatedgenes to identify regulatory motifs. Bioinformatics,2003.19(18): p.2369-2380.
    [102] Moses, A.M., et al., MONKEY: identifying conserved transcription-factorbinding sites in multiple alignments using a binding site-specificevolutionary model. Genome Biol,2004.5(12): p. R98.
    [103]李婷婷等,转录因子结合位点的计算分析方法. ACTA BIOPHYSICASINICA,2008.24(5).
    [104] Reményi, A., H.R. Sch ler, and M. Wilmanns, Combinatorial control ofgene expression. Nature structural&molecular biology,2004.11(9).
    [105] Zhou, Q. and W.H. Wong, CisModule: de novo discovery of cis-regulatorymodules by hierarchical mixture modeling. Proceedings of the nationalacademy of sciences of the United States of America,2004.101(33): p.12114-12119.
    [106] Gupta, M. and J.S. Liu, De novo cis-regulatory module elicitation foreukaryotic genomes. Proceedings of the national academy of sciences of theUnited States of America,2005.102(20): p.7079-7084.
    [107] Thompson, W., et al., Decoding human regulatory circuits. GenomeResearch,2004.14(10a): p.1967-1974.
    [108] Johnson, D.S., et al., De novo discovery of a tissue-specific gene regulatorymodule in a chordate. Genome Research,2005.15(10): p.1315-1324.
    [109] Wingender, E., et al., TRANSFAC: a database on transcription factors andtheir DNA binding sites. Nucleic Acids Research,1996.24(1): p.238-241.
    [110] Sandelin, A., et al., JASPAR: an open‐access database for eukaryotictranscription factor binding profiles. Nucleic Acids Research,2004.32(suppl1): p. D91-D94.
    [111] Robertson, G., et al., cisRED: a database system for genome-scalecomputational discovery of regulatory elements. Nucleic Acids Research,2006.34(suppl1): p. D68-D73.
    [112] Karolchik, D., et al., The UCSC genome browser database. Nucleic AcidsResearch,2003.31(1): p.51-54.
    [113] Hubbard, T.J., et al., Ensembl2007. Nucleic Acids Research,2007.35(suppl1): p. D610-D617.
    [114] Zhu, J. and M.Q. Zhang, SCPD: a promoter database of the yeastSaccharomyces cerevisiae. Bioinformatics,1999.15(7): p.607-611.
    [115] Sierro, N., et al., DBTBS: a database of transcriptional regulation in Bacillussubtilis containing upstream intergenic conservation information. NucleicAcids Research,2008.36(suppl1): p. D93-D96.
    [116] Collas, P. and J.A. Dahl, Chop it, ChIP it, check it: the current status ofchromatin immunoprecipitation. Front Biosci,2008.13(17): p.929-943.
    [117] Ren, B., et al., Genome-wide location and function of DNA binding proteins.Science,2000.290(5500): p.2306-2309.
    [118] Johnson, D.S., et al., Genome-wide mapping of in vivo protein-DNAinteractions. Science,2007.316(5830): p.1497-1502.
    [119] Buhler, J. and M. Tompa, Finding motifs using random projections. Journalof Computational Biology,2002.9(2): p.225-242.
    [120] Ho, E.S., C.D. Jakubowski, and S.I. Gunderson, iTriplet, a rule-basednucleic acid sequence motif finder. Algorithms for molecular biology: AMB,2009.4: p.14.
    [121] Evans, P.A., A.D. Smith, and H.T. Wareham, On the complexity of findingcommon approximate substrings. Theor. Comput. Sci.,2003.306(1-3): p.407-430.
    [122] Li, M., B. Ma, and L. Wang, On the closest string and substring problems.Journal of the ACM (JACM),2002.49(2): p.157-171.
    [123] Davila, J., S. Balla, and S. Rajasekaran, Fast and practical algorithms forplanted (l, d) motif search. Computational Biology and Bioinformatics,IEEE/ACM Transactions on,2007.4(4): p.544-552.
    [124] Huang, C.-W., W.-S. Lee, and S.-Y. Hsieh, An improved heuristic algorithmfor finding motif signals in DNA sequences. IEEE/ACM Transactions onComputational Biology and Bioinformatics (TCBB),2011.8(4): p.959-975.
    [125] Dinh, H., S. Rajasekaran, and V.K. Kundeti, PMS5: an efficient exactalgorithm for the (l, d)-motif finding problem. BMC Bioinformatics,2011.12(1): p.410.
    [126] Yu, Q., et al., PairMotif: a new pattern-driven algorithm for planted (l, d)DNA motif search. PLoS ONE,2012.7(10): p. e48442.
    [127] Yang, X. and J.C. Rajapakse, Graphical approach to weak motif recognition.Genome Informatics Series,2004.15(2): p.52.
    [128] Sun, H.Q., et al., RecMotif: a novel fast algorithm for weak motif discovery.BMC Bioinformatics,2010.11(Suppl11): p. S8.
    [129] Dopazo, J., et al., Design of primers for PCR ampiification of highlyvariable genomes. Computer applications in the biosciences: CABIOS,1993.9(2): p.123-125.
    [130] Lanctot, J.K., et al. Distinguishing string selection problems. in Proceedingsof the tenth annual ACM-SIAM symposium on Discrete algorithms.1999.Society for Industrial and Applied Mathematics.
    [131] Deng, X., et al., Genetic design of drugs without side-effects. SIAM Journalon Computing,2003.32(4): p.1073-1090.
    [132] Ben-Dor, A., et al. Banishing bias from consensus sequences. inCombinatorial Pattern Matching.1997. Springer.
    [133] Boyer, L.A., et al., Core transcriptional regulatory circuitry in humanembryonic stem cells. Cell,2005.122(6): p.947-956.
    [134] Loh, Y.-H., et al., The Oct4and Nanog transcription network regulatespluripotency in mouse embryonic stem cells. Nature genetics,2006.38(4): p.431-440.
    [135] O’Neill, L.P. and B.M. Turner, Immunoprecipitation of native chromatin:NChIP. Methods,2003.31(1): p.76-82.
    [136] Sun, J.-M., H.Y. Chen, and J.R. Davie, Differential distribution ofunmodified and phosphorylated histone deacetylase2in chromatin. Journalof Biological Chemistry,2007.282(45): p.33227-33236.
    [137] Iyer, V.R., et al., Genomic binding sites of the yeast cell-cycle transcriptionfactors SBF and MBF. Nature,2001.409(6819): p.533-538.
    [138] Pillai, S. and S.P. Chellappan, ChIP on chip assays: genome-wide analysis oftranscription factor binding and histone modifications, in ChromatinProtocols2009, Springer. p.341-366.
    [139] Mardis, E.R., The impact of next-generation sequencing technology ongenetics. Trends in genetics,2008.24(3): p.133-141.
    [140] Mardis, E.R., ChIP-seq: welcome to the new frontier. Nat Methods,2007.4(8): p.613-4.
    [141] Barski, A., et al., High-resolution profiling of histone methylations in thehuman genome. Cell,2007.129(4): p.823-837.
    [142] Robertson, G., et al., Genome-wide profiles of STAT1DNA associationusing chromatin immunoprecipitation and massively parallel sequencing.Nat Methods,2007.4(8): p.651-657.
    [143] Mikkelsen, T.S., et al., Genome-wide maps of chromatin state in pluripotentand lineage-committed cells. Nature,2007.448(7153): p.553-560.
    [144] Jones, P.A. and P.W. Laird, Cancer-epigenetics comes of age. Naturegenetics,1999.21(2): p.163-167.
    [145] Li, H. and N. Homer, A survey of sequence alignment algorithms fornext-generation sequencing. Briefings in Bioinformatics,2010.11(5): p.473-483.
    [146] Laajala, T.D., et al., A practical comparison of methods for detectingtranscription factor binding sites in ChIP-seq experiments. BMC genomics,2009.10(1): p.618.
    [147]王曦等,新一代高通量RNA测序数据的处理与分析.生物化学与生物物理进展,2010.37(8): p.834-846.
    [148] Li, H., J. Ruan, and R. Durbin, Mapping short DNA sequencing reads andcalling variants using mapping quality scores. Genome Research,2008.18(11): p.1851-1858.
    [149] Burrows, M. and D.J. Wheeler, A block-sorting lossless data compressionalgorithm.1994.
    [150] Trapnell, C. and S.L. Salzberg, How to map billions of short reads ontogenomes. Nature Biotechnology,2009.27(5): p.455.
    [151] Langmead, B., et al., Ultrafast and memory-efficient alignment of shortDNA sequences to the human genome. Genome Biol,2009.10(3): p. R25.
    [152] Li, H. and R. Durbin, Fast and accurate short read alignment withBurrows–Wheeler transform. Bioinformatics,2009.25(14): p.1754-1760.
    [153] Li, R., et al., SOAP2: an improved ultrafast tool for short read alignment.Bioinformatics,2009.25(15): p.1966-1967.
    [154] Homer, N., B. Merriman, and S.F. Nelson, BFAST: an alignment tool forlarge scale genome resequencing. PLoS ONE,2009.4(11): p. e7767.
    [155] Rumble, S.M., et al., SHRiMP: accurate mapping of short color-space reads.Plos Computational Biology,2009.5(5): p. e1000386.
    [156]高山等,下一代测序中ChIP-seq数据的处理与分析.遗传,2012.34(6):p.773-783.
    [157] Wilbanks, E.G. and M.T. Facciotti, Evaluation of algorithm performance inChIP-seq peak detection. PLoS ONE,2010.5(7): p. e11471.
    [158] Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol,2008.9(9): p. R137.
    [159] Valouev, A., et al., Genome-wide analysis of transcription factor bindingsites based on ChIP-Seq data. Nat Methods,2008.5(9): p.829-834.
    [160] Krig, S.R., et al., Identification of genes directly regulated by the oncogeneZNF217using chromatin immunoprecipitation (ChIP)-chip assays. Journalof Biological Chemistry,2007.282(13): p.9703-9712.
    [161] Zeller, K.I., et al., Global mapping of c-Myc binding sites and target genenetworks in human B cells. Proceedings of the National Academy ofSciences,2006.103(47): p.17834-17839.
    [162] Hu, M., et al., On the detection and refinement of transcription factorbinding sites using ChIP-Seq data. Nucleic Acids Research,2010.38(7): p.2154-2167.
    [163] Machanick, P. and T.L. Bailey, MEME-ChIP: motif analysis of large DNAdatasets. Bioinformatics,2011.27(12): p.1696-1697.
    [164] Reid, J. and L. Wernisch, STEME: efficient EM to find motifs in large datasets. Nucleic Acids Research,2011.39(18): p. e126-e126.
    [165] Kulakovskiy, I.V., et al., Deep and wide digging for binding motifs inChIP-Seq data. Bioinformatics,2010.26(20): p.2622-2623.
    [166] Liu, X.S., D.L. Brutlag, and J.S. Liu, An algorithm for finding protein-DNAbinding sites with applications to chromatin-immunoprecipitationmicroarray experiments. Nature Biotechnology,2002.20(8): p.835.
    [167] Ettwiller, L., et al., Trawler: de novo regulatory motif discovery pipeline forchromatin immunoprecipitation. Nat Methods,2007.4(7): p.563-565.
    [168] Linhart, C., Y. Halperin, and R. Shamir, Transcription factor and microRNAmotif discovery: the Amadeus platform and a compendium of metazoantarget sets. Genome Research,2008.18(7): p.1180-1189.
    [169] Bailey, T.L., DREME: motif discovery in transcription factor ChIP-seq data.Bioinformatics,2011.27(12): p.1653-1659.
    [170] Sharov, A.A. and M.S.H. Ko, Exhaustive Search for Over-represented DNASequence Motifs with CisFinder. DNA Research,2009.16(5): p.261-273.
    [171] Rhee, Ho S. and B.F. Pugh, Comprehensive Genome-wide Protein-DNAInteractions Detected at Single-Nucleotide Resolution. Cell,2011.147(6): p.1408-1419.
    [172] Li, N. and M. Tompa, Analysis of computational approaches for motifdiscovery. Algorithms for Molecular Biology,2006.1(1): p.8.
    [173] Benos, P.V., A.S. Lapedes, and G.D. Stormo, Probabilistic Code for DNARecognition by Proteins of the EGR Family. Journal of Molecular Biology,2002.323(4): p.701-727.
    [174] Mahony, S., et al., Self-organizing neural networks to support the discoveryof DNA-binding motifs. Neural Netw.,2006.19(6): p.950-962.
    [175] Blekas, K., D.I. Fotiadis, and A. Likas, Greedy mixture learning for multiplemotif discovery in biological sequences. Bioinformatics,2003.19(5): p.607-617.
    [176] Van Heeringen, S.J. and G.J.C. Veenstra, GimmeMotifs: a de novo motifprediction pipeline for ChIP-sequencing experiments. Bioinformatics,2011.27(2): p.270-271.
    [177] Stormo, G.D. and G.W. Hartzell, Identifying protein-binding sites fromunaligned DNA fragments. Proceedings of the National Academy ofSciences,1989.86(4): p.1183-1187.
    [178] Liu, J., The Collapsed Gibbs Sampler in Bayesian Computations withApplications to a Gene Regulation Problem. Journal of the AmericanStatistical Association,1994.89(427): p.958.
    [179] Chan, T.-M., K.-S. Leung, and K.-H. Lee, TFBS identification based ongenetic algorithm with combined representations and adaptivepost-processing. Bioinformatics,2008.24(3): p.341-349.
    [180] Blanco, E., et al., ABS: a database of Annotated regulatory Binding Sitesfrom orthologous promoters. Nucleic Acids Research,2006.34(suppl1): p.D63-D67.
    [181] Hu, J., B. Li, and D. Kihara, Limitations and potentials of current motifdiscovery algorithms. Nucleic Acids Research,2005.33(15): p.4899-4913.
    [182] Shaw Jr, W.M., R. Burgin, and P. Howell, Performance standards andevaluations in IR test collections: Cluster-based retrieval models.Information Processing&Management,1997.33(1): p.1-14.
    [183] Keich, U. and P.A. Pevzner, Finding motifs in the twilight zone, inProceedings of the sixth annual international conference on Computationalbiology2002, ACM: Washington, DC, USA. p.195-204.
    [184] Zambelli, F. and G. Pavesi, A Faster Algorithm for Motif Finding inSequences from ChIP-Seq Data, in Computational Intelligence Methods forBioinformatics and Biostatistics2012, Springer. p.201-212.
    [185] Chen, X., et al., Integration of External Signaling Pathways with the CoreTranscriptional Network in Embryonic Stem Cells. Cell,2008.133(6): p.1106-1117.
    [186] Sinha, S., Discriminative motifs. J Comput Biol,2003.10(3-4): p.599-615.
    [187] Sinha, S., On counting position weight matrix matches in a sequence, withapplication to discriminative motif finding. Bioinformatics,2006.22(14): p.e454-e463.
    [188] Redhead, E. and T. Bailey, Discriminative motif discovery in DNA andprotein sequences using the DEME algorithm. BMC Bioinformatics,2007.8(1): p.385.
    [189] Fauteux, F., M. Blanchette, and M.V. Str mvik, Seeder: discriminativeseeding DNA motif discovery. Bioinformatics,2008.24(20): p.2303-2307.
    [190] Mason, M.J., K. Plath, and Q. Zhou, Identification of Context-DependentMotifs by Contrasting ChIP Binding Data. Bioinformatics,2010.26(22): p.2826-2832.
    [191] Harbison, C.T., et al., Transcriptional regulatory code of a eukaryoticgenome. Nature,2004.431(7004): p.99-104.
    [192] Bailey, T., et al., The value of position-specific priors in motif discoveryusing MEME. BMC Bioinformatics,2010.11(1): p.179.
    [193] Cartwright, P., et al., LIF/STAT3controls ES cell self-renewal andpluripotency by a Myc-dependent mechanism. Development,2005.132(5):p.885-896.
    [194] Jiang, J., et al., A core Klf circuitry regulates self-renewal of embryonic stemcells. Nat Cell Biol,2008.10(3): p.353-60.
    [195] Ivanova, N., et al., Dissecting self-renewal in stem cells with RNAinterference. Nature,2006.442(7102): p.533-8.
    [196] Kim, J., et al., An extended transcriptional network for pluripotency ofembryonic stem cells. Cell,2008.132(6): p.1049-61.
    [197] Thomas-Chollier, M., et al., RSAT2011: regulatory sequence analysis tools.Nucleic Acids Research,2011.39(suppl2): p. W86-W91.
    [198] Thomas-Chollier, M., et al., RSAT peak-motifs: motif analysis in full-sizeChIP-seq datasets. Nucleic Acids Research,2012.40(4): p. e31-e31.
    [199] Bourque, G., et al., Evolution of the mammalian transcription factor bindingrepertoire via transposable elements. Genome Res,2008.18(11): p.1752-62.
    [200] Wasserman, W.W. and A. Sandelin, Applied bioinformatics for theidentification of regulatory elements. Nature Reviews Genetics,2004.5(4):p.276-287.
    [201] Bi, C., A Monte Carlo EM algorithm for de novo motif discovery inbiomolecular sequences. Computational Biology and Bioinformatics,IEEE/ACM Transactions on,2009.6(3): p.370-386.
    [202] Burset, M. and R. Guigo, Evaluation of gene structure prediction programs.Genomics,1996.34: p.353-367.
    [203]张懿璞等,用于转录因子结合位点识别的定位投影求精算法.计算机学报,2013.36(12): p.2545-2559.
    [204] Yu, Q., et al., PairMotif+: A Fast and Effective Algorithm for De Novo MotifDiscovery in DNA sequences. International journal of biological sciences,2013.9(4): p.412.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700