用户名: 密码: 验证码:
基于云计算的文本挖掘技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网和通信技术的飞速发展,数据正以指数级的形式增长,其中很大一部分是文本数据。如何能在有限的时间内完成对海量文本数据的挖掘分析,缩短从海量数据到辅助决策的时间消耗是一个十分具有挑战性的问题。云计算不需要高性能的服务器,只需要运行在大量廉价的PC集群上,是近几年刚刚兴起的一种技术,可以提供海量存储能力和超级计算能力。将云计算技术应用到海量文本数据的挖掘分析,可以提高文本数据的预处理、挖掘算法和可视化展现过程的时间性能。因此,将云计算与文本挖掘相结合,具有非常重要的研究意义和应用前景。
     本文主要研究海量文本数据的聚类分析,具体工作有如下几点:
     1.分析了云计算的相关核心技术以及云计算平台Hadoop的搭建过程。在云计算与文本挖掘相结合的基础上,设计基于Hadoop平台的文本挖掘原型系统,并详细介绍了各个子系统的具体功能。
     2.停用词过滤是文本挖掘预处理流程的一个重要环节,本文构建了一个综合停用词表,并利用哈希算法设计了一种较为快速的哈希表过滤方法,通过实验证明了停用词表及其过滤算法在文本粗降维和提高过滤效率上的作用。
     3.文本预处理占据整个挖掘过程中很大部分的时间性能消耗,将该过程并行化能够缩短挖掘时间。本文详细分析设计了文本预处理的MapReduce化过程,并对预处理的数据划分、Map函数和Reduce函数进行了详尽说明,最后和单节点运行进行实验对比,验证了并行化的高效性。
     4.分析了聚类算法在文本挖掘中所面临的主要困难以及Jarvis-Patrick算法用于文本聚类的优点,详细分析设计了Jarvis-Patrick算法的并行化过程,对算法的数据划分、Map函数、Reduce函数和分类簇的提取进行了详尽说明,最后和单节点运行进行实验对比,验证了算法并行化的高效性。
With the rapid development of internet and communication techniques, the amount of data, most of which is text data, increases in the form of exponential. It is a very challenging question to analyze the mass data, extract potential knowledge and shorten the process from the data collection to decision support.
     Cloud computing, which can be deployed in the commodity personal computers, is a newly produced technique that can provide the ability of mass storage and super computation. If cloud computing is applied in the text mining on the mass text data, it is possible to promote the time performance for text preprocessing, mining algorithms and results visualization. So it is of great research significance and application future to combine text mining and cloud computing.
     The main works of the paper are as follows:
     1. The key techniques of cloud computing and the process of the establishment of open source platform-Hadoop are analyzed. Based on the feasibility of the combination of cloud computing and text mining, we design a Hadoop-based text mining system, and the functions and relations between the three sub-systems are introduced in detail.
     2. The stopword filter is an indispensable step in the preprocessing of text mining. In this paper, we construct a composite stopword list and design a quick filter algorithm based on hash function. The experiment verifies the effect of stopword filter in the reduction of text vector dimension and time consuming.
     3. The text preprocessing takes a large part in the whole time consuming, so it is necessary to apply it in the MapReduce programming model. This paper analyses and designs the MapReduce steps of the preprocessing, and then the input data partition, Map function and Reduce function are introduced in detail. At last, compared with the experiment run on the single node, the parallel execution are verified effective.
     4. The difficulties in the clustering algorithms used in text mining and the merits of Jarvis-Patrick used in the clustering are analyzed. After we analyze and design the MapReduce steps of the Jarvis-Patrick, the input data partition, Map function, Reduce function and extraction of the clusters are introduced in detail. In the end, the MapReduced-based Jarvis-Patrick algorithm is verified efficient compared with the single node execution.
引文
[1] Gray J, Hey T, In Search of Petabyte Databases, WWW.research.microsoft/~gray, 2001.
    [2] Gulli A, A.Signorini. The indexable Web is more than 11.5 billion pages[J]. WWW(Special interest tracks and posters), 2005.
    [3]李耐国.军事情报研究[M].北京:军事科学出版社,2001.
    [4]梁德文.关于网络中心战系统及其关键技术的探讨[手稿][M].中国电子科技集团第10研究所,2002.
    [5] Kent S. Strategic Intelligent For American World Policy[M]. Princeton: Princeton University,1949.
    [6] Feldman R, Dagan I. KDT-Knowledge Discovery in Textual Database [C]. Proceedings of the 1st Annual Conference on Knowledge Discovery and Data Mining.1995. 112—117.
    [7] Helena A, Oskari H, K.Mika, Inkeri V A. Applying Data Mining Techniques in Text Analysis[J]. 1997,2 4-8.
    [8] IBM, IBM Virtualization, http://www.ibm.com/virtulization, 2009.
    [9] Amazon, Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2, 2009.
    [10] Isard M, Budiu M, Yu Y. Dryad: Distributed data-parallel programs from sequential building blocks. [C]. in: Proc. Of the 2nd European Conf. on Computer Systems.2007. 59-72.
    [11]刘鹏.云计算[M].北京:电子工业出版社,2010.
    [12] Bryant R E. Data—intensive supercomputing: The case for DISC,CMU-CS-07-128 [R]. Pittsburgh: USA:Carnegie Mellon University. Department of Computer Science, 2007.
    [13] Nurmi D, Wolski R, Eucalyptus: A Technical Report on an Elastic Utility Computing. Architecture, http://open.eucalyptus.com/documents/nurmi_et_al-eucalyptus_tech_report-august_2008.pdf, 2008.
    [14] Duoqian M, Qiguo D, Hongyun Z, Na J. Rough set based hybrid algorithm for text classfication[J]. Expert Systems with Applications, 2009,36(5) 9168-9174.
    [15] Mothe J, Chrisment C, Dkaki T. Information mining-use of the document dimensions to analyse interactively a document set [C]. European Colloquium on Information Retrieval Research.2001. 6-20.
    [16] Ghanem M, Chortaras A, Guo Y, Rowe A, et al. A grid of infrastructure for mixed bioinformatics data and text mining[J]. Computer Systems and Applications, 2005,34(1) 116-130.
    [17] Karanikas H, Tjortjis C, Theodoulidis B. An approach to Text Mining using Information Extraction [C]. Proceeding of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Database.Lyon, France: 2000. 13-16.
    [18] Hu Q, Yu D, Duan Y, Bao W. A novel weighting formula and feature selection for text classfication based on rough set theory [C]. Proceedings of Natural Language Processing and Knowledge Engineering.2003. 638-645.
    [19] Tan S, Cheng X, Wang B, Xu H, et al. Using dragpushing to refine centroid text classifiers [C]. Proc. of the ACM SIGIR-05.ACM Press,2005. 653-654.
    [20] Miao D, Qiguo D, Hongyun Z, Na J. Rough set based hybrid algorithm for text classfication[J]. Expert Systems with Applications, 2009,36(5) 9168-9174.
    [21] Congnan L, Yanjun L, Chung S M. Text document clustering based on neighbors[J]. Data & Knowledge Engineering, 2009,68(11) 1271-1288.
    [22] Karatzoglou A, Feignerer I. Kernel-based machine learning for fast text mining in R[J]. Computational Statistics & Data Analysis, 2010,54(2) 290-297.
    [23] Wei C-P, Yang C C, Lin C-M. A Latent Sementic Indexing-based approach to multilingual document clustering[J]. Decision Support Systems, 2008,45(3) 606-620.
    [24] Harte H, Lu Y, Osbom S, Dehoney D, et al. Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining [C]. Proceedings of the 2003 IEEE: Bioinformatics Conference.2003. 644-645.
    [25] Cheong P, Fung G, Yu X, Lam J W. Stock Prediction: Intergrating text mining approach using real-time news [C]. Proceeding of 2003 IEEE International Conference on Computational Intelligence for Financial Engineering.2003. 395-402.
    [26] Kosala R, Blockeel H. Web Mining Research: A Survey [C]. ACM SIGKDD.2000. 1-15.
    [27] Li H, K Y. Mining from Open Answers in Questionaire Data [C]. Proc of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2001. 443-449.
    [28] Pons-Porrata A, Berlanga-Llavori R, Rui-Shulcloper J. Topic discovery based ontext mining techniques[J]. Information Processing and Management, 2007,43 752-768.
    [29] H W I, J D K, M D, V T. Text Mining in a Digital Library[J]. Journal of Digital Libraries, Special Issue on Digital Libraries as Experienced by the Editors of the Journal, 2003,5 1-4.
    [30]吴斌,傅伟鹏,史忠植等.一种基于群体智能的web文档聚类算法[J].计算机研究与发展, 2002,39 (11): 1429-1435.
    [31]林鸿飞,马雅彬.基于聚类的文本过滤模型[J].大连理工大学学报, 2003,42 (2).
    [32] Y I, Y N, M M, T A, et al. Text Mining System for Analysis of Salesperson's Daily Reports [C]. Proc. of the Pacific Association for Computational Linguistics.Kitakyushu,Japan: 2001. 606-612.
    [33]谌志群,张国煊.文本挖掘研究进展[J].模式识别与人工智能, 2005,18 (1): 65-74.
    [34]董振东,董强,知网介绍, http://www.keenage.com.
    [35]刘云峰.基于潜在语义空间维度特性的多层文档聚类[J].清华大学学报(自然科学版), 2005,45 (1): 1783-1786.
    [36]周茜,赵明生等.中文文本分类中的特征选择研究[J].中文信息学报, 2004,18 (3): 17-23.
    [37]王继成,潘金贵,张福炎. web文本挖掘技术研究[J].计算机研究与发展, 2000,37 (5): 513-520.
    [38] Liu Y, Li M, Hammoud S, Alham N K, et al. A MapReduce based Distributed LSI [C]. 2010 7th Intl. Conf. on Fuzzy Systems and Knowledge Discovery.2010. 2978-2982.
    [39]李雪峰.基于云计算环境的web数据挖掘算法研究[D].北京交通大学. 2010.
    [40]程苗.云计算技术在web日志挖掘中的应用研究[D].中国科学技术大学. 2011.
    [41]刘洋.基于MapReduce的中医药并行数据挖掘服务[D].浙江大学. 2010.
    [42]钟延辉.基于文本挖掘的垃圾短信过滤方法[D].电子科技大学. 2008.
    [43]李军华.云计算及若干数据挖掘算法的MapReduce化研究[D].电子科技大学. 2010.
    [44] Ghemawat S, Gobioff H, T L P. The Google File System [C]. Proceedings of 19th ACM Symposium on Operation Systems Principles.New York: ACM Press,2003. 20-43.
    [45] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters [C]. Proceedings of the 6th Symposium on operating System Design and Implementation.New York: ACM Press,2004. 137-150.
    [46] Chang F, Dean J, Ghemawat S. Bigtable: A Distributed Storage System for Structured Data[J]. ACM Transactions on Computer Systems, 2008,26 (2): 1-26.
    [47] Hadoop, http://hadoop.apache.org.
    [48] http://eucalyptus.com.
    [49] Gu Y, Grossman R. Sector and Sphere: The Design and Implementation of a High Performance Data Cloud, Theme Issue of the Philosophical Transactions of the Royal Society[J]. Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructure, 2009,367 (1897): 2429-2445.
    [50]周斌,刘亚萍,吴泉源.一个面向电子商务的数据挖掘系统的设计与实现[J].计算机工程, 2000,26 (6): 18-20.
    [51] Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957,1 (4): 309-317.
    [52] Luhn H P. The Automatic Creation of Literature Abstracts[J]. IBM Journal,presented at IRE National Convention, 1958.
    [53] RIJSBERGEN C J V. Information Retrieval[M]. London: Butterworths,1975.
    [54] FRANCIS W, KUCERA H. Frequency Analysis of English Usage[J]. Journal of English Linguistic, 1982,18 (1): 64-70.
    [55] Fox C. Information retrieval: Data structures and algorithms[M]. Upper Saddle River, New Jersey: Prentice Hall,1992.
    [56] Lo T-W. Automatically Building a stopwords list for an information retrieval system [C]. Proceeding of the 5th Duth-belgian Information Retrieval Workshop.Utrecht, the Neterlands: 2005. 141-148.
    [57] Helena A, Oskari H, Mika K, Inkeri V A. Applying Data Mining Techniques in Text Analysis[J]. 1997,2 4-8.
    [58] Yang Y, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization [C]. Proceedings of ICML-97,14th International Conference on Machine Learning.San Francisco: Morgan Kaufmann Publishers,1997. 412-420.
    [59] Silva C, Ribeiro B. The Importance of Stop Word Removal on Recall Values in Text Categorization[J]. Neural NetWorks, 2003,3 20-24.
    [60]崔彩霞.停用词的选取对文本分类效果的影响研究[J].太原师范学院学报(自然科学版), 2008,7 (4): 3.
    [61] Fox C. A Stop List for General Text [C]. in: ACM-SIGIR Forum.1990. 19-35.
    [62]顾益军,樊孝忠,王义.中文停用词表的自动选取[J].北京理工大学学报, 2005,25 (04): 4.
    [63]周钦强.文本自动分类系统文本预处理方法的研究[J].计算机应用研究, 2005,27 (02): 2.
    [64] Fumani M R F Q, Ramachandra C S. The concept of stopwords in Persian chemistry articles: a discussion in automatic indexing[J]. available at: http//biblotecavirtualut.suagm.edu/Glossa2/journal/dex2008, 2008.
    [65] Seki K, Mostafa J. An application of text categorization methods to gene ontology annotation [C]. Proc. of the 28th annual international ACM SIGIR conference on Research and development in information retrieval.New York, USA: 2005.
    [66] Makrehchi M, Kamel M S. Automatic Extraction of domain-specific stopwords from labeled documents[J]. Advances in Information Retrieval, 2008,4956/2008 222-233.
    [67] Cover T M, Thomas J A. Elements of Information Theory[M]. John Wiley: 1991.
    [68] Kernighan B W, Ritchie D M. The C Programming Language[M]. Prentice-Hall: 1988.
    [69] Imdict-chinese-analyzer, http://code.google.com/p/imdict-chinese-analyzer/.
    [70] Paoding Analysis, http://code.google.com/p/paoding/.
    [71] mmseg4j, http://code.google.com/p/mmseg4j/.
    [72] IKAnalyzer, http://code.google.com/p/ik-analyzer/.
    [73]单松巍,冯是聪等.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用, 2003,39 (22).
    [74] Steinbach M, Karypis G, Kumara V. A Comparison of Document Clustering Techniques [C]. KDD-2000 Workshop On Text Mining.Boston MA USA: 2000.8. 109-110.
    [75]范明,孟小峰等译.数据挖掘概念与技术[M].北京:机械工业出版社,2001.
    [76] Zhang T, Ramakrishnan R, Livny M. BIRCH:An efficient data clustering method for very large databases [C]. 1996 ACM?SIGMOD Int.Con. Management of data(SIGMOD'96).103-114.
    [77] Guha S, Rastogi R, Shim K. ROCK: A Robust Clustering Algorithm for Categorical Attributes [C]. Proc. Of the 15th intl. Conf. on Data Engineering.1999. 512-521.
    [78] Karypis G, Han E H, Kumar v. CHAMELEON:A hierarchical clusteringalgorithm using dynamic modeling[J]. Computer Systems and Applications, 1999,32 (8): 68-75.
    [79] Ankerst M, Breunig M, Kriegel H P, Sander J. OPTICS:Ordering points to identify the clustering structure [C]. Proc.1999 ACM?SIGMOD Int.Conf.Management of data(SIGMOD’99).1999. 49-60.
    [80] Hinneburg A, Keim D A. An Efficient Approach to Clustering in Large Multimedia Databases with noise [C]. Proc. Of the 4th Intl. Conf. on Knowledge Discovery and Data Mining.New York City: 1998. 58-65.
    [81] Wang W, Yang J, Muntz R. STING+:an approach to active spatial data mining [C]. 15th Intemational Conference on Data Engineering.1999. 116-125.
    [82] Sheikholeslami G, Chaaeoee S, Zhang A. WaveCluster:A multi-resolution clustering approach for very large spatial database [C]. Proc. 1998 Intl. Conf. Very Large Data Bases(VLDB’98).428-439.
    [83] Agrawal R, Gehrke J, Gtmopulos D, etal. Automatic subspace clustering of high dimensional data for data mining applications [C]. Proc.1998 ACM-SIGMOD Intl. Conf. Management of Data(SIGMOD'98).94-105.
    [84] Shavlik J W, Dietterich T G. Readings in Machine Learning[M]. San Mateo,CA: Morgan Kaufmann,1990.
    [85] Ertoz L, Steinbach M, Kumar V. A New Shared Nearest Neighbor Clustering Algorithm and its Application In Workshop on Clustering High Dimensional Data and its Applications [C]. Proc. Of Text Mine'01, First SIAM Intl. Conf. on Data Mining.Chicago, USA: 2001.
    [86] Jarvis R A, Patrick E A. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors[J]. IEEE Transaction on Computer, 1973,22 (11): 1025-1034.
    [87] Gowda K C, Krishna G. Agglomerative Clustering Using the Concept of Mutual Nearest Neighborhood[J]. Pattern Recognition, 1978,10 (2): 105-112.
    [88] ertoz L, Steinbach M, Kumar V. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data [C]. Proc. Of the 2003 SIAM Intl. Conf. on Data Mining.San Francisco: 2003.
    [89] Steinbach M, Tan P N, Kumar V, Klooster S, et al. Discovery of climate indices using clustering [C]. Proc. of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.New York, NY,USA: 2003. 446-455.
    [90]范明,范宏建等译.数据挖掘导论[M].北京:人民邮电出版社,2006.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700