用户名: 密码: 验证码:
海量二维表数据的排序问题研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
信息时代带给人们的影响是震撼的,数据作为信息最主要的表现形式之一,无论在广度还是深度上都已经深刻渗透到我们的生活中,研究表明,全球数据量以年均80%的速度持续增长,数据时代已经悄然来到。现在,海量数据存储已经有了较好的解决方案并已实现,而且网络存储正在走向平民化,SUN,惠普,博科等公司都提供完善的策略,高性能的设备支持大容量存储。但是,面对爆炸性增长的数据,如何沙里淘金呢?这一问题,即海量数据的搜索和处理正在研究阶段。
     数据挖掘,分布式计算,云计算等技术都是基于海量数据处理这一大背景的。数据挖掘的发展将数据坟墓转换成知识金块,并广泛应用于金融、零售业、电信、科学探索等;云计算概念的提出无疑具有时代意义,它将丰厚的资源组织起来提供强大的计算能力和服务。若要发挥他们的魔力,高性能算法无疑是重要的技术基础,但是一些传统技术在庞大的数据规模下,性能并不能令人满意,如何设计适用海量数据的算法是至关重要的。
     现在,设备的价格随着硬件的发展大幅下降,空间不再是瓶颈,时间效率日益成为最关注的焦点。在海量数据处理中,面向离散数据的二维表排序是一个基础操作,若先对二维表排序,将大大提高后续数据处理效率,而且许多复杂问题的解决最终都归为二维表排序,二维表排序在数据挖掘、机器学习、数据库、粗糙集等领域均有广泛应用。本文在深入分析了现有二维表排序算法后,针对海量数据对算法效率要求高这一问题,将二维表快速排序算法进行改进,并在空间换时间的思想下提出基于Hash的二维表排序算法,最后将其扩展到云计算模型下的海量二维表数据处理。
     该算法深化了粗糙集的等价类划分思想,将有序等价类推广到二维表排序上,并利用划分块之间的独立性实现云模型下的并行计算,随着划分的层层细化,并行度增加,大大提高了效率,在海量数据集上的优势尤为明显。本文设计的基于云计算模型的二维表排序方法将数据分割、各节点计算任务的分派及排序流程中的有序等价类划分结合在一起,减少了工作开销,此模型查询数据的效率非常高,可用于云平台下以查询为主的高性能应用程序。
The impact which the information age has brought to our life is shocking. Data, as one of the most important media of the information, has deeply penetrated into our life both in breadth and depth. According to some researches, the amount of the global data has been growing at an average rate of 80% annually, which means that the information age has come. Nowadays, the storage of massive data has been well solved and achieved, and network storage has been widely used by civilians. SUN, HP, Brocade and other companies have all developed the high-performance massive data storage device with comprehensive strategic. However, how can we get what we want among the massive data, which has been explosively grown. This problem is the topic of massive data searching and processing, which is on the stage of research, nowadays.
     Data mining, knowledge discovery, cloud computing, are all based on massive data processing. The data mining, which has been widely used in many fields such as finance, retail, telecommunication, scientific exploration and so on, transfers data from data grave to knowledge nuggets. The introduction of cloud computing is undoubtedly significance, which provides powerful computing and service, based on a great deal of resource. To play their magic, high-performance algorithm is a basis technology. But, the traditional method cannot do well in dealing with large-scale data. So a proper algorithm for massive data is a key to these technologies.
     Nowadays, with the development of hardware technology, the prices of the equipment have significantly dropped, space is no longer a bottleneck, and time efficiency is increasingly becoming the focus. This paper is based on the idea of time for space, we get down to the basic question of ranking and had in-depth research in it. In the massive data processing, sort for two-dimension table based on discrete data is a basic operation. If we first sort for two-dimension table, the efficiency will be greatly enhanced in the follow-up operation, many complex problems are appealing to sort for two-dimension table, Sort for two-dimension table is widely used in data mining, machine learning, databases, rough sets and other fields. In this paper we first had in-depth analysis of quick sorting algorithm of two-dimension table. Then, for the high efficiency requirement of the algorithm in massive data processing, we improve the quick sort algorithm, put forward a Hash sort algorithm on two-dimension table, and finally extends it to massive two-dimension table under cloud computing model.
     This algorithm is deepened by equivalence classes of rough set ideology, extended ordered equivalence classes to sort for two-dimension table, and realize parallel computing in the background of cloud model with the independence between blocks, with the increasing of parallel degrees, efficiency is greatly improving, the advantage in huge data sets is obvious. The approach in the paper under cloud computing model, combined data partitioning, computing task arrangement of each node with orderly division of equivalence classes together, which reducing the cost of work, this model is efficient in query, which can be used for high performance applications under cloud computing.
引文
[1]Paul Marks. Massive science experiments pose data storage problems[J].The New Scientist,2007(196):28-29.
    [2]汤劲.海量数据处理系统框架关键技术研究[D].南京航空航天大学硕士学位论文,2006.
    [3]J. Vega, A. Murari, A. Pereira, A. Portas, G.A. Ratta, R. Castro and JET-EFDA Contributors. Overview of intelligent data retrieval methods for waveforms and images in massive fusion databases[J]. Fusion Engineering and Design,2009(84): 1916-1919.
    [4]Orly Kalfus, Boaz Ronen, Israel Spiegler. A selective data retention approach in massive databases[J]. Omega,2004(32):87-95.
    [5]M. K. Ng, Z. Huang.Data-mining massive time series astronomical data:challenges, problems and solutions[J]. Information and Software Technology,1999(41): 545-556.
    [6]王加阳.面向海量数据的粗糙集理论与方法研究[D].中南大学博士学位论文,2005,11.
    [7]Massimo Brescia, Giuseppe Longo, Fabio Pasia. Mining knowledge in astrophysical massive data sets[M]. Nuclear Instruments and Methods in Physics Research Section A:Accelerators, Spectrometers, Detectors and Associated Equipment, 2010(623):845-849.
    [8]Tsai-Hung Fan, Dennis K. J. Lin, Kuang-Fu Cheng. Regression analysis for massive datasets[M]. Data & Knowledge Engineering,2007(61):554-562.
    [9]刘清.Rough集及Rough推理[M].北京:科学出版社,2001.
    [10]Dongarra. The Top 10 Algorithms[C]. IEEE Computing in Science & Engineering, 2000,2(1):22-23.
    [11]张和生,张毅,胡东成.海量数据管理框架与方法研究[J].计算机工程与应用,2004(11):26-31.
    [12]陈全,邓倩妮.云计算及其关键技术[J].计算机应用.2009(29):2562-2567.
    [13]Donald Ervin Knuth. The art of computer programming:Sorting and searching [M]. Pearson Education, Inc,2007.
    [14]George H. Mealy. Data structure:theory and representation [M]. Center for Research in Computing Technology, Harvard University,1975.
    [15]Feng Hu, Guoyin Wang, Lin Feng. Fast knowledge reduction algorithms based on quick sort[C].Proceedings of the 3rd international conference on Rough sets and knowledge technology,2008.
    [16]Graefe, G, Linville, A., Shapiro, L. D., Sort vs. hash revisited[J]. Knowledge and Data Engineering, IEEE Transactions on 1994 (6):934-944.
    [17]Kanada, Y. A vectorization technique of hashing and its application to several sorting algorithms[C].Databases, Parallel Architectures and Their Applications,. PARBASE-90, International Conference on.1990, Page(s):147-151.
    [18]Fenghui Zhang, Anxiao Jiang, Jianer Chen. Sorting Based Data Centric Storage[C]. Network Computing and Applications,2008. NCA '08. Seventh IEEE International Symposium on.2008 (16):283-286.
    [19]吴江,张德同.二次分“档”链接排序算法分析[J].计算机研究与发展.2001(08):15-17.
    [20]Pawlak Z. Rough set[J]. International Journal of Computer and Information Sciences, 1982,11(5):341-356.
    [21]王国胤Rough集理论与知识获取[M].西安:西安交通大学出版社,2001.
    [22]胡可云,陆玉昌,石纯一.粗糙集理论及其应用进展[J].清华大学学报,2001,41(1):64-68.
    [23]Nguyen S H, Skowron A, Synakp, et al. Knowledge discovery in databases:Rough set approach[C]//Proceedings of IFSA'97, Prague, Czech Republic,1977:205-209.
    [24]覃政仁,吴渝,王国胤.一种基于Rough set的海量数据分割算法[J].模式识别与人工智能,2006,19(2):249-256.
    [25]HU F, WANG G Y, HUANG H, et al. Incremental Attribute Reduction Based on Elementary Sets [C]//Proceedings of RSFDGrC'2005, Part 1, LNA I3641. Berlin, ALLEMAGNE:Sp -ringer,2005:185-194.
    [25]胡峰,代劲,王国胤.一种决策表增量属性约简算法[J].控制与决策,2007,22(3):268-272.
    [26]ZHENG Z, WANG G Y. RR IA:A Rough Set and Rule Tree Based Incremental Knowledge Acquisition Algorithm[J]. Fundamenta Informaticae,2004,59 (323): 299-314.
    [27]王利,王国胤,吴渝,基于可变精度粗集模型的增量式规则获取算法[J].重庆邮电大学学报(自然科学版),2005,17(6):709-713.
    [28]Q IN Z R, WANG G Y, WU Y, et al. A Scalable Rough Set Knowledge Reduction Algorithm [C]//LNA I3066.Berlin. ALLEMAGNE:Springer,2004:445-455.
    [29]MURA I T, NAKATA M, SATO Y. A Note on Filtrationand Granular Reasoning [C] //TERANO T. New Fron2tiers in Artificial Intelligence, LNA I 2235. Berlin:Sp ringer,2001:385-389.
    [30]YAO J T, YAO Y Y. Induction of Classification Rules by Granular Computing [C] //Proceedings of RSCTC2002,LNA 12475. Berlin:Sp ringer,2002:10-13.
    [31]ZADEH L A. Fuzzy Logic =Computing with Words [J].IEEE Transactions on Fuzzy System,1996,4(1):104-111.
    [32]ZADEH L A. Some Reflections on Soft Computing, Granular and Their Roles in the Concep- tion, Design and Utilization of Information/Intelligence Systems [J]. Soft Computing,1998,2(1):24-25.
    [33]Tevfik Kosar, Miron Livny. A framework for reliable and efficient data placement in distributed computing systems[J]. Journal of Parallel and Distributed Computing, 2005(65):1146-1157.
    [34]Clemens H. Cap, Volker Strumpen. Efficient parallel computing in distributed workstation environments [J].Parallel Computing,1993(19):1221-1234.
    [35]K. Mani Chandy, Joseph Kiniry, Adam Rifkin, Daniel Zimmerman. A framework for structured distributed object computing[J].Parallel Computing,1998(24):1901-1922.
    [36]Ping Luo, Kevin Lu, Zhongzhi Shi, Qing He. Distributed data mining in grid computing environments [J].Future Generation Computer Systems,2007(23):84-91.
    [37]李军华.云计算及若干数据挖掘算法的MapReduce化研究[M].电子科技大学硕士学位论文,2010,5.
    [38]钟伟彬,周梁月,潘军彪,文锦军.云计算终端的现状和发展趋势[J].2010(3):22-28.
    [39]刘越.云计算综述与移动云计算的应用研究[J].研究与开发.2010(4):14-20.
    [38]Gul A. Agha, Wooyoung Kim. Actors:A unifying model for parallel and distributed computing[J]. Journal of Systems Architecture,1999(45):1263-1277.
    [39]Chen Wang, Yong Meng Teo. Supporting parallel computing on a distributed object architecture [J]. Journal of Systems and Software,2001(56):261-278.
    [40]李莉,廖剑伟,欧灵.云计算初探[J].计算机应用研究,2010(27):4419-4426.
    [41]曹邑芬.分布式存储平台的设计与实现[J].科技博览,2009(34):117-118.
    [42]余利华.分布式数据存储和处理的若干技术研究[D].浙江大学博士学位论文 2008,6.
    [43]Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic. Cloud computing and emerging IT platforms:Vision, hype, and reality for delivering computing as the 5th utility[J].Future Generation Computer Systems,2009(25):599-616.
    [44]Terrence V. Lillard, Clint P. Garrison, Craig A. Schiller, James Steele. The Future of Cloud Computing[J]. Digital Forensics for Network, Internet, and Cloud Computing,2010:319-339.
    [45]Catherine Everett. Cloud computing- A question of trust[J]. Computer Fraud & Security,2009(6):5-7.
    [46]Gregory Chockler, Eliezer Dekel, Joseph JaJa, Jimmy Lin. Special Issue of the Journal of Parallel and Distributed Computing:Cloud Computing[J] Journal of Parallel and Distributed Computing,2009(69):813.
    [47]Lluis Pamies-Juarez, Pedro Garcia-Lopez, Marc Sanchez-Artigas, Blas Herrera. Towards the design of optimal data redundancy schemes for heterogeneous cloud storage infrastructures [J].Computer Networks,2011(55):1100-1113.
    [48]S. Subashini, V. Kavitha. A survey on security issues in service delivery models of cloud computing[J]. Journal of Network and Computer Applications,2011(34): 1-11.
    [49]Idziorek, J. Discrete event simulation model for analysis of horizontal scaling in the cloud computing model [C]. Winter Simulation Conference (WSC), Proceedings of the 2010.2010:3004-3014.
    [50]Shuai Zhang, Shufen Zhang, Xuebin Chen, Xiuzhen Huo. Cloud Computing Research and Development Trend[C]. Future Networks,2010. ICFN'10. Second International Conference on.2010:93 - 97.
    [51]Junjie Peng, Xuejun Zhang, Zhou Lei, Bofeng Zhang, Wu Zhang, Qing Li. Comparison of Several Cloud Computing Platforms [C]. Information Science and Engineering (ISISE),2009 Second International Symposium on,2009.
    [52]Jiyi Wu, Lingdi Ping, Xiaoping Ge, Ya Wang, Jianqing Fu.Cloud Storage as the Infrastructure of CloudComputing[C]. Intelligent Computing and Cognitive Informatics (ICICCI),2010 International Conference on.2010, Page(s):380-383.
    [53]Xiao-Yong Li, Li-Tao Zhou, Yong Shi, Yu Guo. A trusted computing environment model in cloudarchitecture[C].Machine Learning and Cybernetics (ICMLC),2010 International Conference on.2010(6):2843-2848.
    [54]胡峰,王国胤.二维表快速排序的复杂度分析[J].计算机学报,2007,30(6):963-968.
    [55]刘少辉,盛秋戬,吴斌,史忠植,胡斐.rough集高效算法研究[J].计算机学报,2003,26(5):524-529.
    [56]刘少辉,盛秋戬,史植忠.一种新的快速计算正区域的方法[J].计算机研究与发展,2003,40(5):637-642.
    [57]赵军,王国胤,吴忠福,唐宏,李华,廖晓峰.一种高效的属性核计算方法[J].小型微型计算机系统,2003,24(11):1945-1953.
    [58]霍红卫,许进.快速排序算法研究[J].微电子学与计算机,2002,(6)6-9.
    [59]孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61.
    [60]William F. Gilreath. Hash sort:a linear time complexity multiple-dimensional sort algorithm originally entitled "making a hash of sort".// WILL@WILLIAMGILREATH. COM
    [61]Ssrtaj Sahni. Data structures, algorithms, and application in C++[M].北京:机械工业出版社,2009.
    [62]刘勇,熊蓉,褚健.Hash'决速属性约简算法[J].计算机学报.2009,32(8):1493-1499.
    [63]万至臻.基于MapReduce模型的并行计算平台的设计与实现[D].浙江大学计算机科学与技术学院,2008.
    [64]Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, D. Stott Parker. Map-reduce-merge: simplified relational data processing on large cluster. [C] SIGMOD'07,2007,6.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700