用户名: 密码: 验证码:
XML数据压缩技术的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
作为半结构化数据的典型代表,XML已成为Web上数据表示和交换的标准。但它的自描述和半结构化特性使得XML中存在大量的数据冗余,极大地增加了数据存储、交换和处理的代价,严重阻碍了XML数据库更深入更广泛的应用。因此,对XML数据进行压缩变得十分必要。
     针对生物XML数据中可压缩子结构高频重复出现的特点,我们设计了适于生物XML数据的压缩方法SCSC(Schema based Compressible Substructure Compressing)。首先,我们根据XML Schema提供的丰富结构信息建立XML扩充结构树,并从中提取用于压缩的子结构,包括极大可压缩子结构和完全可压缩子结构。然后解析XML文档,将其分离为结构数据和内容数据,且把内容数据归到不同的内容容器中。最后根据提取的可压缩子结构对结构数据进行压缩,并对不同类型的内容容器采用相应的压缩方法。理论分析和实验结果表明,在生物数据的压缩上,SCSC比已有的方法XMill具有更好的压缩性能。此外,SCSC不仅适合生物XML数据,也可扩展到其他具有高频重复子结构的XML数据。
     针对大多数XML压缩方法不能支持Twig查询的问题,我们提出了一种能够支持Twig查询的XML压缩算法XCTwig(XML Compression supporting Twig)。该方法主要包含两大部分—XML数据压缩方法和压缩XML数据上的Twig查询处理方法。首先,我们给出了XCTwig压缩算法的框架,设计了XML结构树的构建算法以及XML文档的压缩算法。然后,我们给出了压缩XML数据上进行Twig查询的框架,并给出了压缩数据上Twig查询操作算法。XCTwig压缩算法的基本思想是把具有相同路径的内容数据划分到同一组进行存储。在压缩XML数据上进行Twig查询处理时,首先把Twig查询模式分解为多个路径查询表达式,在压缩数据中依次执行并返回匹配的结果,且将先驱路径的中间结果用作当前路径查询中的过滤条件。实验证明,虽然XCTwig的压缩比低于XMill和gzip,却明显高于XGrind,且由于采用了路径存储的独特形式,其上Twig查询操作仍然具有很好的性能。
As the reprenstive format of semi-structured data, XML has become the de facto standard for data representation and exchange over the World Wide Web. However, its self-describing and semi-structured feature introduces the verbosity of XML documents and increased the cost of storing, exchanging and processing, which hinder the wider and deeper use of the XML database. So, it is quite necessary to compress XML data.
     Based on the characteristic of biological XML data that the compressible substructure repeats highly, we designed an algorithm called SCSC (Schema based Compressible Substructure Compressing) to compress biological XML data. First, based on the sufficient information provided by XML Schema, we build an XML extended structure tree, from which the compressible substructure can be extracted, including maximum compressible substructure and complete compressible substructure. Then we parse the XML file separating the structure from the content and map the content to different containers considering with its data type. At last, we compress the structure container together based on the compressible substructure, and compress the data container with different encoding method based on its data type. The theoretical analysis and the experimental results showed that SCSC performs better than XMill when compress biological XML data. SCSC method is not just suitable for biological XML data, but also can be extended to other XML data in which some substructure highly repeated.
     Since most of the compression methods can not support Twig query efficiently, we proposed a compression algorithm XCTwig (XML Compression supporting Twig) which can support Twig query without complete uncompression。SCSC consists of two big parts—XML data compression and Twig query on compressed XML data. Fist, we give the frame of XCTwig and the XML structure tree constructing algorithm. Then we designed the XML compression algorithms and the Twig query algorithm on compressed XML data. The basic idea of XCTwig compression method is mapping the content data to different container according its path from the root of the XML document. When we execute the Twig query on compressed XML data, we first decompose it into several path query expressions, and return match results of each path. In this process, the result of previous path query can be used as the filter condition when we match the current path query. The experimental results showed that in spite of its lower compression ratio compared with XMill and gzip, XCTwig has high compression ratio than XGrind and performance of Twig query on compressed data.
引文
1 H. Liefke, D. Suciu. XMill: an efficient compressor for XML data. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May, Dallas, Texas, USA, 2000. New York, ACM press: 153-164
    2 H. Liefke, D. Suciu. An extensible compressor for XML data. SIGMOD Record. 2000, 29(1): 57-62
    3王腾蛟;高军;杨冬青;唐世渭;刘云峰.面向XPath执行的XML数据流压缩方法.软件学报. 2005, 16(5): 869~877
    4 P. Skibiński, Sz. Grabowski, J. Swacha. Effective asymmetric XML compression. Software: Practice and Experience. 2008, 33(10): 121-129
    5 M. Girardot, N. Sundaresan. Millau: an encoding format for efficient representation and exchange of XML documents over the WWW. Computer Networks. 2000, 33(1-6): 747-765
    6 http://www.w3.org/XML/
    7 N.Sundaresan, R. Moussa. Algorithms and programming models for efficient representation of XML for Internet applications. In Proceedings of the Tenth International World Wide Web Conference, Hong Kong, China, May 1-5, 2001. New York, ACM Press, 2001: 366-375
    8 M. Levene, P. Wood. XML structure compression. In Proceedings of the Second International Workshop on Web Dynamics, Hawii, USA, 2002
    9 A. Kinno, H. Yukitomo, T. Nakayama, A. Takeshita. Recursive application of structural templates to efficiently compress parsed XML. In Proceedings of the 5th International Conference on Web Engineering, Sydney, Australia, July, 2005. Berlin, Germany, Springer-Verlag (LNCS 3579): 296-301
    10 J. Adiego, G. Navarro, P. de la Fuente. Lempel-Ziv compression of structured text. In Proceedings of the 2004 IEEE Data Compression Conference, Snow-bird, UT, U.S.A, March 2004. New York, IEEE Computer Society: 112-121
    11 J. Cheney. Compressing XML with multiplexed hierarchical PPM models. In Proceedings of the 2001 IEEE Data Compression Conference, Snow-bird, UT, U.S.A, March 2001. New York, IEEE Computer Society: 163-172
    12 J. Cheney. An Empirical Evaluation of Simple DTD-Conscious Compression Techniques. In Proceedings of the 8th International Workshop on the Web and Databases, Baltimore, Maryland, USA, June, 2005. New York, USA, ACM Press: 43-48
    13 J. Adiego, G. Navarro, P. Fuente. Using structural contexts to compress semistructured text collections. Information processing and management. 2007,43(3): 769~790
    14 J. Adiego, P. Fuente, G. Navarro. Merging prediction by partial matching with structural contexts model. In Proceedings of the 2004 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2004. New York, IEEE Computer Society: 522-525
    15 J. Adiego, P. Fuente, G. Navarro. Combining Structural and Textual Contexts for Compressing Semistructured Databases. In Proceedings of the 6th Mexican International Conference on Computer Science, Puebla, Mexico, Sep., 2005. New York, IEEE Computer Society: 68-73
    16 J. Cheney. Tradeoffs in XML Database Compression. In Proceedings of the 2006 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2006. New York, IEEE Computer Society: 392-401
    17 G. Leighton, J. Diamond, T. Müldner. AXECHOP: a grammar-based compressor for XML. In Proceedings of the 2005 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2005. New York, IEEE Computer Society: 467.
    18 M. Kalman, F. Havasi, T. Gyimothy. Compacting XML documents. Information and Software Technology. 2006, 48(2): 90~106
    19 C. League, K. Eng. Type-Based Compression of XML Data. In Proceedings of the 2007 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2007. New York, IEEE Computer Society: 273-282
    20 C. League, K. Eng. Schema-based compression of XML data with Relax NG.. Journal of Computers, 2007, 2(10): 9~17
    21 S. Hariharan, P. Shankar. Evaluating the Role of Context in Syntax Directed Compression of XML Documents. In Proceedings of the 2006 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2006. New York, IEEE Computer Society: 453
    22 S. Hariharan, P. Shankar. Compressing XML documents with finite state automata. In Proceedings of the 10th International Conference on Implementation and Application of Automata, Sophia Antipolis, France, June, 2005. Berlin, Germany, Springer-Verlag: 285-296
    23 S. Harrusi, A. Averbuch, A. Yehudai. XML Syntax Conscious Compression. In Proceedings of the 2006 IEEE Data Compression Conference, Snow-bird, UT, USA, March 2006. New York, IEEE Computer Society: 402-411
    24 V. Toman. Compression of XML data. Master's thesis, Prague, Czech Republic ,Charles University, 2003
    25 C. J. Augeri, B.E. Mullins, L. C. Baird, et al. An Analysis of XML Compression Efficiency. In Proceedings of the 2007 Workshop on Experimental Computer Science, San Diego, California, USA, Jun., 2007. New York, ACM Press: 73~84
    26包小源,唐世渭,吴泠,杨冬青,宋再生,王腾蛟. ArithRegion—一种压缩XML的索引结构.北京大学学报(自然科学版). 2006, 42(1):103~109
    27金彦钟,包小源,宋再生. ArithBi+—一种基于反向算术压缩的XML索引结构.计算机科学. 2005, 32(11):119~123
    28包小源,唐世渭,杨冬青. Interval+—一种基于区间树的压缩XML索引结构.计算机研究与发展. 2006, 43(7): 1285~1290
    29 P. Tolani, J. Haritsa. XGRIND: a query-friendly XML compressor [C]. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, March, 2002. New York, IEEE Compter Society: 225-234
    30 J. Min, M. Park, C. Chung. A compressor for effective archiving, retrieval, and update of XML documents. ACM Transactions on Internet Technology. 2006, 6(3): 223-258
    31 J. Min, M. Park, C. Chung. XPRESS: a queriable compression for XML data [C]. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, JUN., 2003. New York, ACM Press: 122-133
    32 A. Arion, A. Bonifati, I. Manolescu, A. Pugliese. XQueC: A Query-Conscious Compressed XML Database. Transactions on Internet Technology, ACM press, 2007, 7(2):10~45
    33 A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, A. Pugliese. Xquec: pushing queries to compressed XML data. In Proceedings of 29rd International Conference on Very Large Data Bases, Berlin, Germany, 2003. San Fransisco, CA, Morgan Kaufmann publisher Inc.: 1065~1068
    34 P. Buneman, M. Grohe, C. Koch. Path queries on compressed XML. In Proceedings of 29rd International Conference on Very Large Data Bases, Berlin, Germany, 2003. San Fransisco, CA, Morgan Kaufmann publisher Inc.:141~152
    35 H. Wang, J. Li, J. Luo, and Z. He. XCpaqs: compression of XML document with XPath query support. In Proceedings of the 2004 International Conference on Information Technology: Coding and Computing, Las Vegas, Nevada, USA, April, 2004. New York, IEEE Computer Society:354-358
    36 M. Lohrey, S. Maneth. Tree Automata and XPath on Compressed Trees. In Proceedings of the 10th International Conference on Implementation and Application of Automata, Sophia Antipolis, France, June, 2005. Berlin, Germany, Springer-Verlag : 227-240
    37 P. Ferragina, F. Luccio, G. Manzini, S. Muthukrishnan. Compressing and Searching XML Data Via Two Zips. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland, UK, May, 2006. New York, USA, ACM Press:751-760
    38 W. Ng, W.-Y. Lam, P.T. Wood, M. Levene. XCQ: A Queriable XML Compression System. Knowledge and Information Systems. 2006, 10(4): 421-452
    39 W. Lam, W. Ng, P. Wood, M. Levene. XCQ: XML compression and querying system. In Proceedings of the 12th International World Wide Web Conference, Budapest, Hungary, May 2003: 117-125
    40 J. He, W. Ng, X. Wang, A. Zhou. An Efficient Co-operative Framework for Multi-query Processing over Compressed XML Data. In Proceedings of the 11th International Conference on Database Systems for Advanced Applications, Singapore, April, 2006. Berlin, Germany, Springer-Verlag: 218-232
    41 J. Cheng, W. Ng. XQzip: querying compressed XML using structural indexing [C]. In: Elisa Bertino, S. Christodoulakis, D. Plexousakis, et al. (eds.) Proceedings of the 9th International Conference on Extending Database Technology, Heraklion, Greece, March, 2004. Berlin, Germany, Springer-Verlag: 219-236
    42 P. Skibiński, J. Swacha. Combining Efficient XML Compression with Query Processing. In Proceedings of the 11th East-European Conference on Advances in Databases and Information Systems, Varna, Bulgaria, Oct., 2007. Berlin, Germany, Springer-Verlag: 330-342
    43 B. Choi. Document Decomposition for XML Compression: A Heuristic Approach. In Proceedings of the 11th International Conference on Database Systems for Advanced Applications, Singapore, April, 2006. Berlin, Germany, Springer-Verlag: 202-217
    44 G. Leighton, T. Müldner, J. Diamond. TREECHOP: a tree-based query-able compressor for XML. In Proceedings of the 9th Canadian Workshop on Information Theory, Montreal, QC, Canada, June, 2005. New York, IEEE Computer Society: 115-118
    45罗震霄,和菊珍,王晓玲,艾丽君,周傲英. XML压缩文档上的数值更新方法[J].计算机科学. 2007, 34(4):106~108
    46 W. Ng, W.-Y. Lam, J. Cheng. Comparative Analysis of XML Compression Technologies. World Wide Web. 2006, 9(1): 5-33
    47 M. Cannataro, C. Comito, A. Pugliese. SqueezeX: synthesis and compression of XML data. In Proceedings of the International Conference on Information Technology: Coding and Computing, Las Vegas, Nevada, USA, April, 2002. Berlin, Germany, Springer: 326~331
    48 T. Müldner, G. Leighton, J. Diamond. Using XML compression for WWW communication. In Proceedings of the IADIS WWW/Internet Conference, Lisbon, Portugal, Oct., 2005: 459-466
    49 U. Niedermeier, J. Heuer, A. Hutter, W. Stechele, A. Kaup. An MPEG-7 tool forcompression and streaming of XML data. In Proceedings of the 2002 IEEE International Conference on Multimedia and Expo, Lausanne, Switerland, Aug. 2002. New York, IEEE Computer Society: 521-524
    50 U. Niedermeier, J. Heuer, A. Hutter, W. Stechele. MPEG-7 binary format for XML data. In Proceedings of the 2002 IEEE Data Compression Conference, Snow-bird, UT, U.S.A, April 2002. New York, IEEE Computer Society: 467-472
    51钟世明,邵锐,张胜,朱才连.基于位置服务系统中XML数据流压缩方法.武汉理工大学学报(交通科学与工程版). 2006, No.1:13-19
    52王丹琛.物流信息系统中XML数据压缩与传输安全性研究.西南交通大学硕士学位论文. 2007:1-60
    53 H. Wang, J. Li, Z. He, and J. Luo. Web Information Integration Based on Compressed XML. In Proceedings of the 3th International Workshop Databases in Networked Information Systems, Aizu, Japan, Sep., 2003. Berlin, Germany, Springer-Verlag :122-137
    54 Y. Natchetoi, H. Wu, G. Babin. A Context-Dependent XML Compression Approach to Enable Business Applications on Mobile Devices. In Proceedings of the 13th International Conference on Parallel Processing, Rennes, France, Aug., 2007. Berlin, Germany, Springer-Verlag :911-920
    55 Y. Natchetoi, H. Wu, G. Babin, S. Dagtas. EXEM: Efficient XML data exchange management for mobile applications. Information Systems Frontiers, Springer-Verlag ,2007, 9(4): 439-448
    56 B. Demmings, T. Müldner, G. Leighton, A. Young. Using XML compression to increase efficiency of P2P messaging in JXTA-based environments. In Proceedings of Extreme Markup Languages, Montréal, Canada, Aug., 2007.
    57 C. Zhang, J. Naughton, D. Dewitt, Q. Luo, G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001. New York, ACM Press: 425-436
    58 S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In Proceedings of the IEEE International Conference on Data Engineering, 2002. New York, IEEE Computer Society:141-152
    59 B Nicolas, K Nick, S Divesh. Holistic Twig Joins: Optimal XML Pattern Matching. Proceedings of the 2002 ACM SIGMOD international conference on Management of data, 2002. New York, ACM Press: 122-133
    60 H. F. Jiang, W. Wang, H. J. Lu et al. Holistic Twig Joins on Indexed XML Documents. In Proceedings of 29th International Conference on Very Large DataBases, Berlin, Germany, 2003. San Fransisco, CA, Morgan Kaufmann publisher Inc.: 273--284
    61 B. Choi, M. Mahoui et al. On the Optimality of Holistic Algorithms for Twig Queries. Lecture notes in computer science, 2003.
    62 T. Chen, J. Lu, and T. Ling. On boosting holism in XML twig pattern matching using structural indexing techniques. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2005. New York, ACM press: 455-466
    63 J. Lu, T. Chen, T.wang. TJFast: effective processing of XML twig pattern matching. In proccedings of International World Wide Web Conference, 2005. New York, ACM press: 1118-1119
    64 S. Chen, H. Li, J. Tatemura1, W. Hsiung, D. Agrawal, K. Selcuk Candan. Twig2Stack: Bottom-Up Processing of Generalized Tree Pattern Queries over XML Documents. In Proceedings of 32nd International Conference on Very Large Data Bases, 2006. San Fransisco, CA, Morgan Kaufmann publisher Inc.: 283-294
    65 http://www.gzip.org/

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700