用户名: 密码: 验证码:
A Flexible Space-Time Tradeoff on Hybrid Index with Bicriteria Optimization
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:A Flexible Space-Time Tradeoff on Hybrid Index with Bicriteria Optimization
  • 作者:Xingshen ; Song ; Yuexiang ; Yang ; Yu ; Jiang
  • 英文作者:Xingshen Song;Yuexiang Yang;Yu Jiang;School of Advanced Interdisciplinary Studies, National University of Defense Technology;College of Computer, National University of Defense Technology;Northwest Institute of Nuclear Technology;
  • 英文关键词:inverted index;;bicriteria compression;;Lagrangian relaxation
  • 中文刊名:QHDY
  • 英文刊名:清华大学学报自然科学版(英文版)
  • 机构:School of Advanced Interdisciplinary Studies, National University of Defense Technology;College of Computer, National University of Defense Technology;Northwest Institute of Nuclear Technology;
  • 出版日期:2019-01-18
  • 出版单位:Tsinghua Science and Technology
  • 年:2019
  • 期:v.24
  • 基金:the Natural Science Foundation of Hunan Province(No.2016JJ2007)for their financial support
  • 语种:英文;
  • 页:QHDY201901011
  • 页数:17
  • CN:01
  • ISSN:11-3745/N
  • 分类号:108-124
摘要
Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different spacetime characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve,flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index,namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size.However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(n log n)time and O(n)space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.
        Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different spacetime characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve,flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index,namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size.However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(n log n)time and O(n)space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.
引文
[1]M.Catena,C.Macdonald,and I.Ounis,On inverted index compression for search engine efficiency,in Advances in Information Retrieval,M.de Rijke,T.Kenter,A.P.de Vries,C.X.Zhai,F.de Jong,K.Radinsky,and K.Hofmann,eds.Springer,2014,pp.359-371.
    [2]A.Trotman,Compression,SIMD,and postings lists,in Proc.2014 Australasian Document Computing Symp.,Melbourne,Australia,2014,p.50.
    [3]J.Zobel and A.Moffat,Inverted files for text search engines,ACM Comput.Surv.(CSUR),vol.38,no.2,p.6,2006.
    [4]D.Lemire and L.Boytsov,Decoding billions of integers per second through vectorization,Softw.:Pract.Exp.,vol.45,no.1,pp.1-29,2015.
    [5]D.Lemire,L.Boytsov,and N.Kurz,SIMD compression and the intersection of sorted integers,Softw.:Pract.Exp.,vol.46,no.6,pp.723-749,2015.
    [6]W.X.Zhao,X.D.Zhang,D.Lemire,D.D.Shan,J.Y.Nie,H.F.Yan,and J.R.Wen,A general SIMD-based approach to accelerating compression algorithms,ACM Trans.Inf.Syst.(TOIS),vol.33,no.3,p.15,2015.
    [7]D.Lemire,N.Kurz,and C.Rupp,Stream VByte:Faster byte-oriented integer compression,Inf.Process.Lett.,vol.130,pp.1-6,2018.
    [8]G.Ottaviano,N.Tonellotto,and R.Venturini,Optimal space-time tradeoffs for inverted indexes,in Proc.8th ACM Int.Conf.Web Search and Data Mining,Shanghai,China,2015,pp.47-56.
    [9]G.Ottaviano and R.Venturini,Partitioned Elias-Fano indexes,in Proc.37thInt.ACM SIGIR Conf.Research&Development in Information Retrieval,Gold Coast,Australia,2014,pp.273-282.
    [10]M.Petri,A.Moffat,and J.S.Culpepper,Score-safe term-dependency processing with hybrid indexes,in Proc.37thInt.ACM SIGIR Conf.Research&Development in Information Retrieval,Gold Coast,Australia,2014,pp.899-902.
    [11]F.Silvestri and R.Venturini,VSEncoding:Efficient coding and fast decoding of integer lists via dynamic programming,in Proc.19thACM Int.Conf.Information and Knowledge Management,Toronto,Canada,2010,pp.1219-1228.
    [12]A.Moffat and L.Stuiver,Binary interpolative coding for effective index compression,Inf.Retr.,vol.3,no.1,pp.25-47,2000.
    [13]J.Lin,M.Crane,A.Trotman,J.Callan,I.Chattopadhyaya,J.Foley,G.Ingersoll,C.Macdonald,and S.Vigna,Toward reproducible baselines:The open-source IR reproducibility challenge,in Proc.38thEuropean Conf.IR Research,Padua,Italy,2016,pp.408-420.
    [14]H.Yan,S.Ding,and T.Suel,Inverted index compression and query processing with optimized document ordering,in Proc.18thInt.Conf.World Wide Web,Madrid,Spain,2009,p.401.
    [15]A.Farruggia,P.Ferragina,A.Frangioni,and R.Venturini,Bicriteria data compression,in Proc.25thAnn.ACM-SIAM Symp.Discrete Algorithms,Portland,OR,USA,2014,pp.1582-1595.
    [16]K.Mehlhorn and M.Ziegelmann,Resource constrained shortest paths,in Proc.8thAnn.European Symp.Algorithms,Saarbr¨ucken,Germany,2000,pp.326-337.
    [17]G.Y.Handler and I.Zang,A dual algorithm for the constrained shortest path problem,Networks,vol.10,no.4,pp.293-309,1980.
    [18]S.Ding and T.Suel,Faster top-K document retrieval using block-max indexes,in Proc.34thInt.ACM SIGIR Conf.Research and Development in Information Retrieval,Beijing,China,2011,pp.993-1002.
    [19]R.Konow,G.Navarro,C.L.A.Clarke,and A.L′opezOrt′?z,Faster and smaller inverted indices with treaps,in Proc.36thInt.ACM SIGIR Conf.Research and Development in Information Retrieval,Dublin,Ireland,2013,pp.193-202.
    [20]G.Navarro and S.J.Puglisi,Dual-sorted inverted lists,in String Processing and Information Retrieval,E.Chavez and S.Lonardi,eds.Springer,2010,pp.309-321.
    [21]A.A.Stepanov,A.R.Gangolli,D.E.Rose,R.J.Ernst,and P.S.Oberoi,SIMD-based decoding of posting lists,in Proc.20thACM Int.Conf.Information and Knowledge Management,Glasgow,UK,2011,pp.317-326.
    [22]R.Delbru,S.Campinas,K.Samp,and G.Tummarello,Adaptive frame of reference for compressing inverted lists,Technical Report,DERI-Digital Enterprise Research Institute,Ireland,2010.
    [23]J.Goldstein,R.Ramakrishnan,and U.Shaft,Compressing relations and indexes,in Proc.14thInt.Conf.Data Engineering,Orlando,FL,USA,1998,pp.370-379.
    [24]V.N.Anh and A.Moffat,Index compression using fixed binary codewords,in Proc.15thAustralasian Database Conf.-Volume 27,Dunedin,New Zealand,2004,pp.61-67.
    [25]V.N.Anh and A.Moffat,Inverted index compression using word-aligned binary codes,Inf.Retr.,vol.8,no.1,pp.151-166,2005.
    [26]V.N.Anh and A.Moffat,Improved word-aligned binary compression for text indexing,IEEE Trans.Knowl.Data Eng.,vol.18,no.6,pp.857-861,2006.
    [27]V.N.Anh and A.Moffat,Index compression using 64-bit words,Softw.:Pract.Exp.,vol.40,no.2,pp.131-147,2010.
    [28]R.Delbru,S.Campinas,and G.Tummarello,Searching web data:An entity retrieval and high-performance indexing model,Web Semant.,vol.10,pp.33-58,2012.
    [29]M.Zukowski,S.Heman,N.Nes,and P.Boncz,Superscalar RAM-CPU cache compression,in Proc.22ndInt.Conf.Data Engineering,Atlanta,GA,USA,2006,p.59.
    [30]P.Ferragina,I.Nitto,and R.Venturini,On optimally partitioning a text to improve its compression,Algorithmica,vol.61,no.1,pp.51-74,2011.
    [31]X.S.Song,K.Jiang,Y.Jiang,and Y.X.Yang,On optimizing partitioning strategies for faster inverted index compression,in Proc.16thInt.Conf.Computational Science and Its Applications,Beijing,China,2016,pp.246-260.
    [32]E.L.Lawler,Combinatorial Optimization:Networks and Matroids.New York,NY,USA:Courier Corporation,2001.
    [33]G.Pass,A.Chowdhury,and C.Torgeson,A picture of search,in Proc.1stInt.Conf.Scalable Information Systems,Hong Kong,China,2006,p.1.
    [34]Giuseppe Ottaviano,https://github.com/ot/ds2i.git,2017.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700