用户名: 密码: 验证码:
Various improvements to text fingerprinting
详细信息    查看全文
文摘
Let be a text (or sequence) on a finite alphabet ¦² of size ¦Ò. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring is a maximal location for a fingerprint (denoted by ) if the alphabet of is f and , , if defined, are not in f. The set of maximal locations in s is (it is easy to see that ). Two maximal locations and such that are named copies, and the quotient set of according to the copy relation is denoted by .

We first present new exact efficient algorithms and data structures for the following three problems: (1) to compute ; (2) given f as a set of distinct characters in ¦², to answer if f represents a fingerprint in ; (3) given f, to find all maximal locations of f in s. As well as in papers concerning succinct data structures, in the paper all space complexities are counted in bits. Problem 1 is solved either in worst-case time (in this paper all logarithms are intended as base two logarithms) using bits of space, or in randomized expected time using bits of space. Problem 2 is solved either in expected time if only bits of working space for queries is allowed, or in worst-case time if a working space of bits is allowed (with ? a constant satisfying ). These algorithms use a data structure that occupies bits. Problem 3 is solved with the same time complexity as Problem 2, but with the addition of an occ term to each of the complexities, where occ is the number of maximal locations corresponding to the given fingerprint. Our solution of this last problem requires a data structure that occupies bits of memory.

In the second part of our paper we present a novel Monte Carlo approximate construction approach. Problem 1 is thus solved in expected time using bits of space but the algorithm is incorrect with an extremely small probability that can be bounded in advance.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700