
Notes on the Transclusion Finder The html version of this document badly botches the special symbols. The .pdf or .ps versions are good in this regard. The text with the formulae may be of interest to some however. The main idea is to define a "landmark" that is a property of a string of bits that is suitably rare. Each file is scanned for landmarks and a checksum of the data around each landmark is made and included in a file along with the address of where it was found. The file is them sorted by checksum and then landmark sites with the same checksum are compared by reading the disk. The main difficulty is finding a suitable landmark definition. I invented a stream hash function that is fairly efficient to compute. It amounts to a convolution of the bit stream of the file with a string of about 500 bits. I look for strings of about 11 ones in the hash and call this a landmark. Bits surrounding this string of ones are used as the checksum. This has the convenient property that if two file portions are alike except for different offsets then the coincidence will be found even if the offset difference is not a multiple of one byte. This scheme is likely to find coincidences of more than 256 characters as such a stream is likely to have a landmark. Shorter coincidences are less likely to be found. One piece of hair is removal of multiple reports resulting from multiple landmarks in one coincidence. Local Stationary Bit Stream Functions This is the theory of a string hash function. It is a function on a long string of bits. Its value is a few hundred bits longer than the argument. It is local, stationary and fairly fast to compute. We use “string” here to refer to arrays of bits indexed by the integers. If b is such a string then bj is bit j of that string. The strings are infinitely long in both directions but the strings will be zero outside some finite range. Both the argument and value of string hash is a string. b << n is the string b shifted n to the left. (b<<n)i= bi–n for any integers i and n. Note that we picture bits with a larger index on the left! e.g. b2 b1 b0 b–1 b–2 ... —this is the classic order for polynomial coefficients. We also write “b(x)” to represent the polynomial Σbixi over the field GF(2). A string function s is stationary iff s(b<<1) = s(b)<<1 for all b. In this case s(b<<n) = s(b)<<n for all b and n. A function s is local when there are integers a and b such that a<b and s(b)i is determined by bj only when i+a < j < i+b. This means that such functions depends on only bits in the neighborhood. More precisely L is local if there are integers a and b such that a<b and for any strings s and t (for all i, if (for all j if (i+a < j < i+b) then sj = tj) then L(s)i = L(t)i). We want a stationary function so that properties of a string that are preserved under shifting will remain visible in the yield of S if they were visible before. We want the string to local so that equality of strings of modest length will lead to equality of the yield of S. We will look for “landmarks” in the yield such as strings of ones. No particular pattern can be presumed to have a useful density in the original data—zeros, for instance, are excluded from ascii text. Such patterns should have known densities in the “randomized” data that comes from S. Our particular string hash function is based on a polynomial h over GF(2) of degree P≅50. We identify a bit string b with the polynomial h(x) = Σbixi summed over non-negative i. When polynomials are used to checksum transmission blocks a “checksum” is computed and transmitted that would have caused the block to be a multiple of the polynomial had it been exclusive-ored into the right end of the block. In this case the checksum is a function of all of the bits in the block and is thus not local. Our string hash function S is a convolution function. S(b)i = Σbi–jpj. (In this context Σ means the one bit exclusive-or: ⊕) If h(x) = x64+x61+1 then p may be recursively defined as pj = 0 for j<0 or j≥B and p0 = 1 and pj = pj–64 ⊕ pj–61 for 0 < j < B. (B is some constant ≅ 500). Alternatively p may be defined by the relations xB + r(x) = p(x)h(x) and deg(r)≤deg(h). r(x) is the remainder after dividing xB by h(x) and p(x) = (xB + r(x))/ h(x). It may be seen that S is stationary and local for the range a = –B, b = 0. The “polynomial” associated with p for infinite B is the reciprocal of h in the sense that the terms of the product of p and h are the coefficients of the polynomial 1 despite the fact there are infinitely many terms in p. This is like 1/(1–x) = 1+x–1+x–2+x–3... You may have seen the old fallacious proof that 1+2+4+8... = -1. That proof is valid in this funny world. The desired convolution is m(x)p(x) where mi are the bits of the argument to S. If there are N bits in the argument then the first bit is mN-1 and the last bit is m0. We will process the bits from first to last. m(x)p(x) = m(x)[(xB + r(x))/h(x)] = [m(x)(xB + r(x))]/h(x) We first describe a conceptual method that does not have the factor B in its cost. Then we show how to do it several bits at a time instead of one. The first conceptual method of computation proceeds as follows. The string argument is delivered one bit at a time, largest index first. The algorithm does not know the value of that index. S being stationary, we can compute the answer as soon as the data begin to arrive. By the first and last bits of the string argument we refer to some range of the infinite input outside of which is all zero. We buy three left infinite accumulators m, s and q. m accumulates the argument, s will accumulate m(x)(xB + r(x)) and q will accumulate the quotient [m(x)(xB + r(x))]/h(x). We prepare the initial answer as if the input string were empty (all zero) by setting m, s and q to all zeros. When ever a bit of m arrives: 1) Shift m, s and q each to the left one bit and deposit the new bit in m0. 2) If the new bit is 1 we exclusive-or xB + r(x) into s. 3) Set q0 to sB. 4) If sB then exclusive-or h(x)x(B–P) into s. (This turns sB off.) When an oracle tells us that there are no new bits we supply B–P zeros after which s will be all zeros. Note that the last step above keeps sj = 0 for j>B. Supplying zeros continues to increase the number of low bits in s. The loop invariant for the above is: s(x) = (xB + r(x))m(x) – q(x)h(x)x(B–P) and deg(s)<B. Since s is driven to zero we have q(x) = (xB + r(x))m(x)x(B–P)/(h(x)x(B–P)) upon exit from the loop. The extra factor x(B–P) in m(x)x(B–P) above arose when we fed B–P more zero bits into the algorithm after the last bit of the argument arrived. Steps 1 and 2 constitute the multiplication (xB + r(x))m(x) and steps 3 and 4 perform the division (xB + r(x))m (x)/h(x). We now have: q(x) = (xB + r(x))m(x)/h(x) = m(x)p(x) which is the desired answer. ....We reformulate the above in notation without mutation such as shifting and “exclusive-or”ing. To accelerate the above computations we consider the data flow. The only parts of the computation that are proportional to B are the shifts of s, m and q. m is entirely unneeded—it was computed only to make the the loop invariant testable and thus make the computation easier to debug. The bits of q are produced as the answer and are unused again in this computation. Only the first B bits of s are used—indeed the rest are all zero. After sP is produced it is not used again, aside for shifting until it becomes sB–P. This is thus a fixed length FIFO queue and may be implemented without shifting. Alternatively we can dispense with the queue and perform steps 3 and 4 sooner if we can get an early edition of the input bits so as to perform the xB element of step 2. The latter is the only thing delaying the computation in steps 3 and 4. In either case there remain no factors of B in the computation.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-