AVersion Numb ering Scheme
with a Useful Lexicographical Order
y z
Arthur M. Keller Je rey D. Ullman
Stanford University Stanford University
Computer Science Dept. Computer Science Dept.
Stanford, CA 94305-2140 Stanford, CA 94305-2140
Abstract all situations. A con guration C mayhave an ances-
1
tor con guration C , in which each of the versions par-
0
ticipating in con guration C must b e a descendant
We describe a numbering scheme for versions with
1
of the corresp onding version in con guration C . The
alternatives that has a useful lexicographical ordering.
0
CEDB computational strategy is to check incremen-
The version hierarchy is a tree. By inspection of the
tally for constraint violations by comparing a con g-
version numbers, we can easily determine whether one
uration with an ancestor con guration. This compar-
version is an ancestor of another. If so, we can de-
ison relies on the determination of changes \deltas"
termine the version sequencebetween these two ver-
that o ccurred b etween twoversions.
sions. If not, we can determine the most recent com-
mon ancestor to these two versions i.e., the least up-
Our approach is to record the incremental changes
per bound, lub. Sorting the version numbers lexico-
between a version and its immediate ancestor. As a
graphical ly results in a version being fol lowedbyall
result, if we need to nd the changes that haveoc-
descendants and preceded by al l its ancestors. We use
curred b etween versions A and B ,wemust nd the
arepresentation of nonnegative integers that is self
changes asso ciated with all versions that lie b etween
delimiting and whose lexicographical ordering matches
A and B in the version hierarchy. Imagine that all
the ordering by value.
changes are stored in a relation, along with the ver-
sion to which they p ertain. For example, if in going
1 Intro duction
from version P toachild version C ,we insert the ele-
ment a, then we might store the change record, a triple
Our motivation is a pro ject in collab orative design
insert, a, C in the relation.
for Civil Engineering b eing carried out at Stanford.
Ideally,we should b e able, given version identi ers
The approach taken by the pro ject, named CEDB
A and B , to nd all the change records asso ciated
Collab orativeEnvironment for the Design of Build-
with versions that lie b etween the version numb ers for
ings, leads to an interesting problem in version num-
A and B in the version hierarchy including B but
b ering. Our approachtoversion numb ering is the sub-
not A. Surely,we can do so if wehave access to the
ject of this pap er.
hierarchy.However, there is a lack of eciency in an
CEDB uses a version and con guration manage-
approach that issues one query for eachversion on the
ment system that supp orts incremental checking of
path b etween A and B .We prop ose instead to issue
constraints see Krishnamurthy [1993]. There is a hi-
one range query that will access all the change records
erarchical version system, with a separate version hier-
for all the versions b etween A and B . With the prop er
archy for each design discipline e.g., plumbing, elec-
index and storage structure, these change records can
trical, structural engineering. A con guration con-
b e obtained in little more than the time it takes to
sists of a version from each discipline together with a
access data of this bulk.
set of constraints that the con guration should satisfy
Our primary goal, therefore, is to nd a scheme for
although we do not insist on constraint satisfaction in
numb ering versions, so that all of the relevantchange
This work is part of the CEDB pro ject: Collab orative En-
records can b e found by a sequential scan of the data
vironment for the Design of Buildings, or Civil Engineering
between the twoversion numb ers when change records
Database. This e ort is funded in part by NSF grant IRI{
are sorted by their asso ciated version numb er. There
91{16646.
y
are twointeresting asp ects of our version numb ering
Arthur Keller's e-mail address is [email protected].
z
scheme. Je Ullman's e-mail address is [email protected].
1. A metho d for assigning numb ers to versions so the parent and the three children, a sequential traver-
that all the versions b etween A and B in the sal from at least one of the children must skip over at
hierarchyhaveversion numb ers that lie b etween least one of its siblings to reach its parent. In general,
the version numb ers for A and B however, there we shall have to skip some versions, but it is highly
may also b e other version numb ers that lie b e- desirable to nd all the ancestors of a version within
tween A and B . The version numb ering scheme some small region of the complete ordering. For ex-
must make it p ossible to add new children with- ample:
out renumb ering old versions.
If we wish to nd the changes from some version
V to a descendantversion W ,we shall need to
2. A metho d for enco ding nonnegativeintegers as
retrieve less information from secondary storage
character strings so that if i if the ancestors of W are in a small region of the for integer i is lexicographically less than the co de ordering. for j . If we store all change records in a single relation, Note that the conventional enco ding of integers and we sort or index those records byversion do es not satisfy 2. For example, in decimal the char- numb er, we can use a range query to help nd acter strings for the integers 5 and 43 are in the wrong the changes b etween a version V and a descen- order. That is, as integers, 5 < 43, but as character dantversion W , and this query will b e eciently strings, 43 lexically precedes 5. implemented. 2 Representing a Hierarchy Now supp ose we wish to add to the hierarchy.We assume that new versions can only b e added as chil- In this section, we describ e how to represent the dren of existing versions, as is the case in the CEDB hierarchical relationship of version numb ers using a system. When we add new versions, we are faced with linear enco ding that sorts in a useful manner. Let the problem of nding new names for the new versions, us consider the simple version hierarchy in Figure 1, esp ecially if we wish to retain the prop erty that names where the names of the versions o er no help in nding are in lexicographic order along paths of the hierarchy. the versions that lie b etween two given versions in the Figure 2 shows two p ossible approaches. In each hierarchy. case, wehave added some children to Figure 1. In Figure 2a wehave renamed versions so children fol- A low their parent in lexicographic order. Figure 2b shows another approach, where names of versions are unchanged and new versions are given names that fol- B lowany previously used names in lexicographic order. The primary advantage of ordering new versions near the parent as in Figure 2a is lo cality of refer- C D E ence. The primary advantage of naming versions in order of creation as in Figure 2b is that wedonot have to rename versions when adding other versions. Figure 1: Simple version hierarchy. However, in the latter scheme there tend to b e many \false drops," that is, version names that lie in lexi- Version A has one child, version B , which in turn cographic order b etween the names of some version V has three children, versions C , D , and E . In this case, and a descendant W ,yet are not lo cated b etween V the lexicographical ordering of the version names is and W in the hierarchy. useful. To trace the ancestry of version D ,we can By using variable-length version names, we can progress sequentially though the ordering. We tra- avoid the problem of renaming. Consider Figure 3, verse, A, B , skip C , and then stop when we reach where wehave enco ded the ancestry in the version D . name. That is, the name of eachversion is the name of its parent, followed byanumb er that indicates which Observe that it is not p ossible to order the ver- child it is in the ordering of children of its parent. sions so that a sequential traversal of the ordering will There are several further improvements we can encounter all the ancestors of eachversion in an unin- make to this scheme. First, we can optimize for the terrupted sequence. For example, consider a parent common case where a version has only one child. That has three children. Given some sequential ordering of A A B B C F I BA BB BC D E G H BAA BAB BBA BBB a Sorted with parents. Figure 3: Enco ding ancestry. A A B B C D E C BA BB F G H I D CA BAA BAB b Sorted by level. Figure 4: Another way to enco de ancestry. Figure 2: Sorting versions. 3 Our Prop osed Version-Numb ering Scheme is, wemay assign to the rst child of certain versions anumb er that is one greater than the numb er of its Note the problem in naming the descendants of ver- parent, than treating all alternatives children of the sion BA of Figure 4. We need to distinguish b etween same parentversion the same, as shown in Figure 4. the rst child of BA and the alternativetoBA, b oth This approach reduces the average length of names in of which could plausibly have b een named BB. To the common case when most versions have one child. handle this problem, we shall alternate b etween en- In comparison, the scheme of Figure 3 forces names to co ding the generation numb er and enco ding the al- b e as long as the paths in the hierarchy. ternativenumb er. To facilitate the exp osition, we In summary, there are three imp ortant criteria we will use numb ers to represent the generation sequence need from a version-naming scheme. and letters to represent alternatives. However, in ver- sion names, think of A;B;::: as representing integers 1. The lexicographic property : all the versions b e- 0; 1;::: . This approach is illustrated in Figure 5. tween some version V and a descendantversion Formally,a version numb er for a version V is a se- W lie lexicographically b etween V and W . quence g a g :::a g . Here, g indicates the number 0 1 1 n n n of generations from the matriarch of V , whose num- 2. Given any hierarchyofversions, we can nd a ber is g a g :::a 0. The matriarchof V is the closest 0 1 1 n name for any additional child of an existing ver- not necessarily prop er ancestor of V that is not a rst sion. This name must b e distinct from the names child of its parent or the ro ot if there is no such an- of other versions in the hierarchy. cestor of V . For example, in Figure 5, the matriarch of 1B 1A2is1B 1A0, and the matriarchof3A2A0is itself. 3. The length of a name do es not grow signi cantly The number a is one less than the numb er of sib- along paths that follow a sequence of leftmost n lings that the matriarchofV has to its left. For exam- children. 0 1 2 1A0 1B 0 3 1A1 1A0A0 1B 1 2A0 4 1A2 1B 2 3A0 1B 1A0 5 1B 3 3A1 1B 1A1 6 1B 4 3A2 1B 1A2 1B 5 3A3 3A2A0 1B 1A3 1B 1A2A0 7 Figure 5: Improved version numb ering. ple, in Figure 5 consider the children of version 1. The 4 Some Basic Algorithms Using the Version- rst child gets numb er 2. The second child instead has Numb ering Scheme app ended to the name \1" the letter A which repre- We b egin with an algorithm for determining the sents \0". The third child has B app ended to the ancestors of a version. name \1". Both the second and third children havea 0 app ended to their names after the A or B , b ecause Algorithm 1 Traverse ancestors they are their own matriarchs; i.e., they are 0 genera- tions from their matriarch. That is, the following rules Input: Version g a g :::a g . 0 1 1 n n summarize how the children of a version are named. Output: All ancestors in order from parent to oldest. The rst child of version g a g :::a g is 0 1 1 n n for i := n downto 0 do g a g :::a g + 1. 0 1 1 n n for j := g downto 0 do i Printg a g :::g a j 0 1 1 i 1 i The k th child of version g a g :::a g , for k>1, 0 1 1 n n 3 is g a g :::a g k 20. 0 1 1 n n Note that when i is 0, then merely j is output. Here is an algorithm for determining the closest This scheme o ers an imp ortant economy in the common ancestor i.e., least upp er b ound to twover- situation that is most common: very few versions have sions. more than one child. For example, in the extreme case where there are n versions in a chain, that is, no Algorithm 2 Closest common ancestor version has more than one child, log n bits suce to 00 00 00 00 00 0 0 0 0 0 representversion names. In comparison, anyscheme . g :::a g a and g g :::a g a Input: Versions g m m 1 1 0 n n 1 1 0 that allows us to extend the name of a version to get the name of its child would require linear space to Output: Closest common ancestor g a g :::a g . 0 1 1 l l represent these n version names. 0 00 if g 6= g W lie b etween V and W in lexicographic order. In a 0 0 0 00 then b egin l := 0; g := ming ;g ; exit end sense, that is all we need. However, the scheme of 0 0 0 0 else g := g ; Section 3 su ers from the problem that the elements 0 0 for i := 1 to minn; m do b egin of the lists that form names are arbitrary integers. 0 00 if a 6= a Since wemust represent suchintegers in some system i i then exit; that uses a nite numb er of symb ols, wemay not b e 0 a := a ; able to preserve the lexicographic order. i i 0 00 if g 6= g For example, consider the names 10A0 and 2B 3. i i 0 00 then b egin l := i; g := ming ;g ; exit end; The rst of these represents a version that is found by i i i 0 g := g ; following 10 generations of rst children from the ro ot, i i end; then moving to the second child. The second is found l := minn; m; by following two generations of rst children from the exit ro ot, moving to the third child, and then going down 3 three more generations of rst children. The latter precedes the former, since 2 < 10. However, if we We call the traversal that results from lexicograph- representintegers in decimal, and treat A as 0, B as ical ordering from our enco ding the inverse left-corner 1, and so on, then the two names b ecome 1000 and traversal. This traversal is pro duced by the following 213. The rst precedes the second in lexicographic algorithm. order, which is wrong. Wethus face the problem of representing integers Algorithm 3 Inverse left-corner traversal by co des that are in the same order, lexicographically, Input: Aversion history represented by tree T . as the values of the integers. Such an enco ding of inte- gers will b e said to have the lexicographic property.We Output: Traversal according to lexicographical ordering. just observed that the ordinary representations such as decimal do not have the lexicographic prop erty.We walkT = saw that 2 < 10 as integers, but 10 precedes 2 lexico- for i:=2 to last child of ro ot do graphically. walkith subtree of ro ot of T ; The simplest solution is to use xed-length numb ers printrootofT; of sucient length, with numb ers padded on the left walk rst subtree of ro ot of T by leading zero es. For example, 5 digits are enough to end; handle 10,000 generations of versions and up to 10,000 3 children of one version. The xed-length solution suf- fers from three problems: wasted space for lowvalues, To traverse from a version v to some descendant a limited maximum value, and the need to sp ecify the v , scan the ordering starting at the version v and d a length somehow. ending at the desired descendant v , skipping all ver- d Alternatively,we can use a variable-length repre- sions that are not ancestors of the desired descendant. sentation that is self-delimiting. Consider the follow- The only versions that will have to b e skipp ed are ing simple scheme for representing binary numb ers. other descendants of ancestor v . a Represent a binary number of b bits by pre xing it with b 1 one bits and a zero bit. This scheme is called 5 Representing Version Names Elias co des [Elias 1975]. For example, the number 5, or 101 , is represented by the binary string 110101. Wemay see the version naming scheme of Section 3 2 Deco ding is easy. The pre x consists of all the bits as one where version names are lists of integers, the up to and including the rst zero; the numb er of bits alternating g 's generation numb ers and a's p osition in the pre x tells how many bits follow. The number among siblings. These lists have a natural lexico- follows in binary. graphic order. That is, to compare two lists, nd the common pre x and delete it. After this mo di cation, A similar scheme can b e used for variable-length an empty list precedes any list, and otherwise, the list enco dings using decimal numb ers. Here, h digits with with the lower rst element precedes the other. For the value 9 are followed by h + 1 digits, where the example, in Figure 5, 1B 1A2 precedes b oth 1B 4 and rst of those latter digits is not 9. The numb ers 0 1B 1A2A0. through 8 are represented as themselves. The numb ers It is easy to check that with this numb ering scheme, 9 through 98 are represented by 900 through 989. The all the versions b etween a version V and its descendant numb ers 99 through 998 are represented by 99000 to We shall intro duce the optimal scheme rst by con- 99899, and so on. In general, to get the value from sidering the case of the decimal representation. such a representation add the pre x the consecutive 9's to the rest of the numb er using ordinary decimal Enco de the rst 5 numb ers 0 through 4 by 0 through arithmetic. 4. To obtain a representation we use the following al- Enco de the next 25 numb ers 5 through 29 by 50 gorithm. Let d b e the numb er of digits ordinarily used through 74. to represent the desired value v . If all d digits of v are Enco de the next 125 numb ers 30 through 154 by 750 the digit 9, then the representation is d 9's followed through 874. 0 by d + 1 0's. Otherwise, let v = v x, where x Enco de the next 625 numb ers 155 through 779 by is the decimal numb er formed by d 1 9's; that is, 8750 through 9374. d 1 x =10 1. Then the representation of v is d 1 Enco de the next 3125 numb ers 780 through 3904 by 0 9's followed by v padded out with leading zero es to d 93750 through 96874, and so on. digits, if necessary. This approach o ers smaller expansion of co des at We might also adapt the same idea to any radix the exp ense of slightly more complicated enco ding and other than binary or decimal. In any radix it is easy deco ding. For example, the scheme of Section 5 repre- to prove that for anyintegers i and j ,ifi sents 999 numb ers in 5 digits, while this scheme rep- the enco ding of i precedes the enco ding of j , lexico- resents 3905 numb ers in 5 digits. Here are algorithms graphically.We shall fo cus on the binary system, for for enco ding and deco ding numb ers using this scheme. simplicity. Supp ose i Algorithm 4 Enco ding then the enco ding of i will b egin with fewer 1's than j 's Input: n to b e enco ded. enco ding. Thus, i precedes j lexicographically, when b oth are enco ded. The only other p ossibility when Output: Decimal representation of the enco ded i< j is that i and j have the same numb er of bits numb er. in their binary representations, say b bits. Then b oth b 1 their enco dings start with 1 0, followed by the b-bit d Let d b e the smallest integer such that n<5 5=4. representations of i and j . Since i Let :h h :::h b e the decimal fraction 1 2 d 2 enco ding of i will precede that of j , lexicographically. d 2 representation of 1 1=2 . Then the enco ding of n is the decimal representation 6 Optimal Lexicographic Enco dings of Inte- d 1 of h h :::h 0, plus n, minus 5 5=4. 1 2 d 2 gers 3 For example, let n = 2000. Then d = 6, b ecause Let us de ne the expansion of an enco ding to b e 6 5 5 5=4 = 3905 is greater than 2000, but 5 the limit, as i gets large, of the ratio of the length of 5=4 = 780 is not. Thus, h h :::h = 9375, since the enco ding of i to the length of the ordinary radix 1 2 d 2 1 1 = :9375. Wethus compute the enco ding of 2000 representation of i in the base that has as many digits 16 5 by adding 93,750 to 2000 and subtracting 5 5=4, as the enco ding uses. For example, whether in binary, or 780. The resulting co de is 94970. decimal, or another base, the expansion of the enco d- ing schemes of Section 5 is 2. The reason is that the Algorithm 5 Deco ding pre xes intro duced have, within 1, as many digits as the ordinary representation of the numb er. Input: A k -digit enco ding, whose decimal value is d, We might ask whether 2 is the minimum expansion to b e deco ded. p ossible. The answer is rather surprising. It is p ossi- ble to approach expansion 1 as closely as we like; that Output: Numerical value of the deco ded numb er. is, asymptotically the requirement of a lexicographic k k k enco ding costs nothing. However, to approach1it The enco ded integer is d 10 +2 5 +5 5=4. is necessary to inde nitely increase the base of the 3 representation. That has the e ect of putting a pro- gressively higher o or on the minimum length of an For example, supp ose we are given the co de 94970. 5 enco ded integer when all representations are reduced Then d = 94,970 and k =5.Wemust subtract 10 5 to binary, as they surely must b e for representation = 100,000 from d, then add 2 5 = 6250 and add 5 within a computer. 5 5=4 = 780. The result is 2000. i 7 Optimality of Our Integer Enco ding symb ols is b=2 . Note that the scheme of Section 6, which uses b = 10, actually assigns exactly this many It turns out that the enco ding of Section 6, or its integers. n generalization to an arbitrary base, is essentially the For a general base b,we can represent b numb ers b est we can do. We shall show that any enco ding in by co des of length up to m, provided m satis es base b that satis es two constraints app ears to use only m m 1 n b=2 +b=2 + +1 b half of the b available digits. The two constraints, b oth of which are consequences of our assumptions ab out Some algebra shows that the smallest m that satis es howversion names are to b e enco ded, are: the ab ove is approximately 1. The enco ding must b e a pre x co de; that is, no m = n log b=log b 1 2 2 integer mayhave an enco ding that is a pre x of Put another way, the expansion of such a co de is ap- another integer's enco ding. To understand the proximately 1 + 1= log b. For b = 10, wehavean 2 need for this requirement, wemust rememb er that expansion of ab out 1.3, compared with an expansion whatever alphab et is used to represent enco dings of 2 for the enco ding of Section 5. is the entire alphab et. There is no extra \blank" Of course, we need to show the existence of an en- or \endmarker" symbol available. Thus, if one co ding for an arbitrary numberofsymb ols b that pre- integer's enco ding were a pre x of another's, it serves lexicographic order the way the enco ding for would not b e clear whichinteger was meant when b =10does.We claim that such a co de always exists. the latter's app eared. The existence is easy to exhibit and also comparatively easy to understand for the case that b isapower of 2. The enco ding must not b e biased in favor of large k 2; say b =2 . It also helps our understanding if we or small integers. The formal requirement that represent the digits in base b by their binary represen- we b elieve re ects this condition is that the num- tation. For example, if b = 16, we represent eachof b er of enco dings of a given length i grows in a the hexadecimal digits by a string of four 0's and 1's. uniformly exp onential way. That is, the number i Then the co de consists of the following: of integers that have enco dings of length i is c for some constant c. While it is p ossible to develop k 1 2 bit strings of length k b eginning with 0, co des that favor either small or large numb ers, 2k 2 2 bit strings of length 2k b eginning with 10, wefavor the common case where version numb ers 3k 3 2 bit strings of length 3k b eginning with 110, may b e arbitrarily large. 4k 4 2 bit strings of length 4k b eginning with 1110, 5k 5 2 bit strings of length 5k b eginning with 11110, Condition 1 has a well-known arithmetic interpre- and so on. tation, called the Kraft inequality McMillan [1956], Cover and Thomas [1991]. Let n b e the number of For example, again assuming b = 16, so k = 4, our i integers with co des of length i, and let b b e the num- co de uses 8 bit strings of length 4, which are 0000 b er of symb ols in the enco ding alphab et. Then if the through 0111, corresp onding to the hexadecimal dig- enco ding satis es condition 1, i.e., it is a pre x co de, its 0 through 7. Wehave 64 bit strings of length 8, then each of which b egins with 10. These are the two-digit 1 X hexadecimal numb ers whose rst digit is b etween 8