Journal of Universal Computer Science, vol. 5, no. 4 (1999), 243-286
submitted: 12/4/99, accepted: 23/4/99, appeared: 28/4/99 Springer Pub. Co.
Automatic Data Restructuring
Seymour Ginsburg
घComputer Science Department, University of Southern California,
Los Angeles, California, 90089
e-mail: ginsburg@p ollux.usc.eduङ
Nan C. Shu
घIBM Los Angeles Scienti c Center,
Los Angeles,Californiaङ
Dan A. Simovici
घUniversity of Massachusetts at Boston,
Department of Mathematics and Computer Science,
Boston, Massachusetts 02125
e-mail: [email protected]ङ
Abstract: Data restructuring is often an integral but non-trivial part of informa-
tion pro cessing, esp ecially when the data structures are fairly complicated. This pap er
describ es the underpinnings of a program, called the Restructurer, that relieves the
user of the \thinking and co ding" pro cess normally asso ciated with writing pro cedural
programs for data restructuring. The pro cess is accomplished by the Restructurer in
two stages. In the rst, the di erences in the input and output data structures are
recognized and the applicabilityofvarious transformation rules analyzed. The result
is a plan for mapping the sp eci ed input to the desired output. In the second stage,
the plan is executed using emb edded knowledge ab out b oth the target language and
run-time eciency considerations.
The emphasis of this pap er is on the planning stage. The restructuring op erations
and the mapping strategies are informally describ ed and explained with mathematical
formalism. The notion of solution of a set of instantiated forms with resp ect to an
output form is then intro duced. Finally,itisshown that such a solution exists if and
only if the Restructurer pro duces one.
Key Words: non- rst normal form databases, data restructuring, instantiated forms,
hierarchical structures, solution of a set of instantiated forms.
Category: H.2, E.1
1 Intro duction
The rapid decline of computing costs and the desire for pro ductivity improve-
ment has created increased demands for computerized information pro cessing by
end users [22]. The challenge for computer professionals is to bring computing
capabilities - usefully and simply - to p eople without sp ecial computer training.
The emergence of visual programming [36, 8, 24, 38, 6] and the resurgence of
automatic programming [5] represent two signi cant approaches to meet this
challenge. FORMAL [34], a visual-directed and forms-oriented application de-
velopment system, is a prototyp e that emb o dies the spirits of b oth. The visual
programming asp ects of FORMAL have b een describ ed elsewhere [34,36, 37].
The purp ose of this pap er is to present the automatic asp ects of data restruc-
244 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
1
turing in FORMAL and a theoretical foundation for it. However, the abstract
considerations presented here are not dep endent on the particular implementa-
tion solutions adopted in FORMAL.
Data restructuring is often done in an ad ho c manner. What sets our metho d
apart from other data restructuring e orts is the automatic mechanism's ability
to takeover the \thinking and co ding" pro cess normally asso ciated with writ-
ing algorithms for data restructuring. For convenience this mechanism is called
\the Automatic Restructurer" घor simply the Restructurerङ. The most imp ortant
pieces of information given to the Automatic Restructurer are an unambiguous
description of each input and an unambiguous structural description of the de-
sired output. If there are two or more inputs, then the match elds used to tie the
inputs together must also b e given. Based on these descriptions, an executable
program घusing the restructuring op erations originally prop osed in CONVERT
[31]ङ is generated byatwo-stage pro cess. In the rst stage, the di erences b e-
tween the input and output data structures are recognized and the applicability
of various transformational rules analyzed. The result is a plan for mapping the
sp eci ed inputघsङ to the desired output. In the second stage, runtime ecien-
cies are taken into consideration, and an executable program is implemented to
carry out the plan. The result is a reasonably ecient program for the restruc-
turing task at hand. A ma jor contribution of this pap er is a detailed explanation
घexpressed b oth formally and informallyङ of the underpinnings of the automatic
restructuring mechanism. Since the eciency considerations and implementation
details are not the primary interests of this pap er, the concerns of the second
घi.e., construction or co dingङ stage are included only to complete the exp osition.
The emphasis is on the mechanism underlying the rst घi.e., planningङ stage,
namely, on the derivation of a sequence of restructuring op erations aimed at
pro ducing a user-sp eci ed output from the given input.
The pap er itself is organized into eight sections, the rst b eing this intro duc-
tion. Section 2 explains the notion of \data restructuring." Section 3 presents the
underlying data mo del. Section 4 discusses the restructuring problem. Section 5
de nes four typ es of basic op erations designed sp eci cally for data restructuring.
Section 6 considers the strategies for mapping sp eci ed inputs to desired out-
puts, using sequences of these basic op erations. Section 7 discusses the eciency
considerations. Section 8 presents a summary of the results obtained.
2 Data Restructuring
As mentioned in the Intro duction, the purp ose of this pap er is to present a
metho d for automatically p erforming data restructuring tasks and provide a
theoretical basis for it. This section explains and elab orates on the meaning of
\data restructuring," why it is an imp ortant part of information pro cessing, and
what is the signi cance of an automatic programming approach.
Historically, database reorganization is de ned as \changing some asp ect of
the way in which a database is arranged logically and/or physically" [39] - a
1
The facilities provided byFORMAL are a sup erset of those describ ed in this pap er.
Here, wecho ose to discuss only those directly relevant to the restructuring of the
data structures. Capabilities suchaschanging comp onent names, sp ecifying criteria
for selections, case-by-case assignments, arithmetic and string op erations, handling
of exceptions, etc., are not germane to structural transformations, and are therefore not included in the discussion.
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 245
generic term that covers b oth data restructuring घchanging logical structuresङ
and reformating घchanging physical structuresङ of database systems. With that
p ersp ective, data restructuring fell primarily in the domain of data-pro cessing
op erations p ersonnel and database researchers. Our view of data restructur-
ing has a much broader scop e. Here, the term data restructuring is to mean
a sequence of op erations aimed at pro ducing an output with structure di erent
from that of the corresp onding input. Data restructuring may b e p erformed for a
variety of reasons, for example, as a desirable database function घto improve p er-
formance or storage utilizationङ, or as a useful application घto supp ort decision
making or data pro cessingङ.
घPERSONङ
ENO DNO NAME PHONE JC घKIDSङ घSCHOOLङ SEX LOC
KNAME AGE SNAME घENROLLङ
YEARIN YEAROUT
05 D1 SMITH 5555 05 JOHN 02 PRINCETON 1966 1970 F SF
MARY 04 1972 1976
07 D1 JONES 5555 05 DICK 07 SJS 1960 1965 F SF
JANE 04
BERKEPRY 1965 1969
11 D1 ENGEL 2568 05 RITA 04 UCLA 1970 1974 F LA
12 D1 DURAN 7610 05 MARY 08 M SF
BOB 10
19 D1 HOPE 3150 07 MARYLOU 10 M SJ
MARYANN 07
02 D2 GREEN 1111 01 DAVID 04 M SF
20 D2 CHU 3348 10 CHARLIE 06 HONGKONG 1962 1966 F LA
CHRIS 09
BONNIE 04 STANFORD 1967 1969
1972 1975
21 D2 DWAN 3535 12 USC 1970 1974 F SJ
43 D2 JACOB 4643 09 PAULA 07 BERKEPRY 1962 1966 M SJ
Figure 1: घPERSONङ le
Data restructuring plays an imp ortant role in information pro cessing applica-
tions. When data is extracted from sources or new elds are created and placed
in the output, the structure of the resulting output seldom resembles that of
the input. To illustrate, consider the p ersonnel information of a given company
shown in the घPERSONङ le describ ed in Figure 1. Supp ose a Christmas party
is b eing planned. All employees' children will receive gifts at the party. Di erent
gifts are planned for each age group. The organizer of the party wishes to list,
for each lo cation घLOCङ, by age of the children घAGEङ, the name of each kid
घKNAMEङ and his/her parent name घNAMEङ. For this seemingly simple exam-
ple, all information required to pro duce the desired घXMASLISTङ le घFigure
2ङ is contained in the घPERSONङ le. Yet, the actual pro cess of pro ducing the
desired घXMASLISTङ involves data restructuring of a non-trivial nature घsee
Figure 3ङ.
Besides showing that data restructuring is an integral part of many informa-
tion pro cessing applications, the ab ove example also shows that query facilities
घdesigned for simple retrievalङ are not adequate to handle many of the real world
situations: programming is required. Traditionally, programming is considered to
246 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
घXMASLISTङ
LOC घRECIPIENTङ
AGE घKIDङ
KNAME NAME
LA 04 BONNIE CHU
RITA ENGEL
06 CHARLIE CHU
09 CHRIS CHU
SF 02 JOHN SMITH
04 MARY SMITH
JANE JONES
DAVID GREEN
07 DICK JONES
08 MARY DURAN
10 BOB DURAN
SJ 07 PAULA JACOB
MARYANN HOPE
10 MARYLOU HOPE
Figure 2: Desired घXMASLISTङ
b e in the professional programmers' province. But the reliance on a professional
programmer often means placing a formal request to the data pro cessing p erson-
nel and waiting on a priority list for programmers to take action. The alternative
is for the end users to learn to program.
Learning to program is generally a time-consuming and often frustrating
endeavor. Moreover, even after the skill is learned, writing and testing a pro-
gram is still tedious and lab or intensive. Writing programs to p erform data
transformation is no exception. To ease the task, sp ecial languages with high
level op erations for data restructuring have b een rep orted in the literature. घSee
[1, 11,17,20,21,25,30,31,32,40,42] for example.ङ These languages are high
level, but they are still \pro cedural" and programming training is necessary.In
other words, users of these languages must write out the algorithms of restruc-
turing in a step by step manner. घSee App endix A for an example.ङ
In the rapidly growing end-user computing environment, many non-DP pro-
fessionals simply do not have the time, interest, or skill to program. Indeed, as
noted in a survey of end-user computing घ [27]ङ, two thirds of the end-user ap-
plications were develop ed by supp ort p ersonnel, programmers, and consultants.
It is our contention that unless an essential part of tedious, low level program-
ming can be made automatic, a large community of p otential end users will
not develop their applications. In the remainder of this pap er, we present an
automatic restructuring mechanism which is aimed at decreasing the tedium
normally asso ciated with writing pro cedural programs. For example, b ecause of
the Automatic Restructurer implemented in FORMAL, घXMASLISTङ can be
created from घPERSONङ with the sp eci cation shown in Figure 4. In essence,
the user describ es the structure of the desired output घXMASLISTङ and tells
the system to nd the required information in the source form घPERSONङ. The system do es the rest.
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 247
घPERSONङ
ENO DNO NAME JC घKIDSङ घSCHOOLङ SEX LOC
KNAME AGE SNAME घENROLLङ
YEARIN YEAROUT
Restructuring
घXMASLISTङ ?
LOC घRECIPIENTङ
AGE घKIDङ
KNAME NAME
Figure 3: Restructuring of घPERSONङ to घXMASLISTङ
3 The Underlying Data Mo del
As mentioned in the Intro duction, the most imp ortant information given to
the Restructurer is an unambiguous description of the input and output data
structures. In this section, we discuss the data structures supp orted by the Re-
structurer, and indicate several ways to describ e them.
Brie y, the Automatic Restructurer handles all data structures as hierar-
chies. Relational tables are treated as degenerate hierarchies घi.e., hierarchies
of one levelङ, and networks are assumed to have b een decomp osed into fami-
lies of hierarchies b efore the Automatic Restructurer is invoked. No conceptual
limitations are placed on the depth or width of the hierarchies.
There are many di erentways to describ e hierarchical data structures. In the
database community, hierarchy graphs are commonly used when data structures
are discussed. As an example, Figure 3 shows the hierarchy graphs of घPERSONङ
and घXMASLISTङ. But hierarchy graphs b ecome cumb ersome when the display
248 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
घXMASLISTङ
घRECIPIENTङ
LOC
AGE घKIDङ
KNAME NAME
SOURCE घPERSONङ
Figure 4: A sp eci cation for creating घXMASLISTङ from घPERSONङ
of instances is desired. The diculty is that one \instance graph" is required to
represent each o ccurrence, and consequently many \instance graphs" are needed
to showmultiple o ccurrences.
Data Pro cessing p ersonnel mayormay not care ab out instance graphs, but
for users who do not have very much data-pro cessing training, visualization
of o ccurrences often enhances the understanding of a problem. A \form" data
mo del was therefore suggested in [31] as a two-dimensional representation of
hierarchical data where multiple o ccurrences can be displayed conveniently. It
was intended at that time as a visual aid to assist the users in understanding the
high level op erations p erformed on their data. The \form op eration by example"
[23], the nested table घNTङ mo del [20], and more recently, the \nested normal
forms", and \non rst normal form relational databases" घsee [1, 9, 11,12,18,
19, 28, 29, 30, 41], for exampleङ use similar representations as visual aids. In
these representations, hierarchical structures are implied by exhibiting sample
instances.
Visual aids enhance human understanding. But, unless the representations
are precise, they cannot b e pro cessed by a computer program घdata restructurer
includedङ. In order to provide unambiguous descriptions घwithout dep ending
on the display of sample instancesङ to the Restructurer, the hierarchical struc-
tures must b e made explicit. A straightforward approach is to re ne the \form
headings" used in the form mo del so that the one-many relationships are dis-
tinguished from the one-one relationships [33,10]. Before presenting the re ned
form heading as an unambiguous description of a hierarchical structure, a few
de nitions are in order.
A eld is the smallest unit of data that can b e referenced in a user's appli-
cation. A group is a sequence of one or more elds and a numb er of sub ordinate
groups. For example, ENO घEmployee numb erङ, DNO घDepartment numb erङ,
NAME घEmployee Nameङ, PHONE, JC घJob Co deङ, ..., are elds of the घPER-
SONङ le घFigure 1ङ, while घKIDSङ, घSCHOOLङ, and घENROLLङ are groups.
Groups are either rep eating or non-rep eating. A non-repeating group is a
convenientway to refer to a sequence of elds as a unit घe.g., DATE is a non-
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 249
rep eating group over MONTH, DAY and YEARङ. A repeating group, on the
other hand, represents the one-many relationship and thus exhibits the hierar-
chical relationship. Take घPERSONङ घFigure 1ङ for example. An employee may
have manychildren, and mayhave attended a number of scho ols. Thus, for an
employee, घKIDSङ and घSCHOOLङ are rep eating groups. Rep eating groups can
b e nested घrepresenting several levels of a hierarchy, e.g., घENROLLङ is a sub or-
dinate rep eating group, nested within घSCHOOLङङ, or in parallel घrepresenting
several branches of a hierarchy, e.g., घKIDSङ and घSCHOOLङ representtwo di er-
ent branchesङ. To simplify the discussion, we do not deal with the non-rep eating
groups in the rest of this pap er.
In data pro cessing applications, a le is a named collection of homogeneously
structured data. The structure of a le is generally de ned in terms of its com-
p onents घwhere a component is either a eld or a groupङ. Comp onent names are
assumed to b e unique for each le, i.e., there are no duplicate names among the
comp onents.
In FORMAL, a form heading assumes a role that is commonly known as the
data structure de nition in conventional programming languages. Its purp ose is
to unambiguously de ne the name of the asso ciated le, its comp onents, and
structural relationships among the comp onents. Parentheses are used to denote
the existence of multiple घi.e. zero or moreङ o ccurrences in the asso ciated le or
group. Thus, names of the le and its repeating groups घalso cal led subformsङ
are enclosed in parentheses. More precisely, in a form heading, the le name is
placed on the top line. The names of the ro ot or top level comp onents are shown
in columns under the le name. The names of comp onents contained in a group,
in turn, are placed in columns under the name of the containing group. A double
line signals the end of a form heading. घSee the lines ab ove the double line in
Figure 1ङ.
In a form heading, the nesting and/or parallel parenthesized names serveto
de ne explicitly a hierarchical structure. The level of each comp onent is indicated
by its row-p osition under the le name. To illustrate, the level numb ers are
explicitly shown in Figure 5 on the outside of the घPERSONङ form heading. For
example, ENO, घKIDSङ, and घSCHOOLङ are at level 1, SNAME is at level 2 and
YEARIN level 3.
The Restructurer classi es data structures घor \forms"ङ into three distinct
shapes : \ at", \branch", and \tree." The algorithm for determining the shap e
is quite simple. It is implemented by traversing a given two-dimensional form
heading घor its equivalent \comp onent description table", discussed laterङ. Dur-
ing the traversal, each comp onent in the form heading is assigned a level number
according to its row p osition. At the end of the traversal, the shap e of the data
structure is determined. If the numb er of parenthesized names equals one, then
none of the comp onents is a rep eating group घor subformङ. The structure is at
shap ed. If the numb er of parenthesized names घsay pङ is greater than one, then
that number घpङ is compared with the largest level numb er घsay nङ. Obviously,
if p is larger than n, then at least one level must have more than one subform,
and the shap e of the form is a tree. घSee Figure 5 for an example.ङ Otherwise,
the structure is a branch. घSee घXMASLISTङ in Figure 4 for an example.ङ
The shap es of at, branch, and tree structures are ranked 1, 2 and 3, re-
sp ectively. The relative shap e rankings of the input and output structures have
a b earing on how the Restructurer maps the strategy for transformation घdis- cussed in Section 6ङ. In the rest of the pap er, when the shap e of the data structure
250 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
घPERSONङ level
ENO DNO NAME PHONE JC घKIDSङ घSCHOOLङ SEX LOC 1
KNAME AGE SNAME घENROLLङ 2
YEARIN YEAROUT 3
Index IDENTIFIER NEXTI CHILDI PARENTI PRVEL GROUP
0 PERSON 0 1 0 0 1
1 ENO 2 0 0 1 0
2 DNO 3 0 0 1 0
3 NAME 4 0 0 1 0
4 PHONE 5 0 0 1 0
5 JC 6 0 0 1 0
6 KIDS 7 10 0 1 1
7 SCHOOL 8 12 0 1 1
8 SEX 9 0 0 1 0
9 LOC 0 0 0 1 0
10 KNAME 11 0 6 2 0
11 AGE 0 0 6 2 0
12 SNAME 13 0 7 2 0
13 ENROLL 0 14 7 2 1
14 YEARIN 15 0 13 3 0
15 YEAROUT 0 0 13 3 0
Note: NEXTI, CHILDI and PARENTI are indexes to entries in the comp onent description
table for next comp onent, child comp onent,and parent comp onent, resp ectively.
Figure 5: Comp onent description table of घPERSONङ form
is not the center of attention, the generic term \tree" is used interchangeably
with the term \hierarchy" घdiscussed in the intro duction to the present sectionङ.
To view the data, o ccurrences are displayed under the form heading. The
compactness of the form heading enables the visualization of many o ccurrences
at one time घsee Figure 1ङ.
Since the form heading represents a concrete means to unambiguously de-
ne a hierarchical structure of arbitrary depth and width, it is used as an in-
put/output description byFORMAL. An input is de ned once घregardless of the
numb er of di erent applications utilizing itङ by a user or a systems p erson, and
the description is immediately available in the system catalog. There is no need
for another user to de ne the same le again. Description of the desired output,
on the other hand, is provided by the user as part of his application sp eci -
cation. For example, in Figure 4, the description of घXMASLISTङ is provided
in the form heading of the FORMAL program that pro duces घXMASLISTङ. As
discussed later in Section 6, intermediate les may b e needed during the restruc-
turing pro cess. Structures of all intermediate les are de ned by the Restructurer
and are of no concern to the end users.
As mentioned earlier, it is necessary that the unambiguous descriptions of
input and output data structures are available to the automatic restructuring
mechanism, but it is immaterial how the required information is captured. In the
FORMAL implementation, a parser translates the two-dimensional form head-
ings into component description tables which are used as input to the Automatic
Restructurer. As an example, Figure 5 shows the comp onent description table
corresp onding to the घPERSONङ form heading. The comp onent description ta-
bles are easily translatable from the form headings, and vice versa. They are
really two representations of the same information, one b eing more convenient
for the user, the other more suitable for machine manipulation.
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 251
In the formal treatment which follows, we use the term form to denote, in
an abstracted form, the unambiguous description of a data structure. In the
examples, we use form heading as the concrete representation of the abstracted
notion.
In order to discuss the data restructuring pro cess in a more abstract manner,
a formal description of the hierarchical data structures is now presented.
Let BA b e an in nite set of abstract elements called basic attributes. घBasic
1
attributes corresp ond to the elds mentioned earlier.ङ For each basic attribute
A, let
{ DomघAङ घcalled the domain of Aङ b e a nonempty set of at least two abstract
elements, exactly one of which is denoted ? घthe null symb olङ;
A
{ the basic attributes of A, denoted BAघAङ, b e the set fAg; and
{ the attributes of A, denoted Attrib घAङ, b e the set fAg.
Continuing recursively, let B ;:::;B be r ऋ 1 attributes such that BA घB ङ \
1 r i
BAघB ङ = ; for all i 6= j , and at least one of them a basic attribute. Let
j
B = hB ;:::;B i, where \h" and \i" are two sp ecial symb ols. Then
1 r
{ DomघB ङ is the family of nite subsets of the घorderedङ Cartesian pro duct
DomघB ङ ࣾ ::: ࣾ DomघB ङ;
1 r
{ B is called a form घattributeङ and a parent of each B , and each B is an
i i
o spring or an immediate descendent of B ;
S
{ BAघB ङ= BAघB ङ, and is called the set of basic attributes of B ; and
i
i
S
{ Attrib घB ङ=fB g[ AttribघB ङ, and is called the set of attributes of B .
i
i
Each attribute in Attrib घB ङ fB g is called a proper attribute of B . Each B
i
which is a form is called a subform of B . Recursively, each subform of a subform
B is also called a subform of B . If each B is a basic attribute, then B is said
i i
to b e a at form.IfB has at least one subform and B and each subform of B is
0 0 0 0
is a form, then B is said i, where at most one B ;:::;B of the shap e B = hB
s 1
i
to b e a branch.
To illustrate, in Figure 1, ENO, DNO, NAME, PHONE, JC, KNAME, AGE,
SNAME, YEARIN, YEAROUT, SEX and LOC are basic attributes.
घENROLLङ=hYEARIN ; YEAROUTi;
घSCHOOLङ = hSNAME ; ENROLLi;
घKIDS ङ = hKNAME; AGEi; and
घPERSONङ = hENO;:::;JC; घKIDSङ; घSCHOOLङ; SEX; LOCi
are forms.
घPERSONङ is the parent of ENO, DNO, ..., and LOC. Each attribute except
घPERSONङ is a prop er attribute of घPERSONङ.
Note that each prop er attribute C of B has exactly one parent, denoted
Par घC ङ, or Par घC ङ when B is understo o d. Also, if D is a prop er attribute of
B
C and C is an attribute of B , then Par घD ङ=Par घD ङ.
C B
0 0
C is an ancestor of C if either C = Par घC ङ, or C = Par घD ङ and D is
0 0
an ancestor of C घalternatively, C = Par घC ङ, or C is an ancestor of D and
0 0 0
D = Par घC ङङ. C is said to b e a descendent of C if C is an ancestor of C . C
0 0
and C are said to b e siblings if Par घC ङ=Par घC ङ.
If C is an ancestor of B , then C is a form.
252 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
Let C be an attribute in Attrib घF ङ. If D ;:::;D are the immediate de-
1 n
scendents of C in F , then Free घC ङ = fD j1 ऊ i ऊ n and D in BAघF ङg. If
F i i
C in BAघF ङ, then Free घC ङ=fC g. The subscript F is omitted when F is clear
F
from the context. Note that Free घC ङ 6= ; for each C in AttribघF ङ. Thus
F
FreeघघPERSONङङ = fENO; DNO; NAME; PHONE; JC; SEX; LOCg;
FreeघघSCHOOLङङ = fSNAMEg and
FreeघघENROLLङङ = fYEARIN; YEAROUTg:
0 0 0
Let B; B be two attributes. B equals B घdenoted by B = B ङif
0
{ B and B are the same basic attribute, or
0 0 0 0
{ B = hB ;:::;B i, B = hB ;:::;B i and B = B for 1 ऊ i ऊ r .
1 r i
1 r
i
This means that two forms that have the same direct descendents are considered
identical.
Referring to the level number discussed earlier, let F = hF ;:::;F i be a
1 r
form. The level of each attribute F with resp ect to F is de ned to b e 1. Con-
i
0
tinuing by induction, supp ose F is an attribute in F of level l with resp ect to
0 0 0 0
;:::;F i the level of each F with resp ect to F is de ned to b e F .For F = hF
s
1
j
l + 1. Each attribute of level 1 is said to b e at root level. The level of an attribute
A in Attrib घF ङ is denoted by level घAङ, or by levelघAङ when F is understo o d.
F
Wenow turn to the instance of a form. Using recursion, an instance of a form
F = hF ;:::;F i is a nite set I of functions
1 r
f : FreeघF ङ [fF jF a form g !
j j
S
fDomघAङjA in FreeघF ङg[fInsघF ङjF a formg;
j j
such that f घAङ is in DomघAङ for each A in FreeघF ङ and f घF ङisinInsघF ङ for
j j
each form F . Let InsघF ङ b e the set of all instances of F .
j
Figure 1 illustrates an instance I of the घPERSONङ form. The set I consists
of nine functions f ;:::;f , representing the information in the compartments 1
1 9
through 9 under the double line. f is the function over
1
fENO, DNO, NAME, PHONE, JC, घKIDSङ, घSCHOOLङ, SEX, LOCg
de ned by
f घENOङ = 05; f घDNOङ = D 1; f घNAME ङ = SM I T H;
1 1 1
f घPHONEङ = 5555; f घJCङ = 05; f घSEX ङ = F;
1 1 1
f घLOCङ = SF; f घघKIDSङ ङ = I f घघSCHOOLङ ङ = I :
1 1 11 1 12
Here, I is an instance of घKIDSङ that consists of the functions f ;f de ned
11 111 112
by
f घKNAMEङ = JOHN; f घAGEङ = 02;
111 111
f घKNAMEङ = MARY; f घAGEङ = 04:
112 112
I is the instance of घSCHOOLङ consisting of the function f , where
12 121
f घSNAME ङ = PRINCETON;f घघENROLLङङ = I :
121 121 1211
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 253
I is the instance of the form घENROLLङ consisting of the functions f
1211 12111
and f over fYEARIN,YEAROUTg, where
12112
f घYEARIN ङ = 1966;f घYEAROUTङ = 1970;
12111 12111
f घYEARIN ङ = 1972;f घYEAROUTङ = 1976:
12112 12112
The remaining functions f ;:::;f are de ned similarly.
2 9
S
For each form F , let Ins घF ङ= fI jI in InsघF ङg, that is, Ins घF ङ is the
1 1
set of all functions which are in at least one instance of F .Aninstantiated form
is a pair घI ;Fङ, where F is a form and I is an instance of F .
A set of instantiated forms B = fघI ;FङjF in Fg will be referred to as an
instantation of F .
4 The Restructuring Problem
2
Wenow turn to a precise statement of the basic problem . The actual formal-
ization is rather complicated and requires a numb er of concepts. A central role
will b e played by the notions of \direct solution" and of \solution" intro duced
in Section 7.
We start by recalling some notions from relational database theory. Let X be
a nite nonempty set of basic attributes. A tuple घover X ङ is a function f from
S
X into DomघAङ such that f घAङ is in DomघAङ for all A in X .A relation
A in X
is a pair घI; Xङ, or I when X is understo o d, where X is a nonempty set of basic
attributes and I is a nite set of tuples over X . If घI ; hB ;:::;B iङ is a at
1 n
instantiated form, then B ;:::;B are basic attributes and घI ; fB ;:::;B gङis
1 n 1 n
a relation.
For each nonempty subset Y of X , the projection over Y is the function ँ
Y
from the set of tuples over X to the set of tuples over Y de ned by ँ घf ङ=g ,
Y
where f is a tuple over X and g is the tuple over Y for which g घAङ=f घAङ for
all A in Y .
Our interest in tuples over a set of basic attributes stems from the following
concept. By recursion, for each form F = hF :::;F i and each f 2 Ins घF ङ, let
1 r 1
K घf ङ, called the kernel of f , b e the set of all tuples g over BAघF ङ such that
घiङ g घAङ=f घAङ for each A in FreeघF ङ and
घiiङ g over BAघF ङ, F a form, is a tuple in K घf घF ङङ.
j j j
S
For each I in InsघF ङ, the set fK घf ङjf in Ig denoted by K घI ;Fङ, is the kernel
of घI ;Fङ.
Obviously, each kernel is a nite set of tuples. If F is a at form, then
K घI ;Fङ=I . When F is clear from the context we shall write K घI ङ instead of
K घI ;Fङ.
Given a form F and a nite set R of tuples over BA घF ङ there may b e more
than one instantiated form घI ;Fङ such that K घI ;Fङ=R .For example, consider
the form F = hA; F ;F i, where F = hB i and F = hC i, and the set R of tuples
1 2 1 2
over BAघF ङis
2
The authors are indebted to Dr. Marc Gyssens for helping to clarify the original
incorrect formulation of the basic problem.
254 Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring
A B C
a b c
1
a b c
2
a b c
1 1
a b c
2
Both instantiated forms घI ;Fङ and घI ;Fङ given b elowhave their kernels equal
1 2
to R .
F F
A F F A F F
1 2 1 2
B C B C
घI ;Fङ= घI ;Fङ=
a b c a b c
2 1
1 1
b c
2 1
a b c a b c
1 1 2
a b c a b c
2 2
The set of paths of F , denoted by Paths घF ङ, is de ned inductively on the
number k of subforms of F . Sp eci cally, if k = 0, then F is a at form and
Paths घF ङ=fF g. Continuing by induction, let F b e the form F = hF ;:::;F i.
1 r
If F ;:::;F are the subforms of F at the ro ot level, then each of them has
i i
1 m
fewer than k subforms. This allows us to de ne Paths घF ङas
Paths घF ङ=ffH g jH = F and
j 0ऊj ऊ` 0
a form in F for 1 ऊ p ऊ mg: ङ;F fH g in Paths घF
i j 1ऊj ऊ` i
p p
Each path ए in Paths घF ङ generates a घuniqueङ branch F obtained by removing
ए
from F all form attributes, together with their descendants, that do not o ccur
in ए .We shall use the notation BAघए ङ for BAघF ङ.
ए
5 Basic Typ es of Restructuring Op erators
As mentioned in Section 3, the data structures of interest to us are basically hi-
erarchical structures of arbitrary depth or width. Relational tables are treated as
hierarchies of one level, and networks are assumed to have b een decomp osed into
families of trees. In this section, restructuring op erations sp eci cally designed to
work on hierarchical structures are discussed.
In transforming a hierarchical structure to a di erent one, the Automatic Re-
structurer makes use of four basic typ es of tree op erations: trimming, attening,
stretching, and grafting. When applied to a particular situation, eachof these
four typ es of op erations causes a restructuring to take place - an op eration that
pro duces one output le from either one घin the case of trimming, attening and
stretchingङ or two घin the case of graftingङ input les.
These four typ es of op erations were called SELECT, SLICE, CONSOL-
IDATE and GRAFT in the CONVERT system [32]. घThe names have b een
changed here for mnemonic purp oses.ङ Exp erience has shown that the data ex-
traction, manipulation and restructuring needs of the real world can b e handled
byacombination of these four typ es of op erations. Similar op erations were later
prop osed in [20,42,13]. Query languages designed for \nested tables" or \non
Ginsburg S., Shu N.C., Simovici D.A.: Automatic Data Restructuring 255
rst normal form relational databases" generally have capabilities to p erform
functions similar to the basic op erations discussed here. घSee [2, 3,10,14,26,28]
for example.ङ
The following examines the functions of the four basic typ es of op erations.
5.1 Trimming
A trimming op eration removes unwanted comp onents from a form or a subform.
Data is extracted from the input and placed in the output without anychange in
the hierarchical relationships. An example of trimming is presented in Figures 6
and 7. Figure 6 shows the structural transformation, and 7 the instances involved.
Wenow consider \trimming" in a formal manner.
A form G = hG :::G i is a trimming of the form F = hF ;:::;F i if G 6= F
1 s 1 r
and there is a one-to-one partial function h, called a trimming operation, from
Attrib घF ङonto AttribघGङ with the following prop erties: