GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS

by

Yusu Wang

Department of Computer Science Duke University

Date: Approved:

Prof. Pankaj K. Agarwal, Supervisor

Prof. Herbert Edelsbrunner, Co-advisor

Prof. John Harer

Prof. Johannes Rudolph

Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University

2004 ABSTRACT GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS

by

Yusu Wang

Department of Computer Science Duke University

Date: Approved:

Prof. Pankaj K. Agarwal, Supervisor

Prof. Herbert Edelsbrunner, Co-advisor

Prof. John Harer

Prof. Johannes Rudolph

An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University

2004 Abstract

Biology provides some of the most important and complex scientific challenges of our time. With the recent success of the Human Genome Project, one of the main challenges in molecular biology is the determination and exploitation of the three-dimensional structure of proteins and their function. The ability for proteins to perform their numerous functions is made possible by the diversity of their three-dimensional structures, which are capable of highly specific molecular recognition. Hence, to attack the key problems involved, such as protein folding and docking, geometry and topology become important tools. Despite their essential roles, geometric and topological methods are relatively uncommon in com- putational biology, partly due to a number of modeling and algorithmic challenges. This thesis describes efficient computational methods for characterizing and comparing molecu- lar structures by combining both geometric and topological approaches. Although most of the work described here focuses on biological applications, the techniques developed can be applied to other fields, including computer graphics, vision, databases, and robotics.

Geometrically, the shape of a molecule can be modeled as (i) a set of weighted points, representing the centers of atoms and their van der Waals radii; (ii) as a polygonal curve, corresponding to a protein backbone or a DNA strand; or (iii) as a polygonal mesh cor- responding to a molecular surface. Each such representation emphasizes different aspects of molecular structures at various scales, the choice of which depends on the underlying applications. Characterizing molecular shapes represented in various ways is an important step toward better understanding or manipulating molecular structures. In the first part of the thesis, we study three geometric descriptions: the writhing number of DNA strands, the level-of-details representation of protein backbones via simplification, and the elevation of molecular surfaces.

The writhing number of a curve measures how many times a curve coils around itself

iii in space. It describes the so-called supercoiling phenomenon of double stranded DNA, which influences DNA replication, recombination, and transcription. It is also used to characterize protein backbones. This thesis proposes the first subquadratic for computing the writhing number of a polygonal curve. It also presents an algorithm that is easy to implement and runs in near-linear time on inputs that are typical in practice, including DNA strands, which is significantly faster than the quadratic time needed by used in current DNA simulation softwares.

The level-of-detail (LOD) representation of protein backbone helps to extract its main features. We compute LOD representations via curve simplification under the so-called Frechet´ error measure. This measure is more desirable than the widely used Hausdorff error measure in many situations, especially if one wants to preserve global features of a curve (e.g, the secondary structure elements of a protein backbone) during simplifica- tion. In this thesis, we present a simple approximation algorithm to simplify curves under Frechet´ error measure, which is the first simplification algorithm with guaranteed quality that runs in near-linear time in dimensions higher than two.

We propose a continuous elevation function on the surface of a molecule to capture its geometric features such as protrusions and cavities. To define the function, we follow the example of elevation as defined on Earth, but we go beyond this simpler concept to accommodate general 2-manifolds. Our function is invariant under rigid motions. It scales with the surface and provides beyond the location, also direction and size of shape features. We present an algorithm for computing the points with locally maximum elevation. These points corresponds to locally most significant features. This succinct representation of features can be applied to aligning shapes and we will present one such application in the second part of the thesis.

The second part of the thesis focuses on molecular shape matching algorithms. The importance of shape matching, both similarity matching and complementarity matching,

iv arises from the general belief that the structure of a protein decides its function. Efficient algorithms to measure the similarity between shapes help identify new types of protein architecture, discover evolutionary relations, and provide biologists with computational tools to organize the fast growing set of known protein structures. By modeling a molecule as the union of balls, we study the similarity between two such unions by (variants of) the widely used Hausdorff distance, and propose algorithms to find (approximately) the best translation under Hausdorff distance measure.

Complementarity matching is crucial to understand or simulate protein docking, which is the process where two or more protein molecules bind to form a compound structure. From a geometric perspective, protein docking can be considered as the problem of search- ing for configurations with maximum complementarity between two molecular surfaces. Using the feature information generated by the elevation function, we describe an efficient algorithm to find promising initial relative placements of the proteins. The outputs can later be refined to locate docking positions independently using a heuristic that improves the fit locally, using geometric but possibly also chemical and biological information.

v And indeed there will be time To wonder, “ Do I dare? ” and “Do I dare? ” Time to turn back and descend the stair,

With a bald spot in the middle of my hair ¡ ¡

Do I dare Disturb the universe?

— T. S. Eliot, The love-song of J. Alfred Prufrock

vi Acknowledgements

I came to take of your wisdom: And behold I have found that which is greater than wisdom. — Kahlil Gibran, The prophet

It is not without regret that I am writing this acknowledgment — while being grateful for all those who made my life in the past few years a joyful and fruitful one, I know sadly that our lives will part soon. The path towards obtaining a PhD was a struggle for me in many ways. I can’t imagine how it would have been without their support.

It has been a great opportunity to have worked under the supervision of Profs. Pankaj K. Agarwal and Herbert Edelsbrunner. The experience helped shape my attitude and ap- proaches towards research both in and in general. Pankaj led me into the world of computational geometry with his broad knowledge. Besides sup- port and guidance, he gave me great freedom in doing research, and is always patient and understanding. It is hard to overestimate how much I have benefited from the numerous discussions with him. Herbert showed me the “friendly” side of computational topology, with his deep insights accompanied by illustrative explanation. His philosophy and vision in research have greatly influenced me. I am deeply indebted to both of them for their guidance and inspiration throughout the course of this dissertation. I would also like to thank Profs. John Harer and Johannes Rudolph not only for being on my committee of this thesis, but also for various discussions and collaborations. Support for this work was provided by NSF under the grant NSF-CCR-00-86013 (the BioGeometry project).

The Duke CS department is a wonderful place. In particular, I wish to thank Drs Lars Arge and Ron Parr who are always open and ready to help on my career concerns. Dr.

Sariel Har-Peled, now a post-postdoc (well, an assistant professor) in UIUC, has been a tremendous mentor and friend for me, especially during a period when I was swinging

vii among various career choices, and at a time when I was learning to walk on the ropes of research. I learned from him to have an approximate perspective towards problems both in computer science and in life.

I would like to thank all the graduate students and postdocs in the theory group who provided a vibrant research environment that I have enjoyed and benefited from so much, especially Nabil Mustafa, Hai Yu, Peng Yin, and Vijay Natarajan. I had a lot of fun both in research and in life with friends such as Vicky Choi, David Cohen-Steiner, Ho-lun Cheng, Ashish Gehani, Sathish Govindarajan, Jingquan Jia, Tingting Jiang, Dmitriy Morozov, Nabil Mustafa, Vijay Natarajan, Jeff Phillips, Nan Tian, Eric Zhang, Hai Yu, and Haifeng Yu. Those inspiring discussions with Nabil, Sathish and David will always mark my mem- ories of the PhD life. Special thanks to my best friend Peng Yin. His humor, energy, understanding and advices accompany me to this day. I also want to thank his wife Xia Wu for kindly feeding me uncountable times.

I wish to thank all staffs in the department for being so friendly and helpful, especially Ms. Celeste Hodges and Ms. Diane Riggs.

Last, but not least, I would like to thank my family for their love, support and confi- dence in me. My parents and grandparents encouraged me to pursue my own dream from the childhood, and have never tried to pressure me into any life that others may consider as successful. My sister and brother-in-law have always been there for me, full of under- standing and support. What I have achieved was possible only because they were all by my side. This thesis is dedicated to them.

viii Contents

Abstract iii

Acknowledgements vii

List of Tables xii

List of Figures xiii

1 Introduction 1

1.1 Protein Structure and Geometric Models ...... 2

1.2 Related Research Areas ...... 5

1.3 Shape Analysis in Molecular Biology ...... 9

1.3.1 Describing Shapes ...... 10

1.3.2 Matching Shapes ...... 12

1.4 Main Contributions ...... 14

2 Writhing Number 18

2.1 Introduction ...... 18

2.2 Prior and New Work ...... 20

2.3 Writhing and Winding ...... 23

2.3.1 Closed knots ...... 24

2.3.2 Open knots ...... 29

2.4 Computing Directional Writhing ...... 31

2.5 Experiments ...... 33

2.5.1 Algorithms ...... 34

2.5.2 Comparison ...... 35

ix 2.6 Notes and Discussion ...... 38

3 Backbone Simplification 39

3.1 Introduction ...... 39

3.2 Prior and New Work ...... 41

3.3 Frechet´ Simplification ...... 44

3.3.1 Algorithm ...... 45

3.3.2 Comparisons ...... 47

3.4 Experiments ...... 51

3.5 Notes and Discussions ...... 56

4 Elevation Function 58

4.1 Introduction ...... 58

4.2 Defining Elevation ...... 60

4.2.1 Pairing ...... 60

4.2.2 Height and Elevation ...... 65

4.3 Pedal Surface ...... 69

4.4 Capturing Elevation Maxima ...... 72

4.4.1 Continuity ...... 72

4.4.2 Elevation Maxima ...... 76

4.5 Algorithm ...... 83

4.6 Experiments ...... 87

4.7 Notes and Discussion ...... 90

5 Matching via Hausdorff Distance 94

5.1 Introduction ...... 94

x 5.2 Collision-Free Hausdorff Distance between Sets of Balls ...... 100

5.2.1 Computing ¢¡¤£¦¥¨§ ©¨ ¦ in 2D and 3D ...... 101

5.2.2 Partial matching ...... 104

5.3 Hausdorff Distance between Unions of Balls ...... 105

5.3.1 The exact 2D algorithm ...... 105

5.3.2 Approximation algorithms ...... 109

5.4 RMS and Summed Hausdorff Distance between Points ...... 113

5.4.1 Simultaneous approximation of Voronoi diagrams ...... 113

5.4.2 Approximating  ¡¤£¦¥¨§ ...... 115

5.4.3 Approximating ¡£¥¨§ ...... 117

5.4.4 Maintaining the 1-median function ...... 119

5.4.5 Randomized algorithm ...... 123

5.5 Notes and Discussion ...... 124

6 Coarse Docking via Features 126

6.1 Introduction ...... 126

6.2 Algorithm ...... 132

6.2.1 Scoring function ...... 132

6.2.2 Computing features...... 133

6.2.3 Coarse alignment algorithm ...... 134

6.3 Experiments ...... 137

6.4 Notes and discussion ...... 144

Bibliography 147

Biography 161

xi List of Tables

2.1 Comparisons on protein data ...... 38

4.1 Table of singularities ...... 71

4.2 Number of Maxima for 1brs ...... 87

4.3 Number of Max with different resolution ...... 90

4.4 Covering density ...... 90

6.1 Index-k Features ...... 138

6.2 Complex 1brs ...... 139

6.3 25 Test Cases ...... 140

6.4 Two-step for 25 Test Cases ...... 141

6.5 Unbound Benchmark ...... 143

xii List of Figures

1.1 Protein structure ...... 2

1.2 Protein models ...... 4

1.3 Protein folding ...... 6

1.4 Protein docking ...... 7

2.1 DNA supercoiling ...... 19

2.2 Sign of crossings ...... 20

2.3 Worst case of writhe ...... 23

2.4 Critical directions ...... 24

2.5 Winding number ...... 27

2.6 Spherical triangle ...... 28

2.7 Open knot ...... 30

2.8 Oriented edges ...... 31

2.9 Convergence rate ...... 36

2.10 Running time ...... 36

2.11 Protein backbones ...... 37

3.1 Frechet matching ...... 45

3.2 Frechet simplification ...... 46

3.3 Comparison between Frechet and Hausdorff simplifications ...... 47

¡£¢ ¡¨¢

¡¥¤ ¥§¦ ¡¥¤ ¥§¦

3.4 Relation between and ...... 49

xiii 3.5 Results of Frechet simplification ...... 53

3.6 Running time ...... 54

3.7 Comparisons between DP and GreedyFrechetSimp algorithms . . . 55

3.8 Simplification of a protein ...... 57

4.1 Four types of maxima ...... 60

4.2 Extended persistence ...... 62

4.3 Elevation on 1-manifold ...... 67

4.4 Pedal curve ...... 70

4.5 Co-dimensional 2 singularities ...... 71

4.6 Discontinuity in elevation ...... 73

4.7 Stratification ...... 75

4.8 Mercedes star property ...... 77

4.9 Another neighborhood pattern for a triple point ...... 78

4.10 Other neighborhood patterns ...... 78

4.11 Parameterization of Gaussian neighborhood ...... 80

4.12 Height difference for 2-legged maximum ...... 82

4.13 Height difference for 3-legged maximum ...... 82

4.14 Decaying of maxima ...... 88

4.15 Top 100 maxima for 1brs ...... 88

4.16 Elevation on 1brs ...... 92

xiv 5.1 Valid and forbidden regions ...... 102

5.2 Voronoi for union of balls ...... 107

5.3 Exponential grid ...... 119

6.1 Predict docking configurations ...... 126

6.2 Max types again...... 133

6.3 Coarse alignment ...... 135

6.4 Align features ...... 135

6.5 Align Pairs ...... 136

xv Chapter 1

Introduction

If a living cell is viewed as a biochemical factory, then its main workers are protein molecules, acting as catalysts, transporting small molecules, forming cellular structures, and carrying signals, among other roles. As Jacques Monod states in his book Chance and Necessity: “ . . . it is in proteins that lies the secret of life. ”. Their functional diversity is made possible by the diversity of their three-dimensional structures. Understanding or simulating molecular processes involved in the formation of protein structures and their biological functions is a major challenge of molecular biology. For most of the key prob- lems involved in this challenge, such as protein folding, docking, structure classification, and structure prediction, naturally, geometry and topology play important roles. However, currently, geometric methods are not fully utilized and investigated when attacking these key problems, partly due to a number of representational and algorithmic challenges.

To close this gap, in this thesis, we study shape analysis problems arising in molecular biology by combining both geometric and topological approaches. In particular, we focus on algorithms for describing and matching protein structures. Note that, in general, shape characterization and matching are central to various application areas other than structural biology, including computer vision, pattern recognition, and robotics [17, 131, 165]. Most of the techniques that we have developed are applicable to these other fields as well.

In the remainder of this chapter, we first give a brief biological background on protein structures and introduce some related research areas. More details can be found in standard textbooks [34, 68, 125]. We then describe shape analysis problems, arising in molecular biology, from the computational side. We state our main contributions at the end of this chapter.

1 1.1 Protein Structure and Geometric Models

A protein is a polymer consisting of a long chain of small building blocks, called amino

¡£¢ ¡

acids or residues. All amino acids have a 3-atom backbone, - - , to which a side chain ¤¨§ (denoted by ¤¦¥ and in Figure 1.1 (a)) is attached. Besides the side chain, a hydrogen atom is bonded to the backbone nitrogen atom, and an oxygen is doubly bonded to the

carboxy carbon. There are © standard amino-acids residues, distinguishable by their side ¡ chains. The amino end ( ) of an amino acid connects to the carboxy end ( ) of the preceding amino acid, forming a peptide bond. Thus the chemical structure of a protein molecule can be viewed as a linear sequence of amino acids interconnected by peptide

bonds.

¥ H O

¢ N C

N C ¢

H O § (a) (b) Figure 1.1: (a) Protein structure: with the backbone structure in the dotted boxes. (b) The folded state of a protein; each atom is modeled as a ball.

Though a linear sequence, a protein molecule folds into a compact and typically unique three-dimensional structure under certain physiological conditions (see Figure 1.1 (b)). This is the result of various atomic interactions, such as van der Waals and electrostatic forces. The resulting structure, refered to as the native structure or the folded state, is how a protein molecule exists in nature, and is the conformation in which a protein is able to perform its physiological functions. In fact, given a protein molecule, its three-dimensional

structure decides its functionality to a large extent [68, 125]1. Therefore, knowledge of the  For example, disruption of the native structures of proteins is the primary cause of several neurodegenera-

2 protein structures is essential for understanding the principals that govern their functions in nature. This three-dimensional structure is the focus of our study. In general, protein structures are examined at different levels, referred to as protein structure architecture [34]. We present a brief description below.

primary structure: the amino acid composition of the protein, i.e., the linear se- quence of amino acids;

secondary structure: common patterns in the conformation of protein backbones

observed in nature. There are four major types of secondary structure elements

¢ ¢

(SSEs): ¡ -helix, -sheet, -turn, and coils;

supersecondary structure: the higher scale structures organized by secondary struc-

ture elements, e.g, how SSEs are connected.

tertiary structure: the global folded three-dimensional structure of the protein;

quaternary structure: the structure, or the complex of two proteins bound together.

How to model proteins appropriately is a crucial first step before we can visualize or manipulate them. Several models have been proposed, depending on the objectives of the underlying applications and/or what information one would like to emphasize. In the literature, the term modeling refers to a broad collection of methods to describe not only the geometric structures, but also the energetic aspects of a molecule2 [84, 125]. In this thesis, we focus on geometric models of the three dimensional structure of proteins [64].

Geometric shapes typically refer to a finite set of points, a space curve, or a surface. In the context of molecular biology, a set of points corresponds to the set of centers of atoms, possibly weighted by the van der Waals radius of the atoms. Sometimes, points

tive diseases, such as Alzheimer's disease and Parkinson's disease. £ For example, by using quantum mechanics, one can describe in detail the energy of any arrangement of atoms and molecules in a particular system.

3 are connected by “sticks” that represent covalent bonds between atoms (Figure 1.2 (a)). Such representations not only specify the positions of each atom in a molecule, but also the chemical information.

(a) (b) (c)

Figure 1.2: A protein molecule represented as (a) the set of atom centers connected by sticks to represent covalent bonds; (b) a space curve (the main chain representation); and (c) the (van der Waals) surface of the union of atoms, each represented by a ball.

Sometimes, the details presented in the above representation are not necessary, or even undesirable. The main chain representation is often exploited in such situations, where a

protein molecule is modeled as a curve in ¢¡ following the trace of the backbone atoms of the amino acids (see Figure 1.2 (b)). Such a representation emphasizes the linear nature of protein molecules, and shows clearly how this linear sequence of amino acids folds in space. It provides a much simplified representation of a protein, while still maintaining its main structural features. Consequently, this representation is popular in many applications, especially those of high computational complexity, such as protein folding and structure classification.

A surface representation of proteins is useful when the object of study is the space occu- pied by a protein molecule, or when the global structure of the molecule is more important than its local geometry. There are many ways to represent the surface of a molecule. If

we model each atom by a ball in £¡ with its van der Waals radius, then the surface of the union of these balls is refered to as the van der Waals surface (Figure 1.2 (c)). The solvent

4 accessible surface, originally proposed by Lee and Richard [126], is the surface traced out by the center of a probe sphere (which typically represents the water molecule) rolling on top of the VDW surface. The surface traced out by the inward-facing surface of this probe sphere is called molecular surface. The skin surface developed by Cheng et al. [55] is more complicated, but has many elegant (mathematical) properties.

1.2 Related Research Areas

Despite the important role that protein structures play in understanding life, early research in computational biology, or bioinformatics, focused on sequence analysis, rather than on protein structures. One of the key reasons is that it is significantly easier to both acquire and manage sequence data.3 Nevertheless, with the tremendous success in sequence analysis, the study of protein structures has become increasingly critical. For example, in the post- genomic era, a major obstacle to the exploitation of the large volume of genome sequence data is the functional characterization of the gene products (protein structures). Since the three-dimensional structures of proteins are more conserved than the corresponding sequences, many large-scale protein structure determination projects have been initiated recently [169] to help to analyze the functionally unannotated protein sequences space. These initiatives are widely referred to as structural genomics or structural proteomics. In this subsection, we briefly mention several (not necessarily disjoint) research areas related to protein structures.4

Protein folding and protein structure prediction. Predicting a protein’s structure from its amino acid sequence is one of the most significant tasks tackled in computational biology (Figure 1.3). Solving this problem will have enormous impacts on rational drug

design, cell modeling, and genetic engineering. It is therefore not surprising that it is

For example, the Human Genome Project has made massive amounts of protein sequences data available, while the output of experimentally determined protein structures, typically by time-consuming and relative

5 Figure 1.3: From left to right we show snapshots of a periscope and staphylococcal pro- tein A, B domain (in mainchain representation), at different stage during the folding process (http://parasol.tamu.edu/dsmft/research/folding). considered as the “ holy grail ” in the structural biology community; see e.g. [123].

Two issues are involved here: (i) to understand the mechanism behind protein folding process, i.e., how does a protein fold in nature; and (ii) to predict the final folded confor- mation given a sequence of amino acids. These two aspects are obviously related, but not necessarily equivalent: several successful approaches to predict protein structures do not mimic the folding process, but rely on the knowledge of known protein structures.

There has been a long history of tackling the folding problem. Current approaches can be classified into two categories, which we mention without going into any detail here. For more information, refer to [26, 123, 156]. The first class, including comparative modeling and threading, start by using a known template structure or known folds. The second class, de novo or ab initio methods, predict structure from sequence directly, using principles of atomic interactions and protein architecture. Despite the success of these methods for some cases, especially in predicting structures of small protein molecules, the protein folding problem remains largely unsolved. Main reasons include that the structure is defined by a large number of degrees of freedom5, and that the physical basis of protein

expensive X-ray crystallography and NMR spectroscopy, is lagging far behind.

It is impossible to survey these areas in a comprehensive manner in this thesis Ð each would require a

whole book to do so! ¡ As highlighted by the Levinthal paradox which results from the observation that proteins are folded into their speci®c three-dimensional conformation, in a timespan signi®cantly short (milliseconds) than what would be expected if the molecule actually searched the entire conformation space for the lowest energy state.

6 structural stability is not fully understood. The vigor of this field can be seen from the high participation and great performance output of the Critical Assessment of Structure Prediction (CASP) experiments (http://predictioncenter.llnl.gov).

Protein interactions. Two or more molecules interact with each other by forming an intermolecular complex (either in a stable manner, or temporarily), the process of which is called docking or receptor-ligand recognition and binding. Such interactions are critical to various biological processes, such as cell-cell recognition and enzyme catalysis and in- hibition. The target macromolecules (receptors) are usually large, mostly proteins, while the ligands can be either large, such as proteins, or small, such as drugs or cofactors (see

Figure 1.4). The sites where binding happens are called active sites or binding sites. They

(a) (b)

Figure 1.4: Examples of (a) protein-small molecule docking: mainchain representation of HIV-1 protease bound to an inhibitor (in VDW representation); and (b) protein-protein docking: human growth hormone. are usually places on the surfaces of proteins where chemical reactions or conformational changes happen. Hence knowledge of the interaction between molecules is crucial in un- derstanding, and even manipulating, their functions. As an example, many drug molecules work by acting as inhibitors: they bind to the receptor proteins to block the active sites, thus stopping undesired chemical reactions or molecular processes from happening. As a result, efficient docking algorithms of drug molecules to target receptor proteins is one of the major ingredients in a rational drug design scheme [85, 102, 164].

The docking problem has attracted great attention from computer scientists as well

7 as biochemists, due to its strong geometric and algorithmic flavor [102, 81, 114]. Much success has been achieved for docking a protein with a small molecule, or docking two rigid proteins (i.e., each protein can only undergo rigid transformations). However, the field remains rather open, especially in the case of protein-protein docking without the

“rigid” assumption [97]. In this case, as protein structures are complicated, modeling their conformational changes introduces many degrees of freedom. Nevertheless, progress is being made, and interested readers should refer to the results of the CAPRI experiments, i.e., the Critical Assessment of PRediction of Interaction, for the newest advances in this

field (http://capri.ebi.ac.uk).

Protein structure comparison and classification. As a protein molecule with some functional role evolves in the context of a living cell, its overall three-dimensional struc- ture tends to remain unaltered, even when all sequence memory may have been lost [80]. This evolutionary resilience of protein three-dimensional structures is the fundamental rea- son for comparing protein structures in molecular biology. Numerous comparison methods have been proposed and developed in the past 20 years [122, 160]. The problem, however, is difficult and remains unsolved. There is typically no clear definition for structural sim- ilarity. Structural similarity is of interest at many levels: from the fine detail of backbone and side-chain conformation at the residue level to the coarse similarity at the tertiary struc- ture level. Besides, many situations require one to capture local similarity, which is hard to describe.

Moreover, more and more protein structure data are becoming available: At the time of

¥

this writing, there are more than © proteins structures in the Protein Data Bank [31], and the number almost doubles every 18 months. It is therefore crucial to bring certain order into protein structures by classifying them into families. Other than organizing the large structure database, such classification can aid our understanding of the relationships between structures and functions. For example, it is shown that almost all enzymes have

8

¢ ¢

¡ ¡ the so-called ¡ folds [132], i.e., they have both -helices and -sheets in their struc- tures. Furthermore, while each sequence typically generates a unique three-dimensional structure, multiple sequences may produce similar folded structures, or folds. A natural question arising is then: how many different folds are there in nature? Classification helps to answer this question, and its solution is useful in annotating the sequence space by struc- tures, thus functions, which is a central aspect in structural genomics that we mentioned earlier [26]. Classification also enables us to experimentally determine many fewer number of protein structures — we can now afford to determine only those that possibly produce novel folds [160].

Currently three most popular classifications are: SCOP [138], CATH [139, 141], and

FSSP [104], all of which are accessible via the world wide web. Similar to structure comparison, one main difficulty of the classification problem arises from the fact that there is no consensus on defining the organization of different categories. Thus how to classify protein structures in a fully automatic way is still a daunting problem.

1.3 Shape Analysis in Molecular Biology

Above we have sketched some key research areas closely related to protein structures. Two issues appear repeatedly there — how to describe and characterize structures and how to develop efficient computational methods (algorithms). These two issues are obviously not new to many research fields in computer science. In this section, we address these two issues by identifying shape-analysis problems in molecular biology and describing them from a computational perspective. Shape analysis problems have been studied extensively in many fields including computer graphics, vision, geometric computing, robotics, and so on. On the one hand, many of the techniques there can be adopted to attacking problems in molecular biology directly. On the other hand, protein structure analysis has many unique

9 properties and new techniques are greatly needed.6

At a high level, we classify shape analysis problems into two broad categories, each including many subtopics. Though our classification below are tailored towards molecu- lar biological applications, one should note that techniques developed are not necessarily constrained to biological applications. Once again, we will only sample a few techniques exploited in attacking these problems, as a full enumeration will go beyond the scope of this thesis. For surveys on a subset of topics in shape analysis in general, refer to [17, 131, 165].

1.3.1 Describing Shapes

Modeling flexibility. We have introduced some basic geometric representations for three-dimensional structures of proteins in Section 1.1. In some applications, more sophis- ticated representations are required: protein molecules are in constant motion (vibration) in solution, and they might undergo significant conformational changes at times (such as in a protein-protein docking process). Therefore, it is important to incorporate flexibility in modelling protein structures. The question then is of course how to model flexibility, which is typically complex as the protein structures have too many degrees of freedom. On the one hand, special data structures are desirable to efficiently support changes in con- formations: For example, in [6], a chain hierarchy has been proposed which can detect collisions for deforming protein backbones efficiently. On the other hand, it is important to characterize motions: what are the types of motions molecules undergo and where do they happen. Techniques from robotics, motion planing and graph theory have been exploited successfully in several cases for either identifying possible motions [112] or for reducing the degree of freedoms of motions [85, 161].

Simplified representations. In some applications, simplified structures are needed to

We remark here that the shape analysis problems are extremely hard for proteins structures due to the fact that the connection between protein structures and their functions is not yet well understood. In many situations, it is not clear what aspects about structures give rise to a particular functionality.

10 help to manage complex problems. Hence many approaches use simplified protein struc- tures, such as representing the backbone of a protein molecule as a set of fragments, each corresponding to a secondary structure element [160, 155]. As another example, one model proposed by Dill [49, 74] simplifies the protein backbone as beads chained together on a unit lattice. The beads can either be hydrophobic or hydrophilic, with contacts between hydrophobic beads being favored. Although fairly simplistic, this model yields results surprisingly similar to those derived from experimental data when applied to the protein folding problem [73, 162].

Shape descriptors/signature. One aspect of shape characterization is to extract key features or information of a given shape. For example, this is central to many approaches that compare protein structures: The key information is typically stored in a shape de- scriptor or signature, and similarity between two shapes can now be measured by some distance between the corresponding descriptors. As another example, in order to under- stand protein-protein interaction, there have been much research on characterizing the in- terface where the two proteins interact with each other [27, 130]. Features that contribute to describe such interfaces include the buried surface areas, the tightness of the binding, hydrophobicity and so on.

Extracting features and generating shape descriptors are widely used in graphics, vi- sion, and robotics. Many techniques borrowed from there can be applied to applications in molecular biology: statistical methods (such as histograms and harmonic maps) [118, 22], geometric-based methods (such as turning angles) [59], and topological-based methods (such as Connolly function) [66, 136] have all been exploited to generate shape descriptors in structural biological applications.

11 1.3.2 Matching Shapes

Similarity matching. Measuring similarity between protein structures is essential to protein structure classification, and is needed when applying comparative modeling meth- ods for predicting protein structures. There are in general two types of approaches for this problem. The first type of methods are alignment-based. They consider matching as an optimization problem by finding the best alignment (i.e., relative placement) of two input structures under some scoring function (where the score evaluates how similar structures are). Example of this approach include DALI [103], STRUCTAL [129], and CE [149]. How to define the scoring function, or the distance between two structures, is an intriguing problem in itself, and has received much attention [80, 158].

The second approach computes the similarity/distance between input structures di- rectly, without producing any alignment to superimpose them. Many of methods in this category exploit shape descriptors [22, 145]. Another example is the contact-map over- lay approach, which converts each protein structure into some special type of graph, and similarity is measured as the size of the largest congruent subgraph between two such graphs [94].

In general, alignment-based methods involve searching in a large configuration space, and thus have higher time complexity than the second type of approaches. They are also less efficient when querying in a structure database. However, they are more reliable and discriminative in measuring similarities. Refer to [150, 151] for a comparison between current popular matching approaches in structural biology.

Complementarity matching. The main motivation to study complementarity match- ing in molecular biology is to understand protein-ligand interactions. A simple geometric

formulation for protein-protein docking problem is the following: Given protein £ and § § , find the best transformation of such that they best complement each other. In other

12 words, this is a partial surface matching under the constraints that the two surfaces do not intersect. Two main issues involved here are:7 (i) evaluate the alignments generated, i.e., find a good score function that produces few false-positives [97, 153]; and (ii) reduce the complexity of the search procedure, e.g., by exploiting more efficient computation (such as FFT or spherical harmonics) [53, 143], by better searching strategies (such as by ge- netic algorithms) [39, 91], or by reducing the number of transformations inspected (such as geometric hashing) [86].

Of course in nature both molecules can change their conformation during the dock- ing process. For large protein-protein interactions, it is complex to model flexibility in matching procedure, and this is one of the main focus of the current research [97, 153].

Classification and structure database. We mentioned protein structure classification earlier for the purpose of organizing the rapidly-expanding collection of protein structures available. There are also needs to manage structures into a database that can support effi- cient queries (i.e, given a query structure, return one or more structures from the database that contain it (in the case of motif query), or that are similar to it). The pairwise structure comparison that we discussed above (in similarity matching) is obviously a fundamen- tal component in classification and query problems, and a straightforward way to classify protein structures is by all-against-all comparisons. This is the method adopted by most current classifications of protein structures, such as CATH and FSSP [139, 104]. It is, however, rather inefficient, especially when combined with the alignment-based pairwise matching procedures. Part of the reason for the usage of this straightforward approach despite its inefficiency is because most past focus is on how to classify protein structures in a reliable and automatic way (the problem of which is still not satisfactorily solved even now) [122]. With better understanding of protein structures, and with the number

of known structures increasing rapidly, efficient clustering techniques become essential.

References are for molecular biological applications.

13 Several recently developed protein structure comparison techniques aim at developing a similarity measure that satisfies triangle inequality [60, 47, 145], so that many known clus- tering algorithms can be applied. In particular in [60], Choi et al. exploits techniques from information retrieval in building their classification system.

1.4 Main Contributions

Our research touches both shape description and shape matching categories. We focus on developing efficient computational methods for describing or matching structures. Our approaches rely on both geometric and topological techniques 8. Softwares produced from this thesis work are available at the BioGeometry website (http://biogeometry.duke.edu/).

Part I. Shape description. (1) Writhing number: The writhing number of a curve measures how many times a curve coils around itself in space. It characterizes the so-called supercoiling phenomenon of dou- ble stranded DNA, which influences DNA replication, recombination, and transcription. It is also used to characterize protein backbones. We establish a relationship between the writhing number of a space curve and the winding number, a topological concept. This enables us to develop the first subquadratic algorithm for computing the writhing number of a polygonal curve. We have also implemented a simpler algorithm that runs in near- linear time on inputs that are typical in practice [5], including protein backbones and DNA strands, in contrast to the quadratic-time algorithms used by current softwares.

(2) Simplification: The level-of-detail (LOD) representation of protein backbone helps to single out its main features. One way to obtain LOD representations is via curve simpli-

fication. We study the simplification problem under the so-called Frechet-error´ measure.

We remark here that in the literature of molecular biology, the word topology is typically used in a different meaning from our usage: It mainly refers to the topology of the molecule itself, such as how the elements of a molecules are interconnected; while we exploits knowledge and understanding from theories and tools from classic topology, such as Morse theory, in our approaches.

14 This measure is more desirable than the widely used Hausdorff error measure in many sit- uations, especially if one wants to preserve global features of a curve (e.g, the secondary structure elements of a protein backbone) during simplification. We propose and imple- ment a simple algorithm to simplify curves under Frechet´ error measure [7], which is the

first simplification algorithm that runs in near-linear time in dimensions higher than two with guaranteed quality. (3) Elevation function: Given a molecular surface, to capture geometric features such as protrusions and cavities, we design a continuous elevation function on the surface, and compute the points with locally maximum elevation [4]. The intuition of the function follows from elevation on Earth, by which we identify mountain peaks and valleys. But the concept is more technical to extend to general 2-manifolds. This function is scale- independent and provides beyond the location, also direction and size of shape features.

By elevation function, we can describe the above geometric features in a reliable and suc- cinct manner, the results of which can aid to attack the protein docking problem.

Part II. Shape matching. (4) Matching via Hausdorff distance: By modelling a molecule as the union of a set of balls (each representing an atom), we measure the similarity between two molecules by variants of Hausdorff distance. In particular, we present algorithms that compute exactly or approximately the minimum Hausdorff distances between two such unions under all possible translations [8]. We also investigate the version in which we are constrained to only translations where the two sets remain collision-free (i.e., no ball from one set inter- sects the other sets).

(5) Docking via features: As mentioned earlier, from a geometric perspective, protein docking can be considered as the problem of searching for configurations with the maxi- mum complementarity between two molecular surfaces. Our goal is to efficiently compute a small set of potentially good docking configurations based on the geometry of the two

15 structures. Given such a set, more sophisticate procedures can then be performed on each of its members independently to locate the “real” docking configuration. To find such a potential set, we would like to align cavities from one protein with protrusions from the other, and these “meaningful” features are captured by the elevation function we have de- signed. Our approach can compute important matching positions while inspecting many fewer configurations than the exhaustive search or earlier geometric-hashing approaches.

16 Part I: Shape Description

17 Chapter 2

Writhing Number

2.1 Introduction

The writhing number is an attempt to capture the physical phenomenon that a cord tends to form loops and coils when it is twisted. We model the cord by a knot, which we define to

be an oriented closed curve in three-dimensional space. We consider its two-dimensional

¢¡ ¡ family of parallel projections. In each projection, we count or £ for each crossing, depending on whether the overpass requires a counterclockwise or a clockwise rotation

(an an angle between 0 and ¤ ) to align with the underpass. The writhing number is then the signed number of crossings averaged over all parallel projections. It is a conformal invariant of the knot and useful as a measure of its global geometry.

The writhing number attracted much attention after the relationship between the linking number of a closed ribbon and the writhing number of its axis, expressed by the White formula, was formally discovered independently by Calug˘ areanu˘ [40], Fuller [88], Pohl

[142], and White [168].

¥§¦ ¨ © 

(2.1)

Here the linking number, ¥§¦ , is half the signed number of crossings between the two boundary curves of the ribbon, and the twisting number, © , is half the average signed number of local crossing between the two curves. The non-local crossings between the two curves correspond to crossings of the ribbon axis, which are counted by the writhing number,  . The linking number is a topological invariant, while the twisting number and the writhing number are not. A small subset of the mathematical literature on the subject can be found in [20, 79].

18 Besides the mathematical interest, the White Formula and the writhing number have received attention both in physics and in biochemistry [70, 117, 128, 157]. For example, they are relevant in understanding various geometric conformations we find for circular DNA in solution, as illustrated in Figure 2.1 taken from [37]. By representing DNA as

Figure 2.1: Circular DNA takes on different supercoiling conformations in solution. a ribbon, the writhing number of its axis measures the amount of supercoiling, which characterizes some of the DNA’s chemical and biological properties [30].

As another example, the writhing number and some of its variants have also been ap- plied to protein backbones, modeled as open curves, as shape descriptors to classify pro- tein structures [128, 145]. The intuition for such approaches follows from the fact that the writhing number of a space curve measures the relative position between any two points in the curve and the relative orientation between the tangents at those points. (This view will become more clear after we introduce Equation 2.3 in the next section. ) When extended to a polygonal curve, this means that the writhing number measures the relative position and orientation between any two edges of the curve. Hence two protein backbones with similar arrangements of secondary structure elements produce similar writhing number.

This chapter studies algorithms for computing the writhing number of a polygonal knot. Section 2.2 introduces background work and states our results. Section 2.3 relates the writhing number of a knot with the winding number of its Gauss map. Section 2.4 shows how to compute the writhing number in time less than quadratic in the number of edges of the knot. Section 2.5 discusses a simpler sweep-line algorithm and presents initial experimental results.

19 2.2 Prior and New Work

In this section, we formally define the writhing number of a knot and review prior al- gorithms used to compute or approximate that number. We conclude by presenting our

results.

¥¥¤

¡

¡£¢

Definitions. A knot is a continuous injection or, equivalently, an oriented

§ ¢

closed curve embedded in £¡ . We use the two-dimensional sphere of directions, , to

§

¦¨§©¢ represent the family of parallel projections in ¢¡ . Given a knot and a direction ,

the projection of is an oriented, possibly self-intersecting, closed curve in a plane normal

¦ ¦ to ¦ . We assume to be generic, that is, each crossing of in the direction is simple

and identifies two oriented intervals along , of which the one closer to the viewer is the overpass and the other is the underpass. We count the crossing as ¢¡ if we can align the

two orientations by rotating the overpass in counterclockwise order by an angle between

¡

¤ £ and . Similarly, we count the crossing as if the necessary rotation is in clockwise order. Both cases are illustrated in Figure 2.2. The Tait or directional writhing number of

+1 −1

Figure 2.2: The two types of crossings when two oriented intervals intersect.

¢¡ ¡

¡

¦ ¦ £ in the direction , denoted as , is the sum of crossings counted as or as

explained. The writhing number is the averaged directional writhing number, taken over §

all directions ¦ § ¢ ,



¡

¨



¡ 

¦ ¦

(2.2) ¤

We note that a crossing in the projection along ¦ also exists in the opposite direction, along

¨

¡ ¡

¦ ¦ £¦ £ , and that it has the same sign. Hence , which implies that the

20

writhing number can be obtained by averaging the directional writhing number over all

¥

¦ £¦¢¡ points of the projective plane or, equivalently, over all antipodal points pairs of the sphere.

Computing the writhing number. Several approaches to computing the writhing number

of a smooth knot exactly or approximately have been developed. Consider an arc-length

©

¥ ¤

¡

¡ ¢ ¤£ £

parameterization , and use and to denote the position and the unit

¥ § ¢ tangent vectors for ¥ . The following double integral formula for the writhing number

can be found in [142, 159]:

§

  © © 

 

¡

¥

¨

£©¨ £ £



 

¥

 

(2.3)

¦ ¦

¤ ¡

¤£ £

If the smooth knot is approximated by a polygonal knot, we can turn the right hand side of (2.3) into a double sum and approximate the writhing number of the smooth knot [33, 128]. This can also be done in a way so that the double sum gives the exact writhing number of the polygonal knot [28, 121, 166].

Alternatively, we may base the computation of the writhing number on the directional

¥ ¦ ¨ ©

§

¡ ¡

¦ ¦ § ¢ version of the White formula, for ¦ . Recall that both the

linking number and the twisting number are defined over the two boundary curves of a

©

¡ ¡

¦ ¦

closed ribbon. Similar to the definition of , the directional twisting number, ,

¢¡ ¡

is defined as half the sum of crossings between the two curves, each counted as or £ § as described in Figure 2.2. We get (2.1) by integrating over ¢ and noting that the linking

number does not depend on the direction. This implies

¨ ¥§¦ ©



£



 

¡

¨ © ©

¡ ¡ ¡ 

¦ ¦ £

(2.4) ¤

To compute the directional and the (average directional) twisting numbers, we expand to a ribbon, which amounts to constructing a second knot that runs alongside but is disjoint

21 from . Expressions for these numbers that depend on how we construct this second knot

can be found in [121]. Le Bret [35] suggests to fix a direction ¦ and define the second

knot such that in the projection it runs always to the left of . In this case we have

© ¨

¡

¦ ¦ and the writhing number is the directional writhing number for minus the twisting number.

A third approach to computing the writhing number is based on a result by Cima- soni [62], which states that the writhing number is the directional writhing number for a

fixed direction ¦ , plus the average deviation of the other directional writhing numbers from

¡ ¡  

¦ ¡

. By observing that is the same for all directions in a cell of the de-

© ©

§ £ composition of ¢ formed by the Gauss maps and (also referred to as the tangent

indicatrix or tantrix in the literature [56, 154]), we get

¢ ¢

¡

¡£¢¥¤

¨



¡ ¡ §¦ £ ¥

¦ ¦

£ (2.5)

¤

¢ ¢

¡  £

¡ ¡ where is for any one point in the interior of , and is the area of .

If applied to a polygonal knot, all three algorithms take time that is at least proportional to the square of the number of edges in the worst case.

Our results. We present two new results. The first result can be viewed as a variation of

© ©

§



§ ¢

(2.4) and a stronger version of (2.5). For a direction not on and not on £ , let

© ©

¡

be its winding number with respect to and £ . As explained in Section 2.3, this

© ©

¡ 

means that and £ wind times around . ¦

THEOREM A. For a knot and a direction , we have

  

¡

¨



¡ ¡ ¡ 

¦ £ ¦

¤

Observe the similarity of this formula with (2.4), which suggests that the winding number can be interpreted as the directional twisting number for a ribbon one of whose two bound-

22 ary curves is . We will prove Theorem A in Section 2.3. We will also extend the relation in Theorem A to open knots and give an algorithm that computes the average winding number in time proportional to the number of edges. Our second result is an algorithm that computes the directional writhing number for a polygonal knot in time sub-quadratic in the

number of edges.

§

¡

¦ § ¢ ¦

THEOREM B. Given a polygonal knot with edges and a direction , can

¥¢¡ £¥¤§¦

¡ ¤

be computed in time O , where is an arbitrarily small positive constant.

Figure 2.3: A knot whose directional writhing number is quadratic in the number of edges.

Theorems A and B imply that the writhing number for a polygonal knot can be computed

¥¢¡ £¥¤§¦

¡ in time O . As shown in Figure 2.3, the number of crossings in a projection can

be as large as quadratic in . The sub-quadratic running time is achieved because the algorithm avoids checking each crossing explicitly. We also present a simpler sweep-line algorithm that checks each crossing individually and therefore does not achieve the worst- case running time of the algorithm in Theorem B. It is, however, fast when there are few crossings.

2.3 Writhing and Winding

In this section, we develop our geometric understanding of the relationship between the writhing number of a knot and the winding number of its Gauss map. We define the Gauss

23 map as the curve of critical directions, prove Theorem A, and give a fast algorithm for computing the average winding number.

2.3.1 Closed knots

Critical directions. We specify a polygonal knot by the cyclic sequence of its vertices,

 

¨

¤£¦¥ ¢¡

¥ ¥ ¥ ¡

¡

¥ ¥ § ¤ ¥§£ § § ¤ ¥ £ § ¡ ¥¨§

in . We use indices modulo and write

§ ¤ ¥ ¥©§ ¡

for the unit vector along the edge § . Note that is also a direction in and a point in

§

¥¨§ ¥©§ ¤ ¥

¢ . Any two consecutive points and determine a unique arc, which, by definition, is

¡ £ ¥

¥ ¥ ¥

¡

¥ ¥ ¥ ¥

the shorter piece of the great circle that connects them. The cyclic sequence ¥

© ©

§ £ thus defines an oriented closed curve in ¢ . We also need the antipodal curve, , which is the central reflection of © through the origin.

Figure 2.4: In all three cases, the viewing direction slides from left to right over the oriented great circle of directions defined by the hollow vertex and the solid edge. The directional writhing

number changes only in the third case, where we lose a positive crossing.

© © £

The directions on and are critical, in the sense that the directional writhing

§ ¢ number changes when we pass through along a generic path in , and these are the only

critical directions [62]. We sketch the proof of this claim for the polygonal case. It is clear

§

§ ¢ §

that is critical only if it is parallel to a line that passes through a vertex and

 

¡

£ © ¥ §

a point on an edge ¤ of the knot that is not adjacent to . There are such

§ vertex-edge pairs, each defining a great circle in ¢ . First, we note that only of these great

24

¨¢¡

¡ £

circles actually carry critical points, namely, the great circles that correspond to

¨£¡

©

and . The reason for this is shown in Figure 2.4, where we see that the writhing

 

¤ ¥

number does not change unless § is separated from by only one edge along the knot.

¨¤¡

¡

§ £

Second, assuming we observe that the subset of directions along which projects

 

¨

§

¡

§ ¤ ¥ § ¤ § § ¤ § £ § § ¤ § £ § §¦¥¢§ ¥ § ¥ § ¢

onto is the arc ¥ from to the direction in ,

¡

¥¨§§¥ § £ ¥©§ £¨¥¢§

and symmetrically the arc £ from to . The subset of directions along which

¡

¤ § § § ¤ ¥ ¥ § ¥ § ¤ ¥ £ ¥¢§ ¥ § ¤ ¥ ¥©§ ¥¢§ ¥©§ ¤ ¥

§ projects onto are the arcs and . The points , , and lie

§ ¥©§ ¥ § ¤ ¥

on a common great circle and ¥ lies on the arc . This implies that the concatenation

¡ ¡ ¡

§©¥¢§ ¥¢§ ¥ § ¤ ¥ ¥ § ¥ § ¤ ¥ £ ¥ §©¥¢§ £ ¥¢§ ¥ § ¤ ¥ £ ¥©§ ¥ § ¤ ¥

of ¥ and is the arc , and that of and is the arc . It

© ©

follows that and £ indeed comprise all critical directions.

© ©

Decomposition. The curves and £ are both oriented, which is essential. We say a

§



§ ¢

direction lies to the left of an oriented arc ¥ if it lies in the open hemisphere to 

the left of the oriented great circle that contains ¥ . Equivalently, sees that great circle

©  oriented in counterclockwise order. If passes from the left of an arc ¥ of to its right, then we either lose a positive crossing (as in the third row of Figure 2.4), or we pick up

a negative crossing. Either way the directional writhing number decreases by one. This

©

 ¡

£ ¥ £

motion corresponds to £ passing from the right of the arc of to its left. Since

  the directional writhing numbers at and £ are the same, we decrease the directional

writhing number by one in the opposite view as well. In other words, if  moves from the ©

left of an arc of £ to its right, then the effect on the directional writhing number is the

opposite from what it is for an arc of © . These simple rules allow us to keep track of the

© ©

§ £

directional writhing number while moving around in ¢ . The curves and decompose §

¢ into cells within which the directional writhing number is invariant. We can thus rewrite

(2.2) as

¢ ¢

¡

¡

¢

¨



£ ¥

¤

25

¢

where the sum ranges over all cells ¡ of the decomposition, and is the directional ¡

writhing number of any one point in the interior¢ of . Equation (2.5) of Cimasoni can now

¡

¦ be obtained by subtracting from inside the sum and adding it outside the sum.

This reformulation provides an algorithm for computing the writhing number.

¡

¦ ¦

Step 1. Compute for an arbitrary but fixed direction .

¢

§

¡ £

Step 2. Construct the decomposition of ¢ into cells, label each cell with

¡ ¦

, and form the sum as in (2.5).

§

¡

The running time for Step 2 is in the worst case as there can be quadratically many

¡ cells. We improve the running time to O and, at the same time, simplify the algorithm.

First we prove Theorem A.

§

Winding numbers. We now introduce a function over ¢ that may be different from

¨

¡  ¡ ¡  ¡

£ ¦ £ ¦

but changes in the same way. In other words,

§ §

 ¥ 

§ ¢ § ¢

for all ¦ . This function is the winding number of a point with respect

© ©  to the two curves and £ that do not contain . Observe that the space obtained by

removing two points from the two-dimensional sphere is topologically an annulus. We

¡ 

£¦

fix non-critical, antipodal directions ¦ and and define equal to the number of

©

 ¦

times winds around the annulus obtained by removing and £ plus the number of

©

 ¦

times £ winds around the annulus obtained by removing and . This is illustrated in

¨ ¨ ¨ ©

¡

¡ ¡ ¡ 

£¦ © Figure 2.5, where ¦ and . Here we count the winding of in

counterclockwise order as seen from  positive, and winding in clockwise order negative.

© 

Symmetrically, we count the winding of £ in clockwise order as seen from positive, ©

and winding in counterclockwise order negative. Imagine moving a point along and 

connecting to with a circular arc. Specifically, we use the circle that passes through

 

£¦ £¦

, , and and the arc with endpoints and that avoids . Symmetrically, we move

©



£ £ £ along and connect to with the appropriate arc of the circle passing through

26 −z

−T T x

z

¡ ¢¤£ ¢¥

Figure 2.5: The winding number counts the number of times separates from and £

separates ¡ from .

 

¦

, £ , and . Locally at we observe continuous movements of the two arcs. Clockwise

 and counterclockwise movements cancel, and ¡ is the number of times the first arc

rotates in counterclockwise order around  plus the number of times the second arc rotates  in clockwise order around  . The winding number of is always an integer but can be

negative.

Observe that indeed changes in the same way as does. Specifically, drops by

© ©

 

1 if crosses from left to right, and it increases by 1 if crosses £ from left to right.

Starting from the definition (2.2) of the writhing number, we thus get





¡

¨



¡  

¤

  

¡

¤

¨

¡ ¡ ¡ §¦£ 

¦ £ ¦

¤







¡

¤

¨

¡ ¡ ¡ ¦£

¦ £ ¦

¤

  

¡

¨

¡   ¥ ¡ ¡

¦ £ ¦

¤ which completes the proof of Theorem A.

Signed area modulo 2. Observe that the writhing number changes continuously under

deformations of the knot, as long as does not pass through itself. When performs a

¡

¦© ¦ small motion during which it passes through itself there is a ¦ jump in , while the

27 average winding number changes only slightly. We use these observations to give a new

proof of Fuller’s relation [13, 89],

¨

¡ 

£¡ ¡£¢¥¤  ¥

© ¤ ©

(2.6)

¦

¨ ©

¡

¥

§

£¡ ¡   ¢

where § is the signed area of the curve in . Note first that

¨

 ¡

¡ ¡ £§ ¡£¢¥¤ 

¦ ¦ © ¤

because both and are integers. We start with

¨ ¨



£¨

¡ ¦ © ¤

being a circle in , in which case (2.6) holds because and . Other



©

than continuous changes, we observe jumps of ¦ in when passes through itself.

¡ 

£©

© ¤ Theorem A together with the fact that the fractional parts of and are the same implies that (2.6) is maintained during the deformation. Fuller’s relation follows

because every knot can be obtained from the circle by continuous deformation.

§

¥ ¥ § ¢

Computing the average winding number. Three generic points define three

 

arcs, which bound the spherical triangle . Recall that the area of is the sum of angles

¢

¨



 £

¡ £ ¤

minus ¤ . We define the signed area of as if lies to the left of

¢

¨



§

 £

¡ £ £ ¤ ¦©§ ¢

the oriented arc , and as £ if it lies to the right. Let be a

§ ¥ § ¤ ¥

non-critical direction. As shown in Figure 2.6, every arc ¥ forms a unique spherical

©

£ ¡

¥ § ¥ § ¤ ¥ § £ ¥ § ¥ § ¤ ¥ £

triangle ¦ . Let be its signed area. The corresponding arc of forms

¡ £

¦ ¥¨§ ¥ § ¤ ¥ £ § the antipodal spherical triangle £ with signed area . The winding number of a

−z

−ti −ti+1

ti+1 ti

z ¢¥

Figure 2.6: The two spherical triangles defined by an arc of and its antipodal arc of .

¨

 direction ¦ can be obtained by counting the number of spherical triangles that contain

it. To be more specific, we call a spherical triangle positive if its signed area is positive and

¦ ¡ ¡  negative if its signed area is negative. Let and be the numbers of positive and

28

¥ ¥

 ¦ ¡ ¡

¥©§ ¥ § ¤ ¥

negative spherical triangles ¦ that contain , and similarly let and be

¡ 

¦ ¥ § ¥ § ¤ ¥

the numbers of positive and negative spherical triangles £ that contain . Then

¤ ¤

¨

¥ ¥

¡  ¦ ¡ ¡ §¦ ¦ ¡ ¡  §¦

£ £ £

 

To see this note that the equation is correct for a point near ¦ and remains correct as

© ©

moves around and crosses arcs of and of £ . The average winding number is thus

£ ¥ £¦¥

¥ ¥







¡ ¡ ¡

¡ ¡

¨

¡  ¢ £ ¡ £

§ £ £ §

¡ ¡

¤ ¤ ¤

¡ ¡

§ §

£ ¥

¥

¡

¡

¨

£

§

¡

© ¤

¡ §

Computing the sum in this equation is straightforward and takes only time O( ).

2.3.2 Open knots

¤

¢

¡

¤

¥ ¦

¡

We define an open knot as a continuous injection ¡ . Equivalently, it is an ¢

oriented curve, embedded in £¡ , with endpoints. The directional writhing number of is well-defined, and the writhing number is the directional writhing number averaged over

all parallel projections, as before. Assume ¢ is a polygon specified by the sequence of its

¡ £ ¥ £¦¥ ¡

¥ ¥ ¥

¡ ¡

¥ ¥

vertices, ¥ , and let be the knot obtained by adding the edge . The ¢

critical directions of differ in two ways from those of : ¢

(i) there are critical directions of that are not critical for , namely the ones whose

¡ £¦¥

definition includes a point of ¥ ;

¡ £¦¥

(ii) there are new critical directions, namely those defined by an endpoint ( or ¥ ) and another point of the polygon but not on the two adjacent edges.

To see that the directions in (ii) are indeed critical for ¢ , examine the first two rows of

Figure 2.4. The hollow vertex is now an endpoint of ¢ , so we remove one of the two

29

dashed edges. Because of this change, the directional writhing number changes at the ©

moment the hollow vertex passes over the solid edge. Changing the critical curve of ¢

to the critical curve of can thus be achieved by removing the arcs of Case (i) and adding the arcs of Case (ii). We illustrate this process in Figure 2.7. To describe the process, we

...... t n−3 wn−4

un−3 = w n−3

tn−2 −un−1 = w1 w 2 tn−1 −tn−1 = vn−1 = w0 vn−3 t 0 −un−2 = vn−2

u0 = v2 . t v 3 . 1 . . . . .

Figure 2.7: The critical curves of the knot ¡ are marked by hollow vertices, and the additions

required for the critical curves of the open knot ¢ are marked by solid black vertices.

   

¨ ¨

¡¤£ £ ¡

¡ ¡ ¤£¦¥  £ ¥ 

¡ ¡

§ § £ § £ £ ¥ £ ¥ £

define , for , and , for

¨ ¡ ¨ ¨ ¨ ¨

£ £

£ ¥ ¡ £¦¥ £ ¥ £ ¥ ¡ £ ¥

¥ £¨¥ ¥ £ © § ¥ § £ ¥ § ¥ £ ¥ ¥

. Observe that , , , ,

¨ ©

£¦¥ £ ¥ ¥

and . We get the critical curve from by

¡ ¡

£¦¥ £¦¥ ¡ ¡ £ ¥ £¦¥ £¦¥ ¡

¥ § ¥ ¥ ¥ § ¥ ¥ ¥ ¥ ¥

1. removing the partial arcs ¥ and , and the arcs and ,

¡

¨ ¨ ¨ ¨

¡ £ ¥ £ ¥ £¦¥ ¡ £¦¥

¥ ¥ ¥ ¥ ¥ ¥

¡ ¡ ¡ ¡

§ ¥ £ ¥ ¥ £ ¥ ¥ ¥

2. adding new paths ¥ and

¡ ¡ £ ¥

¥ .

¡

©

£ ¥ £¦¥ £¦¥ £¦¥

¡ ¡

£ ¥ §¥ ¥ £ ¥ ¥ ¥ ¥

Note that Step 2 adds a piece of £ , namely and , to the new

©

£¥ £

critical curve . Symmetrically, we get from . Everything we said earlier about the

© winding number of the critical curve of applies equally well to the critical curve of

¢ . Similarly, all algorithms described in the subsequent sections apply to knots as well as to open knots.

30 2.4 Computing Directional Writhing

In this section, we present an algorithm that computes the directional writhing number of

¥¢¡ £ a polygonal knot with edges in time roughly proportional to . The algorithm uses complicated subroutines that may not lend themselves to an easy implementation.

Reduction to five dimensions. Assume without loss of generality that we view the knot

¨ ¨

¡

¡ ¥ ¥

§ § ¤ ¥ ¦ £ §

from above, that is, in the direction of . Each edge of

¨

 

¤ ¥ § is oriented. Another edge that crosses in the projection either passes above or below and it either passes from left to right or from right to left. The four cases

are illustrated in Figure 2.8 and classified as positive and negative crossings according to

¦

§ § Figure 2.2. Letting and be the numbers of edges that form positive and negative

+1 −1 −1 +1

Figure 2.8: The four ways an oriented edge can cross another. §

crossings with , the directional writhing number is

£ ¥ £¦¥

¡

¥ ¥

¡

¡ ¡

¨

¡ ¦

§£¢ ¦ § £

¡ ¡

©

¡ ¡

§ §

¦

§ £¡

To compute the sums of the § and efficiently, we map edges in to points and half-

¥¤ ¦ § §

spaces in . Specifically, let be the oriented line that contains the oriented edge and

¤

§ § § §

use Pluck¨ er coordinates as explained in [52] to map ¦ to a point or alternatively to

§ ©¤ ¦ § ¦

a half-space ¨ in . The mapping has the property that and form a positive crossing

¦

§ § § ¨ if and only if lies in the interior of . We use this correspondence to compute § in two stages: first we collect the ordered pairs of oriented lines that form positive crossings, and second we count among them the pairs of edges that cross.

Recursive algorithm. It is convenient to explain the algorithm in a slightly more general

31

 ¦ ¡ ¥

¡

¡ ¡

setting, where and are sets of and oriented edges in . Let denote

¨

¡ ¥£¢ ¦

§¤ ¨¥¡ §

the number of pairs that form positive crossings, and note that §

¨

¦ ¡ ¥

¡ ¦ ¡

if is the set of edges of the knot and . We map to a set of

©¨

§ ¤

points and ¡ to a set of half-spaces in . Let be a sufficiently large constant.

¥

§ ¦ ¦

A -cutting of and is a collection of pairwise disjoint simplices covering such that



§

each simplex intersects at most hyperplanes bounding the half-spaces in . We use

¥

 

the algorithm in [10] to compute a -cutting consisting of simplices in time O( ),

   

 ¤ where is at most times a constant independent of . For each simplex  in the

cutting, define

¨

¥

 § §  § § §© ¡

¨ ¨



¢   ¥

¡ §¡ ¨  ¡

! ¨

¢

 §¡ #" ¨ ¡

¨ $&%(' ¨ $&%(' ¨

£ 

    

  ) ¡  )

Letting and , we have  and . By construc-

!

¡ ¥£¢

§ * ¨ 

tion, every defines a pair of lines that form a positive crossing. For each

!

¡ ¥£¢

§ * ¨ 

simplex  , we count the edge pairs that form positive crossings, and let ¦

 be the number of such pairs. Then

¡ ¤

¨

¦ ¡ ¥ ¦ ¡ ¥ ¦ ¦

¡  ¡ 

¡

 ¥

¦ +

Note that  is the number of crossings between projections of the line segments in

!

¦ 

and in  . We can therefore use the algorithm in [51] to compute all numbers , for

¦ ¨

¡ £ £

§.- §.-

¤   ¤  ¤  ¡ ¥ ¡

¡ ¡

, in time , O . We recurse to compute

£ 

¦ ¡ ¥

¡ the  and stop the recursion when . The running time of this algorithm is at

32

most

¡

¨



¡ ¥ ¡ ¥ ¡  ¥

 ,

¡

 ¥

¨

§

¤¢¡

¡  ¤)  ¥

O

¨

¨  

¡

£ for any , provided £ is sufficiently large.

Improving the running time. We improve the running time of the algorithm by taking

¤ § §

advantage of the symmetry of the mapping to . Specifically, a point lies in the interior

§ ¨¢§

of a half-space ¨ if and only if the point lies in the interior of the half-space . We

proceed as above, but switch the roles of points and half-spaces when  becomes less than

¥¤



¡

. That is, if then we map the edges in to half-spaces and the edges in to points.

¨ ¨

§

¤¢¡

¡ ¥ ¡ ¤

By our above analysis, the running time is then less than O

¥ ¤¢¡

¡

O . The overall running time is thus less than

¥©

¡ ¥ ¡ ¥¨§  ¥

¨ ¦

, 

¡

 ¥

 ¥

¡ if

¥¤

¥ ¤¢¡



if

¡

¨

¡ ¥¤¢ ¥ ¤¢

¡ ¡ ¡ ¥

O

¦

£ §

where is a positive constant and is any real larger than . It follows that § can

¨

¥¢¡ £¥¤§¦

¤

§

be computed in time O( , for any constant . Similarly, § and therefore

¡ ¦ the directional writhing number, , can be computed within the same time bound, thereby proving Theorem B.

We remark that the technique described in this section can also be used to compute the

£

¥¢¡ £¥¤§¦

linking number between two polygonal knots with and edges in time O( ).

2.5 Experiments

In this section, we sketch a sweep-line algorithm that computes the writhing number of a polygonal knot using Theorem A. We implemented the algorithm in C++ using the LEDA

33 software library and compared it with two versions of the algorithm based on the double integral in (2.3). We did not implement any version of Le Bret’s algorithm mentioned in Section 2.2 since it is based on a formula similar to Theorem A and can be expected to perform about the same as our sweep-line algorithm, and since it only works for closed knots.

2.5.1 Algorithms

Sweep-line algorithm. Theorem A expresses the writhing number of a knot as the sum of three terms. Accordingly, we compute the writhing number in three steps.

Step 1. Compute the directional writhing number for an arbitrary but fixed, non-critical

¡

¦

direction ¦ , .

© ©

¡

£ ¦ Step 2. Compute the winding number of ¦ relative to the Gauss maps and , .

Step 3. Compute the average winding number by summing the signed areas of the

¥

£

¦ ¥©§ ¥ § ¤ ¥ §

§ ¡

spherical triangles , § .

¥

¡ ¡ £

¦ £ ¦ §

Return § ¡ § .

Instead of using the algorithm described in Section 2.4, we implemented Step 1 using a

sweep-line algorithm [71], which reports the crossing pairs formed by the edges in

¡ ¡ ¤

time O . Steps 2 and 3 are both computed in a single traversal of the spherical

© ©

polygons and £ , keeping track of the accumulated angle and the signed area as we go.

¡

The running time of the traversal is only O .

Double-sum algorithm. We compare the implementation of the sweep-line algorithm

¨

§ ¤ ¥ £ § with two implementations of (2.3). Write § for the unnormalized tangent

vector. Following [33, 128], we discretize (2.3) to

£¦¥

§

¥

¡



¥

¡ ¡

¨

¨ § £ §





 (2.7)



¡

£¢

¤ ¡

£ §

¡

¡

§ §

34

We note that  is not the writhing number of the polygonal knot, but it converges to the writhing number of a smooth knot if the polygonal approximation is progressively refined to approach that knot [43].

Alternatively, we may discretize the double integral in such a way that the result is

¦§

the writhing number of the approximating polygonal knot. Given two edges and , we § measure the area of the two antipodal quadrangles in ¢ along whose directions we see the

edges cross. The area of one of the quadrangles is the sum of angles minus one full angle,

¢



£

¡ £ © ¤ . The absolute value of the signed area § is the same, and its sign

depends on whether we see a positive or a negative crossing. We thus have

£¦¥

¥

¡

¡ ¡

¨



£

§

(2.8)

¡

£¢

¤

¡

¡

§ §

Straightforward vector geometry and trigonometry can be used to derive analytical formu-

£ las for the § [28, 121].

2.5.2 Comparison

We compare the three implementations using a sequence of polygonal approximations of

an artificially created smooth knot. It has the form of the infinity symbol, , and is fairly ¡ flat in , with only a small gap in the middle. Because the knots are fairly flat, most of

their parallel projections have one crossing and the writhing number is just a little smaller

¡ than . Figure 2.9 shows that the algorithms that compute the exact writhing numbers for polygonal approximations converge faster to the writhing number of the smooth knot than the algorithm implementing (2.7). Figure 2.10 shows how much faster the sweep-line

algorithm is than both implementations of the double-sum algorithm. Let be the number

of edges. The graphs suggest that the running time of the sweep-line algorithm is O( ) and

¡

§

¡

the running times of the two implementations of the double-sum algorithm are . We

35

¡ Figure 2.9: Comparing convergence rates between ¢¡ (upper curve) and (lower curve). For

each tested approximation of the £ -knot, we draw the number of vertices along the horizontal axis and the writhing number along the vertical axis.

Figure 2.10: Comparing the running times of the sweep-line algorithm (lower curve) and the

two implementations of the double-sum algorithm: approximate (middle curve) and exact (upper ¤ curve). The ¡ -axis and -axis represent the number of vertices in the curve, and the running time of the algorithm respectively. observe the linear bound whenever we approximate a smooth knot by a polygon, since for generic projections the number of crossings as well as the number of edges simultaneously intersected by the sweep-line are independent of the total number of edges.

Protein backbones. We present some preliminary experimental results obtained with the three implementations. All experiments are carried out on a SUN workstation, with a 333 MHz UltraSPARC-IIi CPU, and 256 MB memory. Short of conformation data of long DNA strands, we decided to run our algorithms on a modest collection of open knots representing protein backbones, down-loaded from the protein data bank [2]. We modified the algorithms to account for the missing edge in the data, as explained in Section 2.3.

Figure 2.11 displays the four backbones chosen for our experimental study. Table 2.1 presents some of our findings.

36 Figure 2.11: The open knots modeling the backbone of the protein conformations stored in the PDB files 1AUS.pdb (upper left), 1CDK.pdb (upper right), 1CJA.pdb (lower left), and 1EQZ.pdb (lower right).

Thick knots. Even though the writhing number of a polygonal knot can be as large as quadratic in the number of edges, all four protein backbones in Figure 2.11 have writhing numbers that are significantly smaller than the numbers of edges. If a knot is made out of rope with non-zero thickness, then the quadratic bound can be achieved only if the ratio

of length over cross-section radius is sufficiently high. Specifically, the writhing number ¥

of a knot of length with an embedded tubular neighborhood of radius ¤ is less than

¥

¥

-

¡

¤ ¡

[44]. Such “thick” knots can be used to capture the fact that the edges of a

protein backbone are about as long as they are thick. A backbone with edges thus has

- ¡ writhing number at most some constant times . Examples which show that the upper

37

Data Size Time Writhing #

¡ ¢¤£¦¥ §¨¢ © § ¢¤

¢¡ ¡ 1AUS 439 122 0.09 3.93 9.28 22.70 17.87 1CDK 343 111 0.06 2.39 5.62 7.96 6.01 1CJA 327 150 0.06 2.19 5.10 12.14 10.43 1EQZ 125 18 0.02 0.31 0.73 4.78 3.37

Table 2.1: Four protein backbones modeled by open polygonal knots. The size of the problem ¡

is measure by the number of edges, , and the number of crossings in the chosen projection, . § ¢©

The time the sweep-line ( ¢ £¦¥ ), the approximate double-sum ( ), and the exact double-sum

§ ¢¤ ¡ ( ) algorithms take is measured in seconds. is an approximation of the writhing number for polygonal data. bound is asymptotically tight can be found in [38, 45, 72].

2.6 Notes and Discussion

In this chapter, we have developed an efficient algorithm to compute the writhing number ¡ of a space curve in . A fast method is important as the writhing number of DNA strands is computed at each step in some molecular simulations. Other than this computational aspect, it would be interesting to further investigate the concept and see whether there is a correlation between the writhing numbers and the common classification of protein folds. As mentioned in Chapter 2.1 there has been some initial work in this direction [82, 145].

It seems that although the writhing number of a protein backbone describes the spatial arrangements of its secondary structure elements, it alone is not discriminative enough in classifying protein structures. One major reason is that the writhing number is mainly effective in describing the global geometry of a given space curve. To solve this prob- lem, it might be necessary to consider backbones on a range of scale levels and compute the writhing number as a function of scale. Another possible approach is to combine the writhing number with other topological or geometric measures that describe different as- pects, especially the local geometry, of protein structures.

38 Chapter 3

Backbone Simpli®cation

3.1 Introduction

Protein structures are examined at different levels of details in various applications. Sim- pler structures are exploited in cases either when time complexity is too high, or when excessive details might obscure crucial features or principals that one would like to ob- serve. It is therefore desirable to build up a level-of-detail (LOD) representation for pro- tein structures. One way to achieve such a representation for protein backbones is via curve simplifications.

Given a polygonal curve, the curve-simplification problem asks for computing another polygonal curve that approximates the original curve based on a predefined error criterion and whose size is as small as possible. Other than their potential usage in simplifying protein structures, curve simplification is widely applied to numerous application areas, such as geographic information systems (GIS), computer vision, computer graphics and data compression. Simplification helps to remove unnecessary cluttering due to excessive details, to save memory space needed to store a curve, and to expedite the processing of a curve. For example, one of the main problems in computational cartography is to visualize geo-spatial information as a simple and easily readable map. To this end, curve simplification is used to represent rivers, road-lines, coast-lines and other linear features at an appropriate level of detail when a map of a large area is being produced.

In this chapter, we study the curve simplification problem under the so-called Frechet´ er- ¡ ror measure, and propose the first near linear time algorithm to simplify curves in with guaranteed quality. Below we first introduce the curve simplification problem formally.

39

§

£

¦ ¥ ¥

¡ ¡

¡ ¥

Problem definition. Let denote a polygonal curve in with as its se-

§

¨ ¨

¡ ¤§¦¨¦¨¦ ¤

¦£¢ ¥ ¥ ¦

¡ ¡

¦

§¥¤ § ¥

quence of vertices. A polygonal curve simplifies if

¨ © ¡

¡ £ ¤ £



¡ ¥§¦¦

§



. Given an error measure and a pair of indices , let

©

¦

denote the error of the segment § with respect to under the error measure . Intu-

 

¡ ¥§¦ ¦

§ § §

itively, measures how well approximates the portion of between and

§

¨ ¨ ¨

£ 

¥ ¥ ¦ ¢ ¦

¡

¦

§ ¥ §

. The error of a simplification ¤ of is defined as

¨ %

¢

¥§¦ ¡ ¡ ¦ ¥§¦ ¢

¦

§ §



¥ 

©

£

¢ ¢

¤ ¦ ¤ ¦ ¡ ¦ ¥§¦ ¨

We call an -simplification of , under the error measure , if . Let

¡

¤ ¦ ¡ ¤ ¥ ¦¦

denote the minimum number of vertices in an -simplification of under the

© © ¤

error measure . Given a polygonal curve ¦ , an error measure , and a parameter , the

¡

¤ ¡¥¤ ¥§¦

curve-simplification problem asks for computing an -simplification of size .

©

¤

¡  ¨ 

We now define the error measure we study in this chapter: Let  be

¥ ¥

a distance function, e.g., the Euclidean distance, or ¥ or norm. Now given two curves

¤ ¤

¤ ¤

¢ ¥ ¢ ¦ ¥ ¢ ¦ ¡ ¢¥

¡  ¡  ! 

, and , the Frec´ het distance under metric , , is

defined as

¢

¨ #%$'& %

¡ ¢¥ ¢ ¡ ¢¡ ¡ ¥ ¡ ¡

"   ¡ ¥  ¥

¡.- ¤

¤ (3.1)

¡

£*),+ ¥0/

¤

¢

¥ ¦ ¥ ¦

¡ ¡

¤ ¤

¢

¡

¤

¥ ¦ ¥ (¢ ¦

¡ ¢

where ¡ and range over continuous and monotonically non-decreasing functions with

¢ ¢ ¢

¨ ¨ ¨ ¨

¡ ¡

¡ ¢ ¡ ¥ ¡ ¢ ¥ ¡

¡ ¡

¡ and . If and are the maps that realize the

¥

¢32

¨

¥

¢

¡ 

Frechet´ distance, then we refer to the map 1 as the Frec´ het map from to .

¡

¡ £ ¤ £



§

For a pair of indices , the Frec´ het error of a segment is defined to be

¤

¨

¢   

¡ ¥§¦ ¡ ¥§¦ ¥ ¦ ¥

" § § §

¤

¥ ¦ ¦ ¦ § where denotes the portion of from to § .

40 Most previous work has been focused on the so-called Hausdorff error measure. We define it here as well, as we will compare the simplification under Frechet-´ and Hausdorff-

error measures later in this chapter. If we define the distance between a point and a line

¨ #%$¢¡

¡ ¥ ¢ ¡ ¥

)¤£' §  segment as , then the Hausdorff error under the metric , also

referred to as  -Hausdorff error measure, is defined as

¨ %



¥ ¡ ¥§¦¦ ¢ ¡

§ §  ¤¥

§%  

¡

¤ ¡ ¤ ¥§¦¦

An -simplification under Hausdorff error measure and ¥ are defined similarly to

the case of Frechet´ error measure.

£¢ ¦

If we remove the constraint that the vertices of ¦ are a subset of the vertices of , then

¢

¡£¢

¦ ¤ ¦ ¡ ¤ ¥ ¦¦

is called a weak -simplification of . Let denote the minimum number of ¦ vertices in a weak Frechet´ ¤ -simplification of .

3.2 Prior and New Work

Previous work. The problem of approximating a polygonal curve ¦ has been studied extensively during the last two decades; see [100, 167] for surveys. Imai and Iri [109]

formulated the curve-simplification problem as computing a shortest path between two

¦ ¦¨§

nodes in a directed acyclic graph ¦¨§ : each vertex of corresponds to a node in , and

£

 

¡ ¥ ¦¦ ¤ ¥

§ ¥

there is an edge between two nodes § if . A shortest path from

©

£

¤ ¦ to in ¦©§ corresponds to an optimal -simplification of under the error measure .

§ 1

In , under the Hausdorff measure with the so-called uniform metric , their algorithm

§

¤ ¡

takes time. Chin and Chan [50], and Melkman and O’Rourke [134] improve

the running time of their algorithm to quadratic. Agarwal and Varadarajan [3] improve the

 £ £  

The uniform metric in is de®ned as follows: given two points ,

  

! #"¨$ !%&'$%&

 if   ( otherwise.

41

¥

- ¤¢

¡

¡ ¥

running time to for - and uniform-Hausdorff error measures , for an arbitrarily

¨

¦ §

small constant , by implicitly representing the graph . In dimensions higher than

¥ ¥ ¤

two, Barequet et al. compute the optimal -simplification under ¥ - or -Hausdorff error

¥ 

measures in quadratic time [29]. For § -Hausdorff error measure, an optimal simplification

¥

¨

¨

¡

§.-£¢¥¤§¦ ¤ ¥ ©

¡

¡

can be computed in near-quadratic time for and in polylog in

¡ ¨ for .

Curve simplification using the Frechet´ error measure was first proposed by Godau [92],

£

¡¨¢ ¡ ¢

¡ ¤ ¥§¦¦ ¡¥¤ ¥ ¦¦ ¡

who showed that . Alt and Godau [16] also proposed an -

£

¢

¦ ¡ ¦ ¥ ¤

¦

time algorithm to determine whether ¦ for given polygonal curves and of

¨

¤

size and , respectively, and for a given error measure . Following the approach ¦

of Imai and Iri [109], an ¤ -simplification of , under the Frechet´ error measure, of size

¡ ¢

¡ ¤ ¥ ¦¦ ¡

¡ can be computed in time.

The problem of developing an optimal near-linear ¤ -simplification algorithm remains elusive. Among the several heuristics that have been proposed over the years, the most widely used is the Douglas-Peucker method [75] (together with its variants). Originally

proposed for simplifying curves under the Hausdorff error measure, its worst-case running

¨

§

¡

©

time is in . For the running time is improved by Snoeyink et al. [101]

¡ ¤)

to . However, the Douglas-Peucker heuristic does not offer any guarantee on

¤ ¡

the size of the simplified curve—it can return an -simplification of size even if

¨

¡

¡

¡¥¤ ¥§¦ ¡

¥ .

Much work has been done on computing a weak ¤ -simplification of a polygonal curve

¦ ¡

. Imai and Iri [108] give an optimal -time algorithm for finding an optimal weak

§

¤ 

-simplification (under the Hausdorff error measure) of a -monotone curve in . As for

weak ¤ -simplification of planar curves under Frechet´ distance, Guibas et al. [96] proposed

§

¡ ¤ ¡

an -time factor-2 approximation algorithm and an -time exact algorithm.

They also proposed linear-time algorithms for approximating some other variants of weak

42 simplifications.

The problem of simplifying curves becomes much harder when additional constraints such as topology preservation and non-intersection requirements are introduced. Given a set of non-intersecting curves, the problem of simplifying the curves optimally so that

the simplified curves are also non-intersecting is NP-hard – in fact, it is hard to approxi-

¥

¨

¥ -

¤ mate within a factor of , for any , where is the total number of vertices of the curves [83]. Guibas et al. [96] show that the problem of computing an optimal non- intersecting simplification of a simple polygon is NP-hard, and that computing the optimal

weak simplification of a set of non-intersecting curves is also NP-hard.

¨

¦ ¤

Our results. Let be a polygonal curve in , and let be a parameter. In Sec- ¦

tion 3.3, we present simple, near-linear algorithms for computing ¤ -simplifications of

¡

¡¥¤ ¥ ¦¦

©

with size at most under the Frechet-error´ measure.

¨

¦ ¤

¡

Theorem 3.2.1 Let be a polygonal curve in with vertices, and let be a

¢

¡ ¤ ¦ ¦

parameter. We can compute in time a simplification of of size at most

£

¡ ¢ ¢

¡ ¤ ¥§¦ ¡ ¦ ¢ ¥§¦¦ ¤

©

so that , assuming that the distance between points is measured ¡ in any ¥ -metric.

To our knowledge, this is the first simple, near-linear approximation algorithm for curve

¨ © simplification with guaranteed quality that extends to for arbitrary curves. We illustrate its simplicity and efficiency by comparing its performance with Douglas-Peucker and exact algorithms in Section 3.4. Our experimental results on various data sets show that our algorithm is efficient and produces ¤ -simplifications of near-optimal size.

Also in Section 3.3, we compare curve simplification under Hausdorff and Frechet´ er-

£

¡£¢ ¡ ¢

¡ ¤ ¥ ¦¦ ¡¥¤ ¥§¦ ror measure, and we show that , thereby improving the result by Godau [92].

43

3.3 Frechet´ Simplification

§

¨

¨

£

¦ ¥ ¥ ¤

¡ ¡

¥ Let be a polygonal curve in , and let be a parameter. In this section, we first prove a few properties of the Frechet´ error measure. We then present an approximation algorithm for simplification under the Frechet´ error measure. At the end of

this section, we compare Frechet´ simplification with some other versions of simplification.

¦ ¦ ¦ ¦

¡ ¥ ¡ ¥

Let " be as defined in (3.1), and let denote the Euclidean distance between

two points in .



¥

Lemma 3.3.1 Given two directed segments and in ,

¨ %

¡ ¥ ¢ ¡ ¥ ¥ ¡ ¥

" ¥ ¥ ¡

¨ %

©

¢ ¡ ¥ ¥ ¡ ¥ ¡ ¥

¥ ¡ ! ¥ ¥

PROOF. Let . First , since (resp. ) has to

 £¡

¥ ¥ ¡

be matched to (resp. ). Assume the natural parameterization for segment ,

¤ ¤

¨

¡ ¡

¤ ¤

¥ ¦ £¡ ¡ § ¡ ¥ ¦ 

¥ ¥ ¥ ¥ £ ¥ ¥ ¡

, such that . Similarly, define for

¨

 § ¡  ¡  £¡ § ¡

¥ ¥ £ ¥ ¥

segment , such that . For any two matched points and , let

¨ ¨

¡

¡ £¡ § ¡ ¡ ¡  ¡

¡ ¥ ¥ £ ¥ £ ¥ ¥ £ ¥ £

¤

 

£ ¡ £

¡ ¡ ¥ ¦ ¡ ¥ 

¥ ¡ ¥ ¥ § ¥

Since ¡ is a convex function, for any . Therefore .

¦ 

¥

Lemma 3.3.2 Given a polygonal curve in and two directed segments and ,

£

¡ ¥§¦ ¡ ¥§¦ ¡ ¥

 " ¥ £ "  " ¥

¤

¡

¤

£¡ ¥ ¦

¥ ¡ ¥ ¥

PROOF. Assume that is the natural parameterization of , such that

¤

¨

¡

£¡ ¡ ¦ ¡ ¥ ¦

¥ ¥ £¤¥ ¥ ¥ § ¥ . Let , , be a parameterization of the polygonal curve

44

¦ ¦ ¡ £¡ ¦

¥ ¥

such that ¥ and realize the Frechet´ distance between and . As in the proof

¤

¨

¡

¤

¡  § ¡ ¥ ¦  § ¡ 

¥ £ ¡ ¥

of Lemma 3.3.1, let ¥ be such that . By triangle

¤

¡ £

¡¤£¡ ¥¨§ ¡ ¥ ¡ ¦ ¡ ¥ £¡ ¥ ¦ ¡ ¦ ¡ ¥¨§ ¡

¥ ¥ ¥ ¥ § ¥ ¥ inequality, for any ¥ , yielding the

lemma.

¢£

¢¡

¢¤

¥

¥ 

¢£

¦¡

§©¨  



§ § 

Figure 3.1: Dashed curve is , and are mapped to  and on , respectively.

§

¨ ¡

£ £  £

£

¦ ¥ ¥ ¥

¡ ¡

§

Lemma 3.3.3 Let ¥ be a polygonal curve in . For ,



£ ¦

¢ ¢ 

¡ ¥ ¦¦ ¡ ¥§¦¦

© §

¤

¨

¤

¢  

¡ ¥§¦ ¦ ¥ ¦

 § ¡ ¡ § §

PROOF. Let . Let § be the Frechet´ map from to

¤ ¤

¨

 

¦ ¦ ¡ ¥ ¦ ¥ ¦

 §  ¡  §

§ (see Section 3.1 for definition). For any vertex , set ;

¤

 

£

¡ ¥§¦ ¥ ¦

 

see Figure 3.1 for an illustration. By definition, . In particular,

   

£ £

¡ ¥ ¥ ¡ ¥ ¡ ¥

 

! . By Lemma 3.3.1, . It then follows from

Lemma 3.3.2 that

¤ ¤

¨

    

£ £

¢

¡ ¥§¦ ¡ ¥§¦ ¥ ¦ ¥§¦ ¥ ¦ ¡

 

" ©

3.3.1 Algorithm

Our simplification algorithm is a greedy approach (Figure 3.2). Suppose we have already

§

¦ ¨

¨ £

¢

¦ ¢ ¡ ¥§¦ ¤ ¥ ¥

¡ ¡

¦

¥ §  § 

added § to . We then find an index such that (i)

¨ ¦

¨

¢ ¢

¡ ¥ ¦¦ ¦ ¤

¦

§  ¤ ¥ ¤ ¥ § 

and (ii) . We set and add to . We repeat this process

¢£ £ ¢ until we encounter . We then add to ¦ .

45 ¤

ALGORITHM GreedyFrechetSimp ( ¦ , )

§

¨

¦¨¦¨¦ ¨

£

¦ ¥ ¥ ¤

Input: ¥ ;

£

¢

¦ ¢ ¦ ¡ ¦ ¢ ¥ ¦¦ ¤

Output: " such that .

begin

§

¡¢¨ ¨ ¨

¡ ¡

¦ ¢

§ 

; ; ;

¤

while ( ) do

¨ 

;

£

¢

¥§¦¦ ¤ ¡

¡

¦



§

 ¤ §

while ( § ) do

¨

 ¡

 ; 

end while

¢ ¢

¨ # ¨

¤ ¥

¤¡  ©

© ; ;

¢ ¢

#

¤ ¡

¤¡ ¡ 

while ( £ ) do

¢ ¢

# ¨ #

¢  ¡ ¤£  ©

;

£

¢

¡ ¥§¦ ¤

§  §  ¤¥¤§¦©¨

if ( )

¨ #

¡ ¢ 

¤ ;

¢ ¢

# ¨ #

¢  else  ;

end while

§

¨ ¡ ¨ ¡ ¨

¦ ¤ ¡

¢ ¢

¤£ ¦ ¦

¦

§ ¤ ¥ ; ; ; end while end

Figure 3.2: Computing -simplification under the Fr´echet error measure.

¡

¡ £ ¤ £

Alt and Godau [16] have developed an algorithm that, given a pair , can

¡

£

¢ 

¡ ¡ ¥§¦¦ ¤

£ §

determine in time whether . Therefore, a first approach would be



to add vertices greedily one by one, starting with the first vertex § , and testing each edge

¦



 ¨



§ , for , by invoking the Alt-Godau algorithm, until we find the index . However,

§

¡ the overall algorithm could take time. To limit the number of times that Alt-Godau

algorithm is invoked when computing the index ¦ , we proceed as follows.

 © £

¢

¡ ¥§¦¦

¡

§ 

 ¤ §

First by an exponential search, we determine an integer so that §

 

¤

¨

¤ ¥

¢

¡ ¥§¦ ¤ ¤ ¥ ¦

¡

¦



§ © ©

 ¤ §

and § . Next, by performing a binary search in the interval ,

 

¤

 £ ¨

¤ ¥

¢ ¢

¡ ¥§¦ ¤ ¡ ¥ ¦ ¥§¦¦

© © § § ¤ §  §  ¤ ¤ ¥ we determine an integer § such that and

¤ . Note that in the worst case, the asymptotic costs of the exponential and binary searches

¦ ¨



are the same. Set . See Figure 3.2 for pseudo-code of this algorithm. Since com-

¦ ¨



¡ ¡ ¤

puting the value of requires invoking the Alt-Godau algorithm times,

46

¢ ¢

£ 

¡ ¥

£ ¡ ©

each with a pair ¡ such that , the total time spent in computing the value of

¦ ¦

¡

¡ ¡ ¡ ¤ ¤

£

is . Hence, the overall running time of the algorithm is .

§

¨

¨

£

¦ ¥ ¥ ¤

¡ ¡

Theorem 3.3.4 Given a polygonal curve ¥ in and a parameter ,

¡ ¤ ¦ ¢ ¦ ¤

we can compute in time an -simplification of under the Frec´ het error

£

¡ ¢

¦ ¢ ¡ ¤

  ©

measure so that .

£

¢

¦ ¢ ¡ ¦ ¢ ¥§¦

PROOF. Compute by the described above. By construction

§

¨ ¨ ¨

£

¤£

¡ ¢

¦ ¢ ¥ ¥ ¤ ¡ ¤ ¥§¦ ¦ ¢

¡ ¡

¦

§¥¤   ¥ © §

, so it suffices to prove that . Let , and let

§

¨ ¨ ¨

¡ ¢   £

¦ ¡¥¤ ¦ ¡¥¤ ¥ ¦¦ ¥ ¥

¡

¦ ¡ ¡



© ©

¥ be an optimal -simplification of of size .

¡ ¦

© £  £

¡£¢

¡ ¤ ¥§¦

¡ ©

We claim that for all . This would imply that .

¨

¡

We prove the above claim by induction on . For , the claim is obviously

¨ ¡ ¨ ¡ ¡

¡ © £

¥ ¥ ¥

¥ ¥ ¢ ¥ ¥ £ ¥

true because . Suppose . If , we are done. So

¡

£ ¨

¥ ¢ 

¥ ¦¦ ¤ ¦ ¡ ¤ ¡

¦

© ¢ ¥  © ¤

assume that . Since is an -simplification, ¤¦¥ .

¡ ¡

£ £ £

¥ ¢ ©

¢ ¡ ¥§¦ ¤

¦

§ ¥ §

Lemma 3.3.3 implies that for all , ¤¨¥ . But by construc-

¡ ¡

¨ ¡#¨ ©

¢

¡ ¥§¦ ¤

¦

§ § ¤ ¥ ¢ ¢ ¤ tion, ¤¨¥ , therefore and thus .

Remark.Our algorithm works within the same running time even if we measure the distance between two points in any ¥ -metric.

3.3.2 Comparisons

(a) (b) (c)

Figure 3.3: (a) Polygonal chain (a piece of protein backbone) composed of three alpha-helices, (b) its Fr´echet -simplification and (c) its Hausdorff -simplification.

47 Hausdorff vs. Frechet.´ One natural question is to compare the quality of simplifi-

cations produced under the Hausdorff- and the Frechet-´ error measures. Given a curve

§

¨

£

 ¢  ¤£

¡ ¥§¦ ¡ ¥§¦ ¦ ¥ ¥

¡

§ § ¥

¥ , it is not too hard to show that under any

¥¡

£

¡ ¢ ¡

¡¥¤ ¥§¦ ¡ ¤ ¥§¦¦

-metrics, which implies that ¥ . The inverse however does not

¨

¡

¡

¡ ¤ ¥§¦¦ ¡ ¦ ¤

hold, and there are polygonal curves and values of for which ¥ and

¨

¡ ¢

¡ ¤ ¥ ¦¦ ¡

.

The Frechet´ error measure takes the order along the curve into account, and hence is more useful in some cases especially when the order of the curve is important (such as curves derived from protein backbones). Figure 3.3 illustrates a substructure of a protein backbone, where the ¤ -simplification under Frechet´ error measure preserves the overall structure, while the ¤ -simplification under Hausdorff error measure is unable to preserve it.

We remark that the Douglas-Peucker algorithm is also based on Hausdorff error mea- sure. Therefore the above discussion holds for it as well.

Weak-Frechet´ vs. Frechet.´ In Section 3.3.1 we described a fast approximation algorithm ¦

for computing an ¤ -simplification of under the Frechet´ error measure, where we used the

§



¥ ¥ ¡ ¡

Frechet´ measure in a local manner: we restrict the curve § to match to the line



segment § . We can remove this restriction to make the measure more global by consid-

§

¨

¤ ¦   ¥  ¥ ¥ 

¡

§ ¥ § £

ering weak -simplification. More precisely, given and , where

¦ ¤

does not necessarily lie on , is a weak -simplification under Frechet´ error measure if

£

¡ ¦ ¥ ¤

" . The following lemma shows that for Frechet´ error measure, the size of the

¡¨¢

¡¥¤

optimal ¤ -simplification ( ) can be bounded in terms of the size of the optimal weak

¡¨¢

¡ ¤ ¤

-simplification ( ):

48

¦¡¢¡

¦



¡ ¥ §

¢¡¢¡ ¥ § £



¦¡¢¡

¦



¤

©





¡

 ¡ 

¦



¤

¦







¡

¢¡¢¡¤£

¡ ¡

¡¦¥¨§ ¡

¦

 

(a) (b)

¢¡¢¡

¦



¡¦¥¨§

¢¡¢¡ ¥ §£





 ¢¡¢¡ 

¦





¡

¡

¡¤£ 

(c)

§ §  

¢ ¢ %$ %$

 

§ § ¤ ¥

! #" &' ("

Figure 3.4: Relationship between  and : (a) and ; (b)

     

  

¡ ¡ ¡

§ § § ¤ ¥ § ¤ ¥

* ¤ ¥ +) ¤ ¥ +) , .- , /-

depicts the map ) , , and . (c) the case for and :

   

 

0

¡¦¥¨§ ¡ ¥ § 12 1* .

Theorem 3.3.5 Given a polygonal curve ¦ ,

£ £

¡ ¢ ¡ ¢ ¡ ¢

¡ ¤ ¡ ¤ ¡ ¤

§

¨

 ¥ ¥ 

¡ ¡

¥ £

PROOF. The second inequality is obvious, so we prove the first one. Let

¤

¡¥¤ ¦ ¦

¡ ¡

be an optimal weak -simplification of , and let be a Frechet´ map so

£ ¡ £ £

¡  ¥ ¡  ¤ 

¡ § ¥

that for all (see Section 3.1 for the definition). For ,

¨

 



¦ ¡ 

¤ ¤ ¤

§ ¤ ¥ ¡ § §

let be the edge of that contains , and let be the endpoint of that

§

¨

  43

¥ ¡  ¦ ¢

¡ ¡

¦ §

is closer to ¡ . We set ; we remove a vertex from this sequence if it

¡ ¡

¤

¦ ¢ ¦

§ ¤ ¥

is the same as its predecessor. Clearly, § , so is a simplification of . Next we

¡ £ ¤ £

¢  

¡ ¥§¦ ¤

¤ ¤

¦

 ¥

show that , for all , which then implies the theorem.

¨ ¨

¡  ¡ 

¡ § ¡ § ¤ ¥

Let ¥ and . See Figure 3.4 (a) for an illustration.

¤

£

¡ ¥§¦ ¥ ¦ ¤

" ¥ © Claim 3.3.6 ¥

49

£ £

¡  ¥ ¡  ¥ ¤ ¡ ¥   ¤

§ ¥ § ¤ ¥ ! ¥ § § ¤ ¥

PROOF. By construction, , . Therefore,

¤ ¦

(by Lemma 3.3.1). On the other hand, since is a weak ( )-simplification of ,

¤

£

¥ ¦ ¤ ¡   ¥§¦

¥ § § ¤ ¥

" . The claim then follows from Lemma 3.3.2.

¤ ¤

¢ ¢

£

¤

¥ ¦ ¦ ¥ ¦ ¡  ¥ ¡ ¤  ¦

¥ ¥ ¡ ¥ © §

Let be a Frechet´ map such that , for all .



¤



¥ ¡ § ¦ § §

Let ¦ be the line containing . We define a map that maps a point

¢

 

 ¡

¤ ¤

¤ ¥ ¤ ¥

to the intersection point of ¦ with the line through and parallel to . See

¨



£  £

 

¤ ¥ ¡ ¡ ¡ ¥

¤ ¤ ¤ ¤ ¤

© ¥

Figure 3.4 (b) for an illustration. If , then since



¡ ¥

¤

¤ ¥ ¥

: This is the case depicted in Figure 3.4. Hence, we can infer that

 

  £



¡ ¥ ¡ ¡ ¤

¤ ¤ ¤ ¤

¤ ¥ ¤ ¥ ©

" (3.2)

¤



¡ § ¤ ¥ ¦ § § ¤ ¥

Similarly, we define a map that maps a point to the intersection point

¢

 

 ¡

¤ ¤

¦ ¦

 

of ¦ with the line through and parallel to . As above,

 

£



¡ ¥ ¡ ¡ ¤

¤ ¤ ¤ ¤

¦ ¦ ¦ ¦

    ©

" (3.3)





¡ ¡

¤ ¤

¦



Let (resp. § ) denote the point (resp. ); see Figure 3.4 (c).

£

 

¡ ¥ ¤

¤ ¤

¦

§  ©

Claim 3.3.7 "

£

 

¡ ¥ ¡ ¥ ¤

¤ ¤

¦

 § ©

PROOF. Since , , the claim follows from Lemma 3.3.1.

¤

£

 

¡ ¥ ¦ ¥ ¦ ¤

¤ ¤

¦

" § © Claim 3.3.8 

PROOF. By definition of Frechet´ distance,

¤

%

 

£ 





¡ ¥§¦ ¥ ¦ ¢ ¡ ¡ ¥ ¥

¤ ¤ ¤ ¤ ¤

¦

" §  " ¤ ¥ ¤ ¥

¤

   



¡ ¡ ¡ ¥§¦ ¥ ¦ ¥

¤ ¤ ¤ ¤

¦ ¦

" ¤ ¥

 ¤ ¥ 

 



¡ ¥ ¡

¤ ¤ ¤

¦ ¦ ¦

"    § ¡

50

¤ ©

By (3.2) and (3.3), the first and the third terms are at most . To bound the second

¢ ¢

¨ ¨

   



¡ ¡ ¡ ¡

¤ ¤ ¤ ¤

¦ ¦

 ¤ ¥  ¥ term, observe that ¤ and . It then follows from the

definition of the map ¢ and Claim 3.3.6 that

¤ ¤

¢ ¢

   

£ £

¥ ¦ ¤ ¥ ¡ ¥§¦ ¦ ¥ ¥ ¦ ¡ ¡ ¡

¤ ¤ ¤ ¤

¦ ¦

¥ © " ¥  ¤ ¥  ¤ ¥

¤ ©

which implies that the second term is at most as well. Thus proves the claim.

£

¢ 

¥ ¦¦ ¤ ¡

¤ ¤

¦ 

Claim 3.3.7 and 3.3.8 along with Lemma 3.3.2 imply that , and

£

¢ ¢

¡ ¦ ¥ ¦¦ ¤ therefore . This completes the proof of the theorem.

3.4 Experiments

We have implemented our simplification algorithm, GreedyFrechetSimp, and the

¡

¡ -time optimal Frechet-simplification´ algorithm (referred to as Exact) by construct-

ing the shortest path in certain graphs, outlined in Section 3.2, that computes an ¤ -simplification

¡

¦ ¥¤ ¥§¦ of of size ¡ . In this section, we measure the performance of our algorithm in terms of the output size and the running time.

Data sets. We test our algorithms on two different types of data sets, each of which is a ¡ family of polygonal curves in .

Protein backbones. The first set of curves are derived from protein backbones by

adding small random “noise” vertices along the edges of the backbone. We have

¡ ¦

¡ ¡

chosen two backbones from the protein data bank [2]: PDB file ids and .

¡ ¦

¡ ¡ ¡

¡ ¡

The number of vertices in the original backbones of and are © and ,

¡ ¦

¡

¡ ¡

¥

respectively. Protein A is the original protein . Protein B is with random vertices added: the new vertices are uniformly distributed along the curve edges, and then perturbed slightly in a random direction.

51 Stock-index curves. The second family of curves is generated from the daily NAS-

DAQ index values over the period January 1971 to January 2003 (data is obtained

¡ £ £ ¡ £ £

¡  ¥ ¡ ¥

§ ¥  ¡ ¦§ ¥  ¡

from [1]). We take a pair of index curves and

¡ £ £

¡ ¥ ¥

§ § ¥  ¡ ¡

and generate the curve in . In particular, we take telecom- 

munication index and Bio-technology index as the - and -coordinates and time ¢¡

as the ¦ -coordinates to construct the curve Tel-bio in . In the second case, we

 ¦ take transportation, telecommunication index, and time as -, -, and -coordinates,

respectively, to construct curve Trans-Tel in £¡ .

These two families of curves have different structures. The stock-index curves ex- hibit an easily distinguishable global trend; however, locally there is a lot of noise. The protein curves, though coiled and irregular-looking, exhibit local patterns that represent the structural elements of the protein backbones (commonly referred to as the secondary struc- tures). In each of these cases, simplification helps identify certain patterns (e.g., secondary structure elements) and trends in the data.

Output size. We compare the quality (size) of simplifications produced by our algorithm

(GreedyFrechetSimp) and optimal algorithm (Exact) in Figure 3.5 for curves from

the above two families respectively. The simplifications produced by our algorithm are ¡ almost always within of the optimal.

To provide a visual picture of the simplification produced by various (commonly used) ¡ algorithms for curves in , Figure 3.8 shows the simplifications of protein A computed

by the GreedyFrechetSimp, exact (i.e., optimal) Frechet´ simplification algorithm, and ¥

the Douglas-Peucker heuristic (using Hausdorff error measure under § metric).

Running time. As the running time for the optimal algorithm is orders of magnitude slower than our algorithm, we compare the efficiency of GreedyFrechetSimpwith the widely used Douglas-Peucker simplification algorithm under the Hausdorff measure –

52 Output size ProteinA ProteinB (327) (9,777)

¤ GreedyFrechetSimp Exact GreedyFrechetSimp Exact 0.05 327 327 6786 6431 0.12 327 327 1537 651 1.20 254 249 178 168 1.60 220 214 140 132 2.00 134 124 115 88 5.00 37 36 41 39 10.0 22 22 24 20 20.0 10 8 8 6 50.0 2 2 2 2 (a) Output size Trans-Tel Tel-Bio (7,057) (1,559)

¤ GreedyFrechetSimp Exact GreedyFrechetSimp Exact 0.05 6882 6880 1558 1558 0.50 4601 4469 1473 1471 1.20 2811 2637 1292 1279 3.00 1396 1228 974 942 5.00 890 732 772 720 10.0 414 329 490 402 20.0 168 124 243 200 50.0 47 35 94 73 (b)

Figure 3.5: The sizes of Fr´echet simplifications on (a) protein data and (b) stock-index data.

53 Running time (ms.) ProteinA ProteinB (327) (9,777)

¤ GreedyFrechetSimp DP GreedyFrechetSimp DP 0.05 3 16 146 772 0.50 3 16 171 524 1.20 4 16 176 488 1.60 5 12 202 394 2.00 5 11 210 354 5.00 5 11 209 356 10.0 5 10 222 329 20.0 5 8 233 263 50.0 2 1 87 50 (a) Running time (ms.) Trans-Tel Tel-Bio (7,057) (1,559)

¤ GreedyFrechetSimp DP GreedyFrechetSimp DP 0.05 82 599 16 113 0.50 103 580 17 114 1.20 113 559 19 113 3.00 119 510 22 109 5.00 121 472 24 109 10.0 127 411 25 96 20.0 146 360 27 85 50.0 162 271 27 71 (b)

Figure 3.6: The running time of GreedyFrechetSimp and Douglas-Peucker algorithm on (a) protein data and (b) stock-index data.

54 0.9 DP 0.8 FS

0.7

0.6

0.5

0.4

0.3 Running time (secs) 0.2

0.1

0 0 5 10 15 20 25 30 35 40 45 50 Error

Figure 3.7: Comparison of running time of GreedyFrechetSimp and DP algorithms for varying on Protein B.

we can extend the Douglas-Peucker algorithm to simplify curves under the Frechet´ error

¡

¡ measure; however, such an extension is inefficient and can take time in the worst case.

Figures 3.6 illustrates the running time of the two algorithms. Note that as ¤ increases, resulting in small simplified curves, the running time of Douglas-Peucker decreases. This

phenomenon is further illustrated in Figure 3.7, which compares the running time of our

¥

algorithm with the Douglas-Peucker on Protein B (with artificial noise added) with

vertices. This phenomenon is due to the fact that, at each step, the DP algorithm deter-

¤

 

¦ ¥ ¦ §

mines whether a line segment § simplifies . The algorithm recursively solves

¨



¡ ¥§¦ ¤ ¤

¥ § two subproblems only if . Thus, as increases, it needs to make fewer recursive calls. Our algorithm, however, proceeds in a linear fashion from the first ver-

tex to the last vertex using exponential and binary search. Suppose the algorithm returns

§ §

¨ ¨

  ¢ £

¥ ¥ ¦ ¦ ¥ ¥

¦ ¡ ¡ ¡ ¥

¤ for an input polygonal curve . The exponential search

¥

¨¤¡ ¡

 ¥

¡ ¤ ¡

§ § § ¤ ¥ £ § §

¡ ¥

takes time, while the binary search takes § where and

¦

¦ ¢ ¤ § is the number of vertices of . Thus as increases, increases, and therefore the time

for binary search increases, as Figure 3.7 illustrates. Note however that if ¤ is so large that

¦ ¨

£ ¥ © , i.e., the simplification is just one line segment connecting to , the algorithm

55

¨

¤

does not perform any binary search and is much faster, as the case for illustrates in Figure 3.7.

3.5 Notes and Discussions

In this chapter, we have proposed and developed a simple near-linear time curve simplifi-

cation algorithm. Other than being the first near-linear simplification algorithm for curves ¡ in , our algorithm tends to preserve long but relatively skinny features (Figure 3.3) as we use the Frechet´ error measure. This property is desirable when simplifying protein back- bones as it helps to maintain a trace of secondary structural elements such as alpha-helices and beta-strands in the simplified structures.

It would be interesting to see whether our algorithm can help to produce an automatic method to identify protein secondary structure elements. It is also possible to generate a level-of-detail representation of a protein backbone via simplification and compute its writhing number (or other shape descriptors) at different scales, in order to characterize protein structures. Such a level-of-detail representation can also be used when comparing protein backbones, as current structural alignments methods are usually of high computa- tional complexity.

We end this chapter by mentioning a few open problems related to curve simplification.

(i) Does there exist a near-linear algorithm for computing an ¤ -simplification of size at

© ¡

¡

¦ ¡ ¤ ¥ ¦¦

most for a polygonal curve , where is a constant?

(ii) Is it possible to compute the optimal ¤ -simplification under Hausdorff error measure in near-linear time, or under Frechet´ error measure in sub-cubic time?

(iii) Is there any provably efficient exact/approximation algorithm for curve simplifica- §

tion in that returns a simple curve if the input curve is simple.

56

GreedyFrechetSimp EXACT DP

¨

¡

¤

¨

¤

©

¨

¤

¨

¤

©

Figure 3.8: Simpli®cations of a protein (Protein A) backbone.

57 Chapter 4

Elevation Function

4.1 Introduction

The starting point of the work described in this chapter is the desire to identify features

that are useful in finding a fit between solid shapes in ¢¡ . We are looking for cavities and protrusions and a way to measure their size. The problem is made difficult by the interaction of these features, which typically exist on various scale levels. We therefore take an indirect approach, defining a real-valued function on the surface that is sensitive to the features of the shape. We call this the elevation function because it has similarities to the elevation measured on the surface of the Earth, but the problem for general surfaces is more involved and the analogy is not perfect.

Related work in protein docking. The primary motivation for designing elevation func- tion to characterize protein surfaces is protein docking, which is the computational ap- proach to predicting protein interactions, a biophysical phenomenon at the very core of life. The phenomenon is clearly important and the interest in protein docking is correspondingly wide-spread. The related work on attacking the docking problem will be surveyed in Chap- ter 6, and here we only mention some survey articles on docking algorithms [81, 97, 114]. The idea of docking by matching cavities with protrusions goes back to Crick [69] and Connolly [67]. Connolly also introduced the idea of using the critical points of a real- valued function defined on the protein surface to identify cavities and protrusions. The particular function he used is the fraction of a fixed-size sphere that is buried inside the protein volume as we move the sphere center on the protein surface. In the limit, when the size of the sphere goes to zero, this function has the same critical points as the mean

58 curvature function [48]. A similar but different function suggested for the same purpose is the atomic density [136]. Here we take the buried fraction of the ball bounded by the sphere but we also vary its radius from zero to about ten Angstrom. At every point of the protein surface, the function value is the fraction of buried volume averaged over the balls centered at that point.

Our results. The main contribution of this chapter is the description and computation

of a new type of feature points that mark extreme cavities and protrusions on a surface ¡ embedded in . More specifically,

we extend the concept of topological persistence [77] to form a pairing between all

critical points of a function on a 2-manifold embedded in ¢¡ ;

we use the pairings obtained for a 2-parameter family of height functions to define the elevation function on the 2-manifold;

we classify the generic local maxima of the elevation function into four types;

we develop and implement an algorithm that computes all local maxima of the ele- vation function.

The elevation differs from Connolly’s and the atomic density functions in two major ways: it is independent of scale and it provides, beyond location, estimates for the direction and size of shape features. Both additional pieces of information are useful in shape character- ization and matching. The four generic types of local maxima are illustrated in Figure 4.1.

In each but the first case, the maximum is obtained at an ambiguity in the pairing of critical points. In all cases, the endpoints of the legs share the same normal line, and the legs have the same length if measured along that line. The case analysis is delicate and aided by a transformation of the original 2-manifold to its pedal surface, which maps tangent planes to points and thus expresses points with common tangent planes as self-intersections of the

59 Figure 4.1: From left to right: a one-, two-, three-, and four-legged local maximum of the elevation function. In the examples shown, the outer normals at the endpoints of the legs are all parallel (the same). Each of the four types also exists with anti-parallel outer normals in various combinations. pedal surface. The algorithm we describe for enumerating all local maxima is inspired by our analysis of the smooth case but works on piecewise linear data.

Outline. Section 4.2 defines the pairing of the critical points, based on which we then introduce the height and elevation as functions on a 2-manifold. Section 4.3 describes a dual view of these concepts based on the pedal surface of the 2-manifold. Section 4.4 uses surgery to make elevation continuous and to define a stratified Morse function on the new 2-manifold. We then characterize the four types of generic local maxima of the continuous elevation function. Section 4.5 sketches an algorithm for enumerating all local maxima.

Section 4.6 presents experimental results for protein data.

4.2 Defining Elevation

4.2.1 Pairing

The elevation function is based on a canonical pairing of the critical points, which we

describe in this section.

¤

¢

Traditional persistence. Let be a connected and orientable 2-manifold and ¡

¢  1  a smooth function. A point § is critical if the derivative of at is identical to 0, and it is non-degenerate if the Hessian at the point is invertible. It is convenient to assume

that ¢ is generic:  We remark that the algorithms we describe below work for 2-manifolds with multiple components as well. We assume there is only one components for simplicity.

60 I. all critical points are non-degenerate;

II. the critical points have different function values.

A function that satisfies Conditions I and II is usually referred to as a Morse function [135]. It has three types of critical points: minima, saddles and maxima distinguished by

the number of negative eigenvalues of the Hessian. Imagine we sweep in the direction

¡ ¨

of increasing function value, proceeding along a level set of closed curves. We write

£

 ¢¡

¡ § 

for the swept portion of the 2-manifold. This portion changes the ¢ topology whenever the level set passes through a critical point. A component of starts at a minimum and ends when it merges with another, older component at a saddle. A hole in the 2-manifold starts at a saddle and ends when it is closed off at a maximum. After observing that each saddle either merges two components or starts an new hole, but not both, it is natural to pair up the critical point that starts a component or a hole with the critical point that ends it. This is the main idea of topological persistence introduced in [77]. It is clear that a small perturbation of the function that preserves the sequence of

critical events does not affect the pairing, other than by perturbing each pair locally. The 

method pairs all critical points except for the first minimum, the last maximum, and the ©

 

saddles starting the © cycles that remain when the sweep is complete. Here is the genus

©  of . These © unpaired critical points are the reason we need an extension to the method, which we describe next.

Extended persistence. It is natural to pair the remaining minimum with the remaining

   maximum. The remaining © saddles, , contains up-forking and down-forking sad- dles. We wish to pair up-forking saddles with down-forking ones, and this can be achieved in a way that reflects how they introduce cycles during the sweep. This pairing is best described using the Reeb graph obtained by mapping each component of each level set to

a point, as illustrated in Figure 4.2. As proved in [63], the Reeb graph has a basis of 

61 0 B

3 4

B B B

1 1 4 2 4 2

3 3

2 2 2 1 1 A A A A 0 0 Figure 4.2: Left: a 2-manifold whose points are mapped to the distance above a horizontal plane. Middle: the Reeb graph in which the critical points of the function appear as degree-1 and degree-3 nodes. The labels indicate the pairing. Right: the representing the Reeb graph from slightly above downwards. cycles such that each cycle is the sum (modulo 2) of a subset of basis cycles. Each cycle has a unique lowest and a unique highest point, referred to as lo-point and hi-point. We say the lo- and hi-point span this cycle but note that they may span more than one cycle. There

is a one-to-one correspondence between lo- (hi-) points and up- (down-) forking saddles,

 

thereby giving exactly  lo-points and hi-points. We pair each lo-point with the lowest

 

hi-point that spans a cycle with . Note that is also the highest lo-point that spans a

cycle with . Indeed, if it were not the case, then we could add the cycle spanned by and

  

and the cycle spanned by and the lo-point higher than to get a cycle spanned by and

lower hi-point than , a contradiction. This implies that each lo-point and each hi-point

belongs to exactly one pair, giving a total of  pairs between up- and down-forking saddles

from as desired.

The Reeb graph of a piecewise-linear function on a triangulation with edges can be

¤) ¡

constructed in time using the algorithm in [63]. We now describe an algorithm that computes both the traditional persistence pairing and the extended persistence pairing

2 ¦

as introduced above, given the Reeb graph ¦ of . It simulates the sweep of , maintain-

ing a forest for during the course. In particular, it takes the following steps at reaching £ We remark that the algorithm can in fact construct the Reeb graph and the pairing simultaneously in one sweep. We assume the Reeb graph is given for simplicity.

62  a critical point (i.e., a node in ¦ ), merging two arcs across a degree-2 node whenever one is created.

Case 1:  is a minimum. We add a new tree, consisting of a single node, to the forest.

Case 2:  is an up-forking saddle. We turn the corresponding leaf into an internal node,

adding two new leaves as its children.



Case 3: is a down-forking saddle, connecting leaves ¥ and . We glue the two down-

ward paths from ¥ and to their roots, and ends the gluing at . In one case, is the 

root of one tree (the higher one); is a minimum, which we now pair with (this

corresponds to a traditional persistence pairing). In the other case, is the lowest



common ancestor of ¥ and ; is an up-forking saddle, and we pair it with (this

corresponds to an extended persistence pairing). 

Case 4: is a maximum. We pair it with its parent and remove the joining edge

together with the two nodes; can be either an up-forking saddle, producing a tradi-

tional persistence pairing, or a minimum, producing an extended persistence pairing.

In order to perform these operations efficiently, we use the linking and cutting tree proposed by Sleator and Tarjan [152]. It decomposes the forest into a family of

vertex-disjoint paths, and each path is represented using a biased . By

©

¡

¥ ¥ ¡ ¤

©

maintaining a linking and cutting tree , cases and can be handled in ¡

overall time. So we focus only on case ¡ . Given an instance of case , assume that the

common ancestor of ¥ and , , exists (the case where it does not exist can be processed

¡ ¤

similarly). We can find in time using the operations supported by the linking

and cutting tree data structure. The only extra operation we need is to glue the path in

© ¦

from ¥ to with that from to . Let and be the length of these two paths with

¦ £

. We can perform the gluing operation by inserting each node from the shorter

¦

¡ ¤ path into the longer one, which takes time as each path is represented using

63

a weighted binary search tree.3 Assume that there are a sequence of  gluing operations

¦

during the entire sweep, and the ’th operation glues a path of length § with one of length

¦ ¦ ¦

§

£

£

¤ ¤ ¤)

¡

§ § § § §

§

¡ § for . The overall time complexity is § .

If we regard the parent of each node as its successor, then © induces a partial order on

¦ 1 §

the nodes of . Let be the number of total orders that are extensions of the partial

© © ¨

order induced by after the ’th gluing operation. Since initially and a single path

¨ ¨

¡

¡

1 1 ¡

after all operations, and . The ’th gluing operation merges two paths of

¤ ¤

¦ ¨

¢ ¤  ©£¢

¦

¥

1¢§ § § 1¢§ ¥

¤ ¤

¢  ¢

length and , . Therefore,

¦

¡

$ $ ¨ $ ¦ $

§ § §

© ¡§

¥

¡

1 § ¥ £ 1 § §

¦ ¦

§ §¤ §¥

Hence,

¥

¥

¡ ¡

$ $ $ ¦

£ £

¥

¡ ¥

1¢§ ¥ £ 1¢§ §

¡

¡ ¡

§ § ¥

§

¡ ¤

implying that the overall time for computing the persistence pairing is .

¤

¢ ¢

¡ Symmetry. The negative function, £ , has the same critical points as . We

claim that it also generates the same pairing.

 ¢ ¢ £

SYMMETRY LEMMA. Critical points and are paired for iff they are paired for . 

PROOF. The claim is true for the first minimum, , and the last maximum, . Every other 

pair of ¢ contains at least one saddle. We assume without loss of generality that is a

¤

¢¡ ¢¡

saddle and that . Consider again the sweep of the 2-manifold in the direction

¨

¢ ¢¡ 

of increasing values of . When we pass we split a cycle in the level set into

two. The two cycles belong to the boundary of ¦ , the set of points with function value

¦

or higher. If the two cycles belong to the same component of , such as for the point

£

©¨¡  ¨  In fact, for our algorithm, to achieve an § bound, a balanced binary tree will suf®ce.

64 

labeled 2 in Figure 4.2, then is a lo-point and is the lowest hi-point that spans a cycle

 

with . The claim follows because is also the highest lo-point that spans a cycle with .

If, on the other hand, the two cycles belong to two different components of ¦ , such as for §

the point labeled in Figure 4.2, then is the lower of the two maxima that complete the

¢

two components. In the backward sweep (the forward sweep for £ ), starts a component

  ¢ £ that merges into the other, older component at . Again and are also paired for , which implies the claimed symmetry.

4.2.2 Height and Elevation ¡ In this section, we define the elevation as a real-valued function on a 2-manifold in .

Measuring height and elevation on Earth. Even on Earth, defining the elevation of a point  on the surface is a non-trivial task. Traditionally, it is defined relative to the mean

sea level (MSL) in the direction of the measured point. In other words, the MSL elevation of  a point  is the difference between the distance of from the center of mass and the distance of the MSL from the center of mass in the direction of  . The difficulty of measuring height in the middle of a continent was overcome by introducing the geoid, which is a level surface of the Earth’s gravitational potential and roughly approximates the MSL while extending it across land. The orthometric height above (or below) the geoid is thus more general and about the same as the MSL elevation. It is perhaps surprising that the geoid differs significantly from its best ellipsoidal approximation due to non-uniform density of the

Earth’s crust [87]. Standard global positioning systems (GPS) indeed return the ellipsoidal height, which is elevation relative to a standard ellipsoidal representation of the Earth’s surface. They also include knowledge of the geoid height relative to the ellipsoid and compute the orthometric height of  as its ellipsoidal height minus the ellipsoidal height of the geoid in the direction of  .

65 A simplifying factor in the discussion of height and elevation on Earth is the existence of a canonical core point, the center of mass. For general surfaces, distance measurements from a fixed center make much less sense. We are interested in this general case, which includes surfaces with non-zero genus for which there is no simple notion of core. As on Earth, we define the elevation of a point  as the difference between two distances, except we no longer use a reference surface, such as the mean sea level or the geoid, but instead measure relative to a canonically associated other point on the surface. To explain how this works, we give different meanings to the ‘height’ of a point  , which we define for every direction, and the ‘elevation’ of the point, which is the difference between two heights. While height depends on an arbitrarily chosen origin, we will see that elevation is independent of that choice. Indeed, the technical concept of elevation, as introduced shortly, will be similar in spirit to the idea of orthometric height, with the exception that it

substitutes the canonical associated point for a globally defined reference surface.

¡ Height, persistence and elevation. Let be a smoothly embedded 2-manifold in .

We assume that is generic but it is too early to say what exactly that should mean. We

define the height in a given direction as the signed distance from the plane normal to that

§

§ ¢

direction and passing through the origin. Formally, for every unit vector , we call

§

¨

¢¢¡ ¡  ¥ 

the height of in the direction . This defines a 2-parameter family of

height functions,

¢§¦

£¥¤ #

§

¤

 ¥

¡ ¨ ¢

¢§¦

£¥¤ # ¨

 ¡ ¥ ¢¨¡ ¡

where . The height is a Morse function on for almost all direc- ©¡ tions. We pair the critical points of ¢ as described in Section 4.2.1. Following [76], we

define the persistence of a critical point as the absolute difference in height to the paired

¤ ' ¨ ¤ ' ¨

¡ ¡ ¢¨¡ ¡ ¢¨¡ ¡ 

 £ 

point: .



Each point § is critical for exactly two height functions, namely for the ones in

66

¨

¦¡ £¢

the direction of its outer and inner normals: . We proved in Section 4.2.1 that the

 pairs we get for the two opposite directions are the same. Hence, each point § has a

unique persistence, which we use to introduce the elevation function,

¦

¤ ¤¦¥(% # $

¤

¤ ¥

¡

¦

¤ ¤§¥(% # $ ¨ ¤ '

¤ ¡  ¡ 

defined by . We note that the elevation is invariant under transla-

tion and rotation of in £¡ .

Two-dimensional example. We illustrate the definitions of the height and elevation func-

¤

§

¢©¡

¡

tions for a smoothly embedded 1-manifold in . The critical points of

¨



¨¢ ¦

are the points § with normal vectors . Figure 4.3 illustrates a sweep in the

¢ ¡ vertical upward direction . Each critical point of starts a component, ends a compo- nent by merging it into an older component, or closes the curve. The critical points that start components get paired with the other critical points. The elevation is zero at inflexion

Figure 4.3: A 1-manifold with marked critical points of the vertical height function. The shaded strips along the curve connect paired critical points. The black and grey dots mark two- and one- legged elevation maxima. points and increases as we move away in either direction. The function may experience a discontinuity at points that share tangent lines with others, such as endpoints of segments that belong to the boundary of the convex hull. On the way towards a discontinuity, the elevation may go up and down, possibly several times. The elevation may reach a local maximum at points that either maximize the distance to a shared tangent line or the distance

67 to another critical point in the normal direction. Examples of the first case are the black dots in Figure 4.3, where the elevation peaks in a non-differentiable manner. An example of the second case is the grey point, where the elevation forms a smooth maximum.

Singular tangencies. The elevation is continuous on , except possibly at points with

singular tangencies. These points correspond to transitional violations of the two genericity

¢§¦

£ ¤ # conditions of Morse functions. Such violations are unavoidable as  is a 2-parameter family within which we can transition from one Morse function to another:

two critical points may converge and meet at a birth-death point where they cancel each other;

two critical points may interchange their positions in the ordering by height, passing a direction at which they share the same height.

The first transition corresponds to an inflexion point of a geodesic on . Such points are

referred to as flat or parabolic, indicating that their Gaussian curvature is zero. The second

¨ © ¨ ©

 

¢ transition corresponds to two points that share the same tangent plane, § .

Both types of singularities are forced by varying one degree of freedom and are turned into curves by varying the second degree of freedom. These curves pass through co-dimension two singularities formed by two simultaneous violations of the two genericity conditions. There can be two concurrent birth-death points, a birth-death point concurrent with an interchange, or two concurrent interchanges. In each case, the singularity is defined by two pairs of critical points and we get two types each because these pairs may be disjoint or share one of the points. See Table 4.1 for the features on that correspond to the six types of co-dimension two singularities. We can now be more precise about what we mean

by a generic 2-manifold.

GENERICITY ASSUMPTION A. The 2-parameter family of height functions on has no violations of Conditions I and II for Morse functions other than the ones mentioned

68 above (and enumerated in Table 4.1 below).

Some of these violations will be discussed in more detail later as they can be locations of maximum elevation. A second genericity assumption referring specifically to the elevation function will be stated in Section 4.4.1.

4.3 Pedal Surface

In this section, we take a dual view of the height and elevation functions based on a trans-

¡

formation of to another surface in . We take this view to help our understanding of

¢§¦

£¥¤ # the singularities of  , but it is of course also possible to study them directly using

standard results in the field [23, 99].

©

 §

Pedal function. Recall that ¢ is the plane tangent to that passes through . The

© ¨¡ ¤ %

  ¡ pedal of is the orthogonal projection of the origin on ¢ . We write and

obtain a function

¤ %

¤

¥ 

¡

¡

¨£ ¤ % ©

 ¡ 

¢

whose image ¢ is the pedal surface of [36]. If the line is normal to

¨

then  . More generally, we can construct by drawing the diameter sphere with center

©

  ¡

© ¢ ©

passing through and . This sphere intersects in a circle with center

¨¤ ¤ %

  ¡ 

that passes through and . In fact, ¢ is the evolute of the diameter spheres

 defined by the origin and the points § , as illustrated in Figure 4.4. The following three properties are useful in understanding the correspondence between and its pedal

surface:

points on have parallel and anti-parallel normal vectors iff their images under the pedal function lie on a common line passing through the origin;

69 Figure 4.4: A smoothly embedded closed curve (boldface solid) and the image of the pedal func- tion (solid) constructed as the evolute of the diameter circles (dotted) between the curve and the

origin.



the height of a point § in the direction of its normal vector is equal to plus or

¤ %

¡ 

minus the distance of  from the origin;

¡

¢ ¢

from § and the angle between the vector and the normal of at we 

can compute the radius ¢ of the corresponding diameter sphere and the preimage

#%$

¡

©£¢

at distance from in the direction normal to and ¨ .

The third property implies that the pedal surface determines the 2-manifold.

Tangents, heights, and pedals. We are interested in singularities of the pedal function as

they correspond to directions along which the height function is not generic. For example,

¢ ¦

£ ¤ # 

a birth-death point of corresponds to a cusp point of ¢ . To see this recall that the

 birth-death point corresponds to a flat point § . A generic geodesic curve through this point has an inflexion at  , causing the tangent plane to reverse the direction of its rotating

motion as we pass through  . Similarly, it causes a sudden reversal of the motion of the

¢§¦

¤ % £¥¤ #

¡   image point thus forming a cusp at  . In contrast, an interchange of , which

corresponds to a plane tangent to in two points, maps to a point of self-intersection (a

¢§¦

£¥¤ #  xing) of ¢ . These two cases exhaust the co-dimension one singularities of , which are listed in the upper block of Table 4.1.

Co-dimension two singularities. There are six types of co-dimension two singularities listed in the lower block of Table 4.1. Perhaps the most interesting is formed by two

70

Dictionary of Singularities

¡£¢¥¤

§¦§¨ ©

¯at point birth-death (bd) point cusp double tangency interchange xing Jacobi point 2 bd-points dovetail point triple tangency 3 interchanges triple point bd-pt. and interchange cusp xing 2 bd-points cusp-cusp overpass 2 interchanges xing-xing overpass bd-pt. and interchange cusp-xing overpass Table 4.1: Correspondence between singularities of tangents of the manifold, the 2-parameter family of height functions, and the pedal surface. There are two singularities of co-dimension one: curves of cusps and curves of self-intersections (xings). There are six singularities of co-dimension two. concurrent birth-death points that share a critical point. As illustrated in Figure 4.5, left, the corresponding dovetail point in the pedal surface is endpoint of two cusps but also of a self-intersection curve. The second most interesting type is formed by two concurrent

cusp intersection dovetail point triple point Figure 4.5: Left: a portion of the pedal surface in which a self-intersection and two cusps end at a dovetail point. Middle: three sheets of the pedal surface intersecting in a triple point. Right: a cusp intersecting another sheet of the pedal surface. interchanges that share a critical point and therefore force a third concurrent interchange of the other two critical points. It corresponds to three self-intersection curves formed by

three sheets of ¢ that intersect in a triple point, as shown in Figure 4.5, middle. Third, we may have a concurrent birth-death point and interchange that share a critical point. As illustrated in Figure 4.5, right, this corresponds to a cusp curve that passes through another sheet of the pedal surface. There are three parallel types in which the concurrency happens

in the same direction but not in space. They correspond to two curves on the pedal surface ¡ that cross each other as seen from the origin but do not meet in . As before, a birth-death point corresponds to a cusp curve and an interchange to a curve of self-intersections.

71 4.4 Capturing Elevation Maxima

4.4.1 Continuity

We are interested in the local maxima of the elevation function, which are the counterparts of mountain peaks and deepest points in the sea. But they are not well defined because the elevation can be discontinuous. We remedy this shortcoming through surgery, and establish a stratified Morse function on the new 2-manifold.

Discontinuities at interchanges. As mentioned in Section 4.2.1, the pairs vary con- tinuously as long as the height function varies without passing through interchanges and birth-death points (Conditions I and II). It follows that the elevation is continuous in re- gions where this is guaranteed. Around a birth-death point, the elevation is necessarily small and goes to zero as we approach the birth-death point. The only remaining possibil- ity for discontinuous elevation is thus at interchanges, which happen when two points share the same tangent plane. As mentioned in Table 4.1, this corresponds to a point at which the pedal surface intersects itself. Figure 4.6 shows that discontinuities in the elevation

can indeed arise at co-tangent points. We see four points with common vertical normal ¦ direction, of which and are co-tangent. Consider a small neighborhood of the vertical

direction, , and observe that the critical points vary in neighborhoods of their locations

¢¡ 

for . The critical point near changes its partner from the right side of to the left side 

of ¦ as it varies from left to right in the neighborhood of . Similarly, the critical point near

changes its partner from the right side of ¦ to the left side of as it varies from left to

right in the neighborhood of . Since the height difference is the same at the time of the

interchange, the elevation at  and is still continuous. However, it is not continuous at

 ¦ and at , which both change their partners, either from to or the other way round.

Not all interchanges cause discontinuities, only those that affect the pairing. These are the interchanges that affect a common topological feature arising during the sweep of in the

72 y z

x

w Figure 4.6: The four white points share the same normal direction, as do the four light shaded and the four dark shaded points. The strips indicate the pairing, which switches when the height

function passes through the vertical direction. The insert on the right illustrates the effect of surgery £ at ¤ and on the pedal curve.

height direction.

Continuity through surgery. We apply surgery to to obtain another 2-manifold

on which the elevation function is continuous. Specifically, we cut along curves at

¦

¤ ¤§¥(% # $

¤

¤

¡ which ¡ is discontinuous, resulting in a 2-manifold with boundary, .

Then we glue ¡ along its boundary, making sure that glued points have the same elevation.

Formally, we cut by applying the inverse of a surjective map from ¡ to , and glue by

applying a surjective map from ¡ to :

¦

¥

£¢¥¤§¦ ¤

¤

¨

£©¡ £

As argued above, the boundary curves of ¡ occur in pairs, and each pair is defined by an interchange, thus corresponding to a self-intersection curve (a xing) of the pedal surface. The latter view is perhaps the most direct one in which surgery means cutting along xings and gluing the resulting four sheets in a pairing that resolves the self-intersection. This is illustrated in Figure 4.6 where on the right we see a self-intersection being resolved by

cutting the two curves and gluing the upper and lower two ends. In the original boldface ¦

curve on the left, this operation corresponds to cutting at and and gluing the four ends

¨ ¨



¦ ¦ to form two closed curves: one from to to and the other from to to . As mentioned earlier, not all xings correspond to discontinuities and we perform surgery

73 only on the subset that do. In general, a discontinuity follows a xing until it runs into a dovetail or a triple point. In the former case, the xing and the discontinuity both end. In the latter case, the xing continues through the triple point and the discontinuity may follow, turn, or even branch to other xings passing through the same triple point. Two possible configurations created by surgery in the neighborhood of a triple point are illustrated in Figure 4.8 and 4.9. Their particular significance in the recognition of local maxima will be discussed shortly. Whatever the situation, the subset of xings along which the

elevation is discontinuous together with the gluing pattern across these xings provides a

complete picture of how to use surgery to change ¢ into a new surface, . The 2-manifold

¨ ¤ %

 ¡

is the one for which this is the pedal surface: . That is indeed a

manifold can be shown by (tediously) enumerating and examining all cases of cut-and-glue

¦

¤ ¤§¥(% # $

¤

¤ patterns that may occur. After surgery, we have a continuous function ¡ .

Furthermore, we have continuously varying pairs of critical points. To formalize this idea,

we introduce a new map

¡

¦

$ # ¤

¤

¤ 

¡

¡ ¡

¦ ¦

¨ $ # ¤ $ # ¤

 ¤  ¡ ¤ 

that maps a point to its paired point . The function is a homeomorphism and its own inverse. We note in passing that we could construct yet another 2-manifold by identifying antipodal points. Each local maximum of the elevation function on this new manifold corresponds to a pair of equally high maxima in . This construction is the reason we will blur the difference between maxima and antipodal pairs

of maxima in the next few sections.

¦

¤ ¤¦¥(% # $

Smoothness of ¤ . The elevation function on is smooth almost everywhere. To

 ¡

describe the violations of smoothness, let ¢ denote the boundary of the intermediate

¡

¦

¨¤£ ¤ ¨ $ # ¤

¦¥

§ ¡  §¨§ ¤  ¡ §

¢ ¡ manifold. Let and define ; is the set of points

at which the elevation function is not smooth. By Genericity Assumption A, is a graph, consisting of nodes and arcs. We have degree-1 and degree-3 nodes that correspond to

74 dovetail points and triple points in the pedal surface respectively, as well as degree-4 nodes

that correspond to overpasses between xings. Each degree-4 node is the crossing of an arc § in § and an arc in the antipodal image of . We think of this construction as a stratification of . Its strata are

the three kinds of nodes;

the open and closed arcs;

the open connected regions in £ .

Figure 4.7 illustrates the construction by showing how such a stratification may look like.

When restricted to every stratum, the elevation function is smooth, but still not a Morse

function. For example, all points from lines of inflexion have elevation identical to 0,

¦

¤ ¤¦¥(% # $ forming lines of local minima for ¤ . We now complete our description of what we mean by a generic 2-manifold.

Figure 4.7: Stratification of the 2-sphere obtained by overlaying a spherical tetrahedron with its

antipodal image. The (shaded) degree-4 nodes are crossings between and its antipodal image.

¦

¤ ¤¦¥(% # $

GENERICITY ASSUMPTION B. The local maxima of ¤ on are isolated.

The implication of this assumption becomes more clear after we enumerate the generic types of local maxima of the elevation function in next section. In particular, this means that surfaces such as spheres and cylinders are not generic under this assumption.

75 4.4.2 Elevation Maxima

In this section, we enumerate the generic types of local maxima of the elevation function.

They come in pairs in which, by inverse surgery, form multi-legged creatures in .



Classification of local maxima. Depending on its location, a point § can have

¡ 

one, two, or three preimages under surgery. We call this number its multiplicity, . § Specifically,  has multiplicity three if it is a node of the graph , it has multiplicity two if it lies on an arc of § , and it has multiplicity one otherwise. Degree-4 nodes in the

stratification correspond to antipodal pairs of points with multiplicity two each. Let now

 

§ be a local maximum of the elevation function. We know that is not a flat point of , else its elevation would be zero. This simple observation eliminates five of the eight

singularities in Table 4.1. Furthermore, the assumption of a generic 2-manifold implies

¡

¦

¨ $ # ¤

 ¤  ¡  that the sum of the multiplicity of and that of is at most four (where the xings intersect each other transversely, otherwise, we can deform the manifold slightly

to enforce it). This leaves the following four possible types of local maxima  :

¡£¢ ¦¢

¨ ¨

¡

¢ ¢

¡  ¡ ¥

one-legged

¨ ¨

¡

¤ §

¡ ¥ ¡ 

©

two-legged and

¢ ¢

¨ ¨

¡

¡

¢ ¢¨

 ¡ ¥

if ¡

three-legged and

¨ ¨

¥

¡  ¡ ¥

©

four-legged

¡

¦

¨ $ # ¤

¤  ¡  where ; see Figure 4.1. We sometimes call the preimages of the heads

and those of the feet of the maximum. The most exotic of the four types is perhaps the

four-legged maximum, which corresponds to an overpass of two xings in the pedal surface

¤ %

 or, equivalently, a degree-4 node in the stratification. The image under  of lies on

one xing and the image of lies on the other. Both maxima have two preimages under surgery, which makes for a complete bipartite graph with two heads, two feet, and four

legs.

 

Neighborhood patterns. Given a point § , take an open neighborhood of on .

76

§

¡ 4

¢ " ¢

Denote by ¢ the image of this neighborhood under Gauss map , and refer to it

 

as the neighborhood of ¨¢ . If is not a flat point (i.e., the Gaussian curvature at is not

¡

¢

zero), then ¢ is homeomorphic to an open disk, and there is a one-to-one map from the  neighborhood of to that of ¨¢ under Gauss map. In the following discussion, we study

only non-flat points since flat points will not possibly be maxima of the elevation function.

It is instructing to look at the local neighborhood of a maximum  in . Most in-

¥ ¥

¥ §

teresting is the three-legged type, with feet . A small perturbation of the normal ¡ direction can change the ambiguous pairing of  with all three to an unambiguous pairing

of a point in the neighborhood of  with a point in the neighborhood of one of the feet. We

¡

¢ ¢ ¢

indicate this by labeling the points in the neighborhood of (i.e., ) with the indices ¢ of the feet, as shown in Figure 4.8. The three curves passing through correspond to the

p p p

2 1 3 3 1 2 3 n 3 1 2 2 1 x n nx x

2 1 1 2 3 3

Figure 4.8: The three sheets of after cutting and gluing the neighborhood of a triple point in

¡ ¢ at the top, and the corresponding pairing patterns in the neighborhood of ¢ , at the bottom. The

(shaded) Mercedes star is necessary for a three-legged maximum.

¢

three xings passing through the triple point § . Note that in generic cases, such curves  should pass through each other at £¢ in a transversal manner as long as is not a flat point.

Hence, they decompose the neighborhood into six slices corresponding to the six permu- tations of the three feet. The labeling indicates the pairing and reflects the surgery at these

feet and, equivalently, at the corresponding triple point in the pedal surface. Only the right-

£

The Gauss map takes each point on to its normal vector on £ .

77 most pattern in Figure 4.8 corresponds to a maximum, the reason of which will become clear later after we introduce and prove the necessary projection conditions for elevation maxima. We call this pattern the Mercedes star property of three-legged maxima.

There are in fact two ways to apply surgery at a three-legged maximum, one of which is already shown in Figure 4.8. We illustrate the neighborhood patterns of the other in Figure 4.9. The neighborhood pictures for the remaining three types of maxima are simpler.

For a one-legged maximum we have an undivided disk, which requires no surgery. For a two- (resp. four-) legged maximum we have a disk divided into two halves (resp. quarters) and there is only one way to do the surgery (see Figure 4.10).

p p p

3 3 2 1 1 2 3 n 3 1 2 2 n 1 x x nx 2 1 1 2 3 3

Figure 4.9: The second type of surgery pattern at a triple point: The three sheets of after cutting ¡

and gluing the neighborhood of a triple point in , at the top, and the corresponding pairing ¢ patterns in the neighborhood of ¢ , at the bottom.

2, 2 1 2 2, 1 1, 2 nx nx 1, 1

(a) (b)

Figure 4.10: Neighborhood patterns for (a) a two-legged maximum and (b) a four-legged maxi-

§

, ¡ ¤

mum, where we mark a region by if is paired with in it.

¦

¤ ¤§¥(% # $

¤

 ¤

Necessary projection conditions. Given a maximum of ¡ with

¡

¦

¨ $ # ¤

¤  ¡ 

§ ¦ § , Let be the two directions at which the heads ( ’s) and feet

78



§ ( ’s) of this maximum in are critical. Recall that all ’s (resp. ’s), are co-tangent. Furthermore, they should satisfy the following necessary properties, stated as projection conditions:

PROJECTION CONDITIONS. The point  is a maximum of the elevation function only if

¨

¡



¥ £

legs : is parallel or anti-parallel to ;

¨

 

¥ £ § £

legs © : , , and are linearly dependent and the orthogonal projection



¥ §

of ¥ onto the line of the two feet lies between and ;

¨ ¡

legs : the orthogonal projection of  onto the plane of the three feet lies inside

¥ §

the triangle spanned by , and ;

¡

¨

 

¥ § ¥© § legs : the orthogonal projections of the segments and onto a plane

parallel to both have a non-empty intersection. 

In summary, is a local maximum only if is either a positive or a negative linear combi-



£ § nation of the vectors . Below we first prove the necessary conditions for one-legged

maxima. We then sketch the proof for all the remaining types of maxima. 

For a one-legged maximum  , assume without loss of generality that lies at the origin,

¦

¨ ¨ ¨ ¨ ¤ ¤¦¥(% # $ ¨

¨ ¡

¡ ¥ ¥ ¡ ¥ ¥ ¤ ¡

¢ £

, , and § . By definition,

¦

¤ ¤§¥(% # $ ¨

¤ ¡ 

, which is the height difference between and in the vertical direction.

§

¡ ¡¢¡ ¥

" ¢ ¡

We parameterize points in ¢ by as illustrated in Figure 4.11, where

¤

¨

¡ ¥ ¡ ¥  ¡£¡ ¥

© ¤ ¡ § £ £ ¡ §

§ and for a sufficiently small . Next, let (resp.

¡£¡ ¥  ¡¢¡ ¥

¡ § ¡

) be the point in the neighborhood of (resp. ) with normal . Denote

 ¡¢¡ ¥ ¡¢¡ ¥ ¡¢¡ ¥ ¡¢¡ ¥

¡ ¡ ¡ ¡

by ¨ the height difference between and in direction , i.e.,

¦ ¦

¨ ¤ ¤¦¥(% # $ ¨ ¤ ¤§¥(% # $

¡¢¡ ¥ ¤ ¡ ¡¢¡ ¥ ¤ ¡ ¡¢¡ ¥

¨ ¡ ¡ ¡

§

¨

¡£¡ ¥  ¡¢¡ ¥ ¢¥ ¡£¡ ¥

¡ £ ¡ ¡

79

Obviously,  is a maximum if and only if

¤

© ¨

¥¡ ¡ ¥ ¥ ¡¢¡ ¥

§ © ¤ ¨ ¡ £ £ for a sufficiently small (4.1)

Moreover,

¢¤£¦¥¨§

¥

©





Figure 4.11: Illustration of normal and the parameterization of its neighborhood , which is §

a spherical patch from .

¨ .# $ $ .# $ .# $ $

¡£¡ ¥ ¡ ¤ ¡ ¥ ¡ ¥ ¤ ¥

¡ ¡ ¡ ¡

and

¨ #%$ $ #%$ #%$ $

¡

¡£¡ ¥ ¡ ¡¢¡ ¤ ¡ ¥ ¡£¡ ¡ ¥ ¡£¡ ¡ ¤ ¥

¡  ¡  ¡ £  £ ¡

§ § §

¡¢¡ ¡



where § denote the radius of curvature at position in direction parameterized by .

 ¡¢¡ ¥

Similarly, we can compute ¡ . It then follows that

§

¨

¡£¡ ¥ ¡¢¡ ¥  ¡£¡ ¥ ¢¥ ¡£¡ ¥

¨ ¡ ¡ £ ¡ ¡

¨ .# $ $ .# $ .# $ $

¤ ¡ ¡ ¤

¡ ¡ ¡

$

¡

¡ ¡£¡ ¡£¡ ¡ ¤

 ¢ ¡  £

§ (4.2)

Hence,

¤

¦

¨ $ $ $ .#%$

¡

¡£¡ ¥ ¡ ¤ ¡ ¡£¡ ¡£¡ ¤ ¡ ¤ ¡ ¡ §¦

£ ¨ ¡ £ ¡ £  ¢  £ ¡¡ ©

§

¤

¦

$ $ ¨ $

¡

¤ ¤ ¡£¡ §¦ ¥ ¡ ¤ ¡ ¡£¡ ¡£¡

§ §

¡¡ © £ 1 £ ¡ £  ¢  £

§

.# $ ¨ $ ¨

¡

§ § § §

¤ ¡ ¥

1 1 1

where and ( is the angle from vector to

¡ ¥  vector in the -plane). The third term in the bracket dominates the above value for

80

¦

$

¢

¤  ¡

sufficiently small , as § tends to infinity in that case. Hence, (4.1) implies that is a

¤

¨ ¨ ¨

§ §

¡ ¥

§ © ¤

maximum if and only if (i) , i.e., and ; and (ii) for any ,

©

¡¢¡ ¡£¡

  ¢

§ . Note that the projection condition for one-legged maxima is the same as (i) and is thus indeed necessary. Furthermore, if we add (ii), then the conditions are also

sufficient. In fact, the necessary Projection Conditions for all other types of maxima can be

 ¦§ made sufficient by adding appropriate constraints on the curvature of at § ’s and at ’s, and, if  is three-legged, it also needs the Mercedes star property. We sketch the proofs for the remaining types of maxima below. As these curvature conditions are not used in our algorithm for the piecewise-linear manifolds, we omit terms related to curvatures in the following sketch for simplicity. Hence, the above discussion suggests that it suffices to

consider only

¨ $

¡¢¡ ¤ ¡£¡ ¥

 £ 1

¡¢¡ ¥

¨ ¡ ¡

which has opposite sign as £ as tends to zero.

¦ ¦

¨ ¡

For a -legged maximum  , where , the complication is that a point in the

 neighborhood of might be paired with points from the neighborhood of different ’s,

as illustrated by the neighborhood patterns in Figures 4.8 – 4.10. The disk in the neigh-

¡

borhood pattern corresponds to the projection of ¢ in direction . Therefore, we can ¡

parametrize it using as well, and each region  in the pattern corresponds to an interval

¤

¡ ¥

© ¤

 in , refered to as the range of this region.

¦ ¨ ¨

¡

¡

 ¡ ¥ ¥

First consider © or , and assume again that is at the origin, and .

¡£¡ ¥ ¡£¡ 

¡ 1  Define ¨ , , and as before by substituding with . Obviously, is a

maximum if and only if for each region  in the neighborhood pattern,

£

¡£¡ ¥ ¡ ¡

§   (4.3)

For a two-legged maximum, the length of the range of each region is ¤ . Hence, (4.3)

¨

¡¢¡ ¡¢¡

¥ £  § imposes that the graph  as shown in Figure 4.12 (a). In other words, we

81

(a) (b)

 

¢¡ ¢¡

¥ §

¡

Figure 4.12: Solid (dotted) curve corresponds to the graph of ( ). (a) is necessary for

¤£¦¥ ¤£¦¥

¡ ¢¡ ¢¡

¥ § to be a maximum. In (b), there exists (hollow dot) such that both and . Thus

¡ cannot be a maximum.

¨



¥ 1 § ¤ ¥© § have 1 , implying that the -projections of segment contains the origin

(i.e.,  ).

(a) (b)

  

¢¡ ¢¡ ¢¡

¥ §

Figure 4.13: Dark, light, and dotted curves are graph of , and , respectively. (a) is



¡

¢¡ ¡

necessary for ¡ to be a maximum. In (b), where we move slightly, there exists (hollow dot)

¤£¦¥

¡

¢¡

©

, ¨§ ¡ such that for and . Thus cannot be a maximum.

For a three-legged maximum, first note that the Mercedes star property is necessarily

true. This is because for any other neighborhood patterns in Figure 4.8 or 4.9, there exist

¡ ¨¢

pairs of antipodal normals in ¢ that are marked by the same index. For example, in

¡

¡£¡

the middle pattern in Figure 4.8, there are such normals both marked by index . As 

¨ ¨

¡£¡ ¡¢¡ ¡¢¡ 

 ¤ ¤ 

and  have opposite signs (unless ), cannot be a maximum

£ £ £ £

¥

¥ 1 § ¤ 1 ¥ 1 ¤ 1 §

in this case. Furthermore, (4.3) imposes that 1 see Figure

¡

 

¥© §

4.13. This implies that the -projection of triangle covers the origin (i.e., ). ¡

The case for a four-legged maximum is slightly more involved as there are two heads

¨

¡

   ¡ ¥ ¥

§ ¥ § ¥

¥ and , as well as two feet and . Assume that is at the origin, and .

¨

 ¡ ¥ ¥

¥© § § By construction, is parallel to the horizontal plane and lies within it.

82

¡ ¨

¡

¢

¥   ¡£¡ ¥

1  © ¥ ¨

We define ’s and ’s for as before by substituting with . Denote by ¡

¡¢¡ ¥ ¢¡£¡ ¥  ¡¢¡ ¥  ¡£¡ ¥

¡ ¨ ¡ ¡ ¥ ¡

the height of § minus in direction . By (4.2), has the same

¨ $

¡

¢ ¢ ¢

¡¢¡ ¤ ¡£¡ ¡ ¥ ¡ ¥

 1 £ 1

sign as £ where is the angle from vector to vector . Let

¤

¨

£ £

¡ ¥ ¢ ¡¢¡  ¡ ¡£¡

¢¡ § © ¤ , §  ¥

. If is a maximum, then for any , either or

£

¡£¡  ¡£¡ ¥  ¡¢¡ ¥ ¡

§ ¥ ¡ § ¡ §

 holds. This is because that is higher than for . Thus the

 ¡£¡ ¥ ¡£¡ ¥ ¡£¡ ¥

§ ¡ ¨ ¡

height difference between and ¡ can only be larger than . It then

£ £

¢ 

1 § 1 ¥ 1 ¥© §

follows that , implying that the -projections of segment intersects the

  

§ §

line passing through ¥ in the interior. By switching the role of ’s and ’s in the above

  §

argument, we can show that segment ¥ also intersects, in its interior, the line passing

¥ § through the projections of and . Thus proves the necessary condition for four-legged maxima.

4.5 Algorithm

In this section, we describe an algorithm for constructing all points with locally maximum ¡ elevation. The input is a piecewise linear 2-manifold embedded in . The running time of the algorithm is polynomial in the size of the 2-manifold.

Smooth vs. piecewise linear. We consider the case in which the input is a two-dimensional

simplicial complex in £¡ . This data violates some of the assumptions we used in our math- ematical considerations above. This causes difficulties which, with some effort, can be

overcome. For example, it makes sense to require that be a 2-manifold but not that it be smoothly embedded. The 2-parameter family of height functions is well-defined and continuous but not smooth. The definition of the elevation function is more delicate as it makes reference to point pairs in all possible directions. For any given direction, we get a well-defined collection of pairs, but how can we be sure that the pairs for different direc-

tions are consistent? The difficulty is rooted in the fact that a vertex in can be critical for more than one direction and it may be paired with different vertices in different directions.

83 To rationalize this phenomenon, we follow [76] and think of as the limit of an infinite

series of smoothly embedded 2-manifolds. A vertex of gets resolved into a small patch with a two-dimensional variety of normal directions. Even as the patch shrinks toward the vertex, the variety of normal directions may remain fixed or at least not contract. For dif- ferent directions in this variety, the corresponding points on the patch may be paired with points from different other patches. It thus seems natural that in the limit a vertex would be paired to more than one other point.

To make this idea concrete, we introduce a combinatorial notion of the variety of nor-



mal directions. Let be a simplex in (a vertex, edge, or triangle), let be a point in the

§



§ ¢ interior of , and let be a direction. We say is critical for the height function in

the direction if

§

§¨

¥ 

¦ £ ¦ (i) for all points of ;

(ii) the lower link of is not contractable to a point.

For example, the empty lower link of a minimum and the complete circle of a maximum

§

 ¡ 

" ¢

are both not contractible. Let be the set of directions along which is critical.

Generically, the set for a point inside a triangle of is an antipodal pair of points, that for a point on an edge is an antipodal pair of open great-circle arcs, and that for a

vertex is an antipodal pair of open spherical polygons. Here, the word ‘generic’ applies to ¡ a simplicial complex in , where it simply means that the vertices are in general position.

Computationally, this assumption can be simulated by a symbolic perturbation [78]. We

¡ ¥ ¥ 

¡

write for the common intersection of the sets of , and so on.

Finite candidate sets. Given a candidate for a maximum, we can use the extended per-

sistence algorithm to decide whether or not it really is a maximum. More specifically, we  need a point and a direction along which the sweep defining the pairing proceeds. The details of this decision algorithm will be discussed shortly. We use the Projection

84

Conditions, which are necessary for local maxima, to get four kinds of candidates:

 

¨

¡

 ¡  

£ ¢£

legs : pairs of points and on with the direction contained in

¡ ¥

;

¨

 ¥ ¥ 

¦ ¥ §

legs © : triplets of points such that the orthogonal projection of onto the

 

¡  

¥ § £ ¦ £

line of and lies between the two points and the direction ¦ is

¡ ¥ ¥

¥ §

contained in ;

¨

¡

 ¥ ¥ ¥ 

¥ § ¦

legs : quadruplets of points such that the orthogonal projection of

¡

 

¡  

¥ § ¦ £ ¦ £

onto the plane of , , lies inside the triangle and the direction

¡

¡  ¥ ¥ ¥

¥ §

is contained in ;

¡

¨

 ¥ ¥ ¥

§ ¥ § ¦

legs : quadruplets of points ¥ such that the shortest line segment con-

 ¥ ¥

§ ¥ §

necting the lines of ¥ and also connects the two line segments and the

 

¡ ¡ ¥ ¥ ¥

£ ¦ £ ¥ § ¥ § direction ¦ is contained in .

With the assumption of a generic simplicial complex , we get a finite set of candidates

of each kind. Since this might not be entirely obvious, we discuss the one-legged case



in some detail. Let and be two simplices and and points in their interiors. For

¡ ¥

a generic , the intersection of normal directions, , is non-empty only if one of

¨



the two simplices is a vertex or both are edges. If is a vertex then is necessarily



the orthogonal projection of onto , which may or may not exist. If and are both



edges then is necessarily the line segment connecting and and forming a right angle §

with both, which again may or may not exist. In the end, we get a set of O( ) candidate



pairs and , where is the number of edges in . For the two-legged case, we get ¡

O( ) candidates, each a triplet of vertices or a pair of vertices together with a point on an

edge. For the three- and four-legged cases, we get O( ) candidates, each a quadruplet of

vertices, giving a total of O( ) candidates.

 ¥ § Verifying candidates. Let be a pair of points whose heads and feet all have

85

parallel or anti-parallel normal directions. In the smooth case, the necessary and sufficient  conditions for and to define an elevation maximum consists of three parts:

(a) the Projection Conditions of Section 4.4.1;

¡

¦

¨ $ # ¤

¤  ¡

(b) the requirement that ;

(c) the curvature constraint alluded to in Section 4.4.1.

We subsume the Mercedes star property in (b) since it depends on the antipodality map or, equivalently, on the pairing by extended persistence. In the piecewise linear case, we only have (a) and (b) because the concentration of the curvature at the edges and vertices renders (c) redundant. We have seen above how to translate (a) to the piecewise linear

case. It remains to test (b), which reduces to answering a constant number of antipodality

 ¢ queries: given a direction and a critical point of , find the paired critical point . This

is part of what the algorithm described in Section 4.2.1 computes if applied to a sweep of

in the direction . More precisely, the algorithm computes one of the possible pairs, if applied in non-generic directions in which two or more vertices share the same height. Most of our candidates generate non-generic directions, and we cope with this situation by running the algorithm several times, namely once for each combination of permutations of the heads and of the feet. Each combination corresponds to a generic direction that is infinitesimally close to the non-generic direction. The largest number of combinations is six, which we get for three-legged maxima. This is also how we decide the Mercedes

star property: each of the three feet is the answer to exactly two of the six antipodality

§

¤

queries. Letting be the number of edges, the algorithm takes time O( ) to answer

the antipodality query. Since we have O( ) candidates to test, this amounts to a total

§

¤

¤ running time of O( ).

86 4.6 Experiments

We implemented the algorithm described in Section 4.5 and used it on surface representa- tions of a few protein structures. We describe the findings to illustrate how the concepts introduced in the earlier sections might be applied.

Elevation on surface. We discuss the experimental findings for chain £ of the protein complex with pdbid 1brs, which we downloaded from the protein data bank. It contains 864 atoms, not counting the hydrogens which are too small to be resolved in the x-ray experiment and are not part of the structure. The particular surface representation we use is the molecular skin [55], which is similar to the better known molecular surface [65]. The reason for our choice is the availability of triangulating software and the guaranteed smooth embedding. The computed triangulation displayed in Figure 4.16 has slightly more than 50 thousand vertices after some simplification. Given the triangulation of the skin surface, we compute the elevation for each vertex of it and visualize it in Figure 4.16: Recall that each vertex has a range of directions associated with it that makes it critical, from which we choose an arbitrary one to compute its elevation value.

Number of maxima. Table 4.2 gives the number of maxima of each type computed for the skin triangulation for protein 1brs. We show in a separate row the number of additional maxima paired by extended persistence as introduced in Section 4.2.1. Since the genus of this particular surface is zero, all these maxima lie on the convex hull of the surface.

legs one two three four max (trad.) 5 3,617 728 1,103 max (addl.) 15 0 6 0 Table 4.2: The number of maxima for the molecular skin of the 1brs protein structure obtained via traditional persistence (second row) and the additional maxima obtained by its extension (third row).

We notice that there are significantly more two-legged than other types of maxima. The reasons is perhaps the particular shape of molecules in which covalently bonded atoms

87 Figure 4.14: The percentage of maxima with elevation exceeding the threshold marked on the vertical axis. From top to bottom: the curves for the three-legged, four-legged, and two-legged maxima. form small dumbbells which invite two-legged maxima with one foot on each atom. These dumbbells are rotationally symmetric and form surface patches with non-generic elevation function, which further contributes to the abundance of two-legged maxima. The con- figurations required to form one-legged and three-legged maxima are considerably more demanding, but when they occur the maxima tend to have higher elevation. This obser- vation is quantified in Figure 4.14, which sorts the maxima in the order of decreasing elevation. We see that for each threshold, the fraction of three-legged maxima higher than

Figure 4.15: The one hundred pairs of maxima with highest elevation. The heads are marked by light and the feet by dark dots. that elevation is significantly larger than the fractions of two- and four-legged maxima. The difference is even more pronounced for one-legged maxima of which four of the five

88 have elevation exceeding 5 Angstrom. The statistics for other proteins are similar.

High elevation maxima. We are indeed mostly interested in high elevation maxima as the others are likely consequences of insignificant surface fluctuations or artifacts of the piecewise linear nature of the data. Figure 4.15 shows the top one hundred maxima on the skin surface of 1brs. Each antipodal pair of maxima is represented by its one or two heads and one, two, or three feet.

One might expect that the binding site of a protein would perhaps have more or higher maxima. We did not observe any such trend in the few cases we studied. It seems that maxima are more or less uniformly distributed over the surface. This should be contrasted to the finding that in many cases the pocket with the largest volume identifies the location of the binding site [130]. The elevation is indeed a less specific measurement with respect to a single surface, and we expect its primary use to be in the study of interactions between two or more shapes.

Meshes of different resolution. To see how the resolution of surface meshes influences the behavior of the elevation function, we generate two meshes for the molecular surface

of chain E of the protein complex (pdbid 1cho), using the msms program (also available

¢ ¢



¡ ¡

¥ ¥

¡

as part of the VMD package [105]). These two meshes, and , have and

© © 

vertices, respectively. Let (resp. ) be the set of maxima of the elevation function

© © © ©

  

¡¤£ ¡¤£

¡ " "

from surface (resp. ) and (resp. ) the subset of maxima

© £

with an elevation greater than some threshold . We show in Table 4.3 the size of

© © ©

 

¡¤£ ¡¥£ £ ¡¥£

and for various ’s. We note that the size of (from the coarser mesh)

©

¡¤£ £

differs less from that of as increases. Furthermore, Table 4.4 shows that points

© ©

 

¡¤£ ¦ ¡¥£ ¡¥£ ¦ ¡¤£

from , , roughly covers points from , i.e., . In particular, a point



£

¦ ¡¥£ ¦ ¡¥£ ¡ ¥

§ § § §

is -covered if there is some point such that , where

¨

¦ ¡¤£

is the covering radius. Table 4.4 shows the percentage of points from that are

89

Threshold 0.0 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0

  ¢¡

 2292 1445 1010 734 500 345 272 115 26



¢¡

3714 2233 1290 931 489 357 287 104 34 

Table 4.3: The number of maxima whose elevation is greater than the threshold from surfaces £

and £ .

  

¦ ¡¤£ ¡¥£ ¦ ¡¥£ ¦ ¡¤£ £



-covered by , called covering density of over , for various ’s and

¡

¨

 

¡

¦ ¡¥£ ¦ ¡¥£

 for ˚ . Similarly, denotes the covering density of over . The covering density increases in general as £ increases, indicating that larger features are preserved better than smaller ones as the surface mesh becomes coarser.

¡ 0.1 0.2 0.3 0.5 0.8 1.0 2.0 4.0

  ¥§¦©¨ ¨ ©¨ ©¥ ¨ ¦ ¨ ¦ ¨ ¦ ¨ ¦©¥ ¨

¢¡

¤

© ©

  ¦ ¨ ¦ ¥ ¨ ¦ ¨ ¦ ¨ ¦¨ ¦¨ ¦ ©¨ ¥ ¥ ¨

¢¡

¤

©

§ § §

§   § 

¡ ¢¡ ¢¡

Table 4.4: As changes (upper row), the covering density of over (middle row) and

§  §  

¢¡ ¢¡ that of over (lower row).

4.7 Notes and Discussion

The main contribution of this chapter is the definition of elevation as a real-valued function ¡

on a 2-manifold embedded in and the computation of all local maxima. The definition

¤ ¥

of this function can be extended to a -manifold embedded in .

The logical next step in this research is the exploitation of the maxima in protein dock- ing and other shape matching problems. We will describe in Chapter 6 one such approach.

It would also be worth exploring extensions of our results to manifolds with boundary. A crucial first step will have to be the generalization of the concept of extended persistence to these more general topological spaces. Another interesting direction of research is iden- tifying “features” (as those computed by elevation function) directly from a point cloud representing some underlying surface © : Surface reconstruction in general is hard with oversampled and/or noisy point sets. However, it is easy to construct a simplicial complex

90 ©

(which may not be a manifold) that roughly describes and may have different topol- ©

ogy as . By computing points with maximal elevation on , and keeping only those with high elevation, it is still possible to identify important features on © .

Finally, the algorithm presented in Section 4.4.1 enumerates all local maxima of the elevation function, without computing the elevation function itself, other than at a collec- tion of candidate points. This approach is suggested by the ambiguities that arise in the

definition of the elevation function for piecewise linear data. Unfortunately, it implies the

§

¤

¤ fairly high running time of O( ) in the worst case. Can the maxima be enumerated more efficiently than that? Is there an algorithm that enumerates all maxima above some elevation threshold without computing the maxima below the threshold?

91 Figure 4.16: Visualization of elevation on skin surface for protein 1brs. Roughly, the higher the elevation is, the darker the color is.

92 Part II: Shape Matching

93 Chapter 5

Matching via Hausdorff Distance

5.1 Introduction

The problem of shape matching in two and three dimensions arises in a variety of applica- tions, including computer graphics, computer vision, pattern recognition, computer aided design, and molecular biology [17, 97, 148]. For example, proteins with similar shapes are likely to have similar functionalities, therefore classifying proteins (or their fragments) based on their shapes is an important problem in computational biology. Similarly, the proclivity of two proteins binding with each other also depends on their shapes, so shape matching is central to protein docking problem in molecular biology [97].

Informally, the shape-matching problem can be described as follows: Given a distance

§

¡ measure between two sets of objects in or , determine a transformation, from an al- lowed set, that minimizes the distance between the sets. In many applications the allowed transformations are all possible rigid motions. However, in certain applications there are constraints on the allowed transformations. For example, if matching the pieces of a jigsaw puzzle, it is important that no two pieces overlap each other in their matched positions. An- other example is the aforementioned docking problem, where two molecules bind together to form a compound, and, clearly, at this docking position the molecules should occupy disjoint portions of space [97]. Moreover, because of efficiency considerations, one some- times restricts further the set of allowed transformations, most typically to translations only.

Several distance measures between objects have been proposed, varying with the kind of input objects and the application. One common distance measure is the Hausdorff dis-

94 tance [17], originally proposed for point sets. In this chapter we adopt this measure, extend it to non-point objects (mainly, disks and balls), and apply it to several variants of the shape matching problem, with and without constraints on the allowed transformations. We are primarily interested in the case of balls because of molecular-biology applications, where

a molecule is typically modeled as a set of balls, with each atom represented by a ball [19]. §

Problem statement. Let £ and be two (possibly infinite) sets of geometric objects

¤

£ §

¡ ¨

(e.g., points, balls, simplices) in , and let be a distance function between

¨ #%$'&¡

£ ¡ ¥ § ¡ ¥ £ §

§ )£¢

the objects in and in . For , we define . Similarly, we

¨ #%$'&

¡ ¥ £ ¡ ¥ § £

)¥¤ § define , for . The directional Hausdorff distance between

and § is defined as

¨

¥

¡¤£¦¥¨§ ¡ ¥¨§ ¥

¨

)¥¤ §

and the Hausdorff distance between £ and is defined as

¨ %

¡£¥¨§ ¢ ¡£¦¥ § ¥ ¡§ ¥ £

§ ¨ ¨ ¡ § (It is important to note that in this definition each object in £ or in is considered as a

single entity, and not as the set of its points.) In order to measure similarity between £ £

and § , we compute the minimum value of the Hausdorff distance over all translates of

©

"

within a given set of allowed translation vectors. Namely, we define

© ¨ #%$'&

¢¡£¥¨§ © ¡¤£ ¥¨§ ¥

§ ¥

£*)

© ¨

£ £

¥ ¥  § ¡ where . In our applications, will either be the entire or the set of collision-free translates of £ at which none of its objects intersects any object of

§ . The collision-free matching between objects is useful for applications (like the docking

£ ¥

problem) in which the goal is to determine a transformation so that the shape of ¥

§ ¡£¦¥ § ¢¡¤£¦¥¨§ © best complement that of . We use to denote .

95 As already mentioned, our definition of (directional) Hausdorff distance is slightly dif-

ferent from the one typically used in the literature [17], in which one considers the two

£ §¢§ unions § , as two (possibly infinite) point sets, and computes the standard Hausdorff

distance

¨ %

¡ §£¦¥ § § ¢ ¡ §£¥ § § ¥ ¡ §¢§ ¥ §£ ¥

§ ¨ ¨ ¡

where

#%$'& ¨ ¨

¥ ¥

¡ ¥ ¡ ¥ § § ¡ §£¦¥ § §

¡

§

¨

)¢ ¢

)¡ ¤ )¡ ¤

£ £

¡¤£¦¥¨§ §£ § §

¨¥¤ ¢

We will denote (resp., ) as ¤ (resp., ), and use the notation to denote

£ £ ©

¡ ¥ ¡¤£¦¥¨§ ¡¤£¦¥¨§ ©

¤ ¢ §¦¤ ¤ ¨ . Analogous meanings hold for the notations and .

A drawback of the directional Hausdorff distance (and thus of the Hausdorff distance) is its sensitivity to outliers in the given data. One possible approach to circumvent this

problem is to use “partial matching” [57], but then one has to determine how many (and §

which) of the objects in £ should be matched to . Another possible approach is to use §

the root-mean-square (rms, for brevity) Hausdorff distance between £ and , defined by

¦

¥ - §

§

¡ ¥¨§

¨

¨ § % $

¤

¡¤£¦¥¨§ 

¦

¨

¤

¨ %

¡¤£¦¥¨§ ¢ ¡¤£¦¥¨§ ¥ ¡§ ¥ £ ¥

§ ¨ ¨ ¡ with an appropriate definition of integration (usually, summation over a finite set or the

Lebesgue integration over infinite point sets). Define

© ¨ # $'&

¡¤£¦¥¨§ © ¡£ ¥¨§

§ ¥

£*)

Finally, we define the summed Hausdorff distance to be

¦

¡ ¥ §

¨

¤

¡¤£¦¥¨§ ¥

¦

¨

¤

96

¥ 

  ¡£¦¥ § ¨

and similarly define § and . Informally, can be regarded as an -distance

¥¡ ¥ ¥

£ § ¥ over the sets of objects and . The two new definitions replace by § and , respectively.

Prior work. It is beyond the scope of this section to discuss all the results on shape matching. We refer the reader to [17, 97, 148] and references therein for a sample of known results. Here we summarize known results on shape matching using the Hausdorff distance measure.

Most of the early work on computing Hausdorff distance focused on finite point sets.

£ § ¡£¥¨§

§

Let and be two families of and points, respectively. In the plane,

¡ ¡ ¤

¢¡

can be computed in time using Voronoi diagrams [14]. In , it

- ¤§¦

¡ ¡

¡

can be computed in time using the data structure of Agarwal and Ma-

¢¡£¥¨§ ¡ ¡

tousekˇ [9]. Huttenlocher et al. [107] showed that can be computed in

¨

§ § ¥ ¤§¦

¤ ¡ ¤ ¡ ¡ ¡ ¡

¡ ¡

time in , and in time in , where is

- §£¢ ¤ ¥

¡ ¡ ¡ ¤ ¡

¡

¡

arbitrarily small. Chew et al. [57] presented an -time algo-

©

¢¡£¥¨§

©

rithm to compute in for any . The minimum Hausdorff distance between

§ £

£ § ¡ ¡ ¤

and under rigid motion in can be computed in time [106].

¡£¦¥ § Faster approximation algorithms to compute were first proposed by Goodrich

et al. [95]. Aichholzer et al. proposed a framework of approximation algorithms using §

reference points [12]. In , their algorithm approximates the optimal Hausdorff dis-

¡ ¡ ¤ ¡

tance within a constant factor in time over all translations, and in

¤) ¡ ¤ ¡ ¡



time over rigid motions. The reference point approach can be extended to higher dimensions. However, it neither approximates the directional Hausdorff

distance over a set of transformations, nor can it cope with the partial-matching problem.

 ¨

Indyk et al. [110] study the partial matching problem, i.e., given a query , com-

£ 

£ ¡ ¥¨§ pute the maximum number of points from such that . They present algorithms for ¤ -approximating the maximum-size partial matching over the set of rigid

97

££¢ ££¢

 

§ §

¤ ¤)¤ ¡ ¡ ¡ ¤ ¤¡ ¡ ¡ ¡

¡ ¡ + + ¡

¦ motions in ¦ time in , and in time in

1

¡  , where is the maximum of the spreads of the two point sets . Their algorithm can be extended to approximate the minimum summed Hausdorff distance over rigid motions.

Similar results were independently achieved in [46] via different technique.

¡¤£¦¥¨§ ¡£¥¨§ £ §

¤ ¤ Algorithms for computing § and/or where and are sets of seg-

ments in the plane, or sets of simplices in higher dimensions [11, 14, 15]. Atallah com-

¡¤£¦¥¨§ ¤

puted § for two convex polygons [25]. Agarwal et al. [11] provide an algorithm

§

¡ ¡ ¡£¥¨§ ¤ ¡

¡

for computing ¤ in time . For the case where any rigid motion

§

¡ ¡ ¤ ¡

¡ is allowed, the minimum Hausdorff distance can be computed in time

(Chew et al. [58]). Aichholzer et al. approximate the minimum Hausdorff distance under §

different families of transformations for sets of points, segments in , and sets of trian- ¡

gles in using reference points [12]. Other than that, little is known about computing

¡¤£¦¥¨§ ¢¡¤£¦¥¨§ £ §

¤ or where and are simplices or other geometric shapes in higher

dimensions.

©

Our results. In this chapter, we develop efficient algorithms for computing ¢¡¤£¦¥¨§ ©

©

¡£¦¥ § © ¡¤£¦¥¨§ ¥ ¡£¥¨§

and ¤ for balls, and for approximating for sets of points in

. Consequently, the chapter consists of three parts, where the first two deal with the two variants of Hausdorff distances for balls, and the third part studies the rms and summed

Hausdorff-distance problems for point sets.

¨

 

£ £ ¥ ¥ £ ¡ ¥

¡ ¡

¥ ¡

Let denote the ball in of radius centered at . Let

¨ ¨ ¨

£

§ § ¥ ¥¨§ £ ¡ ¥ §

¡ ¡

¥ § §  § ¡

and be two families of balls in , where and

¡



¡£ ¥

¥ § 

, for each and . Let be the set of all translation vectors so that no ball

£ § of ¥ intersects any ball of .

Section 5.2 considers the problem of computing the Hausdorff distance between two §

sets £ and of balls under the collision-free constraint, where the distance between two  The spread of a set of points is the ratio of its diameter to the closest-pair distance.

98

¨



§ £ £ § ¡¤£ ¥¨§ ¡ ¥

§ § § § £  § £ disjoint balls § and is defined as . We can

regard this distance as an additively weighted Euclidean distance between the centers of

£ §

§ and , and it is a common way of measuring distance between atoms in molecular ©¨ 

biology [97]. In Section 5.2 we describe algorithms for computing ¢¡£¥¨§ in two and

§ § §

¡ ¡ ¤ ¡ ¡

three dimensions. The running time is in , and

¤

¡ in . The approach can be extended to solve the (collision-free) partial- matching problem under this variant of Hausdorff distance in the same asymptotic time

complexity.

¡£¦¥ § ¡£¥¨§ ©¨  ¤ Section 5.3 considers the problem of computing ¤ and , i.e., it

computes the Hausdorff distance between the points lying in the union of £ and the

§ £

union of , minimized over all translates of in or in . We first describe an

§

¡ ¡ ¤ ¡£¦¥ § ¡£¥¨§ ©¨ 

¤ ¤ -time algorithm for computing and in , which relies on several geometric properties of the union of disks. A straightforward ex-

tension of our algorithm to £¡ is harder to analyze, and does not yield efficient bounds on

its running time, so we consider approximation algorithms. In particular, given a parame-

¨

§ §

¤ ¡ ¡ ¡ ¤ ¤ ¡

¡

ter , we compute a translation ¥ , in time in and in

§

£ ¡

§ §

¤ ¤ ¡ ¡ ¡ ¡ ¡£ ¥¨§ ¡ ¤ ¡¤£¦¥¨§

¡

¡ § ¤ ¥ ¤

time in , such that .

¡£¥¨§ ©¨ 

We also present a “pseudo-approximation” algorithm for computing ¤ : Given

¨

¤ ¤

" ¡

an , the algorithm computes a region , an -approximation of (in a sense

§

defined formally in Section 5.3), and returns a placement ¥ such that

£ ¡

¡¤£ ¥¨§ © ¡ ¤ ¡£¥¨§ © ¥

§ ¤ ¥ ¤

§

§ §

¡ ¡ ¡ ¤ ¤

¡ ¡ in time in . This variant of approximation makes sense in applications where the data is noisy and shallow penetrations between objects are allowed,

as is the case in the docking problem [97].

£ §

Finally, let and be two sets of points in . Section 5.4 describes an algorithm

99

¡£¦¥ § ¡ ¡ ¤ ¤ ¡ ¤

¤ 2

that computes an -approximation of in time . It also

¡ ¤ ¡ ¤ ¤

provides a data structure so that it can return in time an -approximation

¥¨§ ¡¤£

¥ ¥ §

of § for a query vector . In fact, we solve a more general problem,



¦ ¥ ¥§¦

¡

¥

which is interesting in its own right. Given a family of point sets in , with

¡ ¤



a total of points, we construct a decomposition of into cells, which is



¤ ¦ ¥ ¥§¦

¡ an -approximation of each of the Voronoi diagrams of ¥ , in the sense defined

in [24, 98]. Moreover, given a semigroup operation , we can preprocess this decom-



§

¡ ¥§¦ ¡ ¡ ¤ ¤ ¡ ¤ ¤

§ §

¡ ¥

position in time, so that an -approximation of § , for

¡ ¤ ¡ ¤

a query point § , can be computed in time. We also extend the approach

¡

§

¤ ¡£¥¨§ ¡ ¡ ¤ ¤ ¤ ¡ ¥ ¤

to -approximating in time. This result relies on

a dynamic data structure, which we propose, for maintaining an ¤ -approximation of the

¡

-median of a point set in , under insertion and deletion of points.

5.2 Collision-Free Hausdorff Distance between Sets of Balls

¨ ¨ ¨

¡

£

£ £ ¥ ¥ £ § § ¥ ¥¨§ ¥

¡ ¡ ¡ ¡

¥ ¡ ¥ ¡ ©

Let and be two sets of balls in , . For

¨ ¨



£ ¡ ¥ £ § ¡ ¥ §

§ §  § §

two disjoint balls and § , we define

¨



¡¤£ ¥¨§ ¡ ¥ ¥

§ § £  § £

£ §

namely, the (minimum) distance between § and as point sets. Let be the set of

£ £ § ¥ placements ¥ of such that no ball in intersects any ball of . In this section, we describe an exact algorithm for computing ¢¡¤£¦¥¨§ ©¨ ¦ , and show that it can be extended to

partial matching.

£

¡¥

§¦  Indyk et al. [110] outline an approximation algorithm for computing ¢¤£ without providing any details. We believe that if we work out the details of their algorithm, the running time of our algorithm is better. Moreover our algorithm is more direct.

100 5.2.1 Computing ¢¡¤£¦¥¨§ ©¨ ¦ in 2D and 3D

As is common in geometric optimization, we first present an algorithm for the decision

£ ¨

¢¡¤£¦¥¨§ ©¨ ¦

problem, namely, given a parameter , we wish to determine whether . ©¨ 

We then use the parametric-searching technique [11, 133] to compute ¢¡£¥¨§ . Given

¨ ¡ £ £

¢§ " ¥ §

, for , let be the set of vectors such that

§ £ §

§ ¥

(V1) § does not intersect the interior of any ;

# $

£

£

¢ ¡£ ¥¨§

  § ¥

(V2) ¥ .

¥

¨ £ ¨¢¡ ¨

 

¤ ¤ ¤

¡ ¥

¡£ ¥

£

£ §  §

£ §  §

§ § § §

Let and . Then 

¥ ¥

¨ £

¡

£

§ §

is the set of vectors that satisfy (V2), and the interior of  violates (V1).

¥

¨ $ £ £

¤

¡

¢§

§¤£

Clearly, § . Let

¥

¨ ¨¢¨

¨ ¥ ¨ $ ¥ £ £

§¢§ §§¦

¤

¡£¦¥ §

§

£

§ §

¥§% § §

¡£¥¨§

" ¥§

See Figure 5.1 for an illustration. By definition, is the set of vectors

£

¡¤£ ¥¨§

¥

such that ¨ . Similarly, we define

¨

£

¡ § ¥ £ ¡ § ¥ £

" ¥ §  ¨ ¥ ¡

¨

£



¢¡£¥¨§ ©¨  ¡£¥¨§ ¡ § ¥ £ ©

Thus if and only if

§ §

¡£¥¨§ ¡

Lemma 5.2.1 The combinatorial complexity of in is . ¡£¥¨§

PROOF. If an edge of ¨© is not adjacent to any vertex, then it is the entire circle

¥

¤

¡

§

bounding a disk of § or . There are such disks, so it suffices to bound the ¡¤£¦¥¨§

number of vertices in .

¡ £ £

¡¤£¦¥¨§

§

Let be a vertex of ; is either a vertex of , for some , or an

¨ ¦

¡ £ £



§  intersection point of an edge in and an edge in , for some . In the

latter case,

¥ ¥

¨ £ £ £ £

¤ ¤

 

¡ ¡ §

§ ¢§ 

§ £ §

 

101

£¡

¡

¢

¥¤£

(a) (b)

¥

§ §

¤

§

¢¦¥ §

Figure 5.1: (a) Inner, middle and outer disks are , § , and , respectively; (b) an

¥

¤

§

¨ © © § example of (dark region), which is the difference between § (the whole union) and (inner

light region).

¥ ¥ ¥ ¥

£ £ £ £ £ £ £ £

¤ ¤ ¤ ¤



¡¤£¦¥¨§ §

§ § £ £ § §

  

In other words, a vertex of is a vertex of , , , or  ,

¥

¦ £ £ £ £

£ ¡ £

¤ ¤ ¤



¥

£

§ § 

for . Observe that a vertex of (resp., of  ) that lies on both

¥ ¥

£ £ £ £ £ £ £

¤ ¤ ¤ ¤ ¤

§ §

¨ ¨ ¨

§ § §

  

and (resp.,  ) is also a vertex of (resp., ). Therefore,

¥ ¥ ¥ ¥

£ £ £ £ £ £ £ £

¤ ¤ ¤ ¤

¡£¦¥ § § § § §

§ § § §

  

every vertex in is a vertex of  , , , or ,

¥

¦ £ £

¡ £ £

¤

¥ ¥

§

for some . Since each § is the union of a set of disks, each of

¥ ¥ ¥ ¥ ¥

£ £ £ £ £ £ £ £

¤ ¤ ¤

§ § § §

©

§ § § §

  

 , , , is the union of a set of disks and thus has

§

¡ ¡£¥¨§ ¡

vertices [120]. Hence, has vertices.

§

¡£¥¨§ ¡

¢¡ ¡

Lemma 5.2.2 The combinatorial complexity of in is .

§ §

¡¤£¦¥¨§ ¡

PROOF. The number of faces or edges of that do not contain any vertex is

since they are defined by at most two balls in a family of © balls. We therefore focus on

¡¤£¦¥¨§ ¡¤£¦¥¨§ the number of vertices in . As in the proof of Lemma 5.2.1, any vertex

satisfies:

¥ ¥ ¥

¨ £ £ £ £ £ £

¤ ¤ ¤

   

¡ ¡ § § ¥

§ § 

§ £ §

 

¡ ¦

¡ £ £ £ £

§ §

* §

for some . Again, such a vertex is also a vertex of ,

¥

£ £

 

¤

§

¡

§ * £¡ §

where (or , ) is § or . Since the union of balls in has vertices,

§ §

§ § ¡ ¡£¦¥ § ¡

§  ¡ has vertices, thereby implying that has vertices.

102

§ § §

¡ § ¥ £ ¡ ¡

¡

Similarly, we can prove that the complexity of is in and ¡

in . Extending the preceding arguments a little, we obtain the following.



¡§ ¥ £ ¡¤£¦¥¨§ ¡ ¡

Lemma 5.2.3 has a combinatorial complexity of in

§ § §

¡ ¡

¡

, and in . ¨

Remark. The above argument in fact bounds the complexity of the arrangement of

§

¥ ¥ ¥

¡

¥ § ¡ ¨© § ¨ 

. For example, in , any intersection point of and lies on the

 

¡ ¡

¢§  § 

boundary of ¨ , and we have argued that has vertices. Hence, the

§ §

¡

entire arrangement has vertices in .

We exploit a divide-and-conquer approach, combined with a plane-sweep, to compute

§

¡£¦¥ § ¡§ ¥ £ ¡¤£¦¥¨§

, , and their intersections in . For example, to compute , we

£ £

¨¢¡ ¨¢¡ ¨

- §



¢ ¢ ¢ ¡£¥¨§ ¢ ¢ ¢

£

§ ¢§

¡ ¡

¥ § - §¥¤ ¥

compute § and recursively, and merge

¡ ¡ ¤)

by a plane-sweep method. The overall running time is .

¨



¡£¦¥ § ¡§ ¥ £

£¡

To decide whether in , it suffices to check whether

£¥¤ ¨

 

¡£¦¥ § ¡§ ¥ £

¨£

¥

¡

¡ £ £ ¡ £ £

¤

¥ ¥

§  ¡ §

is empty for all balls § . Using the fact that

¥

£¦¤

¤

¥

¨£ §

the various § meet any in a collection of spherical caps, we can compute

¡ ¡ ¤

in time , by the same divide-and-conquer approach as computing

§ § §



¡£¦¥ § ¡ ¡ ¤ ¡ § ¥ £

in . Therefore we can determine in time

£

¡£¦¥ § ©¨ 

¡ whether in .

Finally, the optimization problem can be solved by the parametric search technique [11]. In order to apply the parametric search technique, we need a parallel version of the above procedure. However, above divide-and-conque paradigm uses plane-sweep during the conque stage, which is not easy to parallelize. As such, we instead adopt the same al- gorithm as in [11] to compute the union/intersection of two planar or spherical regions. It

103



¡ § ¥ £ ¡£¥¨§

yields an overall parallel algorithm for determining whether is empty

§

§ § §

¡ ¤ ¡ ¡ ¤) ¡ ¡

in time using processors in , and

¤

¡

processors in . This implies the following.

£ §

Theorem 5.2.4 Given two sets and of and disks (or balls), we can compute

§ § §

¢¡¤£¦¥¨§ ©¨ ¦ ¡ ¡ ¤ ¡ ¡ ¤

in time in and in time ¡ in .

5.2.2 Partial matching

Extending the definition of partial matching in [110], we define the partial collision-free

Hausdorff distance problem as follows.

¦ ¦

£

¡£¥¨§ ¡ ¥¨§

  §

Given an integer , let ¨ denote the largest value in the set

¨

£ ¡¤£¦¥¨§ ¡¤£¦¥¨§ ¡§ ¥ £

¨ ¨ ¥ ¨ 

¡ ; note that . We define in a fully symmetric manner, and

©

¡¤£¦¥¨§ ¡£¥¨§ © 

then define § , as above. The preceding algorithm can be extended to

¡£¥¨§ ©¨ 

compute  in the same asymptotic time complexity. We briefly illustrate the

¨

¥ ¥ ¥ ¡

¡ ¡

¥ § ¡

two-dimensional case. Let be as defined above, and let be the

¡ ¡

 §¡ ¢  § 

arrangement of . For each cell , let be the number of ’s that contain .

¦

¨ £

¡ ¡ ¡¤£ ¥¨§

 ¢  £ ¨ ¥

Note that for any point ¥ in a cell with , , and vice

¡ ¡ ¡

¢   §£

versa. Hence, we compute and for each cell , and then discard all

¦ © ¨

£

¡ ¡ ¡¤£

¢  £ ¥ ¥  ¨ 

the cells  for which . The remaining cells form the set

£

§

¡ ¥¨§

¥ ¡

. By the Remark following Lemma 5.2.3, has vertices, and it can

©

§ §

¤ ¤ ¡ ¡

¥

be computed in time. Therefore, can be computed in

© ¨

£

§

¤ ¡ § ¥ £ ¡

§ ¥  ¨  ¥ ¡

time. Similarly, we can compute in time, and

© ¨ ©



 ¡ ¡ ¤

§ ¥ we can determine in time whether . Similar arguments

can solve the partial matching problem in £¡ . Putting everything together, we obtain the following.

104

¦

©

£ §

Theorem 5.2.5 Let and be two families of and balls, respectively, and let

§

¡ ¡ ¤) ¡£¥¨§ ©¨ 

be an integer, we can compute  in time in , and in

§ §

¡ ¡ ¤

¡

time in .

5.3 Hausdorff Distance between Unions of Balls

§

¡¤£¦¥¨§

In Section 5.3.1 we describe the algorithm for computing ¤ in . The same

¡£¥¨§ ©¨ 

approach can be extended to compute ¤ within the same asymptotic time com- §

plexity. In Section 5.3.2, we present approximation algorithms for the same problem in ¡ and .

5.3.1 The exact 2D algorithm

¨ ¨

£

£ £ ¥ ¥ £ § § ¥ ¥¨§

¡ ¡

¥ ¡ ¥ ¡

Let and be two sets of disks in the plane. Write,

¨ ¨ ¨ ¡ ¨ £

¡  ¡

£ ¡ ¥ ¥ ¥ § ¡£ ¥ ¥ ¥

¡ ¡ ¡

§  § ¤

as above, § , for , and , for . Let

£

£ §

(resp., ¢ ) be the union of the disks in (resp., ). As in Section 5.2, we focus on the

¨ decision problem for a given parameter .

For any point , we have

£ ¨ # $ ¨ #%$

¡ ¥ ¢ ¡ ¥ ¢ ¡ ¥ §

¡

¢ §

£

¥ 

) ¤¡

¨ # $ %



¢ ¢ ¡ ¥ ¥

£ ¢¡

£

¥ 

This value is greater than if and only if

#%$

 ¨

¢ ¡ ¥ ¡

£

£

¥ 

£

¨

¡£ ¥¨§

¨ ¤ § ¤ ¥

In other words, if and only if there exists a point such that

£

¨ £ ¨ ¡



§ ¡ ¡ ¥ § ¡ § ¡

¥ § ¢ ¡

¥ , where is the disk expanded by .

Let

© ¨

£

¡£ ¥¨§ ©

¥ ¥  ¨ ¤ ¥ ¡

105

© £ £

¡

¥ ¤ ¥ " ¢

¥ is the set of all translations such that . Our decision procedure ©

computes the set ¥ and the analogously defined set

© ¨

£

¥

¡ § ¥ £

¡ ¥ § ¥  ¨ ¤

© © © ¨





¥ ¥

and then tests whether § . To understand the structure of , we first study the

£ £ 

case in which consists of just one disk , with center and radius . For simplicity of

£ £ £ £

¡

¦ ¨

notation, we denote ¢ temporarily by . Let be the set of vertices of and the

£ £

¢

£ £ ¡

 ¦    £ ©

set of (relatively open) edges of ¨ ; [120].

' £ £

¤ ¡ §

Consider the ¦ of the boundary features of , clipped to

£ £ £

§

§ ¦

within . This is a decomposition of into cells, so that, for each ¡ , the cell

£ £

£

 ¡ ¡  ¥ ¡ ¥ ¢ ¢ §

§ ¡ ¡ ¡ ¡ ¡ § ¦

of is the set of points such that , for all . The

£ £

 §

diagram is closely related to the medial-axis of ¨ . See Figure 5.2(a). For each , let

  

¡ § ¡

denote the circular sector spanned by within the disk whose boundary is ,

£ ¨ £ ¡£¢



¡ ¢

£ ¥¤ and let ) . The diagram has the following structure. A variant of the

following lemma was observed in [19].

£ ¨

  

¡ ¡

Lemma 5.3.1 (a) For each § , we have .

¨ £



¡ ¢ ¢ ¡ ¢¤¡

§¤¦ ¡ ¡ ¡ ¡

(b) For each ¡ , we have , where is the Voronoi cell of in

'

¤ ¡ ¡

¦ ¡

the Voronoi diagram ¦ of . Moreover, is a convex polygon.

' £ £

¤ ¡ §

Lemma 5.3.1 implies that ¦ yields a convex decomposition of of linear

© £

£

¥ "

size. Returning to the study of the structure of , by definition, ¥ if and only if

£

©

¡ ¥ §

¥ ¡ ¡ ¦ ¥

 , where is the feature of whose cell contains . This implies

© £

¡£ £ £

¥ ¥ "

that the set of all translations of for which ¥ is given by

© ¨ ¦

¡¤£ ¡ ¥

¥ © ¡ £

¦

)¨§ ¨¤

where

¨

©

¡  ¡ ¡  ¥

© ¡ § ¡  ¡  ¡

106

¡£¢¤ ¦¥ §©¨ 

(a) (b) (c)

Figure 5.2: (a) Medial axis (dotted segments) of the union of four disks centered at solid points:



 ¤

The Voronoi diagram decomposes the union into cells; (b) shrinking by the Voronoi cell ¨ of each boundary element  of the union; (c) light-colored disk bounds a convex arc, and dark disk

bounds a concave arc.

£

  

¡ ¡

§ © 

For , is the sector obtained from by shrinking it by distance towards

¨

¡ ¡ ¡ ¥

¡ §©¦ © ¡ ¡ ¡ 

its center. For , £ . See Figure 5.2(b) for an illustration. £

Now return to the original case in which consists of disks; we obtain

¥ ¥

© ¨ © ¨ ¦

¡¤£ ¡ ¡

¤

¥ ¥ § © ¡ £ §

¦

¡ ¡

§ ¥ § ¥

) § ¤

©

¡£ ¡

¥ §

Note that each is bounded by circular arcs, some of which are convex (those

bounding shrunk sectors), and some are concave (those bounding shrunk Voronoi cells of

¦

¡ £ £ 

¡£ ¥

 £ §  £  §

vertices). Convex arcs are bounded by disks for some ,

¡ ¥

¡ £ §  § ¡ § ¦

while concave arcs are bounded by disks for . Furthermore, since

© £

¡£  

§ §

¥ is obtained by removing all points such that the nearest distance from

£ ©



¡£ ¥ ¡£

 §  £ §  £  § " ¥ §

to ¨ is smaller than , we have that: (i) ; and (ii)

© © ¨



¡ ¥ ¡ ¡¤£ ¡ ¡£

¡ £ §  § ¥ § ¨ ¥ §

£ . See Figure 5.2 (c) for an illustration.

© ©



£ ¥ £ ¡£ ¡£ £

§ ¥ § ¥

Lemma 5.3.2 For any pair of disks § , the complexity of is

¡

.

© ©



¡£ ¡¤£

§ ¥

PROOF. Clearly, ¥ is bounded by circular arcs, whose endpoints are either

© © ©

¡¤£ ¡£ ¡£

§ ¥ ¨ ¥ §

vertices of ¥ or , or intersection points between an arc of and an arc

©

¡£ ¥ of ¨ . It suffices to estimate the number of vertices of the latter kind.

107

¢

§

© © ¦ 

Consider the set § of the disks

 

£

¥ ¡ ¡£ ¥ ¥ § ¡ ¥ ¥ ¡ ¥

¦

 £  §  £  £   £ § ¡ ¥   ¡ £ §  § ¡ £  ¡ )¨§

© ©

¡£ ¡¤£

¥ § ¨ ¥

We claim that any intersection point between two arcs, from ¨ and respec-

¡ §¢§£¢ 

¨

tively, lies on § . Indeed, assume that is such an intersection point that does not

¡ § §£¢ § ¢  

¨ § § § lie on § . Then there is a disk that contains (i.e., ). There are two

possibilities for the choice of disk .

¨ ¨ ¦

 ¡ £ £ 

¡ ¥ ¡ ¥

 £  §  £ §  £  £ 

(i) (resp. ) for some .

© © ©

¡¤£ ¡¤£ ¡¤£

¥ § ¥ ¥ § ¨ "

Such a disk bounds some convex arc on ¨ (resp. ), and

© ©

 ¡¤£ ¡¤£

" ¥ ¨ ¥ §

(resp. ). As such, cannot appear on the boundary of (resp.

©

¡£ ¥

¨ ), contrary to assumption.

¨ ¨

¡ ¥ ¡ ¥

¡ £ §  § ¡ £  ¡ § ¦ ¦

(ii) (resp. ) for some . Recall that is

£

£  §

the set of vertices on the boundary of ¨ . Therefore by definition, (resp.

£

 £

) contains ¡ in its interior, so it cannot be fully contained in , implying that

© ©

 ¡£  ¡£

§ ¥ § § ¥ (resp. ). Contradiction.

Thus we have proved the claim by contradiction. It then follows, using the bound of [120],

¨

¢

¦ ¡

¡ ¡

© ¦  £ ©

that the number of intersections under consideration is at most © .

© © ©



¡£ ¡£

¥ § ¥

Each vertex of ¥ is also a vertex of some . Applying the preceding

§

£ ¥ £ ¡

§

lemma to all the pairs , we obtain the following.

©

§ §

¤ ¡ ¡

¥

Lemma 5.3.3 The complexity of is , and it can be computed in

time.

©

§ §

¡ ¡ ¤)

§

Similarly, the set has complexity and can be computed in time .

© © ¨



 ¡ ¡

§ Finally, we can determine whether ¥ , by plane sweep, in time

108

¤ ¡£¦¥ § ¡ ¡

¤

. Using parametric search as in [11], can be computed in

¤

time.

¡£¥¨§ ©¨  ¢¡£¥¨§ ©¨ 

To compute ¤ , we follow the same approach as computing in the preceding section. Combined with an argument similar to the one above, we can show

the following.

§

£ §

Theorem 5.3.4 Given two families and of and disks in , we can compute both

¡ ¡ ¤ ¡¤£¦¥¨§ ¡¤£¦¥¨§ © ¦

¤ ¤ and in time .

5.3.2 Approximation algorithms

No good bounds are known for the complexity of the Voronoi diagram of the boundary

¡ of the union of balls in , or, more precisely, for the complexity of the portion of the

diagram inside the union [19]. Hence, a na¨ıve extension of the preceding exact algorithm ¡ to yields an algorithm whose running time is hard to calibrate, and only rather weak

upper bounds can be derived. We therefore resort to approximation algorithms.

¨

§

¡£¦¥ § ¤

¤ ¡

Approximating in and . Given a parameter , we wish to compute

£ ¡

£ ¡¤£ ¥¨§ ¡ ¤ ¡¤£¦¥¨§ ¡£ ¥¨§

§ ¤ ¥ ¤ § ¤ ¥

a translation ¥ of such that , i.e., is

¤ ¡£¥¨§ ¡¤£¦¥¨§ ¤

an -approximation of ¤ . Our approximation algorithm for follows the

 

¡ §

same approach as the one used in [12, 14]. That is, let ¡£ (resp., ) denote the bottom

£ £ ¢

left point, called the reference point, of the axis-parallel bounding box of ¤ (resp., ).

¨

  £ ¡

¡£ ¡ § ¡£ ¡¤£¦¥¨§ ¥¨§ ¡

£ § ¤ ¤

Set . It is shown in [14] that in , .

¡ ¡¤£ ¥¨§

§ ¤ Computing takes time. We compute using the parametric search technique [11], which is based on the following simple implementation of the decision

procedure:

£ ¨ ¡ £ ¨ ¡

 ¨

¡ ¡ ¡£ ¥ ¡ ¥

¤ ¢ §  §

Put § and . For given parameter ,

£ £ £ £

£

¡£ ¥¨§ ¡ ¡

¤ ¤ ©" ¢ ¢ " ¤ we observe that § if and only if and .

109

£ £ £ £

¡ ¡ § ¡

" ¢ ¤ ¢

To test whether ¤ , we compute , the union of balls in

£ § ¡ £

and , and check whether any ball of appears on its boundary. If not, then

£ £ £ £

¡ ¡

" ¢ ¢ " ¤

¤ . Similarly, we test whether . The total time spent is

§ § §

¡ ¡ ¤) ¡ ¡

¡

in , and in .

¤ ¡¤£¦¥¨§

In order to compute an -approximation of ¤ from a constant-factor approx- 

imation, we use the standard trick of placing a grid in the neighborhood of ¡§ , and

¥¨§ ¡£

¥ ¤ ¥

returning the smallest § where ranges over the differences between all grid



points and ¡£ . We conclude the following.

¨

£ § ¤

Theorem 5.3.5 Given two sets of balls and of size and , respectively, and ,

§

¤ ¡£¥¨§ ¡ ¡ ¡ ¤ ¤ ¡

¡

¤

an -approximation of can be computed in time in

§

§ § §

¤ ¤ ¡ ¡ ¡ ¡

¡

¡

, and in time in .

¡£¦¥ § ©¨ 

Pseudo-approximation for ¤ . Currently we do not have an efficient algo-

¤ ¡£¦¥ § ©¨ 

¡ rithm to -approximate ¤ in . Instead, we present a “pseudo-approximation”

algorithm in the following sense.

¨ £ £

¡

¢¢¡ £ ¤ ¡

Let , where is the Minkowski sum, be the set of all placements of

£ £ ¨ ¡ ¨

 

£ ¡£ ¥ ¡

-

¡

¤ ¢ £ §  § £

at which intersects ; § . Clearly . For

© ¤

a parameter , let

¨ ¦

¡ 

¡¥¤ ¡ ¥ ¡ ¤ ¡ ¥

£ § £  §

-

§

¨



¡ ¤ ¡ ¡¥¤ ¤ ¡¥¤

¡ " ¡ " " and £ . We call a region -free if .

This notion of approximating is motivated by some applications in which the data is noisy, and/or when shallow penetration is allowed. For example, each atom in a pro-

tein is in fact a “fuzzy” ball instead of a hard ball [97]. We can model this fuzziness



¡£ ¥

by allowing any atom to be intersected by other atoms, but only within the shell

 ¡  ¨

¡£ ¥ ¡£ ¥ ¡ ¤ ¤

£

£ for some . In this way, the atoms of two docking molecules

110

may penetrate a little in the desired placement. Although can have large complexity,

§ §

¡ ¤

¡

namely, up to in , we present a technique for constructing an -free region



¥ §

of considerably smaller complexity. We thus compute and a placement such that

£ ¡

¥¨§ ¡ ¤ ¡¤£¦¥¨§ © ¥¨§ ¡¤£ ¡£

¥  ¤ ¥  ¤ § ¤

§ . We refer to such an approximation

¤ ¡¤£¦¥¨§ ©¨ ¦

as a pseudo- -approximation for ¤ .

¤ ¡ ¤

¡

Lemma 5.3.6 An -free region of size can be computed in time

¡ ¡ ¤ ¤ ¡ ¤

¡

¨ ¨ ¡

 ¡ £ £ ¡¤£ £

¡ ¥ ¥

§ £ §  § ¡

PROOF. Let  We insert each ball

© ©

§ §© ¡¡

into an oct-tree . Let be the cube associated with a node of . In order to

©

§ ¡¢ " §

insert , we visit in a top-down manner. Suppose we are at a node . If , we

¨





 ¤ ¡

¡¡ § ¡¡  § ©

mark black and stop. If and the size of is at least , then we

recursively visit the children of . Otherwise, we stop. After we insert all balls from , if

¨

¥ ¥ ¥

¡ ¡

¥ § ( ¡

all eight children of a node are marked black, we mark black. Let §

be the set of highest marked nodes, i.e., each is marked black but none of its ancestors

¡

¡ ¤

¡

§

is black. It is easy to verify that each marks at most nodes black as nodes

¨



¤ ¡ ¡ ¤

§   ¡ ©

they mark are disjoint and of size at least ; thus . The whole

¡

¡ ¡ ¤ ¤ ¡ ¤ ¡ ¤

¡ " ¡¥ ©"

()¤£

construction takes time, and obviously

¨ $

¡

¡ ¤

¡ ¡¥

£

()¤£

Set ; it is an -free region, as claimed.

¨

   

¡§ ¥ ¡¤£ ¡£ ¡§ £ Furthermore, let , and be as defined earlier in this section.

We prove the following result.

 §

Lemma 5.3.7 Let ¥ be the closest point of in . Then

¡

£ ¡

¡£ ¥ § ¡ ¡£¥¨§ ©



§ ¤ ¤ ¥ ©

111

¨ ¨

¡¤£ ¥¨§ ¡£¥¨§ ©

§ ¤ ¥ ¤ ¥ §

PROOF. Let and the placement so that . Then

   

¨ ¨

   

¥ ¡ § ¡§ ¡£ ¡ ¡£

¥ £ ¥ £ ¥

¡

 £ 

¥ ¡§ ¡ ¡£

¥

A result in [12] implies that . On the other hand,

     

£ £

¥ § ¡£

  

¥ £ ¥ ¥ £ £ ¥ ¥ §¡

 

¨

¡ ¡

£ £ ¡

¡ ¡£¦¥ § ©

© £ ¥ © ©

¤



¥

The closest point of in , , can be computed as follows. Recall that in Lemma 5.3.6,

¨ $ ¡ ¨ $ ¨ ¡

¡ ¡ ©

¡ ¡¥ ¡ ¡¥

¦ ¦

£ £

)¤£ () £

. Set consists of a set of disjoint

§

(other than on the boundary) cubes. We first check whether ¦ by a point-location

¨

§ ¥  ¥ 

operation. If the answer is no, then , and we return . Otherwise, is the

¨

¨ ¨ ¥ 

point from ¦ that is closest to . In the latter case, is either a vertex of a cube

§

in , or lies in the interior of an edge or a face of a cube in . Given a node ,

¢

¡ ¡

for each boundary feature ¡ , that is, a face, an edge, or a vertex of , compute

¡ ¦

the closest point of in . Let be the resulting set of closest points. Next, for each

§ §¦ § § ¨ , check whether ¦ by visiting the neighboring marked nodes that also contain

§ . This can be achieved by performing operations. Finally, we traverse all

¦ ¨ ¨

nodes in . Among all those points from ’s that lie on ¦ (thus on ), return

¡ ¤

¡ the one that is closest to . There are cubes, and each has constant number of

boundary features. Furthermore, at most constant number of nodes from will contain a

¡ ¤ ¡ ¤



¥

given point, and each point location takes time. Hence, can be computed

¡ ¡ ¤ ¤ ¡ ¤

¡

in time.

§

§ §

¡£ ¡ ¡ ¤ ¥ §

§ ¤ 

We can compute ¥ in time, as described in Sec-

§

§ §

¡£¦¥ § © ¡ ¡ ¤

¤

tion 5.3.2, so we can approximate in time. Further-

¤ ¡£¦¥ § ©



¤ more, we can again draw a grid around ¥ and compute an -approximation of . We obtain the following.

112

¨ §

§ §

£ § ¤ ¡ ¡ ¡ ¤ ¤

¡

Theorem 5.3.8 Given , in £¡ and , we can compute in

£

¥¨§ ¤ £ ¡¤£

¥ " £¡ ¥ § § ¤

time, an -free region and a placement of s.t.

¡

¡ ¤ ¡¤£¦¥¨§ ©

¤ .

5.4 RMS and Summed Hausdorff Distance between Points

We first establish a result on simultaneous approximation of the Voronoi diagrams of

several point sets, which we believe to be of independent interest, and then we apply

¨

 ¡£¥¨§  ¡¤£¦¥¨§ £ ¥ ¥

¡ ¡

¥ ¡

this result to approximate and for point sets and

¨

£

§ ¥ ¥

¡

¥ ¡ in any dimension.

5.4.1 Simultaneous approximation of Voronoi diagrams



¦ ¥ ¥§¦

¡

¥ ¡

Given a family of point sets in , with a total of points, and a parameter

¨

¤ 

§ ¡

, we wish to construct a subdivision of , so that, for any , we can quickly

¡ £ £ £ ¡

¡ ¥ ¡ ¤ ¡ ¥§¦ ¡  ¥§¦ ¦

§ § § § §

compute points § , for all , so that , where

#%$ ¡

¢ ¡  ¥

¤

¤§ §

is, as defined earlier, ) . Our data structure is based on a recent result by Arya

¨

¦ ¤

and Malamatos [24]: Given a set of points and a parameter , they construct

¡ ¤

 § a partition of into cells — each cell is the region lying between

two nested hypercubes (the inner hypercube may be empty), and is associated with a point

£ ¡

¡ ¦ ¡ ¥ ¡ ¡ ¤ ¡ ¥§¦

 § § §  § 1  §

1 , so that for any point , . is the partition ©

induced by the leaves of a compressed quad tree [146], built on an initial hypercube ¡

©

¦ ¡ ¡ ¤ ¤) ¡ ¤

that contains . and can be constructed in time, and the cell of

¡ ¤) ¡ ¤

containing a query point can be located in time.



¡

¦

§ ¡

¡ ¥

Let be a hypercube containing § . We construct the above compressed quad

© © ©



¦ ¥ ¥

¡

§ § ¥ §

tree for point set , and let be the resulting subdivision. We then merge

©



¥ ¥

¡ ¡

¥

into a single compressed quad tree [146] and thus effectively overlay . In

© ©

£ £ 

§ © particular, we start with ¥ and insert cells of ’s one by one, for . We refine

113

after each insertion so that we still maintain a compressed quad tree structure [24]. Since

© ¡ all § ’s are built using the same initial hypercube , the four hypercubes involved in any

two cells during the process are either disjoint or one containing another. Hence each



¥ ¥

¡

© ¥

insertion creates at most new leaves. Let be the resulting overlay of ; is

¨ ©

¡ ¤

§ ¡ 

a refinement of each and . Since the merged tree is also a compressed

¡ ¤ ¡ ¤

quad tree, the cell of containing any query point can be computed in time.

¡ ¦

§¢ 1 §  § §  § § §

For any cell  , let denote the point associated with the cell that

¡ ¤ ¦

 § §© 1 § 

contains . Recall that, for any point , is an -nearest neighbor w.r.t. § , i.e.,

£ £ ¡

¡ ¥ ¦ ¡ ¥ ¡ ¡ ¤ ¡ ¥ ¦

§ § § 1¢§  § §

©

¡

§   §£

If we store all the 1 ’s for each cell (i.e., in the leaf nodes of ), we need

 ¦

¡ ¤

1 §

space, which we cannot afford. So we instead store at appropriate internal

© ©

¡ £ £

§ § § §

nodes of . More specifically, for a fixed , and for any cell  , let

© © ©

§

be the node in the merged tree associated with  . Let be the subtree of rooted at

© ¨

¡ ¡

 §¤ 1 §  1 §  §

. For any cell associated with a node of , . We therefore store

© ¨

¡ ¡ ¦ ¤

1¢§  §  §   § 

at , instead of storing it at all leaf nodes of . Since , the total



¨

¦

¡ ¡ ¦ ¤ ¡ ¤

1 §  §  §

¡ ¥

storage needed to store ’s is § To query with a point

¡¤£ £ 

¡

§¥ 1 § 

lying in a cell  , we collect , , while traversing the path from the root

©

1 § to the leaf of associated with  . As is stored once along any path from the root to a

leaf of © , we conclude the following.



¦ ¥ ¥§¦

¡ ¡

¥ ¡

Theorem 5.4.1 Given a family of point sets in , with a total of points,

¨

¤ ¡ ¡ ¤ ¤ ¡ ¤

and a parameter , we can compute in time a subdivision of

¤ ¡ ¤ ¡ ¥§¦

§ § § §

of size so that, for any point , one can -approximate , for all

¡ £ £  

¡ ¤) ¡ ¤

, in time.

114

5.4.2 Approximating  ¡£¥¨§

¨ ¨ ¡

¡¤£ £ ¡¤£ £

¦ §

£ §  ¡ £ §

For , let § We construct the preceding de-

©

¦ ¥ ¥§¦

¡ ¡

¤ ¤ ¥

composition, denoted as , and the associated compressed quad-tree , for ,

¨

¤ ¡ ¤

 ¤ 

with the given parameter ; . Define

¨ ¨ #%$

§ §

¢ ¡ ¡ ¥§¦ ¢ ¥ ¡ ¥

§ ¥ ¥ § £ § ¥

£

¥ 

and let

¡

¨ ¨

¦

§

¢ ¡ ¥ § ¡ ¡£

§ ¥ ¥ ¤ ¥ ¨

¡

§ ¥

§ ¤

For each cell  , define

¡

¡

¨

§

- ¢

¡ ¥ ¡ ¡

¥ 1 §  ¤ ¥

¡

§ ¥

§©

By construction, for any ¥ ,

¡

¡

¨

£

§

- ¢

¡ ¥ ¡ ¡ ¡

¥ 1 §  ¤ ¥ ¤ ¥

¡

§ ¥

¡

¨

£ ¡ ¦ ¡

§ § §

¡ ¤ ¡ ¥ ¦ ¡ ¤ ¡ ¥

¥ § ¤ ¥

¡

§ ¥

implying that

¢

¡

¡

£ ¡§

- ¢

¡ ¡ ¤ ¡£ ¥¨§

¥ ¨ ¤ ¥

¡ ¡

- ¢ - ¢

¡

¥  § ¤ ¤

Hence, it suffices to store ¤ at each cell . Since is a quadratic equation

¡

¡

¥ §

in , it can be stored using space (where the constant depends on ) and updated

¡

¡ ¡

1 § 

in time for each change in .

¡

§

- ¢

¡ ¤

 § ¤ If we compute ¤ for each cell independently, the total time is . We therefore proceed as follows. We perform an in-order traversal of the compressed

115

© ©

 ¤

quadtree ¤ . For the cell associated with the first leaf of visited by the procedure,

¡ ¡

- ¢ - ¢

¡

¤

we compute ¤ in time. For the subsequent leaves we compute from the

¢

¤ 

value previously computed. Suppose we are currently visiting a cell  of , let be the

¢

¦ 

previous cell visited by the procedure, let ¦ (resp. ) be the leaf associated with (resp. ¢

 ), and let

¦ ¨ ¨

¢

¢ - ¢ ©

¡  ¡

 1¢§  1  ¡

¦

¢ - ¢

¡ ©

§  §

The value of 1 , for all , are stored along the path from the nearest common

¢

¦ ¦

ancestor of ¦ and to the leaf . Since

¡ ¡

¡ ¤

¨

§ §

¢

- ¢ - ¢ ©

¡ ¡ ¡ ¥ ¡ ¡ ¥ ¡ §¦ ¥

¤ ¥ ¤ ¥ ¥ 1¢§  £ ¥ 1¢§ 

©

§%) ¡ £¢

¡ ¡

¦ ¦ ¨

- ¢ - ¢ © ¢ - ¢¨© ¢ - ¢ ©

¡ ¡ ¤

¤ ¤    

we can compute from in time. As , the total

¡

- ¢

¡ ¤

¤

time required to compute all ’s is .

¨

¡ ¤

¢ ¦ £ § 

Next, we compute, in time, a subdivision on the family

¡

¡

¡ £ £ ¡ £ £

- ¢

¡ ¢  § ¢

, for , and a quadratic function for each cell so that

¡

£ ¡

§

- ¢

¡ ¡ ¤ ¡

¢ ¥ ¢ ¥ ¤ ¢

We overlay and . The same argument as the one used to

¡ ¤

¤

bound the complexity of shows that the resulting overlay has cells and

¡ ¡ ¤ ¤ ¡ ¤

 that it can be computed in time. Finally, for each cell in the

overlay, we compute

¤ ¤

¡ ¡

¨ %(' # $ %

¢ - ¢ - ¢

 ¢ ¢ ¡ ¥ ¡

¥ ¨ ¥ ¡ ¥ ¤ ¢

¢

£*)

%(' # $

£ ¡

 ¢ ¡ ¤ ¡¤£ ¥¨§

§ ¥

¢

£*)

and return

#%$

£ ¡§

¢

¢ ¡¤£ ¥¨§ ¡ ¤ ¡£¥¨§

§ ¥

¢

)¦¥

Hence, we obtain the following.

116

¨

£ § ¤

 Theorem 5.4.2 Given two sets and of and points in and a parameter ,

we can:

¡ ¡ ¤ ¤ ¡ ¤

¥  §

i. compute a vector in time, so that

£ ¡§

¥ § ¡ ¤ ¡£¥¨§ © ¡¤£



¥ §

¡ ¤ ¡ ¡ ¤ ¤ ¡ ¤

ii. construct a data structure of size , in time , so that

¥¨§ ¤ ¡£ ¡ ¤ ¡ ¤

¥ ¥ § § for any query vector , we can -approximate in

time. ¡¤£¦¥¨§

5.4.3 Approximating 

 ¡£¦¥ § ¦

§ ¤

Modifying the above scheme, we approximate as follows. Let , ¦ , , and ¢

be as defined above. We define

¡

¨ ¨

¦

¡ ¥§¦  ¡¤£ ¥¨§ ¥ ¡

¥ § ¨ ¥ ¦ ¤ ¥

¡

§ ¥

£

¡

¨ ¨

¦

¡ ¡ ¥ ¡§ ¥ £

¦ ¢ ¥ ¥ ¦ ¨ ¥

¡

¥

§ ¤

For each cell  , let

¡

¡

¨

£ ¡

- ¢

¡ ¡ ¥ ¡ ¡ ¤ ¡¤£ ¥¨§

¦ ¤ ¥ ¥ 1 §  ¨ ¥

¡

§ ¥

§ ¢

and for each cell  , let

£

¡

¡

¨

£ ¡

- ¢

¡ ¡ ¡ ¥ ¡ ¤  ¡ § ¥ £

 ¥ ¥ 1 ¨ ¦ ¢ ¥

¡

¥

¤ ¢ 

As above, we overlay and . For each cell in the overlay, we wish to compute

¡ ¡

¡ ¡

¨ # $ % ¦

¢ - ¢ - ¢

¢ ¡ ¥ ¡ ¡ ¢

§ ¦ ¤ ¥ ¦ ¢ ¥

¢

£*)

117

¡ ¡

- ¢ - ¢

¤ ¦ ¢

However, since ¦ and are not simple algebraic functions, we do not know how to

¢ ¤

compute, store and update § efficiently. Nevertheless, we can compute an -approximation

¢

¦

§ of that is easier to handle. More precisely, for a given set of points in , define the

¡ -median function

¡

¤ ¨

¡ ¥ ¢  ¡

¥ § ¥

) §

¡

¢ ¨ ¨ ¤

¡ £ £

- ¢ ¢

¡ ¡ ¡ ¢ ¡ ¡

 § ¤ ¦ ¤  ,1 §   ¡ ¢ © ¥

For any , ¥ , where . The

¡

- ¢

¦ ¢ § ¢

same is true for , where  . In Section 5.4.4, we describe a dynamic data

¦ ¤

structure that, given a point set of size , maintains an -approximation of the function

¤

¦ ¡ ¡

¢  ¡ ¡ ¡ ¤ ¤ ¡ ¤

§

as a function consisting of pieces; the domain of each piece

is a -dimensional (or the complement of a -dimensional) hypercube. A point can be

¤ ¥ §

¦ ¡ ¤ ¤ ¡ ¤ ¤

inserted into or deleted from in time. Furthermore, given

¦ ¤

¦

two point sets and in , this data structure can maintain an -approximation of

% ¤ ¤

¢ ¢  ¡ ¥ ¢  ¡

§ ¥ § ¥ ¡

£ within the same time bound.

¤ ¢

Using this data structure, we can traverse all cells of the overlay of and , as in

¡ ¡

- ¢ - ¢

¤ ¤

¤ ¦ ¢ ©

Section 5.4.2, and compute an -approximation of ¦ and , (thus roughly a ( )-

¢ 

approximation of § ) for each cell of the overlay. However, given two adjacent leaves

¢ 

during the traversal, associated with cells  and respectively, we now spend

¦

¦ ¡ ¡ ¡

¢ - ¢

¡ © ¡ ¤ ¤ ¤ ¡ ¥ ¤

¡ ¡

- ¢ - ¢

¤ ©

¤ ¦ ¤ time to compute an -approximation of ¦ from that of . Putting everything to-

gether, we conclude the following.

¤

£ §

Theorem 5.4.3 Given two sets and of and points in and a parameter

£ ¡

¤ , we can compute:



£

¡

¥

¡ ¤ ¤ ¡ ¥

¥ §

¦

i. a vector in ¦ time so that

¦

£ ¡

 ¡£ ¥ § ¡ ¤ ¡¤£¦¥¨§ ©



§ ¥

118

 

£ £

¥ ¥

¡ ¤ ¤ ¡ ¥ ¡ ¤ ¤ ¡ ¥

¦ ¦ ¦

ii. a data structure of ¦ size in time , so

¦ ¦

¥¨§  ¡¤£ ¤

¥ ¥ § §

that for any query vector , we can -approximate in time

¡

¥

¡ ¤ ¤ ¡ ¥

¦ .

5.4.4 Maintaining the 1-median function

¨

¦ ¤ 

§

Let be a set of points in and let be a parameter. For , define

¤ ¨

¡

¢  ¡ ¡ ¥ 

§ ¤§

the -median function ) , as above. We describe a dynamic data

¤

¢

¡ structure that maintains a function ¡ as the points are inserted into or deleted

from ¦ so that

¤ ¤

£ £ ¡§

¢  ¡  ¢¡ ¡ ¤ ¢  ¡  ¥ 

§ § §

© ¦

We maintain a height-balanced binary tree with leaves, each storing a point of .

©

¦ ¦

§ "

For a node , let be the set of points stored at the leaves of the subtree rooted

¨ ¨

¦ £ ¤

  § © ¨

at ; set . For each node of height (leaves have height ), set ,

©

¤

where ¨ (= ) is the height of the tree . We associate with a node , at height , a

¤

¢ £ ¢ 

§ §¡

function that is a -approximation of , i.e.,

¤ ¤

£ £ ¡

¢  ¡  ¢ ¡ ¡ £ ¢  ¡ ¥¡ 

§¡ § §¡ §

¢ ¡ ¡ ¤ ¤ ¡ ¤ ¢

¨ ¨

The description complexity of is . Finally, we maintain a function

¤

¡ ¡

¤ ¢  ¡  ¡ ¡ ¤ ¤ ¡ ¤

§ that is an -approximation of with description complexity .

u

(a) (b)

Figure 5.3: (a) An exponential grid with 3 layers. (b) The larger (resp. smaller) cube is ¢ (resp.

£

£

¢ §

), and the set of hollow dots are .

119

¨

¦ ¢ ¡ ¡  ¥

§

More specifically, if a leaf stores the point , then set . For all

¢

¦

internal nodes , we compute in a bottom-up manner as follows. Let and be the

¢¡ ¢

children of . By induction, suppose we have already computed the function and

¨

¡ ¡ ¤ ¤ ¡ ¤ ¡  ¢¢ ¡  ¢¤£ ¡

¨ ¨ 

each of descriptive complexity . Set . Since

¤ ¨ ¤ ¤

¡

¢  ¡  ¢  ¡  ¢  ¡ ¡ ¡ ¤

§¡ §¢¥ §¢¦  £ © ¨

, by induction hypothesis, is an -

¤

¢ 

¡ 

approximation of § . However, the description complexity of is more than what we

¢ 

 §

desire. We therefore approximate by a simpler function as follows. For and

§

  

¡ ¥ 

©

§ , let be the hypercube of side length centered at . For simplicity, let

§ §

¨ %)' # $ ¨ ¨ ¨

 ¢ ¡ ¡ £ ¤ ¡ ¥ ¡©¨

¥ ¢   ¥ © ¨ ¥ ©

. We compute and set . Let

§ §

¡

¡ £ £ ¡

¤ ¡ £

¤ ¥

for . Partition each cubic shell £ into hypercubes by a -

¡

£ ¡

¦ ©

dimensional grid in which each cell has side length . The union of

§

¨

¡ ¡

¡ £ ¤) ¡ £

¦

’s is an exponential grid with cells that covers the hypercube

§

¡ ¥ ¨ ¡¥£

 § ¦

¥ . See Figure 5.3 (a) for an illustration. In each cell , pick an

§©

arbitrary point and set

¨

¥¡  ¢ ¡  ¡ £

§©  © (5.1)

For points outside § , we set

§

¨

¢ ¡ ¡ ¥ ¥¡ 

¤ ¥ §

£ (5.2)

§ ¢

Hence, the function is piecewise-constant inside and a quadratic function outside

§

¨

¡ ¡

¢ ¡ ¡ £ ¤) ¡ ¤ ¡ ¤ £ ¡ ¡ ¤

¨ ¨

. The description complexity of is . Since

¢¢ ¢ £

and have the same structure, the point ¥ can be computed by evaluating the func-

¢ ¢¤£ 

tion  at the vertices of the exponential grids drawn for and . At each point , we

 ¡ ¡ ¤ ¡ ¤

¨

can evaluate  in time time by simply locating in the two exponential

§

¡ ¡ ¤ ¤ ¡ ¤

¥ ¨ ¨

grids. Hence, we can compute the point in time . We spend an-

¦

¡ ¡ ¤ ¢ ¢ ¡ £ ¤) ¡ ¤

¨ §

other ¨ time to compute . That is indeed a -approximation of

¤

¦

¢  ¡

§ is proved in Lemmas 5.4.4 and 5.4.5. This finishes the induction step. Using the

120

¡

¡¥¤ ¢ ¢

same procedure, we compute an -approximation, , of root, of descriptive complexity

¡ ¡

¤ ¤ ¤) ¡ ¡ ¡ 

§

. By construction, for all ,

¤

¡

£ £ ¡

¢  ¡ ¢¡ ¡ ¤ ¡ ¢

§ root

¤ ¤

¡

£ ¡§ ¡§ £ ¡

¡ ¤ ¡ ¤ ¢  ¡ ¡ ¤ ¢  ¡

© § §

¡ ¡ ¤ ¤ ¤ ¡ ¤

Obviously, the size of the above data structure is . To insert

©

or delete a point , we follow a path from the leaf ¦ storing to the root of and recompute

¢ ¢ ¢

at all nodes along this path, and then compute from root. Hence, the update time is

¤ ¥ §

¡ ¤) ¤ ¡ ¤ ¤ ¢

, and the only missing component now is to show that , as

© ¤

£ ¢ 

§ § §¡

constructed above at each node , is indeed a -approximation of .

¡ ©

¡ £ £ ¡

¤ ¡ £

Lemma 5.4.4 Let be a node of at height . For any and for any

§

§ ¦ "

 ,

¤ ¤

£ £ ¡

¡ ¥¡  ¡ ¢ ¡  ¡ £ ¢  ¢ 

§© § § §

 ¥

§

PROOF. The triangle inequality implies that for any ,

¤ ¤ ¤ ¤

£ £

¡ ¡ ¢  ¡ ¡ ¥ ¢  ¡  ¢  ¢ 

§  § £ § §  (5.3)

Therefore, by construction of the exponential grid,

£

© ¤ ¤

£

¥  ¥ ¢  ¡  ¢  ¡

§©  §¡ £ §¡ 

¡ (5.4)

Equation (5.1) and (5.3) imply that

£

¥

¨ ¤ ¤

©

© ©

§

¢ ¡ ¡ ¢  ¡ ¢  ¡  £

 §¡ §¡ ©

Substituting ¥ for in (5.3), we obtain

¥

¤ ¤

© © ¦ ©

¥

¢  ¡ ¡ ¥ ¢  ¡ ¨

§ ¥ £ § © £ © ¥ (5.5)

121

 ©

Hence, for any § ,

¤

¥ ¥

¨

¨ ¤

§

£ ¡ ¡

§ §

£ £ ¢ ¡  ¡ ¡ ¢  ¡

© ©  £ §

© ¨

£ ¤

¥

¨¡ ¢

§

¤

©

£ ¡ ¡

§

£ ¡ ¢  ¡ 

© £ §

¡ ( using (5.4) )

© ¨

¤

¥

¨

¤ §

¦ £ ¡ ¡

¥

£ ¢  ¡  ¡

© §¡ £

© ¨

£ ¤

¨

¤

§

£ ¡ ¡ ¦

¡ ¢  ¡

£ ¡

§ ( using (5.5) )

© ¨ ©

¤

¨

¤ ¨ ¤

§

¡ £ ¡

¢  ¡  ¡ £ ¢  ¡ 

§¡ § §¡

© ¨

§

©



§ 

Lemma 5.4.5 Let be a node of at height . Then for any £ ,

¤ ¤

£ £ ¡§

¢  ¡ ¢ ¡  ¡ £ ¢  ¡ 

§¡ § §¡

PROOF. By (5.3),

¤ ¤ ¤

£ £

¢  ¡ ¢  ¡ ¡  ¥ ¢  ¡

§¡ £ §¡ ¥  ¤ ¥ §¡  (5.6)

The first inequality of the lemma is now immediate because

¤ ¤ ¨

£ £

¢  ¡ ¡ ¥ ¢  ¡ ¡ ¥ ¢ ¡ 

§¡ ¥ §¡ ¥ ¥

§

¨ ¤

¢  ¡

§¡ £

As for the second inequality, we first upper-bound in terms of . Let

§ §

¨



¡ ¥ ¡¥£ ¦ ¦

¥ £

and £ (see Figure 5.3 (b)). Then

¡ ¡

¤ ¨

© ¦ © ©

¢  ¡ ¡ ¥ ¡ ¥ ¦ ¦

§ ¥ ¥ ¥  

£

£

£

)¤§¡

)¥¤- §¡

Therefore

¨

© ¡

¦ £ ¡ £

  £ £

£

122

§ §



§ § £

On the other hand, for any £ and , we have that

¨

©

¡ ¥ ¨ £ £ £

£

¨

©

£ ¤

© ¨ ©

Hence, as long as , we have

£

¡

¤ ©

© © ¡ ¦ ©

¢  ¡  ¡ ¥  ¡ ¥

£ §

£ £

) §¡

¤

¤

£

£ ¢  ¡ 

© ¡

thereby implying that § . Using (5.2) and (5.6),

¨ ¤

£

¢ ¡ ¡ ¥ ¢  ¡

© ¤ ¥ §

¤ ¤ ¤

£ ¡§ ¦

£

¡  ¡  ¡ £ ¢  ¡ £ ¢  ¢ 

§ § § §

5.4.5 Randomized algorithm ¡£¥¨§

We briefly describe below a simple randomized algorithm to approximate . The

¡£¦¥ § 

algorithm of approximating is similar. Let ¥ be the optimal translation, i.e.,

¨

¡£ ¥ § ¡¤£¦¥¨§

¥ 

§ .

£

£ ¡ ¥¨§ ¡£¥¨§

  ¥  ©

Lemma 5.4.6 For a random point from , , with probability

¡

¡£¥¨§ ©

greater than . The same claim holds for .

£ £ 

PROOF. Let be a random point from , where each point of is chosen with equal

¤

¨ ¨

¥

¡ ¥¨§ ¦ ¡

¡ ¡  ¥  ¡ §

¡

§ ¥

probability. Let be the random variable . Then

¥¨§ 

¥ . Moreover,

¡

¤

¡

¨ ¨ ¨

¡£¥¨§ ¡¤£ ¡ ¥¨§ ¦ ¥¨§

 

§  ¥ ¡ ¥

¡

 ¥

The lemma now follows immediately from Markov’s inequality.

123

¨ ¨

£ ¡£ ¥ §

 § ¥ £  ¥

Choose a random point . Let and § , for

¡

¡ £ £

. It then follows from Lemma 5.4.6 and the same argument as in Lemma 5.3.7,

#%$

¢ ¡¤£¦¥¨§

that is a constant-factor approximation of , with probability greater than

¡

©

. Computing exactly is expensive in , therefore we compute an approximate value

¡

¡ £ £

¡ ¡ ¤

of , for , in time , by performing approximate nearest-

neighbor queries [24]. We can improve this constant-factor approximation algorithm to

¡

¤  ¡£¥¨§ compute a ¡ -approximation of using the same technique as in Section 5.3.

We thus obtain the following result.

£ §

Theorem 5.4.7 Given two sets and of and points, respectively, in and a

¨

¤ ¡ ¡ ¤ ¤

¥ ¥

parameter , we can compute in time two translation vectors

¡

§ ©

and ¥ , such that with probability greater than ,

£ ¡§ £ ¡

¡£ ¥¨§ ¡ ¤ ¡¤£¦¥¨§ ¡£ ¥ § ¡ ¤  ¡¤£¦¥¨§

¥ ¥ § ¥ § § and

5.5 Notes and Discussion

We provide in this chapter some initial study of various problems related to minimizing Hausdorff distance between sets of points, disks, and balls. One natural question following

our study is to compute exactly or approximately the smallest Hausdorff distance over all

§

£ §

¡

possible rigid motions in and . Given two sets of points and of size and ,

£ §

respectively, let  be the maximum of the diameters of point sets and . We believe that

 there is a randomized algorithm with roughly expected time, that approximates the optimal summed-Hausdorff distance (or rms-Hausdorff distance) under rigid motions in the plane. The algorithm combines our randomized approach from Section 5.4.5, a frame- work to convert the original problem to a pattern matching problem [110], and a result by Amir et al. on string matching [21]. However, this approach does not extend to families of balls. We leave the problem of computing the smallest Hausdorff distance between sets of

124 points or balls under rigid motions as an open question for further research. Another ques- tion is to approximate efficiently the best Hausdorff distance under certain transformations when partial matching is allowed. The traditional approaches using reference points break down with partial matching.

125 Chapter 6

Coarse Docking via Features

6.1 Introduction

Proteins perform many of their functions by interacting with other molecules, and these interactions are made possible by molecules binding to form either transient or static com- plexes. We focus in this chapter on the problem of predicting the binding (or docking) configurations for two large protein molecules, which we refer to as the protein-protein docking problem (see Figure 6.1). This problem is important because the docked complex

(a) (b) (c)

Figure 6.1: Given protein structures in (a) and (b), the docking problem predicts the docking configuration in (c). has functional consequences (e.g. signal transduction) and it is usually hard to crystal- lize complexes. Many of the more than 25,000 proteins in Protein Data Bank (PDB) [31] are able to form protein-protein complexes, however, there are only a few hundred non- obligate crystallized complexes1; see the discussion in Chapter 1 for more motivations.

When two proteins bind, their shapes (molecular surfaces) roughly complement each  Obligate complexes are permanent multimers whose components do not exist independently.

126 other at the interface [113, 116]. This is the main justification for a geometric approach

in predicting the docking configurations, which is also the basis for our approach. The §

docking problem that we study is as follows: Given two proteins, £ and , assume that

£ §

the native docked configurations are and  . The goal is to develop an efficient algo-

§ §

rithm to compute configurations for that are close to  , or in other words, to search £

for configurations for § that complement best geometrically. To measure the goodness

§ ¢ §

of the complementary fit between £ and some configuration of , a scoring function

$ ' ¤

¡£¥¨§£¢ ¤ is needed. Note that in the above formulation, we assume that both pro- teins are rigid bodies during the docking process, which is usually refered to as the bound docking problem. The unbound docking problem, in which each protein may change its conformation, is not considered in this chapter, but will be the focus of future work. Of course in nature, more than two proteins might bind and form one complex. The work in this chapter serves as a starting point for attacking these more general problems.

Prior work. There are mainly two categories of interactions that have been extensively studied: protein-ligand interactions which happen between a protein and a small molecule; and protein-protein interactions. Much of earlier work in this area has focused on protein- ligand interactions, mainly motivated by drug design. With the number of available protein structures increasing rapidly, there is considerable research on understanding interactions between large proteins. Despite some similarities, the known approaches for predicting these two types of interactions are different in several aspects. For the case of protein- ligand docking, both chemical information and the flexibility of the ligand (and sometimes that of the receptor as well) are considered at a quite detailed level. While for the case of protein-protein docking, it seems that geometry plays a more important role, and because of their large size, these proteins may not be as flexible as small molecules during the docking process. Even if we ignore the flexibility, predicting protein-protein docking configurations is computationally more demanding than the case of protein-ligand docking because of

127 high complexity of protein molecules. We do not review the known results for protein- ligand docking in this section. Interested readers are referred to [85, 102, 164].

Current research on protein-protein docking is focused on either bound docking, or unbound docking with fairly small protein conformational changes. Most approaches for unbound docking use bound docking as a subroutine. They usually consist of two stages [153]:

(i) bound-docking stage, which produces a set of potential docking configurations by considering only rigid transformation; and

(ii) refinement stage, which allows certain flexibility.

Two essential components are involved in both stages:

(i) developing a scoring function that can discriminate near-native docking configura- tions from incorrect ones; and

(ii) a search algorithm to find (approximately) the best configuration under the score function used.

In general, approaches to the bound docking stage rely mainly on geometric comple- mentarity. Each of the input proteins can either be represented as a surface model [86, 127], a union of balls (with each ball representing an atom) [32, 61], or a set of voxels [53, 143].

For the last case, the space is divided into a set of regular grid cells (voxels), where each cell is marked as inside, outside, or on the surface of the protein. To search for the best geo- metric fit, the most straightforward approach, called exhaustive search, discretizes transfor- mational space into a six-dimensional grid and computes the score for each grid point [32].

Although this approach produces good near-native docking configurations with few false

positives2, it is too expensive. Several approaches have been proposed to reduce the time £ A false positive is a docking con®guration with high score but far from the native con®guration

128 complexity of this search procedure.

One popular technique is the fast Fourier transform (FFT) method, which was first used in molecular docking in [119]. The FFT method represents molecules as voxels. By de-

signing the scoring function appropriately and discretizing the translational space into a

¦ ¦ ¦ ¨

¨ grid, for a fixed rotation, the scores of all relative translations can be evaluated si-

¦ ¦

¡ ¤ ¡ multaneously in time. It is still necessary to search the three-dimensional space of rotations, which is typically done exhaustively. There are several properties of FFT-type approaches that make it rather attractive: other than surface complementarity, chemical properties such as hydrophobicity can be correlated as well, and it is fast to perform low- resolution FFT-searches, which is useful in producing coarse fits. Available docking pro- grams exploiting the FFT method include FTDock [90], 3D-Dock [137], GRAMM [163], and ZDock [53]. Nevertheless, for reasonably high resolution docking, the time complex- ity for FFT-type approaches is still fairly high. The approach proposed by Ritchie and Kemp [143, 144] addresses this problem by using spherical harmonics to represent both the molecular surface and the electric field. Complementarity between surfaces in different orientations is calculated by Fourier correlations between the expansion coefficients of ba- sis functions. A table of overlap integrals independent of protein identities is pre-calculated to expedite the docking algorithm.

Another widely used type of approaches reduces the number of transformations in- spected by aligning the so-called features on molecular surfaces. The idea goes back to Connolly[67], who also proposed to identify point features by taking the minima and max- ima of the so-called Connolly function. One representative here is by Fischer et al.[86] who use geometric hashing technique to align points computed by a variant of the Con- nolly function. Variants and improvements of their approach include [93, 127]. Such algorithms are usually faster, but require post-processing to remove steric clashes and to refine the geometric fit.

129 Other approaches for bound docking include genetic algorithms to conduct the search [39, 91], and fast bit-manipulation routines to expedite the evaluation of scores [140].

Although good in recombining separate components of a known complex, geometric fit

is not sufficient to dock unbound proteins. It is commonly believed that the native complex ¦ is at the global minimum of  , the difference between the free energy of the complex and that of its separate components. Hence the refinement stage is usually modeled as an energy-minimization problem. The scoring functions here focus more on the thermo- dynamic aspects of the interaction. Flexibility of side chains, or even that of backbones, is usually considered during the search process. One common way to incorporate side chain flexibility is by considering only known populated rotamers of the side chains. Un- fortunately, this still produces an exponential number of combinations. Techniques such as iterative mean-field minimization [111], dead-end elimination [18], and genetic algo- rithms [115] have been used to reduce the time complexity. Recently, Vajda et al. have proposed a hierarchical, progressive refinement protocol [41, 42]. It follows the intuition that a simplified energy landscape is sufficient for far-apart proteins, and as two proteins

get closer, more detailed energy landscape should be used. Their algorithm is able to reli- ¡

ably converge to a near-native configuration (with smaller than © ˚ rmsd) starting with one

¡ ¡ with initial rmsd of up to ˚ .

Little success has been achieved for including backbone conformational changes. It seems that a significant portion of this larger motion is the class of hinge-link inter-domain movements [124]. Some initial investigation has been made to docking proteins consider- ing such motions [147].

New work. Since each step in the refinement stage is quite costly, it is crucial to gen- erate a small and reliable set of potential configurations during the bound docking stage. On the other hand, even though only rigid motions are considered in this stage, the time complexity of current approaches is still not satisfactory, preventing us from experiment-

130 ing with larger data sets. In this paper, we present an efficient algorithm for the bound- docking stage. We use geometric complementarity to guide us in the search for a small set of rigid motions that fit the two proteins loosely into each other. Such a set of poten- tial configurations can be further refined independently, using both chemical information and flexibility [42, 61], to obtain more accurate docking predictions. We remark that for the case of unbound docking, it is especially important to consider coarse (not tight) fits between input components in order to provide enough tolerance for flexibility in the later

refinement procedure.

¡

%)' ¤ # $

¤ 

We describe our algorithm, called , in Section 6.2. It relies on the ability to describe meaningful “features” on molecular surfaces, such as protrusions and cavities, using a succinct set of pairs of points computed from the elevation function, defined in Chapter 4. We then align such pairs and evaluate the resulting alignments by a simple and rapid scoring function. Compared to similar approaches [86, 127] that align the so-called feature points, our algorithm inspects orders of magnitude fewer transformations by align- ing only meaningful features to produce a reliable set of potential docking positions. In Section 6.3, we demonstrate the performance of our approach by testing a set of 25 bound

protein-complexes obtained from the Protein Data Bank [31]. We also show that by com-

¡

%(' ¤ # $

¤  bining our algorithm with the local improvement procedure described in [61], we are able to efficiently find accurate near-native docking positions for all but one case

without any false positives. Additionally, we have tested our algorithm on the unbound

¡

# $ %)' ¤

 ¤ protein docking benchmark [54], demonstrating that can generate useful pro- tein poses that can serve as input for refinement methods that take protein flexibility into account.

131

6.2 Algorithm

¨ ¨

£

£ £ ¥ £ ¥ ¥ £ § § ¥¨§ ¥ ¥¨§

¡ ¡ ¡

¡ ¥ § ¥ § ¡

Assume that we are given two proteins and ,

¨ ¨



£ ¡ ¥ § ¡£ ¥

¡

§ §  § §

where each § (resp. ) represents an atom centered at (resp.

$ ' ¤



¤ ¡¤£¦¥¨§

§ ¡ 

) with van der Waals radius § (resp. ). Let be a scoring function §

that we will describe shortly. Ideally, we would like to find a transformation for that

$ ' ¤

¤ ¡¤£¦¥ ¡ § maximizes . In this section, we describe an algorithm that finds a set of potentially good transformations for § . Below, we first explain the scoring function we use. We then present an algorithm that produces a set of transformations for § by aligning

pairs of points computed from the elevation function.

¢ ¢ ¢ At a high level, we first construct a set ¤ (resp. ) of features based on elevation

function as defined in Chapter 4, each consisting of two points along with a normal di- §

rection, to characterize the molecular surface of protein £ (resp. ). These two sets of

¢ ¢ ¢ features, ¤ and , are the inputs to our coarse alignment algorithm, which outputs a list of possible configurations sorted by their scores. Below, we first describe the scoring

function that we use. Then after explaining how to construct features from the elevation

¡

%)' ¤ # $

¤  function, we present the algorithm .

6.2.1 Scoring function

A good scoring function should produce a higher score for (near) native configurations

than for non-native ones. We design our scoring function to describe the geometric com- §

plementarity between £ and . In particular, let

¦

¡ £ £

© £ ¥

§ if

$ ' ¤ ¡ # # $ ¡ ¨

¡

¡ ¤

¨

© ¤ ¡ ¥ © ¤ ¤ ¡ ¥ ¥

£

if

© ¥

otherwise

132

 

¨



£

£ § £  § £

where , and is some prefixed constant that we refer to as §

contact-threshold. The scoring function between £ and involves two components:

# # $ ¨ $ ' ¤ ¨ $ ' ¤ ¡

¤ ¤ ¡£¥¨§ ¤ ¡¤£¦¥¨§ ¤ ¡ ¥

-

the score § and the collision number

# # $ ¡ # # $

£

¤ ¤ ¡ ¥ ¡£¥¨§ ¤ ¤ ¡¤£¦¥¨§

-

¢

§ A configuration is valid if , where the the

collision-threshold ¢ defines the maximum number of clashes that can be tolerated.

This definition is rather similar to the one used in [32] and in [53]. The main difference is that in addition to counting collisions, we use them as a reason to lower the score. The

reason behind this is that as our algorithm generates coarse alignments between the input

¨

¢

proteins, we need a large tolerance for collision ( ¢ in our experiments, in contrast to

¨

¢ in [32]). However, a high collision number will also increase the number of pairs of atoms that are in contact, resulting in high scores. By giving a penalty to the score when a clash happens, we intend to counterbalance the consequential increase in score.

6.2.2 Computing features.

¦

¤ ¤¦¥(% # $

¤

£ ¤

Given protein , we compute its surface and the elevation function ¡ on (as defined in Chapter 4). For docking purposes, we are interested in points with locally maximal elevation, as they represent locally most significant features. Recall that there are four types of maxima, which are illustrated in Figure 6.2 (a), each describing a different type of features on molecular surfaces.

(a) (b)

Figure 6.2: (a) From left to right: a one-, two-, three-, and four-legged local maximum of the elevation function. (b) A three-legged maximum generates six features as indicated by shaded bars.

133

¢

£ 

We collect ¤ , the set of features for protein , as follows: for each maximum from

¢ 

, add all possible pairs of points between the head(s) and feet of into ¤ . See Figure 6.2

(b) for an example. With each feature generated from the maximum  , we associate the

 ¢

normal direction (i.e., all head(s) and feet of are critical in the height function in

¢ ¢

direction ), and assume that is always pointing towards the exterior of the surface at

 ¢

. Thus a feature is a pair of points along with the unit vector . Given a feature, we

¦

¤ ¤§¥)% # $

¡ refer to the distance between its two points as its length, and ¤ as its elevation. The length and elevation of a feature indicate its importance, thereby providing a way to distinguish less from more meaningful features.

We remark that previous alignment-based approaches describe features by points resid- ing at a protrusion or a cavity. This is one of the main difference between our approach and those methods. Points provide much less information in identifying specific protrusions or cavities as compared to our representation of features. Therefore our alignment algorithm (as described below) is able to inspect orders of magnitude fewer configurations than those algorithms. We elaborate on this difference more in Section 6.4 at the end of this chapter.

6.2.3 Coarse alignment algorithm

¢ ¢

£ § ¢ Given two proteins and , along with their feature sets ¤ and , respectively, our

algorithm, sketched in Figure 6.3, computes a set of potential coarse alignments. The

¡

# $ %)' ¤

 ¤ rationale behind is that good fits between the input proteins should have some

features aligned, such as a protrusion from one surface fitting inside a cave from the other £ (see Figure 6.4 (a)). Hence if we align all “features” from § to those from , we will

cover potentially good fits. Here features are defined as in Figure 6.2 (b). As we wish to

¢ ¢ ¢ align only important features, we preprocess ¤ and by removing features with small elevation value or short length. We now explain steps (1), (2) and (3) from the above

134

¡

%)' ¤ # $

¤ 

ALGORITHM

¢ ¢ ¢

Preprocess ¤ and ;

¢ ¢

¥ § ¤ ¡ §§ ¢

for any ¡ and

% # ¤

¥

¡ ¥

 ¥ ¡ §

(1) if ( ¡ )

¡

# $

¥ 

¡ ¥ ¡ §

(2) = ( );

# .# $ $ ' ¤

¤ ¤ ¡£¥ ¡§ ¤

(3) compute ( , ) for ;

# .# $

£

¤ ¤ ¡¤£¦¥ ¡ §

¢

if ( )

¡

add to ; // is valid

$ ' ¤ ¤

Sort ¡ by ; end

Figure 6.3: The coarse alignment algorithm.

p

q

(a) (b)

Figure 6.4: (a) If two surfaces complement each other well, then features at the interface should align well too. (b) Points and 0 do not match each other: they should have opposite criticality (maximum with minimum w.r.t. their own normals).

algorithm.

¡ ¡

# $ # $

 ¡  ¡ ¥

¥ ¡ §

Function : The function ¡ computes a rigid motion that aligns the

¨ ¨

¡ ¥ ¡ ¥

§ § ¥ § § ¡ ¥ ¥ § ¥ §

pair of points ¡ with . In particular, assume that and are the

¥ ¡ §

normals associated with ¡ and respectively. Let

0 0

¥ § ¥ §

¢ ¢

¤§¦ ¤ ¤ ¤

¥£¢ §¥¢ ¥ §

¡ ¢

¡ and

0 0

¥ § ¥ §

¢ ¢

¥ § §

To obtain the transformation, we first translate the midpoint of segment § to the mid-

§ ¥ § § § ¥ § § §

point of segment ¥ . Next, we rotate segment so that (i) segment lies on the

 

¥ § §

line passing through ; and (ii) ¥ coincides with . See Figure 6.5 for an illustration.

¥©¨

¥ § § Note that there is ambiguity in (i) as vector § can either be in the same direction or in

135









¡

¡

¡£¢



¤¥¢



¡

¦

§

§

¦

¡

¦

¦

§ §

¦

¡ ¤

¦

¥ ¨ ¥©¨

0 0

¥ § ¥ § ¨

Figure 6.5: First move the midpoints¥ (two empty dots) together. Then rotate so that and

0 0

¥ § ¥ § ¡

coincide. Last, rotate w.r.t. so that ¡ coincides with .

¥©¨

¡

# $

 ¡ § opposite direction as vector ¥ . As such, the function in fact returns two distinct

transformations, although for simplicity, we pretend it returns only one.

% .# ¤

¥

¡

 ¡ ¥ ¡ § Function : Obviously, if and are fairly ”dissimilar”, then they will not align well with each other in any good configuration. By ”dissimilar”, we mean that

(F1) we are trying to align a protrusion (resp. cave) from one surface with a protrusion

(resp. cave) from the other (see Figure 6.4 (b) ), or

§ ¥ § § §

(F2) the length of ¥ is too different from that of .

¨ ¨

¡ ¥ ¡ ¥

¥ ¥ § ¡ § § ¥ § § ¥ §

Given feature pairs ¡ and , associated with normal and

% # ¤

¥

¡ ¥

 ¡ ¥ ¡ §

respectively, the function returns false if either (F1) or (F2) happens. In

§

particular, (F1) happens if ¥ (or ) is a local minimum (or local maximum) with respect



¥ § ¥ § § §

to and (or ) is of the same criticality with respect to . Let be the length of the

© 

¡ ¥ ¡ § ¨ shorter feature pair and the difference between the length of and . If , we

consider (F2) happens, where is a threshold on the ratio of the two lengths.

$ ' ¤ $ ' ¤ # .# $

¤ ¤ ¡£¦¥ § ¤ ¤

Compute and . By definition, we can compute both and

# .# $

£ § ¤ ¤ ¡¤£¦¥¨§ ¡

in time, where and are the number of atoms in and

¡ ¡ ¤

respectively. Here we present a simple algorithm that computes them in

time by building a hierarchical data structure for £ as follows.

¦ 

¦ £ ¦ ¡

Let be the set of centers of atoms from ; the diameter of is at most where  is the smallest radius of an atom from £ (between one and two Angstrom). We build a

136

© © ¦

standard oct-tree on ¦ . Points of are stored at the leaves of . At each internal node

¦ £ £

, let denote the set of centers in the subtree of and the set of balls from with

¦ £

center in . We associate with node the smallest enclosing ball for , denoted by .

©

¡ ¤ ¡ ¤

The depth of is , therefore it can be constructed in time.

# # $ $ ' ¤

¤ ¤ ¡£¥¨§ ¤ ¡¤£¦¥¨§

We now describe the computation of , and that of is simi-

# # $ ¡ # .# $ ¡ ¨ ¡

¡¤£ £

¤ ¤ ¡ ¥ ¤ ¤ ¡£¥ §

lar. In particular, for each , , we compute §

©

§

by a top-down traversal of . At an internal node , if and intersect, we recurse down

§ the subtree rooted at . Otherwise, we return. At a leaf node, if intersects the atom

whose center is stored at the leaf node, we increment a counter that records the number of

# .# $ ¡

¡

¤ ¤ ¡¤£¦¥

collisions seen so far by . The resulting number after the traversal is .

§ ¡ ¤

It is easy to verify that the above traversal for a specific takes time. This

© ©

is because that at any level of , § intersects at most constant number of nodes of , and

¡ ¤ ¡ ¡ ¤

that the depth of the tree is . Hence it takes time to compute

$ ' ¤ # .# $

¤ ¡¤£¦¥¨§ ¤ ¤ ¡¤£¦¥¨§ ¡¤£¦¥¨§

(and similarly ) for a particular configuration .

In practice, we build a similar tree for § and traverse down both trees simultaneously

$ ' ¤ # # $

¤ ¡¤£¦¥¨§ ¤ ¤ ¡¤£¦¥¨§

to compute (as well as ) directly, instead of computing each

# .# $ ¡

¤ ¤ ¡¤£¦¥

one by one. Although the new algorithm is proved to be more efficient

in practice, it has the same asymptotic time complexity as the one described above. The

¡

%)' ¤ # $ ¢ ¢

¤  ¡ ¡ ¤

 ¤   ¢  overall time complexity for is .

6.3 Experiments

In this section, we first provide a detailed experimental study of one protein complex. Next we test our docking algorithm on a diverse data set of 25 bound protein complexes chosen from the Protein Data Bank [31], which contain both easy docking problems where large protrusions fill deep pockets and more difficult problems where the interface is relatively

137

flat. Finally, we provide results for testing the unbound protein benchmark [54]. £

A case study. We take the protein complex barnase/barstar (pdb-id 1brs, chain and ), as an example. The two chains have 864 and 693 atoms respectively. We use the msms

program (also available as part of the VMD software [105]) to generate a triangulation

¨

of the molecular surface for each input chain. The resulting triangulations have and

¨ © vertices respectively. The left of Table 6.1 shows the number of features generated

2-legged 3-legged 4-legged 2-legged 3-legged 4-legged A 1044 696 156 A 112 205 50

D 828 510 154 D 68 160 49

$

¢¡ ¡ Table 6.1: -legged features for chain A and D of § in the left. Only large ones are in the right. from the four different types of maxima of the elevation function: A ¦ -legged feature is

derived from a ¦ -legged maximum. In the right, we show the number of large features ¡ whose length is at least ¡ ˚ and elevation at least 0.2: these features are the inputs to our coarse alignment algorithm. We note that a significantly higher percentage of 3-legged

features are large compared to those of 4-legged and especially 2-legged features.

¡

%)' ¤ # $

¤  ¡

Given the two sets of large features, algorithm generates a family of

¡

¡ ¡

¥ ¥ ¨

© © ©

valid configurations whose score is higher than (or valid ones whose

¡ score is greater than ). The running time is around 3 minutes on a single processor

PIII 1GHz machine. Each configuration in ¡ corresponds to a transformation for chain

¡

. For each transformation computed from , we compute the rmsd (root mean squared

¡

deviation) between the centers of all atoms from chain and those of , and use it to measure how good our predictions are: the smaller the rmsd of a configuration is, the

closer it is to the native docking position. The rmsd for the top ranked configuration in ¡

¡ ¡

¡

is ˚ , and the rmsd is less than ˚ for 6 out of the top 10 configurations. Next consider

 ¡

only a subset ¡ , the top 100 ranked configurations from . We refine each configuration

 ¡  in ¡ by applying the local improvement heuristic [61] and re-rank afterwards based on

138 After LocalImprove Before LocalImprove

rank score rmsd rank coll rmsd 1 359 0.54 12 24 3.23 2 338 0.80 5 48 2.42 3 328 0.72 1 23 1.59 4 314 0.80 4 49 3.57 5 311 0.91 2 39 1.70 6 310 0.78 59 12 2.84 7 307 1.50 3 29 2.32 8 281 1.47 11 18 3.07 9 251 2.09 14 16 3.00 10 213 39.96 76 29 39.39

Table 6.2: Performance of the algorithm (including refinement) on protein complex bar-

nase/barstar. Only those with ¡£¢¥¤¦¤¨§¨© § ¢¥  are kept after local improvement heuristic. The right side shows the corresponding ranking and number of collisions before applying the local

improvement heuristic.

¡

%(' ¤ # $

¤  the new scores. The results, shown in Table 6.2, demonstrate that generates multiple useful poses for chains A and D that can be refined to yield a near-native final configuration.

More bound protein complexes. We extend our experiments to a diverse set of 25 bound protein complexes obtained from PDB. After computing the elevation function for all protein surfaces, we compute the set of features for each chain, and remove the features

that are not large. Usually, the number of remaining features for each chain is roughly

¡

%)' ¤ # $

¤ 

the same as the number of atoms of that chain. Next, we apply to these ¡ feature sets. In Table 6.3, we show a low-rmsd (less than ˚ ) configuration for each

protein complex, as well as its rank (by score) in the set of configurations returned by

¡ ¡

%(' ¤ # $ ¨ ¨

¢

£ ¤ 

© ¢ , with parameters ˚ and . Note that in all but one case, we

have at least one configuration with low rmsd among the top 100 configurations. The last

¡

# $ %)' ¤

 ¤ column shows the running time of algorithm . It does not include the time to

compute the molecular surfaces and the elevation function.

¡

A configuration is of type - if it is produced by aligning an -legged feature from one chain with an ¡ -legged feature from the other one. Our experimental results indicate that

4-legged features seldom give rise to good configurations (i.e., those with rmsd less than

139 pdbid rank coll rmsd time 1A22 (A, B) 2 23 2.75 20 1BI8 (A, B) 12 43 2.48 26 1BRS (A, D) 1 11 1.52 3 1BUH (A, B) 5 14 1.85 2 1BXI (B, A) 3 34 2.54 8 1CHO (E, I) 1 14 2.71 3 1CSE (E, I) 2 22 2.21 9 1DFJ (I, E) 78 11 3.09 27 1F47 (B, A) 15 1 1.49 1 1FC2 (D, C) 5 49 4.13 6 1FIN (A, B) 11 44 3.70 41 1FS1 (B, A) 1 29 1.62 5 1JAT (A, B) 522 20 1.20 9 1JLT (A, B) 8 23 3.64 10 1MCT (A, I) 1 27 3.49 3 1MEE (A, I) 1 23 1.33 9 1STF (E, I) 1 43 1.18 8 1TEC (E, I) 9 54 3.07 7 1TGS (Z, I) 1 46 2.61 6 1TX4 (A, B) 2 4 3.35 14 2PTC (E, I) 1 18 4.55 6 3HLA (A, B) 1 19 1.87 16 3SGB (E, I) 1 38 3.21 5 3YGS (C, P) 6 7 1.07 6

4SGB (E, I) 10 33 2.33 4

¦

¡ £¢ ©¥¤ ¤¨§¨§¥

Table 6.3: Performance of ¡£¢ on 25 proteins. Column 2 shows the rank of the first

¦ ¦

¡£¢¡ £¢ ©©¤ ¤¨§ §¥ configuration with low rmsd ( ˚). The last column shows the running time of in minutes.

140 After Improve Before Improve

pdbid rank score rmsd rank score coll rmsd 1A22 1 475 1.08 2 270 23 2.75 1BI8 1 234 29.88 62 211 10 30.0 1BRS 1 349 1.07 2 389 37 3.01 1BUH 1 256 0.61 5 209 14 1.85 1BXI 1 289 0.63 16 217 21 5.59 1CHO 1 305 0.99 1 243 14 2.71 1CSE 1 317 0.82 23 273 36 2.57 1DFJ 1 220 1.28 78 178 11 3.09 1F47 1 221 0.56 15 129 1 1.49 1FC2 2 200 1.33 5 356 49 4.13 1FIN 1 413 0.61 34 382 54 9.94 1FS1 1 326 0.89 2 309 27 1.59 1JAT 1 288 0.87 522 168 21 1.20 1JLT 1 310 1.77 3 232 14 6.17 1MCT 1 322 0.32 84 233 34 3.57 1MEE 1 372 0.57 1 373 23 1.33 1STF 1 314 0.79 1 408 43 1.18 1TEC 1 304 1.28 10 362 51 4.51 1TGS 1 348 0.44 2 227 13 2.71 1TX4 1 355 0.36 80 243 25 4.34 2PTC 1 314 0.66 1 265 18 4.55 3HLA 1 416 0.70 1 246 19 1.97 3SGB 1 257 2.24 1 320 38 3.21 3YGS 1 209 0.85 6 216 7 1.03 4SGB 1 266 2.50 10 260 33 2.33

Table 6.4: Performance of algorithm for 25 proteins after and before the local improvement.

Columns 2 to 4 show the first configuration with low rmsd after locally improving the top coarse alignments and re-ranking them. Columns 5 to 8 show the corresponding configuration before the local improvement and re-ranking.

141

¡ ¡

˚ ). In fact, no 4-legged feature is involved in producing the best configuration for 24/25

cases (the only exception is 2PTC), and the percentage of good configurations involving

¡

4-legged features is usually less than . On the other hand, the computation of 4-legged features is most expensive (refer to Chapter 4).

For each protein complex, we apply the local improvement heuristic [61] to the top

configurations, and then re-rank them based on new scores. The results are illustrated in

¨

¡

Table 6.4, where we choose (other than for the case of 1JAT, where we choose

¨

¢

). We consider only those configurations with at most collisions. In all but one case, the top-ranked (or the second ranked in 1FC2) configuration after local improvement has a small rmsd. In the only remaining case, we can obtain a small rmsd for the top- ranked configuration if we relax the number of collisions to ¨ . In other words, in 23 out of these 25 test complexes, our coarse alignment algorithm in conjunction with the local- search heuristics [61] can predict an accurate near-native docking configurations without

false positives.

¡

%)' ¤ # $

¤ 

Unbound protein benchmark. We further tested on the protein-protein docking benchmark provided in [54]. We omitted the seven complexes classified as difficult in [54] because they have significantly different conformations in the unbound vs. bound structures. We also omitted complexes 1IAI, 1WQ1 and 2PCC because of difficulties in generating molecular surfaces of required quality. Of the 49 remaining complexes, 25 are so-called bound-unbound cases in which one of the components is rigid. For each complex, we fix one chain as £ , which is the rigid chain for the bound-unbound cases, or

the receptor for the unbound-unbound cases. We generate a set ¡ of potential positions

§ ¡

for the other component, . For each transformation computed from , we measure

§ ¡ §

¢

the rmsd between the interface ¡ atoms from and those from , and refer to it as  rmsd  . (For unbound protein complexes, rmsd serves as a better measure than the rmsd which we used before due to the flexibility, thus the unreliability in the positions of non-

142

C-id Top 2000 ¢¡ All outputs ¡ rmsd ¡ hits rank rmsd rmsd size 1ACB 3.70 20 3,951 1.75 1.81 14,426 1AVW 5.51 8 4,698 5.42 6.38 23,565 1BRC 4.66 35 1,629 4.66 5.72 12,770 1BRS 1.60 7 426 1.60 1.54 11,607 1CGI 3.04 5 695 3.04 3.32 10,135 1CHO 2.35 27 92 2.35 2.69 11,815 1CSE 3.15 7 15,271 2.74 2.53 21,068 1DFJ 6.44 2 1,433 6.44 6.09 35,231 1FSS 7.65 2 10,721 5.15 5.45 25,609 1MAH 2.78 4 1,561 2.78 3.45 25,402 1TGS 5.27 18 543 5.27 6.07 11,383 1UGH 7.95 3 8,268 7.16 7.40 14,656 2KAI 6.55 26 2,560 3.41 4.71 13,478 2PTC 4.55 32 4,983 4.16 7.85 13,929 2SIC 4.04 27 76 4.04 7.71 20,065 2SNI 6.34 10 4,894 4.58 4.72 15,830 1PPE 4.13 10 37 4.13 5.07 7,660 1STF 1.41 8 1 1.41 1.92 15,082 1TAB 3.78 3 48 3.78 5.48 8,296 1UDI 4.50 3 1,124 4.50 7.38 21,133 2TEC 1.42 5 6 1.42 1.53 21,134 4HTC 5.94 2 396 5.94 7.39 14,032 1AHW 9.38 1 2,781 4.37 10.44 32,919 1BVK 1.95 5 1,189 1.95 3.69 24,611 1DQJ 4.59 7 710 4.59 6.21 28, 694 1MLC 3.71 7 6949 3.32 7.13 29,747 1WEJ 6.27 3 4,659 5.86 6.01 18,194 1BQL 6.98 11 10,388 4.39 4.64 23,308 1EO8 2.31 1 11 2.31 3.11 45,512 1FBI 6.49 8 11,783 2.30 2.08 26,036 1JHL 3.47 18 14,185 2.61 3.27 32,091 1KXQ 5.99 2 1,495 5.99 15.86 37,218 1KXT 4.52 12 153 4.52 10.90 39,240 1KXV 2.48 7 321 2.48 3.54 46,368 1MEL 2.21 8 73 2.21 2.55 17,741 1NCA 1.75 7 621 1.75 1.92 49,600 1NMB 7.18 7 14,202 2.72 5.11 42,066 1QFU 1.97 4 12 1.97 3.07 47,693 2JEL 3.46 19 115 3.46 4.39 34,072 2VIR 1.08 11 1 1.08 1.86 40,813 1AVZ 4.06 8 4,243 3.52 4.08 7,895 1L0Y 2.75 2 1136 2.75 3.83 34,044 2MTA 2.91 40 19,167 2.07 2.16 36,903 1A0O 5.95 3 3950 4.35 5.20 9,113 1ATN 1.52 8 1 1.52 2.33 50,729 1GLA - - 25,307 2.82 2.83 33,879 1IGC 2.48 3 3,260 2.06 2.59 25,303 1SPB 2.83 3 617 2.83 3.03 13,728 2BTF 5.02 2 10,132 3.28 3.63 33,480 Table 6.5: Protein-protein benchmark. C-id means complex-id. 143 backbone atoms.) As the unbound structures (i.e, the input to our algorithm) provided in

the benchmark are superimposed onto their crystalized correspondances, this value is close

¡§ § 

to the rmsd measured between and the crystalized structure for .

¥



© ¡

Now take ¡ , the top ranked configurations from . The results are shown

¡ 

in Table 6.5, where we show in column 2 the smallest rmsd  generated from , and in

¡

¡

  column 3 the number of configurations from ¡ with a rmsd smaller than ˚ . Columns

4 to 6 provide information (rank, rmsd  , and corresponding rmsd) of the configuration in

 ¡ ¡ with the smallest rmsd , and the size of is shown in column 7.

Our results demonstrate a number of favorable characteristics of our algorithm. First,

¥

¡ 

within the relatively small set of © top-scoring configurations ( ), 38 out of 49 com-

¡ ¢

plexes yield a configuration below ˚ rmsd  . All but one complex yield a configuration

¡ ¡ below the ˚ cut-off needed as input for the hierarchical, progressive refinement protocol in [41, 42]. The fact that most complexes generate multiple hits (column 3) increases the probability that a local refinement will not be trapped in a local minimum and instead find

a correct solution. Second, within all the configurations generated ( ¡ , at most 50,000), ¡

47 out of 49 complexes yield a configuration below ¢ ˚ , typically within the top 10,000

¡

¥

© scores. All 49 generate at least one configuration below ˚ , in at most configu- rations. How these coarse alignments re-rank to yield high scoring solutions with low rmsd

remains to be investigated. We also remark that it is possible to futher reduce the size of ¡ by clustering similar configurations [86].

6.4 Notes and discussion

We have presented in this chapter an efficient alignment-based algorithm to compute a set of coarse configurations given two rigid input proteins. We have shown in Section 6.3 that when combined with the local improvement heuristics from [61], our algorithm can predict an accurate near-native docking configurations for 23 out of 25 test bound protein

144 complexes, without producing any false positives. When tested on the unbound protein docking benchmark [54], our algorithm is able to rapidly produce a relatively short list of potential configurations which can be inputs to other local improvement methods that allow protein flexibility and thus have the potential to solve the unbound protein docking problem.

Comparisons. Current approaches for the bound docking stage differ in how they sample the search space and how they evaluate the docking score. FFT-type methods discretize the search space in a rather uniform way. They produce more accurate predictions for re- combining known complexes, but at a much higher computational cost. For the case of unbound docking, it is possible to run those algorithms at a low resolution to provide some tolerance for flexibility. However, if the resolution is too low, there is a danger of miss- ing good alignments completely; while if the resolution is high, too many configurations will be generated and the selection of a small set of good candidates for the refinement stage is not a trivial problem. Alignment-based methods tend to sample the search space in a more selective manner guided by shape complementarity. Methods in this category

designed prior to our algorithm align feature points residing at protrusions and cavities. § In particular, given two sets of such points representing £ and , they align all possible pairs of points from one set with all possible pairs from the other to generate potential rigid motions. Since the number of feature pairs computed in our algorithm is similar to

that of feature points computed in their algorithms, they inspect significantly more trans-

§

¡ formations than ours ( vs. , where is the number of feature pairs or points) due to meaningless ones and duplicates. In contrast, by aligning features computed from the elevation function, we sample the transformational space in a much sparser manner than

previous approaches, focusing only on potentially good docking locations. The size of the

¡

¥ © output of our algorithm is much smaller (with configurations without clustering for 1BRS), and as they are generated by fitting features, we expect to capture reasonable con-

145 figurations unless proteins undergo dramatic conformational changes. We also comment that it is possible to further improve the speed of our algorithm by the geometric hashing technique as in [86].

146 Bibliography

[1] http://www.marketdata.nasdaq.com/mr4b.html.

[2] Protein data bank. http://www.rcsb.org/pdb/.

[3] P. Agarwal and K. R. Varadarajan. Efficient algorithms for approximating polygonal chains. Discrete Comput. Geom., 23:273–291, 2000.

[4] P. K. Agarwal, H. Edelsbrunner, J. Harer, and Y. Wang. Extreme elevation on a 2-manifold. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 357–365, 2004.

[5] P. K. Agarwal, H. Edelsbrunner, and Y. Wang. Computing the writhing number of a knot. Discrete Comput. Geom., 32:37–53, 2004.

[6] P. K. Agarwal, L. J. Guibas, A. Nguyen, D. Russel, and L. Zhang. Collision detec- tion for deforming necklace. To Appear.

[7] P. K. Agarwal, S. Har-Peled, N. H. Mustafa, and Y. Wang. Near-linear time approx- imation algorithms for curve simplification in two and three dimensions. Algorith- mica, To appear.

[8] P. K. Agarwal, S. Har-Peled, M. Sharir, and Y. Wang. Hausdorff distance under translation for points and balls. In Proc. 19th Annu. ACM Sympos. Comput. Geom., pages 282–291, 2003.

[9] P. K. Agarwal and J. Matousek.˘ Ray shooting and parametric search. SIAM J. Comput., 22:540 – 570, 1993.

[10] P. K. Agarwal and J. Matousek.ˇ On range searching with semialgebraic sets. In Discrete Comput. Geom., volume 11, pages 393–418, 1994.

[11] P. K. Agarwal, M. Sharir, and S. Toledo. Applications of parametric searching in geometric optimization. J. Algorithms, 17:292 – 318, 1994.

[12] O. Aichholzer, H. Alt, and G. Rote. Matching shapes with a reference point. Intl. J. Comput. Geom. and Appl., 7:349–363, 1997.

[13] J. Aldinger, I. Klapper, and M. Tabor. Formulae for the calculation and estimation of writhe. J. Knot Theory and Its Ramifications, 4:343–372, 1995.

147 [14] H. Alt, B. Behrends, and J. Bloemer. Approximate matching of polygonal shapes. Annu. Math. Artif. Intell., 13:251 – 266, 1995.

[15] H. Alt, P. Brass, M. Godau, C. Knauer, and C. Wenk. Computing the hausdorff distance of geometric patterns and shapes. Discrete Comput. Geom. – the Goodman- Pollack Festchrift, pages 65–76, 2003.

[16] H. Alt and M. Godeau. Computing the Frechet´ distance between two polygonal curves. International Journal of Computational Geometry, pages 75–91, 1995.

[17] H. Alt and L. Guibas. Discrete geometric shapes: Matching, Interpolation, and Approximation. Handbook of Computational Geometry (J.-R. Sack and J. Urrutia eds), 1999.

[18] E. Althaus, O. Kohlbacher, H. P. Lenhof, and P. Muller. A combinatorial approach to protein docking with flexible side-chains. In Prooceedings of the Fourth Inter- national Conference on Computaional Molecular Biology (RECOMB), pages 8–11, 2000.

[19] N. Amenta and R. Kolluri. The medial axis of a union of balls. Comput. Geom: Theory Appl., 20:25–37, 2001.

[20] A. M. Amilibia and J. J. N. Ballesteros. The self-linking number of a closed curve £

in . Journal of Knot Theory and Its Ramifications, 9(4):491–503, 2000.

[21] A. Amir, E. Porat, and M. Lewenstein. Approximate subset matching with Don’t Cares. In Proc. 12th ACM-SIAM Symp. Discrete Algorithms, pages 305–306, 2001.

[22] M. Ankerst, G. Kastenmuller, H. Kriegel, and T. Seidl. 3d shape histograms for sim- ilarity search and classfication in spatial databases. In Proc. of the 6th Int. Sympos. on Spatial Databases, volume 1651, pages 207–226, 1999.

[23] V. I. Arnold. Catastrophe Theory. Springer-Verlag, Berlin, Germany, 1984.

[24] S. Arya and T. Malamatos. Linear-size approximate voronoi diagrams. In Proc. 13th ACM-SIAM Symp. on Discrete Algorithms, pages 147–155, 2002.

[25] M. J. Atallah. A linear time algorithm for the Hausdorff distance between convex polygons. Inform. Process. Lett., 17:207 – 209, 1983.

[26] D. Baker and A. Sali. Protein structure prediction and structural genomics. Science, 294(5):93–96, 2001.

148 [27] Y. A. Ban, H. Edelsbrunner, and J. Rudolph. Interface surfaces for protein-protein complexes. In RECOMB, pages 205–212, 2004.

[28] T. Banchoff. Self-linking numbers of space polygons. Indiana Univ. Math. J., 25:1171–1188, 1976.

[29] G. Barequet, D. Z. Chen, O. Daescu, M. T. Goodrich, and J. Snoeyink. Effi- ciently approximating polygonal paths in three and higher dimensions. Algorith- mica, 33(2):150 – 167, 2002.

[30] W. R. Bauer, F. H. C. Crick, and J. H. White. Supercoiled dna. Scientific American, 243:118 – 133, 1980.

[31] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. H. Shinkdyalov, and P. E. Bourne. The protein data bank. Nucleic Acid Res., 28:235– 242, 2000.

[32] S. Bespamyatnikh, V. Choi, H. Edelsbrunner, and J. Rudolph. Accurate protein docking by shape complementarity alone, 2004.

[33] K. Brakke. Surface evolver software documentation. http://www.geom.umn.edu/software/evolver/.

[34] C. Branden and J. Tooze. Introduction to Protein Structure. Garland Publishing Inc., 2 edition, 1999.

[35] M. L. Bret. Catastrophic variation of twist and writhing of circular DNAs with constraint? Biopolymers, 18:1709 – 1725, 1979.

[36] J. W. Bruce and P. J. Giblin. Curves and Singularities. Cambridge Univ. Press, England, second edition, 1992.

[37] D. Brutlag. DNA topology and topoisomerases, 2000. http://cmgm.stanford.edu/- biochem201/Handouts/DNAtopo.html.

[38] G. Buck. Four-thirds power law for knots and links. Nature, 392:238–239, 1998.

[39] R. M. Burnett and J. S. Taylor. DARWIN: A program for docking flexible molecules. Proteins: Structure, Function, and Genetics, 41:173 – 191, 2000.

[40] G. Calug˘ areanu.˘ Sur les classes d’isotopie des noeuds tridimensionnels et leurs invariants. Czechoslovak Math. J., 11:588–625, 1961.

149 [41] C. J. Camacho, D. W. Gatchell, S. R. Kimura, and S. Vajda. Scoring docked con- formations generated by rigid-body protein-protein docking. Proteins, 40:525–537, 2000.

[42] C. J. Camacho and S. Vajda. Protein docking along smooth association pathways. Proc. Natl. Acad. Sci., 98:10636–10641, 2001.

[43] J. Cantarella. On comparing the writhe of a smooth curve to the writhe of an in- scribed polygon, 2002.

[44] J. Cantarella, D. DeTurck, and H. Gluck. Upper bounds for the writhing of knots and the helicity of vector fields. In Proceedings of the Conference in Honor of the 70th Birthday of Joan Birman, J. Gilman, X. Lin W. Menasco (eds), 2000.

[45] J. Cantarella, R. Kusner, and J. Sullivan. Tight knot values deviate from linear relation. Nature, 392:237–238, 1998.

[46] D. Cardoze and L. Schulman. Pattern matching for spatial point sets. In Proc. 39th Annu. IEEE Sympos. Found. Comput. Sci., pages 156 – 165, 1998.

[47] O. Carugo and S. Pongor. Protein fold similarity estimated by a probablistic

¢ ¢ ¡ approach based on ¡ - distance comparison. Journal of Molecular Biology, 315:887–898, 2002.

[48] F. Cazals, F. Chazal, and T. Lewiner. Molecular shape analysis based upon the Morse-Smale complex and the Connolly function. In Proc. 19th Annu. ACM Sym- pos. Comput. Geom., 2003.

[49] H. S. Chan and K. A. Dill. The effects of internal constraints on the configurations of chain molecules. Journal of Chemical Physics, 92(5):3118 – 3135, 1990.

[50] W. S. Chan and F. Chin. Approximation of polygonal curves with minimum number of line segments. In Proc. 3rd Annu. Internat. Sympos. Algorithms Comput., volume 650 of Lecture Notes Comput. Sci., pages 378–387. Springer-Verlag, 1992.

[51] B. Chazelle. Cutting hyperplanes for divide-and-conque. Discrete Comput. Geom., 9:145 – 158, 1993.

[52] B. Chazelle, H. Edelsbrunner, L. Guibas, M. Sharir, and J. Stolfi. Lines in space: cobinatorics and algorithms. Algorithmica, 15:428–447, 1996.

150 [53] R. Chen, L. Li, and Z. Weng. ZDOCK: An initial-stage protein docking algorithm. Proteins, 52(1):80–87, 2003.

[54] R. Chen, J. Mintseris, J. Janin, and Z. Weng. A protein-protein docking benchmark. Proteins, 52:88–91, 2003.

[55] H.-L. Cheng, T. K. Dey, H. Edelsbrunner, and J. Sullivan. Dynamic skin triangula- tion. Discrete Comput. Geom., 25:525–568, 2001.

[56] S. S. Chern. Curves and surfaces in Euclidean space. Studies in Global Geometry and Analysis, pages 16–56, 1967.

[57] L. P. Chew, D. Dor, A. Efrat, and K. Kedem. Geometric pattern matching in - dimensional space. Discrete Comput. Geom., 21:257 – 274, 1999.

[58] L. P. Chew, M. T. Goodrich, D. P. Huttenlocher, K. Kedem, J. M. Kleinberg, and D. Dravets. Geometric pattern matching under euclidean motion. Comput. Geom. Theory Appl., 7:113 – 124, 1997.

[59] L. P. Chew, D. Huttenlocher, K. Kedem, and J. Kleinberg. Fast detection of common geometric substructure in proteins. In Proc. 3rd. Int. Conf. Comput. Mol. Biol., 1993.

[60] I. Choi, J. Kwon, and S. Kim. Local feature frequency profile: A method to measure structural similarity in proteins. PNAS, 101(11):3797 – 3802, 2004.

[61] V. Choi, P. K. Agarwal, H. Edelsbrunner, and J. Rudolph. Local search heuristic for rigid protein docking. In 4th Workshop on Algorithms in Bioinformatics, 2004.

[62] D. Cimasoni. Computing the writhe of a knot. Journal of Knot Theory and Its Ramifications, 10(3):387–395, 2001.

[63] K. Cole-McLaughlin, H. Edelsbrunner, J. Harer, V. Natarajan, and V. Pascucci. Loops in Reeb graphs of 2-manifolds. Discrete Comput. Geom. To appear.

[64] M. L. Connolly. Molecular surface review.

[65] M. L. Connolly. Analytical molecular surfaces calculation. Journal of Applied Crystallography, 16:548 – 558, 1983.

[66] M. L. Connolly. Measurement of protein surface shape by solid angles. J. Mol. Graphics, 4:3 – 6, 1986.

151 [67] M. L. Connolly. Shape complementarity at the hemo-globin albl subunit interface. Biopolymers, 25:1229–1247, 1986.

[68] T. E. Creighton. Proteins: structures and molecular properties. W. H. Freeman and Comapny, New York, second edition, 1993.

[69] F. H. C. Crick. The packing of alpha-helices: simple coiled coils. Acta Crystallog- raphy, 6:689–697, 1953.

[70] F. H. C. Crick. Linking numbers and nucleosomes. Proc. Natl. Acad. Sci. USA, 73(8):2639–2643, 1976.

[71] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational geometry: algorithms and applications. Springer, 1997.

[72] Y. Diao and C. Ernst. The complexity of lattice knots. Topology Appl., pages 1–9, 1998.

[73] K. A. Dill. Theory for the folding and stability of globular proteins. Biochemistry, 24(6):1501 – 1509, 1985.

[74] K. A. Dill. Dominant forces in protein folding. Biochemistry, 29(31):7132 – 7135, 1990.

[75] D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartogra- pher, 10(2):112–122, Dec. 1973.

[76] H. Edelsbrunner, J. Harer, and A. Zomorodian. Hierarchical morse complexes for piecewise linear 2-manifolds. Discrete Comput. Geom., 30:87–108, 2003.

[77] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete Comput. Geom., 28:511–533, 2002.

[78] H. Edelsbrunner and E. P. Muck¨ e. Simulation of simplicity: a technique to cope with degenerate cases in geometric algorithms. ACM Trans. Graphics, 9:66–104, 1990.

[79] M. H. Eggar. On White’s Formula. Journal of Knot Theory and Its Ramifications, 9(5):611–615, 2000.

152 [80] I. Eidhammer, I. Jonassen, and W. R. Taylor. Structure comparison and structure patterns. J. Mol. Biol., 7:685–716, 2000.

[81] A. H. Elcock, D. Sept, and J. A. McCammon. Computer simulation of protein- protein interactions. J. Phys. Chem., 105:1504–1518, 2001.

[82] M. A. Erdmann. Protein similarity from knot theory and goemetric convolution. In RECOMB, pages 195–204, 2004.

[83] R. Estkowski and J. S. B. Mitchell. Simplifying a polygonal subdivision while keeping it simple. In Proc. 17th Annu. ACM Sympos. Comput. Geom., pages 40–49, 2001.

[84] A. Fersht. Structure and mechanism in protein science. W. H. Freeman and Com- pany, New York, third edition, 2000.

[85] P. Finn and L. Kavraki. Computational approaches to drug design. Algorithmica, 25:347–371, 1999.

[86] D. Fischer, S. L. Lin, H. Wolfson, and R. Nussinov. A geometry-based suite of molecular docking processes. J. Mol. Biol., 248:459–477, 1995.

[87] W. Fraczek. Mean sea level, GPS, and the geoid. Arc-Users Online, 2003. ERSI Web Sites: www.esri.com/news/arcuser/0703/sum-mer-2003.html.

[88] F. B. Fuller. The writhing number of a space curve. In Proc. Natl. Acad. Sci. USA, volume 68, pages 815–819, 1971.

[89] F. B. Fuller. Decomposition of the linking number of a closed ribbon: a problem from molecular biology. In Proc. Natl. Acad. Sci. USA, volume 75, pages 3557– 3561, 1978.

[90] H. A. Gabb, R. M. Jackson, and M. J. Sternberg. Modeling protein docking using shape complementarity, electrostatics and biochemical information. J. Mol. Biol., 272(1):106–120, 1997.

[91] E. J. Gardiner, P. Willett, and P. J. Artymiuk. Protein docking using a genetic algo- rithm. Proteins: Structure, Function, and Genetics, 44:44 – 56, 2001.

[92] M. Godau. A natural metric for curves: Computing the distance for polygonal chains and approximation algorithms. In Proc. of the 8th Annual Symposium on Theoretical Aspects of Computer Science, pages 127–136, 1991.

153 [93] B. B. Goldman and W. T. Wipke. QSD: quadratic shape descriptors. 2. Molecular docking using quadratic shape descriptors (QSDock). Proteins, 38:79–94, 2000.

[94] D. Goldman, S. Istrail, and C. H. Papadimitriou. Algorithmic aspects of protein structure similarity. In IEEE Symposium on Foundations of Computer Science, pages 512–522, 1999.

[95] M. T. Goodrich, J. S. Mitchell, and M. W. Orletsky. Practical methods for approx- imate geometric pattern matching under rigid motion. In Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 103 – 112, 1994.

[96] L. J. Guibas, J. E. Hershberger, J. S. Mitchell, and J. Snoeyink. Approximating polygons and subdivisions with minimum link paths. Internat. J. Comput. Geom. Appl., 3(4):383–415, Dec. 1993.

[97] I. Halperin, B. Ma, H. Wolfson, and R. Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins: Struc- ture, Function, and Genetics, 47:409 – 443, 2002.

[98] S. Har-Peled. A replacement for voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found. Comput. Sci., pages 94–103, 2001.

[99] A. Hatcher and J. Wagoner. Pseudo-Isotopies of compact manifolds, 1973. Societ´ e´ Mathematique´ de France.

[100] P. Heckbert and M. Garland. Survey of polygonal surface simplification algorithms.

In SIGGRAPH 97 Course Notes: Multiresolution Surface Modeling, 1997.

¡ ¤

[101] J. Hershberger and J. Snoeyink. An implementation of the Douglas- Peucker algorithm for line simplification. In Proc. 10th Annu. ACM Sympos. Com- put. Geom., pages 383–384, 1994.

[102] A. Hillisch and R. Hilgenfeld, editors. Modern Methods of Drug Discovery. Springe Verlag, 2003.

[103] L. Holm and C. Sander. Mapping the protein universe. Science, 273:595–602, 1996.

[104] L. Holm and C. Sander. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Research, 25(1):231–234, 1997.

[105] W. Humphrey, A. Dalke, and K. Schulten. VMD– Visual Molecular Dynamics. J. Molec. Graphics, 15:33 – 38, 1996.

154 [106] D. P. Huttenlocher, K. Kedem, and J. M. Kleinberg. On dynamic Voronoi diagrams and the minimum Hausdorff distance for point sets under Euclidean motion in the plane. In Proc. 8th Annu. ACM Sympos. Comput. Geom., pages 110 – 120, 1992.

[107] D. P. Huttenlocher, K. Kedem, and M. Sharir. The upper envelope of Voronoi sur- faces and its applications. Discrete Comput. Geom., 9:267 – 291, 1993.

[108] H. Imai and M. Iri. An optimal algorithm for approximating a piecewise linear function. Information Processing Letters, 9(3):159–162, 1986.

[109] H. Imai and M. Iri. Polygonal approximations of a curve-formulations and algo- rithms. In G. T. Toussaint, editor, Computational Morphology, pages 71–86. North- Holland, Amsterdam, Netherlands, 1988.

[110] P. Indyk, R. Motwani, and S. Venkatasubramanian. Geometric matching under noise: Combinatorial bounds and algorithms. In Proc. 10th Annu. ACM-SIAM Sym- pos. Discrete Alg., pages 457 – 465, 1999.

[111] R. M. Jackson, H. A. Gabb, and M. J. E. Sternberg. Rapid refinement of protein interfaces incorporating solvation: application to the docking problem. J. Mol. Biol., 276:265–285, 1998.

[112] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility predic- tions using graph theory. Proteins: Structure, Function, and Genetics, 44:150–165, 2001.

[113] J. Janin and C. Chothia. The structure of protein-protein recognition site. J. Mol. Biol., 265:16027 – 16030, 1990.

[114] J. Janin and S. J. Wodak. The structural basis of macromolecular recognition. Adv. Protein Chem., 61:9–73, 2002.

[115] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol., 267:727–748, 1997.

[116] S. Jones and J. M. Thornton. Principles of protein-protein interactions. Proc. Natl. Acad. Sci., 93(1):13–20, 1996.

[117] R. D. Kamien. Local writhing dynamics. Eur. Phys. J. B, 1:1–4, 1998.

155 [118] G. Kastenmuller, H. Kriegel, and T. Seidl. Similarity search in 3d protein databases, 1998.

[119] E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. A. Friesen, C. Aflalo, and I. A. Vakser. Molecular surface recognition: determination of geometric fit between pro- teins and their ligands by correlation techniques. Proc. Natl. Acad. Sci., 89:2195– 2199, 1992.

[120] K. Kedem, R. Livne, J. Pach, and M. Sharir. On the union of Jordan regions and collision-free translational motion amidst polygonal obstacles. Discrete Comput. Geom., 1:59–71, 1986.

[121] K. Klenin and J. Langowski. Computation of writhe in modeling of supercoiled DNA. Biopolymers, 54:307 – 317, 2000.

[122] P. Koehl. Protein structure similarities. Curr. Opin. Struct. Biol., 11:348–353, 2001.

[123] P. Koehl and M. Levitt. A brighter future for protein structure prediction. Nature Structural Biology, 6(2):108 – 111, 1999.

[124] W. G. Krebs and M. Gerstein. The morph server: a standardized system for ana- lyzing and visualizing macromolecular motions in a database framework. Nucleric Acids Res., 28:1665–1675, 2000.

[125] A. R. Leach. Molecular modelling: Principles and applications. Pearson Education Limited, 1996.

[126] B. Lee and F. M. Richards. The interpretation of protein structures: Estimation of static accessibility. J. Mol. Biol., 55:379 – 400, 1971.

[127] H. Lenhof. An algorithm for the protein docking problem. Bioinformatics: From Nucleic Acids and Proteins to Cell Metabolism, pages 125–139, 1995.

[128] M. Levitt. Protein folding by restrained energy minimization and molecular dynam- ics. J. Mol. Biol., 170:723–764, 1983.

[129] M. Levitt and M. Gerstein. A unified statistical framework for sequence comparison and structure comparison. Proc. Nat. Acad. Sci., 95:5913–5920, 1998.

[130] J. Liang, H. Edelsbrunner, and C. Woodward. Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci., 7:1884–1897, 1998.

156 [131] S. Loncaric. A survey of shape analysis techniques. Pattern Recognition, 31(8):983 – 1001, 1998.

[132] A. C. Martin, C. A. Orengo, E. G. Hutchinson, S. Jones, M. Karmirantzou, R. A. Laskowski, J. B. Mitchell, C. Taroni, and J. M. Thornton. Protein folds and func- tions. Struct. Fold. Des., 6:875 – 884, 1998.

[133] N. Megiddo. Applying parallel computation algorithms in the design of serial algo- rithms. J. ACM, 30:852 – 865, 1983.

[134] A. Melkman and J. O’Rourke. On polygonal chain approximation. In G. T. Tous- saint, editor, Computational Morphology, pages 87–95. North-Holland, Amsterdam, Netherlands, 1988.

[135] J. Milnor. Morse Theory. Princeton Univ. Press, New Jersey, 1963.

[136] J. C. Mitchell, R. Kerr, and L. F. T. Eyck. Rapid atomic density measures for molec- ular shape characterization. J. Mol. Graph. Model., 19:324–329, 2001.

[137] G. Moont and M. J. E. Sternberg. Modelling protein-protein and protein-dna dock- ing. Bioinformatics – From Genomes to Drugs, 1:361–404, 2001.

[138] A. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: A structural classifi- cation of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247:536–540, 1995.

[139] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH – A hierarchic classfication of protein domain structures. Struc- ture, 5(8):1093–1108, 1997.

[140] P. N. Palma, L. Krippahl, J. E. Wampler, and J. J. G. Moura. BIGGER: A new soft docking algorithm for predicting protein interactions. Proteins: Structure, Func- tion, and Genetics, 39:178 – 194, 2000.

[141] F. M. G. Pearl, D. Lee, J. E. Bray, I. Sillitoe, A. E. Todd, A. P. Harrison, J. M. Thornton, and C. A. Orengo. Assigning genomic sequences to CATH. Nucleic Acids Research, 28(1):277–282, 2000.

[142] W. F. Pohl. The self-linking number of a closed space curve. J. Math. Mech., 17:975–985, 1968.

157 [143] D. W. Ritchie. Evaluation of protein docking predictions using hex 3.1 in capri rounds 1 and 2. Proteins, 52(1):98–106, 2003.

[144] D. W. Ritchie and G. J. L. Kemp. Protein docking using spherical poloar fourier correlations. Proteins, 39:178–194, 2000.

[145] P. Rogen and B. Fain. Automatic classification of protein structure by using gauss integral. PNAS, 100(1):119–124, 2003.

[146] H. Samet. Spatial Data Structures: Quadtrees, Octrees, and Other Hierarchical Methods. Addison-Wesley, Reading, MA, 1989.

[147] B. Sandak, R. Nussinov, and H. J. Wolfson. A method for biomolecular struc- tural recognition and docking allowing conformational flexibility. J. Comput. Biol., 5:631–654, 1998.

[148] S. Seeger and X. Laboureux. Feature extraction and registration: An overview. Principles of 3D Image Analysis and Synthesis, pages 153 – 166, 2002.

[149] I. N. Shindyalov and P. E. Bourne. Protein structure alignment by incremental com- binatorial extension (ce) of the optimal path. Proten Eng., 11(9):739–747, 1998.

[150] M. L. Sierk and W. R. Pearson. Sensitivity and selectivity in protein structure com- parison. Protein Science, 13:773–785, 2004.

[151] A. P. Singh and D. L. Brutlag. Protein structure alignment: a comparison of meth- ods. 2001.

[152] D. D. Sleator and R. E. Tarjan. A data structure for dynamic trees. J. Comput. Sys. Sci., 26(3):362–391, 1983.

[153] G. R. Smith and M. J. E. Sternberg. Prediction of protein-protein interactions by docking methods. Current Opinion in Structural Biology, 12:29–35, 2002.

[154] B. Solomon. Tantrices of spherical curves. Amer. Math. Monthly, 103:30–39, 1996.

[155] R. Srinivasan and G. D. Rose. Linus: A hierarchic procedure to predict the fold of a protein. Proteins: Structure, Function, and Genetics, 22:1143 – 1159, 1995.

[156] M. J. E. sternberg. Protein structure prediction: a pratical approach. Oxford Uni- versity Press, 1996.

158 [157] D. Swigon, B. D. Coleman, and I. Tobias. The elastic rod model for DNA and its application to the tertiary structure of DNA minicircles in mononucleosomes. Biophysical Journal, 74:2515–2530, 1998.

[158] M. B. Swindells, C. A. Orengo, D. T. Jones, E. G. Hutchinson, and J. M. Thornton. Contemporary approaches to protein structure classification. BioEssays, 20:849– 891, 1998.

[159] M. Tabor and I. Klapper. The dynamics of knots and curves (Part I). Nonlinear Science Today, 4(1):7–13, 1994.

[160] W. R. Taylor, A. C. W. May, N. P. Brown, and A. Aszodi. Protein structure: ge- ometry, topology and classification. Reports on Progress Physics, 64:517 – 590, 2001.

[161] M. L. Teodoro, G. N. Phillips, and L. E. Kavraki. Understanding protein flexibility through dimensional reduction. Journal of Computational Biology, 10:617–634, 2003.

[162] A. Sali,˘ E. Shakhnovich, and M. Karplus. How does a protein fold? Nature, 369:248 – 251, 1994.

[163] I. A. Vakser. Protein docking for low-resolution structures. Protein Engineering, 8:371–377, 1995.

[164] P. Veerapandian. Structure-based drug design. Marcel Dekker, 1997.

[165] R. C. Veltkamp and M. Hagedoorn. State of the art in shape matching. Principles of Visual Information Retrieval, edited by M. S. Lew, Series in Advances in Pattern Recognition, 2001.

[166] A. V. Vologodskii, V. V. Anshelevich, A. V. L. ka shin, and M. D. Frank- Kamenetskii. Statistical mechanics of supercoils and the torsional stiffness of the DNA double helix. Nature, 280:294 – 298, 1979.

[167] R. Weibel. Generalization of spatial data: principles and selected algorithms. In M. van Kreveld, J. Nievergelt, T. Roos, and P. Widmayer, editors, Algorithmic Foun- dations of Geographic Information System. Springer-Verlag Berlin Heidelberg New York, 1997.

[168] J. White. Self-linking and the Gauss integral in higher dimensions. Amer. J. Math., XCI:693–728, 1969.

159 [169] D. L. Wild and M. A. S. Saqi. Structural proteomics: Inferring function from protein structure. Current Proteomics, 1:59 – 65, 2004.

160 Biography

Yusu Wang was born on June 28th, 1976 in Shanxi Province, China. After receiving her BS from Tsinghua University in 1998, Yusu Wang joined in Computer Science Dept. of Duke University, where she received her MS in year 2000 and is now pursuing a PhD in the area of geometric computing. Her research focuses on designing efficient computational meth- ods for shape analysis problems, especially for protein structure analysis, by combining both geometry and topology. She is currently involved in an interdisciplinary collabarotive

project, BioGeometry, at Duke Univ., Stanford Univ., UNC-Chapel Hill, and UNCA T, to address fundamental computational problems in representing, searching, simulating, ana- lyzing, and visualizing biological structures.

Related Publications.

1. P. K. AGARWAL, H. EDELSBRUNNER AND Y. WANG Computing the writhing number of a knot. Discrete Comput. Geom. 32 (2004), 37–53.

2. S. HAR-PELED AND Y. WANG Shape fitting with outliers. SIAM J. Comput., to appear.

3. P. K. AGARWAL, S. HAR-PELED, N. MUSTAFA AND Y. WANG Near-linear time approxi- mation algorithm for curve simplification. Algorithmica, to appear.

4. P. K. AGARWAL, H. EDELSBRUNNER, J. HARER AND Y. WANG Extreme elevation on a 2-Manifold. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 357–365.

5. P. K. AGARWAL, Y. WANG AND H. YU A 2D kinetic triangulation with near-quadratic topological changes. In “Proc. 20th Annu. Sympos. Comput. Geom., 2004”, 180–189.

6. P. K. AGARWAL, S. HAR-PELED, M. SHARIR AND Y. WANG Hausdorff distance under translation for points and balls. In “Proc. 19th Annu. Sympos. Comput. Geom., 2004”, 282–291.

161