GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS
by
Yusu Wang
Department of Computer Science Duke University
Date: Approved:
Prof. Pankaj K. Agarwal, Supervisor
Prof. Herbert Edelsbrunner, Co-advisor
Prof. John Harer
Prof. Johannes Rudolph
Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University
2004 ABSTRACT GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS
by
Yusu Wang
Department of Computer Science Duke University
Date: Approved:
Prof. Pankaj K. Agarwal, Supervisor
Prof. Herbert Edelsbrunner, Co-advisor
Prof. John Harer
Prof. Johannes Rudolph
An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University
2004 Abstract
Biology provides some of the most important and complex scientific challenges of our time. With the recent success of the Human Genome Project, one of the main challenges in molecular biology is the determination and exploitation of the three-dimensional structure of proteins and their function. The ability for proteins to perform their numerous functions is made possible by the diversity of their three-dimensional structures, which are capable of highly specific molecular recognition. Hence, to attack the key problems involved, such as protein folding and docking, geometry and topology become important tools. Despite their essential roles, geometric and topological methods are relatively uncommon in com- putational biology, partly due to a number of modeling and algorithmic challenges. This thesis describes efficient computational methods for characterizing and comparing molecu- lar structures by combining both geometric and topological approaches. Although most of the work described here focuses on biological applications, the techniques developed can be applied to other fields, including computer graphics, vision, databases, and robotics.
Geometrically, the shape of a molecule can be modeled as (i) a set of weighted points, representing the centers of atoms and their van der Waals radii; (ii) as a polygonal curve, corresponding to a protein backbone or a DNA strand; or (iii) as a polygonal mesh cor- responding to a molecular surface. Each such representation emphasizes different aspects of molecular structures at various scales, the choice of which depends on the underlying applications. Characterizing molecular shapes represented in various ways is an important step toward better understanding or manipulating molecular structures. In the first part of the thesis, we study three geometric descriptions: the writhing number of DNA strands, the level-of-details representation of protein backbones via simplification, and the elevation of molecular surfaces.
The writhing number of a curve measures how many times a curve coils around itself
iii in space. It describes the so-called supercoiling phenomenon of double stranded DNA, which influences DNA replication, recombination, and transcription. It is also used to characterize protein backbones. This thesis proposes the first subquadratic algorithm for computing the writhing number of a polygonal curve. It also presents an algorithm that is easy to implement and runs in near-linear time on inputs that are typical in practice, including DNA strands, which is significantly faster than the quadratic time needed by algorithms used in current DNA simulation softwares.
The level-of-detail (LOD) representation of protein backbone helps to extract its main features. We compute LOD representations via curve simplification under the so-called Frechet´ error measure. This measure is more desirable than the widely used Hausdorff error measure in many situations, especially if one wants to preserve global features of a curve (e.g, the secondary structure elements of a protein backbone) during simplifica- tion. In this thesis, we present a simple approximation algorithm to simplify curves under Frechet´ error measure, which is the first simplification algorithm with guaranteed quality that runs in near-linear time in dimensions higher than two.
We propose a continuous elevation function on the surface of a molecule to capture its geometric features such as protrusions and cavities. To define the function, we follow the example of elevation as defined on Earth, but we go beyond this simpler concept to accommodate general 2-manifolds. Our function is invariant under rigid motions. It scales with the surface and provides beyond the location, also direction and size of shape features. We present an algorithm for computing the points with locally maximum elevation. These points corresponds to locally most significant features. This succinct representation of features can be applied to aligning shapes and we will present one such application in the second part of the thesis.
The second part of the thesis focuses on molecular shape matching algorithms. The importance of shape matching, both similarity matching and complementarity matching,
iv arises from the general belief that the structure of a protein decides its function. Efficient algorithms to measure the similarity between shapes help identify new types of protein architecture, discover evolutionary relations, and provide biologists with computational tools to organize the fast growing set of known protein structures. By modeling a molecule as the union of balls, we study the similarity between two such unions by (variants of) the widely used Hausdorff distance, and propose algorithms to find (approximately) the best translation under Hausdorff distance measure.
Complementarity matching is crucial to understand or simulate protein docking, which is the process where two or more protein molecules bind to form a compound structure. From a geometric perspective, protein docking can be considered as the problem of search- ing for configurations with maximum complementarity between two molecular surfaces. Using the feature information generated by the elevation function, we describe an efficient algorithm to find promising initial relative placements of the proteins. The outputs can later be refined to locate docking positions independently using a heuristic that improves the fit locally, using geometric but possibly also chemical and biological information.
v And indeed there will be time To wonder, “ Do I dare? ” and “Do I dare? ” Time to turn back and descend the stair,
With a bald spot in the middle of my hair ¡ ¡
Do I dare Disturb the universe?
— T. S. Eliot, The love-song of J. Alfred Prufrock
vi Acknowledgements
I came to take of your wisdom: And behold I have found that which is greater than wisdom. — Kahlil Gibran, The prophet
It is not without regret that I am writing this acknowledgment — while being grateful for all those who made my life in the past few years a joyful and fruitful one, I know sadly that our lives will part soon. The path towards obtaining a PhD was a struggle for me in many ways. I can’t imagine how it would have been without their support.
It has been a great opportunity to have worked under the supervision of Profs. Pankaj K. Agarwal and Herbert Edelsbrunner. The experience helped shape my attitude and ap- proaches towards research both in computational geometry and in general. Pankaj led me into the world of computational geometry with his broad knowledge. Besides sup- port and guidance, he gave me great freedom in doing research, and is always patient and understanding. It is hard to overestimate how much I have benefited from the numerous discussions with him. Herbert showed me the “friendly” side of computational topology, with his deep insights accompanied by illustrative explanation. His philosophy and vision in research have greatly influenced me. I am deeply indebted to both of them for their guidance and inspiration throughout the course of this dissertation. I would also like to thank Profs. John Harer and Johannes Rudolph not only for being on my committee of this thesis, but also for various discussions and collaborations. Support for this work was provided by NSF under the grant NSF-CCR-00-86013 (the BioGeometry project).
The Duke CS department is a wonderful place. In particular, I wish to thank Drs Lars Arge and Ron Parr who are always open and ready to help on my career concerns. Dr.
Sariel Har-Peled, now a post-postdoc (well, an assistant professor) in UIUC, has been a tremendous mentor and friend for me, especially during a period when I was swinging
vii among various career choices, and at a time when I was learning to walk on the ropes of research. I learned from him to have an approximate perspective towards problems both in computer science and in life.
I would like to thank all the graduate students and postdocs in the theory group who provided a vibrant research environment that I have enjoyed and benefited from so much, especially Nabil Mustafa, Hai Yu, Peng Yin, and Vijay Natarajan. I had a lot of fun both in research and in life with friends such as Vicky Choi, David Cohen-Steiner, Ho-lun Cheng, Ashish Gehani, Sathish Govindarajan, Jingquan Jia, Tingting Jiang, Dmitriy Morozov, Nabil Mustafa, Vijay Natarajan, Jeff Phillips, Nan Tian, Eric Zhang, Hai Yu, and Haifeng Yu. Those inspiring discussions with Nabil, Sathish and David will always mark my mem- ories of the PhD life. Special thanks to my best friend Peng Yin. His humor, energy, understanding and advices accompany me to this day. I also want to thank his wife Xia Wu for kindly feeding me uncountable times.
I wish to thank all staffs in the department for being so friendly and helpful, especially Ms. Celeste Hodges and Ms. Diane Riggs.
Last, but not least, I would like to thank my family for their love, support and confi- dence in me. My parents and grandparents encouraged me to pursue my own dream from the childhood, and have never tried to pressure me into any life that others may consider as successful. My sister and brother-in-law have always been there for me, full of under- standing and support. What I have achieved was possible only because they were all by my side. This thesis is dedicated to them.
viii Contents
Abstract iii
Acknowledgements vii
List of Tables xii
List of Figures xiii
1 Introduction 1
1.1 Protein Structure and Geometric Models ...... 2
1.2 Related Research Areas ...... 5
1.3 Shape Analysis in Molecular Biology ...... 9
1.3.1 Describing Shapes ...... 10
1.3.2 Matching Shapes ...... 12
1.4 Main Contributions ...... 14
2 Writhing Number 18
2.1 Introduction ...... 18
2.2 Prior and New Work ...... 20
2.3 Writhing and Winding ...... 23
2.3.1 Closed knots ...... 24
2.3.2 Open knots ...... 29
2.4 Computing Directional Writhing ...... 31
2.5 Experiments ...... 33
2.5.1 Algorithms ...... 34
2.5.2 Comparison ...... 35
ix 2.6 Notes and Discussion ...... 38
3 Backbone Simplification 39
3.1 Introduction ...... 39
3.2 Prior and New Work ...... 41
3.3 Frechet´ Simplification ...... 44
3.3.1 Algorithm ...... 45
3.3.2 Comparisons ...... 47
3.4 Experiments ...... 51
3.5 Notes and Discussions ...... 56
4 Elevation Function 58
4.1 Introduction ...... 58
4.2 Defining Elevation ...... 60
4.2.1 Pairing ...... 60
4.2.2 Height and Elevation ...... 65
4.3 Pedal Surface ...... 69
4.4 Capturing Elevation Maxima ...... 72
4.4.1 Continuity ...... 72
4.4.2 Elevation Maxima ...... 76
4.5 Algorithm ...... 83
4.6 Experiments ...... 87
4.7 Notes and Discussion ...... 90
5 Matching via Hausdorff Distance 94
5.1 Introduction ...... 94
x 5.2 Collision-Free Hausdorff Distance between Sets of Balls ...... 100
5.2.1 Computing ¢¡¤£¦¥¨§ ©¨ ¦ in 2D and 3D ...... 101
5.2.2 Partial matching ...... 104
5.3 Hausdorff Distance between Unions of Balls ...... 105
5.3.1 The exact 2D algorithm ...... 105
5.3.2 Approximation algorithms ...... 109
5.4 RMS and Summed Hausdorff Distance between Points ...... 113
5.4.1 Simultaneous approximation of Voronoi diagrams ...... 113
5.4.2 Approximating ¡¤£¦¥¨§ ...... 115
5.4.3 Approximating ¡£¥¨§ ...... 117
5.4.4 Maintaining the 1-median function ...... 119
5.4.5 Randomized algorithm ...... 123
5.5 Notes and Discussion ...... 124
6 Coarse Docking via Features 126
6.1 Introduction ...... 126
6.2 Algorithm ...... 132
6.2.1 Scoring function ...... 132
6.2.2 Computing features...... 133
6.2.3 Coarse alignment algorithm ...... 134
6.3 Experiments ...... 137
6.4 Notes and discussion ...... 144
Bibliography 147
Biography 161
xi List of Tables
2.1 Comparisons on protein data ...... 38
4.1 Table of singularities ...... 71
4.2 Number of Maxima for 1brs ...... 87
4.3 Number of Max with different resolution ...... 90
4.4 Covering density ...... 90
6.1 Index-k Features ...... 138
6.2 Complex 1brs ...... 139
6.3 25 Test Cases ...... 140
6.4 Two-step for 25 Test Cases ...... 141
6.5 Unbound Benchmark ...... 143
xii List of Figures
1.1 Protein structure ...... 2
1.2 Protein models ...... 4
1.3 Protein folding ...... 6
1.4 Protein docking ...... 7
2.1 DNA supercoiling ...... 19
2.2 Sign of crossings ...... 20
2.3 Worst case of writhe ...... 23
2.4 Critical directions ...... 24
2.5 Winding number ...... 27
2.6 Spherical triangle ...... 28
2.7 Open knot ...... 30
2.8 Oriented edges ...... 31
2.9 Convergence rate ...... 36
2.10 Running time ...... 36
2.11 Protein backbones ...... 37
3.1 Frechet matching ...... 45
3.2 Frechet simplification ...... 46
3.3 Comparison between Frechet and Hausdorff simplifications ...... 47
¡£¢ ¡¨¢
¡¥¤ ¥§¦ ¡¥¤ ¥§¦
3.4 Relation between and ...... 49
xiii 3.5 Results of Frechet simplification ...... 53
3.6 Running time ...... 54
3.7 Comparisons between DP and GreedyFrechetSimp algorithms . . . 55
3.8 Simplification of a protein ...... 57
4.1 Four types of maxima ...... 60
4.2 Extended persistence ...... 62
4.3 Elevation on 1-manifold ...... 67
4.4 Pedal curve ...... 70
4.5 Co-dimensional 2 singularities ...... 71
4.6 Discontinuity in elevation ...... 73
4.7 Stratification ...... 75
4.8 Mercedes star property ...... 77
4.9 Another neighborhood pattern for a triple point ...... 78
4.10 Other neighborhood patterns ...... 78
4.11 Parameterization of Gaussian neighborhood ...... 80
4.12 Height difference for 2-legged maximum ...... 82
4.13 Height difference for 3-legged maximum ...... 82
4.14 Decaying of maxima ...... 88
4.15 Top 100 maxima for 1brs ...... 88
4.16 Elevation on 1brs ...... 92
xiv 5.1 Valid and forbidden regions ...... 102
5.2 Voronoi for union of balls ...... 107
5.3 Exponential grid ...... 119
6.1 Predict docking configurations ...... 126
6.2 Max types again...... 133
6.3 Coarse alignment ...... 135
6.4 Align features ...... 135
6.5 Align Pairs ...... 136
xv Chapter 1
Introduction
If a living cell is viewed as a biochemical factory, then its main workers are protein molecules, acting as catalysts, transporting small molecules, forming cellular structures, and carrying signals, among other roles. As Jacques Monod states in his book Chance and Necessity: “ . . . it is in proteins that lies the secret of life. ”. Their functional diversity is made possible by the diversity of their three-dimensional structures. Understanding or simulating molecular processes involved in the formation of protein structures and their biological functions is a major challenge of molecular biology. For most of the key prob- lems involved in this challenge, such as protein folding, docking, structure classification, and structure prediction, naturally, geometry and topology play important roles. However, currently, geometric methods are not fully utilized and investigated when attacking these key problems, partly due to a number of representational and algorithmic challenges.
To close this gap, in this thesis, we study shape analysis problems arising in molecular biology by combining both geometric and topological approaches. In particular, we focus on algorithms for describing and matching protein structures. Note that, in general, shape characterization and matching are central to various application areas other than structural biology, including computer vision, pattern recognition, and robotics [17, 131, 165]. Most of the techniques that we have developed are applicable to these other fields as well.
In the remainder of this chapter, we first give a brief biological background on protein structures and introduce some related research areas. More details can be found in standard textbooks [34, 68, 125]. We then describe shape analysis problems, arising in molecular biology, from the computational side. We state our main contributions at the end of this chapter.
1 1.1 Protein Structure and Geometric Models
A protein is a polymer consisting of a long chain of small building blocks, called amino
¡£¢ ¡
acids or residues. All amino acids have a 3-atom backbone, - - , to which a side chain ¤¨§ (denoted by ¤¦¥ and in Figure 1.1 (a)) is attached. Besides the side chain, a hydrogen atom is bonded to the backbone nitrogen atom, and an oxygen is doubly bonded to the
carboxy carbon. There are © standard amino-acids residues, distinguishable by their side ¡ chains. The amino end ( ) of an amino acid connects to the carboxy end ( ) of the preceding amino acid, forming a peptide bond. Thus the chemical structure of a protein molecule can be viewed as a linear sequence of amino acids interconnected by peptide
bonds.
¥ H O
¢ N C
N C ¢
H O § (a) (b) Figure 1.1: (a) Protein structure: with the backbone structure in the dotted boxes. (b) The folded state of a protein; each atom is modeled as a ball.
Though a linear sequence, a protein molecule folds into a compact and typically unique three-dimensional structure under certain physiological conditions (see Figure 1.1 (b)). This is the result of various atomic interactions, such as van der Waals and electrostatic forces. The resulting structure, refered to as the native structure or the folded state, is how a protein molecule exists in nature, and is the conformation in which a protein is able to perform its physiological functions. In fact, given a protein molecule, its three-dimensional
structure decides its functionality to a large extent [68, 125]1. Therefore, knowledge of the For example, disruption of the native structures of proteins is the primary cause of several neurodegenera-
2 protein structures is essential for understanding the principals that govern their functions in nature. This three-dimensional structure is the focus of our study. In general, protein structures are examined at different levels, referred to as protein structure architecture [34]. We present a brief description below.
primary structure: the amino acid composition of the protein, i.e., the linear se- quence of amino acids;
secondary structure: common patterns in the conformation of protein backbones
observed in nature. There are four major types of secondary structure elements
¢ ¢
(SSEs): ¡ -helix, -sheet, -turn, and coils;
supersecondary structure: the higher scale structures organized by secondary struc-
ture elements, e.g, how SSEs are connected.
tertiary structure: the global folded three-dimensional structure of the protein;
quaternary structure: the structure, or the complex of two proteins bound together.
How to model proteins appropriately is a crucial first step before we can visualize or manipulate them. Several models have been proposed, depending on the objectives of the underlying applications and/or what information one would like to emphasize. In the literature, the term modeling refers to a broad collection of methods to describe not only the geometric structures, but also the energetic aspects of a molecule2 [84, 125]. In this thesis, we focus on geometric models of the three dimensional structure of proteins [64].
Geometric shapes typically refer to a finite set of points, a space curve, or a surface. In the context of molecular biology, a set of points corresponds to the set of centers of atoms, possibly weighted by the van der Waals radius of the atoms. Sometimes, points
tive diseases, such as Alzheimer's disease and Parkinson's disease. £ For example, by using quantum mechanics, one can describe in detail the energy of any arrangement of atoms and molecules in a particular system.
3 are connected by “sticks” that represent covalent bonds between atoms (Figure 1.2 (a)). Such representations not only specify the positions of each atom in a molecule, but also the chemical information.
(a) (b) (c)
Figure 1.2: A protein molecule represented as (a) the set of atom centers connected by sticks to represent covalent bonds; (b) a space curve (the main chain representation); and (c) the (van der Waals) surface of the union of atoms, each represented by a ball.
Sometimes, the details presented in the above representation are not necessary, or even undesirable. The main chain representation is often exploited in such situations, where a
protein molecule is modeled as a curve in ¢¡ following the trace of the backbone atoms of the amino acids (see Figure 1.2 (b)). Such a representation emphasizes the linear nature of protein molecules, and shows clearly how this linear sequence of amino acids folds in space. It provides a much simplified representation of a protein, while still maintaining its main structural features. Consequently, this representation is popular in many applications, especially those of high computational complexity, such as protein folding and structure classification.
A surface representation of proteins is useful when the object of study is the space occu- pied by a protein molecule, or when the global structure of the molecule is more important than its local geometry. There are many ways to represent the surface of a molecule. If
we model each atom by a ball in £¡ with its van der Waals radius, then the surface of the union of these balls is refered to as the van der Waals surface (Figure 1.2 (c)). The solvent
4 accessible surface, originally proposed by Lee and Richard [126], is the surface traced out by the center of a probe sphere (which typically represents the water molecule) rolling on top of the VDW surface. The surface traced out by the inward-facing surface of this probe sphere is called molecular surface. The skin surface developed by Cheng et al. [55] is more complicated, but has many elegant (mathematical) properties.
1.2 Related Research Areas
Despite the important role that protein structures play in understanding life, early research in computational biology, or bioinformatics, focused on sequence analysis, rather than on protein structures. One of the key reasons is that it is significantly easier to both acquire and manage sequence data.3 Nevertheless, with the tremendous success in sequence analysis, the study of protein structures has become increasingly critical. For example, in the post- genomic era, a major obstacle to the exploitation of the large volume of genome sequence data is the functional characterization of the gene products (protein structures). Since the three-dimensional structures of proteins are more conserved than the corresponding sequences, many large-scale protein structure determination projects have been initiated recently [169] to help to analyze the functionally unannotated protein sequences space. These initiatives are widely referred to as structural genomics or structural proteomics. In this subsection, we briefly mention several (not necessarily disjoint) research areas related to protein structures.4
Protein folding and protein structure prediction. Predicting a protein’s structure from its amino acid sequence is one of the most significant tasks tackled in computational biology (Figure 1.3). Solving this problem will have enormous impacts on rational drug
design, cell modeling, and genetic engineering. It is therefore not surprising that it is
For example, the Human Genome Project has made massive amounts of protein sequences data available, while the output of experimentally determined protein structures, typically by time-consuming and relative
5 Figure 1.3: From left to right we show snapshots of a periscope and staphylococcal pro- tein A, B domain (in mainchain representation), at different stage during the folding process (http://parasol.tamu.edu/dsmft/research/folding). considered as the “ holy grail ” in the structural biology community; see e.g. [123].
Two issues are involved here: (i) to understand the mechanism behind protein folding process, i.e., how does a protein fold in nature; and (ii) to predict the final folded confor- mation given a sequence of amino acids. These two aspects are obviously related, but not necessarily equivalent: several successful approaches to predict protein structures do not mimic the folding process, but rely on the knowledge of known protein structures.
There has been a long history of tackling the folding problem. Current approaches can be classified into two categories, which we mention without going into any detail here. For more information, refer to [26, 123, 156]. The first class, including comparative modeling and threading, start by using a known template structure or known folds. The second class, de novo or ab initio methods, predict structure from sequence directly, using principles of atomic interactions and protein architecture. Despite the success of these methods for some cases, especially in predicting structures of small protein molecules, the protein folding problem remains largely unsolved. Main reasons include that the structure is defined by a large number of degrees of freedom5, and that the physical basis of protein
expensive X-ray crystallography and NMR spectroscopy, is lagging far behind.
It is impossible to survey these areas in a comprehensive manner in this thesis Ð each would require a
whole book to do so! ¡ As highlighted by the Levinthal paradox which results from the observation that proteins are folded into their speci®c three-dimensional conformation, in a timespan signi®cantly short (milliseconds) than what would be expected if the molecule actually searched the entire conformation space for the lowest energy state.
6 structural stability is not fully understood. The vigor of this field can be seen from the high participation and great performance output of the Critical Assessment of Structure Prediction (CASP) experiments (http://predictioncenter.llnl.gov).
Protein interactions. Two or more molecules interact with each other by forming an intermolecular complex (either in a stable manner, or temporarily), the process of which is called docking or receptor-ligand recognition and binding. Such interactions are critical to various biological processes, such as cell-cell recognition and enzyme catalysis and in- hibition. The target macromolecules (receptors) are usually large, mostly proteins, while the ligands can be either large, such as proteins, or small, such as drugs or cofactors (see
Figure 1.4). The sites where binding happens are called active sites or binding sites. They
(a) (b)
Figure 1.4: Examples of (a) protein-small molecule docking: mainchain representation of HIV-1 protease bound to an inhibitor (in VDW representation); and (b) protein-protein docking: human growth hormone. are usually places on the surfaces of proteins where chemical reactions or conformational changes happen. Hence knowledge of the interaction between molecules is crucial in un- derstanding, and even manipulating, their functions. As an example, many drug molecules work by acting as inhibitors: they bind to the receptor proteins to block the active sites, thus stopping undesired chemical reactions or molecular processes from happening. As a result, efficient docking algorithms of drug molecules to target receptor proteins is one of the major ingredients in a rational drug design scheme [85, 102, 164].
The docking problem has attracted great attention from computer scientists as well
7 as biochemists, due to its strong geometric and algorithmic flavor [102, 81, 114]. Much success has been achieved for docking a protein with a small molecule, or docking two rigid proteins (i.e., each protein can only undergo rigid transformations). However, the field remains rather open, especially in the case of protein-protein docking without the
“rigid” assumption [97]. In this case, as protein structures are complicated, modeling their conformational changes introduces many degrees of freedom. Nevertheless, progress is being made, and interested readers should refer to the results of the CAPRI experiments, i.e., the Critical Assessment of PRediction of Interaction, for the newest advances in this
field (http://capri.ebi.ac.uk).
Protein structure comparison and classification. As a protein molecule with some functional role evolves in the context of a living cell, its overall three-dimensional struc- ture tends to remain unaltered, even when all sequence memory may have been lost [80]. This evolutionary resilience of protein three-dimensional structures is the fundamental rea- son for comparing protein structures in molecular biology. Numerous comparison methods have been proposed and developed in the past 20 years [122, 160]. The problem, however, is difficult and remains unsolved. There is typically no clear definition for structural sim- ilarity. Structural similarity is of interest at many levels: from the fine detail of backbone and side-chain conformation at the residue level to the coarse similarity at the tertiary struc- ture level. Besides, many situations require one to capture local similarity, which is hard to describe.
Moreover, more and more protein structure data are becoming available: At the time of
¥
this writing, there are more than © proteins structures in the Protein Data Bank [31], and the number almost doubles every 18 months. It is therefore crucial to bring certain order into protein structures by classifying them into families. Other than organizing the large structure database, such classification can aid our understanding of the relationships between structures and functions. For example, it is shown that almost all enzymes have
8
¢ ¢
¡ ¡ the so-called ¡ folds [132], i.e., they have both -helices and -sheets in their struc- tures. Furthermore, while each sequence typically generates a unique three-dimensional structure, multiple sequences may produce similar folded structures, or folds. A natural question arising is then: how many different folds are there in nature? Classification helps to answer this question, and its solution is useful in annotating the sequence space by struc- tures, thus functions, which is a central aspect in structural genomics that we mentioned earlier [26]. Classification also enables us to experimentally determine many fewer number of protein structures — we can now afford to determine only those that possibly produce novel folds [160].
Currently three most popular classifications are: SCOP [138], CATH [139, 141], and
FSSP [104], all of which are accessible via the world wide web. Similar to structure comparison, one main difficulty of the classification problem arises from the fact that there is no consensus on defining the organization of different categories. Thus how to classify protein structures in a fully automatic way is still a daunting problem.
1.3 Shape Analysis in Molecular Biology
Above we have sketched some key research areas closely related to protein structures. Two issues appear repeatedly there — how to describe and characterize structures and how to develop efficient computational methods (algorithms). These two issues are obviously not new to many research fields in computer science. In this section, we address these two issues by identifying shape-analysis problems in molecular biology and describing them from a computational perspective. Shape analysis problems have been studied extensively in many fields including computer graphics, vision, geometric computing, robotics, and so on. On the one hand, many of the techniques there can be adopted to attacking problems in molecular biology directly. On the other hand, protein structure analysis has many unique
9 properties and new techniques are greatly needed.6
At a high level, we classify shape analysis problems into two broad categories, each including many subtopics. Though our classification below are tailored towards molecu- lar biological applications, one should note that techniques developed are not necessarily constrained to biological applications. Once again, we will only sample a few techniques exploited in attacking these problems, as a full enumeration will go beyond the scope of this thesis. For surveys on a subset of topics in shape analysis in general, refer to [17, 131, 165].
1.3.1 Describing Shapes
Modeling flexibility. We have introduced some basic geometric representations for three-dimensional structures of proteins in Section 1.1. In some applications, more sophis- ticated representations are required: protein molecules are in constant motion (vibration) in solution, and they might undergo significant conformational changes at times (such as in a protein-protein docking process). Therefore, it is important to incorporate flexibility in modelling protein structures. The question then is of course how to model flexibility, which is typically complex as the protein structures have too many degrees of freedom. On the one hand, special data structures are desirable to efficiently support changes in con- formations: For example, in [6], a chain hierarchy has been proposed which can detect collisions for deforming protein backbones efficiently. On the other hand, it is important to characterize motions: what are the types of motions molecules undergo and where do they happen. Techniques from robotics, motion planing and graph theory have been exploited successfully in several cases for either identifying possible motions [112] or for reducing the degree of freedoms of motions [85, 161].
Simplified representations. In some applications, simplified structures are needed to
We remark here that the shape analysis problems are extremely hard for proteins structures due to the fact that the connection between protein structures and their functions is not yet well understood. In many situations, it is not clear what aspects about structures give rise to a particular functionality.
10 help to manage complex problems. Hence many approaches use simplified protein struc- tures, such as representing the backbone of a protein molecule as a set of fragments, each corresponding to a secondary structure element [160, 155]. As another example, one model proposed by Dill [49, 74] simplifies the protein backbone as beads chained together on a unit lattice. The beads can either be hydrophobic or hydrophilic, with contacts between hydrophobic beads being favored. Although fairly simplistic, this model yields results surprisingly similar to those derived from experimental data when applied to the protein folding problem [73, 162].
Shape descriptors/signature. One aspect of shape characterization is to extract key features or information of a given shape. For example, this is central to many approaches that compare protein structures: The key information is typically stored in a shape de- scriptor or signature, and similarity between two shapes can now be measured by some distance between the corresponding descriptors. As another example, in order to under- stand protein-protein interaction, there have been much research on characterizing the in- terface where the two proteins interact with each other [27, 130]. Features that contribute to describe such interfaces include the buried surface areas, the tightness of the binding, hydrophobicity and so on.
Extracting features and generating shape descriptors are widely used in graphics, vi- sion, and robotics. Many techniques borrowed from there can be applied to applications in molecular biology: statistical methods (such as histograms and harmonic maps) [118, 22], geometric-based methods (such as turning angles) [59], and topological-based methods (such as Connolly function) [66, 136] have all been exploited to generate shape descriptors in structural biological applications.
11 1.3.2 Matching Shapes
Similarity matching. Measuring similarity between protein structures is essential to protein structure classification, and is needed when applying comparative modeling meth- ods for predicting protein structures. There are in general two types of approaches for this problem. The first type of methods are alignment-based. They consider matching as an optimization problem by finding the best alignment (i.e., relative placement) of two input structures under some scoring function (where the score evaluates how similar structures are). Example of this approach include DALI [103], STRUCTAL [129], and CE [149]. How to define the scoring function, or the distance between two structures, is an intriguing problem in itself, and has received much attention [80, 158].
The second approach computes the similarity/distance between input structures di- rectly, without producing any alignment to superimpose them. Many of methods in this category exploit shape descriptors [22, 145]. Another example is the contact-map over- lay approach, which converts each protein structure into some special type of graph, and similarity is measured as the size of the largest congruent subgraph between two such graphs [94].
In general, alignment-based methods involve searching in a large configuration space, and thus have higher time complexity than the second type of approaches. They are also less efficient when querying in a structure database. However, they are more reliable and discriminative in measuring similarities. Refer to [150, 151] for a comparison between current popular matching approaches in structural biology.
Complementarity matching. The main motivation to study complementarity match- ing in molecular biology is to understand protein-ligand interactions. A simple geometric
formulation for protein-protein docking problem is the following: Given protein £ and § § , find the best transformation of such that they best complement each other. In other
12 words, this is a partial surface matching under the constraints that the two surfaces do not intersect. Two main issues involved here are:7 (i) evaluate the alignments generated, i.e., find a good score function that produces few false-positives [97, 153]; and (ii) reduce the complexity of the search procedure, e.g., by exploiting more efficient computation (such as FFT or spherical harmonics) [53, 143], by better searching strategies (such as by ge- netic algorithms) [39, 91], or by reducing the number of transformations inspected (such as geometric hashing) [86].
Of course in nature both molecules can change their conformation during the dock- ing process. For large protein-protein interactions, it is complex to model flexibility in matching procedure, and this is one of the main focus of the current research [97, 153].
Classification and structure database. We mentioned protein structure classification earlier for the purpose of organizing the rapidly-expanding collection of protein structures available. There are also needs to manage structures into a database that can support effi- cient queries (i.e, given a query structure, return one or more structures from the database that contain it (in the case of motif query), or that are similar to it). The pairwise structure comparison that we discussed above (in similarity matching) is obviously a fundamen- tal component in classification and query problems, and a straightforward way to classify protein structures is by all-against-all comparisons. This is the method adopted by most current classifications of protein structures, such as CATH and FSSP [139, 104]. It is, however, rather inefficient, especially when combined with the alignment-based pairwise matching procedures. Part of the reason for the usage of this straightforward approach despite its inefficiency is because most past focus is on how to classify protein structures in a reliable and automatic way (the problem of which is still not satisfactorily solved even now) [122]. With better understanding of protein structures, and with the number
of known structures increasing rapidly, efficient clustering techniques become essential.
References are for molecular biological applications.
13 Several recently developed protein structure comparison techniques aim at developing a similarity measure that satisfies triangle inequality [60, 47, 145], so that many known clus- tering algorithms can be applied. In particular in [60], Choi et al. exploits techniques from information retrieval in building their classification system.
1.4 Main Contributions
Our research touches both shape description and shape matching categories. We focus on developing efficient computational methods for describing or matching structures. Our approaches rely on both geometric and topological techniques 8. Softwares produced from this thesis work are available at the BioGeometry website (http://biogeometry.duke.edu/).
Part I. Shape description. (1) Writhing number: The writhing number of a curve measures how many times a curve coils around itself in space. It characterizes the so-called supercoiling phenomenon of dou- ble stranded DNA, which influences DNA replication, recombination, and transcription. It is also used to characterize protein backbones. We establish a relationship between the writhing number of a space curve and the winding number, a topological concept. This enables us to develop the first subquadratic algorithm for computing the writhing number of a polygonal curve. We have also implemented a simpler algorithm that runs in near- linear time on inputs that are typical in practice [5], including protein backbones and DNA strands, in contrast to the quadratic-time algorithms used by current softwares.
(2) Simplification: The level-of-detail (LOD) representation of protein backbone helps to single out its main features. One way to obtain LOD representations is via curve simpli-
fication. We study the simplification problem under the so-called Frechet-error´ measure.
We remark here that in the literature of molecular biology, the word topology is typically used in a different meaning from our usage: It mainly refers to the topology of the molecule itself, such as how the elements of a molecules are interconnected; while we exploits knowledge and understanding from theories and tools from classic topology, such as Morse theory, in our approaches.
14 This measure is more desirable than the widely used Hausdorff error measure in many sit- uations, especially if one wants to preserve global features of a curve (e.g, the secondary structure elements of a protein backbone) during simplification. We propose and imple- ment a simple algorithm to simplify curves under Frechet´ error measure [7], which is the
first simplification algorithm that runs in near-linear time in dimensions higher than two with guaranteed quality. (3) Elevation function: Given a molecular surface, to capture geometric features such as protrusions and cavities, we design a continuous elevation function on the surface, and compute the points with locally maximum elevation [4]. The intuition of the function follows from elevation on Earth, by which we identify mountain peaks and valleys. But the concept is more technical to extend to general 2-manifolds. This function is scale- independent and provides beyond the location, also direction and size of shape features.
By elevation function, we can describe the above geometric features in a reliable and suc- cinct manner, the results of which can aid to attack the protein docking problem.
Part II. Shape matching. (4) Matching via Hausdorff distance: By modelling a molecule as the union of a set of balls (each representing an atom), we measure the similarity between two molecules by variants of Hausdorff distance. In particular, we present algorithms that compute exactly or approximately the minimum Hausdorff distances between two such unions under all possible translations [8]. We also investigate the version in which we are constrained to only translations where the two sets remain collision-free (i.e., no ball from one set inter- sects the other sets).
(5) Docking via features: As mentioned earlier, from a geometric perspective, protein docking can be considered as the problem of searching for configurations with the maxi- mum complementarity between two molecular surfaces. Our goal is to efficiently compute a small set of potentially good docking configurations based on the geometry of the two
15 structures. Given such a set, more sophisticate procedures can then be performed on each of its members independently to locate the “real” docking configuration. To find such a potential set, we would like to align cavities from one protein with protrusions from the other, and these “meaningful” features are captured by the elevation function we have de- signed. Our approach can compute important matching positions while inspecting many fewer configurations than the exhaustive search or earlier geometric-hashing approaches.
16 Part I: Shape Description
17 Chapter 2
Writhing Number
2.1 Introduction
The writhing number is an attempt to capture the physical phenomenon that a cord tends to form loops and coils when it is twisted. We model the cord by a knot, which we define to
be an oriented closed curve in three-dimensional space. We consider its two-dimensional
¢¡ ¡ family of parallel projections. In each projection, we count or £ for each crossing, depending on whether the overpass requires a counterclockwise or a clockwise rotation
(an an angle between 0 and ¤ ) to align with the underpass. The writhing number is then the signed number of crossings averaged over all parallel projections. It is a conformal invariant of the knot and useful as a measure of its global geometry.
The writhing number attracted much attention after the relationship between the linking number of a closed ribbon and the writhing number of its axis, expressed by the White formula, was formally discovered independently by Calug˘ areanu˘ [40], Fuller [88], Pohl
[142], and White [168].
¥§¦ ¨ ©
(2.1)
Here the linking number, ¥§¦ , is half the signed number of crossings between the two boundary curves of the ribbon, and the twisting number, © , is half the average signed number of local crossing between the two curves. The non-local crossings between the two curves correspond to crossings of the ribbon axis, which are counted by the writhing number, . The linking number is a topological invariant, while the twisting number and the writhing number are not. A small subset of the mathematical literature on the subject can be found in [20, 79].
18 Besides the mathematical interest, the White Formula and the writhing number have received attention both in physics and in biochemistry [70, 117, 128, 157]. For example, they are relevant in understanding various geometric conformations we find for circular DNA in solution, as illustrated in Figure 2.1 taken from [37]. By representing DNA as
Figure 2.1: Circular DNA takes on different supercoiling conformations in solution. a ribbon, the writhing number of its axis measures the amount of supercoiling, which characterizes some of the DNA’s chemical and biological properties [30].
As another example, the writhing number and some of its variants have also been ap- plied to protein backbones, modeled as open curves, as shape descriptors to classify pro- tein structures [128, 145]. The intuition for such approaches follows from the fact that the writhing number of a space curve measures the relative position between any two points in the curve and the relative orientation between the tangents at those points. (This view will become more clear after we introduce Equation 2.3 in the next section. ) When extended to a polygonal curve, this means that the writhing number measures the relative position and orientation between any two edges of the curve. Hence two protein backbones with similar arrangements of secondary structure elements produce similar writhing number.
This chapter studies algorithms for computing the writhing number of a polygonal knot. Section 2.2 introduces background work and states our results. Section 2.3 relates the writhing number of a knot with the winding number of its Gauss map. Section 2.4 shows how to compute the writhing number in time less than quadratic in the number of edges of the knot. Section 2.5 discusses a simpler sweep-line algorithm and presents initial experimental results.
19 2.2 Prior and New Work
In this section, we formally define the writhing number of a knot and review prior al- gorithms used to compute or approximate that number. We conclude by presenting our
results.
¥¥¤
¡
¡£¢
Definitions. A knot is a continuous injection or, equivalently, an oriented
§ ¢
closed curve embedded in £¡ . We use the two-dimensional sphere of directions, , to
§
¦¨§©¢ represent the family of parallel projections in ¢¡ . Given a knot and a direction ,
the projection of is an oriented, possibly self-intersecting, closed curve in a plane normal
¦ ¦ to ¦ . We assume to be generic, that is, each crossing of in the direction is simple
and identifies two oriented intervals along , of which the one closer to the viewer is the overpass and the other is the underpass. We count the crossing as ¢¡ if we can align the
two orientations by rotating the overpass in counterclockwise order by an angle between
¡
¤ £ and . Similarly, we count the crossing as if the necessary rotation is in clockwise order. Both cases are illustrated in Figure 2.2. The Tait or directional writhing number of
+1 −1
Figure 2.2: The two types of crossings when two oriented intervals intersect.
¢¡ ¡
¡
¦ ¦ £ in the direction , denoted as , is the sum of crossings counted as or as
explained. The writhing number is the averaged directional writhing number, taken over §
all directions ¦ § ¢ ,
¡
¨
¡
¦ ¦
(2.2) ¤
We note that a crossing in the projection along ¦ also exists in the opposite direction, along
¨
¡ ¡
¦ ¦ £¦ £ , and that it has the same sign. Hence , which implies that the
20
writhing number can be obtained by averaging the directional writhing number over all
¥
¦ £¦¢¡ points of the projective plane or, equivalently, over all antipodal points pairs of the sphere.
Computing the writhing number. Several approaches to computing the writhing number
of a smooth knot exactly or approximately have been developed. Consider an arc-length
©
¥ ¤
¡
¡ ¢ ¤£ £
parameterization , and use and to denote the position and the unit
¥ § ¢ tangent vectors for ¥ . The following double integral formula for the writhing number
can be found in [142, 159]:
§
© ©
¡
¥
¨
£©¨ £ £
¥
(2.3)
¦ ¦
¤ ¡
¤£ £
If the smooth knot is approximated by a polygonal knot, we can turn the right hand side of (2.3) into a double sum and approximate the writhing number of the smooth knot [33, 128]. This can also be done in a way so that the double sum gives the exact writhing number of the polygonal knot [28, 121, 166].
Alternatively, we may base the computation of the writhing number on the directional
¥ ¦ ¨ ©
§
¡ ¡
¦ ¦ § ¢ version of the White formula, for ¦ . Recall that both the
linking number and the twisting number are defined over the two boundary curves of a
©
¡ ¡
¦ ¦
closed ribbon. Similar to the definition of , the directional twisting number, ,
¢¡ ¡
is defined as half the sum of crossings between the two curves, each counted as or £ § as described in Figure 2.2. We get (2.1) by integrating over ¢ and noting that the linking
number does not depend on the direction. This implies
¨ ¥§¦ ©
£
¡
¨ © ©
¡ ¡ ¡
¦ ¦ £
(2.4) ¤
To compute the directional and the (average directional) twisting numbers, we expand to a ribbon, which amounts to constructing a second knot that runs alongside but is disjoint
21 from . Expressions for these numbers that depend on how we construct this second knot
can be found in [121]. Le Bret [35] suggests to fix a direction ¦ and define the second
knot such that in the projection it runs always to the left of . In this case we have
© ¨
¡
¦ ¦ and the writhing number is the directional writhing number for minus the twisting number.
A third approach to computing the writhing number is based on a result by Cima- soni [62], which states that the writhing number is the directional writhing number for a
fixed direction ¦ , plus the average deviation of the other directional writhing numbers from
¡ ¡
¦ ¡
. By observing that is the same for all directions in a cell of the de-
© ©
§ £ composition of ¢ formed by the Gauss maps and (also referred to as the tangent
indicatrix or tantrix in the literature [56, 154]), we get
¢ ¢
¡
¡£¢¥¤
¨
¡ ¡ §¦ £ ¥
¦ ¦
£ (2.5)
¤
¢ ¢
¡ £
¡ ¡ where is for any one point in the interior of , and is the area of .
If applied to a polygonal knot, all three algorithms take time that is at least proportional to the square of the number of edges in the worst case.
Our results. We present two new results. The first result can be viewed as a variation of
© ©
§
§ ¢
(2.4) and a stronger version of (2.5). For a direction not on and not on £ , let
© ©
¡
be its winding number with respect to and £ . As explained in Section 2.3, this
© ©
¡
means that and £ wind times around . ¦
THEOREM A. For a knot and a direction , we have
¡
¨
¡ ¡ ¡
¦ £ ¦
¤
Observe the similarity of this formula with (2.4), which suggests that the winding number can be interpreted as the directional twisting number for a ribbon one of whose two bound-
22 ary curves is . We will prove Theorem A in Section 2.3. We will also extend the relation in Theorem A to open knots and give an algorithm that computes the average winding number in time proportional to the number of edges. Our second result is an algorithm that computes the directional writhing number for a polygonal knot in time sub-quadratic in the
number of edges.
§
¡
¦ § ¢ ¦
THEOREM B. Given a polygonal knot with edges and a direction , can
¥¢¡ £¥¤§¦
¡ ¤
be computed in time O , where is an arbitrarily small positive constant.
Figure 2.3: A knot whose directional writhing number is quadratic in the number of edges.
Theorems A and B imply that the writhing number for a polygonal knot can be computed
¥¢¡ £¥¤§¦
¡ in time O . As shown in Figure 2.3, the number of crossings in a projection can
be as large as quadratic in . The sub-quadratic running time is achieved because the algorithm avoids checking each crossing explicitly. We also present a simpler sweep-line algorithm that checks each crossing individually and therefore does not achieve the worst- case running time of the algorithm in Theorem B. It is, however, fast when there are few crossings.
2.3 Writhing and Winding
In this section, we develop our geometric understanding of the relationship between the writhing number of a knot and the winding number of its Gauss map. We define the Gauss
23 map as the curve of critical directions, prove Theorem A, and give a fast algorithm for computing the average winding number.
2.3.1 Closed knots
Critical directions. We specify a polygonal knot by the cyclic sequence of its vertices,
¨
¤£¦¥ ¢¡
¥ ¥ ¥ ¡
¡
¥ ¥ § ¤ ¥§£ § § ¤ ¥ £ § ¡ ¥¨§
in . We use indices modulo and write
§ ¤ ¥ ¥©§ ¡
for the unit vector along the edge § . Note that is also a direction in and a point in
§
¥¨§ ¥©§ ¤ ¥
¢ . Any two consecutive points and determine a unique arc, which, by definition, is
¡ £ ¥
¥ ¥ ¥
¡
¥ ¥ ¥ ¥
the shorter piece of the great circle that connects them. The cyclic sequence ¥
© ©
§ £ thus defines an oriented closed curve in ¢ . We also need the antipodal curve, , which is the central reflection of © through the origin.
Figure 2.4: In all three cases, the viewing direction slides from left to right over the oriented great circle of directions defined by the hollow vertex and the solid edge. The directional writhing
number changes only in the third case, where we lose a positive crossing.
© © £
The directions on and are critical, in the sense that the directional writhing
§ ¢ number changes when we pass through along a generic path in , and these are the only
critical directions [62]. We sketch the proof of this claim for the polygonal case. It is clear
§
§ ¢ §
that is critical only if it is parallel to a line that passes through a vertex and
¡
£ © ¥ §
a point on an edge ¤ of the knot that is not adjacent to . There are such
§ vertex-edge pairs, each defining a great circle in ¢ . First, we note that only of these great
24
¨¢¡
¡ £
circles actually carry critical points, namely, the great circles that correspond to
¨£¡
©
and . The reason for this is shown in Figure 2.4, where we see that the writhing
¤ ¥
number does not change unless § is separated from by only one edge along the knot.
¨¤¡
¡
§ £
Second, assuming we observe that the subset of directions along which projects
¨
§
¡
§ ¤ ¥ § ¤ § § ¤ § £ § § ¤ § £ § §¦¥¢§ ¥ § ¥ § ¢
onto is the arc ¥ from to the direction in ,
¡
¥¨§§¥ § £ ¥©§ £¨¥¢§
and symmetrically the arc £ from to . The subset of directions along which
¡
¤ § § § ¤ ¥ ¥ § ¥ § ¤ ¥ £ ¥¢§ ¥ § ¤ ¥ ¥©§ ¥¢§ ¥©§ ¤ ¥
§ projects onto are the arcs and . The points , , and lie
§ ¥©§ ¥ § ¤ ¥
on a common great circle and ¥ lies on the arc . This implies that the concatenation
¡ ¡ ¡
§©¥¢§ ¥¢§ ¥ § ¤ ¥ ¥ § ¥ § ¤ ¥ £ ¥ §©¥¢§ £ ¥¢§ ¥ § ¤ ¥ £ ¥©§ ¥ § ¤ ¥
of ¥ and is the arc , and that of and is the arc . It
© ©
follows that and £ indeed comprise all critical directions.
© ©
Decomposition. The curves and £ are both oriented, which is essential. We say a
§
§ ¢
direction lies to the left of an oriented arc ¥ if it lies in the open hemisphere to
the left of the oriented great circle that contains ¥ . Equivalently, sees that great circle
© oriented in counterclockwise order. If passes from the left of an arc ¥ of to its right, then we either lose a positive crossing (as in the third row of Figure 2.4), or we pick up
a negative crossing. Either way the directional writhing number decreases by one. This
©
¡
£ ¥ £
motion corresponds to £ passing from the right of the arc of to its left. Since
the directional writhing numbers at and £ are the same, we decrease the directional
writhing number by one in the opposite view as well. In other words, if moves from the ©
left of an arc of £ to its right, then the effect on the directional writhing number is the
opposite from what it is for an arc of © . These simple rules allow us to keep track of the
© ©
§ £
directional writhing number while moving around in ¢ . The curves and decompose §
¢ into cells within which the directional writhing number is invariant. We can thus rewrite
(2.2) as
¢ ¢
¡
¡
¢
¨
£ ¥
¤
25
¢
where the sum ranges over all cells ¡ of the decomposition, and is the directional ¡
writhing number of any one point in the interior¢ of . Equation (2.5) of Cimasoni can now
¡
¦ be obtained by subtracting from inside the sum and adding it outside the sum.
This reformulation provides an algorithm for computing the writhing number.
¡
¦ ¦
Step 1. Compute for an arbitrary but fixed direction .
¢
§
¡ £
Step 2. Construct the decomposition of ¢ into cells, label each cell with
¡ ¦
, and form the sum as in (2.5).
§
¡
The running time for Step 2 is in the worst case as there can be quadratically many
¡ cells. We improve the running time to O and, at the same time, simplify the algorithm.
First we prove Theorem A.
§
Winding numbers. We now introduce a function over ¢ that may be different from
¨
¡ ¡ ¡ ¡
£ ¦ £ ¦
but changes in the same way. In other words,
§ §
¥
§ ¢ § ¢
for all ¦ . This function is the winding number of a point with respect
© © to the two curves and £ that do not contain . Observe that the space obtained by
removing two points from the two-dimensional sphere is topologically an annulus. We
¡
£¦
fix non-critical, antipodal directions ¦ and and define equal to the number of
©
¦
times winds around the annulus obtained by removing and £ plus the number of
©
¦
times £ winds around the annulus obtained by removing and . This is illustrated in
¨ ¨ ¨ ©
¡
¡ ¡ ¡
£¦ © Figure 2.5, where ¦ and . Here we count the winding of in
counterclockwise order as seen from positive, and winding in clockwise order negative.
©
Symmetrically, we count the winding of £ in clockwise order as seen from positive, ©
and winding in counterclockwise order negative. Imagine moving a point along and
connecting to with a circular arc. Specifically, we use the circle that passes through
£¦ £¦
, , and and the arc with endpoints and that avoids . Symmetrically, we move
©
£ £ £ along and connect to with the appropriate arc of the circle passing through
26 −z
−T T x
z
¡ ¢¤£ ¢¥
Figure 2.5: The winding number counts the number of times separates from and £
separates ¡ from .
¦
, £ , and . Locally at we observe continuous movements of the two arcs. Clockwise
and counterclockwise movements cancel, and ¡ is the number of times the first arc
rotates in counterclockwise order around plus the number of times the second arc rotates in clockwise order around . The winding number of is always an integer but can be
negative.
Observe that indeed changes in the same way as does. Specifically, drops by
© ©
1 if crosses from left to right, and it increases by 1 if crosses £ from left to right.
Starting from the definition (2.2) of the writhing number, we thus get
¡
¨
¡
¤
¡
¤
¨
¡ ¡ ¡ §¦£
¦ £ ¦
¤
¡
¤
¨
¡ ¡ ¡ ¦£
¦ £ ¦
¤
¡
¨
¡ ¥ ¡ ¡
¦ £ ¦
¤ which completes the proof of Theorem A.
Signed area modulo 2. Observe that the writhing number changes continuously under
deformations of the knot, as long as does not pass through itself. When performs a
¡
¦© ¦ small motion during which it passes through itself there is a ¦ jump in , while the
27 average winding number changes only slightly. We use these observations to give a new
proof of Fuller’s relation [13, 89],
¨
¡
£¡ ¡£¢¥¤ ¥
© ¤ ©
(2.6)
¦
¨ ©
¡
¥
§
£¡ ¡ ¢
where § is the signed area of the curve in . Note first that
¨
¡
¡ ¡ £§ ¡£¢¥¤
¦ ¦ © ¤
because both and are integers. We start with
¨ ¨
£¨
¡ ¦ © ¤
being a circle in , in which case (2.6) holds because and . Other
©
than continuous changes, we observe jumps of ¦ in when passes through itself.
¡
£©
© ¤ Theorem A together with the fact that the fractional parts of and are the same implies that (2.6) is maintained during the deformation. Fuller’s relation follows
because every knot can be obtained from the circle by continuous deformation.
§
¥ ¥ § ¢
Computing the average winding number. Three generic points define three
arcs, which bound the spherical triangle . Recall that the area of is the sum of angles
¢
¨
£
¡ £ ¤
minus ¤ . We define the signed area of as if lies to the left of
¢
¨
§
£
¡ £ £ ¤ ¦©§ ¢
the oriented arc , and as £ if it lies to the right. Let be a
§ ¥ § ¤ ¥
non-critical direction. As shown in Figure 2.6, every arc ¥ forms a unique spherical
©
£ ¡
¥ § ¥ § ¤ ¥ § £ ¥ § ¥ § ¤ ¥ £
triangle ¦ . Let be its signed area. The corresponding arc of forms
¡ £
¦ ¥¨§ ¥ § ¤ ¥ £ § the antipodal spherical triangle £ with signed area . The winding number of a
−z
−ti −ti+1
ti+1 ti
z ¢¥
Figure 2.6: The two spherical triangles defined by an arc of and its antipodal arc of .
¨
direction ¦ can be obtained by counting the number of spherical triangles that contain
it. To be more specific, we call a spherical triangle positive if its signed area is positive and
¦ ¡ ¡ negative if its signed area is negative. Let and be the numbers of positive and
28
¥ ¥
¦ ¡ ¡
¥©§ ¥ § ¤ ¥
negative spherical triangles ¦ that contain , and similarly let and be
¡
¦ ¥ § ¥ § ¤ ¥
the numbers of positive and negative spherical triangles £ that contain . Then
¤ ¤
¨
¥ ¥
¡ ¦ ¡ ¡ §¦ ¦ ¡ ¡ §¦
£ £ £
To see this note that the equation is correct for a point near ¦ and remains correct as
© ©
moves around and crosses arcs of and of £ . The average winding number is thus
£ ¥ £¦¥
¥ ¥
¡ ¡ ¡
¡ ¡
¨
¡ ¢ £ ¡ £
§ £ £ §
¡ ¡
¤ ¤ ¤
¡ ¡
§ §
£ ¥
¥
¡
¡
¨
£
§
¡
© ¤
¡ §
Computing the sum in this equation is straightforward and takes only time O( ).
2.3.2 Open knots
¤
¢
¡
¤
¥ ¦
¡
We define an open knot as a continuous injection ¡ . Equivalently, it is an ¢
oriented curve, embedded in £¡ , with endpoints. The directional writhing number of is well-defined, and the writhing number is the directional writhing number averaged over
all parallel projections, as before. Assume ¢ is a polygon specified by the sequence of its
¡ £ ¥ £¦¥ ¡
¥ ¥ ¥
¡ ¡
¥ ¥
vertices, ¥ , and let be the knot obtained by adding the edge . The ¢
critical directions of differ in two ways from those of : ¢
(i) there are critical directions of that are not critical for , namely the ones whose
¡ £¦¥
definition includes a point of ¥ ;
¡ £¦¥
(ii) there are new critical directions, namely those defined by an endpoint ( or ¥ ) and another point of the polygon but not on the two adjacent edges.
To see that the directions in (ii) are indeed critical for ¢ , examine the first two rows of
Figure 2.4. The hollow vertex is now an endpoint of ¢ , so we remove one of the two
29
dashed edges. Because of this change, the directional writhing number changes at the ©
moment the hollow vertex passes over the solid edge. Changing the critical curve of ¢
to the critical curve of can thus be achieved by removing the arcs of Case (i) and adding the arcs of Case (ii). We illustrate this process in Figure 2.7. To describe the process, we
...... t n−3 wn−4
un−3 = w n−3
tn−2 −un−1 = w1 w 2 tn−1 −tn−1 = vn−1 = w0 vn−3 t 0 −un−2 = vn−2
u0 = v2 . t v 3 . 1 . . . . .
Figure 2.7: The critical curves of the knot ¡ are marked by hollow vertices, and the additions
required for the critical curves of the open knot ¢ are marked by solid black vertices.
¨ ¨
¡¤£ £ ¡
¡ ¡ ¤£¦¥ £ ¥
¡ ¡
§ § £ § £ £ ¥ £ ¥ £
define , for , and , for
¨ ¡ ¨ ¨ ¨ ¨
£ £
£ ¥ ¡ £¦¥ £ ¥ £ ¥ ¡ £ ¥
¥ £¨¥ ¥ £ © § ¥ § £ ¥ § ¥ £ ¥ ¥
. Observe that , , , ,
¨ ©
£¦¥ £ ¥ ¥
and . We get the critical curve from by
¡ ¡
£¦¥ £¦¥ ¡ ¡ £ ¥ £¦¥ £¦¥ ¡
¥ § ¥ ¥ ¥ § ¥ ¥ ¥ ¥ ¥
1. removing the partial arcs ¥ and , and the arcs and ,
¡
¨ ¨ ¨ ¨
¡ £ ¥ £ ¥ £¦¥ ¡ £¦¥
¥ ¥ ¥ ¥ ¥ ¥
¡ ¡ ¡ ¡
§ ¥ £ ¥ ¥ £ ¥ ¥ ¥
2. adding new paths ¥ and
¡ ¡ £ ¥
¥ .
¡
©
£ ¥ £¦¥ £¦¥ £¦¥
¡ ¡
£ ¥ §¥ ¥ £ ¥ ¥ ¥ ¥
Note that Step 2 adds a piece of £ , namely and , to the new
©
£¥ £
critical curve . Symmetrically, we get from . Everything we said earlier about the
© winding number of the critical curve of applies equally well to the critical curve of
¢ . Similarly, all algorithms described in the subsequent sections apply to knots as well as to open knots.
30 2.4 Computing Directional Writhing
In this section, we present an algorithm that computes the directional writhing number of
¥¢¡ £ a polygonal knot with edges in time roughly proportional to . The algorithm uses complicated subroutines that may not lend themselves to an easy implementation.
Reduction to five dimensions. Assume without loss of generality that we view the knot
¨ ¨
¡
¡ ¥ ¥
§ § ¤ ¥ ¦ £ §
from above, that is, in the direction of . Each edge of
¨
¤ ¥ § is oriented. Another edge that crosses in the projection either passes above or below and it either passes from left to right or from right to left. The four cases
are illustrated in Figure 2.8 and classified as positive and negative crossings according to
¦
§ § Figure 2.2. Letting and be the numbers of edges that form positive and negative
+1 −1 −1 +1
Figure 2.8: The four ways an oriented edge can cross another. §
crossings with , the directional writhing number is
£ ¥ £¦¥
¡
¥ ¥
¡
¡ ¡
¨
¡ ¦
§£¢ ¦ § £
¡ ¡
©
¡ ¡
§ §
¦
§ £¡
To compute the sums of the § and efficiently, we map edges in to points and half-
¥¤ ¦ § §
spaces in . Specifically, let be the oriented line that contains the oriented edge and
¤
§ § § §
use Pluck¨ er coordinates as explained in [52] to map ¦ to a point or alternatively to
§ ©¤ ¦ § ¦
a half-space ¨ in . The mapping has the property that and form a positive crossing
¦
§ § § ¨ if and only if lies in the interior of . We use this correspondence to compute § in two stages: first we collect the ordered pairs of oriented lines that form positive crossings, and second we count among them the pairs of edges that cross.
Recursive algorithm. It is convenient to explain the algorithm in a slightly more general
31
¦ ¡ ¥
¡
¡ ¡
setting, where and are sets of and oriented edges in . Let denote
¨
¡ ¥£¢ ¦
§¤ ¨¥¡ §
the number of pairs that form positive crossings, and note that §
¨
¦ ¡ ¥
¡ ¦ ¡
if is the set of edges of the knot and . We map to a set of
©¨
§ ¤
points and ¡ to a set of half-spaces in . Let be a sufficiently large constant.
¥
§ ¦ ¦
A -cutting of and is a collection of pairwise disjoint simplices covering such that
§
each simplex intersects at most hyperplanes bounding the half-spaces in . We use
¥
the algorithm in [10] to compute a -cutting consisting of simplices in time O( ),
¤ where is at most times a constant independent of . For each simplex in the
cutting, define
¨
¥
§ § § § §© ¡
¨ ¨
¢ ¥
¡ §¡ ¨ ¡
! ¨
¢
§¡ #" ¨ ¡
¨ $&%(' ¨ $&%(' ¨
£
) ¡ )
Letting and , we have and . By construc-
!
¡ ¥£¢
§ * ¨
tion, every defines a pair of lines that form a positive crossing. For each
!
¡ ¥£¢
§ * ¨
simplex , we count the edge pairs that form positive crossings, and let ¦
be the number of such pairs. Then
¡ ¤
¨
¦ ¡ ¥ ¦ ¡ ¥ ¦ ¦
¡ ¡
¡
¥
¦ +
Note that is the number of crossings between projections of the line segments in
!
¦
and in . We can therefore use the algorithm in [51] to compute all numbers , for
¦ ¨
¡ £ £
§.- §.-
¤ ¤ ¤ ¡ ¥ ¡
¡ ¡
, in time , O . We recurse to compute
£
¦ ¡ ¥
¡ the and stop the recursion when . The running time of this algorithm is at
32
most
¡
¨
¡ ¥ ¡ ¥ ¡ ¥
,
¡
¥
¨
§
¤¢¡
¡ ¤) ¥
O
¨
¨
¡
£ for any , provided £ is sufficiently large.
Improving the running time. We improve the running time of the algorithm by taking
¤ § §
advantage of the symmetry of the mapping to . Specifically, a point lies in the interior
§ ¨¢§
of a half-space ¨ if and only if the point lies in the interior of the half-space . We
proceed as above, but switch the roles of points and half-spaces when becomes less than
¥¤
¡
. That is, if then we map the edges in to half-spaces and the edges in to points.
¨ ¨
§
¤¢¡
¡ ¥ ¡ ¤
By our above analysis, the running time is then less than O
¥ ¤¢¡
¡
O . The overall running time is thus less than
¥©
¡ ¥ ¡ ¥¨§ ¥
¨ ¦
,
¡
¥
¥
¡ if
¥¤
¥ ¤¢¡
if
¡
¨
¡ ¥¤¢ ¥ ¤¢
¡ ¡ ¡ ¥
O
¦
£ §
where is a positive constant and is any real larger than . It follows that § can
¨
¥¢¡ £¥¤§¦
¤
§
be computed in time O( , for any constant . Similarly, § and therefore
¡ ¦ the directional writhing number, , can be computed within the same time bound, thereby proving Theorem B.
We remark that the technique described in this section can also be used to compute the
£
¥¢¡ £¥¤§¦
linking number between two polygonal knots with and edges in time O( ).
2.5 Experiments
In this section, we sketch a sweep-line algorithm that computes the writhing number of a polygonal knot using Theorem A. We implemented the algorithm in C++ using the LEDA
33 software library and compared it with two versions of the algorithm based on the double integral in (2.3). We did not implement any version of Le Bret’s algorithm mentioned in Section 2.2 since it is based on a formula similar to Theorem A and can be expected to perform about the same as our sweep-line algorithm, and since it only works for closed knots.
2.5.1 Algorithms
Sweep-line algorithm. Theorem A expresses the writhing number of a knot as the sum of three terms. Accordingly, we compute the writhing number in three steps.
Step 1. Compute the directional writhing number for an arbitrary but fixed, non-critical
¡
¦
direction ¦ , .
© ©
¡
£ ¦ Step 2. Compute the winding number of ¦ relative to the Gauss maps and , .
Step 3. Compute the average winding number by summing the signed areas of the
¥
£
¦ ¥©§ ¥ § ¤ ¥ §
§ ¡
spherical triangles , § .
¥
¡ ¡ £
¦ £ ¦ §
Return § ¡ § .
Instead of using the algorithm described in Section 2.4, we implemented Step 1 using a
sweep-line algorithm [71], which reports the crossing pairs formed by the edges in
¡ ¡ ¤
time O . Steps 2 and 3 are both computed in a single traversal of the spherical
© ©
polygons and £ , keeping track of the accumulated angle and the signed area as we go.
¡
The running time of the traversal is only O .
Double-sum algorithm. We compare the implementation of the sweep-line algorithm
¨
§ ¤ ¥ £ § with two implementations of (2.3). Write § for the unnormalized tangent
vector. Following [33, 128], we discretize (2.3) to
£¦¥
§
¥
¡
¥
¡ ¡
¨
¨ § £ §
(2.7)
¡
£¢
¤ ¡
£ §
¡
¡
§ §
34
We note that is not the writhing number of the polygonal knot, but it converges to the writhing number of a smooth knot if the polygonal approximation is progressively refined to approach that knot [43].
Alternatively, we may discretize the double integral in such a way that the result is
¦§
the writhing number of the approximating polygonal knot. Given two edges and , we § measure the area of the two antipodal quadrangles in ¢ along whose directions we see the
edges cross. The area of one of the quadrangles is the sum of angles minus one full angle,
¢
£
¡ £ © ¤ . The absolute value of the signed area § is the same, and its sign
depends on whether we see a positive or a negative crossing. We thus have
£¦¥
¥
¡
¡ ¡
¨
£
§
(2.8)
¡
£¢
¤
¡
¡
§ §
Straightforward vector geometry and trigonometry can be used to derive analytical formu-
£ las for the § [28, 121].
2.5.2 Comparison
We compare the three implementations using a sequence of polygonal approximations of
an artificially created smooth knot. It has the form of the infinity symbol, , and is fairly ¡ flat in , with only a small gap in the middle. Because the knots are fairly flat, most of
their parallel projections have one crossing and the writhing number is just a little smaller
¡ than . Figure 2.9 shows that the algorithms that compute the exact writhing numbers for polygonal approximations converge faster to the writhing number of the smooth knot than the algorithm implementing (2.7). Figure 2.10 shows how much faster the sweep-line
algorithm is than both implementations of the double-sum algorithm. Let be the number
of edges. The graphs suggest that the running time of the sweep-line algorithm is O( ) and
¡
§
¡
the running times of the two implementations of the double-sum algorithm are . We
35
¡ Figure 2.9: Comparing convergence rates between ¢¡ (upper curve) and (lower curve). For
each tested approximation of the £ -knot, we draw the number of vertices along the horizontal axis and the writhing number along the vertical axis.
Figure 2.10: Comparing the running times of the sweep-line algorithm (lower curve) and the
two implementations of the double-sum algorithm: approximate (middle curve) and exact (upper ¤ curve). The ¡ -axis and -axis represent the number of vertices in the curve, and the running time of the algorithm respectively. observe the linear bound whenever we approximate a smooth knot by a polygon, since for generic projections the number of crossings as well as the number of edges simultaneously intersected by the sweep-line are independent of the total number of edges.
Protein backbones. We present some preliminary experimental results obtained with the three implementations. All experiments are carried out on a SUN workstation, with a 333 MHz UltraSPARC-IIi CPU, and 256 MB memory. Short of conformation data of long DNA strands, we decided to run our algorithms on a modest collection of open knots representing protein backbones, down-loaded from the protein data bank [2]. We modified the algorithms to account for the missing edge in the data, as explained in Section 2.3.
Figure 2.11 displays the four backbones chosen for our experimental study. Table 2.1 presents some of our findings.
36 Figure 2.11: The open knots modeling the backbone of the protein conformations stored in the PDB files 1AUS.pdb (upper left), 1CDK.pdb (upper right), 1CJA.pdb (lower left), and 1EQZ.pdb (lower right).
Thick knots. Even though the writhing number of a polygonal knot can be as large as quadratic in the number of edges, all four protein backbones in Figure 2.11 have writhing numbers that are significantly smaller than the numbers of edges. If a knot is made out of rope with non-zero thickness, then the quadratic bound can be achieved only if the ratio
of length over cross-section radius is sufficiently high. Specifically, the writhing number ¥
of a knot of length with an embedded tubular neighborhood of radius ¤ is less than
¥
¥
-
¡
¤ ¡
[44]. Such “thick” knots can be used to capture the fact that the edges of a
protein backbone are about as long as they are thick. A backbone with edges thus has
- ¡ writhing number at most some constant times . Examples which show that the upper
37
Data Size Time Writhing #
¡ ¢¤£¦¥ §¨¢ © § ¢¤
¢¡ ¡ 1AUS 439 122 0.09 3.93 9.28 22.70 17.87 1CDK 343 111 0.06 2.39 5.62 7.96 6.01 1CJA 327 150 0.06 2.19 5.10 12.14 10.43 1EQZ 125 18 0.02 0.31 0.73 4.78 3.37
Table 2.1: Four protein backbones modeled by open polygonal knots. The size of the problem ¡
is measure by the number of edges, , and the number of crossings in the chosen projection, . § ¢©
The time the sweep-line ( ¢ £¦¥ ), the approximate double-sum ( ), and the exact double-sum
§ ¢¤ ¡ ( ) algorithms take is measured in seconds. is an approximation of the writhing number for polygonal data. bound is asymptotically tight can be found in [38, 45, 72].
2.6 Notes and Discussion
In this chapter, we have developed an efficient algorithm to compute the writhing number ¡ of a space curve in . A fast method is important as the writhing number of DNA strands is computed at each step in some molecular simulations. Other than this computational aspect, it would be interesting to further investigate the concept and see whether there is a correlation between the writhing numbers and the common classification of protein folds. As mentioned in Chapter 2.1 there has been some initial work in this direction [82, 145].
It seems that although the writhing number of a protein backbone describes the spatial arrangements of its secondary structure elements, it alone is not discriminative enough in classifying protein structures. One major reason is that the writhing number is mainly effective in describing the global geometry of a given space curve. To solve this prob- lem, it might be necessary to consider backbones on a range of scale levels and compute the writhing number as a function of scale. Another possible approach is to combine the writhing number with other topological or geometric measures that describe different as- pects, especially the local geometry, of protein structures.
38 Chapter 3
Backbone Simpli®cation
3.1 Introduction
Protein structures are examined at different levels of details in various applications. Sim- pler structures are exploited in cases either when time complexity is too high, or when excessive details might obscure crucial features or principals that one would like to ob- serve. It is therefore desirable to build up a level-of-detail (LOD) representation for pro- tein structures. One way to achieve such a representation for protein backbones is via curve simplifications.
Given a polygonal curve, the curve-simplification problem asks for computing another polygonal curve that approximates the original curve based on a predefined error criterion and whose size is as small as possible. Other than their potential usage in simplifying protein structures, curve simplification is widely applied to numerous application areas, such as geographic information systems (GIS), computer vision, computer graphics and data compression. Simplification helps to remove unnecessary cluttering due to excessive details, to save memory space needed to store a curve, and to expedite the processing of a curve. For example, one of the main problems in computational cartography is to visualize geo-spatial information as a simple and easily readable map. To this end, curve simplification is used to represent rivers, road-lines, coast-lines and other linear features at an appropriate level of detail when a map of a large area is being produced.
In this chapter, we study the curve simplification problem under the so-called Frechet´ er- ¡ ror measure, and propose the first near linear time algorithm to simplify curves in with guaranteed quality. Below we first introduce the curve simplification problem formally.
39
§
£
¦ ¥ ¥
¡ ¡
¡ ¥
Problem definition. Let denote a polygonal curve in with as its se-
§
¨ ¨
¡ ¤§¦¨¦¨¦ ¤
¦£¢ ¥ ¥ ¦
¡ ¡
¦
§¥¤ § ¥
quence of vertices. A polygonal curve simplifies if
¨ © ¡
¡ £ ¤ £
¡ ¥§¦¦
§
. Given an error measure and a pair of indices , let
©
¦
denote the error of the segment § with respect to under the error measure . Intu-
¡ ¥§¦ ¦
§ § §
itively, measures how well approximates the portion of between and
§
¨ ¨ ¨
£
¥ ¥ ¦ ¢ ¦
¡
¦
§ ¥ §
. The error of a simplification ¤ of is defined as
¨ %
¢
¥§¦ ¡ ¡ ¦ ¥§¦ ¢
¦
§ §
¥
©
£
¢ ¢
¤ ¦ ¤ ¦ ¡ ¦ ¥§¦ ¨
We call an -simplification of , under the error measure , if . Let
¡
¤ ¦ ¡ ¤ ¥ ¦¦
denote the minimum number of vertices in an -simplification of under the
© © ¤
error measure . Given a polygonal curve ¦ , an error measure , and a parameter , the
¡
¤ ¡¥¤ ¥§¦
curve-simplification problem asks for computing an -simplification of size .
©
¤
¡ ¨
We now define the error measure we study in this chapter: Let be
¥ ¥
a distance function, e.g., the Euclidean distance, or ¥ or norm. Now given two curves
¤ ¤
¤ ¤
¢ ¥ ¢ ¦ ¥ ¢ ¦ ¡ ¢¥
¡ ¡ !
, and , the Frec´ het distance under metric , , is
defined as
¢
¨ #%$'& %
¡ ¢¥ ¢ ¡ ¢¡ ¡ ¥ ¡ ¡
" ¡ ¥ ¥
¡.- ¤
¤ (3.1)
¡
£*),+ ¥0/
¤
¢
¥ ¦ ¥ ¦
¡ ¡
¤ ¤
¢
¡
¤
¥ ¦ ¥ (¢ ¦
¡ ¢
where ¡ and range over continuous and monotonically non-decreasing functions with
¢ ¢ ¢
¨ ¨ ¨ ¨
¡ ¡
¡ ¢ ¡ ¥ ¡ ¢ ¥ ¡
¡ ¡
¡ and . If and are the maps that realize the
¥
¢32
¨
¥
¢
¡
Frechet´ distance, then we refer to the map 1 as the Frec´ het map from to .
¡
¡ £ ¤ £
§
For a pair of indices , the Frec´ het error of a segment is defined to be
¤
¨
¢
¡ ¥§¦ ¡ ¥§¦ ¥ ¦ ¥
" § § §
¤
¥ ¦ ¦ ¦ § where denotes the portion of from to § .
40 Most previous work has been focused on the so-called Hausdorff error measure. We define it here as well, as we will compare the simplification under Frechet-´ and Hausdorff-
error measures later in this chapter. If we define the distance between a point and a line
¨ #%$¢¡
¡ ¥ ¢ ¡ ¥
)¤£' § segment as , then the Hausdorff error under the metric , also
referred to as -Hausdorff error measure, is defined as
¨ %
¥ ¡ ¥§¦¦ ¢ ¡
§ § ¤¥
§%
¡
¤ ¡ ¤ ¥§¦¦
An -simplification under Hausdorff error measure and ¥ are defined similarly to
the case of Frechet´ error measure.
£¢ ¦
If we remove the constraint that the vertices of ¦ are a subset of the vertices of , then
¢
¡£¢
¦ ¤ ¦ ¡ ¤ ¥ ¦¦
is called a weak -simplification of . Let denote the minimum number of ¦ vertices in a weak Frechet´ ¤ -simplification of .
3.2 Prior and New Work
Previous work. The problem of approximating a polygonal curve ¦ has been studied extensively during the last two decades; see [100, 167] for surveys. Imai and Iri [109]
formulated the curve-simplification problem as computing a shortest path between two
¦ ¦¨§
nodes in a directed acyclic graph ¦¨§ : each vertex of corresponds to a node in , and
£
¡ ¥ ¦¦ ¤ ¥
§ ¥
there is an edge between two nodes § if . A shortest path from
©
£
¤ ¦ to in ¦©§ corresponds to an optimal -simplification of under the error measure .
§ 1
In , under the Hausdorff measure with the so-called uniform metric , their algorithm
§
¤ ¡
takes time. Chin and Chan [50], and Melkman and O’Rourke [134] improve
the running time of their algorithm to quadratic. Agarwal and Varadarajan [3] improve the
£ £
The uniform metric in is de®ned as follows: given two points ,
! #"¨$ !%&'$%&
if ( otherwise.
41
¥
- ¤¢
¡
¡ ¥
running time to for - and uniform-Hausdorff error measures , for an arbitrarily
¨
¦ §
small constant , by implicitly representing the graph . In dimensions higher than
¥ ¥ ¤
two, Barequet et al. compute the optimal -simplification under ¥ - or -Hausdorff error
¥
measures in quadratic time [29]. For § -Hausdorff error measure, an optimal simplification
¥
¨
¨
¡
§.-£¢¥¤§¦ ¤ ¥ ©
¡
¡
can be computed in near-quadratic time for and in polylog in
¡ ¨ for .
Curve simplification using the Frechet´ error measure was first proposed by Godau [92],
£
¡¨¢ ¡ ¢
¡ ¤ ¥§¦¦ ¡¥¤ ¥ ¦¦ ¡
who showed that . Alt and Godau [16] also proposed an -
£
¢
¦ ¡ ¦ ¥ ¤
¦
time algorithm to determine whether ¦ for given polygonal curves and of
¨
¤
size and , respectively, and for a given error measure . Following the approach ¦
of Imai and Iri [109], an ¤ -simplification of , under the Frechet´ error measure, of size
¡ ¢
¡ ¤ ¥ ¦¦ ¡
¡ can be computed in time.
The problem of developing an optimal near-linear ¤ -simplification algorithm remains elusive. Among the several heuristics that have been proposed over the years, the most widely used is the Douglas-Peucker method [75] (together with its variants). Originally
proposed for simplifying curves under the Hausdorff error measure, its worst-case running
¨
§
¡
©
time is in . For the running time is improved by Snoeyink et al. [101]
¡ ¤)
to . However, the Douglas-Peucker heuristic does not offer any guarantee on
¤ ¡
the size of the simplified curve—it can return an -simplification of size even if
¨
¡
¡
¡¥¤ ¥§¦ ¡
¥ .
Much work has been done on computing a weak ¤ -simplification of a polygonal curve
¦ ¡
. Imai and Iri [108] give an optimal -time algorithm for finding an optimal weak
§
¤
-simplification (under the Hausdorff error measure) of a -monotone curve in . As for
weak ¤ -simplification of planar curves under Frechet´ distance, Guibas et al. [96] proposed
§
¡ ¤ ¡
an -time factor-2 approximation algorithm and an -time exact algorithm.
They also proposed linear-time algorithms for approximating some other variants of weak
42 simplifications.
The problem of simplifying curves becomes much harder when additional constraints such as topology preservation and non-intersection requirements are introduced. Given a set of non-intersecting curves, the problem of simplifying the curves optimally so that
the simplified curves are also non-intersecting is NP-hard – in fact, it is hard to approxi-
¥
¨
¥ -
¤ mate within a factor of , for any , where is the total number of vertices of the curves [83]. Guibas et al. [96] show that the problem of computing an optimal non- intersecting simplification of a simple polygon is NP-hard, and that computing the optimal
weak simplification of a set of non-intersecting curves is also NP-hard.
¨
¦ ¤
Our results. Let be a polygonal curve in , and let be a parameter. In Sec- ¦
tion 3.3, we present simple, near-linear algorithms for computing ¤ -simplifications of
¡
¡¥¤ ¥ ¦¦
©
with size at most under the Frechet-error´ measure.
¨
¦ ¤
¡
Theorem 3.2.1 Let be a polygonal curve in with vertices, and let be a
¢
¡ ¤ ¦ ¦
parameter. We can compute in time a simplification of of size at most
£
¡ ¢ ¢
¡ ¤ ¥§¦ ¡ ¦ ¢ ¥§¦¦ ¤
©
so that , assuming that the distance between points is measured ¡ in any ¥ -metric.
To our knowledge, this is the first simple, near-linear approximation algorithm for curve
¨ © simplification with guaranteed quality that extends to for arbitrary curves. We illustrate its simplicity and efficiency by comparing its performance with Douglas-Peucker and exact algorithms in Section 3.4. Our experimental results on various data sets show that our algorithm is efficient and produces ¤ -simplifications of near-optimal size.
Also in Section 3.3, we compare curve simplification under Hausdorff and Frechet´ er-
£
¡£¢ ¡ ¢
¡ ¤ ¥ ¦¦ ¡¥¤ ¥§¦ ror measure, and we show that , thereby improving the result by Godau [92].
43
3.3 Frechet´ Simplification
§
¨
¨
£
¦ ¥ ¥ ¤
¡ ¡
¥ Let be a polygonal curve in , and let be a parameter. In this section, we first prove a few properties of the Frechet´ error measure. We then present an approximation algorithm for simplification under the Frechet´ error measure. At the end of
this section, we compare Frechet´ simplification with some other versions of simplification.
¦ ¦ ¦ ¦
¡ ¥ ¡ ¥
Let " be as defined in (3.1), and let denote the Euclidean distance between
two points in .
¥
Lemma 3.3.1 Given two directed segments and in ,
¨ %
¡ ¥ ¢ ¡ ¥ ¥ ¡ ¥
" ¥ ¥ ¡
¨ %
©
¢ ¡ ¥ ¥ ¡ ¥ ¡ ¥
¥ ¡ ! ¥ ¥
PROOF. Let . First , since (resp. ) has to
£¡
¥ ¥ ¡
be matched to (resp. ). Assume the natural parameterization for segment ,
¤ ¤
¨
¡ ¡
¤ ¤
¥ ¦ £¡ ¡ § ¡ ¥ ¦
¥ ¥ ¥ ¥ £ ¥ ¥ ¡
, such that . Similarly, define for
¨
§ ¡ ¡ £¡ § ¡
¥ ¥ £ ¥ ¥
segment , such that . For any two matched points and , let
¨ ¨
¡
¡ £¡ § ¡ ¡ ¡ ¡
¡ ¥ ¥ £ ¥ £ ¥ ¥ £ ¥ £
¤
£ ¡ £
¡ ¡ ¥ ¦ ¡ ¥
¥ ¡ ¥ ¥ § ¥
Since ¡ is a convex function, for any . Therefore .
¦
¥
Lemma 3.3.2 Given a polygonal curve in and two directed segments and ,
£
¡ ¥§¦ ¡ ¥§¦ ¡ ¥
" ¥ £ " " ¥
¤
¡
¤
£¡ ¥ ¦
¥ ¡ ¥ ¥
PROOF. Assume that is the natural parameterization of , such that
¤
¨
¡
£¡ ¡ ¦ ¡ ¥ ¦
¥ ¥ £¤¥ ¥ ¥ § ¥ . Let , , be a parameterization of the polygonal curve
44
¦ ¦ ¡ £¡ ¦
¥ ¥
such that ¥ and realize the Frechet´ distance between and . As in the proof
¤
¨
¡
¤
¡ § ¡ ¥ ¦ § ¡
¥ £ ¡ ¥
of Lemma 3.3.1, let ¥ be such that . By triangle
¤
¡ £
¡¤£¡ ¥¨§ ¡ ¥ ¡ ¦ ¡ ¥ £¡ ¥ ¦ ¡ ¦ ¡ ¥¨§ ¡
¥ ¥ ¥ ¥ § ¥ ¥ inequality, for any ¥ , yielding the
lemma.
¢£
¢¡
¢¤
¥
¥
¢£
¦¡
§©¨
§ §
Figure 3.1: Dashed curve is , and are mapped to and on , respectively.
§
¨ ¡
£ £ £
£
¦ ¥ ¥ ¥
¡ ¡
§
Lemma 3.3.3 Let ¥ be a polygonal curve in . For ,
£ ¦
¢ ¢
¡ ¥ ¦¦ ¡ ¥§¦¦
© §
¤
¨
¤
¢
¡ ¥§¦ ¦ ¥ ¦
§ ¡ ¡ § §
PROOF. Let . Let § be the Frechet´ map from to
¤ ¤
¨
¦ ¦ ¡ ¥ ¦ ¥ ¦
§ ¡ §
§ (see Section 3.1 for definition). For any vertex , set ;
¤
£
¡ ¥§¦ ¥ ¦
see Figure 3.1 for an illustration. By definition, . In particular,
£ £
¡ ¥ ¥ ¡ ¥ ¡ ¥
! . By Lemma 3.3.1, . It then follows from
Lemma 3.3.2 that
¤ ¤
¨
£ £
¢
¡ ¥§¦ ¡ ¥§¦ ¥ ¦ ¥§¦ ¥ ¦ ¡
" ©
3.3.1 Algorithm
Our simplification algorithm is a greedy approach (Figure 3.2). Suppose we have already
§
¦ ¨
¨ £
¢
¦ ¢ ¡ ¥§¦ ¤ ¥ ¥
¡ ¡
¦
¥ § §
added § to . We then find an index such that (i)
¨ ¦
¨
¢ ¢
¡ ¥ ¦¦ ¦ ¤
¦
§ ¤ ¥ ¤ ¥ §
and (ii) . We set and add to . We repeat this process
¢£ £ ¢ until we encounter . We then add to ¦ .
45 ¤
ALGORITHM GreedyFrechetSimp ( ¦ , )
§
¨
¦¨¦¨¦ ¨
£
¦ ¥ ¥ ¤
Input: ¥ ;
£
¢
¦ ¢ ¦ ¡ ¦ ¢ ¥ ¦¦ ¤
Output: " such that .
begin
§
¡¢¨ ¨ ¨
¡ ¡
¦ ¢
§
; ; ;
¤
while ( ) do
¨
;
£
¢
¥§¦¦ ¤ ¡
¡
¦
§
¤ §
while ( § ) do
¨
¡
;
end while
¢ ¢
¨ # ¨
¤ ¥
¤¡ ©
© ; ;
¢ ¢
#
¤ ¡
¤¡ ¡
while ( £ ) do
¢ ¢
# ¨ #
¢ ¡ ¤£ ©
;
£
¢
¡ ¥§¦ ¤
§ § ¤¥¤§¦©¨
if ( )
¨ #
¡ ¢
¤ ;
¢ ¢
# ¨ #
¢ else ;
end while
§
¨ ¡ ¨ ¡ ¨
¦ ¤ ¡
¢ ¢
¤£ ¦ ¦
¦
§ ¤ ¥ ; ; ; end while end
Figure 3.2: Computing -simplification under the Fr´echet error measure.
¡
¡ £ ¤ £
Alt and Godau [16] have developed an algorithm that, given a pair , can
¡
£
¢
¡ ¡ ¥§¦¦ ¤
£ §
determine in time whether . Therefore, a first approach would be
to add vertices greedily one by one, starting with the first vertex § , and testing each edge
¦
¨
§ , for , by invoking the Alt-Godau algorithm, until we find the index . However,
§
¡ the overall algorithm could take time. To limit the number of times that Alt-Godau
algorithm is invoked when computing the index ¦ , we proceed as follows.
© £
¢
¡ ¥§¦¦
¡
§
¤ §
First by an exponential search, we determine an integer so that §
¤
¨
¤ ¥
¢
¡ ¥§¦ ¤ ¤ ¥ ¦
¡
¦
§ © ©
¤ §
and § . Next, by performing a binary search in the interval ,
¤
£ ¨
¤ ¥
¢ ¢
¡ ¥§¦ ¤ ¡ ¥ ¦ ¥§¦¦
© © § § ¤ § § ¤ ¤ ¥ we determine an integer § such that and
¤ . Note that in the worst case, the asymptotic costs of the exponential and binary searches
¦ ¨
are the same. Set . See Figure 3.2 for pseudo-code of this algorithm. Since com-
¦ ¨
¡ ¡ ¤
puting the value of requires invoking the Alt-Godau algorithm times,
46
¢ ¢
£
¡ ¥
£ ¡ ©
each with a pair ¡ such that , the total time spent in computing the value of
¦ ¦
¡
¡ ¡ ¡ ¤ ¤
£
is . Hence, the overall running time of the algorithm is .
§
¨
¨
£
¦ ¥ ¥ ¤
¡ ¡
Theorem 3.3.4 Given a polygonal curve ¥ in and a parameter ,
¡ ¤ ¦ ¢ ¦ ¤
we can compute in time an -simplification of under the Frec´ het error
£
¡ ¢
¦ ¢ ¡ ¤
©
measure so that .
£
¢
¦ ¢ ¡ ¦ ¢ ¥§¦
PROOF. Compute by the greedy algorithm described above. By construction
§
¨ ¨ ¨
£
¤£
¡ ¢
¦ ¢ ¥ ¥ ¤ ¡ ¤ ¥§¦ ¦ ¢
¡ ¡
¦
§¥¤ ¥ © §
, so it suffices to prove that . Let , and let
§
¨ ¨ ¨
¡ ¢ £
¦ ¡¥¤ ¦ ¡¥¤ ¥ ¦¦ ¥ ¥
¡
¦ ¡ ¡
© ©
¥ be an optimal -simplification of of size .
¡ ¦
© £ £
¡£¢
¡ ¤ ¥§¦
¡ ©
We claim that for all . This would imply that .
¨
¡
We prove the above claim by induction on . For , the claim is obviously
¨ ¡ ¨ ¡ ¡
¡ © £
¥ ¥ ¥
¥ ¥ ¢ ¥ ¥ £ ¥
true because . Suppose . If , we are done. So
¡
£ ¨
¥ ¢
¥ ¦¦ ¤ ¦ ¡ ¤ ¡
¦
© ¢ ¥ © ¤
assume that . Since is an -simplification, ¤¦¥ .
¡ ¡
£ £ £
¥ ¢ ©
¢ ¡ ¥§¦ ¤
¦
§ ¥ §
Lemma 3.3.3 implies that for all , ¤¨¥ . But by construc-
¡ ¡
¨ ¡#¨ ©
¢
¡ ¥§¦ ¤
¦
§ § ¤ ¥ ¢ ¢ ¤ tion, ¤¨¥ , therefore and thus .
Remark.Our algorithm works within the same running time even if we measure the distance between two points in any ¥ -metric.
3.3.2 Comparisons