Geometric and Topological Methods in Protein Structure Analysis

GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS by Yusu Wang Department of Computer Science Duke University Date: Approved: Prof. Pankaj K. Agarwal, Supervisor Prof. Herbert Edelsbrunner, Co-advisor Prof. John Harer Prof. Johannes Rudolph Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2004 ABSTRACT GEOMETRIC AND TOPOLOGICAL METHODS IN PROTEIN STRUCTURE ANALYSIS by Yusu Wang Department of Computer Science Duke University Date: Approved: Prof. Pankaj K. Agarwal, Supervisor Prof. Herbert Edelsbrunner, Co-advisor Prof. John Harer Prof. Johannes Rudolph An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University 2004 Abstract Biology provides some of the most important and complex scientific challenges of our time. With the recent success of the Human Genome Project, one of the main challenges in molecular biology is the determination and exploitation of the three-dimensional structure of proteins and their function. The ability for proteins to perform their numerous functions is made possible by the diversity of their three-dimensional structures, which are capable of highly specific molecular recognition. Hence, to attack the key problems involved, such as protein folding and docking, geometry and topology become important tools. Despite their essential roles, geometric and topological methods are relatively uncommon in computational biology, partly due to a number of modeling and algorithmic challenges. This thesis describes efficient computational methods for characterizing and comparing molecular structures by combining both geometric and topological approaches. Although most of the work described here focuses on biological applications, the techniques developed can be applied to other fields, including computer graphics, vision, databases, and robotics. Geometrically, the shape of a molecule can be modeled as (i) a set of weighted points, representing the centers of atoms and their van der Waals radii; (ii) as a polygonal curve, corresponding to a protein backbone or a DNA strand; or (iii) as a polygonal mesh corresponding to a molecular surface. Each such representation emphasizes different aspects of molecular structures at various scales, the choice of which depends on the underlying applications. Characterizing molecular shapes represented in various ways is an important step toward better understanding or manipulating molecular structures. In the first part of the thesis, we study three geometric descriptions: the writhing number of DNA strands, the level-of-details representation of protein backbones via simplification, and the elevation of molecular surfaces. The writhing number of a curve measures how many times a curve coils around itself iii in space. It describes the so-called supercoiling phenomenon of double stranded DNA, which influences DNA replication, recombination, and transcription. It is also used to characterize protein backbones. This thesis proposes the first subquadratic algorithm for computing the writhing number of a polygonal curve. It also presents an algorithm that is easy to implement and runs in near-linear time on inputs that are typical in practice, including DNA strands, which is significantly faster than the quadratic time needed by algorithms used in current DNA simulation softwares. The level-of-detail (LOD) representation of protein backbone helps to extract its main features. We compute LOD representations via curve simplification under the so-called Frechet´ error measure. This measure is more desirable than the widely used Hausdorff error measure in many situations, especially if one wants to preserve global features of a curve (e.g, the secondary structure elements of a protein backbone) during simplification. In this thesis, we present a simple approximation algorithm to simplify curves under Frechet´ error measure, which is the first simplification algorithm with guaranteed quality that runs in near-linear time in dimensions higher than two. We propose a continuous elevation function on the surface of a molecule to capture its geometric features such as protrusions and cavities. To define the function, we follow the example of elevation as defined on Earth, but we go beyond this simpler concept to accommodate general 2-manifolds. Our function is invariant under rigid motions. It scales with the surface and provides beyond the location, also direction and size of shape features. We present an algorithm for computing the points with locally maximum elevation. These points corresponds to locally most significant features. This succinct representation of features can be applied to aligning shapes and we will present one such application in the second part of the thesis. The second part of the thesis focuses on molecular shape matching algorithms. The importance of shape matching, both similarity matching and complementarity matching, iv arises from the general belief that the structure of a protein decides its function. Efficient algorithms to measure the similarity between shapes help identify new types of protein architecture, discover evolutionary relations, and provide biologists with computational tools to organize the fast growing set of known protein structures. By modeling a molecule as the union of balls, we study the similarity between two such unions by (variants of) the widely used Hausdorff distance, and propose algorithms to find (approximately) the best translation under Hausdorff distance measure. Complementarity matching is crucial to understand or simulate protein docking, which is the process where two or more protein molecules bind to form a compound structure. From a geometric perspective, protein docking can be considered as the problem of search- ing for configurations with maximum complementarity between two molecular surfaces. Using the feature information generated by the elevation function, we describe an efficient algorithm to find promising initial relative placements of the proteins. The outputs can later be refined to locate docking positions independently using a heuristic that improves the fit locally, using geometric but possibly also chemical and biological information. v And indeed there will be time To wonder, “ Do I dare? ” and “Do I dare? ” Time to turn back and descend the stair, With a bald spot in the middle of my hair ¡ ¡ Do I dare Disturb the universe? — T. S. Eliot, The love-song of J. Alfred Prufrock vi Acknowledgements I came to take of your wisdom: And behold I have found that which is greater than wisdom. — Kahlil Gibran, The prophet It is not without regret that I am writing this acknowledgment — while being grateful for all those who made my life in the past few years a joyful and fruitful one, I know sadly that our lives will part soon. The path towards obtaining a PhD was a struggle for me in many ways. I can’t imagine how it would have been without their support. It has been a great opportunity to have worked under the supervision of Profs. Pankaj K. Agarwal and Herbert Edelsbrunner. The experience helped shape my attitude and approaches towards research both in computational geometry and in general. Pankaj led me into the world of computational geometry with his broad knowledge. Besides support and guidance, he gave me great freedom in doing research, and is always patient and understanding. It is hard to overestimate how much I have benefited from the numerous discussions with him. Herbert showed me the “friendly” side of computational topology, with his deep insights accompanied by illustrative explanation. His philosophy and vision in research have greatly influenced me. I am deeply indebted to both of them for their guidance and inspiration throughout the course of this dissertation. I would also like to thank Profs. John Harer and Johannes Rudolph not only for being on my committee of this thesis, but also for various discussions and collaborations. Support for this work was provided by NSF under the grant NSF-CCR-00-86013 (the BioGeometry project). The Duke CS department is a wonderful place. In particular, I wish to thank Drs Lars Arge and Ron Parr who are always open and ready to help on my career concerns. Dr. Sariel Har-Peled, now a post-postdoc (well, an assistant professor) in UIUC, has been a tremendous mentor and friend for me, especially during a period when I was swinging vii among various career choices, and at a time when I was learning to walk on the ropes of research. I learned from him to have an approximate perspective towards problems both in computer science and in life. I would like to thank all the graduate students and postdocs in the theory group who provided a vibrant research environment that I have enjoyed and benefited from so much, especially Nabil Mustafa, Hai Yu, Peng Yin, and Vijay Natarajan. I had a lot of fun both in research and in life with friends such as Vicky Choi, David Cohen-Steiner, Ho-lun Cheng, Ashish Gehani, Sathish Govindarajan, Jingquan Jia, Tingting Jiang, Dmitriy Morozov, Nabil Mustafa, Vijay Natarajan, Jeff Phillips, Nan Tian, Eric Zhang, Hai Yu, and Haifeng Yu. Those inspiring discussions with Nabil, Sathish and David will always mark my mem- ories of the PhD life. Special thanks to my best friend Peng Yin. His humor, energy, understanding and advices accompany me to this day. I also want to thank his wife Xia Wu for kindly feeding me uncountable times. I wish to thank all staffs in the department for being so friendly and helpful, especially Ms. Celeste Hodges and Ms. Diane Riggs. Last, but not least, I would like to thank my family for their love, support and confi- dence in me. My parents and grandparents encouraged me to pursue my own dream from the childhood, and have never tried to pressure me into any life that others may consider as successful. My sister and brother-in-law have always been there for me, full of understanding and support.

Geometric and Topological Methods in Protein Structure Analysis

Sweeping the Sphere

Uwaterloo Latex Thesis Template

Visibility Graphs and Cell Decompositions

Sweep Line Algorithm

Parallel Range, Segment and Rectangle Queries with Augmented Maps

Variations of Enclosing Problem Using Axis Parallel Square(S): a General Approach

Geometric Search·Intersection

CSC-758: Computational Geometry Course Description

18.415 Advanced Algorithms Contents

A Divide and Conquer Algorithm for Rectilinear Region Coverage

CMSC 754 Computational Geometry1

CMSC 754 Computational Geometry1