High Performance Computing on Biological Sequence Aligment

Total Page:16

File Type:pdf, Size:1020Kb

High Performance Computing on Biological Sequence Aligment Nom/Logotip de la Universitat on s’ha llegit la tesi High performance computing on biological sequence aligment Miquel Orobitg Cortada Dipòsit Legal: L.290-2013 http://hdl.handle.net/10803/110930 Títol de la tesi està subjecte a una llicència de Reconeixement- NoComercial-CompartirIgual 3.0 No adaptada de Creative Commons (c) any, nom i cognom/s de l'autor/a de la tesi Escola Politècnica Superior Departament d’Informàtica i Enginyeria Industrial High performance computing on biological sequence alignment Memòria presentada per obtenir el grau de Doctor per la Universitat de Lleida per Miquel Orobitg Cortada Dirigida per Dr. Fernando Cores Prado Dr. Fernando Guirado Fernández Universitat de Lleida Programa de doctorat en Enginyeria Lleida, febrer de 2013 Abstract Multiple Sequence Alignment (MSA) is an extremely powerful tool for such important biological applications, as phylogenetic analysis, identification of conserved motifs and domains and structure prediction. The main idea behind MSA is to place protein residues into columns according to selected criteria. These criteria can be the structural, evolutionary, functional or sequence sim- ilarity. The first three criteria are based on biological meaning, but the fourth is not. Then the best match alignment may not exhibit the best biological meaning. The other issue is that MSAs are computationally difficult to calculate, and most formulations of the problem lead to NP-hard optimization problems. MSA computation therefore depends on approximate heuristics or algorithms. Accurate and fast construction of MSAs has been extensively researched in recent years, and a variety of methods have been developed. Nowadays, the new challenges in the genomics era are focused on align- ing thousands, or even hundreds of thousands, of sequences. For this reason, alignment methods must be adapted to avoid being relegated from everyday use. Many MSA methods have introduced High Performance Computing ca- pabilities to take advantage of the new technologies and infrastructures. In particular, there are many approaches that improve alignment performance by parallelizing the whole MSA application. However, these distributed methods exhibit scalability problems when the number of sequences increases, as they are constrained by data dependencies that guide the alignment process. In this thesis, three different approaches to solving or reducing some limita- tions of the MSA methods are proposed. The first of these is aimed at reducing data dependencies to increase the degree of parallelism in the applications and iii improve their scalability. The second one is devoted to reducing the memory requirements of a consistency-based method. And the last one is designed to increase the alignment accuracy by improving the guide tree. From the parallelism point of view, we propose a new guide tree construc- tion algorithm to exploit the degree of parallelism in the final alignment pro- cess in order to resolve the bottleneck of the progressive alignment stage in MSA. This new contribution increases the number of parallel alignments to take advantage of increasing computer resources by balancing the guide tree, but taking biological accuracy properties into account. The results achieved demonstrated that the proposed heuristic is capable of increasing the degree of parallelism in the parallel MSA method, thus reducing the execution time while maintaining the biological quality of the alignment. Regarding the memory requirements, we propose a new approach to reduce the computational requirements of the consistency-based method, T-Coffee. The main goal of the proposal is to reduce the number of consistency data stored. This allows a reduction in the execution time and therefore an improve- ment in the scalability of TC to enable the method to align more sequences. Furthermore, the proposed methodology is focused on achieve a minimal the impact over accuracy of the alignments and it is the user who decides the de- gree of optimization. The results obtained proved that this approach is able to reduce the memory consumption and increase both performance and the number and the length of the sequences that the method can align. In connection with biological accuracy, we propose Multiple Trees Align- ment (MTA). MTA is a new MSA method for aligning multiple guide-tree variations for the same input sequences, evaluating the alignments obtained and selecting the best one as a result. MTA is implemented in parallel to pro- vide a good compromise between time and accuracy. Besides, this approach could be applied to any progressive alignment method that accepts a guide tree as an input parameter, and the resulting alignments could be evaluated with any scoring function. Furthermore, we also propose two meta-score metrics, obtained through the combination of single ones. These metrics, which are de- vised using evolutionary algorithms, are able to improve the single ones. The study of different alignment scoring metrics for the selection of the best align- ment proved that in general the best ones are the two proposed meta-scores, followed by the ones that use structural information. Finally, the results ob- tained demonstrated that MTA with Clustal-W and T-Coffee considerably improves the quality of the alignments. Acknowledgements I would like to acknowledge everybody who supported me during the course of this thesis. First and foremost, I want express my gratitude to my supervisors, Dr. Fernando Cores Prado and Dr. Fernando Guirado Fernández, for their continued encouragement and invaluable suggestions throughout this work. They have patiently supervised every little issue, always guiding me in the right direction. Without their help, I could not have finished my thesis successfully. Special thanks are due to all the seniors from the Group of Distributed Computing (GCD) at the Universitat de Lleida (UdL). I must say that has been a great experience working closely with such good people: Concepció Roig, Francesc Giné, Francesc Solsona, Josep Ll. Lèrida, Josep M. Solà, Valentí Pardo and Xavier Faus. During my research, I have shared great moments with all my colleagues: Josep Rius, Ignasi Barri, Ivan Teixidó, Héctor Blanco, Anabel Usié, Damià Castellà, Alberto Montañola, Jordi Vilaplana, Eloi Gabaldon, Jordi Lladós and Ismael Arroyo. Many thanks to all of them for the wonderful times we shared. Furthermore, I want to thank Porfidio Hernández and Toni Espinosa from Universitat Autònoma de Barcelona (UAB) for all their support. I also want to thank all the members of the Comparative Bioiformatics group from the Centre for Genomic Regulation (CRG), especially Cedric Notredame, Carsten Kemena and Paolo Di Tommaso, for their good advice and help, and for wel- coming me as if I were one more during one month. Finally, I want to thank all the members of the Edinburgh Data-Intensive Research group at the University of Edinburgh. During three month, I had the chance to stay in this group and work with great people who made me feel at home. It was a great experience vii viii that marked me in many positive aspects. I would especially like to acknowl- edge Malkolm Atkinson, Paolo Besana, David Rodríguez, Rosa Filgueira and Gary McGilvary for all their support and for welcoming me as if I were another member of a big family. Last but not least, I would like to dedicate this thesis to all my friends and, foremost, to my closest family. Especially to my father Miquel and my sister Marta for the patience and understanding they have shown during these long years. To my girlfriend Laura for her emotional support. And especially, I want to dedicate this thesis to my mother Elena who unfortunately is no longer with us. Contents Contents ix List of Figures xiii List of Tables xvii 1 Introduction 1 1.1 Bioinformatics . .3 1.1.1 Molecular Biology . .4 1.1.1.1 Biological preliminaries . .4 1.1.1.2 Central Dogma of molecular Biology . .8 1.1.2 Bioinformatic Research Areas . .8 1.2 Sequence Alignment . 12 1.2.1 Pairwise Sequence Alignment . 14 1.2.2 Multiple Sequence Alignment . 14 1.2.2.1 Applications . 15 1.2.2.2 Constraints and Requirements . 15 1.2.2.3 Evaluation . 17 1.2.2.4 Current Status . 17 1.3 High Performance Computing . 18 1.3.1 Parallel Computing . 19 1.3.1.1 Hardware . 21 1.3.1.2 Software . 28 1.4 Motivation . 29 1.5 Thesis Objectives . 31 ix x CONTENTS 1.6 Overview . 32 2 Related Work 35 2.1 Introduction . 35 2.2 Pairwise Sequence Alignment Methods . 36 2.2.1 Dot-matrix Methods . 36 2.2.2 Dynamic Programming . 36 2.2.3 Word Methods . 40 2.3 Multiple Sequence Alignment Methods . 40 2.3.1 Exact Algorithms . 41 2.3.2 Progressive Alignment . 42 2.3.3 Iterative Alignment . 44 2.3.3.1 Stochastic Iterative Algorithms . 44 2.3.3.2 Non-stochastic Iterative Algorithms . 48 2.3.4 Consistency-based Methods . 51 2.3.5 Meta-aligners . 54 2.3.6 Template-based Methods . 55 2.4 Benchmarking . 57 2.5 High Performance Computing in MSA . 60 2.5.1 Parallel Algorithms for Pairwise Alignment . 61 2.5.1.1 Wavefront Approach . 61 2.5.1.2 Divide and Conquer Approach . 61 2.5.2 Parallel Multiple Sequence Alignments . 63 2.5.2.1 Multi-Threading and Multi-Process Parallel Aligners . 63 2.5.2.2 Distributed Memory Parallel Aligners . 65 2.5.3 GPUs based Parallel Aligners . 69 2.5.4 Conclusions . 71 3 Balanced Guide Tree 73 3.1 Introduction . 73 3.2 Problem analysis . 74 3.2.1 Progressive Alignment analysis . 76 3.3 Balanced Guide Tree . 78 CONTENTS xi 3.3.1 Neighbor-Joining Algorithm . 79 3.3.2 Balancing Guide Tree Algorithm . 80 3.4 Experimentation . 84 3.4.1 Balanced Guide Tree Performance . 84 3.4.1.1 BalancedParallel-TCoffee . 85 3.4.1.2 Balanced-TCoffee .
Recommended publications
  • Supplementary Data
    Figure 2S 4 7 A - C 080125 CSCs 080418 CSCs - + IFN-a 48 h + IFN-a 48 h + IFN-a 72 h 6 + IFN-a 72 h 3 5 MRFI 4 2 3 2 1 1 0 0 MHC I MHC II MICA MICB ULBP-1 ULBP-2 ULBP-3 ULBP-4 MHC I MHC II MICA MICB ULBP-1 ULBP-2 ULBP-3 ULBP-4 7 B 13 080125 FBS - D 080418 FBS - + IFN-a 48 h 12 + IFN-a 48 h + IFN-a 72 h + IFN-a 72 h 6 080125 FBS 11 10 5 9 8 4 7 6 3 MRFI 5 4 2 3 2 1 1 0 0 MHC I MHC II MICA MICB ULBP-1 ULBP-2 ULBP-3 ULBP-4 MHC I MHC II MICA MICB ULBP-1 ULBP-2 ULBP-3 ULBP-4 Molecule Molecule FIGURE 4S FIGURE 5S Panel A Panel B FIGURE 6S A B C D Supplemental Results Table 1S. Modulation by IFN-α of APM in GBM CSC and FBS tumor cell lines. Molecule * Cell line IFN-α‡ HLA β2-m# HLA LMP TAP1 TAP2 class II A A HC§ 2 7 10 080125 CSCs - 1∞ (1) 3 (65) 2 (91) 1 (2) 6 (47) 2 (61) 1 (3) 1 (2) 1 (3) + 2 (81) 11 (80) 13 (99) 1 (3) 8 (88) 4 (91) 1 (2) 1 (3) 2 (68) 080125 FBS - 2 (81) 4 (63) 4 (83) 1 (3) 6 (80) 3 (67) 2 (86) 1 (3) 2 (75) + 2 (99) 14 (90) 7 (97) 5 (75) 7 (100) 6 (98) 2 (90) 1 (4) 3 (87) 080418 CSCs - 2 (51) 1 (1) 1 (3) 2 (47) 2 (83) 2 (54) 1 (4) 1 (2) 1 (3) + 2 (81) 3 (76) 5 (75) 2 (50) 2 (83) 3 (71) 1 (3) 2 (87) 1 (2) 080418 FBS - 1 (3) 3 (70) 2 (88) 1 (4) 3 (87) 2 (76) 1 (3) 1 (3) 1 (2) + 2 (78) 7 (98) 5 (99) 2 (94) 5 (100) 3 (100) 1 (4) 2 (100) 1 (2) 070104 CSCs - 1 (2) 1 (3) 1 (3) 2 (78) 1 (3) 1 (2) 1 (3) 1 (3) 1 (2) + 2 (98) 8 (100) 10 (88) 4 (89) 3 (98) 3 (94) 1 (4) 2 (86) 2 (79) * expression of APM molecules was evaluated by intracellular staining and cytofluorimetric analysis; ‡ cells were treatead or not (+/-) for 72 h with 1000 IU/ml of IFN-α; # β-2 microglobulin; § β-2 microglobulin-free HLA-A heavy chain; ∞ values are indicated as ratio between the mean of fluorescence intensity of cells stained with the selected mAb and that of the negative control; bold values indicate significant MRFI (≥ 2).
    [Show full text]
  • Bioinformatics 1: Lecture 3
    Bioinformatics 1: Lecture 3 •Pairwise alignment •Substitution •Dynamic Programming algorithm Scoring matrix To prepare an alignment, we first consider the score for aligning (associating) any one character of the first sequence with any one character of the second sequence. A A G A C G T T T A G A C T 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 Exact match 0 0 1 0 0 1 0 0 0 0 1/0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 The cost of mutation is not a constant DNA: A change in the 3rd base in a codon, and sometimes the first base, sometimes conserves the amino acid. No selective pressure. Protein: A change in amino acids that are in the same chemical class conserve their chemical environment. For example: Lys to Arg is conservative because both a positively charged. Conservative amino acid changes N Lys <--> Arg C + N` N N` C N C + C C C N` C C C C O C O C C N N Ile <--> Leu C C C C C C C C O C O C C Ser <--> Thr Asp <--> Glu Asn <--> Gln If the “chemistry” of the sidechain is conserved, then the mutation is less likely to change structure/function.
    [Show full text]
  • BASS: Approximate Search on Large String Databases
    BASS: Approximate Search on Large String Databases Jiong Yang Wei Wang Philip Yu UIUC UNC Chapel Hill IBM [email protected] [email protected] [email protected] Abstract Similarity search on a string database can be classified into two categories: exact match and approximate match. The In this paper, we study the problem on how to build an index struc- search of exact match looks for substrings in the database, ture for large string databases to efficiently support various types of which is exactly identical to the query pattern while the search string matching without the necessity of mapping the substrings to of approximate match allows some types of imperfection such a numerical space (e.g., string B-tree and MRS-index) nor the re- as substitutions between certain symbols, some degree of mis- striction of in-memory practice (e.g., suffix tree and suffix array). alignment, and the presence of “wild-card” in the query pat- Towards this goal, we propose a new indexing scheme, BASS-tree, tern. We shall mention that supporting approximate match is to efficiently support general approximate substring match (in terms very important to many applications. For instance, biologists of certain symbol substitutions and misalignments) in sublinear time have observed that mutations between certain pair of amino on a large string database. The key idea behind the design is that all acids may occur at a noticeable probability in some proteins positions in each string are grouped recursively into a fully balanced and such a mutation usually does not alter the biological func- tree according to the similarities of the subsequent segments starting tion of the proteins.
    [Show full text]
  • Mouse Lsm10 Knockout Project (CRISPR/Cas9)
    https://www.alphaknockout.com Mouse Lsm10 Knockout Project (CRISPR/Cas9) Objective: To create a Lsm10 knockout Mouse model (C57BL/6J) by CRISPR/Cas-mediated genome engineering. Strategy summary: The Lsm10 gene (NCBI Reference Sequence: NM_138721 ; Ensembl: ENSMUSG00000050188 ) is located on Mouse chromosome 4. 2 exons are identified, with the ATG start codon in exon 2 and the TGA stop codon in exon 2 (Transcript: ENSMUST00000055575). Exon 2 will be selected as target site. Cas9 and gRNA will be co-injected into fertilized eggs for KO Mouse production. The pups will be genotyped by PCR followed by sequencing analysis. Note: Exon 2 starts from about 0.27% of the coding region. Exon 2 covers 100.0% of the coding region. The size of effective KO region: ~366 bp. The KO region does not have any other known gene. Page 1 of 8 https://www.alphaknockout.com Overview of the Targeting Strategy Wildtype allele 5' gRNA region gRNA region 3' 1 2 Legends Exon of mouse Lsm10 Knockout region Page 2 of 8 https://www.alphaknockout.com Overview of the Dot Plot (up) Window size: 15 bp Forward Reverse Complement Sequence 12 Note: The 2000 bp section upstream of start codon is aligned with itself to determine if there are tandem repeats. Tandem repeats are found in the dot plot matrix. The gRNA site is selected outside of these tandem repeats. Overview of the Dot Plot (down) Window size: 15 bp Forward Reverse Complement Sequence 12 Note: The 2000 bp section downstream of stop codon is aligned with itself to determine if there are tandem repeats.
    [Show full text]
  • Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh
    Computational Biology Lecture 8: Substitution matrices Saad Mneimneh As we have introduced last time, simple scoring schemes like +1 for a match, -1 for a mismatch and -2 for a gap are not justifiable biologically, especially for amino acid sequences (proteins). Instead, more elaborated scoring functions are used. These scores are usually obtained as a result of analyzing chemical properties and statistical data for amino acids and DNA sequences. For example, it is known that same size amino acids are more likely to be substituted by one another. Similarly, amino acids with same affinity to water are likely to serve the same purpose in some cases. On the other hand, some mutations are not acceptable (may lead to demise of the organism). PAM and BLOSUM matrices are amongst results of such analysis. We will see the techniques through which PAM and BLOSUM matrices are obtained. Substritution matrices Chemical properties of amino acids govern how the amino acids substitue one another. In principle, a substritution matrix s, where sij is used to score aligning character i with character j, should reflect the probability of two characters substituing one another. The question is how to build such a probability matrix that closely maps reality? Different strategies result in different matrices but the central idea is the same. If we go back to the concept of a high scoring segment pair, theory tells us that the alignment (ungapped) given by such a segment is governed by a limiting distribution such that ¸sij qij = pipje where: ² s is the subsitution matrix used ² qij is the probability of observing character i aligned with character j ² pi is the probability of occurrence of character i Therefore, 1 qij sij = ln ¸ pipj This formula for sij suggests a way to constrcut the matrix s.
    [Show full text]
  • Noelia Díaz Blanco
    Effects of environmental factors on the gonadal transcriptome of European sea bass (Dicentrarchus labrax), juvenile growth and sex ratios Noelia Díaz Blanco Ph.D. thesis 2014 Submitted in partial fulfillment of the requirements for the Ph.D. degree from the Universitat Pompeu Fabra (UPF). This work has been carried out at the Group of Biology of Reproduction (GBR), at the Department of Renewable Marine Resources of the Institute of Marine Sciences (ICM-CSIC). Thesis supervisor: Dr. Francesc Piferrer Professor d’Investigació Institut de Ciències del Mar (ICM-CSIC) i ii A mis padres A Xavi iii iv Acknowledgements This thesis has been made possible by the support of many people who in one way or another, many times unknowingly, gave me the strength to overcome this "long and winding road". First of all, I would like to thank my supervisor, Dr. Francesc Piferrer, for his patience, guidance and wise advice throughout all this Ph.D. experience. But above all, for the trust he placed on me almost seven years ago when he offered me the opportunity to be part of his team. Thanks also for teaching me how to question always everything, for sharing with me your enthusiasm for science and for giving me the opportunity of learning from you by participating in many projects, collaborations and scientific meetings. I am also thankful to my colleagues (former and present Group of Biology of Reproduction members) for your support and encouragement throughout this journey. To the “exGBRs”, thanks for helping me with my first steps into this world. Working as an undergrad with you Dr.
    [Show full text]
  • WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT
    (12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization I International Bureau (10) International Publication Number (43) International Publication Date WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT (51) International Patent Classification: CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO, C12Q 1/68 (2018.01) A61P 31/18 (2006.01) DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN, C12Q 1/70 (2006.01) HR, HU, ID, IL, IN, IR, IS, JO, JP, KE, KG, KH, KN, KP, KR, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME, (21) International Application Number: MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, PCT/US2018/056167 OM, PA, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SA, (22) International Filing Date: SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, 16 October 2018 (16. 10.2018) TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW. (25) Filing Language: English (84) Designated States (unless otherwise indicated, for every kind of regional protection available): ARIPO (BW, GH, (26) Publication Language: English GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ, TZ, (30) Priority Data: UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ, 62/573,025 16 October 2017 (16. 10.2017) US TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, ΓΕ , IS, IT, LT, LU, LV, (71) Applicant: MASSACHUSETTS INSTITUTE OF MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TECHNOLOGY [US/US]; 77 Massachusetts Avenue, TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW, Cambridge, Massachusetts 02139 (US).
    [Show full text]
  • B.Sc. (Hons.) Biotech BIOT 3013 Unit-5 Satarudra P Singh
    B.Sc. (Hons.) Biotechnology Core Course 13: Basics of Bioinformatics and Biostatistics (BIOT 3013 ) Unit 5: Sequence Alignment and database searching Dr. Satarudra Prakash Singh Department of Biotechnology Mahatma Gandhi Central University, Motihari Challenges in bioinformatics 1. Obtain the genome of an organism. 2. Identify and annotate genes. 3. Find the sequences, three dimensional structures, and functions of proteins. 4. Find sequences of proteins that have desired three dimensional structures. 5. Compare DNA sequences and proteins sequences for similarity. 6. Study the evolution of sequences and species. Sequence alignments lie at the heart of all bioinformatics Definition of Sequence Alignment • Sequence alignment is the procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. LGPSSKQTGKGS - SRIWDN Global Alignment LN – ITKSAGKGAIMRLFDA --------TGKG --------- Local Alignment --------AGKG --------- • In global alignment, an attempt is made to align the entire sequences, as many characters as possible. • In local alignment, stretches of sequence with the highest density of matches are given the highest priority, thus generating one or more islands of matches in the aligned sequences. • Eg: problem of locating the famous TATAAT - box (a bacterial promoter) in a piece of DNA. Method for pairwise sequence Alignment: Dynamic Programming • Global Alignment: Needleman- Wunsch Algorithm • Local Alignment: Smith-Waterman Algorithm Needleman & Wunsch algorithm : Global alignment • There are three major phases: 1. initialization 2. Fill 3. Trace back. • Initialization assign values for the first row and column. • The score of each cell is set to the gap score multiplied by the distance from the origin.
    [Show full text]
  • Parameter Advising for Multiple Sequence Alignment
    PARAMETER ADVISING FOR MULTIPLE SEQUENCE ALIGNMENT by Daniel Frank DeBlasio Copyright c Daniel Frank DeBlasio 2016 A Dissertation Submitted to the Faculty of the DEPARTMENT OF COMPUTER SCIENCE In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2016 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the disser- tation prepared by Daniel Frank DeBlasio, entitled Parameter Advising for Multiple Sequence Alignment and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Date: 15 April 2016 John Kececioglu Date: 15 April 2016 Alon Efrat Date: 15 April 2016 Stephen Kobourov Date: 15 April 2016 Mike Sanderson Final approval and acceptance of this dissertation is contingent upon the candidate's submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Date: 15 April 2016 Dissertation Director: John Kececioglu 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, pro- vided that accurate acknowledgment of the source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder.
    [Show full text]
  • Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices
    194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices Ankit Agrawal and Xiaoqiu Huang Abstract—Pairwise sequence alignment is a central problem in bioinformatics, which forms the basis of various other applications. Two related sequences are expected to have a high alignment score, but relatedness is usually judged by statistical significance rather than by alignment score. Recently, it was shown that pairwise statistical significance gives promising results as an alternative to database statistical significance for getting individual significance estimates of pairwise alignment scores. The improvement was mainly attributed to making the statistical significance estimation process more sequence-specific and database-independent. In this paper, we use sequence-specific and position-specific substitution matrices to derive the estimates of pairwise statistical significance, which is expected to use more sequence-specific information in estimating pairwise statistical significance. Experiments on a benchmark database with sequence-specific substitution matrices at different levels of sequence-specific contribution were conducted, and results confirm that using sequence-specific substitution matrices for estimating pairwise statistical significance is significantly better than using a standard matrix like BLOSUM62, and than database statistical significance estimates
    [Show full text]
  • WO 2012/174282 A2 20 December 2012 (20.12.2012) P O P C T
    (12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization International Bureau (10) International Publication Number (43) International Publication Date WO 2012/174282 A2 20 December 2012 (20.12.2012) P O P C T (51) International Patent Classification: David [US/US]; 13539 N . 95th Way, Scottsdale, AZ C12Q 1/68 (2006.01) 85260 (US). (21) International Application Number: (74) Agent: AKHAVAN, Ramin; Caris Science, Inc., 6655 N . PCT/US20 12/0425 19 Macarthur Blvd., Irving, TX 75039 (US). (22) International Filing Date: (81) Designated States (unless otherwise indicated, for every 14 June 2012 (14.06.2012) kind of national protection available): AE, AG, AL, AM, AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ, English (25) Filing Language: CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO, Publication Language: English DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN, HR, HU, ID, IL, IN, IS, JP, KE, KG, KM, KN, KP, KR, (30) Priority Data: KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, ME, 61/497,895 16 June 201 1 (16.06.201 1) US MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, 61/499,138 20 June 201 1 (20.06.201 1) US OM, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SC, SD, 61/501,680 27 June 201 1 (27.06.201 1) u s SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, TR, 61/506,019 8 July 201 1(08.07.201 1) u s TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
    [Show full text]
  • Characterization and Mapping of Human Genes Encoding Zinc Finger Proteins (Transcription/Chromosome/Sequence-Tagged Site) P
    Proc. Nati. Acad. Sci. USA Vol. 88, pp. 9563-9567, November 1991 Biochemistry Characterization and mapping of human genes encoding zinc finger proteins (transcription/chromosome/sequence-tagged site) P. BRAY*, P. LICHTERtt, H.-J. THIESEN§, D. C. WARDt, AND I. B. DAWID* *Laboratory of Molecular Genetics, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD 20892; tDepartment of Genetics, Yale School of Medicine, New Haven, CT 06520; *Deutsches Krebsforschungszentrum, Im Neuenheimer Feld 280, D-6900 Heidelberg, Federal Republic of Germany; §Basel Institute for Immunology, Grenzacherstrasse 487, CH-4005 Basel, Switzerland Contributed by I. B. Dawid, August 2, 1991 ABSTRACT The zinc finger motif, exemplified by a segment finger genes cloned by binding of known regulatory nucleo- of the Drosophila gap gene Krfippel, is a nudeic add-binding tide sequences, zinc finger clones isolated by sequence domain present in many transcription factors. To investigate the homology often contain many fingers and highly conserved gene family encoding this motifin the human genome, a placental H/C link regions. Among multifinger cDNAs at least two genomic library was screened at moderate stringency with a associated motifs of unknown function have been identified degenerate oligodeoxynucleotide probe designed to hybridize to and named FAX (finger-associated box; ref. 14) and KRAB the His/Cys (H/C) link region between adjoining zinc fingers. (Kirppel-associated box; ref. 15). Over 200 phage clones were obtained and are being sorted into With the aim to survey the number and chromosomal groups by partial sequencing, cross-hybridization with oligode- distribution of zinc finger genes in the human genome, we oxynucleotide probes, and PCR amplifion.
    [Show full text]