Computational and Statistical Methods for Mass Spectrometry Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
University of Warsaw Faculty of Mathematics, Informatics and Mechanics Mateusz Krzysztof Łąi Student no. Computational and Statistical Methods for Mass Spectrometry Data Analysis PhD’s dissertation in COMPUTER SCIENCE Supervisors: Prof. Anna Gambin Institute of Informatics, University of Warsaw Dr Błażej Miasojedow Institute of Applied Mathematics, University of Warsaw September Supervisor’s statement Hereby I confirm that the presented thesis was prepared under my supervi- sion and that it fulfils the requirements for the degree of PhD of Computer Science. Date Supervisor’s signature Author’s statement Hereby I declare that the presented thesis was prepared by me and none of its contents was obtained by means that are against the law. The thesis has never before been a subject of any procedure of obtaining an academic degree. Moreover, I declare that the present version of the thesis is identical to the attached electronic version. Date Author’s signature Abstract Computational and Statistical Methods for Mass Spectrometry Data Analysis This dissertation covers a series of related topics in the mathematical modelling of mass spectrometry data. The dissertation opens by a presentation of an optimal al- gorithm for the generation of the fine isotopic structure. We further show the ap- plications of that algorithm to the problem of deconvoluting mixed isotopic signals, in two different ways. We also approach the problem of estimating the deep pa- rameters of mass detectors, estimating the parameters of a function that relates the instrument-generated intensities to the numbers of ions. These solutions are applied to the problem of understanding Electron Driven reactions, whose principal aim is to induce ion fragmentation and, in that way, enhance the instrument’s identifica- tion capabilities. Finally, we show how to apply the mathematical theory of reaction kinetics to estimate the reaction rates of the electron transfer reactions. Metody obliczeniowe i statystyczne analizy dany ze spektrometrów masowy Niniejsza rozprawa doktorska dotyczy szeregu tematyk z zakresu matematycznego modelowanie widm masowych. W pracy przedstawiam algorytm służący obliczeniom związanym z rozkładami izotopowymi cząsteczek. Algorytm ów wykorzystuję w problemie dekonwolucji mieszanek sygnałów ze znanych źródeł molekularnych, na dwa różne sposoby. Przedstawiam również sposób na wyznaczenie zależności pomiędzy zarejestrowanym sygnałem a liczbą jonów dla różnych detektorów jonów. Powyższe rozwiązania zostają również wykorzystane w celu dokładniejszego zrozumienia za- sad działania fragmentacji jonów za pomocą transferu elektronu, która znacząco posz- erza możliwości identyfikacji substancji. Pokazuję również sposób na wyestymowanie parametrów tych reakcji, wykorzystując w tym celu matematyczny model kinetyki reakcji. Keywords Mass Spectrometry, Isotopic Fine Structure, Estimation of Chemical Rates, Electron Trans- fer Dissociation, Deconvolution of Fine Isotopic Structures esis domain (Socrates-Erasmus subject area codes) . Informatyka Subject classification I.. Simulation and Modelling J.. Physical Sciences and Engineering Tytuł pracy w języku polskim Metody obliczeniowe i statystyczne analizy danych ze spektrometrów masowych Contents . Introduction ...................................... . Isotopic Distribution Calculations .......................... The Complexity of Pruning .............................. The IsoSpec Algorithm ................................ Experimental Results ................................. Further uses of the software .............................. . antifying Electron Transfer Reactions ...................... Materials and methods ................................ Results and Discussion ................................ Conclusions ...................................... . Estimating Reaction Kinetics of Electron Transfer Reactions . Formal model of the ETD reaction .......................... Validation & Results .................................. Discussion & Conclusions ............................... . Deconvolution of Mass Spectra & Ion Statistics . Data Preprocessing .................................. The Data Generation Model .............................. Bayesian Calculations ................................. . Conclusions and Future Resear . List of Figures .. Idealized convolution .............................. .. Division of isotopic envelope into optimal p-sets, p 2 f80%; 90%; 95%; 100%g, for a toy molecule. ................................ .. Problems resulting from fixing relative peak height threshold at a given value. .. The threshold function obtained for Bovine Insulin. .............. .. The quality of the Gaussian approximation to the optimal p-set for a toy example one element compound with three isotopes. ............ .. Idea behind the proof of proportionality of the ellipsoid volume to the num- ber of subisotopologues on the simplex. .................... .. Approximate size of the optimal P -set in terms of the size of the optimal %-set (y axis, logarithmic scale) for different joint thresholds P (x axis). .. The principle behind the IsoSpec algorithm. ................. .. Merging subisotopologues into isotopologues on a toy example of a two element molecule. ................................ .. Adaptive linear approximation to the threshold function. .......... .. Comparison of enviPat and IsoSpec Thehold and of IsoSpec Theh old with IsoSpec calculating the optimal % and % sets. ........ .. Comparison of enviPat with IsoSpec Thehold and IsoSpec Theh old with IsoSpec aiming at joint probability equal to % and % on frag- ment identification problem ( compounds). ................ .. A connected component C of the deconvolution graph G ........... .. Simple branching model. ............................ .. Two interpretations of observing c and z matching fragments: lavish and parsimonious. ................................ .. Summary of the proposed pairing algorithms. ................ .. A pairing graph (a) and it representation as a max flow optimization problem. .. Error rates of the deconvolution procedure on in silico data. ........ .. The distribution of distance between the estimates and true values ofthe reaction probabilities. .............................. .. MassTodonPy runtime distribution. ...................... .. Selected results of the MassTodon as run on Substance P spectra. ...... .. Estimates of the probabilities of ETnoD and PTR conditional on one of these events happening obtained for ubiquitin. ................... .. The deconvolution of the observed isotopic envelopes performed by MaTodon. .. The process of mass spectrum interpretation with MaTodon and ETDe- tective. ...................................... .. A model of the ETD reaction. .......................... .. Relative errors of the fitting procedure on in silico Substance P data. .... .. Explanation Percentage (EP) for experimental Substance P spectra. .... .. The distribution of the runtime of ETDetective. ................ .. Application of ETDetective to experimental data preprocessed by MaTodon .. Computation of expected numbers of molecules. ............... .. LTQ Orbitrap Velos data. ............................ .. Orbitrap Preprocessing Strategy. ........................ .. Bayesian net representation of the data augmented deconvolution problem attacked by MassOn. .............................. .. The rejection algorithm for drawing from density proportional to B(n). .. .. Results of Bayesian deconvolution. ...................... List of Tables .. Masses and Frequencies of isotopes of elements that make up proteins. .. .. Chemical reactions considered by the MassTodon. .............. .. Chemical reactions considered by the ETDetective. ............ 1 Introduction “Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.” — Winston Churchill a Specome is a subfield of the Analytical Chemistry that studies and develops instruments useful for analysing the molecular content of samples. The instruments, that are called mass spectrometers, have been developed by Joseph J. Thomson just before the First World War and used Mto study the presence of the isotopes of natural elements (Thomson, ). The output of a mass spectrometer – a mass spectrum – is a histogram: each bar has its own specific position in the mass-to-charge domain and height equal to that of the observed intensity. The intensities are usually assumed to be proportional to the number of ions. Below,we show how a mass spectrum might look like. Let us study the case of human insulin, CHNOS, to see how complex can a signal be, even in case of one single source. the signal even generated by one source Table .: Masses and Frequencies of isotopes of elements that build up the proteins (Brand et al., ). Isotope Mass Frequency H . H (D) . C . C . N . N . O . O . O . S . S . S . S . might be complex. To start with, all atoms in the above formula can assume one of multiple isotopic variants, from a set that is different for each element. Finding a particular isotope in nature is largely a random event. This does not mean that it is not predictable: when together in large numbers, they do follow many well studied patterns. The International Union for Pure and Applied Chemistry (IUPAC) is performing continuous measurements of the frequencies of natural isotopes. Table . summarizes a small fraction of their findings up till year . From the viewpoint of statistical