University of Geneva Faculties of Medicine, Science & Computer Science MSc in Proteomics and Bioinformatics
LC-MSMS identification of small molecules: making sense of the unknown
Author: Celine Moret Supervisor: Alexandre Masselot, Ph.D. Academic Year: 2009/2010 Declaration of Authorship
The author hereby declares that she compiled this thesis independently, using only the listed resources and literature.
The author grants to the University of Geneva permission to reproduce and to distribute copies of this thesis document in whole or in part.
Geneva, September 13, 2011 Signature Acknowledgments
This thesis would not have been possible without the help of many people. First, I would like to thank my professors of the Master, who allowed me to discover the fascinating field of Bioinformatics. I owe my deepest gratitude to Dr. Alexandre Masselot who gave me the opportunity of realising this thesis under his supervision. He never considered blindness as an obstacle, and even valued my different perspective. At each step, he was there to help me, motivate me, and teach me to welcome scientific challenges. I will never forget the friendly and stimulating working atmosphere at Genebio, and I would like to thank all previous and current collaborators. A special thanks goes to the SmileMS team: to Dr. Nicolas Budin and Anastasia Chasapi for their collaboration and support in every level; to Dr. Roman Mylonas for his endless patience and his everyday help to overcome accessibility issues by looking for alternative strategies. I am grateful to Pauline Carrara for her encouragements and her help in adapting inaccessible material. Finally, I would like to thank Christina Fasser my employer at Retina Suisse, as well as my family and friends for their support during my studies. I have a special thought for my godmother Beatrice to whom I dedicate this thesis. Abstract
Based on recent robust scoring models, liquid chromatography tandem mass spectrometry (LC-MSMS) is now a method of choice for small molecules iden- tification. The most common method is to score one experimental spectrum versus a reference library spectra and eventually associate the experimental data with a molecule if the score is high enough. The aim of this thesis is to investigate a run content more globally and even to compare runs with each other, while conventional approaches score individual spectra in a run. For this purpose, we developed a clustering algorithm allowing to group similar spectra. Our approach was applied on two challenges associated with the intrinsic variability of LC-MSMS acquisitions.
• automatic library creation: after clustering spectra across a set of exper- iments, we built a library of ”representative” spectra. The goals were to see how such an automatic method performs compared to the usual human expert process and to evaluate if we could annotate common con- taminants;
• molecules correlations within a run: in a typical LC-MSMS run, the vast majority of spectra are not elucidated and are usually neglected in an automatic process. The challenge was to increase this elucidation rate looking at how different compounds are correlated together. We analysed a list of acquisitions made at Geneva University Hospital (HUG) and looked for such correlations (different drugs, artefacts etc.)
The conclusions of this thesis offer new perspectives for the interpretation of small molecules identifications by LC-MSMS.
Keywords LC-MSMS, small molecules
Author’s e-mail [email protected] Supervisor’s e-mail [email protected] Contents
List of Tables vii
List of Figures viii
Acronyms 1
1 Introduction 1 1.1 Small molecules ...... 1 1.2 Small molecules identification methods ...... 2 1.2.1 Liquid chromatography tandem mass spectrometry (LC- MSMS) to identify small molecules ...... 3 1.3 Challenges associated with LC-MSMS ...... 5 1.3.1 Poor spectra reproducibility ...... 5 1.3.2 Reference libraries scarcity ...... 5 1.3.3 Elucidation rate ...... 5 1.4 Research goals ...... 8
2 Spectra clustering 9 2.1 Problematic ...... 9 2.2 Methods ...... 10 2.2.1 Data and library ...... 10 2.2.2 Clustering workflow ...... 10 2.2.3 Dealing with large data ...... 14 2.2.4 Clustering algorithm validation ...... 14 2.3 Results ...... 15 2.3.1 Cluster compacity ...... 15 2.3.2 Cluster discriminance ...... 16 2.4 Conclusion ...... 18 Contents vi
3 An automatic reference library building process 19 3.1 Problematic ...... 19 3.2 Methods ...... 20 3.2.1 Automatic library building workflow ...... 20 3.2.2 Spectrum selection process ...... 20 3.2.3 New library creation ...... 21 3.2.4 Automatic library validation ...... 21 3.3 Results ...... 23 3.4 Conclusion ...... 26
4 Seventy runs analysis: molecules recurrence and co-occurrence 28 4.1 Problematic ...... 28 4.2 Methods ...... 29 4.2.1 Library and data files ...... 29 4.2.2 Spectra recurrence analysis ...... 29 4.2.3 Correlations ...... 29 4.3 Results ...... 31 4.3.1 Spectra recurrence analysis ...... 31 4.3.2 Correlations ...... 33 4.4 Conclusion ...... 36 4.5 Achievements ...... 38 4.6 Perspectives ...... 39 4.6.1 Reference library building ...... 39 4.6.2 Elucidation rate ...... 39 4.6.3 Scoring ...... 40
Bibliography 43
A Freiburg library content I List of Tables
1.1 Main methodologies used for small molecule identification . . .2 1.2 Main chromatographic methods used for HPLC ...... 4
2.1 Proportion of identified spectra in 70 runs ...... 9 2.2 Scores distribution in 30 clusters ...... 16 2.3 Summary of a single acquisition run clustering ...... 18
3.1 Comparative submissions against the Freiburg’ library and an automatic library ...... 24 3.2 Scores distribution against an automatic library ...... 25 3.3 Identifications against an automatic library ...... 25 3.4 Comparison between identification scores against the Freiburg’ library and the automatic library ...... 26
4.1 Occurrence of four molecules across 15 runs ...... 34 4.2 Correlation between pairs of molecules across 70 runs ...... 35 List of Figures
1.1 Spectra variability in LC-MSMS ...... 6 1.2 Spectra reproducibility in GC-MSMS ...... 7
2.1 Clustering workflow ...... 12 2.2 Scores distribution in cluster 173 ...... 16 2.3 Scores distribution in cluster 184 ...... 17
3.1 Automatic library building workflow ...... 22
4.1 Spectra occurrence across 70 runs ...... 32 4.2 Identified spectra across 70 runs ...... 33 4.3 Spectra correlation ...... 37 Chapter 1
Introduction
1.1 Small molecules
Small molecules can be defined as low molecular weight organic compounds, which are, by definition, not polymers (Mylonas (2010)). Some typical exam- ples include drugs, pesticides, dietary supplements and metabolites. Most of the analytes of biomedical, food and environmental interest have low molecular weight, usually less than about 1000 Da (Shankaran et al. (2007)). Their identification is very important in many fields such as (Mylonas (2010)):
• Metabolomics, for identification and quantification of metabolites.
• Pharmaceutical industry, for drug discovery, combinatorial chemistry, pharmacokinetics, drug metabolism, quality control
• Clinic, for drug testing
• Environment, for water quality, food contamination
• Geology, for the assessment oil composition
• Military, for the detection of toxic environments
• Forensics, for the study of drivers under drug influence
• Homeland security, for explosive detection
• Anti-doping, for the detection of doping products
• Space industry, for the analysis of collected material 1. Introduction 2
1.2 Small molecules identification methods
As summarised in table 1.1, many methods exist to identify small molecules. Each one presents its own advantages and disadvantages, and the choice of a specific method depends on the laboratory needs.
Technology Advantages Drawbacks Immunoassays Ferrara et al. (1994) a) Rapid b) Easy to use a) Limited to specific Moeller et al. (2008) classes of molecules b) Antigen-antibody reac- tion not always specific NMR1 Garc´ıa-P´erez et al. a) High reproducibility a) Low sensitivity b) Not (2008) b) Non-destructive c) well adapted for screening Simultaneous quantifica- tion LC-UV2 Maurer (2004) a) Easy to use b) Wide a) Support from Bio- panel of compounds in- Rad discontinued b) Not cluded in shipped library very sensitive for specific c) plug and play system classes of molecules CE-MS3 Garc´ıa-P´erez et al. a) Rapid b) Wide panel a) Poor concentration (2008) of compounds detectable sensitivity b) Needs c) Good resolution special treatment for ionizable compounds GC-MS4 Maurer (2004) Vogeser & Seger a) Very large and a) Tedious sample prepa- (2008) machine independent ration b) No possible Moeller et al. (2008) libraries b) Reference automation c) Limited methods exist c) High to specific classes of sensitivity and specifity molecules LC-MSMS5 Vogeser & Seger a) High sensitivity and a) Requires substantial (2008) specificity b) Possible au- expertise and know-how tomation c) Rapidity b) Instrument dependent libraries c) Matrix effect
Table 1.1: Main methodologies used for small molecule identification (Mylonas (2010)).
In this thesis, the interest will be focused on liquid chromatography tandem mass spectrometry, which will be presented in more details in the next section. 1. Introduction 3
1.2.1 Liquid chromatography tandem mass spectrometry (LC- MSMS) to identify small molecules
General workflow
Due to the capability of liquid chromatography tandem mass spectrometry (LC- MSMS) to identify low and high weight molecules, this technique is widely used in Proteomics (Chen & Pramanik (2009)), and has been used in small molecules identification for some years. It presents the huge advantage of being able to identify nonvolatile and thermally labile compounds, in contrast with gas chromatography tandem mass spectrometry. LC-MSMS involves the following steps:
• Sample extraction
• liquid chromatography
• MSMS analysis
• computer analysis.
The sample extraction will not be explained in this thesis. The other steps are presented more extensively in the following paragraphs.
Liquid chromatography
Liquid chromatography is a technique for substance separation which prevents, in the LC-MSMS context, the molecules from reaching the mass spectrometer at the same time. Practically, a sample mixture is passed through a column packed with solid particles (stationary phase). With the proper solvents, pack- ing conditions, some components in the sample will travel the column more slowly than others resulting in the desired separation. While there are several LC techniques as shown in table 1.2, small molecules are generally well separated thank to their differing polarity (Mannhold (2008)), making Reverse Phase Chromatography the most appropriate choice. Reversed phase columns consist of a non-polar stationary phase. A C18 bonded silica is the most popular type of reversed-phase HPLC packing (Mannhold (2008)). The mobile phase is usually an aqueous blend of water with a miscible, polar organic solvent, such as acetonitrile or methanol (Mylonas (2010)). Polar molecules such as acids will elute faster, and will thus be injected early into the mass spectrometer. 1. Introduction 4
Chromatographic method Separation technique Normal phase (NP-HPLC) polar differences Reverse phase (RPC) polar differences Hydrophilic Interaction (HILIC) polar differences Size exclusion (SEC) molecule size Ion exchange molecular charge Bioaffinity complex building
Table 1.2: Main chromatographic methods used for HPLC.
Mass spectrometry
Mass spectrometry is a sensitive analytical technique which is able to quan- tify known analytes and to identify unknown molecules at the picomoles or femtomoles level (Staack & Hopfgartner (2007)). A mass spectrometer is an instrument which measures precisely the abundance of molecules which have been converted to ions. During mass spectrometry analysis, molecules are first ionised and then separated in the mass spectrometer according to their ratio mass/charge (m/z). At this point, fragmentation can occur and a second analyser determines the m/z ratio of the resulting fragments. Ions are finally detected by a detector, and the obtained signal analysed by a computer.
Computer analysis with SmileMS
SmileMS is a platform for small molecules identifications developed by GeneBio. This software performs the last step of the small molecules identification work- flow, the library search. In this step, all experimental spectra of a run are searched against a library and a result containing all the matches is generated (automated library search) (Mauron (2010)) Whereas spectra corresponding to the same molecule show great variations in peaks intensity fragments are generally conserved. In order to give less weight to the intensities while still taking them into account, SmileMS considers the rank of the intensity instead of its absolute or relative value as a robust metrics (Mylonas et al. (2009)). X-Rank, the SmileMS algorithm, is based on proba- bilistic calculations. This means it is trainable for specific conditions. When compared to the widely used dot product, X-Rank shows better performances. This is especially the case when the spectra data to identify is acquired under different conditions than the spectra library. 1. Introduction 5
All the small molecules identifications analyses performed in this thesis used SmileMS.
1.3 Challenges associated with LC-MSMS
1.3.1 Poor spectra reproducibility
Spectra obtained by tandem mass spectrometry tend to be poorly reproducible (Oberacher et al. (2009)). A same molecule can generate different spectra, varying in the peaks present and their intensity. As mentionned in (Mylonas (2010)), ionisation source and fragmentation source variability can partly ex- plain spectra disparity. Moreover, matrix effect, which are the alteration of ionisation efficiency by the presence of co-eluting substances, appear also par- tially responsible for such a variability. Figures 1.1 and 1.2 show spectra dis- parity in LC-MSMS compared to the reproducibility for the same spectra in GC-MSMS. Such variations, the presence of adduct or derived sub compounds, tend to generate spectra that are generally neglected or misinterpreted.
1.3.2 Reference libraries scarcity
Whereas large GC-MS libraries are available, few spectra libraries adapted to LC-MSMS exist (Mylonas (2010)). The scarcity of LC-MSMS spectral libraries can be partly explained by the poor availability of certain substances. The cost of building such libraries is also an important factor. In fact, while data acquisition can be rapidly performed, selecting the reference spectra to fill the library is to be a time-consuming process relying on a human expert.
1.3.3 Elucidation rate
In a typical LC-MSMS run, the vast majority of spectra are not elucidated. This problematic will be further illustrated in chapter 4, where we show that, in the best case, only 7% of spectra are identified. The remaining spectra may consists of poor quality spectra, but also of metabolites or contaminants not stored in reference libraries. 1. Introduction 6
Figure 1.1: Spectra variability in LC-MSMS 1. Introduction 7
Figure 1.2: Spectra reproducibility in GC-MSMS 1. Introduction 8
1.4 Research goals
The aim of this thesis is to value usually neglected information in LC-MSMS analysis by looking at a run more globally and even comparing runs with each other. For this purpose, and to address the intrinsic variability associated with LC- MSMS acquisitions, we developed and validated a clustering algorithm allowing to group similar spectra. This tool makes it possible to investigate two challenges encountered in the LC-MSMS field:
• We propose a workflow to automatically build a reference library, and we compare the performance of this method to the usual human expert pro- cess. Moreover, we study if such a workflow allows to annotate common contaminants.
• In a second step, we try to increase the elucidation rate of a LC-MSMS analysis, looking at how different compounds are correlated together.
We believe that the results, which will be presented in this thesis, confirm the need for a wider view on LC-MSMS runs and the value of usually neglected information. Chapter 2
Spectra clustering
2.1 Problematic
After LC-MSMS analysis, small molecules identification consists of compar- ing experimental spectra to a reference library containing spectra of known molecules. Such libraries therefore only include a limited number of spectra, and it is common to deal with spectra in a sample that do not match any of the molecules in the library. To illustrate this problematic, let us mention the example of 70 acquisition runs taken from the Toxicology Department of the HUG. They were submitted against a reference library, and the ratio of spectra matching against the library divided by the total number of spectra present in the run was calculated. Table 3.3 summarises our findings: as we can see, the maximum percentage of identified spectra was only equal to 7.2%, and in half of the runs, less than 3.7% of spectra were identified.
measure value minimum 0.01638 q1 0.02710 mean 0.03708 median 0.03716 q3 0.04353 maximum 0.07193
Table 2.1: Proportion of identified spectra in 70 runs: The ratio number of identified spectra divided by total number of spectra in the run was calculated for 70 run acquired on a Bruker ion trap machine at the Toxicology Department of the HUG. A score threshold of 0.3 was chosen. The mean, median, quartiles (q1 and q3) and the extreme values are reported. 2. Spectra clustering 10
To the extend of our knowledge, this part of a sample consisting of uniden- tified spectra has not been extensively studied, and we wanted to explore it more in depth. In this chapter we will focus on how spectra produced by a single acquisition run can be clustered together, suggesting that they could correspond to the same molecule. Moreover, we wanted to search for spectra, which did not match anything against the reference library, but which could be close to known molecules. These questions could be answered thanks to a clustering algorithm that we developed. We will present a workflow to build clusters, as well as validation tests.
2.2 Methods
2.2.1 Data and library
In this research, we worked with a selection of samples from the Toxicology Department of the HUG. They were acquired on a Bruker ion trap machine (amaZon), and the data files were in XML format. In order to identify their content they were submitted against the Freiburg library, built by Wolfgang Weimann at Freiburg University, Germany (Dresen et al. (2009)). It consists of an electrospray ionisation tandem mass spectrome- try (ESI-MS/MS) library which contains over 50600 spectra of 10253 compounds relevant in clinical and forensic toxicology. It has been developed using a hybrid tandem mass spectrometer with a linear ion trap, analysing pure compound solutions-in some cases solutions made of tablets. The Freiburg library content can be consulted in appendix A.
2.2.2 Clustering workflow
State of the art
Clustering has been studied in peptide LC-MSMS identification, where experi- ments often generate millions of spectra that can be used to identify thousands of proteins in complex samples (Frank et al. (2007)). Analysing such large datasets poses a computational challenge, when searching millions of spectra against large protein databases, particularly if mutations and unexpected post- translational modifications (PTMs) are considered. Instead of repeating the identification process for each spectrum, it can be beneficial to perform this 2. Spectra clustering 11 process once and apply the results to all similar spectra. Tabb et al. (2003) demonstrated how clustering can accelerate analysis of runs, but at the cost of loosing some spectra identification. The MS2GROUPER algorithm improved the results by reducing by 20% the number of spectra that have to be searched and loosing 1% of peptides (Tabb et al. (2005)). In 2004, Beer et al. (2004) de- veloped the Pep-Miner algorithm and applied it on 5000000 spectra. Although they demonstrated its usefulness in reducing running time and improving iden- tification, this algorithm is not publically available and little is known on its clustering performance. Moreover, the clustering algorithm is based on reten- tion time prediction, which can be difficult to calibrate especially when multiple runs are considered (Frank et al. (2007)). Clustering algorithm initially applied to Internet and database clustering have recently been adapted to MSMS anal- yses. Ramakrishnan et al. (2006), and Dutta & Chen (2007), proposed to use metric space embedding for MSMS database search and clustering. While these promising approaches offer a potential solution to the problem of clus- tering very large datasets, the applications of these new ideas were illustrated only with a related task of filtering candidates for database searches or for clustering with relatively small spectral datasets (Frank et al. (2007)). Frank et al. (2007) developed the MS-Clustering algorithm, an optimisation of the Pep-Miner algorithm allowing to analyse millions of spectra. Determining the similarity between spectra is a crucial step in spectra clustering. MS-Cluster uses a normalised dot product as similarity measure. This method has some drawbacks. The artificially high scores granted to spectra containing only few peaks can be mentioned (Mylonas (2010)).
A new method
At best, usual clustering methods propose RT and precursor mass filter plus a dot product, which do not appear optimal to us as we have access to a much more robust scoring mechanism. In fact, X-Rank, the algorithm implemented in SmileMS (Mylonas (2010)) demonstrated better sensitivity, specificity and better robustness than dot product or derived scoring models. We therefore based our cluster algorithm on scoring, and we proceeded in the following way. As SmileMS offers a powerful scoring model to score spectra versus a li- brary of reference spectra, the idea was simply to use an acquisition run as a reference library and score the run against itself. In a second step, based on 2. Spectra clustering 12
experimental spectra
1. insertion as ref library 2. scoring
SmileMS
list of matches between experimental spectra
clustering
clusters of spectra
Figure 2.1: Clustering workflow: experimental spectra from an acquisition run are inserted as reference library into SmileMS. The run is scored against itself. Spectra are then clustered based on obtained scores. the submission results we obtained, we grouped similar spectra together using our clustering algorithm described in figure 2.1. The different steps of this workflow are detailed in the following sections.
Submissions
The first step in our workflow consisted in identifying the content of a data file taken randomly from the HUG Toxicology Department. This process first involved the addition of this file in SmileMS as a new library of its own. The same file was then submitted against this added library. Both steps were per- formed using web services, allowing to use the power of the SmileMS platform from an external software. For both submissions, the main parameters were: 2. Spectra clustering 13
• a score threshold of 0.7: only matches between spectra with a higher score are considered. This value is rather conservative, in order to prevent unlikely matches;
• a filter on the precursor mass difference of ±0.2Da: this limit was chosen to take into account the instrument precursor mass accuracy.
At the end of this submission process, we had a list of all experimental spectra present in our data file, and the scores of the matches between pairs of spectra in our sample.
Clustering algorithm
Starting from a list of experimental spectra (annotated with the name of the corresponding molecule if known for later verification), and knowing the scores between each pair of spectra, the goal of the clustering process is to group the spectra that are similar. At the beginning of this clustering process, no cluster is created. Each of the experimental spectra is considered sequentially. For each one, the algorithm checks if this spectrum is already present in a cluster. Three scenarios are then possible:
• the answer is no: a new cluster is then created and contains the considered spectrum, as well as its matching spectra.
• the answer is yes and only one cluster contains the considered spectrum: the matching spectra of the considered experimental spec- trum are added to this cluster.
• the answer is yes and several clusters contain the studied spec- trum: these clusters are merged. Moreover, the matching spectra of the considered experimental spectrum are added to the resulting cluster.
At the end of the clustering process, a test is systematically performed to ensure that no spectrum belonging to several clusters is found. If matches are symmetrical, redundancy is prevented with our clustering algorithm. It is important to understand that clustering is simply a partition of the set of all the spectra - every spectra is in one and only in one cluster at the end. We can note that, in addition to scores, retention time could also be used to calculate similarity between spectra, and it would be interesting to observe its effect in a further study. 2. Spectra clustering 14
2.2.3 Dealing with large data
When applied on data files containing single acquisition runs, the clustering workflow appears fast, and no memory limitations are encountered. However, the study of several acquisition runs merged in the same data file required practical adjustment of two steps of our process because of memory constraints. First, the format of the data file which was inserted into SmileMS as a new library and then submitted against this library had to be changed. In fact, parsing a XML file in SmileMS (mandatory step in the library insertion process and submission process) involved the creation of a list of spectra objects, and then the storage of these objects in a database. In the case of a large data file, the number of generated objects was too important, leading to an ”out of memory” exception. Using SDF and mgf formats (for library and submission respectively) solved this issue because the creation of an individual spectrum object was immedi- ately followed by its storage in the database, preventing the generation of a too large number of objects. Practically, Groovy methods reading the XML files, converting them to SDF and mgf format and merging the individual files were written. The second step, which had to be slightly modified was the submission result file reading. In fact, its size being too large, the Groovy XmlSlurper could not manage it. Therefore, we had to parse the file manually, isolating the matches results for individual experimental spectrum, and use a XmlParser to parse the generated string.
2.2.4 Clustering algorithm validation
Two aspects of clusters, which we will call compacity and discriminance, were investigated in order to validate our algorithm meaningfulness.
Cluster compacity
Cluster compacity refers to the distance (scores in our case) between spectra belonging to a same cluster. In a given cluster, each spectrum was scored against the other cluster mem- bers, and the scores distribution was studied using R, an environment for statis- tical computing and graphics. The hist function allowed to obtain histograms 2. Spectra clustering 15 of scores, and the summary function to get a numerical description of data distribution by providing the following information:
• minimum value
• first quartile (q1): cuts off lowest 25% of data
• median: cuts data set in half
• mean
• third quartile (q3): cuts off highest 25% of data, or lowest 75%
• maximum value
Cluster discriminance
Discriminance will refer, in this thesis, as the property of a cluster to contain only one molecule, or several very close molecules. In other words, a cluster will be considered as meaningful if it contains maximum one known molecule. In order to know which molecules were present in the clusters, we submitted spectra of these clusters against the Freiburg library (Dresen et al. (2009)).
2.3 Results
2.3.1 Cluster compacity
To evaluate scores between spectra in a cluster, we studied fifteen acquisition runs merged together and submitted to the clustering process. Thirty of the obtained clusters were scored against themselves and we analysed scores dis- tribution in each cluster. In order to prevent a bias of the results, we of course did not take into account scores between a spectra and itself, which would be very high and artificially increase the mean score value for a cluster. Figures 2.2 and 2.3 are examples of scores distribution in two clusters: As we can notice, score values appear extremely high, meaning that our clustering algorithm indeed groups very similar spectra. These positive results were further confirmed by the distribution of all ob- tained scores within the thirty studied clusters. As shown in Table 2.2, 75% of the scores are greater than 0.78, and half of the scores appear superior to 0.86. 2. Spectra clustering 16
Scores distribution in cluster 173 10 8 6 Nb scores 4 2 0
0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
Score value
Figure 2.2: Scores distribution in cluster 173: As we can notice, most of the scores between pairs of spectra in this cluster are higher than 0.7, confirming that our clustering algorithm is able to group very close spectra.
Routinely, much lower score thresholds are usually chosen to determine that a match between spectra is good. Therefore, the score values we obtained indi- cate that our clustering algorithm indeed allows to group very similar spectra.
measure value minimum 0.18 q1 0.78 mean 0.82 median 0.86 q3 0.91 maximum 0.91
Table 2.2: Scores distribution in 30 clusters: We can notice that 75% of scores are above 0.78, and 50% above 0.86, confirming the validity of our clustering approach.
2.3.2 Cluster discriminance
In order to assess if the obtained clusters only contain one molecule or very closely related molecules, we analysed a single run acquisition. A description 2. Spectra clustering 17
Scores distribution in cluster 184 150 100 Nb scores 50 0
0.65 0.70 0.75 0.80 0.85 0.90
Score value
Figure 2.3: Scores distribution in cluster 184: As we can notice, most of the scores between pairs of spectra in this cluster are higher than 0.7, confirming that our clustering algorithm is able to group very close spectra.
of the clusters containing known molecules is presented in table 2.3. First, and most importantly, we can notice that no cluster with more than one molecule was generated. Three molecules were identified in the studied run, each one appearing in a different cluster. Our clustering algorithm could successfully discriminate these molecules. The study of spectra, which were not associated with known molecules, ap- pears extremely interesting. While the spectrum corresponding to methadone was alone in a cluster, unidentified spectra were present in clusters containing morphine and clomipramine. The precursor mass range in these clusters sug- gest that identified and unidentified spectra are probably very closely related (slightly chemically modified molecule for instance). Much more surprisingly, our single acquisition run revealed the presence of many unidentified spectra close enough to be clustered together. In total, 514 clusters were built. Three of them contained identified molecules, the remaining clusters were composed of unidentified spectra only. Most of them only contained one spectrum, but 18 clusters had at least two unidentified spectra. 2. Spectra clustering 18
nbSp mol nbMatches MozRange RtRange 10 Morphine 1 175.9 − 177 0.02 − 5.6 1 Methadone 1 272.9 0.94 19 Clomipramine 1 265.9 − 267 0.07 − 8.76
Table 2.3: Summary of a single run acquisition clustering: nbsp = number of spectra in the cluster; mol = molecules associated with the spectra present in the cluster; nbMatches = number of spectra in the cluster, which matched against the Freiburg library; MozRange = range of spectra mass precursor in the cluster; RtRange = range of spectra retention time in the cluster.
2.4 Conclusion
In this chapter we presented the development of a clustering algorithm for LC-MSMS spectra, based on SmileMS scoring. Validation tests showed the meaningfulness of this algorithm in two different ways. Cluster compacity re- vealed homogeneous clusters from a scoring point of view. Scores distribution in thirty clusters showed that 75% of scores were above 0.78, and 50% above 0.86. In routine analysis, much lower score thresholds are chosen to deter- mine the significance of a score, and our results indicate a very high similarity between spectra of the same cluster. Cluster discriminance demonstrated that our clustering algorithm was able to separate molecules present in the studied sample as desired. This analysis also revealed 18 very interesting clusters, containing more than two unidentified spectra. Such spectra could be, for instance, contaminants or metabolites of drugs present in the sample. These findings lead us to wonder if these uniden- tified spectra are also present in other runs, opening the perspective to increase the elucidation rate in future analyses. In order to answer this question, it was necessary to develop a solution to store the unidentified spectra in a reference library, against which further runs would be searched. This approach will be presented in the next chapter. Chapter 3
An automatic reference library building process
3.1 Problematic
A reference library is a set of spectra associated with some characteristics (such as experimental protocol, acquisition parameters etc.), generally linked to a molecule identifier. After LC-MSMS analysis, experimental spectra are searched against such libraries, meaning that each experimental spectrum is compared to all reference spectra of the library, and a score is calculated. Building a reference library involves choosing a spectrum corresponding to a molecule, which is not a trivial task. Typically, for a 200-compound library, data acquisition is highly automated and takes approximately 1.5 days. Then, a human expert must select the best spectrum for each compound and store its association with a molecule. This second step can typically last a couple of weeks. Moreover, this selection process is often subjective, two experts do not always have the same opinion when electing the “best spectrum”. Finally, we must keep in mind that only spectra corresponding to known molecules are inserted into such libraries. In the previous chapter, we presented a clustering algorithm for experi- mental spectra. We were then interested to keep track of a ”representative spectrum” for each one of our clusters and use them to populate an automatic library. As many of our clusters do not contain known molecules, we wanted to investigate if these spectra (which could be contaminants, metabolites etc) were present accross different samples. Therefore, we developed an approach to automatically select a spectrum 3. An automatic reference library building process 20 from a cluster and insert it into a library. In order to validate our approach, we compared the submission of data files against a reference library and our automatic library.
3.2 Methods
3.2.1 Automatic library building workflow
Automatic library building is a direct application of our clustering procedure presented in the previous chapter. In fact, after the clustering step, an empty library is created in SmileMS and the ”most representative” spectrum of each cluster is chosen and inserted into this newly created library as shown in figure 3.1
3.2.2 Spectrum selection process
In each cluster, a representative spectrum must be selected. Frank et al. (2007) simply elected the best spectrum as the one with the highest signal-to-noise ratio. We do not want to select the ’best looking’ spectra, but the one which performs the best regarding the scoring mechanism. In a given cluster, for each spectrum, the sum of the scores between the considered spectrum and the other cluster members is calculated. The proce- dure is repeated for each spectrum of the cluster and the spectrum having the higher sum of scores is considered as the most representative spectrum of the cluster (being the closest to the other spectra in the cluster). If the cluster C is the set of n spectra s1 . . . sn, lets ˆ(C) be the best matching spectrum, i.e.:
n ( n ) X X sˆ(C) = sk such as sk · si = maxj=1...n sj · si (3.1) i=1 i=1
Our simple metric to chose a representative spectrum is based on scoring. This criteria would not necessary correspond to a human expert choice, but the goal is to select the most efficient spectrum from the scoring point of view, not the human chemist’s one. We will present in section 3.3 how this choice performs. We could have built a consensus spectrum from the cluster (thus changing the functions ˆ(C) from equation 3.1) with ideas such as Mueller et al. (2007) 3. An automatic reference library building process 21 who kept only peaks present in a given number of spectra from the cluster, but our simple choice proved to be robust enough.
3.2.3 New library creation
Once a spectras ˆ(C) is elected from cluster C, it must be inserted into the library. Based on 20 acquisition runs from the HUG, a new reference library was created in SmileMS using web services, which allow to conveniently interact with the platform from external code. Such services also allow to insert, update and delete spectra into the new library. For this experiment, each run, stored as a file, was treated successively. We inserted the runs one after the other because we encountered memory problems at that time, even with the tricks to deal with large data presented in chapter 2. It means that we can have several times the same compound registrated from different runs. However, we accepted redundancy for these specific experiments because we wanted to encounter the wider variety of known molecules. Run acquisitions being patient samples, containing only few molecules, the number of considered runs was important. From a practical point of view, an automatic library contains the following information on inserted spectra:
• job ID
• spectrum ID
• MS level
• precursor mz
• retention time
• compound name (associated molecule, if any is available)
• title (job ID - spectrum ID)
3.2.4 Automatic library validation
In order to test the performance of our automatic library, we submitted 10 new files taken from the Toxicology Department of the Geneva University Hospi- tal against it and against the Freiburg’ library, the subset of Freiburg library molecules contained in the samples used to build our library. 3. An automatic reference library building process 22
experimental spectra
1. insertion as ref library 2. scoring
SmileMS
list of matches between experimental spectra
clustering
clusters of spectra
spectra selection
best spectra
insertion
automatic library
Figure 3.1: Automatic library building workflow: using our clustering algorithm described in figure 2.1, similar spectra of a run are grouped together. A representative spectrum of each cluster is automatically selected and inserted into a newly created library. 3. An automatic reference library building process 23
The aim of this experiment is to compare the molecules identified against both libraries. We expect that molecules identified against the Freiburg’ library will also be identified against the automatic library. This comparison tests involve that spectra inserted into the automatic li- brary are annotated. Therefore, before filling the automatic library, spectra were submitted against the Freiburg library. A score threshold of 0.3 was chosen for all submissions. Our objective was, at that point, not to identify the most appropriate score, but to make sure that the identification against our automatic library was equivalent to the identification against the Freiburg’ library.
Let Stest be the set of spectra from the 10 test runs identified against
Freiburg’ library and the automatic library with a score ≥ 0.3. Let xauto(s) be the identification score of a spectrum s searched against the automatic li- brary, and xfr(s) the identification score against the Freiburg’ library. Using R and Groovy, we quantified the performance of our automatic library by calculating the following metrics:
• nbcommons = kStestk, the number of common identifications between both
libraries, and nbmisses, the number of molecules identified in Freiburg’ library but not found in our automatic library.
• distribution of xauto(s), s ∈ Stest,
xauto(s) the distribution of the ratio rsim(s) = for all spectra s ∈ S. • xfr(s)
3.3 Results
Applying the strategy described in section 3.2, we built an automatic library based on twenty files. In order to test our approach, we submitted 10 runs against the Freiburg’ library and against our automatic library, and asked the following questions: How many common identifications do we have against both libraries? For these common identifications, how do the scores look like? Is xauto(s) similar to xfr(s) for these common identifications? As indicated in table 3.3, between 2 and 12 spectra matched against our automatic library and against the Freiburg’ library. More importantly, in half of the submitted runs, no missed identification was reported. In the remaining cases, one molecule was present in the samples used to build our library but was not inserted into it (and therefore not identified). 3. An automatic reference library building process 24
The analysis of scores (xauto and xfr) for these common identifications appears very interesting. Table 3.1 shows the details of three runs submit- ted against the two libraries. The molecules identified against both libraries are presented with their scores. We can notice that scores are extremely high, confirming a robust identification of these molecules. The distribution of {xauto(s)|s ∈ Stest} is presented in table 3.2. With a median of 0.9025, 50% of the scores appear superior to this value, confirming the validity of our spectrum selection choice.
The scores xauto(s) and xfr(s), for a given spectrum s, are not only high against both libraries, but also very similar. To confirm our impression, we cal- culated, for each molecule identified against both libraries, the ratio rsim(S) = xauto(s) . Table 3.4 summarises the distribution of obtained ratio. We can notice xfr(s) that 75% of the ratio are between 0.9506 and 1.069, therefore extremely close to 1, confirming the high similarity between the two compared scores. The remaining ratio, 25% of cases, are above 1.07, with a maximum value of 2.52, showing a tendency for the scores against our automatic library to be higher than the scores against the Freiburg’ library.
runid mol(s) xauto(s) xfr(s) 1 Citalopram 0.704 0.701 1 D4-Haloperidol 0.913 0.904 1 D4-Haloperidol 0.913 0.908 1 Clomipramine 0.913 0.864 7 D4-Haloperidol 0.913 0.902 7 Primidone 0.681 0.556 7 Primidone 0.750 0.721 8 Midazolam 0.913 0.899 8 Midazolam 0.913 0.912 8 D4-Haloperidol 0.913 0.912
Table 3.1: Comparative submission against the Freiburg’ library and an automatic library: 10 runs were submitted against both libraries. The example of three of these runs is presented, with runid ∈ {1, 7, 8}. Molecules (mol(s) is the compound associated to spectrum s in Freiburg Library) identi- fied against both libraries are reported, with their score against the automatic library (xauto(s)) and against the Freiburg’ library (xfr(s)). 3. An automatic reference library building process 25
measure value minimum 0.3870 q1 0.8572 median 0.9025 mean 0.8558 q3 0.9130 maximum 0.913
Table 3.2: Scores distribution against an automatic library {xauto(s) | s ∈ Stest}: for all common molecules identified against the Freiburg’ library and our automatic library, the distribution of scores versus the auto- matic library is presented. The median of 0.9025 confirms the validity of our spectrum selection choice.
runid nbcommons nbmissed 1 2 1 2 4 0 3 2 0 4 5 0 5 5 1 6 6 1 7 3 1 8 3 0 9 6 0 10 12 1
Table 3.3: Identifications against an automatic library: 10 runs were submitted against our automatic library and against the Freiburg’ library. The number of common matches between both libraries (nbcommons), and nbmissed the number of molecules present in Freiburg’ but not inserted into our library (and therefore not identified) are reported. 3. An automatic reference library building process 26
measure value minimum 0.9506 q1 1.006 median 1.013 mean 1.112 q3 1.069 maximum 2.052
Table 3.4: Comparison of identification scores between the Freiburg’ library and an automatic library: 10 runs were submitted against the Freiburg’ library and an automatic library. For each common identification xauto(s) s ∈ Stest, the ratio rsim(s) = was calculated. The distribution of rsim(s) xfr(s) is presented. We can notice that 75% of the ratio are between 0.9506 and 1.069, confirming that xauto(s) and xfr(s) are very close.
3.4 Conclusion
In this chapter, we presented an approach to select a representative spectrum from a cluster of MSMS spectra. Our choice was to take advantage of the robust SmileMs scoring to elect the best performing spectrum. We then built an automatic library with the selected spectra using twenty acquisition runs. Our approach was tested by submitting 10 files against our home-made library and against the Freiburg’ library. We reported the number of common identified molecules, as well as the number of missed identifications, taking into account the molecules that were encountered in the 20 files used to build the library. Few missed identifications were noticed (only one in half of the runs), explained by the non insertion of the encountered molecules into our library because another spectrum from the cluster was selected. In spite of the small size of our automatic library, between 2 and 12 common molecules per run could be identified. The scores of these common molecules are extremely high against the automatic library (in 50% of cases, they are above 0.90), and very close to scores against the Freiburg’ library. These results confirm our spectrum selection choice as well as the validity of our approach. Our workflow allowed to create a personalised library, and to automate an otherwise time-consuming process. We of course used a limited number of acquisition runs, and could therefore identify a small number of molecules. We worked with samples taken randomly from the HUG in “real-life” condi- 3. An automatic reference library building process 27 tions, and containing few molecules. By spiking molecules separately - protocol usually followed to build a library and lasting one day and a half - we could encounter all molecules present in the Freiburg library and therefore build a useful library. Moreover, we encountered memory limitations, preventing us from using large data files (merged runs) and leading to a probable redundancy in our library. Optimising the code could probably solve this issue and thus prevent redundancy. In the next chapter, a smaller but non redundant library will be built in order to study unidentified spectra. In chapter 2 we showed their large number in an acquisition run, and looking for recurrence of such spectra across runs will be of particular interest. Chapter 4
Seventy runs analysis: molecules recurrence and co-occurrence
4.1 Problematic
In chapter 2, we showed that LC-MSMS analysis of runs obtained from the Geneva University Hospital leads to the identification of some known molecules. However, it is very common to face experimental spectra, which do not corre- spond to any of the reference spectra stored in the Freiburg library. In chapter 3, we presented a workflow to build a reference library with such unidentified spectra, and we are now interested in investigating the recurrence of uniden- tified spectra across runs. Does a particular unidentified spectrum appear in many different runs or is it isolated in one run? In a second step, we will adress the question of correlations between the presence of certain spectra. In other words, is a particular spectrum present in a run more often than by chance if another spectrum is also present? Such correlations could bring very useful information such as confirming the presence of a particular molecule using the presence of another molecule. Moreover, recognising unidentified spectra thank to their correlation with other spectra could increase the elucidation rate. In order to answer these questions, we generated an automatic library based on 15 files. Seventy runs were then submitted against this library, and the recurrence and correlation of unidentified spectra were studied. 4. Seventy runs analysis: molecules recurrence and co-occurrence 29
4.2 Methods
4.2.1 Library and data files
A reference library was automatically created as described in chapter 3. This library was filled with spectra selected, after clustering, from a SDF file contain- ing 15 runs from the HUG. In order to build a robust library, a score threshold of 0.7 was chosen. Seventy runs from the HUG were submitted against this automatic library in order to study the recurrence of spectra across these runs. To visualise the results, an Excel sheet was generated, representing the exhaustive list of identified spectra across the seventy runs and the number of time they were encountered in each run.
4.2.2 Spectra recurrence analysis
The recurrence of spectra across seventy runs was represented graphically using R. The obtained curve was then fitted in order to find the distribution of the number of occurrence of a molecule across 70 runs.
4.2.3 Correlations
Hypergeometric distribution
After submitting 70 runs against our automatic library and studying molecule recurrence, the central question was to test if correlations between molecules exist. In other words, are there pairs of molecules which are present together in several runs more often than by chance? In order to answer this question, we calculated the probability of our observations.
Considering N runs, observing molecule A (molA) in nA runs, molecule
B(molB) in nB runs, we can count the number of time molecules are seen together in the same run nA∩B. This observable nA∩B follows a hypergeometric distribution. A hypergeometric experiment typically consists in drawing without replace- ment n marbles from an urn containing N marbles in total, NA of which are of interest. If X is a random variable following a hypergeometric distribution. Equation 4.1 describes the probability that X = k success (Lecoutre (2002)): 4. Seventy runs analysis: molecules recurrence and co-occurrence 30