Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 346

Managing and Exploring Large Data Sets Generated by Liquid Separation - Mass Spectrometry

DANIEL BÄCKSTRÖM

ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 UPPSALA ISBN 978-91-554-6976-4 2007 urn:nbn:se:uu:diva-8223                       !   ""  "# $  % & $    $ '%  %( )%   *     +* %(

 ,-. ( ""(  & &  /   & 0&  + 1   02 +  3  +  ( 4   (                          5 6( 6 ( ( 7+8 933 3663 (

4            %  %   &  $    $  &  $      *%  *      (   &   :         $   (;  $  $    *% 2       :0+3+;  &  * 3*   : ( <=;( +%   & >$ &  > - $             &    $ (  %= 2  %&%   &      *% %&%     *    & *%%    % &   %  &     ( )% %  %  $   & &     & &      $ %  >$ &  > :     ;( )% %  *        $      * 3      $      ( 4  &3$   %  %    $  %    $    ( ?   $          *       %    $  *%    &    $  ( 4   $  = &     &  &   *     ( @%    & *    >$ &  > %  %   &        %$         <=( 4 '43-  %         % %$   48!A4         $  $$  *% %  ( )%   &  %  *         %     &   ( )% $ $         $ & :    -B & $$;      & : $$  &    ;  % $     - :   & ; ( )%  $  $ %    $  * =  '4      *   & & :   &;(

   2 %  & %    %     %  & %         =  %          - $ &   '4 &        48!A4 &      & 

     ! "        " #   "  $ %&&"    " '()%*+,  "   

C   ,-. ""

7++8 636 7+8 933 3663  #  ### 395 :% #<< (-(< D E #  ### 395; Papers included in this thesis

This thesis is based on the following Papers, which are referred to in the text by their Roman numerals as indicated below:

I. Nina Johannesson, Louise Olsson, Daniel Bäckström, Mag- nus Wetterhall, Rolf Danielsson, Jonas Bergquist. Screening for biomarkers in plasma from patients with gangrenous phlegmonous appendicitis using CE and CEC in combination with MS. , 2007, 28, 1435-1443.

II. Rolf Danielsson, Daniel Bäckström, Sara Ullsten. Rapid multivariate analysis of LC/GC/CE data (single or multiple channel detection) without prior peak alignment. Chemomet- rics and Intelligent Laboratory Systems, 2006, 84, 33-39.

III. Sara Ullsten, Rolf Danielsson, Daniel Bäckström, Per Sjöberg, Jonas Bergquist. Urine profiling using capillary electrophoresis-mass spectrometry and multivariate data analysis. Journal of A, 2006, 1117, 87-93.

IV. Daniel Bäckström, My Moberg, Per Sjöberg, Jonas Ber- gquist, Rolf Danielsson. Multivariate comparison between peptide mass fingerprints obtained by Liquid Chromatgraphy- Electrospray Ionization Mass Spectrometry with different trypsin digestion procedures. Journal of Chromatography A, submitted.

V. Daniel Bäckström, Erik Allard, Rolf Danielsson, Per Sjöberg, Jonas Bergquist. Comparing CE-MS fingerprints of urine samples, obtained after intake of coffee, tea or water. Manuscript.

Reprints where made with kind permission from the publishers Author Contribution

In Papers I-V, I was responsible for the data management including system maintenance and code implementation. In Paper I, I also assisted in the data evaluation and the writing of the Paper. In Paper II and III, I took part in the development of algorithms and their implementation, as well as evaluation of the results, and assisted in the writing. In Paper IV, I took part in the plan- ning and was responsible for the . I also took part of the evalua- tion and the writing of the Paper as well as the algorithm implementation. In Paper V, I took part of the data evaluation as well as in the writing of Paper and the algorithm implementation.

Contents

1. Introduction ...... 9 2. Liquid Separations...... 11 2.1 Capillary Electrophoresis ...... 11 2.2 Liquid Chromatography ...... 14 3. Mass Spectrometry ...... 16 3.1 Electrospray Ionization...... 17 3.2 Mass Analyzers and Detectors ...... 18 4 Hyphenated LS-MS and complex samples...... 20 4.1 Applications...... 21 4.1.1 ‘The biomarker study’ ...... 21 4.1.2 ‘The paracetamol study’ ...... 21 4.1.3 ‘The digest study’ ...... 22 4.1.4 ‘The beverage study’ ...... 22 4.1.5 The use of CE vs. CEC with mass spectrometry ...... 22 4.1.6 The use of CE vs. LC with mass spectrometry...... 23 4.1.7 The use of TOF/Quadrupole/FTICR mass spectrometry...... 23 5. Managing LS-MS data...... 24 5.1 Software...... 24 5.2 Generating data...... 27 5.3 Preprocessing of spectra...... 27 5.4 Exporting data ...... 27 5.4 Applications...... 32 5.4.1 ‘The biomarker study’ ...... 32 5.4.2 ‘The paracetamol study’ ...... 32 5.4.3 ‘The digest study’ ...... 33 5.4.4 ‘The beverage study’ ...... 33 6. Exploring LS-MS Data...... 34 6.1 Exploring single runs...... 34 6.2 Comparing two runs ...... 35 6.2.1 Intensity scaling and normalization ...... 36 6.2.2 Peak alignment...... 37 6.2.3 Data compression...... 40

6.3 Exploring the whole data set ...... 41 6.3.1 The ‘fuzzy’ correlation matrix...... 42 6.3.2 Cluster analysis and dendrograms ...... 42 6.3.3 PCA and scores...... 43 6.3.4 PCA and loadings ...... 45 6.3.5 Selection of scores by ANOVA...... 46 6.3.6 Variable assessment with target correlation (contrasts)...... 47 7. Conclusions and future aspects...... 50 8. Acknowledgements...... 51 9. Summary in Swedish ...... 52 9.1 Introduktion till LS/MS ...... 52 9.2 Hantering och utvärdering av LS/MS-data...... 53 9.4 Projekt ...... 54 10. References...... 56

Abbreviations

LS Liquid separation BPC Base peak chromatogram CE Capillary electrophoresis CEC Capillary electrochromatography COW Correlation optimized warping EOF Electroosmotic flow ESI Electrospray ionization FTICR Fourier transform ion cyclotron resonance i.d. Inner diameter LC Liquid chromatography MAD Median absolute deviation MALDI Matrix assisted laser desorption ionization MS Mass spectrometry PCA Principal component analysis SVD Singular value decomposition TIC Total ion chromatogram TOF Time-of-flight mass spectrometer XIC Extracted ion chromatogram

1. Introduction

Analytical chemists need to analyze a broad range of complex samples, often with very low concentrations of the analytes in relation to the surrounding material (the matrix). The complexity of the samples as well as the wanted precision and accuracy influence the choice of the method(s) used. Sample pretreatment and separation is also crucial for generation of reliable and usable data.

In view of this, an analytical chemist must be able to handle a range of de- tection methods, and realize how to best combine them with prior steps in the analytical chain. More recently, the handling of data has become more of an issue. With the techniques available today, it is easy to generate data cor- responding to a full DVD (4.7 GB), per analyst and hour. To handle data of that magnitude, we need computational power and lots of electronic memory (RAM) as well as storage space. A number of methods to extract the relevant information have been developed and ever more elaborate ones are still be- ing published at a high pace. This thesis is partly focusing on the manage- ment of data. It is possible today to generate a lot more data than the com- puters can handle without help and planning. The help and planning includes cunning containers for the data, i.e. file formats, and equally cunning struc- tures within the containers.

When the data is well structured inside a well chosen file format we need techniques for extracting the relevant information; this is sometimes referred to as the ‘needle in the haystack’ problem. In this case, the ‘haystack’ is the size of a small town, the ‘needles’ are many and look similar to hay. To top it off, in many cases the most interesting information is not the ‘needles’ themselves, but their mutual relative positions.

An additional problem is the increasing use of proprietary software, often delivered together with the analysis instruments. The problem does not lie in poor quality of this software, the issues at stake are rather the following: (i) it is problematic to analyze and compare data generated by instruments of different brands, causing a problem if the machine park is inhomogeneous (ii) to deploy a function/algorithm on the data, access to the source code is needed (and the permission to use it), a quite unlikely scenario today.

9

This thesis presents a set of algorithms suitable for the management and exploration of large two-way data sets generated by liquid chromatography– mass spectrometry. All algorithms were implemented in Matlab and were applied to real data sets obtained from complex biological samples .

10 2. Liquid Separations

Liquid separation (LS) encompasses all types of separation techniques where the solved sample is surrounded by a liquid environment (mobile phase/ electrolyte) driven through a separation media. These techniques can roughly be divided into two groups depending on the method used to drive the sepa- ration. In Capillary electrophoresis (CE), an electrical field is applied over the capillary and charged analytes will start to migrate. Analytes will experi- ence more or less friction as they migrate under the influence of the electric field. Furthermore, depending on the capillary surface characteristics such as charge polarity and density, the liquid will also be transported through the capillary. In Liquid Chromatography (LC) a mechanical pump pushes the bulk solution and the sample through the column in the desired direction. During this the analytes participate to a stationary phase and elute at differ- ent times.

2.1 Capillary Electrophoresis The use of electrophoresis as a separation technique can be dated back to the 1930's when Arne Tiselius developed the technique [1], which was further developed by Stellan Hjertén during 1950-60 [2]. Capillary electrophoresis, which was used in Paper I, III and V, is today a well established tool in ana- lytical chemistry. This is due to the rapid separation of species (less than 5 min/run is common), the ability to handle very low sample volumes and the very high separation efficiency as well as its relatively low cost (both to buy and run). To set up a crude but fully functional CE-system, little more is needed than two vials with buffer solution, a fused silica capillary, a high voltage power supply and a detector (Figure 1).

The CE separation is most commonly performed in fused silica capillaries. At high pH, the silanol groups (Si-OH) inside of column are deprotonated and therefore charged. Very close to the capillary inner surface, there is a static double layer of cations. On top of this layer, where the electrostatic pull from the capillary wall is lower, another layer called the “slipping plane", is formed. It is the cation movement in the slipping plane towards the negative electrode that induces the electroosmotic flow (EOF).

11

Figure 1. Schematic picture of a capillary electrophoresis System.

Figure 2. A schematic picture presenting the two effects that summed describes the mobility of each analyte.

12 All analytes will be influenced by the EOF, but if charged (which is nor- mally preferred in CE) they will also be affected by the applied electric field over the capillary. The analyte migration velocity (and in some cases also direction) will deviate from the EOF velocity. The observed linear velocity of the analyte is dependent on both the linear EOF velocity and the pull of the applied electric field (proportional to the analyte charge).

The deviation from the linear EOF velocity depends on analyte geometry, size and charge. A large analyte will be more affected by the surroundings and thus have an observed migration velocity closer to the EOF velocity, compared to a smaller analyte with equivalent charge. Consequently, CE separates analytes with respect to both size and charge (Figure 2). As it is normally preferred to have as efficient separation as possible, high field strength is preferred. The main factors resulting in zone broadening are tem- perature gradient, sample introduction and capillary wall interactions.

Application of a high electric field over the capillary will lead to a phenome- non called joule heating, due to ion friction when current is passing through the capillary. In a CE capillary there will always be some heat generated, but as the capillary is very thin (25-100μm i.d. normally) the heat dispersion to the surrounding environment will be high enough for a stable temperature equilibrium, and the radial temperature gradient in the capillary will be rela- tively flat. When applying a too high field, the temperature gradient will become steeper, and with a viscosity and ion mobility change of about 2% per degree Celsius, the result is a rather severe zone broadening [3, 4]. Using an electric field of lower strength, the primary contribution to the peak width is sample introduction.

The inner wall of the capillary is normally Si-OH, resulting in a pH- dependent EOF velocity. Other possible problems when analyzing positively charged analytes are due to electrostatic (and hydrophobic) interactions be- tween positively charged analytes and the negatively charged capillary wall, which may result in band broadening, peak tailing and possible memory effects. The capillary wall can be coated with dynamic (Paper I) [5-8] or permanent [7-12] surfaces. By using a coating, it is possible to change the direction of the EOF and to reduce the pH-dependency of the EOF velocity. It is also possible to introduce new wall-analyte interactions for enhanced separation between analytes similar mobilities in normal CE columns. Coat- ing the capillary wall with surfaces can also introduce a chromatographic feature called capillary electrochromatography (CEC), which was used in Paper I. Both open tubular capillaries and capillaries filled with packing material exist, the latter for enhanced ratio between stationary and mobile phase.

13 Among the strengths of CE is a routinely achieved efficiency of several hun- dred thousand theoretical plates per meter. An efficiency that high leads to very narrow peaks, from a few seconds down to below a second in width. This put high demands on the sampling rate of the detector. A general rule is to acquire at least 8-10 data points per peak; thus the chosen detector must be capable of a sampling rate of up to 10-40 Hz. The amount of data gener- ated will consequently be high also for very short runs, a challenge in itself that will be addressed in later sections.

2.2 Liquid Chromatography Liquid chromatography (LC), which was used in Paper IV, is a commonly used method in and is well adapted to non-volatile and thermo unstable analytes. In LC, the mobile phase is driven through a tube containing small particles, i.e. a packed column, using a mechanical pump. A wide variety of mobile phases as well as packing materials are used. An employed terminology used in analytical scale LC is presented in Table 1.

Table 1. Nomenclature for packed column LC [13] Column i.d. Typical flow in reversed phase Classification [min-1] 3.2-4.6 mm 0.5-2.0 mL Conventional 1.5-3.2 mm 100-500 μL Microbore 0.5-1.5 mm 10-100 μL Micro 150-500 μm 1-10 μL Capillary 10-150 μm 10-1000 nL nano/ nanoscale

In the column, separations occur due to analytes different partition coeffi- cients between the stationary phase (packing material) and the mobile phase (eluent). Important factors for LC efficiency are flow rate, size and homoge- neity of the packing material, diffusion coefficient, diameter of the column and column length.

14 The optimal flow rate is often visualized as linear flow rate vs. plate height with a local minimum as optimal flow rate. The reduced flow rate () is de- fined as ud   p (1) Dm where u is linear flow rate, dp particle diameter of the packing material and Dm diffusion coefficient. The reduced velocity is related to the reduced plate height (h) according tho the Knox equation

B h  Av 0.33   Cv (2) v where A, B and C represent representing flow dispersion, longitudinal diffu- sion and resistance to mass transfer respectively [14]. As a normalized measure of the column effiency, the reduced plate height facilitates com- parison between different column.

For the capillary columns used in this thesis, with an inner diameter below 300 μm, the reduced plate height decreases and the reduced flow rate in- creases with decreasing column diameter [15, 16]. This is suggested to be due to a decrease in inhomogeneities of both mobile phase flow and packing of the stationary phase, as well as shorter time needed for mobile phase molecules to diffuse between the middle of the column and the column wall (diffusional relaxation) [16].

Without doubt there are advantages in using micro-size or even smaller col- umns, like the need for less packing material, a reduced use of mobile phase and higher ionization and transfer efficiency when combined with electros- pray ionization (ESI) mass spectrometry (MS). However, there are a few challenges attached to its use too. Not until recently it has become possible to buy high quality commercially made capillary columns. Furthermore, making good columns in the lab in a reproducible way does require some skill. Finally, the demand on the pumping systems is high with the need to produce and maintain a stable gradient flow below 50 μL/min.

15 3. Mass Spectrometry

A mass spectrometer is a device capable of measuring the mass-to-charge ratio (m/z) of the ions present in gas phase. It is today, combined with a separation system, one of the most effective ways we have of analyzing complex samples with many low abundant analytes, as is common in bio- logical samples. A restriction all of today mass spectrometers have in com- mon is the need for the ions to be in the gas phase before introduced into the mass analyzer; hence, there is a need for a good ion source. A number of different ion sources exist, each with its own merits [17-25]. The only one used in this thesis, and one of the most commonly used ion sources today for samples in liquid state is electrospray ionization (ESI).

A schematic picture of a mass spectrometer, shown in Figure 3, consists of an ion source, ion optics to focus the ion beam, a mass analyzer and detector to determine the mass-to-charge ratio and intensity of the ions present in the beam. The information must then be stored, either locally or on a network administrated storage device (NAS).

Figure 3. A schematic picture of a mass spectrometer. The example is of a time-of- flight mass spectrometer with an electrospray ionization interface.

16 3.1 Electrospray Ionization Electrospray, as a part of ionization (ESI) is, simply put, a nebulizer that transfers ions in liquid phase to ions in gas phase using a high electric field. With a high electric field (a few kV/cm) generated between an hollow emit- ter electrode (the column outlet) in contact with the electrolyte and an counter electrode (entrance of the mass spectrometer) a cone emitting small droplets with a net charge can be produced (Figure 4). Parameters influenc- ing the so-called Taylor cone have among others been studied by Wilm and Mann [26] and include mobile phase flow and composition, emitter design and field strength. In a fraction of a second, the solvated analytes are trans- ferred to the gas phase as ions via a mechanism that is not yet fully under- stood.

Dole described the charged residue mechanism (CRM) in 1968 [27]. When the solvent evaporates, the charge remains, leading to a rapidly increasing electrostatic repulsion that is counteracted by the surface tension. When the repulsion force becomes too high, the droplet explodes, generating a large number of smaller droplets. The Coulomb fissions, as they are called, con- tinue until the droplets are so small that they only contain one ion each. The solvent then continue to evaporate leaving only unsolvated gas phase ions. Irbarne and Thomson [28] proposed in 1976 the ion evaporation model

Figure 4. Schematics of the electrospray ionization interface (ESI).

17 (IEM). This model suggests that when the droplets are very small (reduced to that size by the same mechanism as proposed above), the field strength at the droplets surface would be high enough to make it energetically favorable to transfer ions directly into the surrounding gas phase.

Whether the CRM or IEM is the dominating mechanism, or a blend as Cole [29] suggested, the ions have to be present at the droplet surface before they can be desolvated and enter the mass spectrometer, hence, surface activity as pointed out by Tang and Kebarle [30] is an important parameter. Enke re- fined this model, by defining a partition coefficient (K) between the neutral interior and charged surface of the droplets [31]. With only one type of ana- lyte (A), electrolyte (E) and counter ion (X) present, the competition (i.e. K) between the different species at the surface (s) and the interior (i) of the droplets and can be described (Eq. 3).

   K A E X  (3) K A  s i     K B E s A X i

This introduces a challenge as analytes eluting with a time overlap will com- pete in the spray, resulting in a signal response different to that the analytes would have given alone. One way to reduce the above limitation is to use a separation step prior to ESI, reducing the number of analytes in the Taylor- cone at each given moment.

3.2 Mass Analyzers and Detectors Mass analyzers separate analytes with respect to mass over charge ratio (m/z). Of the more common designs the time-of-flight (TOF), quadrupole (Q) and fourier-transform-ion-cyclotron resonance (FTICR) have been used in this thesis. Some of the differences between them are resolution, resolving power, linear and dynamic range, mass accuracy, speed and cost.

An important characteristic feature of a mass spectrometer is the resolving power. If m is the smallest mass difference for which two peaks with masses m and m + m are resolved, the definition of the resolving power R is described as the ratio between m and m according to

m R   (4) m

18

Two peaks are considered to be resolved if the valley in between is equal to 10% of the weaker peak intensity when using magnetic or ion cyclotron resonance instruments and 50% when using quadrupole [32].

The most used of the three mass analyzers and still a good choice for many applications is the quadrupole mass band filter. The quadrupole consists of four poles (rods) aligned in parallel. Ions to be mass analyzed are focused down the center of the quadrupole, with a combination of direct current and radio frequency voltages applied to the quadrupole rods. Only a small band of m/z ratios will have stable trajectories through the quadrupole, determined by the amplitude of the voltages. Ions outside this m/z “band” are neutralized by striking the quadrupole rods. The quadrupole is today often used as a first stage in tandem mass spectrometers. Examples of this are triple quadrupole (QqQ), quadrupole-TOF and quadrupole ion-traps.

Time-of-flight mass spectrometers accelerate the ions and measure the corre- sponding flight time. The very nature of TOF makes it optimal for use with pulsating ion sources such as the matrix assisted laser desorption ionization (MALDI). If however a continuous ion source is used, an orthogonal TOF is normally preferred [33, 34]. In the orthogonal TOF, the ion beam is chopped into small packages by a repeller (also called pulser), sending the packages to the detector in 90o relative to the original ion beam. The downside with this is the loss of the part of the original ion beam that is not diverted, nor- mally 50% or more.

Fourier transform ion cyclotron resonance (FTICR) is the mass spectrometer of choice if mass resolution and mass accuracy is the primary goal [35, 36]. With a resolving power of up to 106 it is unsurpassed, but the data collection frequency is low (one spectrum every 1-10s). FTICR mass spectrometers trap the ions in an analyzer cell inside a massive magnetic field, several Tesla strong. The ions revolve with a frequency related to their m/z ratio, and by Fourier transformation of the detected oscillating current a mass spectrum can be obtained.

19 4 Hyphenated LS-MS and complex samples

The mass spectrometer of today is capable of generating accurate and highly detailed spectra. One can argue that often the resolution is enough to distin- guish and thus quantify separate analytes in a mixture without prior separa- tion. When using very complex mixtures as body fluids, the combination of high mass resolution and a wide mass band (the range between the lowest and the highest m/z) present a challenge to hard for the single mass spec- trometer to handle. This is especially true when the pattern and not only a few specific analytes is of interest. As mentioned in section 3.1, competition in the ESI can result in either a signal decrease or an increase and it is very hard to predict how and to what extent [37]. Additionally, combining a sepa- ration step with a high resolving mass spectrometer generates a large amount of data, but more importantly it can also generate a lot more information that otherwise may had been hidden.

By adding a separation step prior to the ESI, one can not only decrease the number of analytes competing in the spray at any given moment, the addi- tional information of when an analyte elutes can be utilized. By using physiochemical properties such as hydrophobicity it is possible to estimate retention times for reversed phase chromatography [38, 39]. Time related information is also available for other types of separations, and migration time predictions for CE has been made [40-42]. It has been reported that using estimated retention times together with m/z can assist in identi- fication by peptide mass fingerprinting [38, 39, 43]. This reduces the ex- tremely high mass resolution needed for accurate mass fingerprinting, where the whole identification lies in spectral accuracy. There are always a finite number of possible peptides for a given resolution within the mass meas- urement error. However, to come close to only one or a few possible candi- dates using mass spectrometry, even with the additional information pro- vided by elution time, the needed mass resolution fall well below 1 parts per million. This is hardly achievable even by a well tuned FTICR mass spec- trometer [44], especially given the complexity and dynamic range of the sample and additional information about the peptide fragments is needed to resolve the puzzle fully. One way to do this is by tandem mass spectrometry [44, 45], a field beyond the scope of this thesis.

20 4.1 Applications

Three different liquid separation systems have been employed in this thesis, CE (Paper I, III and V), CEC (Paper I) and LC (Paper IV). To facilitate the discussion in the following sections the term “elution time” and “chroma- togram” will be used regardless if CE, CEC or LC is the employed separa- tion technique.

The three mass spectrometry analyzers coupled to above are TOF (Paper I, IV and V), triple quadrupole (Paper III) and FTICR (Paper I).

This thesis involves data from four separate projects. The projects are: 1. Biomarker screening of plasma from patients with appendicitis (Paper I) 2. Profiling of human urine before and after intake of paracetamol (Paper II and III) 3. Peptide mapping of trypsin digests (Paper IV) 4. Metabolomics/metabonomics study on human urine regarding intake of coffee, tea and water (Paper V).

4.1.1 ‘The biomarker study’

In ‘the biomarker study’ (Paper I), both CE-TOF and CEC-FTICR was used, providing slightly different results and different challenges regarding the data management. A difference between the two appendicitis types studied could be found both from the CE-TOF and CEC-FTICR data. A list of pos- sible biomarkers where constructed and cross-checking proved a subset to be present in both methods, strengthening the probability of the biomarkers being ‘true’.

4.1.2 ‘The paracetamol study’

The ‘paracetamol study’ was divided into two Papers (Paper II and III), one focusing on the ‘fuzzy’ concept needed to process the data within a manage- able time-frame (Paper II). The other part of the study (Paper III) focused on the profiling of the data with respect to intake of paracetamol 4h before sam- pling or not. In this study, the amount of data generated by each run was ~500 000 data points. It was clear that CE-MS is capable of providing a good enough separation in time to produce clear distinct groups even with a low mass resolution. The low mass resolution was due to the use of a quad- rupole combined with a large mass range of interest, and to provide a high data collection frequency.

21 4.1.3 ‘The digest study’

In the trypsin digest study, the peptide maps of BSA (bovine serum albumin), -casein and fetuin where investigated with respect to tree differ- ent digestion/ clean up procedures. The vast amount of data generated (~500 000 000 data points per run or about ~ 0.3 GB, initially) can be handled us- ing the tools presented in this thesis, without resorting to more expensive and less common 64bit computers. It was also shown that systematic differences occurred between the protocols and the pair-wise contrast analysis presented the most informative time and m/z windows with respect to these differ- ences.

4.1.4 ‘The beverage study’

As an even larger study than the previous described ‘digest study’, compris- ing of 115 runs in positive and 115 in negative mode, each run in the prox- imity of ~ 500 000 000 data points (~ 0.2 GB), initially, ‘the beverage study’ provided an even larger challenge. On top of this, three separate groups, coffee, tea and water, was evaluated generating the need for more flexible tools, which was written. It was also during this study that the use of the HDF5 file format [46] was fully needed as the combined size of all the runs in one set was to large to be contained in neither the previously used Matlab file format, nor in NetCDF [47].

4.1.5 The use of CE vs. CEC with mass spectrometry

Previous studies has shown the combination of CE and MS to be very suc- cessful when working with proteomics [48-54]. One common challenge is analyte adsorption to the capillary wall, and an effective way of reducing this is by altering the capillary wall surface, as is done in two out of three pro- jects involving CE (the ‘biomarker’ and ‘paracetamol’ project). Different coatings can be used, such as the MAPTAC polymer [48, 55] used in the ‘biomarker’ study, providing a capillary wall surface less prone to adsorb protein digests combined with a stable reversed EOF compared to the un- coated wall. The combination of polymers also used in the ‘biomarker’ study is the monomers ODAC and TAC, which generates an inner surface with chromatographic properties and has previously shown fast and highly effi- cient separation for protein digests [56]. A third coating used in the ‘paracetamol’ project is the PolyE-323 [11, 12], providing a reversed, more pH independent EOF. The chromatographic properties introduced via ODAC+TAC enhances the separation of neutral species, and is used in con-

22 junction with a TOF to get a data collection frequency high enough to fully dissolve the mass chromatogram.

4.1.6 The use of CE vs. LC with mass spectrometry

In the ‘digest’ study, liquid chromatography (LC) is used to separate trypsin digests. The combination of LC-ESI-MS for peptide mapping is widespread, and is used in areas such as proteomics [57] and pharmaceutical drug dis- covery [58]. Compared to CE, LC does not provide as high separation effi- ciency, but has a higher degree of freedom. This freedom presents itself as the opportunity to use a wide variety of mobile phases as well as stationary phases; in this thesis though, only standard C18 was used which is a common stationary phase in the proteomics field. The use of a mobile phase gradient also separates it from CE; enabling high efficiency in combination with good separation. Two disadvantages with this is the mobile phase dependency on ESI efficiency and the rather long equilibration period required after each run. In Paper IV, we used a post-column addition of 2-propanol to level out the differences in mobile phase composition from the gradient, yielding a more stable ESI over the whole gradient with the added benefit of a slightly higher signal compared to without the addition.

4.1.7 The use of TOF/Quadrupole/FTICR mass spectrometry

In the ‘paracetamol’ as well as the ‘beverage’ studies, both positive and negative ionization mode was used, providing similar results using cluster analysis on the correlation matrix (section 6.3.2) but with slightly different mass chromatograms, i.e. the information was to some extent complemen- tary. For the ‘paracetamol’ study, a more prominent separation of the groups was found for the combinations positive ionization mode and low pH, and negative ionization mode in high pH.

Two large differences between TOF and FTICR mass spectrometry is the comparatively slow data collection frequency and outstanding mass resolu- tion of the FTICR. The low data collection frequency is not necessarily a major concern as it is generally more than made up for in enhanced mass resolution, but as the instrument type is very expensive even compared to other mass spectrometers, it is an instrument hardly to be regarded for rou- tine analysis.

23 5. Managing LS-MS data

Managing LS-MS data is rapidly becoming a field including not only ana- lytical chemistry but also several fields in computer science, such as pro- gramming with special focus on memory usage and clusters. Databases and the use of efficient, open and common storage formats are also areas of cur- rent interest. This chapter presents ideas and to some degree implementation of how to address these challenges without having to extend the computer power beyond that of a high end PC accessible today.

5.1 Software The proprietary software used to collect the data is, together with a high end PC, normally capable of handling and presenting two dimensional LS-MS data. The standard presentation forms used when manually exploring the data include the total ion chromatogram (TIC), where all intensities at a given elution time are summed to a single value and plotted versus elution time. Also common is the extracted ion chromatogram (XIC), where the intensities for a single m/z value are extracted and plotted versus elution time. A third common form is the base peak chromatogram (BPC), where only the most intense m/z peak at each time interval is plotted versus elution time, which gives a clearer view when new intense components elute.

One of the main goals for this thesis was to bring together LS-MS data, re- gardless of origin, into a common platform. The platform should be able to assist in the exploration of the data using standard high end PCs, and also facilitate fast and easy implementation of algorithms found in the literature. In addition to proprietary software packages, software packages for handling and presenting LS-MS data are under active development [59-63] and made available in the public domain as free or open-source software (FOSS). A downside using proprietary software is that implementing new or changing existing algorithms cannot be done simply because access to the source code is prohibited. This can be a serious drawback in any environment, and even more so in a research environment. Using free/open-source software pack- ages, the source code is accessible and adding/enhancing functionality to the software is actually appreciated. Being proficient enough in the program- ming language is though not always an easy task, whether it may be Java

24 [60, 61] or some other general language, and a more specific approach is desirable. The statistically oriented language R [64] on which XCMS [62] is built, has a well developed and rich syntax, is under constant development and could have made an excellent foundation for the software development needed for this thesis. It is also closely related to S, a well known software in the statistical community. The choice however, went to Matlab [65], a non- free software package which has become more or less a standard for the development and publication of algorithms in the current area. One reason for this is that Matlab, short for “matrix laboratory” is well adapted to matrix calculations and graphic presentation. We also hade some prior knowledge of Matlab, rendering a shorter initial learning period. Although the imple- mentation in Matlab calls for the licensed software, the Matlab code for the algorithms is often available and useful as documentation. The possible dan- ger in having to rely on a commercial software and the possibility to replace Matlab by the open source alternative Octave [66] was discussed by Alsberg and Hagen [67]. Other more open alternatives to Matlab are Scilab [68] and FreeMat [69].

A task at which Matlab excels is working with matrices. A matrix is a table with values for each cell (the intersection of a specific row and column), for example a grid representation of the two-way LS-MS data. Given a list of intensities with corresponding values of m/z and elution time, a matrix can be populated as shown in Figure 5. The m/z values correspond to row indi- ces according to a scale vector, and the time values to column indices in the same way. The matrix itself can also be represented as three lists, one with the cell values and two with the column and row indices, respectively.

When a matrix is populated primarily by zeros, as often in LS-MS, it is called a sparse matrix (in contrast to a dense matrix). If a matrix is repre- sented as vectors (to the right in Figure 5), the zero values can be omitted and less space is needed to represent and store the matrix. Matlab can use sparse matrices efficiently and almost all operations that can be performed on a full (maximally dense) matrix can also be performed on sparse matrices. Some operations are most efficiently performed on only one of the vectors representing the matrix, such as normalization and compression of the data. An example is to normalize the matrix so unit sum, then the vector contain- ing the cell values is divided with its initial sum.

In this thesis, the vectors referring to mass values, time values and intensity values are denoted i, j and s, respectively. The values of i and j are indexes and must therefore be integers, pointing to the cells in the matrix that have the corresponding values in s. Two scale vectors, ‘masses’ and ‘scans’ are used to translate the index values to mass and time values (and vice versa).

25

X ijs

ntensity time m/z

masses

scans

Figure 5. A schematic picture of the three types of data representation used in this thesis. To the right, the format used in both the exported ASCII and NetCDF files, while in the middle (a sparse matrix and two scalar vectors) and to the right (three index and two scale vectors) the two forms used when imported into Matlab are shown.

The length of the scale vectors is the number columns and rows, respec- tively, usually much shorter than i, j and s since these refer to every non-zero cell in the matrix.

An important point to make is that each matrix cell corresponds to a certain range of m/z and time values, i.e. all values that fall within the grid points. Hence each cell represents a “bin” (or “bucket”) which can hold several data points with slightly different m/z and/or time values, then with added inten- sities. Especially when the bins are made relatively large in comparison to the m/z or time resolution, this procedure is called “binning” or “bucketing”. It should be remembered, however, that representing the data points as a matrix always involves binning.

26 5.2 Generating data Combining a separation step with mass spectrometry is appealing, but it does come at a cost. The drawbacks are the increased analysis time (lower sample throughput) and the added data file size. When moving from one- to two- way data an increase by a factor of two to three thousand in data size is often normal. With a 5 min CE run, a data collection frequency of 1 Hz and a mass band of m/z 300-2500 with a step size of 0.01u, a single run will consist of 66 million data points! Switching from CE to LC does not change the out- come significantly, as the reduction of data collection frequency (due to broader peaks) will be balanced by an increase in run time. The resulting data is stored in a proprietary file format, specific for the brand of the mass spectrometer used. The files are usually compressed and with the possibility to include more than one run.

5.3 Preprocessing of spectra The possibility to enhance the signal in comparison with noise is useful when working with the data directly. Savitsky-Golay filtering [70, 71] along the time axis, i.e. using a moving polynomial to smooth out transient dips and spikes due to random noise and keep the more slowly changing “true” signal, is a standard tool since many years. Converting the distributed peaks into centroid representation is sometimes used to make the data easier to interpret, as the information is condensed. The centroid is the center of grav- ity of a peak, usually presented as a single spike or “needle” with a height proportional to the peak area. Finding isotope clusters and determination of the charge of an analyte may lead to transformation of the LS-MS peaks into so called features, objects characterized by retention time and monoisotopic mass well as charge, isotopic pattern and some quantity measure (e.g. total intensity). These features can then be the basis for more extensive evalua- tion, often combined with internal or external database searches.

5.4 Exporting data

For further exploration we need to transfer the data from the proprietary environment to Matlab. Exporting the data is normally limited to a few file formats. ASCII is usually an standard option and today some systems can also produce NetCDF files [47], both of which that can be imported by Mat- lab. Irrespective of file format, due to lack of compression the size of the exported file is always significantly larger then that of the original proprie-

27 tary file. The difference can be as large as up to ten times, depending on the underlying data, while 3-5 times the original file size is more common.

In the transformation of the data from a proprietary format to something possible to work with in Matlab, exporting the data is just the first step. The data obtained during a single run is normally exported as three lists, repre- senting the data points as values of m/z, time (or scan number) and intensity (counts or cps). Additional information may be included, for example mass spectrometer parameters used when collecting the data.

The second step is importing the data, adapting the data structure and data format to Matlab. The native way to represent the two-way data in Matlab is as a matrix, for simplicity like a regular grid with equally spaced cells. For this, however, there are challenges to overcome. The m/z spectra (scans) are not always acquired at constant time intervals, which results in distortion of the chromatographic profiles when they are treated as equidistant data points. More serious, however, is the sometimes non-linear m/z grid inher- ently related to the mass spectrometer. In the time-of-flight mass spectrome- ter, the m/z for a specific ion is computed from the flight time between the deflector (pulser) in the beginning and the detector at the end of the analyzer (Figure 3). The deflector sends packages of ions with a frequency of typi- cally 10 kHz that are detected at the other end by an analog-to-digital- converter (ADC) working at a frequency of typically 1 GHz. This generates 100k time bins, evenly distributed over one deflector cycle (100 ms). The deflector adds a discrete amount of kinetic energy to the ions according to

4 2 4  m v (5) z Ek 2 where z is the ion charge, m the ion mass, and v the speed along the flight path. As the added kinetic energy is the same for all ions, the speed is in- versely proportional to (m/z)1/2, and traveling the same distance results in a flight time that is proportional to the square root of m/z. The consequence of this is that when the linear index of flight times (each time-of-flight bin = 1 ns) are converted to m/z ratios, the result will not be linear grid. The m/z index of an exported mass spectrometer run has smaller bins for lower m/z ratios than for higher. To overcome this, a regular grid with a high resolution is used, accepting that certain matrix rows, representing high m/z, will stay empty. We also have to accept that other matrix rows, representing low m/z, may contain intensities from different ions unless they are well separated in retention time (i.e. matrix columns).

28

/

info.n data.n

filename i j s masses scans

id batch runNr label

Figure 6. A picture of the basic file structure used when files, exported by the mass spectrometer software, are imported and converted into three index vectors (i, j and s) together with the scale vecors scans and masses. Additional information about the exported file is also stored, to facilitate traceability from the original file and on- words.

Before storing the data in Matlab, the data matrix is converted to the list representation with the vectors i, j and s as previously described (Figure 5). A structure raw with two main branches, info and data, is created for the single run (Figure 6). Under data, the three branches i, j and s are populated with i, j and s, i.e. row and column indexes and intensity values, respec- tively. The two scale vectors are also stored under data as masses and scans, relating the index values in i and j to their original m/z and time values. The info branch is at this stage only populated with information about the origi- nal file name. The importing procedure is completed by storing the structure raw as a single .mat file, making use of the efficient compression of data files in Matlab.

The data structures obtained for the single runs, and the corresponding data files, are too large to be handled together as a data set compiled within a study or project. The information, however, corresponds to a large extent to instrumental and chemical noise, only a few percent or less of the data points correspond to actual peaks, when defined as intensities more then three stan- dard deviations above the base line. Using an in-house developed algorithm for background subtraction and noise removal to discard non essential data, 95% or more of the original data points were removed. For each mass chan- nel (each chromatogram) the baseline is modeled as a second-order piece- wise polynomial function with equally spaced knots (nodes) along the time axis (default is set to 10). This smoothing spline function is fitted to the in- tensity data by least squares minimization in an iterative way. After each iteration, the data values are adjusted towards the current baseline. The posi- tive and negative deviations from the current baseline are restricted to limits

29 related to the median absolute deviation (MAD) in order to ‘cut down’ the peaks.

Default values are 0.1 MAD above and 1 MAD below the current baseline. Due to this asymmetric clipping, the baseline will adjust iteratively to the lower bound of the data. By default the number of iterations is set to 4, a number found to be sufficient in most cases. The final value of MAD is a robust measure of the random deviations from the baseline, and therefore used as a measure of the noise level for the signal.

x 106

10

8

6

4 (cps) intensity

2 6

0 0 1 2 3 4 5 6 7 8 9 migration time (min)

Figure 7. A raw CE-MS TIC (above) presenting no obvious analyte signals, prior to baseline and noice removal. After the background signals and noise (below) are removed, a multitude of chromatographic peaks emerge.

The effect can be quite dramatic, as is shown using data from the ‘paraceta- mol’ study (Figure 7). In the TIC there is little or no visible trace of analyte signal, prior to the removal of signals 5 MAD or below the baseline, calcu- lated separately for each XIC (extracted ion chromatogram). After the re- moval, several peaks with elution profiles emerge, signals that where com- pletely hidden in the un-treated data.

After subtraction of the baseline from the signal, the next step is to remove everything but the “true” data points. Regarding these as outliers we may replace the conventional outlier limit 3 with its robust counterpart 5 MAD

30 [72] and discard the data points below this limit. Additionally, a minimum number of consecutive counts above the threshold is used with a default setting of 2, efficiently removing signal spikes. It should be noted that al- though the baseline functions and noise levels are obtained individually for each individual m/z channel (row in the data matrix), the calculations are performed simultaneously for the whole data matrix by matrix manipula- tions.

The amount of data obtained from the single runs are at this point reduced to less than 5%, and the runs comprising a data set may now be stored together in one data file. The file has an internal structure similar to the .mat files used earlier to store single runs (Figure 8). As there are now more than one run represented in the file, the structure branches data and info are now in- dexed with a run number. The parameters used in the data reduction step are added to info. However, the amount of data for the whole data set could eas- ily exceed the limits for even a high end computer, and there is a need for a file type that allows storing and accessing data for individual runs. This pre- vents the use of ordinary .mat files which only can be created and loaded as a whole. The file type selected for this task is HDF5 [46], due to a number of beneficial properties in comparison to the possible alternative NetCDF. A short overview of HDF5 compared to NetCDF as well as the original .mat format is presented in Table 2.

/

info.n data.n Project

filename log date i j s masses scans

id

batch runNr label groupMembers publications

param

segNr upper lower

Figure 8. A picture of the full data structure used in the HDF5 file, storing a full project or similar under one roof. The separate runs, earlier saved in separate files, are now clustered together for ease of use. Information about the project should be stored in the structure Project in the root of the file, where data and information about the single runs are stored in indexed structures (data.n and info.n) with n cor- responding to run number.

31 Table 2. An overview of different properties between HDF5, NetCDF and MATLABs file format. Feature HDF5 NetCDF .mat Parallel file access Yes, on any system No (in development) No with MPI I/O

Number of objects Unlimited Limited Limited

Maximum size of Limited only by 2^31-1 bytes 2^31-1 bytes objects/ files computer or file system capacity

Ability to store meta- Flexible No No data separately from raw data

Ability to use com- Flexible No (in development) Yes pression of data

5.4 Applications

5.4.1 ‘The biomarker study’ The CEC-TOF data (48 runs) were imported as netCDF-files into Matlab where they where stored in one indexed file. A coarse one-point alignment was accomplished by choosing an unknown peak as a landmark. The peak, eluting after the EOF, had a strong signal present in all runs.

The FTICR-CE data (62 runs) were exported from the Bruker XMASS soft- ware as deconvoluted ion lists in ASCII format, then rearranged in Matlab to standard format (vectors i, j, s together with scale vectors scans and masses). No alignment was performed, instead a coarse time grid (35 bins) and a rather high ‘fuzzy’ setting of 3 was used .

5.4.2 ‘The paracetamol study’ From a CE-QqQ system, 115 ASCII files were imported into Matlab, form- ing a file of about 60 MB in size. To capture the sharp peaks CE produces, the step size of the mass spectrometer was set to 1u, a high value producing only 300 mass channels. The data collection frequency was high, but the data matrix for each run was only about 450 000 cells in size, about half of which was populated prior to data reduction. Alignment was performed as a

32 one-point linear transformation using urine acid as landmark. Data reduction was done with the baseline and noise removal routine described in section 5.4, which reduced the amount of data with a factor of 40-200. The data sets, comprising four groups of 27-30 runs (for negative and positive ESI mode and for pH 3 and 9, repectively), were evaluated separately.

5.4.3 ‘The digest study’ Using LC-TOF, the raw files (27 in total) generated by the proprietary soft- ware (Mariner) was between 250 and 400 MB in size. The conversion to ASCII resulted in about doubled sizes. This is a smaller difference than ex- pected and is most likely due to low compression ratio of the original data files. When transferring the data points into a regular grid, i.e expressing the original data as three vectors paired with two scale vectors as shown in Fig- ure 6, the resulting file-size went down to below 100 MB per file with the majority in the region between 60 – 80 MB. The .mat-files are compressed, apparently more so than the original files. After baseline removal approxi- mately 0.1 % of the original data points remained, and the full set was no more than 60 MB in size.

5.4.4 ‘The beverage study’ This is by far the largest study presented in this thesis, comprising 2115 CE-TOF runs with high resolution in both time and m/z. The original data files contained about 20 runs/file with total a size of 1.1-1.5 GB, of which the average run needs about 50-75 MB. In comparison with the original files in the ‘digest’ study the compression ratio is high, which may be due to the fact that in the 'beverage study a much more recent software (AnalystQS, v1.4) was used. This software is intended for computers with superior com- putational strength and thus a more sophisticated compression algorithm can be utilized without notably slowing down the system. With this software the export of data involved a plug-in module earlier acquired from Agilent, pro- ducing netCDF-files of about 0.5-1 GB in this study. Transferring them into .mat-files with the same structure as in the ‘digest’ study (cf. Figure 6) re- sulted in files between 60 and 200 MB in size, which is significantly larger than that of the original data files. After data reduction, removing 95-99% of the data points, the data obtained in this study was stored as two separate HDF5-files, one for positive and one for negative ESI mode. The sizes of the two files were 6.2 GB and 3.5 GB, respectively.

33 6. Exploring LS-MS Data

The exploration of LS-MS data is a vast field, spanning into almost every branch in analytical chemistry. In this thesis, we limit the field to two-way data sets, in which we examine the relations between samples (similarities and dissimilarities, groups etc.). There are two main paths that can be used to do this; comparing two-way data as “fingerprints” or evaluate and compare lists of peaks. In the latter, isolation and quantification of individual peaks (using methods like “peak picking” and “curve resolution”) as well as de- termination of the origin of the peaks (e.g. combining isotopic clusters) using mass and time. In this thesis, exploration of data is done only by comparing data two-way data as “fingerprints”.

Using MATLAB, all calculations and graphical presentations are facilitated by having the data in matrix form. For a selected run the data, stored as the lists i, j, and s together with the scale vectors masses and scans, is trans- formed into a “working matrix” with a time and m/z window and a resolu- tion (bin size) according to the needs.

6.1 Exploring single runs As a part of the work for this thesis, a tool has been developed to graphically present single runs, extracted from the HDF5 file comprising an experimen- tal study. With this tool it is possible to obtain a graphical overview of the “fingerprint”, interactively zoom in on a chosen sub-area (m/z and time win- dow) and to produce separation profiles (e.g. chromatograms) as well as mass spectra based on the zoomed area. Thereby the chromatogram can be restricted to single mass channels (XIC, extracted ion chromatogram) or represent the sum of a larger mass interval. On the time axis, the spectrum presented can refer to a single time bin as well as the sum obtained over a larger time interval. The total (integrated) intensity and the m/z value repre- senting the center of gravity (centroid) are presented continuously, some- thing of special interest when zooming in over a small mass and time area (e.g. the area around a peak). When dealing with peptides, it is possible to select a mass interval for a peak including all the isotopes. Then the centroid m/z value can be compared with the “average mass” that often is presented in peptide fragment databases.

34

selection elution profile 0.21744 0.12 250 0.1

200 0.08

m/z 0.06 int

150 0.04

0.02

100 0 140 150 160 170 180 140 150 160 170 180 time time full size m/z spect 137.8934 0.12

0.1 400 0.08 300 0.06 time int 200 0.04

0.02 100 0 50 100 150 200 250 300 100 150 200 250 m/z m/z Figure 9. An overview of the graphical presentation modules written for exploration of single runs. A ‘full size’ window (bottom left), a ‘selection’ (top left) from the box in the ‘full size’ window, an ‘elution profile’ (top right) for the selected area and ‘m/z spect’ (bottom right) for the same area.

The overview picture obtained for one of the runs in the beverage study (here a urine sample taken after intake of coffee) is shown in Figure 9. The somewhat coarse visual impression is due to the binning related to the matrix representation of the data, here selected to facilitate further calculations. However, the graphical presentation module covers a large portion of the presentation options present in standard proprietary software.

6.2 Comparing two runs Next logical step in exploring the data is pair-wise comparison of runs within a data set. There is always the possibility of visually comparing two runs, for example using two instances of the tool described above. However, to get a quantitative measure of the similarity (and thereby of the difference) between two runs, a more mathematical approach is needed. With the matrix representation of the data, using the same regular grid, there is an established similarity measure called the “Frobenius inner product”. It is the component- wise inner product, for the two matrices A and B, calculated as.

35

F  B A B  trace(ABT ) (6) ij ij ij

This means using the track (the sum of the diagonal elements) of the matrix ABT. In short, each element (cell) in A is multiplied with the corresponding element in B, adding the pair-wise products. The same result is found by unfolding the matrices to vectors followed by calculating the inner product of the two vectors.

F  vec(A) 4 vec(B) (7)

The ‘vec’ (vectorization) on a matrix is the same as a transformation to one single column vector by stacking the columns of the matrix on top of each other. There are however a few circumstances that affect the similarity measurement.

6.2.1 Intensity scaling and normalization The value of F is affected by scaling, e.g. the relation between the occur- rence of the analytes and registered intensities. The scale factor is affected by a range of parameters of which some are hard to control, such as of the ESI efficiency. One way to reduce the influence of such factors is to normal- ize the F-value (Eq. 8).

vec(A) 4 vec(B) c  (8) A 4 B vec( ) vec( )

The similarity measurement (here denoted as c) is by this limited between -1 and 1. Since all the components in the matrices (and thus the vectors as well) are non-negative, the value of c will fall between 0 and 1. Fully congruous pair of runs (i.e. with peaks in exact the same positions in both m/z and time) and with the same relative intensities will yield c=1, while the opposite non- congruous pairs (without any overlapping peaks) yield c=0. It is thus permis- sible to call c a correlation value, spanning from 1 for two runs in total agreement down to 0 for two runs without any similarity. The ‘usual’ corre- lation coefficient for the vectorized matrices would yield a negative value in the latter case.

Normalization of the similarity measurement eliminates differences in scal- ing, but it seems prudent in some cases to also be able to perform a direct normalization of the intensity values. One such case is when comparing two

36 ion-chromatograms (e.g. TIC) in the same diagram. If one peak represents a known analyte with a concentration that can be assumed to be the same for both samples, all intensities can be scaled in relation to that value. If no such peaks exist, the sum of the intensities for each sample can be normalized, usually 1 (‘unity’). Finally, a more geometrical approach is to scale the in- tensities of each run to 1. This involve |vec(A)| = |vecB|=1, the same as the normalization of F to c.

6.2.2 Peak alignment

Displacement of the peaks may occur in both time and m/z, in both cases affecting the correlation. With displacement the correlation normally de- creases, but false contributions from peaks that do not have the same origin may also increase the correlation. ‘Peak alignment’ is a procedure often ap- plied when comparing LS-MS data, especially in combination with the use of multivariate methods (PCA [73], PARAFAC [74] etc.) as they are based on correlation structures within the data set. Displacement in time is almost always dominating over displacement in m/z. The size of the time ‘bins’ is closely related to the influence of time displacement; if the size of the bins are large enough to swallow the displacements, the undesired loss of correla- tion will disappear. If, however, a peak is divided between two bins, a rela- tively small displacement can cause considerable effect on the correlation. With peaks spanning over a large number of bins, the effect will only be noticeable if the displacement is large in comparison with the peak width. In general, by suitable binning the need for the peak alignment can be reduced.

Another way to minimize the effect of time displacement is to transform the timescale for one of the runs (‘time warping’) in such a way that the dis- placements are compensated. The simplest way of doing this is by linear transformation, capable of compensating for differences in flow rate and synchronization between sample injection and data collection. For this, one or two parameters have to be established, either by forcing one or two known peaks in both runs to the same values or by varying the parameters until a maximal correlation between the runs is achieved [75]. The scale vector for one run is modified by the ‘rubber band‘ principle, generating either fewer or more bins when stretching or shrinking the ‘rubber band’. This may cause distortion of the time profiles as some index values may point to empty bins (created by stretching) or to bins with contributions from two previous bins (due to shrinking). The distortion effect can be avoided by interpolation of the intensity values (‘resampling’) using more or less elabo- rate methods (linear interpolation, interpolation using polynomials, Fourier transform etc.) [76].

37 In practice, a linear transformation is not sufficient; even small non-linear distortions may have a large impact on multivariate methods (PCA etc.). The method has been expanded to include ‘piece-wise linear transformations’, where a number of time points (nodes) define the segments subjected to in- dividual linear transformations. The nodes can be selected from peaks that are supposed to have the same origin, either chosen manually [77] or auto- matically [76], Nielsen et al. proposed an alignment algorithm for chroma- tographic profiles, correlation optimized warping (COW) [78], further de- veloped by Bylund et al.[79]. The idea is that equally spaced time nodes are adjusted as to maximize the total correlation between the two time profiles. After determination of the node positions (i.e. the time shift for the nodes), the ‘rubber band’ principle is applied to the time segments between the nodes. To get a more plausible fit to the actual time displacement within the nodes, interpolation with splines or polynomials of varying degree have been used instead of linear interpolation [80, 81]. Various forms of peak align- ment with time warping has been described in the literature [76, 77, 82-84]. A method proposed by Torgrip et al. in 2004 also deserve to be mentioned with a focus on computational speed and memory requirements [85].

‘Peak alignment’ is usually performed as a pre-processing step for the whole data set, by selecting one run as a ‘master’ to which the others are aligned. The outcome of the alignment procedure depends on a number of parameters (the choice of master, number of nodes, allowed warping etc.) which in turn may have to be optimized.

The solution we have chosen is quite different. To handle the challenge of mis-aligned runs, we first process the data using a linear transformation with two nodes. Two different methods for this pre-alignment where used in this thesis. The method used in the 'biomarker' and the 'paracetamole' studies was to select a known peak in the chromatograms (e.g. uric acid in the 'paraceta- mole' study), assuming no alignment error due to bad synchronization be- tween sample injection and data collection (i.e. coinciding start points). In the 'digest' and 'beverage' studies we used a slightly more elaborate method T involving a manual step. A correlation matrix is formed according to X Xm, X referring to the run to be aligned and Xm to the master. Chromatographic peaks from the same species will give rise to high correlation (very similar mass spectra), and Figure 10 shows how such peaks will occur along an ap- proximately linear trace. The trace, however, deviates from the true diagonal that defines perfect alignment. By mouse control two points are selected, resulting in a straight line approximating the actual trace of correlated peaks. The line equation is applied to the time scale for the run to be aligned, and as a result the correlation trace will move closer to the diagonal.

38

2000

1500

1000

(s) Time

500

0 0 500 1000 1500 2000 Time (s)

Figure 10. Correlation between two runs (vertical and horizontal, respectively) with a line fitted manually to the trace formed by points with high correlation. A linear time transformation for the adjusted run makes the fitted line represent the diagonal. After the coarse linear alignment, the deviations from the line represent the remain- ing misalignment.

The linear transformation for each run (except for the master) is defined by the scaling and translation parameters, which are stored in the file under data. Henceforth, when fetching the data for a run (or a selected m/z and time window) the original scale vectors, scans and masses, are automatically transformed to the aligned version using the stored parameters. In this, the distortion caused by the stretching or shrinking may be leveled out by using larger time ‘bins’ in the following steps.

After alignment, some deviations in time (and, to some degree, also in m/z) remain. The residual deviations may stretch over several ‘bins’, especially in the chromatographic dimension. Due to the edge effect when a peak crosses a 'bin' border, we always have to allow for variations between two adjacent 'bins', for m/z as well as for time. The time binning will usually result in peaks that extends over a number of 'bins', and we may have to expect varia- tions over more than one bin. The actual number depends on the success of the linear coarse adjustment, but normally the remaining time displacements correspond to just a few time 'bins'.

39

Figure 11. The construction of a ‘fuzzy’ matrix B* (right) from the matrix B (left). When computing the values for the two cells in matrix B* above, the intensities in the ‘blurred’ area around the cells in B are summed. Here, the ‘blurring’ window is  1 in both directions.

To neutralize the deviations still present when calculating the correlation between runs, we now apply a ‘fuzzy’ correlation measure. When computing the product sum of AijBij (Eq. 6), we add AijBi'j' with i' and j' being other positions in time and m/z within allowed displacement. The term ‘fuzzy’ is here used in its colloquial sense ('unfocused' or 'blurred'), not specifically tied to the 'fuzzy set' or 'fuzzy logic' concepts. Computationally the 'fuzzy' correlation is obtained by first summing the intensities in the ‘blurred’ area around Bij to the value B*ij = Bij + Bi'j'. The procedure to obtained the 'blurred' version B* from B is illustrated in Figure 11.

The 'fuzzy' correlation measure is now computed as F=vec(A)•vec(B*) (cf. Eq. 7). After the normalization (Eq. 8), the measure of similarity (s) is less sensitive to alignment errors but with an increased risk of ‘false’ correlation contributions due to overlapping peaks.

6.2.3 Data compression The measure of similarity s is affected by the intensity of the correlating peaks. High intensities will give rise to large contributions to the product sum. To avoid s being dominated by large peaks, there are several transfor- mation methods possible. One common method is to simply log-transform the intensities [86], which is prone to cause problems when using sparse matrices, as many of the matrix values are zero. If only the non-zero intensi- ties are log-transformed, the contrast between absence of intensity (zero) and measured intensity (log s) will be strongly dependent on the scaling. An alternative to log-transformation is ‘square-rooting’, a method to

40

A B C

Figure 12. The effects of normalization and compression for two simulated chroma- tograms. (A) Originally the first peak is missing in the lower curve, while the others have the same height (indicated by the dotted line for the last peak). (B) After nor- malization the two peaks in the lower curve are apparently enhanced with about 40% (cf. the dotted line). (C) Compression with n=3 results in leveling the peak heights as well as reduced enhancement caused by the normalization. reduce the difference between high and low intensities, has also been used on two-way mass spectrometry data [87]. The transformation we use is more general with a compression parameter k according to s(1/k), where the value of k determines the rate of compression. With a value of 1, the values are left unchanged, 2 equals square-rooting and higher values corresponds to more compression. With a sufficient high value of k, the compression is close to ‘binary’, where the similarity measure is not dependent on the intensity but only occurrence.

The compression is also favorable combined with the intensity normaliza- tion. The normalization of the intensities causes ‘closure’, i.e. the (now) relative intensities become dependent on each other. If the relative intensity of one peak increases the relative intensity of the other peaks goes down, even if the absolute values themselves are unchanged. This ‘artifact’ is in some way decreased by compression (Figure 12).

6.3 Exploring the whole data set The goal is to characterize the whole data set with respect to similarities and dissimilarities between runs, finding clusters and groups, and to relate the results to 'a priori' knowledge about the samples. In principle, we have to compare each pair of runs according to the description above, i.e. compute the ‘fuzzy’ or blurred correlation measure. With matrix operation this can be performed for all pairs simultaneously, and the first step is to collect the whole data set into one single data matrix X1. For this, the results from indi- vidual runs must be vectorized into single rows in X1, and the number of rows will be the number of runs comprising the data set. The number of col-

41 umns equals the product of the number of time 'bins' and number of m/z 'bins' (which must be equal for all the runs).

6.3.1 The ‘fuzzy’ correlation matrix A square matrix holding the pair-wise inner products F can be computed as the matrix product (X1)(X1)T. To obtain a ‘fuzzy’ correlation matrix we need a 'fuzzy' version of the combined data matrix X1. This version, denoted X2, is computed in the same way as X1, only that the 'blurred' data matrix for each run are utilized (cf. Figure 11). The 'blurring' procedure, applied to each individual data matrix before vectorization, may be rather time consum- ing and can be replaced by an efficient matrix manipulation of the whole compiled data matrix X1. The ‘fuzzy’ version of the correlation c (Eq. 8) can now be computed for the whole data set by matrix multiplication as (X1)(X2)T. A final step is to make the matrix fully symmetrical, which re- sults in the 'fuzzy' correlation matrix C.

6.3.2 Cluster analysis and dendrograms Using the matrix C as a starting point, it is possible to perform a cluster analysis in order to identify groups of similar runs and to find the relation between these groups (hierarchical clustering). The results are usually pre- sented as dendrograms, a way of visualizing clustering of MS "fingerprints" often found in the literature [88, 89]. By comparing the results with ‘a priori’ knowledge, it is possible to verify that the data set does contain relevant systematic information. The cluster analysis of C has been performed with Minitab [90] after converting the correlation matrix C to a distance matrix D according to Dij = 1-Cij. Figure 13 shows the clustering obtained in the 'di- gest' study (Paper IV). For each protein digested, the different digestion ap- proaches gives "fingerprints" that cluster together in all cases but one.

BSA Casein Fetuin

Distance Distance Distance

Z Z O O O N N N Z N N N Z Z Z O O O N N N Z Z Z O O O

Figure 13. Three dendrograms from the‘digest’ study. The dendrograms where based on the similarity matrix for BSA (left panel), casein (middle panel), and fetuin (right panel). The different digestion approaches are labeled as Z=ZipTip, N=NAP5, and O=ACN.

42 6.3.3 PCA and scores A standard method for exploring and visualizing data sets is principal com- ponent analysis (PCA). PCA can be viewed as a projection method where objects are described using only a few variables, denoted scores. With two- way analysis such as LS-MS, the data table of the whole data set is in fact a tree-way table. After vectorization, however, the two-way data table X1 is constructed as described above with a large number of columns. These col- umns represents the variables in the PCA context, now mutually related as an analyte can give a response in more than one m/z and time 'bin’. There are sophisticated methods to explore tree-way data (e.g. PARAFAC), but it is doubtful whether they can be applied to the full set of "fingerprints" repre- senting a large number of analytes. As the interest here is the relation be- tween runs (objects) and not the relation between variables, it is also doubt- ful if such methods will provide additional relevant information.

The score values in PCA are optimal linear combinations of the original variables, with a description of the dissimilarities (i.e. variations between the objects) used as the optimization criterion. The scores can be computed us- ing various algorithms (NIPALS, SVD etc.), some of which directly con- nected to the correlation matrix (X1)(X1)T. For each principal component a score vector is produced with one score value for each row (run). In essence, the score vectors are eigenvectors to (X1)(X1)T. The score vector for the first principal component is computed as the square root of the largest eigenvalue of the correlation matrix (a scalar), multiplied with its corresponding eigen- vector. This procedure can be described as ‘re-creation’ of variables from the correlation matrix. The relatively few score 'variables' reflect the same rela- tions (correlations or similarities) between the objects as the variables in the original matrix X1. Finding scores from a correlation table is similar to set- ting out places on a map in concordance with a mutual distance table.

For PCA to generate proper results, the variables must be in correspondence between the runs, i.e. the LS-MS peaks must be adequately aligned (section 6.2.2). Our idea is to use the ‘blurred’ correlation matrix C to generate a ‘PCA-like’ projection of scores by eigenvalue analysis (Paper II). A matrix, T, with score values is computed in Matlab using its native SVD (singular value composition) function. Then the relationships previously illustrated by the dendrograms (cf. Figure 13) can be visualized as score plots. The con- cept of ‘blurring’ was previously reported by Marengo et al. [91, 92]. When they compared 2D-gels, every spot was localized and focused into a single dot ('de-fuzzyfied' in their terminology). Prior to PCA analysis, the spots were de-localized or ‘fuzzyfied’ according to a two-dimensional normal distribution. This corresponds to computing the correlation matrix C using only the ‘blurred’ matrix X2. One of the basic principles of our ‘blurred’

43 similarity measure is to give equal consideration to all the pair-wise contri- butions to the ‘blurred’ intensity (i.e a rectangular weight distribution). With the approach suggested by Marengo et al., pairs of data points originating from the same original position will add more to the correlation than pairs with some offset (although still within the 'blurring' window). This hold true even if the 'blurring' distribution is rectangular, since the correlation between two 'blurred' matrices involves the convolution of the two distributions. With rectangular 'blurring', the weight distribution for the pairs of data points will be triangular with respect to their relative displacement.

Figure 14 shows the score plots obtained from the urine CE-MS 'fingerprints' in the 'paracetamole' study. It is seen that already in the first principal com- ponent (PC1), the score values clearly differentiate samples with and without prior intake of paracetamole. This is due to the fact that the drug intake is the single fixed-effect factor in this study, only obscured by random effects due to different occasions (seen as grouping in PC3) and repeated analyses.

CC C C C BB BB AB B B A CCBC BB A AA C C PC3 Without With

A AA A A

PC1

Figure 14. The ‘fuzzy’ scores obtained from the ‘paracetamole’ study. The score plot show clear group separation between ‘with’ and ‘without’ in PC1.

44 6.3.4 PCA and loadings With conventional PCA, loadings (the ‘weights’ for the linear combinations) as well as scores are computed. The loadings describe how the scores for the principal components are connected to the original variables. When the score values for a principal component show relevant differences between the ob- jects, the loadings can be used to elucidate the variables mainly responsible for these differences. Such variables are revealed by high loading values, regardless of the sign. In our approach we start with the correlation matrix C and obtain scores without having direct access to the corresponding loadings. One way of acquiring loadings with conventional PCA is to compute the eigenvectors of the matrix (X1)T(X1). Formally we could do the same for (X1)T(X2), although this matrix probably would be too large to easily han- dle. Instead, we compute the loadings by using the score vectors derived from the matrix C and correlating them with the ‘blurred’ version X2 of our original data matrix. Each variable is thus compared with the score vector; the better the correlation, the more relevant the variable is for the relations described by the scores.

Computationally, these correlation loadings are computed as the result of T (X2) Tc, where Tc is a matrix with the centered score vectors as columns. By using the ‘blurred’ version of the data matrix (X2), the residual displace- ments in time and m/z are largely eliminated. Centering of the score vectors in Tc generates loadings with both positive and negative signs. Actually, relevant variables can be either positively or negatively influenced by the factor associated with the principal component and the score values. Usually correlation loadings are normalized to be in the range -1 to 1. If not normal- ized the correlation measure depends on the intensity level of the different variables, and high intensity variables may result in higher correlation in spite of a moderate correlation coefficient. However, we use the compres- sion parameter k (section 6.2.3) to balance the intensity and the degree of correlation when assessing the relevance of the variables.

T The loading matrix (X2) Tc contains the loading vectors as columns, one for each principal component. Each column holds the values for all the unfolded original variables (mass 'bins' and time 'bins'), and it can be re-folded into the same format as the original two-way data (mass  time ‘bins’) and be visualized in much the same way.

Figure 15 shows the loading plots obtained from the urine CE-MS 'finger- prints' in the 'paracetamole' study. In the two-way loadings plot we have identified some known metabolites of paracetamole, but several unknown compounds strongly affected by the intake of the drug is also revealed.

45

400 272 328 350 270 326

300 324 268 2.0 2.2 2.4 2.6 2.8 3.0 3.5 4 (u) m/z 250

232 200 313 230 150 312 228 311 1.6 1.8 2.0 2.2 100 1.5 2.0 2.5 3.0 3.5 4.0 2.5 3.0 3.5 migration time (min)

Figure 15. The ‘fuzzy’ loadings obtained from the ‘paracetamole’ study. The darker areas in the loadings plot indicates higher values. Zoomed areas shows positive loadings in accordance with known metabolites: PS (paracetamol sulfate), PC (paracetamol cystein), PM (paracetamol glucoronide).

6.3.5 Selection of scores by ANOVA PCA, as well as hierarchical clustering are examples of ‘unsupervised pat- tern recognition’. Prior knowledge about the objects is not utilized until the visualization, where the addition of 'labels' may demonstrate relevant sys- tematic information in the data set. Sometimes several experimental factors, each with defined levels or states (treatments, illnesses, sample occasions etc.), bear influence on the 'fingerprints' and it may be hard to find the prin- cipal components presenting the ‘best’ picture for the current issue. By analysis of variance (ANOVA) applied to the score values, we can find the principal components most influenced by the experimental factors. The se- lection is made according to the F ratio and p value provided by ANOVA for each factor and each principal component. The most informative score plot with respect to a factor's influence is obtained with the two principal compo- nents with the highest F ratios or (equivalently) the lowest p values. It should be stressed that the probability value (p) is not used as a way to assess statis- tical significance, but only as a help in selecting PCs. However, very low p values do give support to the assumption of systematic variation in the data set with respect to the investigated factor.

In the ‘beverage’ study, the use of ANOVA was utilized for the selection of the relevant principal components. The full data set comprised urine samples acquired from 13 persons (males and females) after intake of coffee, tea or water, each sample obtained at three different occasions. In total, there were

46

700 600 500 400 300

Total ion current (cps) 200

100

0 50 100 150 200 250 300 Elution time (s)

20 0.25 c c 4 c c 0.3 c c c 9 c 7 0.2 15 c c c c c c 1 c11 c 6 c13 0.2 c 0.15 c12 c 3 c c c c c 10 c c c c c c c c c c 8 c c c cc c c 0.1 c 2 c 5c10 0.1 c c c t c 5 t c c t w t w 0.05 tt wt twtw cc PC3 PC3 PC3 0 t www w t t w 0 t tw wt t 0 t 5 t w ww t t tt w t t13 -0.1 w t w w t 6 w10 t 2 t12 wt wwwwt w -0.05 t 8 w 2 -5 t w w 3t 7 t10 w11 w 8 t w 1 t ww w w t 1 w t ttt t t -0.1 wt11 9w 5 -0.2 w www t 9 w12 -10 w w w13t 4 w w4 6 t t -0.15 w t w w 7t 3 -0.3 -15 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -10 -5 0 5 10 PC2 PC2 PC2 Figure 16. ’Fuzzy’ PCA-plots of higher grade present structural information within groups (‘coffee’ vs. ‘tea and water’) using a selected time window (upper, ‘grayed’ box). When removing the effects from the 'nuisance' factors (variations due to indi- viduals and different sampling occasions) a much clearer picture emerges (bottom, left to right).

115 CE-MS runs made in positive ESI mode and as many in negative mode, generating a huge data set with great complexity. The relevant systematic information (i.e. related to 'beverage') searched for in this study is hidden under large or even larger variations of less interest. Actually the variations due to different individuals and different sampling occasions are dominating, and the relevant information about the beverage effect is not revealed until higher PCs (Figure 16, bottom left). Using ANOVA modeling (as a general linear model), the effects due to the other 'nuisance' factors can be subtracted and a much clearer picture of the differences caused by the beverage intake emerges (Figure 16, bottom, left to right).

6.3.6 Variable assessment with target correlation (contrasts) An example of ‘supervised pattern recognition’ is the division of the data set into groups based on prior knowledge, i.e. levels for one or more experimen- tal factors. The goal may be to classify samples with unknown belonging, such as diagnosing a patient. However, a more prominent challenge is to determine the variables of most importance with respect to the classification, e.g. searching biomarkers or elucidating underlying mechanisms. We will confine to binary classification with only two classes, and for each variable we calculate the so called contrast value between the classes. For this we

47 correlate the variables with a 'target' vector t, which takes the value ‘1’ for objects of one class and ‘0’ for objects of the other class. After centering, transforming t to tc, the values will be  0.5, provided that both classes are of the same size. If we have more than two level for a factor, the contrast will be calculated between two levels (or groups of levels) at a time. One exam- ple is in the ‘beverage’ study where there are three separate groups with respect to the beverage factor, namely ‘coffee’, ‘tea’ and ‘water’. Even for complicated cases with several factors and levels, especially with unbal- anced designs, the target vectors can easily be obtained from the general linear model applied in ANOVA.

The correlations between the target vector tc and the variables are calculated in a similar fashion as the correlation loadings, replacing the score matrix Tc T with the target vector tc. The product (X2) tc is a contrast vector holding the contrasts obtained for each variable. This vector can be re-folded and visual- ized as a mass chromatogram in the same way as the correlation loadings. This gives the opportunity to identify areas of interest, i.e. mass and time windows, something that proved to be very helpful in the ‘digest’ study. Figure 17 illustrates the contrasts between the NAP5 and the acetonitrile approaches for the digestion of fetuin. The contrasts are shown as separate mass chromatograms for the positive contrasts (higher intensities with NAP5) and the negative contrasts (higher intensities with acetonitrile). Some of the contrast values were also identified to correspond to tryptic peptide fragments of fetuin.

2000 2000

1650 1650

1200 1200 m/z m/z

4 5 750 750 6 1 3 2

300 300 0 500 1000 1500 2000 0 500 1000 1500 2000 Time (s) Time(s)

Figure 17. Contrast plots, resolved in both time and m/z; (left) positive and (right) negative differences between NAP5 and ACN for fetuin. The encircled and num- bered areas (1-6) are examples of spots that have been identified as tryptic peptide fragments of the protein: (1) HTLNQIDSVK, (2) YFK, (3) AHYDLR, (4) ALGGEDVR, (5) TPIVGQPSIPGGPVR and (6) QDGQFSVLFTK

48 When used as a tool for ‘biomarker screening’, it is important to be aware of the influence of the intensity level on the contrast value for a variable. The use of compression of the intensities affects this influence in the same way as for the ‘fuzzy’ loadings. A more direct way to balance the target correla- tion and the intensity level for the variable is to combine the correlation co- efficient r (with a value between -1 and 1) with the non-compressed intensity value into a single measure. In the ‘metabonomics’ study we used the inten- sity  r3 as such a measure.

When contrasts are calculated for all variables, it is likely that a certain amount of rather high but ‘false’ contrast values are obtained. Such 'false' contrasts are not due to the influence of the studied factor but rather appear by chance as a consequence of the very large number of variables. To make an estimation of the contrast thresholds (one positive level and one negative level) above which a certain probability for finding ‘false’ correlations is low enough (e.g. 5% or 1%), we estimate the random distribution of contrasts for our data set. This is done by rearrangement of the positions of ‘0’s and ‘1’s in the target vector. With a low number of objects (as we have in the ‘digest’ study), all combinations can be considered. Otherwise a set of random re- arrangements were generated, large enough to give a fair estimation of the limits and small enough to be treated with reasonable computation effort.

Variables exhibiting high contrast values may be useful as a basis for further investigation. By using the positions (i.e. m/z and time values in the re- folded matrices) as starting points, targeted complementary studies can be performed with methods like MS/MS. The contrast map can also be used as an indicator of interesting m/z and time areas in the present data set. For these limited m/z and time windows, refined data exploration can be under- taken with higher m/z and time resolution (i.e. smaller ‘bins’) without mem- ory requirements beyond the boundaries of a 32-bit computer.

49 7. Conclusions and future aspects

As described in the introduction, there is no reason to expect that the com- plexity of analysis procedures and the amount of data generated (and needed) will decrease in the future. With increased complexity, tools for managing and exploring are not just a convenience but rather a necessity. I have no doubt in that further development within industry and academia will help the analytical chemists to generate even more data, and to analyze and solve even more intricate challenges. I have also no doubt that the file for- mats used, at least to an extent far beyond today, will be open and accessible to the public. There are today approaches in that direction and one such is mzXML [93], an open file format that seem to have gained some support. There is also the discussion of Alsberg et al. [67] regarding the use of Octave as a replacement for Matlab, a today inferior scripting language but perhaps with a brighter future?

The procedures presented in this thesis need a common graphical user inter- face (GUI), especially the routines for import and data reduction are lacking in this area. We also need to write routines to acquire and manipulate the meta-information stored in the files, i.e. a form of search engine in minia- ture.

An area where two- or three- way modeling would be well suited is when small ‘areas of interest’ have been established in the data sets. This way the isotope patterns and chromatographic profiles of isolated analytes could be extracted.

To achieve all of the above challenges within a reasonable timeframe, the involvement of other disciplines such as computer programming, mathemat- ics and chemistry need to build a common platform, together. The future does look interesting.

50 8. Acknowledgements

My present supervisors, Per Sjöberg and Rolf Danielsson for all the help, time and encouragement you have invested in me. My former supervisors Dan Bylund and Karin Markides for the same. Karin for accepting me a long time ago, giving me the opportunity to perform my Ph.D. studies at the de- partment.

Dan and Rolf (and My) for teaching in such a captivating way that I started to contemplate the idea of post-graduate studies for the first time.

Jonas Bergquist, for good ideas and general help to keep the spirit high.

My collaboraters at BMC, Erik Allard, Nina Johannesson, Sara Ullsten, and My Moberg.

Camilla Zettersten, my room mate, it would have been a gloomier place without you.

Nina Johanneson, Emilia Hardenborg, Björn Arvidsson and Erik Allard for being there when I needed to let out some steam or just new input.

The rest of the department, former and present, for being there. And for populating the spray bar…

All the friends that are not into chemistry for reminding me of the life out- side of BMC.

My family, and most of all Tanja. You are my anchor, my strength, and without you I’d be long lost.

51 9. Summary in Swedish

Målet med denna avhandling är att presentera metoder att hantera och utvär- dera stora dataset, genererade med LS/MS. I denna avhandling beskrivs hur den interna strukturen i datafilerna kan vara utformad för att på bästa sätt underlätta skapandet av rutiner för analys av sådana dataset. Metoder för explorativ utforskning presenteras också, tillsammans med metoder för bas- linjereduktion och upplinjering. Genomgående har Matlab används för att implementera metoderna.

Nedan kommer följande förkortningar användas: LS Vätskebaserad separation MS Masspektrometri CE Kapillärelektrofores CEC Kapillär elektrokromatografi LC Vätskekromatografi ESI Elektrospray jonisation TOF Time-of-flight masspektrometer FTICR Fourier-transform-ion-cyklotron-resonance masspektrometer PCA Prinsipalkomponentsanalys ANOVA Variationsanalys

9.1 Introduktion till LS/MS Analytiska frågeställnigar inom medicin, miljö etc blir allt mer komplexa, vilket leder till ständigt ökande krav på känsligare analysmetoder av ofta mycket komplexa prov (exempelvis olika biologiska vätskor som helblod, plasma/serum, ryggmärgsvätska o.s.v). Sådana prover innehåller en mång- fald av för studien ointressanta analyter som vanligtvis återfinns i höga kon- centrationer. För att få fram önskad information krävs som regel att olika analytiska metoder kombineras. Innan komponenterna i ett komplext prov kan identifieras och/eller kvantifieras behövs oftast ett separationssteg. För lösta komponenter (t ex i någon av ovan nämnda kroppsvätskor) används ofta vätskebaserade metoder som CE, CEC eller LC. Separeringssteget mås- te sedan kopplas ihop med någon form av detektor, vilket i denna avhandling konsekvent utgörs av någon form av masspektrometer (MS). För att överföra

52 analyterna till gasfas, dvs fria joner som masspektrometern sedan kan detek- tera och kvantifiera, används här så kallad elektrospray jonisation (ESI). Ett LS/MS-system kommer att generera data som kan illustreras med en karta med massa/laddning (m/z) och tid på axlarna i stället för norr-söder respekti- ve väster-öster.

Ju mer komplicerad en frågeställning är, desto mer information behöver genereras för att kunna producera ett troligt svar. Idag kan en LS/MS analys med lätthet producera stora mängder data. På mindre än en än en timme kan man erhålla en datamängd för stor att rymmas i det elektroniska minnet (RAM) på en dator byggd med standardkomponenter (dvs upp till 4 GB RAM). Hanteringen av data i sig har alltså seglat upp som en stor utmaning.

9.2 Hantering och utvärdering av LS/MS-data Om analysen av ett prov genererar så stora mängder data, var ligger de störs- ta problemen när man arbetar med större studier som omfattar hundratals prover? Förvaringen av datamängden är inte i sig ett större problem då data oftast komprimeras effektivt när den lagras elektroniskt och lagringsyta är idag billigt. Inte heller krävs det nödvändigtvis stora (och därigenom kost- samma) mängder med beräkningskraft. Ett mål med de rutiner som utveck- lats under arbetet med denna avhandling var att de skulle kunna användas med de standardsystem som vanligtvis redan finns på de flesta arbetsplatser. Det som dock visat sig vara det största problemet vid arbete med många och framförallt stora datamatriser är tillgången på elektroniskt arbetsminne.

Ett stort arbete har lagts på att utveckla rutiner som kräver så lite minne som möjligt vid analys och utvärdering. Att arbeta med matriser (en form av ma- tematisk tabell) är mycket användbart. De data, den tidm/z- karta som ge- nereras av masspektrometern, innehåller bara till väldigt liten del informa- tion som kommer från analyter, resten är olika former av brus och bakgrund. Som man kan se i bilden nedan (Figur 1), är det svårt för att inte säga omöj- ligt att se något annat än brus innan överflödiga, icke informativa, datapunk- ter avlägsnats. Detta medför att matrisen blir så kallat gles, dvs större delen av matrisen upptas av celler med värdet noll. Detta innebär att matrisen kan göras mycket mindre då alla matrisceller med värdet noll ignoreras och bara värden skilda från noll sparas. En metod för att hitta och ta bort bak- grundsvärden och brus på ett snabbt och minneseffektivt sätt presenteras i denna avhandling. Den interna strukturen hos de filer vi arbetar med (inne- hållandes LS/MS-körningar) presenteras också, utformad för att på bästa möjliga sätt kunna skriva rutiner som arbetar både snabbt och minnessnålt med all data. Det presenteras även metoder for att jämföra

53 x 106

10

8

6

4 (cps) intensity 2 6

0 0 1 2 3 4 5 6 7 8 9 migration time (min) Figur 1. En LS-MS bild (TIC) uppvisande få eller inga tydliga analytsignaler innan bakgrunden lyfts bort (övre). När så skett (nedre) framträder en liten skog av signa- ler som tidigare skymdes. dataset från flera körningar, vilket kan användas till att undersöka om det finns grupperingar i datasetet, samt metoder för att i så fall leta reda på orsa- kerna till dessa grupperingar. Alla rutiner har använts i ett antal projekt som i korthet presenteras nedan.

9.4 Projekt  Biomarkörer (Artikel I) En studie på plasma taget från personer med två olika former av blindtarms- inflammation, såväl under operationen som en tid efter borttagandet av blindtarmen (dvs de är sin egen kontrollgrupp). Proverna undersöktes med både CEC/TOF och CE/FTICR. Systematisk information mellan grupperna gick att finna och för varje analysmetod upprättades en lista över de toppar som främst bidrog till skillnaden. Båda listorna jämfördes och ett antal po- tentiella biomarkörer, vilka behöver undersökas närmare i framtiden, identi- fierades.

 ’Paracetamol’ (Artikel II och III) Urinprover från en försöksperson analyserades med CE/MS. Proverna togs före och efter intag av Alvedon vid tre olika tillfällen. För att hantera slumpmässiga förskjutningar i tid hos separationerna utvecklades en mer 'robust' variant av PCA, kallad ’fuzzy’ PCA och metoden publicerades i en separat artikel. Metoden ’fuzzy’ PCA visade sig fungera bra och urinprover- na visade på tydliga systematiska skillnader mellan innan och efter intag av Alvedon.

54  ’Digerering’ (Artikel IV) Tre olika protokoll för avsaltning och digerering testades på tre olika protein (BSA, fetuin och -casein) varvid proverna analyserades med LC/TOF. Vid utvärdering med ’fuzzy’ PCA uppvisade fetuin och -casein mycket tydliga skillnader mellan protokollen, medan BSA uppvisade mindre skillnader. De datamatriser som genererades var mycket stora (~ 500 000 000 datapunkter per prov) och användningen av de rutiner som presenteras under kapitel 5 och 6 visade sig fungera utmärkt.

 ’Drycker’ (Manuskript V) En mer avancerad studie av humanurin än ’paracetamol’-studien, med urin- prover från 13 personer som druckit kaffe, te eller vatten. Totalt analyserades 2x115 prover av ungefär samma storlek per prov som i ’digererings’- projektet, men den här gången med CE/TOF. De största systematiska skill- naderna mellan proven visade sig bero på provtillfälle och person-till-person, men genom användande av ANOVA kunde de intressanta effekterna isoleras och god separation hittades främst mellan intag av kaffe jämfört med te eller vatten. De stora mängderna data (störst av alla studier i denna avhandling) kunde hanteras av metoderna beskrivna i kapitel 5 och 6.

55 10. References

1. Tiselius, A., A new apparatus for electrophoretic analysis of colloidal mixtures, Transactions of the Faraday Society 1937, 33, 524-531. 2. Hjertén, S., Free zone electrophoresis, Chromatogr. Rev. 1967, 9, 122-219. 3. Baker, D. R. Capillary electrophoresis; 1995, A Wiley-Interscience Publication 4. Li, S. F. Y. Capillary electrophoresis, principles, practice and applications, 1994, 2:d ed, Elsevier Science. 5. Giordano, B. C.; Muza, M.; Trout, A.; Landers, J. P., Dynamically-coated capillar- ies allow for capillary electrophoretic resolution of transferrin sialoforms via direct analysis of human serum, J. Chromatogr., B: Biomed. Appl. 2000, 742, 79-89. 6. Lindner, H.; Helliger, W.; Sarg, B.; Meraner, C., Effect of buffer composition on the migration order and separation of histone h1 subtypes, Electrophoresis 1995, 16, 604-610. 7. Yao, Y. J.; Khoo, K. S.; Chung, M. C. M.; Li, S. F. Y., Determination of isoelectric points of acidic and basic proteins by capillary electrophoresis, J. Chromatogr., A 1994, 680, 431-435. 8. Gilges, M.; Kleemiss, M. H.; Schomburg, G., Capillary zone electrophoresis sepa- rations of basic and acidic proteins using poly(vinyl alcohol) coatings in fused sil- ica capillaries, Anal. Chem. 1994, 66, 2038-2046. 9. Cordova, E.; Gao, J.; Whitesides, G. M., Noncovalent polycationic coatings for capillaries in capillary electrophoresis of proteins, Anal. Chem. 1997, 69, 1370- 1379. 10. Ullsten, S.; Söderberg, L.; Folestad, S.; Markides, K. E., Quaternary ammonium substituted agarose as surface coating for capillary electrophoresis, Analyst 2004, 129, 410-415. 11. Ullsten, S.; Zuberovic, A.; Wetterhall, M.; Hardenborg, E.; Markides, K. E.; Bergguist, J., A polyamine coating for enhanced capillary electrophoresis- electrospray ionization-mass spectrometry of proteins and peptides, Electrophoresis 2004, 25, 2090-2099. 12. Hardenborg, E.; Zuberovic, A.; Ullsten, S.; Söderberg, L.; Heldin, E.; Markides, K. E., Novel polyamine coating providing non-covalent deactivation and reversed elec- troosmotic flow of fused-silica capillaries for capillary electrophoresis, J. Chroma- togr., A 2003, 1003, 217-221. 13. Chervet, J. P.; Ursem, M.; Salzmann, J. P., Instrumental requirements for nanoscale liquid chromatography, Anal. Chem. 1996, 68, 1507-1512. 14. Kirkup, L.; Foot, M.; Mulholland, M., Comparison of equations describing band broadening in high-performance liquid chromatography, J. Chromatogr., A 2004, 1030, 25-31.

56 15. Karlsson, K. E.; Novotny, M., Separation efficiency of slurry-packed liquid chro- matography microcolumns with very small inner diameters, Anal. Chem. 1988, 60, 1662-1665. 16. Kennedy, R. T.; Jorgenson, J. W., Preparation and evaluation of packed capillary liquid chromatography columns with inner diameters from 20 to 50 micrometers, Anal. Chem. 1989, 61, 1128-1135. 17. Murray, K. K., Coupling matrix-assisted laser desorption/ionization to liquid sepa- rations, Mass Spectrom. Rev. 1998, 16, 283-299. 18. Niessen, W. M. A., Advances in instrumentation in liquid chromatography-mass spectrometry and related liquid-introduction techniques, J. Chromatogr., A 1998, 794, 407-435. 19. Niessen, W. M. A., Progress in liquid chromatography-mass spectrometry instru- mentation and its impact on high-throughput screening, J. Chromatogr., A 2003, 1000, 413-436. 20. Niessen, W. M. A., State-of-the-art in liquid chromatography-mass spectrometry, J. Chromatogr., A 1999, 856, 179-197. 21. Niessen, W. M. A.; Tinke, A. P., Liquid chromatography-mass spectrometry. Gen- eral principles and instrumentation, J. Chromatogr., A 1995, 703, 37-57. 22. Gelpi, E., Interfaces for coupled liquid-phase separation/mass spectrometry tech- niques. An update on recent developments, J. Mass Spectrom. 2002, 37, 241-253. 23. Gelpi, E., Biomedical and biochemical applications of liquid chromatography-mass spectrometry, J. Chromatogr., A 1995, 703, 59-80. 24. Gusev, A. I., Interfacing matrix-assisted laser desorption/ionization mass spec- trometry with column and planar separations, Fresenius J. Anal. Chem. 2000, 366, 691-700. 25. Hanold, K. A.; Fischer S. M.; Cormia P. H.; Miller C. E.; Syage J. A., Atmospheric pressure photoionization. 1. General properties for lc/ms, Anal. Chem. 2004, 76, 2842-2851. 26. Wilm, M. S.; Mann, M., Electrospray and taylor-cone theory, dole's beam of mac- romolecules at last?, Int. J. Mass Spectrom. Ion Processes 1994, 136, 167-180. 27. Dole, M.; Mack, L. L.; Hines, R. L.; Mobley, R. C.; Ferguson, L. D.; Alice, M. B., Molecular beams of macroions, J. Chem. Phys. 1968, 49, 2240-2249. 28. Iribarne, J. V.; Thomson, B. A., On the evaporation of small ions from charged droplets, J. Chem. Phys. 1976, 64, 2287-2294. 29. Cole, R. B., Some tenets pertaining to electrospray ionization mass spectrometry, J. Mass Spectrom. 2000, 35, 763-772. 30. Kebarle, P.; Tang, L., From ions in solution to ions in the gas phase - the mecha- nism of electrospray mass spectrometry, Anal. Chem. 1993, 65, 972A-986A. 31. Enke, C., A predictive model for matrix and analyte effects in electrospray ioniza- tion of singly-charged ionic analytes, Anal. Chem. 1997, 69, 4885-4893. 32. Hoffman, E. d.; Stroobant, V. Mass spectrometry, principles and applications, 2 ed.; Wiley, 2002. 33. Coles, J.; Guilhaus, M., Orthogonal acceleration - a new direction for time-of-flight mass spectrometry: Fast, sensitive mass analysis for continuous ion sources, TrAC, Trends Anal. Chem. 1993, 12, 203-213. 34. Guilhaus, M.; Selby, D.; Mlynski, V., Orthogonal acceleration time-of-flight mass spectrometry, Mass Spectrom. Rev. 2000, 19, 65-107.

57 35. Marshall, A. G., Milestones in fourier transform ion cyclotron resonance mass spectrometry technique development, Int. J. Mass Spectrom. 2000, 200, 331-356. 36. Marshall, A. G.; Hendrickson, C. L.; Jackson, G. S., Fourier transform ion cyclo- tron resonance mass spectrometry: A primer, Mass Spectrom. Rev. 1998, 17, 1-35. 37. Sjöberg, P. J. R.; Bökman, C. F.; Bylund, D.; Markides, K. E., Factors influencing the determination of analyte ion surface partitioning coefficients in electrosprayed droplets, J. Am. Soc. Mass Spectrom. 2001, 12, 1002-1010. 38. Palmblad, M.; Ramström, M.; Markides, K. E.; Håkansson, P.; Bergquist, J., Pre- diction of chromatographic retention and protein identification in liquid chroma- tography/mass spectrometry, Anal. Chem. 2002, 74, 5826-5830. 39. Palmblad, M.; Ramström, M.; Bailey, C. G.; McCutchen-Maloney, S. L.; Bergquist, J.; Zeller, L. C., Protein identification by liquid chromatography-mass spectrometry using retention time prediction, J. Chromatogr., B: Biomedical. Appl. 2004, 803, 131-135. 40. Sanz-Nebot, V.; Toro, I.; Benavente, F.; Barbosa, J., Pka values of peptides in aqueous and aqueous-organic media. Prediction of chromatographic and electro- phoretic behaviour, J. Chromatogr., A 2002, 942, 145-156. 41. Grossman, P. D.; Colburn, J. C.; Lauer, H. H., A semiempirical model for the elec- trophoretic mobilities of peptides in free-solution capillary electrophoresis, Anal. Biochem. 1989, 179, 28-33. 42. Cifuentes, A.; Poppe, H., Behavior of peptides in capillary electrophoresis. Effect of peptide charge, mass, and structure, Electrophoresis 1997, 18, 2362-2376. 43. Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; Pasa-Tolic, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E. F.; Shen, Y.; Zhao, R.; Smith, R. D., Use of artificial neural networks for the accurate prediction of peptide liquid chromatog- raphy elution times in proteome analyses, Anal. Chem. 2003, 75, 1039-1048. 44. Bruce, J. E.; Anderson, G. A.; Wen, J.; Harkewicz, R.; Smith, R. D., High-mass- measurement accuracy and 100% sequence coverage of enzymatically digested bo- vine serum albumin from an esi-fticr mass spectrum, Anal. Chem. 1999, 71, 2595- 2599. 45. Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Masselon, C.; Pasa-Tolic, L.; Shen, Y.; Udseth, H. R., The use of accurate mass tags for high-throughput microbial proteomics, Omics 2002, 6, 61-90. 46. HDF5, http://www.hdfgroup.org/products/hdf5/index.html, Sept. 2007 47. NetCDF, http://www.unidata.ucar.edu/software/netcdf/, Sept. 2007 48. Wetterhall, M.; Palmblad, M.; Håkansson, P.; Markides, K. E.; Bergquist, J., Rapid analysis of tryptically digested cerebrospinal fluid using capillary electrophoresis- electrospray ionization-fourier transform ion cyclotron resonance-mass spectrome- try, J. Proteome Res. 2002, 1, 361-366. 49. Kraly, J.; Fazal, M. A.; Schoenherr, R. M.; Bonn, R.; Harwood, M. M.; Turner, E.; Jones, M.; Dovichi, N. J., Bioanalytical applications of capillary electrophoresis, Anal. Chem. 2006, 78, 4097-4110. 50. Neuhoff, N. V.; Kaiser, T.; Wittke, S.; Krebs, R.; Pitt, A.; Burchard, A.; Sund- macher, A.; Schlegelberger, B.; Kolch, W.; Mischak, H., Mass spectrometry for the detection of differentially expressed proteins: A comparison of surface-enhanced la- ser desorption/ionization and capillary electrophoresis/mass spectrometry, Rapid Commun. Mass Spectrom. 2004, 18, 149-156.

58 51. Hu, A.; Chen, C.-T.; Tsai, P.-J.; Ho, Y.-P., Using capillary electrophoresis-selective tandem mass spectrometry to identify pathogens in clinical samples, Anal. Chem. 2006, 78, 5124-5133. 52. Wahl, J. H.; Gale, D. C.; Smith, R. D., Sheathless capillary electrophoresis- electrospray ionization mass spectrometry using 10 mm i.D. Capillaries: Analyses of tryptic digests of cytochrome c, J. Chromatogr. 1994, 659, 217-222. 53. Von Brocke, A.; Nicholson, G.; Bayer, E., Recent advances in capillary electropho- resis/electrospray-mass spectrometry, Electrophoresis 2001, 22, 1251-1266. 54. Schmitt-kopplin, P.; Frommberger, M., Capillary electrophoresis - mass spectrome- try: 15 years of developments and applications, Electrophoresis 2003, 24, 3837- 3867. 55. Kelly, J. F.; Ramaley, L.; Thibault, P., Capillary zone electrophoresis-electrospray mass spectrometry at submicroliter flow rates: Practical considerations and ana- lytical performance, Anal. Chem. 1997, 69, 51-60. 56. Johannesson, N.; Wetterhall, M.; Markides, K. E.; Bergquist, J., Monomer surface modifications for rapid peptide analysis by capillary electrophoresis and capillary electrochromatography coupled to electrospray ionization-mass spectrometry, Elec- trophoresis 2004, 25, 809-816. 57. Hunter, T. C.; Andon, N. L.; Koller, A.; Yates, J. R., III; Haynes, P. A., The func- tional protein toolbox: Methods and applications, J. Chromatogr., B: Biomed. Appl. 2002, 782, 165-181. 58. Geoghegan, K. F.; Kelly, M. A., Biochemical applications of mass spectrometry in pharmaceutical drug discovery, Mass Spectrom. Rev. 2005, 24, 347-366. 59. Palagi, P. M.; Walther, D.; Quadroni, M.; Catherinet, S.; Burgess, J.; Zimmermann- Ivol, C. G.; Sanchez, J.-C.; Binz, P.-A.; Hochstrasser, D. F.; Appel, R. D., Msight: An image analysis software for liquid chromatography-mass spectrometry, Pro- teomics 2005, 5, 2381-2384. 60. Katajamaa, M.; Miettinen, J.; Oresic, M., Mzmine: Toolbox for processing and visualization of mass spectrometry based molecular profile data, 2006, 22, 634-636. 61. Katajamaa, M.; Oresic, M., Processing methods for differential analysis of lc/ms profile data, Bioinformatics 2005, 6, 179. 62. Smith, C. A.; Want, E. J.; O'Maille, G.; Abagyan, R.; Siuzdak, G., Xcms: Process- ing mass spectrometry data for metabolite profiling using nonlinear peak align- ment, matching, and identification, Anal. Chem. 2006, 78, 779-787. 63. Li, X.-J.; Pedrioli, P. G. A.; Eng, J.; Martin, D.; Yi, E. C.; Lee, H.; Aebersold, R., A tool to visualize and evaluate data obtained by liquid chromatography-electrospray ionization-mass spectrometry, Anal. Chem. 2004, 76, 3856-3860. 64. R, http://www.r-project.org/, Sept. 2007 65. Matlab, http://www.mathworks.com/, Sept. 2007 66. Octave, http://www.gnu.org/software/octave/, Sept. 2007 67. Bjørn K. Alsberg, O. J. H., How octave can replace matlab in chemometrics, Chemom. Intell. Lab. Syst. 2006, 84:1-2, 195-200. 68. Scilab, http://www.scilab.org/, Sept. 2007 69. Freemat, http://freemat.sourceforge.net/wiki/index.php/Main_Page, Sept. 2007 70. Vicsek, M.; Neal, S. L.; Warner, I. M., Time-domain filtering of two-dimensional fluorescence data, Appl. Spectrosc. 1986, 40, 542-548.

59 71. Vivo-Truyols, G.; Torres-Lapasio, J. R.; van Nederkassel, A. M.; Vander Heyden, Y.; Massart, D. L., Automatic program for peak detection and deconvolution of multi-overlapped chromatographic signals. Part i: Peak detection, J. Chromatogr., A 2005, 1096, 133-145. 72. Miller, J. N.; Miller, J. C. Statistics and chemometrics for Anal. Chem., 5:th ed.; Pearson Education Limited, 2005. 73. Wold, S.; Esbensen, K.; Geladi, P., Principal component analysis, Chemom. Intell. Lab. Syst. 1987, 2, 37-52. 74. Bro, R., Parafac. Tutorial and applications, Chemom. Intell. Lab. Syst. 1997, 38, 149-171. 75. Andersson, R.; Hämäläinen, M. D., Simplex focusing of retention times and latent variable projections of chromatographic profiles, Chemom. Intell. Lab. Syst. 1994, 22, 49-61. 76. Krebs, M. D.; Tingley, R. D.; Zeskind, J. E.; Holmboe, M. E.; Kang, J.-M.; Davis, C. E., Alignment of -mass spectrometry data by landmark se- lection from complex chemical mixtures, Chemom, Intell. Lab. Syst. 2006, 81, 74- 81. 77. Malmquist, G.; Danielsson, R., Alignment of chromatographic profiles for principal component analysis: A prerequisite for fingerprinting methods, J. Chromatogr., A 1994, 687, 71-88. 78. Nielsen, N.-P. V.; Carstensen, J. M.; Smedsgaard, J., Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correla- tion optimized warping, J. Chromatogr., A 1998, 805, 17-35. 79. Bylund, D.; Danielsson, R.; Malmquist, G.; Markides, K. E., Chromatographic alignment by warping and dynamic programming as a pre-processing tool for parafac modelling of liquid chromatography-mass spectrometry data, J. Chroma- togr., A 2002, 961, 237-244. 80. Gan, F.; Ruan, G.; Mo, J., Baseline correction by improved iterative polynomial fitting with automatic threshold, Chemom. Intell. Lab. Syst. 2006, 82, 59-65. 81. Eilers, P. H. C., Parametric time warping, Anal. Chem. 2004, 76, 404-411. 82. Fischer, B.; Grossmann, J.; Roth, V.; Gruissem, W.; Baginsky, S.; Buhmann, J. M., Semi-supervised LC/MS alignment for differential proteomics, Bioinformatics 2006, 22, e132-e140. 83. Karisa M. Pierce, B. W. W.; Synovec, R. E., Unsupervised parameter optimization for automated retention time alignment of severely shifted gas chromatographic data using the piecewise alignment algorithm, J. Chromatogr. A 2007, 1141, 106- 116. 84. Torgrip, R. J. O.; Åberg, M.; Karlberg, B.; Jacobsson, S. P., Peak alignment using reduced set mapping, J. Chemom. 2004, 17, 573-582. 85. Gong, F.; Liang, Y.-Z.; Fung, Y.-S.; Chau, F.-T., Correction of retention time shifts for chromatographic fingerprints of herbal medicines, J. Chromatogr., A 2004, 1029, 173-183. 86. Nielsen, N.-P. V.; Smedsgaard, J.; Frisvad, J. C., Full second-order chroma- tographic/spectrometric data matrixes for automated sample identification and component analysis by non-data-reducing image analysis, Anal. Chem. 1999, 71, 727-735.

60 87. Dixon, S. J.; Xu, Y.; Brereton, R. G.; Soini, H. A.; Novotny, M. V.; Oberzaucher, E.; Grammer, K.; Penn, D. J., Pattern recognition of gas chromatography mass spectrometry of human volatiles in sweat to distinguish the sex of subjects and de- termine potential discriminatory marker peaks, Chemom. Intell. Lab. Syst. 2007, 87, 161-172. 88. Alm, R.; Johansson, P.; Hjernø, K.; Emanuelsson, C.; Ringner, M.; Häkkinen, J., Detection and identification of protein isoforms using cluster analysis of maldi-ms mass spectra, J. Proteome Res. 2006, 5, 785-792. 89. Pohjanen, E.; Thysell, E.; Lindberg, J.; Schuppe-Koistinen, I.; Moritz, T.; Jonsson, P.; Antti, H., Statistical multivariate metabolite profiling for aiding biomarker pat- tern detection and mechanistic interpretations in gc/ms based metabolomics, Me- tabolomics 2006, 2, 257-268. 90. Minitab, http://www.minitab.com/, Sept. 2007 91. Marengo, E.; Robotti, E.; Rightti, P. G.; Antonucci, F., New approach based on fuzzy logic and principal component analysis for the classification of two- dimensional maps in health and disease, J. Chromatogr., A 2003, 1004, 13-28. 92. Marengo, E.; Robotti, E.; Gianotti, V.; Righetti, P. G.; Cecconi, D.; Domenici, E.; A new integrated statistical approach to the diagnostic use of two-dimensional maps, Electrophoresis 2003, 24, 225-236. 93. Lin, S. M.; Zhu, L.; Winter, A. Q.; Sasinowski, M.; Kibbe, W. A., What is mzXML good for?, Expert Rev Proteomics 2005, 2, 839-845.

61 Acta Universitatis Upsaliensis Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 346

Editor: The Dean of the Faculty of Science and Technology

A doctoral dissertation from the Faculty of Science and Technology, Uppsala University, is usually a summary of a number of papers. A few copies of the complete dissertation are kept at major Swedish research libraries, while the summary alone is distributed internationally through the series Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology. (Prior to January, 2005, the series was published under the title “Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology”.)

ACTA UNIVERSITATIS UPSALIENSIS Distribution: publications.uu.se UPPSALA urn:nbn:se:uu:diva-8223 2007