RIBOSOME PROFILING and the IMPROVEMENT of PROTEIN IDENTIFICATION in CANCER PROTEOGENOMICS STUDIES Number of Words: 20 226

RIBOSOME PROFILING AND THE IMPROVEMENT OF PROTEIN IDENTIFICATION IN CANCER PROTEOGENOMICS STUDIES Number of words: 20 226

Sofie Gielis Student number: 01507424

Promotor: Prof. dr. ir. Wim Van Criekinge Promotor: dr. ir. Gerben Menschaert

Master's dissertation submitted in partial fulfillment of the requirements for the degree of Master in Bioscience Engineering: Cell and Gene Biotechnology

Academic year: 2016 - 2017

Deze pagina is niet beschikbaar omdat ze persoonsgegevens bevat. Universiteitsbibliotheek Gent, 2021.

This page is not available because it contains personal information. Ghent University, Library, 2021.

Acknowledgements

After five years of hard work and dedication, college is coming to an end. It was definitely the most challenging and interesting experience ever. I am very grateful for all the support of my friends and family throughout these years. Thank you all!

Also, I would like to thank my promotor Wim Van Criekinge to guide me through the basics of bioinformatics in the first master year. Your enthusiastic way of teaching and love for the subject has given me the interest in bioinformatics.

Finally, a special thanks goes to my promotor Gerben Menschaert and tutor Steven Verbruggen for all the help and support over the last year.

Sofie Gielis Gent, 4 juni 2017

vii

viii

Contents

1 Introduction and outline ...... 1

2 Overview of relevant literature ...... 2 2.1 NGS-based technologies ...... 2 2.2 MS-based proteomics ...... 3 2.2.1 Outline of MS-based proteomics experiments ...... 3 2.2.2 Database searching and MS/MS-based protein identification ...... 3 2.3 Proteogenomics ...... 5 2.3.1 Goals of proteogenomics ...... 5 2.4 Ribosome profiling ...... 6 2.4.1 An emerging technique ...... 6 2.4.2 Outline of a ribosome profiling experiment ...... 6 2.4.3 Applications of ribosome profiling ...... 8 2.5 The PROTEOFORMER pipeline ...... 10 2.5.1 Overview of the pipeline ...... 11 2.5.2 Achievements ...... 14

3 Material and methods ...... 16 3.1 Hardware ...... 16 3.2 Software ...... 16 3.2.1 SSH clients ...... 16 3.2.2 Python ...... 16 3.2.3 SQLite ...... 17 3.2.4 Python DB-API ...... 17 3.2.5 STAR ...... 17 3.2.6 Variant caller ...... 18 3.2.7 SearchGUI and PeptideShaker ...... 18 3.2.8 Galaxy ...... 19 3.3 Public databases ...... 19 3.3.1 Ensembl annotation bundle ...... 19 3.3.2 dbSNP ...... 20 3.3.3 Sequence read archives ...... 21

3.3.4 PRIDE ...... 21 3.4 Custom database creation ...... 21 3.4.1 Mapping ...... 22 3.4.2 Transcript calling ...... 23 3.4.3 TIS calling ...... 23 3.4.4 Variant calling ...... 24 3.4.5 Translation assembly ...... 24 3.5 Validation of the renewed PROTEOFORMER pipeline with proteomics data ...... 25

4 Results ...... 26 4.1 Custom database creation ...... 26 4.1.1 Mapping statistics ...... 26 4.1.2 Mapping quality control ...... 30 4.1.3 Discovered variants ...... 31 4.1.4 Translation assembly ...... 33 4.2 Validation of the renewed PROTEOFORMER pipeline with proteomics data ...... 34 4.2.1 Exploration of the effect of INDELs on the identification process ...... 34 4.2.2 Exploration of the effect of SNPs on the identification process ...... 40

5 Discussion ...... 44

6 Conclusion ...... 48

7 Further research ...... 49

Bibliography ...... 51

List of abbreviations

APC adenomatous polyposis coli API application programming interface A-site aminoacyl site aTIS alternative translation initiation site (general)/annotated start site (PROTEOFORMER) ATP adenosine triphosphate BAM Binary Alignment/Map BCF Binary Variant Call Format BRAF B-Raf proto-oncogene serine/threonine kinase cDNA complementary DNA CDS coding sequence CHX cycloheximide COFRADIC combined fractional diagonal chromatography COSMIC Catalogue Of Somatic Mutations cRAP common Repository of Adventitious Proteins CWI Center for Mathematics and Computer Science DB-API database application programming interface dbSNP Single Nucleotide Polymorphism Database dbTIS database annotated TIS DDBJ DNA Databank of Japan DNA deoxyribonucleic acid dNTP deoxynucleotide triphosphate dTIS downstream TIS EMBL-EBI European Bioinformatics Institute ENA European Nucleotide Archive E-site exit site EST expressed sequence tag GATK Genome Analysis Toolkit

GRC Genome Reference Consortium HARR harringtonine HCT116 human colorectal tumour 116 cell line INDEL INsertion and DELetion INSDC International Nucleotide Sequence Database Collaboration K lysine KRAS Kirsten rat sarcoma viral oncogene homolog LC liquid chromatography LC-MS/MS liquid chromatography followed by tandem mass spectrometry LTM lactimidomycin mESC mouse embryonic stem cells MMS19 methyl-methanesulfonate sensitivity 19 mRNA messenger RNA MS mass spectrometry

MS/MS tandem mass spectrometry MTHFD1 methylenetetrahydrofolate dehydrogenase 1 NCBI National Center for Biotechnology Information NGS next-generation sequencing NHGRI National Human Genome Research Institute OMSSA Open Mass Spectrometry Search Algorithm ORF open reading frame PacBio Pacific Biosciences PCR polymerase chain reaction PRIDE Proteomics Identifications database PSF Python Software Foundation P-site peptidyl site PSM peptide to spectrum match QC quality control R arginine RIBO-seq ribosome profiling RNA ribonucleic acid RNA-seq RNA sequencing RPF ribosome protected fragment rRNA ribosomal RNA SA suffix array SAM Sequence Alignment/Map SMAD4 mothers against decapentaplegic homolog 4 SAAV single amino acid variant SMRT single molecule, real-time sequencing technology sn(o) RNA small nuclear and nucleolar RNA SNP Single Nucleotide Polymorphism SOLiD Sequencing by Oligo Ligation Detection sORF small open reading frame SQL Structured Query Language SRA Sequence Read Archive SSH Secure Shell STAR Spliced Transcripts Alignment to a Reference TCGA The Cancer Genome Atlas TELO2 telomere maintenance 2 TIS translation initiation site TP53 tumour protein 53 TrEMBL translated EMBL tRNA transfer RNA UniProt Universal Protein Resource uORF upstream open reading frame 3' UTR 3' untranslated region 5' UTR 5' untranslated region VCF Variant Call Format

xii

Abstract

A lot of recent breakthroughs in genomics, proteomics and transcriptomics studies are due to the development of so-called next-generation sequencing techniques (NGS). An important novel NGS technique is termed ribosome profiling and is based on the deep sequencing of ribosome-protected fragments. By studying the mRNA fragments captured by translating ribosomes, the actual translation status in vivo can be assessed. In combination with MS/MS, ribosome profiling has become an important assistant in the protein identification process by building a more comprehensive protein sequence database. Currently, this database can be generated by using a new automated pipeline developed by researchers at the BioBix lab: the PROTEOFORMER pipeline.

This master thesis will cover the improvement of the protein identification process in human cancer cells in a proteogenomics setting combining ribosome profiling and mass spectrometry data. Preceding studies have shown that these proteins are often rich in SNPs (Single Nucleotide Polymorphisms) and INDELs (INsertions and DELetions). Therefore, the addition of these genetic variants into the protein database is bound to have a positive effect on the protein identification rate. Previously, the inclusion of SNPs in the PROTEOFORMER pipeline has already proven its success. Similarly, an improvement on the overall protein identification process is expected when INDEL information is included. This hypothesis was evaluated by comparing the protein identification rate using custom search databases with and without INDELs in the MS/MS identification process. Therefore, INDELs derived from ribosome profiling data were included. This has led to the validation of one protein which could not be identified with custom databases lacking this information. Although, this shows that the inclusion of INDELs had only a small positive effect on the protein identification rate, further increase is expected in the near future by providing extra INDEL information from variant databases and by using better INDEL detection methods.

Finally, the effect of the inclusion of INDEL information on the protein identification rate was compared with the effect of SNPs. Notwithstanding only one SNP was validated with MS/MS data, more than 100 SNP-containing proteins were identified solely or within a protein group. These results show the importance of well generated custom databases tailored to the study in question.

xiii

xiv

Korte samenvatting

Verschillende doorbraken in genomische, proteomische en transcriptomische studies zijn te wijten aan de ontwikkeling van nieuwe sequeneringstechnieken. Een belangrijke voorbeeld hiervan is ribosoom profilering. Deze opkomende techniek is gebaseerd op het sequeneren van ribosoom omgeven mRNA-fragmenten. Dankzij het bestuderen van de mRNA-fragmenten aanwezig in actieve ribosomen, kan de daadwerkelijke translatie status in vivo nagegaan worden. In combinatie met tandem massaspectrometrie kan ribosoom profilering een belangrijke rol spelen in het eiwit identificatie proces binnen een proteogenomische studie, met name in de ontwikkeling van een uitgebreidere eiwit sequentie databank. Momenteel kan deze databank gegenereerd worden dankzij een nieuwe en geautomatiseerde pipeline ontwikkeld door de onderzoekers van het BioBix labo: de PROTEOFORMER pipeline.

Deze masterproef omvat de verbetering van het eiwit identificatie proces bij menselijke kankercellen. Voorafgaande studies hebben aangetoond dat eiwitten vaak rijk zijn aan SNPs (Single Nucleotide Polymorphisms) en INDELs (INserties en DELeties). Vermoedelijk zal het toevoegen van deze genetische varianten een positief effect hebben op het aantal geïdentificeerde eiwitten. Het succes van de inclusie van SNPs in de PROTEOFORMER pipeline is reeds bewezen. We verwachten echter een additionele verbetering van de eiwit identificatie procedure wanneer ook de INDELs geïncludeerd worden. Deze hypothese werd onderzocht door de eiwit identificaties te evalueren via zelf gegenereerde databanken met en zonder de inclusie van INDELs. Hiervoor werden INDELs afgeleid van ribosoom profileringsdata gebruikt. Dit heeft geleid tot de validatie van één eiwit dat niet geïdentificeerd kon worden met de gegenereerde databanken die deze informatie ontbraken. Ondanks dit een klein, positief effect van INDELs op de eiwit identificatie aantoont, wordt er in de nabije toekomst een grotere stijging verwacht door het voorzien van extra INDEL informatie vanuit databanken en het gebruik van betere INDEL detectie methoden.

Tenslotte werd het effect van de INDELs op de eiwit identificatie vergeleken met het effect van de SNPs. Niettegenstaande er slechts één SNP gevalideerd werd met MS/MS data, werden er meer dan 100 SNP bevattende eiwitten geïdentificeerd die al dan niet tot een eiwit groep behoorden. Deze resultaten tonen het belang aan van databanken die afgestemd zijn op het onderzoek.

xvi

1 Introduction and outline

Protein identification plays an important role in a lot of life science studies. This is mostly based on mass spectrometry (MS) and relies on the availability of annotated protein sequences in public databases. Therefore, the identification process is limited to the detection of known proteins. Currently, researchers are trying to address this problem by creating custom protein sequence databases based on genomic and transcriptomic information. Hence, novel technologies targeting nucleotide information on translational level are being developed. This fusion of genomics and transcriptomics studies has led to a new research area, called proteogenomics.

Ribosome profiling is a relatively new technique enabling the sequencing of ribosome protected mRNA fragments. Herewith, a snapshot of the translational situation in vivo can be made. Nowadays, this information is already taken as basis for the construction of custom protein sequence databases. In order to facilitate this process, an automated pipeline was designed: the PROTEOFORMER pipeline. Currently, it is already proven that this pipeline increases the overall protein identification rate in MS-based experiments. However, the pipeline is still under continuous improvement.

In this master thesis, an attempt is made to improve the MS-based protein identification strategy of human cancer samples. Since proteins can largely be affected by the presence of SNPs and INDELs in the aforementioned samples, the influence of introducing this information in the protein identification pipeline is studied. Previously, SNP (Single Nucleotide Polymorphism) information was added to the pipeline resulting in an overall increase on the protein identification rate. Here, the focus will mainly be on the addition of INDELs to the pipeline. In general, the inclusion of INDEL information is expected to have a positive effect. In addition, the effect of INDELs will be compared with the effect of SNPs by analyzing the identification rates of custom databases with and without the addition of these variants based on matching ribosome profiling data from human colorectal cancer cells.

In the following chapter, MS-based proteomics, proteogenomics, ribosome profiling and the PROTEOFORMER pipeline are explained in more depth. Moreover, an introduction into NGS-based technologies is given as these technologies are frequently used in proteogenomics studies. Chapter 3 focuses on the practical aspects of this master thesis by describing the used materials and methods. Afterwards, the results are given in chapter 4 and discussed in chapter 5. Chapter 6 gives a general conclusion. Finally, this master thesis ends with a glimpse into the future of the PROTEOFORMER pipeline in chapter 7.

2 Overview of relevant literature

2.1 NGS-based technologies Since the discovery of deoxyribonucleic acid (DNA) in 1869 by Johann Friedrich Miescher, DNA has been the center of attention in a lot of scientific studies [1]. However, it took almost 100 years before the first sequencing operations were carried out. Sequencing involves every experiment in which the order of nucleic acids in a DNA or ribonucleic acid (RNA) sequence is determined. Initially, sequencing methods, referred to as first-generation sequencing methods, were slow and confined to short sequences only [2]. Later on, new insights and improvements gave rise to high-throughput sequencing technologies: next-generation sequencing-based technologies (NGS-based technologies). Nowadays, multiple NGS-based technologies are commercially available. Illumina sequencing by synthesis, SOLiD (Sequencing by Oligo Ligation Detection) and 454 sequencing are probably the most popular techniques [3]. In general, these methods all start with the fractionation of the sample nucleic acids into smaller pieces and the ligation of adaptor sequences needed for the amplification and sequencing steps. Between different sequencing technologies, the amplification methods are often shared but the actual sequencing procedures are quite different [3,4].

The oldest commercial available NGS-technology is 454 sequencing. Its rationale is based on the production of light during the conversion of luciferin to oxyluciferin by the enzymatic action of luciferase. This reaction can only occur when a complementary base is added to the sequence resulting in the generation of pyrophosphate and its derivative product adenosine triphosphate (ATP) which activates the enzyme. By adding one deoxynucleotide triphosphate (dNTP) at a time, the nucleotide sequence can be devised. This process is also called pyrosequencing [3,5,6].

SOLiD and Illumina sequencing are both based on the detection of fluorescent nucleotides. They differ in the amount of nucleotides attached during each detection cycle. The SOLiD system relies on the incorporation of a probe containing eight bases with a fluorescent group attached to the last base. By using four different fluorescent dyes, colour detection can reveal the last base. The implementation of a ladder primer set, makes it possible to determine the sequence after five sequencing cycles [3,5]. In contrast with SOLiD, Illumina is based on the incorporation of fluorescent labelled mononucleotides. By adding a blocker to the end of these nucleotides, the synthesis process can be carefully controlled, allowing the detection of one single nucleotide [3,5].

Further improvements in sequencing technologies has led to a new generation of techniques: the third generation. Sequencing technologies belonging to this generation are characterized by the ability to sequence longer fragments in comparison with previously explained NGS- based technologies. Moreover, no amplification step is required due to the ability to sequence single molecules. Two popular third generation sequencing techniques are single molecule 2

real-time sequencing technology (SMRT) provided by Pacific Biosciences (PacBio) [7,8] and Oxford Nanopore sequencing [9]. Despite their promising features, they are not yet routinely used in the commercial industry.

2.2 MS-based proteomics

2.2.1 Outline of MS-based proteomics experiments Proteomics refers to the study of the total protein content of a species, also called its proteome. It involves the identification and quantification of the entire protein load present in the species of interest [10,11]. A fundamental part in this research is the use of shotgun proteomics. This methodology aims at the identification of proteins in a complex protein mixture by studying the corresponding peptides after enzymatic cleavage [10,11]. Figure 1 gives a clear overview of the general aspects. A common shotgun proteomics experiment starts with an enzymatic cleavage of the proteins into peptides (mostly trypsin is used). Afterwards, these molecules are separated using liquid chromatography (LC). The eluted peptides are then ionized before they enter the mass spectrometer [12,13,14]. In mass spectrometry (MS), the isolation of peptides is based on their mass-to-charge ratio [11,15]. The desired proteomic data is obtained by repeating the MS step on the fragments of each selected peptide ion [13] resulting in a final tandem mass spectrometry (MS/MS) spectra. The described combination of liquid chromatography and tandem mass spectrometry is abbreviated as LC-MS/MS [16].

Figure 1: Practical aspects of LC-MS/MS. In general, every shotgun proteomics experiment starts with the enzymatic digestion of the proteins of interest into peptides. These peptides are then separated in a liquid chromatography step and afterwards subjected to tandem mass spectrometry. Adapted from [12].

2.2.2 Database searching and MS/MS-based protein identification Database searching of the resulting MS/MS spectra gives rise to the identity of the examined peptides and thus also their corresponding proteins. This process is illustrated in figure 2. It consist of two consecutive steps, namely the peptide and the protein identification process [17,18].

Peptide identification The identification of the peptides, obtained after protein digestion, is based on the correlation between experimental and theoretical MS/MS fragmentation spectra. The latter are obtained by subjecting a chosen protein sequence database to in silico MS/MS pattern prediction algorithms. In this prediction process, proteins are virtually digested into peptides using

enzymatic cleavage rules. For example, when using the trypsin cleavage rule, cleavage is supposed to occur C-terminally after every lysine or arginine [19]. Next, the associated mass- to-charge ratios are calculated for all the cleaved products. By comparing these theoretical ratios with the experimental mass-to-charge ratios a candidate peptide list is generated. This list contains all the theoretical peptides which can be identified based on their mass-to-charge ratios. Subsequently, these peptides are broken down into smaller ions. Finally, the produced theoretical fragmentation spectra are matched with the experimental MS/MS spectra resulting in a statistical similarity score. Only peptides associated with a high score will be retained and added to a final matching peptide list [14,18].

Figure 2: Overview of the protein identification process using a database searching approach. Proteins can be identified by comparing the experimental MS/MS spectra with theoretical spectra. The latter are obtained by subjecting a protein sequence database to an in silico MS/MS pattern prediction algorithm [14].

Protein identification The obtained information from the peptide identification process is brought together in a second step resulting in a list of possible proteins present in the sample of interest. This list will be subjected to a statistical analysis in order to select the true protein identifications [18]. The overall procedure can be hampered by the protein inference problem which refers to the difficulty of extending information from the peptide to the protein level. This may be a result of large-scale sequence homology between the proteins present, because digestion of these proteins can give rise to identical peptides. In the absence of unique peptide sequences, discrimination between these proteins is not possible. This problem can be taken into account by collecting the indiscernible proteins in a so-called protein group. In this way, statistical analysis can reveal the presence of a certain group of proteins without knowing the presence of specific proteins within that group [20].

2.3 Proteogenomics Proteogenomics is a relatively new term describing the fusion of proteomics and genomics [12,21,22]. This method made its debut in the scientific research community in 2004 [12] and has been evolving since then. Initially, proteogenomics comprised the support of proteomic information in genome annotation [12,22]. Genome annotation appears after genome sequencing and purposes the delineation of the genes and determination of their biological function [16]. Conventionally, this endeavour was pursued by the genomic research community, whereas proteomic researchers were focusing on the actual protein level. Nowadays, proteomics and genomics are not stand-alone research fields anymore, but perform a joint effort in improving the current genome annotation [12].

2.3.1 Goals of proteogenomics Since the beginning of proteogenomics, a lot of studies are dedicated to improvements in the genome annotation and the protein identification process. Therefore, these two applications will be explained in more detail.

Aid genome annotation As proteogenomics studies were developed with a focus on genome annotation, this is still one of the most important applications. Traditionally, protein-coding genes were located using sequence similarity or ab initio gene finding predictions. These annotation processes, however, are rather theory-based and do not provide enough evidence to call genes flawlessly [16]. Validation of the predicted genes is therefore extremely important. Mass spectrometry-based experiments provide a way to confirm the existence of protein-coding genes and hypothetical proteins with translational evidence. Mass spectrometry-based proteomics data can also be used to find new protein-coding genes by using the identified peptide sequences to search genomic databases. In this way, proteogenomics studies can reduce the error rate of current and previous annotation studies, which emphasizes its importance in genome annotation and re-annotation [16,23,24].

Discover new proteoforms Along with the progress in genome annotation, proteogenomics has also become a key player in the process of protein identification and finding new proteoforms. The term 'proteoforms' refers to all protein products derived from the same gene [25]. For example, proteins might have proteoforms containing a longer N- or C-terminus resulting in the so-called N-terminally and C-terminally extended proteoforms. Similarly, proteoforms with shorter N- or C-termini are called N-terminally or C-terminally truncated proteoforms [26,27].

Traditionally, proteins were identified with the aid of reference protein sequence databases like Swiss-Prot [28]. As explained earlier, these databases can be used to match experimentally obtained MS/MS spectra with their corresponding peptides after undergoing in silico MS/MS. Unfortunately, this limits the identification procedure to peptides already present in the consulted reference database. Researchers are trying to address this problem by generating customized protein sequence databases using genomic and transcriptomic 5

information, which enlarges the search space in the field of interest. This can be of particular importance when studying disease-associated proteins which may contain one or more genetic variation. Likewise, completely new peptides associated with novel coding regions are more likely to be found with a customized database [12,22,29]. So far, a lot of new proteins have been discovered with the aid of proteogenomics studies. These include alternative splice variants [30] and peptides derived from small open reading frames (sORFs) [31].

2.4 Ribosome profiling

2.4.1 An emerging technique Next to genomics and proteomics, translatomics has been a very important research area over the last few years. At the beginning, information on the translational level was derived from measurements at the expression level by evaluation of mRNA (messenger RNA) abundance using microarrays. Later on, mRNA sequencing became more and more popular. However due to the action of extensive regulation on mRNA level and the occurrence of events like initiation at non-AUG start codons, it is impossible to determine the exact protein levels and their identity by using the detected mRNA levels. With the advent of translation inhibitors, new approaches to monitor protein information at the translational level were developed [32,33].

Initially, polysome profiling [34] was used to follow up the protein synthesis. A polysome refers to the presence of multiple ribosomes occupying an mRNA transcript. In polysome profiling these mRNAs are studied after isolation of the polysome fraction. Unfortunately, this technique has limited resolution and cannot be used to reveal the exact ribosome positions [32,33]. These shortcomings have led to the introduction of ribosome profiling (also called RIBO-seq) by Ingolia et al. This emerging technique is based on the deep sequencing of ribosome protected fragments (RPFs). By only studying the mRNA fragments captured in the ribosomes, the actual translation status in vivo can be assessed. Moreover, the ability to halt active ribosomes, enables the determination of the ribosomal positions with single-nucleotide precision. Up to now, these advantages have made the technique very popular in the process of genome annotation by improving the identification of novel protein-coding genes [32,33,35].

2.4.2 Outline of a ribosome profiling experiment The ribosome profiling strategy consists of a few main steps. These steps are presented in figure 3 and are explained below.

Figure 3: Overview of the major steps in ribosome profiling. Ribosome profiling begins with the release of RNA from cells treated with translation inhibitors. After a subsequent RNA digest, which degrades unbound RNA, the resulting RNA-ribosome-protected fragment (RPF) complexes are isolated with a centrifugation technique. The RPFs are then separated from the ribosomes and purified using gel electrophoresis. In a following step, the isolated RPFs are linked with adaptors, amplified and sequenced. An intermediate rRNA depletion step makes sure that the remaining rRNA molecules are degraded. Finally, the sequenced RPFs are mapped to a reference sequence [36].

Every RIBO-seq experiment starts with lysing the cells of interest. In order to take a representative snapshot of the translational situation, halting the active ribosomes before the actual lysis is necessary. This pausing step is introduced to avoid alterations of the ribosomal positions during and after lysis as a result of the rapid translational progress. Different approaches based on translation inhibitors including cycloheximide (CHX), lactimidomycin (LTM) and harringtonine (HARR), are available [35,36,37,38]. Their purposes will be explained in section 2.4.3.

After cell lysis, nuclease digestion ensures the degradation of RNA that is not protected by ribosomes. The resulting RNA-ribosome complexes contain RPFs of circa 28 nucleotides, also termed footprints. A subsequent separation step based on sucrose density fractionation isolates these footprint-bound ribosomes from remaining cell residuals. In a consecutive step, the footprints are recovered from the ribosomal complexes and purified by gel electrophoresis. Deep sequencing of these footprints is possible after the construction of a derived DNA-library. This includes the attachment of linkers that provides the priming site needed for reverse transcription followed by circularization of the products [37,38]. Next, the contamination with ribosomal RNA (rRNA) has to be taken into consideration. Therefore 7

rRNA is optionally depleted by hybridization with rRNA probes which are later on removed due to their affinity for streptavidin [38]. Following a polymerase chain reaction (PCR), the resulting fragments are then ready for deep sequencing. Finally, the sequenced fragments can be analyzed by aligning the obtained reads to a reference sequence [35,37,38].

2.4.3 Applications of ribosome profiling Despite the fact that ribosome profiling is a relatively young technique, it has already been used in a number of different studies demonstrating the large range of applications of this technology. Generally, two different approaches can be made. The first one is relying on the use of elongating ribosomes whereas the second one makes us of initiating ribosomes [35,38].

Applications of elongating ribosomes Elongating ribosomes can be halted with the aid of flash-freezing [38,39] or by using antibiotics acting on the translocation activity. A well known and widely used antibiotic is cycloheximide (CHX). This antibacterial drug blocks the elongation of translation in eukaryotic cells trough the obstruction of the exit site of the 60S ribosome subunit [36,40,41]. The mRNA sequences restrained in these halted ribosomes provide a massive amount of information leading to a better (re)-annotation of the genome [38].

Identification of small open reading frames Ribosome profiling can reveal the existence of small open reading frames (sORFs). These open reading frames (ORFs) are defined as ORFs smaller than or equal to 300 bases. Due to their small sizes, sORFs are often excluded by automated gene annotation algorithms. Likewise, MS-based proteomics and RNA-seq (RNA sequencing) based transcriptomics experiments are not effective in revealing new sORFs. As a result, the discovery of new sORFs became feasible with the introduction of ribosome profiling. The small peptides derived from sORFs are called micropeptides and can have important functions [42]. Transporter proteins, transcription factors and hormones are just a few examples of the enormously diverse functionality of these micropeptides [43].

Studying protein translation on a quantitative level Ribosome profiling also provides the possibility to study the expression of different genes on a quantitative scale. Each gene transcribes its DNA information into mRNA sequences. These sequences will go through the translational process due to ribosomal activity. Every mRNA transcript bound to a ribosomal complex is considered as being translated. In other words, every RPF that is picked up during the experiment represents a translating ribosome. Therefore, the number of RPFs resembles the number of translating ribosomes for the corresponding transcript and is proportional to the level of protein that is produced. By taking into account both the amount of RPFs and the time required to finish a protein, the production rate of the proteins can be determined. Hence, RPF density can be used to quantify gene expression and measure the speed of protein synthesis [35,38]. Also, by measuring the translation products and comparing it to the mRNA abundance on transcriptional level in RNA-seq, new insights into the translational regulation can be gained [33,38]. 8

Discovering genetic variation Ribosome profiling can be of particular importance in disease-focused studies due to the inclusion of genetic variation information within ORFs. Single Nucleotide Polymorphisms (SNPs) are the most common variants in the human population. SNPs represent single base positions in the genome that are differing between individuals. Several people will for example carry a cytosine at a specific genomic location while others have an adenine at this position [44].

Next to SNPs, insertions (addition of one or more nucleotides) and deletions (removal of one or more nucleotides) can occur. A lot of diseases are already linked with specific SNPs and INDELs (INsertions and DELetions). Well know examples are sickle-cell anaemia (caused by a SNP [45]), cystic fibrosis (caused by a deletion [46]) and Tay-Sachs disease (caused by an insertion [47]).

Genetic variation is also playing a key role in cancer research. Due to the presence of extensive protection mechanisms against cancer, multiple genes have to undergo mutagenesis to initiate the disease [48,49]. Moreover, the alterations that lead to oncogenesis, have to occur in specific classes of genes: oncogenes, tumour-suppressor genes and stability genes. Oncogenes and tumour-suppressor genes are related to the cell-cycle control. By stimulating cell division (due to activated oncogenes) and inhibiting cell-cycle control (due to inhibited suppressor genes), mutations in these genes can cause a rapid increase in the number of cells. Stability genes, on the other hand, play an important role in controlling the presence of mutations by trying to repair the genetic alterations. Inactivation of these genes suppresses this control mechanism resulting in a higher mutation rate [49]. Both SNPs and INDELs are often present in cancer-related genes [48,50]. These variations can be detected with NGS-based technologies, including ribosome profiling, making these very useful techniques in cancer studies [51,52].

Applications of initiating ribosomes Ribosomes can also be stalled solely during the initiation stage. Halting initiating ribosomes can be achieved by using harringtonine (HARR) or lactimidomycin (LTM). LTM inhibits translation initiation by binding to the exit site of the 80S complex whereas HARR halts the ribosomes by binding to the acceptor site of the free 60S ribosome subunit [38,40,53,54]. Both compounds are useful in the identification of translation start sites. However LTM can be preferred over HARR as treatment of the latter results in a substantial amount of mRNA- fragments mapping after the translation start site resulting in a drop of the accuracy of the experiment [40].

Detection of alternative translation initiation sites and their resulting proteoforms The translational process in eukaryotes is a well-known phenomenon. Both the initiation and the elongation process have been thoroughly studied over the years. Generally spoken, translation initiation is triggered when the ribosomal complex scans an AUG codon. In the

case of efficient recognition, the first AUG will serve as a translation initiation site [40]. However, inefficient recognition, due to factors like the absence of necessary secondary structures in the mRNA near the initiation site, may lead to leaky scanning. In this way, more efficient translation initiation sites (TISs) are searched in the region downstream the initial TIS [40,55].

More recently, it was discovered that translation initiation is also occurring at non-AUG codons differing in one nucleotide with the cognate start codon. These near-cognate start sites are denoted as alternative translation initiation sites or aTIS. Due to the failure of finding these aTIS using in silico sequence analysis, it was not possible to accurately predict new aTIS. However, with the introduction of ribosome profiling as a way to define ribosomal positions, this problem was solved. Using specific translation initiation inhibitors, translation initiation sites can be determined with single nucleotide resolution. In this way both TIS and aTIS are accurately predicted [40].

A significant fraction of these newly annotated TIS represents near-cognate start sites lying in the regions upstream of the annotated coding sequence. These seem particularly important in the generation of upstream open reading frames (uORFs): short translated sequences in the 5' untranslated region (5' UTR) [32,39,56]. Moreover, some of these ORFs might have an influence on the regulation of downstream ORFs by producing small peptides called peptoswitches [57].

Alternative TIS are also responsible for the production of N-terminally extended and N- terminally truncated proteoforms. The latter are produced by using downstream TIS. Both extensions and truncations result in variants having different functions or carrying other localisation signals [35,40].

2.5 The PROTEOFORMER pipeline Recently a new tool creating a customized database has been developed by researchers of BioBix in Gent. This tool was introduced under the name of PROTEOFORMER in 2014 and is already publicly available as a stand-alone version and a Galaxy implementation. The PROTEOFORMER tool consists of an automated pipeline of seven steps starting from ribosome profiling data and ending in a protein sequence search database necessary for the identification of proteins by MS/MS matching. Using RIBO-seq data rather than mRNA-seq data, the PROTEOFORMER pipeline overcomes the problems of searching a big database and provides evidence on translational level. Nowadays, the analysis of RIBO-seq data derived from Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana has been made available. The analysis of other species will be made available in the future [52].

2.5.1 Overview of the pipeline The PROTEOFORMER pipeline includes seven major steps: quality control, mapping, transcript calling, TIS calling, variation calling, translation assembly and translation database construction. These steps are depicted in figure 4. Besides these major steps, figure 4 also contains the prior step of next-generation sequencing, needed to gain the input reads, and the final peptide and protein identification step. A detailed description of these steps is given below [52,58].

Figure 4: Overview of the PROTEOFORMER pipeline. The PROTEOFORMER pipeline starts from RIBO-seq reads generated with NGS-based technologies. These reads are analyzed using seven different steps including quality control, mapping of the reads to the reference genome, transcript calling, TIS calling, variation calling, translation assembly and translation database construction. In a final step the custom created database can be used in a MS-based peptide and protein identification process [52].

Next generation sequencing The PROTEOFORMER pipeline uses RIBO-seq data as input. This information is gathered by sequencing the ribosome protected footprints using the Illumina approach. As previously explained, Illumina sequencing is based on fluorophores and terminating groups attached to the four different nucleotides. By matching each nucleotide with a specific fluorescent emission signal, the incorporated nucleotide can be determined. The resulting reads derived from the measured fluorescence and their estimated quality are collected in a FASTQ file. The inclusion of a TIS calling algorithm in the pipeline requires ribosome profiling information on initiating ribosomes. Hence, in addition to the RIBO-seq data of CHX-treated samples, information concerning HARR- or LTM-treated samples is necessary. Consequently, two FASTQ files are taken as input: one containing reads derived from elongating ribosomes and one representing information about initiating ribosomes [52].

Quality control The subsequent quality control of the collected FASTQ files forms a preliminary step to the PROTEOFORMER pipeline. A suitable tool for the monitoring of FASTQ files and determining the sequence quality is the FastQC quality control application developed by the Babraham Institute. This application creates a quality control (QC) report that is very valuable for the detection of irregularities or unexpected deviations from the predicted outcome of the raw sequencing data [59]. Moreover, FastQC can also be applied on the aligned reads after mapping. In this way, the effects of the prepossessing steps on the raw data can be verified. These include adaptor clipping, sequence trimming and filtering on unwanted RNA.

Aside from the QC report, two other quality determinants are taken into account. The first one is known as the metagenic functional classification wherein the species-specific annotation from Ensembl is used. RPFs in accordance with protein-coding transcripts are classified in 5' UTR (5' untranslated region), exonic, intronic and 3' UTR (3' untranslated region) annotation classes whereas RPFs corresponding with non-protein-coding transcripts are classified as 'other biotypes'. The rest of the RPFs are considered intergenic [52]. The second quality determinant is the gene distribution which visualizes the overall dynamic range of the translation counts [52].

Mapping As a first step in the pipeline, the raw RPFs should be aligned against a reference genome. Two different mapping approaches can be used by the PROTEOFORMER pipeline [52]: TopHat [60] and STAR (Spliced Transcripts Alignment to a Reference) [61]. Both these splice-aware aligners start with constructing a genome index. The mapping against the indexed genome of the species of interest is possible after filtering out the reads aligned to the PhiX bacteriophage genome (resulting from the spiked-in sequences) and the reads aligned to the species-specific rRNA (resulting from rRNA contamination specific to the ribosome profiling protocol), tRNA (transfer RNA) and sn(o)RNA (both small nuclear and nucleolar RNA) [52]. For all these filtering steps, the RPFs are mapped non-uniquely, meaning they are allowed to map to more than one location. In contrast, the mapping against the reference genome of the species of interest can be carried out uniquely, wherein the RPFs map to maximal on location, or non-uniquely.

After clipping the 3' end adaptor sequence, the reads are mapped to the different created genome indices. The number of reads mapped to each of these indices will be calculated and redirected to an SQLite table. The alignment information is also given in a BedGraph file which enables the visualization of the results in a genome browser [52].

Transcript calling In addition to the mapping procedure, translational evidence on the transcripts is collected by calculating the number of footprints corresponding to every genomic location of the coding sequence. Thereby the footprint coverage of every exon is determined. Transcripts having an

uniform exon coverage exceeding a certain threshold are marked as truly translated [52]. In this way, Ensembl transcripts matching the experimental RIBO-seq data are identified.

SNP calling Using the aligned reads of the previous step, SNPs can be uncovered by using the SAMtools suite. SAMtools is a software package developed for the analysis of alignments in Sequence Alignment/Map (SAM) and Binary Alignment/Map (BAM) formats [62]. Finding SNPs with the mpileup module of SAMtools starts from a SAM or BAM file containing the alignment of the reads on a reference genome. The mpileup command uses this info to generate a Binary Variant Call Format (BCF) file containing genomic position information together with the likelihoods of the potential genotypes [63,64]. Since mpileup requires a BAM file as input, a SAM to BAM conversion is necessary [64]. The calling itself is done with BCFtools by using the information in the BCF file and applying a Bayesian model. It can also convert BCF files to Variant Call Format (VCF) files [63,65].

TIS calling As already mentioned before, the inclusion of antibiotics in the ribosome profiling protocol is a key element in determining translation initiation sites. Both LTM and HARR can be used to stall initiating ribosomes whereas CHX acts on elongating ribosomes. The parallel analysis of the obtained LTM/HARR-treated and CHX-treated reads results in a peak calling approach [40]. Herewith, LTM- or HARR-associated RPFs and CHX-associated RPFs are mapped to transcript sequences contained within the Ensembl annotation bundle [52]. During this procedure, reads of initiating ribosomes will be accumulated at the translation initiation start sites, resulting in the so-called TIS peaks [40]. Only a subset of these peak positions resembles actual TIS. These TIS can be detected using user-defined parameters concerning the proximity of predicted TIS, their minimal associated number of reads and a value representing the relative abundance of LTM-treated reads over CHX-treated reads at a single nucleotide position [52].

Translation assembly The objective of the previous steps is the collection of the information derived from ribosome profiling data. With the translation of RIBO-seq data into amino acid sequences in mind, both TIS and SNP information as well as knowledge concerning transcript isoforms is gathered. In addition to the information on proteoforms described above, the chromosome reference sequence and the Ensembl annotation bundle are essential. Scanning these chromosome files using a binary reading approach allows to acquire the exon sequences easily and fast. These exon sequences are used to derive the desired protein sequences [52].

Translation database In order to use the assembled protein sequences in MS-based protein identification experiments, these sequences need to be stored in a database. The PROTEOFORMER pipeline generates a custom database in FASTA format. An important feature is the non-redundancy of this database [52].

MS-based proteomics/peptidomics The final protein sequence database created with the PROTEOFORMER pipeline can be used in the MS-based peptide and protein identification process. Therefore, database searching engines are used to match the in silico generated MS/MS spectra, derived from peptides within the database, with the experimental available MS/MS spectra. This process is explained in section 3.2.7 in more depth [52].

2.5.2 Achievements The PROTEOFORMER pipeline was developed with the eye on creating an optimal custom protein database with the help of ribosome profiling information. The transition from publicly available protein databases like Swiss-Prot to custom databases leads to a rise in the number of protein identifications. Moreover, the inclusion of RIBO-seq data overcomes the problem of an increasing database size by including only the translation products of one single reading frame (as opposed to conversion of mRNA sequencing that generally performs a six-frame translation) and focuses on the actual translation in vivo [52].

The benefits of using the resulting custom database in a MS/MS proteogenomics experiment was demonstrated by comparing the generated custom databases with a peptide database derived from Swiss-Prot [28]. This strategy was applied both to RIBO-seq data derived from mouse embryonic stem cells (mESC) and the human colorectal tumour 116 cell line (HCT116). The results of this validation experiment are shown in figure 5. The first two pie charts in this figure visualize the increase of protein identification due to the implementation of RIBO-seq data. Both cell lines contain proteins that were not part of the Swiss-Prot database and could therefore not be detected before. Next to the identification of new proteins, the protein score of a significant part of the proteins was increased. The last two pie charts represent the classification of the N-terminally proteoforms (detected with the N- terminal COmbined FRactional DIagonal Chromatography or COFRADIC method [66]). Proteoforms containing TIS annotated in the Swiss-Prot database (database annotated TIS or dbTIS) are the most abundant group, followed by proteoforms starting downstream of the annotated TIS (downstream TIS or dTIS). Additionally, uORFs and N-terminally extended proteoforms are detected but in smaller numbers. Moreover, both near-cognate and cognate start sites where found within the newly detected proteoforms [52] .

Figure 5: Pie charts representing the rise in identification rate for a) mouse and b) human and the identification of new N-terminally proteoforms for c) mouse and d) human by using the PROTEOFORMER pipeline [52].

3 Material and methods

Currently, the PROTEOFORMER pipeline has already proven its usefulness by increasing the overall protein identification rate and enabling the detection of alternative proteoforms like N- terminally extended proteoforms [52]. However, in order to be useful in cancer studies, the inclusion of structural variant information is very important. SNPs are already taken into account, but next to SNPs, insertions and deletions play a very important role in the development of cancer. Therefore, it is necessary to include these variants in the pipeline as well.

By taking into account both SNPs and INDELs, a lot of the genetic variation in cancer cells is covered [48]. In all probability, this will improve the identification rate of proteins present in cancer cells in comparison with databases lacking this information. This hypothesis is tested by studying the protein identifications with and without the inclusion of INDELs using matching ribosome profiling and mass spectrometry data from a human colorectal cancer dataset. Additionally, the effect of INDELs is compared with the effect of SNPs.

3.1 Hardware Both the existing scripts, needed to run the PROTEOFORMER pipeline, as well as the new scripts written during this master thesis were executed on the Linux servers of BioBix. Two different servers where used: one having 32 processors and 157 GB RAM and one having 64 processors and 315 GB RAM. The analysis of the peptide identification results was done on a laptop with, 8 GB RAM, four 2,20 GHz Intel® Core™ i5-5200U processors and the Windows 8.1 operating system.

3.2 Software

3.2.1 SSH clients The connection with the servers was made using the Secure Shell (SSH) protocol insuring secure network services [67]. Within the context of the client-server model, two different SSH clients where used: PuTTY (version 0.67) and WinSCP (version 5.9). While WinSCP was needed for a secure file transfer between the local and the remote computer. PuTTY was used to operate the servers from a distance.

3.2.2 Python Python is a high-level, object-oriented language created by Guido van Rossum. Van Rossum designed this program while working at the Center for Mathematics and Computer Science (CWI) in Amsterdam and made it publicly available in 1991. Today, Python is owned by a non-profit organization, called the Python Software Foundation (PSF), and is freely available at the official Python webpage: www.python.org. The programming language is distributed with a large library of modules. Moreover, Python is portable and applicable to a large

domain of problems. Thanks to its simple syntax and its interactive interpreter, this programming language is also easy to learn and probably the most convenient language to introduce non-programmers into computer programming [68].

Initially, the majority of the scripts of the PROTEOFORMER pipeline, were written in Perl (version 5.22.2). However, due to Python's rising popularity, this will change over time. In this master thesis, Python version 2.7.11 was used.

3.2.3 SQLite During the analysis of the ribosome profiling data by the PROTEOFORMER pipeline, the results were stored in a relational database. In general, a relational database consists of tables containing data and connections between the records of these tables. This data was stored and queried using SQLite (version 2.8.17). SQLite is a freely available, very reliable Structured Query Language (SQL) database engine [69]. It can be used in combination with any available operating system [70] and enables hand-operated execution of SQL queries through the sqlite3 command-line service [71].

3.2.4 Python DB-API The connection with the desired SQLite database through Python was made using the sqlite3 module (version 3.11.0). This module is responsible for the creation of an SQL interface conformable with the consistent database application programming interface (API) used in Python: DB-API 2.0 [72,73].

Initially, new Python database modules were provided with a particular interface different from other existing modules. Due to incompatibilities, users were compelled to rewrite working code when changing the database product. This problem has been the starting point for the creation of a consistent Python database interface. Data retrieving and inserting by the sqlite3 module begins with the generation of an object expressing the connection with the database. Hereafter, a cursor object needs to be initiated and executed in order to perform an SQL command. Finally, the results can be collected by using one of the fetch techniques [72,74].

3.2.5 STAR Spliced Transcripts Alignment to a Reference (STAR) is a free alignment tool written in C++. It provides fast mapping of RNA-seq data to the reference genome in three consecutive steps: genome indexing, seed search and seed clustering. Genome indexing is done by means of uncompressed suffix arrays (SAs). A binary string search in these SAs allows the detection of seeds. These seeds are clustered together with the aid of anchor sequences. Simultaneously, a local alignment scoring scheme is used to find the alignments having the best scores [61].

STAR (version 2.5.2a) was used to map all the RIBO-seq derived reads to the reference PhiX bacteriophage genome and the human reference rRNA, sn(o)RNA, tRNA and genome sequence. 17

3.2.6 Variant caller The PROTEOFORMER pipeline enables SNP detection using the SAMtools suite. Next to SNPs, INDELs needed to be identified. Today, multiple INDEL detection tools are available. They are often based on different principles while the general workflow is the same. Hence, different categories of INDEL calling tools can be made [65].

A first important class of INDEL calling tools encloses the haplotype-based tools. Their algorithms begin with selecting those sites in the genome that are more likely to show variations. These variant sites are used to create a list of possible haplotypes. After realigning these haplotypes to the original reads and applying Bayesian statistics, the variants can be called [65,75,76]. Popular haplotype-based tools are GATK (Genome Analysis Toolkit) HaplotypeCaller [77] and Platypus [76].

A second important class of variant calling tools groups the alignment-based techniques. These methods start with the alignment of the reads against a reference genome. This alignment is used to collect all the possible variants that are subsequently subjected to a filtering step in which the correct SNPs and INDELs are retained [65]. Both SAMtools [62] and GATK UnifiedGenotyper [78] are well known variant callers belonging to this class.

Because the PROTEOFORMER pipeline uses SAMtools mpileup (version 1.3) to find SNPs in experimental data and BCFtools (version 1.3) to call the significant variants, these tools were also used to find INDELs.

3.2.7 SearchGUI and PeptideShaker A classic proteomics study ends with the identification of the proteins picked up throughout the experiment. Figure 6 gives an overview of the main steps carried out during this process.

Figure 6: Protein identification using SearchGUI and PeptideShaker. Experimental MS/MS spectra are identified using a database searching approach. Hereby, SearchGUI provides different search engines which can analyze the MS/MS spectra simultaneously. The resulting peptide to spectrum matches (PSMs) are evaluated with PeptideShaker. Finally, the obtained results can be submitted to PRIDE or subjected to further analysis [79]. 18

The data-analysis starts from the obtained experimental spectra and a protein sequence database present in the standard FASTA format. Both data available from PRIDE (more information on PRIDE is given in section 3.3.4) as well as in-house spectral data can be taken as a starting point. Identification of the experimental spectra can be done using a database searching approach by applying freely available identification software algorithms, also called search engines. Nowadays, various search engines are at hand, each having their own advantages and disadvantages. In order to achieve the best outcome possible, multiple search engines are often combined. However, due to conflicts between input parameters and visual interfaces, this can be very troublesome [79,80].

SearchGUI (version 3.2.14) is a user-friendly interface providing easy configuration of eight different search engines (including X!Tandem [81] and the Open Mass Spectrometry Search Algorithm or OMSSA [82]) and de novo sequencing algorithms, avoiding these conflicts [80,83]. The resulting peptide to spectrum matches (PSMs) can be analyzed by another software application: PeptideShaker (version 1.16.5). PeptideShaker provides simple evaluation of the collective output gathered by the selected search engines and facilitates the re-analysis of public data. The final results can be easily submitted to PRIDE or subjected to further analysis in case of unidentified spectra [79,84].

3.2.8 Galaxy Galaxy is an open source, web-enabled platform designed for the intensive data analysis present in everyday bioscience research. It allows researchers to create or reuse complex computational workflows from a user-friendly interface without the need to program. The platform ensures reproducibility of the experiments by storing the required metadata. Furthermore, the possibility to add annotations to every step of a workflow and the simplicity of communicating experimental results are making this platform accessible to every researcher interested in genomic data analysis [85].

3.3 Public databases

3.3.1 Ensembl annotation bundle The Ensembl annotation bundle is part of the Ensembl project founded in 1999 [86]. It provides high quality genome annotation for a wide variety of species. The Ensembl annotation process starts from a genome assembly present in the International Nucleotide Sequence Database Collaboration (INSDC). It involves ab initio gene prediction tools and sequence homology based on protein, cDNA (complementary DNA), EST (expressed sequence tag) and RNA-seq data [87,88].

The human genome assembly is currently managed by a worldwide association responsible for improvements of the human, mouse, chicken and zebrafish reference genome: the Genome Reference Consortium (GRC) [89,90]. Hence, the human genome assembly is abbreviated as GRCh followed by a release number.

Here, the research is based on assembly GRCh38 and Ensembl annotation bundle version 82. Only the tables needed for the construction of the Ensembl transcripts were used. These include the exon, the exon_transcript, the gene, the seq_region, the coord_system, the transcript and the translation table. Their contents and underlying relationships are shown in figure 7.

Figure 7: Overview of the most important Ensembl tables needed in the PROTEOFORMER pipeline. Transcript information is stored in the transcripts table. It provides knowledge about the start and end positions of the transcripts and the strand (sense or antisense). It is linked with the translation table which contains information on the translation start and stop sites of the transcript. The transcripts are linked with their corresponding genes by the gene table and with their exons through the exon_transcript linking table. More information about the exons themselves can be seen in the exon table. The tables seq_region and coord_system store information about all the available coordinate systems. The PROTEOFORMER pipeline uses the chromosome coordinate system. Adapted from [91].

3.3.2 dbSNP Parallel to the variant calling, performed by SAMtools, the PROTEOFORMER pipeline allows the inclusion of SNPs present in the Single Nucleotide Polymorphism Database (dbSNP) [52]. This database was developed in 1998 by the National Center for Biotechnology Information (NCBI), with the help of the National Human Genome Research Institute (NHGRI), as an answer to the increasing interest and research in genomic variations and their

association with clinical phenotypes. The name 'dbSNP' refers to the most abundant variation type present in the database: SNPs. Along with these SNPs, dbSNP contains other genomic variant including short INDELs [92,93].

Researchers can submit new genomic variants to the database when providing sufficient information with regard to the experiment and the flanking sequences. In case of multiple entries concerning the same genomic variant, dbSNP will combine the information given by all original records and store it into a new reference record. This record can be linked with other NCBI databases [92,93].

In addition to the inclusion of SNPs from dbSNP, the PROTEOFORMER pipeline was adapted in order to enable the inclusion of INDELs as well.

3.3.3 Sequence read archives In today's genomic studies, NGS-based technologies entail massive amounts of nucleotide sequence data. In order to preserve this information in the international, public domain, the DNA Databank of Japan (DDBJ), the European Bioinformatics Institute (EMBL-EBI) and the NCBI started a collaboration to unify the obtained sequence information in all of their databases, namely the INSDC. Thanks to this collaboration, data submitted to one of these organizations can easily be shared with the other partners. One of the data types contained by these databases is raw sequence data produced during NGS. This data type is stored in a separate database called the Sequence Read Archive (SRA) [94]. Both the DDBJ, the NCBI as well as the EMBL-EBI have their own SRA [95]. The later is part of the European Nucleotide Archive (ENA) [96].

3.3.4 PRIDE The PRoteomics IDEntifications database (PRIDE) is a public database holding the experimental results of proteomic studies associated with published, scientific articles. It was created to make proteomic data more accessible and easier to query, but it can also be used as a private platform for communication and collaboration during the process of data-analysis and peer-review of the final article. Peptide and protein identification lists can be easily submitted and approached via a web interface along with its associated MS/MS spectra [97,98]. Together with the RIBO-seq data available in the SRAs, this data is used during the validation of the renewed PROTEOFORMER pipeline.

3.4 Custom database creation An assessment was made of the effect of including INDEL or SNP information in the creation of custom sequence databases on the peptide/protein identification rate in MS-based proteomics. Hence, RIBO-seq data derived from the human colorectal cancer cell line HCT116 was used. This data was generated by the BioBix lab in 2014 [99] and can be easily downloaded from the SRAs.

The custom databases itself were created using the mapping, transcript calling, TIS calling and variant calling steps provided by the PROTEOFORMER pipeline. The latter was only available for SNP detection. Therefore, the initial variant calling script was adapted. Moreover, a completely new assembly script was written for the inclusion of both INDELs and SNPs in the protein sequence database. Both the new variant calling script as well as the assembly script are available at GitHub on https://github.com/sgielis/Master-thesis.

3.4.1 Mapping For every RIBO-seq experiment, two FASTQ files were present: one containing the RIBO- seq derived reads collected after CHX-treatment and another containing the RIBO-seq derived reads collected after LTM-treatment. Both files were preprocessed using the FASTX-toolkit (version 0.0.14) [100,101]. This included the removal of the adaptor sequence (AGATCGGAAGAGCACAC) using the fastx_clipper tool and trimming the sequences with the fastq_quality_trimmer tool. During the latter step, a quality threshold of 28 was used, resulting in the removal of nucleotides lying at the end of the sequence and having a quality score below 28. Simultaneously, sequences shorter than 20 nucleotides were discarded.

In a following step, STAR (version 020201) was used to map the remaining RIBO-seq reads consecutively to the reference PhiX bacteriophage genome and the human reference rRNA, sn(o)RNA and tRNA (obtained from NCBI). These filtering steps were needed in order to retain the translating RPFs only. In all cases, a maximum of two mismatches per read was allowed. Moreover, a value of 0.5 was chosen for the option seedSearchStartLmaxOverLread and no introns where allowed during mapping against the PhiX genome.

Next, the filtered (unmapped) reads were aligned to the human reference genome. Also, a maximum of two mismatches per read was chosen as a threshold. Alignments associated with reads mapping to maximum 15 loci were collected in SAM and BAM files. The same information was also stored in BedGraph files used to visualize the obtained information in a genome browser.

In a later step, the RPFs were appointed to single nucleotide positions, namely the ribosomal P-sites. This was done using the Plastid tool [102]. Plastid determines these P-sites by calculating P-site offsets. Both the term P-site and P-site offset are explained in figure 8 in more detail.

In general, Plastid starts with grouping all the RPFs by their lengths. Subsequently, these reads are mapped to the canonical genes in the region surrounding the canonical start site. Finally the distance between the canonical start site and the highest peak at the 5' end of the start codon is measured and returned as a P-site offset [103,104].

Because the P-site offsets were calculated for all read lengths varying from 22 to 34 nucleotides, the exact P-site could be calculated for each of these lengths. Next, the range of RPF lengths having a clear P-site offset was determined. All the RPFs lying in this range were 22

retained for further analysis. This information was also stored in two SQLite tables: one for the CHX-treated reads and another for the LTM-treated reads. Both tables contain the retained P-sites and an extra value representing the number of reads that are appointed to this site.

Figure 8: Clarification of the terms P-site and P-site offset. The P-site (peptidyl site) is the second site in the ribosomal complex. Here, the enlarging polypeptide is present. It is preceded by the aminoacyl site (A-site) which is responsible for the presentation of the aminoacyl-tRNA to het mRNA and followed by the exit site (E-Site) where empty tRNA molecules leave the ribosomal complex [105]. The latter is not shown in this figure. The P-site offset is defined as the distance of the first nucleotide of the RPF from the P-site. Here, a RPF of length 28 nucleotides is shown with an offset of 12 [104].

3.4.2 Transcript calling Subsequent to the mapping procedure, the ribosome profiles of the CHX-treated ribosomes were used to find the translated transcripts. The transcript calling procedure includes the collection of all possible transcripts using the Ensembl annotation bundle and the subsequent association of these Ensembl transcripts with the evidence on translational level.

3.4.3 TIS calling After the mapping and transcript calling, the translation start sites were determined. The TIS- calling procedure started from the read counts and transcripts gathered by previous methods. In a first step, these reads were matched with the transcripts. Meanwhile, the LTM-reads accumulated at TIS were combined into peaks. These TIS-peaks resembled both true and false TIS. Hence, subsequent criteria were needed to retain the peak positions corresponding with true translation start sites. As a first criterion, the identified TIS positions should lay in a ±1 nucleotide window relative to an AUG or near-cognate start codon. Ribosome profiles registered on the first position or the +/-1 position of an AUG or near-cognate start codon were summed into one peak. Second, these combined profiles should have more reads on a single position in comparison with a user definable minimal count. Third, the positions must also have a maximum of reads mapped within a user definable window. Here a window of one codon up- and one codon downstream was chosen. And for the final criterion, an

RLTM-RCHX value was calculated by using the following formula:

Here, X represents the amount of reads on position X and N the sum of all the reads on the transcript for the RPFs obtained after treatment with LTM or CHX. [52]. Peaks with an

RLTM-RCHX value equal or higher than a selected threshold were retained.

Both the minimum count of reads as well as the RLTM-RCHX number could have a different value depending on the annotation. Five different annotations were possible: aTIS (annotated TIS: the TIS coincidence with a canonical start site), 5' UTR (TIS lying upstream of the canonical start site), 3' UTR (TIS lying downstream of the stop codon), coding sequence or CDS (TIS lying within the protein-coding transcript) and no translation (TIS lying in non- coding transcripts). All the TIS calling procedures were performed using a minimum count of 5 for aTIS, 10 for a 5' UTR site, a 3' UTR site and a non-coding site and 15 for a CDS site.

The RLTM-RCHX values were set to 0.01 for aTIS, 0.05 for a 5' UTR, a 3' UTR and a non- coding site and 0.15 for a CDS site.

3.4.4 Variant calling SNPs and INDELs were detected using SAMtools. This variant caller starts from a SAM file containing the alignment of the CHX-derived RIBO-seq data with the reference sequence. At first, this SAM file was converted into a BAM file which was sorted on genomic location. Next, the genotype likelihoods were determined and collected in a BCF file with the mpileup command. BCFtools used this BCF file to call all the significant SNPs and INDELs and wrote this information to a VCF file. Furthermore, the vcfutils.pl varFilter option was used to discard all variants having a read depth below 3 or exceeding 100. From the final VCF file, two SQLite tables were created: one for the SNPs and another one for the INDELs. Both tables contain genomic position information, the reference and alternative variants and their corresponding allele frequencies. The latter is an important value describing the biological relevance of the variant in your sample compared to the reference. An allele frequency of 1.0, for example, means that the reference allele is not present in the studied RIBO-seq data while an allele frequency of 0.5 indicates both alleles at a certain locus are equally present.

Next to the experimental observed variants, SNPs and INDELs from dbSNP were also taken into consideration. In order to reduce the size of the custom generated database, only dbSNP variants present in the RIBO-seq derived reads were included. While adding these variants to the two existing SQLite tables a distinction was made between new variants discovered by SAMtools, existing variants present in dbSNP and detected by SAMtools and variants that could not be called by SAMtools but were nevertheless present in dbSNP and in the SAM alignment file.

3.4.5 Translation assembly In a final phase of the customized database construction, all the information obtained during previous steps was collected and assembled into protein sequences. This process started from 24

the transcript sequences called with sufficient translational evidence. Their sequences were build up using the human reference genome sequence acquired from the iGenomes repository and the exon start and end positions given by the Ensembl annotation bundle. In addition, the experimental identified TIS were taken into account by constructing ORFs starting from that TIS position.

During ORF construction, variant information was added. Due to the discovery of a large number of SNPs by SAMtools, the SNPs from dbSNP were not included. In contrast the number of INDELs discovered by SAMtools was low. Hence, both the INDELs discovered by SAMtools as well as the INDELs from dbSNP were included. Therefore, all the INDELs and SNPs were collected per transcript and sorted in ascending order of appearance. Afterwards the variants were introduced one by one. During this procedure, all the alternative INDELs and SNPs were inserted. Here, only SNP information from non-synonymous SNPs was included. The latter are SNPs leading to a single amino acid change in a protein resulting in the so-called single amino acid variants (SAAVs) [106]. These SNPs are also known as missense mutations. The reference variants were also included alongside their alternative sequence when an allele frequency around 0.5 was observed.

Finally, all the assembled nucleotide sequences were translated in silico into protein sequences. An extra filtering step ensured that only complete proteins were retained and stored in an SQLite table and further processed into a non-redundant database in FASTA format. From this database a target-decoy database [107] was created including additional information from the cRAP (common Repository of Adventitious Proteins) database. The latter is a FASTA database containing proteins which are often present in MS-based proteomics experiments due to contamination [108].

3.5 Validation of the renewed PROTEOFORMER pipeline with proteomics data The generated custom databases were used to identify proteins and peptides isolated from the HCT116 cell line using experimental MS/MS spectra available from PRIDE. This was done with the help of the search engines X!Tandem Vengeance (2015.12.15.2) and OMSSA (2.1.9) through the Searchgui user-interface. The analysis was carried out using two fixed modifications (heavy labelled arginine (13C6) and lysine (13C6)) and three variable modifications (pyroglutamate formation of N-terminal glutamine, oxidation of methionine to methionine-sulfoxide, and acetylation of the N-terminus of both peptides and proteins). Trypsin was the selected digestion enzyme and a threshold of one missed cleavage was set. In addition, cleavage before proline was allowed. Furthermore, a mass tolerance of 10 ppm on precursor ions and 0.5 Da on fragment ions was chosen. For the peptide charge a range from 2+ until 4+ was selected.

The resulting PSMs produced by X!Tandem and OMSSA were analysed with PeptideShaker using a false discovery rate (FDR) of 1%.

4 Results

4.1 Custom database creation To study the influence of incorporating INDELs in the protein sequences on the identification rate in a MS-based identification process, two protein sequence databases are created: one with INDEL information and one without. In addition, a protein sequence database containing SNP information is generated to compare the impact of INDELs versus SNPs on the protein identification level.

4.1.1 Mapping statistics Initially, the CHX-treated and the LTM-treated RIBO-seq derived reads from the human colorectal cancer cell line were mapped successively to the PhiX genome, the human rRNA, sn(o)RNA (both small nuclear and nucleolar), tRNA and the human reference genome. The mapping statistics of the CHX-treated RIBO-seq derived reads are given in table 1, whereas table 2 lists the mapping results of the LTM-treated RIBO-seq derived reads. Both tables specify the number of input reads available for mapping against each of the reference sequence types (PhiX, rRNA, sn(o)RNA, tRNA and the human genome) and the associated number of mapped and unmapped reads. In addition, the frequency of the mapped reads relatively to the total reads associated with the sequence type is given.

Table 1: Mapping statistics for the CHX-treated RIBO-seq derived reads. Type Total reads Mapped Unmapped Frequency mapped (%)

PhiX 215 956 640 582 289 215 374 351 0,27 rRNA 215 374 351 159 881 772 55 492 579 74,23 sn(o)RNA 55 492 579 1 854 896 53 637 683 3,34 tRNA 53 637 683 6 686 666 46 951 017 12,47 genomic 46 951 017 39 774 983* 7 176 034 84,72** *From a total of 39 774 983 reads mapped to the human reference genome, 26 615 285 reads mapped to a unique genomic location, 13 159 698 were non-unique. **The total frequency of 84,72% is the sum of the frequency of the uniquely mapped reads (56,69%) and the non-uniquely mapped reads (28,03%) to the human reference genome.

Table 2: Mapping statistics for the LTM-treated RIBO-seq derived reads. Type Total reads Mapped Unmapped Frequency mapped (%)

PhiX 228 664 982 590 458 228 074 524 0,26 rRNA 228 074 524 150 632 286 77 442 238 66,05 sn(o)RNA 77 442 238 31 98 726 74 243 512 4,13 tRNA 74 243 512 6 112 089 68 131 423 8,23 genomic 68 131 423 55 879 608* 12 251 815 82,02** *From a total of 55 879 608 reads mapped to the human reference genome, 38 689 756 reads mapped to a unique genomic location, 17 189 852 were non-unique. **The total frequency of 82,02% is the sum of the frequency of the uniquely mapped reads (56,79%) and the non-uniquely mapped reads (25,23%) to the human reference genome.

In total 39 774 983 CHX-treated and 55 879 608 LTM-treated RIBO-seq derived reads are mapped to the human reference genome. In order to determine the exact nucleotide position needed for the TIS and transcript calling procedures, P-site offsets were calculated. With the help of the Plastid tool, two plots were created visualizing these offsets in function of their corresponding read length. In figure 9, the results for the LTM-derived reads are given. According to this figure, reads with a length varying from 27 until 34 nucleotides have a clearly determined P-site offset and can therefore be associated with one single nucleotide position on the reference genome. However, this range cannot be applied to the CHX-derived reads as shown by figure 10. Here, reads having less than 27 or more than 31 nucleotides cannot be associated with a clearly determined P-site offset. Due to the inability to choose a different range for both the analysis of the CHX-derived and the LTM-derived reads, the range from 27 until 31 nucleotides, with a well-defined offset was chosen. All the CHX- and LTM-derived reads lying within this range are retained and pinpointed to one nucleotide position.

Figure 9: Overview of the calculated P- site offsets for the LTM-treated RIBO-seq derived reads with lengths varying from 22 until 34 nucleotides.

Figure 10: Overview of the calculated P-site offsets for CHX-treated RIBO-seq derived reads with lengths varying from 22 until 34 nucleotides.

4.1.2 Mapping quality control After the mapping procedure, the quality of the RIBO-seq data was determined using two important quality parameters, namely the translational phasing and the length distribution of the reads (i.e. RPFs). The phase distribution represents the amount of RIBO-seq reads mapped to nucleotides lying in phase, lying one nucleotide out of phase or lying two nucleotides out of phase within the translated open reading frame. The length distribution gives the number of reads having a certain length. Both distributions are shown in figure 11 for the CHX-derived and LTM-derived reads. It is clear that the majority of the reads map to nucleotides in phase. Furthermore, the length distribution shows that most of the reads have a length within the selected range of 27-31 nucleotides. As a consequence, only a small amount of the initial reads mapped to the human reference genome were excluded, resulting in a limited loss of information.

Figure 11: Phase and RPF length distribution of the CHX-treated RIBO-seq derived reads (a) and LTM-treated RIBO-seq derived reads (b) mapped to the human reference genome.

4.1.3 Discovered variants Variant information was gathered based on the SAM file containing the alignments of the CHX-derived reads. Both new variants, discovered by SAMtools, as well as existing variant, present in dbSNP, were detected. Figure 12 shows the number of variants discovered by each of these sources. In total 63 300 SNPs and 251 INDELs are found. Most of these variants are already present in dbSNP. However, 43 novel INDELs and 19 914 new SNPs are discovered with SAMtools.

Figure 12: Pie charts showing (1) the number of new variants discovered by SAMtools, (2) the number of dbSNP variants missed by SAMtools and (3) the number of existing variants present in dbSNP and detected by SAMtools for a) SNPs and b) INDELs.

The detected variants were incorporated into the Ensembl transcript sequences during the assembly. Herein, the incorporation was limited to variants having an allele frequency lying between 0.4 and 0.6 or 0.9 and 1.0. In this way, the allele frequencies lying nearby the theoretical predicted allele frequencies of 0.5 and 0.9 are included. In order to investigate the number of variants having an allele frequency located within this range, the distribution of the allele frequencies for both variants were calculated. The results are shown in figure 13 and 14 for respectively the SNPs and INDELs. From figure 13, it is clear that only a small subset of the SNPs has an allele frequency lying outside the predefined range. These are therefore not included in the Ensembl transcripts. In contrast with the SNPs, all the INDELs have allele frequencies lying within the applied range. Hence, all the INDELs are included in the Ensembl transcripts. For both variant classes, the majority of the variants have an allele frequency around 1.0. This shows that for most variants, the reference is not present in the studied HCT116 RIBO-seq dataset.

Figure 13: Distribution of the allele frequencies of the SNPs detected by SAMtools.

Figure 14: Distribution of the allele frequencies of the INDELs detected by SAMtools.

4.1.4 Translation assembly Before assembling the protein sequences into a database in FASTA format, an SQLite table containing the protein sequences including transcript, TIS and variant information was created. In total three SQLite tables are made: one with INDEL information, one with SNP information and one without any variant information. These contain respectively 140 617, 143 710 and 140 293 unique protein sequences. In figure 15, a Venn diagram is given, illustrating the amount of overlap between these three databases. From this diagram, it's clear that most of the protein sequences are shared between the three tables. Moreover, the diagram shows that addition of SNP information is responsible for 7 348 extra protein sequences, whereas the addition of INDEL information results in 345 extra protein sequences.

Figure 15: Venn diagram showing the amount of similar and distinct protein sequences related with Ensembl transcripts lacking variant information and Ensembl transcripts carrying one or more INDELs or SNPs (derived from the RIBO-seq data).

While studying the number of identical transcripts in the SQLite tables with INDELs and without variants, 7 transcripts were found containing an INDEL that does not alter the protein sequence. Because INDELs are typically associated with dramatically adjustments on protein level, these transcripts were studied in more detail.

It has been found that, for every transcript, the stop codon is lying in an INDEL. More specifically, the stop codon is part of a deletion of one nucleotide. Normally, this type of variant is responsible for a frameshift resulting in a deviant protein sequence. However, this frameshifting event is not occurring, because it does not alter the stop codon itself but changes the mRNA sequence downstream. For example, the stop codon TAG is not altered when the guanine of the stop codon is part of a reported deletion from GGGGGGA to GGGGGA and the first two nucleotides of the stop codon (in this case TA) are preceding the INDEL.

4.2 Validation of the renewed PROTEOFORMER pipeline with proteomics data After creating the custom protein sequence databases, these were used in a database search approach to obtain peptide to spectrum matches (PSMs) from the experimental fragmentation spectra (MS/MS). By comparing the proteins identified from using these different databases, the influence of the incorporation of INDEL and SNP information on the peptide and protein identification process is assessed.

4.2.1 Exploration of the effect of INDELs on the identification process In order to study the gain of adding INDEL information to the peptide and protein identification process, both the custom protein sequence database without variant information and with INDEL information were used to identify peptide and protein sequences. During all analyses a threshold of 90% confidence was set, meaning only the proteins having a score of at least 90% are considered. In total 19 protein groups are detected that contain at least one proteoform with an INDEL. More importantly a single protein containing an INDEL is identified. Both the protein groups as well as the single protein are discussed in more detail in the following paragraphs.

Identification of a single protein containing an INDEL A protein derived from the Ensembl transcript ENST00000368232 (start site at position 156 599 046 on chromosome 1) is identified with a confidence score of 100%. This transcript contains an insertion of a guanine and thymine at position 1069 in the transcript sequence. This INDEL is present in dbSNP, but is also found using SAMtools with an allele frequency of 1.0. In order to make the subsequent analysis clearer, this protein is called the alternative protein. In addition, the protein derived from the Ensembl transcript lacking the INDEL is called the reference protein.

The identification of the alternative protein is based on the identification of two peptides: pyro-QKEEEEATASER-COOH with 100% confidence and NH2-GEETVLGGGTR-COOH with 99% confidence. The PSMs of these peptides are shown in figure 16 and 17 respectively. These figures also visualize the found fragment ions corresponding with each of the peptide bonds. For the first peptide, evidence is found for nine precursor fragments out of eleven theoretical possible ones. Eight of these are validated with both b- and y-ions, one is only validated with a b-ion. For the second peptide, only five out of ten peptide bonds are validated with b- and/or y-ions. However, no other peptide is found containing the subsequence ETVLGG. This reinsures the presence of NH2-GEETVLGGGTR-COOH.

Figure 16: Fragment ions and spectrum of the peptide QKEEEATASER-COOH. a) The sequence fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the cleavage of the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the red peaks are annotated by a fragment ion, whereas the grey peaks are not.

Figure 17: Fragment ions and spectrum of the peptide NH2-GEETVLGGGTR-COOH. a) The sequence fragmentation plot shows the b-ions (blue peaks) and y-ions (red peaks) present due to the cleavage of the corresponding peptide bonds. b) The PSM visualizes the annotated spectrum. Here, the red peaks are annotated by a fragment ion, whereas the grey peaks are not.

The effect of the insertion on protein level is studied by comparing the alternative protein sequence with the reference sequence. Below, the alignment of the two protein sequences is given. The bold, underlined amino acids represent the identified peptides. The amino acids in green highlight the differences between the two protein sequences.

reference MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF alternative MNVTPEVKSRGMKFAEEQLLKHGWTQGKGLGRKENGITQALRVTLKQDTHGVGHDPAKEF ************************************************************ reference TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE alternative TNHWWNELFNKTAANLVVETGQDGVQIRSLSKETTRYNHPKPNLLYQKFVKMATLTSGGE ************************************************************ reference KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA alternative KPNKDLESCSDDDNQGSKSPKILTDEMLLQACEGRTAHKAARLGITMKAKLARLEAQEQA ************************************************************ reference FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS alternative FLARLKGQDPGAPQLQSESKPPKKKKKKRRQKEEEEATASERNDADEKHPEHAEQNIRKS ************************************************************ reference KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE alternative KKKKRRHQEGKVSDEREGTTKGNEKEDAAGTSGLGELNSREQTNQSLRKGKKKKRWHHEE ************************************************************ reference EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETFRWW alternative EKMGVLEEGGKGKEAAGSVRTEEVESRAYADPCSRRKKRQQQEEEDLNLEDRGEETVLGG ********************************************************. reference NQGSREQSMQ------alternative --GTREAESRACSDGRSRKSKKKRQQHQEEEDILDVRDEKDGGAREAESRAHTGSSSRGK *:** . : reference ------alternative RKRQQHPKKERAGVSTVQKAKKKQKKRD

The reference sequence consists of 370 amino acids. The alternative protein sequence contains 446 amino acids, which is 76 more than the reference sequence. The alternative protein is therefore a C-terminally extended proteoform of the reference protein. Both proteins are present in UniProt as G patch domain-containing protein 4 proteoforms. However, the reference protein is not experimental validated and is therefore present in the TrEMBL section (UniProt entry A0A0A0MRK1), whereas the alternative protein is reviewed and present in the Swiss-Prot section (UniProt entry Q5T3I0).

Important to notice is the position of the second identified peptide, namely NH2-GEETVLGGGTR-COOH. This peptide is overlapping the amino acids that are changed by the INDEL and therefore provides evidence of the presence of the alternative protein in the studied sample.

The C-terminal extended proteoform is found as a single protein (figure 18). The term 'single protein' refers to a protein that can be explained solely by a set of identified peptides. In contrast, a protein identification is undistinguishable from other protein identifications when the identified peptide sequences can jointly map to more than one protein sequence in the database. This identification problem was also explained in section 2.2.2 as the protein inference problem. In this case, all the undistinguishable proteins are collected in a group and are identified as one ‘protein group’ instead of single proteins.

Figure 18: Protein inference graph of the C-terminally extended proteoform. The C-terminally extended proteoform is represented by the black circle at the bottom of the graph. It is linked with the two peptides (green circles) described in the main text. Because the presence of these two peptides can only be explained by the C-terminally extended variant, this proteoform is found as a single protein. (The reference protein is not present in the database due to the INDEL having an allele frequency of 1.0.) In addition, a third peptide (red circle) is present in the top of the figure. This was not considered in the discussion of the peptides due to its confidence score of 0%. The remaining gray circles are representing proteins containing this peptide.

The protein identification process making use of the database without variant information gives rise to the identification of a protein group containing three proteins: the reference protein (370 amino acids) associated with transcript ENST00000368232 and two other proteins associated with transcript ENST00000438976 (one starting at position 156 599 013 of chromosome 1 resulting in a sequence of 359 amino acids, the other starting at position 156 601 439 resulting in a sequence of 375 amino acids). These proteins differ only slightly from each other (the last 359 amino acids of every protein are identical) as they are all products from the same gene but translated from different transcript isoforms. Absence of identified peptides specific to one of these proteins limits the identification to a group instead of a single protein (figure 19).

More importantly to notice, is that none of these proteins, identified by the database without INDEL information, are actual present in the studied sample. This conclusion can be drawn from the allele frequency of the INDEL, which is 1.0.

In addition, since the three proteins are similar in the majority of their sequence, we would expect to have found INDELs in the two other proteins as well. However, only the protein associated with transcript ENST00000368232 is identified as a protein containing an INDEL. In order to explain this result, the two other transcripts and their resulting protein sequences were studied in more detail. It appears that the addition of an INDEL in these sequences gives rise to incomplete proteins. This means that the inclusion of an INDEL in these transcripts results in a new transcript sequence containing no stop codon. This makes it impossible to produce a functional protein.

All the above results illustrate the importance of the addition of INDEL information to a custom created protein sequence database.

Figure 19: Protein inference graph of the reference protein (blue circle) found using the database without INDELS. Only one peptide (green circle) associated with the reference protein is found. This peptide is linked with two other proteins (gray circles with yellow border) making it impossible to distinguish one protein from another.

Protein groups containing proteins derived from Ensembl transcripts with INDELs Next to the previously described single protein, 19 protein groups are identified containing at least one proteoform derived from a transcript holding an INDEL. These INDEL-containing proteoforms are not validated with the matching MS-based proteomics data, since they were within protein inference groups also holding non-INDEL containing proteoforms. However, they were still studied in more detail in order to evaluate the possible impact of INDELs on the translational level.

In general, the INDELs give rise to three different type of proteoforms: C-terminally truncated proteoforms, N-terminally extended proteoforms and proteoforms missing only one amino acid with respect to the original protein lacking the INDEL. For every type an example is given in figure 20 and described in more detail in the following paragraphs.

a) ENST00000612661 b) ENST00000612898 c) ENST00000527673

Legend:

reference protein protein in protein group validated peptide

alternative protein protein not in protein group non-validated peptide

Figure 20: Protein inference graphs from (a) a protein group containing an C-terminally truncated proteoform (black circle) in comparison with the reference sequence (blue circle), (b) a protein group containing an N-terminally extended proteoform (black circle) and (c) a protein group containing a proteoform missing one amino acid (black circle) in comparison with the reference sequence (blue circle).

Example of an C-terminally truncated proteoform A first protein group consists of ten proteins. Two are derived from the Ensembl transcript ENST00000612661 (start site at position 113 857 746 on chromosome 6): one containing a deletion of an adenine at position 454 in the transcript sequence and one lacking this deletion. In this case, the reading frame is changed with one nucleotide when the deletion is encountered. Because of this, a 'premature' stop codon is encountered, lying 5' upstream of the normal stop codon. This results in an C-terminally truncated proteoform.

More specifically, the reference proteoform contains 332 amino acids whereas the alternative proteoform contains 165 amino acids. They only share their first 154 amino acids. The 178 remaining amino acids from the reference protein and the 11 remaining amino acids from the alternative protein are completely different.

Due to the absence of peptides overlapping the distinct C-terminal part of the alternative protein, this proteoform is not validated with the MS/MS data.

Example of an N-terminally extended proteoform A second group contains 22 proteins. An example of this category is derived from the Ensembl transcript ENST00000612898 (start site lying in the 5' UTR region at position 27 838 622 on chromosome 6) containing a deletion of a thymine at the fourth position in the transcript sequence.

Due to the position of the start site in the 5' UTR region, the protein sequence derived from this transcript lacking the INDEL is completely different from the protein sequence derived

from the canonical start site (which is lying at position 27 838 662 on chromosome 6). Hence, the start site in the 5' UTR region is lying out of frame with the canonical start site.

The deletion in the transcript sequence changes the reading frame with one nucleotide. As a result, the canonical start site is lying in frame with the 5' UTR start site, leading to an alternative sequence of 139 amino acids. Herein, the last 126 amino acids are identical with the reference sequence starting from the canonical start site. Hence, the alternative protein is an N-terminally extended proteoform of this reference protein. Due to the absence of identified peptides overlapping the N-terminally extended region, this proteoform was not validated with MS/MS data.

Example of a proteoform missing one amino acid in comparison with its reference protein A next protein group consists of two proteins derived from the Ensembl transcript ENST00000527673 (start site at position 119 018 284 on chromosome 11). Here, one of the proteins contains a deletion of three nucleotides: the nucleotides AAGG starting at position 28 in the reference transcript sequence are replaced with a single guanine. Since a deletion of three amino acids occurs, the reading frame is not changed. Therefore, the deletion has only a small effect on the protein, namely the absence of a lysine encoded by the three deleted nucleotides. Also, the new proteoform is not validated due to the lack of peptides overlapping this missing amino acid.

4.2.2 Exploration of the effect of SNPs on the identification process In order to compare the effect of INDELs with the effect of SNPs on the peptide and protein identification level, the protein identification process was repeated using a custom database including SNP information. Here, the analysis was restricted to proteins having a confidence score of at least 90%.

SNP containing proteins lacking extra MS/MS validation In total 30 SNP-containing proteins are identified as single proteins (i.e. not in an inference group). In addition, 74 protein groups are identified that solely consists of SNP containing proteins. The majority of these proteins are not validated with peptides overlapping the SNP affected amino acids. However, these proteins can still be distinguished from the reference proteins (proteins without SNP) by means of their allele frequency: all of the SNPs associated with single proteins or a protein groups lacking the reference proteins have an allele frequency of 1.0. This indicates that the database does not contain the reference proteins. Hence, the alternative proteins can be identified without the need of peptides overlapping the changed amino acids.

An interesting remark can be made on the function of the identified proteins. At least two of these proteins are related with DNA repair activity (MMS19 (methyl-methanesulfonate sensitivity 19) nucleotide excision repair protein homolog [109] and TELO2 (telomere maintenance 2)-interacting protein 1 homolog [110]). In addition one possible oncogene related with colorectal cancer (Pinin [111]) was found.

SNP-containing proteins with extra MS/MS validation Only one protein group contains a MS/MS validated SNP. This group (figure 21) consists of two proteins: one derived from the Ensembl transcript ENST00000216605 (start site at position 64 388 428 on chromosome 14) which has a length of 935 amino acids and is submitted to Swiss-Prot (UniProt entry P11586) and on derived from the Ensembl transcript ENST00000545908 (start site at position 64 388 260 on chromosome 14) which has a length 1020 of amino acids and is submitted to TrEMBL (Uniprot entry F5H2F4). Each contain two SNPs (one at chromosomal position 64 415 662 replacing an adenine by a guanine and one at chromosomal position 64 442 127 replacing a guanine by an adenine) with an allele frequency of 1.0. Because both proteins are products from the same gene (MTHFD1 gene), their sequences are highly similar. The alignment of these sequences is shown below. In addition, the reference sequence of ENST00000545908 (Uniprot entry F5H2F4) is added to illustrate the amino acid changes. The bold, underlined amino acids represent the identified peptides. Due to the presence of a peptide starting after the first methionine (i.e. the initiator methionine has been cleaved by the methionine aminopeptidase), the protein derived from Ensembl transcript ENST00000216605 is more likely to be present in the sample. The green amino acids represent the amino acids that are changed by the SNPs.

ENST00000216605 ------MAPA ENST00000545908 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA F5H2F4 MRHRVCRGRSGQGRRRSSVIPWPVPKHVGWVVLLGCGGSGTSILVVSIVGSGLIKAMAPA ****

ENST00000216605 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG ENST00000545908 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG F5H2F4 EILNGKEISAQIRARLKNQVTQLKEQVPGFTPRLAILQVGNRDDSNLYINVKLKAAEEIG ************************************************************

ENST00000216605 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV ENST00000545908 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV F5H2F4 IKATHIKLPRTTTESEVMKYITSLNEDSTVHGFLVQLPLDSENSINTEEVINAIAPEKDV ************************************************************

ENST00000216605 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL ENST00000545908 DGLTSINAGRLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL F5H2F4 DGLTSINAGKLARGDLNDCFIPCTPKGCLELIKETGVPIAGRHAVVVGRSKIVGAPMHDL *********:**************************************************

ENST00000216605 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD ENST00000545908 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD F5H2F4 LLWNNATVTTCHSKTAHLDEEVNKGDILVVATGQPEMVKGEWIKPGAIVIDCGINYVPDD ************************************************************

ENST00000216605 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM ENST00000545908 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM F5H2F4 KKPNGRKVVGDVAYDEAKERASFITPVPGGVGPMTVAMLMQSTVESAKRFLEKFKPGKWM ************************************************************

ENST00000216605 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK ENST00000545908 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK F5H2F4 IQYNNLNLKTPVPSDIDISRSCKPKPIGKLAREIGLLSEEVELYGETKAKVLLSALERLK ************************************************************ 41

ENST00000216605 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA ENST00000545908 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA F5H2F4 HRPDGKYVVVTGITPTPLGEGKSTTTIGLVQALGAHLYQNVFACVRQPSQGPTFGIKGGA ************************************************************

ENST00000216605 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN ENST00000545908 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN F5H2F4 AGGGYSQVIPMEEFNLHLTGDIHAITAANNLVAAAIDARIFHELTQTDKALFNRLVPSVN ************************************************************

ENST00000216605 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI ENST00000545908 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI F5H2F4 GVRRFSDIQIRRLKRLGIEKTDPTTLTDEEINRFARLDIDPETITWQRVLDTNDRFLRKI ************************************************************

ENST00000216605 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED ENST00000545908 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED F5H2F4 TIGQAPTEKGHTRTAQFDISVASEIMAVLALTTSLEDMRERLGKMVVASSKKGEPVSAED ************************************************************

ENST00000216605 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF ENST00000545908 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADQIALKLVGPEGF F5H2F4 LGVSGALTVLMKDAIKPNLMQTLEGTPVFVHAGPFANIAHGNSSIIADRIALKLVGPEGF ************************************************:***********

ENST00000216605 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI ENST00000545908 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI F5H2F4 VVTEAGFGADIGMEKFFNIKCRYSGLCPHVVVLVATVRALKMHGGGPTVTAGLPLPKAYI ************************************************************

ENST00000216605 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK ENST00000545908 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK F5H2F4 QENLELVEKGFSNLKKQIENARMFGIPVVVAVNAFKTDTESELDLISRLSREHGAFDAVK ************************************************************

ENST00000216605 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE ENST00000545908 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE F5H2F4 CTHWAEGGKGALALAQAVQRAAQAPSSFQLLYDLKLPVEDKIRIIAQKIYGADDIELLPE ************************************************************

ENST00000216605 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV ENST00000545908 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV F5H2F4 AQHKAEVYTKQGFGNLPICMAKTHLSLSHNPEQKGVPTGFILPIRDIRASVGAGFLYPLV ************************************************************

ENST00000216605 GTMSTMPGLPTRPCF------YDID------LDPETEQVNGL------F------ENST00000545908 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV F5H2F4 GTITIHLQEATLKVWPVSIQAHWELGSISKPREVSPCPEDLKLIVGVSPEVIFSLNSHHV **:: * : :::. :.* *::: : *

The first SNP is validated by an overlapping peptide. It replaces a lysine (K) by an arginine (R) at sequence position 134 in the protein derived from transcript ENST00000216605 and at sequence position 190 in the protein derived from transcript ENST00000545908. The second SNP replaces an arginine (R) by a glutamine (Q) at sequence position 653 in the protein derived from transcript ENST00000216605 and at sequence position 709 in the protein derived from transcript ENST00000545908. This SNP is not validated by the matching MS/MS data. Still, it is considered present by reason of its allele frequency of 1.0.

Figure 21: Protein inference graph of the protein group containing a validated SNP. The blue dots represent the two proteins that are grouped together. The smaller, green dots represent the identified peptides. The peptide overlapping the SNP affected amino acid is shown in light green. The small, red dot represents a non-validated peptide.

5 Discussion

The interest in genetic variation is associated with its influence on the phenotypic level, especially when it comes to disease susceptibility. In this work, the focus lies on the human cancer proteome. From previous studies, SNPs and INDELs appear to be very abundant in human cancer cells [48]. Therefore, their effect on the protein identification rate was studied.

The inclusion of INDELs resulted in the identification of one extra protein containing an insertion of two nucleotides at the end of its transcript sequence. This implicated a shift in the reading frame generating a new stop codon lying downstream the original one. Hence, an C- terminally extended proteoform was present. This proteoform was validated with one peptide derived from the altered C-terminus and one peptide shared with the reference protein. The former made it possible to discern the alternative from the reference proteoform. This, however, did not indicate the absence of the reference proteoform. In fact, the absence of proteins can never be confirmed by MS [112]. Although, in this case, the allele frequency of 1.0 evinced the absence of the reference protein.

Next to the discovered single protein, 19 protein groups were detected containing one or more alternative proteoforms. Unfortunately, no peptides were found overlapping the altered amino acids. As a consequence, the presence of these proteoforms was not validated. Still these contained valuable information about the putative impact of INDELs on the protein translation products. According to these examples, the effect of an INDEL is associated with its length. Here, a deletion of three nucleotides resulted in the loss of only one amino acid while maintaining the reading frame. Generally, INDELs which are multiples of three nucleotides (in-frame INDELs) do not change the reading frame resulting in a minor effect on the protein sequence [113]. In contrast, INDELs with a length not divisible by three do alter the reading frame. Here, the position of the INDEL in the transcript sequence plays an important role. For example, the INDEL in the C-terminally part induced a new stop codon upstream of the original stop codon, resulting in an C-terminal truncated proteoform. Apart from the C-terminally extended or truncated proteoforms, the human proteome contains N- terminal proteoforms. These are associated with alternative start sites [114]. Alternative start sites located in the 5' UTR give rise to N-terminally extended proteoforms, whereas alternative start sites located downstream the canonical start site result in N-terminally truncated proteins. However, this is only the case when the alternative and canonical start sites are lying in frame. Here, an N-terminally extended proteoform was found including an INDEL lying upstream the canonical start codon. This INDEL was responsible for a frame shift of one nucleotide causing the near-cognate site to lie in frame with the canonical start site.

The inclusion of SNP information in the PROTEOFORMER pipeline has led to the identification of 30 single SNP-containing proteins and 74 protein groups consisting solely of SNP-containing proteins. Again, only one SNP was validated with MS/MS. This SNP was 44

present in two proteins and was responsible for the replacement of a lysine by an arginine. The inconsistency between the number of identified proteins containing a SNP and the number of SNPs validated with MS/MS illustrates the importance of adding extra experimental data to the custom database. In fact, the identification is based on the allele frequency of the SNPs. All 30 single proteins and 74 protein groups contain SNPs with an allele frequency of 1.0. This means that the reference protein was not found in the sample and therefore was not added to the database. This makes the identification of the alternative proteins possible. In case of SNPs having an allele frequency of 0.5, both reference and alternative proteins are added to the custom database. In the absence of peptides overlapping the altered amino acids, the alternative and reference protein cannot be distinguished. In summary, the SNP-calling based on the RIBO-seq data points to a full inclusion of the mutation point in both alleles (i.e. allele frequency of 1.0) and thus gives evidence of the alternative sequence (although matching MS/MS validation is missing).

The results illustrate the possible impact of INDELs and SNPs on the diversity of the human proteome. Although the study here was restricted to the effect of these variants on molecular level, other studies have been focusing on their effect on cancer development. Cancer results from somatic mutations in three different group of genes: tumour-suppressor genes, oncogenes and stability genes. These are all active in the cell-cycle regulation through the production of proteins initiating cell division (oncogenes), controlling cell division (tumour- suppressor genes) or providing DNA repair activity (stability genes) [48,115]. The latter are important in maintaining genomic stability. Due to the inactivation of these DNA repair genes, mutations in oncogenes and tumour-suppressor genes can accumulate [116]. The latter two are the key drivers in the development of cancer.

Oncogenes are associated with a gain of function. This is a result of precise mutations at specific locations resulting in a small modification of the protein [113]. Both mutations leading to the over-expression of a protein as well as mutations resulting in continuously activated proteins may lead to cancer development [115]. The majority of these mutations are missense mutations. However in-frame INDELs are also frequently occurring in oncogenes [113]. In contrast with oncogenes, tumour-suppressor genes contribute to cancer by loss-of-function mutations. Here, frameshifting INDELs are playing an important role [113]. Normally, the introduction of a frameshifting INDEL would lead to the production of an aberrant protein. These are generally not present due to the action of nonsense-mediated mRNA decay [117]. However, in cancer development, frameshifting INDELs in tumour-suppressor genes are often retained by means of positive selection. Indeed, non-functional tumour-suppressor genes are not able to stop the cell cycle or induce apoptosis in case of excessive cell growth. This results in the accumulation of tumour cells [115]. Next to these frameshifting INDELs, tumour-suppressor genes also contain a large number of missense and nonsense substitutions (substitutions leading to a 'premature' stop codon). In addition a small proportion of the mutations are caused by in-frame INDELs [113].

In general, the development of cancer occurs in different phases due to the accumulation of multiple mutations. Both the function of the oncogenes as well as the tumour-suppressor genes need to be affected. Hence, different genes are often playing a role. Various studies have been focusing on the association between genetic variation and the risk of developing colorectal cancer. It has been found that a lot of colorectal tumours are initiated by mutations in three tumour-suppressor genes (APC (Adenomatous polyposis coli) [118], tumour protein 53 (TP53) [119] and SMAD4 (mothers against decapentaplegic homolog 4) [120]) and two oncogenes (KRAS (Kirsten rat sarcoma viral oncogene homolog) [121] and BRAF (B-Raf proto-oncogene serine/threonine kinase) [122]). In addition, numerous other genes are associated with an increased risk in colorectal cancer. In a recent meta-analysis [123], 950 colorectal cancer studies were investigated. From this analysis, variations in 50 genes were found to be associated with an elevated risk in colorectal cancer. One of these genes was also identified with the database containing SNP information, namely the MTHFD1 (methylenetetrahydrofolate dehydrogenase 1) gene. This gene generates a protein with enzymatic activity in the folate-mediated one-carbon metabolism pathway. This pathway plays an important role in the synthesis and repair of DNA by de novo production of purines and thymidylate. It is also responsible for the production of methionine which is required for DNA methylation. As a consequence, polymorphisms in the MTHFD1 gene may have a possible impact on the genetic stability and gene expression resulting in cancer susceptibility [124]. In addition, at least three other proteins related with cancer development were detected in this work. These include two proteins with DNA repair activity (MMS19 nucleotide excision repair protein homolog and TELO2-interacting protein 1 homolog) and a possible oncogene (Pinin). The alternative G patch domain-containing protein 4 does also occur in some cancer studies [125,126]. In one of these studies, a proteogenomic approach was used to assess the impact of mutations in DNA mismatch repair genes on the proteome. This resulted in the validation of variant peptides including the sequence GEETVLGGGTR which was also validated in this work. However its exact relation with cancer was not elucidated [126]. All these results illustrate the potential benefits of proteogenomic cancer studies in clinical approaches, namely the discovery and use of new biomarkers. By combining proteomic and genomic approaches, variant proteins associated with tumour development can be detected [127].

An important remark can be made on the impact of the detected variants on the translational level. In figure 15 the number of alternative proteins having one or more INDEL(s) or SNP(s) is shown. In total 7348 SNP-containing proteins are present in the database. Hereof, 30 single proteins and 74 protein groups are detected with a matching MS-based proteomics experiment. However, only one SNP was validated with mass spectrometry evidence. In addition, only one alternative protein with an INDEL out of 345 possible proteins was validated. According to these results, a limited amount of the alternative proteins is picked up with mass spectrometry. In other words, these proteins are associated with a low identification rate. This limitation is also present in other proteogenomics studies [127,128]. For example, in a recent study [129], the identification rate of single amino acid variants in breast cancer was studied. Only 10 % of the SNPs detected using whole genome sequencing and whole 46

transcriptome sequencing were validated with MS/MS (figure 22). Multiple explanations are possible for this phenomenon. For one thing, genetic variation has been linked with a decreased efficiency in translation [130,131] and an elevated level of protein degradation [132,133]. In addition to these biological consequences, the validation of genetic variants is limited by the MS/MS process itself [127,128,129]. Efficient peptide identification, for example, is restricted to peptides with a length from 6 to 30 amino acids. Therefore, all variants incorporated in peptides falling outside this range will most likely be missed. Because this limitation in sequence coverage of MS/MS is associated with trypsin digestion, alternative approaches using multiple enzymes and fragmenting methods need to be studied [129].

Figure 22: Venn-diagram showing the amount of SNPs detected by whole genome DNA-seq (blue circle), whole transcriptome RNA-seq (pink circle) and MS/MS-based proteomics (orange circle) in luminal breast tumours [129].

A comparative analysis of the protein identification process using the databases with SNPs and INDELs shows that the addition of SNPs has a more powerful effect. Indeed, more than 100 SNP-containing proteins and only one INDEL-containing protein were identified. This difference can be explained by several factors. First of all, SNPs are more abundant in the human genome [134,135]. Secondly, the success of INDEL detection is depending on the chosen INDEL calling tool. Here, an alignment-based method was used to find variants in short-read data. This method seems suitable for the detection of SNPs, but is more error-prone when detecting INDELs. In the case of deletions, the detection is complicated due to the presence of a gap between the reference sequences. Insertions, on the other hand, complicate the process by introducing new sequences which cannot be mapped to the reference genome. In addition, reads with long insertions are flanked with shorter reference sequences which makes the mapping procedure more difficult [136].

6 Conclusion

Today, a lot of studies rely on the identification of proteins using MS-based technologies. This implies the need of well annotated, comprehensive protein sequence databases. Nowadays, custom databases are generated using novel information, both from genomic level as well as transcriptional level. This has led to the discovery of new proteoforms that could not be detected using publicly available databases lacking this information.

The progress in genomics and transcriptomics goes hand in hand with the introduction of new sequencing techniques. Currently, it is possible to assess the translational status of mRNA fragments using ribosome profiling. With the aid of the PROTEOFORMER pipeline, this information can easily be transformed into a custom protein sequence database.

In this master thesis, the impact of INDELs on the identification rate of human cancer proteins was studied and compared with the impact of SNPs. For this, ribosome profiling data from the human colorectal cancer cell line HCT116 was used to create a custom protein sequence database. Hereof, one INDEL and one SNP were validated with MS/MS. These low numbers may be a result of the low sequence coverage of MS/MS and the possible negative effect of genetic variation on the translation process and the protein stability. In addition, a large contrast was found between the number of alternative proteins detected using the database with INDELs and the number of alternative proteins detected using the database with SNPs. Both the relative frequency of appearance of these variants as well as the limitations of the current INDEL calling strategy can explain this difference. Hence, further improvements of the PROTEOFORMER pipeline with regard to genetic variation are expected to give an additional increase in the protein identification rate.

7 Further research

In this master thesis, the impact of SNPs and INDELs on the protein identification rate of human colorectal cancer cells was studied. It has been found that the addition of these variants to the PROTEOFORMER pipeline has improved the protein identification process by the detection of alternative proteins. However, this study has also shown that the pipeline is still incomplete. Here, the current limitations and the possible improvements will be explained in more detail.

A first improvement can be made using other variant callers. As previously stated, alignment-based methods, like SAMtools, are less effective in detecting INDELs. As a result, new INDEL calling tools are developed. These are often based on paired-end RNA sequencing. In contrast with single-read sequencing, paired-end methods sequence both ends of the read [137]. Pindel for example uses a split-read mapping approach based on short-read paired-end sequencing data [138]. In addition, haplotype-based methods are also adequate in calling INDELs. Here, alignment information is used to find the regions of interest which are later assembled into possible haplotypes. By realigning the reads with these haplotypes, INDELs are detected. Although these methods seem to overcome the limitation of alignment-based methods, they are still not optimal. As a result, the best outcome is obtained when combining multiple techniques [65]. In accordance with the general problem of detecting longer INDELs in short reads, the read length can also be considered as an important factor influencing the success of INDEL detection. In this context, the use of ribosome profiling might be questioned as the method of choice since it limits the read length to approximately 28 nucleotides. This problem can be addressed by allowing the support of paired-end RNA-sequencing of longer reads [139]. In this way the sequencing step is split up in two parts: ribosome profiling will give information about the translational status including TIS information whereas paired-end RNA-seq will be used to find genetic variations.

Furthermore, the overall detection rate of both SNPs and INDELs from the ribosome profiling data is limited. Figure 12 illustrates the number of variants detected from the ribosome profiling data. Only a small amount of INDELs is picked up with SAMtools. The majority were identified using dbSNP. Also, most of the identified SNPs were present in dbSNP. This shows that the number of new genetic variants discovered from the experimental data is relatively low. This conclusion can also be drawn from other proteogenomics studies. In recent work [127], human colon and rectal tumours were characterized using a proteogenomics approach based on RNA-seq. For this, tumour samples from The Cancer Genome Atlas (TCGA) [140] were conducted to LC-MS/MS shotgun proteomics. Single nucleotide variants were detected using matching RNA-seq data. Figure 23 shows the number of detected single amino acid variants in the analyzed samples. A few of these SNPs were previously reported by TCGA. Most of the SNPs were present in dbSNP. In addition, a sufficient number of COSMIC-supported variants were detected. COSMIC (Catalogue Of Somatic Mutations In Cancer) [141] is a database collecting somatic mutations in human 49

cancer genes. Only a small amount of the SNPs represented new variants. This study illustrates that most of the variants discovered in proteogenomics studies are already annotated in public databases. As a result, a further increase in the protein identification rate is expected when these previously annotated variants are added to the pipeline. In order to limit the size of the database, only cancer-related variants need to be included. As a result, the addition of COSMIC-variants related with colorectal cancer seems the best approach.

Figure 23: Overview of the number of single amino acid variants (SAAVs) per TCGA sample. The number of somatic variants, reported by TCGA, is shown in the upper graph. The second and third graph represent the remaining number of somatic variants present in COSMIC and dbSNP respectively. The last graph visualizes the number of new variants. Adapted from [127].

In summary, the implementation of new INDEL detection methods and the incorporation of genetic variants derived from public databases to the PROTEOFORMER pipeline is expected to cause an additional increase in the protein identification rate.

Bibliography

[1] R. Dahm, “Friedrich Miescher and the discovery of DNA,” Dev. Biol., vol. 278, no. 2, pp. 274–288, 2005. [2] J. M. Heather and B. Chain, “The sequence of sequencers: The history of sequencing DNA,” Genomics, vol. 107, no. 1, pp. 1–8, 2016. [3] L. Liu et al., “Comparison of next-generation sequencing systems.,” J. Biomed. Biotechnol., vol. 2012, p. 251364, 2012. [4] J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nat. Biotechnol., vol. 26, no. 10, pp. 1135–1145, 2008. [5] E. R. Mardis, “The impact of next-generation sequencing technology on genetics,” Trends Genet., vol. 24, no. 3, pp. 133–141, 2008. [6] J. M. Rothberg and J. H. Leamon, “The development and impact of 454 sequencing.,” Nat. Biotechnol., vol. 26, no. 10, pp. 1117–1124, 2008. [7] R. J. Roberts, M. O. Carneiro, and M. C. Schatz, “The advantages of SMRT sequencing,” Genome Biol., vol. 14, no. 6, p. 405, 2013. [8] Pacific Biosciences, “Advance genomics with Single Molecule, Real-time (SMRT) sequencing.” [Online]. Available: http://www.pacb.com/smrt-science/smrt-sequencing/. [Accessed: 06-May-2017]. [9] S. Goodwin, J. Gurtowski, S. Ethe-Sayers, P. Deshpande, M. Schatz, and W. R. McCombie, “Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome,” Genome Res., vol. 25, no. 11, pp. 1750–1756, 2015. [10] Y. Zhang, B. R. Fonslow, B. Shan, M. Baek, and J. R. Yates, “Protein Analysis by Shotgun / Bottom-up Proteomics,” Chem. Rev., vol. 113, no. 4, pp. 2343–2394, 2013. [11] X. Han, A. Aslanian, and J. Yates, “Mass Spectrometry for Proteomics,” Curr Opin Chem Biol., vol. 12, no. 5, pp. 483–490, 2008. [12] A. I. Nesvizhskii, “Proteogenomics: concepts, applications and computational strategies,” Nat. Methods, vol. 11, no. 11, pp. 1114–1125, 2014. [13] R. Aebersold and M. Mann, “Mass spectrometry-based proteomics.,” Nature, vol. 422, no. 6928, pp. 198–207, 2003. [14] M. W. Duncan, R. Aebersold, and R. M. Caprioli, “The pros and cons of peptide- centric proteomics,” Nat. Biotechnol., vol. 28, no. 7, pp. 659–664, 2010. [15] C. C. Wu and M. J. MacCoss, “Shotgun proteomics: tools for the analysis of complex biological systems.,” Curr. Opin. Mol. Ther., vol. 4, no. 3, pp. 242–50, 2002. [16] C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith, “Proteogenomics: Needs and roles to be filled by proteomics in genome annotation,” Briefings Funct. Genomics Proteomics, vol. 7, no. 1, pp. 50–62, 2008. [17] J. S. Cottrell, “Protein identification using MS/MS data,” J. Proteomics, vol. 74, no. 10, pp. 1842–1851, 2011. [18] A. I. Nesvizhskii, “A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics,” J. Proteomics, vol. 73, no. 11, pp. 2092–2123, 2010. [19] J. V Olsen, S.-E. Ong, and M. Mann, “Trypsin cleaves exclusively C-terminal to arginine and lysine residues.,” Mol. Cell. proteomics, vol. 3, no. 6, pp. 608–14, 2004. [20] A. I. Nesvizhskii and R. Aebersold, “Interpretation of Shotgun Proteomic Data,” Mol. Cell. Proteomics, vol. 4, no. 10, pp. 1419–1440, 2005.

[21] N. Castellana and V. Bafna, “Proteogenomics to discover the full coding content of genomes: A computational perspective,” J. Proteomics, vol. 73, no. 11, pp. 2124–2135, 2010. [22] G. Menschaert and D. Fenyö, “Proteogenomics from a bioinformatics angle: a growing field,” Mass Spectrom. Rev., vol. 2015, no. 9999, pp. 1–16, 2015. [23] S. Tanner et al., “Improving gene annotation using peptide mass spectrometry,” Proteome, vol. 17, no. 2, pp. 231–239, 2007. [24] M. Mann and A. Pandey, “Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases,” Trends Biochem. Sci., vol. 26, no. 1, pp. 54–61, 2001. [25] L. M. Smith et al., “Proteoform: a single term describing protein complexity,” Nat. Methods, vol. 10, no. 3, pp. 186–187, 2013. [26] N. Fortelny, P. Pavlidis, and C. M. Overall, “The path of no return--Truncated protein N-termini and current ignorance of their genesis,” Proteomics, vol. 15, no. 14, pp. 2547–2552, 2015. [27] G. Menschaert, V. Olexiouk, and S. Verbruggen, “Proteoformer 2.0 (in preparation),” 2017. [28] B. Boeckmann et al., “The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003,” Nucleic Acids Res., vol. 31, no. 1, pp. 365–370, 2003. [29] D. A. Bitton, D. L. Smith, Y. Connolly, P. J. Scutt, and C. J. Miller, “An integrated mass-spectrometry pipeline identifies novel protein coding-regions in the human genome,” PLoS One, vol. 5, no. 1, pp. 1–10, 2010. [30] G. M. Sheynkman, M. R. Shortreed, B. L. Frey, and L. M. Smith, “Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq.,” Mol. Cell. Proteomics, vol. 12, pp. 2341–53, 2013. [31] D. Yagoub et al., “Proteogenomic Discovery of a Small, Novel Protein in Yeast Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames,” J. Proteome Res., vol. 14, no. 12, pp. 5038–5047, 2015. [32] N. T. Ignolia, S. Ghaemmaghami, J. R. S. Newman, and J. S. Weissman, “Genome- Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling,” Science., vol. 324, no. 5924, pp. 218–223, 2009. [33] N. T. Ingolia, “Ribosome profiling: new views of translation, from single codons to genome scale.,” Nat. Rev. Genet., vol. 15, no. 3, pp. 205–13, 2014. [34] H. Chassé, S. Boulben, V. Costache, P. Cormier, and J. Morales, “Analysis of translation using polysome profiling,” Nucleic Acids Res., vol. 45, no. 3, p. e15, 2017. [35] N. T. Ingolia, “Ribosome Footprint Profiling of Translation throughout the Genome,” Cell, vol. 165, no. 1, pp. 22–33, 2016. [36] A. M. Michel and P. V. Baranov, “Ribosome profiling: A Hi-Def monitor for protein synthesis at the genome-wide scale,” Wiley Interdiscip. Rev. RNA, vol. 4, no. 5, pp. 473–490, 2013. [37] N. T. Ingolia, “Genome-wide translational profiling by Ribosome footprinting,” Methods Enzymol., vol. 470, pp. 119–142, 2010. [38] N. T. Ingolia, G. A. Brar, S. Rouskin, A. M. Mcgeachy, and J. S. Weissman, “The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments,” Nat. Prot., vol. 7, no. 8, pp. 1534–1550, 2013. [39] N. T. Ingolia, L. Lareau, and J. Weissman, “Ribosome Profiling of Mouse Embryonic Stem Cells Reveales Complexity of Mammalian Proteomes,” Cell, vol. 147, no. 4, pp. 789–802, 2012.

[40] S. Lee, B. Liu, S. Lee, S.X. Huang, B. Shen, and S.B. Qian, “Global mapping of translation initiation sites in mammalian cells at single-nucleotide resolution. TL - 109,” Proc. Natl. Acad. Sci. U. S. A., vol. 109, no. 37, p. 32, 2012. [41] W. McKeehan and B. Hardesty, “The mechanism of cycloheximide inhibition of protein synthesis in rabbit reticulocytes,” Biochem. Biophys. Res. Commun., vol. 36, no. 4, pp. 625–630, 1969. [42] V. Olexiouk, J. Crappé, S. Verbruggen, K. Verhegen, L. Martens, and G. Menschaert, “SORFs.org: A repository of small ORFs identified by ribosome profiling,” Nucleic Acids Res., vol. 44, no. D1, pp. D324–D329, 2016. [43] M. a Basrai, P. Hieter, and J. D. Boeke, “Small Open Reading Frames : Beautiful Needles in the Haystack Small Open Reading Frames : Beautiful Needles in the Haystack,” Genome Res., vol. 7, pp. 768–771, 1997. [44] R. Gibbs et al.,“The International HapMap Project.,” Nature, vol. 426, no. 6968, pp. 789–796, 2003. [45] D. C. Rees, T. N. Williams, and M. T. Gladwin, “Sickle-cell disease,” Lancet, vol. 376, no. 9757, pp. 2018–2031, 2010. [46] J. L. Bobadilla, M. Macek, J. P. Fine, and P. M. Farrell, “Cystic fibrosis: A worldwide analysis of CFTR mutations - Correlation with incidence data and application to screening,” Hum. Mutat., vol. 19, no. 6, pp. 575–606, 2002. [47] E. C. Landels, I. H. Ellis, A. H. Fensom, P. M. Green, and M. Bobrow, “Frequency of the Tay-Sachs disease splice and insertion mutations in the UK Ashkenazi Jewish population,” J Med Genet, vol. 28, pp. 177–180, 1991. [48] H. Yang, Y. Zhong, C. Peng, J.-Q. Chen, and D. Tian, “Important role of indels in somatic mutations of human cancer genes.,” BMC Med. Genet., vol. 11, no. 128, 2010. [49] B. Vogelstein and K. Kinzler, “Cancer genes and the pathways they control,” Nat. Med., vol. 10, no. 8, pp. 789–799, 2004. [50] C. Greenman et al., “Patterns of somatic mutation in human cancer genomes,” Nature, vol. 446, no. 7132, pp. 153–158, 2007. [51] Z. Sun, A. Bhagwate, N. Prodduturi, P. Yang, and J.-P. A. Kocher, “Indel detection from RNA-seq data: tool evaluation and strategies for accurate detection of actionable mutations,” Brief. Bioinform., pp. 1–11, 2016. [52] J. Crappé et al., “PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration.,” Nucleic Acids Res., vol. 43, no. 5, p. e29, Mar. 2015. [53] M. Fresno, A. Jiménez, and D. Vázquez, “Inhibition of translation in eukaryotic systems by harringtonine.,” Eur. J. Biochem., vol. 72 VN-r, no. 2, pp. 323–330, 1977. [54] T. Schneider-poetsch et al., “Inhibition of Eukaryotic Translation Elongation by Cycloheximide and Lactimidomycin,” Nat. Chem. Biol., vol. 6, no. 3, pp. 209–217, 2010. [55] M. Kozak, “Downstream secondary structure facilitates recognition of initiator codons by eukaryotic ribosomes.,” Proc. Natl. Acad. Sci. U. S. A., vol. 87, no. 21, pp. 8301– 8305, 1990. [56] G. Menschaert et al., “Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events.,” Mol. Cell. Proteomics, vol. 12, no. 7, pp. 1780–90, 2013. [57] S. J. Andrews and J. A. Rothnagel, “Emerging evidence for functional peptides encoded by short open reading frames,” Nat. Rev. Genet., vol. 15, no. 3, pp. 193–204, 2014.

[58] G. Menschaert, J. Crappé, E. Ndah, and S. Koch , ert, “What is PROTEOFORMER?,” 2014. [Online]. Available: http://www.biobix.be/proteoformer/. [Accessed: 29-Oct- 2016]. [59] “FastQC - A Quality Control application for FastQ files.” [Online]. Available: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/README.txt. [Accessed: 29-Oct-2016]. [60] D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L. Salzberg, “TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions,” Genome Biol., vol. 14, no. 4, p. R36, 2013. [61] A. Dobin et al., “STAR: Ultrafast universal RNA-seq aligner,” Bioinformatics, vol. 29, no. 1, pp. 15–21, 2013. [62] H. Li et al., “The Sequence Alignment/Map format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 2009. [63] SAMtools, “Multisample SNP Calling.” [Online]. Available: http://samtools.sourceforge.net/mpileup.shtml. [Accessed: 14-Sep-2016]. [64] Genome Research Limited, “Samtools Manual Page,” 2016. [Online]. Available: http://www.htslib.org/doc/samtools.html. [Accessed: 14-Sep-2016]. [65] M. S. Hasan, X. Wu, and L. Zhang, “Performance evaluation of indel calling tools using real short-read data,” Hum. Genomics, vol. 9, no. 1, p. 20, 2015. [66] K. Gevaert et al., “Exploring proteomes and analyzing protein processing by mass spectrometric identification of sorted N-terminal peptides,” Nat. Biotechnol., vol. 21, no. 5, pp. 566–569, 2003. [67] T. Ylonen and C. Lonvick, “The Secure Shell (SSH) Protocol Architecture,” 2006. [Online]. Available: https://www.rfc-editor.org/rfc/rfc4251.txt. [Accessed: 04-Apr- 2017]. [68] Python Software Foundation, “General Python FAQ,” 2017. [Online]. Available: https://docs.python.org/3/faq/general.html. [Accessed: 05-Apr-2017]. [69] SQLite Consortium, “About SQLite.” [Online]. Available: https://www.sqlite.org/about.html. [Accessed: 03-Apr-2017]. [70] SQLite Consortium, “SQLite is a Self Contained System.” [Online]. Available: https://www.sqlite.org/selfcontained.html. [Accessed: 03-Apr-2017]. [71] SQLite Consortium, “Command Line Shell For SQLite.” [Online]. Available: http://www.sqlite.org/cli.html. [Accessed: 03-Apr-2017]. [72] Python Software Foundation, “DB-API 2.0 interface for SQLite databases,” 2017. [Online]. Available: https://docs.python.org/2/library/sqlite3.html. [Accessed: 05-Apr- 2017]. [73] Python Software Foundation, “PEP 0249 -- Python Database API Specification v2.0,” 2017. [Online]. Available: https://www.python.org/dev/peps/pep-0249/. [Accessed: 05- Apr-2017]. [74] A. M. Kuchling, “The Python DB-API,” Linux Journal, no. 49, 1998. [75] Broad Institute, “HC overview: How the HaplotypeCaller works,” 2016. [Online]. Available: https://software.broadinstitute.org/gatk/documentation/article.php?id=4148. [Accessed: 15-Sep-2016]. [76] A. Rimmer, H. Phan, I. Mathieson, Z. Iqbal, and S. R. F. Twigg, “Integrating mapping- , assembly- and haplotype-based approaches for calling variants in clinical sequencing applications,” Nat Genet., vol. 46, no. 8, pp. 912–918, 2016.

[77] Broad Institute, “HaplotypeCaller.” [Online]. Available: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstit ute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php. [Accessed: 29-Apr- 2017]. [78] M. A. DePristo et al., “A framework for variation discovery and genotyping using next-generation DNA sequencing data,” Nat. Genet., vol. 43, no. 5, pp. 491–498, 2011. [79] M. Vaudel et al., “PeptideShaker enables reanalysis of MS-derived proteomics data sets,” Nat. Biotechnol., vol. 33, no. 1, pp. 22–24, 2015. [80] M. Vaudel, H. Barsnes, F. S. Berven, A. Sickmann, and L. Martens, “SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches,” Proteomics, vol. 11, no. 5, pp. 996–999, 2011. [81] R. Craig and R. C. Beavis, “TANDEM: Matching proteins with tandem mass spectra,” Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004. [82] L. Y. Geer et al., “Open mass spectrometry search algorithm,” J. Proteome Res., vol. 3, no. 5, pp. 958–964, 2004. [83] Compomics, “SearchGUI,” 2017. [Online]. Available: http://compomics.github.io/projects/searchgui.html. [Accessed: 11-Apr-2017]. [84] Compomics, “PeptideShaker,” 2017. [Online]. Available: http://compomics.github.io/projects/peptide-shaker.html. [Accessed: 11-Apr-2017]. [85] J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.,” Genome Biol., vol. 11, no. 8, p. R86, 2010. [86] EMBL-EBI, “About the Ensembl Project,” 2014. [Online]. Available: http://www.ensembl.org/info/about/index.html. [Accessed: 16-Apr-2017]. [87] T. Hubbard et al., “The Ensembl genome database project,” Nucleic Acids Res., vol. 30, no. 1, pp. 38–41, 2002. [88] B. L. Aken et al., “The Ensembl Gene Annotation System,” Database (Oxford)., vol. 2016, p. baw093, 2016. [89] D. M. Church et al., “Modernizing reference genome assemblies,” PLoS Biol., vol. 9, no. 7, p. e1001091, 2011. [90] “The Genome Reference Consortium.” [Online]. Available: https://www.ncbi.nlm.nih.gov/grc. [Accessed: 12-Apr-2017]. [91] EMBL-EBI, “Ensembl Core - Schema documentation,” 2017. [Online]. Available: http://www.ensembl.org/info/docs/api/core/core_schema.html. [Accessed: 12-Apr- 2017]. [92] S. T. Sherry et al., “dbSNP: the NCBI database of genetic variation.,” Nucleic Acids Res., vol. 29, no. 1, pp. 308–311, 2001. [93] S. T. Sherry, M. Ward, and K. Sirotkin, “dbSNP—Database for Single Nucleotide Polymorphisms and Other Classes of Minor Genetic Variation,” Genome Res., vol. 9, no. 8, pp. 677–679, 1999. [94] G. Cochrane, I. Karsch-Mizrachi, and Y. Nakamura, “The International Nucleotide Sequence Database Collaboration,” Nucleic Acids Res., vol. 39, no. (Database issue): D15–D18, 2011. [95] EMBL-EBI, “International Nucleotide Sequence Database Collaboration.” [Online]. Available: http://www.insdc.org/. [Accessed: 09-Apr-2017]. [96] R. Leinonen et al., “The European nucleotide archive,” Nucleic Acids Res., vol. 39, no. (Database issue):D28-31, 2011.

[97] P. Jones et al., “PRIDE: a public repository of protein and peptide identifications for the proteomics community,” Nucleic Acids Res., vol. 34, no. (Database issue): D659- 63, 2006. [98] L. Martens et al., “PRIDE: The proteomics identifications database,” Proteomics, vol. 5, no. 13, pp. 3537–3545, 2005. [99] A. Koch et al., “A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites,” Proteomics, vol. 14, no. 0, pp. 2688–2698, 2014. [100] H. Laboratory, “FASTX-Toolkit.” [Online]. Available: http://hannonlab.cshl.edu/fastx_toolkit/. [Accessed: 22-May-2017]. [101] D. Blankenberg et al., “Manipulation of FASTQ data with galaxy,” Bioinformatics, vol. 26, no. 14, pp. 1783–1785, 2010. [102] J. G. Dunn and J. S. Weissman, “Plastid: nucleotide-resolution analysis of next- generation sequencing and genomics data,” BMC Genomics, vol. 17, no. 1, p. 958, 2016. [103] J. G. Dunn, “Performing metagene analyses,” 2014. [Online]. Available: http://plastid.readthedocs.io/en/latest/examples/metagene.html. [Accessed: 24-May- 2017]. [104] J. G. Dunn, “Determine P-site offsets for ribosome profiling data,” 2014. [Online]. Available: http://plastid.readthedocs.io/en/latest/examples/p_site.html. [Accessed: 24- May-2017]. [105] D. L. Lafontaine and D. Tollervey, “The function and synthesis of ribosomes.,” Nat. Rev. Mol. Cell Biol., vol. 2, no. 7, pp. 514–520, 2001. [106] C. M. Yates, I. Filippis, L. A. Kelley, and M. J. E. Sternberg, “SuSPect: Enhanced prediction of single amino acid variant (SAV) phenotype using network features,” J. Mol. Biol., vol. 426, no. 14, pp. 2692–2701, 2014. [107] J. E. Elias and S. P. Gygi, “Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry,” Nat. Methods, vol. 4, no. 3, pp. 207–214, 2007. [108] The Global Proteome Machine Organization, “cRAP protein sequences,” 2011. [Online]. Available: http://www.thegpm.org/crap/. [Accessed: 23-May-2017]. [109] O. Stehling et al., “MMS19 assembles iron-sulfur proteins required for DNA metabolism and genomic integrity,” Science (80-. )., vol. 337, no. 6091, pp. 195–199, 2012. [110] K. E. Hurov, C. Cotta-Ramusino, and S. J. Elledge, “A genetic screen identifies the Triple T complex required for DNA damage signaling and ATM and ATR stability,” Genes Dev., vol. 24, no. 17, pp. 1939–1950, 2010. [111] Z. Wei, W. Ma, X. Qi, X. Zhu, Y. Wang, and Z. Xu, “Pinin facilitated proliferation and metastasis of colorectal cancer through activating EGFR / ERK signaling pathway,” Oncotarget, vol. 7, no. 20, pp. 29429–29439, 2016. [112] M. A. Baldwin, “Protein Identification by Mass Spectrometry,” Mol. Cell. Proteomics, vol. 3, no. 1, pp. 1–9, 2003. [113] P. Iengar, “An analysis of substitution, deletion and insertion mutations in cancer genes,” Nucleic Acids Res., vol. 40, no. 14, pp. 6401–6413, 2012. [114] D. Gawron, E. Ndah, K. Gevaert, and P. Van Damme, “Positional proteomics reveals differences in N-terminal proteoform stability,” Mol. Syst. Biol., vol. 12, no. 2, p. 858, 2016. [115] Harvey Lodish et al., “The genetic basis of cancer,” in Molecular Cell Biology, 7th ed., New York: W. H. Freeman and Company, 2000, pp. 1124–1131. 56

[116] G. Kurzawski, J. Suchy, T. Debniak, J. Kładny, and J. Lubiński, “Importance of microsatellite instability (MSI) in colorectal cancer: MSI as a daignostic tool,” Ann. Oncol., vol. 15, no. SUPPL. 4, pp. 283–284, 2004. [117] Y.-F. Chang, J. S. Imam, and M. F. Wilkinson, “The Nonsense-Mediated Decay RNA Surveillance Pathway,” Annu. Rev. Biochem., vol. 76, pp. 51–74, 2007. [118] R. Fodde, “The APC gene in colorectal cancer.,” Eur. J. Cancer, vol. 38, no. 7, pp. 867–71, 2002. [119] X.L. Li, J. Zhou, Z.R. Chen, and W.J. Chng, “P53 Mutations in Colorectal Cancer- Molecular Pathogenesis and Pharmacological Reactivation,” World J. Gastroenterol., vol. 21, no. 1, pp. 84–93, 2015. [120] M. Miyaki and T. Kuroki, “Role of Smad4 (DPC4) inactivation in human cancer,” Biochem. Biophys. Res. Commun., vol. 306, no. 4, pp. 799–804, 2003. [121] M. Hajdúch, S. Jančík, J. Drábek, and D. Radzioch, “Clinical relevance of KRAS in human cancers,” J. Biomed. Biotechnol., vol. 2010, no. 150960, p. 13, 2010. [122] D. Barras, “BRAF Mutation in Colorectal Cancer : An Update,” Biomark. Cancer, vol. 7, no. Suppl 1, pp. 9–12, 2015. [123] X. Ma, B. Zhang, and W. Zheng, “Genetic variants associated with colorectal-cancer risk: comprehensive research synopsis, meta-analysis, and epidemiological evidence,” Gut, vol. 63, no. 2, pp. 326–336, 2008. [124] A. J. MacFarlane, C. A. Perry, M. F. McEntee, D. M. Lin, and P. J. Stover, “Mthfd1 is a modifier of chemically induced intestinal carcinogenesis,” Carcinogenesis, vol. 32, no. 3, pp. 427–433, 2011. [125] K. Yamaguchi et al., “MRG-binding protein contributes to colorectal cancer development,” Cancer Sci., vol. 102, no. 8, pp. 1486–1492, 2011. [126] P. J. Halvey et al., “Proteogenomic analysis reveals unanticipated adaptations of colorectal tumor cells to deficiencies in DNA mismatch repair,” Cancer Res., vol. 74, no. 1, pp. 387–397, 2014. [127] B. Zhang, J. Wang, X. Wang, J. Zhu, Q. Liu, and Z. Shi, “Proteogenomic characterization of human colon and rectal cancer,” Nature, vol. 513, no. 7518, pp. 382–387, 2014. [128] P. Mertins et al., “Proteogenomics connects somatic mutations to signaling in breast cancer,” Nature, vol. 534, no. 7605, pp. 55–62, 2016. [129] K. V. Ruggles et al., “An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer,” Mol. Cell. Proteomics, vol. 15, no. 3, pp. 1060–1071, 2016. [130] Q. Li et al., “Genome-wide search for exonic variants affecting translational efficiency,” vol. 4, no. 2260, 2013. [131] C. Polychronakos, “Gene expression as a quantitative trait: what about translation?,” J. Med. Genet., vol. 49, no. 9, pp. 554–557, 2012. [132] J. Hu and P. C. Ng, “Predicting the effects of frameshifting indels,” Genome Biol., vol. 13, no. 2, p. R9, 2012. [133] E. Nagy and L. E. Maquat, “A rule for termination-codon position within intron- containing genes: When nonsense affects RNA abundance,” Trends Biochem. Sci., vol. 23, no. 6, pp. 198–199, 1998. [134] R. A. Cartwright, “Problems and solutions for estimating indel rates and length distributions,” Mol. Biol. Evol., vol. 26, no. 2, pp. 473–480, 2009. [135] J. Q. Chen, Y. Wu, H. Yang, J. Bergelson, M. Kreitman, and D. Tian, “Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria,” Mol. Biol. Evol., vol. 26, no. 7, pp. 1523–1531, 2009. 57

[136] G. Narzisi and M. C. Schatz, “The Challenge of Small-Scale Repeats for Indel Discovery,” Front. Bioeng. Biotechnol., vol. 3, p. 8, 2015. [137] Illumina, “Paired-End Sequencing,” 2017. [Online]. Available: https://www.illumina.com/technology/next-generation-sequencing/paired-end- sequencing_assay.html. [Accessed: 29-May-2017]. [138] K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, “Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads,” Bioinformatics, vol. 25, no. 21, pp. 2865–2871, 2009. [139] P. J. Campbell et al., “Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing,” Nat. Genet., vol. 40, no. 6, pp. 722–729, 2008. [140] K. Tomczak, P. Czerwińska, and M. Wiznerowicz, “The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge,” Contemp. Oncol., vol. 19, no. 1A, pp. A68–A77, 2015. [141] S. A. Forbes et al., “COSMIC: Exploring the world’s knowledge of somatic mutations in human cancer,” Nucleic Acids Res., vol. 43, pp. D805–D811, 2015.