Department of Pharmacy Analytical Biochemistry

The role of proteogenomics in understanding molecular mechanisms of COPD

Yanick Hagemeijer, Rainer Bischoff, Victor Guryev, Peter Horvatovich

[email protected]

X-Omics Festival, April 12, 2021 Population and intra-individual genomics variability

Omics layers: Phenome Genome/exome seq Microbiome DNA variants Metabolome RNA-seq Lipidome Data integration Glycome Transcripts Glycoproteome Kinome Population genetic , peptides Phosphoproteome variability Proteome LC-MS/MS Translatome Transcriptome Epi-genome Genome Metabolome Metabolites Lipidome Acetylome Adapted from … PMID 23664091 Disease

Tumor heterogeneity Era of personal genomics

~ 500 000

genomes / / /

Genome of the Netherlands

1000 genomes Craig project Venter James Genomes read Watson

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Difficulty in identifying variants de novo from MS/MS spectra

NH2-E A N F D I N Q L Y D C N W V V V N C S T P G N F F H V L R-COOH

Int Int

b1 y4 y11 y9 80,000 E E A D D A Q Q Q L V Q I 12,000 V H F T S C N V V V y 75,000 13 11,000 70,000 y y5 10 65,000 10,000 y2 60,000 y9 y12 9,000 FNGP WNCDYLQNIDFNAE 55,000 y 8,000 y15 50,000 1 y8 45,000 y7 7,000

40,000 6,000 y14 35,000 y13 b 5,000 30,000 3 b 4 y11 4,000 y16 25,000 b5 y6 b 20,000 3,000 3 b 15,000 2 y b 14 y y 6 2,000 y2 y b y 12 10,000 3 3 4 4 b6 y10 b7 y5 b10 b 5,000 b8 11 1,000 Prec++ b9 b12 b13 0 m/z 0 m/z 0 100 200 300 400 500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 1,600 0 250 500 750 1,000 1,250 1,500 1,750 2,000 2,250 2,500

• Best identification method: database search (DBS) • DBS requires list of sequences expected to be present in the sample • Canonical sequences Swissprot and Uniprot (Ensembl) is used in common proteomics workflows • In public databases low number of variants (20, 80 and 30 k proteins) are present. Lam MCP 2011 Chronic Obstructive Pulmonary disease

20% of the smokers develop COPD, more than 200 millions persons have COPD Progressive loss of lung function with a large impact on the quality of life Insufficient insight in the molecular mechanisms of COPD Limited therapeutic options COPD phenotypes and treatment chronic bronchitis emphysema

Impaired lung function Symptoms airways alveoli

Today treatment “One size fits all” Study design

20/18 lung tissues COPD 10 COPD stage IV (8 female/2 male) 10/8 ex-smokers control (6/4 female/4 male) Control

proteins mRNA sequences

Orbitrap Q-Executive+ Illumina 1D-LC-MS/MS 20 million sequencing depth 2D-LC-MS/MS Polyadenylated mRNA fraction

Brandsma CA, Guryev V, Timens W, …, van den Berge M, Horvatovich P., Integrated proteogenomic approach identifying a protein signature of COPD and a new splice variant of SORBS1, Thorax, 2020, 75(2):180-183. PMID: 31937552 Proteogenomics data integration workflow mRNA raw sequence LC2-MS/MS data Illumina Orbitrap QE+

• Annotated proteins FastQC Grid quality control • SAAVs smoothing, resampling • Splice isoforms • indels Trimmomatic • New models Centroid sequence trimming • New transcripts peak detection and quantification

STAR TopHat2 Warp2D alignment and annotation alignment and annotation time alignment

Cufflinks StringTie MetaMatch mRNA assembly and quantification mRNA assembly and quantification peak matching

TransDecoder SearchGUI/PEAKS Translated protein sequence peptide/protein identification

mRNA quantitative table PeptideShaker/PEAKS Fasta format/FPKM validation of peptide/protein identification

mRNA-Seq mRNA-Seq pre-processing Peptide/protein quantitative table Quantitative table (PSM, IC) Proteomics Input files MS1 quantification mRNA-Seq and LC-MS/MS

Proteomics Output files Suits F, Hoekman B, Rosenling T, Bischoff R, Horvatovich P., Threshold Avoiding Proteomics Pipeline, peptide/protein identification Protein/mRNA and quantities Anal Chem., 2011, 83(20):7786-94. annotated/Normal: A A C A g t g a g t . . . a t a c a g A A G A One reason why

exon exon transcript is in 5 COPD samples: A A C A g t g a g t . . . a t A C A G A A G A present, and

frameshift protein is absent exon exon translated protein sequence of oncostatin M receptor gene without mutation

with mutation STRING protein-protein interaction network

STRING edges: experiments and database experiments database both

Proteomics + transcriptomics FDR 0.01 Fixed layout Peptide identification (combined dataset)

Non-synonymous New transcript variant isoform 500 164 Public (UniProt, 231 Ensembl) 8767 0 Confirmed gene models 17586 9707 0 6

361 17314 0 722 321

642

901 peptides are not in curated public databases PSM count of novel peptides present only in COPD or Control 8

Control

25 COPD 4 9

higher in Control higher in COPD 20

15 6 5 PSM countPSM 5 7 5 4 4 4 5 5

10 5 5 4 7 4 4 7 6 4 6 4

4 5

5 0 Group only “Novel peptides” identification PEAKS score and SpetcrumAI test Peptides only present in COPD Peptide sequence Quality score Gene Effect Ion support Flanking ion support ADSQDAGQETEKEGEDPQASAQDETPITSAK 134.83 AKAP12 E1600D amino acid substitution   DSRPSQAAGDNQGDEVK 132.86 THRAP3 A201V amino acid substitution   TGQEALSQTTISWAPFQDTSEYIISCHPVGTDEEPLQFR 131.44 FN1 Native NA NA AEHMETNAVGPSQSSDTR 129.43 MPRIP P327Q amino acid substitution   EGEDPQASAQDETPITSAK 129.34 AKAP12 E1600D amino acid substitution   DPAEGTPLEAAGTR 118.25 SORBS1 Splice variant, exon extension NA NA ELTVSNNDINEAGVHVLCQGLK 115.80 RNH1 R188H amino acid substitution   TLEGLQVEEEPVYK 106.37 HCLS1 E361K amino acid substitution   EASQGSSASSAPQSVK 105.60 ZNF830 H99Q amino acid substitution   SDSELNNEVAAR 103.81 CYBRD1 S266N amino acid substitution   EQALQEAMEQLEELELER 100.83 SWAP70 Q505E amino acid substitution   EQQNDTSSELQNR 91.55 ZFYVE16 I192T amino acid substitution   YGFQFLR 81.46 PCYOX1 S149F amino acid substitution   NKYEDEINR 70.58 KRT7 H186R amino acid substitution   EPVSGAVEGK 62.17 CAVIN2 Q174E amino acid substitution   NVSSNPCHEAVGIK 56.53 SORBS1 Splice variant, exon extension NA NA YEDEINR 49.80 KRT7 H186R amino acid substitution  

Peptides only present in Control Peptide sequence Quality score Gene Effect Ion support Flanking ion support ISYGPDWKDFYVVEPLAFEGTPEQK 120.51 HEXA I447V amino acid substitution   ISNSAAYSGSVAPANSALGQTQPSDQDTLVQR 110.84 PDLIM5 T410A amino acid substitution   LHSLTQAKEESEK 110.58 RRBP1 L1043H amino acid substitution  - GGGAGFISGLTYLELDNPAGNK 108.34 C7 S389T amino acid substitution   ASQSVSSNYLAWYQQKPGQAPR 105.88 IGKV3-20 S52N amino acid substitution   QTLEKENTDLAGELR 77.52 MYH11 A1241T amino acid substitution   NSLFLQMNSLR 76.68 IGHV3-43 Y99F amino acid substitution   LLIYWASAR 68.41 IGKV4-1 T79A amino acid substitution   LLEDLR 33.62 OTOA/PDE4DIP Native NA NA Zhu Y, Orre LM, Johansson HJ, Huss M, Boekel J, Vesterlund M, Fernandez-Woodbridge A, Branca RMM, Lehtiö J. Discovery of coding regions in the by integrated proteogenomics analysis workflow. Nature Commun. 2018 Mar 2;9(1):903. Validation of non-reference peptides with synthetic peptides MS/MS spectra

SORBS1 AKAP12

SORBS1 SWAP70 E

E EQALQEAMEQLEELELER

Q505E amino acid substitution Two peptides uniquely mapping to a novel exon of SORBS1 gene SORBS1 is encoded on minus strand of chr10 (band 10q24.1), between 95.31 and 95.56 Mbp (gene length: 249.64 kb)

Exon 8 New exon Exon 7 Adds 239 AAs between residues 270 and 271 2 peptides were found by MS/MS analysis >NEWP24530 MSSECDGGSKAVMNGLAPGSNGQDKATADPLRARSISAVKIIPVKTVKNASGLVLPTDMD LTKICTGKGAVTLRASSSYRETPSSSPASPQETRQHESKPGLEPEPSSADEWRLSSSADA NGNAQPSSLAAKGYRSVHPNLPSDKSQDATSSSAAQPEVIVVPLYLVNTDRGQEGTARPP Exons 1-7 TPLGPLGCVPTIPATASAASPLTFPTLDDFIPPHLQRWPHHSQPARASGSFAPISQTPPS FSPPPPLVPPAPEDLRRVSEPDLTGAVSSTVLSPPRPPLPQKDRFAWQSPTIHNTYKDSL YLSSPKPYVPLGTPRQQNPSQPQPISVLLAAGSAPKGVVCPGSLLPDSTFPSASSQPQQR YAATRTVYHKNVSSNPCHEAVGIKKVSSLYVPCLSNNICLAASENSSRVARDPAEGTPLE New exon AAGTRAPAPGLVSRTAGTGKPPPAPPPDPPKLFFDIRKDAVNRGESPSLGTQASFPDVRP PVLGPRVTSDPENRKSKESYLLQPSYPAKDSSPLLNEVSSSLIGTDSQAFPSVSKPSSAY PSTTIVNPTIVLLQHNREQQKRLSSLSDPVSERRVGEQDSAPTQEKPTSPGKAIEKRAK… Exons 8 and further Protein domains of SORBS1 splice variants native SORBS1 1 250 500 750 1000 1266 Sorb SH3 SH3 SH3

new exon 8 +239 AA novel SORBS1 splice variant 1 250 500 750 1000 1250 1505 Atrophin-1 superfamily Sorb SH3 SH3 SH3 Cancer moonshot for melanoma

Clinically well characterized patients

Biobanks

BRAF V600E Tumor sections

Vemurafenib melanoma lymph node metastasis Multi-omics profiles & digital pathology Proteomics Genomics Digital pathology

PMIDs: 31599373, 30914758, 30900145 491.085 Da Fundings and Collaborations

Yanick Hagemeijer Ekaterina Ovchinnikova Victor Guryev Corry-Anke Brandsma Maria Yakovleva Maarten van den Berge Melinda Rezeli Wim Timens Jonatan Eriksson Karin Wolters Thomas Fehniger Alejandro Sanchez Brotons Dirkje Postma György Markó-Varga Karel Gerbrands Gyorgy Halmos Ana Ciconelle Renee Vehoeven Hjalmar Permentier Rainer Bischoff

Frank Suits