Computational Genomics

Francisco García García

BIER

[email protected] Máster en Biotecnología Biomédica. UPV Why are we interested in Computational Genomics?

The overall goal:  Apply computational methods to biomedical and biotechnological problems

Research interests:  The development and application of novel methods aimed at discovering new drugs  Identification of genes or proteins may be considered therapeutic targets  Personalized medicine: tools for discovering and diagnostic

Introduction Why Computational Genomics? Computational Genomics

Genomics Transcriptomics

Metabolomics Lipidomics

Proteomics Epigenomics

Introduction Omics sciences Computational Genomics

How do these technologies work ?

Introduction High throughput technologies: microarrays Computational Genomics

How do these technologies work ?

Reference genome

Introduction High throughput technologies: Next Generation Computational Genomics

KEGG Gene Regulatory Biological pathways Ontology elements MiRNA, CisRed knowledge InterPro Transcription Factor Biocarta Motifs Binding Sites pathways Gene Expression Bioentities from in tissues literature

Clinical ClinVar HUMSAVAR knowledge HGMD COSMIC

Introduction Clinical and biological databases Computational Genomics

Introduction Personalized Medicine Computational Genomics

+

Introduction Personalized Medicine Descripción de las sesiones

3 sesiones (7 horas) sobre el uso de herramientas web para el análisis e interpretación de datos de secuenciación.

Toda la documentación (presentaciones + ejercicios) que necesitaremos durante estos días, estarán disponibles en este enlace http://bioinfo.cipf.es/mbb/. También en Poliformat.

Docentes: Marta Hidalgo y Paco García.

El enfoque de las sesiones será práctico y sólo introduciremos aquellos conceptos que precisemos para los ejercicios.

Introduction Máster en Biotecnología Biomédica. UPV. Programa

Sesión 1 • Introducción a las tecnologías NGS. • Estudios de detección de variación genómica. Pipeline de análisis de datos genómicos. • ¿Cómo detectar mutaciones de interés en estudios de exomas completos? Ejercicios con la herramienta web BiERapp.

Sesión 2  Estudios de variación genómica: secuenciación genómica dirigida.  ¿Cómo diseñar un panel de genes? ¿Cómo analizar e interpretar datos de paneles de genes?. Ejercicios con TEAM.  Variabilidad genética española. Base de datos CSVS.  Estudios transcriptómicos con datos de NGS. Pipeline de análisis de datos de expresión. ¿Cómo analizar datos de RNA-Seq desde la suite Babelomics?

Sesión 3  Análisis de datos transcriptómicos en el contexto de las rutas de señalización.  Ejercicios con las herramientas web hipathia y PathAct.

Introduction Máster en Biotecnología Biomédica. UPV. Web tools to analyze omic data

BIER

[email protected] Máster en Biotecnología Biomédica. UPV NGS Data Analysis Pipeline

Fastq

Sequence preprocessing Fastq

Alignment

BAM

Resequencing Visualization RNA-Seq BAM Data Analysis Data Analysis

Variant calling RNA-Seq processing

VCF Count matrix Variant annotation RNA-Seq data analysis

Prioritization Functional analysis

Introduction NGS data analysis: pipelines Fastq format

 We could say “it is a fasta with qualities”:  1. Header (like the fasta but starting with “@”)  2. Sequence (string of nt)  3. “+” and sequence ID (optional)  4. Encoded quality of the sequence

@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Introduction NGS data analysis: files format BAM/SAM format

@PG ID:HPG-Aligner VN:1.0 @SQ SN:20 LN:63025520

HWI-ST700660_138:2:2105:7292:79900#2@0/1 16 20 76703 254 76= * 0 0 GTTTAGATACTGAAAGGTACATACTTCTTTGTAGGAACAAGCTATCATGCTGCATTTCTATAATATCACATGAATA GIJGJLGGFLILGGIEIFEKEDELIGLJIHJFIKKFELFIKLFFGLGHKKGJLFIIGKFFEFFEFGKCKFHHCCCF AS:i:254 NH:i:1 NM:i:0

HWI-ST700660_138:2:2208:6911:12246#2@0/1 16 20 76703 254 76= * 0 0 GTTTAGATACTGAAAGGTACATACTTCTTTGTAGGAACAAGCTATCATGCTGCATTTCTATAATATCACATGAATA HHJFHLGFFLILEGIKIEEMGEDLIGLHIHJFIKKFELFIKLEFGKGHEKHJLFHIGKFFDFFEFGKDKFHHCCCF AS:i:254 NH:i:1 NM:i:0

HWI-ST700660_138:2:1201:2973:62218#2@0/1 0 20 76655 254 76M * 0 0 AACCCCAAAAATGTTGGAAGAATAATGTAGGACATTGCAGAAGACGATGTTTAGATACTGAAAGGGACATACTTCT FEFFGHHHGGHFKCCJKFHIGIFFIFLDEJKGJGGFKIHLFIJGIEGFLDEDFLFGEIIMHHIKL$BBGFFJIEHE AS:i:254 NH:i:1 NM:i:1

HWI-ST700660_138:2:1203:21395:164917#2@0/1 256 20 68253 254 4M1D72M * 0 0 NCACCCATGATAGACCAGTAAAGGTGACCACTTAAATTCCTTGCTGTGCAGTGTTCTGTATTCCTCAGGACACAGA #4@ADEHFJFFEJDHJGKEFIHGHBGFHHFIICEIIFFKKIFHEGJEHHGLELEGKJMFGGGLEIKHLFGKIKHDG AS:i:254 NH:i:3 NM:i:1

HWI-ST700660_138:2:1105:16101:50526#6@0/1 16 20 126103 246 53M4D23M * 0 0 AAGAAGTGCAAACCTGAAGAGATGCATGTAAAGAATGGTTGGGCAATGTGCGGCAAAGGGACTGCTGTGTTCCAGC FEHIGGHIGIGJI6FCFHJIFFLJJCJGJHGFKKKKGIJKHFFKIFFFKHFLKHGKJLJGKILLEFFLIHJIEIIB AS:i:368 NH:i:1 NM:i:4

SAM Specification: http://samtools.sourceforge.net/SAM1.pdf

Introduction NGS data analysis: files format VCF format

http://www.1000genomes.org/

Introduction NGS data analysis: files format Counts

Sample Gene

Introduction NGS data analysis: files format Transcriptomic Studies

BIER

[email protected] Máster en Biotecnología Biomédica. UPV RNA-Seq Data Analysis Pipeline

1. Sequence preprocessing P r i m

a 2. Mapping r y

3. Quantification

4. Normalization S e c o

n 5. Differential expression d a r y 6. Functional Profiling

Babelomics 5 RNA-Seq Data Analysis Babelomics 5

http://babelomics.bioinfo.cipf.es/

Babelomics 5 Analyzing omics data + functional profiling Differential Expression

NORMALIZATION UPLOAD EDIT + FUNCTIONAL DATA DATA DIFFERENTIAL PROFILING EXPRESSION

Babelomics 5 Analyzing omics data + functional profiling Supervised and Unsupervised Classification

RPKM TMM CLUSTERING

UPLOAD NORMALIZE EDIT DATA DATA DATA PREDICTORS

Babelomics 5 Analyzing omics data + functional profiling Signaling Pathways Analysis

http://hipathia.babelomics.org/

hiPhatia Signaling Pathways Analysis Genomic Variation Studies

BIER

[email protected] Máster en Biotecnología Biomédica. UPV Genomics Data Analysis Pipeline

P 1. Sequence preprocessing r i m a r y

A 2. Mapping n a l y s i s 3. Variant calling S e c o n d

a 4. Variant prioritization r y

Pipeline Resequencing Data Analysis How do we prioritize variants in whole exome studies?

http://courses.babelomics.org/bierapp/

BIER

BiERapp Discovering variants Introduction

 Whole-exome sequencing has become a fundamental tool for the discovery of disease-related genes of familial diseases but there are difficulties to find the causal mutation among the enormous background

 There are different scenarios, so we need different and immediate strategies of prioritization

 Vast amount of biological knowledge available in many databases

 We need a tool to integrate this information and filter immediately to select candidate variants related to the disease

BiERapp Discovering variants How does BiERapp work?

Filterings

VCF file BiERapp multisample

VARIANT CellBase

BiERapp Discovering variants Input: VCF file

P 1. Sequence preprocessing r i m a r y

A 2. Mapping n a l y s i s 3. Variant calling VCF files S e c o

n 4. Variant prioritization d BiERapp a r y

BiERapp Discovering variants Can I interpret sequencing data for diagnostic?

http://courses.babelomics.org/team/

BIER

TEAM Targeted Enrichment Analysis and Management Gene panel

Sequencing Biological data knowledge

ClinVar HUMSAVAR HGMD TEAM COSMIC

Diagnostic

TEAM Targeted Enrichment Analysis and Management Gene panel

1. VCF files 2. Gene panel TEAM

ClinVar HGMD HUMSAVAR COSMIC

TEAM Targeted Enrichment Analysis and Management CSVS: CIBERER Spanish Variant Server

Repositorio de frecuencias de variantes en la población española

http://csvs.babelomics.org/

CSVS CIBERER Spanish Variant Server CIBERER Spanish Variant Server

CSVS Local genetic variability Tool interface

http://csvs.babelomics.org/

CSVS CIBERER Spanish Variant Server Genome Maps Visualizador genómico que interactúa con bases de datos funcionales

http://genomemaps.org/

Genome Maps A next-generation web-based genome browser Tool interface

Genome Maps A next-generation web-based genome browser Cell Maps Herramienta de modelización y visualización de redes biológicas

http://cellmaps.babelomics.org/

Cell Maps Visualizing and integrating biological networks Cell Maps

1)Es una herramienta que permite la integración, visualización y el análisis de redes biológicas. 2)El input es un fichero donde indicamos las relaciones entre los nodos de nuetra red. Opcionalmente podemosincluir un fichero con los atributos de cada nodo. 3)El output gráfico es una red en la que se muestran las relaciones de los distintos nodos que la integran.

Tutorial: https://github.com/opencb/cell-maps/wiki

Cell Maps Visualizing and integrating biological networks Tool interface

Cell Maps Visualizing and integrating biological networks Cell Maps: inputs

Cell Maps Visualizing and integrating biological networks Cell Maps: outputs

Cell Maps Visualizing and integrating biological networks Omics Data Integration from a Systems Biology perspective

BIER

Francisco García Omics Data Integration [email protected] Omics Data Integration

Patient Technologies Data Analysis Integration and interpretation

Molecular and clinical model

Introduction Omics Data Integration Multidimensional Gene Set Analysis

MicroRNA-Seq & mRNA-Seq

Patterns miRNA1 0.5 Case Control miRNA2 1.2 miRNA3 1.3 miRNA4 1.7 microRNA- Seq ... Ranking Index Logistic GOs Case Control Regression KEGGs Gene1 0.01 InterPRO Gene2 0.04 Gene3 0.09 mRNA-Seq Gene4 0.2 Functional ... annotation

Strategies Omics Data Integration Functional Meta-Analysis N mRNA-Seq studies

Case Control Differential Functional Expression Profiling GOs mRNA-Seq KEGGs InterPRO Case Control Differential Functional mRNA-Seq Expression Profiling

Case Control Differential Functional Meta- Expression Profiling mRNA-Seq Analysis

Case Control Differential Functional Expression Profiling mRNA-Seq …..

Strategies Omics Data Integration PATHiVAR: mutations and expression

 PATHiVAR estimates the functional impact that mutations have over the human signalling network.

 PATHiVAR:  Analyses VCF files  Extract the deleterious mutations  Locate them over the signalling pathways in the selected tissue (with the appropriate expression pattern)  Provide a comprehensive, graphic and interactive view of the predicted signal transduction probabilities across the different signalling pathways.

http://pathivar.babelomics.org/

Strategies PATHiVAR How does PATHiVARK work?

SIFT Inheritance pattern PolyPhen

VCF file PATHiVAR

Pathways Tissues

Strategies PATHiVAR PATHiVAR

CALCIUM SIGNALING PATHWAY

Strategies PATHiVAR Other resources for Genomic Data Analysis

P 1. Sequence preprocessing r

i HPG Pore m a r y

A HPG Aligner 2. Mapping n a l y CellBase s i s 3. Variant calling HPG Variant S

e http://www.opencb.or c

o g/ n

d a

r 4. Variant prioritization BiERapp y

More resources NGS data analysis Any question?

[email protected] Francisco García García