Computational Genomics
Francisco García García
BIER
[email protected] Máster en Biotecnología Biomédica. UPV Why are we interested in Computational Genomics?
The overall goal: Apply computational methods to biomedical and biotechnological problems
Research interests: The development and application of novel bioinformatics methods aimed at discovering new drugs Identification of genes or proteins may be considered therapeutic targets Personalized medicine: tools for discovering and diagnostic
Introduction Why Computational Genomics? Computational Genomics
Genomics Transcriptomics
Metabolomics Lipidomics
Proteomics Epigenomics
Introduction Omics sciences Computational Genomics
How do these technologies work ?
Introduction High throughput technologies: microarrays Computational Genomics
How do these technologies work ?
Reference genome
Introduction High throughput technologies: Next Generation Sequencing Computational Genomics
KEGG Gene Regulatory Biological pathways Ontology elements MiRNA, CisRed knowledge InterPro Transcription Factor Biocarta Motifs Binding Sites pathways Gene Expression Bioentities from in tissues literature
Clinical ClinVar HUMSAVAR knowledge HGMD COSMIC
Introduction Clinical and biological databases Computational Genomics
Introduction Personalized Medicine Computational Genomics
+
Introduction Personalized Medicine Descripción de las sesiones
3 sesiones (7 horas) sobre el uso de herramientas web para el análisis e interpretación de datos de secuenciación.
Toda la documentación (presentaciones + ejercicios) que necesitaremos durante estos días, estarán disponibles en este enlace http://bioinfo.cipf.es/mbb/. También en Poliformat.
Docentes: Marta Hidalgo y Paco García.
El enfoque de las sesiones será práctico y sólo introduciremos aquellos conceptos que precisemos para los ejercicios.
Introduction Máster en Biotecnología Biomédica. UPV. Programa
Sesión 1 • Introducción a las tecnologías NGS. • Estudios de detección de variación genómica. Pipeline de análisis de datos genómicos. • ¿Cómo detectar mutaciones de interés en estudios de exomas completos? Ejercicios con la herramienta web BiERapp.
Sesión 2 Estudios de variación genómica: secuenciación genómica dirigida. ¿Cómo diseñar un panel de genes? ¿Cómo analizar e interpretar datos de paneles de genes?. Ejercicios con TEAM. Variabilidad genética española. Base de datos CSVS. Estudios transcriptómicos con datos de NGS. Pipeline de análisis de datos de expresión. ¿Cómo analizar datos de RNA-Seq desde la suite Babelomics?
Sesión 3 Análisis de datos transcriptómicos en el contexto de las rutas de señalización. Ejercicios con las herramientas web hipathia y PathAct.
Introduction Máster en Biotecnología Biomédica. UPV. Web tools to analyze omic data
BIER
[email protected] Máster en Biotecnología Biomédica. UPV NGS Data Analysis Pipeline
Fastq
Sequence preprocessing Fastq
Alignment
BAM
Resequencing Visualization RNA-Seq BAM Data Analysis Data Analysis
Variant calling RNA-Seq processing
VCF Count matrix Variant annotation RNA-Seq data analysis
Prioritization Functional analysis
Introduction NGS data analysis: pipelines Fastq format
We could say “it is a fasta with qualities”: 1. Header (like the fasta but starting with “@”) 2. Sequence (string of nt) 3. “+” and sequence ID (optional) 4. Encoded quality of the sequence
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Introduction NGS data analysis: files format BAM/SAM format
@PG ID:HPG-Aligner VN:1.0 @SQ SN:20 LN:63025520
HWI-ST700660_138:2:2105:7292:79900#2@0/1 16 20 76703 254 76= * 0 0 GTTTAGATACTGAAAGGTACATACTTCTTTGTAGGAACAAGCTATCATGCTGCATTTCTATAATATCACATGAATA GIJGJLGGFLILGGIEIFEKEDELIGLJIHJFIKKFELFIKLFFGLGHKKGJLFIIGKFFEFFEFGKCKFHHCCCF AS:i:254 NH:i:1 NM:i:0
HWI-ST700660_138:2:2208:6911:12246#2@0/1 16 20 76703 254 76= * 0 0 GTTTAGATACTGAAAGGTACATACTTCTTTGTAGGAACAAGCTATCATGCTGCATTTCTATAATATCACATGAATA HHJFHLGFFLILEGIKIEEMGEDLIGLHIHJFIKKFELFIKLEFGKGHEKHJLFHIGKFFDFFEFGKDKFHHCCCF AS:i:254 NH:i:1 NM:i:0
HWI-ST700660_138:2:1201:2973:62218#2@0/1 0 20 76655 254 76M * 0 0 AACCCCAAAAATGTTGGAAGAATAATGTAGGACATTGCAGAAGACGATGTTTAGATACTGAAAGGGACATACTTCT FEFFGHHHGGHFKCCJKFHIGIFFIFLDEJKGJGGFKIHLFIJGIEGFLDEDFLFGEIIMHHIKL$BBGFFJIEHE AS:i:254 NH:i:1 NM:i:1
HWI-ST700660_138:2:1203:21395:164917#2@0/1 256 20 68253 254 4M1D72M * 0 0 NCACCCATGATAGACCAGTAAAGGTGACCACTTAAATTCCTTGCTGTGCAGTGTTCTGTATTCCTCAGGACACAGA #4@ADEHFJFFEJDHJGKEFIHGHBGFHHFIICEIIFFKKIFHEGJEHHGLELEGKJMFGGGLEIKHLFGKIKHDG AS:i:254 NH:i:3 NM:i:1
HWI-ST700660_138:2:1105:16101:50526#6@0/1 16 20 126103 246 53M4D23M * 0 0 AAGAAGTGCAAACCTGAAGAGATGCATGTAAAGAATGGTTGGGCAATGTGCGGCAAAGGGACTGCTGTGTTCCAGC FEHIGGHIGIGJI6FCFHJIFFLJJCJGJHGFKKKKGIJKHFFKIFFFKHFLKHGKJLJGKILLEFFLIHJIEIIB AS:i:368 NH:i:1 NM:i:4
SAM Specification: http://samtools.sourceforge.net/SAM1.pdf
Introduction NGS data analysis: files format VCF format
http://www.1000genomes.org/
Introduction NGS data analysis: files format Counts
Sample Gene
Introduction NGS data analysis: files format Transcriptomic Studies
BIER
[email protected] Máster en Biotecnología Biomédica. UPV RNA-Seq Data Analysis Pipeline
1. Sequence preprocessing P r i m
a 2. Mapping r y
3. Quantification
4. Normalization S e c o
n 5. Differential expression d a r y 6. Functional Profiling
Babelomics 5 RNA-Seq Data Analysis Babelomics 5
http://babelomics.bioinfo.cipf.es/
Babelomics 5 Analyzing omics data + functional profiling Differential Expression
NORMALIZATION UPLOAD EDIT + FUNCTIONAL DATA DATA DIFFERENTIAL PROFILING EXPRESSION
Babelomics 5 Analyzing omics data + functional profiling Supervised and Unsupervised Classification
RPKM TMM CLUSTERING
UPLOAD NORMALIZE EDIT DATA DATA DATA PREDICTORS
Babelomics 5 Analyzing omics data + functional profiling Signaling Pathways Analysis
http://hipathia.babelomics.org/
hiPhatia Signaling Pathways Analysis Genomic Variation Studies
BIER
[email protected] Máster en Biotecnología Biomédica. UPV Genomics Data Analysis Pipeline
P 1. Sequence preprocessing r i m a r y
A 2. Mapping n a l y s i s 3. Variant calling S e c o n d
a 4. Variant prioritization r y
Pipeline Resequencing Data Analysis How do we prioritize variants in whole exome studies?
http://courses.babelomics.org/bierapp/
BIER
BiERapp Discovering variants Introduction
Whole-exome sequencing has become a fundamental tool for the discovery of disease-related genes of familial diseases but there are difficulties to find the causal mutation among the enormous background
There are different scenarios, so we need different and immediate strategies of prioritization
Vast amount of biological knowledge available in many databases
We need a tool to integrate this information and filter immediately to select candidate variants related to the disease
BiERapp Discovering variants How does BiERapp work?
Filterings
VCF file BiERapp multisample
VARIANT CellBase
BiERapp Discovering variants Input: VCF file
P 1. Sequence preprocessing r i m a r y
A 2. Mapping n a l y s i s 3. Variant calling VCF files S e c o
n 4. Variant prioritization d BiERapp a r y
BiERapp Discovering variants Can I interpret sequencing data for diagnostic?
http://courses.babelomics.org/team/
BIER
TEAM Targeted Enrichment Analysis and Management Gene panel
Sequencing Biological data knowledge
ClinVar HUMSAVAR HGMD TEAM COSMIC
Diagnostic
TEAM Targeted Enrichment Analysis and Management Gene panel
1. VCF files 2. Gene panel TEAM
ClinVar HGMD HUMSAVAR COSMIC
TEAM Targeted Enrichment Analysis and Management CSVS: CIBERER Spanish Variant Server
Repositorio de frecuencias de variantes en la población española
http://csvs.babelomics.org/
CSVS CIBERER Spanish Variant Server CIBERER Spanish Variant Server
CSVS Local genetic variability Tool interface
http://csvs.babelomics.org/
CSVS CIBERER Spanish Variant Server Genome Maps Visualizador genómico que interactúa con bases de datos funcionales
http://genomemaps.org/
Genome Maps A next-generation web-based genome browser Tool interface
Genome Maps A next-generation web-based genome browser Cell Maps Herramienta de modelización y visualización de redes biológicas
http://cellmaps.babelomics.org/
Cell Maps Visualizing and integrating biological networks Cell Maps
1)Es una herramienta que permite la integración, visualización y el análisis de redes biológicas. 2)El input es un fichero donde indicamos las relaciones entre los nodos de nuetra red. Opcionalmente podemosincluir un fichero con los atributos de cada nodo. 3)El output gráfico es una red en la que se muestran las relaciones de los distintos nodos que la integran.
Tutorial: https://github.com/opencb/cell-maps/wiki
Cell Maps Visualizing and integrating biological networks Tool interface
Cell Maps Visualizing and integrating biological networks Cell Maps: inputs
Cell Maps Visualizing and integrating biological networks Cell Maps: outputs
Cell Maps Visualizing and integrating biological networks Omics Data Integration from a Systems Biology perspective
BIER
Francisco García Omics Data Integration [email protected] Omics Data Integration
Patient Technologies Data Analysis Integration and interpretation
Molecular and clinical model
Introduction Omics Data Integration Multidimensional Gene Set Analysis
MicroRNA-Seq & mRNA-Seq
Patterns miRNA1 0.5 Case Control miRNA2 1.2 miRNA3 1.3 miRNA4 1.7 microRNA- Seq ... Ranking Index Logistic GOs Case Control Regression KEGGs Gene1 0.01 InterPRO Gene2 0.04 Gene3 0.09 mRNA-Seq Gene4 0.2 Functional ... annotation
Strategies Omics Data Integration Functional Meta-Analysis N mRNA-Seq studies
Case Control Differential Functional Expression Profiling GOs mRNA-Seq KEGGs InterPRO Case Control Differential Functional mRNA-Seq Expression Profiling
Case Control Differential Functional Meta- Expression Profiling mRNA-Seq Analysis
Case Control Differential Functional Expression Profiling mRNA-Seq …..
Strategies Omics Data Integration PATHiVAR: mutations and expression
PATHiVAR estimates the functional impact that mutations have over the human signalling network.
PATHiVAR: Analyses VCF files Extract the deleterious mutations Locate them over the signalling pathways in the selected tissue (with the appropriate expression pattern) Provide a comprehensive, graphic and interactive view of the predicted signal transduction probabilities across the different signalling pathways.
http://pathivar.babelomics.org/
Strategies PATHiVAR How does PATHiVARK work?
SIFT Inheritance pattern PolyPhen
VCF file PATHiVAR
Pathways Tissues
Strategies PATHiVAR PATHiVAR
CALCIUM SIGNALING PATHWAY
Strategies PATHiVAR Other resources for Genomic Data Analysis
P 1. Sequence preprocessing r
i HPG Pore m a r y
A HPG Aligner 2. Mapping n a l y CellBase s i s 3. Variant calling HPG Variant S
e http://www.opencb.or c
o g/ n
d a
r 4. Variant prioritization BiERapp y
More resources NGS data analysis Any question?
[email protected] Francisco García García