Genomi 7 Aplotype

Con il termine aplotipo si definisce la combinazione di varianti alleliche lungo un cromosoma o segmento cromosomico contenente loci in linkage disequilibrium, cioè strettamente associati tra di loro, e che in genere, vengono ereditati insieme.

Haplotype: A series of polymorphisms that are close together in the genome. The distribution of alleles at each polymorphic site is nonrandom: the base at one position predicts with some accuracy the base at the adjacent position. Persons sharing a haplotype are related, often very distantly. Haplotypes in Europeans are generally of the order of tens of kilobases long; older populations, such as those of West Africa, tend to have shorter haplotypes, since a longer period of evolutionary time means more meiotic events and a greater chance of population admixture, both of which result in shorter haplotypes. Aplotype

Over the course of many generations, segments of the ancestral chromosomes in an interbreeding population are shuffled through repeated recombination events. Some of the segments of the ancestral chromosomes occur as regions of DNA sequences that are shared by multiple individuals. These segments are regions of chromosomes that have not been broken up by recombination, and they are separated by places where recombination has occurred. These segments are the haplotypes that enable geneticists to search for genes involved in diseases and other medically important traits.

The fossil record and genetic evidence indicate that all humans today are descended from anatomically modern ancestors who lived in Africa about 150,000 years ago. Because we are a relatively young species, most of the variation in any current human population comes from the variation present in the ancestral human population. Also, as humans migrated out of Africa, they carried with them part but not all of the genetic variation that existed in the ancestral population. As a result, the haplotypes seen outside Africa tend to be subsets of the haplotypes inside Africa. In addition, haplotypes in non-African populations tend to be longer than in African populations, because populations in Africa have been larger through much of our history and recombination has had more time there to break up haplotypes.

As modern humans spread throughout the world, the frequency of haplotypes came to vary from region to region through random chance, natural selection, and other genetic mechanisms. As a result, a given haplotype can occur at different frequencies in different populations, especially when those populations are widely separated and unlikely to exchange much DNA through mating. Also, new changes in DNA sequences, known as mutations, have created new haplotypes, and most of the recently arising haplotypes have not had enough time to spread widely beyond the population and geographic region in which they originated. Linkage e linkage disequilibrium

• Linkage: l’associazione fisica degli alleli sui cromosomi

• Linkage disequilibrium: l’associazione non casuale degli alleli di diversi loci nei gameti

LINKAGE EQUILIBRIUM: indica una combinazione casuale di alleli a loci associati. Consideriamo per esempio il caso di due loci associati 1 e 2 con 2 possibili alleli ciascuno (A e a per il locus 1 e B e b per il locus 2). Gli aplotipi possibili in una determinata popolazione (AB, Ab, aB, ab) si verificheranno con una frequenza che è il prodotto delle frequenze dei singoli alleli per ciascun aplotipo.

LINKAGE DISEQUILIBRIUM: indica una combinazione non casuale di alleli a loci associati. Il linkage disequilibrium è spesso la conseguenza di un effetto "founder" (fondatore), cioè di una mutazione in un singolo individuo. Perchè l'effetto fondatore sia evidenziabile in una popolazione è necessario che i due loci siano vicini, in maniera tale che gli eventi di ricombinazione siano rari tra i due loci, e che non sia trascorso abbastanza tempo dalla comparsa del fondatore poichè la ricombinazione puo' ristabilire nel tempo l'equilibrio. Linkage e linkage disequilibrium

Il linkage disequilibrium (LD) indica la presenza di associazione statistica tra specifici alleli relativi a due o più loci, che costituiscono di solito un particolare aplotipo ancestrale, diffuso nella popolazione in cui è rilevato perché trasmesso lungo la discendenza da un comune progenitore.

Per questo motivo il linkage disequilibrium è maggiore in popolazioni omogenee, cioè originate da un nucleo di individui fondatori come le popolazioni sarda o finlandese. Infatti le migrazioni recenti generano substruttura ed eterogeneità genetica e aplotipi differenti associati allo stesso tratto tendono ad abbassare i valori di significatività.

Il linkage disequilibrium è un importante strumento per individuare regioni cromosomiche di limitata ampiezza in cui si collocano i geni per una data malattia (mappaggio ad alta risoluzione) e si avvale dell'analisi molecolare di varianti alleliche (per lo più SNPs) che costituiscono aplotipi in pazienti tra loro apparentemente non imparentati. Infatti è prevedibile che pazienti che hanno ereditato lo stesso segmento cromosomico, definito dal medesimo aplotipo, abbiano ereditato anche la stessa mutazione in esso contenuto. Origini del linkage disequilibrium (LD)

Alla sua comparsa, una nuova mutazione è in LD (grigio) con tutti I loci dello stesso cromosoma. Attraverso le generazioni la ricombinazione riduce progressivamente l’area di LD. Contano soprattutto:

1. Tasso di ricombinazione 2. Numero di generazioni Attraverso le generazioni si riduce anche l’area di LD

Nature Reviews Genetics 4; 701-709 (2003) Linkage and Linkage Disequilibrium

Within a family, linkage occurs when two genetic markers (points on a chromosome) remain linked on a chromosome rather than being broken apart by recombination events during meiosis, shown as red lines. In a population, contiguous stretches of founder chromosomes from the initial generation are sequentially reduced in size by recombination events. Over time, a pair of markers or points on a chromosome in the population move from linkage disequilibrium to linkage equilibrium, as recombination events eventually occur between every possible point on the chromosome. Association Studies and Linkage Disequilibrium

• If all polymorphisms were independent at the population level, association studies would have to examine every one of them…

• Linkage disequilibrium makes tightly linked variants strongly correlated producing cost savings for association studies This diagram shows two ancestral chromosomes being scrambled through recombination over many generations to yield different descendant chromosomes. If a genetic variant marked by the A on the ancestral chromosome increases the risk of a particular disease, the two individuals in the current generation who inherit that part of the ancestral chromosome will be at increased risk. Adjacent to the variant marked by the A are many SNPs that can be used to identify the location of the variant.

Il genoma umano è diviso in blocchi di aplotipi, al cui interno la ricombinazione è rara

Nove diversi blocchi di aplotipi in questa regione Genome-wide association study (GWAS) Genome-wide association study (GWAS)

Genome-wide association study (GWAS) is an approach used in genetics research to associate specific genetic variations with particular diseases. The method involves scanning the genomes from many different people and looking for genetic markers that can be used to predict the presence of a disease. Once such genetic markers are identified, they can be used to understand how genes contribute to the disease and develop better prevention and treatment strategies. Genome-wide association study (GWAS)

A variety of resources have been developed to catalog common single nucleotide polymorphisms (SNPs). More recently, data from the 1000 Genomes Project have begun to be used to catalog variants in the 1% frequency range.

In order to test whether these common SNPs are associated with risk of disease, commercial ‘SNP chips’ or arrays have been developed that capture most, although not all, common variation in the genome. These genotyping arrays can genotype hundreds of thousands of SNPs in a single experiment, at a cost of several hundred US dollars per sample.

Contemporary GWASs use these arrays to measure the frequency of alleles in cases compared with controls. If the difference in allele frequency reaches a stringent level of statistical significance that corrects for the fact that there are about 1,000,000 independent common SNPs in the human genome (this significance level is about P < 5 × 10-8), then the allele is said to be ‘associated’ with disease. GENOME WIDE ASSOCIATION STUDY

Christensen and Murray 356 (11): 1094, Figure 1 March 15, 2007 Joel N. Hirschhorn & Mark J. Daly Nature Reviews Genetics 6, 95-108 (February 2005) Example calculation illustrating the methodology of a case-control GWA study. The allele count of each measured SNP is evaluated, in this case with a chi-squared test, in order to identify variants associated with the trait in question. The numbers in this example are taken from a 2007 study of coronary artery disease (CAD) which showed that the individuals with the G-allele of SNP1 (rs1333049) were overrepresented amongst CAD-patients

Published Genome-Wide Associations through 12/2012 Published GWA at p≤5X10-8 for 17 trait categories

NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/

The Human The Genome Reference Consortium http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/

The Genome Reference Consortium (GRC) is an international collective of academic and research institutes with expertise in genome mapping, sequencing, and informatics, formed to improve the representation of reference genomes. At the time the human reference was initially described, it was clear that some regions were recalcitrant to closure with existing technology. The main reason for improving the reference assemblies are that they are the cornerstones upon which all whole genome studies are based (i.e. the 1000 Genomes Project).

The GRC is a collaborative effort which interacts with various groups in the scientific community, however the primary member institutes are:

The Wellcome Trust Sanger Institute The Genome Institute at Washington University The European Bioinformatics Institute The National Center for Biotechnology Information

Initially the focus lies with the Human and the Mouse reference genomes, but in mid-late 2010 full maintenance and improvement of the Zebrafish genome sequence was also added to the GRC. Major assembly releases do not follow a fixed cycle, however there are "minor" assembly updates in the form of genome patches which either correct errors in the assembly or add additional alternate loci. These assemblies are represented in various genome browsers and databases including Ensembl, those in NCBI and UCSC Genome Browser.

GRCh36, hg18 GRCh37, hg19 GRCh38, hg38

Il numero di geni non è proporzionale alla complessità dell’organismo Human genome annotation

Genome annotation remains a major challenge for scientists investigating the human genome, now that the genome sequences of more than a thousand human individuals and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism

Here is an alphabetical listing of databases relevant to genome annotation:

• ENCyclopedia Of DNA Elements (ENCODE) http://encodeproject.org/ENCODE/ • Ensembl http://www.ensembl.org • GENCODE http://www.gencodegenes.org/ • Gene Ontology Consortium http://www.geneontology.org/ • RefSeq http://www.ncbi.nlm.nih.gov/refseq/ • Vertebrate and Genome Annotation Project (Vega)

Human genome is about 20-30,000 genes

CCDS Release 17 (Updated August2014) : CCDS IDs 30,499 Gene IDs 18,801 Sequence IDs by Organization: NCBI RefSeq 34,752 EBI,WTSI Records 38,822

HCGN (Updated 30 Dec 2014) assigned unique gene symbols to: Loci 39,397 Protein coding genes 19,008 12,709 Non-coding RNA 5,883

GENCODE 21 (Updated June 2014) : Loci 60,155 Protein coding genes 19,881 Protein coding transcripts 79,377

CCDS: The Consensus CDS project; HCGN: HUGO Gene Nomenclature Committee; GENCODE: Gene catalog from the ENCODE project The Consensus Coding Sequence (CCDS) project

The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines.

The main curation groups are the Havana team at the WTSI and the RefSeq annotation group at NCBI. In addition, the manually curated information on chr14 (Genoscope) and Chr7 (Wustl) has been brought in via the Vega resource.

Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations

CCDS release statistics for human and mouse.

Farrell C M et al. Nucl. Acids Res. 2014;42:D865-D872 RefSeq

NCBI’s Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/refseq/) is a collection of taxonomically diverse, non-redundant and richly annotated sequences representing naturally occurring molecules of DNA, RNA, and protein. Included are sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes.

Each RefSeq is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration (INSDC). Similar to a review article, a RefSeq is a synthesis of information integrated across multiple sources at a given time. RefSeqs provide a foundation for uniting sequence data with genetic and functional information. They are generated to provide reference standards for multiple purposes ranging from genome annotation to reporting locations of sequence variation in medical records.

The RefSeq collection is available without restriction and can be retrieved in several different ways, such as by searching or by available links in NCBI resources, including PubMed, Nucleotide, Protein, Gene, and Map Viewer, searching with a sequence via BLAST, and downloading from the RefSeq FTP site. HUGO Gene Nomenclature Committee

The Human Genome Organisation (HUGO) is a non-profit making body which is jointly funded by the Wellcome Trust (UK) and the US National Human Genome Research Institute (NHGRI). HUGO operates under the auspices of, with key policy advice from an International Advisory Committee (IAC). HUGO also uses a team of specialist advisors who provide support on specific gene family nomenclature issues, and work in close collaboration with staff at MGNC and RGNC.

For each known human gene the HUGO Gene Nomenclature Committee (HGNC) approves a unique gene name and symbol (or short-form abbreviation). All approved gene symbols are stored in the HGNC Database (www.genenames.org).

The HGNC ensures that each gene is given only one unique approved symbol, so that we and others can talk unambiguously about genes; these symbols also facilitate electronic data retrieval from publications and databases. In preference each symbol maintains parallel construction in different members of a gene family and can also be used in other species, especially for orthologous genes across vertebrate species. The HGNC have already approved over 33,000 gene symbols; the majority of these (~19,000) are for protein-coding genes, but we also assign names to pseudogenes, non-coding RNAs, phenotypes and genomic features. New gene symbols are requested by individual researchers, communities working on a specific group or family of genes, authors submitting to journals that insist on the use of approved gene nomenclature (e.g. Nature Genetics, Genome Research, Genomics) and other databases that use or reference HGNC gene symbols (e.g. Ensembl, Entrez Gene, UniProt, MGI, RGD and OMIM). In all cases considerable efforts are made to use a symbol acceptable to workers in the field.

GENCODE

The GENCODE gene set and its strengths over other publicly available human reference annotation and the reasons it has been adopted by the ENCODE Consortium (The ENCODE Project Consortium 2011), The 1000 Genomes Project Consortium (2010), and The International Cancer Genome Consortium (2010) as their reference gene annotation.

The GENCODE reference gene set is a combination of manual gene annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/) and automatic gene annotation from Ensembl (http://www.ensembl.org/index.html). It is updated with every Ensembl release (approximately every 3 mo). Since manual annotation of the whole human genome is not finished, the GENCODE releases are a combination of manual annotation from HAVANA and automatic annotation from Ensembl to ensure whole-genome coverage.

The group's approach to manual gene annotation is to annotate transcripts aligned to the genome and take the genomic sequences as the reference rather than the cDNAs. Currently only three vertebrate genomes—human, mouse, and zebrafish—are being fully finished and sequenced to a quality that merits manual annotation. The finished genomic sequence is analyzed using a modified Ensembl pipeline, and BLAST results of cDNAs/ESTs and proteins, along with various ab initio predictions, can be analyzed manually in the annotation browser tool Otterlace (http://www.sanger.ac.uk/resources/software/otterlace/). The advantage of genomic annotation compared with cDNA annotation is that more alternative spliced variants can be predicted, as partial EST evidence and protein evidence can be used, whereas cDNA annotation is limited to availability of full-length transcripts. Moreover, genomic annotation produces a more comprehensive analysis of pseudogenes. One disadvantage, however, is that if a polymorphism occurs in the reference sequence a coding transcript cannot be annotated, whereas cDNA annotation, for example, performed by RefSeq, can select the major haplotypic form as it is not limited by a reference sequence.

http://genome.cshlp.org/content/22/9/1760.long http://www.gencodegenes.org/ http://www.gencodegenes.org/

Human Genome Interpretation The challenge n engl j med 369;16 nejm.org october 17, 2013

Genome interpretation: Variant Analysis Workflow

Primary . Analysis of hardware generated data Analysis . Production of sequence reads and quality score

. QA and clipping/filtering reads Secondary . Alignment/Assembly of reads Analysis . De-duplication, variant calling on aligned reads

. QA and filtering of variant calls Terziary . Annotation of variants Analysis . Multi-sample interation

«Sense making» . Visualization of variants in genomic context . Experiment-specific inheritance/population analysis Read sequences alignment are stored in a BAM file

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM/BAM format. These files are generated as output by short read aligners like BWA. SAM files can be very large (10s of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, while BAM files are simply the binary equivalent. From http://www.broadinstitute.org/igv/BAM The Variant Call Format file (VCF)

The end result of a SNP calling analysis is a collection of SNPs. A standard file has been created to hold these SNPs, the Variant Call Format file (VCF). In this file every line represents an SNP and the following information is found: • The position in the reference genome • The allele in the reference genome • The other alleles found • The filters not passed by the SNP • The genotypes found with its abundances

From http://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html

CHROM. MY SNP GENE ALLELES RISK WGS REGION STATUS rs4977574 CDKN2B 9p21 A/G AA NULL CONFIRMED rs646776 CELSR2 1p13 C/T TT HIGH CONFIRMED rs9982601 SLC5A3 21q22 C/T CC NULL rs17465637 MIA3 1q41 A/C AA NULL rs1746048 CXCL12 10q11 T/C CC HIGH rs12526453 PHACTR1 6p24 G/C GC MEDIUM CONFIRMED rs1122608 LDLR 19p13 T/G TG MEDIUM CONFIRMED rs6725887 WDR12 2q33 T/C TT NULL rs11206510 PCSK9 1p32 C/T CT MEDIUM CONFIRMED rs11065987 BRAP 12q24 A/G GG HIGH CONFIRMED

CHROM. MY SNP GENE ALLELES RISK WGS REGION STATUS rs4977574 CDKN2B 9p21 A/G AA NULL CONFIRMED rs646776 CELSR2 1p13 C/T TT HIGH CONFIRMED rs9982601 SLC5A3 21q22 C/T CC NULL CONFIRMED rs17465637 MIA3 1q41 A/C AA NULL CONFIRMED rs1746048 CXCL12 10q11 T/C CC HIGH CONFIRMED rs12526453 PHACTR1 6p24 G/C GC MEDIUM CONFIRMED rs1122608 LDLR 19p13 T/G TG MEDIUM CONFIRMED rs6725887 WDR12 2q33 T/C TT NULL CONFIRMED rs11206510 PCSK9 1p32 C/T CT MEDIUM CONFIRMED rs11065987 BRAP 12q24 A/G GG HIGH CONFIRMED Genome interpretation: Variant Analysis Workflow

Primary . Analysis of hardware generated data Analysis . Production of sequence reads and quality score

. QA and clipping/filtering reads Secondary . Alignment/Assembly of reads Analysis . De-duplication, variant calling on aligned reads

. QA and filtering of variant calls Terziary . Annotation of variants Analysis . Multi-sample interation

«Sense making» . Visualization of variants in genomic context . Experiment-specific inheritance/population analysis Genome interpretation: The Challenge

VCF file goes in (> 3,000,000 variants) VARIANT ASSESSMENT AND RESULT REPORTING the questions are quite simple, but…

1. Does the variant disrupt gene (protein) function?

2. Can disrupted function lead to disease?

3. Is this disrupted gene function causative for your patient’s presentation? Variant Annotation

Variant type # % Missense 56,669 44.5% Small indels 29,743 23.4% Nonsense 14,191 11.2% Splicing 11,802 9.3% CNVs 10,734 8.4% Regulatory 2,450 1.9% Complex rearrangements 1,245 1.0% Repeat variations 394 0.3% Total 127,231

Derived from The Human Gene Mutation Database http://www.hgmd.cf.ac.uk Variant Classification

Benign CORNERSTONES

• Allele frequency in cases/controls Likely Benign

• Segregation with disease Unknown Significance • Functional (in vivo) studies

Likely Pathogenic • To some extent: computational predictions (clinically validated) Pathogenic Silent variants

• Can be pathogenic (by affecting splicing)

• LMNA Gly382Gly … has been reported in 1 individual with Limb-girdle muscular dystrophy… . RNA studies showed that this variant creates a new splice site in the 3' end of exon 6 resulting in the deletion of the remaining 13 bases in the exon (Benedetti 2007) …..

• Overall this is rare

• Usually classified as “likely benign or benign” unless in exonic part of splice consensus

Exon -2 -1 +1 +2

Nature Reviews Genetics 3:285-298, 2002. Nonsense • Frameshift • +/-1,2 Splice

• Commonly assumed to be pathogenic because of high likelihood of severe impact to the protein

• Nonsense - premature stop • Frameshift - premature stop or stop/gain • +/-1,2 splice – exon skipping  fs  premature stop

• Careful! Most will be deleterious to a protein, but not all of them will result in disease!

• Example: COL1A1 – osteogenesis imperfecta

• Missense variants have a severe (dominant negative) effect on collagen triple helix • Truncating variants cause mild disease (fewer but fully functional chains) Assessing the functional impact of mutations

For the subset of mutations that affect either protein-coding sequences or known regulatory sites, it is possible to make computational predictions about their potential effects

Missense variants are difficult to interpret

• How significant is the biochemical change to the • BLOSUM scores, Grantham differences, etc

• Does the variant affect a critical domain of the protein? • DNA binding domain of transcription factors • Transmembrane domain • Catalytic domain for enzymes

• Computational predictions of pathogenicity (rely heavily on conservation) • PolyPhen-2, SIFT, Align GVGD, Xvar and many more

• How conserved is the amino acid • Conservation in distant species supports pathogenicity • Lack of conservation in closely related species argues against

• Are benign missense variants common in your gene or never seen? • TTN – Common • PTPN11 – Rare

Rules for assigning the effects of missense mutations on Molecular Function

1. Protein stability (83%). Affected by one or more of the following: 1) Loss of one or more hydrogen bonds. 2) Reduced hydrophobic interaction 3) Loss of a salt bridge. 4) Buried charged residue. 5) Over-packing. 6) Internal cavity. 7) Electrostatic repulsion. 8) Buried polar residue. 9) Disruption of metal binding. 10) Breakage of a disulfide bond. 11) Destabilization of a protein multimer.

2. Ligand binding. Any of the above protein stability rules can apply, where ligand atoms interact with the mutated side chain.

3. Catalysis (II+III=5%). The mutated residue is directly involved in a catalytic process, or is critical in positioning such a residue.

4. Allosteric regulation. The mutated residue is involved in an allosteric mechanism.

5. Post-translational modification. An N-XS/ T sequence pattern for N- glycosylation (X =any residue except Pro) is disrupted.

HUMAN MUTATION 17:263.270 (2001) Using Primary Structure as Proxy for Tertiary

• 83% of disease-causing mutations affect stability of proteins (Wang and Moult, 2001) • 90% of disease-causing mutations can be detected using structure and stability • Many human proteins have numerous homologs: • Paralogs: Separated by a gene duplication event • Orthologs: Separated by speciation • Don’t know the exact structure of most proteins, but we can compare amino acid sequences to identify domains and motifs conserved by evolution • Disease causing mutations are overrepresented at conserved sites in the primary structure (Miller and Kumar, 2001) Mutations that affect protein function tend to occur at evolutionarily conserved sites ..

.. and tend to be buried in protein structure

Genome Biol. 4:R72 Sickle-cell anemia

• Incorrectly predicted to be benign by a structure-based AAS prediction method (PolyPhen) • Correctly predicted by a sequence homology-based AAS prediction method (SIFT) β-globin E6V substitution Pathogenicity Score based on Protein Sequence and/or Structure methods

Gnad et al. BMC Genomics 2013 14:S7 Problem: different softwares predict different sets of deleterious variants! Predicted functional effect

Condel stands for CONsensus DELeteriousness score of non-synonymous single nucleotide variants (SNVs).

Condel combine five predictive tools:

• Log R Pfam E-value (Logre) • MAPP • Mutation Assessor (Massessor) • Polyphen2 (PPH2) • SIFT

And calculates a weighted average of the normalized scores (WAS). Variants scored at least 0.469 are predicted to significantly affect (most likely, harm) protein function; variants with lower scores are predicted to be functionally tolerated. Not all variants are scored.

The American Journal of Human Genetics 88, 440–449, April 8, 2011 Missense Prediction Tools based on Conservation score The conservation score summarizes how well sequence at a given site is conserved among 33 mammal taxa. Substitutions at evolutionarily conserved (EC) positions are more deleterious than those at evolutionarily unconserved (EU) positions

• Position: GERP, PhyloP – Genomic Evolutionary Rate Profiling (GERP) Measures Base Conservation – PhyloP Assigns Conservation P-values

• Small regions: PhastCons PhastCons Fits a Hidden Markov Model

Missense Prediction Tools based on Variant Frequency in the population 1000 genomes (http://www.1000genomes.org/) The 1000 Genomes Project, launched in January 2008, is an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in Nature. In October 2012, the sequencing of 1092 genomes was announced in another Nature publication. The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied.

http://www.nature.com/nature/journal/v491/n7422/full/nature11632.html Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/)

By capturing and sequencing all protein-coding exons (the exome, which comprises ~1 to 2% of the human genome), exome sequencing is a powerful approach for discovering rare variation and has facilitated the genetic dissection of unsolved Mendelian disorders and the study of human evolutionary history. Rare and low-frequency (MAF between 0.5 and 1%) variants have been hypothesized to explain a substantial fraction of the heritability of common, complex diseases. Because common variants explain only a modest fraction of the heritability of most traits, the National Heart, Lung, and Blood Institute (NHLBI) recently sponsored the multicenter Exome Sequencing Project (ESP) to identify previously unknown genes and molecular mechanisms underlying complex heart, lung, and blood disorders by sequencing the of a large number of individuals measured for phenotypic traits of substantial public health importance (e.g., early-onset myocardial infarction, stroke, and body mass index).

The ESP6500 dataset is comprised of a set of 2203 African-Americans and 4300 European-Americans unrelated individuals, totaling 6503 samples (13,006 chromosomes). It represents all of the ESP exome variant data.

A subset of this data (ESP2500) that has more stringent filtering criteria is available

When is a variant “common enough” to be classified as benign?

• Combination of • Disease prevalence (how frequent is the disease) • Penetrance (likelihood of developing disease when carrying a pathogenic variant) • Allelic heterogeneity (how many different pathogenic variants exist)

• Example: • autosomal disease, dominant • Prevalence = 1/500 individuals = 1/1000 chromosomes • 0.1% of the population may carry a pathogenic variant • But reduced penetrance and variable onset should be considered

Variants interpretation used to be so simple….

Amino acid conserved + variant absent from 200 control chromosomes = pathogenic CAUTION!!!!

• Absence from controls does NOT prove pathogenicity

• Though increases likelihood

• Presence in controls does NOT rule out pathogenicity

• Pathogenic variants can be present for • Recessive diseases • Late-onset phenotypes • Low-penetrant diseases Birgit H. Funke, PhD

Director of Clinical Research and Development, Laboratory for Molecular Medicine, Massachusetts General Hospital

Assistant Professor in Pathology, Harvard Medical School HISTORICAL OVER-INTERPRETATION OF VARIANT-DISEASE ASSOCIATIONS

Gly278Glu (MYBPC3)

• Published 2003* in 1 proband with hypertrophic cardiomyopathy (HCM) • No clinical data provided • Absent from 200 control chromosomes • Thought to be disease causing

*Richard et al.

Interpretation 2013 : BENIGN

Gly278Glu in Exon 08 of MYBPC3: This variant is not expected to have clinical significance because it has been identified in 1.4% (47/3258) of African American chromosomes from a broad population by the NHLBI Exome Sequencing Project (http://evs.gs.washington.edu/EVS; dbSNP rs147315081).

Birgit: “semi-accurate calculation: dominant, prevalence = 1/500 people = 1/1000 chromosomes = 0.1%. Assume reduced penetrance (80%?) so if a single variant were pathogenic and explained all HCM you'd expect it to be 0.12% frequent in the general population. Of course HCM is caused by many many mutations, so it is safe to assume that a variant that has a 1.4% allele frequency is benign” dbGaP (http://www.ncbi.nlm.nih.gov/gap) The database of Genotypes and Phenotypes (dbGaP) archives and distributes the results of studies that have investigated the interaction of genotype and phenotype. Such studies (536 until 01/2015) include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.

The types of data distributed through the dbGaP include phenotype data, association (GWAS) data, summary level analysis data, SRA (Short Read Archive) data, reference alignment (BAM) data, VCF (Variant Call Format) data, expression data, imputed genotype data, image data, etc. Online Mendelian Inheritance in Man (OMIM) http://www.omim.org/

OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources.

This database was initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man (MIM). Twelve book editions of MIM were published between 1966 and 1998. The online version, OMIM, was created in 1985 by a collaboration between the National Library of Medicine and the William H. Welch Medical Library at Johns Hopkins. It was made generally available on the internet starting in 1987. In 1995, OMIM was developed for the World Wide Web by NCBI, the National Center for Biotechnology Informatio http://www.omim.org/entry/113705?search=brca1&highlight=brca1 Combing the Cancer Genome A guided tour through the main online resources for analyzing cancer genomics data Much of the available genomic data comes from a handful of large international collaborations. The Cancer Genome Atlas (TCGA), a project of the National Cancer Institute and the National Human Genome Research Institute, oversees the generation of genomic data from quality-controlled samples, most of which have been analyzed using multiple platforms. The UK Cancer Genome Project houses a data collection called COSMIC, the single most comprehensive catalog of somatic mutations in the world. Casting an even wider net, the International Cancer Genome Consortium (ICGC) is a one-stop-shopping portal through which you can access data from its 12 member countries, as well as from the TCGA and COSMIC databases.

A good starting point is to check for known mutations and other aberrations in your gene of interest. The ICGC Data P- ortal offers several search routes. Enter a gene name, NCBI accession number, or Ensembl gene ID in the Gene Search field, click through to the Gene Report, and under Mutation Summary you’ll find the mutations and copy number changes detected and their frequency in the tumors analyzed to date. The COSMIC section just below lists somatic mutations, including point mutations, small deletions, and insertions, from the COSMIC database.

Another tack is to look for all the affected genes in a tumor type. In the ICGC Data Portal, you can accomplish this by clicking Genes under Database Search, and from there, selecting the type of cancer you’re interested in as well as optional sub-parameters such as the pathway you’d like to explore. Alternatively, in the TCGA’s Data Portal, you can select Bulk Download from the Download Data menu and retrieve information on somatic mutations, or other data types such as copy number, DNA methylation, and gene expression, for a number of tumor types.

Both ICGC and TCGA data are publicly available, but note that they have already been processed: sequences have been confirmed by various techniques, and patient-identifying information, such as the presence of germline SNPs, has been removed. Researchers can request access to raw data through ICGC’s Data Access Compliance Office if they would like to know the germline SNPs, which could be relevant in studies of cancer risk and cancer drug response.

Also, consider supplementing your ICGC search with an exploration of data sets from other centers, including TCGA’s Data Portal. ICGC only updates the database every few months, whereas TCGA updates as new data become available. What’s more, ICGC does not store TCGA’s raw data—to access it, submit a Data Access Request form to TCGA The Cancer Genome Atlas (https://tcga-data.nci.nih.gov/tcga/)

The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high-throughput sequencing analysis of the tumor genomes

TCGA collects and analyzes high-quality tumor samples and makes the following data available on the Data Portal: • clinical information about patients participating in the program • metadata about their samples (for example, the weight of a sample portion) • histopathology slide images from sample portions • molecular information derived from the samples

In addition to collecting and analyzing high-quality tumor samples, the TCGA is also attempting to include high- quality non-tumor samples in some assays. The goal is to analyze germline DNA for every participant to establish which abnormalities detected in a tumor sample are peculiar to the oncogenic process.

For most tumor types, TCGA will be able to collect and analyze normal blood samples for the majority of participants with that disease. Sometimes a matching normal blood sample is not available; in this case, a normal tissue sample from the same participant may be used as the germline control in DNA assays.

In the case of RNA assays, normal blood is not a suitable control since the RNA profile of a blood sample would be expected to differ from the RNA profile of tissue from an organ such as brain, breast, lung, or ovary regardless of whether that tissue were normal or cancerous. For this reason, TCGA attempts to collect some number of normal tissue samples matched to the anatomic site of the tumor but usually not matched to the participant. RNA measurements from these normal samples can be pooled and used to analyze how RNA expression in a tumor differs from RNA expression in normal tissue of the same anatomic origin.

COSMIC (http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/)

The Catalogue of Somatic Mutations in Cancer (COSMIC) is designed to store and display somatic mutation information and related details and contains information relating to human cancers.

Some key features of COSMIC are: • It contains information on publications, samples and mutations. Includes samples which have been found to be negative for mutations during screening therefore enabling frequency data to be calculated for mutations in different genes in different cancer types. • Samples entered include benign neoplasms and other benign proliferations, in situ and invasive tumours, recurrences, metastases and cancer cell lines.

COSMIC combines cancer mutation data manually curated from the scientific literature, with the output from the Cancer Genome Project (CGP) at the Sanger Institute UK. Genes are selected for full literature curation with a focus on those mutated by small point mutations in the coding domains, and including those mutated by gene fusion. In the age of whole cancer genome sequencing, it is now possible to describe the genomewide somatic mutation content of a tumour sample, including structural rearrangements and non-coding variants. COSMIC is now integrating this information into the database, providing full coding and genomic variant annotations for samples, both from CGP laboratories and recent publications

The International Cancer Genome Consortium http://www.icgc.org

The International Cancer Genome Consortium (ICGC) has been organized to launch and coordinate a large number of research projects that have the common aim of elucidating comprehensively the genomic changes present in many forms of cancers that contribute to the burden of disease in people throughout the world.

The primary goals of the ICGC are to generate comprehensive catalogues of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types and/or subtypes which are of clinical and societal importance across the globe and make the data available to the entire research community as rapidly as possible, and with minimal restrictions, to accelerate research into the causes and control of cancer. The ICGC facilitates communication among the members and provides a forum for coordination with the objective of maximizing efficiency among the scientists working to understand, treat, and prevent these diseases.

Currently, the ICGC has received commitments from funding organizations in Asia, Australia, Europe, North America and South America for 53 project teams in 15 jurisdictions to study over 25,000 tumor genomes. Projects that are currently funded are examining tumors affecting the bladder, blood, bone, brain, breast, cervix, colon, head and neck, kidney, liver, lung, oral cavity, ovary, pancreas, prostate, rectum, skin, soft tissues, stomach, thyroid and uterus. Over time, additional nations and organizations are anticipated to join the ICGC. The genomic analyses of tumors conducted by ICGC members in Australia (pancreatic cancer), Canada (pancreatic cancer, pediatric brain cancer and prostate cancer), China (gastric cancer), France (liver cancer), Germany (blood cancer, brain cancer), Japan (liver cancer), Spain (blood cancer), the UK (blood, breast, lung, prostate and skin cancer) and the USA (blood, bladder, brain, breast, cervical, colon, head and neck, kidney, liver, lung, ovarian, prostate, rectal, skin, stomach, thyroid and uterine cancer) are available through the Data Coordination Center housed on the ICGC http://dcc.icgc.org/web/

The ICGC Data Portal offers several search routes. Enter a gene name, NCBI accession number, or Ensembl gene ID in the Gene Search field, click through to the Gene Report, and under Mutation Summary you’ll find the mutations and copy number changes detected and their frequency in the tumors analyzed to date. The COSMICsection just below lists somatic mutations, including point mutations, small deletions, and insertions, from the COSMIC database. The Human Gene Mutation Database (http://www.hgmd.org)

The Human Gene Mutation Database (HGMD) represents an attempt to collate known (published) gene lesions responsible for human inherited disease. It constitutes a comprehensive core collection of data on germ-line mutations in nuclear genes underlying or associated with human inherited disease.

Data cataloged include single-basepair substitutions in coding, regulatory, and splicing-relevant regions, micro-deletions and micro-insertions, indels, and triplet repeat expansions, as well as gross gene deletions, insertions, duplications, and complex rearrangements. Each mutation is entered into HGMD only once, in order to avoid confusion between recurrent and identical-bydescent lesions. HGMD also includes cDNA reference sequences for more than 98% of the listed genes.

This database, whilst originally established for the study of mutational mechanisms in human genes, has now acquired a much broader utility in that it embodies an up-to-date and comprehensive reference source to the spectrum of inherited human gene lesions. Thus, HGMD provides information of practical diagnostic importance to (i) researchers and diagnosticians in human molecular genetics, (ii) physicians interested in a particular inherited condition in a given patient or family, and (iii) genetic counsellors. Human Gene Mutation Database

HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease. It is maintained in Cardiff - UK

HGMD Public HGMD Professional

Up-to-date data

Free for academia/no profit + advanced search

2 ½ years out of date + Mutation map

+ Genomic coordinates

Categorization of mutations & polymorphisms • DM: disease-causing (pathological) mutation • DM?: likely disease-causing (likely pathological) mutation • DP: disease-associated polymorphism • DFP: disease-associated polymorphism with additional supporting functional evidence • FP: polymorphism affecting the structure, function or expression of a gene but with no disease association reported yet • FTV: frameshift or truncating variant with no disease association reported yet

ClinVar: A Central Repository for Interpretations of Clinically Relevant Variants

dbSNP dbVar Variation Phenotype MedGen Gene (HPO, OMIM) RefSeq

ACMG PubMed Interpretat Evidence Sequence ion GTR Ontology ClinVar web display

Interpretation • Significance • Review status • Accession.version

Allele summary • Gene • Variant type • Genomic location • HGVS expressions • Molecular consequence • Links • Frequency

Phenotype summary • Names • Links • Prevalence ClinVar Review Status

classified by single submitter classified by multiple submitters conflicting data from submitters reviewed by expert panel reviewed by professional society

Expert panels – both medical and research experts with published criteria and process for evaluating variant pathogenicity • CFTR2, InSiGHT

Professional society – groups that provide practice guidelines • American College of Medical Genetics (ACMG) Genome interpretation: Implementing Whole Genome Sequencing into Clinical Care My whole genome sequencing

Library DNA extraction Fragmentation preparation Sequencing

Illumina HiSeq 1000 Blood sample

TGTTGGAATACCTGAAGATTGCTCAGGACCTGGAGA ……… CAACATTGGATCAAATGGATCTGATAGACCTTTACA GAAAAAATGTAGTCAGCCTTTAACTTGGCCTGATAA ……… CACAGCTGGGGCTGTAGCAACCCTTTCCAACCCCTT TAGTCGGTTGTTGATGAGATATTTGGAGGTGGGGAT ……… GTCAAAGGCAAAGGAAAAAATGTTCAATATAGTTAA

• 640 Millions reads Mapping on (2x100) reference genome • 128 Gigabases • Genome coverage: 37,4x

My whole genome analysis

SNVs

Total Number 3,344,185 Number in Genes 1,277,850 Number in Coding 19,410 Regions Number in UTRs 25,978 Splice Site Region 2,311 Stop Gained 80 Stop Lost 34 Non-synonymous 10,165 Synonymous 9,130 Most variants may be deleterious to a protein, but not all of them will result in disease

Tay Sachs (100%) 100% Cystic Fibrosis (100%)

Cardiomyopathy (55%) Breast Cancer (40-80%) 50%

Type 2 Diabetes (25% to 33%)

0% Alzheimer’s Disease (1% to 6%) Monogenic diseases Majority of disease risk by single gene Multifactor disease(>1 gene + environment)

"Currents recorded from CHO cells transiently transfected with 50% WT and 50% G1036D mutant cDNA were 68% of WT"

Until 1981 Since 1981