Introduction to human genomics and informatics

Session 1

Prince of Wales Clinical School

Dr Jason Wong ARC Future Fellow

Head, & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre

Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH WALES, SYDNEY NSW 2052

What we will cover

• Structure of the human genome

• Layers of genomic – DNA (Sequence variation) – RNA (Genes & gene expression) – Epigenetics (DNA methylation) – Epigenetics (Histone code/Transcription factors)

• Genomic data acquisition technologies – Microarray – Next-generation sequencing Structure of human genome

• Consist of 23 pairs of chromosomes.

• Each chromosome is paired meaning that it is diploid.

• Each individual chromosome made up of double stranded DNA.

• Approximately ~3 billion bases in total.

Reference human genome

• Human vary significantly between individuals (~0.1%)

• Computationally, a reference genome is used.

• Important things to note about the reference genome: – Is haploid (i.e. only 1 sequence) – Is a composite sequence (i.e. does not correspond to anyone’s genome) Representation of genomic data

• Genomic data is most common represented in two ways:

1. Sequence data – fasta format (.fa or .fasta)

>chr1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt .... 2. Location data – bed format (.bed)

chr1 934343 935552 HES4 0 - chr1 948846 949919 ISG15 0 + ...

chromosome start end name score strand

All about genomic formats here - http://genome.ucsc.edu/FAQ/FAQformat.html What do chromosomes contain?

Genes: ~1.2% coding ~2% non-coding

Regulatory regions: ~2%

Repetitive elements comprise another ~50% of the human genome Layers of genetic information

• DNA sequence variation

• Gene expression – Coding – Non-coding

• Epigenetic regulation – DNA methylation – Histone/transcription factor binding Sequence variation Variations in DNA sequence

• Cytological level: – Chromosome numbers – Segmental duplications, rearrangements, and deletions

• Sub-chromosomal level: – Transposable elements – Short Deletions/Insertions, Tandem repeats

• Sequence level: – Single Nucleotide Polymorphisms (SNPs) – Small Nucleotide Insertions and Deletions (Indels) Sequence variation

• Single nucleotide polymorphisms (SNPs) – DNA sequence variations that exist with members of a species. – They are inherited at birth and therefore present in all cells.

• Somatic mutations – Are somatic – i.e. only present in some cells – Mutations are often observed in cancer cells

Types of SNPs/Mutations

• Most SNPs and mutations fall in intergenic regions.

• Within genes, they can either fall in the non-coding or coding regions.

• Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.

Synonymous TSS Coding Non-coding Intergenic region Non-Synonymous Effects of sequence variation

• Non-synonymous variants: – Missense (change structure) – Nonsense (truncates protein) • Synonymous or non-coding variants: – Alter transcriptional/translational efficiency – Alter mRNA stability – Alter gene regulation (i.e. alter TF binding) – Alter RNA-regulation (i.e. affect miRNA binding)

Majority of sequence variation are neutral Genes and gene expression Types of genes

• A gene is a functional unit of DNA that is transcribed into RNA. • Total genes in the human genome – 57,445

Source: GENCODE (version 18) Coding genes

• Traditionally considered to be the most important functional unit of genomes.

• ~ 20,000 in the human genome.

• Due to splicing one gene can make many .

Source: http://www.news-medical.net

Non-coding genes

microRNA

• Plays a role in post- transcriptional regulation.

• Only discovered in 1993.

• Acts by either causing RNA degradation or inhibition of translation.

• Implicated in many aspects of health and disease including: – Development – Cancer – Heart disease

Long non-coding RNA (lncRNA)

• Arbitrarily defined as non- coding transcripts > 200 nt in length.

• Implicated in many functions including: – Altering protein/DNA interaction. – Binds mRNA. – Sink for miRNAs. – Etc…

• Unlike coding and miRNAs, lncRNA are less conserved and function of many are unknown.

Prensner and Chinnaiyan (2011) Cancer Discov. 1:391 Gene expression

• Measuring the level of RNA (typically mRNA) in the sample.

• Generally microarray- or sequencing- based.

• Commonly used for measuring differential expression – between samples, or – between genes

• Computation analysis and normalisation of expression data can be complicated.

Source: OPENbeta Gene-set/Pathway analysis

• Differential expression of individual genes not necessarily informative.

• Genes are often grouped in gene-sets based on ontology or biological pathways.

• Gene-set and pathway analyses are therefore a common downstream after differential gene expression analysis.

Gene regulation Gene regulation/epigenetics

• Epigenetics is the study of mechanisms that alter cellular function independent to any changes in DNA sequence

• Mechanisms include: – DNA methylation – Nucleosome positioning/Histone modification – Transcription factors – Non-coding RNA DNA methylation

• DNA is methylated on cytosines in CpG dinucleotides

Nucleosomes & Histones

• Histones are proteins that package DNA into nucleosomes. Histone modifications

• Acetylation • Methylation • Phosphorylation • Ubiquitination

• Can enhance or repress gene expression Transcription factors

• Proteins that bind DNA to regulate gene expression. • Typically binds at gene promoters or enhancers. Studying gene regulation

• Has traditionally been more difficult than studying gene expression because: – Location of many regulatory regions are poorly defined. – Regulatory regions differ greatly between cell types. – Many modes of gene regulation.

• Next-generation sequencing technologies has enabled great progress to be made. Genomic technologies Genomic technologies

• Microarray-based data – SNP profiling – Copy number profiling – DNA methylation profiling – Gene expression profiling

• Next-generation sequencing – “Swiss-army knife” of genomics Data acquisition

• Relies on fluorescence-based on hybridisation of DNA against complementary probe on array.

• Can be used to study DNA or any molecule that can be converted to cDNA. – SNP array (probe for two alleles) – Methylation array (probe for bisulfide converted DNA) – Expression array (probe for exonic DNA regions)

• Limited by probes present on the array.

Microarray gene expression analysis

Microarray chips Images scanned by laser Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707

Datasets C la s s S n o D 2 6 5 2 8 D 6 3 8 7 4 D 6 3 8 8 0 … ALL 2 193 4157 556 ALL 3 129 11557 476 ALL 4 44 12125 498 • Gene signatures Data Mining ALL 5 218 8484 1211 AML 51 109 3537 131 • Sample classification and analysis AML 52 106 4578 94 AML 53 211 2431 209 … Next-generation sequencing What is NGS?

A number of different technologies.

We use the technology by Illumina sequencers as an example.

Figures provided by Illumina Inc. Sequences are inferred from fluorescence during synthesis

Short sequencing reads

Figures provided by Illumina Inc. Gene

Alignment

Aligned reads NGS file formats

• Fastq – Stores sequencing reads from NGS. Contains read sequence and quality scores.

• BAM/SAM – A BAM file (.bam) is a binary file containing coordinates of where a read has mapped to in a genome. SAM is the same file in text format

• BedGraph/Wig – for storing continuous profile information for visualisation.

• VCF – for storing information about variants.

https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip Pros/cons of each technology

• NGS – Greater dynamic range (only limited by depth of sequencing) – Coverage of genome does not need to be limited. – Many more applications from sequencing data. – Data analysis and management can be challenging.

• Microarrays – Microarrays are still significantly cheaper. – Largest public datasets are likely to be microarray based. – Data analysis pipelines are well standardised. Example of using public resources to tell us more about our data

http://www.powcs.med.edu.au/OncoCis OncoCis uses public data from various sources to assign potential function to non-coding mutations

Given a non-coding mutation what do we want to know?

1. Does the mutation fall within a cis- regulatory region (ENCODE/Human Epigenome Atlas).

2. Is the mutation site highly conserved (UCSC)?

3. What gene might the mutation affect (FANTOM5)?

4. What transcription factor binding site might be altered (JASPAR)?

5. Does the mutation affect a gene which is druggable (DGIdb)? Gene mapping from Conservation FANTOM5 Link out to UCSC FANTOM5 or GREAT data from UCSC regulatory data genome browser

Link out to Drug-Gene interaction database (DGIdb) Motif data from JASPAR Epigenetic data from ENCODE/Epigenome project