A Zero-Knowledge Based Introduction to Biology

Jim Notwell 09 January 2013 Q: What is your genome?

A: Q: What is your genome?

A:The sum of your hereditary information. From DNA to Organism

You are composed of ~ 10 trillion cells From DNA to Organism Cell From DNA to Organism Cell Protein

Proteins do most of the work in biology Central Dogma of Biology DNA: “Blueprints” for a cell

•Genetic information encoded in long strings

•Deoxyribonucleic acid comes in four flavors: , thymine, , and Phosphate-deoxyribose Backbone

to previous nucleotide

O 5’ H to base O P O C O H O- C C

H H H H C C

3’ H

to next nucleotide Complementary Pairing

purines

Adenine (A) Guanine (G)

Thymine (T) Cytosine (C)

pyrimidines DNA Double Helix DNA Packaging Q: What is your genome?

A:The sum of your hereditary information. Q: What is your genome?

A:The sum of your hereditary information. Humans bundle two copies of the genome into 46 chromosomes in every cell Central Dogma of Biology DNA vs RNA RNA

purines

Adenine (A) Guanine (G)

Uracil (U) Cytosine (C)

pyrimidines Gene Transcription

G A T T A C A . . . 5’ 3’

3’ C T A A T G T . . . 5’ Gene Transcription

G A T T A C A . . . 5’ 3’

3’ C T A A T G T . . . 5’ Gene Transcription

G A T T A C A . . . 5’ 3’

3’ C T A A T G T . . . 5’

Strands are separated (DNA helicase) Gene Transcription

G A T T A C A . . .

5’ G A U U A C A 3’

3’ C T A A T G T . . . 5’

An RNA copy of the 5’→3’ sequence is created from the 3’→5’ template Gene Transcription

G A T T A C A . . . 5’ 3’

3’ C T A A T G T . . . 5’

G A U U A C A . . . pre-mRNA 5’ 3’ RNA Processing

5’ cap poly(A) tail exon intron

mRNA

5’ UTR 3’ UTR Gene Structure

introns

5’ 3’

promoter 5’ UTR exons 3’ UTR

coding non-coding Central Dogma of Biology From RNA to Protein •Proteins are long strings of amino acids joined by peptide bonds

•Translation from RNA sequence to amino acid sequence performed by ribosomes

•20 amino acids → 3 RNA letters required to specify a single amino acid Amino Acid

Alanine Arginine Asparagine Aspartate H O Cysteine Glutamate Glutamine H N C C OH Glycine Histidine Isoleucine H Leucine Lysine R Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine There are 20 standard amino acids Valine Proteins

H O to previous aa N C C to next aa

H R

H OH

N-terminus C-terminus (start) (end) from 5’ 3’ mRNA Translation

The ribosome (a complex of protein and RNA) synthesizes a protein by reading the mRNA in triplets (codons). Each codon is translated to an amino acid. Translation

U C A G

UUU"Phenylalanine"(Phe) UCU"Serine"(Ser) UAU"Tyrosine"(Tyr) UGU"Cysteine"(Cys) U#

UUC"Phe UCC"Ser UAC"Tyr UGC"Cys C# U# UUA"Leucine"(Leu) UCA"Ser" UAA"STOP UGA"STOP A#

UUG"Leu UCG"Ser" UAG"STOP UGG"Tryptophan"(Trp) G

CUU"Leucine"(Leu) CCU"Proline"(Pro) CAU"His4dine"(His) CGU"Arginine"(Arg) U#

CUC"Leu CCC"Pro CAC"His CGC"Arg" C# C# CUA"Leu CCA"Pro CAA"Glutamine"(Gln) CGA"Arg" A#

CUG"Leu CCG"Pro CAG"Gln CGG"Arg" G

AUU"Isoleucine"(Ile) ACU"Threonine"(Thr) AAU"Asparagine"(Asn) AGU"Serine"(Ser) U#

AUC"Ile ACC"Thr AAC"Asn AGC"Ser" C# A# AUA"Ile ACA"Thr" AAA"Lysine"(Lys) AGA"Arginine"(Arg) A#

AUG"Methionine"(Met)"or"START ACG"Thr AAG"Lys AGG"Arg" G

GUU"Valine"(Val) GCU"Alanine"(Ala) GAU"Aspar4c#acid"(Asp) GGU"Glycine"(Gly) U#

GUC"Val GCC"Ala GAC"Asp GGC"Gly C# G# GUA"Val GCA"Ala GAA"Glutamic#acid"(Glu) GGA"Gly A#

GUG"Val GCG"Ala GAG"Glu GGG"Gly G Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’ Translation

5’ . . . A U U A U G G C C U G G A C U U G A . . . 3’

UTR Met Ala Trp Thr Start Stop Codon Codon Translation Central Dogma of Biology Protein coding 1%

Other Stuff 99% Protein Non-coding coding exons 1% 2%

??? Introns/ 32% promoters/ polyA sites 37% Intergenic transcribed RNA Regulatory 19% elements 9% The ENCODE Project Consortium (2012) Nature 489:57-74 Non-coding RNAs •RNAs transcribed from DNA but not translated into protein

•Structural ncRNAs: Conserved secondary structure

•Involved in gene regulation microRNA Protein Non-coding coding exons 1% 2%

??? Introns/ 32% promoters/ polyA sites 37% Intergenic transcribed RNA Regulatory 19% elements 9% The ENCODE Project Consortium (2012) Nature 489:57-74 Different Cell Types

Subsets of the DNA sequence determine the identity and function of different cells Gene Expression Regulation •When should each gene be expressed?

•Why? Every cell has same DNA but each cell expresses different proteins.

•Signal transduction: One signal converted to another: cascade has “master regulators” turning on many proteins, which in turn each turn on many proteins Central Dogma of Biology Transcription Regulation •Transcription factors link to binding sites

•Complex of transcription factors forms

•Complex assists or inhibits formation of the RNA polymerase machinery Gene Transcription

G A T T A C A . . . 5’ 3’

3’ C T A A T G T . . . 5’ Transcription Factor Binding Sites •Short, degenerate DNA sequences recognized by particular transcription factors

•For complex organisms, cooperative binding of multiple transcription factors required to initiate transcription

Binding Sequence Logo Transcription Regulation

TF A Gene B Binding Site

Transcription Factor A ANRV285-GG07-02 ARI 8 August 2006 1:29 Gene Regulatory Region

understand how different permutations of the Distal regulatory elements same regulatory elements alter gene expres- sion. An understanding of how the combina- Locus control region Insulator torial organization of a promoter encodes reg- Silencer Enhancer ulatory information first requires an overview of the proteins that constitute the transcrip- tional machinery.

Proximal Core THE EUKARYOTIC promoter promoter TRANSCRIPTIONAL elements MACHINERY Promoter ( 1 kb) Factors involved in the accurate transcrip- tion of eukaryotic protein-coding genes by Figure 1 RNA polymerase II can be classified into three Schematic of a typical gene regulatory region. The promoter, which is groups: general (or basic) transcription fac- composed of a core promoter and proximal promoter elements, typically spans less than 1 kb pairs. Distal (upstream) regulatory elements, which can tors (GTFs), promoter-specific activator pro- include enhancers, silencers, insulators, and locus control regions, can be teins (activators), and coactivators (Figure 2). located up to 1 Mb pairs from the promoter. These distal elements may GTFs are necessary and can be sufficient for contact the core promoter or proximal promoter through a mechanism that accurate transcription initiation in vitro (re- involves looping out the intervening DNA. viewed in 141). Such factors include RNA polymerase II itself and a variety of auxil- (73); subsequent reinitiation of transcription iary components, including TFIIA, TFIIB, then only requires rerecruitment of RNA TFIID, TFIIE, TFIIF, and TFIIH. In addi- polymerase II-TFIIF and TFIIB. General tion to these “classic” GTFs, it is apparent that The assembly of a PIC on the core pro- transcription factor in vivo transcription also requires Mediator, moter is sufficient to direct only low levels of (GTF): a factor that a highly conserved, large multisubunit com- accurately initiated transcription from DNA assembles on the plex that was originally identified in yeast (re- templates in vitro, a process generally referred core promoter to viewed in 38, 119). to as basal transcription. Transcriptional ac- form a preinitiation complex and is GTFs assemble on the core promoter in tivity is greatly stimulated by a second class required for an ordered fashion to form a transcription of factors, termed activators. In general, ac- transcription of all preinitiation complex (PIC), which directs tivators are sequence-specific DNA-binding (or almost all) genes RNA polymerase II to the transcription start proteins whose recognition sites are usually Coactivators: site (TSS). The first step in PIC assembly present in sequences upstream of the core adaptor proteins that is binding of TFIID, a multisubunit com- promoter (reviewed in 149). Many classes of typically lack by Stanford University Robert Crown Law Lib. on 04/03/07. For personal use only. plex consisting of TATA-box-binding pro- activators, discriminated by different DNA- intrinsic sequence-specific tein (TBP) and a set of tightly bound TBP- binding domains, have been described, each DNA binding but Annu. Rev. Genom. Human Genet. 2006.7:29-59. Downloaded from arjournals.annualreviews.org associated factors (TAFs). Transcription then associating with their own class of specific provide a link proceeds through a series of steps, including DNA sequences. Examples of activator fam- between activators promoter melting, clearance, and escape, be- ilies include those containing a cysteine- and the general fore a fully functional RNA polymerase II rich zinc finger, homeobox, helix-loop-helix transcriptional machinery elongation complex is formed. The current (HLH), basic leucine zipper (bZIP), fork- model of transcription regulation views this head, ETS, or Pit-Oct-Unc (POU) DNA- PIC: preinitiation complex as a cycle, in which complete PIC assembly is binding domain (reviewed in 142). In addition stimulated only once. After RNA polymerase to a sequence-specific DNA-binding domain, TSS: transcription start site II escapes from the promoter, a scaffold struc- a typical activator also contains a separable ture, composed of TFIID, TFIIE, TFIIH, activation domain that is required for the ac- and Mediator, remains on the core promoter tivator to stimulate transcription (149). An

www.annualreviews.org Transcriptional Regulatory Elements 31 • ANRV285-GG07-02 ARI 8 August 2006 1:29

Gene Regulatory Region specific activator are typically degenerate, PIC and are therefore described by a consen- sus sequence in which certain positions are TFIIH TFIIE relatively constrained and others are more Mediator variable. Many activators form heterodimers RNA polymerase II and/or homodimers, and thus their binding sites are generally composed of two half-sites. TFIIF Notably, the precise subunit composition of an activator can also dictate its binding speci- ficity and regulatory action (37). ? Co- TFIIB Although an activator can bind to a wide activator variety of sequence variants that conform to the consensus, in certain instances the precise ? sequence of a TFBS can impact the regulatory ? output. For example, TFBS sequence vari- TFIIA TFIID ations can affect activator binding strength (reviewed in 30), which may be biologically Activator AD important in situations such as in early devel- DBD opment, in which activators are distributed in a concentration gradient (84, 144). TFBS se- quence variations may also direct a preference TATA TSS for certain dimerization partners over others Core (37, 124, 142). Finally, the particular sequence promoter of a TFBS can affect the structure of a bound Figure 2 activator in a way that alters its activity (69, The eukaryotic transcriptional machinery. Factors involved in eukaryotic 104, 108, 154, 163). The best-studied exam- transcription by RNA polymerase II can be classified into three groups: ples are nuclear hormone receptors, a large general transcription factors (GTFs), activators, and coactivators. GTFs, class of ligand-dependent activators. Various which include RNA polymerase II itself and TFIIA, TFIIB, TFIID, studies have shown that the relative orienta- TFIIE, TFIIF, and TFIIH, assemble on the core promoter in an ordered tion of the half-sites, as well as the spacing be- fashion to form a preinitiation complex (PIC), which directs RNA polymerase II to the transcription start site (TSS). Transcriptional activity tween them, play a major role in directing the is greatly stimulated by activators, which bind to upstream regulatory regulatory action of the bound nuclear hor- elements and work, at least in part, by stimulating PIC formation through mone receptor dimer (37). a mechanism thought to involve direct interactions with one or more Activators work, at least in part, by in- components of the transcriptional machinery. Activators consist of a creasing PIC formation through a mechanism DNA-binding domain (DBD) and a separable activation domain (AD) by Stanford University Robert Crown Law Lib. on 04/03/07. For personal use only. that is required for the activator to stimulate transcription. The direct thought to involve direct interactions with targets of activators are largely unknown. one or more components of the transcrip- Annu. Rev. Genom. Human Genet. 2006.7:29-59. Downloaded from arjournals.annualreviews.org tional machinery, termed the “target” (141, extensive discussion of the properties of acti- 149). Activators may also act by promoting a vators is beyond the scope of this review; read- step in the transcription process subsequent to ers are referred to several excellent reviews on PIC assembly, such as initiation, elongation, TBP: TATA-box-binding the subject (87 and references therein). or reinitiation (103). Finally, activators have protein The DNA-binding sites for activators also been proposed to function by recruit- TAF: [also called transcription factor-binding sites ing activities that modify chromatin structure TBP-associated (TFBSs)] are generally small, in the range (47, 106). Chromatin often poses a barrier factor of 6–12 bp, although binding specificity is to transcription because it prevents the tran- TFBS: transcription usually dictated by no more than 4–6 po- scriptional machinery from interacting di- factor-binding site sitions within the site. The TFBSs for a rectly with promoter DNA, and thus can be

32 Maston Evans Green · · Protein Non-coding coding exons 1% 2%

??? Introns/ 32% promoters/ polyA sites 37% Intergenic transcribed RNA Regulatory 19% elements 9% The ENCODE Project Consortium (2012) Nature 489:57-74 Q: What if the transcription/ translation machinery makes mistakes?

Q:What is the effect in coding regions? Evolution = Mutation + Selection B.Structural Structural Abnormalities Abnormalities Normal

Duplication

Deletion

Inversion

Insertion

Reciprocal Translocation II.Single Single NucleotideNucleotide ChangesChanges

Normal Missense Mutation DNA DNA AAAATACGTGCA AAAATACCTGCA UUUUAUGCACGU UUUUAUGGACGU mRNA mRNA

Protein Phe Tyr Ala Arg Protein Phe Tyr Gly Arg

Silent Mutation Nonsense Mutation DNA DNA AAGATACGTGCA AAAATTCGTGCA UUCUAUGCACGU UUUUAAGCACGU mRNA mRNA

Protein Phe Tyr Ala Arg Protein Phe STOP II.Single Single Nucleotide Nucleotide ChangesChanges

Normal DNA AAAATACGTGCA UUUUAUGCACGU Frameshift (Deletion) mRNA

Protein Phe Tyr Ala Arg T

DNA AAAAACCTGCA Frameshift (Insertion) UUUUUGGACGU DNA mRNA AAATATACGTGC Protein UUUAUAUGCACG Phe Leu His Val mRNA

Protein Phe Ile Cys Thr Evolution = Mutation + Selection Selection time

Harmful mutation Beneficial mutation Evolution = Mutation + Selection Summary

Evolution = Mutation + Selection Summary •All hereditary information encoded in double-stranded DNA

•Each cell in an organism has same DNA

•DNA → RNA → protein

•Proteins have many diverse roles in cell

•Gene regulation diversifies protein products within different cells Further Reading •See website: cs173.stanford.edu