Journal of Theoretical Biology 288 (2011) 92–104

Contents lists available at ScienceDirect

Journal of Theoretical Biology

journal homepage: www.elsevier.com/locate/yjtbi

On parameters of the human genome

Wentian Li

The Robert S. Boas Center for Genomics and Human , The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, 350 Community Drive, NY 11030, USA article info abstract

Article history: There are mathematical constants that describe universal relationship between variables, and physical/ Received 13 April 2011 chemical constants that are invariant measurements of physical quantities. In a similar spirit, we have Received in revised form collected a set of parameters that characterize the human genome. Some parameters have a constant 28 June 2011 value for everybody’s genome, others vary within a limited range. The following nine human genome Accepted 21 July 2011 parameters are discussed here, number of bases (genome size), number of (karyotype), Available online 3 August 2011 number of protein-coding gene loci, number of transcription factors, guanine–cytosine (GC) content, Keywords: number of GC-rich gene-rich isochores, density of polymorphic sites, number of newly generated Genome size deleterious mutations in one generation, and number of meiotic crossovers. Comparative genomics and Karyotype theoretical predictions of some parameters are discussed and reviewed. This collection only represents Human genes a beginning of compiling a more comprehensive list of human genome parameters, and knowing these Transcription factors Single nucleotide polymorphisms parameter values is an important part in understanding human evolution. & 2011 Elsevier Ltd. All rights reserved.

1. Introduction never change. Although such assumption has been challenged concerning fundamental physical constants, in particular in the Mathematics, physics, and chemistry all have their standard cosmology context (Dirac, 1937; Gamow, 1967; Varshalovich and sets of basic constants, as expected for any quantitative science. Potekhin, 1995; Uzan, 2003), the suggested magnitude of varia- Mathematical constants are non-dimensional quantities that tion is extremely small and the proposed time for such a change is numerically characterize relationship between variables. Exam- very long. ples include golden ratio f, circle circumference-to-diameter We are used to the notion that at the biological level, there ratio p, and natural logarithmic base e. A total of 136 mathema- could be structures or processes that are universal for all cur- tical constants are listed and discussed in Finch (2003). Funda- rently living species (e.g., genetic code), or some branches of mental constants in physics are dimensional invariant values with species (e.g., division of sexes for most animals and plants). On various physical contents (Fritzsch, 2004). Examples include the the other hand, any attempt to quantify biological organisms speed of light c, Planck constant h, Newtonian constant of numerically is to take a historical snapshot at a specific time and gravitation G in physics (Mohr et al., 2008). A total 326 physi- space; and due to evolution, these quantities may change over cal/chemical constants are listed in National Institute of Stan- time. Due to this reason, the term parameter is better than the dards and Technology (NIST)’s database. term constant in describing numerical features in biology, with Compared to these mathematical and physics sciences, biol- the admission that these quantitative measures could eventually ogy, in particular molecular biology, is less quantitative, and be different. generally speaking, constants are almost non-existent. We of Why do we need to know these parameter values, if they are course recognize some branches of biology with strong quantita- supposed to change in the future? Why could it be possibly tive tradition, such as Mendel’s genetics, Fisher–Haldane– interesting? First, for many genetic and genomic analyses, famil- Wright’s population genetics, and the modern day bioinformatics. iarity with human genome parameter makes it possible to do Even as biology becomes more quantitative in recent years, the quick back-of-the-envelope calculation. In other words, these numerical value of some baseline measurement is still typically numbers are useful in practice. Second, comparing parameter less focused on than relative changes of some quantitative values in human genome with that in other genomes reveal rich measurements. The term constant implies that the value will information about evolution. It is worth repeating Dobzhansky’s line that ‘‘nothing in biology makes sense except in the light of evolution’’ (Dobzhansky, 1964), and these parameter values E-mail address: [email protected] included. Third, in the scope of human species, these parameter

0022-5193/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2011.07.021 W. Li / Journal of Theoretical Biology 288 (2011) 92–104 93

Table 1 Abbreviations used: TF: transcription factor; GC%: guanine–cytosine content; F: female; M: male; kb: 103 bases; Mb: 106 bases; Gb: 109 bases; NS: non-synonymous substitution; S: synonymous substitution; NC: substitution in non-coding region. Notes. n1: person-to-person variation. n2: based on the 70280 copy number variations (CNV) with average size of 250 kb (Perry et al., 2008). n3: see Doolittle and Sapienza (1980), Orgel and Crick (1980), Charlesworth et al. (2002)). n4: see Cavalier-Smith (1982a), Cavalier-Smith (1982b). n5: 22 pairs are autosomal. n6: people who carry extra of chr21, chr18, chr13, chrX, etc. may still survive, though with abnormal symptoms (Down syndrome, Edwards syndrome, Patau syndrome, triple-X syndrome). n7: see Spuhler (1948a). n8: see King and Jukes (1969) and Ohno (1972). n9: see Bernardi (2007) n10: see Lynch (2010). n11: see Watterson (1975). n12: per diploid per generation.

Parameter Value Variabilityn1 Theory, model, prediction

n3 n4 # bases (genome size) 3Gb 4 720 Mbn2 Selfish-DNA , cell size # chromosomes (karyotype) 23 pairsn5 0n6 Fusion/fission # genes 20,000 ? Comparative genomicsn7, mutation loadn8 # TF gene 2000 ? ? GC% 40% ? Neoselectionistn9, mutation–selection balancen10 # GC-/gene-rich isochores 120 ? ? Density of polymorphic sites 1=kb 1210=kb ? Infinite-site modeln11 # deleterious mutations 1n12 0:522 ? Ratio of #NS/(#Sþ#NC) # crossovers 80 (F), 50 (M) 502100 (F), 30270 (M) Muller’s¨ ratchet ?

values are relatively stable. As soon as some of these parameters conversion between the two units is: 1 pg¼978 Mb or 1 Mb¼ start to change, we are arguably in a process of evolving towards a 1.022 103 pg (Gregory et al., 2007). The 3 Gb human genome post-human species. size is roughly equal to 3 pg C-value. Constancy is at the other end of spectrum from variability and Previously, it was thought that the larger a genome, the more polymorphism. Each human being’s genome is slightly different complex the organism. Refusing the statement that plants are from another, with the exception of identical twins. Even that more complex than human because they have much higher claim has been challenged (Bruder et al., 2008; Kaminsky et al., C-values led to the realization that plants have higher proportion 2009), though any genomic difference between identical twins of repetitive sequences than human (Flavell et al., 1974), and should be small, if not zero (Baranzini et al., 2010; Ono et al., genome size by itself is not a perfect measure of complexity. This 2010). The variability has its limit though: if the change in a is the ‘‘junk DNA explanation’’ of the ‘‘C-value paradox’’ (Pagel and person’s genome is too far beyond the norm, e.g., one extra copy Johnstone, 1992). of chromosome 1, that person will not survive. If polymorphism is If genome size cannot really measure the organism complex- essential in identifying genetic basis of phenotypes and medical ity, it can be however used to infer evolutionary processes and conditions (Buchanan and Higley, 1921), then constancy is essen- rates. It is estimated that due to the difference of retrotransposi- tial in identifying us human as human. tion rate between human and other primates, there is a 15–20% Table 1 lists the human genome parameters or quantities to be expansion of human genome size in the past 50 million years (Liu discussed in this paper. The selection of these quantities is in some et al., 2003). Another study has shown that genome size was sense arbitrary. Some parameter values do not change from reduced in mammals around the Cretaceous-Tertiary (KT) bound- person to person, other vary among individuals within a range ary 65 millions years ago (Rho et al., 2009)—a mass extinction (e.g., number of crossovers). Some are measured precisely, others period that killed the dinosaurs. are still work-in-progress (e.g., number of transcription factor There are two schools of thoughts in understanding genome genes). Most parameters are well defined, whereas for others the size: the selfish-DNA hypothesis and the bulk-DNA hypothesis definition itself is still under debate (e.g., gene-rich isochores). (Lynch, 2007). The selfish-DNA school considers the mobile With the human genome sequence in public database, such as elements in non-coding region as the driving force of genome NCBI (http://www.ncbi.nlm.nih.gov/genome/guide/human/), UCSC size changes (Doolittle and Sapienza, 1980; Orgel and Crick, 1980; Genome Browser (http://hgdownload.cse.ucsc.edu/), and Ensembl Charlesworth et al., 2002). A more complete description would (http://useast.ensembl.org/Homo_sapiens/), many statistics can be consider all forces that affect genome size (besides mobile/ obtained easily. Differing from other efforts of providing a sum- transposable elements, there are also insertion, deletion, duplica- mary of the biological information from the human genome tion, and chromosome translocation), characterize whether the (Scherer, 2008 and http://www.humangenomeguide.org/), this mutation experienced neutral random drift or selective advan- review focuses more on numerical and theoretical aspects of the tage/disadvantage, then run a computer simulation (Petrov, human genome. 2001). In particular, deletion or DNA loss has been suggested as an important factor in determining the genome size (Petrov et al., 2000). 2. Human haploid genome size: 3 109 basepairs The bulk-DNA hypothesis is based on the observation that genome size is highly correlated with the cell volume, nuclear The February 2009 version of the human genome (GRCh37/ volume (Cavalier-Smith, 1982, 1985), and cell division rate hg19) contains 3.098 Gb (b¼basepair, kb¼103b, Mb¼106b, (Bennett, 1972). It has been hypothesized that the cell size increase Gb¼109b) including autosomal, X, Y chromosomes and mitocon- from prokaryotes and eukaryotes was crucially due to the added drial DNA, sequenced and unsequenced (thus labeled by ‘‘N’’), mitochondria DNA/genes (Lane and Martin, 2010). aligned and unaligned (thus in the ‘‘random’’ segment files). Since The fact that genome size is correlated with the cell the number of unsequenced bases in heterochromatic regions is volume has been used to infer the genome size of long-extinct an estimation, its value (234 Mb) can be less reliable. Although we species such as dinosaurs (Organ et al., 2007). In the bulk-DNA now use b to measure genome size, before the genome age, hypothesis, a larger genome is necessary to maintain a large cell genome size was measured by weight of DNA molecules in the volume with more complex cellular structures (Cavalier-Smith, unit of picogram (¼1012 g), called C-value (Swift, 1950). The 2005). 94 W. Li / Journal of Theoretical Biology 288 (2011) 92–104

Theoretically, in order to estimate the human genome size Wienberg, 2004; Froenicke,¨ 2005; Froenicke¨ et al., 2006). The from a computer simulation, one has to include all necessary sequence-based inference of the ancestral mammalian synteny mutation mechanisms and model their fitness (Petrov, 2001). This blocks and karyotypes are more likely to be inaccurate due to the endeavor seems to be ambitious, if not impossible. Similarly, to smaller number of species available (Froenicke¨ et al., 2006). estimate the human genome size from the expected cell volume Moving further up in the phylogenetic tree, it could be size, where the cell size is determined by some first principle increasingly difficult to infer an ancestral karyotype, because of physical–chemical calculation, also seems to be difficult. It is still tremendous amount of chromosomal re-arrangements. Susumu unclear whether a practical plan exists to predict the human Ohno proposed a two-round (2R) whole-genome duplication genome size. (polyploidization) before the onset of ancestral fish (Ohno, On the other hand, primates that are closer to human all have a 1970). If this 2R hypothesis is true, we would expect the mini- larger C-value than human (3.5): chimpanzee (3.63), gorillas (3.57), mum number of chromosome pairs to be 4. orangutans (3.66) (Morand and Ricklefs, 2005 and http://www. Another mental exercise on the range of number of chromo- genomesize.com/). On sequence level, a rough estimation of base- some pairs is to start from the human genome with 17 meta- pair difference between the human and chimpanzee genome is still centric and five acrocentric chromosome pairs. If we allow fission lacking. There are sequence gaps in the chimpanzee genome, and in all metacentric chromosomes, we expect 17 2þ5¼39 chro- even the human genome is only 92.4% sequenced/aligned (using the mosome pairs. On the other hand, if we allow fusion in acro- GRCh37/hg19 release, 234 Mb bases are marked as ‘‘N’’ out of centric chromosomes, e.g., two combine into one, and the other 3095.7 Mb). Furthermore, there are up to 70–80 copy number three combine into one, we expect 17þ2¼19 chromosome pairs. variations (CNVs) with median size of 250 kb within either The importance of centric fusion was recognized by Robert human or chimpanzee genome (Perry et al., 2008). Matthey, and he called the number of chromosome arms ‘‘nombre fondamental’’, as the more fundamental unit in karyotypes than chromosome numbers (Matthey, 1945). 3. Number of human autosomal chromosome pairs: 22

It had been long thought that the number of human chromo- 4. Number of human gene loci: approximately 20,000 somes was 48 (24 pairs) (Painter, 1923). The correct number of human chromosome, 46 (23 pairs, with 22 pairs of autosomal At 2000s Cold Spring Harbor Laboratory’s annual genome chromosomes), was announced in 1955 after new techniques to meeting, a bet was made in the bar to guess correctly the number better arrest cell cycle at the specific point and to better spread of genes in human genome. I put my 1 USD bet, and forgot what spacings between chromosomes (Tjio and Levan, 1956; Kottler, number I chosen, though apparently I did not win (Lee Rowen at 1974; Gartler, 2006). The numerical index attached to each Institute for System Biology was a winner, Pennisi, 2003). The chromosome is ordered by the chromosome size: chromosome histogram of all bettors’ numbers shows that most people over- 1 for the largest, and supposedly chromosome 22 for the smallest. estimated the gene number instead of underestimating it (Nature However, human chromosome 21 turns out to be even smaller Genetics Editorial, 2000). than human chromosome 22. However, the question, if not asked precisely, can have One of the easiest ways to change chromosome numbers is different answers, also considering the fact that the concept of fusion between two acrocentric (one-armed) chromosomes, and gene itself has evolved in time (Beurton et al., 2000). The original fission of a metacentric (two-armed) chromosome. Chimpanzee, question means the number of chromosome locations in the human’s closest relative, has 24 pairs of chromosomes (23 pairs of human genome that participate transcription—or gene loci. autosomal chromosomes) (Young et al., 1960). There is strong Usually, non-protein-coding gene loci are excluded from the evidence that human chromosome 2 resulted from a fusion count. The number of gene products or number of proteins can between two ancestral chromosomes (Dutrillaux et al., 1975; be several time multiples if each gene has many alternative Yunis and Prakash, 1982; Ijdo et al., 1991), whereas such fusion splicings (Ast, 2005; Nilsen and Graveley, 2010). Indeed, in event did not occur in chimpanzees. ensembl (Flicek et al., 2011) release GRCh37, there are 21,550 For other closely related species, both gorillas and orangutans unique gene names that are protein-coding, whereas the number have 48 chromosomes or 23 autosomal chromosome pairs, indicat- of transcript names is 126,019. The latter number is consistent ing that the ancestral great apes karyotype had 23 pairs of autosomal with the 100,000 alternative splicing events reported in Pan chromosomes. For lesser apes, the karyotype is more complicated: et al. (2008). To determine the exact number of alternative 2n¼38, 44, 50, 52 are all observed (O’Brien et al., 2006). Due to splicing products is not easy, as different detection methods have difficulties caused by faster chromosome evolution, lesser apes are different levels of reliability (Modrek and Lee, 2002). often not included in chromosome phylogeny analysis of primates Unlike genome size and chromosome numbers, there are no (O’Brien et al., 2006). Moving higher to primates, it is suggested that simple experimental methods to determine the gene loci number the ancestral primate karyotype contained 24 pairs of autosomal for unsequenced species. If sequence is available on a region of chromosomes (2n¼50) (Muller¨ et al., 1999; Ferguson-Smith and the genome or a chromosome, it is possible to extrapolate the Trifonov, 2007; Stanyon et al., 2008). gene loci number from this region to the whole genome. An Using cytogenetic chromosome painting to find homology extrapolated number of 30,000 genes is given in Ewing and Green between closely related species, followedbyaparsimonyinference (2000), based on the proportion of human expressed sequence procedure (Rocchi et al., 2006), many ancestral karyotypes have tags (EST) that map to chromosome 22, and computer predicted been proposed (Ferguson-Smith and Trifonov, 2007). For example, number of genes on the already sequenced chromosome 22. 2n¼42 (20 autosomal pairs) for carnivore (cat, dog, seal, raccoon, Due to lack of simple experimental methods in finding genes etc.), 2n¼52 (25 autosomal pairs) for cetartiodactyls (dolphin, in unsequenced genomes, it is hard to speculate how gene hippopotamus, giraffe, cow, deer, pig, etc.), 2n¼46 or 48 (22 or 23 number changes in related species. Even for genomes whose autosomal pairs) for rodents (squirrel, rat, mouse, hamster, etc.) sequence is available, missing data prevents us from reaching (Ferguson-Smith and Trifonov, 2007). Interestingly, the proposed conclusions. For example, there are 19,829 unique gene names ancestral placental mammalian (eutherians) karyotype contains 22 listed in ensembl’s chimpanzee genome (release CHIMP1A from autosomal pairs, the same as human (Chowdhary et al., 1998; April 2006). This number is 8% lower than that of human, but it is W. Li / Journal of Theoretical Biology 288 (2011) 92–104 95 unclear whether it is due to incomplete sequencing or due to truly number of genes in the human genome appeared more than 45 lower number of genes. years ago y in 1964 (Vogel, 1964)’’ missed the date by at least Since a main mechanism for generating new genes is seg- 16 years. mental duplication, comparison of duplication events in human and chimpanzee might provide insight concerning the difference of number of genes in these species (Cheng et al., 2005). Lineage- 5. Number of transcription factor gene loci: specific segmental duplications have been found in both chim- approximately 2000 panzee and human, with a comparable rate (Cheng et al., 2005). This points to the possibility that the gene sets in chimpanzee and If neither genome size nor number of genes (Hahn and Wray, human may be different even if the number of genes in the two 2002) could reliably characterize the complexity of a species, the genomes are the same, due to species-unique gain and loss of next candidates might be the number of transcription factors (TF) genes (Demuth et al., 2006). or the number of protein–protein or protein–DNA interactions Can the number of human genes be predicted from a more (Szathma´ry et al., 2001). Clearly, each transcription factor inter- fundamental principle? In 1948, James Spuhler estimated the acts with another gene through its TF product. human gene number by two methods. The first estimation of Proteins that are involved in transcription regulation include 42,000 genes is by comparing the human and Drosophila chromo- general transcription factors (GTF) and transcription activators some lengths and the known number of Drosophila genes (TA). GTFs are conserved among many different species while TAs (Spuhler, 1948a,b). The second estimation of 19,890–30,420 genes are species-specific. The number of GTFs in human is considered is based on higher male-to-female miscarriage ratio, which he to be around 200–300 (Farnham, 2009). The number of human used to infer the number of lethal mutations on X chromosome, genes that code for TA has been suggested to be 2000 (Tupler then extended the number to the whole genome. In 1962, et al., 2001), with the largest group of C2H2 zinc finger proteins Oswaldo Frota-Pessoa made some corrections on Spuhler’s second with more than 900 members (Tupler et al., 2001; O’Geen et al., method, and came up with a prediction of 11,700 human genes 2010). Compared to Drosophila, C. elegans, and yeast, the number (Frota-Pessoa, 1961). of TAs is greatly expanded in human genome. Jack King and Tom Jukes predicted the number of human More recent studies have more or less confirmed the 2000 genes to be 40,000 (King and Jukes, 1969) in their neutral number for TFs. In Messina et al. (2004) the number of human TFs evolution paper. Their estimation is based on the spontaneous is estimated to be 1962 (the largest family is the zinc finger group recessive lethal mutations per locus (i.e., per gene) per generation with 762 members). In Vaquerizas et al. (2009), 1391 sequence- (the mutation rate for deleterious but non-lethal mutations per specific DNA-binding TFs are manually curated. In Ravasi et al. base per generation will be discussed in a later section) of (2010), 1988 DNA-binding TFs are compiled. 1062105. If more genes mean more chances to lethal muta- Simply dividing the number of genes with the number of tions and non-survival, the number of genes will be limited. The transcription factors means that 20,000/2000¼10 target per TF is number of 40,000 genes is to ensure the non-survival rate per expected. We may make the estimation slightly more compli- person per generation to be 4240%. Ohno (1972) used a very cated by assuming that 500 housekeeping genes (a number of similar argument in his ‘‘junk’’ DNA paper: the upper limit is 575 housekeeping genes is used in Eisenberg and Levanon, 2003) apparently 100,000 if the lethal mutations per locus is 105, and leave 19,500 tissue-specific genes, then 19,500=2000 9:7 target if dominant/recessive model is considered, the estimation per TF. Note that TF genes are genes themselves, and their becomes 30,000. transcription is activated by other TFs. Many are unaware of these earlier efforts in predicting the If the distribution of the number of targets per TF is not number of human genes. For example, the statement in Pertea normally distributed, then the average value will not characterize and Salzberg (2010) that the ‘‘the first attempt to estimate the the whole distribution. To have an idea of the distribution, we use

1974 TFs, 67311 targets

322 116 100.0 163 all TF’s

50.0 75 76

20.0

10.0

5.0 ZNF family (163 ZNT*, 2994 targets) 2.0

number of TFs with certain target 1.0

0.5

0.5 1.0 2.0 5.0 10.0 20.0 50.0 100.0 200.0 500.0 num targets

Fig. 1. Distribution of number of predicted TF targets for the data from the ITFP database (Zheng et al., 2008a,b). x-axis is the number of TF targets, y-axis is the number of TF’s with that number of targets (e.g., there are 322 TFs with only one predicted target, 163 TFs with two predicted targets). Both x and y are in logarithmic scale. The histogram for the subset of zinc finger TFs is plotted separately. 96 W. Li / Journal of Theoretical Biology 288 (2011) 92–104

a database (Integrated Transcription Factor Platform (ITFP), nucleotides, 1pS ¼ AT% ¼ W% that of weak (i.e., A and T) http://itfp.biosino.org/itfp/) that lists the predicted targets for nucleotides, pW-S ¼ PðW-S9WÞ and pS-W ¼ PðS-W9SÞ are the each TF (Zheng et al., 2008a,b). Although it is unclear how good substitution rates per generation, GC% may increase due to the the target predictions are, or whether a TF-target binding has any W-S substitution, or decrease due to the S-W substitution. In biological function (Wingender, 2008), we are hoping individual the equilibrium (marked by n), the two tendencies are balanced: prediction errors do not affect the distribution. n n ð1p Þp - ¼ p p - Fig. 1 shows the histogram of the predicted number of targets S W S S S W per TF from the ITFP database. Most TFs have very few number of or targets, and small number of TFs have large number of targets. Since n pW-S the histogram is plotted in double-logarithmic scale, the distribution pS ¼ ð1Þ p - þp - is closer to a power-law (Clausetetal.,2009)oramodifiedpower- S W W S law called beta function (Martı´nez-Mekler et al., 2009)thantoa The substitution rate data can be found in Gojobori et al. (1982), normal distribution, and mean value will not be a good measure to Podlutsky et al. (1998), Petrov and Hartl (1999), and Lynch (2007), characterize the whole distribution. Interestingly, the median num- pW-S ¼ pA:T-G:C þpA:T-C:G 0:2520:26 (the first term is the prob- ber of targets in Fig. 1 is 11 and median absolute deviation (a robust ability for transition and the second term is for transversion), n alternative of standard deviation) is 10. For the largest subset of TF, pS-W ¼ pG:C-A:T þpG:C-T:A 0:470:5, which leads to pS 33%.A the zinc finger TFs, the median number of targets predicted is 7, and more careful modelling is carried out in Arndt et al. (2003) and median absolute deviation is 6. Duret and Arndt (2008) by using the distinct transition substitu-

The number of targets of a TF is similar to the ‘‘K’’ parameter in tion rate pG:C-A:T at the CpG sites. the NK model (Kauffman, 1993), where N is the number of parts The observation that the equilibrium GC% driven by neutral of a system and K is the number of cross-links among parts. Early substitution alone is lower than the observed GC% in human theoretical studies of NK models often assumed a fixed K value, genome implies that other driving forces might contribute to the either low (K¼2) (Kauffman, 1969) or large (K¼N1) (Kauffman base composition (Lynch, 2010; Bernardi, 2004; Eyre-Walker and and Levin, 1987). Fig. 1 clearly indicates that K can be broadly Hurst, 2001) (or, maybe the equilibrium state has not been distributed. This scale-free degree distribution in regulatory net- reached). These other driving forces can be one of the selections: work, as well as metabolic network and most other networks in (1) AT-rich genomes have a higher chance to contain stop codons, biology, has already been noticed (Baraba´si and Oltvai, 2004). leading to shorter gene regions with possibly deleterious effects; The co-existence of widespread TFs with hundreds of targets (2) AT-rich genomes are less stable thermally than GC-rich and specific TFs with only 1 or few targets is parallel to the genomes (Bernardi and Bernardi, 1986; Belle et al., 2002); observation of co-existence between broad TFs that are expressed (3) Being GC-rich is positively correlated with the DNA helix’ in all tissues and specific TFs that are expressed in only 1–2 bendability and the ability to B–Z transition (Vinogradov, 2003), tissues (Farnham, 2009). Any theoretical prediction on the num- which eases the opening of chromatin during transcription; ber of TFs in human genome has to take that observation into (4) Having recombination is evolutionarily advantageous, and account. recombination may somehow increase the local GC%. One hypothesis for a recombination-related biased AT-GC rate is the gene conversion (Galtier et al., 2001; Meunier and 6. Guanine–cytosine (GC) content: 40% Duret, 2004). The significance of the biased gene conversion in forming GC% in human is still unclear, as no GC-favoring gene Human genomes, like other genomes, follow Chargaff’s second conversion has been directly observed yet in human (even in parity rule (or ‘‘strand symmetry’’) of A% (adenine content) T% yeast, GC-favoring gene conversions only occur 50.62% of the (thymine content) and G% (guanine content) C% (cytosine time, a very weak signal, Duret and Galtier, 2009), the region content) within a single strand (Rudner et al., 1968). With affected by gene conversion is very narrow, and gene conversion GC% ¼ G%þC% 40%, Chargaff’s second parity rule predicts that is not exclusively linked to crossovers (it can also occur when G% C% 20%, and A% T% 30% in the human genome. The Holliday junction is resolved in a non-crossover outcome). human reference sequence release GRCh37/hg19 (February 2009) Recombination (with the crossover outcome) rate is apparently indeed shows G%¼20.50%, C%¼20.49, A%¼29.49%, T%¼29.52% statistically correlated with GC-content (Fullerton et al., 2001), for 22 autosomal chromosomes (and 6.82% untyped rate), and however, statistical correlation does not necessarily imply caus- Chargaff’s strand symmetry is strikingly accurate: 9G%C%9= ality, and the causality arrow is not known. For this, causal ð0:5ðG%þC%ÞÞ ¼ 4:9 104 and 9A%T%9=ð0:5ðA%þT%ÞÞ ¼ 1:0 inference using a third variable might be needed (Freudenberg 103. Similarly for X chromosome, G%¼19.77%, C%¼19.73%, et al., 2009). A%¼30.21%, T%¼30.29% (and 2.69% untyped rate), 9G% C% If the mutation-driven drift of GC% is indeed balanced by 9=ð0:5ðG %þ C%ÞÞ ¼ 2:0 103 and 9A%T%9=ð0:5ðA%þT%ÞÞ ¼ selection (Bulmer, 1991), which is more likely to occur locally in 2:6 103. gene-rich regions, the relative strength of selection (s) over drift 0 0 0 0 With G% C% 20%, one might expect that 5 -CC-3 ,5-CG-3 , (measured by 1=Ne where Ne is the effective population size) 50-GC-30,50-GG-30 dinucleotides appear with 0.2 0.2¼4% prob- might be estimated from the data. This ratio is estimated to be ability. However, the 50-CG-30 dinucleotide density is only 1% in around 1 in Kondrashov et al. (2006) and 0.99 in Lynch (2010). human genome (it ranges from 0.79% to 1.9% in 23 autosomal- Because the second codon position almost always determines the plus-X chromosomes). It is because the C base in 50-CG-30 (also amino acids in the coding product, and the third codon position does called CpG) is a target of methyltransferases in mammalian not, the relationship of GC%[3] and GC%[2] can be used to study the genome, and methylated C base has a very high mutation rate relative constraints on the protein and DNA level (Bernardi, 2004). A to base T, resulting in a removal of the 50-CG-30 dinucleotide from linear relationship is observed between GC%[3] and GC%[2], with the genome (Sinsheimer, 1955; Doerfler, 1983). regression coefficients of: GC%[3]¼2.41þ7.14 GC%[2] for eukar- If the GC content is completely driven by neutral substitution, yotes (D’Onofrio et al., 1999), GC%[3]¼1.88þ5.84 GC%[2] for its equilibrium value can be easily estimated (Sueoka, 1962; Hartl human (Pruitt and Maglott, 2001; Cruveiller et al., 2003), which is and Clark, 2006, p. 156; Lynch, 2007, box 6.1 at p. 126). Assume further simplified to GC%[3]¼2þ6GC%[2](Bernardi, 2004, pS¼GC%¼S% to be the composition of strong (i.e., G and C) p. 281). Solving GC%[3]¼GC%[2] leads to solution GC%¼39.3%, W. Li / Journal of Theoretical Biology 288 (2011) 92–104 97

38.8% and 40.0%. It is tempting to speculate that the observed (Bernaola-Galva´n et al., 1996) followed by a statistical test on GC% GC%¼41% which is an average of GC/gene-rich and GC/gene-poor difference near a segmentation point (Oliver et al., 2004). regions, is also a balance between the two constraints. Fig. 2 shows that although the distribution of GC% among segmented domains (Fig. 2(C)), and the GC% difference between neighboring domains (Fig. 2(D)) are somewhat similar in the 7. Number of guanine–cytosine-rich and gene-rich isochores: three studies, the number of called isochores (Fig. 2(A)) and the approximately 120 size distribution of them (Fig. 2(B)) are very different. This makes it difficult to unambiguously quote a number of isochores in GC% along a chromosome is not stationary: it can be as high as human genome. Although one choice is to use the 3000 number, 60% and as low as 30%. Finding regions with homogeneous GC- this number depends on the choice of 100–300 kb lower limit for composition is tricky, because it crucially depends on at what domain size, which is more an experimental limitation in early length scales the GC% is examined. Experimentally, GC% of DNA work on isochore than an intrinsic sequence feature. For example, segments of length at least 300 kb had been studied long before we can have smaller than 3000 isochores (e.g., 300 or 100, see DNA sequence was available, and the concept of ‘‘isochore’’ was http://bioinfo2.ugr.es/SSinfoSup/ if the segmentation criterion is proposed for experimentally observed long DNA segments with made more stringent (Carpena et al., 2011). distinct GC% (Macaya et al., 1976). Instead, here we focus on the task of counting the number of Just as the number of chromosome bands depends on resolu- GC-rich and gene-rich regions. This number is perhaps more tion of the detection technique (Yunis, 1976), the number of robustly measured than the amount or proportion of DNA in GC- isochore or compositionally homogeneous chromosome regions rich isochores, as repetitive sequences could increase the latter but also depends on criterion for claiming or rejecting homogeneity. not the former. Consequently, this number could be more infor- This ‘‘domains with domains’’ phenomenon and the relative mative in comparing different species. As discovered by Giorgio nature of homogeneity has been discussed in Li (2002). Short- Bernardi and his collaborators, GC-rich regions tend to have higher length-scale inhomogeneity was used to reject the existence of gene densities, and gene-poor ‘‘deserts’’ tend to be GC-poor isochore in Lander et al. (2001), however, by ignoring base (Bernardi et al., 1985; Bernardi, 2004). Fig. 3 illustrates this result composition fluctuations at small scales, statistical tests can well in two different ways. Fig. 3(A) is similar to previous plots (Lander establish the existence of isochores (Li et al., 2003) et al., 2001; Bernardi, 2004) which show the upper bound of gene We take three sets of isochore partitioning of the human density moves up in GC-rich isochores. Fig. 3(B) adds isochore size genome: (1) supplement material from Costantini et al. (2006), information and shows that large GC-poor isochores all have lower which lists 3000 isochores using non-overlapping 100 kb win- gene density, whereas high gene-density isochores are mostly GC- dow and neighboring GC% difference ðo122%Þ as the criterion for rich. GC% is easier to calculate, but since its correlation with gene- joining windows to larger isochores. (2) Data obtained from http:// density is still not 100%, we still need two pieces of information to tubic.tju.edu.cn/isomap/ (Zhang and Zhang, 2003)with 3000 measure this parameter. isochores based on visual inspection of cumulative plots of non- To obtain a more robust result in counting regions, we use the overlapping window’s GC%. (3) Data obtained from http://bioinfo2. cumulative plot which tend to highlight the larger domains (cumu- ugr.es/isochores/ (Oliver et al., 2002), using recursive segmentation lative plot for another application can be found in Li et al., 2009).

2000 10000

500 1000 X Y 100

100 histogram 10 50 Constantini Zhang number of isochores 20 Oliver 1 5 10 15 20 0.02 0.05 0.20 0.50 2.00 5.00 20.00 chromosome size (Mb)

0.15 L H 0.15

0.10 0.10

0.05 0.05 proportion of bases 0.00 normalized histogram 0.00 20 30 40 50 60 70 −40 −20 0 20 GC% delta GC%

Fig. 2. Statistics of three collections of isochores: (Costantini et al., 2006) (solid triangle), (Zhang and Zhang, 2003) (cross), and (Oliver et al., 2002) (open inverted triangle). (A) number of isochores or compositionally homogeneous domains per chromosome; (B) isochore/domain size distribution (in double logarithmic scale); (C) proportion of bases in each isochore/domain GC% bin value (the weighted.hist R function from the plotrix package is used for this plot); (D) neighboring isochore/domain GC% jump distribution (normalized). 98 W. Li / Journal of Theoretical Biology 288 (2011) 92–104

150

100

50

0 gene density (num/Mb) 35 40 45 50 55

50.0

x10 10.0 x5 x2

num genes 2.0 ave 0.5 0.1 0.2 0.5 1.0 2.0 5.0 10.0 isochore size (Mb)

Fig. 3. (A) Relationship between gene density (number of gene per Mb) and GC%. Each point represents an isochore obtained from Costantini et al. (2006). The RefGene for hg17 release is used with redundant gene names removed: this leaves 22,397 genes. (B) Number of genes in an isochore plotted against the isochore length (in Mb) (in double logarithmic scale). Each isochore is marked by its GC% with the color code used in Fig. 3(A). The solid line represents the genome-wide average of gene density, and dashed lines for higher gene density that are 2, 5, 10 times the average. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

chr19 100

0

−100 genes −200

−300 detrended cumulative num 0.55 10 20 30 40 50 60

0.50 GC = 47.98%

GC% 0.45 GC = 46%

0.40

10 20 30 40 50 60 position (Mb)

Fig.P 4. (A) Cumulative number of genes detrended by the average gene density. Each point i represents an isochore obtained from Costantini et al. (2006), i yi ¼ j ¼ 1ðng,jab xj) where ng,j is the number of genes in isochore j, xi is the jth isochore position, a and b are the coefficient of linear line linking the first and the last point in the cumulative plot (not shown). (B) GC% of each isochore. The 46% line is the threshold in Lander et al. (2001) and Bernardi (2004) to separate GC-rich from GC-poor isochores, and the chromosome-averaged 47.98% is actually used to indicate GC-rich.

Fig. 4 illustrates the method for chromosome 19. Instead of using a from Fig. 3(B) is less robust, as one may impose thresholds on genome-wide criterion for gene-rich or GC-rich, we use a chromo- several variables such as minimum isochore size, minimum some-wide criterion. Fig. 4(A) shows the cumulative number of number of genes, and multiple over the averaged gene density, genes (each point is an isochore) detrended by the chromosome with any threshold values, as the selection criterion. average of gene density. Any up (down) trend in Fig. 4(A) indicates For chromosome 21, we identify only one gene-rich GC-rich an gene-rich (poor) region relative to the rest of the chromosome. domain by the cumulative plot (this is the region where the Similarly, the GC% of an isochore in Fig. 4(B) is compared to the 500 kb oscillation of GC% is located (Li and Holste, 2004; Li and chromosome-wide average, instead of the 46% threshold used in Miramontes, 2006), whereas using the GC% 446%, size 4300 kb Lander et al. (2001) and Bernardi (2004). threshold in Fig. 3(B) would identify four domains. The two In Fig. 4, two gene-rich GC-rich regions (and one gene desert) results can be reconciled as one of the four domains does not are claimed. Going through all autosomal and X chromosomes, have higher-than chromosome-wide average of gene density, and the total number of gene-rich GC-rich regions is estimated to be the rest three of the four domains are next to each other thus can around 120. To estimate a number of gene-rich GC-rich isochores be combined into one. W. Li / Journal of Theoretical Biology 288 (2011) 92–104 99

8. Density of polymorphic sites: approximately þþ1=ðn1Þ0:577þlogðn1Þ (Nei, 1987). Second, in the 1–10 per 103 bases infinite-site mutation model (Kimura, 1968), the average nucleo- tide diversity is Starting from this section, we discuss parameters that are m E½p¼4N , ð4Þ defined on a population of genomes, and have wider ranges of e L possible values from one individual genome to another. Single where N is the ancestral effective population size, and m=L is the nucleotide polymorphisms (SNPs) are single base positions that e mutation rate per nucleotide per generation. take on multiple possible alleles in the population. This definition Plugging the values of N , m=L, n into Eqs. (4) and (3) (N ¼3100 points out its dependence on samples used: for inbred popula- e e for the European and the East Asian population, or N ¼7500 for tions, the level of polymorphism is very low; for a regional e Yoruba African population, (Tenesa et al., 2007), m=L ¼ 11 109, population with limited genetic diversity, the number of SNPs is (Roach et al., 2010), n ¼ 2 6:9 109 for the current world much smaller than that of the human population as a whole. population), we have the estimation of d (3:3 103 for A reason for SNP density not being a constant is that its value SNP European/EastAsian and 7:9 103 for Yoruba African). These depends on human evolutionary history and diverging time two estimated SNP densities can be translated to a total number between human and its closest relative species, the chimpanzee of 10 and 24 million SNPs. The relatively smaller estimation of the (which is estimated to be between 4 and 8 million years ago). If effective population size in Tenesa et al. (2007) is because in a all non-protein-coding bases are evolutionarily neutral and if we temporarily fluctuating population, the N is dominanted by the wait long enough, 490% of the bases from the human genome e smallest population size in a time period (Hartl and Clark, 2006, are potential target of mutation, thus polymorphic bases. pp. 121–123), and the European and the East Asian population Polymorphic bases are definable for a population of N persons may have had gone through such a bottleneck period. This (and n¼2N sequences). When N¼1(n¼2), a polymorphic base predicted number of SNPs can be checked when genomes of means a heterozygous genotype in that person, and the percentage everybody on the earth are sequenced and compared. of heterozygote sites h ¼ dhetero ¼ H=L (H is the number of hetero- zygotes in a person’s genome, 3 109 bases is the genome size) is comparable to the SNP density from N b1 samples: 9. Number of new deleterious mutations generated per S generation per diploid genome: approximately 0.5–2 d ¼ ð2Þ SNP L The SNP density and mutation rate discussed in the last (S is the number of bases in the human genome that show section does not consider function. To estimate the number of polymorphism among n¼2N sequences). In population genetics, a amino-acid changing deleterious mutations produced in each polymorphic site is also called a segregating site (Hartl and Clark, generation, two extra considerations are needed. 2006, p. 34). First, both maternal and paternal germline can accumulate The HapMap consortium released the map of 1 million mutations during cell replication. Actually, more mutations may (HapMap Consortium, 2005) and 3.1 million SNPs in the human be accumulated in an older father, as the number of germ cell genome (HapMap Consortium, 2007), based on array genotyping, divisions in spermatogenesis increases with the paternal age, with the latter representing d 1 103, or one SNP per kb. SNP whereas that during does not increase with the Comparison of Craig Venter’s diploid genome with the NCBI maternal age (Crow, 2000). The number of new mutations per reference genome leads to 3.1 million SNPs after a quality control diploid genome is the per base mutation rate multiplied by twice filtering (Levy et al., 2007). A similar comparison between James 9 the haploid genome size, i.e., 2 3 10 bases. Watson’s diploid genome with the reference genome leads to Mutation rate can be expressed in various forms: per bases vs. 3.32 million SNPs (Wheeler et al., 2008). per gene (per locus) vs. per haploid genome vs. per diploid genome, Comparative number of SNPs have been reported in other and per year vs. per generation vs. per (e.g.) 5 million years (which is publications using the next-generation sequencing: 7.9 million the divergence time between human and chimpanzee). If we use the SNPs (European samples, low coverage), 10.9 millions (Yoruba 8 m 1:1 10 per base per generation (Roach et al., 2010), which is African samples, low coverage), 6.3 millions (East Asian samples, also close to the lower range estimated in Nachman and Crowell low coverage), 3.6 millions (European trio samples), 4.5 million (2000), the number of new mutations per generation per diploid (Yoruba African trio samples) from The 1000 Genomes Project 8 9 8 genome is M ¼ 1:1 10 6 10 ¼ 66. If the m 2:5 10 Consortium (2010); 3 million and 3.45 million SNPs for an Chinese value (Nachman and Crowell, 2000)isused,thenumberofnew (Wang et al., 2008) and a Korean individual (Kim et al., 2009). 8 9 mutations is M ¼ 2:5 10 6 10 ¼ 150. If we take the total number of SNPs from three different sub- Second, one needs to distinguish mutations with or without populations in The 1000 Genome Project, 14.9 millions SNPs effects, and estimate the proportion of mildly deleterious muta- corresponds to d 5 103. Using next-generation sequencing SNP tions from the genome. If the mutation was estimated by data to estimate population genetics values still face many sequence comparison between two surviving individuals, lethal technical challenges, such as sequencing error, assembly error, mutations (strongly deleterious mutations) are not observed: and missing data due to low coverage (Pool et al., 2010). These reported SNP density will surely be updated in the future. Munobserved ¼ Mlethal The number of SNPs in human genome can be calculated in The number of observed mutations can be partitioned into theoretical models. First of all, SNP density d is related to SNP M ¼ M þM þM : ð5Þ nucleotide diversity p, which is the probability (per base) that observed deleterious neutral advantageous two randomly chosen alleles from two persons are different (Nei Despite recent reports on positive selection in human genome and Li, 1979), by the formula (Watterson, 1975) (Kelly and Swanson, 2008), these studies have been questioned 1 1 1 (Hughes, 2008; Nozawa et al., 2009; Nei et al., 2010), and the dSNP ¼ p 1þ þ þþ , ð3Þ number of advantageous mutations is assumed to be low, or can 2 3 n1 be ignored in a first approximation. The neutral mutations are where n is the number of sequences used to detect SNP. In the those occurring in the non-coding regions, plus those occurring in large n limit, the last term is approximately equal to 1þ1=2þ1=3 coding-region but do not change the amino acids (synonymous). 100 W. Li / Journal of Theoretical Biology 288 (2011) 92–104

The deleterious mutations are estimated by the non-synonymous a recombination)—it is because only two chromatids are involved in substitutions plus some in the regulatory regions. crossovers, while the other two chromatids remain intact. In Since there are 1–2% of human genome that are coding or in genetics, these two points are said to have a distance of 50 regulatory region, and a crudely estimated proportion of 50% of centiMorgan (cM). It has been known that 1 cM roughly corresponds coding mutations are non-synonymous (a value of 86% was used to 106 bases (Morton, 1988, 1991), so the genome-wide number of in Kimura, 1983; Crow, 1993; Kondrashov and Crow, 1993), the crossovers can be estimated to be 3 109=ð50 106Þ60. proportion of non-synonymous/deleterious mutations ranges Beyond the 1 cM¼1 Mb rule (Ulgen and Li, 2005), there is a between 0:01 0:5 ¼ 0:005 and 0:02 0:86 ¼ 0:017, which trans- tremendous amount of regional variation of the recombination lates to around 0.33–2.5 new deleterious mutations per diploid rate (Jeffreys et al., 2001). There is also a huge difference between genome per generation. The number of 1 is used in Kondrashov human female and male (Hunt and Hassold, 2002) and Crow (1993). including the recombination rate. A promising hypothesis Interestingly, a much higher estimation of the number of explains this difference by the lesser degree of compactness in deleterious mutation in human was proposed: 4.2, based on a female chromatids (Tease and Hulte´n, 2004; Paigen and Petkov, human–chimpanzee sequence comparison of 46 genes (Eyre- 2010). The distribution of crossovers in the human genome Walker and Keightley, 1999). According to Haldane (1937) and among different chromosomes is mainly determined by the Muller¨ (1950), a larger-than-one number of new deleterious chromosome length, with the requirement that a chromosome mutations per diploid per generation signals trouble for the needs at least one crossover (i.e., minimum genetic length is fitness of the species. However, the number of human genes used 50 cM) (Mather, 1938; Li and Freudenberg, 2009; Li et al., 2011) in Eyre-Walker and Keightley (1999) was 60,000 which is three Cytological methods can be used to observe crossovers directly, times the actual number. If we correct this and keep everything for example, using immunocytogenetic assays of the mismatch else unchanged, the number of new deleterious mutations will repair protein MLH1. The number of MLH1 foci in female was found become 4.2/3¼1.4. to be 70.3 foci (ranging from 48 to 102) using 95 samples (Tease et al., 2002), 69.3714.3 using 1035 cells from 31 women (ranging from 52.6710.5 in one woman to 88.3710.3 in another) (Cheng 10. Number of crossovers during meiosis: approximately et al., 2009). In male, the following average of MHL1 foci counts have 50–100 in female, 30–70 in male been reported: 48 (range from 21 to 65) (Sun et al., 2005), 49.7 (range from 21 to 67) using 1000 samples (10 men each contributing Human germ cells go through gametogenesis to produce 100 samples) (Sun et al., 2006), 4273.8 for one male and 52.275.6 gametes (sperms and eggs). Part of gametogenesis is meiosis for another, each contributing 100 cells (Codina-Pascual et al., 2006). when diploid cells replicate genome contents followed by two Genetic methods can estimate the number of recombinations rounds of cell division to produce four haploid cells. After (which is half the number of crossovers in meiosis) by using replication and before cell division, four homologous chromo- pedigree data. A construction of recombination map of human somes align and crossover between two of the four chromatids at genome by the Iceland company deCODE used 146 pedigrees to many chromosome locations. The ‘‘X’’ shaped exchange product is estimate female genetic length to be 4459.94 cM, equivalent to 89 called chiasma (plural: chiasmata), and the DNA–protein struc- crossovers, and male genetic length of 2590.49 cM, equivalent ture at the position is called synaptonemal complex. We simply to 52 crossovers (Kong et al., 2002). Another study using call them crossovers. both deCODE pedigrees and CEPH three-generation pedigrees For the next-generation offsprings who receive a gamete pro- (from Caucasian families in Utah, collected by Centre d’E´tude du duced by the parents’ meiosis, the two chromosomal locations that Polymorphisme Humain which is currently called Foundation Jean bracket a crossover have a 50% chance to be recombinant (or contain Dausset—CEPH) estimated genome-wide 92 and 57 crossovers,

deCode data 350 o female 300 x male

250 slope = 1

200 x x x 150 xx x x slope = 0.5 no. crossover = 2*(cM)/100 x x x genetic length (cM) x x xx 100 x x x x 50 xx y.intercept = 50 0 0 50 100 150 200 250 300 chromosome length (Mb)

Fig. 5. Illustration of simple determination of the number of crossovers from human genome size. The x-axis is the chromosome length in Mb, the y-axis is the genetic length of each human chromosome determined by deCODE pedigree analysis (Kong et al., 2002), in cM. One cM is a 1% probability for recombination, so the number of crossovers covered by a DNA segment with G cM is: 2 G/100. The female and male data are drawn separately and the slopes are 1 and 0.5 respective, with t-intercept¼50 cM. W. Li / Journal of Theoretical Biology 288 (2011) 92–104 101 for female and male respectively (Matise et al., 2007). The number bases and more. We cannot claim a ‘‘typical’’ exon size as mode, of male crossovers estimated by genetic and immunocytological mean, and median of the exon size distribution are all different. methods are more or less consistent ð50Þ, however, it is still a Another possible parameter is the percentage of repetitive puzzle that the cytological estimation of female crossover fre- sequences in the human genome. The percentage of repetitive quencies are lower than those estimated by genetic methods sequences identified by the RepMasker program (http://www. ( 70 vs. 90). repeatmasker.org/) for the sequenced portion of the human Based on genetically determined recombination information, genome is 50.59% (1447.5677 Mb out of 2861.3437 Mb, of human number of female and male crossovers can be estimated by the genome release GRCh37/hg19). In the unsequenced portion genome size (Li and Freudenberg, 2009). Fig. 5 shows that (labeled as a sequence of ‘‘N’’s), the percentage of repetitive regression line with extremely simple coefficients fit the genetic sequence could be even higher, making the above value an chromosome length (G in cM) versus physical chromosome length underestimation. The difficulty in having a stable estimation of (P in Mb) very well, which translates to these rules of estimating this parameter is that repetitive sequences are the most dynamic the total number of crossovers in autosomal chromosomes: region of the human genome. Intersperse repetitive elements provide sequence homology for promoting non-allelic recombi- N ¼ 22þ2P=100 co,female nations, which result in structural variations (large-scaled dele- tions, duplications, etc. also called copy number variations N ¼ 22þP=100 ð6Þ co,male (CNVs)) (Alkan et al., 2011). These CNVs may either increase or which leads to 82 crossovers in female and 52 crossovers in male decrease the percentage of repetitive sequences. (total P¼3021 Mb, Kong et al., 2002). On population-based parameters, the number of recombination Is it possible to theoretically predict the number of crossovers hotspots (Kauppi et al., 2004; Paigen and Petkov, 2010)isalsoan in human? Muller¨ (1964) suggested the importance of recombi- attractive candidate. An estimation of the number of recombination nation in removing deleterious mutations, later called ‘‘Muller’s¨ hotspots (more than 25000) was published by means of coalescent ratchet’’ (Felsenstein, 1974). The long-term advantage of recom- analysis (Myers et al., 2005). Since this estimated number is in the bination has been experimentally tested as well as confirmed range of the number of genes, it is suggested that the a-type of (Rice, 2002), and has been addressed in models with realistic recombination hotspots for yeast which requires transcription- conditions, such as finite population sizes (Otto and Lenormand, factor binding is a possible model (Petes, 2001). However, it is 2002; Keightley and Otto, 2006). Intuitively, more crossovers proposed that some recombination hotspots might be a ‘‘fluid would be more effective in removing deleterious mutations. It feature’’ of the human genome, ‘‘highly transient’’, and regulated might be possible to apply these ideas to construct a model that (Jeffreys et al., 2005; Jeffreys and Neumann, 2009; Buard and de balance the number of crossovers with the level of mutations. Massy, 2007; Paigen and Petkov, 2010). The significantly different results on recombination hotspots from coalescent analysis and sperm typing puts a question mark on the number 25,000. 11. Discussion The nine parameters examined here can clearly distinguish human from rodent genomes. But can these distinguish human from Compiling a comprehensive list of human genome parameters chimpanzee genome? Researchers have been trying to find human- that are either constant in individual genomes or vary within specific genes, human-specific genomic features, human-lineage- limited range in the population of genomes is a considerable task. specific mutation rate, and human-specific environment (Dorus Here we attempted to start this compilation by discussing nine of et al., 1994; Pollard et al., 2006; Katzman et al., 2010; Hawks et al., such parameters. Some are well known textbook materials such as 2007; Varki et al., 2008). Clearly not all genome parameters are genome size and number of chromosomes, and the emphasis is on equivalent: some are better suited for showing universality among comparing these numbers with those in other species. Some have different species, other are better used to distinguish finer branches in interesting stories such as the ‘‘surprise’’ concerning the lower a phylogenetic tree. If the number of olfactory genes is used as a number of human genes when the human genome was first genomic parameter, for example, it should easily separate human sequenced. Others are closely tied to the current efforts to uncover from chimpanzee, as it has been shown that the fraction of olfactory the genetic basis of human complex diseases, such as the number receptorpseudogenesinhumansistwiceaslargeasthatinother of SNPs and the number of newly generated deleterious mutations. primates (Gilad et al., 2003). Many physicists are used to the so-called ‘‘back-of-the-envelope’’ From the theoretically modelling point of view, picking a calculation (Swartz, 2003), characterized by quantitative but genome parameter narrows the focus of the modelling. For approximate ‘‘order of magnitude’’ calculation. Parameter values example, a model on genome size does not need to model the discussed here, such as ‘‘1 cm¼1 Mb’’ rule from the number of number of chromosomes, and vice versa, a model on karyotype crossovers, are essential for a ‘‘back-of-the-envelope’’ genomics does not need to study the evolution of genome size. However, as calculation. Morton (1991) had a similarly titled paper, but his more genome parameters are collected, some of them are corre- collection of parameter is only limited to genetic map lengths. lated, and a model has to deal with multiple parameters simulta- Some parameters may be obvious choices to be included. For neously. For example, GC% and recombination rate are highly example, the size distribution of human exons (Naora and correlated in local chromosome regions. To model the evolution of Deacon, 1982; Hawkin, 1988; Zhang, 1998) peaks around 110– genome-wide GC% and number of crossovers, it may not impos- 120 bases. This length is comparable to the 147b, the length of sible to model one but not the other. Our ultimate goal is to be nucleosome DNA that wraps around the histone octamers (Baldi able to understand where these human genome parameter values et al., 1996). Similar observation led to the hypothesis that came from, and be able to predict them from theoretical models. nucleosome positioning and the exon–intron boundaries are matched (Beckmann and Trifonov, 1991; Denisov et al., 1997; Kogan and Trifonov, 2005; Tilgner et al., 2009; Schwartz et al., Acknowledgments 2009). Exon size distribution has also been used to examine two competing theories on the origin of introns (Gudlaugsdottir et al., I would like to thank the support from the Robert S. Boas 2007). However, the exon size distribution is extremely asym- Center for Genomics and Human Genetics, Dirk Holste, Jan metric around the peak, with long tails that extend to thousand of Freudenberg, Pedro Miramontes for discussion and suggestions 102 W. Li / Journal of Theoretical Biology 288 (2011) 92–104 on an early draft, Jose Oliver for providing isochore data, Liqing Codina-Pascual, M., Campillo, M., Kraus, J., Speicher, M.R., Egozcue, J., Navarro, J., Zhang for collaboration on a transcription-factor-related project, Benet, J., 2006. Crossover frequency and synaptonemal complex length: their variability and effects on human male meiosis. Mol. Hum. Repro. 12, 123–133. and four anonymous reviewers for helpful comments. Costantini, M., Clay, O., Auletta, F., Bernardi, G., 2006. An isochore map of human chromosomes. Genome Res. 16, 536–541. Crow, J.F., 1993. How much do we know about spontaneous human mutation rates? Env. Mol. Mut. 21, 122–129. References Crow, J.F., 2000. The origins, patterns and implications of human spontaneous mutation. Nature Rev. Genet. 1, 40–47. 1000 Genomes Project Consortium, 2010. A map of human genome variation from Cruveiller, S., Jabbari, K., Clay, O., Bernardi, G., 2003. Compositional features of population-scale sequencing. Nature, 467, 1061–1073. eukaryotic genomes for checking predicted genes. Brief. Bioinf. 4, 43–52. Alkan, C., Coe, B.P., Eichler, E.E., 2011. Genome structural variation discovery and Demuth, J.P., De Bie, T., Stajich, J.E., Cristianini, N., Hahn, M.W., 2006. The evolution genotyping. Nature Rev. Genet. 12, 363–376. of mammalian gene families. PLoS ONE 1, e85. Arndt, P.F., Burge, C.B., Hwa, T., 2003. DNA sequence evolution with neighbor- Denisov, D.A., Shpigelman, E.S., Trifonov, E.N., 1997. Protective nucleosome dependent mutation. J. Compd. Biol. 10, 313–322. centering at splice sites as suggested by sequence-directed mapping of the Ast, G., 2005. The alternative genome. Sci. Am. 292, 58–65. nucleosomes. Gene 205, 145–149. Baldi, P., Brunak, S., Chauvin, Y., Krogh, A., 1996. Naturally occurring nucleosome Dirac, P.A.M., 1937. The cosmological constants. Nature 139, 323. positioning signals in human exons and introns. J. Mol. Biol. 263, 503–510. Dobzhansky, T., 1964. Biology, molecular and organismic. Am. Zoo. 4, 443–452. Baraba´si, A.L., Oltvai, Z.N., 2004. Network biology: understanding the cell’s Doerfler, W., 1983. DNA methylation and gene activity. Ann. Rev. Biochem. 52, functional organization. Nature Rev. Genet. 5, 101–113. 93–124. Baranzini, S.E., Mudge, J., vanVelkinburgh, J.C., Khankhanian, P., Khrebtukova, I., D’Onofrio, G., Jabbari, K., Musto, H., Bernardi, G., 1999. The correlation of protein Miller, N.A., Zhang, L., Farmer, A.D., Bell, C.J., Kim, R.W., May, G.D., Woodward, hydropathy with the base composition of coding sequences. Gene 238, 3–14. J.E., Caillier, S.J., McElroy, J.P., Gomez, R., Pando, M.J., Clendenen, L.E., Ganusova, Doolittle, W.F., Sapienza, C., 1980. Selfish genes, the phenotype paradigm and E.E., Schilkey, F.D., Ramaraj, T., Khan, O.A., Huntley, J.J., Luo, S., Kwok, P., Wu, genome evolution. Nature 284, 601–603. T.D., Schroth, G.P., Oksenberg, J.R., Hauser, S.L., Kingsmore, S.F., et al., 2010. Dorus, S., Vallender, E.J., Evans, P.D., Anderson, J.R., Gilbert, S.L., Mahowald, M., Genome epigenome and RNA sequences of monozygotic twins discordant for Wyckoff, G.J., Malcom, C.M., Lahn, B.T., 1994. Accelerated evolution of nervous multiple sclerosis. Nature 464, 1351–1356. system genes in the origin of Homo sapiens. Cell 119, 1027–1040. Beckmann, J.S., Trifonov, E.N., 1991. Splice junctions follow a 205-base ladder. Duret, L., Arndt, P.F., 2008. The impact of recombination on nucleotide substitu- Proc. Natl. Acad. Sci. 88, 2380–2383. tions in the human genome. PLoS Genet. 4, e1000071. Belle, E.M.S., Smith, N., Eyre-Walker, A., 2002. Analysis of the phylogenetic Duret, L., Galtier, N., 2009. Biased gene conversion and the evolution of mamma- distribution of isochores in vertebrates and a test of the thermal stability lian genome landscapes. Ann. Rev. Genomics Hum. Genet. 10, 285–311. hypothesis. J. Mol. Evol. 55, 356–363. Dutrillaux, B., Rethore´, M.O., Lejeune, J., 1975. Comparison du caryotype de Bennett, M.D., 1972. Nuclear DNA content and minimum generation time in l’orang-outang (Pongo pygmaeus) a´ celui de l’homme, du chimpanze´ et du herbaceous plants. Proc. R. Soc. London B 181, 109–135. gorille. Ann. Ge´ne´t. 18, 153–161. Bernaola-Galva´n, P., Roma´n-Rolda´n, R., Oliver, J.L., 1996. Compositional segmenta- Eisenberg, E., Levanon, E.Y., 2003. Human housekeeping genes are compact. Trends tion and long-range fractal correlations in DNA sequences. Phys. Rev. E 53, Genet. 19, 362–365. 5181–5189. Ewing, B., Green, P., 2000. Analysis of expressed sequence tags indicates 35,000 Bernardi, G., 2004. Structural and Evolutionary Genomics—Natural Selection in human genes. Nature Genet. 25, 232–234. Genome Evolution. Elsevier. Eyre-Walker, A., Hurst, L.D., 2001. The evolution of isochore. Nature Rev. Genet. 2, Bernardi, G., 2007. The neoselectionist theory of genome evolution. Proc. Natl. 549–555. Acad. Sci. 104, 8385–8390. Eyre-Walker, A., Keightley, P.D., 1999. High genomic deleterious mutation rates in Bernardi, G., Bernardi, G., 1986. Compositional constraints and genome evolution. hominids. Nature 397, 344–347. J. Mol. Evol. 24, 1–11. Farnham, P.J., 2009. Insights from genomic profiling of transcription factors. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, Nature Rev. Genet. 10, 605–616. M., Rodier, F., 1985. The mosaic genome of warm-blooded vertebrates. Science Felsenstein, J., 1974. The evolutionary advantage of recombination. Genetics 78, 228, 953–958. 737–756. Beurton, P.J., Falk, R., Rheinberger, H.J. (Eds.), 2000. The Concept of the Gene in Ferguson-Smith, M.A., Trifonov, V., 2007. Mammalian karyotype evolution. Nature Development and Evolution: Historical and Epistemological Perspectives. Rev. Genet. 8, 950–962. Cambridge University Press. Finch, S.R., 2003. Mathematical Constants. Cambridge University Press. Bruder, C.E.G., Piotrowski, A., Gijsbers, A.A.C.J., Andersson, R., Erickson, S., Diaz de Flavell, R.B., Bennett, M.D., Smith, J.B., Smith, D.B., 1974. Genome size and the Støahl, T., Menzel, U., Sandgren, J., von Tell, D., Poplawski, A., Crowley, M., proportion of repeated nucleotide sequence DNA in plants. Biochem. Genet. Crasto, C., Partridge, E.C., Tiwari, H., Allison, D.B., Komorowski, J., van Ommen, 12, 257–269. G.J.B., Boomsma, D.I., Pedersen, N.L., denDunnen, J.T., Wirdefeldt, K., Dumanski, Flicek, P., et al., 2011. Ensembl 2011. Nucl. Acids Res. 39 (suppl 1), D800–D806. J.P., 2008. Phenotypically concordant and discordant monozygotic twins dis- Freudenberg, J., Wang, M., Yang, Y., Li, W., 2009. Partial correlation analysis play different DNA copy-number-variation profiles. Am. J. Hum. Genet. 82, indicates causal relationships between GC-content, exon density and recom- 763–771. bination rate variation in the human genome. BMC Bioinf. 10 (suppl 1), S66. Buard, J., de Massy, B., 2007. Playing hide and seek with mammalian meiotic Fritzsch, H., 2004. The Fundamental Constants: A Mystery of Physics. World crossover hotspots. Trends in Genet. 23, 301–309. Scientific. Buchanan, J.A., Higley, E.T., 1921. The relationship of blood-groups to disease. Brit. Froenicke,¨ L., 2005. Origins of primate chromosomes—as delineated by Zoo-FISH J. Exp. Path. 2, 247–255. and alignments of human and mouse draft genome sequences. Cytogenet. Bulmer, M., 1991. The selection–mutation-drift theory of synonymous codon Genome Res. 108, 122–138. usage. Genetics 129, 897–907. Froenicke,¨ L., Calde´s, M.G., Graphodatsky, A., Muller,¨ S., Lyons, L.A., Robinson, T.J., Carpena, P., Oliver, J.L., Hackenberg, M., Coronado, A.V., Barturen, G., Bernaola- Volleth, M., Yang, F., Wienberg, J., 2006. Are molecular cytogenetics and Galva´n, P., 2011. High-level organization of isochores into gigantic super- bioinformatics suggesting diverging models of ancestral mammalian gen- structures in the human genome. Phys. Rev. E 83, 031908. omes? Genome Res. 16, 306–310. Cavalier-Smith, T., 1982. Skeletal DNA and the evolution of genome size. Ann. Rev. Frota-Pessoa, O., 1961. On the number of gene loci and the total mutation rate in Biophs. Bioeng. 11, 273–302. man. Am. Nat. 95, 217–222. Cavalier-Smith, T., 1985. The Evolution of Genome Size. Wiley. Fullerton, S., Carvalho, A.B., Clark, A., 2001. Local rates of recombination are Cavalier-Smith, T., 2005. Economy, speed and size matter: evolutionary forces positively correlated with GC content in the human genome. Mol. Biol. Evol. driving nuclear genome miniaturization and expansion. Ann. Botany 95, 18, 1139–1142. 147–175. Galtier, N., Piganeau, G., Mouchiroud, D., Duret, L., 2001. GC-content evolution in Charlesworth, B., Sniegowski, P., Stephan, W., 2002. The evolutionary dynamics of mammalian genomes: the biased gene conversion hypothesis. Genet. 159, repetitive DNA in eukaryotes. Nature 371, 215–220. 907–911. Cheng, E.Y., Hunt, P.A., Naluai-Cecchini, T.A., Fligner, C.L., Fujimoto, V.Y., Paster- Gamow, G., 1967. Variability of elementary charge and quasistellar objects. Phys. nack, T.L., Schwartz, J.M., Steinauer, J.E., Woodruff, T.J., Cherry, S.M., Hansen, Rev. Lett. 19, 913–914. T.A., Vallente, R.U., Broman, K.W., Hassold, T.J., 2009. Meiotic recombination in Gartler, S.M., 2006. The chromosome number in humans: a brief history. Nature human oocytes. PLoS Genet. 5, e1000661. Rev. Genet. 7, 655–660. Cheng, Z., Ventura, M., She, X., Khaitovich, P., Graves, T., Osoegawa, K., Church, D., Gilad, Y., man, O., Pa¨abo,¨ S., Lancet, D., 2003. Human specific loss of olfactory DeJong, P., Wilson, R.K., Pa¨abo,¨ S., Rocchi, M., Eichler, E.E., 2005. A genome- receptor genes. Proc. Natl. Acad. Sci. 100, 3324–3327. wide comparison of recent chimpanzee and human segmental duplications. Gojobori, T., Li, W.H., Graur, D., 1982. Patterns of nucleotide substitution in Nature 437, 88–93. pseudogenes and functional genes. J. Mol. Evol. 18, 360–369. Chowdhary, B.P., Raudsepp, T., Fronicke,¨ L., Scherthan, H., 1998. Emerging patterns Gregory, T.R., Nicol, J.A., Tamm, H., Kullman, B., Kullman, K., Leitch, I.J., Murray, of comparative genome organization in some mammalian species as revealed B.G., Kapraun, D.F., Greilhuber, J., Bennett, M.D., 2007. Eukaryotic genome size by Zoo-FISH. Genome Res. 8, 577–589. databases. Nucl. Acids Res. 35 (suppl 1), D332–D338. Clauset, A., Shalizi, C.R., Newman, M.E., 2009. Power-law distributions in empirical Gudlaugsdottir, S., Boswell, D.R., Wood, G.R., Ma, J., 2007. Exon size distribution data. SIAM Rev. 51, 661–703. and the origin of introns. Genetica 131, 299–306. W. Li / Journal of Theoretical Biology 288 (2011) 92–104 103

Hahn, M.W., Wray, G.A., 2002. The g-value paradox. Evol. Dev. 4, 73–75. Li, W., Lee, A., Gregersen, P.K., 2009. Copy-number-variation region detection by Haldane, J.B.S., 1937. The effect of variation on fitness. Am. Nat. 71, 337–349. cumulative plots. BMC Bioinf. 10 (suppl 1), S67. Hartl, D.L., Clark, A.F., 2006. Principles of Population Genetics, fourth ed. Sinauer Li, W., Miramontes, P., 2006. Large-scale oscillation of structure-related DNA Associates, Sunderland, MA. sequence features in human chromosome 21. Phys. Rev. E 74, 021912. Hawkin, J.D., 1988. A survey on intron and exon lengths. Nucl. Acids Res. 16, Liu, G., NISC Comparative Sequencing Program, Zhao, S., Bailey, J.A., Sahinalp, S.C., 9893–9908. Alkan, C., Tuzun, E., Green, E.D., Eichler, E.E., 2003. Analysis of primate genomic Hawks, J., Wang, E.T., Cochran, G.M., Harpending, H.C., Moyzis, R.K., 2007. Recent variation reveals a repeat-driven expansion of the human genome. Genome acceleration of human adaptive evolution. Proc. Natl. Acad. Sci. 104, Res. 13, 358–368. 20753–20758. Lynch, M., 2007. The Origins of Genome Architecture. Sinauer Associations, Inc., Hughes, A.L., 2008. Near neutrality: leading edge of the neutral theory of Sunderland, MA. molecular evolution. Ann. N.Y. Acad. Sci. 1133, 162–179. Lynch, M., 2010. Rate, molecular spectrum, and consequences of human mutation. Hunt, P.A., Hassold, T.J., 2002. Sex matters in meiosis. Science 296, 2181–2183. Proc. Natl. Acad. Sci. 107, 961–968. Ijdo, J.W., Baldini, A., Ward, D.C., Reeders, S.T., Wells, R.A., 1991. Origin of human Macaya, G., Thiery, J.P., Bernardi, G., 1976. An approach to the organization of chromosome 2: an ancestral telomere–telomere fusion. Proc. Natl. Acad. Sci. eukaryotic genomes at a macromolecular level. J. Mol. Biol. 108, 237–254. 88, 9051–9055. Martı´nez-Mekler, G., Martı´nez, R.A., Beltra´n del Rı´o, M., Mansilla, R., Miramontes, International HapMap Consortium, 2005. A haplotype map of the human genome. P., Cocho, G., 2009. Universality of rank-ordering distributions in the arts and Nature 437, 1299–1320. sciences. PLoS ONE 4, e4791. International HapMap Consortium, 2007. A second generation human haplotype Mather, K., 1938. Crossing-over. Biol. Rev. 13, 258–292. map of over 3.1 million SNPs. Nature 449, 851–861. Matise, T.C., Chen, F., Chen, W., DeLaVega, F.M., Hansen, M., He, C., Hyland, F.C.L., Jeffreys, A.J., Kauppi, L., Neumann, R., 2001. Intensely punctate meiotic recombina- Kennedy, G.C., Kong, X., Murray, S.S., Ziegle, J.S., Stewart, W.C.L., Buyske, S., tion in the class II region of the major histocompatibility complex. Nature 2007. A second-generation combined linkage—physical map of the human Genet. 29, 217–222. genome. Genome Res. 17, 1783–1786. Jeffreys, A.J., Neumann, R., 2009. The rise and fall of a human recombination hot Matthey, R., 1945. L’e´volution de la formule chromosomiale chez les Verte´bre´s. spot. Nature Genet. 41, 625–629. Cell. Mol. Life Sci. 1, 78–86. Jeffreys, A.J., Neumann, R., Panayi, M., Myers, S., Donnelly, P., 2005. Human Messina, D.N., Glasscock, J., Gish, W., Lovett, M., 2004. An ORFeome-based analysis recombination hot spots hidden in regions of strong marker association. of human transcription factor genes and the construction of a microarray to Nature Genet. 37, 601–606. interrogate their expression. Genome Res. 14, 2041–2047. Kaminsky, Z.A., Tang, T., Wang, S.C., Ptak, C., Oh, G.H.T., Wong, A.H.C., Feldcamp, Meunier, J., Duret, L., 2004. Recombination drives the evolution of GC-content in L.A., Virtanen, C., Halfvarson, J., Tysk, C., McRae, A.F., Visscher, P.M., Mon- the human genome. Mol. Biol. Evol. 21, 984–990. tgomery, G.W., Gottesman, I.I., Martin, N.G., Petronis, A., 2009. DNA methyla- Modrek, B., Lee, C., 2002. A genomic view of alternative splicing. Nature Genet. 30, tion profiles in monozygotic and dizygotic twins. Nature Genet. 41, 240–245. 13–19. Katzman, S., Kern, A.D., Pollard, K.S., Salama, S.R., Haussler, D., 2010. GC-biased Mohr, P.J., Taylor, B.N., Newell, D.B., 2008. CODATA recommended values of the evolution near human accelerated regions. PLoS Genet. 6, e1000960. fundamental physical constants: 2006. Rev. Mod. Phys. 80, 633–730. Kauffman, S.A., 1969. Metabolic stability and epigenesis in randomly connected Morand, S., Ricklefs, R.E., 2005. Genome size is not related to life-history traits in nets. J. Theor. Biol. 22, 437–467. primates. Genome 48, 273–278. Kauffman, S.A., 1993. The Origins of Order: Self-Organization and Selection in Morton, N.E., 1988. Multipoint mapping and the emperor’s clothes. Ann. Hum. Evolution. Oxford University Press. Genet. 52, 309–318. Kauffman, S.A., Levin, S., 1987. Towards a general theory of adaptive walks on Morton, N.E., 1991. Parameters of the human genome. Proc. Natl. Acad. Sci. 88, rugged landscapes. J. Theor. Biol. 128, 11–15. 7474–7476. Kauppi, L., Jeffreys, A.J., Keeney, S., 2004. Where the crossovers are: recombination Muller,¨ H.J., 1950. Our load of mutations. Am. J. Hum. Genet. 2, 111–176. distributions in mammals. Nature Rev. Genet. 5, 413–424. Muller,¨ H.J., 1964. The relation of recombination to mutational advance. Mut. Res. Keightley, P.D., Otto, S.P., 2006. Interference among deleterious mutations favours 1, 2–9. sex and recombination in finite populations. Nature 443, 89–92. Muller,¨ S., Stanyon, R., O’Brien, P.C., Ferguson-Smith, M.A., Plesker, R., Wienberg, J., Kelly, J.L., Swanson, W.J., 2008. Positive selection in the human genome: from 1999. Defining the ancestral karyotype of all primates by multidirectional genome scans to biological significance. Ann. Rev. Genomics Hum. Genet. 9, chromosome painting between tree shrews, lemurs and humans. Chromo- 143–160. soma 108, 393–400. Kim, J.I., et al., 2009. A highly annotated whole-genome sequence of a Korean Myers, C., Bottolo, L., Freeman, C., McVean, G., Donnelly, P., 2005. A fine-scale map individual. Nature 460, 1011–1015. of recombination rates and hotspots across the human genome. Science 310, Kimura, M., 1968. Evolutionary rate at the molecular level. Nature 217, 624–626. 321–324. Kimura, M., 1983. Rare variant alleles in the light of the neutral theory. Mol. Biol. Nachman, M.W., Crowell, S.L., 2000. Estimate of the mutation rate per nucleotide Evol. 1, 84–93. in human. Genetics 156, 297–304. King, J.L., Jukes, T.H., 1969. Non-Darwinian evolution. Science 164, 788–798. Naora, H., Deacon, N.J., 1982. Relationship between the total size of exons and Kogan, S., Trifonov, E.N., 2005. Gene splice sites correlate with nucleosome introns in protein-coding genes of higher eukaryotes. Proc. Natl. Acad. Sci. 79, positions. Gene 352, 57–62. 6196–6200. Kondrashov, A.S., Crow, J.F., 1993. A molecular approach to estimating the human Nature Genetics Editorial, 2000. The nature of the number. Nature Genet. 25, deleterious mutation rate. Hum. Mut. 2, 229–234. 127–128. Kondrashov, F.A., Ogurtsov, A.Y., Kondrashov, A.S., 2006. Selection in favor of Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New nucleotides G and C diversifies evolution rates and levels of polymorphism at York, NY. mammalian synonymous sites. J. Theor. Biol. 240, 616–626. Nei, M., Li, W.H., 1979. Mathematical model for studying genetic variation in terms Kong, A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richards- of restriction endonucleases. Proc. Natl. Acad. Sci. 76, 5269–5273. son, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., Nei, M., Suzuki, Y., Nozawa, M., 2010. The neutral theory of molecular evolution in Palsson, S.T., Frigge, M.L., Thorgeirsson, T.E., Gulcher, J.R., Stefansson, K., 2002. the genomic era. Ann. Rev. Genomics Hum. Genet. 11, 265–289. A high-resolution recombination map of the human genome. Nature Genet. Nilsen, T.W., Graveley, B.R., 2010. Expansion of the eukaryotic proteome by 31, 241–247. alternative splicing. Nature 463, 457–463. Kottler, M.J., 1974. From 48 to 46: cytological technique, preconception and the Nozawa, M., Suzuki, Y., Nei, M., 2009. Reliability of identifying positive selection by counting of the human chromosomes. Bull. Hist. Med. 48, 465–502. branch-site and the site-prediction methods. Proc. Natl. Acad. Sci. 106, Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. 6700–6705. Nature 409, 860–921. O’Brien, S.J., Menninger, J.C., Nash, W.G., 2006. Atlas of Mammalian Chromosomes. Lane, N., Martin, W., 2010. The energetics of genome complexity. Nature 467, Wileu-Liss. 929–934. O’Geen, H., Frietze, S., Farnham, P.J., 2010. Using ChIP-seq technology to identify Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, targets of zinc finger transcription factors. Meth. Mol. Biol. 649, 437–455. J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W.C., Shago, M., Ohno, S., 1970. Evolution by Gene Duplication. Springer-Verlag. Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Ohno, S., 1972. So much ‘junk’ DNA in our genome. Brookhaven Symp. Biol. 23, Beeson, K.Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., 366–370. Rogers, Y.H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., Venter, J.C., 2007. The Oliver, J.L., Carpena, P., Hackenberg, M., Bernaola-Galva´n, P., 2004. IsoFinder: diploid genome sequence of an individual human. PLoS Biol. 5, e254. computational prediction of isochores in genome sequences. Nucl. Acids Res. Li, W., 2002. Are isochore sequences homogeneous? Gene 300, 129–139. 32 (suppl 2), W287–W292. Li, W., Bernaola-Galvan, P., Carpena, P., Oliver, J.L., 2003. Isochores merit the prefix Oliver, J.L., Carpena, P., Roma´n-Rolda´n, R., Mata-Balaguer, T., Mejı´as-Romero, A., ‘iso’. Comput. Biol. Chem. 27, 5–10. Hackenberg, M., Bernaola-Galva´n, P., 2002. Isochore chromosome maps of the Li, W., Freudenberg, J., 2009. Two-parameter characterization of chromosome- human genome. Gene 300, 117–127. scale recombination rate. Genome Res. 19, 2300–2307. Ono, S., Imamura, A., Tasaki, S., Kurotaki, N., Ozawa, H., Yoshiura, Y., Okazaki, Y., Li, W., He, C., Freudenberg, J., 2011. A mathematical framework for examining 2010. Failure to confirm CNVs as of aetiological significance in twin pairs whether a minimum number of chiasmata is required per metacentric discordant for Schizophrenia. Twin Res. Hum. Genet. 13, 455–460. chromosome or chromosome arm in human. Genomics 97, 186–192. Organ, C.L., Shedlock, A.M., Meade, A., Pagel, M., Edwards, S.V., 2007. Origin Li, W., Holste, D., 2004. An unusual 50,0000 bases long oscillation of guanine and of avian genome size and structure in non-avian dinosaurs. Nature 446, cytosine content in human chromosome 21. Comput. Biol. Chem. 28, 393–399. 180–184. 104 W. Li / Journal of Theoretical Biology 288 (2011) 92–104

Orgel, L.E., Crick, F.H.C., 1980. Selfish DNA: the ultimate parasite. Nature 284, Sun, F., Oliver-Bonet, M., Liehr, T., Starke, H., Turek, P., Ko, E., Rademaker, A., 604–607. Martin, R.H., 2006. Variation in MLH1 distribution in recombination maps for Otto, S.P., Lenormand, T., 2002. Resolving the paradox of sex and recombination. individual chromosomes from human males. Hum. Mol. Genet. 15, Nature Rev. Genet. 3, 252–261. 2376–2391. Pagel, M., Johnstone, R.A., 1992. Variation across species in the size of the nuclear Sun, F., Trpkov, K., Rademaker, A., Ko, E., Martin, R.H., 2005. Variation in meiotic genome supports the junk-DNA explanation for the C-value paradox. Proc. recombination frequencies among human males. Hum. Genet. 116, 172–178. Biol. Sci. 249, 119–124. Swartz, C.E., 2003. Back-of-the-Envelope Physics. Johns Hopkins University Press. Paigen, K., Petkov, P., 2010. Mammalian recombination hot spots: properties, Swift, H., 1950. The constancy of deoxyribose nucleic acid in plant nuclei. Proc. control and evolution. Nature Rev. Genet. 11, 221–233. Natl. Acad. Sci. 36, 643–654. Painter, T.S., 1923. Studies in mammalian spermatogenesis. II. The spermatogen- Szathma´ry, E., Jorda´n, F., Pa´l, C., 2001. Can genes explain biological complexity? esis of man. J. Exp. Zool. 376, 291–336. Science 292, 1315–1316. Pan, Q., Shai, O., Lee, L.J., Frey, B., Blencowe, B.J., 2008. Deep surveying of Tease, C., Hartshorne, G.M., Hulte´n, M.A., 2002. Patterns of meiotic recombination alternative splicing complexity in the human transcriptome by high-through- in human fetal oocytes. Am. J. Hum. Genet. 70, 1469–1479. put sequencing. Nature Genet. 40, 1413–1415. Tease, C., Hulte´n, M.A., 2004. Inter-sex variation in synaptonemal complex lengths Pennisi, E., 2003. A low number wins the GeneSweep pool (news of the week). largely determine the different recombination rates in male and female germ Science 300, 1484. cells. Cyto. Genome Res. 107, 208–215. Perry, G.H., Yang, F., Marques-Bonet, T., Murphy, C., Fitzgerald, T., Lee, A.S., Hyland, Tenesa, A., Navarro, P., Hayes, B.J., Duffy, D.L., Clarke, G.M., Goddard, M.E., Visscher, C., Stone, A.C., Hurles, M.E., Tyler-Smith, C., Eichler, E.E., Carter, N.P., Lee, C., P.M., 2007. Recent human effective population size estimated from linkage Redon, R., 2008. Copy number variation and evolution in humans and disequilibrium. Genome Res. 17, 520–526. chimpanzees. Genome Res. 18, 1698–1710. Tilgner, H., Nikolaou, C., Althammer, S., Sammeth, M., Beato, M., Valca´rcel, J., Pertea, M., Salzberg, S.L., 2010. Between a chicken and a grape: estimating the Guigo´ , R., 2009. Nucleosome positioning as a determinant of exon recognition. number of human genes. Genome Biol. 11, 206. Nature Struct. Mol. Biol. 16, 996–1002. Petes, T.D., 2001. Meiosis recombination hot spots and cold spots. Nature Rev. Tjio, J.H., Levan, A., 1956. The chromosome number of man. Hereditas 42, 1–6. Genet. 2, 360–369. Tupler, R., Perini, G., Green, M.R., 2001. Expressing the human genome. Nature, Petrov, D.A., 2001. Evolution of genome size: new approaches to an old problem. 409, 832–833. Trends Genet. 17, 23–28. Ulgen, A., Li, W., 2005. Comparing single-nucleotide polymorphism marker-based Petrov, D.A., Hartl, D.L., 1999. Patterns of nucleotide substitution in Drosophila and and microsatellite marker-based linkage analyses. BMC Genet. 6 (suppl 1), mammalian genomes. Proc. Natl. Acad. Sci. 96, 1475–1479. S13. Petrov, D.A., Sangster, T.A., Johnston, J.S., Hartl, D.L., Shaw, K.L., 2000. Evidence for Uzan, J.P., 2003. The fundamental constants and their variation: observational and DNA loss as a determinant of genome size. Science 287, 1060–1062. theoretical status. Rev. Mod. Phys. 75, 403–455. Podlutsky, A., Osterholm, A.M., Hou, S.M., Hofmaier, A., Lambert, B., 1998. Vaquerizas, J.M., Kummerfeld, S.K., Teichmann, S.A., Luscombe, N.M., 2009. A Spectrum of point mutations in the coding region of the hypoxanthine- census of human transcription factors: function, expression and evolution. guanine phosphoribosyltransferase (hprt) gene in human T-lymphocytes Nature Rev. Genet. 10, 252–263. in vivo. Carcinogenesis 19, 557–566. Varki, A., Geschwind, D.H., Eichler, E.E., 2008. Human uniqueness: genome Pollard, K.S., Salama, S.R., King, B., Kern, A.D., Dreszer, T., Katzman, S., Siepel, A., interactions with environment, behaviour and culture. Nature Rev. Genet. 9, Pedersen, J.S., Bejerano, G., Baertsch, R., Rosenbloom, K.R., Kent, J., Haussler, D., 749–763. 2006. Forces shaping the fastest evolving regions in the human genome. PLoS Varshalovich, D.A., Potekhin, A.Y., 1995. Cosmological variability of fundamental Genet. 2, e168. physical constants. Space Sci. Rev. 74, 259–268. Pool, J.E., Hellmann, I., Jensen, J.D., Nielsen, R., 2010. Population genetic inference Vinogradov, A.E., 2003. DNA helix: the importance of being GC-rich. Nucl. Acids from genomic sequence variation. Genome Res. 20, 291–300. Res. 31, 1838–1844. Pruitt, K.D., Maglott, D.R., 2001. RefSeq and LocusLink: NCBI gene-centered Vogel, F., 1964. A preliminary estimate of the number of human genes. Nature 201, resources. Nucl. Acids Res. 29, 137–140. 847. Ravasi, T., et al., 2010. An atlas of combinatorial transcriptional regulation in Wang, J., et al., 2008. The diploid genome sequence of an Asian individual. Nature mouse and man. Cell 140, 744–752. 456, 60–65. Rho, M., Zhou, M., Gao, X., Kim, S., Tang, H., Lynch, M., 2009. Parallel mammalian Watterson, G.A., 1975. On the number of segregating sites in genetical models genome contractions following the KT boundary. Genome Biol. Evol. 1, 2–12. without recombination. Theor. Popul. Biol. 7, 256–276. Rice, W.R., 2002. Experimental tests of the adaptive significance of sexual Wheeler, D.A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., recombination. Nature Rev. Genet. 3, 241–251. Chen, Y.J., Makhijani, V., Roth, G.T., Gomes, X., Tartaro, K., Niazi, F., Turcotte, Roach, J.C., Gustavo, G., Smit, A.F., Huff, C.D., Hubley, R., Shannon, P.T., Rowen, L., C.L., Irzyk, G.P., Lupski, J.R., Chinault, C., Song, X.Z., Liu, Y., Yuan, Y., Nazareth, L., Pant, K.P., Goodman, N., Bamshad, M., Shendure, J., Drmanac, R., Jorde, L.B., Qin, X., Muzny, D.M., Margulies, M., Weinstock, G.M., Gibbs, R.A., Rothberg, Hood, L., Galas, D.J., 2010. Analysis of genetic inheritance in a family quartet by J.M., 2008. The complete genome of an individual by massively parallel DNA whole-genome sequencing. Science 328, 1095–9203. sequencing. Nature 452, 872–876. Rocchi, M., Archidiacono, N., Stanyon, R., 2006. Ancestral genomes reconstruction: Wienberg, J., 2004. The evolution of eutherian chromosomes, Curr. Opin. Genet. an integrated multi-disciplinary approach is needed. Genome Res. 16, Dev. 14, 657–666. 1441–1444. Wingender, E., 2008. The TRANSFAC project as an example of framework Rudner, R., Karkas, J.D., Chargaff, E., 1968. Separation of B. subtilis DNA into technology that supports the analysis of genomic regulation. Brief. Bioinf. 9, complementary strands. III. Direct analysis. Proc. Natl. Acad. Sci. 60, 921–922. 326–332. Scherer, S., 2008. A Short Guide to the Human Genome. Cold Spring Harbor Young, W.J., Merz, T., Ferguson-Smith, M.A., Johnston, A.W., 1960. Chromosome Laboratory Press. number of the Chimpanzee, Pan troglodytes. Science 131, 1672–1673. Schwartz, S., Meshorer, E., Ast, G., 2009. Chromatin organization marks exon– Yunis, J.J., 1976. High resolution of human chromosome. Science 191, 1268–1270. intron structure. Nature Struc. Mol. Biol. 16, 990–995. Yunis, J.J., Prakash, O., 1982. The origin of man: a chromosomal pictorial legacy. Sinsheimer, R.L., 1955. The action of pancreatic deoxyribonuclease. II. isomeric Science 215, 1525–1530. dinucleotides. J. Biol. Chem. 215, 579–583. Zhang, C.T., Zhang, R., 2003. An isochore map of the human genome based on the Z Spuhler, J.N., 1948a. On the number of genes in man. Science 108, 279–280. curve method. Gene 317, 127–135. Spuhler, J.N., 1948b. An estimate of the number of genes in man (abstract). Am. J. Zhang, M.Q., 1998. Statistical features of human exons and their flanking regions. Phys. Anthropol. 6, 248. Hum. Mol. Genet. 7, 919–932. Stanyon, R., Rocchi, M., Capozzi, O., Roberto, R., Misceo, D., Ventura, M., Cardone, Zheng, G., Qian, Z., Yang, Q., Wei, C., Xie, L., Zhu, Y., Li, Y., 2008a. The combination M.F., Bigoni, F., Archidiacono, N., 2008. Primate chromosome evolution: approach of SVM and ECOC for powerful identification and classification of ancestral karyotypes, marker order and neocentromeres. Chromosome Res. transcription factor. BMC Bioinf. 9, 282. 16, 17–39. Zheng, G., Tu, K., Yang, Q., Xiong, Y., Wei, C., Xie, L., Zhu, Y., Li, Y., 2008b. ITFP: Sueoka, N., 1962. On the genetic basis of variation and heterogeneity of DNA base an integrated platform of mammalian transcription factors. Bioinf. 24, composition. Proc. Natl. Acad. Sci. 48, 582–592. 2416–2417.