Genetic Sequence Analysis by Microarray Technology

EMILIE HULTIN

Royal Institute of Technology School of Biotechnology

Stockholm 2007

© Emilie Hultin

Royal Institute of Technology AlbaNova University Center School of Biotechnology SE-106 91 Stockholm Sweden

Printed at Universitetsservice US-AB Box 700 14 SE-100 44 Stockholm Sweden

ISBN 978-91-7178-609-8 Emilie Hultin (2007): Genetic Sequence Analysis by Microarray Technology School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Sweden ISBN 978-91-7178-609-8

Abstract Developments within the field of genetic analysis have during the last decade become enormous. Advances in DNA sequencing technology have increased throughput from a thousand bases to over a billion bases in a day and decreased the cost thousandfold per base. Nevertheless, to sequence complex genomes like the human is still very expensive and efforts to attain even higher throughputs for less money are undertaken by researchers and companies. Genotyping systems for single nucleotide polymorphism (SNP) analysis with whole genome coverage have also been developed, with low cost per SNP. There is, however, a need for genotyping assays that are more cost efficient per sample with considerably higher accuracy. This thesis is focusing on a technology, based on competitive allele- specific extension and microarray detection, for genetic analysis. To increase specificity in allele-specific extension (ASE), a nucleotide degrading enzyme, apyrase, was introduced to compete with the polymerase, only allowing the fast, perfect matched primer extension to occur. The aim was to develop a method for analysis of around twenty loci in hundreds of samples in a high-throughput microarray format. A genotyping method for human papillomavirus has been developed, based on a combination of multiplex competitive hybridization (MUCH) and apyrase-mediated allele- specific extension (AMASE). Human papillomavirus (HPV), which is the causative agent in cervical cancer, exists in over a hundred different types. These types need to be determined in clinical samples. The developed assay can detect the twenty-three most common high risk types, as well as semi-quantifying multiple infections, which was demonstrated by analysis of ninety-two HPV-positive clinical samples. More stringent conditions can be obtained by increased reaction temperature. To further improve the genotyping assay, a thermostable enzyme, protease, was introduced into the allele-specific extension reaction, denoted PrASE. Increased sensitivity was achieved with an automated magnetic system that facilitates washing. The PrASE genotyping of thirteen SNPs yielded higher conversion rates, as well as more robust genotype scoring, compared to ASE. Furthermore, a comparison with pyrosequencing, where 99.8 % of the 4,420 analyzed genotypes were in concordance, indicates high accuracy and robustness of the PrASE technology. Single cells have also been analyzed by the PrASE assay to investigate loss of alleles during skin differentiation. Single cell analysis is very demanding due to the limited amounts of DNA. The multiplex PCR and the PrASE assay were optimized for single cell analysis. Twenty-four SNPs were genotyped and an increased loss of genetic material was seen in cells from the more differentiated suprabasal layers compared to the basal layer.

Keywords: genotyping, single nucleotide polymorphism (SNP), protease-mediated allele- specific extension (PrASE), microarray, tag-array, competitive hybridization, human papillomavirus (HPV), single cell, loss of alleles, differentiation, epidermis.

© Emilie Hultin

Beundran börjar där begripandet slutar

Charles Baudelaire

List of publications

This thesis is based on the following papers, attached in appendices I-IV and referred to in the text by their Roman numerals.

I. Max Käller, Emilie Hultin, Biying Zheng, Baback Gharizadeh, Keng-Ling Wallin, Joakim Lundeberg and Afshin Ahmadian. Tag-array based HPV genotyping by competitive hybridization and extension. J Virol Methods 129(2), 102-112 (2005).

II. Emilie Hultin, Max Käller, Afshin Ahmadian and Joakim Lundeberg. Competitive enzymatic reaction to control allele-specific extensions. Nucleic Acids Res 33(5), e48 (2005).

III. Max Käller, Emilie Hultin, Kristina Holmberg, Marie-Louise Persson, Jacob Odeberg, Joakim Lundeberg and Afshin Ahmadian. Comparison of PrASE and Pyrosequencing for SNP Genotyping. BMC Genomics 7, 291 (2006).

IV. Emilie Hultin, Anna Asplund, Lovisa Berggren, Karolina Edlund, Afshin Ahmadian, Rolf Sundberg, Fredrik Pontén and Joakim Lundeberg. Random loss of genetic segments during skin differentiation indicated by analysis of single cells. Submitted. (2007).

Reprints were made with permission from the publishers.

Table of contents

INTRODUCTION TO DNA ANALYSIS...... 1

1 THE FASCINATING DNA...... 1

2 THE HISTORY OF DNA SEQUENCING ...... 5 2.1 HUMAN GENOME PROJECT ...... 5 2.2 SANGER DNA SEQUENCING...... 6 2.3 SEQUENCING BY HYBRIDIZATION ...... 7 2.4 PYROSEQUENCING...... 8

3 NOVEL SEQUENCING TECHNOLOGIES ...... 10 3.1 MASSIVE GENOME SEQUENCING...... 10 3.1.1 454 SEQUENCINGTM TECHNOLOGY ...... 10 3.1.2 SOLEXA CLONAL SINGLE MOLECULE ARRAY ...... 13 3.1.3 POLONY SEQUENCING BY LIGATION ...... 15 3.2 SINGLE MOLECULE SEQUENCING ...... 17 3.2.1 VISIGEN ...... 17 3.2.2 HELICOS...... 18 3.2.3 LI-COR ...... 19 3.2.4 SEQUENCING THROUGH NANOPORES...... 19

4 ANALYSIS OF DNA VARIATIONS ...... 21 4.1 FINDING UNKNOWN VARIATIONS ...... 22 4.2 GENOTYPING OF KNOWN VARIATIONS ...... 25 4.2.1 ALLELE-SPECIFIC OLIGONUCLEOTIDE HYBRIDIZATION ...... 25 4.2.2 REAL-TIME DETECTION METHODS...... 27 4.2.3 OLIGONUCLEOTIDE LIGATION ASSAY ...... 27 4.2.4 SINGLE BASE EXTENSION ...... 28 4.2.5 ALLELE-SPECIFIC EXTENSION ...... 29 4.3 LARGE SCALE GENOTYPING SYSTEMS ...... 29 4.3.1 MOLECULAR INVERSION PROBES ...... 29 4.3.2 ARRAYED ALLELE-SPECIFIC HYBRIDIZATION ...... 31 4.3.3 GOLDENGATE AND INFINIUM ASSAYS ...... 31 4.4 DETECTION PLATFORMS ...... 33

5 SUMMARY...... 34

PRESENT INVESTIGATIONS...... 37

6 HUMAN PAPILLOMAVIRUS GENOTYPING (PAPER I) ...... 38 6.1 GENOTYPING MICROBES AND VIRUSES ...... 38 6.2 HUMAN PAPILLOMAVIRUS...... 38 6.3 METHODS FOR MOLECULAR DIAGNOSIS OF HPV...... 39 6.4 AMASE ...... 40 6.5 MUCH-AMASE ...... 41 6.6 RESULTS OF HPV GENOTYPING BY MUCH-AMASE...... 43 7 COMPETITIVE ALLELE-SPECIFIC EXTENSION (PAPER II & III) ...45 7.1 PRASE ...... 45 7.2 RESULTS OF THE SNP GENOTYPING BY PRASE (PAPER II) ...... 45 7.3 RESULTS OF PRASE VERSUS PYROSEQUENCING (PAPER III)...... 47

8 SINGLE CELL ANALYSIS INDICATE GENETIC LOSS (PAPER IV).....49 8.1 EPIDERMIS...... 49 8.2 SINGLE CELL ANALYSIS OF SKIN...... 49 8.3 RESULTS OF ALLELIC LOSS ...... 50 8.3.1 RANDOM SAMPLING ...... 50 8.3.2 TISSUE SECTION STATISTICS...... 51

FUTURE PERSPECTIVES...... 53

ABBREVIATIONS ...... 55

ACKNOWLEDGEMENTS...... 59

REFERENCES...... 61

PUBLICATIONS (APPENDICES I-IV)...... 75

EMILIE HULTIN

INTRODUCTION TO DNA ANALYSIS

1 THE FASCINATING DNA

Life is brilliant. It may seem far-fetched to compare a bacterium or even an insect with humans, but still we consist of the same building blocks. The code for the inherited information of all living organisms on earth, from human papillomavirus, birch trees, bumblebees and dogs to humans, is hidden in the DNA sequence; a polymer of only four different nucleotides made of the sugar molecule deoxyribose, a phosphate group and one of the four nitrogen bases, adenine (A), thymine (T), cytosine (C) and guanine (G). The DNA molecule exists in a variety of forms, but most commonly, two DNA strands are twined together forming a double helix (figure 1), which was first described in 1953 based on X-ray diffraction patterns of DNA fibers (Franklin and Gosling, 1953; Watson and Crick, 1953). The nucleotides in a DNA strand are covalently linked to each other with alternating sugar and phosphates forming the backbone for the bases. Each base in one of the strands forms hydrogen bonds to its complementary base in the other strand, A with T and C with G. These four bases and their particular order build up the genetic code of life, different for different species, but also slightly different between all individuals within a certain species. This is what shapes common features of life, but also makes us unique.

1 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

Chromosome

Cell

Organism

DNA ATTGACCGATGTACCGGTTATC TAACTGGCTACATGGCCAATAG

FIGURE 1. The organization of the inherited code of life, DNA, in an organism. The bond-pairing between the bases build up the double helix of the DNA. All the 3 billion bases that constitute the human genome are condensed by a very specific packing into 46 chromosomes in the cell nucleus during mitosis. The sequence of the bases can be read and variations between individuals analyzed. (Parts of the picture are taken from http://www.koshland-science-museum.org.)

The genetic code is transcribed from DNA into RNA, which in turn can be translated into proteins, functioning as enzymes that catalyze biochemical reactions, having structural or mechanical functions such as the proteins in the cytoskeleton, or playing important roles in cell signaling, immune responses, cell adhesion, and the cell cycle. This relationship between DNA, RNA and proteins was stated as the central dogma by Crick in 1957. Despite the fact that the concept of the central dogma has a crucial role in cell biology, research in the last couple of years has revealed a much more complicated picture in gene regulation. There are for instance genes that are transcribed into functioning RNA molecules, which are not translated into proteins. One group of functional RNAs called microRNAs, 21-23 nucleotides in length, act as guide molecules in post-transcriptional gene silencing by mRNA cleavage or translational repression (Lee and Ambros, 2001; Lim et al., 2005).

2 EMILIE HULTIN Depending on the organism, the DNA molecule can be either circular or linear and differ in size from tens of kilobase pairs to thousands of megabase pairs. In general, prokaryotes have small circular chromosomes and eukaryotes larger linear chromosomes. The human genome, for instance, consists of about 3 billion base pairs (bp), packed into 46 chromosomes, 23 inherited from each parent. In eukaryotes, the chromosomes are condensed during mitosis and meiosis by different packing proteins like histones into chromatin, making the chromosomes visible as the classic four-arm structure of the two sister chromatids attached to each other at the centromere (figure 1). Only around 3 % of the human genome is coding DNA, giving rise to either a protein or a functional RNA. A fraction of the non-coding DNA has regulatory functions, but most of the human sequence has no function or a function that is yet unknown. In addition to the nuclear genome, there are hundreds of copies of the small maternally inherited mitochondrial genome of about 16,000 base pairs in each cell.

Variations in the DNA sequence, which make up the specific genotype of an individual, contribute to the differences in phenotype between individuals within a certain species together with environmental factors and random variations. The difference between two individual human genomes has been estimated to be less then 0.1 %, which corresponds to about 3 million variants or one variant per 1,200 bases (Li and Sadler, 1991; Sachidanandam et al., 2001) (http://www.hapmap.org). Each distinct variant is called an allele and a person’s collection of alleles makes up the genotype. All variations in the genome have occurred due to mutations in the DNA sequence. The mutations can be caused by copying errors during cell division or exposure to different toxic compounds, radiation or viruses. Mutations can occur in the germline, which will then be passed on to the offspring, or as somatic mutations, only affecting the individual organism.

There are different kinds of variations in the genome. Large chromosomal variations, such as loss or gain of whole chromosomes or breakage and rejoining of chromatids, are rare as constitutional mutations, but often pathogenic. These kinds of DNA rearrangements are more common as somatic alterations and often found in tumor cells. For example, loss of heterozygosity (LOH), which is a common phenomenon in tumor cells, can occur in an

3 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY individual with heterogeneous alleles where one allele is lost due to deletion or recombination. Consequently, the above mentioned gross alterations often lead to either gene dosage increase or gene loss.

Short tandem repeats (STRs) or microsatellites, which consist of repeating units of 2-5 bases in length, are polymorphic sequences that are often used as molecular markers (Jeffreys et al., 1985; Litt and Luty, 1989). STRs are typically situated in non-coding regions and over 10,000 STR sequences in the human genome have been published. Some of these have been used in linkage studies while others are utilized to create unique genetic profiles of individuals, so called DNA fingerprints, used in forensic investigations (Gill et al., 1985; Moretti et al., 2001).

Obviously, small-scale DNA variations are much more common than the large-scale chromosomal abnormalities. Single nucleotide polymorphisms (SNPs), where a single base is substituted, inserted or deleted, account for about 90 % of all polymorphisms. The definition of a polymorphism is when more than one allele at a locus occurs with a frequency greater than 1 %. In the human genome there are about 10-15 million SNPs. Single base substitutions can occur in both coding sequences, where most of the pathogenic alterations have been identified, as in sickle cell anemia (Pauling et al., 1949; Ingram, 1957), intragenic non-coding sequences altering gene expression and in regulatory sequences outside exons (Knight, 2003; Pastinen and Hudson, 2004; Pastinen et al., 2004). Base substitutions are the most common DNA changes where transitions, exchanging a purine (A or G) for a purine or a pyrimidine (C or T) for a pyrimidine, are more frequent than transversions, exchanging a purine for a pyrimidine and vice versa, however, this is not universal for all taxa (Keller et al., 2007). SNPs can, like the STRs, also be used as molecular markers in both disease and forensic investigations. For degraded DNA samples, SNP analysis utilizing very short amplicons is an advantage over longer STR fragments. In addition, analysis of SNPs in the mitochondrial DNA (mtDNA) can provide an even higher sensitivity, due to the higher copy number of mtDNA compare to nuclear DNA (Divne and Allen, 2005).

4 EMILIE HULTIN

2 THE HISTORY OF DNA SEQUENCING

2.1 HUMAN GENOME PROJECT

In 1990 the international Human Genome Project (HGP) was launched in the United States, a 15 year program with the aim of determining the entire human genome sequence. The funding levels from the National Institute of Health, the Department of Energy, the Wellcome Trust, and other governments and foundations all around the world indicate the importance of the project for biology and potential medical breakthroughs. After years of technical innovations and improvements of sequencing approaches and the development of bacterial artificial chromosomes (BACs) (Shizuya et al., 1992), all the 20 HGP sequencing sequenced 1000 bases per second in the year 2000. In contrast, in the mid-1980’s the best equipped and most experienced laboratories sequenced 1000 bases in 24 hours (Collins et al., 2003). The goal to sequence the human genome was also set up by the private venture company Celera Genomics, which used a newer riskier but cheaper technique called whole genome shotgun sequencing, in 1998 (Venter et al., 1998). In 2001 the initial draft of the human genome was finished by both HGP (Lander et al., 2001) and Celera (Venter et al., 2001). Improved drafts were published 2003 and 2005, filling in some gaps in the sequence. The cost of the final sequence together with all the technology ended up being 3 billion US-dollars (USD), 1 dollar per base in the HGP (Service, 2006). When the program started in 1990, the cost per base was 10 dollars, and when it ended 2003, it was reduced to around 5 cents.

Now, over one thousand viruses and hundreds of bacteria have been sequenced and several eukaryotic genomes are in progress, among them around fifty mammals of which twenty are complete or in assembly. One of the completed genomes is from the Common Chimpanzee (Pan troglodytes), which revealed that humans only differ from the chimpanzee in 35 million single nucleotide changes and five million insertion/deletion events. By that, the two genomes are over 98.5 % identical (The Chimpanzee Sequencing and Analysis Consortium, 2005). In February 2007 there were 514 complete and published genomes and 2434 ongoing bacterial, archaeal, eukaryotic and metagenome projects reported

5 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY (http://www.genomesonline.org). The cost for whole genome sequencing is, however, still far too high; millions of dollars per sample are required to sequence complex genomes such as the human. A year ago the genome of the rhesus macaque, the second non-human primate to be sequenced, was completed for 22 million USD (Service, 2006). In 2003, a sequencing race was initiated as the J. Craig Venter Science Foundation offered a 500,000 USD award for the first approach able to sequence a genome, of similar complexity as the human, for less than 1000 dollars. This represents the current world-wide goal within whole-genome sequencing. The strive was further encouraged by the National Institutes of Health in 2004 with the introduction of a 70 million USD grant program to support researchers working to sequence a mammalian genome initially for 100,000 USD and finally for 1000 USD.

However, to reach the goals of the human genome project and in order to realize the visions of the 1000 USD genome, several frontiers have been working with different technical solutions. It all started in 1977 with Fredrick Sanger’s elegant technique for DNA sequencing called chain termination. In 1998, an alternative sequencing approach called pyrosequencing was presented.

2.2 SANGER DNA SEQUENCING

The decoding of the DNA sequence was revolutionized in 1977 when Sanger introduced the concept of sequencing by chain termination (Sanger et al., 1977). The technique became the golden standard for many years and is still the most common sequencing approach at today’s laboratories, though there are several emerging sequencing technologies. The Sanger DNA sequencing method is based on the modified and chain terminating dideoxyribonucleotide triphosphates (ddNTPs), which in the event of incorporation terminate further elongation, generating different fragment lengths. In the early years of this technique, the sequencing primer used to be radiolabeled and polymerase mediated extension was performed in the presence of all four deoxyribonucleotide triphosphates (dNTPs) together with only one of the four terminating nucleotides. Thus, four different sequencing reactions, one for each ddNTP,

6 EMILIE HULTIN generated fragments that were separated by gel electrophoresis in four lanes and the sequence could be read according to the fragment migration.

The method has been modified several times since 1977. Today, Sanger DNA sequencing is mostly performed with the four terminating ddNTPs each labeled with a specific fluorescent dye in one single tube (Smith et al., 1986; Prober et al., 1987) and the fragments are separated by capillary electrophoresis. The use of thermostable polymerases has enabled cycled sequencing reactions, resembling the polymerase chain reaction (PCR) but with linear amplification, called cycle sequencing (Innis et al., 1988; Carothers et al., 1989). In one single run, 96 or 384 samples can automatically be sequenced with read lengths up to 1000 bases, as with for instance the capillary sequencer instrument ABI 3730 from Applied Biosystems (CA, USA), generating over two million bases in 24 hours.

2.3 SEQUENCING BY HYBRIDIZATION

The nature of the DNA double helix formation can be utilized for sequencing by enabling a single stranded labeled DNA template to hybridize to a set of short oligonucleotide probes, creating duplex molecules. The probes, or sometimes the templates, are attached to the surface of a glass slide or other solid support in a microarray format. The template strands will only anneal to perfectly matched probes, i.e. probes that are fully complementary to the template DNA. For de novo sequencing, all possible combinations of oligonucleotides of a given length are used as probes, for instance 65 536 (48) octamers (Drmanac and Crkvenjakov, 1987; Drmanac et al., 1989; Strezoska et al., 1991). Longer sequences can then be obtained by combining all short hybridized oligomers by their unique overlaps, as for these three octamers: ATTGCGTA TTGCGTAC TGCGTACA which uniquely define the decamer ATTGCGTACA. All overlapping short sequences that give a positive signal are assembled by a computer program that reconstructs the sequence.

7 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY The drawbacks of sequencing by hybridization (SBH) are that small differences in duplex stability between similar probes can create false positives and repetitive sequences may be unreadable (Tibanyenda et al., 1984; Marziali and Akeson, 2001). However, in addition to de novo sequencing, SBH can be used for resequencing, DNA variation discovery and genotyping of SNPs, as well as for expression profiling (Drmanac et al., 2002). In diagnostics, it has been shown that a combinatorial SBH (cSBH) can be used to define disease causing mutations in a gene for a particular disorder (Cowie et al., 2004; Schirinzi et al., 2006). In cSBH, a universal probe set is attached to a support, creating a microarray, and a second dye-labeled probe is present in the solution. Ligation between the two probes occurs when a DNA strand from the target PCR product consecutively anneals to both array-bound and solution-phase probe. This creates ligated long labeled probes attached to the surface that are complementary to the target.

2.4 PYROSEQUENCING

In 1998 an alternative DNA sequencing approach denoted pyrosequencing was developed, later pioneered by the company Biotage in Uppsala (Sweden) (Ronaghi et al., 1998). The technique is based on the concept of sequencing by synthesis that was described in 1985 by Melamede (Melamede, 1985). In sequencing by synthesis, a sequencing primer is annealed to the target DNA template and nucleotides are added in a sequential manner. Upon incorporation of a nucleotide by the polymerase, the polymerase activity can be measured by different means. The four dNTPs are dispensed either in a cyclic fashion or in a determined order as many times as needed to achieve a desired read length.

In pyrosequencing, the incorporation of a nucleotide by the polymerase is detected by an enzymatic cascade starting with the release of pyrophosphate (PPi) (figure 2). The PPi is used as substrate together with adenosine phosphosulphate (APS) by ATP sulfurylase, generating ATP. Luciferase then converts the energy in ATP in the presence of D- luciferin to detectable light. The light signal intensity is directly proportional to the number of nucleotides incorporated in a single nucleotide flow. Between each nucleotide dispensation, apyrase eliminates the surplus of unincorporated dNTPs by cleaving off the two phosphate groups, making it unusable for the polymerase in the following nucleotide

8 EMILIE HULTIN additions and also degrades the remaining ATPs. The use of apyrase is crucial to avoid accumulating light signals after each nucleotide incorporation as well as avoiding tedious and expensive washing procedures (Ronaghi et al., 1996).

The major drawbacks with the pyrosequencing technique are the short read lengths and the problem of reading long homopolymeric sequences. Efforts have been made to increase read length, in some cases up to 100 bases (Gharizadeh et al., 2002). The read- length limitations of pyrosequencing can be attributed to high background levels due to incomplete nucleotide incorporations and incomplete nucleotide degradation, phenomena that cause negative and positive frame-shifts, respectively.

FIGURE 2. The pyrosequencing approach. Incorporation of a nucleotide by the polymerase initiates a cascade of enzymatic reactions. The PPi that is cleaved off from the dNTP by the polymerase acts as substrate together with APS for ATP sulfurylase, generating ATP, which in turn is a substrate for luciferase that converts the energy in ATP to light that can be detected by a CCD camera. Finally the enzyme apyrase degrades unused dNTPs and ATPs, and a second round with another dNTP can be started. (The picture is published with permission from Biotage.)

9 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

3 NOVEL SEQUENCING TECHNOLOGIES

The race towards the 1000 US-dollar human genome has led to several emerging sequencing technologies. These techniques may be divided into two categories. The first category relies on massive parallel sequencing of populations of clonally amplified single molecules, while the second category includes single molecule sequencing. Generally, these techniques are based on the activity of different enzymes such as polymerases (sequencing by synthesis), ligases (sequencing by ligation) and restriction enzymes or physical techniques such as nanopores where the DNA is unaltered (Bennett et al., 2005). Next generation sequencing technologies look promising for both scanning and screening of all kinds of variations in the genome, and some are now commercially available.

3.1 MASSIVE GENOME SEQUENCING

Two different approaches, utilizing the sequencing by synthesis strategy of clonally amplified molecules for whole genome sequencing, have received great attention the last year. The first system called the GS-20 is provided by the company 454 Life Sciences and is based on emulsion PCR and 454 SequencingTM. The second system called 1G Genetic Analyzer was developed by Solexa (recently acquired by Illumina Inc. in San Diego (CA, USA)) and is based on stepwise sequencing of clonal amplified single molecules on a glass slide. Another massive parallelized sequencing technology, based on polony sequencing by ligation, has been developed at Harvard Medical School, now commercialized by Agencourt (MA, USA) and Applied Bioystems.

TM 3.1.1 454 SEQUENCING TECHNOLOGY 454 Life Sciences in Branford (CT, USA) has developed a technology for whole genome sequencing in a picotiter plate, based on the pyrosequencing and emulsion based PCR carried out by their Genome Sequencer 20 (GS-20) instrument (Margulies et al., 2005). The process is divided into three steps (figure 3); library preparation, emulsion PCR and pyrosequencing. First, the library is prepared by the following procedure. The genomic DNA is fractionated by nebulization into 300-500 bp fragments and blunted. Two short adapters, A and B, containing priming sequences for amplification and

10 EMILIE HULTIN sequencing are ligated onto the ends. The library is then immobilized onto streptavidin coated beads by the biotin tag on one of the adaptors and the non-biotinylated strand is released and used as template after titration. Second, the templates are amplified by an emulsion PCR (Dressman et al., 2003). The single stranded template DNA is annealed to an excess of beads and only beads containing a template molecule are emulsified together with amplification reagents in a water-in-oil mixture. The water droplets form microreactors for PCR amplification of the templates immobilized onto the beads, one bead in each droplet, leading to clonally amplified DNA fragments. Third, the beads are loaded into a picotiter plate (Leamon et al., 2003), one bead in each well, and covered with a smaller type of beads with sequencing reagents attached to them. The clonally amplified fragments on each bead are sequenced according to the pyrosequencing chemistry (see above), by a sequential flow of nucleotides in a fixed order across the wells. The photons that are emitted by the incorporation of nucleotides are detected by a CCD camera (charge-coupled device camera) from all wells simultaneously. The reads are filtered for quality and then assembled by a computer algorithm into contigs.

In a run that takes about five hours, at least 20 Mb are generated from around 200,000- 350,000 reads, each about 100 bases long. The 6x6 cm picotiter plate contains 1.6 million wells with a diameter of 44 µm that are loaded with beads under optimized conditions to maximize number of wells containing exactly one bead. One future goal is to achieve longer read lengths of up to maybe 500-1000 bases, compared to today’s read lengths of occasionally up to 200 bases. A drawback with the 454 sequencingTM is the problem of reading through homopolymers. January this year 454 Life Sciences launched a new version of their sequencing instrument, Genome Sequencer FLX System, with an improved read length of 200 bp with a total capacity to sequence 1000Mb per experiment.

11 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

FIGURE 3. The process of the Genome Sequencer 20 system. The genomic DNA is fragmented and blunted. Two adapters A and B containing priming sequences for amplification and sequencing are ligated onto the ends. The library is then immobilized onto streptavidin coated beads by the biotin tag on one of the adapters and the non-biotinylated strand is released and used as template. The single stranded templates are then annealed to a second type of beads and emulsified in a water-in-oil mixture containing PCR reagents. The water droplets form microreactors for PCR amplification with one bead carrying a single template in one droplet. The clonally amplified beads are then loaded into the picotiter plate and covered by smaller beads where sequencing reagents are attached. 454 sequencingTM is then performed by letting nucleotides sequentially flow over the plate in a fixed order. The emitted light from the incorporated bases is detected by a CCD camera. (The picture is published with permission from 454 Life Sciences.)

12 EMILIE HULTIN The 454 instrument can be used for several applications in both de novo sequencing and resequencing. In 2005 researchers sequenced the entire genomes of Mycoplasma genitalium (0.6 Mb) and Streptococcus pneumoniae (2.1 Mb) in single sequencing reactions (Margulies et al., 2005). Currently, The Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany, in collaboration with 454 Life Sciences are sequencing the Neanderthal genome, a two year project. Over one million bases have now been sequenced from the 38,000 year old Neanderthal fossil and comparison with the human and chimpanzee genomes reveals that the Neanderthal diverged from the modern human approximately 500,000 years ago (Green et al., 2006). With the present technique from 454 Life Sciences, it will be necessary to run the GS-20 system 6000 times to complete the Neanderthal genome, but it is believed that technical improvements will reduce the number of runs to 600. Another application is mutation detection in heterogeneous cancer specimens to detect rare cancer- associated variations to facilitate accurate molecular diagnosis and personalized treatment (Thomas et al., 2006).

3.1.2 SOLEXA CLONAL SINGLE MOLECULE ARRAY The Solexa genome analysis system is based on clonal single molecule array technology developed by the company Solexa, a subsidiary of Illumina Inc. since January 2007 (Bennett, 2004; Bennett et al., 2005). The genomic DNA is fragmented randomly and two adapters are ligated onto the ends. The single stranded fragments are then bound to an optically transparent surface, together with primers that are complementary to the adapters. The bound templates form bridges between the primers on the surface, which will be extended by addition of polymerase and nucleotides (Adessi et al., 2000). After the first cycle of solid phase bridge amplification the double stranded molecules are denatured and another cycle of amplification is started. On completion, several million clusters of clonally amplified template DNA molecules have been generated in each channel of the flow cell (figure 4). The first round of sequencing starts by addition of the four differently labeled reversible terminating nucleotides, DNA polymerase and primers. After washing and laser excitation, the image of emitted fluorescence from the clusters is captured and the first base for each cluster is identified. Before initiating the next sequencing cycle, the blocked 3’-terminus and the fluorophore from each incorporated base are removed.

13 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

Bridge amplification

1 234 5 DNA fragment with adapters

Primers

Clonal single molecule sequencing

Extension primers

Labeled reversible terminator nucleotides

Laser excitation reveal first base

A C T G

Repeated cycles of sequencing A C T G A T A C C A A T T G C C G T G A FIGURE 4. Solexa solid phase bridge amplification and clonal single molecule sequencing. Single stranded fragments of genomic DNA, blunted with adapters, are bound to a surface together with primers that are complementary to the adapters (1). The bound templates form bridges between the primers on the surface (2). The primers are then extended by a polymerase (3). After the first cycle of amplification the dsDNA molecules are denatured (4). Additional cycles generate cluster of clonally amplified template DNA molecules (5). The sequencing is performed with labeled reversible terminator nucleotides. The image of emitted fluorescence from each cluster is recorded and the terminators and the labels are removed before the next cycle of sequencing is started.

The flow cell contains over 10 million clusters per square cm, where each cluster is made up of around 1000 copies of the template. In one run that takes approximately 48-72 hours, 25-35 bases are read from each cluster, generating up to 1 Gb in total that can be aligned to a reference genome. In March 2005, Solexa reported the completion of its first genome sequence, that of the virus Phi-X 174 (http://www.solexa.com).

14 EMILIE HULTIN

3.1.3 POLONY SEQUENCING BY LIGATION Multiplex polony sequencing by ligation is a four step process developed at Harvard Medical School by a team headed by Church (Shendure et al., 2005). First, a library is constructed from genomic DNA by a cell-free strategy, where a restriction enzyme and a linker bring sequences that are separated by 1 kb in the genome closer to each other by circularization of the fragments. The library molecules end up with a length of 135 bp, where the two genomic sequences of 17-18 bp each are separated and flanked by universal sequences. These universal sequences are complementary to amplification and sequencing primers. In the second step, the library molecules are attached and amplified on beads in an emulsion PCR (Dressman et al., 2003) resulting in polymerase colonies, called polonies (Mitra and Church, 1999). In the third step, after enrichment of clonal beads, the beads are immobilized in an acrylamide gel developed for in situ polonies (Mitra et al., 2003), forming a monolayered array that is mounted in a flow cell for sequencing, which is the last step in the process.

The sequencing strategy utilizes four colors and degenerate ligation (figure 5). An anchor primer is hybridized immediately 5’ or 3’ to one of the two genomic sequence tags. A population of degenerate nonamers, labeled with fluorescent dyes, is then added, where a ligation reaction of the anchor primer to the nonamers is performed. The population of nonamers is designed so that the base identity at a specific site is correlated with the identity of the fluorophore attached to that nonamer, generating four colors of nonamers depending on the specific base position. As the ligase discriminates for complementarity at the queried position, the fluorescent signal identifies the base. After ligation and imaging the anchor primer:nonamer complexes are stripped away and a new cycle can begin with a new population of nonamers with changed position for the known base. Accurate sequencing can be obtained when the query base is up to 6 or 7 bases from the ligation junction while ligating in the 5’-3’ or 3’-5’ direction respectively. This allows for sequencing of 13 bp (6+7) per genomic tag and of 26 bp (2x(6+7)) per amplicon. This assay has been used for resequencing the microbial genome of E.coli and yielded 30.1 Mb raw data in two 60 hour runs, providing 71 % high quality coverage (X4) (Shendure et al.,

15 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY 2005). The cost of 0.11 USD per kb was roughly one-ninth of the electrophoretic Sanger DNA sequencing.

(a) Genomic sequence tags

Bead

Anchor primers

(b) Labeled degenerate nonamers -NNNNANNNN -NNNNGNNNN -NNNNCNNNN -NNNNTNNNN Bead-immobilized Anchor primer template strand

AGTCTAGCTGACTAG------GAGT?????????????????TCAGATCGACTGATC---

Hybridization and ligation of nonamers -NNNNCNNNN AGTCTAGCTGACTAG------GAGT????????????G????TCAGATCGACTGATC---

The query position: C FIGURE 5. Sequencing by ligation of polonies. (a) Two genomic sequence tags that were separated by 1 kb in the genome sequence are flanked and separated by universal sequences to which one of the four anchor primers are hybridized. (b) At any given sequencing cycle, the population of nonamers that is used is designed to have unknown sequence in all but one position. The identity of that position correlates with the identity of the attached fluorophore. The nonamers are hybridized and ligated to the anchor primers. After four-color imaging the ligated complexes are stripped off and a new cycle of sequencing begins. (The picture has been modified after (Shendure et al., 2005).)

16 EMILIE HULTIN

3.2 SINGLE MOLECULE SEQUENCING

Drawbacks with sequencing by synthesis techniques are the relatively short read lengths, which make the assembly more complicated, and the need of template amplification. Longer read lengths can be difficult to achieve; since the sequencing reactions are not hundred percent efficient, the population of DNA strands that are sequenced will gradually get out of phase, eventually precluding accurate sequence determination. Those problems can be overcome with single molecule sequencing. Some of the advantages by sequencing single molecules are the reduced reagent consumption and the small amounts of template DNA needed. Nevertheless, detection of single molecules can be very challenging. Different chemical and technical solutions have been adopted for sequencing of a single molecule. Some rely on the sequencing by synthesis approach with different detection strategies. Another approach conceived by Richard Keller at Los Alamos National (NM, USA) uses a exonuclease to cleave off bases, one at a time, from a DNA template strand, labeled at each and every base (Werner et al., 2003). An improvement to this method would be the ability to detect un-labeled bases. A group of single molecule sequencing techniques utilizes nanopores as base sensors as a DNA strand is threaded through the pore. Below are some promising approaches with the potential of sequencing an entire human-sized genome for 1000 US-dollars.

3.2.1 VISIGEN Visigen Biotechnologies in Houston (TX, USA) is one of the new companies striving towards the 1000 US-dollar human genome. Visigen’s sequencing by synthesis approach is based both on an engineered polymerase and on nucleotide triphosphates that act as molecular sensors of DNA base identity (http://www.visigenbio.com). The company is developing nano-sequencing machines of single DNA polymerase molecules that are attached to the surface of a glass slide. The incorporation of bases by the polymerase is detected in real-time by a fluorescence resonance energy transfer (FRET) chemistry, where the polymerase is modified with a fluorescent donor molecule and the nucleotides color coded with an acceptor fluorescent moiety. During the extension reaction of the template DNA, energy is transferred from the polymerase to the nucleotides, leading to emission of a base-specific incorporation signature that is directly detected (figure 6).

17 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

FIGURE 6. Visigen’s single molecule sequencing by synthesis approach. An immobilized modified polymerase is used for sequencing of a single molecule. The polymerase contains a fluorescent donor molecule and the nucleotides are color coded with an acceptor fluorescent moiety. As the polymerase extends the primer, the base incorporation results in FRET from the polymerase to the nucleotide, leading to emission of a base-specific signature that is detected in real-time.

By creating massively parallel arrays of nano-machines, it would be possible to achieve a sequencing rate of 1 Mb/sec/microarray.

3.2.2 HELICOS Helicos Biosciences in Cambridge (MA, USA) was one of the grant recipient companies for advanced sequencing technology development projects that were awarded in 2004, initiated by the National Human Genome Research Institute (NHGRI) (http://www.genome.gov/12513162). Helicos is developing high speed sequencing by synthesis assay for both DNA and RNA without amplification, by analyzing single molecules. The instrument system, called the HeliScope™, is based on sequencing by synthesis of single molecules and the use of fluorescence . The technique was first developed by Stephen Quake and his co-workers at the California Institute of Technology (CA, USA). The approach utilizes universal primers that are immobilized on a glass surface inside a flow cell. The single stranded template is fragmented and universal primers fluorescently labeled with Cy3, are attached at the ends. The template strands are then hybridized to the primers in the flow cell by the complementary universal primer sequences. The positions of the template-primer duplexes are revealed by illuminating the surface, letting the Cy3 molecules fluoresce. The sequencing reaction starts by addition of

18 EMILIE HULTIN polymerase and one of the four nucleotides fluorescently labeled with Cy5. The bases are incorporated by the polymerase at the complementary positions whereupon unincorporated nucleotides are washed away. The duplexes are visualized once again but now with the other fluorophore, Cy5, to detect the positions of the primers with incorporated bases. The Cy5 molecules are removed and the process is repeated with all four nucleotides, one at a time until the desired read length is reached. The work done by Quake’s group in 2003 demonstrated a read length of 5 bp (Braslavsky et al., 2003). The detection of the single molecules is accomplished by a sensitive digital camera connected to a . This method allows a high degree of parallelization for de novo sequencing. The first generation HeliScope™ is thought to produce 1 Gb per day with a 1000-fold price advantage over current Sanger DNA sequencing methods (http://helicosbio.com).

3.2.3 LI-COR Li-Cor Biosciences in Lincoln (NE, USA) was also one of the recipients of the advanced sequencing technology projects award (http://www.genome.gov/12513162). Li-Cor’s project, headed by John Williams, aims to develop a system for de novo sequencing of single DNA molecules with very long reads. The principle is based on sequencing by synthesis of a single DNA molecule, where the incorporation of nucleotides is detected. They are using charge switch dNTPs, bases modified with a charge moiety. The nucleotides are also dye- labeled at the γ-phosphate, so that the released pyrophosphate is labeled and partitioned by charge. A problem with single dye molecule is that the signal-to-noise ratio is low. One way to get around this problem could be to use brighter labels, giving a stronger signal. The major advantage with Li-Cor’s methodology is the possible long, unlimited read length of maybe 20,000 bases or even whole chromosomes. Long read lengths make the assembly of the sequence much easier and it also reveals haplotype information (http://www.genengnews. com/sequencing/supp_03.aspx).

3.2.4 SEQUENCING THROUGH NANOPORES All sequencing technologies mentioned above are based on molecular cell biology principles of DNA chemistry, either by utilizing primer extension by a polymerase,

19 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY hybridization properties of the DNA or restriction enzymes. In this section, a method for DNA sequencing that combines nanophysics with biotechnology, called nanopore sequencing, proposed over a decade ago, is described (Kasianowicz et al., 1996).

The aim is to sequence a DNA strand as it is pulled electrophoretically through a synthetic or natural pore, only 1.5 nm wide. The detection is based on a shift in the ionic current that would be driven through the pore by the applied potential as the bases pass through, making the voltage to spike up and down (Deamer and Akeson, 2000). Single-base recognition in a static DNA strand by the α-hemolysin pore, a transmembrane protein pore from Staphylococcus aureus, was achieved in 2005 (Ashkenasy et al., 2005). However, in sequencing the DNA strand has to move through the pore under an electric potential, which makes the strand move through the pore too rapidly for identification of single bases (Meller et al., 2001). A combined detection with a nanopore and exonuclease was proposed recently by a group at the University of Oxford (UK) (Astier et al., 2006). This group show that individual unmodified deoxyribonucleoside monophosphates (dNMP) can be identified by the use of an engineered α-hemolysin pore that contains a cyclodextrin adapter as detector. When a dNMP interacts with the binding site within the pore, a change in the ionic current that passes through the pore under an applied potential can be measured. By this technique, all four bases can be distinguished, demonstrating a label-free detection method for exonuclease sequencing.

There are, however, many problems that need to be solved before nanopores could be considered for DNA sequencing (Astier et al., 2006). The exonuclease has to be coupled to the nanopore so that every dNMP that is released enters the pore. The dNMPs are only detected in the presence of the adapter, so the adapter has to stay inside the pore. This can be achieved by covalently linking the adapter to the pore. Nanopore sequencing is still far from realized, but it holds some promising features for the future; unlimited sequence read length with no need for DNA amplification and relatively high multiplex capacity for ultra-high throughput sequencing.

20 EMILIE HULTIN

4 ANALYSIS OF DNA VARIATIONS

It is of great importance to reveal the relationship between a phenotype and a genotype, in order to link a certain disease or disease susceptibility to the genotype status. Genotyping has also implications in forensic sciences, population genetics and pharmacogenetics. Most genetic variation analyses rely on either linkage studies of family pedigrees or association studies of large populations. This can be done by screening for polymorphisms in the entire genome or in candidate genes. In the recent years the focus has been on SNPs to perform linkage and association investigations. However, as mentioned above, there are other types of variations such as chromosomal rearrangements, chromosomal duplications and/or deletions that are important to link specific genes or loci to certain diseases such as cancer. Linkage studies are best suited for high penetrant rare alleles with monogenic patterns, which have shown to be easier to find than alleles contributing to complex diseases (Hirschhorn and Daly, 2005). Unfortunately, most common diseases are believed to be caused by multiple genetic and environmental factors, which are much more complicated to map (Wang et al., 2005). For those complex diseases, whole genome association studies are preferable (Risch and Merikangas, 1996). The whole genome association approach is however very expensive, rendering many scientists to use the more convenient and economical candidate gene approach.

The recently completed phase I and II International HapMap Project with the aim of constructing a human haplotype map (HapMap) (The International HapMap Consortium, 2003) may simplify the search for interesting locations in the genome. It relies on the nature of evolution, where blocks of polymorphisms are inherited together, forming a haplotype. Consequently, it is only necessary to genotype a set of polymorphisms, so called tagSNPs, in each block, reducing the genotyping complexity. Genotyping of about 300,000 to 600,000 tagSNPs is estimated to be enough for genome wide SNP analysis, which would imply a ten to fiftyfold reduction compared to genotyping of all the 10-15 million SNPs (Hinds et al., 2005; Steemers and Gunderson, 2007) (http://www.hapmap.org). About 550,000 tagSNPs are estimated by Steemers et al. to capture about 90 % of the variation in the Caucasians and Asians (The International

21 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY HapMap Consortium, 2003), while 650,000 tagSNPs would be necessary for 75 % coverage in the Yoruban populations. Genotyping of 300,000 tagSNPs would be roughly equivalent in power to 1 million randomly chosen SNPs according to Hinds and co- workers. However, it remains to be revealed if this reduction is of great value for finding disease causing SNPs. One problem could be that the map is based on common SNPs and the often causative rare SNPs could be missed (Lai et al., 2002; Terwilliger et al., 2002).

Genetic variation analysis may be performed by several different techniques and on many different platforms. Variation studies can be divided into two main categories according to the purpose of the analysis. The first category comprises methods used for scanning the genome for unknown variations and the second category consists of different genotyping systems used to screen for known variations. Sequencing is usually used for scanning, but may also be used for genotyping. It is, though, very expensive to genotype several loci with Sanger DNA sequencing and pyrosequencing. Novel whole genome sequencing technologies (described above) have the potential to eventually be used for this purpose. There are, however, a number of different assays that are in use for SNP discovery and genotyping. The common denominator for these methods is the relative low cost of instrumentation and performance. Some of these methods are described below.

4.1 FINDING UNKNOWN VARIATIONS

There are two groups of scanning methods to find unknown variations. The assays in the first group detect conformational changes of a mutant DNA molecule as a shift in gel mobility. This group of methods includes heteroduplex analysis (HA) (White et al., 1992), denaturing gradient gel electrophoresis (DGGE) (Fischer and Lerman, 1979; Fischer and Lerman, 1983; Myers et al., 1985; Myers et al., 1987), denaturing high-performance liquid chromatography (DHPLC) and single-stranded conformation polymorphism (SSCP) (Orita et al., 1989). In the HA, a PCR product is denatured and reannealed. Homozygous samples produce homoduplexes and heterozygous samples both homoduplexes and heteroduplexes. These two products will generate one respectively two bands in a gel by the shift in mobility of the heteroduplexes. In DGGE, mutations are detected as a difference in partial melting behavior of double stranded DNA (dsDNA) and hence, a

22 EMILIE HULTIN difference in mobility as regions of single stranded DNA (ssDNA) in a dsDNA fragment greatly retard the migration through acrylamide gels. In DHPLC differential retention of homoduplexes and heteroduplexes under partially denaturing conditions is achieved by a combination of size separation of DNA fragments with thermal denaturation (Huber et al., 1993a; Huber et al., 1993b). Wave system, commercialized by the company Transgenomic (NE, USA), is a mutation detection technology based on DHPLC. The SSCP relies on the principle that secondary structures affect the mobility of ssDNA in a non-denaturing gel. A fragment with a single base mutation adapts a different conformation then a wild-type fragment.

The assays in the second group rely on mismatch recognition in formed heteroduplexes by chemical (Cotton et al., 1988) or enzyme cleavage (Solaro et al., 1993; Goldrick et al., 1996; Goldrick, 2001) and by mismatch binding proteins (Lishanski et al., 1994), all leading to altered mobility in gel electrophoresis. These methods are able to scan larger fragments than conformation-based assays. The size of the cleaved product also indicates the localization of the mutation. The procedures are based on heteroduplex formation of a PCR product. Mismatched bases are recognized and cleaved, either by enzymes such as endonucleases or chemicals and detected as two bands or peaks in gel or capillary electrophoresis. The company Transgenomic utilizes this technique for mutation detection by their Surveyor assay. Instead of cleaving the heteroduplex, mismatch binding proteins such as MutS, can bind to the mismatch position in the heteroduplex, leading to a retarded migration in the gel compared to homoduplexes.

The limiting factors for these scanning methods are that all are highly condition- dependent, requiring extensive optimization and that the throughput is rather low. Still, for instance, DGGE is quite frequently used in clinical laboratories. Once the method has been optimized for a certain gene of interest it is rather cheap and convenient and the sensitivity and specificity is rather high; often 100 % of the mutations are detected (Tchernitchko et al., 1999).

23 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY The above described techniques are suited to detect unknown, but subtle, genetic variations. However, as previously mentioned, gross alterations frequently occur in the genome. Gross rearrangements can be detected at the cytogenetic level by FISH-based methods when it comes to deletions or duplications on the Mb level (Van Prooijen-Knegt et al., 1982; Lawrence et al., 1988; Fan et al., 1990; Korenberg et al., 1992). Microarray-based comparative genomic hybridization (CGH) can be used for high throughput detection of gross duplications or deletions in the order of 200-500 kb, and is especially suitable for tumor samples (Kallioniemi et al., 1992; Pinkel et al., 1998; Ishkanian et al., 2004; Pinkel and Albertson, 2005). Southern blotting can detect alterations in the order of a kb (Southern, 1975). Southern blotting is, however, quite tedious and requires large amounts of DNA.

Alterations affecting gene expression can be detected at the RNA level. For absolute and relative quantification of DNA or RNA in allele imbalances and detection of gene deletions and duplications quantitative real-time PCR (qPCR) is an option and is often used in tumor diagnosis to quantify amplifications of oncogenes or deletions of tumor suppressor genes (Ding and Cantor, 2004). Several instruments are available for qPCR, among them the LightCycler system from Roche (Switzerland). The drawbacks of qPCR are the limited number of samples that can be analyzed at the same time and the expensive reagents and equipment.

An alternative to qPCR is semi-quantitative analysis of PCR products by using specifically bound probes that are amplified with universal primers, which allow multiplexing. One such method is the multiplex amplifiable probe hybridization (MAPH), which relies on multiple short DNA probes (100-300 bp) that hybridize to genomic DNA immobilized on a membrane (Hollox et al., 2002). The probes are flanked with universal primer binding sites and designed with variable length, dependent on target site, to allow separation of the multiplex PCR products by electrophoresis. After hybridization of the probes to the target DNA, unbound probes are washed away and the bound probes, the amount of which is proportional to the target copy number, are recovered and amplified quantitatively. The products can then be separated by size and relatively quantified. In an alternative approach called multiplex ligation-dependent probe amplification (MLPA), the target genomic DNA

24 EMILIE HULTIN is hybridized in solution to two adjacent probes that are subsequently joined by ligase (Schouten et al., 2002). The ligated probes are then amplified in a multiplex PCR with fluorescently labeled universal primers. The peak area or band intensity of each amplified product reflects the relative copy number. Up to 40-50 target sequences can be analyzed simultaneously (Hogervorst et al., 2003).

4.2 GENOTYPING OF KNOWN VARIATIONS

There is a multitude of genotyping assays that can be used to screen known variations, from small scale approaches, where one to a handful of SNPs can be analyzed, to genome- wide genotyping systems. Small scale genotyping methods are preferably used when a few SNPs are of interest, but where a large number of samples need to be analyzed. These methods are often based upon PCR. Medium scale genotyping methods usually use microarray formats. Array-based detection systems constitute an appropriate choice when ten to a couple of hundreds of SNPs need to be analyzed in approximately a hundred to a thousand samples. In this section different small and medium scale genotyping methods will be described.

4.2.1 ALLELE-SPECIFIC OLIGONUCLEOTIDE HYBRIDIZATION There are several approaches that rely on hybridization between complementary DNA strands, which can be utilized for allele discrimination. One is allele-specific oligonucleotide hybridization (ASOH) where allele-specific oligonucleotides, differing in only a single base, hybridize to the target sequence only when they match perfectly (figure 7) (Wallace et al., 1979). The oligonucleotides are usually designed to have the allelic variation in the central part. Allele specificity is achieved by suitable stringent conditions. Instead of only detecting hybridized versus non-hybridized probes, the template:probe duplex stability can be monitored in real time as the temperature is increased, eventually leading to denaturation of the duplex. This method is called dynamic allele-specific hybridization (DASH) (Howell et al., 1999). Detection of the difference in melting behavior can be done by fluorescently labeled oligonucleotides or an intercalating dye that fluoresces only when bound to duplex DNA. The sample throughput of DASH is rather

25 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY high as the technique has been used in 384-plate format. Nevertheless, the multiplexing capacity of DASH has been limited to two SNPs per reaction (Jobs et al., 2003).

Allele 1 Allele 2 ASOH T T

A G TaqMan A D AD T T

A G

A A D D T

A G

Molecular A Beacons A D T D T

A G

T OLA T

A Ligase G

T T

A G

Padlock T probes T

A G Ligase

T T

A G

SBE A G ddTTP*

T* A G

T T ASE A G dNTP*

T T * * A G FIGURE 7. Genotyping assays. Hybridization, enzymatic and combined approaches for allele discrimination. For FRET, the donor molecule is a circle denoted D and the acceptor molecule a circle denoted A. A fluorescent label attached to a dNTP is denoted with an asterisk (*).

26 EMILIE HULTIN

4.2.2 REAL-TIME DETECTION METHODS Real-time monitoring of DNA amplification can be performed by hybridization and cleavage of double-labeled fluorogenic allele-specific oligonucleotides directly to the PCR product during amplification. TaqMan (or the 5’ nuclease PCR assay) utilizes the 5’ to 3’ nucleolytic activity of Taq DNA polymerase (Holland et al., 1991; Lee et al., 1993; Livak et al., 1995). The allele-specific oligonucleotide contains a fluorescent donor label and an acceptor label at either end. When the labels are in close proximity, the fluorescence from the donor is quenched by the acceptor. After denaturation in the PCR, the oligonucleotides hybridize to the target DNA. During the following extension step the Taq DNA polymerase will reach the oligonucleotide and degrade it, where upon the labels are separated and the fluorescent signal from the donor can be detected (figure 7). Molecular beacons are another type of oligonucleotide probes that also can be used for monitoring DNA amplification (Tyagi and Kramer, 1996; Tyagi et al., 1998). Here, the probe forms a hairpin structure in solution, bringing the two labels at the ends close enough for quenching of the signal. When the probe hybridizes to a PCR product, the hairpin shape breaks and the probe ends are separated, giving rise to a signal (figure 7). In both assays, the amount of DNA present from start can be calculated by extrapolating from the monitored amplification values.

4.2.3 OLIGONUCLEOTIDE LIGATION ASSAY Allele-specific ligation between two oligonucleotides can discriminate alleles at a SNP locus. This approach was described in 1988 and was denoted oligonucleotide ligation assay (OLA) (Alves and Carr, 1988; Landegren et al., 1988). In principle, three oligonucleotides are used in the assay. Two allele-specific oligonucleotides, one for each allele variant, are designed to have their 3’-ends complementary to the SNP. A third common oligonucleotide is complementary to the sequence upstream, just adjacent to the SNP. After PCR, the probes are hybridized to the amplicon and, if there is a perfect match 3’- end at the SNP, these are joined by the ligase (figure 7). A biotin label on either the allele- specific or common oligonucleotides allows capturing of the probes on streptavidin. Fluorescent labels on the probes enable detection of the ligated products. By varying combinations of color dyes and probe lengths, multiple variations can be detected, either

27 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY by electrophoresis or in a solid phase format (Grossman et al., 1994). The SNPlex™ genotyping system from Applied Biosystems in Foster City (CA, USA) uses OLA and PCR for allelic discrimination and ligation product amplification of 48 SNPs. The detection is based on capillary electrophoresis, where a universal set of dye-labeled, mobility modified probes is used for separation and detection.

To increase the specificity of the assay, padlock probes were introduced (Nilsson et al., 1994). One padlock probe replaces the two probes in OLA, by containing both the sequences on either ends, joined together with a linker (figure 7). Accordingly, cross- hybridization of the oligonucleotide probes is reduced. After ligation of the two ends a circle probe is generated, which can be amplified by rolling circle amplification (Baner et al., 1998), thus circumventing the need for PCR amplification. An alternative way is to use universal PCR primer sequences in the linker together with a capture sequence for microarray detection (Baner et al., 2003). The padlock probes and rolling circle amplification have been used for in situ single molecule detection (Larsson et al., 2004).

4.2.4 SINGLE BASE EXTENSION To distinguish between different SNP alleles, a single base extension (SBE) can be performed at the SNP position in a so called minisequencing approach (Syvanen et al., 1990; Lindroos et al., 2002). The assay extends a primer located just beside the SNP locus with the four ddNTPs fluorescently labeled with different dyes. Depending on the allele sequence, a ddNTP with a certain fluorescent dye will be incorporated in a cycled minisequencing reaction for increased sensitivity (figure 7). Microarrays can be used for detection where the extension primers are designed to have a 5’-end signature tag complementary to the printed oligonucleotides in the array (Fan et al., 2000). An array-of- arrays format has been developed for analysis of 80 samples and up to 200 SNPs in parallel (Pastinen et al., 2000). The Snapshot assay is a commercialized multiplex system from Applied Biosystems (CA, USA) based on the SBE method that enables multiplexing of up to 10 SNPs.

28 EMILIE HULTIN

4.2.5 ALLELE-SPECIFIC EXTENSION Allele-specific amplification (ASA) has been assayed through several approaches; the most widely used is the amplification refractory mutation system (Newton et al., 1989), but others include PCR amplification of specific alleles (Sommer et al., 1989), allele-specific PCR (Wu et al., 1989) and allele-specific amplification (Okayama et al., 1989). The strategy is based on oligonucleotides complementary to either wild-type or mutant allele at the 3’- end and the ability of the DNA polymerase to only extend perfectly matched 3’-ends and hence, prevents amplification of mismatch oligonucleotides. This can be utilized in SNP genotyping, by designing the oligonucleotides with their 3’-ends at the SNP locus, leading to match or mismatch 3’-ends (figure 7). One problem, though, is that the polymerase discrimination of mismatch ends is not satisfactory (Newton et al., 1989; Kwok et al., 1990; Ayyadevara et al., 2000), which sometimes leads to misincorporations and false genotyping.

In addition to all above mentioned methods, pyrosequencing can also be used for small scale SNP genotyping. Due to inability to multiplex this approach it is only suitable for a few SNPs, but hundreds of samples can easily be analyzed by utilizing the 96-plate format.

4.3 LARGE SCALE GENOTYPING SYSTEMS

In addition to the novel DNA sequencing technologies that have emerged after the completion of the HGP, there are several genotyping systems for whole genome analysis that have been developed during the HapMap project. These large scale genotyping systems can be used for genotyping thousands of SNPs or maybe a million SNPs. One challenge for whole genome genotyping is in the amplification step as PCR has a limited multiplexing capacity. Below are some systems for SNP analysis with throughputs of up to 500,000 SNPs.

4.3.1 MOLECULAR INVERSION PROBES The padlock probe assay has been further developed by ParAllele Biosciences, now part of Affymetrix (CA, USA), for increased specificity in high throughput SNP analysis by combining single base extension and ligation of molecular inversion probes (MIP) (figure 8) (Hardenbol et al., 2003). In the MIP assay, the polymorphic nucleotide is left out in the

29 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY padlock probe locus-specific sequence. When hybridized to the target, a gap is created at the query base. In the next step, the gap is filled by a single base extension with only one base present in each reaction tube followed by ligation to form circularized probes. After exonucleolytic digestion of linear unreacted or cross-reacted probes, the circularized probes are released from the genomic DNA by cleavage. Universal primers are used for PCR amplification before tag array detection. A 12,000-plex SNP genotyping reaction has been performed and documented by the MIP assay (Hardenbol et al., 2005) and, according to Affymetrix, it would be possible to scale it up to 120,000 SNPs.

Allele 1 Allele 2

Probe Hybridiazation Genomic

A G

SBE with T T

A G

Ligation T

A G

Exonuclease digestion of T non-circularized probes A G

Probe release by T specific cleavage A

Amplification with universal primers

Specific cleavage

Hybridization to * tag-array and labeling

FIGURE 8. Outline of the molecular inversion probes assay. Allelic discrimination is obtained by a combination of hybridization, SBE and OLA. After amplification of the probes with universal primers, one part of the product is cleaved off and hybridized to a tag-array followed by labeling.

30 EMILIE HULTIN

4.3.2 ARRAYED ALLELE-SPECIFIC HYBRIDIZATION High-density oligonucleotide arrays can be used to accurately genotype millions of SNPs by very high throughput allele-specific hybridization (Pease et al., 1994; Chee et al., 1996). DNA samples are hybridized to the probes in the array after amplification. There are different amplification procedures. Perlegen Sciences in Mountain View (CA, USA) amplifies regions containing SNPs across the whole genome in a long-range PCR with 300,000 primer pairs, covering around 92 % of the available human genome. The generated amplicons of, on average, 9 kb are labeled and pooled before hybridization to their in situ synthesized array of 50,000 SNPs (Hinds et al., 2005). Affymetrix introduced a general PCR on fragmented DNA with a ligated universal PCR primer. As the fragments are generated by restriction enzyme cleavage and amplicons of a certain length are favored in the PCR, a fixed set of 500,000 SNPs can be analyzed by the GeneChip system (Kennedy et al., 2003; Matsuzaki et al., 2004).

4.3.3 GOLDENGATE AND INFINIUM ASSAYS Enzymatic discrimination, in addition to hybridization, has proven to increase specificity in genotyping. Illumina has developed a method for high throughput SNP genotyping, called GoldenGate (Fan et al., 2003; Shen et al., 2005). The assay is based on a combination of allele-specific extension (ASE) and ligation, directly on genomic DNA. The DNA is activated for binding to paramagnetic beads, whereupon two allele-specific oligonucleotides (ASOs) together with a locus-specific oligonucleotide (LSO) are hybridized. The oligonucleotides are designed so that they can be joined by a ligase after an allele-specific extension. All three oligonucleotides contain universal PCR primer sites and the LSOs also carries a unique address sequence that targets a particular bead type for the detection step. The ligated products are PCR amplified with three universal PCR primers, identical for all SNP loci. The two PCR primers binding to the two ASO domains are fluorescently labeled with Cy3 respectively Cy5, generating differently labeled PCR products depending on allele type. Finally, all products are hybridized to decoded random bead arrays. Each locus is linked to a particular bead type and the genotypes are scored by measuring ratio of fluorescent signals. There are 1,536 bead types and hence, 1,536 SNPs can be interrogated simultaneously (http://www.illumina.com).

31 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

Immobilized T genomic DNA

T A Primer G hybridization

T A ASE and ligation

Universal PCR A with labeled primer *

A Hybridization to bead-array *

FIGURE 9. The GoldenGate assay from Illumina. Genomic DNA is immobilized onto paramagnetic beads, whereupon allele-specific primers and locus-specific primers are hybridized to the SNP positions. ASE and ligation are performed, creating allele-specific primer complexes that are amplified with universal allele-specific labeled PCR primers. The products are then hybridized to a tag bead-array and the fluorescence from the allele-specific labels is detected.

The Infinium genotyping platform provided by Illumina is a PCR free, whole genome genotyping assay with unlimited multiplexing, unlike many alternative PCR dependent assays (Gunderson et al., 2005). Infinium employs multiple displacement amplification (Dean et al., 2002) of the genomic DNA samples, which is an isothermal whole genome amplification (WGA) with φ29 DNA polymerase and random primers. The WGA samples are subsequently fragmented randomly by a controlled enzymatic process to a size of

32 EMILIE HULTIN around 300 bp and hybridized to a bead array. The beads are covered with immobilized capture probes, containing an allele-specific sequence of 50 bases and a decoding part of 25 bases. Each bead-type corresponds to one of two alleles at a certain locus and there are around 20 beads per bead-type. After hybridization, allelic specificity is obtained in an enzymatic step by a discriminating ASE, where biotin-labeled dNTPs are incorporated. Detection is performed by a sandwich staining of the biotin-labeled nucleotides with streptavidin-conjugated fluorophores. The SNP scoring step has also been performed by SBE, where the probe sequences are locus-specific, hybridizing just adjacent to the SNP query (Steemers et al., 2006). The Infinium whole genome genotyping has multiplex systems for 100, 300 and 500K SNP loci. On their HumanHap650 bead chip over 650,000 SNPs can be genotyped for a cost of 0.001 USD per SNP (Steemers and Gunderson, 2007).

4.4 DETECTION PLATFORMS

Gel electrophoresis is a low throughput detection system that can only be multiplexed by size. An automated alternative is the capillary electrophoresis with higher throughput and possibilities to multiplex both with respect to size and fluorescence. MALDI-TOF MS (matrix-assisted laser desorption ionization-time of flight mass spectroscopy) can also be used for high throughput detection, multiplexing by mass (Karas and Hillenkamp, 1988), but it has limited read length, less than 100 bases (Wu et al., 1994; Monforte and Becker, 1997; Taranenko et al., 1998). It has been used for detection of multiplex genotyping by SBE of 30 SNPs (Kim et al., 2004) and has been commercialized by Sequenome in their MASSarray system. There are also different kinds of platforms based on fluorescence detection. The multiplexing is limited to only color, but the system is rather high throughput. It has the possibility of closed tube assays like DASH.

Solid phase arrays and bead arrays are also based on fluorescence and are very high throughput platforms that enable massively parallel analyses. The production of solid phase arrays is performed in two ways. Oligonucleotides can either be printed or synthesized directly on the surface. Many companies have activated glass slides for printing. Affymetrix uses a photolithographic manufacturing process for oligonucleotide

33 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY synthesis on their chip. Illumina’s beadarray technology is based on 3 µm silica beads, each covered with hundreds of thousands of copies of a specific oligonucleotide that act as capture sequences. The beads self assemble in microwells on either fiber optic bundles (Michael et al., 1998) or planar silica slides, decoded by sequential hybridizations of dye- labeled oligonucleotides (Gunderson et al., 2004). Luminex in Austin (TX, USA) has developed a detection system called xMAP, based on beads in solution (http://www. luminexcorp.com). The beads have a color coded core and can be covered with different capture probes. In the detection step, the beads are lined up by flow cytometry and decoded by a laser. A second laser detects the fluorescent signal on the beads. The Luminex system is suitable for medium scale genotyping of up to 100-plex in a 96-plate format.

5 SUMMARY

The DNA sequencing race towards a human genome for 1000 USD has spurred many researchers and companies to strive for low cost, high accuracy, long read lengths and high throughputs. Nevertheless, the conventional Sanger DNA sequencing is still a method of choice for many laboratories as it is cheap per sample and generates long read lengths with high accuracy. Two other technologies are pyrosequencing and sequencing by hybridization, which also generate high accuracy and low cost per sample. These techniques are, however, not well suited for whole genome sequencing as they are very time consuming and expensive for high throughput sequencing. An alternative technology on the market today is the 454 sequencingTM with the GS-20 from 454 Life Sciences, offering very high throughput sequencing for a low cost per genome, generating at least 20 Mb per run. Also the clonal single molecule array from Illumina and multiplex polony sequencing by ligation can sequence an entire genome of the size of a bacterium at an affordable price. For sequencing of more complex genomes such as the human, improvements in read length are desired. Novel, single molecule sequencing techniques look promising for the future, where the nanopore sequencing approaches may eventually achieve the 1000 dollar human genome.

34 EMILIE HULTIN The sequencing technologies have improved greatly during the last couple of years, still, the cost per sample is very high, creating a need for alternative approaches for genotyping. The genotyping approaches presented herein fill different purposes and are suitable for different sample sizes and SNP numbers, from small scale genotyping of a few SNPs to whole genome genotyping. The method of choice depends on the analysis. Small scale genotyping need to be cost effective per sample. Real-time detection assays such as TaqMan and molecular beacons, as well as DASH, are well suited for analysis of hundred or thousand samples for one or a few SNPs, where the cost per sample is quite low. Large scale genotyping from a thousand SNPs to whole genome genotyping, on the other hand, has to be cost effective per SNP. For whole genome association studies, which require genotyping of tens to hundreds of thousands of SNPs in each sample, the Infinium bead- array system from Illumina or the arrayed hybridization based systems from Affymetrix or Perlegen Sciences would be the methods of choice. One drawback with those platforms is, however, that they have a fixed set of SNPs. For a candidate gene genotyping and fine mapping applications these platforms would not be useful. Instead, the more flexible systems, as the GoldenGate assay from Illumina or molecular inversion probes from ParAllele Biosciences are better suited, by which around 1,500 SNPs can be genotyped. In the medium range, where ten to hundred of SNPs need to be analyzed in a hundred or thousand samples, different in-house methods, relying on hybridization, enzymatic discrimination or a combination of both, like OLA, SBE and ASE approaches, have been developed with much higher sample throughput. A microarray based detection system can achieve high level of multiplexing of several hundreds of SNPs, but there are other detection platforms as well, such as mass spectroscopy (up to 30 SNPs) like in the MASSarray assay from Sequenome and capillary electrophoresis (up to 48 SNPs) like in the SNPlex assay from Applied Biosystems.

35 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

36 EMILIE HULTIN

PRESENT INVESTIGATIONS

The novel massively parallel sequencing and genotyping technologies provided by commercial actors such as 454 Life Sciences, Illumina and Affymetrix generate immense amounts of data in a relatively short time but at a high cost for each sample. However, if only a gene of interest or a moderate set of polymorphisms is to be screened in hundreds to thousands of samples, use of cost effective and flexible medium range technologies is advised. Therefore, the work presented in this thesis focus on development of multiplex genotyping technologies based on allele-specific extension and microarray detection. Paper I describes a method for viral and microbial genotyping by the multiplex competitive hybridization and apyrase-mediated allele-specific extension (MUCH-AMASE) assay. The method was demonstrated by genotyping human papillomavirus (HPV) in 92 samples. In paper II and III, a novel approach for increased specificity in ASE genotyping is described. The method introduces a second enzyme for a competitive allele-specific extension by protease-mediated allele-specific extension (PrASE). The papers present an increased throughput and validation of the technology. In paper IV, the PrASE genotyping has been further optimized together with a multiplex PCR for genotyping of single cells. The cells were laser microdissected from different layers of epidermis of the skin and allelic loss was investigated as an occurrence of cell differentiation.

37 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

6 HUMAN PAPILLOMAVIRUS GENOTYPING

(PAPER I)

6.1 GENOTYPING MICROBES AND VIRUSES

For correct treatment of microbial and viral infections, accurate disease diagnosis is of great importance. That includes specific typing and characterization of particular types or species. Microorganisms are usually identified morphologically, which includes cultivation, a tedious procedure, and certain organisms, especially viruses, usually lack adequate characteristics for easy and reliable identification. Development of vaccines and specific therapies requires exact identification, and a reliable molecular method is necessary for monitoring of infections and follow-up during treatment (van Doorn et al., 2001). Serological tests on antibodies may be performed for diagnostic purposes and are commonly used in clinical laboratories, but it is not available for all pathogens. An alternative approach is to analyze the nucleic acid sequences. For this purpose, robust microarray-based techniques with increased sample and genotyping throughput are preferred for detection of microbial pathogens (Reyes-Lopez et al., 2003).

6.2 HUMAN PAPILLOMAVIRUS

Human papillomaviruses are DNA based viruses that infect the skin and mucous membranes of humans and animals. It is the necessary causal agent for transformation of normal epithelial cells into cervical cancer cells (Schiffman and Castle, 2003). There are over 100 HPV types that have been characterized and classified according to similarity in the nucleic acid sequence of the L1 gene (De Villiers et al., 2004). 10 % similarity between types, 2-10 % similarity between subtypes and less than 2 % similarity between variants are the prescribed definitions. A group of around 30-40 HPV types infect anogenital sites and those are divided into high- medium- and low-risk types, based on their oncogenic potential. Interestingly, different high-risk types are associated with different lesions. For example, type 16 dominates in squamous cell carcinomas and type 18 in adenocarcinomas of the cervix (McKaig et al., 1998). In 2006, the prophylactic HPV vaccine Gardasil,

38 EMILIE HULTIN marketed by Merck (NJ, USA), was approved. The vaccine protects against initial infection with HPV types 16 and 18, which together cause 70 percent of cervical cancers. However, much is still unknown about the etiology and persistence of infection by separate types, manly due to limitations in the available methods. Cultivation of HPV in vitro is not possible and cytological and histological examinations merely diagnose the consequences of the infection on the morphology of the cells and not the actual HPV types (van Doorn et al., 2001). Therefore, genotyping of HPV would provide a more accurate diagnosis and there are several molecular methods for HPV tests that are used in the clinics, but none of these is optimal. Furthermore, the role of infections by multiple HPV types in the development of cancer is largely unknown (Hubbard, 2003).

6.3 METHODS FOR MOLECULAR DIAGNOSIS OF HPV

There are several methods that are used clinically for molecular diagnosis of HPV infection. Hybrid Capture II system (HCII) is based on hybridization of RNA probes to the HPV DNA. The DNA-RNA heteroduplexes are recognized by an antibody, which can be detected after signal amplification. A disadvantage of the system is that it is less sensitive than PCR based methods and that it only differentiates between low- and high- risk types (Smits et al., 1995; Cope et al., 1997). Another approach is the in situ hybridization (ISH), both fluorescent and non-fluorescent, that detects infections directly in histological sections. ISH are used to test for integration of the viral genome into host cells, which determines persistency of infection (Jeon and Lambert, 1995; Evans and Cooper, 2004). Drawbacks of the ISH approach are the large amount of target DNA needed for type determination and the low throughput (Hubbard, 2003). To increase sensitivity and throughput, PCR amplification is currently the best alternative. There are two approaches for PCR amplification; type-specific amplification or consensus amplification. In the former, type-specific primes are used to amplify a single HPV type, which requires a large number of PCR reactions to cover important types. The consensus strategy on the other hand, uses general or consensus PCR primers. There are different sets of consensus primers available and several methods like restriction fragment length polymorphism (RFLP), hybridization assays (Bernard et al., 1994), sequencing technologies or combinations of hybridization and enzymatic assays (Delrio-Lafreniere et al., 2004) can

39 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY be used to determine the genotypes of the amplified HPV fragments. RFLP assays are very low-throughput, time-consuming and labor intensive. In recent years, DNA sequencing for routine diagnostics has improved, but it is not suitable for resolving multiple infections and minor types representing 10-25 % of the total infection may remain undetected (Kleter et al., 1999; Gharizadeh et al., 2001). In these cases, cloning of the PCR products and sequencing a large number of clones is necessary, a very cumbersome approach. Massively parallel sequencing technologies based on sequencing of clonal single molecules would easily detect all infections, but these are far too expensive for clinical use. Hybridization techniques are being developed that allow increased high throughputs, such as the Line-blot assays on a membrane strip surface (van Doorn et al., 2001) and printed microarrays (Park et al., 2004). To further increase specificity a combination of hybridization and an enzymatic step has been utilized both on HPV and other microorganisms (Lovmar et al., 2003; Delrio-Lafreniere et al., 2004).

6.4 AMASE

To increase the specificity of allele-specific extension, in 2001 Ahmadian et al. described a new improved ASE reaction (Ahmadian et al., 2001). The authors investigated the reaction kinetics of allele-specific primer extension. It was demonstrated that the reaction kinetics were slower for mismatched configurations compared to matched ones, by real-time monitoring of the light emitted from the luciferase-based enzyme cascade. Therefore, to circumvent spurious mismatch extensions of allele-specific primers by the polymerase, a second enzyme, apyrase, was introduced. The nucleotide degrading enzyme apyrase competes with the polymerase during the extension reaction so that incorporation of nucleotides only occurs in the fast reaction with perfectly matched primer:template configurations (figure 10). Slower, 3’-end mismatched extensions will be prevented by the degradation of nucleotides by the apyrase. The apyrase-mediated allele-specific extension (AMASE) assay has also been adopted for in situ SNP genotyping with fluorescent nucelotides on microarrays (O'Meara et al., 2002). Both studies showed an increased specificity in AMASE compared to conventional ASE.

40 EMILIE HULTIN

- apyrase

+ apyrase

FIGURE 10. The apyrase effect on allele-specific primer extension. Extension occurs for both match and mismatch allele-specific primers with no inclusion of apyrase (upper panel). When apyrase is included in the ASE reaction, the polymerase only extends the perfect matched allele-specific primer (lower panel).

6.5 MUCH-AMASE

To tackle the limitations of the existing methods for HPV genotyping, a combination of hybridization and ASE reaction was developed by Gharizadeh and co-workers (Gharizadeh et al., 2003). This novel approach is based on a multiplex competitive hybridization of target-specific oligonucleotides, followed by AMASE for increased specificity. The detection step is then performed in a microarray format. The problem with type-specific hybridization assays is non-specific hybridizations, giving false results. This cross-reactivity among probes is especially a problem in high throughput microarray analysis, where small amounts of discriminating probes immobilized on a glass surface are hybridized to an excess of mobile target DNA. In theory, stringent temperature conditions will only allow hybridization of perfectly matched probes, but in practice, the differences in duplex stability are extremely small between a perfect match and a mismatch at one base, leading to non-specific hybridization. In MUCH, the hybridization step is designed to be as specific as possible by immobilizing the amplified biotinylated target DNA to streptavidin-coated magnetic beads and hybridizing an excess of probes. First, the non- biotinylated strands were removed by alkali elution. Second, type-specific oligonucleotides designed for all HPV types included in the assay were added in large excess. The reaction

41 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY mixture was heated to 70° C and then slowly cooled to room temperature. The best matched probes will hybridize first and occupy the targets before other probes can anneal in this competitive hybridization.

However, due to sequence similarity, unwanted hybridization may still occur, but with the next enzyme-discriminating step, AMASE, correct genotyping is ensured. In this step, hybridized oligonucleotides with matched 3’-ends were extended by incorporation of fluorescently labeled dNTPs, while extension of mismatch 3’-ends and 3’ minus 1-ends were prevented. The extended products were then alkali-eluted, neutralized and hybridized to a tag-array via 5’-signature tag fused to each type-specific probe (Hirschhorn et al., 2000), whereupon detection of fluorescence is performed. Unhybridized type-specific probes were washed away from the bead-immobilized samples, preventing them from influencing hybridization to the tag-array. All steps in MUCH-AMASE were automated by the use of a pipetting robot capable of handling magnetic beads. In paper I, increased number of detectable HPV types, improved throughput and detailed examination of samples with multiple infections has been obtained.

The type-specific oligonucleotides were designed with a custom Perl script. The L1 region sequences of 38 HPV types that are possible to amplify with the consensus primer sets (150 bp) were used as target for the oligonucleotides. Three parts with high heterogeneity between types within this region were selected for oligonucleotide design and checked for cross-reactivity and mismatch extension. All oligonucleotides were compared for cross- reactivity sites and scored according to number of mismatches, where over four mis- matches were considered discriminatory for the MUCH step. However, since it has been shown that two to four mismatches in the primer:template is not always discriminatory in an amplification step (Kwok et al., 1990), the 3’-ends of the oligonucleotides were also checked for being discriminatory in the subsequent AMASE step. Some oligonucleotides however, could lack discriminatory power for both hybridization and extension. To eliminate occurrence of false results due to this scenario, two specific oligonucleotides for each HPV type were used. After a BLAST search of all oligonucleotides, that showed specific matches with corresponding types, 46 type-specific oligonucleotides, 19 or 20 bp

42 EMILIE HULTIN long, were used as probes for 23 HPV types. Each probe was designed with a unique tag- sequence at the 5’-end, complementary to the arrayed oligonucleotides.

6.6 RESULTS OF HPV GENOTYPING BY MUCH-AMASE

92 samples with HPV infection that were amplified with consensus PCR primers for L1 region were genotyped by MUCH-AMASE for 23 HPV types. The results were confirmed with Sanger DNA sequencing and, in some cases cloning. The MUCH-AMASE assay was able to type both single (71 samples) and multiple infections (20 samples where 14 had double, four with triple and one each with quadruple and quintuple infections). Ten out of the 23 HPV types that were detectable by the assay were found. Only one of the 92 samples could not be typed by the developed method, but were positive when analyzed with gel electrophoresis. Sanger DNA sequencing of this sample revealed, after homology search, that it contained HPV type 62, a very rare type that was not included in our set of 23 detectable types. All samples except two were sequenced, and 63 of the 90 samples were typed as single infection in concordance with MUCH-AMASE. However, the easily detected multiple infections by MUCH-AMASE were considerably more difficult to determine by sequencing, where high background signals and ambiguous sequences generated poor results.

The developed Excel script for scoring the number and identity of infections had certain criteria. The input values were the fluorescent intensities from both oligonucleotides for each HPV type, calculated by Genepix software. The first criterion was that the signals had to be stronger than two times the sum of the median background and spot intensities. Second, both oligonucleotide signals for a type had to match the first criterion. Third, the least abundant type in a multiple infection needed to have a signal of at least 5 % of the total signal from all the abundant types. Several samples had infections with relative abundances lower than 5 % and to verify those and investigate the sensitivity of the assay, four multiple samples were cloned and sequenced. With the cloning procedure, 14 of the total 15 types were found with a relative abundance of as low as 2 % in the cloned multiple infections. Only one type, which was scored to be present in only 2 % in a sample

43 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY with five infections, could not be verified by cloning. But, as only 30 clones were obtained for sequencing, the chance of finding this type was low.

The quantifying capacity of the assay was investigated further by type titration. PCR products from individual clones of two types were quantified and the products were mixed in different known proportions. Measurements by MUCH-AMASE of the relative amounts revealed exact quantification when comparing the oligonucleotides from one region. However, the two other oligonucleotides from the second region did not generate an exact quantification and hence, the quantification by MUCH-AMASE should only be considered approximate.

In conclusion, the MUCH-AMASE assay can easily distinguish multiple infections as verified by Sanger DNA sequencing and cloning. In fact, multiple infections are very difficult to analyze with conventional sequencing, generating high background signals or indistinguishable double or triple sequences. If only Sanger DNA sequencing was used, primary low-risk types could conceal secondary high-risk types and the samples could be diagnosed incorrectly. By MUCH-AMASE it is possible to detect and semi-quantify multiple infections with minor types of less than 5 % relative abundance in a single test. The clinical importance of a multiple infection diagnoses and the role in carcinogenesis is yet to be explored.

44 EMILIE HULTIN

7 COMPETITIVE ALLELE-SPECIFIC EXTENSION

(PAPER II & III)

7.1 PRASE

These two investigations describe further improvements of the ASE genotyping platform to enable increased multiplexing levels. In highly multiplexed SNP genotyping by ASE, a large number of allele-specific primers have to simultaneously hybridize to the target DNA. This can lead to non-specific hybridization and false results. More stringent conditions can be achieved by an increase in reaction temperature, thus preventing these cross-hybridizations and further increasing the specificity. Therefore, a thermostable enzyme was introduced to compete with the polymerase during the extension reaction. The former competitor, the nucleotide degrading enzyme apyrase was not suitable because of its thermo-labile property. Instead, a protein degrading enzyme was used to degrade the polymerase after the fast allele-specific extension of perfectly matched primers and hence, prevent slower, unspecific extensions of 3’-mismatched primers. The thermostable Proteinase K has the ability to degrade proteins at temperatures up to 72 °C and showed to be an excellent choice in the protease-mediated allele-specific extension (PrASE) reaction.

7.2 RESULTS OF THE SNP GENOTYPING BY PRASE

(PAPER II)

In paper II, the PrASE method was employed to analyze 13 SNPs in 36 individuals in a multiplex, microarray based assay. The PCR amplification was performed in a multiplex nested fashion, with outer and inner primers, yielding products of 71 bp and 46-52 bp respectively. The SNPs were initially chosen for forensic purposes where limited amounts and poor quality of DNA material are common. As old DNA and DNA from hair usually are degraded, short, amplifiable regions are preferable. The PCR was also optimized to be highly sensitive for multiplex amplification of low amounts of DNA. One of the inner primers was biotinylated at the 5’-end to facilitate immobilization of the product to

45 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY streptavidin-coated magnetic beads, generating ssDNA by alkali elution of the non- biotinylated strand. Allele-specific extension primers are hybridized to the target strands, followed by extension with fluorescently labeled nucleotides in the presence of protease. The extended products are then released by alkali elution, neutralized and hybridized to a generic tag array. The assay was automated with a pipetting robot capable of handling magnetic beads from Magnetic Biosolutions in Stockholm (Sweden) with an upgraded program for preparation of 48 samples.

To investigate the efficiency of the protease, four synthetic oligonucleotide templates were designed to only contain one G-nucleotide positioned 5, 10, 15 and 20 nt respectively, downstream of the annealed allele-specific primer. Extensions were performed with Cy5- labeled dCTPs together with native dATP, dTTP and dGTP at varied amounts of Proteinase K (0-80 µg). As expected, the signal intensities declined with increased amounts of Proteinase K. For optimal genotyping 20 µg of Proteinase K is recommended, which allows incorporation of at least 10 nt. However, in a normal PrASE reaction, two dNTPs are Cy5-labeled to ensure incorporation of at least one labeled nucleotide.

The automated magnetic system facilitates immobilization of target to a solid support and consequently allows washing for increased sensitivity. The hybridization of allele-specific primers is performed with an excess of primers compared to template. Since the excess and potentially non-extended primers also contain tags, they will occupy sites on the microarray, if not removed. With hybridization followed by a washing step, the excess of primers is removed before hybridization to the tag array. Six individuals were genotyped for the 13 SNPs both with and without removal of primer excess and the average signal intensities for all SNPs were calculated. The removal of excess of primers generated at least a 15-fold increase of signal intensities, and demonstrated the advantage of this system, enabling washing for an increased sensitivity.

The effect of multiplexing was investigated in 12 different PrASE reactions, from a simplex reaction containing only one allele-specific primer pair to a multiplex of 12 primer pairs where 12 SNPs are genotyped in multiplex. The sum of fluorescent signal intensities

46 EMILIE HULTIN of both primers for each SNP was calculated at the different multiplex levels and the results showed no decrease in intensity by increased multiplex level. Consequently, the limitations of the multiplexing capacity rather lie in the PCR than in the PrASE reaction, which is very promising for further analysis with increased SNP numbers.

To evaluate the accuracy of the technique, 12 of the 13 SNPs were analyzed with pyrosequencing in 8 of the 36 individuals. All genotypes from PrASE were in concordance with the pyrosequencing results; however, one genotype generated an ambiguous sequence with pyrosequencing. The effect of protease on genotyping and accuracy was also investigated by genotyping all 36 individuals twice, both with and without inclusion of protease. The fluorescent signal intensities from both primers of each SNP were used to calculate the relative allelic fraction (AF), which is the signal from wild-type primer divided by the sum of the signal from wild-type and mutant. All AFs were plotted in SNP diagrams and well defined clusters for each genotype and SNP were generated from the PrASE results. However, when protease was omitted in the ASE reaction, the clusters were not as clearly distinguished as in PrASE and for SNP 10, the genotyping with ASE failed completely due to no separation of the clusters at all. In conclusion, by genotyping with PrASE, higher conversion rates, as well as more robust genotype scoring, can be achieved.

Due to low minor allele frequency of SNP 1, we did not find the mutant allele in any of the 36 samples for that SNP. To ensure that the system could detect both alleles, synthesized templates corresponding to wild-type and mutant were used to score the different genotypes. PrASE reactions of wild-type template and mutant template generated homozygous genotypes and the mixture of the two templates was scored as heterozygous.

7.3 RESULTS OF PRASE VERSUS PYROSEQUENCING

(PAPER III)

To further investigate the accuracy and robustness of the PrASE method, 10 SNPs suggested as prothrombotic genetic variations (Lane and Grant, 2000; Endler and

47 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Mannhalter, 2003; Humphries and Morgan, 2004) were analyzed by both PrASE and pyrosequencing. 442 DNA samples from a cohort of patients presenting symptoms of acute chest pain were amplified in a nested multiplex PCR, where the same outer PCR was used as template for both multiplex inner PCR for the PrASE genotyping and for simplex inner PCRs for pyrosequencing. The results from the PrASE genotyping were compared with the pyrosequencing results and 99.8 % were in concordance. The eight discordant genotypes of the total 4420 genotypes that were scored by each method were distributed among all SNPs, PCR-plates and assay dates. Sanger DNA sequencing was used to settle five of these ambiguities. PrASE was correct in four and pyrosequencing in one. The last three samples could not be confirmed, due to lack of DNA sample.

The robustness of the generic tag array was investigated by designing each allele-specific primer with two alternative tag sequences. All primer pair combinations were compared and gave similar clusters. However, for one SNP the clusters were shifted to the left, but still analyzable. This can be due to differences in hybridization efficiency or due to failure in the synthesis. The protease effect was also investigated, by genotyping eight samples with and without inclusion of protease. Correct genotyping was obtained for eight of the ten SNPs without protease. With protease, all genotypes were correct and the clusters were more separated for all SNPs, indicating the higher robustness of PrASE, consistent with the results in paper II. In addition, the reproducibility was investigated by genotyping 24 PCR products by PrASE on different occasions and on different dates and microarray slides. Minimal intra and inter chip variability was seen, proving the reproducibility of the assay. In fact, all 4,420 genotypes of the 10 SNPs could be combined into a single diagram and still generate well defined clusters. In conclusion, the results from this investigation indicate that PrASE gives higher conversion rates than ASE and that it has equal robustness and accuracy as pyrosequencing.

48 EMILIE HULTIN

8 SINGLE CELL ANALYSIS INDICATE GENETIC LOSS

(PAPER IV)

8.1 EPIDERMIS

Epidermis is the outermost part of the skin, built up of keratinocytes. It consists of different layers, beginning with the basal layer (stratum basale) on the border to dermis, to suprabasal (stratum spinosum), superficial (stratum granulosum) and cornified layer (stratum corneum) consisting of dead cells at the surface of the skin. The skin stem cells lie on the border between dermis and epidermis, in the basal layer, where the maturation and migration of the keratinocytes are initiated. As basal cells proliferate, keratinocytes are gradually moving towards the outermost epidermal entity, the cornified layer and finally shed after 21-48 days. The balance between renewal and differentiation is suggested to be achieved through asymmetric division by progenitor stem cells in the basal layer, rendering one stem cell and one daughter transient amplifying (TA) cell. TA cells are committed to a limited number of cell divisions followed by withdrawal from the cell cycle, and subsequent terminal differentiation.

8.2 SINGLE CELL ANALYSIS OF SKIN

Laser microdissection of single cells from a skin section makes it possible to investigate mutations and polymorphisms in cells from different layers of epidermis. Single cell analysis can facilitate a molecular map of genetic composition and provide insights to the variation of the genome sequence in both normal and tumor skin. In previous studies, single cell microdissection and DNA sequencing technology were used for analysis of the TP53 tumor suppressor gene, exons 4 to 9. Those studies demonstrated UV induced mutations in all types of samples from normal epidermis, epidermal clones of TP53- immunoreactive keratinocytes to basal cell cancer (Persson et al., 2000; Ling et al., 2001). In addition, a trend towards a random loss of TP53 alleles in terminally differentiating normal skin was observed. This may either represent a random event in differentiating skin, a selective event given the role of the TP53 gene or a technical difficulty of analyzing single

49 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY cells. One technical obstacle of such single cell studies is the limitation in multiplex PCR, where normally only ten to fifteen genomic segments can be amplified in parallel. To investigate those potential losses of genetic segments during skin differentiation in normal epidermis, we designed an improved multiplex PCR covering 24 SNPs distributed over different chromosomes, optimized for single cell analysis. The genotyping was performed by PrASE and loss of genomic segments was analyzed by statistical means.

8.3 RESULTS OF ALLELIC LOSS

The 24 polymorphic positions were selected to be highly informative according to reported frequencies (NCBI MapViewer, http://www.ncbi.nlm.nih.gov/) and in the proximity to reported cancer genes. Technical artifacts are always an issue in single cell amplification (Piyamongkol et al., 2003). To avoid some of the issues we established a minimal length multiplex PCR and before analyzing single cells we tried to address the robustness of the analysis. Genomic DNA from 24 donors were used to validate the amplification as well to verify the reported SNP frequencies in the Swedish population. All 24 SNPs amplified successfully and could be genotyped by our PrASE technology. To investigate the efficiency in amplification on single DNA copies, we also analyzed a dilution series of genomic DNA and compared the experimental results with statistical calculations of random sampling. Laser microdissected single cells from basal layer and suprabasal layer of a tissue section of epidermis were then analyzed by the PrASE technology and loss of alleles was analyzed statistically.

8.3.1 RANDOM SAMPLING Two genomic DNA samples were diluted in the following steps: 1 ng, 18 pg , 6 pg and 3 pg genomic DNA. Theoretically, 6 pg of genomic DNA contains DNA from one cell, i.e. two copies of DNA. When a small portion is sampled from homogenized cell material, the number of copies of an allele will be more or less random, due to chance fluctuations in the sampling. The probable number of allele copies to be found in a sample of size m pg should be a random number, Poisson distributed with mean value m/6 in the heterozygote case and m/3 in the homozygote case, where zero corresponds to allele loss. The actually observed frequency data of alleles being present or absent, were compared

50 EMILIE HULTIN with the corresponding theoretical frequencies. For one of the two individuals a statistically significant loss was observed, although very small, whereas for the other individual excellent agreement was found. In conclusion, a minor loss not due to sampling can occur, but that technical loss is very small. However, we cannot exclude other artifacts that are introduced during the amplification step of single cells obtained from laser microdissection, both the laser catapulting and the sectioning could potentially contribute to the observed phenomena but we do not have any indications of this.

8.3.2 TISSUE SECTION STATISTICS Since the amplification procedure only had a limited affect in loss of alleles we continued with the analysis of single cells in tissue sections. Single cells as well as positive controls (10, 50 and 100 cells, in duplicates) were laser microdissected and pressure catapulted. 15 single cells from the basal layer and 20 single cells from the suprabasal layer were collected from normal epidermis. Eleven of the 24 SNPs demonstrated to be heterozygous by PrASE analysis of the positive controls. One single cell from the basal layer and two single cells from the suprabasal layer as well as one of the 10 cell samples did not generate any signals, an indication of failure in the catapulting step (3 of 41 samples). A total of 912 genotypes were obtained in the analysis facilitating analysis of both total loss, where both alleles or allele copies are lost, as well as single allele loss in the different layers in epidermis. Statistical analysis of both total loss and single loss revealed that the previous observations of random allele loss in TP53 can be confirmed at a global scale. No particular chromosomal region (SNP) could be linked to the loss, indicating a random process in differentiating skin. The loss of genetic information in skin could be triggered by DNA damages and differentiation signals and future studies need to be expanded to validate these observations, in both UV exposed and non-exposed skin for an increased number of single cells.

51 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

52 EMILIE HULTIN

FUTURE PERSPECTIVES

In this thesis, techniques for genetic sequence analysis have been described. The recent advances in sequencing and genotyping by massively parallelized reactions have increased throughputs thousandfolds only during the last decade. The limitation of these methods is however, the cost per sample, which is still too high for the majority of genetic analyses. Hence, moderate throughput genotyping assays where the cost per sample is significantly lower will be the choice in the nearest future for most applications, especially in clinical tests and academic research. Another aspect is the amount of DNA required for the analysis. In forensic investigations and single cell analysis limited amounts of DNA are available and therefore, sensitive assays are of great importance.

In the present investigation, methods for viral and SNP genotyping have been developed. In the MUCH-AMASE assay, different human papillomavirus types can be detected in clinical samples, containing both single and multiple infections. This technique can be very useful when investigating the transformation potentials and persistency of infection of different HPV types. As vaccines are being developed for some of the HPV types, the consequences of multiple infections need to be elucidated for further improved vaccines. The MUCH-AMASE technique can also be utilized for other kinds of microbes and viruses.

53 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY In SNP genotyping, the PrASE technology has been developed and used in several applications. Despite the fact that novel sequencing and whole genome genotyping assays have emerged, there is still a need for genotyping technologies that are flexible and cost effective with respect to sample numbers. With the PrASE assay, single cells have been genotyped from different layers of the epidermis of the skin, which concluded that loss of alleles occur during cell differentiation. This single cell genotyping approach makes it possible to analyze tumor heterogeneity. SNPs related to different disease conditions have also been genotyped and it has been showed that the PrASE genotyping is very robust, accurate and sensitive. To further improve throughput, the multiplex capacity of the amplification procedure needs to be increased. This can be done by amplifying with the tri-nucleotide threading (TnT) approach, which has been used to accurately genotype 70 SNPs by PrASE (Pettersson et al., 2006). These results indicate that the PrASE assay can be used for even higher levels of multiplexing with a whole genome amplification strategy.

In the future, when cost has been further reduced for whole genome sequencing, maybe by sequencing through nanopores, it might be affordable to sequence whole genomes for SNP genotyping in diagnostics. At that time, different ethical issues have to be considered. As individual genetic codes are being sequenced and susceptibility for different diseases is revealed together with other personal characteristics, there is a major risk for individual persons to be discriminated by employees, insurance companies and by the society. These issues have to be dealt with in time, so that legal justice can be adapted for the new circumstances.

54 EMILIE HULTIN

ABBREVIATIONS

A adenine AMASE apyrase mediated allele-specific extension APS adenosine phosphosulphate ASA allele-specific amplification ASE allele-specific extension ASO allele-specific oligonucleotides ASOH allele-specific oligonucleotide hybridization ATP adenosine triphosphate BAC bacterial artificial chromosome bp base pairs C cytosine CCD charge-coupled device CGH comparative genomic hybridization cSBH combinatorial sequencing by hybridization Cy3 cyanine 3 Cy5 cyanine 5 DASH dynamic allele-specific hybridization dATP Deoxyadenosine triphosphate dCTP Deoxycytidine triphosphate

55 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY ddNTP dideoxyribonucleotide triphosphate DGGE denaturing gradient gel electrophoresis dGTP Deoxyguanosine triphosphate DHPLC denaturing high-performance liquid chromatography DNA deoxyribonucleic acid dNMP deoxyribonucleoside monophosphates dNTP deoxyribonucleotide triphosphates dsDNA double stranded DNA dTTP Deoxythymidine triphosphate FISH fluorescent in situ hybridization FRET fluorescence resonance energy transfer G guanine Gb gigabase pairs GS-20 Genome Sequencer 20 HA heteroduplex analysis HapMap haplotype map HCII Hybrid Capture II system HGP Human Genome Project HPV human papillomavirus ISH in situ hybridization kb kilobase pairs LOH loss of heterozygosity LSO locus-specific oligonucleotide MAPH multiplex amplifiable probe hybridization Mb megabase pairs MIP molecular inversion probes MLPA multiplex ligation-dependent probe amplification mRNA messenger RNA mtDNA mitochondrial DNA MUCH multiplex competitive hybridization NHGRI National Human Genome Research Institute

56 EMILIE HULTIN nt nucleotides OLA oligonucleotide ligation assay PCR polymerase chain reaction PPi pyrophosphate PrASE protease mediated allele-specific extension qPCR quantitative real-time PCR RFLP restriction fragment length polymorphism RNA ribonucleic acid SBE single base extension SBH sequencing by hybridization SNP single nucleotide polymorphism SSCP single-strand conformation polymorphism ssDNA single stranded DNA STR short tandem repeat T thymine TA transient amplifying Taq Thermus aquaticus TnT tri-nucleotide threading USD US-dollar WGA whole genome amplification

57 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY

58 EMILIE HULTIN

ACKNOWLEDGEMENTS

Jag har många fantastiska personer att tacka för att den här avhandlingen blev till, utan ert stöd och era idéer hade det varit betydligt mycket svårare för att inte säga omöjligt.

Först och främst vill jag tacka min huvudhandledare Joakim för att jag fick förtroendet att börja forska i din grupp. Du är alltid engagerad och kommer med nya spännande projekt. Vid våra fredagsmöten med ojämna mellanrum överförde du alltid lite av din entusiasm, forskning är fantastiskt!

Självklart vill jag även rikta ett stort tack till min handledare Afshin, du är grym på att komma med nya idéer. Vilken naturlig känsla du har för DNA! Du är alltid omtänksam och nyfiken. Jag kommer att sakna dig och dina kluriga, smarta lösningar vad jag än jobbar med i framtiden.

Tack också till:

Mathias för dina storslagna forskarprojekt som kartläggningen av alla våra proteiner, du visar att det går. Din kreativitet smittar av sig och skapar forskardrömmar på vår avdelning.

Professorerna Per-Åke för härliga slumpdiskussioner, Stefan för ditt strålande humör och Sophia för att du är en bra förebild.

Alla våra PIs för att ni bidrar till den här fina stämningen vi har på avdelningen, utan er skulle tiden som doktorand inte varit lika stimulerande och rik på lärande.

Mina vapendragare Max för att du introducerade mig på labbet under mitt exjobb och lärde mig alla dina små knep, du var en kanoninstruktör, samt Erik P och Pawel för

59 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY trevligt PrASE:ande. Skönt att kunna riva sitt hår tillsammans när spottarna är spårlöst försvunna. Men oftast firar vi ju våra triumfer. Fantastiskt!

Annelie, utan dig hade det inte blivit några fina spottar, du är en klippa!

Kicki för all hjälp med MBS:erna och för att du tar hand om dem och håller ordning på labbet.

Uppsalagänget med Figge, Anna och Karolina, för all support, spännande huddiskussioner och alla singelceller som ni katapultat åt mig.

Rolf och Lovisa på SU för all hjälp med statistik på allelförluster och singelceller. Det var trevliga och lärorika möten om Poissonfördelningar och celler, ni tog er verkligen tid. Jag uppskattar ert engagemang!

Mattias, Christian och Savo för trevligt samarbete, ett riktigt hundgöra!

Gentechgruppen för intressanta workshops, tips, hjälp och idéer, samt för en rolig och trevlig HUGO i Helsingfors. Valtteri, vilken toppenguide!

Innebandygänget från veteraner som Peru, Olof, Erik, Björn, Henrik och Max till dagens förmågor och rookies som Tove, Marcus, Johan, Rockberg, Hanna, Per, Sara, Jochen, Holger, Pawel och Patrik, listan kan göras lång, vi är en vältränad avdelning. Ni är svaret på varför mitten av veckan är allra bäst!

Bästa rummet med Malin, Erik, Björn, Torun, John, Jojje, Per-Åke och Calle för grekfester, härliga diskussioner om lejonöra, bananflugor, meningen med livet och annat viktigt ovetande.

Alla tjejorna för skratt, teater, middagar, sol och bad, limekola, filmvisning, grill, spa med isvaksdopp, vårrus och goda drinkar, ni är coola!

Hanna, vad bra att du också började på kemi på KTH -97, du var en pärla att fråga om hjälp när föreläsningen inte var helt solklar och senare under doktorandtiden om labproblem och praktiska råd. Skönt med en långlunch på PhiPhi emellanåt!

Kolagänget, vad vore livet utan er? ;)

Mamma Else-Marie och Pappa Håkan för att ni är världsbäst på att pusha och stötta mig. Åååh vad ni är bra att ha till mycket! Kram på er!

David, du är min definition av kärlek, jag älskar dig. ♥

Jag beundrar er alla.

60 EMILIE HULTIN

REFERENCES

Adessi C., Matton G., Ayala G., Turcatti G., Mermod J.J., Mayer P. and Kawashima E. Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic Acids Res 28, E87 (2000)

Ahmadian A., Gharizadeh B., O'Meara D., Odeberg J. and Lundeberg J. Genotyping by apyrase-mediated allele-specific extension. Nucleic Acids Res 29, E121 (2001)

Alves A.M. and Carr F.J. Dot blot detection of point mutations with adjacently hybridising synthetic oligonucleotide probes. Nucleic Acids Res 16, 8723 (1988)

Ashkenasy N., Sanchez-Quesada J., Bayley H. and Ghadiri M.R. Recognizing a single base in an individual DNA strand: a step toward DNA sequencing in nanopores. Angew Chem Int Ed Engl 44, 1401-4 (2005)

Astier Y., Braha O. and Bayley H. Toward single molecule DNA sequencing: direct identification of ribonucleoside and deoxyribonucleoside 5'-monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J Am Chem Soc 128, 1705-10 (2006)

Ayyadevara S., Thaden J.J. and Shmookler Reis R.J. Discrimination of primer 3'-nucleotide mismatch by taq DNA polymerase during polymerase chain reaction. Anal Biochem 284, 11-8 (2000)

Baner J., Isaksson A., Waldenstrom E., Jarvius J., Landegren U. and Nilsson M. Parallel gene analysis with allele-specific padlock probes and tag microarrays. Nucleic Acids Res 31, e103 (2003)

Baner J., Nilsson M., Mendel-Hartvig M. and Landegren U. Signal amplification of padlock probes by rolling circle replication. Nucleic Acids Res 26, 5073-8 (1998)

61 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Bennett S. Solexa Ltd. Pharmacogenomics 5, 433-8 (2004)

Bennett S.T., Barnes C., Cox A., Davies L. and Brown C. Toward the $1000 human genome. Pharmacogenomics 6, 373-82 (2005)

Bernard H.U., Chan S.Y., Manos M.M., Ong C.K., Villa L.L., Delius H., Peyton C.L., Bauer H.M. and Wheeler C.M. Identification and assessment of known and novel human papillomaviruses by polymerase chain reaction amplification, restriction fragment length polymorphisms, nucleotide sequence, and phylogenetic algorithms. J Infect Dis 170, 1077-85 (1994)

Braslavsky I., Hebert B., Kartalov E. and Quake S.R. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 100, 3960-4 (2003)

Carothers A.M., Urlaub G., Mucha J., Grunberger D. and Chasin L.A. Point mutation analysis in a mammalian gene: rapid preparation of total RNA, PCR amplification of cDNA, and Taq sequencing by a novel method. Biotechniques 7, 494-6, 498-9 (1989)

Chee M., Yang R., Hubbell E., Berno A., Huang X.C., Stern D., Winkler J., Lockhart D.J., Morris M.S. and Fodor S.P.A. Accessing Genetic Information with High-Density DNA Arrays. Science 274, 610-614 (1996)

Collins F.S., Morgan M. and Patrinos A. The Human Genome Project: lessons from large- scale biology. Science 300, 286-90 (2003)

Cope J.U., Hildesheim A., Schiffman M.H., Manos M.M., Lorincz A.T., Burk R.D., Glass A.G., Greer C., Buckland J., Helgesen K.et al. Comparison of the hybrid capture tube test and PCR for detection of human papillomavirus DNA in cervical specimens. J Clin Microbiol 35, 2262-5 (1997)

Cotton R.G., Rodrigues N.R. and Campbell R.D. Reactivity of cytosine and thymine in single-base-pair mismatches with hydroxylamine and osmium tetroxide and its application to the study of mutations. Proc Natl Acad Sci U S A 85, 4397-401 (1988)

Cowie S., Drmanac S., Swanson D., Delgrosso K., Huang S., du Sart D., Drmanac R., Surrey S. and Fortina P. Identification of APC gene mutations in colorectal cancer using universal microarray-based combinatorial sequencing-by-hybridization. Hum Mutat 24, 261-71 (2004)

De Villiers E.M., Fauquet C., Broker T.R., Bernard H.U. and Zur Hausen H. Classification of papillomaviruses. Virology 324, 17-27 (2004)

Deamer D.W. and Akeson M. Nanopores and nucleic acids: prospects for ultrarapid sequencing. Trends Biotechnol 18, 147-51 (2000)

62 EMILIE HULTIN Dean F.B., Hosono S., Fang L., Wu X., Faruqi A.F., Bray-Ward P., Sun Z., Zong Q., Du Y., Du J.et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 99, 5261-6 (2002)

Delrio-Lafreniere S.A., Browning M.K. and McGlennen R.C. Low-density addressable array for the detection and typing of the human papillomavirus. Diagn Microbiol Infect Dis 48, 23-31 (2004)

Ding C. and Cantor C.R. Quantitative analysis of nucleic acids--the last few years of progress. J Biochem Mol Biol 37, 1-10 (2004)

Divne A.M. and Allen M. A DNA microarray system for forensic SNP analysis. Forensic Sci Int 154, 111-21 (2005)

Dressman D., Yan H., Traverso G., Kinzler K.W. and Vogelstein B. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci U S A 100, 8817-22 (2003)

Drmanac R. and Crkvenjakov R. Method of sequencing of genomes by hybridization with oligonucleotide probes. Yugoslav patent application 570/87 (1987)

Drmanac R., Drmanac S., Chui G., Diaz R., Hou A., Jin H., Jin P., Kwon S., Lacy S., Moeur B.et al. Sequencing by hybridization (SBH): advantages, achievements, and opportunities. Adv Biochem Eng Biotechnol 77, 75-101 (2002)

Drmanac R., Labat I., Brukner I. and Crkvenjakov R. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics 4, 114-28 (1989)

Endler G. and Mannhalter C. Polymorphisms in coagulation factor genes and their impact on arterial and venous thrombosis. Clin Chim Acta 330, 31-55 (2003)

Evans M.F. and Cooper K. Human papillomavirus integration: detection by in situ hybridization and potential clinical application. J Pathol 202, 1-4 (2004)

Fan J.B., Chen X., Halushka M.K., Berno A., Huang X., Ryder T., Lipshutz R.J., Lockhart D.J. and Chakravarti A. Parallel genotyping of human SNPs using generic high- density oligonucleotide tag arrays. Genome Res 10, 853-60 (2000)

Fan J.B., Oliphant A., Shen R., Kermani B.G., Garcia F., Gunderson K.L., Hansen M., Steemers F., Butler S.L., Deloukas P.et al. Highly parallel SNP genotyping. Cold Spring Harb Symp Quant Biol 68, 69-78 (2003)

Fan Y.S., Davis L.M. and Shows T.B. Mapping small DNA sequences by fluorescence in situ hybridization directly on banded metaphase chromosomes. Proc Natl Acad Sci U S A 87, 6223-7 (1990)

Fischer S.G. and Lerman L.S. Length-independent separation of DNA restriction fragments in two-dimensional gel electrophoresis. Cell 16, 191-200 (1979)

63 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Fischer S.G. and Lerman L.S. DNA fragments differing by single base-pair substitutions are separated in denaturing gradient gels: correspondence with melting theory. Proc Natl Acad Sci U S A 80, 1579-83 (1983)

Franklin S.E. and Gosling R.G. Molecular configuration in sodium thymonucleate. Nature 171, 740-1 (1953)

Gharizadeh B., Kalantari M., Garcia C.A., Johansson B. and Nyren P. Typing of human papillomavirus by pyrosequencing. Lab Invest 81, 673-9 (2001)

Gharizadeh B., Kaller M., Nyren P., Andersson A., Uhlen M., Lundeberg J. and Ahmadian A. Viral and microbial genotyping by a combination of multiplex competitive hybridization and specific extension followed by hybridization to generic tag arrays. Nucleic Acids Res 31, e146 (2003)

Gharizadeh B., Nordstrom T., Ahmadian A., Ronaghi M. and Nyren P. Long-read pyrosequencing using pure 2'-deoxyadenosine-5'-O'-(1-thiotriphosphate) Sp- isomer. Anal Biochem 301, 82-90 (2002)

Gill P., Jeffreys A.J. and Werrett D.J. Forensic application of DNA 'fingerprints'. Nature 318, 577-9 (1985)

Goldrick M.M. RNase cleavage-based methods for mutation/SNP detection, past and present. Hum Mutat 18, 190-204 (2001)

Goldrick M.M., Kimball G.R., Liu Q., Martin L.A., Sommer S.S. and Tseng J.Y. NIRCA: a rapid robust method for screening for unknown point mutations. Biotechniques 21, 106-12 (1996)

Green R.E., Krause J., Ptak S.E., Briggs A.W., Ronan M.T., Simons J.F., Du L., Egholm M., Rothberg J.M., Paunovic M.et al. Analysis of one million base pairs of Neanderthal DNA. Nature 444, 330-6 (2006)

Grossman P.D., Bloch W., Brinson E., Chang C.C., Eggerding F.A., Fung S., Iovannisci D.M., Woo S. and Winn-Deen E.S. High-density multiplex detection of nucleic acid sequences: oligonucleotide ligation assay and sequence-coded separation. Nucleic Acids Res 22, 4527-34 (1994)

Gunderson K.L., Kruglyak S., Graige M.S., Garcia F., Kermani B.G., Zhao C., Che D., Dickinson T., Wickham E., Bierle J.et al. Decoding randomly ordered DNA arrays. Genome Res 14, 870-7 (2004)

Gunderson K.L., Steemers F.J., Lee G., Mendoza L.G. and Chee M.S. A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet 37, 549-54 (2005)

Hardenbol P., Baner J., Jain M., Nilsson M., Namsaraev E.A., Karlin-Neumann G.A., Fakhrai-Rad H., Ronaghi M., Willis T.D., Landegren U.et al. Multiplexed

64 EMILIE HULTIN genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 21, 673-8 (2003)

Hardenbol P., Yu F., Belmont J., Mackenzie J., Bruckner C., Brundage T., Boudreau A., Chow S., Eberle J., Erbilgin A.et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 15, 269-75 (2005)

Hinds D.A., Stuve L.L., Nilsen G.B., Halperin E., Eskin E., Ballinger D.G., Frazer K.A. and Cox D.R. Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072-9 (2005)

Hirschhorn J.N. and Daly M.J. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6, 95-108 (2005)

Hirschhorn J.N., Sklar P., Lindblad-Toh K., Lim Y.M., Ruiz-Gutierrez M., Bolk S., Langhorst B., Schaffner S., Winchester E. and Lander E.S. SBE-TAGS: an array- based method for efficient single-nucleotide polymorphism genotyping. Proc Natl Acad Sci U S A 97, 12164-9 (2000)

Hogervorst F.B., Nederlof P.M., Gille J.J., McElgunn C.J., Grippeling M., Pruntel R., Regnerus R., van Welsem T., van Spaendonk R., Menko F.H.et al. Large genomic deletions and duplications in the BRCA1 gene identified by a novel quantitative method. Cancer Res 63, 1449-53 (2003)

Holland P.M., Abramson R.D., Watson R. and Gelfand D.H. Detection of specific polymerase chain reaction product by utilizing the 5'----3' exonuclease activity of Thermus aquaticus DNA polymerase. Proc Natl Acad Sci U S A 88, 7276-80 (1991)

Hollox E.J., Akrami S.M. and Armour J.A. DNA copy number analysis by MAPH: molecular diagnostic applications. Expert Rev Mol Diagn 2, 370-8 (2002)

Howell W.M., Jobs M., Gyllensten U. and Brookes A.J. Dynamic allele-specific hybridization. A new method for scoring single nucleotide polymorphisms. Nat Biotechnol 17, 87-8 (1999)

Hubbard R.A. Human papillomavirus testing methods. Arch Pathol Lab Med 127, 940-5 (2003)

Huber C.G., Oefner P.J. and Bonn G.K. High-resolution liquid chromatography of oligonucleotides on nonporous alkylated styrene-divinylbenzene copolymers. Anal Biochem 212, 351-8 (1993a)

Huber C.G., Oefner P.J., Preuss E. and Bonn G.K. High-resolution liquid chromatography of DNA fragments on non-porous poly(styrene-divinylbenzene) particles. Nucleic Acids Res 21, 1061-6 (1993b)

65 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Humphries S.E. and Morgan L. Genetic risk factors for stroke and carotid atherosclerosis: insights into pathophysiology from candidate gene approaches. Lancet Neurol 3, 227-35 (2004)

Ingram V.M. Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180, 326-8 (1957)

Innis M.A., Myambo K.B., Gelfand D.H. and Brow M.A. DNA sequencing with Thermus aquaticus DNA polymerase and direct sequencing of polymerase chain reaction- amplified DNA. Proc Natl Acad Sci U S A 85, 9436-40 (1988)

Ishkanian A.S., Malloff C.A., Watson S.K., DeLeeuw R.J., Chi B., Coe B.P., Snijders A., Albertson D.G., Pinkel D., Marra M.A.et al. A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 36, 299-303 (2004)

Jeffreys A.J., Wilson V. and Thein S.L. Hypervariable 'minisatellite' regions in human DNA. Nature 314, 67-73 (1985)

Jeon S. and Lambert P.F. Integration of human papillomavirus type 16 DNA into the human genome leads to increased stability of E6 and E7 mRNAs: implications for cervical carcinogenesis. Proc Natl Acad Sci U S A 92, 1654-8 (1995)

Jobs M., Howell W.M., Stromqvist L., Mayr T. and Brookes A.J. DASH-2: flexible, low- cost, and high-throughput SNP genotyping by dynamic allele-specific hybridization on membrane arrays. Genome Res 13, 916-24 (2003)

Kallioniemi A., Kallioniemi O.P., Sudar D., Rutovitz D., Gray J.W., Waldman F. and Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258, 818-21 (1992)

Karas M. and Hillenkamp F. Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal Chem 60, 2299-301 (1988)

Kasianowicz J.J., Brandin E., Branton D. and Deamer D.W. Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci U S A 93, 13770-3 (1996)

Keller I., Bensasson D. and Nichols R.A. Transition-Transversion Bias Is Not Universal: A Counter Example from Grasshopper Pseudogenes. PLoS Genet 3, e22 (2007)

Kennedy G.C., Matsuzaki H., Dong S., Liu W.M., Huang J., Liu G., Su X., Cao M., Chen W., Zhang J.et al. Large-scale genotyping of complex DNA. Nat Biotechnol 21, 1233- 7 (2003)

Kim S., Ulz M.E., Nguyen T., Li C.M., Sato T., Tycko B. and Ju J. Thirtyfold multiplex genotyping of the p53 gene using solid phase capturable dideoxynucleotides and . Genomics 83, 924-31 (2004)

66 EMILIE HULTIN Kleter B., van Doorn L.J., Schrauwen L., Molijn A., Sastrowijoto S., ter Schegget J., Lindeman J., ter Harmsel B., Burger M. and Quint W. Development and clinical evaluation of a highly sensitive PCR-reverse hybridization line probe assay for detection and identification of anogenital human papillomavirus. J Clin Microbiol 37, 2508-17 (1999)

Knight J.C. Functional implications of genetic variation in non-coding DNA for disease susceptibility and gene regulation. Clin Sci (Lond) 104, 493-501 (2003)

Korenberg J.R., Yang-Feng T., Schreck R. and Chen X.N. Using fluorescence in situ hybridization (FISH) in genome mapping. Trends Biotechnol 10, 27-32 (1992)

Kwok S., Kellogg D.E., McKinney N., Spasic D., Goda L., Levenson C. and Sninsky J.J. Effects of primer-template mismatches on the polymerase chain reaction: human immunodeficiency virus type 1 model studies. Nucleic Acids Res 18, 999-1005 (1990)

Lai E., Bowman C., Bansal A., Hughes A., Mosteller M. and Roses A.D. Medical applications of haplotype-based SNP maps: learning to walk before we run. Nat Genet 32, 353 (2002)

Landegren U., Kaiser R., Sanders J. and Hood L. A ligase-mediated gene detection technique. Science 241, 1077-80 (1988)

Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W.et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001)

Lane D.A. and Grant P.J. Role of hemostatic gene polymorphisms in venous and arterial thrombotic disease. Blood 95, 1517-32 (2000)

Larsson C., Koch J., Nygren A., Janssen G., Raap A.K., Landegren U. and Nilsson M. In situ genotyping individual DNA molecules by target-primed rolling-circle amplification of padlock probes. Nat Methods 1, 227-32 (2004)

Lawrence J.B., Villnave C.A. and Singer R.H. Sensitive, high-resolution chromatin and chromosome mapping in situ: presence and orientation of two closely integrated copies of EBV in a lymphoma line. Cell 52, 51-61 (1988)

Leamon J.H., Lee W.L., Tartaro K.R., Lanza J.R., Sarkis G.J., deWinter A.D., Berka J., Weiner M., Rothberg J.M. and Lohman K.L. A massively parallel PicoTiterPlate based platform for discrete picoliter-scale polymerase chain reactions. Electrophoresis 24, 3769-77 (2003)

Lee L.G., Connell C.R. and Bloch W. Allelic discrimination by nick-translation PCR with fluorogenic probes. Nucleic Acids Res 21, 3761-6 (1993)

Lee R.C. and Ambros V. An extensive class of small RNAs in Caenorhabditis elegans. Science 294, 862-4 (2001)

67 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Li W.H. and Sadler L.A. Low nucleotide diversity in man. Genetics 129, 513-23 (1991)

Lim L.P., Lau N.C., Garrett-Engele P., Grimson A., Schelter J.M., Castle J., Bartel D.P., Linsley P.S. and Johnson J.M. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433, 769-73 (2005)

Lindroos K., Sigurdsson S., Johansson K., Ronnblom L. and Syvanen A.C. Multiplex SNP genotyping in pooled DNA samples by a four-colour microarray system. Nucleic Acids Res 30, e70 (2002)

Ling G., Persson A., Berne B., Uhlen M., Lundeberg J. and Ponten F. Persistent p53 mutations in single cells from normal human skin. Am J Pathol 159, 1247-53 (2001)

Lishanski A., Ostrander E.A. and Rine J. Mutation detection by mismatch binding protein, MutS, in amplified DNA: application to the cystic fibrosis gene. Proc Natl Acad Sci U S A 91, 2674-8 (1994)

Litt M. and Luty J.A. A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am J Hum Genet 44, 397- 401 (1989)

Livak K.J., Flood S.J., Marmaro J., Giusti W. and Deetz K. Oligonucleotides with fluorescent dyes at opposite ends provide a quenched probe system useful for detecting PCR product and nucleic acid hybridization. PCR Methods Appl 4, 357-62 (1995)

Lovmar L., Fock C., Espinoza F., Bucardo F., Syvanen A.C. and Bondeson K. Microarrays for genotyping human group a rotavirus by multiplex capture and type-specific primer extension. J Clin Microbiol 41, 5153-8 (2003)

Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.J., Chen Z.et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-80 (2005)

Marziali A. and Akeson M. New DNA sequencing methods. Annu Rev Biomed Eng 3, 195- 223 (2001)

Matsuzaki H., Dong S., Loi H., Di X., Liu G., Hubbell E., Law J., Berntsen T., Chadha M., Hui H.et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods 1, 109-11 (2004)

McKaig R.G., Baric R.S. and Olshan A.F. Human papillomavirus and head and neck cancer: epidemiology and molecular biology. Head Neck 20, 250-65 (1998)

Melamede R.J. Automatable process for sequencing nucleotide. U.S. patent 4863849 (1985)

Meller A., Nivon L. and Branton D. Voltage-driven DNA translocations through a nanopore. Phys Rev Lett 86, 3435-8 (2001)

68 EMILIE HULTIN Michael K.L., Taylor L.C., Schultz S.L. and Walt D.R. Randomly ordered addressable high-density optical sensor arrays. Anal Chem 70, 1242-8 (1998)

Mitra R.D. and Church G.M. In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res 27, e34 (1999)

Mitra R.D., Shendure J., Olejnik J., Edyta Krzymanska O. and Church G.M. Fluorescent in situ sequencing on polymerase colonies. Anal Biochem 320, 55-65 (2003)

Monforte J.A. and Becker C.H. High-throughput DNA analysis by time-of-flight mass spectrometry. Nat Med 3, 360-2 (1997)

Moretti T.R., Baumstark A.L., Defenbaugh D.A., Keys K.M., Smerick J.B. and Budowle B. Validation of short tandem repeats (STRs) for forensic usage: performance testing of fluorescent multiplex STR systems and analysis of authentic and simulated forensic samples. J Forensic Sci 46, 647-60 (2001)

Myers R.M., Fischer S.G., Lerman L.S. and Maniatis T. Nearly all single base substitutions in DNA fragments joined to a GC-clamp can be detected by denaturing gradient gel electrophoresis. Nucleic Acids Res 13, 3131-45 (1985)

Myers R.M., Maniatis T. and Lerman L.S. Detection and localization of single base changes by denaturing gradient gel electrophoresis. Methods Enzymol 155, 501-27 (1987)

Newton C.R., Graham A., Heptinstall L.E., Powell S.J., Summers C., Kalsheker N., Smith J.C. and Markham A.F. Analysis of any point mutation in DNA. The amplification refractory mutation system (ARMS). Nucleic Acids Res 17, 2503-16 (1989)

Nilsson M., Malmgren H., Samiotaki M., Kwiatkowski M., Chowdhary B.P. and Landegren U. Padlock probes: circularizing oligonucleotides for localized DNA detection. Science 265, 2085-8 (1994)

O'Meara D., Ahmadian A., Odeberg J. and Lundeberg J. SNP typing by apyrase-mediated allele-specific primer extension on DNA microarrays. Nucleic Acids Res 30, e75 (2002)

Okayama H., Curiel D.T., Brantly M.L., Holmes M.D. and Crystal R.G. Rapid, nonradioactive detection of mutations in the human genome by allele-specific amplification. J Lab Clin Med 114, 105-13 (1989)

Orita M., Iwahana H., Kanazawa H., Hayashi K. and Sekiya T. Detection of polymorphisms of human DNA by gel electrophoresis as single-strand conformation polymorphisms. Proc Natl Acad Sci U S A 86, 2766-2770 (1989)

Park T.C., Kim C.J., Koh Y.M., Lee K.H., Yoon J.H., Kim J.H., Namkoong S.E. and Park J.S. Human papillomavirus genotyping by the DNA chip in the cervical neoplasia. DNA Cell Biol 23, 119-25 (2004)

69 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Pastinen T. and Hudson T.J. Cis-acting regulatory variation in the human genome. Science 306, 647-50 (2004)

Pastinen T., Raitio M., Lindroos K., Tainola P., Peltonen L. and Syvanen A.C. A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res 10, 1031-42 (2000)

Pastinen T., Sladek R., Gurd S., Sammak A., Ge B., Lepage P., Lavergne K., Villeneuve A., Gaudin T., Brandstrom H.et al. A survey of genetic and epigenetic variation affecting human gene expression. Physiol Genomics 16, 184-93 (2004)

Pauling L., Itano H.A. and et al. Sickle cell anemia a molecular disease. Science 110, 543-8 (1949)

Pease A.C., Solas D., Sullivan E.J., Cronin M.T., Holmes C.P. and Fodor S.P. Light- generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 91, 5022-6 (1994)

Persson A.E., Ling G., Williams C., Backvall H., Ponten J., Ponten F. and Lundeberg J. Analysis of p53 mutations in single cells obtained from histological tissue sections. Anal Biochem 287, 25-31 (2000)

Pettersson E., Lindskog M., Lundeberg J. and Ahmadian A. Tri-nucleotide threading for parallel amplification of minute amounts of genomic DNA. Nucleic Acids Res 34, e49 (2006)

Pinkel D. and Albertson D.G. Array comparative genomic hybridization and its applications in cancer. Nat Genet 37 Suppl, S11-7 (2005)

Pinkel D., Segraves R., Sudar D., Clark S., Poole I., Kowbel D., Collins C., Kuo W.L., Chen C., Zhai Y.et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20, 207-11 (1998)

Piyamongkol W., Bermudez M.G., Harper J.C. and Wells D. Detailed investigation of factors influencing amplification efficiency and allele drop-out in single cell PCR: implications for preimplantation genetic diagnosis. Mol Hum Reprod 9, 411-20 (2003)

Prober J.M., Trainor G.L., Dam R.J., Hobbs F.W., Robertson C.W., Zagursky R.J., Cocuzza A.J., Jensen M.A. and Baumeister K. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238, 336-41 (1987)

Reyes-Lopez M.A., Mendez-Tenorio A., Maldonado-Rodriguez R., Doktycz M.J., Fleming J.T. and Beattie K.L. Fingerprinting of prokaryotic 16S rRNA genes using oligodeoxyribonucleotide microarrays and virtual hybridization. Nucleic Acids Res 31, 779-89 (2003)

70 EMILIE HULTIN Risch N. and Merikangas K. The future of genetic studies of complex human diseases. Science 273, 1516-7 (1996)

Ronaghi M., Karamohamed S., Pettersson B., Uhlen M. and Nyren P. Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242, 84-9 (1996)

Ronaghi M., Uhlen M. and Nyren P. A sequencing method based on real-time pyrophosphate. Science 281, 363, 365 (1998)

Sachidanandam R., Weissman D., Schmidt S.C., Kakol J.M., Stein L.D., Marth G., Sherry S., Mullikin J.C., Mortimore B.J., Willey D.L.et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-33 (2001)

Sanger F., Nicklen S. and Coulson A.R. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463-7 (1977)

Schiffman M.H. and Castle P. Epidemiologic studies of a necessary causal risk factor: human papillomavirus infection and cervical neoplasia. J Natl Cancer Inst 95, E2 (2003)

Schirinzi A., Drmanac S., Dallapiccola B., Huang S., Scott K., De Luca A., Swanson D., Drmanac R., Surrey S. and Fortina P. Combinatorial sequencing-by-hybridization: analysis of the NF1 gene. Genet Test 10, 8-17 (2006)

Schouten J.P., McElgunn C.J., Waaijer R., Zwijnenburg D., Diepvens F. and Pals G. Relative quantification of 40 nucleic acid sequences by multiplex ligation- dependent probe amplification. Nucleic Acids Res 30, e57 (2002)

Service R.F. Gene sequencing. The race for the $1000 genome. Science 311, 1544-6 (2006)

Shen R., Fan J.B., Campbell D., Chang W., Chen J., Doucet D., Yeakley J., Bibikova M., Wickham Garcia E., McBride C.et al. High-throughput SNP genotyping on universal bead arrays. Mutat Res 573, 70-82 (2005)

Shendure J., Porreca G.J., Reppas N.B., Lin X., McCutcheon J.P., Rosenbaum A.M., Wang M.D., Zhang K., Mitra R.D. and Church G.M. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728-32 (2005)

Shizuya H., Birren B., Kim U.J., Mancino V., Slepak T., Tachiiri Y. and Simon M. Cloning and stable maintenance of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci U S A 89, 8794- 7 (1992)

Smith L.M., Sanders J.Z., Kaiser R.J., Hughes P., Dodd C., Connell C.R., Heiner C., Kent S.B. and Hood L.E. Fluorescence detection in automated DNA sequence analysis. Nature 321, 674-9 (1986)

71 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Smits H.L., Bollen L.J., Tjong A.H.S.P., Vonk J., Van Der Velden J., Ten Kate F.J., Kaan J.A., Mol B.W. and Ter Schegget J. Intermethod variation in detection of human papillomavirus DNA in cervical smears. J Clin Microbiol 33, 2631-6 (1995)

Solaro P.C., Birkenkamp K., Pfeiffer P. and Kemper B. Endonuclease VII of phage T4 triggers mismatch correction in vitro. J Mol Biol 230, 868-77 (1993)

Sommer S.S., Cassady J.D., Sobell J.L. and Bottema C.D. A novel method for detecting point mutations or polymorphisms and its application to population screening for carriers of phenylketonuria. Mayo Clin Proc 64, 1361-72 (1989)

Southern E.M. Detection of specific sequences among DNA fragments separated by gel electrophoresis. J Mol Biol 98, 503-17 (1975)

Steemers F.J., Chang W., Lee G., Barker D.L., Shen R. and Gunderson K.L. Whole- genome genotyping with the single-base extension assay. Nat Methods 3, 31-3 (2006)

Steemers F.J. and Gunderson K.L. Whole genome genotyping technologies on the BeadArray platform. Biotechnol J 2, 41-9 (2007)

Strezoska Z., Paunesku T., Radosavljevic D., Labat I., Drmanac R. and Crkvenjakov R. DNA sequencing by hybridization: 100 bases read by a non-gel-based method. Proc Natl Acad Sci U S A 88, 10089-93 (1991)

Syvanen A.C., Aalto-Setala K., Harju L., Kontula K. and Soderlund H. A primer-guided nucleotide incorporation assay in the genotyping of apolipoprotein E. Genomics 8, 684-92 (1990)

Taranenko N.I., Allman S.L., Golovlev V.V., Taranenko N.V., Isola N.R. and Chen C.H. Sequencing DNA using mass spectrometry for ladder detection. Nucleic Acids Res 26, 2488-90 (1998)

Tchernitchko D., Lamoril J., Puy H., Robreau A.M., Bogard C., Rosipal R., Gouya L., Deybach J.C. and Nordmann Y. Evaluation of mutation screening by heteroduplex analysis in acute intermittent porphyria: comparison with denaturing gradient gel electrophoresis. Clin Chim Acta 279, 133-43 (1999)

Terwilliger J.D., Haghighi F., Hiekkalinna T.S. and Goring H.H. A bias-ed assessment of the use of SNPs in human complex traits. Curr Opin Genet Dev 12, 726-34 (2002)

The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69-87 (2005)

The International HapMap Consortium. The International HapMap Project. Nature 426, 789-96 (2003)

Thomas R.K., Nickerson E., Simons J.F., Janne P.A., Tengs T., Yuza Y., Garraway L.A., LaFramboise T., Lee J.C., Shah K.et al. Sensitive mutation detection in

72 EMILIE HULTIN heterogeneous cancer specimens by massively parallel picoliter reactor sequencing. Nat Med 12, 852-5 (2006)

Tibanyenda N., De Bruin S.H., Haasnoot C.A., van der Marel G.A., van Boom J.H. and Hilbers C.W. The effect of single base-pair mismatches on the duplex stability of d(T-A-T-T-A-A-T-A-T-C-A-A-G-T-T-G) . d(C-A-A-C-T-T-G-A-T-A-T-T-A-A-T- A). Eur J Biochem 139, 19-27 (1984)

Tyagi S., Bratu D.P. and Kramer F.R. Multicolor molecular beacons for allele discrimination. Nat Biotechnol 16, 49-53 (1998)

Tyagi S. and Kramer F.R. Molecular beacons: probes that fluoresce upon hybridization. Nat Biotechnol 14, 303-8 (1996)

Wallace R.B., Shaffer J., Murphy R.F., Bonner J., Hirose T. and Itakura K. Hybridization of synthetic oligodeoxyribonucleotides to phi chi 174 DNA: the effect of single base pair mismatch. Nucleic Acids Res 6, 3543-57 (1979) van Doorn L.J., Kleter B. and Quint W.G. Molecular detection and genotyping of human papillomavirus. Expert Rev Mol Diagn 1, 394-402 (2001)

Van Prooijen-Knegt A.C., Van Hoek J.F., Bauman J.G., Van Duijn P., Wool I.G. and Van der Ploeg M. In situ hybridization of DNA sequences in human metaphase chromosomes visualized by an indirect fluorescent immunocytochemical procedure. Exp Cell Res 141, 397-407 (1982)

Wang W.Y., Barratt B.J., Clayton D.G. and Todd J.A. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 6, 109-18 (2005)

Watson J.D. and Crick F.H. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 171, 737-8 (1953)

Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A.et al. The sequence of the human genome. Science 291, 1304-51 (2001)

Venter J.C., Adams M.D., Sutton G.G., Kerlavage A.R., Smith H.O. and Hunkapiller M. Shotgun sequencing of the human genome. Science 280, 1540-2 (1998)

Werner J.H., Cai H., Jett J.H., Reha-Krantz L., Keller R.A. and Goodwin P.M. Progress towards single-molecule DNA sequencing: a one color demonstration. J Biotechnol 102, 1-14 (2003)

White M.B., Carvalho M., Derse D., O'Brien S.J. and Dean M. Detecting single base substitutions as heteroduplex polymorphisms. Genomics 12, 301-6 (1992)

Wu D.Y., Ugozzoli L., Pal B.K. and Wallace R.B. Allele-specific enzymatic amplification of beta-globin genomic DNA for diagnosis of sickle cell anemia. Proc Natl Acad Sci U S A 86, 2757-60 (1989)

73 GENETIC SEQUENCE ANALYSIS BY MICROARRAY TECHNOLOGY Wu K.J., Shaler T.A. and Becker C.H. Time-of-flight mass spectrometry of underivatized single-stranded DNA oligomers by matrix-assisted laser desorption. Anal Chem 66, 1637-45 (1994)

74 EMILIE HULTIN

PUBLICATIONS (APPENDICES I-IV)

75