UNIVERSITY OF H,iI,'N,'·i'I U!'W/\F-N

INSIGHTS INTO GENOME ORGANIZATION BASED ON BAC END SEQUENCE ANALYSIS

A THESIS SUBMITIED TO THE GRADUATE DMSION OF THE UNIVERSITY OF HAWAl'I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

MOLECULAR BIOSCIENCES AND BIOENGINEERING

AUGUST 2006

BY CHUN WAN JEFFREY LAl

Thesis Committee:

Gemot Presting, Chairperson Paul Moore Qingyi Yu Richard Manshardt We certify that we have read this thesis and that, in our opinion, it is satisfactory in scope and quality as a thesis for the degree of Master of Science in Molecular Biosciences and

Bioengineering.

ii MISSING PAGE NO. • • .I ( I '

A T THE TIlVIE OF MICROFILMING ACKNOWLEDGEMENTS

I am obliged that Dr. Maqsudul Alam and Dr. Shaobin Hou from the Center for

Genomics, Proteomics and Bioinformatics Research Initiative (CGPBRI) at the

University of Hawaii generously provided 11,013 papaya BAC end sequence chromatogram files for this analysis. Dr. Ray Ming, who is a co-principle investigator of this project, and Dr. Qingyi Yu from the Hawaii Agriculture Research Center (HARC) directed the sequencing of most BAC ends analyzed here. Peizhu Guan from HARC and

Kanako L. T. Lewis from CGPBRI contributed significantly by generating papaya BAC end sequences. I would like to specially thank my thesis committee members who contributed substantially: Dr. Gernot Presting (Chairperson), Dr. Paul Moore, Dr. Qingyi

Yu and Dr. Richard Manshardt. Dr. Gernot Presting has supervised the entire project, recruited me into his lab during my second semester, and trained me in bioinformatics research and critical thinking. Dr. Paul Moore provided invaluable advice and comments for this analysis. I am honored that Dr. Richard Manshardt, who has many years of papaya genetic research, is on the thesis committee. I am indebted to Dr. Qingyi Yu who has directed the generation of most BAC end sequences used in this project. I am indeed thankful that Dr. DuIaI Borthakur offered me a chance to pursue a Master of Science degree in the Department ofMolecuiar Biosciences and Bioengineering. I desire to thank all faculty and staff who have enlightened and assisted me during my time in the

University ofHawai'i. Anupma, Aren. Beth, Kevin, Moriah, and Thomas offered me friendship and precious insights in the lab. Finally, my mother Hay Hing, my sisters Kit

Chi. Kit Hing, my wife Chui Ying and my son Jit Ching have supported my education in every possible way. iv ABSTRACf

Papaya is a major tree-fruit grown mainly in tropical and subtropical regions. A BAC library was end sequenced and tbe resnlting 50,661 BAC-end sequences were analyzed bioinformatically. A total of 7,456 SSRs were identified among 5,452

BESs. Sixteen percent of BESs contain plant repeat homologies. BESs lacking plant repeats revealed 6,769 (19.1%) Arabidopsis cDNA homologies. BESs witbout plant repeat and Arabidopsis cDNA homology contained 1,124 (3.2%) RefSeq and 644 (1.8%) non-redundant protein sequence homologies.

Low-copy papaya BES pairs (9,038) were compared to Arabidopsis, poplar, and rice genome sequences. A total of 53 BES pairs were mapped to Arabidopsis, 167 to poplar and 11 to rice. Low rate of co-mapping papaya BES pairs to Arabidopsis confirms tbe recent genome rearrangement in Arabidopsis. Poplar exhibited highest level of co­

linearity witb papaya and can be a reference genome for papaya genomic studies.

v TABLE OF CONTENTS

J\~O~~s...... i"

.ABSTRJ\cr ...... v

UST OF T J\BLES ...... 1fii

LIST OF FIGURES ...... "iii

LIST OF J\BBREVlJ\TIONS...... •...... •... ix

INTRODUCTION ...... •..• ... 1

METHODS...... 10

RES.UL TS...... •...... 19

DiSCUSSiONS...... •.•... 36

CONCLUSiONS...... 50

J\PPENDIX J\: Papaya BJ\C end sequence library website (PBESL) ...... 52

J\PPENDIX B: J\lignments oftbree peach BJ\C clones to poplar ilenome ...... •.•...... •.•...... •...... •...... •...... •...•...... •.•.•....••...• ~

J\PPENDIX C: Detailed coordinates of co-rnapped papaya BES pairs in heteroloilous genome comparisons...... •... 66

J\PPENDIX D: Lists of compiled plant repeat databases for repeat content analysis of papaya BESs •.•.•.•..•...... •.•.•....•...... •.. ... 68

LlTERJ\TURE CITED ...... 69 LIST OF TABLES

I Summary of simple sequence repeats percentage of35,472 high quality BESs ...... •...... •.. ..••.•• 22

2 BESs homologous to protein sequences from other but not found in Arabidopsis ...... 28

3 Summary of the annotation of top ten most abundant repeats and the number of matches in uncharacterized BESs •.•...•. ••... .•. ..•...... 30

4 Statistics of number of BES pairs in different stages of genome mapping...... •...... •.•.....•.•. 32

5 Structure of table pBES ...... 54

6 Structure of table SSR ...... 55

7 Structure of table eDNA, REPx, REFSEQ, NR 56

8 structure of table SEQLEN 57

9 structure of table GMAP 58

10 Detailed coordinates of co-mapped papaya BES pairs in heterologous genome comparisons ...... 66

11 Lists of compiled plant repeat databases for repeat content analysis of papaya BESs •.. .•...... •.•... .•...•... .•...... •...... •. ...•.•... 68

vii LIST OF FIGURES

Figures ~

1 Papaya BAC-end sequencing procedure 11

2 Comparative genome mapping procedure 16

3 tree of Carlcaceae and other speices that were used in comparative mapping of BES pairs ..•...... •...... 18

4 Selection of sequence length cutoff in high quality papaya BES

...... 0 •••••••••••••••••• 20

5 Characterization of all simple sequence repeats (SSRs) detected in 35,472 high quality BESs...... 24

6 Summary of plant repeat element content in 35,472 high quality BESs...... ••...... •.•...... •...... 26

7 Comparative genome mapping ofBES pairs in heterologous plant genomes ...... 33

8 Project website index page 59

9 Microsatellite search engine

•••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• .. ••••••• ••••••••••••• 0 ••••• 61

10 BAC-end sequence search engine 63

11 Comparisons oftbree peach BAC clones to the poplar genome 64

viii LIST OF ABBREVIATIONS

BAC - Bacterial artificial chromosome

BES - BAC end sequence

Mbp - Megabase pairs

SSR - Simple sequence repeat bp -Base pairs k -Thousand kb -Kilobase nt - Nucleotide

ix INTRODUCTION

Production

Carica papaya L., commonly known as papaya, is grown primarily in tropical and subtropical regions of the world. Brazil, Mexico, Nigeria, India, Indonesia, Thailand,

Belize, Guatemala, Dominican Republic, Jamaica, Puerto Rico, the Philippines, and

Hawaii, the major papaya-fruit production state in the United States, are the major papaya-fruits producing regions (perez and Pollack, 2006). In 2005, the total harvested area of Hawaii was 1,480 acre, and the totallltili7J!tion production of papaya in

Hawaii was estimated at 32.9 million pounds valued at $10.97 million dollars

(Agricultural Statistics Board, 2006; Perez and Pollack, 2006). Papaya fruit production is a major source of income for Hawaii's agriculture industry. Most papayas are grown in the Puna district of the Island of Hawaii, and 91 percent of the harvested acres in the

State of Hawaii are located in the Island of Hawaii (perez and Pollack, 2006). The genetically modified strain "Rainbow" accounts for over half of total acreage for current

Hawaii papaya production.

Varieties & Strains

In Hawaii, the major papaya cultivar is 'Solo' and its derivatives (Morton J,

1987). 'Solo' has a high sucrose content and is usually eaten without cooking to retain its

original sweetness. The 'Solo' papaya fruit varieties are usually pear-shaped and relatively small in size (-500 g) when compared to other commercial papayas.

Representatives of 'Solo' derivatives include the 'Kapoho', 'Sunrise', 'Sunset',

'Waimanalo' and 'SunUp'. 'Kapoho' was the major commercial variety of 'Solo' before

1 the papaya ringspot virus became prevalent, occuping approximately one-third of the total acreage. Due to the susceptibility of 'Kapoho' variety to the papaya ringspot virus, the transgenic 'Rainbow' papaya has become the preferred variety for the papaya industry. 'Sunrise' and 'Sunset' papaya fruits have red-orange flesh and are rich in sucrose content. The 'Waimanalo' or 'Solo' line 77 produces fruits that are excellent in firmness and quality. The 'SunUp' variety is a transgenic papaya derived from 'Sunset' that is resistant to the ringspot virus. In addition to the resistance to several pests and ringspot virus, this variety possesses several agronomically important traits, such as richness in sugar content, desirable fruit fragrance, ideal fruit size for exporting, fruit bearing height and short generation time (Ming et aJ., 2001). 'SunUp' and 'Kapoho' are the parent varieties of the hybrid 'Rainbow' papayas that are available widely in the fresh fruit markets of Hawaii Mexican papayas are usually large in size and are generaIly not as sweet as Hawaiian papayas.

Pest & Virus Diseases

Papayas are subject to attack by a variety ofpests such as fruit fly, whitefly, mites, rats, scale, insects and nematodes (Morton J, 1987). The infection by fungal pathogens is a major cause of fruit, stem and root rot in papaya plants. Genetic transformation has been developed to increase resistance of papaya plants against fungal diseases cansed by the pathogen Phytophthora palmivora (Zhu et al., 2004). Besides attack by pests and fungus, the papaya mosaic and ringspot viruses are the major threats to plant development and fruit production. The papaya ring spot virus causes a major

crop loss of papaya. The symptoms ofviNs-infected plants includes ring spots and

2 stunting of the fruit and leaves. Fruit production and the growth of plants are severely reduced. A method was developed to reduce the susceptibility of papaya to the ringspot virus by over-expressing the coat protein of the virus in the plants (Fitch et aI., 1992).

This discovery significantly increased the annual yield of papaya in Hawaii. Genomics could further enhance the current genetic research to increase the quality and quantity of the production of papayas by leading to the discovery of genes for papaya horticultural improvement

Usage

The papaya fruit is mainly used for fresh fruit consumption. The fruits can be made into fruit-jams and juices. The young leaves, unripe fruits and stems can be cooked and eaten as vegetables. Moreover, the fruits, stems, roots, and leaves of papayas are valuable in folk medicinal uses (Ockerman et al., 1993; Osato et al., 1993; Ptnina and

Sandhya, 1988). Seeds have a slightly spicy taste resembling that of pepper and can be

added to salad dressing. Fragrant oils can be extracted from dry unripe papaya seeds. The

latex collected from unripe fruits and stems is currently a main source of papain, a protease that is used industrially in meat tenderizers and digestion aids (Dunne and

Horgan, 1992). The consumption of papaya fruit may help digestion due to the presence of papain. Papaya fruits are rich in vitamin C content and low in calories. The fruits are also a good source of antioxidants such as carotenoids, flavonoids, vitamin E and folic acid. Papaya fruits also have rich content of potassium and dietary fiber.

3 Botanical Description & Taxonomy

Papaya is a dicotyledonous plant with nine pairs of chromosomes and has three sex forms (Arumuganathan and Earle, 1991; Purseglove JW, 1968; Storey WH, 1976;

Ming et al., 2001). Papaya is a large herbaceous perennial plant that is single stemmed and hollow. Fruit production is continuous throughout the year. The plant can grow to maturity and produce fruits in roughly 9 months from seed germination. Papaya plants can be cultivated only in warm climates, and the growth of papaya requires optimal temperature and irrigation. The fruits are juicy, smooth-skinned and generally pear­ shaped, green when unripe, and yellow when ripe. The fruit sizes vary in different strains.

Papayas can bear either male, female or hermaphrodite flowers, which are identified by their distinct flower morphology. Some plants occasionally bear more than one kind of flower at a time. Hermaphrodite flowers are generally self-pollinated and produce more marketable fruits. Female flowers are pollinated with pollen from hermaphrodite or male plants. Although male plants produce many fertile staminate flowers, they are usually female-sterile (Storey WH, 1985). Recently, sex-linked markers have been discovered in papaya and have allowed the screening of plant sex before the seeds are planted (Deputy

et aI., 2002). Papaya is thought to have originated in the region between Mexico and

Costa Rica in Central America (Candolle A de, 1908; Purseglove JW, 1968; Storey WH,

1976). It belongs to the and family , which consists of six genera (Van Droogenbroeck el aI., 2004). The genera in the family other than Carica

include , , Cylicomorpha, and (Badillo VM,

2000; Van Droogenbroeck et aI., 2004). Vasconcellea is the largest genus and consists of

4 21 (Badillo VM, 2000; Van Droogenbroeck et a/., 2004). Both ,

Caricaceae and their related families within the order are characterized by the production of glucosinolates, that are secondary metabolites, and glycosides, that are involved in plant defense mechanisms for combating insect herbivores (Kjaer and

Schuster,1972;KjaerA, 1976; Doughty etal., 1991; Rodmanetal., 1996,1998;

Raybould and Moyes, 2001; Fahey et al., 2001; Tokuhisa et al., 2004; Kroymann et ai.,

2001; Kroymann and Mitchell-Olds, 2004). Caricaceae is taxonomically grouped within the order of Brassicales and is a sister family of Brassicaceae, which contains the model organism Arahidopsis thaliana (Bremer et al., 1998; Rodman et al., 1996). The genome of Carica papaya was reported to be the most divergent from other species within the genus Carica (Kim et al., 2002). Some wild Carica species were reported to be more

closely related allies of the genus Jacaratia than to C. papaya (Aradhya et al., 1999).

This has resulted in separating papaya from other Carica species, which are now assigned to the genus Vasconcellea (Badillo VM, 2000). Carica is monotypic containing only the

single species Carica papaya (Van Droogenbroeck et al., 2004).

Potential model organism for tree-fruit plants

The discovery of sex chromosomes in plants led to speculation that sex

chromosomes are derived from autosomes by mechanisms of recombination suppression

(Filatov et al., 2000; Liu et al., 2004). In previous studies, papaya was found to have a primitive Y-chromosome that contains a non-recombinant sex determination region of 4-

5 Mbp or 10% of the chromosome (Liu et al., 2004; Ma et al., 2004). This Y­ chromosome has significant agronomic importance in the papaya industry since

5 hermaphrodite (Mhm) fruits are most desirable. Prunus persica (peach) is a model organism for Rosaceae, with a genome size almost twice as large as Arabidopsis and a generation time of two to four years to bear fruits. When compared to peach, papaya has more variability among progeny due to its high number of seed produced in each fruit

(800-1200). Moreover, the continuous flowering and fruiting throughout the year, considerable commercial varieties, and polygamous reproduction behavior greatly enhances polymorphism in each generation (Ming et aI., 2001; Verde et aI., 200S). Since papaya has a small genome size of 312 Mbp in haploid state and has a short generation time of around 9 months to produce ripe fruits, it has been suggested to be a model organism oftree-fruit plants for genetic and genomic studies of cloning and mapping

fruit-controlling genes (Ming et aI., 2001).

Genomie resources & HAC library

Genomic technologies can greatly accelerate the current genetic research of papayas for agronomic improvements. There are several established genetic and genomic resources for papaya research. Molecular genetic markers were used to map the sex determination regions (Ma et aI., 2004; Deputy et aI., 2002). Genetic maps of papaya chromosome Y were generated (Ma et aI., 2004), and AFLP genetic analysis detected the diversity among different papaya strains (Kim et aI.• 2002). A highly effective system for transformation established to produce virus resistance (Fitch et al., 1992) can assist in

functional assays. Ming et aI. (2001) constructed a papaya BAC library from two different ligation reactions of the papaya cultivar, "SunUp", using the vector pBeloBACl1 (Shizuya et aI., 1992) and the Hind/II restriction enzyme. The BAC library

6 contains 39,168 clones of papaya genomic sequences with an avemge insert size of 132 kb, approximating 13.7 genome equivalents. The first BAC library, portion of the representing 18,700 clones, has an avemge insert size of 86 kb, whereas the 20,468 clones from the second ligation have an avemge insert size of 174 kb (Ming et al., 2001).

Arabidopsis, a model organism for plants, and rice, a model organism for cereal grasses, were sequenced largely using genomic BAC clone libraries (The Arabidopsis Genome

Initiative 2000; International Rice Genome Sequencing Project 2005; Choi et al., 1995;

Wang etai., 1995; Zhang et aI., 1996). BAC clones can hold up to 300kb of genomic

DNA and are highly stable for manipulating the cloned DNA easily and efficiently.

Therefore, BAC libraries are widely used in the preliminary process of many genomic

sequencing projects with large genome sizes. Moreover, BAC clones can be used for physical mapping by fingerprinting each BAC clone based on restriction analysis (Chen et aI., 2002), map genes having agronomic value (Lange and Presting, 2004), develop genetic markers by BAC-end sequencing (Tomkins et aI., 2004), analyze compamtive genomics between different species (O'Neill and Bancroft 2000; Ilic et aI., 2003), use in

situ hybridization to screen the genome structure and localize target genes on chromosomes (Cheng et aI., 2001; Jiang et aI., 1995), and whole-genome sequencing

(Goff et ai., 2002).

HAC end sequencing and analysis

The analysis of BAC end sequences can greatly accelemte genetic and genomic

studies of an organism. BAC end sequences can greatly enhance the efficiency of whole

genome sequencing. Since BAC clones with an avemge insert size of 150kb are large

7 enough to span most repetitive sequences, such as tandem repeats and transposable elements, the end sequences can assist in generating physical maps by chromosome walking to fill gaps in physical mapping (Rice Chromosome 10 Sequencing Consortium,

2003). BAC clone fingerprinting and BESs together can serve as a platform for whole genome sequencing by either a clone-by-clone approach or a shotgun sequencing approach (Rice Chromosome 10 Sequencing Consortium, 2003). The analysis of end sequences can present the first view of a genome's composition (Mao et al., 2000; Zhao et at .• 2001; Hong et at., 2004). BAC-end sequences also provide genetic markers that are distributed across the genome for polymorphism detection and genetic mapping of the whole genome (Tomkins et at., 2004). Homologous sequences of other plants that are found in papaya BESs can facilitate cloning of organism specific genes for genetic

studies (Yu et at., 2005). Mapped BAC-end sequence pairs can be used for comparative genomics to elucidate genome organization and co-linearity between different plants

(O'Neill and Bancroft, 2000). The sequencing of the papaya sex-determination region could enhance our understanding of the sex-determination process in papaya allowing growers to eliminate undesired sexes before the seeds are planted (Ming et at.• 200 1).

Theory & Hypothesis

Three main hypotheses will be tested in this thesis: 1) Simple sequence repeats

(SSRs or microsateDites) are distributed eqna1ly between genic and non-genic regions of papaya genome. MicrosatelIite analysis can provide useful genetic markers for detection of polymorphism and generating genetic maps for further genomic studies.

Understanding the distribution of SSRs in a genome can facilitate marker selection of 8 types and classes of SSRs for genetic studies. 2) Gene and repetitive DNA content of papaya Is similar to Ambidopsis thaliIuuz. Genes and repetitive elements are major components of a genome. Comparisons of the papaya genome content with other plant crops such as Arabidopsis allow us to discover similarities between papaya and other plants. This study will also accelemte gene discoveries in papaya. 3) The genome structnres of papaya and Arabidopsis are similar, Compamtive genome mapping allows us to direct1y map papaya BAC clones to other completely sequenced plants. The mapped BES pairs can elucidate positions of genomic sequences in the orthologous regions of heterologous genomes. The mapped colinear regions in a reference genome can be used to locate genes of interest and assist in whole genome sequencing of an organism.

9 MEmODS

Processing of papaya HAC end seqnences

All papaya BAC ends were sequenced prior to all the sequence analyses that are described in this thesis. The sequencing procedure is illustrated in Figure 1. PHRED software was used to perform base calling of papaya BAC-end cbromatognuns and trimming of the BAC-end sequences (Ewing et al., 1998; Ewing and Green, 1998). The

PHRED trimming cutoff value was set to the default value of 0.05 with option "trim_all" to carry out trimming based on the PHRED quality value of each base. PHD2FASTA

software (http://www.genome.washington.edu)was used to convert PHRED output files into FAST A formatted sequence files and corresponding quality files representing the quality of each nucleotide in the sequence files. CROSS_MATCH software

(http://www.genome.washington.edu) was used to mask the sequences of pBeloBACll

vector (Shizuya et al.o 1992) that was used in cloning the papaya genomic DNA library.

Terminal vector sequences were trimmed. Trimmed BESs of less than SO bp in length were eliminated. These higb-quality papaya BESs were used for further analysis of micro-satellites, repetitive elements, protein-coding regions and comparative genome

mapping.

10 Figure 1. Papaya BAC-end sequencing procedure = SAC

_ 73 'C. _ ---.-~::~':hi ".4£!7(--.~~:: IFII:- --~~,_:::::::: ._.f.."~FFI ...,

...... chiCWi..-l10i1t8 JJIIPIIYe!ftC unet "qUlnan"

All papaya BAC end sequences generated from the BAC library constructed by Ming et at. (200 1) were sequenced at the Hawaii Agricultural Research Center (HARC) and

Center for Genomics Proteomics and BioinfOlUlatics Research Initiative (CGPBRI) prior to all the sequence analyses that are described in this thesis.

11 ComputatioDBl Identification of papaya microsatemtes

For computational identification of papaya SSRs, bases with PHRED quality value less than 20 were converted to 'N's. Trimmed high quality papaya BESs were scanned for all mono-, di-, tri-, and tetranucleotide repeats of at least 12 nucleotides in length using a custom mode Perl script (microsatpl). Overlapping patterns of di- and tetranuc1eotide repeats were recorded only once. Offset patterus (eg. ATATAlfATAT) were counted once based on length and categorized based on identity of the first nucleotide. Primers for developing genetic markers were picked automatically

(primerDesign.pl) using PRIMER3 software (Rozen and Skaletsky, 2000)

(http://fokker.wi.mit.edulprimer3). The global criteria for the PRlMER3 software were set to 20 nt for optimal primers, and a GC content of 40-60%. Primer product size was set to 100-500 nt. For BESs containing multiple SSRs, the selection criteria for most suitable

PCR-primers were based on 1) the availability of computationally genemted flanking primer sequences, 2) longest SSR, and 3) shortest amplicon in order of importance.

Gene and repeat content of the papaya genome

Repetitive elements

After masking SSRs by converting SSR- (mono-, di-, tri- , and tetranucleotide repeats) containing loci to 'N's (microMasked.pl), high quality BESs were compared with 30,481 non-redundant plant repeat sequences in the The Institute for Genomic

Research (TIGR) plant repeat database (ftp:llftp.tigr.org/pub/data!I1GR]lant_Repeats) using TBLASTX at an E value cut-off of 10-4 (Altschul et aJ., 1990). Detailed listing of the compiled databases is Iisted in Appendix D. BESs identified as having plant repeat

12 homologies were annotated based on the sequence description of best-matched sequences from the plant repeat database. The BESs containing repeat-homologies were categorized based on the TIOR Code/or PlmII Repetitive Sequence table (http://www.tigr.orgltdb/ e2k lIplant.repeatslrepeat.code.shtml). The complete Arabidopsis plastid genome sequence (NC_000932.1) was used to screen all high quality papaya BESs to identify chloroplast-containing BESs using BLASTN with the condition of 1O·3E value cutoff, at least 90% identity, and 100 nt of alignment length.

Gene content & papaya-genes not in Arabidopsis

To identify homologous coding sequences, the Arabidopsis cDNA sequence database (ftp:l/ftp.tigr.org/pub/datala_thalianalathllSEQUENCES) (August 26th 2004) was downloaded from TIOR to compare with the SSR-masked high quality BESs lacking homologies in the plant repeat database to identify homologous coding sequences. These

BESs were used as queries to search the Arabidopsis cDNA database using TBLASTX with an E value cutoff of 10-6. These BESs were annotated based on the best-matched sequences in the Arabidopsis cDNA database. BESs lacking homology in the Arabidopsis cDNA database were further compared with the plant Reference Sequence (RefSeq) protein database (ftp:l/ncbi.nih.gov/refseqlrelease/plantl) using TBLASTX with an E value cutoff of 10-6. BESs with no homologies in plant repeats, eDNA, or RefSeq databases were compared to the non-redundant (m) protein database (ftp:l/ftp.nebi. nih.govlblastldblFASTAI) using BLASTX with a cutoff value of 10-6 and alignment length of at least 34 amino acids. To identify papaya-genes that are absent in the

Arabidopsis genome, the following procedure was used. 1) BESs with homology in

13 ReiSeq or m database but not in the Arabidopsis cDNA database were compared with

Arabidopsis chloroplast (NC_000932.1), mitochondria (NC _001284.2) and Arabidopsis whole genome sequences using TBLASTX at cutoff value of Itr. In a second step, their homologous protein sequences from ReiSeq and Nr databases were also compared with the Arabidopsis whole genome sequences using TBLAS1N at cutoffva1ue of 10~. Ifboth

BAC-end sequence and its homologous sequence were not matched in these comparisons, the gene was assumed to be absent in the Arabidopsis genome. Protein sequences that were homologous to these papaya BESs with an E value of 10.10 or lower and alignment

length of at least 34 amino acids were identified.

Papaya-specific repeats

The most abundant BESs lacking homology to the plant repeat, Arabidopsis cDNA, ReiSeq, or m protein databases were identified by comparing these BESs against each other using BLAS1N at an E value cutoff of 10.10 with up to 1,000 matches of

BLAST output collected for each query. The BLAST results were imported into a database using (importBlastResu1tpl). The query with the largest number of matched

BESs with at least 100 nt alignment length was identified; this query and all of its matches were removed from the database to identifY the next most frequently matched

BAC-end sequence. This procedure was iterated to identify the top ten queries that were most frequently matched by other BESs. These BESs were compared to the web-based non-redundant nucleotide (nt) database by using BLAS1N through a web-based interface

(http://www.ncbi.nlm.nih.govlblastl) to identify potential homologies. Unknown queries lacking homologies in the nt database and all its matches (BESs) of at least 100 nt

14 alignment were aligned using ClustaIX to identify potential consensus sequences.

Assuming eo-Unearity of plant genomes by mapping ofBAC-end sequence pairs

Download and preparation ofsequenced genomes for comparative genome 11Ulpping

Arabidopsis thaliano (ftp://ftp.tigr.org/pub/datala tbaiiana/athllSEQUENCESI),

Populus trichacarpa (poplar, http://genome.jgi-psf.org/PoptrllPoptr1.download.ftp.html)

and Oryza sativa (rice, ftp://ftp.tigr.org/pub/datalEukaryotic]rojects/o_satival

annotation_dbslpseudomoleculeslversion_3.01) whole genome sequences were

downloaded from the TIGR and the Joint Genome Institute. The downloaded plant

genome molecule sequences were fiagmented to 300 kb each with 1020 nt overlap

(fragmFsta.pl).

Comparative genome mapping ofpapaya BES pairs

SSR-masked high quality BESs lacking homologies in the plant repeat database were selected to form forward and reverse sequence pairs. These pairs were compared

separately to each of the fiagmented Arabidopsis, poplar and rice genome sequences by using TBLASTX with a cutoff value of 10-6. The best-matched homologous sequences of both forward and reverse end identified in the same linkage group within a region of 10 to 300 kb and being arranged in the correct orientation were considered to map potential

syntenic region between the corresponding BAC sequence and orthologous genomic

sequences of the target genome. The comparative genome mapping procedure is

diagramed in Figure 2.

15 Figure 2. Comparative genome mapping procedure

/ ~plantll'" n!lcnqJ...... -10112 • ...,

Comparative genome mapping was done computationally by using TBLASTX to compare the forward and reverse end sequence pairs to the other completely sequenced plant genomes. The BES pairs being mapped in the correct orientation and within 10 to

300 kb were claimed as potentially colinear between the BAC sequences and the heterologous genomic sequences.

16 Comparative genome mapping ofcontrol BES pairs from other plant species

The Arabidopsls, poplar, Brassica, and tomato BAC-end sequence pairs were mapped to Arabidopsls and poplar genomes as controls (Figure 3). A total of 120,000

TAMU (Choi et aI.• 1995) and IGF (Mozo et aI.• 1998) Arabidopsls BESs were downloaded from the TIGR website (ftp:l/ftp.tigr.org/pub/datalatbaliana/ bac_end_sequences/atends). In addition, the poplar genome was fragmented into artificial

BACs of 120 kb in length with 10 kb gaps. Exactly 500 bp of forward and reverse end sequences were collected to generate 7,135 artificial poplar BES pairs. Another set of

38,245 Brassica rapa (Brassica mpa [ORGANISM] BAC "end sequence") and 50,000

Lycopersicon esculentum (tomato, lycopersicon [ORGANISM] BAC pBeloBACll)

BESs were downloaded from the NCBI website (http://www.ncbi.nlm.nih.govl). All of these BESs were processed using the same procedures, by masking SSRs

(microMasked.pl) and eHminating repeat-containing BESs (rmRepeat.pl) to yield 3,648

Arabidopsls, 5,685 poplar, 8,525 Brassica, and 10,394 tomato low-copy BES pairs for comparisons with the Arabidopsls and poplar genomes as controls. Another subset of 185

Arabidopsls BES pairs was processed under the same conditions as mentioned previously by masking SSRs and eliminating repeat containing BESs to yield 10210w-copy BES pairs for a comparison with the Arabidopsls genome.

17 Figure 3. Taxonomy of Carieaceae and other species that were used in comparative mapping of DES pairs.

0"",""""" llTaJUh.'f1 Tapa

C/Uicupoptn"o

Bt.l\(~ C,'li.,.>mmplw

H.n-mit:.irl Carkuce

Ani/lei

r,:ure mlk-m)'ledcIb.

P"fllllu, rrMWOl1'pQ Mnttnollupbyra Ln "."..,.Ii.nn t'14'ulfflwm

0"',:0 .sal;ru

All genera within the family Caricaceae are listed here. The BAC-end sequences of

Arabidopsis fha/lana, Brassica rapa, Populus trlchocarpa, Lycopersicon escuJentum, and

Oryza sativa were compared to heterologous genomes to detennine level of co-linearity between them.

18 RESULTS

Trimming and proeessing of papaya HAC end sequences

A total of 50,661 BAC end sequence chromatograms were base-caIled and trimmed using PHRED with default settings to produce 39,590 trimmed BESs (Ewing et aI., 1998; Ewing and Green 1998). Among these, 1,548 sequences consisting entirely of vector sequences were found. Another 2,570 BESs shorter than 50 bp were elimjnated

(Figure 4) due to potential CODtamjnation of end sequences resulting in the generation of

35,472 high quality BESs from 20,842 BAC clones. These BESs represent 17,483,563 high quality bases or 4.7% of the papaya genome. The average trimmed sequence was

493 bases in length and ranged from 50 to 899 bases. Eighty-seven percent of nucleotides had PHRED quality values ~ 20; 35% of these BESs were GC. All high quality BESs with at least 50 bp in length were deposited to the GenBank GSS database (Accession number: DX458351-DXS02755).

19 Figure 4. Selection of sequence length cutoff in high quality papaya BESs.

Conrnminated sequences due to incomplete purification after labeling often fall under 50 bp in length. BESs that were shorter than 50 bp were removed from the subsequent analysis.

Distribution of sequence lengtb in 35,472 high quality C. papDya BESs '~' r------~------~----~--~~~--~~~~------' • • • I ~ t

• • , ~ , • • • .-. • : ...• • •• .,... "Ie • ______.:.. f-'0 .."... ." "0 • ~ ~ • . o#o·.r·,-'-"";..- .... · .... : • . ., ..•• . ... • . • • • -0° . ••. •• ... • • • ...... • .... .,-...... e.. ~-.":.;--o------__l I.M ..,.....I • .....•• • • • r--..!. • • .. • ". ... • ...... ,. • ,. ",.. .,.~.. ,...... - . --...... • • . .... ,... ~ . .., ------~.~~~.~.--~. --- ••. ..~------~'~.~.. ~------_1 • ... ~: .. ~ ...... ,...... • ~ .• • -."L'.O • oi-'!'"" '0° •• ~,~ t.:' >.,;"I .. ''';~: ~, ,!~~... . ," i;. __ '_'. • .1 .... - "' ., I ~ .

20 Computational Identifieation of papaya mierosatellites

A total of 7,456 simple sequence repeats of at least 12 nueleotides length were identified in 5,452 (15.4%) BESs. These SSRs represents 22.9%, 37.8"10,17.7%, and

21.5% mono-, di-, tri-, and tetranueleotide repeats, respeetively (Table 1).

21 Table 1. Summary of simple sequence repeats in 35,472 high qnality BESs. All analyzed BESs contained 19.1 % Arabtdopsis eDNA and 16.2% plant repeat homologies, respectively. The percentage of SSRs within BESs that contain coding regions, repeats homologies or no homologies (non-genic) are summarized here. Note: Multiple SSRs, identified in some of these BESs and several SSRs, can flank a eDNA or plant repeat homology region in a particular BES.

22 Of these SSRs, 1,174 (15.7%) span at least 20bp and represent hypervariable markers.

Dinucleotide repeats were the most abundant class of microsatellite, followed by mononucleotides. Poly (T) and poly (A) were the most abundant homopolymers, representing 882 (51.6%) and 736 (43.1%) ofall homopolymers, respectively. Poly (AI) and poly (TA) were the most abundant dinucleotides, representing 1,137 (40.3%) and 844

(29.9"At) of all dinucleotides, respectively. Poly (AG) and poly (GA) were the longest tandem repeats. Only a small number of poly (CG)/(GC) repeats were identified. Poly

(AAI) and poly (AIT) were the most abundant trinucleotide repeats, representing 12.2%

(161) and 9.0% (119) of that class of repeats. The longest trinucleotide repeat, a poly

(ITA) tandem repeat, was 60 bp in length. Poly (AAIT) and poly (AAAI) were the most abundant tetranucleotide repeats, representing 9.7% (156) and 8.1% (130) of that class of repeats (Figure 5). Of these 5,452 SSR-coJrtaining BESs, 16.9% (922) and 5.8%

(318) exhibit homology to Arabldopsis cDNAs and plant repeats, respectively. Therefore, in regard to all comparisons shown in Table 1, most SSRs were underrepresented in the

BESs that were associated with cDNAs, whereas GC-rich trinucleotides were overrepresented in the proximity of genic regions. All classes of microsatellites were underrepresented in the BESs that were associated with plant repeat elements. Most SSRs were shown to be overly represented in the non-genic regions. The number of all SSRs tended to decrease as the length of the microsatellite increase (Figure 5). AT-rich SSRs were consistently found to be more abundant than GC-rich SSRs.

23 Figure 5. Characterization of all simple sequence repeats (SSRs) detected in 35,472 high quality BESs. Numbers of copies are plotted against the length for each microsatellite.

1~ r------• , • • • • • • • ~" - • • ...... I(. ~ ....,...... ---, ..... • • -. -.:«Q,""',"' .' .. ~.,...... I<1Uc'_ .. _ ~«>...... """"'--...... •.-.cA;o .·... · OOO·..,..•• " ·,,,·,... _ f\itI ",. ~ 100 +...... ; ...... •...., ....~ ... ~ ... ~ .... '0' · ...... , ...... ~ ...... ".. "' ...... ,~ ''W''''' ... . ''-" ...... ,..'()llUtU /.","" ...... G ...... ,

110~ ---<------<---<----,.--..-- --~F;77;:k::.~' F~.. ::I ------O------~--~o------;,~OO

24 Gene and repeat content of the papaya genome

Repetitive elements

A comparison of 35,472 high quality papaya BESs to the compiled TIGR plant repeat database revealed 5,733 BESs (16.2%) with homology to the plant repeat elements

(Figure 6). A total of 4,733 (83.3%) contain retrotransposons accounting for the most abundant repeats. BESs homologous to retrotransposons were further classified as containing long terminal repeats of which, gypsy-like (24.9%) and copia-like (24.2%);

1,539 (26.8%) of the retrotransposons were not categorized to any specific groups within the class. BLAST searches showed these mostly belonged to other retroelement-derived sequences. The next most abundant repeats were class II, tranposons (426 =7.4%). Only

9 miniature inverted-repeat transposable elements (MITE) were found. A tota1 of 242

(4.2%) centromere-related and 19 (0.3%) telomere-related repetitive elements were identified. Moreover, 2.9% of these repeat-containing BESs were accounted for by rDNA homology. Unclassified transposable elements accounted for 97 or 1.7% of all transposable elements found. Screening of all high-quality BES with Arabidopsis complete chloroplast genome revealed 290 BESs from 230 different BAC clones with chloroplast sequence homology.

25 Figure 6. Summary of plant repeat element content in 35,472 high quality BESs a) ., .. 'aNA Yelp ...- ...... I' '. "sm.. J."~ tl'1} '.JJ-.U"

Cii i 7 .. --U':l_.,''")

a."a f! -,·U·. (41'l Unl

b)

.,.,------~

. I .• ~ - ...

, 1 1 2 • 2 1 •• - .1 • l- i :;! ! f ..•

• (M",. ': i ...... d­ 0 1 H . r .."'" , 01+11 .. ''''' '.)

High quality BESs were compared with plant repeat database as described in Methods.

Classification categories were based on Code for Plant Repetitive Sequence table from

TIGR website. a) Distribution of major classes of repeats in papaya. b) Detailed categorization of BESs with homology to each class and repeat element is listed.

26 Gene content and papaya-genes not found in Arabidopsis

A total of 6.769 papaya BESs without matches in the plant repeat database contained Arabidopsis thaliana cDNA homology (representing 19.1 % of all 35.472 high quality papaya BESs). Of these. 2.659 (39.3%) BESs were classified as "unknown".

"putative". "hypothetical". or "expressed protein". Homologies to RefSeq were identified in 1.124 (3.2%) BESs lacking homology in Arabidopsis cDNA database. in addition. 644

(1.8%) BESs lacking homology in both Arabidopsis cDNA and RefSeq databases contained non-redundant protein sequence homology. Of these 1.768 (5%) BESs having identified homologies in RefSeq and Nr protein databases. 170 BESs were identified as retrotransposon-encoded proteins and were eliminated from further analysis. Another 919

BESs either had homology in Arabidopsis mitochondria, chloroplast, whole genome sequence. or their homologous protein sequences were matched in the sequences of

Arabidopsis genome. Among the remaining 679 BESs. 141 had alignment length of at least 34 amino acids to the identified homologous protein sequences. and their belonged organisms were identified. Two BESs being annotated as Arabidopsis protein sequence homologies were clearly microsatellite sequences and were eliminated from further cousideration. A total of 41 BESs were homologous to other plant protein sequences. and

98 BESs were homologous to protein sequences that were identified from other non-plant organisms. Detailed information of these BESs homologous to protein sequences from other plants but not in Arabidopsis is listed in Table 2. Table 2. BESs homologous to protein sequences from other plants but not found in

Arabidopsis

This table lists papaya BESs that are homologous to protein sequences from other plants

but not in Arabidopsis. The identification of BAC-end sequences, GenBank accession

number of homologous protein sequences, sources of protein sequences, and the

sequence annotations are listed here.

28 Papaya specific repeaJs

The most abundant unknown elements among the BESs that do not have homologies in TIGR plant repeat, Arabidopsis cDNA, RefSeq genomic, and non­ redundant protein databases were grouped to identify the 10 most frequently matched queries (BAC-end sequences) (Table 3). Six of these queries were identified to contain homologies in Carica papaya sequences cpsm (sex-linked AFLP markers) 54, cpsm49, cpsm94, and cpbe (BAC-end sequence) 55 sex hermaphrodite chromosome Y male specific sequences. Four queries identified as Carica papaya sequences cpsm54

(giI37992834) (474 bp) were matched by 305.178.177 and 107 BESs resulting in a total of 767 matched BESs to form the largest cluster of associated end sequences. Two queries identified as Carica papaya sequences cpsm49 (giI37992830) (587 bp) and cpbe55 (gi137992870) (409 bp) account for 161 and 120 matched BESs, respectively.

Four unknown queries that were identified as papaya-specific repeats matched 140, 116,

106 and 103 BESs among the uncharacterized portion of the BESs.

29 Table 3. Summary of the annotation of top ten most abundant repeats and the number of matches in uncharaeterized BESs

No. of Queries matched Homology BESs 30B-CI2.r 305 gi137992834 C. a isolate 4 Y male specific seq. 35C-A09.r 178 1ri137992834 C. a isolate 4 Y male soecific seq. 47A-Cll.r 177 gi137992834 C. a isolate 4 Y male specific seq. 4B-FI0.r 161 1ri137992830 C. a isolate cusm49 Y male soecific sea. 53A-Ell.f 140 Novel 44C-HI2.r 120 gil37992870 C. a isolate cpbe55 Y male soecific sea. 94B-G03.r 116 Novel 46C-Cll.r 107 gi137992834 C. a isolate 4 Y male soecific sea. 28D-B03.r 106 Novel 25C-H06.f 103 Novel

The occurrences of each repeat in uncharacterized BESs were recorded representing the number of their matches (unique Subject ID). The sequence accession number and

description were provided for referring to the acquisition of sequence information in the

NCBI GenBank database.

30 Co-linearity of plant genomes revealed by mapping ofBAC-end sequence pairs

Comparative genome mapping ofpapaya BES pairs

High quality papaya BESs with plant-repeat homologies were elimjnated, resulting in the selection of9,038 low-copy papaya BES pairs. These BES pairs were subsequently mapped to the Arabidopsis, poplar and rice genomes to reveal 53, 167, and

11 mapped papaya BES pairs, respectively. The largest number of mapped papaya BES pairs were identified in poplar genome sequences. The average length of the mapped regions marked by the location of homologous forward and reverse sequences in the target genome were 74 kb, approximating to the actual papaya BAC clone library insert size. Only 53 papaya BES pairs were mapped to the Arabidopsis genome, the closest related organism among all currently available completely sequenced plants. The average length of mapped regions in Arabidopsis was 46.4 kb. Only 11 papaya BES pairs co­ mapped with correct orientation and within 10-300 kb in the rice genome with an average length of 47 kb. All mapping results for each step are listed in detail (Table 4). The exact coordinates of mapped papaya BES pairs in the heterologous genomes are provided in

Appendix C. The numbers ofBES pairs co-mapping to different plant genomes were plotted against the distance separating the BAC ends in the target genome (Figure 7). In a

separate study, three completely sequenced peach BAC clones were compared to the poplar genome to identifY potential syntenic regions for further understanding of the

genome structure of poplar, which was used in this study. Details of the results are shown

in Appendix B.

31 Table 4. Statistics of number ofBES pairs in different stages of genome mapping

No.of _DES I~~~'::: _a.an- painBJAST wbh I B No."""BAC_ ,-'''''''_ ...... ,300 IIDBItAST DES ...... - ..,(%)

The results of mapping C. papaya, A. thaliana, P. trichocarpa, L. esculentum, and B. rapa BES pairs to whole genome sequences are listed here in each subsequent steps with more stringent criteria. Arabidopsis BESs were mapped to the Arabidopsis genome as a control. The percentage in parenthesis of each step represents the ratio ofBES pairs that remained from the previous stage.

32 Figure 7. Comparative genome mapping ofBES pairs in heterologous plant genomes a) IMsu., •••• .t".. F, ••••rS'nb ...... rll ••_.lar' ,fRa PI

• '.iG" aa .. , 7" i

• '''''' loa n ANI; '\ .:tII .p.o .• !Sft~ " • 1" ii • • .,.. B!S WI Noh '., IJr 'I i' T·· '"Btsn,a"F ~ i • T ., au .. Arlrk d>pili

• Zhrrnu 81:5 vs,.:' F

Dk ' -."I •• ' .....C ...... tkI!l b)

,.. (N~rilMH'" ut w-tIIapJll'd DES pllin ill h ttl a1112 ' pi itl au 5 _,. i",,. • 1'" ~ iI1 .. ~ .. • .. J. ..

1_.. ' n~-..h.B.\C ___ ' ' .....,

Distribution of distance between BES pairs that co-mapped in heterologous genomes with correct orientation and within 10-300 kb. a) Low-copy papaya, Arabidopsis, poplar, tomato, and Brassica BES pairs were mapped to heterologous genomes. b) Brassica BES pairs that co-mapped to the Arabidopsis genome are displayed separately for higher resolution. A subset of low-copy Arabidopsis BES pairs was mapped to the Arabidopsis genome as a control.

33 Comparative genome fTUJpping 0/ Arabidopsis BES pairs

Arabidopsis BES pairs co-mapped to the poplar genome in only 5 (0.8% overall), the lowest number of mapped BES pairs among all comparisons, cases. In addition, the average length of mapped BES pairs spanned the largest regions (230 kb) among all comparisons. Conversely, 102 Arabidopsis control BES pairs co-mapped to the

Arabidopsis genome in nearly ninety percent or more in all subsequent mapping steps. A total of 94 Arabidopsis BES pairs accOunting for 94.9"10 of the dataset, co-mapped to its own genome. The average length of mapped regions was 97 kb which approximates the actual average insert size (100 kb) of the TAMU (Choi et aI., 1995) and IGF (Mozo et at., 1998) Arabidopsis BAC libraries.

Comparative genome mapping o/poplar BES pairs

Poplar BES pairs co-mapped to the Arabidopsis in 19 cases representing 7.2% in overall success rate. The average insert size of mapped regions was 34.7 kb in length which is significantly lower than the actual size of the poplar artificial BAC clones

(120kb). The total success rate (7.2%) and the ratio of co-mapped BES pairs in the

Arabidopsis genome with correct orientation (35.2%) also resemble that of the papaya­

Arabidopsis comparison (35.3% and 6.8%).

Comparative genome mapping 0/ Brassica BES pairs

Brassica BES pairs co-mapped to the Arabidopsis genome at 52.7% percent, achieving the highest success rate. Among all heterologous comparisons, it is the only case that mapped over fifty percent of the BES pairs. In addition, 45.6% (3,889) of the

34 total Brassica BES pairs were homologous to the Arabidopsis genome. The average length of mapped regions in Brassica-Arabidopsis comparison was 135.1 kb in length.

Both the ratio of mapping in the correct orientation and spanning a region within 10 to

300 kb were approximately 86%. Brassica BES pairs, yielding only 20 (I % overall success rate) mapped BES pairs in correct orientation and within 10-300kb. co-mapped poorly to the poplar genome.

Comparative genome mapping oftoTTUllO BES pairs

Tomato BES pairs co-mapped to the Arabidopsis and poplar genomes with overall success rate of approximately 8%. The ratio of co-mapped BES pairs within 10-

300 kb was higher in the tomato-poplar (68.5%) than in the tomato-Arabidopsis (37.3%) comparison. Moreover, the average size of mapped regions in tomato-poplar comparison

(117.3 kb) was twice larger than the size in tomato-Arabidopsis (45.3 kb) comparison. A comparison of tomato with Arabidopsis and poplar yielded 408 (3.9"10) and 470 (4.5%), lowest percentages of BLAST matches among all comparisons, matched tomato BES pairs.

35 DISCUSSIONS

Processing of papaya BAC end sequences

Approximately seventy percent of the BAC end sequencing yielded high quality data. BESs that are shorter than SO bp in length were removed since contaminated BESs due to incomplete purification after labeling are often less than SO bp. Therefore, 6.5% of the BESs were eliminated to minimize the collection of false-positive results from subsequent analysis. Only 3.9% of the BESs consisted entirely of vector sequences that may represent the actual percentage of the empty BAC clones in the Iilmuy. Based on the restriction digest analysis reported by Ming et aI. (2001), the emmated empty clones for the two halves ofBAC Iilmuy are approximately 3.5 and 4.6%, which is consistent with the present BES data. Highly stringent conditions of screening with BLAST software revealed a total of 1.1% chloroplast containing BESs. Ming et aI. (2001) also reported a total of 504 or 1.4% BAC clones contained chloroplast DNA that was identified by hybridization testing with sorghum chloroplast probes ropB and trunk.

Computational identification of papaya mierosatellites

Mierosatellites are tandem repeats of short sequences showing high levels of polymorphisms since slipped-strand mispairing happens frequently in these sites while

DNA replications, repairs, or recombinations take place (Levinson and Gutman 1987).

Since SSR-containing genetic loci display high level ofpolymorphisms, genetic markers

can be developed for variable lengths of tandem repeats to determine genetic diversity

among papaya cultivars. Genetic markers developed based on SSR loci that are linked to

36 certain phenotypes provide us with easy assays for specific trait of interest. Since microsatellite markers can be detected easily by PCR using flanking primers, SSR markers can be used to construct genetic and physical maps of the papaya genome.

Primers used to develop genetic markers in papaya were designed automatically for analysis reported in this project.

A greater number of class I, hypervariable markers (at least 20 bp), and class II, potential variable markers (less than 20 bp), were identified in non-genic regions than in protein encoding regions of the papaya genome. Class II SSRs are thought to be less variable than class I SSRs owing to the fact that long repeats have a higher chance of

slipped-strand mispairing. Therefore, hypervariable SSR loci can be a useful resource for developing markers to construct genetic maps because these loci exhibit high levels of polymorphism.

Among the whole dataset of high quality papaya BAC-end sequences, poly A and poly T were dominant in mononucleotides. Dinucleotide tandem repeats were found to be the most abundant tandem repeats among all mined SSRs. The abundance of the

dinucleotides, poly (AT), and poly (TA) is consistent with the findings in yeast and

Arabidopsis (Katti et aJ., 2001). Temnykh et aJ. (2001) reported that poly (AT) and poly

(TA) sites yielded poor amplifications in PCR reactions when genetic markers were

developed to flank these loci possibly because of an association of non-coding regions and repetitive elements in rice. In this study , AT-rich dinucleotide repeats are

overrepresented in non-genic regions, and SSRs are underrepresented in repeat­ containing BESs. Trinucleotide tandem repeats were found to be the least abundant Only the expansion or shrinkage of trinucleotide repeats can be tolerated within protein coding

37 regions because these trinucleotide repeats do not cause a frame shift of protein coding during translation. However, alteration of protein structw'e could lead to the suppression of the trinucleotide tandem repeats since trinucleotide repeats might be located within protein coding regions. AT-rich trinucleotide tandem repeats were the most abundant trinucleotide microsatellites in papaya, which is consistent with the finding in both

Arabidopsis and yeast (Katti et al., 2001). Most tetranucleotide tandem repeats were

found to be AT-rich. GC-rich regions are generally associated with high gene density regions in the genome. Since these BESs are presumably randomly distributed across the

genome of papaya, most BESs are expected to span non-genic regions and contain mostly

AT-rich microsatellites.

The frequency of all mono-, di-, tri-, and tetranucleotide tandem repeats decreased

exponentially as the length of the tandem repeats increased. This phenomenon can be

explained by the instability oflong tandem repeats leading to an increase in the mutation rate. Mutations that shorten the tandem repeats account for the inhibition of infinite

growth of tandem repeats (Wierdl, Dominska, and Petes, 1997; Kruglyak et al., 1998).

Moreover, our mined microsatellites were found to be underrepresented in BESs with

plant repeat homologies. This is consistent with the analysis ofTemnykh et al. (2001),

who used a HindIIIBAC library of rice for their analysis. Papaya BESs had a similar

SSR distribution to the peach, almond and Rosa EST assembled sequences (Jung et al.,

2005). Although the abundance of different patterns of trinucleotide repeats vary among

different plant species, more trinucleotide repeats than dinucleotide repeats were found in the coding region of peach, almond, and Arabidopsis ESTs. In contrast, trinucleotide tandem repeats are least abundant in papaya. Dinucleotide repeats were the most

38 abundant simple repeats in the un-translated region (UTR) of peach, almond and Rosa

ESTs. In conjunction with the findings ofTemnykh et aI. (2001), who have shown that

AT-rich dinucleotide repeats and trinucleotide repeats of rice BESs were oveneplesented in non-genic regions, AT-rich dinucleotide repeats and trinucleotide repeats were underrepresented in papaya BESs having eDNA homologies. In order to provide these

SSR markers and their corresponding peR primers to the public, a listing of all SSRs and their primer pairs are accessible at the project website

(http://www.genomics.hawaiLedulpapayaIBES/ ssrNP.cgi).

Gene and repeat eontent of the papaya genome

Repetitive elements

Repeats that consist mostly of transposons are found in all eukaIyotes. These repetitive sequeoces can cause genome rearrangements, mutations of sequences,

duplication, insertion, and translocation of genes. Repetitive sequeoces are a major

driving force of evolution in nature. Repeats that are specific to an organism can be used as markers for in situ hybridization. High content of repeats in any genome can increase the difficulty of sequencing the whole genome. The examination of repeat composition in papaya allows evaluation of the quality of papaya for being a candidate model organism

for tree-fruit plants.

Sixteen percent of high quality papaya BESs contained homology to known plant repeat elements. Most of them were transposons (5,208 - 90.8%) representing 14.7% of the high quality BESs. In Arabidopsis, at least 10% of the genome contains transposons.

39 Most known plant repeat elements fotmd in Arabidopsis were also fotmd in the papaya

BESs. In Arabidopsis, most class I retrottansposons are located in centromeric regions, whereas the class II ttansposons are fotmd dominantly in the arms ofthe chromosomes.

In papaya. the physical location of the retrottansposons cannot be determined tmtil the physical maps of the BAC clones are available. Papaya BESs that contain retrotransposons, which accotmted for 13.5% of all papaya BESs, were fotmd to be most abtmdant Like Arabidopsis, papaya has more class I retrottansposons than class II transposons. When compared to other completely sequenced genomes, the ratio of gypsy­ like to copia-like retrotransposon is almost I: I in papaya BESs; while gypsy-like to copia-like retrotransposable element ratio were approximately 2: I in rice (International

Rice Genome Sequencing Project, 2005). The relative abtmdance ofgypsy-like retrotransposons indicates a significant genetic difference between papaya and rice. DNA based transposons were fotmd to be the second most abtmdant repeats in papaya. Plant

centromeres often contain transposons and 242 centromere specific sequences were

fotmd. The BESs with centromeric repeats may identify BAC clones that are originated

from the centromere.

Gene content &. papaya-genes not found in Arabidopsis

A comparison of 35,472 high quality papaya BESs with the Arabidopsis tholiano

cDNA database revealed 6,169 cDNA homologies. These homologies accotmt for

approximately 19.1 % of all papaya BESs, suggesting that Arabidopsis and papaya share

many genes. We can extrapolate that the protein-encoding regions of papaya will be

11.1Mbp representing 35,526 genes, presuming that the average gene length of papaya is

40 similar to that of Arabidopsis (-2kb) (The Arabidopsis Genome Initiative, 2000). Papaya appears to share similar gene content with Arabidopsis, which has a genome composition of 31,407 genes (NCBI, Nov 17th 2005). The protein coding regions that were identified in this study can be used to clone potential candidate genes or to locate genes within BAC clones that are in close proximity to the genes. The identified genes can also be used as genetic markers for assisting in whole genome sequencing. High hybridization of

Arabidopsis cDNA probes (91 %) to the papaya BAC clone library is consistent to this study that the Caricaceae and Brassicaceae families are closely related (Bowers et ai.,

2003; Soltis et ai., 1999,2000).

The BESs that are not homologous to repeats and Arabidopsis cDNAs were further compared to the plant RefSeq and the non-redundant protein databases. Over 90%

of these identified genes were not retrotransposon-derived. Upon further analysis, over half of these identified genes were found in whole genome or organelle genome

sequences ofArabidopsis and were eliminated from further consideration. More stringent

criteria·revealed 141 BESs without homology to Arabidopsis cDNAs but with alignment

of at least 34 amino acids with either nr or RefSeq protein sequences. Forty-one BESs

homologous to proteins sequences from other plants can be analyzed to identifY genes

that belong to biochemical pathways of papaya but not in Arabidopsis. Two BESs were

homologous to Amaranth agglutinin genes. Four BESs, homologous to pathogen­

resistance genes, can assist in finding potential solutions to improve pathogen resistance

in papaya. Seed protein AmAI sequence homology was also identified and can be used to

analyze nutrition improvement in papaya. The remaining 'unknown', 'hypothetical',

'unnamed product', 'putative', 'predicted', and other homologous protein sequences that

41 were found in papaya BESs can be used to clone potential organism-specific genes for gene functional studies. Ninety-eight BESs that were mostly homologous to proteins from E. coli or other microorganisms were not further analyzed.

Papaya-specific repeats

Four previously unknown papaya-specific repeats were identified. These papaya­ specific repeats did not show any homology to the web-based nt database sequences.

Moreover, these repeats were not found in Arabidopsis nor in the plant repeat database.

These novel sequences will aid the characterization of repetitive elements for papaya genome and should be considered for an addition to existing plant repeat databases.

These papaya-specific repeats could be localized by in situ hybridization, to identifY papaya chromosomes.

AssumJng eo-linearity of plant genomes by mapping of BAC-end sequence pairs

Co-linearity indicates the level of similarity of gene orders in a given genome for measuring synteny or conservation of its structure. Mapping single or low copy forward and reverse end sequences of a BAC clone that are separated by the length of a BAC clone to a completely sequenced heterologous genome can reveal potential syntenic regions and the level of co-linearity between heterologous species. Highly conserved

syntenic regions can be used to locate genes within the BAC clones that are not found in

BESs (Yan et ai., 2003) and find potential reference genomes for further genomic studies

(Ku et ai., 2000). Low-copy papaya BES pairs were compared to the Arabidopsis thaliana, Populus trichacarpa (poplar) and Oryza sativa (rice) genomes in all possible

42 reading frames to test the hypothesis that the forward and reverse BES pairs of papaya

BAC clones will map to the completely sequenced genomes in the correct orientation and within the maximum allowable BAC insert size to reveal corresponding collinear regions.

To validate the computational pipelines of the methodology, attempts were made to map

102 Arabidopsis low-copy BES pairs to the Arabidopsis genome sequences. Four low quality Arabidopsis BESs resulted in the loss of three BES pairs (clone FIOCII, FIOBS, and FlOJ4) during the process of mapping in the Arabidopsis genome due to a lack of detectable BLAST matches. One BES pair (clone FIOJl3) was mismapped to a second locus with equal or better E value. Two BES pairs (clone FIOFI6 and FIOFI7) were not mapped in correct orientations. In another two cases, BES pairs (clone FI 0119 and

FIOGl4) were mapped less than 10 kb apart. Most errors seemed to be due to tracking errors ofBES data set (ftp:/Iftp.tigr.orglpub/datala_thaliana/bac_end_sequences/

README), but these control BES pairs that co-mapped in the correct orientation and within I 0-300kb were at 9S% success rate in all subsequent mapping. Two very closely related species were compared to further validate the methodology in mapping BES pairs to heterologous genomes since the overall success rates in heterologous comparisons were low. Brassica rapa and Arabidopsis thaliana are very closely related species. Forty­

six percent of Brassica BES pairs have detectable BLAST matches in the Arabidopsis genome. Brassica-Arabidopsis comparison yields the highest success rates in all

subsequent heterologous comparisons. In addition, over fifty percent of the Brassica BES pairs that co-mapped to the Arabidopsis genome were located in the correct orientation

and within 10-300 kb. The high level of co-linearity between them indicates that many macrosynteny regions are conserved in Brassica and Arabidopsis. Papaya, poplar and

43 Arabidopsis are members of , and one set ofBES pairs from a dicot that is not a member of rosid was used as a comparison. Lycopersicon (tomato) has a genome size estimated at 950 Mbp and is a member ofasterids within the order of Solanales. Tomato

BES pairs yielded similar success rates in mapping Arabidopsis and poplar genomes.

Tomato BES pairs ranked lowest in having detectable BLAST matches in heterologous genomes. This phenomenon may be mainly due to the large genome size of tomato resulting in the incorpomtion of large portion of non-genic regions in the BAC ends that were mostly not conserved between the members of rosid and asterid, presumably that the BAC-ends are collected mndomly across the tomato genome.

Co-mapping BES pairs to heterologous genomes in all subsequent processes displayed higher success mtes than mndom probability. The probability for any BAC ends co-mapped to the same chromosome in Arabidopsis is approximately 20.7% ([29

.IMlllS.4M]2 + [l9.6M111S.4M] 2 + [23.2M111S.4M] 2 + [17.SMlllS.4M] 2 +

[26.0MlIIS.4M] ~ (The Arabidopsis Genome Initiative, 2000), but in all comparisons with Arabidopsis genome, the success mte of mapping BES pairs on the same chromosome exceeds mndom chance. Even poplar-Arabidopsis comparison yielded a minimum success mte of 31.9% in mapping both BAC-ends on the same chromosome and is still 11.3% higher than the mndom probability. In additions, the mndom probability of mapping any BES pairs on the same linkage group of the poplar genome is

1.9% (adjusted by the size of linkage groups). The minimum success mte is the Brassica­ poplar comparison, and only 12.3% of the time mapped to the same chromosome.

However, it is still six times higher than the mndom chance of 1.9%. If BES pairs are mndomly oriented in the mapped region, a probability of 25% is expected to map both

44 ends in the same orientation of homologous sequence in the target genome. In all comparisons, the success rate ofBES pairs that were oriented correctly exceeded the random probability of25%. The highest success rate being the Brassica-Arabtdopsis comparison. 86% of the BES pairs were mapped to the same chromosome with correct orientation. Papaya BES pairs co-mapped to poplar genome with correct orientation in approximately 10% higher than both Brassica- and Arabidopsis-poplar comparisons. For

BES pairs co-mapped on the same chromosome and oriented in correct directions, the succesl! rates in mapping within 10-300 kb reflect the level of co-Iinearity between the genome of heterologous species. Papaya BES pairs is only at 2.5% lower than the

Brassiea-Arabidopsis comparison in co-mapping within 10-300 kb to the poplar genome, even though papaya is expected to be more distantly related to poplar than the phylogenetic relationship of Brassiea and Arabidopsis. Moreover, the average distance between co-mapped BES pairs in the heterologous genome generally deliberates the genome size of the target genome, except for the Brassiea-Arabidopsis comparison, in

which mapping of large genome regions were observed.

Based on the current phylogenetic tree taxonomy using alignments of plastid

genes rbeL, 18S rDNA and atpB (Soltis et al., 1999,2000), papaya belongs to the family

Carieaceae, which is a sister family of Brassieaceae within the order of Brassieales.

Papaya BES pairs co-mapped poorly to the rice genome that may partly due to the

divergent genome organi7ation between dicot and monocot. However, even the most

closely related specie Arabidopsis also yielded low number of mapped papaya BES pairs.

Surprisingly, co-mapping low-copy papaya BES pairs to a more distantly related poplar

genome demonstrated a higher level of co-linearity (15.9%) than the papaya-Arabidopsis 45 comparison. The papaya BES pairs that co-mapped to the poplar genome is not due to the

papaya specific-repeats discovered in this study since over 90% of the co-mapped BES

pairs have homologs in the Arabidopsis cDNA and nr databases. Moreover, the mapped

genome segment lengths of poplar approximate the actual insert size of the papaya

genomic BAC clones. The mapped regions in poplar also spanned a similar region to the

actual papaya BAC clone size rather than the relatively smaller size of mapped

Arabidopsis genome regions. Furthermore, the lifted level of co-mapping BES pairs in the actual insert size range of BAC clones can be observed in papaya-poplar comparison resembling the comparison of highly collinear species Brassica-Arabidopsis (Figure 6b)

and the control (Figure 6a). The exhibition of high co-Iinearity between papaya and

poplar is unexpected because poplar is more distsntly related than Arabidopsis according to the current taxonomy and is within the order of MaIpighiales, whereas both papaya and

Arabidopsis are within the same order of Brassicales. The highest level of co-linearity

between papaya and poplar suggests that genome structure is more conserved between

papaya and poplar than between Arabidopsis and papaya. The resson of Arabidopsis

displaying low level of co-Iinearity with papaya may be mostly caused by the recent

genome duplication followed by large scale of gene loss in the Arabidopsis genome, in

which a great challenge is posed for using Arabidopsis as a model organism for

comparative genomics. Based on this study, I suggest that Populus trichocarpa can be

chosen as a template genome for assisting in the construction of physical maps to

sequence the whole genome of papaya although poplar belongs to the Malpighales family

and is more distantly related to papaya.

46 The genome rearrangement of Arabldopsis due to polyploidization or duplication

followed by significant gene loss was reported recently (Bowers et al., 2003). Most angiosperm genomes have experienced duplication of gene pairs followed by unequal

crossing over, translocation of genome segments, and chromosomal recombination

(Zhang LQ et al., 2001; Durbin et al., 2000; Ohta T 2000). Recent studies have also confirmed an ancient genome duplication of rice (Kisbimoto et al., 1994; Nagamura et al., 1995), Arabldopsis (Blanc G and Wolfe KH 2004; McGrath et al., 1993; Kowalski et al., 1994), and sorghum (Chittenden et al., 1994). The a genome duplication event in

Arabidopsis was the most recent major change in the Arabldopsis genome, and it occurred after the divergence of (Bowers et al., 2003). The previous duplication was followed by the divergence of Arabldopsis and Brasslca from the common ancestor with Malvaceae (Adams and Wendel 2005). After the a duplication. only 30% of

Arabldopsis genes were preserved in syntenic copies, and the survival of copies of the genes generally relied on their specialized functions after the duplication (Sampedro et al., 2006). Therefore, we expect 0.32 =9% of the gene pairs homologous to the papaya

BES pairs in Arabldopsis to be preserved, and the 6.8% of papaya BES pairs co-mapped to the Arabldopsis genome in correct orientation and within 10-300kb is consistent with this finding. Comparative genomics is complicated by polyploidizations of genomes in

ancient angiosperms because the reshuffling of genome segments and large scale of gene

loss after genome duplication can cause incongruities in comparative genome mapping

(paterson et al., 2004). Therefore, although both Arabldopsis and poplar belong to the rosid subclass, only small amount of co-linearity were found between their genomes

(Stirling el al., 2003). Both Arabldopsis and Brassica BES pairs co-mapped significantly 47 in Arabidopsis genome, yet only genera11y 1% of the BES pairs co-mapped successfully in poplar genome with correct orientation and within 10-300kb; however, papaya BES pairs co-mapped in the poplar genome at 15.9"10 overa1l. In addition, both Brassica-poplar and Arabidopsis-poplar comparisons genera11y yielded low success rates. Poplar BES pairs mapped to the Arabidopsis genome at the level of co-linearity resembling the papaya-Arabidopsis comparison indicating that the genome structure of poplar and papaya are fairly divergent from that of Arabidopsis and Brassica lineage.

Some other reasons may have caused the divergence between genome organi7J!tion of papaya and Arabidopsis. Different generation time between Arabidopsis, poplar, and papaya could also produce different rates of mutation among them over the evolutionary period. The generation time for papaya (- 9 months) and poplar (several years) is significantly longer than that of Arabidopsis (weeks). Therefore, the chance of having genome rearrangement in papaya and poplar due to recombination and polyploidization might be significantly reduced. Furthermore, recombination was found to be severely repressed in the male-specific regions of a primitive Y chromosome of papaya (Liu et al., 2004; Ma et al., 2004).

The mapping of papaya BES pairs to heterologous genomes was genera1ly low.

Even though poplar displayed the highest level of co-linearity, only 1.8% of the total papaya BES pairs co-mapped to the poplar genome with correct orientation and within

300 kb. The low success rate for mapping papaya BES pairs to heterologous genomes may be partIy due the presence of novel repeats in the end sequences that have caused ambiguous mapping of the BES pairs to the heterologous genome and were eliminated

later through more stringent criteria; however, the nwnber of papaya BESs without

48 homology to heterologous genome is another main reason for the low success rate of

comparisons in papaya BES pairs.

To further validate the hypothesis that co-mapping BES pairs with correct orientation and within 10-300 kb will map collinear regions in heterogolous genomes, 19 mapped regions from poplar-Arabidopsis comparisons and 5 mapped regions from

Arabidopsis-poplar comparisons were examined to find potential syntenic regions. Only low amount of co-linearity were displayed between these genome regions which is consistent with recent reports of peach-Arabidopsis and poplar-Arabidopsis comparisons

(Georgi et al., 2003; Stirling et al.. 2003). Both peach from the family of Rosaceae and poplar from the family of Salicaceae are members of eurosids I. Although peach and poplar belong to different families and orders, three Prunus persica (peach) BAC clones

(AClS4900.1, AClS4901.l, AF467900.l) displayed high level ofco-linearity with the poplar genome (Appendix B). Although co-mapping BES pairs to heterologous genomes

does not prove co-linearity regions ofthe corresponding, the high proportion of co­ mapping BES pairs to heterologous genomes in correct orientation and within 10-3OOkb

support the hypothesis that the mapped BES pairs may potentially map corresponding

syntenic regions. The availability of papaya whole genome sequences will allow further

co-linearity analysis between papaya and other species. With the availability of whole

genome sequences of poplar and Arabidopsis and the evidence that papaya-poplar, peach­ poplar, and Brassica-Arabidopsis display high level of co-linearity. accelerated gene

discoveries and whole genome sequencing of papaya, peach, and Brassica can be

facilitated.

49 CONCLUSIONS

A total of 50.661 papaya BAC end sequence chromatograms were processed and trimmed to yield 35,472 high quality BESs representing 17,483,563 high quality bases or

4.7% of the papaya genome. All processed high quality BESs were deposited in

GenBank.

Most mined SSRs belong to the non-genic regions, yet the SSRs associated with cDNAs are more useful in mapping genes that might have important agronomic values.

Trinucleotide repeats were enriched in regions that are associated with cDNAs and form more useful markers for mapping genes of interest. SSR markers mined in this analysis can provide an economic and efficient way to find polymoxphisms between different papaya strains. A website was designed to provide over 2,500 useful SSR markers and their flanking primer pairs. All associated cDNAs, plant repeats. RefSeq genomic, and non-redundant protein homologies with these SSR markers are displayed to enhance the

efficiency of data mining.

Papaya shares many genes with Arahidopsis. Around one fifth of papaya genome

consists of gene rich regions based on the observation of this study. Papaya genome

contains higher repeat element content than Arahidopsis, and sixteen percents of papaya

genomes contain plant repeat elements. Papaya genes that are absent in Arahidopsis can

be further analyzed for identifying gene functions. Around four percent of the genome

contains identified papaya-specific repeats. Over 6,000 gene tags were discovered for papaya researchers to further analyze their genes of interest.

The highly evolving genome structure of Arahidopsis due to polyploidimtion

followed by great scale of gene loss poses a great challenge for scientists to do

50 comparative genomics with this model plant organism. The poplar genome had the most mapped papaya BES pairs and most closely approximated insert size of the papaya BAC clone library. Although no existing completely sequenced plant genomes can be mapped significantly by the papaya BES pairs, the poplar genome cwrently appears to be the most suitable choice as a reference genome of comparative genomics to elucidate the genome organi7Btion of papaya. The orthologous regions between poplar and papaya that were discovered in this study can be further analyzed for potential cloning and mapping of target genes when the completely sequenced papaya BAC clones become available.

These papaya BESs will help elucidate the composition of the whole papaya genome sequences. The papaya BAC library and BESs will be a valuable resources for physical mapping and sequencing of the whole papaya genome. This analysis can greatly enhance the genetic studies of papaya with the potential for increasing the yield, fruit size, and taste of papaya.

51 APPENDIX A

Papaya BAC end sequences Hbrary website (http://www.genomic.hawaiiedulpapaya/BES)

Introduction

The forward and reverse end sequences of a papaya BAC library containing

26,017 clones were sequenced by the collaboration of HARC (Hawaii Agricultural

Research Center), crAHR (College of Tropical Agriculture at the University of Hawaii)

and the CGPBRI (Center for Genomcis, Proteomics, and Bioinformatics Research

Initiative). Analysis of the 35,472 high quality BESs presents the fll'St view of the papaya

genome. Our analysis focused on mining of SSR markers, repetitive element composition, protein-coding regions, and comparative genome mapping.

This website is designed to provide a public service for the inquiry of the

currently available papaya BESs. A total of 42,898 high quality BESs can be downloaded

from this website. Moreover, the microsatellite search engine provides over 2, 500 primer

pairs for detecting papaya SSR length polymorphism. The Arabidopsis cDNA homologies flanking these SSR loci are shown for the convenience of selecting primers

to analyze targeted genes. The BAC-end sequence search engine provides detailed

information for cDNA, plant repeats, RefSeq (genomic), and non-redundant protein

homologies in papaya BESs. Papaya BESs containing microsatellites, plant repeat

homology, gene/cDNA homology, and mapped BES pairs are all available for

downloading from the website. A graphical representation of all microsatellites detected

in papaya BESs, plotted by length and type. is accessible at this website. This website

also provides contact information of principle investigators, co-principle investigators,

52 collabomtive labs and associated institutes. Unks to other major genomics center are provided. The reference of the official publication of the papaya BAC-end sequence analysis project along with other related publications are listed.

Databases

The MySQL database is currently nmning under UNIX (Gentoo) platform.

Several major tables for storing sequence analysis data are shown in Table 5-9. The interacting interface between users and databases is shown in Figure 8.

53 Table S. structure oftable pBES

Field Tvoe Null Key Default &tm ID iDt(IO) unsimed PRI NULL auto IncremeDt PBES_NAME varchar(ZS5l YES NULL RAW_SEQ looatext YES NULL RAW_QUAL lomlte",t YES NULL TRIM_SEQ looatexl YES NULL TRIM_OVAL IODate",t YES NULL CONVfN SEQ looatext YES NULL TRIMJCSEQ loogtex! YES NULL lAB varchar(ZSS) YES NULL HOMOLOOY _TYPE varchar(ZSS) YES NULL HOMOLOOY IODgte",! YES NULL HOMOLOOY ID varchar(ZSS) YES NULL UPDATETIME datetime YES NULL

This table stores all papaya BAC-end sequence information including the BES name, raw sequence and quality values. trimmed sequences and quality values, converted PHRED value ofless than 20 to 'N' sequences, trimmed terminal vector sequences and quality values. Moreover, the associated sequencing labs are indicated. The mined homology identifying number and the description are provided. The update time is also recorded.

54 Table 6. Structure of table SSR

Field Type Null Key Default Extra SSR.JD int(IO) unsigned PRI NULL auto_increment PBPS_NAME varcbar(2SS) VPS NULL FORWARD PRIMER varcbar(2SS) VPS NULL REVERSE_PRIMER varcbar(2SS) YPS NULL SSR varcbar(2SS) VPS NULL PATIERN tinyint(4) VPS NULL MAX SSR text YPS NULL MAX_SSR SIZE Int(ll) VPS NULL SSR..START 1nt(1l) VPS NULL SSR END int(ll) VPS NULL PRODUCCSIZE intCll) VPS NULL PRIMER_EXPLANATION lonatext VPS NULL AMPLIFIED_SEQ longtext VPS NULL PBPS longtext VPS NULL LAB varcbar(2SS) VPS NULL UPDATETIME datetlme VPS NULL

This table stores information about the mined microsatellites, their associated primers,

SSR pattern, maximum SSR size, the position of the SSR markers, PCR-amplified product size, explanation of the primer selection, PCR-amplified sequences, original

BAC-end sequences, associated lab, and updated time.

55 Table 7. Structure of table CDNA, REPx, REFSEQ, NR

field IType Null Key Default Extra ID Int( 10) unsiped PRi NUlL auto Increment QUERYJD varchar(2SS) YES NUlL SUBJECI'ID varcllar(2SS) YES NUlL IDENITIY float YES NUlL AliGNMENT LEN int(ll) YES NUlL MISMATCHES Int(ll) YES NUlL GAP_OPENING intOl) YES NUlL Q..START intO I) YES NUlL Q_END intO I) YES NUlL S_START int(1l) YES NUlL SEND intOl) YES NUlL EVALUE float YES NUlL BIT_SCORE float YES NUlL SUBJECI' DESC langle"t YES NUlL LAB varchar(2SS) YES NUlL

These tables share the same structure. Each table contains the BLAST result output

including query sequence 10, subject sequence 10, identity or percentage of the matches, alignment length, number of mismatched nucleotides or amino acids, gap opening, aligned query position, aligned subject position, expected value, bit score, matched

subject description, and associated sequencing lab of the BES.

56 Table 8. Structure of table SEQLEN

Field Tvoe Null KJlY_ Default Exba ID 1nt(IO) unsil!l1ed PRJ NULL auto_Increment PBFS NAME varchm(2S5) YFS NULL TRJMX....LEN 1nt(1I) YFS NULL HQBASE_NUM int(lI) YFS NULL GC intO!) YFS NULL

This table contains BES identifying name, the length of each trimmed high quality sequence with terminal vector sequences removed, number of high quality bases in the

BES, and percentage of GC content.

57 Table 9. Structure of table GMAP

Field Twe Null Key Default Extra ID int(IO) unsiJmed PRJ NULL auto increment OUERY ID varcbar(2SS) YFS NULL SUBJEC'CID varcbar(2SS) YFS NULL IDENTITY float YFS NULL

ALIGNMENT_LEN I int(ll) YFS NULL MISMATCHES int(ll) YFS NULL GAP_OPENING int(ll) YFS NULL Q.START intO!) YFS NULL Q_END int(ll) YFS NULL S START int(ll) YFS NULL S_END intCll) YFS NULL EVALUE float YFS NULL BIT_SCORE float YFS NULL BASE...APART int(ll) YFS NULL QUERY ORGANISM varcbar(2Ss) YFS NULL SUBJEC'CORGANISM varcbar(2SS) YFS NULL LAB varcbar(2SS) YFS NULL

This table contains the BLAST result output including query sequence ID, subject sequence ID, identity or percentage of the matches, alignment length, number of mismatched nucleotides or amino acids, gap opening, aligned query position, aligned subject position, expected value, bit score, number of bases between the forward and reverse BES pair, query organism (BES pairs), subject organism (genome), and associated sequencing lab of the BES.

58 Figure 8. Project website index page

000 +.+

!'lOme I p' ....·1OtI1 1contad

qr()up" fH:l,f'cl r1"'Irnptl'l" pu I'll, C;!' ron"

. C.nls rbCutOhliCll, F\OI2 WI*, tnd 8161rr1f:>... lllc. Re, " d. ""', .... (CGf'BR1) download . HMnIAgrkUln! Ras,ad. Cenw(HARC1 • M' IF de wNoo.d ps~

. ~ •• if L.aII

• PIIut M~ III Plidkl Salin 11 cknowtt'dgf'me nls ~ R.. ardI c... ' IPS,lRC) • USDA T"sTAR grMtHAv.00551G • QeftlO( P're5ang lAb o.~'l El'ltalMc'IQI"" p .... SAC end SIT 'li'ICW1g BIo.- anc.s and Bl toe,."...... Un~gf ....e1II1t ..lnoe conlilc.! Inform ill Ion

••<41, t gemotQMwai eau

• The mlUIlbt GIIlOITIiC: Alii. d.{l1GR,hON

. ~ot tww., AI ManN (VHliI l ~oOi1'II

• Mcls C"llilt 81e. c' liltS MIS ." ont: BioenglnMtlng hOme (M88Ei 1Il0l) 151 411 , eFaa: (1011 III lau

/

AU project web pages are currently stored on genomics.hawaii.edu apache web server maintained in the Presting Lab. The interactive CGr pages were programmed using PERL in combination with HTML code. The home page of this project web site can be accessed at (http://www.genomics.hawaii.edulpapayalBES).

59

• SSR search engine

Purpose of this featlU'e is to allow users to select SSRs that match certain user­ defined criteria (Figure 9). For example, user can choose to display any combination of mono-, di-, tri- and tetra-nucleotide repeats for selecting SSR markers. The minimum

length of all SSR markers is 12 bp in length. Therefore, selecting a number less than 12

for SSR minimum length will be switched back to 12 by default. Users can select

showing the cDNA homology description by clicking on the box. Users can also select viewing only the SSRs with primers and/or SSRs with cDNA homologies. Users can then click on Search to view the searched result. Users can then click on Display to view the detail of the BAC-end sequence and the microsatellite information including the

amplified product sequence and the orginal BAC-end sequence. Users can key in the

cDNA description, sequences, BES name or SSR pattern to narrow down all matched

SSR markers.

60 Figure 9. Microsatellite search engine

001) -ro - ,

' __; "2211221:" .. " 7 -_III:P ...... $ ..... _1:" ' ...... 211 .. '.--.. .. '9 _ J(S mrc:h .,.,.. I

Y.. No '!n-_'MtOt rN c' d u t a. , r TII'" '.I('I :tJ I r6 T" an'''''''­ r, SSft I' * 'iOII'II lAo al! JiO $till s dlHA IIoff -"a, hr,'; 1. '*'- ,.

__ only T' '0' , q , T • ." wnIefI ,.~'4 '" fI.M M>tn " ,I, ...n .­ sr- only ~1IIII . 1rIr lItoltll eDNA ~ 'A' a' il "*"" lIMon l ak 7 II .- Ie DNA due. i ll

- This search engine can be accessed at (http://www.genomics.hawaii.eduipapayaIBES/ ssrNP.cgi).

61 BES search engine

Purpose of this feature is to allow users to display BESs that match certain user-defined criteria (Figure 10). Users can limit the number of displayed return row to reduce the nmning time of the request. Users can click on the radio buttons to show either all BESs.

BESs with cDNA homologies, BESs with plant repeats, BESs with RefSeq genomics, or

BESs with no homologies. Users can then click on Search to view the searched result.

Users can also key in the homology identifying number (Accession number), target sequences, BES name, or homology description keywords to narrow the display result.

62 Figure 10. BAC-end sequence search engine

000 o + .+ , r

~fIJ II.K tNI~tLIht .... %1sICI2iLI" the p.p.y. ...-cIt •••'AI . ton.1M 3S,412111gh q4IIIlltr paw "/1 lAC IItId II W'11,," .-...... ,.""", ~ glnlra1td In .. ",.tora ..... IU'O • .,. ..n

F Show aft lIlES, In daMball ("' ShoJr only IES, twwhldl cONA. Ix I ::IIOw" new. fUM 11' 9 C$ , I Shaw only IES. bf..tllctl pi'nt rEV •• N liOIoJ • r..w II lin 1 Ii L ii1 ("' SIIow only lESS .. 'IIIfIkn RIGa q Ii .lGmlti I'll If l ID'll' 7, MIle • I 11 U, 3 ~ I ("' Showonly IESlfDf 'wtlkJ.noPIJii "'''i s~ t ll"lI'l?I ' $ ("' '.Iomol.., ".. , .:II .....h l 'c..•. _af ...... ,. _.·c ...- .. , TT '_N 2~ " lCz''' ,', '"''7O'lItd .... 7 .. __ ~''' ., ' i . r.. ,..• ..,. aAC_ .."". i C 2.. J C. ' ...... ", 10M, "P. ,..IAC ,Old, I ~"' ,COl iii"", ",.11 t $ ' 2 ,. ,•••• ', ... iii , ~ ,.,.,...",c .... UQW, £ . dO t.i Cl8ll 111<11 .-..c ..., aM :' ,. P OL.'" ,... , 7 ~" ..

""g71)00.1

pCIraC. I ODIv.'.06Jl .N1e.tcf

pbac-I 00." 8 02 _~ 1M 2-tel Al5g 14180.1

pbK- t~ 7.,1 . B04MJ .".1101700.1

Dbac-I(IOA 807., 1.807.1d' AttglOt5O.1

pCl l c,ol 001'"" OAUI O.teI' AIl135630.1

This search engine can be accessed at (http://www.genomics.hawaii.eduJpapayaIBES/ besNP.cgi)

63 APPENDIXB

Figure 11. Comparisons of three peach BAC clones to the poplar genome

Cllr.u ••tllel."._ ; .0 .. ., .hree 'flr~ aAC ell.fl l. pI"'a, acl. • '''''''' • AF467900 VI LG 11 :1521201·1576201

, • ACI54900 \IS LO VI:599271l-1766 18 ""'" , , • AC15490 1 "'1 LO V,;776372·93817S 1L G XV 1 : 2"~Jl·J()Sl1l .' , ...

, , • •• , • , , , , • ..• o 0 ",.., '00000 llOOOO JOOOOO .40000I) """" , p' ' """""Ph 7 I e W 5 " :1 I i r .. "'- "'-

Three peach BAC clones were downloaded from GeneBank NCBI website

(AC154900.1, AC 15490 1.1 , AF467900.\). These peach BAC clones were compared against artificial poplar BAC clones using BLASTN with an E value less than or equal to

10" . The best matches of poplar BAC clones to the corresponding peach BAC clones were identified. The start positions of the poplar BAC clones were identified, and the exact coordinates of the mapped poplar BAC clones on the poplar genome were obtained.

Based on the linkage group number and the frrst aligned nucleotide of the poplar artificial

BAC clone, a 120 kb poplar genome segments were retrieved. The corresponding peach and poplar BAC clone sequences were compared using BL2SEQ-BLASTN program at an E value cutoff of 10·'°. All aligned regions of the peach and poplar BAC clones were recorded. The portions of peach BAC clone regions with no match to the retrieved poplar

64 genome segment were further compared to the entire non-fragmented poplar genome database. The coordinates of the mapped poplar chromosomes were recorded (Figure II).

For the three peach BAC clones that were compared to the poplar genomic sequences, peach BAC clones ACI54901.1 nucleotide 1,882-34,648 were mapped to poplar Linkage

Group (LG) VI at nucleotide position 776,372-938,175, and region 34,480-73,277 were mapped to poplar LG XVI at nucleotide position 248,433-305,321. For BAC clone

ACI54900.I, region 4,579-66,115 was mapped to poplar LG VI 663,584-776,618 with a gap spanned -80kb at position 681,139-762,409. The BAC clone AF467900.1 region

13,096-16,694 was mapped to poplar LG XU at position 898,040-901,595.

65 APPENDIXC

Table 10. Detalled coordInates of co-mapped papaya DES pairs in heterologous genome comparisous

66 Table 10. (continued) Detailed coordinates of eo-mapped papaya DES pairs In heterologous genome comparisous

The exact coordinates of mapped genome regious from co-mapped papaya BES pairs in heterologous genomes are listed here. These coordinates may facilitate alignment of papaya BAC clones in constructing a physical map for whole genome sequencing of papaya and locate potential genes of interest in papaya BAC clones by referencing to the mapped colinear regions in target genomes.

67 APPENDIXD

Table 11. A list of eompiled plant repeat databases for repeat content analysis of papaya BESs. Plant repeat databases were downloaded from The Institute for Genomic

Research (TIGR) (ftp:/ltigr.orglpub/data/TIGR]lant_Repeats) and compiled into a database of 30,481 unique repeat sequences. a. TIGR_BrassieaJepeats.v2 b. TIGR_Brassieaeeae_Repeats.v2 e. TIGR_Fabaeeae_Repeats.v2

d. TIGR_Glyeine_Repeats.v2

e. TIGR_Gramhteae_Repeats.v3.0 f. TIGR_Hordenm_Repeats.v3.0

g. TIGR_Lotns_Repeats.v2

b. TIGR_Lyeopersicon_Repeats.v2

L TIGR_Medieago.J{epeats.v2 j. TIGR_Oryza_Repeats.v3.0

k. TIGR~olanaeeaeJepeats.v2

L TIGR~lannm_Repeats.v2

m. TIGR_Sorghnm_Repeats.v3.0

n. TIGR_Tritienm_Repeats.v3.0 o. TIGR_Zea_Repeats.v3.0

68 LITERATURE CITED

Adams KL and Wendel JF (2005) Polyploidy and genome evolution in plants. Current

opinion in Plant Biology 8:135-141

Agricultural Statistics Board, USDA. "Crop Production." NASS May 2006

Agricultural Statistics Board, USDA. "Crop Values 2005 Summary." NASS, February

2006

Agricultural Statistics Board, USDA. "Noncitrus Fruits and Nuts 2005 Preliminary

Summary." NASS, January 2006

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment

search tool. J Mol Bioi 215:403-410

Aradhya MK, Manshardt RM, Zee F, Morden CW (1999) A phylogenetic analysis of the

genus Carica L (Caricaceae) based on restriction fragment length variation in a

cpDNA intergenic spacer region. Genet Resour Crop Evol 46:579-586

Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plant

species. Plant Mol BioI Rep 9(3):211-215

Badillo VM (2(00) Carica L. vs. Vasconcella St Hil. (Caricaceae): con la rebabilitaci6n

de este t1Itimo. Emstia 10:74-79

Blanc G, Wolfe KH (2004) Functional divergence of duplicated genes formed by

polyploidy during Arabidopsis evolution. Plant Cell 16: 1679-1691

Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Uinraveling angiosperm genome

evolution by phylogenetic analysis of chromosomal duplication events. Nature

422:433-438

Bremer K, Chase MW, Stevens PF (1998) An ordinal classification for the families of 69 flowering plants. Ann Mo Bot Gard 85:531-553

Candolle A de (1908) Origin of cultivated plants. D.A. Appleton & Co. New York. NY

Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackwon B, Fang G, Kim H, Frisch

D, Yu Y, Sun S, Higingbottom S, Phimphilai J, Phlmphilai D, Thurmond S,

Gaudette B, U P, Uu J, Hatfield J, Main D, Farrar K, Henderson C, Barnett L,

Costa R, Williams B, Walser S, Atkins M, Hall C, Budiman MA, Tomkins JP,

Luo M, Bancroft I, Salse J, Regad F. Mohapatra T. Singh NK, Tyagi AK,

Soderlund C, Dean RA, Wing RA (2002) An integrated physical and genetic map

of the rice genome. Plant Cell 14: 1-10

Cheng Z, Presting G, Buell CR, Wing RA, Jiang J (2001) High-resolution pachytene

chromosome mapping of bacterial artificial chromosomes anchored by genetic

markers reveals the centromere location and the distribution of genetic

recombination along chromosome 10 of rice. Genetics 157:1749-1757

Chittenden LM, Schertz KF, Un YR, Wing RA, Paterson AH (1994) A detailed RFLP

map of Sorghum bicolor x S. propinquum, suitable for high-density mapping,

suggests ancestral duplication of Sorghum chromosomes or chromosomal

segments. Theor Appl Genet 87:925-933

Choi S, Creelman RA, Mullet JE, Wing RA (1995) Construction and characterization of a

bacterial artificial chromosome library of Arabidopsis thaliana. Weeds World

2:17-20

Deputy JC, Ming R, Ma H, Uu Z, Fitch MM, Wang M, Manshardt R, Stiles n (2002)

Molecular markers for sex determination in papaya (Carica papaya L). Theor

Appl Genet 106:107-111

70 Doughty KJ, Porter AJR, Morton AM, Kiddie G, Bock CH, Wallsgrove R (1991)

Variation in the g1ucosinolate content of oilseed rape Brassica-Napus L leaves D.

Response to infection by Alternaria-Brassicae Berk. Sacc Ann Appl BioI

118:469-478

Dune J, Horgan L (1992) Meat tenderizers. In: Hui YH (ed) Encyclopedia of Food

Science and Technology, vol 3. Wiley, New York, p 1745-1751

Durbin ML, McCaig B, Clegg MT (2000) Molecular evolution of the chalcone synthase

multigene family in the morning g101y genome. Plant Mol BioI 42:79-92

Ewing B, Hillier L, Wend Me, Green P (1998) Base-ca1ling of automated sequencer

traces using phred. I. Accuracy assessment Genome Res 8:175-185

Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred D.

Error probabilities. Genome Res 8: 186-194

Fahey JW, ZaIcmann AT, Talalay P (2001) The chemical diversity and distribution of

glucosinolates and isothiocyanates among plants. Phytochemistry 56:5-51

Fitch MMM, Manshardt RM, Gonsalves D, Slightom (1992) Virus resistance papaya

derived from tissue bombarded with the coat protein gene of papaya ringspot

virus. BiorrechnoI1O:1466-1472

Filatov DA, Moneger F, Negrutiu I, Charlesworth D (2000) Low variability in a Y-linked

plant gene and its implications for Y-chromosome evolution. Nature 404:388-390

Georgi LL, Wang Y, Reighard GL, Mao L, Wing RA, Abbott AG (2003) Comparison of

peach and Arabidopsis genomic sequences: fragmentary conservation of gene

neighborhoods. Genome 46:268-276

Goff SA, Ricke D, Lan T, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, 71 Oeller P, Varma H, Hadley D, Hutchison D, Martin C, Katagiri F, Lange B,

MoughamerT, Xia Y, Budworth P. Zhong J, Miguel T, Paszkowski U, Zhang S,

Colbert M, Sun W, Chen L, Cooper B, Park S, Wood T, Mao L, Quail P, Wing R,

Dean R, Yu Y, Zharkikh A, Shen R, Sahasmbudhe S, Thomas A, Cannings R,

Outin A, Pruss D, Reid J, Tavtigian S, Mitchell J, Eldredge 0, Scholl T, Miller R,

Bhatnagar S, Adey N, Rubano T, Tusneem N, Robinson R, Feldhaus J, Macalma

T, Olipbant A, Briggs S (2002) A Draft Sequence of the Rice Genome (Oryza

sativa L ssp.japonica). Science 296: 92-100

Hong CP, Lee SJ, Park JY, Plaha P, Park YS, Lee YI<, Choi JE, Kim KY, Lee JH, Lee J,

Jin H, Choi SR, Urn YP (2004) Construction of a BAC Iibmry of Korean ginseng

and initial analysis of BAC-end sequences. Mol Oen Oenomics 271:709-716

Ilic I<, SanMiguel PJ, Bennetzen JL (2003) A complex history of rearmngements in an

orthologous region of the maize, sorghum, and rice genomes. Proc Natl Acad Sci

USA 100:12265-12270

International Rice Genome Sequencing Project (2005) The map based sequences of the

rice genome. Nature 436:793-800

Jiang J, Oill BS, Wang OL, Ronald PC, Ward DC (1995) Metaphase and interphase

fluorescence in situ hybridization mapping of the rice genome with bacterial

artificial chromosomes. Pree Natl Acad Sci USA 92:4487-4491

Jung S, Abbott A, Jesudurai C, Tomkins J, Main D (2005) Frequency, type, distribution

and annotation of simple sequence repeats in Rosaceae ESTs. Funct Integr

Genomics 5:136-143

Katti M, Ranjekar PI<, Oupta VS (2001) Differential distribution of simple sequence

72 repeats in eukaryotic genome sequences. Mol Bioi EvoI18:1161-1167

Kjaer A (1976) Glucosinolates in the cruciferae. In: Vaughn. J.G .• MacLeod, AJ.• Jones.

B.M.G. (Eds.), The Biology and Chemistry of the Cruciferae. Academic, London,

UK, pp. 207-219

Kjaer A. Schuster A (1972) GIucosinolates in seeds of Neslia-Panicultzta.

Phytochemistry 11:3045-3048

Kim MS, Moore PH. Zee F. FItch MMM, Steiger DL. Manshardt, Paull RE, Drew RA,

Sekioka T. Ming R (2002) Genetic diversity of Carica papaya as revealed by

AFLP markers. Genome 45:503-512

Kishimoto N, Higo H, Abe K, Arai S. Saito A, Higo K (1994) Identification of the

duplicated segments in rice chromosomes 1 and 5 by linkage analysis of cDNA

markers of known functions. Theor Appl Genet 88:722-726

Kowalski SP, Lan TH, Feldmann, Paterson (1994) Comparative mapping of Arabidopsis

thaliana and Brassica oleracea chromosomes reveals islands of conserved

organization. Genetics 138:499-510

Kroymann J, Mitchell-Olds T (2004) Function and evolution of an Arabidopsis insect

resistance QTL. In: Tikhonovich, I., Lugtenberg, B., Provorov, N. (Eels.), Biology

of Molecular Plant-Microbe Interactions. APS Press, Saint Paul, MN, pp. 259-262

Kroymann J, Textor S, Tokuhisa JG, Falk KL, Bartram S, Gershenzon J, Mitchell-Olds T

(2001) A gene controlling variation in Arabidopsis g1ucosinolate composition is

part of the methionine chain elongation pathway. Plant Physiol. 127, 1077-1088

Kruglyak S, Durrett RT, Schug MD, Aquadro CF (1998) Equilibrium distributions of

73 microsatelIite repeat length resulting from a balance between slippage events and

point mutations. Proc Nad Acad Sci USA 95:10774-10778

Ku HM, Vision T, Liu J, Tanksley SD (2000) Comparing sequenced segments of the

tomato and Arabidopsis genomes: Large-scale duplication followed by selective

gene loss creates a network ofsynteny. Proc Nat! Acad Sci USA 97:9121-9126

Lange BM, Presting G (2004) Genomic survey of metabolic pathways in rice. In: Romeo

IT (ed) Recent advances in phytochemistry. Elsevier, Amsterdam, p 111-137

Levinson G, Gutman GA (1987) Slipped-strand mispairing: a major mechanism for DNA

sequence evolution. Mol BioI. Evol. 4:203-221

Liu Z. Moore PH, Ma H, Ackerman CM, Ragiba M, Yu Q, Pearl HM, Kim MS, Chartton

JW, Stiles JI, Zee Ff, Paterson AH, Ming R (2004) A primitive Y chromosome in

papaya marks incipient sex chromosome evolution. Nature 427:348-352

Ma H, Moore PH, Liu Z. Kim MS, Yu Q, Fitch MM, Sekioka T, Paterson AH, Ming R

(2004) High-density linkage mapping revealed suppression of recombination at

the sex determination locus in papaya. Genetics 166:419-436

Mao L, Wood T, Yu Y, Budiman MA, TomkinsJ, Woo S, Sasinowski M, Presting G,

Frisch D, Goff S, Dean RA, Wing RA (2000) Rice transposable elements: a

survey of 73,000 sequence-tagged-connectors. Genome Res 10:982-990

McGrath JM et at.,. (1993) Duplicate sequences with a similarity to expressed genes in

the genome of Arabidopsis thaliana. Theor Appl Genet 86:880-888

Ming R, Moore PH, Zee F, Abbey CA, Ma H, Paterson AH (2001) Construction and

characterization of a papaya BAC library as a foundation for molecular dissection

of a tree-fruit genome. Theor Appl Genet 102:892-899

74 Morton JF (1987) Papaya. In: Fruits of warm climates. Creative Resource Systems. Inc.

Miami. pp 336-346

Mozo T. Fischer S. Shizuya H. Altmann T (1998) Construction and characterization of

the IGF Arabidopsis BAC library. Mol Gen Genet 258:562-570

Nagamura Y et al••. (1995) Conservation of duplicated segments between rice

chromosomes 11 and 12. Breed Sci 45:373-376

O'Neill CM. Bancroft I (2000) Comparative physical mapping of segments of the

genome of Brassica olearacea var. alboglabm that are homeologous to sequenced

regions of chromosomes 4 and 5 of Arabidopsis thaliana. Plant J 23:233-43

Ockerman HW. Harnsawas S. Yetim H (1993) Inhibition of papain in meat by potato

protein or ascorbic acid. J Fod Sci 58: 1265-1268

Ohta T (2000) Evolution of gene families. Gene 259:45-52

Osato JA. Santiago L, Remo G. Cuadm M. Mori A (1993) Antimicrobial and antioxidant

activities of unripe papaya. Life Sci 53: 1383-1389

Paterson AH. Bowers JE. Chapman BAC (2004) Ancient polypoidization predating

divergence of the cereals. and its consequences for compamtive genomics. Proc

NatI Acad Sci USA 101:9903-9908

Perez A. Pollack S. USDA. "Fruit and Tree Nuts Outlook." Economic Research Service

FfS-322 (May 25. 2006)

Purina A. Sandhya B (1988) Genotypic difference of in vitro latemI bud establishment

and shoot proliferation in papaya. Curr Sci 7:440-442

Purseglove JW (l968) Tropical crops. Longman. London. pp.45-51

Raybould AF. Moyes CL (200l) The ecological genetics of aliphatic gIucosinolates. 75 Heredity 87:383-391

Rice Chromosome 10 Sequencing Consortium (2003) In-depth view of structure, activity

and evolution of rice chromosome 10. Science 300:1566-1569

Rodman JE, Karol KG, Price RA, Sytsma KJ (1996) Molecules, morphology, and

Dahlgren's expanded order CapparaIes. Syst Bot 21:289-307

Rodman JE, Soltis PS, Soltis DE, Sytsma KJ, Karol KG (1998) Parallel evolution of

glucosinolate biosynthesis inferred from congruent nuclear and plastid gene

phylogenies. Am J Bot 85:997-1006

Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist

programmers. In: Krawetz S. Misener S (eds) Bioinformatics methods and

protocols (methods in molecular biology). Humana Press, Totowa, pp 365-386

Sampedro J. Carey RE. Cosgrove OJ (2006) Genome histories clarify evolution of the

expansin superfamily: new insights from the poplar genome and pine ESTs. J

Plant Res 119: 11-21

Shiuya H, Nirren B, Kim UJ. Mancino V. Slepak T. Tachiiri Y. Simon M (1992) Ooning

and stable maintenance of 200-kilobase-pair fmgments of human DNA in

Escherichia coli using an F-factor-based vector. Proc Nat! Acad Sci USA

89:8794-8797

Soltis DE, Soltis ps. Chase MW, Mort ME, Albach OC, Zanis M, Savolainen V, Hahn

WH, Hoot SB, Fay MF, Axtell M, Swensen SM, Prince LM, Kress WJ, Nixon

KC, Farris JS (2000) Angiosperm phylogeny inferred from 18S rONA, rbeL, and

atpB sequences. BotT Linnean Soc 133:381-461

Soltis ps. Soltis DE, Chase MW (1999) Angiosperm phylogeny inferred from multiple 76 genes: a research tool for comparative biology. Nature 402:402-404

Stirling B, Yang ZK, Gunter La Tnskan GA, Bradshaw HD (2003) Comparative

sequence analysis between ortbologous regions of the Arabidopsis and Populus

genomes reveals substantial synteny and microcollinearity. Can J For Res

33:2245-2251

Storey WB (1976) Papaya. In: Simmonds NW (ed) Evolution of crop plants. Longman,

London, pp 21-24

Storey WB (1985) Papaya. In: CRC Handbook ofFlowering, CRC Press, Boca Raton,

FL

Temnykb S, DeClerck G, Lukashova A, Lipoviich L, Cartinhour, McCouch S (2001)

Computational and experimental analysis of microsatellites in rice (Oryza sative

L.): frequency, length variation, transposon associations, and genetic marker

potential. Genome Res 11:1441-1452

The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the

Arabidopsis thaliana. Nature 408:796-815

Tokubisa J, de Kraker I-W, Textor S, Gershezon I (2004) The biochemical and molecular

origins of aliphatic g1ucosinolate diversity in Arabidopsis thaliana. In: Romeo,

I.T. (Ed.), Secondary Metabolism in Model Systems. Pergamon, New York, NY,

pp.19-38

Tomkins I, Fregene M, Main D, Kim H, Wing R, Tobme I (2004) Bacterial artificial

chromosome (BAC) library resource for positional cloning of pest and disease

resistance genes in cassave (Manilwt esculenta Crantz). Plant Mol Bioi 56:555-

561

77 Van Droogenbroeck B, Kyndt T, Maertens I, Romeijn-Peeters E, Scheldeman X, Romero

JP, Van Damme P, Goetghebeur P, Gheysen G (2004) Phylogenetic analysis of

the highland papayas (Vasconcellea) and allied genera (Caricaceae) using PCR­

RFLP. Theor Appl Genet 108:1473-1486

Verde I, Lauria M, Dettori MT, Vendramin E, Balconi C, MicaIi S, Wang Y, Marrazzo

MT, Cipriani G, Hartings H, Testolin R, Abbott AG, Motto M, Quarta R (2005)

MicrosateIIite and AFLP markers in the Prunus persica [L. Batsch]]xP.

Ferganensis BCI linkage map: saturation and coverage improvement Theor Appl

Genet 111:1013-1021

Wang GL, Holsten TE, Song WY, Wang HP, Ronald PC (1995) Construction of a rice

bacterial artificial chromosome library and identification of clones linked to the

Xa-21 disease resistance locus. Plant J 7:525-533

Wierdl M, Dominska M, Petes TO (1997) MicrosateIIite instability in yeast: dependence

on the length of the microsateIIite. Genetics 146:769-779

Yan L, Loukoianov A, Tranquilli G, Helguera M, Fahima T, Dubcovsky J (2003)

Positional cloning of the wheat vernalization gene VRNll. Proc Natl Acad Sci

USA 100:6263-6268

Yu Q, Moore PH, Albert HH, Roader AH, Ming R (2005) Cloning and Characterization

of a FLORlCAULAILEAFY ortholog, PFL, in polygamous papaya. Cell Res

15(8):576-584

Zhang HB, Choi S, Woo SS, Li Z, Wing RA (1996) Construction and characterization of

two rice bacterial artificial chromosome libraries form the parents of a permanent

recombinant inbred mapping population. Mol Breed 2: 11-24 78 Zhang LQ, Pond SK, Gaut BS (2001) A survey of the molecular evolutionary dynamics

of twenty-five multi gene families from four grass taxa. J Mol EvoI52:144-156

Zhao S, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Krol M, Gebregeorgis E,

Shvartsbeyn A, Russell D, Overton L, Jiang L, Dimitrov G, Tran K, Shetty J,

Malek JA, Feldblyum T, Nierman WC, Fraser CM (2001) Mouse BAC ends

quality assessment and sequence analyses. Genome Res 11:1736-1745

Zhu YJ, Agbayani R, Jackson Me, Tang CS, Moore PH (2004) Expression of the

grapevine stilbene synthase gene VSTl in papaya provides increased resistance

against diseases caused by Phytophthora palmivora. P1anta 220:241-250

79