UNIVERSITY OF H,iI,'N,'·i'I U!'W/\F-N
INSIGHTS INTO PAPAYA GENOME ORGANIZATION BASED ON BAC END SEQUENCE ANALYSIS
A THESIS SUBMITIED TO THE GRADUATE DMSION OF THE UNIVERSITY OF HAWAl'I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
MOLECULAR BIOSCIENCES AND BIOENGINEERING
AUGUST 2006
BY CHUN WAN JEFFREY LAl
Thesis Committee:
Gemot Presting, Chairperson Paul Moore Qingyi Yu Richard Manshardt We certify that we have read this thesis and that, in our opinion, it is satisfactory in scope and quality as a thesis for the degree of Master of Science in Molecular Biosciences and
Bioengineering.
ii MISSING PAGE NO. • • .I ( I '
A T THE TIlVIE OF MICROFILMING ACKNOWLEDGEMENTS
I am obliged that Dr. Maqsudul Alam and Dr. Shaobin Hou from the Center for
Genomics, Proteomics and Bioinformatics Research Initiative (CGPBRI) at the
University of Hawaii generously provided 11,013 papaya BAC end sequence chromatogram files for this analysis. Dr. Ray Ming, who is a co-principle investigator of this project, and Dr. Qingyi Yu from the Hawaii Agriculture Research Center (HARC) directed the sequencing of most BAC ends analyzed here. Peizhu Guan from HARC and
Kanako L. T. Lewis from CGPBRI contributed significantly by generating papaya BAC end sequences. I would like to specially thank my thesis committee members who contributed substantially: Dr. Gernot Presting (Chairperson), Dr. Paul Moore, Dr. Qingyi
Yu and Dr. Richard Manshardt. Dr. Gernot Presting has supervised the entire project, recruited me into his lab during my second semester, and trained me in bioinformatics research and critical thinking. Dr. Paul Moore provided invaluable advice and comments for this analysis. I am honored that Dr. Richard Manshardt, who has many years of papaya genetic research, is on the thesis committee. I am indebted to Dr. Qingyi Yu who has directed the generation of most BAC end sequences used in this project. I am indeed thankful that Dr. DuIaI Borthakur offered me a chance to pursue a Master of Science degree in the Department ofMolecuiar Biosciences and Bioengineering. I desire to thank all faculty and staff who have enlightened and assisted me during my time in the
University ofHawai'i. Anupma, Aren. Beth, Kevin, Moriah, and Thomas offered me friendship and precious insights in the lab. Finally, my mother Hay Hing, my sisters Kit
Chi. Kit Hing, my wife Chui Ying and my son Jit Ching have supported my education in every possible way. iv ABSTRACf
Papaya is a major tree-fruit plant grown mainly in tropical and subtropical regions. A BAC library was end sequenced and tbe resnlting 50,661 BAC-end sequences were analyzed bioinformatically. A total of 7,456 SSRs were identified among 5,452
BESs. Sixteen percent of BESs contain plant repeat homologies. BESs lacking plant repeats revealed 6,769 (19.1%) Arabidopsis cDNA homologies. BESs witbout plant repeat and Arabidopsis cDNA homology contained 1,124 (3.2%) RefSeq and 644 (1.8%) non-redundant protein sequence homologies.
Low-copy papaya BES pairs (9,038) were compared to Arabidopsis, poplar, and rice genome sequences. A total of 53 BES pairs were mapped to Arabidopsis, 167 to poplar and 11 to rice. Low rate of co-mapping papaya BES pairs to Arabidopsis confirms tbe recent genome rearrangement in Arabidopsis. Poplar exhibited highest level of co
linearity witb papaya and can be a reference genome for papaya genomic studies.
v TABLE OF CONTENTS
J\~O~~s...... i"
.ABSTRJ\cr ...... v
UST OF T J\BLES ...... 1fii
LIST OF FIGURES ...... "iii
LIST OF J\BBREVlJ\TIONS...... •...... •... ix
INTRODUCTION ...... •..• ... 1
METHODS...... 10
RES.UL TS...... •...... 19
DiSCUSSiONS...... •.•... 36
CONCLUSiONS...... 50
J\PPENDIX J\: Papaya BJ\C end sequence library website (PBESL) ...... 52
J\PPENDIX B: J\lignments oftbree peach BJ\C clones to poplar ilenome ...... •.•...... •.•...... •...... •...... •...... •...•...... •.•.•....••...• ~
J\PPENDIX C: Detailed coordinates of co-rnapped papaya BES pairs in heteroloilous genome comparisons...... •... 66
J\PPENDIX D: Lists of compiled plant repeat databases for repeat content analysis of papaya BESs •.•.•.•..•...... •.•.•....•...... •.. ... 68
LlTERJ\TURE CITED ...... 69 LIST OF TABLES
I Summary of simple sequence repeats percentage of35,472 high quality BESs ...... •...... •.. ..••.•• 22
2 BESs homologous to protein sequences from other plants but not found in Arabidopsis ...... 28
3 Summary of the annotation of top ten most abundant repeats and the number of matches in uncharacterized BESs •.•...•. ••... .•. ..•...... 30
4 Statistics of number of BES pairs in different stages of genome mapping...... •...... •.•.....•.•. 32
5 Structure of table pBES ...... 54
6 Structure of table SSR ...... 55
7 Structure of table eDNA, REPx, REFSEQ, NR 56
8 structure of table SEQLEN 57
9 structure of table GMAP 58
10 Detailed coordinates of co-mapped papaya BES pairs in heterologous genome comparisons ...... 66
11 Lists of compiled plant repeat databases for repeat content analysis of papaya BESs •.. .•...... •.•... .•...•... .•...... •...... •. ...•.•... 68
vii LIST OF FIGURES
Figures ~
1 Papaya BAC-end sequencing procedure 11
2 Comparative genome mapping procedure 16
3 Taxonomy tree of Carlcaceae and other speices that were used in comparative mapping of BES pairs ..•...... •...... 18
4 Selection of sequence length cutoff in high quality papaya BES
...... 0 •••••••••••••••••• 20
5 Characterization of all simple sequence repeats (SSRs) detected in 35,472 high quality BESs...... 24
6 Summary of plant repeat element content in 35,472 high quality BESs...... ••...... •.•...... •...... 26
7 Comparative genome mapping ofBES pairs in heterologous plant genomes ...... 33
8 Project website index page 59
9 Microsatellite search engine
•••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• .. ••••••• ••••••••••••• 0 ••••• 61
10 BAC-end sequence search engine 63
11 Comparisons oftbree peach BAC clones to the poplar genome 64
viii LIST OF ABBREVIATIONS
BAC - Bacterial artificial chromosome
BES - BAC end sequence
Mbp - Megabase pairs
SSR - Simple sequence repeat bp -Base pairs k -Thousand kb -Kilobase nt - Nucleotide
ix INTRODUCTION
Production
Carica papaya L., commonly known as papaya, is grown primarily in tropical and subtropical regions of the world. Brazil, Mexico, Nigeria, India, Indonesia, Thailand,
Belize, Guatemala, Dominican Republic, Jamaica, Puerto Rico, the Philippines, and
Hawaii, the major papaya-fruit production state in the United States, are the major papaya-fruits producing regions (perez and Pollack, 2006). In 2005, the total harvested area of Hawaii papayas was 1,480 acre, and the totallltili7J!tion production of papaya in
Hawaii was estimated at 32.9 million pounds valued at $10.97 million dollars
(Agricultural Statistics Board, 2006; Perez and Pollack, 2006). Papaya fruit production is a major source of income for Hawaii's agriculture industry. Most papayas are grown in the Puna district of the Island of Hawaii, and 91 percent of the harvested acres in the
State of Hawaii are located in the Island of Hawaii (perez and Pollack, 2006). The genetically modified strain "Rainbow" accounts for over half of total acreage for current
Hawaii papaya production.
Varieties & Strains
In Hawaii, the major papaya cultivar is 'Solo' and its derivatives (Morton J,
1987). 'Solo' has a high sucrose content and is usually eaten without cooking to retain its
original sweetness. The 'Solo' papaya fruit varieties are usually pear-shaped and relatively small in size (-500 g) when compared to other commercial papayas.
Representatives of 'Solo' derivatives include the 'Kapoho', 'Sunrise', 'Sunset',
'Waimanalo' and 'SunUp'. 'Kapoho' was the major commercial variety of 'Solo' before
1 the papaya ringspot virus became prevalent, occuping approximately one-third of the total acreage. Due to the susceptibility of 'Kapoho' variety to the papaya ringspot virus, the transgenic 'Rainbow' papaya has become the preferred variety for the papaya industry. 'Sunrise' and 'Sunset' papaya fruits have red-orange flesh and are rich in sucrose content. The 'Waimanalo' or 'Solo' line 77 produces fruits that are excellent in firmness and quality. The 'SunUp' variety is a transgenic papaya derived from 'Sunset' that is resistant to the ringspot virus. In addition to the resistance to several pests and ringspot virus, this variety possesses several agronomically important traits, such as richness in sugar content, desirable fruit fragrance, ideal fruit size for exporting, fruit bearing height and short generation time (Ming et aJ., 2001). 'SunUp' and 'Kapoho' are the parent varieties of the hybrid 'Rainbow' papayas that are available widely in the fresh fruit markets of Hawaii Mexican papayas are usually large in size and are generaIly not as sweet as Hawaiian papayas.
Pest & Virus Diseases
Papayas are subject to attack by a variety ofpests such as fruit fly, whitefly, mites, rats, scale, insects and nematodes (Morton J, 1987). The infection by fungal pathogens is a major cause of fruit, stem and root rot in papaya plants. Genetic transformation has been developed to increase resistance of papaya plants against fungal diseases cansed by the pathogen Phytophthora palmivora (Zhu et al., 2004). Besides attack by pests and fungus, the papaya mosaic and ringspot viruses are the major threats to plant development and fruit production. The papaya ring spot virus causes a major
crop loss of papaya. The symptoms ofviNs-infected plants includes ring spots and
2 stunting of the fruit and leaves. Fruit production and the growth of plants are severely reduced. A method was developed to reduce the susceptibility of papaya to the ringspot virus by over-expressing the coat protein of the virus in the plants (Fitch et aI., 1992).
This discovery significantly increased the annual yield of papaya in Hawaii. Genomics could further enhance the current genetic research to increase the quality and quantity of the production of papayas by leading to the discovery of genes for papaya horticultural improvement
Usage
The papaya fruit is mainly used for fresh fruit consumption. The fruits can be made into fruit-jams and juices. The young leaves, unripe fruits and stems can be cooked and eaten as vegetables. Moreover, the fruits, stems, roots, and leaves of papayas are valuable in folk medicinal uses (Ockerman et al., 1993; Osato et al., 1993; Ptnina and
Sandhya, 1988). Seeds have a slightly spicy taste resembling that of pepper and can be
added to salad dressing. Fragrant oils can be extracted from dry unripe papaya seeds. The
latex collected from unripe fruits and stems is currently a main source of papain, a protease that is used industrially in meat tenderizers and digestion aids (Dunne and
Horgan, 1992). The consumption of papaya fruit may help digestion due to the presence of papain. Papaya fruits are rich in vitamin C content and low in calories. The fruits are also a good source of antioxidants such as carotenoids, flavonoids, vitamin E and folic acid. Papaya fruits also have rich content of potassium and dietary fiber.
3 Botanical Description & Taxonomy
Papaya is a dicotyledonous plant with nine pairs of chromosomes and has three sex forms (Arumuganathan and Earle, 1991; Purseglove JW, 1968; Storey WH, 1976;
Ming et al., 2001). Papaya is a large herbaceous perennial plant that is single stemmed and hollow. Fruit production is continuous throughout the year. The plant can grow to maturity and produce fruits in roughly 9 months from seed germination. Papaya plants can be cultivated only in warm climates, and the growth of papaya requires optimal temperature and irrigation. The fruits are juicy, smooth-skinned and generally pear shaped, green when unripe, and yellow when ripe. The fruit sizes vary in different strains.
Papayas can bear either male, female or hermaphrodite flowers, which are identified by their distinct flower morphology. Some plants occasionally bear more than one kind of flower at a time. Hermaphrodite flowers are generally self-pollinated and produce more marketable fruits. Female flowers are pollinated with pollen from hermaphrodite or male plants. Although male plants produce many fertile staminate flowers, they are usually female-sterile (Storey WH, 1985). Recently, sex-linked markers have been discovered in papaya and have allowed the screening of plant sex before the seeds are planted (Deputy
et aI., 2002). Papaya is thought to have originated in the region between Mexico and
Costa Rica in Central America (Candolle A de, 1908; Purseglove JW, 1968; Storey WH,
1976). It belongs to the genus Carica and family Caricaceae, which consists of six genera (Van Droogenbroeck el aI., 2004). The genera in the family other than Carica
include Jacaratia, Jarilla, Cylicomorpha, Horovitzia and Vasconcellea (Badillo VM,
2000; Van Droogenbroeck et aI., 2004). Vasconcellea is the largest genus and consists of
4 21 species (Badillo VM, 2000; Van Droogenbroeck et a/., 2004). Both Brassicaceae,
Caricaceae and their related families within the order Brassicales are characterized by the production of glucosinolates, that are secondary metabolites, and glycosides, that are involved in plant defense mechanisms for combating insect herbivores (Kjaer and
Schuster,1972;KjaerA, 1976; Doughty etal., 1991; Rodmanetal., 1996,1998;
Raybould and Moyes, 2001; Fahey et al., 2001; Tokuhisa et al., 2004; Kroymann et ai.,
2001; Kroymann and Mitchell-Olds, 2004). Caricaceae is taxonomically grouped within the order of Brassicales and is a sister family of Brassicaceae, which contains the model organism Arahidopsis thaliana (Bremer et al., 1998; Rodman et al., 1996). The genome of Carica papaya was reported to be the most divergent from other species within the genus Carica (Kim et al., 2002). Some wild Carica species were reported to be more
closely related allies of the genus Jacaratia than to C. papaya (Aradhya et al., 1999).
This has resulted in separating papaya from other Carica species, which are now assigned to the genus Vasconcellea (Badillo VM, 2000). Carica is monotypic containing only the
single species Carica papaya (Van Droogenbroeck et al., 2004).
Potential model organism for tree-fruit plants
The discovery of sex chromosomes in plants led to speculation that sex
chromosomes are derived from autosomes by mechanisms of recombination suppression
(Filatov et al., 2000; Liu et al., 2004). In previous studies, papaya was found to have a primitive Y-chromosome that contains a non-recombinant sex determination region of 4-
5 Mbp or 10% of the chromosome (Liu et al., 2004; Ma et al., 2004). This Y chromosome has significant agronomic importance in the papaya industry since
5 hermaphrodite (Mhm) fruits are most desirable. Prunus persica (peach) is a model organism for Rosaceae, with a genome size almost twice as large as Arabidopsis and a generation time of two to four years to bear fruits. When compared to peach, papaya has more variability among progeny due to its high number of seed produced in each fruit
(800-1200). Moreover, the continuous flowering and fruiting throughout the year, considerable commercial varieties, and polygamous reproduction behavior greatly enhances polymorphism in each generation (Ming et aI., 2001; Verde et aI., 200S). Since papaya has a small genome size of 312 Mbp in haploid state and has a short generation time of around 9 months to produce ripe fruits, it has been suggested to be a model organism oftree-fruit plants for genetic and genomic studies of cloning and mapping
fruit-controlling genes (Ming et aI., 2001).
Genomie resources & HAC library
Genomic technologies can greatly accelerate the current genetic research of papayas for agronomic improvements. There are several established genetic and genomic resources for papaya research. Molecular genetic markers were used to map the sex determination regions (Ma et aI., 2004; Deputy et aI., 2002). Genetic maps of papaya chromosome Y were generated (Ma et aI., 2004), and AFLP genetic analysis detected the diversity among different papaya strains (Kim et aI.• 2002). A highly effective system for transformation established to produce virus resistance (Fitch et al., 1992) can assist in
functional assays. Ming et aI. (2001) constructed a papaya BAC library from two different ligation reactions of the papaya cultivar, "SunUp", using the vector pBeloBACl1 (Shizuya et aI., 1992) and the Hind/II restriction enzyme. The BAC library
6 contains 39,168 clones of papaya genomic sequences with an avemge insert size of 132 kb, approximating 13.7 genome equivalents. The first BAC library, portion of the representing 18,700 clones, has an avemge insert size of 86 kb, whereas the 20,468 clones from the second ligation have an avemge insert size of 174 kb (Ming et al., 2001).
Arabidopsis, a model organism for plants, and rice, a model organism for cereal grasses, were sequenced largely using genomic BAC clone libraries (The Arabidopsis Genome
Initiative 2000; International Rice Genome Sequencing Project 2005; Choi et al., 1995;
Wang etai., 1995; Zhang et aI., 1996). BAC clones can hold up to 300kb of genomic
DNA and are highly stable for manipulating the cloned DNA easily and efficiently.
Therefore, BAC libraries are widely used in the preliminary process of many genomic
sequencing projects with large genome sizes. Moreover, BAC clones can be used for physical mapping by fingerprinting each BAC clone based on restriction analysis (Chen et aI., 2002), map genes having agronomic value (Lange and Presting, 2004), develop genetic markers by BAC-end sequencing (Tomkins et aI., 2004), analyze compamtive genomics between different species (O'Neill and Bancroft 2000; Ilic et aI., 2003), use in
situ hybridization to screen the genome structure and localize target genes on chromosomes (Cheng et aI., 2001; Jiang et aI., 1995), and whole-genome sequencing
(Goff et ai., 2002).
HAC end sequencing and analysis
The analysis of BAC end sequences can greatly accelemte genetic and genomic
studies of an organism. BAC end sequences can greatly enhance the efficiency of whole
genome sequencing. Since BAC clones with an avemge insert size of 150kb are large
7 enough to span most repetitive sequences, such as tandem repeats and transposable elements, the end sequences can assist in generating physical maps by chromosome walking to fill gaps in physical mapping (Rice Chromosome 10 Sequencing Consortium,
2003). BAC clone fingerprinting and BESs together can serve as a platform for whole genome sequencing by either a clone-by-clone approach or a shotgun sequencing approach (Rice Chromosome 10 Sequencing Consortium, 2003). The analysis of end sequences can present the first view of a genome's composition (Mao et al., 2000; Zhao et at .• 2001; Hong et at., 2004). BAC-end sequences also provide genetic markers that are distributed across the genome for polymorphism detection and genetic mapping of the whole genome (Tomkins et at., 2004). Homologous sequences of other plants that are found in papaya BESs can facilitate cloning of organism specific genes for genetic
studies (Yu et at., 2005). Mapped BAC-end sequence pairs can be used for comparative genomics to elucidate genome organization and co-linearity between different plants
(O'Neill and Bancroft, 2000). The sequencing of the papaya sex-determination region could enhance our understanding of the sex-determination process in papaya allowing growers to eliminate undesired sexes before the seeds are planted (Ming et at.• 200 1).
Theory & Hypothesis
Three main hypotheses will be tested in this thesis: 1) Simple sequence repeats
(SSRs or microsateDites) are distributed eqna1ly between genic and non-genic regions of papaya genome. MicrosatelIite analysis can provide useful genetic markers for detection of polymorphism and generating genetic maps for further genomic studies.
Understanding the distribution of SSRs in a genome can facilitate marker selection of 8 types and classes of SSRs for genetic studies. 2) Gene and repetitive DNA content of papaya Is similar to Ambidopsis thaliIuuz. Genes and repetitive elements are major components of a genome. Comparisons of the papaya genome content with other plant crops such as Arabidopsis allow us to discover similarities between papaya and other plants. This study will also accelemte gene discoveries in papaya. 3) The genome structnres of papaya and Arabidopsis are similar, Compamtive genome mapping allows us to direct1y map papaya BAC clones to other completely sequenced plants. The mapped BES pairs can elucidate positions of genomic sequences in the orthologous regions of heterologous genomes. The mapped colinear regions in a reference genome can be used to locate genes of interest and assist in whole genome sequencing of an organism.
9 MEmODS
Processing of papaya HAC end seqnences
All papaya BAC ends were sequenced prior to all the sequence analyses that are described in this thesis. The sequencing procedure is illustrated in Figure 1. PHRED software was used to perform base calling of papaya BAC-end cbromatognuns and trimming of the BAC-end sequences (Ewing et al., 1998; Ewing and Green, 1998). The
PHRED trimming cutoff value was set to the default value of 0.05 with option "trim_all" to carry out trimming based on the PHRED quality value of each base. PHD2FASTA
software (http://www.genome.washington.edu)was used to convert PHRED output files into FAST A formatted sequence files and corresponding quality files representing the quality of each nucleotide in the sequence files. CROSS_MATCH software
(http://www.genome.washington.edu) was used to mask the sequences of pBeloBACll
vector (Shizuya et al.o 1992) that was used in cloning the papaya genomic DNA library.
Terminal vector sequences were trimmed. Trimmed BESs of less than SO bp in length were eliminated. These higb-quality papaya BESs were used for further analysis of micro-satellites, repetitive elements, protein-coding regions and comparative genome
mapping.
10 Figure 1. Papaya BAC-end sequencing procedure = SAC
_ 73 'C. _ ---.-~::~':hi ".4£!7(--.~~:: IFII:- --~~,_:::::::: ._.f.."~FFI ...,
...... chiCWi..-l10i1t8 JJIIPIIYe!ftC unet "qUlnan"
All papaya BAC end sequences generated from the BAC library constructed by Ming et at. (200 1) were sequenced at the Hawaii Agricultural Research Center (HARC) and
Center for Genomics Proteomics and BioinfOlUlatics Research Initiative (CGPBRI) prior to all the sequence analyses that are described in this thesis.
11 ComputatioDBl Identification of papaya microsatemtes
For computational identification of papaya SSRs, bases with PHRED quality value less than 20 were converted to 'N's. Trimmed high quality papaya BESs were scanned for all mono-, di-, tri-, and tetranucleotide repeats of at least 12 nucleotides in length using a custom mode Perl script (microsatpl). Overlapping patterns of di- and tetranuc1eotide repeats were recorded only once. Offset patterus (eg. ATATAlfATAT) were counted once based on length and categorized based on identity of the first nucleotide. Primers for developing genetic markers were picked automatically
(primerDesign.pl) using PRIMER3 software (Rozen and Skaletsky, 2000)
(http://fokker.wi.mit.edulprimer3). The global criteria for the PRlMER3 software were set to 20 nt for optimal primers, and a GC content of 40-60%. Primer product size was set to 100-500 nt. For BESs containing multiple SSRs, the selection criteria for most suitable
PCR-primers were based on 1) the availability of computationally genemted flanking primer sequences, 2) longest SSR, and 3) shortest amplicon in order of importance.
Gene and repeat content of the papaya genome
Repetitive elements
After masking SSRs by converting SSR- (mono-, di-, tri- , and tetranucleotide repeats) containing loci to 'N's (microMasked.pl), high quality BESs were compared with 30,481 non-redundant plant repeat sequences in the The Institute for Genomic
Research (TIGR) plant repeat database (ftp:llftp.tigr.org/pub/data!I1GR]lant_Repeats) using TBLASTX at an E value cut-off of 10-4 (Altschul et aJ., 1990). Detailed listing of the compiled databases is Iisted in Appendix D. BESs identified as having plant repeat
12 homologies were annotated based on the sequence description of best-matched sequences from the plant repeat database. The BESs containing repeat-homologies were categorized based on the TIOR Code/or PlmII Repetitive Sequence table (http://www.tigr.orgltdb/ e2k lIplant.repeatslrepeat.code.shtml). The complete Arabidopsis plastid genome sequence (NC_000932.1) was used to screen all high quality papaya BESs to identify chloroplast-containing BESs using BLASTN with the condition of 1O·3E value cutoff, at least 90% identity, and 100 nt of alignment length.
Gene content & papaya-genes not in Arabidopsis
To identify homologous coding sequences, the Arabidopsis cDNA sequence database (ftp:l/ftp.tigr.org/pub/datala_thalianalathllSEQUENCES) (August 26th 2004) was downloaded from TIOR to compare with the SSR-masked high quality BESs lacking homologies in the plant repeat database to identify homologous coding sequences. These
BESs were used as queries to search the Arabidopsis cDNA database using TBLASTX with an E value cutoff of 10-6. These BESs were annotated based on the best-matched sequences in the Arabidopsis cDNA database. BESs lacking homology in the Arabidopsis cDNA database were further compared with the plant Reference Sequence (RefSeq) protein database (ftp:l/ncbi.nih.gov/refseqlrelease/plantl) using TBLASTX with an E value cutoff of 10-6. BESs with no homologies in plant repeats, eDNA, or RefSeq databases were compared to the non-redundant (m) protein database (ftp:l/ftp.nebi. nih.govlblastldblFASTAI) using BLASTX with a cutoff value of 10-6 and alignment length of at least 34 amino acids. To identify papaya-genes that are absent in the
Arabidopsis genome, the following procedure was used. 1) BESs with homology in
13 ReiSeq or m database but not in the Arabidopsis cDNA database were compared with
Arabidopsis chloroplast (NC_000932.1), mitochondria (NC _001284.2) and Arabidopsis whole genome sequences using TBLASTX at cutoff value of Itr. In a second step, their homologous protein sequences from ReiSeq and Nr databases were also compared with the Arabidopsis whole genome sequences using TBLAS1N at cutoffva1ue of 10~. Ifboth
BAC-end sequence and its homologous sequence were not matched in these comparisons, the gene was assumed to be absent in the Arabidopsis genome. Protein sequences that were homologous to these papaya BESs with an E value of 10.10 or lower and alignment
length of at least 34 amino acids were identified.
Papaya-specific repeats
The most abundant BESs lacking homology to the plant repeat, Arabidopsis cDNA, ReiSeq, or m protein databases were identified by comparing these BESs against each other using BLAS1N at an E value cutoff of 10.10 with up to 1,000 matches of
BLAST output collected for each query. The BLAST results were imported into a database using (importBlastResu1tpl). The query with the largest number of matched
BESs with at least 100 nt alignment length was identified; this query and all of its matches were removed from the database to identifY the next most frequently matched
BAC-end sequence. This procedure was iterated to identify the top ten queries that were most frequently matched by other BESs. These BESs were compared to the web-based non-redundant nucleotide (nt) database by using BLAS1N through a web-based interface
(http://www.ncbi.nlm.nih.govlblastl) to identify potential homologies. Unknown queries lacking homologies in the nt database and all its matches (BESs) of at least 100 nt
14 alignment were aligned using ClustaIX to identify potential consensus sequences.
Assuming eo-Unearity of plant genomes by mapping ofBAC-end sequence pairs
Download and preparation ofsequenced genomes for comparative genome 11Ulpping
Arabidopsis thaliano (ftp://ftp.tigr.org/pub/datala tbaiiana/athllSEQUENCESI),
Populus trichacarpa (poplar, http://genome.jgi-psf.org/PoptrllPoptr1.download.ftp.html)
and Oryza sativa (rice, ftp://ftp.tigr.org/pub/datalEukaryotic]rojects/o_satival
annotation_dbslpseudomoleculeslversion_3.01) whole genome sequences were
downloaded from the TIGR and the Joint Genome Institute. The downloaded plant
genome molecule sequences were fiagmented to 300 kb each with 1020 nt overlap
(fragmFsta.pl).
Comparative genome mapping ofpapaya BES pairs
SSR-masked high quality BESs lacking homologies in the plant repeat database were selected to form forward and reverse sequence pairs. These pairs were compared
separately to each of the fiagmented Arabidopsis, poplar and rice genome sequences by using TBLASTX with a cutoff value of 10-6. The best-matched homologous sequences of both forward and reverse end identified in the same linkage group within a region of 10 to 300 kb and being arranged in the correct orientation were considered to map potential
syntenic region between the corresponding BAC sequence and orthologous genomic
sequences of the target genome. The comparative genome mapping procedure is
diagramed in Figure 2.
15 Figure 2. Comparative genome mapping procedure
/ ~plantll'" n!lcnqJ...... -10112 • ...,
Comparative genome mapping was done computationally by using TBLASTX to compare the forward and reverse end sequence pairs to the other completely sequenced plant genomes. The BES pairs being mapped in the correct orientation and within 10 to
300 kb were claimed as potentially colinear between the BAC sequences and the heterologous genomic sequences.
16 Comparative genome mapping ofcontrol BES pairs from other plant species
The Arabidopsls, poplar, Brassica, and tomato BAC-end sequence pairs were mapped to Arabidopsls and poplar genomes as controls (Figure 3). A total of 120,000
TAMU (Choi et aI.• 1995) and IGF (Mozo et aI.• 1998) Arabidopsls BESs were downloaded from the TIGR website (ftp:l/ftp.tigr.org/pub/datalatbaliana/ bac_end_sequences/atends). In addition, the poplar genome was fragmented into artificial
BACs of 120 kb in length with 10 kb gaps. Exactly 500 bp of forward and reverse end sequences were collected to generate 7,135 artificial poplar BES pairs. Another set of
38,245 Brassica rapa (Brassica mpa [ORGANISM] BAC "end sequence") and 50,000
Lycopersicon esculentum (tomato, lycopersicon [ORGANISM] BAC pBeloBACll)
BESs were downloaded from the NCBI website (http://www.ncbi.nlm.nih.govl). All of these BESs were processed using the same procedures, by masking SSRs
(microMasked.pl) and eHminating repeat-containing BESs (rmRepeat.pl) to yield 3,648
Arabidopsls, 5,685 poplar, 8,525 Brassica, and 10,394 tomato low-copy BES pairs for comparisons with the Arabidopsls and poplar genomes as controls. Another subset of 185
Arabidopsls BES pairs was processed under the same conditions as mentioned previously by masking SSRs and eliminating repeat containing BESs to yield 10210w-copy BES pairs for a comparison with the Arabidopsls genome.
17 Figure 3. Taxonomy of Carieaceae and other species that were used in comparative mapping of DES pairs.
0"",""""" llTaJUh.'f1 Tapa
C/Uicupoptn"o
Bt.l\(~ C,'li.,.>mmplw
H.n-mit:.irl Carkuce Ani/lei r,:ure mlk-m)'ledcIb. P"fllllu, rrMWOl1'pQ Mnttnollupbyra Ln "."..,.Ii.nn t'14'ulfflwm 0"',:0 .sal;ru All genera within the family Caricaceae are listed here. The BAC-end sequences of Arabidopsis fha/lana, Brassica rapa, Populus trlchocarpa, Lycopersicon escuJentum, and Oryza sativa were compared to heterologous genomes to detennine level of co-linearity between them. 18 RESULTS Trimming and proeessing of papaya HAC end sequences A total of 50,661 BAC end sequence chromatograms were base-caIled and trimmed using PHRED with default settings to produce 39,590 trimmed BESs (Ewing et aI., 1998; Ewing and Green 1998). Among these, 1,548 sequences consisting entirely of vector sequences were found. Another 2,570 BESs shorter than 50 bp were elimjnated (Figure 4) due to potential CODtamjnation of end sequences resulting in the generation of 35,472 high quality BESs from 20,842 BAC clones. These BESs represent 17,483,563 high quality bases or 4.7% of the papaya genome. The average trimmed sequence was 493 bases in length and ranged from 50 to 899 bases. Eighty-seven percent of nucleotides had PHRED quality values ~ 20; 35% of these BESs were GC. All high quality BESs with at least 50 bp in length were deposited to the GenBank GSS database (Accession number: DX458351-DXS02755). 19 Figure 4. Selection of sequence length cutoff in high quality papaya BESs. Conrnminated sequences due to incomplete purification after labeling often fall under 50 bp in length. BESs that were shorter than 50 bp were removed from the subsequent analysis. Distribution of sequence lengtb in 35,472 high quality C. papDya BESs '~' r------~------~----~--~~~--~~~~------' • • • I ~ t • • , ~ , • • • .-. • : ...• • •• .,... "Ie • ______.:.. f-'0 .."... ." "0 • ~ ~ • . o#o·.r·,-'-"";..- .... · .... : • . ., ..•• . ... • . • • • -0° . ••. •• ... • • • ...... • .... .,-...... e.. ~-.":.;--o------__l I.M ..,.....I • .....•• • • • r--..!. • • .. • ". ... • ...... ,. • ,. ",.. .,.~.. ,...... - . --...... • • . .... ,... ~ . .., ------~.~~~.~.--~. --- ••. ..~------~'~.~.. ~------_1 • ... ~: .. ~ ...... ,...... • ~ .• • -."L'.O • oi-'!'"" '0° •• ~,~ t.:' >.,;"I .. ''';~: ~, ,!~~... . ," i;. __ '_'. • .1 .... - "' ., I ~ . • 20 Computational Identifieation of papaya mierosatellites A total of 7,456 simple sequence repeats of at least 12 nueleotides length were identified in 5,452 (15.4%) BESs. These SSRs represents 22.9%, 37.8"10,17.7%, and 21.5% mono-, di-, tri-, and tetranueleotide repeats, respeetively (Table 1). 21 Table 1. Summary of simple sequence repeats in 35,472 high qnality BESs. All analyzed BESs contained 19.1 % Arabtdopsis eDNA and 16.2% plant repeat homologies, respectively. The percentage of SSRs within BESs that contain coding regions, repeats homologies or no homologies (non-genic) are summarized here. Note: Multiple SSRs, identified in some of these BESs and several SSRs, can flank a eDNA or plant repeat homology region in a particular BES. 22 Of these SSRs, 1,174 (15.7%) span at least 20bp and represent hypervariable markers. Dinucleotide repeats were the most abundant class of microsatellite, followed by mononucleotides. Poly (T) and poly (A) were the most abundant homopolymers, representing 882 (51.6%) and 736 (43.1%) ofall homopolymers, respectively. Poly (AI) and poly (TA) were the most abundant dinucleotides, representing 1,137 (40.3%) and 844 (29.9"At) of all dinucleotides, respectively. Poly (AG) and poly (GA) were the longest tandem repeats. Only a small number of poly (CG)/(GC) repeats were identified. Poly (AAI) and poly (AIT) were the most abundant trinucleotide repeats, representing 12.2% (161) and 9.0% (119) of that class of repeats. The longest trinucleotide repeat, a poly (ITA) tandem repeat, was 60 bp in length. Poly (AAIT) and poly (AAAI) were the most abundant tetranucleotide repeats, representing 9.7% (156) and 8.1% (130) of that class of repeats (Figure 5). Of these 5,452 SSR-coJrtaining BESs, 16.9% (922) and 5.8% (318) exhibit homology to Arabldopsis cDNAs and plant repeats, respectively. Therefore, in regard to all comparisons shown in Table 1, most SSRs were underrepresented in the BESs that were associated with cDNAs, whereas GC-rich trinucleotides were overrepresented in the proximity of genic regions. All classes of microsatellites were underrepresented in the BESs that were associated with plant repeat elements. Most SSRs were shown to be overly represented in the non-genic regions. The number of all SSRs tended to decrease as the length of the microsatellite increase (Figure 5). AT-rich SSRs were consistently found to be more abundant than GC-rich SSRs. 23 Figure 5. Characterization of all simple sequence repeats (SSRs) detected in 35,472 high quality BESs. Numbers of copies are plotted against the length for each microsatellite. 1~ r------• , • • • • • • • ~" - • • ...... I(. ~ ....,...... ---, ..... • • -. -.:«Q,""',"' .' .. ~.,...... I<1Uc'_ .. _ ~«>...... """"'--...... •.-.cA;o .·... · OOO·..,..•• " ·,,,·,... _ f\itI ",. ~ 100 +...... ; ...... •...., ....~ ... ~ ... ~ .... '0' · ...... , ...... ~ ...... ".. "' ...... ,~ ''W''''' ... . ''-" ...... ,..'()llUtU /.","" ...... G ...... , 110~ ---<------<---<----,.--..-- --~F;77;:k::.~' F~.. ::I ------O------~--~o------;,~OO 24 Gene and repeat content of the papaya genome Repetitive elements A comparison of 35,472 high quality papaya BESs to the compiled TIGR plant repeat database revealed 5,733 BESs (16.2%) with homology to the plant repeat elements (Figure 6). A total of 4,733 (83.3%) contain retrotransposons accounting for the most abundant repeats. BESs homologous to retrotransposons were further classified as containing long terminal repeats of which, gypsy-like (24.9%) and copia-like (24.2%); 1,539 (26.8%) of the retrotransposons were not categorized to any specific groups within the class. BLAST searches showed these mostly belonged to other retroelement-derived sequences. The next most abundant repeats were class II, tranposons (426 =7.4%). Only 9 miniature inverted-repeat transposable elements (MITE) were found. A tota1 of 242 (4.2%) centromere-related and 19 (0.3%) telomere-related repetitive elements were identified. Moreover, 2.9% of these repeat-containing BESs were accounted for by rDNA homology. Unclassified transposable elements accounted for 97 or 1.7% of all transposable elements found. Screening of all high-quality BES with Arabidopsis complete chloroplast genome revealed 290 BESs from 230 different BAC clones with chloroplast sequence homology. 25 Figure 6. Summary of plant repeat element content in 35,472 high quality BESs a) ., .. 'aNA Yelp ...- ...... I' '. "sm.. J."~ tl'1} '.JJ-.U" Cii i 7 .. --U':l_.,''") a."a f! -,·U·. (41'l Unl b) .,.,------~ . I .• ~ - ... , 1 1 2 • 2 1 •• - .1 • l- i :;! ! f ..• • (M",. ': i ...... d 0 1 H . r .."'" , 01+11 .. ''''' '.) High quality BESs were compared with plant repeat database as described in Methods. Classification categories were based on Code for Plant Repetitive Sequence table from TIGR website. a) Distribution of major classes of repeats in papaya. b) Detailed categorization of BESs with homology to each class and repeat element is listed. 26 Gene content and papaya-genes not found in Arabidopsis A total of 6.769 papaya BESs without matches in the plant repeat database contained Arabidopsis thaliana cDNA homology (representing 19.1 % of all 35.472 high quality papaya BESs). Of these. 2.659 (39.3%) BESs were classified as "unknown". "putative". "hypothetical". or "expressed protein". Homologies to RefSeq were identified in 1.124 (3.2%) BESs lacking homology in Arabidopsis cDNA database. in addition. 644 (1.8%) BESs lacking homology in both Arabidopsis cDNA and RefSeq databases contained non-redundant protein sequence homology. Of these 1.768 (5%) BESs having identified homologies in RefSeq and Nr protein databases. 170 BESs were identified as retrotransposon-encoded proteins and were eliminated from further analysis. Another 919 BESs either had homology in Arabidopsis mitochondria, chloroplast, whole genome sequence. or their homologous protein sequences were matched in the sequences of Arabidopsis genome. Among the remaining 679 BESs. 141 had alignment length of at least 34 amino acids to the identified homologous protein sequences. and their belonged organisms were identified. Two BESs being annotated as Arabidopsis protein sequence homologies were clearly microsatellite sequences and were eliminated from further cousideration. A total of 41 BESs were homologous to other plant protein sequences. and 98 BESs were homologous to protein sequences that were identified from other non-plant organisms. Detailed information of these BESs homologous to protein sequences from other plants but not in Arabidopsis is listed in Table 2. Table 2. BESs homologous to protein sequences from other plants but not found in Arabidopsis This table lists papaya BESs that are homologous to protein sequences from other plants but not in Arabidopsis. The identification of BAC-end sequences, GenBank accession number of homologous protein sequences, sources of protein sequences, and the sequence annotations are listed here. 28 Papaya specific repeaJs The most abundant unknown elements among the BESs that do not have homologies in TIGR plant repeat, Arabidopsis cDNA, RefSeq genomic, and non redundant protein databases were grouped to identify the 10 most frequently matched queries (BAC-end sequences) (Table 3). Six of these queries were identified to contain homologies in Carica papaya sequences cpsm (sex-linked AFLP markers) 54, cpsm49, cpsm94, and cpbe (BAC-end sequence) 55 sex hermaphrodite chromosome Y male specific sequences. Four queries identified as Carica papaya sequences cpsm54 (giI37992834) (474 bp) were matched by 305.178.177 and 107 BESs resulting in a total of 767 matched BESs to form the largest cluster of associated end sequences. Two queries identified as Carica papaya sequences cpsm49 (giI37992830) (587 bp) and cpbe55 (gi137992870) (409 bp) account for 161 and 120 matched BESs, respectively. Four unknown queries that were identified as papaya-specific repeats matched 140, 116, 106 and 103 BESs among the uncharacterized portion of the BESs. 29 Table 3. Summary of the annotation of top ten most abundant repeats and the number of matches in uncharaeterized BESs No. of Queries matched Homology BESs 30B-CI2.r 305 gi137992834 C. a isolate 4 Y male specific seq. 35C-A09.r 178 1ri137992834 C. a isolate 4 Y male soecific seq. 47A-Cll.r 177 gi137992834 C. a isolate 4 Y male specific seq. 4B-FI0.r 161 1ri137992830 C. a isolate cusm49 Y male soecific sea. 53A-Ell.f 140 Novel 44C-HI2.r 120 gil37992870 C. a isolate cpbe55 Y male soecific sea. 94B-G03.r 116 Novel 46C-Cll.r 107 gi137992834 C. a isolate 4 Y male soecific sea. 28D-B03.r 106 Novel 25C-H06.f 103 Novel The occurrences of each repeat in uncharacterized BESs were recorded representing the number of their matches (unique Subject ID). The sequence accession number and description were provided for referring to the acquisition of sequence information in the NCBI GenBank database. 30 Co-linearity of plant genomes revealed by mapping ofBAC-end sequence pairs Comparative genome mapping ofpapaya BES pairs High quality papaya BESs with plant-repeat homologies were elimjnated, resulting in the selection of9,038 low-copy papaya BES pairs. These BES pairs were subsequently mapped to the Arabidopsis, poplar and rice genomes to reveal 53, 167, and 11 mapped papaya BES pairs, respectively. The largest number of mapped papaya BES pairs were identified in poplar genome sequences. The average length of the mapped regions marked by the location of homologous forward and reverse sequences in the target genome were 74 kb, approximating to the actual papaya BAC clone library insert size. Only 53 papaya BES pairs were mapped to the Arabidopsis genome, the closest related organism among all currently available completely sequenced plants. The average length of mapped regions in Arabidopsis was 46.4 kb. Only 11 papaya BES pairs co mapped with correct orientation and within 10-300 kb in the rice genome with an average length of 47 kb. All mapping results for each step are listed in detail (Table 4). The exact coordinates of mapped papaya BES pairs in the heterologous genomes are provided in Appendix C. The numbers ofBES pairs co-mapping to different plant genomes were plotted against the distance separating the BAC ends in the target genome (Figure 7). In a separate study, three completely sequenced peach BAC clones were compared to the poplar genome to identifY potential syntenic regions for further understanding of the genome structure of poplar, which was used in this study. Details of the results are shown in Appendix B. 31 Table 4. Statistics of number ofBES pairs in different stages of genome mapping No.of _DES I~~~'::: _a.an- painBJAST wbh I B No."""BAC_ ,-'''''''_ ...... ,300 IIDBItAST DES ...... - ..,(%) The results of mapping C. papaya, A. thaliana, P. trichocarpa, L. esculentum, and B. rapa BES pairs to whole genome sequences are listed here in each subsequent steps with more stringent criteria. Arabidopsis BESs were mapped to the Arabidopsis genome as a control. The percentage in parenthesis of each step represents the ratio ofBES pairs that remained from the previous stage. 32 Figure 7. Comparative genome mapping ofBES pairs in heterologous plant genomes a) IMsu., •••• .t".. F, ••••rS'nb ...... rll ••_.lar' ,fRa PI • '.iG" aa .. , 7" i • '''''' loa n ANI; '\ .:tII .p.o .• !Sft~ " • 1" ii • • .,.. B!S WI Noh '., IJr 'I i' T·· '"Btsn,a"F ~ i • T ., au .. Arlrk d>pili • Zhrrnu 81:5 vs,.:' F Dk ' -."I •• ' .....C ...... tkI!l b) ,.. (N~rilMH'" ut w-tIIapJll'd DES pllin ill h ttl a1112 ' pi itl au 5 _,. i",,. • 1'" ~ iI1 .. ~ .. • .. J. .. 1_.. ' n~-..h.B.\C ___ ' ' ....., Distribution of distance between BES pairs that co-mapped in heterologous genomes with correct orientation and within 10-300 kb. a) Low-copy papaya, Arabidopsis, poplar, tomato, and Brassica BES pairs were mapped to heterologous genomes. b) Brassica BES pairs that co-mapped to the Arabidopsis genome are displayed separately for higher resolution. A subset of low-copy Arabidopsis BES pairs was mapped to the Arabidopsis genome as a control. 33 Comparative genome fTUJpping 0/ Arabidopsis BES pairs Arabidopsis BES pairs co-mapped to the poplar genome in only 5 (0.8% overall), the lowest number of mapped BES pairs among all comparisons, cases. In addition, the average length of mapped BES pairs spanned the largest regions (230 kb) among all comparisons. Conversely, 102 Arabidopsis control BES pairs co-mapped to the Arabidopsis genome in nearly ninety percent or more in all subsequent mapping steps. A total of 94 Arabidopsis BES pairs accOunting for 94.9"10 of the dataset, co-mapped to its own genome. The average length of mapped regions was 97 kb which approximates the actual average insert size (100 kb) of the TAMU (Choi et aI., 1995) and IGF (Mozo et at., 1998) Arabidopsis BAC libraries. Comparative genome mapping o/poplar BES pairs Poplar BES pairs co-mapped to the Arabidopsis in 19 cases representing 7.2% in overall success rate. The average insert size of mapped regions was 34.7 kb in length which is significantly lower than the actual size of the poplar artificial BAC clones (120kb). The total success rate (7.2%) and the ratio of co-mapped BES pairs in the Arabidopsis genome with correct orientation (35.2%) also resemble that of the papaya Arabidopsis comparison (35.3% and 6.8%). Comparative genome mapping 0/ Brassica BES pairs Brassica BES pairs co-mapped to the Arabidopsis genome at 52.7% percent, achieving the highest success rate. Among all heterologous comparisons, it is the only case that mapped over fifty percent of the BES pairs. In addition, 45.6% (3,889) of the 34 total Brassica BES pairs were homologous to the Arabidopsis genome. The average length of mapped regions in Brassica-Arabidopsis comparison was 135.1 kb in length. Both the ratio of mapping in the correct orientation and spanning a region within 10 to 300 kb were approximately 86%. Brassica BES pairs, yielding only 20 (I % overall success rate) mapped BES pairs in correct orientation and within 10-300kb. co-mapped poorly to the poplar genome. Comparative genome mapping oftoTTUllO BES pairs Tomato BES pairs co-mapped to the Arabidopsis and poplar genomes with overall success rate of approximately 8%. The ratio of co-mapped BES pairs within 10- 300 kb was higher in the tomato-poplar (68.5%) than in the tomato-Arabidopsis (37.3%) comparison. Moreover, the average size of mapped regions in tomato-poplar comparison (117.3 kb) was twice larger than the size in tomato-Arabidopsis (45.3 kb) comparison. A comparison of tomato with Arabidopsis and poplar yielded 408 (3.9"10) and 470 (4.5%), lowest percentages of BLAST matches among all comparisons, matched tomato BES pairs. 35 DISCUSSIONS Processing of papaya BAC end sequences Approximately seventy percent of the BAC end sequencing yielded high quality data. BESs that are shorter than SO bp in length were removed since contaminated BESs due to incomplete purification after labeling are often less than SO bp. Therefore, 6.5% of the BESs were eliminated to minimize the collection of false-positive results from subsequent analysis. Only 3.9% of the BESs consisted entirely of vector sequences that may represent the actual percentage of the empty BAC clones in the Iilmuy. Based on the restriction digest analysis reported by Ming et aI. (2001), the emmated empty clones for the two halves ofBAC Iilmuy are approximately 3.5 and 4.6%, which is consistent with the present BES data. Highly stringent conditions of screening with BLAST software revealed a total of 1.1% chloroplast containing BESs. Ming et aI. (2001) also reported a total of 504 or 1.4% BAC clones contained chloroplast DNA that was identified by hybridization testing with sorghum chloroplast probes ropB and trunk. Computational identification of papaya mierosatellites Mierosatellites are tandem repeats of short sequences showing high levels of polymorphisms since slipped-strand mispairing happens frequently in these sites while DNA replications, repairs, or recombinations take place (Levinson and Gutman 1987). Since SSR-containing genetic loci display high level ofpolymorphisms, genetic markers can be developed for variable lengths of tandem repeats to determine genetic diversity among papaya cultivars. Genetic markers developed based on SSR loci that are linked to 36 certain phenotypes provide us with easy assays for specific trait of interest. Since microsatellite markers can be detected easily by PCR using flanking primers, SSR markers can be used to construct genetic and physical maps of the papaya genome. Primers used to develop genetic markers in papaya were designed automatically for analysis reported in this project. A greater number of class I, hypervariable markers (at least 20 bp), and class II, potential variable markers (less than 20 bp), were identified in non-genic regions than in protein encoding regions of the papaya genome. Class II SSRs are thought to be less variable than class I SSRs owing to the fact that long repeats have a higher chance of slipped-strand mispairing. Therefore, hypervariable SSR loci can be a useful resource for developing markers to construct genetic maps because these loci exhibit high levels of polymorphism. Among the whole dataset of high quality papaya BAC-end sequences, poly A and poly T were dominant in mononucleotides. Dinucleotide tandem repeats were found to be the most abundant tandem repeats among all mined SSRs. The abundance of the dinucleotides, poly (AT), and poly (TA) is consistent with the findings in yeast and Arabidopsis (Katti et aJ., 2001). Temnykh et aJ. (2001) reported that poly (AT) and poly (TA) sites yielded poor amplifications in PCR reactions when genetic markers were developed to flank these loci possibly because of an association of non-coding regions and repetitive elements in rice. In this study , AT-rich dinucleotide repeats are overrepresented in non-genic regions, and SSRs are underrepresented in repeat containing BESs. Trinucleotide tandem repeats were found to be the least abundant Only the expansion or shrinkage of trinucleotide repeats can be tolerated within protein coding 37 regions because these trinucleotide repeats do not cause a frame shift of protein coding during translation. However, alteration of protein structw'e could lead to the suppression of the trinucleotide tandem repeats since trinucleotide repeats might be located within protein coding regions. AT-rich trinucleotide tandem repeats were the most abundant trinucleotide microsatellites in papaya, which is consistent with the finding in both Arabidopsis and yeast (Katti et al., 2001). Most tetranucleotide tandem repeats were found to be AT-rich. GC-rich regions are generally associated with high gene density regions in the genome. Since these BESs are presumably randomly distributed across the genome of papaya, most BESs are expected to span non-genic regions and contain mostly AT-rich microsatellites. The frequency of all mono-, di-, tri-, and tetranucleotide tandem repeats decreased exponentially as the length of the tandem repeats increased. This phenomenon can be explained by the instability oflong tandem repeats leading to an increase in the mutation rate. Mutations that shorten the tandem repeats account for the inhibition of infinite growth of tandem repeats (Wierdl, Dominska, and Petes, 1997; Kruglyak et al., 1998). Moreover, our mined microsatellites were found to be underrepresented in BESs with plant repeat homologies. This is consistent with the analysis ofTemnykh et al. (2001), who used a HindIIIBAC library of rice for their analysis. Papaya BESs had a similar SSR distribution to the peach, almond and Rosa EST assembled sequences (Jung et al., 2005). Although the abundance of different patterns of trinucleotide repeats vary among different plant species, more trinucleotide repeats than dinucleotide repeats were found in the coding region of peach, almond, and Arabidopsis ESTs. In contrast, trinucleotide tandem repeats are least abundant in papaya. Dinucleotide repeats were the most 38 abundant simple repeats in the un-translated region (UTR) of peach, almond and Rosa ESTs. In conjunction with the findings ofTemnykh et aI. (2001), who have shown that AT-rich dinucleotide repeats and trinucleotide repeats of rice BESs were oveneplesented in non-genic regions, AT-rich dinucleotide repeats and trinucleotide repeats were underrepresented in papaya BESs having eDNA homologies. In order to provide these SSR markers and their corresponding peR primers to the public, a listing of all SSRs and their primer pairs are accessible at the project website (http://www.genomics.hawaiLedulpapayaIBES/ ssrNP.cgi). Gene and repeat eontent of the papaya genome Repetitive elements Repeats that consist mostly of transposons are found in all eukaIyotes. These repetitive sequeoces can cause genome rearrangements, mutations of sequences, duplication, insertion, and translocation of genes. Repetitive sequeoces are a major driving force of evolution in nature. Repeats that are specific to an organism can be used as markers for in situ hybridization. High content of repeats in any genome can increase the difficulty of sequencing the whole genome. The examination of repeat composition in papaya allows evaluation of the quality of papaya for being a candidate model organism for tree-fruit plants. Sixteen percent of high quality papaya BESs contained homology to known plant repeat elements. Most of them were transposons (5,208 - 90.8%) representing 14.7% of the high quality BESs. In Arabidopsis, at least 10% of the genome contains transposons. 39 Most known plant repeat elements fotmd in Arabidopsis were also fotmd in the papaya BESs. In Arabidopsis, most class I retrottansposons are located in centromeric regions, whereas the class II ttansposons are fotmd dominantly in the arms ofthe chromosomes. In papaya. the physical location of the retrottansposons cannot be determined tmtil the physical maps of the BAC clones are available. Papaya BESs that contain retrotransposons, which accotmted for 13.5% of all papaya BESs, were fotmd to be most abtmdant Like Arabidopsis, papaya has more class I retrottansposons than class II transposons. When compared to other completely sequenced genomes, the ratio of gypsy like to copia-like retrotransposon is almost I: I in papaya BESs; while gypsy-like to copia-like retrotransposable element ratio were approximately 2: I in rice (International Rice Genome Sequencing Project, 2005). The relative abtmdance ofgypsy-like retrotransposons indicates a significant genetic difference between papaya and rice. DNA based transposons were fotmd to be the second most abtmdant repeats in papaya. Plant centromeres often contain transposons and 242 centromere specific sequences were fotmd. The BESs with centromeric repeats may identify BAC clones that are originated from the centromere. Gene content &. papaya-genes not found in Arabidopsis A comparison of 35,472 high quality papaya BESs with the Arabidopsis tholiano cDNA database revealed 6,169 cDNA homologies. These homologies accotmt for approximately 19.1 % of all papaya BESs, suggesting that Arabidopsis and papaya share many genes. We can extrapolate that the protein-encoding regions of papaya will be 11.1Mbp representing 35,526 genes, presuming that the average gene length of papaya is 40 similar to that of Arabidopsis (-2kb) (The Arabidopsis Genome Initiative, 2000). Papaya appears to share similar gene content with Arabidopsis, which has a genome composition of 31,407 genes (NCBI, Nov 17th 2005). The protein coding regions that were identified in this study can be used to clone potential candidate genes or to locate genes within BAC clones that are in close proximity to the genes. The identified genes can also be used as genetic markers for assisting in whole genome sequencing. High hybridization of Arabidopsis cDNA probes (91 %) to the papaya BAC clone library is consistent to this study that the Caricaceae and Brassicaceae families are closely related (Bowers et ai., 2003; Soltis et ai., 1999,2000). The BESs that are not homologous to repeats and Arabidopsis cDNAs were further compared to the plant RefSeq and the non-redundant protein databases. Over 90% of these identified genes were not retrotransposon-derived. Upon further analysis, over half of these identified genes were found in whole genome or organelle genome sequences ofArabidopsis and were eliminated from further consideration. More stringent criteria·revealed 141 BESs without homology to Arabidopsis cDNAs but with alignment of at least 34 amino acids with either nr or RefSeq protein sequences. Forty-one BESs homologous to proteins sequences from other plants can be analyzed to identifY genes that belong to biochemical pathways of papaya but not in Arabidopsis. Two BESs were homologous to Amaranth agglutinin genes. Four BESs, homologous to pathogen resistance genes, can assist in finding potential solutions to improve pathogen resistance in papaya. Seed protein AmAI sequence homology was also identified and can be used to analyze nutrition improvement in papaya. The remaining 'unknown', 'hypothetical', 'unnamed product', 'putative', 'predicted', and other homologous protein sequences that 41 were found in papaya BESs can be used to clone potential organism-specific genes for gene functional studies. Ninety-eight BESs that were mostly homologous to proteins from E. coli or other microorganisms were not further analyzed. Papaya-specific repeats Four previously unknown papaya-specific repeats were identified. These papaya specific repeats did not show any homology to the web-based nt database sequences. Moreover, these repeats were not found in Arabidopsis nor in the plant repeat database. These novel sequences will aid the characterization of repetitive elements for papaya genome and should be considered for an addition to existing plant repeat databases. These papaya-specific repeats could be localized by in situ hybridization, to identifY papaya chromosomes. AssumJng eo-linearity of plant genomes by mapping of BAC-end sequence pairs Co-linearity indicates the level of similarity of gene orders in a given genome for measuring synteny or conservation of its structure. Mapping single or low copy forward and reverse end sequences of a BAC clone that are separated by the length of a BAC clone to a completely sequenced heterologous genome can reveal potential syntenic regions and the level of co-linearity between heterologous species. Highly conserved syntenic regions can be used to locate genes within the BAC clones that are not found in BESs (Yan et ai., 2003) and find potential reference genomes for further genomic studies (Ku et ai., 2000). Low-copy papaya BES pairs were compared to the Arabidopsis thaliana, Populus trichacarpa (poplar) and Oryza sativa (rice) genomes in all possible 42 reading frames to test the hypothesis that the forward and reverse BES pairs of papaya BAC clones will map to the completely sequenced genomes in the correct orientation and within the maximum allowable BAC insert size to reveal corresponding collinear regions. To validate the computational pipelines of the methodology, attempts were made to map 102 Arabidopsis low-copy BES pairs to the Arabidopsis genome sequences. Four low quality Arabidopsis BESs resulted in the loss of three BES pairs (clone FIOCII, FIOBS, and FlOJ4) during the process of mapping in the Arabidopsis genome due to a lack of detectable BLAST matches. One BES pair (clone FIOJl3) was mismapped to a second locus with equal or better E value. Two BES pairs (clone FIOFI6 and FIOFI7) were not mapped in correct orientations. In another two cases, BES pairs (clone FI 0119 and FIOGl4) were mapped less than 10 kb apart. Most errors seemed to be due to tracking errors ofBES data set (ftp:/Iftp.tigr.orglpub/datala_thaliana/bac_end_sequences/ README), but these control BES pairs that co-mapped in the correct orientation and within I 0-300kb were at 9S% success rate in all subsequent mapping. Two very closely related species were compared to further validate the methodology in mapping BES pairs to heterologous genomes since the overall success rates in heterologous comparisons were low. Brassica rapa and Arabidopsis thaliana are very closely related species. Forty six percent of Brassica BES pairs have detectable BLAST matches in the Arabidopsis genome. Brassica-Arabidopsis comparison yields the highest success rates in all subsequent heterologous comparisons. In addition, over fifty percent of the Brassica BES pairs that co-mapped to the Arabidopsis genome were located in the correct orientation and within 10-300 kb. The high level of co-linearity between them indicates that many macrosynteny regions are conserved in Brassica and Arabidopsis. Papaya, poplar and 43 Arabidopsis are members of rosids, and one set ofBES pairs from a dicot that is not a member of rosid was used as a comparison. Lycopersicon (tomato) has a genome size estimated at 950 Mbp and is a member ofasterids within the order of Solanales. Tomato BES pairs yielded similar success rates in mapping Arabidopsis and poplar genomes. Tomato BES pairs ranked lowest in having detectable BLAST matches in heterologous genomes. This phenomenon may be mainly due to the large genome size of tomato resulting in the incorpomtion of large portion of non-genic regions in the BAC ends that were mostly not conserved between the members of rosid and asterid, presumably that the BAC-ends are collected mndomly across the tomato genome. Co-mapping BES pairs to heterologous genomes in all subsequent processes displayed higher success mtes than mndom probability. The probability for any BAC ends co-mapped to the same chromosome in Arabidopsis is approximately 20.7% ([29 .IMlllS.4M]2 + [l9.6M111S.4M] 2 + [23.2M111S.4M] 2 + [17.SMlllS.4M] 2 + [26.0MlIIS.4M] ~ (The Arabidopsis Genome Initiative, 2000), but in all comparisons with Arabidopsis genome, the success mte of mapping BES pairs on the same chromosome exceeds mndom chance. Even poplar-Arabidopsis comparison yielded a minimum success mte of 31.9% in mapping both BAC-ends on the same chromosome and is still 11.3% higher than the mndom probability. In additions, the mndom probability of mapping any BES pairs on the same linkage group of the poplar genome is 1.9% (adjusted by the size of linkage groups). The minimum success mte is the Brassica poplar comparison, and only 12.3% of the time mapped to the same chromosome. However, it is still six times higher than the mndom chance of 1.9%. If BES pairs are mndomly oriented in the mapped region, a probability of 25% is expected to map both 44 ends in the same orientation of homologous sequence in the target genome. In all comparisons, the success rate ofBES pairs that were oriented correctly exceeded the random probability of25%. The highest success rate being the Brassica-Arabtdopsis comparison. 86% of the BES pairs were mapped to the same chromosome with correct orientation. Papaya BES pairs co-mapped to poplar genome with correct orientation in approximately 10% higher than both Brassica- and Arabidopsis-poplar comparisons. For BES pairs co-mapped on the same chromosome and oriented in correct directions, the succesl! rates in mapping within 10-300 kb reflect the level of co-Iinearity between the genome of heterologous species. Papaya BES pairs is only at 2.5% lower than the Brassiea-Arabidopsis comparison in co-mapping within 10-300 kb to the poplar genome, even though papaya is expected to be more distantly related to poplar than the phylogenetic relationship of Brassiea and Arabidopsis. Moreover, the average distance between co-mapped BES pairs in the heterologous genome generally deliberates the genome size of the target genome, except for the Brassiea-Arabidopsis comparison, in which mapping of large genome regions were observed. Based on the current phylogenetic tree taxonomy using alignments of plastid genes rbeL, 18S rDNA and atpB (Soltis et al., 1999,2000), papaya belongs to the family Carieaceae, which is a sister family of Brassieaceae within the order of Brassieales. Papaya BES pairs co-mapped poorly to the rice genome that may partly due to the divergent genome organi7ation between dicot and monocot. However, even the most closely related specie Arabidopsis also yielded low number of mapped papaya BES pairs. Surprisingly, co-mapping low-copy papaya BES pairs to a more distantly related poplar genome demonstrated a higher level of co-linearity (15.9%) than the papaya-Arabidopsis 45 comparison. The papaya BES pairs that co-mapped to the poplar genome is not due to the papaya specific-repeats discovered in this study since over 90% of the co-mapped BES pairs have homologs in the Arabidopsis cDNA and nr databases. Moreover, the mapped genome segment lengths of poplar approximate the actual insert size of the papaya genomic BAC clones. The mapped regions in poplar also spanned a similar region to the actual papaya BAC clone size rather than the relatively smaller size of mapped Arabidopsis genome regions. Furthermore, the lifted level of co-mapping BES pairs in the actual insert size range of BAC clones can be observed in papaya-poplar comparison resembling the comparison of highly collinear species Brassica-Arabidopsis (Figure 6b) and the control (Figure 6a). The exhibition of high co-Iinearity between papaya and poplar is unexpected because poplar is more distsntly related than Arabidopsis according to the current taxonomy and is within the order of MaIpighiales, whereas both papaya and Arabidopsis are within the same order of Brassicales. The highest level of co-linearity between papaya and poplar suggests that genome structure is more conserved between papaya and poplar than between Arabidopsis and papaya. The resson of Arabidopsis displaying low level of co-Iinearity with papaya may be mostly caused by the recent genome duplication followed by large scale of gene loss in the Arabidopsis genome, in which a great challenge is posed for using Arabidopsis as a model organism for comparative genomics. Based on this study, I suggest that Populus trichocarpa can be chosen as a template genome for assisting in the construction of physical maps to sequence the whole genome of papaya although poplar belongs to the Malpighales family and is more distantly related to papaya. 46 The genome rearrangement of Arabldopsis due to polyploidization or duplication followed by significant gene loss was reported recently (Bowers et al., 2003). Most angiosperm genomes have experienced duplication of gene pairs followed by unequal crossing over, translocation of genome segments, and chromosomal recombination (Zhang LQ et al., 2001; Durbin et al., 2000; Ohta T 2000). Recent studies have also confirmed an ancient genome duplication of rice (Kisbimoto et al., 1994; Nagamura et al., 1995), Arabldopsis (Blanc G and Wolfe KH 2004; McGrath et al., 1993; Kowalski et al., 1994), and sorghum (Chittenden et al., 1994). The a genome duplication event in Arabidopsis was the most recent major change in the Arabldopsis genome, and it occurred after the divergence of eudicots (Bowers et al., 2003). The previous duplication was followed by the divergence of Arabldopsis and Brasslca from the common ancestor with Malvaceae (Adams and Wendel 2005). After the a duplication. only 30% of Arabldopsis genes were preserved in syntenic copies, and the survival of copies of the genes generally relied on their specialized functions after the duplication (Sampedro et al., 2006). Therefore, we expect 0.32 =9% of the gene pairs homologous to the papaya BES pairs in Arabldopsis to be preserved, and the 6.8% of papaya BES pairs co-mapped to the Arabldopsis genome in correct orientation and within 10-300kb is consistent with this finding. Comparative genomics is complicated by polyploidizations of genomes in ancient angiosperms because the reshuffling of genome segments and large scale of gene loss after genome duplication can cause incongruities in comparative genome mapping (paterson et al., 2004). Therefore, although both Arabldopsis and poplar belong to the rosid subclass, only small amount of co-linearity were found between their genomes (Stirling el al., 2003). Both Arabldopsis and Brassica BES pairs co-mapped significantly 47 in Arabidopsis genome, yet only genera11y 1% of the BES pairs co-mapped successfully in poplar genome with correct orientation and within 10-300kb; however, papaya BES pairs co-mapped in the poplar genome at 15.9"10 overa1l. In addition, both Brassica-poplar and Arabidopsis-poplar comparisons genera11y yielded low success rates. Poplar BES pairs mapped to the Arabidopsis genome at the level of co-linearity resembling the papaya-Arabidopsis comparison indicating that the genome structure of poplar and papaya are fairly divergent from that of Arabidopsis and Brassica lineage. Some other reasons may have caused the divergence between genome organi7J!tion of papaya and Arabidopsis. Different generation time between Arabidopsis, poplar, and papaya could also produce different rates of mutation among them over the evolutionary period. The generation time for papaya (- 9 months) and poplar (several years) is significantly longer than that of Arabidopsis (weeks). Therefore, the chance of having genome rearrangement in papaya and poplar due to recombination and polyploidization might be significantly reduced. Furthermore, recombination was found to be severely repressed in the male-specific regions of a primitive Y chromosome of papaya (Liu et al., 2004; Ma et al., 2004). The mapping of papaya BES pairs to heterologous genomes was genera1ly low. Even though poplar displayed the highest level of co-linearity, only 1.8% of the total papaya BES pairs co-mapped to the poplar genome with correct orientation and within 300 kb. The low success rate for mapping papaya BES pairs to heterologous genomes may be partIy due the presence of novel repeats in the end sequences that have caused ambiguous mapping of the BES pairs to the heterologous genome and were eliminated later through more stringent criteria; however, the nwnber of papaya BESs without 48 homology to heterologous genome is another main reason for the low success rate of comparisons in papaya BES pairs. To further validate the hypothesis that co-mapping BES pairs with correct orientation and within 10-300 kb will map collinear regions in heterogolous genomes, 19 mapped regions from poplar-Arabidopsis comparisons and 5 mapped regions from Arabidopsis-poplar comparisons were examined to find potential syntenic regions. Only low amount of co-linearity were displayed between these genome regions which is consistent with recent reports of peach-Arabidopsis and poplar-Arabidopsis comparisons (Georgi et al., 2003; Stirling et al.. 2003). Both peach from the family of Rosaceae and poplar from the family of Salicaceae are members of eurosids I. Although peach and poplar belong to different families and orders, three Prunus persica (peach) BAC clones (AClS4900.1, AClS4901.l, AF467900.l) displayed high level ofco-linearity with the poplar genome (Appendix B). Although co-mapping BES pairs to heterologous genomes does not prove co-linearity regions ofthe corresponding, the high proportion of co mapping BES pairs to heterologous genomes in correct orientation and within 10-3OOkb support the hypothesis that the mapped BES pairs may potentially map corresponding syntenic regions. The availability of papaya whole genome sequences will allow further co-linearity analysis between papaya and other species. With the availability of whole genome sequences of poplar and Arabidopsis and the evidence that papaya-poplar, peach poplar, and Brassica-Arabidopsis display high level of co-linearity. accelerated gene discoveries and whole genome sequencing of papaya, peach, and Brassica can be facilitated. 49 CONCLUSIONS A total of 50.661 papaya BAC end sequence chromatograms were processed and trimmed to yield 35,472 high quality BESs representing 17,483,563 high quality bases or 4.7% of the papaya genome. All processed high quality BESs were deposited in GenBank. Most mined SSRs belong to the non-genic regions, yet the SSRs associated with cDNAs are more useful in mapping genes that might have important agronomic values. Trinucleotide repeats were enriched in regions that are associated with cDNAs and form more useful markers for mapping genes of interest. SSR markers mined in this analysis can provide an economic and efficient way to find polymoxphisms between different papaya strains. A website was designed to provide over 2,500 useful SSR markers and their flanking primer pairs. All associated cDNAs, plant repeats. RefSeq genomic, and non-redundant protein homologies with these SSR markers are displayed to enhance the efficiency of data mining. Papaya shares many genes with Arahidopsis. Around one fifth of papaya genome consists of gene rich regions based on the observation of this study. Papaya genome contains higher repeat element content than Arahidopsis, and sixteen percents of papaya genomes contain plant repeat elements. Papaya genes that are absent in Arahidopsis can be further analyzed for identifying gene functions. Around four percent of the genome contains identified papaya-specific repeats. Over 6,000 gene tags were discovered for papaya researchers to further analyze their genes of interest. The highly evolving genome structure of Arahidopsis due to polyploidimtion followed by great scale of gene loss poses a great challenge for scientists to do 50 comparative genomics with this model plant organism. The poplar genome had the most mapped papaya BES pairs and most closely approximated insert size of the papaya BAC clone library. Although no existing completely sequenced plant genomes can be mapped significantly by the papaya BES pairs, the poplar genome cwrently appears to be the most suitable choice as a reference genome of comparative genomics to elucidate the genome organi7Btion of papaya. The orthologous regions between poplar and papaya that were discovered in this study can be further analyzed for potential cloning and mapping of target genes when the completely sequenced papaya BAC clones become available. These papaya BESs will help elucidate the composition of the whole papaya genome sequences. The papaya BAC library and BESs will be a valuable resources for physical mapping and sequencing of the whole papaya genome. This analysis can greatly enhance the genetic studies of papaya with the potential for increasing the yield, fruit size, and taste of papaya. 51 APPENDIX A Papaya BAC end sequences Hbrary website (http://www.genomic.hawaiiedulpapaya/BES) Introduction The forward and reverse end sequences of a papaya BAC library containing 26,017 clones were sequenced by the collaboration of HARC (Hawaii Agricultural Research Center), crAHR (College of Tropical Agriculture at the University of Hawaii) and the CGPBRI (Center for Genomcis, Proteomics, and Bioinformatics Research Initiative). Analysis of the 35,472 high quality BESs presents the fll'St view of the papaya genome. Our analysis focused on mining of SSR markers, repetitive element composition, protein-coding regions, and comparative genome mapping. This website is designed to provide a public service for the inquiry of the currently available papaya BESs. A total of 42,898 high quality BESs can be downloaded from this website. Moreover, the microsatellite search engine provides over 2, 500 primer pairs for detecting papaya SSR length polymorphism. The Arabidopsis cDNA homologies flanking these SSR loci are shown for the convenience of selecting primers to analyze targeted genes. The BAC-end sequence search engine provides detailed information for cDNA, plant repeats, RefSeq (genomic), and non-redundant protein homologies in papaya BESs. Papaya BESs containing microsatellites, plant repeat homology, gene/cDNA homology, and mapped BES pairs are all available for downloading from the website. A graphical representation of all microsatellites detected in papaya BESs, plotted by length and type. is accessible at this website. This website also provides contact information of principle investigators, co-principle investigators, 52 collabomtive labs and associated institutes. Unks to other major genomics center are provided. The reference of the official publication of the papaya BAC-end sequence analysis project along with other related publications are listed. Databases The MySQL database is currently nmning under UNIX (Gentoo) platform. Several major tables for storing sequence analysis data are shown in Table 5-9. The interacting interface between users and databases is shown in Figure 8. 53 Table S. structure oftable pBES Field Tvoe Null Key Default &tm ID iDt(IO) unsimed PRI NULL auto IncremeDt PBES_NAME varchar(ZS5l YES NULL RAW_SEQ looatext YES NULL RAW_QUAL lomlte",t YES NULL TRIM_SEQ looatexl YES NULL TRIM_OVAL IODate",t YES NULL CONVfN SEQ looatext YES NULL TRIMJCSEQ loogtex! YES NULL lAB varchar(ZSS) YES NULL HOMOLOOY _TYPE varchar(ZSS) YES NULL HOMOLOOY IODgte",! YES NULL HOMOLOOY ID varchar(ZSS) YES NULL UPDATETIME datetime YES NULL This table stores all papaya BAC-end sequence information including the BES name, raw sequence and quality values. trimmed sequences and quality values, converted PHRED value ofless than 20 to 'N' sequences, trimmed terminal vector sequences and quality values. Moreover, the associated sequencing labs are indicated. The mined homology identifying number and the description are provided. The update time is also recorded. 54 Table 6. Structure of table SSR Field Type Null Key Default Extra SSR.JD int(IO) unsigned PRI NULL auto_increment PBPS_NAME varcbar(2SS) VPS NULL FORWARD PRIMER varcbar(2SS) VPS NULL REVERSE_PRIMER varcbar(2SS) YPS NULL SSR varcbar(2SS) VPS NULL PATIERN tinyint(4) VPS NULL MAX SSR text YPS NULL MAX_SSR SIZE Int(ll) VPS NULL SSR..START 1nt(1l) VPS NULL SSR END int(ll) VPS NULL PRODUCCSIZE intCll) VPS NULL PRIMER_EXPLANATION lonatext VPS NULL AMPLIFIED_SEQ longtext VPS NULL PBPS longtext VPS NULL LAB varcbar(2SS) VPS NULL UPDATETIME datetlme VPS NULL This table stores information about the mined microsatellites, their associated primers, SSR pattern, maximum SSR size, the position of the SSR markers, PCR-amplified product size, explanation of the primer selection, PCR-amplified sequences, original BAC-end sequences, associated lab, and updated time. 55 Table 7. Structure of table CDNA, REPx, REFSEQ, NR field IType Null Key Default Extra ID Int( 10) unsiped PRi NUlL auto Increment QUERYJD varchar(2SS) YES NUlL SUBJECI'ID varcllar(2SS) YES NUlL IDENITIY float YES NUlL AliGNMENT LEN int(ll) YES NUlL MISMATCHES Int(ll) YES NUlL GAP_OPENING intOl) YES NUlL Q..START intO I) YES NUlL Q_END intO I) YES NUlL S_START int(1l) YES NUlL SEND intOl) YES NUlL EVALUE float YES NUlL BIT_SCORE float YES NUlL SUBJECI' DESC langle"t YES NUlL LAB varchar(2SS) YES NUlL These tables share the same structure. Each table contains the BLAST result output including query sequence 10, subject sequence 10, identity or percentage of the matches, alignment length, number of mismatched nucleotides or amino acids, gap opening, aligned query position, aligned subject position, expected value, bit score, matched subject description, and associated sequencing lab of the BES. 56 Table 8. Structure of table SEQLEN Field Tvoe Null KJlY_ Default Exba ID 1nt(IO) unsil!l1ed PRJ NULL auto_Increment PBFS NAME varchm(2S5) YFS NULL TRJMX....LEN 1nt(1I) YFS NULL HQBASE_NUM int(lI) YFS NULL GC intO!) YFS NULL This table contains BES identifying name, the length of each trimmed high quality sequence with terminal vector sequences removed, number of high quality bases in the BES, and percentage of GC content. 57 Table 9. Structure of table GMAP Field Twe Null Key Default Extra ID int(IO) unsiJmed PRJ NULL auto increment OUERY ID varcbar(2SS) YFS NULL SUBJEC'CID varcbar(2SS) YFS NULL IDENTITY float YFS NULL ALIGNMENT_LEN I int(ll) YFS NULL MISMATCHES int(ll) YFS NULL GAP_OPENING int(ll) YFS NULL Q.START intO!) YFS NULL Q_END int(ll) YFS NULL S START int(ll) YFS NULL S_END intCll) YFS NULL EVALUE float YFS NULL BIT_SCORE float YFS NULL BASE...APART int(ll) YFS NULL QUERY ORGANISM varcbar(2Ss) YFS NULL SUBJEC'CORGANISM varcbar(2SS) YFS NULL LAB varcbar(2SS) YFS NULL This table contains the BLAST result output including query sequence ID, subject sequence ID, identity or percentage of the matches, alignment length, number of mismatched nucleotides or amino acids, gap opening, aligned query position, aligned subject position, expected value, bit score, number of bases between the forward and reverse BES pair, query organism (BES pairs), subject organism (genome), and associated sequencing lab of the BES. 58 Figure 8. Project website index page 000 +.+ !'lOme I p' ....·1OtI1 1contad qr()up" fH:l,f'cl r1"'Irnptl'l" pu I'll, C;!' ron" . C.nls rbCutOhliCll, F\OI2 WI*, tnd 8161rr1f:>... lllc. Re, " d. ""', .... (CGf'BR1) download . HMnIAgrkUln! Ras,ad. Cenw(HARC1 • M' IF de wNoo.d ps~ . ~ •• if L.aII • PIIut M~ III Plidkl Salin 11 cknowtt'dgf'me nls ~ R.. ardI c... ' IPS,lRC) • USDA T"sTAR grMtHAv.00551G • QeftlO( P're5ang lAb o.~'l El'ltalMc'IQI"" p .... SAC end SIT 'li'ICW1g BIo.- anc.s and Bl toe,."...... Un~gf ....e1II1t ..lnoe conlilc.! Inform ill Ion ••<41, t gemotQMwai eau • The mlUIlbt GIIlOITIiC: Alii. d.{l1GR,hON . ~ot tww., AI ManN (VHliI l ~oOi1'II • Mcls C"llilt 81e. c' liltS MIS ." ont: BioenglnMtlng hOme (M88Ei 1Il0l) 151 411 , eFaa: (1011 III lau / AU project web pages are currently stored on genomics.hawaii.edu apache web server maintained in the Presting Lab. The interactive CGr pages were programmed using PERL in combination with HTML code. The home page of this project web site can be accessed at (http://www.genomics.hawaii.edulpapayalBES). 59 • SSR search engine Purpose of this featlU'e is to allow users to select SSRs that match certain user defined criteria (Figure 9). For example, user can choose to display any combination of mono-, di-, tri- and tetra-nucleotide repeats for selecting SSR markers. The minimum length of all SSR markers is 12 bp in length. Therefore, selecting a number less than 12 for SSR minimum length will be switched back to 12 by default. Users can select showing the cDNA homology description by clicking on the box. Users can also select viewing only the SSRs with primers and/or SSRs with cDNA homologies. Users can then click on Search to view the searched result. Users can then click on Display to view the detail of the BAC-end sequence and the microsatellite information including the amplified product sequence and the orginal BAC-end sequence. Users can key in the cDNA description, sequences, BES name or SSR pattern to narrow down all matched SSR markers. 60 Figure 9. Microsatellite search engine 001) -ro - , ' __; "2211221:" .. " 7 -_III:P ...... $ ..... _1:" ' ...... 211 .. '.--.. .. '9 _ J(S mrc:h .,.,.. I Y.. No '!n-_'MtOt rN c' d u t a. , r TII'" '.I('I :tJ I r6 T" an''''''' r, SSft I' * 'iOII'II lAo al! JiO $till s dlHA IIoff -"a, hr,'; 1. '*'- ,. __ only T' '0' , q , T • ." wnIefI ,.~'4 '" fI.M M>tn " ,I, ...n . sr- only ~1IIII . 1rIr lItoltll eDNA ~ 'A' a' il "*"" lIMon l ak 7 II .- Ie DNA due. i ll - This search engine can be accessed at (http://www.genomics.hawaii.eduipapayaIBES/ ssrNP.cgi). 61 BES search engine Purpose of this feature is to allow users to display BESs that match certain user-defined criteria (Figure 10). Users can limit the number of displayed return row to reduce the nmning time of the request. Users can click on the radio buttons to show either all BESs. BESs with cDNA homologies, BESs with plant repeats, BESs with RefSeq genomics, or BESs with no homologies. Users can then click on Search to view the searched result. Users can also key in the homology identifying number (Accession number), target sequences, BES name, or homology description keywords to narrow the display result. 62 Figure 10. BAC-end sequence search engine 000 o + .+ , r ~fIJ II.K tNI~tLIht .... %1sICI2iLI" the p.p.y. ...-cIt •••'AI . ton.1M 3S,412111gh q4IIIlltr paw "/1 lAC IItId II W'11,," .-...... ,.""", ~ glnlra1td In .. ",.tora ..... IU'O • .,. ..n F Show aft lIlES, In daMball ("' ShoJr only IES, twwhldl cONA. Ix I ::IIOw" new. fUM 11' 9 C$ , I Shaw only IES. bf..tllctl pi'nt rEV •• N liOIoJ • r..w II lin 1 Ii L ii1 ("' SIIow only lESS .. 'IIIfIkn RIGa q Ii .lGmlti I'll If l ID'll' 7, MIle • I 11 U, 3 ~ I ("' Showonly IESlfDf 'wtlkJ.noPIJii "'''i s~ t ll"lI'l?I ' $ ("' '.Iomol.., ".. , .:II .....h l 'c..•. _af ...... ,. _.·c ...- .. , TT '_N 2~ " lCz''' ,', '"''7O'lItd .... 7 .. __ ~''' ., ' i . r.. ,..• ..,. aAC_ .."". i C 2.. J C. ' ...... ", 10M, "P. ,..IAC ,Old, I ~"' ,COl iii"", ",.11 t $ ' 2 ,. ,•••• ', ... iii , ~ ,.,.,...",c .... UQW, £ . dO t.i Cl8ll 111<11 .-..c ..., aM :' ,. P OL.'" ,... , 7 ~" .. ""g71)00.1 pCIraC. I ODIv.'.06Jl .N1e.tcf pbac-I 00." 8 02 _~ 1M 2-tel Al5g 14180.1 pbK- t~ 7.,1 . B04MJ .".1101700.1 Dbac-I(IOA 807., 1.807.1d' AttglOt5O.1 pCl l c,ol 001'"" OAUI O.teI' AIl135630.1 This search engine can be accessed at (http://www.genomics.hawaii.eduJpapayaIBES/ besNP.cgi) 63 APPENDIXB Figure 11. Comparisons of three peach BAC clones to the poplar genome Cllr.u ••tllel."._ ; .0 .. ., .hree 'flr~ aAC ell.fl l. pI"'a, acl. • '''''''' • AF467900 VI LG 11 :1521201·1576201 , • ACI54900 \IS LO VI:599271l-1766 18 ""'" , , • AC15490 1 "'1 LO V,;776372·93817S 1L G XV 1 : 2"~Jl·J()Sl1l .' , ... , , • •• , • , , , , • ..• o 0 ",.., '00000 llOOOO JOOOOO .40000I) """" , p' ' """""Ph 7 I e W 5 " :1 I i r .. "'- "'- Three peach BAC clones were downloaded from GeneBank NCBI website (AC154900.1, AC 15490 1.1 , AF467900.\). These peach BAC clones were compared against artificial poplar BAC clones using BLASTN with an E value less than or equal to 10" . The best matches of poplar BAC clones to the corresponding peach BAC clones were identified. The start positions of the poplar BAC clones were identified, and the exact coordinates of the mapped poplar BAC clones on the poplar genome were obtained. Based on the linkage group number and the frrst aligned nucleotide of the poplar artificial BAC clone, a 120 kb poplar genome segments were retrieved. The corresponding peach and poplar BAC clone sequences were compared using BL2SEQ-BLASTN program at an E value cutoff of 10·'°. All aligned regions of the peach and poplar BAC clones were recorded. The portions of peach BAC clone regions with no match to the retrieved poplar 64 genome segment were further compared to the entire non-fragmented poplar genome database. The coordinates of the mapped poplar chromosomes were recorded (Figure II). For the three peach BAC clones that were compared to the poplar genomic sequences, peach BAC clones ACI54901.1 nucleotide 1,882-34,648 were mapped to poplar Linkage Group (LG) VI at nucleotide position 776,372-938,175, and region 34,480-73,277 were mapped to poplar LG XVI at nucleotide position 248,433-305,321. For BAC clone ACI54900.I, region 4,579-66,115 was mapped to poplar LG VI 663,584-776,618 with a gap spanned -80kb at position 681,139-762,409. The BAC clone AF467900.1 region 13,096-16,694 was mapped to poplar LG XU at position 898,040-901,595. 65 APPENDIXC Table 10. Detalled coordInates of co-mapped papaya DES pairs in heterologous genome comparisous 66 Table 10. (continued) Detailed coordinates of eo-mapped papaya DES pairs In heterologous genome comparisous The exact coordinates of mapped genome regious from co-mapped papaya BES pairs in heterologous genomes are listed here. These coordinates may facilitate alignment of papaya BAC clones in constructing a physical map for whole genome sequencing of papaya and locate potential genes of interest in papaya BAC clones by referencing to the mapped colinear regions in target genomes. 67 APPENDIXD Table 11. A list of eompiled plant repeat databases for repeat content analysis of papaya BESs. Plant repeat databases were downloaded from The Institute for Genomic Research (TIGR) (ftp:/ltigr.orglpub/data/TIGR]lant_Repeats) and compiled into a database of 30,481 unique repeat sequences. a. TIGR_BrassieaJepeats.v2 b. TIGR_Brassieaeeae_Repeats.v2 e. TIGR_Fabaeeae_Repeats.v2 d. TIGR_Glyeine_Repeats.v2 e. TIGR_Gramhteae_Repeats.v3.0 f. TIGR_Hordenm_Repeats.v3.0 g. TIGR_Lotns_Repeats.v2 b. TIGR_Lyeopersicon_Repeats.v2 L TIGR_Medieago.J{epeats.v2 j. TIGR_Oryza_Repeats.v3.0 k. TIGR~olanaeeaeJepeats.v2 L TIGR~lannm_Repeats.v2 m. TIGR_Sorghnm_Repeats.v3.0 n. TIGR_Tritienm_Repeats.v3.0 o. TIGR_Zea_Repeats.v3.0 68 LITERATURE CITED Adams KL and Wendel JF (2005) Polyploidy and genome evolution in plants. Current opinion in Plant Biology 8:135-141 Agricultural Statistics Board, USDA. "Crop Production." NASS May 2006 Agricultural Statistics Board, USDA. "Crop Values 2005 Summary." NASS, February 2006 Agricultural Statistics Board, USDA. "Noncitrus Fruits and Nuts 2005 Preliminary Summary." NASS, January 2006 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Bioi 215:403-410 Aradhya MK, Manshardt RM, Zee F, Morden CW (1999) A phylogenetic analysis of the genus Carica L (Caricaceae) based on restriction fragment length variation in a cpDNA intergenic spacer region. Genet Resour Crop Evol 46:579-586 Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plant species. Plant Mol BioI Rep 9(3):211-215 Badillo VM (2(00) Carica L. vs. Vasconcella St Hil. (Caricaceae): con la rebabilitaci6n de este t1Itimo. Emstia 10:74-79 Blanc G, Wolfe KH (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16: 1679-1691 Bowers JE, Chapman BA, Rong J, Paterson AH (2003) Uinraveling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433-438 Bremer K, Chase MW, Stevens PF (1998) An ordinal classification for the families of 69 flowering plants. Ann Mo Bot Gard 85:531-553 Candolle A de (1908) Origin of cultivated plants. D.A. Appleton & Co. New York. NY Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackwon B, Fang G, Kim H, Frisch D, Yu Y, Sun S, Higingbottom S, Phimphilai J, Phlmphilai D, Thurmond S, Gaudette B, U P, Uu J, Hatfield J, Main D, Farrar K, Henderson C, Barnett L, Costa R, Williams B, Walser S, Atkins M, Hall C, Budiman MA, Tomkins JP, Luo M, Bancroft I, Salse J, Regad F. Mohapatra T. Singh NK, Tyagi AK, Soderlund C, Dean RA, Wing RA (2002) An integrated physical and genetic map of the rice genome. Plant Cell 14: 1-10 Cheng Z, Presting G, Buell CR, Wing RA, Jiang J (2001) High-resolution pachytene chromosome mapping of bacterial artificial chromosomes anchored by genetic markers reveals the centromere location and the distribution of genetic recombination along chromosome 10 of rice. Genetics 157:1749-1757 Chittenden LM, Schertz KF, Un YR, Wing RA, Paterson AH (1994) A detailed RFLP map of Sorghum bicolor x S. propinquum, suitable for high-density mapping, suggests ancestral duplication of Sorghum chromosomes or chromosomal segments. Theor Appl Genet 87:925-933 Choi S, Creelman RA, Mullet JE, Wing RA (1995) Construction and characterization of a bacterial artificial chromosome library of Arabidopsis thaliana. Weeds World 2:17-20 Deputy JC, Ming R, Ma H, Uu Z, Fitch MM, Wang M, Manshardt R, Stiles n (2002) Molecular markers for sex determination in papaya (Carica papaya L). Theor Appl Genet 106:107-111 70 Doughty KJ, Porter AJR, Morton AM, Kiddie G, Bock CH, Wallsgrove R (1991) Variation in the g1ucosinolate content of oilseed rape Brassica-Napus L leaves D. Response to infection by Alternaria-Brassicae Berk. Sacc Ann Appl BioI 118:469-478 Dune J, Horgan L (1992) Meat tenderizers. In: Hui YH (ed) Encyclopedia of Food Science and Technology, vol 3. Wiley, New York, p 1745-1751 Durbin ML, McCaig B, Clegg MT (2000) Molecular evolution of the chalcone synthase multigene family in the morning g101y genome. Plant Mol BioI 42:79-92 Ewing B, Hillier L, Wend Me, Green P (1998) Base-ca1ling of automated sequencer traces using phred. I. Accuracy assessment Genome Res 8:175-185 Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred D. Error probabilities. Genome Res 8: 186-194 Fahey JW, ZaIcmann AT, Talalay P (2001) The chemical diversity and distribution of glucosinolates and isothiocyanates among plants. Phytochemistry 56:5-51 Fitch MMM, Manshardt RM, Gonsalves D, Slightom (1992) Virus resistance papaya derived from tissue bombarded with the coat protein gene of papaya ringspot virus. BiorrechnoI1O:1466-1472 Filatov DA, Moneger F, Negrutiu I, Charlesworth D (2000) Low variability in a Y-linked plant gene and its implications for Y-chromosome evolution. Nature 404:388-390 Georgi LL, Wang Y, Reighard GL, Mao L, Wing RA, Abbott AG (2003) Comparison of peach and Arabidopsis genomic sequences: fragmentary conservation of gene neighborhoods. Genome 46:268-276 Goff SA, Ricke D, Lan T, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, 71 Oeller P, Varma H, Hadley D, Hutchison D, Martin C, Katagiri F, Lange B, MoughamerT, Xia Y, Budworth P. Zhong J, Miguel T, Paszkowski U, Zhang S, Colbert M, Sun W, Chen L, Cooper B, Park S, Wood T, Mao L, Quail P, Wing R, Dean R, Yu Y, Zharkikh A, Shen R, Sahasmbudhe S, Thomas A, Cannings R, Outin A, Pruss D, Reid J, Tavtigian S, Mitchell J, Eldredge 0, Scholl T, Miller R, Bhatnagar S, Adey N, Rubano T, Tusneem N, Robinson R, Feldhaus J, Macalma T, Olipbant A, Briggs S (2002) A Draft Sequence of the Rice Genome (Oryza sativa L ssp.japonica). Science 296: 92-100 Hong CP, Lee SJ, Park JY, Plaha P, Park YS, Lee YI<, Choi JE, Kim KY, Lee JH, Lee J, Jin H, Choi SR, Urn YP (2004) Construction of a BAC Iibmry of Korean ginseng and initial analysis of BAC-end sequences. Mol Oen Oenomics 271:709-716 Ilic I<, SanMiguel PJ, Bennetzen JL (2003) A complex history of rearmngements in an orthologous region of the maize, sorghum, and rice genomes. Proc Natl Acad Sci USA 100:12265-12270 International Rice Genome Sequencing Project (2005) The map based sequences of the rice genome. Nature 436:793-800 Jiang J, Oill BS, Wang OL, Ronald PC, Ward DC (1995) Metaphase and interphase fluorescence in situ hybridization mapping of the rice genome with bacterial artificial chromosomes. Pree Natl Acad Sci USA 92:4487-4491 Jung S, Abbott A, Jesudurai C, Tomkins J, Main D (2005) Frequency, type, distribution and annotation of simple sequence repeats in Rosaceae ESTs. Funct Integr Genomics 5:136-143 Katti M, Ranjekar PI<, Oupta VS (2001) Differential distribution of simple sequence 72 repeats in eukaryotic genome sequences. Mol Bioi EvoI18:1161-1167 Kjaer A (1976) Glucosinolates in the cruciferae. In: Vaughn. J.G .• MacLeod, AJ.• Jones. B.M.G. (Eds.), The Biology and Chemistry of the Cruciferae. Academic, London, UK, pp. 207-219 Kjaer A. Schuster A (1972) GIucosinolates in seeds of Neslia-Panicultzta. Phytochemistry 11:3045-3048 Kim MS, Moore PH. Zee F. FItch MMM, Steiger DL. Manshardt, Paull RE, Drew RA, Sekioka T. Ming R (2002) Genetic diversity of Carica papaya as revealed by AFLP markers. Genome 45:503-512 Kishimoto N, Higo H, Abe K, Arai S. Saito A, Higo K (1994) Identification of the duplicated segments in rice chromosomes 1 and 5 by linkage analysis of cDNA markers of known functions. Theor Appl Genet 88:722-726 Kowalski SP, Lan TH, Feldmann, Paterson (1994) Comparative mapping of Arabidopsis thaliana and Brassica oleracea chromosomes reveals islands of conserved organization. Genetics 138:499-510 Kroymann J, Mitchell-Olds T (2004) Function and evolution of an Arabidopsis insect resistance QTL. In: Tikhonovich, I., Lugtenberg, B., Provorov, N. (Eels.), Biology of Molecular Plant-Microbe Interactions. APS Press, Saint Paul, MN, pp. 259-262 Kroymann J, Textor S, Tokuhisa JG, Falk KL, Bartram S, Gershenzon J, Mitchell-Olds T (2001) A gene controlling variation in Arabidopsis g1ucosinolate composition is part of the methionine chain elongation pathway. Plant Physiol. 127, 1077-1088 Kruglyak S, Durrett RT, Schug MD, Aquadro CF (1998) Equilibrium distributions of 73 microsatelIite repeat length resulting from a balance between slippage events and point mutations. Proc Nad Acad Sci USA 95:10774-10778 Ku HM, Vision T, Liu J, Tanksley SD (2000) Comparing sequenced segments of the tomato and Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network ofsynteny. Proc Nat! Acad Sci USA 97:9121-9126 Lange BM, Presting G (2004) Genomic survey of metabolic pathways in rice. In: Romeo IT (ed) Recent advances in phytochemistry. Elsevier, Amsterdam, p 111-137 Levinson G, Gutman GA (1987) Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol BioI. Evol. 4:203-221 Liu Z. Moore PH, Ma H, Ackerman CM, Ragiba M, Yu Q, Pearl HM, Kim MS, Chartton JW, Stiles JI, Zee Ff, Paterson AH, Ming R (2004) A primitive Y chromosome in papaya marks incipient sex chromosome evolution. Nature 427:348-352 Ma H, Moore PH, Liu Z. Kim MS, Yu Q, Fitch MM, Sekioka T, Paterson AH, Ming R (2004) High-density linkage mapping revealed suppression of recombination at the sex determination locus in papaya. Genetics 166:419-436 Mao L, Wood T, Yu Y, Budiman MA, TomkinsJ, Woo S, Sasinowski M, Presting G, Frisch D, Goff S, Dean RA, Wing RA (2000) Rice transposable elements: a survey of 73,000 sequence-tagged-connectors. Genome Res 10:982-990 McGrath JM et at.,. (1993) Duplicate sequences with a similarity to expressed genes in the genome of Arabidopsis thaliana. Theor Appl Genet 86:880-888 Ming R, Moore PH, Zee F, Abbey CA, Ma H, Paterson AH (2001) Construction and characterization of a papaya BAC library as a foundation for molecular dissection of a tree-fruit genome. Theor Appl Genet 102:892-899 74 Morton JF (1987) Papaya. In: Fruits of warm climates. Creative Resource Systems. Inc. Miami. pp 336-346 Mozo T. Fischer S. Shizuya H. Altmann T (1998) Construction and characterization of the IGF Arabidopsis BAC library. Mol Gen Genet 258:562-570 Nagamura Y et al••. (1995) Conservation of duplicated segments between rice chromosomes 11 and 12. Breed Sci 45:373-376 O'Neill CM. Bancroft I (2000) Comparative physical mapping of segments of the genome of Brassica olearacea var. alboglabm that are homeologous to sequenced regions of chromosomes 4 and 5 of Arabidopsis thaliana. Plant J 23:233-43 Ockerman HW. Harnsawas S. Yetim H (1993) Inhibition of papain in meat by potato protein or ascorbic acid. J Fod Sci 58: 1265-1268 Ohta T (2000) Evolution of gene families. Gene 259:45-52 Osato JA. Santiago L, Remo G. Cuadm M. Mori A (1993) Antimicrobial and antioxidant activities of unripe papaya. Life Sci 53: 1383-1389 Paterson AH. Bowers JE. Chapman BAC (2004) Ancient polypoidization predating divergence of the cereals. and its consequences for compamtive genomics. Proc NatI Acad Sci USA 101:9903-9908 Perez A. Pollack S. USDA. "Fruit and Tree Nuts Outlook." Economic Research Service FfS-322 (May 25. 2006) Purina A. Sandhya B (1988) Genotypic difference of in vitro latemI bud establishment and shoot proliferation in papaya. Curr Sci 7:440-442 Purseglove JW (l968) Tropical crops. Longman. London. pp.45-51 Raybould AF. Moyes CL (200l) The ecological genetics of aliphatic gIucosinolates. 75 Heredity 87:383-391 Rice Chromosome 10 Sequencing Consortium (2003) In-depth view of structure, activity and evolution of rice chromosome 10. Science 300:1566-1569 Rodman JE, Karol KG, Price RA, Sytsma KJ (1996) Molecules, morphology, and Dahlgren's expanded order CapparaIes. Syst Bot 21:289-307 Rodman JE, Soltis PS, Soltis DE, Sytsma KJ, Karol KG (1998) Parallel evolution of glucosinolate biosynthesis inferred from congruent nuclear and plastid gene phylogenies. Am J Bot 85:997-1006 Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S. Misener S (eds) Bioinformatics methods and protocols (methods in molecular biology). Humana Press, Totowa, pp 365-386 Sampedro J. Carey RE. Cosgrove OJ (2006) Genome histories clarify evolution of the expansin superfamily: new insights from the poplar genome and pine ESTs. J Plant Res 119: 11-21 Shiuya H, Nirren B, Kim UJ. Mancino V. Slepak T. Tachiiri Y. Simon M (1992) Ooning and stable maintenance of 200-kilobase-pair fmgments of human DNA in Escherichia coli using an F-factor-based vector. Proc Nat! Acad Sci USA 89:8794-8797 Soltis DE, Soltis ps. Chase MW, Mort ME, Albach OC, Zanis M, Savolainen V, Hahn WH, Hoot SB, Fay MF, Axtell M, Swensen SM, Prince LM, Kress WJ, Nixon KC, Farris JS (2000) Angiosperm phylogeny inferred from 18S rONA, rbeL, and atpB sequences. BotT Linnean Soc 133:381-461 Soltis ps. Soltis DE, Chase MW (1999) Angiosperm phylogeny inferred from multiple 76 genes: a research tool for comparative biology. Nature 402:402-404 Stirling B, Yang ZK, Gunter La Tnskan GA, Bradshaw HD (2003) Comparative sequence analysis between ortbologous regions of the Arabidopsis and Populus genomes reveals substantial synteny and microcollinearity. Can J For Res 33:2245-2251 Storey WB (1976) Papaya. In: Simmonds NW (ed) Evolution of crop plants. Longman, London, pp 21-24 Storey WB (1985) Papaya. In: CRC Handbook ofFlowering, CRC Press, Boca Raton, FL Temnykb S, DeClerck G, Lukashova A, Lipoviich L, Cartinhour, McCouch S (2001) Computational and experimental analysis of microsatellites in rice (Oryza sative L.): frequency, length variation, transposon associations, and genetic marker potential. Genome Res 11:1441-1452 The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796-815 Tokubisa J, de Kraker I-W, Textor S, Gershezon I (2004) The biochemical and molecular origins of aliphatic g1ucosinolate diversity in Arabidopsis thaliana. In: Romeo, I.T. (Ed.), Secondary Metabolism in Model Systems. Pergamon, New York, NY, pp.19-38 Tomkins I, Fregene M, Main D, Kim H, Wing R, Tobme I (2004) Bacterial artificial chromosome (BAC) library resource for positional cloning of pest and disease resistance genes in cassave (Manilwt esculenta Crantz). Plant Mol Bioi 56:555- 561 77 Van Droogenbroeck B, Kyndt T, Maertens I, Romeijn-Peeters E, Scheldeman X, Romero JP, Van Damme P, Goetghebeur P, Gheysen G (2004) Phylogenetic analysis of the highland papayas (Vasconcellea) and allied genera (Caricaceae) using PCR RFLP. Theor Appl Genet 108:1473-1486 Verde I, Lauria M, Dettori MT, Vendramin E, Balconi C, MicaIi S, Wang Y, Marrazzo MT, Cipriani G, Hartings H, Testolin R, Abbott AG, Motto M, Quarta R (2005) MicrosateIIite and AFLP markers in the Prunus persica [L. Batsch]]xP. Ferganensis BCI linkage map: saturation and coverage improvement Theor Appl Genet 111:1013-1021 Wang GL, Holsten TE, Song WY, Wang HP, Ronald PC (1995) Construction of a rice bacterial artificial chromosome library and identification of clones linked to the Xa-21 disease resistance locus. Plant J 7:525-533 Wierdl M, Dominska M, Petes TO (1997) MicrosateIIite instability in yeast: dependence on the length of the microsateIIite. Genetics 146:769-779 Yan L, Loukoianov A, Tranquilli G, Helguera M, Fahima T, Dubcovsky J (2003) Positional cloning of the wheat vernalization gene VRNll. Proc Natl Acad Sci USA 100:6263-6268 Yu Q, Moore PH, Albert HH, Roader AH, Ming R (2005) Cloning and Characterization of a FLORlCAULAILEAFY ortholog, PFL, in polygamous papaya. Cell Res 15(8):576-584 Zhang HB, Choi S, Woo SS, Li Z, Wing RA (1996) Construction and characterization of two rice bacterial artificial chromosome libraries form the parents of a permanent recombinant inbred mapping population. Mol Breed 2: 11-24 78 Zhang LQ, Pond SK, Gaut BS (2001) A survey of the molecular evolutionary dynamics of twenty-five multi gene families from four grass taxa. J Mol EvoI52:144-156 Zhao S, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Krol M, Gebregeorgis E, Shvartsbeyn A, Russell D, Overton L, Jiang L, Dimitrov G, Tran K, Shetty J, Malek JA, Feldblyum T, Nierman WC, Fraser CM (2001) Mouse BAC ends quality assessment and sequence analyses. Genome Res 11:1736-1745 Zhu YJ, Agbayani R, Jackson Me, Tang CS, Moore PH (2004) Expression of the grapevine stilbene synthase gene VSTl in papaya provides increased resistance against diseases caused by Phytophthora palmivora. P1anta 220:241-250 79