1/25/2019

PAG XXII Workshop 1. Economic Significane of Sequencing Complex Genome ( indica L.) 13th Jan 2019 Annual production value of > 20 billion USD

A Reference Assembly of the Highly Heterozygous Genome of Mango ( L. cv, Amrapali)

Nagendra K. Singh ICAR- National Research Centre on Plant Biotechnology Pusa Campus, New Delhi-110012

Acknowledgements 1. Origin and Distribution of Mangifera species Funding Support: ICAR-NPTC; ICAR-Extra Mural National Partners (PIs) Anju Bajpai, ICAR-CISH Lucknow S.K. Singh, ICAR-IARI, New Delhi K.V. Ravishankar, ICAR-IIHR Bengaluru • Origin of genus Mangifera has been Mangifera casturi traced to Damalgiri hills of Meghalaya Kalimantan, Indonesia Anil Rai, ICAR-IASRI, New Delhi by 65 my old fossil of mango leaf (R. C. Mehrotra, Birbal Sahni Institute of ICAR-NRCPB Palaeobotany, Lucknow) Vandna Rai, Kishor Gaikwad, Amitha Sevanthi • Common mango (Mangifera indica L.) originated in India (Woodrow, 1904; RA/SRF Scott, 1992; Cole & Hawson, 1963; Mukherjee, 1971; Malo, 1985). Ajay Mahato, Pawan Jaysawal, Sangeeta Singh, Nisha Singh • Mughals (1556 -1605) had a 500 -1000 ha orchard with 1,00,000 mango trees Service Provider • 1,000 varieties of mango contribute Nucleome Informatics India about 39% of the total fruit production in India

1. Domestication and Dispersal of Mango Cultivars Outline Singh et al. 2016

1. Introduction 2. Improved ‘Amrapali’ assembly with BioNano optical fingerprinting 3. Annotation of protein-coding genes 4. Annotation of repeat elements 5. Centromere, telomere and non-coding RNA genes 6. Segmental duplications and phylogeny 7. Re-sequencing, SNP Discovery, association studies 8. Prospects

1 1/25/2019

2. ‘Amrapali’ Genome Assembly 2. Additional PacBio Data for Reference Assembly

Genome PacBio No. of coverage S. No. • Amrapali- 1st Mango hybrid Chemistry Runs No. of reads Total Bases (x) (Dashehari/Neelam) widely grown in India, dwarf, regular bearer P4C2 5 968,567 2,990,090,529 6.6 • Amphidiploid (2n= 40) 1 • Genome Size= 439 Mb 2 P5C3 25 5,357,722 13,407,765,248 29.79 1st Mango draft assembly PAG 2014: (Based on 454 and Illumina data) 3 P5C3 25 6,025,450 15,162,322,569 33.69 Assembly size: 492.6 Mb 4 P6C4 20 8,847,654 13,686,456,431 31.90 Assembly size larger than the 75 21,199,393 45,246,634,777 estimated genome size ? Total 101.98 x

2. Genome size of ‘Amrapali’ by Flow Cytometry 2. ‘Amrapali’ PacBio Reference Assembly

Mango Pea Amrapali Assembly Statistics:

Parameters Value Customized ‘Falcon’ assembler No. and size of contigs 9,703 (400.9 Mb) Longest contig 995.9 Kb • Self error corrected PacBio reads No. and size of scaffolds 4,312 (403.2 Mb) • 500 bp overlap, 25 % mismatch N50 of scaffolds 282.5 Kb Longest scaffold 2.0 Mb • Polishing using ‘quiver’ Scaffolds > 10 Kbp 3,161 (73.3%) DNA histogram of mango and pea nuclei • Scaffolding using Illumina mate pair Unknown bases (Ns) 2.1 Mb (0.5%) GC Content 32 % X axis : Cell count; Y axis: Relative fluorescence intensity (DNA content) • Heterozygosity problem resolved by PacBio long reads The nuclear DNA content of ‘Amrapali’ genome : 402 ±10 Mb Anchoring of scaffold in Pseudomolecules internal standard Pisum sativum (4300 Mb) Amrapali WGS assembly: • No. of markers (CASTA) mapped in contigs = 6,311/6,594 (95.7%) (Pseudomolecules + scaffolds) • No. of markers included in pseudomolecules = 5,492/6,311 (87.0%) [Based on two independent experiments with 4 replicates each using FACS cell deposited at • No. and size of anchored scaffolds = 1,222 (283.77 Mbp, 70.4 %) sorter (BD-LSR II), BD Biosciences] DDBJ/EMBL/GenBank (LMWC02000000) • Longest pseudomolecule: 23.3 Mb Method Used : Dolezel J, Greilhuber J, Suda J. (2007) Estimation of nuclear DNA December 2017 content in plants using flow cytometry. Nature Protocols 2(9):2233-2245. • Shortest pseudomolecule: 9.5 Mb • No. and size of floating scaffolds: 3,089 (119.53 Mbp, 29.6 % )

2. Amrapali Genome- NGS Data Summary 2. Amrapali Improved Assembly: BioNano Data summary S. No Sequencing technology No of reads Base pairs (bp) Genome coverage (X) S. No. Parameter Value 1 454-FLX 6,179,100 2,436,840,346 7.64 2 Illumina (MiSeq) 2X250 16,723,094 4,180,773,500 9.29 1 Number of molecules i 890,863 3 Illumina (MiSeq) 2X250 33,091,426 8,272,856,500 18.39 2 Total length of those molecules is (Mb) 64,376 (160x) 4 Illumina (MiSeq) 2X300 37,016,200 11,104,860,000 24.68 5 Illumina (HiSeq) 2 Kb Mate Pair 78,067,384 7,806,738,400 17.35 3 The length N50 (Kb) 93.7 6 Illumina (HiSeq) 8 Kb Mate Pair 71,029,872 7,102,987,200 15.78 4 Minimum length (Kb) 20.1 7 Illumina (Hiseq) Paired end 455,013,076 45,501,307,600 101.11 (insert 250-350bp) 5 Maximum length (Kb) 1,353.6 9 Pacbio-55-runs 12,351,739 31,560,178,346 70 X 2nd Draft of Amrapali genome (PAG 2016) PacBio Draft assembly: 323 Mb (73.08%) NCBI Ac No. LMWC01000000

2 1/25/2019

2. Improved Amrapali Assembly 2. Mango Genome 20 Pseudomolecules with BioNano Optical Fingerprinting (Using high density map by Luo et al. FPS 2016)

S. No. Version 2.0 Version 3.0 Parameter (PacBio+Illumina) (Verson 2 +BioNano)

9030 2314 Total Number of Scaffold 1 Total genome coverage (Mbp) 403.18 407. 95 2 No. of scaffold in merged pseudomolecule 1,222 97 3 Total size of pseudomolecule (Mbp) 283.77 293.29 4 Total number of Unordered scaffold (Mbp 3090 2217 ) 5 Total size of Unordered scaffold 119.42 114.68 6 Total Numbe of mapped marker 6145 (95.05%) 6297 (97.5%) 5492 (84.94%) 5845 (90.045%) 7 No of markers used in pseudomolecule 8 4312 2217 No. of scaffold after pseudomolecule 9 Longest pseudomolecule 23.31 Mb 25.5 Mb

10 Smallest pseudomolecule 9.5 Mb 9.8 Mb *Luo Chun, Shu Bo, Yao Quangsheng, Wu Hongxia, Xu Wentian, Wang Songbiao (2016). Construction of a High-Density Genetic Map Based on Large-Scale Marker Development in Mango Using Specific-Locus Amplified Fragment Sequencing (SLAF-seq). Front. Plant Sci., 30 August 2016. BLASTn program was used with advance parameter for mapping of markers

2. Genome Coverage: (BUSCO) Analysis 2. Genetic and Physical Maps Correspondence

S.No. Details Mango genome V 2.0 Mango genome V 3.0 (IPacBio+ Illumina) (PacBio+Illumina+Bionano) 1 Complete BUSCOs (C) 1293 (89.8%) 1296 (90.0%) 2 Complete and single-copy BUSCOs (S) 1118 (77.6%) 1114 (77.4%) 3 Complete and duplicated BUSCOs (D) 175 (12.2%) 182 (12.6%]) 4 Fragmented BUSCOs (F) 50 (3.5%) 48 (3.3%) 5 Missing BUSCOs (M) 97 (6.7%) 96 (6.7%)

Assembly Completeness analysis Cont.....2. Completeness of the Amraplai Assembly: 3. Annotation of Protein-coding Gene by Mapping 19 Different NCBI-SRA Transcriptome Reads (De novo, Protein Homology and Transcript based)

S. No. Particular Value S.No. Type of RNA-Seq Data ( SRA-NCBI) Sequencing Submitter Mapping % Technology 1 Number of gene models 46,283 1 mango F1 population Illumina HiSeq 2500 SSCRI,, China 89.71825678 2 RNA seq Chausa NextSeq 500 ICAR-CISH, India 97.90796949 2 Number of CDS 46,395 3 RNA seq Amrapali NextSeq 500 ICAR-CISH, India 97.8609031 Mangifera indica '' transcript reads from mixed organs Illumina HiSeq 2000 Indiana University 98.42958719 3 Number of exons 227,996 4 Mangifera indica 'TOMMY ATKINS' transcript reads from seed Illumina HiSeq 2000 Indiana University 98.05686491 5 Mangifera indica 'TOMMY ATKINS' transcript reads from mesocarp Illumina HiSeq 2000 Indiana University 98.63497871 4 Number of introns 181,601 6 Mangifera indica 'TOMMY ATKINS' transcript reads from leaf Illumina HiSeq 2000 Indiana University 98.19120256 7 Mangifera indica 'TOMMY ATKINS' transcript reads from flower Illumina HiSeq 2000 Indiana University 98.04236459 5 Total CDS length 41.5 Mb 8 Mangifera indica 'TOMMY ATKINS' transcript reads from exocarp Illumina HiSeq 2000 Indiana University 98.03759958 9 Mangifera indica 'TURPENTINE' transcript reads Illumina HiSeq 2000 Indiana University 98.28825616 6 Total gene length 107.5 Mb 10 Mangifera indica 'THAI EVERBEARING' transcript reads Illumina HiSeq 2000 Indiana University 98.32381005 11 Mangifera indica 'NEELUM' transcript reads Illumina HiSeq 2000 Indiana University 98.6131978 7 Shortest gene/CDS 150 bp 12 Mangifera indica 'M. CASTURI "PURPLE"' transcript reads Illumina HiSeq 2000 Indiana University 98.39724743 13 Mangifera indica 'BURMA' transcript reads Illumina HiSeq 2000 Indiana University 96.02066555 8 Longest gene 89,243 bp 14 Mangifera indica 'TOMMY ATKINS' transcript reads from seed coat Illumina HiSeq 2000 Indiana University 97.75470678 15 Mangifera indica 'AMIN ABRAHIMPUR' transcript reads Illumina HiSeq 2000 Indiana University 98.53282503 9 Longest CDS 12,582 bp 16 Amrapali leaf transcritome sequence Illumina MiSeq ICAR-NRCPB, India 97.91252088 17 Plant sample from Mangifera indica Illumina HiSeq 2000 Universidad Nat Auto de Mexico 97.84069045 10 mean gene length 2,324 bp 18 Complete mango transcriptome Illumina HiSeq 2000 University of Karachi, Pakistan 98.74955947 19 MANGO TRANSCRIPTOME Illumina HiSeq 2000 SSCRI, , China 97.99064843 11 mean CDS length 896 bp 12 % of genome covered by genes 26.7 • Mapped RNA-seq reads on to Amrapali Assembly >96 % 13 % of genome covered by CDS 10.3 • Confirm high coverage of the Assembly at transcript level

3 1/25/2019

3. Protein coding genes: Manual Categorization 5. Telomere, Centromere and Non-coding RNA

Functional categorization of annotated genes in mango genome *

No. of genes

9000 7884 8000

7000

6000 5363 4804 5000 4610

4000 3603

3000 2477 2267 2247 1975 2178 2000 1451 1136 1187 998 904 1030 1000 567 644 584 197 290 0

* Using Swissprot, GO, NCBI and AHRD Total 1686 non-coding RNA genes belonging to 11 major families

4. Annotation of Repeat Elements 6. Segmental Duplications in the Mango Genome Fig (a) • Synteny analysis between Mango vs. Mango genome • 18 fragments having more than 1Mb of length distributed among the 11 chromosomes. • The largest fragment was between chromosome 11 and 16 (6.09 Mb fragment size). a • Chromosome no. 2,3,5 and 7 were unique and no self duplication have been observed.

Fig (b)& (c) • Segmental duplication and synonymous substitution analysis of Chromosome 14 -15. 95-111 Mya 190- 206 Mya

• 181.75 Mb (45.02%) of the Mango genome is 142-158 Mya made of 642,130 copies of repeat elements c • 61.67 Mb (20.05%) is similar to Citrus sinensis. b • 52.6 Mb (16.1%) is similar to Theobroma cacao.

5. Telomeres and Centromeres in Pseudomolecules 6. Age of 18 Duplicated Segments

Centromere Position in Mango Genome in 11 Mango Chromosomes

Chr No. Start End Chr1 13141987 13142205 Chr2 1660086 1660278 Chr3 14767404 14767615 Chr4 14877846 14878053 Chr5 5934811 5935008 Chr6 7276298 7276518 Chr7 5310159 5310256 Chr8 818115 818212 Chr9 3397567 3397760

Chr10 3758110 3758305 Telomere Chr11 11301582 11301790

Centromere Chr12 8133444 8133655 Chr13 10792307 10792401 Chr14 501552 501770 Chr15 4327997 4328016 Chr16 4916215 4916321 Chr17 3106152 3106341 Chr18 10639434 10639586 Chr19 14925917 14926109 Chr20 8586731 8586936

4 1/25/2019

6. Mango Genes Conserved in 13 Plant Species 7. Re-sequencing, SNPs Discovery and Heterozygosity in 25 Varieties Genome wide SNP haterozygocity in 'Amrapali' genome

Linkage Length No. of No. of SFP Group (bp) SNPs InDels Density (bp)

LG1 15,232,102 296,893 82,937 40 LG2 13,796,269 285,649 75,374 38 LG3 16,490,533 365,965 90,476 36 LG4 21,168,693 415,337 119,965 40 LG5 11,781,399 255,264 72,487 36 LG6 11,560,298 237,208 67,574 38 LG7 11,574,172 254,354 72,195 35 LG8 11,330,313 222,043 65,291 39 LG9 18,393,972 381,308 101,482 38 LG10 10,058,126 218,774 54,533 37 LG11 14,275,844 265,443 70,460 42 LG12 12,202,370 256,656 69,183 37 LG13 15,640,443 325,776 84,840 38 LG14 9,516,842 203,871 53,086 37 LG15 10,244,494 186,084 53,646 43 LG16 18,252,290 315,426 85,678 46 LG17 10,284,815 226,240 54,276 37 LG18 15,009,555 322,189 80,992 37 LG19 23,318,555 486,824 136,261 Single Feature Polymorphism in 25 Mango varieties 37 Two-way Venn diagram of mango (Mangifera indica L.) genes conserved in 13 other plant LG20 13,640,176 301,403 72,853 36 in 19,625 genes with Reference to 'Amrapali' Genome species (shown in red). Total number of genes in respective species (shown in black). Of the Floating 119,534,726 2,540,013 634,133 38 total 46,395 genes, 18,141 were uniquely expressed in mango Total 403,305,987 8,362,720 2,197,722 38

6. Collinearity of Mango Chromosomes with Citrus 7. Design and Fabrication of Axiom and Cacao Genomes 80K Genic-SNP Genotyping Chip

Mangifera indica 1. No. of varieties re-sequenced by Illumina HiSeq: 24 2. Cloned Horticulturally Important Mango Genes (HIM): 1,492 3. Disease Resistance Defense Response-like Mango genes (DRDRM): 1,440 4. Single-Copy Mango genes (SCM): 8,045 5. Conserved Single-Copy genes in Citrus and Mango (CSCCM): 8,648 6. Total number genes selected for the chip: 19,625 7. Total number SNPs identified in these genes: 5,257,381 8. No. of SNPs sent for chip fabrication: 116,683 9. No. of SNPs probes included in the chip: 91,014 Citrus sinensis Theobroma cacao Collinearity analysis showed the duplication of blocks in chromosome 10.No. of SNPs represented by the probes: 80,816 no. 2, 3, 5 and 7 in Mango vs. Citrus and Mango vs. cacao, which was hidden in the mango segmental Mangiferaindica duplication anaysis

Phylogenetic Analysis 7. Mango 80K SNP Chip Validation • Genomic DNA from 96 mango varieties

Monocot was used for validation.

Snapshot of 100% successful Mango 80K

SNP run on Gene Titan platform

Dicot Gymnosperm

Team members of the Mango 80K SNP Chip Rooted Bayesian tree based on 17 gene sequences conserved in 18 different Summary of validated 96 Varieties plant species. Tree was rooted with Gymnosperm species Pinus taeda. using 80K SNP chip

5 1/25/2019

7. Mango 80K SNP Chip Design and Validation 7. Diversity Analysis of Mango Cultivars

1

1b 1a 2a 2b 2d 2C

1 – Tommy Atkins 5 2 – Amrapali 3 – Kensington 2 4 – Neelam 5 - Dashehri 4 3

Haplotype based UPGMA phylogenetic tree of 384 mango varieties based on 60 unlinked genome wide SNPs with 100% genotypic call rates and MAF of > 0.3 Snapshot of SNP Summary and Cluster Plot of validated SNP locus AX-169874412

7. Diversity and Population Structure of Mango 7. Mango Varietal Phenotyping

7. Population Structure of 384 Mango Genotypes 7. Genome-Wide Association Mapping (Acidity)

•A subset of 60 SNP markers covering whole genome and having Minor allele frequency of 0.5 was selected among 96.

•The maximum delta K (ad-hocquantity) was reached at K = 2 suggesting presence of two subpopulations in the collection

•In SNP-based structure analysis, population I consisted 39.5% of genotypes (38). Population II comprised of total 58 genotypes.

•Out of total genotypes, 62% of genotypes were pure and 38% were admixed.

Figure ) The true value of K was obtained following the delta K method of Evanno et al.

6 1/25/2019

7. Genome-Wide Association Mapping (Fruit Length)

Dr. Nagendra Kumar Singh, National Professor BP Pal Chair

10.7 Genome-Wide Association Mapping (Pulp Percentage)

8. Prospects

1. Iso-Seq based validation of gene models 2. Hi-C for chromosome level pseudomolecules 3. Functional Genomics: Expression profiling of different tissues at RNA, protein and metabolite levels 4. Mango cytology and wild species 5. Marker-trait association studies 6. Utilization in mango breeding

7