Genomic characterization of African cultivated and wild species Peterson Weru Wambugu Msc

A thesis submitted for the degree of Doctor of Philosophy at The University of Queensland in 2017 Queensland Alliance for Agriculture and Food Innovation

Abstract

Rice is the most important crop in the world, acting as the staple food for over half of the world’s population. It belongs to the Oryza genus and has a wide genetic diversity of both wild and cultivated species, with a substantial proportion of this diversity being found in Africa. In this thesis, African cultivated and wild Oryza species have been characterized using genomic tools with the goal of promoting their conservation and utilization. A desk based study aimed at documenting the current state of conservation and utilization of African Oryza genetic resources revealed that they remain under collected and hence underrepresented in germplasm collections. Moreover, they are under-characterized and therefore grossly underutilized. In order to obtain maximum benefits from these resources, it is imperative that they are collected, efficiently conserved and optimally utilized. Efficient conservation and utilization of the Oryza gene pool is however hindered by lack of conclusive, consistent and well resolved phylogenetic and evolutionary relationships between the various cultivated and wild species. This study therefore sought to define the genetic and evolutionary relationships between the various wild and cultivated African species that form part of the AA genome group. This was done by reconstructing the phylogeny of Oryza AA genome species by massive parallel sequencing of whole chloroplast genomes. The analysis resulted in a well resolved and strongly supported phylogeny of the AA genome species showing strong genetic differentiation.

The clearly defined relationships between the various taxa provide a robust platform for supporting the conservation and utility of these species. In an effort at promoting utility of these species, their starch physicochemical properties and the functional diversity relating to these starch related traits were analyzed. African was found to have some unique starch structure and properties, among these being high amylose content which is higher than in Asian rice. Amylose content is a key starch trait in rice. The health benefits of high amylose foods are increasingly getting recognized. African rice therefore seems to be a valuable genetic resource for breeding premium-value genotypes with high amylose content. The deployment of African rice genetic resources in rice improvement is however constrained by the inadequate understanding of the genetic mechanism underlying their unique starch traits, specifically amylose content. First, this study analysed key starch biosynthesis genes in the genomes of various cultivated and wild Oryza species. The analysis revealed that the pullulanase gene has undergone gene duplication in African rice. This apparent gene duplication was characterized with the aim of understanding whether it

ii could be the molecular mechanism underlying high amylose in African rice. It was hypothesized that this gene duplication may have led to increased pullulanase activity which may have subsequently led to unique starch structure and properties due to protein dosage effects. However, pullulanase activity between O. glaberrima and O. sativa was not significantly different as would have been expected. The results therefore did not support the study hypothesis and it therefore appears that the high amylose might be due to other genetic mechanisms and not due to pullulanase gene duplication.

Further analysis was therefore required in order to unravel these genetic mechanisms underlying amylose content. This was conducted by genetic mapping of an interspecific cross between O. sativa and O. glaberrima segregating for amylose content. The aim was to identify genetic variants and candidate genes associated with amylose biosynthesis. Phenotyping for amylose content resulted in a normal distribution of amylose content showing transgressive segregation. Sequencing based bulk segregant analysis enabled the identification of markers putatively linked with amylose content. In particular, a G/A SNP located on chromosome 12 of granule bound starch synthase (GBSS1) that causes an amino acid change from Asp to Asn was identified. This variation may influence the stability of GBSS1 and hence its activity as it is adjacent to a disulphide linkage conserved in all grass GBSS proteins. A total of 106 candidate SNPs clustered with this polymorphism in a genomic region of about 2.1Mbp, most likely due to linkage drag. Finally, an amylose candidate gene analysis was conducted using an integrative genetics approach. This involved a combination of several strategies namely genomic based bulk segregant analysis, gene co-expression analysis and analysis of cis regulatory motifs. Two candidate transcription factors belonging to NAC and CCAAT-HAP5 families were found to be co- expressed with key starch genes and also to share conserved regulatory motifs with known amylose biosynthesis genes. Results of this study suggest that these transcription factors are involved in the transcriptional regulation of the GBSS1 structural gene. These candidate genes provide novel targets for manipulating amylose content.

iii

Declaration by author

This thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.

I have clearly stated the contribution of others to my thesis as a whole, including statistical assistance, survey design, data analysis, significant technical procedures, professional editorial advice, and any other original research work used or reported in my thesis. The content of my thesis is the result of work I have carried out since the commencement of my research higher degree candidature and does not include a substantial part of work that has been submitted to qualify for the award of any other degree or diploma in any university or other tertiary institution. I have clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.

I acknowledge that an electronic copy of my thesis must be lodged with the University Library and, subject to the policy and procedures of The University of Queensland, the thesis be made available for research and study in accordance with the Copyright Act 1968 unless a period of embargo has been approved by the Dean of the Graduate School.

I acknowledge that copyright of all material contained in my thesis resides with the copyright holder(s) of that material. Where appropriate I have obtained copyright permission from the copyright holder to reproduce material in this thesis.

iv

Publications during candidature

Peer reviewed papers Wambugu, P., Furtado, A., Waters, D., Nyamongo, D. & Henry, R. 2013. Conservation and utilization of African Oryza genetic resources. Rice, 6, 29-29.

Wambugu, P.W., Brozynska, M., Furtado, A.., Waters, D.L. and Henry, R.J. 2015. Relationships of wild and domesticated (Oryza AA genome species) based upon whole chloroplast genome sequences. Scientific Reports 5, 13957 Wang, K., Wambugu, P.W., Zhang, B., Wu, A.C., Henry, R.J., and Gilbert, R.J. 2015. The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129, 92- 100

Conference abstracts Wambugu, P.W. A. Furtado and R. Henry. 2015. Molecular and genetic basis of unique starch properties in African rice (Oryza glaberrima). Paper presented at the 65th Australasian grain science conference. 16th-18th September 2015.

Publications included in this thesis

Wambugu, P.W, Furtado, A., Waters, D., Nyamongo, D. & Henry, R. 2013. Conservation and utilization of African Oryza genetic resources. Rice, 6, 29-29 – incorporated as chapter 2

Contributor Statement of contribution

Peterson Wambugu (Candidate) Wrote paper (80%) Designed study (40%)

Agnelo Furtado Edited paper (10%)

Daniel Waters Edited paper (10%)

Desterio Nyamongo Edited paper (10%)

Robert Henry Designed study (60%) Wrote paper (20%) Edited paper (70%)

v

Wambugu, P.W., Brozynska, M., Furtado, A.., Waters, D.L. and Henry, R.J. 2015. Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences. Scientific Reports 5, 13957 - incorporated as chapter 3

Contributor Statement of contribution

Peterson Wambugu (Candidate) Designed experiments (30%) Performed the experiments (100%) Data analysis (90%) Wrote paper (80%)

Agnelo Furtado Designed experiments (20%) Edited paper (10%)

Daniel Waters Edited paper (10%)

Marta Brozynska Statistical analysis of data in Table 14 and Figure 15 (10%)

Robert Henry Designed experiments (50%) Edited paper (80%) Wrote paper (20%)

vi

Contributions by others to the thesis

Principal advisor, Prof. Henry Robert contributed in the conception and design of the project. He also edited and critically revised sections of this thesis. Dr. Marie-Noelle Ndjiondjop of Africa Rice Center provided the mapping population that was analysed in this thesis.

Statement of parts of the thesis submitted to qualify for the award of another degree

None

vii

Acknowledgements

This thesis is the culmination of 4years full time research study conducted at the University of Queensland. The production of this thesis would not have been possible without the financial, technical, moral and logistical support of many individuals and organizations, which I hereby gratefully acknowledge.

First, I give thanks to God for granting me good health and strength to pursue this PhD project. His grace has been sufficient throughout the 4 years that I undertook this project. To Him be the glory and honour.

It is with profound sense of indebtedness that I gratefully acknowledge the financial assistance from Australian Awards through the award of an ADS scholarship. Without this generous financial support, pursuit of this dream would not have been possible.

My greatest respect and appreciation go to my principal advisor, Prof. Robert Henry for his unreserved support, guidance and encouragement. Robert showed keen interest in my PhD work and despite his extremely busy schedule was always available to offer professional guidance. I wish to thank my associate advisor Dr. Agnelo Furtado for his support. I am deeply grateful for the support that I received from the Queensland Alliance for Agriculture and Food Innovation (QAAFI).

I wish to express my sincere gratitude to Kenya Agricultural and Livestock Research Organization (KALRO) for granting me an opportunity to pursue my studies. Special thanks go to Kenya Health Inspectorate Service (KEPHIS) for allowing me to conduct part of my work in their facilities.

I am also grateful to International Rice Research Centre (IRRI) and African Rice Centre for the provision of seed samples that were analysed in this thesis study. I am particularly grateful to Dr. Marie-Noelle Ndjiondjop who provided the mapping population whose analysis formed the bulk of the work reported in this thesis.

Special thanks go to my PhD colleagues from the Robert Henry’s Research Group whose encouragement and moral support gave me the strength to push on and made the compilation of this thesis less stressful. The moments we spent together and the experiences we shared are greatly treasured and will forever remain etched in my mind.

viii

Finally and most important, I wish to thank my wife, Kesiah and children, Purity and William for their unwavering support, patience and encouragement. With honour I dedicate this thesis to them.

ix

Keywords genetic resources, phylogenetic analysis, chloroplast genome, gene duplication, genome sequencing, amylose content, bulk segregant analysis, starch biosynthesis genes, genetic marker, candidate gene

Australian and New Zealand Standard Research Classifications (ANZSRC)

ANZSRC code: 060408, Genomics, 50% ANZSRC code: 060309, Phylogeny and comparative analysis, 30% ANZSRC code: 050202, Conservation and biodiversity, 20%

Fields of Research (FoR) Classification

FoR code: 0604, Genetics, 50% FoR code: 0603, Evolutionary biology, 30% FoR code: 0502, Environmental science and management, 20%

x

Table of Contents Abstract ...... ii

Declaration by author ...... iv

Publications during candidature ...... v

Publications included in this thesis ...... v

Contributions by others to the thesis ...... vii

Statement of parts of the thesis submitted to qualify for the award of another degree ...... vii

Acknowledgements ...... viii

Keywords ...... x

Australian and New Zealand Standard Research Classifications (ANZSRC) ...... x

Fields of Research (FoR) Classification ...... x

Table of Contents ...... xi

List of Figures ...... xvii

List of Tables ...... xix

List of Abbreviations used in the thesis ...... xxi

Chapter 1: General Introduction ...... 1

1.1. Background information ...... 1

1.2. Overview of African Oryza species ...... 3

1.3. Phylogenetic relationships of African Oryza species ...... 4

1.4. Potential of whole chloroplast genome sequences in resolving phylogenetic relationships ……………………………………………………………………………………………………..7

1.5. Application of genomic tools in genetic mapping ...... 8

1.6. Rice grain quality and its genetic control ...... 11 1.6.1. Overview of rice quality ...... 11 1.6.2. Genetic potential of African Oryza species in breeding for rice quality ...... 13 1.6.3. Potential health benefits of African rice ...... 14 1.6.4. Genetic control of amylose content in Asian and African rice ...... 15 1.6.5. Application of genetic markers in breeding for amylose ...... 20 1.6.6. Manipulating amylose content in ...... 21

1.7. Research problem ...... 22

1.8. Research Goal and Objectives ...... 25 xi

1.8.1. Research Goal ...... 25 1.8.2. General objective ...... 25 1.8.3. Specific objectives ...... 25

1.9. Thesis structure ...... 25

1.10. References ...... 27

Chapter 2: Conservation and utilization of African Oryza genetic resources ...... 43

Abstract ...... 43

2.1. Introduction ...... 44

2.2. Germplasm Conservation of African Oryza ...... 44 2.2.1. Ex situ conservation ...... 45 2.2.2. In situ conservation ...... 53 2.2.3. On farm conservation...... 56

2.3. Utility of African Oryza: Achievements and challenges ...... 57 2.3.1. Genetic potential of African Oryza ...... 58 2.3.2. Negative attributes of African rice ...... 60 2.3.3. Challenges in genetic mapping in African rice ...... 61

2.4. Characterization of African Oryza ...... 62

2.5. References ...... 64

Chapter 3: Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences ...... 73

Abstract ...... 73

3.1. Introduction ...... 74

3.2. Results ...... 75 3.2.1. Genome assembly and features ...... 75 3.2.2. Phylogenetic analysis ...... 77 3.2.3. Molecular divergence dating ...... 83

3.3. Discussion ...... 85 3.3.1. Origin and evolution of AA genome species ...... 85 3.3.2. Relationship between O. rufipogon, O. sativa and O. nivara ...... 88 3.3.3. Relationship between O. barthii and O. glaberrima and the origin of African rice ...... 89 3.3.4. Relationship between Australian wild taxa ...... 90

xii

3.4. Methods ...... 93 3.4.1. Plant materials ...... 93 3.4.2. DNA extraction and sequence assembly ...... 93 3.4.3. Phylogenetic analysis ...... 94 3.4.4. Molecular divergence dating ...... 95 3.4.5. Genome annotation ...... 95

3.5. References ...... 96

Chapter 4: Characterization of apparent gene duplication in a gene encoding a pullulanase-type debranching enzyme in African rice (Oryza glaberrima) ...... 103

Abstract ...... 103

4.1. Introduction ...... 104

4.2. Results ...... 106 4.2.1. Sequence analysis ...... 106 4.2.2. Putative functional cis-regulatory elements ...... 108 4.2.3. Conserved protein domain ...... 110 4.2.4. Pullulanase activity ...... 110

4.3. Discussion ...... 110 4.3.1. Gene structure ...... 110 4.3.2. Putative functional cis-regulatory elements ...... 111 4.3.3. Conserved protein domains ...... 112 4.3.4. Functional effect of pullulanase gene duplication ...... 113 4.3.5. Effect of pullulanase gene duplication on Amylose content ...... 114 4.3.6. Genome assembly error ...... 115

4.4. Conclusion ...... 115

4.5. Materials and methods ...... 116 4.5.1. Gene sequences ...... 116 4.5.2. Search for cis regulatory elements and conserved protein domains ...... 116 4.5.3. Assay of pullulanase activity ...... 116

4.6. References ...... 117

Chapter 5: Sequencing of bulks of segregants allows dissection of genetic control of amylose content in rice ...... 124

xiii

Abstract ...... 124

5.1. Introduction ...... 124

5.2. Results ...... 126 5.2.1. Analysis of amylose content and identification of pool segregants ...... 126 5.2.2. DNA sequencing and mapping ...... 127 5.2.3. Association analysis identifies putative amylose-linked genetic markers ...... 129 5.2.4. Allele mining in natural variation of African rice GBSS1 reveals novel alleles ...... 131 5.2.5. Analysis of an indel marker in O. glaberrima GBSS1 ...... 132 5.2.6. Analysis of O. sativa amylose-linked functional SNPs ...... 133

5.3. Discussion ...... 133 5.3.1. Genetics of amylose content ...... 133 5.3.2. Bulk segregant analysis identifies major and minor loci associated with amylose content …………………………………………………………………………………………………134 5.3.3. Analysis of allele frequency detects linkage drag around waxy locus ...... 136 5.3.4. Potential of bulk segregant analysis in genetic mapping ...... 137 5.3.5. Analysis of natural variation identifies novel SNPs in African rice GBSS1 ...... 139 5.3.6. Indel marker can help distinguish O. glaberrima and O. sativa GBSS1 ...... 140 5.3.7. Comparison of allelic patterns in functional SNP positions identified in O. sativa ...... 141 5.3.8. Conclusion and implications of this study in breeding for amylose ...... 142

5.4. Methods ...... 143 5.4.1. Mapping population ...... 143 5.4.2. Amylose content determination ...... 144 5.4.3. Bulk segregant analysis...... 144 5.4.4. DNA extraction and quantification ...... 144 5.4.5. Sequencing and data processing ...... 145 5.4.6. Statistical analysis ...... 146 5.4.7. Analysis of GBSS1 from unrelated accessions of African rice and O. barthii ...... 148

5.5. References ...... 151

Chapter 6: Analysis of candidate genes for amylose synthesis in rice using sequencing based bulk segregant analysis ...... 159

Abstract ...... 159

6.1. Introduction ...... 159

6.2. Results ...... 162 6.2.1. Allele frequency analysis identifies putative candidate genes and TFs ...... 162

xiv

6.2.2. Gene expression profiles and gene co-expression analysis ...... 164 6.2.3. Analysis of conserved regulatory motifs ...... 169

6.3. Discussion ...... 170 6.3.1. Genetic control of amylose ...... 170 6.3.2. Association analysis identifies known and novel candidate genes/TFs ...... 171 6.3.3. Gene co-expression helps identify functionally related genes ...... 174 6.3.4. Integrative genetic analysis provides insights on candidate genes and transcriptional regulatory pathways in amylose biosynthesis ...... 175

6.4. Conclusion ...... 176

6.5. Methods ...... 177 6.4.1. DNA extraction and sequencing ...... 177 6.4.2. Variant calling ...... 177 6.4.3. Identification of starch related candidate genes and association analysis ...... 177 6.4.4. Gene expression and co-expression analysis ...... 177

6.5. References ...... 181

Chapter 7: General discussion, conclusions and future directions ...... 188 7.1. Overview ...... 188 7.2. Application of genomics in conservation and utilization of African Oryza genetic resources ...... 190 7.3. Application of chloroplast phylogenomics in phylogenetic analysis ...... 191 7.4. Analysis of functional diversity in African rice and its potential deployment in breeding for amylose ...... 193 7.5. Dissecting the genetic architecture of amylose synthesis using integrative genetics approach ...... 195 7.6. Future directions ...... 196 7.7. References ...... 198

Appendices ...... 203

Appendix 1: The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima) ...... 203

Abstract ...... 203

1. Introduction ...... 203

2. Materials and methods ...... 205 2.1. Materials ...... 205 2.2. Grinding of rice grains ...... 206 2.3. Size-exclusion chromatography ...... 207

xv

2.4. Fluorophore-assisted carbohydrate electrophoresis ...... 208 2.5. Mathematical fitting of the number chain-length distribution ...... 209 2.6. Gelatinization properties ...... 209 2.7. Statistical analysis ...... 210

3. Results and discussion ...... 210 3.1. Molecular size distributions of whole starch molecules ...... 210 3.2. Chain-length distributions of debranched starch ...... 212 3.3. Parameterizing enzyme activities through mathematical modelling ...... 216 3.4. Gelatinization properties ...... 219

4. Conclusions ...... 222

5. References ...... 222

Appendix 2: Non-synonymous candidate SNPs putatively associated with amylose content in African rice ...... 228

xvi

List of Figures

Figure 1: Undehusked seeds of African Oryza species...... 4 Figure 2: Consensus tree of the AA genome rice species based on trnL-trnF sequences. . 6 Figure 3: Schematic representation of the application of genomic technologies in supporting germplasm utilization ...... 9 Figure 4: Facets of grain quality ...... 12 Figure 5: α-(1-4) and α-(1-6) linkages ...... 13 Figure 6: Polymorphisms that have been found in African rice...... 20 Figure 7: Distribution of Oryza species in Africa...... 54 Figure 8: Rice field in coastal Kenya planted with landrace with patches of Oryza punctata...... 56 Figure 9: Phenotypic diversity of both cultivated and wild Oryza genetic resources ...... 57 Figure 10: Gene map of O. glaberrima chloroplast genome...... 77 Figure 11: Bootstrapping consensus and maximum parsimony tree showing phylogenetic relationships among Oryza AA genome species conducted using whole chloroplast sequences...... 79 Figure 12: Phylogenetic tree of the Oryza AA genome group species showing the origin of the various species...... 80 Figure 13: Phylogenetic tree of the Oryza AA genome group species showing the origin of the various species...... 82 Figure 14: Maximum parsimony (MP) tree of the Oryza AA genome group species obtained after excluding indels in the alignment……………………………...... 83 Figure 15: Maximum parsimony (MP) tree of the Oryza AA genome group species obtained after excluding indels in the alignment...... 84 Figure 16: Wild population of one of the newly identified perennial species in Lakefield National Park, North Queensland, Australia...... 92 Figure 17: Schematic representation of duplicated pullulanase gene…………………….107 Figure 18: Protein Sequence alignment of the original and duplicate pullulanase gene . 108 Figure 19: Frequency distribution of amylose content in 100 individuals and parental genotypes of a BC2F8 population of an interspecific cross between O. glaberrima and O. sativa (CG14 X WAB 56-104) ...... 127 Figure 20: Plot of chromosomal positions and allele frequency of putative candidate non- synonymous SNPs located on chromosome 6...... 130 Figure 21: Schematic diagram of BSA experimental set up and data analysis pipeline .. 147

xvii

Figure 22: Spatial-temporal expression of granule bound starch synthase 1 (GBSS1) (LOC_Os06g04200) ...... 165 Figure 23: Spatial-temporal expression of HAD superfamily phosphatase (LOC_Os06g04790) ...... 165 Figure 24: Spatial-temporal expression of histone-fold domain containing protein (LOC_Os01g01290) ...... 166 Figure 25: Spatial-temporal expression of NAC Transcription factor (LOC_Os11g31330) ...... 166 Figure 26: Spatial-temporal expression of OsbHLH071 (LOC_Os01g01600)...... 167 Figure 27: GBSS 1 co-expressed gene network showing some of the co-expressed TFs...... 168 Figure 28: Schematic diagram showing how the outputs of this study including candidate genetic markers, candidate genes and well defined phylogenetic relationships contribute in supporting the project’s goals of promoting the conservation and utilization of PGR ...... 189 Figure 29: SEC weight distributions, wbr (logVh), of whole branched starch molecules extracted from O. barthii (O.B.), O. glaberrima (O.G.) and O. sativa (O.S.) .... 211 Figure 30: SEC weight CLDs, wde (logX), of (A) debranched starches and (B) an enlargement of the amylose region as a function of DP X ...... 213 Figure 31: Number CLDs, Nde(X), of starches extracted from O. barthii (O.B.), O. glaberrima (O.G.) and O. sativa (O.S.) characterized using FACE, covering

...... 215 Figure 32: Statistical comparisons of mathematical fitting parameters among O. barthii, O. glaberrima and O. sativa ...... 217 Figure 33: Gelatinization properties of starch samples, compared among O. barthii (blue bars), O. glaberrima (orange bars) and O. sativa (green bars)...... 221

xviii

List of Tables

Table 1: African Oryza species and their chromosome numbers and genome types ...... 3 Table 2: Selected studies on QTL analysis for amylose content in rice ...... 17 Table 3: Key polymorphisms associated with amylose content in O. sativa ...... 18 Table 4: Regional, International and genebanks outside Africa holding African Oryza germplasm ...... 46 Table 5: Germplasm collections held in various national genebanks ...... 48 Table 6: Germplasm collected from selected countries and currently held in international genebanks ...... 49 Table 7: Number of herbarium specimens of African Oryza species held in various herbaria globally ...... 50 Table 8: Number of herbarium specimens of African Oryza collected from different African countries ...... 51 Table 9: Useful traits found in African Oryza species ...... 59 Table 10: Sequencing status of African Oryza genomes ...... 63 Table 11: Sequencing status of African Oryza genomes ...... 76 Table 12: GenBank sequences used in the phylogenetic analysis ...... 78 Table 13: Alignment statistics for whole chloroplast genome and with one inverted repeat (IR) excluded ...... 78 Table 14: Divergence date estimates of the AA Oryza genome species based on chloroplast sequences ...... 84 Table 15: Details of seed samples used for DNA extraction ...... 93 Table 16: Sequence statistics of the 2 pullulanase paralogs ...... 106 Table 17: Nucleotide polymorphisms between the two pullulanase gene paralogs ...... 107 Table 18: Putative cis regulatory element found in original pullulanase gene and its duplicate...... 109 Table 19: Pullulanase activity in wild and cultivated Oryza species ...... 110 Table 20: Sequencing Statistics ...... 128 Table 21: Statistics of reads uniquely mapped against O. sativa and O. glaberrima references ...... 128 Table 22: Number of SNPs called for the high and low amylose bulk and 2 parental genomes when mapped to O. sativa and O. glaberrima references ...... 129 Table 23: Upstream and non-synonymous SNPs putatively linked to amylose content identified in candidate genes ...... 131

xix

Table 24: Putative SNPs found in GBSS1 from African rice ...... 132 Table 25: Comparison of allelic patterns in functional SNP positions identified in O. sativa ...... 133 Table 26: O. glaberrima and O. barthii accessions used in the analysis of GBSS1 gene 149 Table 27: Markers and candidate genes putatively associated with amylose content ..... 162 Table 28: Amylose biosynthesis genes showing no functional involvement in amylose synthesis in the system presently under study ...... 163 Table 29: Amylose candidate genes/transcription factors showing co-expression with known starch biosynthesis genes ...... 169 Table 30: Cis regulatory motifs present in the candidate genes ...... 170 Table 31: Starch biosynthesis genes used as guide genes in gene co-expression analysis ...... 179 Table 32: Candidate protein coding genes ...... 180 Table 33: Candidate transcription factors ...... 181 Table 34: Information on O. barthii and O. glaberrima samples ...... 207 Table 35: Structural parameters of amylose from all the samples ...... 212

xx

List of Abbreviations used in the thesis

AC - Amylose content AFLP - Amplified Fragment Length Polymorphism AIC - Akaike Information Criterion ANOVA - Analysis of variance; AUC - Area under the curve; BAC - Bacterial Artificial Chromosome BC - Back cross bhlh - Basic helix-loop-helix BP - BAC Pool BSA - Bulk Segregant Analysis bZIP - Basic leucine zipper CBC - Clone by Clone CBF - CCAAT box binding factor CDD - Conserved Domains Database CIAT - International Center for Tropical Agriculture CLD - Chain length distribution; cM - Centimorgan CNRRI - China National Rice Research Institute CP - Chloroplast CTAB - Cetyltrimethylammonium bromide CWR - Crop wild relatives DAF - Days after flowering DBE - Starch debranching enzymes; DH - Doubled haploid DMSO - Dimethyl sulfoxide; DNA - Deoxyribonucleic acid DP - degree of polymerization; DSC - Differential scanning calorimeter; EREBP - Ethylene-responsive element binding protein FACE - Fluorophore-assisted carbohydrate electrophoresis; FAO - Food and Agriculture Organization GBSS - Granule bound starch synthase GC - Gel consistency

xxi

GI - Glycemic Index GRiSP - Global Rice Science Partnership GT - Gelatinization temperature GTR - General time-reversible GWAS - Genome wide association study HAB - High amylose bulk HAD - Haloacid dehalogenase HAP - Heme Activator Protein HKY - Hasegawa, Kishino and Yano HPD - Highest Posterior Density HS - High Sensitivity IBPGR - International Board for Plant Genetic Resources IL - Introgression lines Indel - Insertion and deletion - International Network for the Genetic Evaluation of Rice INGER-Africa in Africa IOMAP - International Oryza Map Alignment Project IR - Inverted repeat IRA - Inverted repeat A IRB - Inverted repeat B IRD - Institut de recherche pour le développement) IRGSP - International rice genome sequencing project IRRI - International rice research institute LAB - Low amylose bulk LD - Linkage disequilibrium LSC - Large single copy MCMC - Monte Carlo Markov Chain MITE - Miniature Inverted-Repeat Transposable Element ML - Maximum likelihood MP - Maximum parsimony Mya - Millions years ago NAC - NAM, ATAF and CUC NCBI - National Centre for Biotechnology Information NERICA - New Rice for Africa

xxii

NF - Nuclear factor NGS - Next generation sequencing NJ - Neighbour joining PA - Pullulanase activity PCR - Polymerase chain reaction PLACE - Plant Cis-acting Regulatory DNA Elements PlantPAN - Plant Promoter Analysis Navigator PM - Physical Map Integration PUL - Pullulanase QAAFI - Queensland Alliance for Agriculture and Food Innovation QTL - Quantitative trait loci RAPD - Random Amplified Polymorphic DNA RID - Refractive index detector; RIL - Recombinant Inbred Line; RPS-BLAST - Reverse Position-Specific BLAST rRNA - Ribosomal RNA SBE - Starch branching enzyme SBE - Starch branching enzymes; SD - Standard deviation SEC - Size-exclusion chromatography; SINEs - Short Interspersed Elements SNP - Single Nucleotide Polymorphism SNP - Single nucleotide polymorphism SPGRC - SADC Plant Genetic Resources Centre SPR - Subtree-Pruning-Regrafting SRA - Short reads archive SRS - Sampling representative score SS - Starch synthase SSC - Small single copy SSR - Simple Sequence Repeats TF - Transcription factor tRNA - Transfer RNA WARDA - West Africa rice development authority WGS - Whole Genome Shotgun

xxiii

Chapter 1

General Introduction

1.1. Background information

The world is currently faced with various global challenges top among them being population increases, an increasingly degraded environment and climate change. The net effect of these challenges is food and nutritional insecurity as well as widespread poverty which continue to ravage a large proportion of the world population. World population is expected to reach 9 billion people in 2050 thereby putting enormous challenge on the world of feeding this rising population (Godfray et al. 2010). Among the crops that have potential for meeting food and nutritional security as well as economic development is rice. Rice is the most important food crop in the world providing daily calories to more than half of the world’s population. Majority of rice consumers are the vulnerable populations mainly from developing countries, who are economically disadvantaged and face a myriad of social challenges. In Africa for example, preference for rice has continued to increase steadily over the last couple of years. However, there exists a significant yield gap which means rice production in Africa is inadequate to meet the rapidly growing demand (FAO 2015; Van Oort et al. 2015). This deficit in production makes the region a net importer of rice, a situation that aggravates the nutritional and food security situation in the region. In order to provide sufficient daily caloric intake, food production has to double in the next 25 years (McCouch 2013). This increase has to be achieved against a backdrop of increased cases and severity of both biotic and abiotic stresses, which have continued to be exacerbated by climate change.

While increasing the area under crop production is a viable strategy of increasing food production, this approach faces challenges since land and water resources continue to dwindle rapidly, both due to population increase and climate change. The ever growing problem of climate variability threatens future food supply and hence it is critically important that breeders produce varieties that are climate resilient. In some sort of fate, climate change, which continues to degrade some of the

1 most agriculturally productive environments, is predicted to most severely affect the most vulnerable populations such as those in Africa and Asia (FAO/PAR 2011). Sustainable food production therefore calls for increased yield increases per unit area which among other factors requires the use of improved varieties that have the capacity to tolerate biotic and abiotic stresses.

In addition to the need to increase crop productivity and ensuring the crops are climate resilient, there is need to consider nutrition of crops and varieties that are being bred. This is particularly important at this time when, due to improved living standards and increased affluence across the world, large segments of the human population are increasingly getting conscious of the quality of food that they consume. Moreover, the increased incidences of lifestyle diseases, has led to widespread changes in dietary preferences with greater preference being given to foods with health benefits. These shifts in dietary preferences have necessitated changes in breeding objectives and priorities with increased focus being put on breeding for improved grain quality (Chen et al. 2012; Fitzgerald et al. 2009b). Currently, among the breeding objectives, grain quality ranks second to yield. The challenge that is in the hands of rice breeders is how to breed high yielding genotypes which possess superior grain quality. In addition, other challenges such as newly emerging strains of pests and diseases have also to be addressed by breeders.

Combining the various traits as highlighted above, is likely to be quite challenging with available reports indicating that historically there has been very little success in efforts aimed at breeding quality yielding genotypes with superior grain quality (Peng et al. 2008). Indeed in many rice growing countries, a large acreage is planted with some of the major varieties which are popular due to their superior grain quality being unfortunately low yielding and quite susceptible to various stresses (Fitzgerald et al. 2009b). This challenge in breeding programmes can partly be attributed to lack of proper understanding of the genetic control of various grain quality traits especially under different genetic backgrounds. Plant genetic resources (PGR) are a key component of agricultural biodiversity and play a critical role in enhancing food and nutritional security, ecosystem resilience, economic development, sustainable livelihoods and climate resilience. The need to widen genetic diversity by

2 using diverse forms of PGR including both cultivated and wild species is critical to the success of a breeding programme and cannot be overemphasized. In this thesis, the potential contribution of African Oryza species directly as a source of healthy food and also as a source of valuable genetic diversity for breeding high yielding genotypes with superior quality was studied.

1.2. Overview of African Oryza species

The Oryza genus has 2 cultivated species, Oryza sativa and Oryza glaberrima, and over 22 wild species representing 10 rice genome types (Ge et al. 2001; Lu and Jackson 2009; USDA-ARS 2013). Oryza sativa is commonly referred to as Asian rice while Oryza glaberrima is usually referred to as African rice. In this thesis, the names O. glaberrima and African rice will be used interchangeably to refer to the same species. Similarly, the names Asian rice and O. sativa will be used interchangeably throughout this thesis. With a total of 8 species of both cultivated and wild rice species, representing 6 out of the 10 known genome types (Table 1 and Fig. 1), the African region arguably has the greatest diversity of the rice gene pool, in terms of number of species and genome types. It is the only region where the two cultivated species co- exist by growing sympatrically. The term African Oryza species is used in this thesis to refer to the provenance of Oryza species that are found in Africa. The only exception to this is O. sativa which had its origin in Asia but now has pan tropical distribution. The other species described in this thesis are endemic in Africa.

Table 1: African Oryza species and their chromosome numbers and genome types

Species 2n Genome type Oryza sativa L 24 AA Oryza longistaminata A. Chev. et 24 AA Roehr. Oryza glaberrima 24 AA Oryza barthi 24 AA Oryza punctata Kotschy ex Steud 24 BB Oryza schweinfurthiana Prod. 48 BBCC Oryza eichingeri A. Peter 24 CC Oryza brachyantha A. Chev. et Roehr. 24 FF 1Oryza schweinfurthiana Prod is also considered the tetraploid form of Oryza punctata Kotschy ex Steud. This thesis makes no further reference to Oryza schweinfurthiana Prod. as there is negligible amount of data available on this species.

3

c

a b c d e f g

Figure 1: Undehusked seeds of African Oryza species. (a) Oryza longistaminata (b) Oryza glaberrima (c) Oryza glaberrima (d) Oryza brachyantha (e) Oryza eichingeri (f) ) Oryza punctata (g) Oryza barthii

The use of molecular based techniques in the conservation and utilization of these species and the breadth of their genetic diversity was studied in this thesis. These molecular based approaches have been extensively applied in the conservation and use of PGR in the last couple of decades. Over the last decade there have been significant advances in DNA sequencing technologies which are driving many areas of plant science and revolutionizing the depth and breadth of genome analysis. These advances provide powerful tools that help to cost effectively study genetic diversity, identify desirable genes and alleles as well as facilitating their transfer during crop improvement, thus speedily delivering new varieties. In addition, these genomic tools have aided in defining the genetic relationships between various wild and cultivated species.

1.3. Phylogenetic relationships of African Oryza species

Efficient conservation and utilization of the Oryza gene pool will require a clear understanding of the phylogeny and evolutionary relationships of the various species in addition to a reliable taxonomic and biosystematics framework. However, development of such a framework is usually constrained by the lack of taxonomic expertise (Chase and Fay 2009) which in turn has confounded phylogenetic studies

4 mainly due to species misidentification. In the last two decades, efforts have been made in studying the phylogenetic relationships in the Oryza genus using a variety of approaches. These have mainly entailed the use of molecular markers such as Random Amplified Polymorphic DNA (RAPD; Bautista et al. 2001; Ishii et al. 1996; Ren et al. 2003), Simple Sequence Repeats (SSR; Ishii et al. 1996; Ren et al. 2003), Restriction Fragment Length Polymorphisms (RFLP; Bautista et al. 2001; Lu et al. 2002; Wang et al. 1992), Amplified Fragment Length Polymorphism (AFLP; Park et al. 2003a), Inter Sequence Simple Sequence Repeats (ISSR; Joshi et al. 2000), Short Interspersed Elements (SINEs) and Miniature Inverted-Repeat Transposable Element (MITE) insertions (Cheng et al. 2002; Iwamoto et al. 1999; Mochizuki et al. 1993). In addition, morphological and cytological studies have also been conducted (Lu et al. 2000). These studies have significantly increased our understanding of the evolutionary relationships in the Oryza but many questions remain unanswered (Tang et al. 2010; Zou et al. 2008). Resolving apparent phylogenetic incongruences in the Oryza genus must therefore remain a major focus of research for some African species.

Among the African species, Oryza brachyantha has been the most affected by conflicts occasioned by inconsistent phylogenetic placements. Results of phylogenetic relationships have varied depending on the phylogenetic approach used. For example, a phylogenetic study based on Adh1, Adh2 and matK genes conducted by Ge et al. (1999) resulted in 3 different phylogenies all with inconsistent positioning of O. brachyantha. The Adh2 phylogeny suggested a strong relationship between O. brachyantha and the diploid HH genome which was not supported by the Adh1 phylogeny. The 3 phylogenies constructed by these authors depicted O. meyeriana and O. granulata (GG) as the most basal taxa, a finding that was inconsistent with the phylogenetic analysis by Mullins and Hilu (2002) where O. brachyantha appeared as the most basal lineage in the whole genus. Due to this discordance, Ge et al. (1999) recommended the use of additional genes in future phylogenetic studies. The most tenuous placement of O. brachyantha was by Wang et al. (1992) and Nishikawa et al. (2005), who, despite O. brachyantha being one of the most divergent species, placed it closest to the Oryza sativa complex. According to Goicoechea (2009), this seemingly

5 controversial placement of Oryza brachyantha could be attributed to sequence conservation between AA and FF genomes in some of the loci under investigation.

Just like O. brachyantha (FF), the AA genome species have also suffered from phylogenetic incongruence. For example, based on MITE-AFLP and RFLP data, Wang et al. (1992) and Park et al. (2003b) found O. meridionalis to be the most distinct species. This was contradicted by Ren et al. (2003), who, by conducting an SSR and RAPD analysis, found Oryza longistaminata to have undergone significant differentiation from the other AA genome species, a finding that was consistent with other previous studies (Cheng et al. 2002; Iwamoto et al. 1999). Figure 2 shows a consensus tree of the AA genome based on trnL-trnF sequences. Understanding the causes of these different inconsistencies may help to resolve them.

O. longistaminata

O. glaberrima O. barthii O. meridionalis

O. sativa - indica

O. nivara

O. rufipogon

O. sativa -japonica

O. glumaepatula

O. officinalis

Figure 2: Consensus tree of the AA genome rice species based on trnL-trnF sequences. (Adapted from Duan et al. 2007)

The differential results in the various phylogenetic analysis can be attributed to misidentification of accessions, gene choice, use of insufficient data, introgression and hybridization, rapid speciation as well as lineage sorting (Wang et al. 1992; Zhu and Ge 2005; Zou et al. 2008). Results from many studies provide evidence which shows that some of the previously unresolved phylogenetic relationships could be resolved by the use of more data, better selection of genes as well as better phylogenetic

6 methods (e.g. Cranston et al. 2009; Zou et al. 2008). The importance of proper choice and use of a sufficient number of unlinked genes in resolving phylogenetic studies was demonstrated by a recent study that used a total of 142 single copy genes to resolve the phylogeny of all the diploid genomes of Oryza (Zou et al. 2008). Moreover, the use of chloroplast DNA as opposed to nuclear data has been suggested as one way to reduce incongruence in phylogenetic relationships (Takahashi et al. 2008; Waters et al. 2012).

1.4. Potential of whole chloroplast genome sequences in resolving phylogenetic relationships

Chloroplast DNA is haploid, non-recombinant and is generally maternally inherited thus making it a useful tool for molecular systematics (Small et al. 2004; Waters et al. 2012). A recent study of the Oryza phylogeny generated a better resolved phylogeny based on 20 chloroplast fragments (Tang et al. 2010). However, the potential of chloroplast DNA in resolving phylogenetic relationships as highlighted by these studies is contradicted by other studies where the use of chloroplast fragments has resulted in inconsistent phylogenies or lack of sufficient resolution (Guo and Ge 2005). It would appear that this apparent phylogenetic inconsistency based on chloroplast data may be due to the number of genes or loci studied. As with nuclear data, increasing the number of chloroplast loci may increase resolution. Indeed, Waters et al. (2012) reported that the use of whole chloroplast genome sequences has more power in phylogenetic resolution. As reviewed by Li et al. (2015), the on-going advances in sequencing coupled with decreased sequencing costs as well as the high multiplexing capacity for cp-genomes, will continue to make the whole plastid sequences a popular tool that may eventually replace Sanger based DNA barcoding. Though more and more cp genomes are currently being sequenced, it will obviously require massive international efforts to build an accurate, representative and comprehensive reference cp barcode database similar to the barcode of life database (Ratnasingham and Hebert 2007) and GenBank (Benson et al. 2013).

Due to the challenges of plastid enrichment (McPherson et al. 2013), sequencing of total DNA and then isolating chloroplast sequences is now the method of choice for most researchers. Chloroplast sequences can be assembled by

7 reference guided assembly where reads are mapped to a reference (Nock et al. 2011) or by de novo assembly followed by selection of choloroplast contigs through homology searches (McPherson et al. 2013). A combination of reference guided and de novo assembly is likely to be a robust approach that will greatly improve the quality of the assembled genomes and therefore help in defining genetic and evolutionary relationships. The use of whole chloroplast sequences eliminate the need to have priori information on a locus of choice, a difficulty that acts as a major hindrance to single or multi-locus studies. With a much longer sequence than most commonly used DNA barcodes, it has more variation that can help discriminate closely related species. Due to the increased availability in public databases of short reads of whole genome sequences, assembly of chloroplast genome sequences has never been much easier. Previously unresolved and intractable phylogenies are therefore likely to benefit from the use of chloroplast sequences thus promoting the utilization of such genetic resources. With the continued advances in bioinformatics and increased computation capacity, it is likely that the use of whole nuclear genome sequences in resolving phylogenetic relationships will soon become feasible, thus increasing resolution. Better information on the phylogenetic relationships in the Oryza genus will facilitate more efficient conservation and utilization in rice crop improvement programs.

1.5. Application of genomic tools in genetic mapping

As already highlighted, the current advances in DNA sequencing are revolutionizing many aspects related to the conservation and utilization of PGR (Fig. 3). One of the areas that genomics has found wide application is in mapping the functional diversity associated with various traits. As will be highlighted in chapter 2, one of the drawbacks in promoting the utilization of African Oryza genetic resources for plant improvement is the inadequate understanding of the diversity present in those resources. This genomic revolution which has led to dramatic changes in read length, sequencing chemistry, instrumentation, throughput and cost is offering opportunities to study the genetic diversity and potential value at finer scale resolution than ever witnessed before. Reference genome sequences for a large number of model and non-model species have been published and others continue to be released at a highly unprecedented rate (Chen et al. 2013; International Brachypodium Initiative 2010;

8

IRGSP 2005; Paterson et al. 2009; Wang et al. 2014b). These reference genome sequences provide perhaps the most important genomic resource for promoting use of these species and have acted as an extremely useful resource for the studies detailed in this thesis. These reference genomes have for example made it easy to link genomic regions with phenotypic traits thus assisting in understanding the genetic value of genetic resources. Progress in the development and release of reference genomes among the African Oryza species is highlighted in chapter 2.The studies described in this thesis have leveraged the various genomics capabilities that have been provided by this next generation sequencing spurred genomic revolution.

Figure 3: Schematic representation of the application of genomic technologies in supporting germplasm utilization

Most genetic mapping efforts have previously been undertaken using pedigree based QTL mapping (Table 2). However QTL mapping is gradually being replaced by association mapping as it offers the advantage of higher resolution than QTL mapping. Association mapping helps to sample as much natural variation at a particular loci as possible as well as taking advantage of numerous historical recombination events in the study population. Genome wide association studies (GWAS) continue to gain popularity and have been used to construct haplotypes which are useful in allele

9 mining and elucidating molecular basis of important traits (Jia et al. 2013; Lu et al. 2010). One limitation in the use of GWAS is that they require a lot of genomic information about SNPs and may therefore not be suitable for species whose genomes are poorly studied and/or not well annotated (Wu Bi et al. 2013). The African cultivated and wild Oryza species are some of the under studied taxa and therefore would benefit from approaches that do not require a lot of genomic resources.

Despite the decrease in sequencing costs, the scientific community has been and continues to search for more cost effective approaches that will enable studying of larger number of samples. The search for cost effective approaches has led to the increased popularity of pool sequencing (pool-seq) in genetic mapping and population genetic studies (Schlötterer et al. 2014). Pool-seq can be done on the whole genome or on reduced representation of the genome. One of the approaches that involve pooling of individuals rather than sequencing/genotyping them individually is bulk segregant analysis (BSA). Bulk segregant analysis (Michelmore et al. 1991), also called DNA pooling (Giovannoni et al. 1991) is a genetic mapping approach which uses trait based sampling where DNA of individuals on the extreme ends of the phenotypic distribution is pooled. In addition to the bulks selected from the tail ends, other bulks can be constituted by selecting individuals randomly from the population as this helps increase resolution and reduce false positives (Yang et al. 2015; Zou et al. 2016). Sequencing of the DNA bulks followed by analysis of polymorphisms between them helps to identify markers and candidates associated with the target trait. Next generation sequencing (NGS) aided BSA has largely been used on bi-parental populations where it has successfully been used to map various traits in several crops, among them tomato (Illa-Berenguer et al. 2015), sorghum (Han et al. 2015), chick pea (Das et al. 2015), cucumber (Lu et al. 2014), rice (Yang et al. 2013b) and pigeon pea (Singh et al. 2015). It has also been demonstrated that BSA can be used on natural populations as opposed to bi-parental populations (Thomas et al. 2010; Yang et al. 2015). Yang et al. (2015) suggested that applying NGS aided BSA on natural populations is particularly valuable in species with inadequate genomic resources.

Trait mapping studies have led to the identification of numerous marker trait relationships which have found application in marker assisted selection (MAS). In

10 addition, these trait associated markers have the capacity to potentially ease the task of selecting the right germplasm from the large collections held in genebanks, a task that has hitherto been quite difficult. Without such information, choosing appropriate germplasm from genebanks for use in plant improvement is like the proverbial “searching for a needle in a haystack”. Selecting the right germplasm from genebanks is confounded by the lack of good quality phenotypic data. Genomics has extensively and successfully been used is in mapping the functional diversity associated with rice quality traits.

1.6. Rice grain quality and its genetic control

1.6.1. Overview of rice quality

Rice quality is an important trait that largely determines its demand both at a household level as well as in international markets. Breeding for rice grain quality is therefore increasingly becoming a priority in breeding programs especially in developed countries. Figure 4 shows the various facets of grain quality, among them eating and cooking quality, milling quality, nutritional quality and appearance quality. The eating, cooking and milling quality of rice is largely determined by its starch component which is the most important and abundant food component in rice, constituting about 90% of rice grain. Starch is a form of dietary carbohydrate and is composed of two distinct glucose polymers namely amylose and amylopectin which are polymerized through α- (1-4) and α-(1-6) linkages. Amylose is a long, essentially linear chain composed of 102 - 104 D-glucosyl units largely joined by α-(1-4) linkages with infrequent α-(1-6) linkages (Fig. 5). It usually comprises about 20-30% of total starch. On the other hand, amylopectin which comprises about 70-80% of total starch comprises 104 - 105 short linear chains of α-(1-4) linked glucose residues connected through more frequent α- (1-6) linkages (Regina et al. 2006). The ratio of these two polymers which is genetically determined shows wide variation between cultivars and is of great interest to plant breeders and food chemists as it affects rice physico-chemical properties.

11

Eating & Nutritional cooking quality quality Grain quality

Milling Appearance quality quality

Figure 4: Facets of grain quality. (Adapted from Bao (2014)

Rice quality determines not only the acceptability of a particular variety by consumers due to taste preferences but also has huge health implications. Though rice has one of the most advanced and well established breeding programmes, breeding for quality has usually been neglected especially in the developing countries where the highest priority is given to yield. The major quality traits that have received a lot of research attention include amylose content (AC), gel consistency (GC) and gelatinization temperature (GT). Among these and other quality traits, amylose content is the most important determinant of rice end-use quality specifically the eating, cooking and processing quality (Juliano 1985). Based on their amylose content rice varieties are usually classified into different classes namely high (>25%), intermediate (20-25%), low (10-19%), very low (3-9%) and waxy (0-2%) (Kumar and Khush 1987). Rice varieties with high amylose content cook dry and fluffy and are the most preferred type of rice in South Asia as well as in occidental countries while those with low amylose content are tender, glossy and sticky after cooking. Preference for low amylose rice is high in countries such as Japan, China and Korea (Biselli et al. 2014; Khush 2001).

12

Figure 5: α-(1-4) and α-(1-6) linkages. Reproduced with permission from Wang et al. (2014a)

1.6.2. Genetic potential of African Oryza species in breeding for rice quality

While the potential of African rice in breeding for resistance to biotic and abiotic stresses is relatively well known as highlighted in chapter 2, little is known of the potential of African species in breeding for rice quality. Both wild and cultivated African species seem to have potential to contribute genes that will improve rice grain quality as well as acting as alternative food sources. African wild species among them O. longistaminata, O. punctata and O. barthii are consumed as whole grains in times of food scarcity thereby acting as a safety net for poor communities (Brink 2006). While it would therefore appear reasonable to suggest that these wild species have capacity for acting as new food crops, their nutritional properties are yet to be studied.

African rice seems to be a new source of valuable genes for improvement of specific rice qualities such as eating, cooking and milling properties of rice grain, qualities that are valued in some rice export markets (Aluko et al. 2004). However, as noted by Henry et al. (2009), rice wild relatives have not yet contributed genes to improve the quality of rice. QTL analysis of progenies derived from interspecific crosses between O. sativa and O. glaberrima has revealed some loci in O. glaberrima that could enhance crude protein content and improve grain shape and appearance (Aluko et al. 2004; Li et al. 2004b). Increased protein content is particularly desirable for poor regions such as Asia and Africa where malnutrition is widespread. African rice has been reported to have higher amylose content than Asian rice (Watanabe et al. 2002).

13

1.6.3. Potential health benefits of African rice

The world is currently faced with a myriad of health challenges top among them being diabetes and obesity, whose occurrence continues to rise at unacceptably high rates especially in developing countries. The rapid and alarming increase in cases of diabetes in rice consuming countries has been attributed to the consumption of white rice (Nanri et al. 2010). There is a growing body of evidence which shows that consuming foods with slowly digestible carbohydrates results in beneficial health outcomes such as the prevention and dietary management of diabetes, cardio- vascular diseases as well as other chronic lifestyle-related diseases (Barclay et al. 2008; Greenwood et al. 2013; Halton et al. 2008). High amylose content has been associated with an increase in slowly digestible and resistant starch thereby leading to low digestibility (Cai et al. 2015; Chung et al. 2011). The high amylose content therefore explains why African rice has slow digestibility and provides increased post meal satiety (Nuijten et al. 2013). This is consistent with reports from consumers in West Africa who report that Asian rice is light and sweet and therefore tend to be consumed at higher rates than African rice. In order to reduce the consumption rates of Asian rice, consumers mix the two types of rice and then cook them together (Teeken et al. 2011). This practice helps to slow the digestion rate of Asian rice and has the overall effect of reducing the amount of carbohydrates consumed per serving. This in turn has a two-fold effect; on one hand, reducing carbohydrate intake leads to reduced risks of type 2 diabetes (Sheard et al. 2004) and on the other hand has positive food security implications as households are able to save some food for the future. This provides an interesting perspective to food security as improving the intrinsic properties and value of food especially among the poor communities as a way of addressing food insecurity is rarely considered.

The glucostatic theory advances the concept that the main regulator of satiety is the amount of glucose released after carbohydrate intake (Mayer 1996). According to this theory, high blood glucose levels signals satiety while low blood glucose levels indicates the onset of hunger and triggers a desire to feed. Consequently, foods with slowly digestible starch are associated with increased satiety as they provide a source of sustained glucose which in turn helps stabilize blood glucose levels (Lehmann and Robin 2007). With its unique starch traits, African rice seems to be a good candidate

14 for addressing some of the health challenges currently confronting the world, as these traits confer some potential health benefits. Amylose has been identified as the main starch component that affects glycaemic index (GI) with high amylose content being associated with low glycaemic index (Fitzgerald et al. 2011; Larsen et al. 2000). Glycaemic index is a measure of postprandial glycaemic response after consumption of a particular type of food compared to that of glucose or white bread (Jenkins et al. 1981). Though the GI of African rice is expected to be lower than that of Asian rice on account of its generally high amylose content, a recent study by Gayin (2015) reported GI of between 71 and 77; values that were not different from control samples of Asian rice. However, this study had very low sampling with only 7 accessions analysed and it is noteworthy that the values were still much lower than some varieties of Asian rice which have been reported to be more than 100 (Frei et al. 2003). These inconsistent findings on the correlation between amylose content and GI bring to the fore the challenges of using amylose content as the sole predictor for glycaemic response and starch digestibility which, as reported by Panlasigui et al. (1991) could be inaccurate. Rice genotypes with the similar amylose content may differ in GI due to differences in physicochemical properties such as gelatinization temperature, cooking time, amylograph consistency and volume expansion (Panlasigui et al. 1991).

1.6.4. Genetic control of amylose content in Asian and African rice

Though AC is affected by environmental conditions (Chen et al. 2008), it is mainly controlled by genetic factors. The current understanding on the genetic architecture of amylose has largely been derived through QTL mapping (Table 2) and targeted mutagenesis, both of which have been conducted extensively over the past several years. Numerous studies (Table 2) have reported that the main QTL controlling amylose biosynthesis is the waxy locus which is found on chromosome 6 (Aluko et al. 2004; Fan et al. 2005; Tan et al. 1999). This locus encodes the Granule Bound Starch Synthase (GBSS). In rice, there are two isoforms of GBSS namely GBSS 1 and GBSS 2 which are synthesized in different tissues and possibly play different roles in starch biosynthesis (Denyer et al. 2001; Wang et al. 1995a). GBSS 1, which is mostly confined to storage tissues seems to play the major role in amylose biosynthesis and is extensively studied in rice (Hirano et al. 1998; Mikami et al. 2008; Sano 1984; Sano et al. 1986; Wang et al. 1995a). GBSS 2 is involved in synthesis of transient starch in

15 leaves and other non-storage organs (Fujita and Taira 1998; Nakamura et al. 1998; Vrinten and Nakamura 2000). Any further reference to GBSS in this thesis will specifically be on GBSS1.

16

Table 2: Selected studies on QTL analysis for amylose content in rice Parents Population QTL name Marker interval Chromosome R2 Source of allele with Reference type positive effect O. sativa X O. nivara BC2F2 ac2.1 RM262–RM3515 2 11% O. nivara Swamy et al. (2012) ac3.1 RM22–RM7 3 5% O. sativa ac3.2 RM85–RM293 3 15% O. sativa ac6.1 RM314–RM3 6 23% O. sativa ac1.1 RM243–RM582 1 5% O. sativa ac5.1 RM413–RM13 5 9% O. nivara ac5.2 RM153–RM413 5 9% O. sativa O. sativa X O. glaberrima DH amy3 RM7–RM251 3 21.5% O. sativa Aluko et al. (2004) amy6 RM190–RM253 6 73.7% O. sativa amy8 RM230–RM264 8 10.9% O. sativa O. sativa X O. glaberrima BC3 ac6.1 CDO78–RG653 6 0.08% O. glaberrima Li et al. (2004a) ac12.1 RZ816–RG574 12 0.05% O. glaberrima O. sativa X O. rufipogon ILs ac3 RM218–RM554 3 29.4% Yuan et al. (2010) (2005) ac6 RM454–RM3567 6 30.8% (2005) ac7 RM298 (2005) 7 30.9% ac8 RM230–RM264 8 26.7% (2006) Indica X Japonica DH qAC-5 RG573–C624 5 11.8% ZYQ8 (indica) He et al. (1999) Wx Wx 6 91.1% ZYQ8 (indica) Indica X Japonica F8 RILs qAC-2 RM525–RM221 2 2.55% Japonica Lou et al. (2009) (Nanyangzhan) qAC-6a RM584–RM585 6 2.84% Indica (Chuan7) qAC-6b RM588–RM540 6 7.83% Indica (Chuan7) Lowland Jap X Upland DH QAc6 C1004–R1962 6 19.77% Guo et al. (2007) Jap O. sativa X O. rufipogon BC2F2 RM170–RZ398 6 21.9% O. rufipogon Septiningsih et al. (2003) Indica X Indica DH Ac6a RM190–RM587 6 54.87% Indica (Zhenshan) Fan et al. (2005) Indica X Indica F2 qAC-5 RM480–RM3345 5 11.33% Indica (Tarommahalli) Sabouri (2009) qAC-8a RM4955–RM152 8 20.21% Indica (Tarommahalli) qAC-8b RM152–RM8264 8 14.45% Indica (Tarommahalli) qAC-12 RM276–RM7626 12 13.33% Indica (Khazar) DH – Doubled haploid; BC – Back cross; RIL – Recombinant Inbred Line; IL – Introgression lines, Jap – Japonica

17

The amylose content in a given variety is determined by the GBSS allele that is present, which in turn determines the amount of GBSS available in the endosperm. To date at least five Wx alleles which control amylose content have been identified namely Wxa, Wxin, Wxb, Wxop and wx (Mikami et al. 2008). Out of these, two alleles are primarily but not entirely responsible for the differences in amylose content in the two subspecies of Asian rice, indica and japonica. The Wxa allele which is predominantly found in indica confers higher amylose content (≈25%) as compared to the Wxb allele mainly found in japonica which confers low amylose content (≈15%) (Larkin and Park 1999; Sano 1984). The reduction in amylose content in japonica as compared to indica is due to reduced expression level of waxy gene which is caused by inefficient splicing of intron 1 due to a G to T 5’ splice site mutation (AGTT instead of AGGT) (Bligh et al. 1998; Hirano et al. 1998; Isshiki et al. 1998). This mutation which is localized 1,164 bp upstream of the start codon (Biselli et al. 2014) makes the level of Wxa allele 10 fold higher than that of Wxb allele, at the level of the protein and mRNA. Due to this splice site mutation, the level of the spliced mRNA is not detectable in glutinous varieties (Isshiki et al. 1998; Wang et al. 1995a). In addition to the G/T SNP, several other polymorphisms associated with different GBSS alleles which show strong correlation with varying levels of amylose content have been identified and confirmed. Among these polymorphisms include CT microsatellite in 5’untranslated region, A/C SNP in exon 6 and C/T SNP in exon 10 of GBSS (Table 3) (Ayres et al. 1997; Bligh et al. 1995; Dobo et al. 2010; Larkin and Park 2003). They therefore provide tools for selecting for genotypes with different amylose content.

Table 3: Key polymorphisms associated with amylose content in O. sativa

Genomic region Polymorphism Type of Reference polymorphism 5’ UTR (CT)n Microsatellite Bergman et al. (2001); Bligh et al. (1995) Intron 1 G/T SNP Hirano et al. (1998); Wang et al. (1995b) Exon 2 23bp indel Indel Wanchana et al. (2003) Exon 6 A/C SNP Larkin and Park (2003) Exon 10 C/T SNP Larkin and Park (2003)

In contrast to the case of O. sativa where huge efforts have been made in studying the genetic basis of rice quality traits, very little has been done for O. glaberrima. In order to unravel the genetic and molecular basis of amylose content in

18

African rice, QTL analysis has previously been conducted on an interspecific cross between Asian and African rice (Aluko et al. 2004; Li et al. 2004a). The number of QTLs contributed by African rice, their effect as well as their location varies between the studies. Using a doubled haploid population of an interspecific cross between indica and O. glaberrima, Aluko et al. (2004) detected 3 QTLs responsible for amylose content. All the positive alleles in these QTLs came from indica. One of the QTLs detected in this study mapped in the Waxy region which has consistently been reported to control AC. On the other hand, Li et al. (2004a) studied a BC3F1 population obtained from an interspecific cross between indica and O. glaberrima in which 2 QTLs were detected, with the positive alleles being contributed by O. glaberrima. Unlike in the former study by Aluko et al. (2004) , though one QTL mapped in chromosome 6, no QTL was mapped in the waxy region in this latter study as would have been expected. The other QTL mapped on chromosome 12, a region which has not previously been known to control AC. The authors suggested that failure to map any QTL in the waxy region was probably because the waxy gene was not segregating in that particular population.

While some limited progress has been made in identifying QTLs associated with amylose content in African rice as already highlighted, very little has been done in studying allelic variation in GBSS and exploring their relationship with the different levels of amylose content. Perhaps the first attempt to study the waxy gene in African rice was done by Sano et al. (1986) who reported that African rice possesses the Wxa allele. Similar findings were later reported by Hirano et al. (1998) who further showed that the Wxa allele in African rice was selected from O. barthii; its putative progenitor. Unlike in O. sativa, no marker-trait relationships have been established for amylose content in African rice thus hindering the use of marker assisted breeding. This limits the utility of this species in rice improvement. The only SNP that has been reported in African rice is a G/A SNP which is located in exon 12 (Fig. 6) (Dobo 2006; Umeda et al. 1991).

19

Intron 10 Exon 12

G/A 139bp del

Figure 6: Polymorphisms that have been found in African rice. No polymorphism has so far been found to be associated with amylose content. Red boxes indicate exons while the connecting lines indicate introns. Schematic representation of the gene was obtained from EnsemblPlants.

1.6.5. Application of genetic markers in breeding for amylose

Genetic markers showing association with AC have the potential for use in marker assisted selection and thus replace the need for the use of amylose content assays especially in breeding programmes. Though the various methods used in AC determination are reproducible within laboratories, results vary widely between laboratories thus presenting challenges in accurately classifying rice breeding lines (Bergman et al. 2001; Fitzgerald et al. 2009a; Zhu et al. 2008). Additionally, some of these methods are also not cost effective and present difficulties when analyzing large number of samples such as segregating populations or mutants (Agasimani et al. 2013; Hu et al. 2010). The use of genetic markers can help overcome these shortcomings as they are cheap and amenable for high throughput screening (Biselli et al. 2014). In addition, genetic markers are not prone to ambiguities associated with amylose content determination when classifying rice cultivars which are caused by environmental factors, modifier genes and cytoplasmic factors (Asaoka et al. 1985; Bergman et al. 2001; Kumar and Khush 1987; McKenzie and Rutger 1983). Genetic markers also help identify homozygotes and heterozygotes thus providing more genetic information that would not have been possible through phenotyping for amylose (Bergman et al. 2001). Application of these markers in MAS at the early stages of plant growth long before seed development helps in the speedy and cost effective delivery of improved varieties. The development of markers for amylose in rice has been and remains a key area of research focus.

20

The various amylose linked markers as highlighted earlier have been used in MAS to varying degrees of success. Though some of these polymorphisms are just closely linked markers and not the causal mutations, their strong correlation qualifies them for use as genetic markers. The individual markers are unable to distinguish the various amylose classes thus limiting their use in MAS. For example, germplasm falling in different classes of AC have been found to have the same number of CT alleles (Table 3) (Dobo et al. 2010). Similary, while the A/C SNP in exon 6 (Table 3) can distinguish germplasm with intermediate AC from other AC categories, it cannot distinguish low AC and high AC varieties (Larkin and Park 2003). The G/T polymorphism cannot also differentiate high and intermediate AC types (Ayres et al. 1997). Combining the 3 SNPs improves their predictive capacity and are able to clearly differentiate the various amylose types (Dobo et al. 2010). They therefore act as an important resource for use in MAS thereby helping to manipulate AC in a more cost effective manner.

1.6.6. Manipulating amylose content in plants

The last few decades have seen significant progress in starch modification, resulting in a variety of novel starches. The amylose-amylopectin ratio is a key target of starch modification, with huge efforts being put in altering the amylose content of economically important crops. Among the crops which have particularly been of key interest in starch modification include rice (Zhu et al. 2012a), wheat (Regina et al. 2006) potato (Jacobsen 1994; Safford et al. 1998; Westcott et al. 2000) and corn (Campbell et al. 2007). This has involved either increasing or lowering the AC compared to the wild type levels depending on the desired end use. In products meant for human consumption, increasing the amylose content has usually been the objective, owing to the nutritional and health benefits of high amylose content as highlighted above. Elevated levels of AC can be achieved by altering the activity levels of various enzymes especially those involved in starch elongation and branching (Regina et al. 2014). Among the mechanisms through which AC can be increased include increasing the activity of GBSS1, reducing amylopectin synthesis by suppressing the activity of starch synthases and suppressing the activity of starch branching enzymes (SBE) (Regina et al. 2006; Zhu et al. 2012b). Suppressing the activity of GBSS1 leads to decreased AC and can be achieved through gene silencing

21

(Shimada et al. 1993; Terada et al. 2000), suppressing transcription factors (Zhu et al. 2003) and reducing the expression of splicing factor genes (Isshiki et al. 2000; Zeng et al. 2007). Gene silencing can be done through RNA interference techniques (RNAi).

Starch branching enzymes are known to play a role in the synthesis of amylopectin by taking part in the synthesis of α-(1-6) linkages. This is done by catalysing the cleavage and transfer of α-(1-4) linked glucan chains to 6-hydroxyl groups forming branches. Reducing the activity of SBE therefore has the resultant effect of reducing branching in amylopectin thereby increasing AC (Morell et al. 2004). Three isoforms of SBE have been identified in rice (SBEI, SBEIIa and SBEIIb). Suppressing the various isoforms either individually or in combination results in varied results in different crops. For example, inhibiting the expression of SBEI and SBEIIb in rice was found to increase apparent AC in flour from 27.2% to 64.8% (Zhu et al. 2012a) while suppression of SBE1 and SBIIa did not have any impact on AC in corn (Blauth et al. 2002; Blauth et al. 2001). This suggests that the different isoforms play different roles in the synthesis of amylose and amylopectin in different species. While it is clear that the different isoforms of SBE have a role in altering amylose content either individually or in combination, it would be interesting to investigate the effect that enhancing GBSS together with suppressing SBE would have on amylose content. However as noted by Liu et al. (2014) and (Nandkishor 2015), it is difficult to control the outcome of some of these starch modification approaches thus limiting their use in crop improvement. Some of these approaches have specifically been found to lead to hard texture of cooked rice which compromises the eating quality in addition to producing shriveled seeds and poor yields. The use of natural allelic variation in starch modification is therefore preferred in rice and has therefore been explored in this thesis.

1.7. Research problem

The African region is a rich source of valuable Oryza genetic resources. These resources underpin world agriculture and continue to play a vital role in global food security. The efficient conservation and sustainable use of these resources is thus critical in meeting the food demands for the growing population. Despite their great

22 genetic potential, little is known about the status of their conservation and utilization. In order to enhance their value, there is need to document their current status of conservation and use. Lack of information on the status of conservation hinders efforts aimed at designing strategies for supporting their use and conservation.

One of the prerequisites for effective germplasm use and conservation is proper species identification and knowledge on the phylogenetic relationships between cultivated species and their crop wild relatives. Such information is vital in making germplasm use and conservation management decisions. Additionally, it helps in aiding gene discovery as well as defining strategies for gene transfer during crop improvement. Phylogenetic relationships in the AA genome – the Oryza primary gene pool - have been the subject of numerous studies which have employed various approaches, mainly molecular markers (Ishii et al. 2001; Lu et al. 2000; Ren et al. 2003). However, despite these efforts, the AA phylogeny remains unresolved as the various studies have used single or multi-loci studies which have resulted in incongruent and inconsistent phylogenies. African Oryza species form the largest proportion of species in the AA genome group and so due to this inconsistency and incongruence, the evolutionary and genetic relationships between these African species and the rest of the Oryza species is not well known. This inconsistent and poorly resolved phylogeny significantly hinders the effective exploitation and conservation of the huge genomic and genetic resources of African Oryza species. Next generation sequencing is finding application in plant identification and phylogenetic studies as it is becoming cheaper and requires less effort to generate data than in single or multi-locus studies. Recently, the potential of whole chloroplast genome sequences as a universal barcode in plant identification as well as in resolving phylogenetic relationships has been demonstrated (Erickson et al. 2008; Nock et al. 2011; Parks et al. 2009; Waters et al. 2012; Yang et al. 2013a). Despite the potential of whole chloroplast genome in phylogenetics, this molecular tool has not been fully exploited in reconstructing phylogeny of the Oryza.

Though the cultivated African rice has been shown to possess higher amylose content than Asian rice as highlighted earlier, its starch structure and properties are not known. Furthermore, there is inadequate information on the relationship between

23 the starch structure of African rice and that of other rice species, among them O. sativa and O. barthii; the putative progenitor of African rice. Starch is the main carbohydrate in human nutrition and its structure and properties is a major determinant of the physicochemical properties of food. The amylopectin: amylose ratio is one of the quality attributes that determine starch digestibility and its physiological response as well as the potential benefits to be derived (Behall et al. 1988). High amylose content has been associated with nutritional and health benefits by having a positive effect on various indices of gastro intestinal health. As highlighted earlier, though the application of mutagenesis and genetic engineering has successfully been used to produce starch with higher amylose than is naturally found in rice, these modifications are associated with undesirable traits (Nandkishor 2015) and hence the use of naturally occurring variation in breeding for amylose remains the most preferred option.

The potential health benefits of African rice make it an important genetic material for rice improvement. The high amylose content is likely to result in decreased digestion rate thereby making African rice a potential natural source of slowly digestible starch. This is consistent with anecdotal reports from consumers and farmers of African rice, that it is less digestible and provides more post meal satiety than Asian rice (Teeken et al. 2012). However, using African rice in breeding for amylose content is constrained by the inadequate understanding of the genetic mechanism underlying this starch trait. This gap in knowledge remains one of the greatest obstacles to the effective utilization of African rice and its genetic resources. Advances in genomics are helping provide greater understanding of the molecular basis of important functional and nutritional traits in crop species (Henry et al. 2016). These advances in genomics are likely to provide vital insights on the genetic control of amylose content thereby helping in promoting the utility of African rice gene pool in rice improvement programmes.

In this study, key starch biosynthesis genes in whole genome sequences of various wild and cultivated species were analyzed. This involved analysis of chromosomal location, intron-exon boundaries, copy number variation and other structural variations. The analysis provided evidence suggesting that pullulanase gene – one of the key starch biosynthesis genes – had undergone gene duplication in the

24 genome of African rice. In the studied cultivated and wild Oryza species, the pullulanase gene is located on chromosome 4 while in African rice, an extra copy is located on chromosome 6. An exception to this is in O. nivara where pullulanase gene seems to have undergone gene translocation and is located on chromosome 8. The functional effect of this duplication on amylose content and other physicochemical properties was unknown and was studied in this thesis.

1.8. Research Goal and Objectives

1.8.1. Research Goal

 Promoting the conservation and utilization of African Oryza genetic resources using molecular approaches for the benefit of present and future generations

1.8.2. General objective

 Characterize African cultivated and wild Oryza species using genomic tools as a basis for their effective conservation and sustainable utilization

1.8.3. Specific objectives

 Document the current state of conservation and utilization of African Oryza genetic resources by conducting a desk based literature review.  Reconstruct the phylogeny of Oryza AA genome species by massive parallel sequencing of whole chloroplast genomes  Determine the structural and gelatinization properties of starch from O. glaberrima and O. barthii and compare with those of Asian rice  Study the functional impact of pullulanase gene duplication on starch physicochemical properties in Oryza glaberrima  Dissect the genetic architecture of amylose content in African cultivated rice by identifying underlying genetic markers and candidate genes

1.9. Thesis structure

This thesis is divided into 7 chapters with the first chapter (Chapter 1) laying out a brief introduction of the African Oryza species which are the primary subject of this thesis.

25

A brief review of the genetic control of physicochemical properties of rice with a special focus on amylose content is also presented. The gaps in knowledge in the conservation and utility of these species are highlighted and form the basis of the studies that have been described in chapters 3, 4, 5 and 6. Chapter 2 provides a general overview of the status of conservation and utilization of African Oryza species and the breadth of genetic diversity that they contain. This chapter pinpoints existing conservation gaps and therefore provides a foundation for more systematic collection and conservation of these vital resources.

One of the prerequisites for ensuring proper conservation and utility of the genetic resources discussed in chapter 2 is to have a clear understanding of their taxonomic, evolutionary and phylogenetic relationships. In chapter 1, the inconsistent, inconclusive and poorly resolved genetic relationships between African Oryza species was highlighted as one of the challenges hindering proper conservation and utilization of these genetic resources. Chapter 3 therefore presents a reconstructed phylogeny of the various African Oryza species that form part of the AA genome group; the Oryza primary gene pool. In this chapter, the relationships between the various wild and cultivated species were clearly defined. The well-defined phylogenetic relationships provide a platform for analyzing, the potential value and functional diversity of these genetic resources as well as supporting its deployment in rice improvement. Next, an analysis on the functional diversity of African rice as it relates to amylose content was conducted using a multi-pronged approach. The molecular and genetic control of amylose in African rice which remains largely unknown is reported in chapters 4, 5 and 6. Chapter 4 presents an analysis of the molecular basis of amylose by characterizing the functional effect that pullulanase gene duplication has on starch physicochemical properties in African rice. This analysis revealed that pullulanase gene duplication may not be the molecular basis of the high amylose found in African rice. Further analysis on this was therefore necessary and is reported in chapter 5 where a bi-parental mapping population was analyzed with the aim of defining marker trait relationships. Markers showing some association with amylose content were identified. Analysis of candidate genes involved in amylose biosynthesis is reported in chapter 6. In this chapter starch synthesis related genes showing putative association with amylose are identified based on an integrative genetics analysis.

26

The general discussion which is presented in chapter 7 takes a broader look at the findings of the preceding chapters. The application of genomic tools in supporting the conservation and utilization of African Oryza genetic resources is discussed. The implications of the findings of this research to practical breeding and particularly to marker assisted breeding are highlighted. The utility and potential of some of the current advances in genomics that have been employed as methodological approaches in chapter 3, 5 and 6 is discussed. An overview of future work that may be needed to address some of the research gaps identified in this thesis has been provided.

1.10. References

Agasimani S, Selvakumar G, Joel AJ, Ganesh Ram S (2013) A Simple and Rapid Single Kernel Screening Method to Estimate Amylose Content in Rice Grains. Phytochemical Analysis 24:569-573

Aluko G, Martinez C, Tohme J, Castano C, Bergman C, Oard JH (2004) QTL mapping of grain quality traits from the interspecific cross Oryza sativa x O. glaberrima. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 109:630-639

Asaoka M, Okuno K, Fuwa H (1985) Effect of Environmental Temperature at the Milky Stage on Amylose Content and Fine Structure of Amylopectin of Waxy and Nonwaxy Endosperm Starches of Rice (Oryza sativa L.). Agric Biol Chem 49:373-379

Ayres N, McClung A, Larkin PD, Bligh HFJ, Jones CA, Park WD (1997) Microsatellites and a single nucleotide polymorphism differentiate apparent amylose classes in an extended pedigree of US rice germplasm. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 94:773-781

Bao J (2014) Genes and QTLs for Rice Grain Quality Improvement

Barclay AW, Petocz P, McMillan-Price J, Flood VM, Prvan T, Mitchell P, Brand-Miller JC (2008) Glycemic index, glycemic load, and chronic disease risk--a meta- analysis of observational studies. The American journal of clinical nutrition 87:627

27

Bautista NS, Solis R, Kamijima O, Ishii T (2001) RAPD, RFLP and SSLP analyses of phylogenetic relationships between cultivated and wild species of rice. Genes & genetic systems 76:71-79

Behall KM, Scholfield DJ, Canary J (1988) Effect of starch structure on glucose and insulin responses in adults. American Journal of Clinical Nutrition 47:428-432

Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2013) GenBank. Nucleic acids research 41:D36-D42

Bergman CJ, Delgado JT, McClung AM, Fjellstrom RG (2001) An improved method for using a microsatellite in the rice waxy gene to determine amylose class. Cereal Chemistry 78:257-260

Biselli C, Cavalluzzo D, Perrini R, Gianinetti A, Bagnaresi P, Urso S, Orasen G, Desiderio F, Lupotto E, Cattivelli L, Valè G (2014) Improvement of marker- based predictability of Apparent Amylose Content in japonica rice through GBSSI allele mining. Rice (New York, NY) 7:1-1

Blauth SL, Kim K-N, Klucinec J, Shannon JC, Thompson D, Guiltinan M (2002) Identification of Mutator insertional mutants of starch-branching enzyme 1 (sbe1) in Zea mays L. Plant molecular biology 48:287-297

Blauth SL, Yao Y, Klucinec JD, Shannon JC, Thompson DB, Guilitinan MJ (2001) Identification of Mutator Insertional Mutants of Starch-Branching Enzyme 2a in Corn. Plant physiology 125:1396-1405

Bligh HF, Till RI, Jones CA (1995) A microsatellite sequence closely linked to the Waxy gene of Oryza sativa. Euphytica 86:83-85

Bligh HFJ, Larkin PD, Roach PS, Jones CA, Fu H, Park WD (1998) Use of alternate splice sites in granule-bound starch synthase mRNA from low-amylose rice varieties. Plant molecular biology 38:407-415

Brink M (2006) PROTA 1: Cereals and pulses/Céréales et légumes secs. In: Brink M, Belay G (eds). PROTA, Wageningen, Netherlands.

Cai JW, Man JM, Huang J, Liu QQ, Wei WX, Wei CX (2015) Relationship between structure and functional properties of normal rice starches with different amylose contents. CARBOHYDRATE POLYMERS 125:35-44

Campbell MR, Jane J-l, Pollak L, Blanco M, O'Brien A (2007) Registration of Maize Germplasm Line GEMS-0067. Journal of Plant Registrations 1:60–61 (2007). doi: 10.3198/jpr2006.10.0640crg

28

Chase MW, Fay MF (2009) Barcoding of Plants and Fungi. Science 325:682-683

Chen J, Liang C, Chen C, Zhang W, Sun S, Liao Y, Zhang X, Yang L, Song C, Wang M, Shi J, Huang Q, Liu G, Liu J, Zhou H, Zhou W, Yu Q, An N, Chen Y, Cai Q, Wang B, Liu B, Gao D, Min J, Huang Y, Wu H, Li Z, Zhang Y, Yin Y, Song W, Jiang J, Jackson SA, Wing RA, Wang J, Wang J, Chen M, Lang Y, Liu T, Li B, Bai Z, Luis Goicoechea J (2013) Whole-genome sequencing of Oryza brachyantha reveals mechanisms underlying Oryza genome evolution. Nature communications 4:1595

Chen M-H, Bergman C, Pinson S, Fjellstrom R (2008) Waxy gene haplotypes: Associations with apparent amylose content and the effect by the environment in an international rice germplasm collection. Journal of Cereal Science 47:536- 545

Chen Y, Wang M, Ouwerkerk PBF (2012) Molecular and environmental factors determining grain quality in rice. Food and Energy Security 1:111-n/a

Cheng C, Tsuchimoto S, Ohtsubo H, Ohtsubo E (2002) Evolutionary relationships among rice species with AA genome based on SINE insertion analysis. Genes & genetic systems 77:323-334

Chung H-J, Liu Q, Lee L, Wei D (2011) Relationship between the structure, physicochemical properties and in vitro digestibility of rice starches with different amylose contents. Food Hydrocolloids 25:968-975

Cranston KA, Hurwitz B, Ware D, Stein L, Wing RA (2009) Species trees from highly incongruent gene trees in rice. Systematic biology 58:489-500

Das S, Upadhyaya HD, Bajaj D, Kujur A, Badoni S, Laxmi, Kumar V, Tripathi S, Gowda CLL, Sharma S, Singh S, Tyagi AK, Parida SK (2015) Deploying QTL-seq for rapid delineation of a potential candidate gene underlying major trait-associated QTL in chickpea. DNA research : an international journal for rapid publication of reports on genes and genomes 22:193-203

Denyer Kay, Johnson P, Zeeman S, Smith AM (2001) The control of amylose synthesis. Journal of Plant Physiology 158:479-487

Dobo M (2006) Role of GBSS allelic diversity in rice grain quality. ProQuest, UMI Dissertations Publishing

29

Dobo M, Ayres N, Walker G, Park WD (2010) Polymorphism in the GBSS gene affects amylose content in US and European rice germplasm. Journal of Cereal Science 52:450-456

Duan S, Lu B, Li Z, Tong J, Kong J, Yao W, Li S, Zhu Y (2007) Phylogenetic Analysis of AA-genome Oryza Species () Based on Chloroplast, Mitochondrial, and Nuclear DNA Sequences. Biochemical Genetics 45:113-129

Erickson DL, Spouge J, Resch A, Weigt LA, Kress JW (2008) DNA barcoding in land plants: developing standards to quantify and maximize success. Taxon 57:1304-1316

Fan CC, Yu XQ, Xing YZ, Xu CG, Luo LJ, Zhang Q (2005) The main effects, epistatic effects and environmental interactions of QTLs on the cooking and eating quality of rice in a doubled-haploid line population. Theoretical and Applied Genetics 110:1445-1452

FAO (2015) Rice Market Monitor: October 2015 XVIII

FAO/PAR (2011) Biodiversity for food and agriculture: Contributing to food security and sustainability in a changing world. Food and Agriculture Organization of the United Nations (FAO)/Platform for Agrobiodiversity Research, Rome, Italy, p 66

Fitzgerald MA, Bergman CJ, Resurreccion AP, Möller J, Jimenez R, Reinke RF, Martin M, Blanco P, Molina F, Chen M-H, Kuri V, Romero MV, Habibi F, Umemoto T, Jongdee S, Graterol E, Reddy KR, Bassinello PZ, Sivakami R, Rani NS, Das S, Wang YJ, Indrasari SD, Ramli A, Ahmad R, Dipti SS, Xie L, Lang NT, Singh P, Toro DC, Tavasoli F, Mestres C (2009a) Addressing the dilemmas of measuring amylose in rice. Cereal Chemistry 86:492-498

Fitzgerald MA, McCouch SR, Hall RD (2009b) Not just a grain of rice: the quest for quality. Trends in plant science 14:133-139

Fitzgerald MA, Rahman S, Resurreccion AP, Concepcion J, Daygon VD, Dipti SS, Kabir KA, Klingner B, Morell MK, Bird AR (2011) Identification of a Major Genetic Determinant of Glycaemic Index in Rice. Rice 4:66-74

Frei M, Siddhuraju P, Becker K (2003) Studies on the in vitro starch digestibility and the glycemic index of six different indigenous rice cultivars from the Philippines. Food Chemistry 83:395-402

Fujita N, Taira T (1998) A 56-kDa protein is a novel granule-bound starch synthase existing in the pericarps, aleurone layers, and embryos of immature seed in diploid wheat (Triticum monococcum L.). Planta 207:125-132

30

Gayin J (2015) Structural and Functional Characteristics of African Rice (Oryza glaberrima) Flour and Starch. The University of Guelph, Guelph, Ontario, Canada, p 258

Ge S, Sang T, Lu B-R, Hong D-Y (1999) Phylogeny of rice genomes with emphasis on origins of allotetraploid species. Proceedings of the National Academy of Sciences of the United States of America 96:14400-14405

Ge S, Sang T, Lu Br, Hong Dy (2001) Rapid and reliable identification of rice genomes by RFLP analysis of PCR-amplified Adh genes. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 44:1136- 1136

Giovannoni J, Wing R, Ganal M, Tanksley S (1991) Isolation of molecular markers from specific chromosomal intervals using DNA pools from existing mapping populations. Nucleic acids research 19:6553-6558

Godfray HCJ, Beddington JR, Crute IR, Haddad L, Lawrence D, Muir JF, Pretty J, Robinson S, Thomas SM, Toulmin C (2010) Food security: The challenge of feeding 9 billion people. Science 327:812-818

Goicoechea JL (2009) Structural comparative genomics of four African species of Oryza. School of Plant Sciences. University of Arizona, Arizona, p 214

Greenwood DC, Threapleton DE, Evans CEL, Cleghorn CL, Nykjaer C, Woodhead C, Burley VJ (2013) Glycemic index, glycemic load, carbohydrates, and type 2 diabetes: systematic review and dose-response meta-analysis of prospective studies. Diabetes care 36:4166-4171

Guo Y-L, Ge S (2005) Molecular phylogeny of Oryzeae (Poaceae) based on DNA sequences from chloroplast, mitochondrial, and nuclear genomes. American Journal of Botany 92:1548-1558

Guo Y, Mu P, Liu J, Lu Y, Li Z (2007) QTL Mapping and Q×E Interactions of Grain Cooking and Nutrient Qualities in Rice Under Upland and Lowland Environments. Journal of Genetics and Genomics 34:420-428

Halton TL, Liu S, Manson JE, Hu FB (2008) Low-carbohydrate-diet score and risk of type 2 diabetes in women. The American journal of clinical nutrition 87:339

Han Y, Lv P, Hou S, Li S, Ji G, Ma X, Du R, Liu G (2015) Combining Next Generation Sequencing with Bulked Segregant Analysis to Fine Map a Stem Moisture

31

Locus in Sorghum (Sorghum bicolor L. Moench PLoS ONE 10(5): e0127065. doi:10.1371/journal.pone.0127065

He P, Li SG, Qian Q, Ma YQ, Li JZ, Wang WM, Chen Y, Zhu LH (1999) Genetic analysis of rice grain quality. TAG Theoretical and Applied Genetics 98:502- 508

Henry RJ, Rangan P, Furtado A (2016) Functional cereals for production in new and variable climates. Current Opinion in Plant Biology 30:11-18

Henry RJ, Rice N, Waters DLE, Kasem S, Ishikawa R, Hao Y, Dillon S, Crayn D, Wing R, Vaughan D (2009) Australian Oryza: Utility and Conservation. Rice 3:235- 241

Hirano H-Y, Eiguchi M, Sano Y (1998) A single base change altered the regulation of the waxy gene at the posttranscriptional level during the domestication of rice. Molecular biology and evolution 15:978-987

Hu G, Burton C, Yang C (2010) Efficient measurement of amylose content in cereal grains. Journal of Cereal Science 51:35-40

Illa-Berenguer E, Van Houten J, Huang Z, van der Knaap E (2015) Rapid and reliable identification of tomato fruit weight and locule number loci by QTL-seq. Theoretical and Applied Genetics

International Brachypodium Initiative (2010) Genome sequencing and analysis of the model grass Brachypodium distachyon. Nature 463:763-768

IRGSP (2005) The map-based sequence of the rice genome. Nature 436:793-800

Ishii T, Nakano T, Maeda H, Kamijima O (1996) Phylogenetic relationships in A- genome species of rice as revealed by RAPD analysis. Genes & genetic systems 71:195-210

Ishii T, Xu Y, McCouch SR (2001) Nuclear- and chloroplast-microsatellite variation in A-genome species of rice. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 44:658-658

Isshiki M, Morino K, Nakajima M, Okagaki RJ, Wessler SR, Izawa T, Shimamoto K (1998) A naturally occurring functional allele of the rice waxy locus has a GT to TT mutation at the 5' splice site of the first intron. Plant Journal 15:133-138

32

Isshiki M, Nakajima M, Satoh H, Shimamoto K (2000) dull: Rice mutants with tissue- specific effects on the splicing of the waxy pre-mRNA. Plant Journal 23:451- 460

Iwamoto M, Nagashima H, Nagamine T, Higo H, Higo K (1999) p-SINE1-like intron of the CatA catalase homologs and phylogenetic relationships among AA-genome Oryza and related species. TAG Theoretical and Applied Genetics 98:853-861

Jacobsen E (1994) Formation and Deposition of Amylose in the Potato Tuber Starch Granule Are Affected by the Reduction of Granule-Bound Starch Synthase Gene Expression. The Plant cell 6:43-52

Jenkins DJ, Wolever TM, Taylor RH, Barker H, Fielden H, Baldwin JM, Bowling AC, Newman HC, Jenkins AL, Goff DV (1981) Glycemic index of foods: a physiological basis for carbohydrate exchange. The American journal of clinical nutrition 34:362

Jia G, Huang X, Zhi H, Zhao Y, Zhao Q, Li W, Chai Y, Yang L, Liu K, Lu H, Zhu C, Lu Y, Zhou C, Fan D, Weng Q, Guo Y, Huang T, Zhang L, Lu T, Feng Q, Hao H, Liu H, Lu P, Zhang N, Li Y, Guo E, Wang S, Wang S, Liu J, Zhang W, Chen G, Zhang B, Li W, Wang Y, Li H, Zhao B, Li J, Diao X, Han B (2013) A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica). Nature Genetics 45:957-961

Joshi SP, Gupta VS, Aggarwal RK, Ranjekar PK, Brar DS (2000) Genetic diversity and phylogenetic relationship as revealed by inter simple sequence repeat (ISSR) polymorphism in the genus Oryza. TAG Theoretical and Applied Genetics 100:1311-1320

Juliano BO (ed) (1985) Criteria and Tests for Rice Grain Qualities, Second edition edn. American Association of Cereal Chemists, St. Paul MN

Khush GS (2001) Green revolution: the way forward. Nature Reviews Genetics 2:815- 822

Kumar I, Khush GS (1987) Genetic analysis of different amylose levels in rice. Crop Science 27:1167-1172

Larkin PD, Park WD (1999) Transcript accumulation and utilization of alternate and non-consensus splice sites in rice granule-bound starch synthase are temperature-sensitive and controlled by a single-nucleotide polymorphism. Plant molecular biology 40:719-727

33

Larkin PD, Park WD (2003) Association of waxy gene single nucleotide polymorphisms with starch characteristics in rice ( Oryza sativa L.). Molecular Breeding 12:335-339

Larsen HN, Rasmussen OW, Rasmussen PH, Alstrup KK, Biswas SK, Tetens I, Thilsted SH, Hermansen K (2000) Glycaemic index of parboiled rice depends on the severity of processing: Study in type 2 diabetic subjects. European Journal of Clinical Nutrition 54:380-385

Lehmann U, Robin F (2007) Slowly digestible starch – its structure and health implications: a review. Trends in Food Science & Technology 18:346-355

Li J, Xiao J, Grandillo S, Jiang L, Wan Y, Deng Q, Yuan L, McCouch SR (2004a) QTL detection for rice grain quality traits using an interspecific backcross population derived from cultivated Asian (O. sativa L.) and African (O. glaberrima S.) rice. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 47:697-697

Li J, Xiao J, Grandillo S, Jiang L, Wan Y, Deng Q, Yuan L, McCouch SR (2004b) QTL detection for rice grain quality traits using an interspecific backcross population derived from cultivated Asian (O. sativa L.) and African (O. glaberrima S.) rice. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 47:697-704

Li X, Yang Y, Henry RJ, Rossetto M, Wang Y, Chen S (2015) Plant DNA barcoding: from gene to genome. Biological Reviews 90:157-166

Liu D, Wang W, Cai X (2014) Modulation of amylose content by structure‐based modification of OsGBSS1 activity in rice (Oryza sativa L.). Plant biotechnology journal 12:1297-1307

Lou J, Lou Q, Chen L, Yue G, Mei H, Xiong L, Luo L (2009) QTL mapping of grain quality traits in rice. Journal of Cereal Science 50:145-151

Lu B-R, Jackson MT (2009) Wild rice . http://www.knowledgebank.irri.org/extension/wild-rice-taxonomy.html (Accessed 18th April 2013). IRRI

Lu B-R, Naredo MEB, Juliano AB, Jackson MT (2000) Preliminary studies on taxonomy and biosystematics of the AA genome Oryza species (Poaceae). . In: Jacobs SWL, Everett J (eds) Grasses, systematics and evolution. CSIRO, Melbourne, Australia, pp 51-58

34

Lu BR, Zheng KL, Qian HR, Zhuang JY (2002) Genetic differentiation of wild relatives of rice as assessed by RFLP analysis. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 106:101-106

Lu H, Lin T, Klein J, Wang S, Qi J, Zhou Q, Sun J, Zhang Z, Weng Y, Huang S (2014) QTL-seq identifies an early flowering QTL located near Flowering Locus T in cucumber. Theoretical and Applied Genetics 127:1491-1499

Lu Y, Li J, Buckler ES, Lu T, Deng L, Jing Y, Guo Y, Huang T, Lin Z, Zhang Z, Zhang Q-F, Zhao Q, Zhu C, Li W, Wei X, Li W, Huang X, Zhou T, Feng Q, Han B, Li C, Wang L, Fan D, Liu K, Zhao Y, Li M, Weng Q, Wang A, Sang T, Qian Q (2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nature Genetics 42:961-967

Mayer J (1996) Glucostatic mechanism of regulation of food intake. Obesity Research 4:493-496

McCouch S (2013) Agriculture: Feeding the future. Nature 499:23-24

McKenzie KM, Rutger JN (1983) Genetic Analysis of Amylose Content, Alkali Spreading Score, and Grain Dimensions in Rice1. Crop Science 23

McPherson H, van der Merwe M, Delaney SK, Edwards MA, Henry RJ, McIntosh E, Rymer PD, Milner ML, Siow J, Rossetto M (2013) Capturing chloroplast variation for molecular ecology studies: A simple next generation sequencing approach applied to a rainforest tree. BMC Ecology 13:8-8

Michelmore RW, Paran I, Kesseli RV (1991) Identification of Markers Linked to Disease-Resistance Genes by Bulked Segregant Analysis: A Rapid Method to Detect Markers in Specific Genomic Regions by Using Segregating Populations. Proceedings of the National Academy of Sciences of the United States of America 88:9828-9832

Mikami I, Uwatoko N, Ikeda Y, Yamaguchi J, Hirano HY, Suzuki Y, Sano Y (2008) Allelic diversification at the wx locus in landraces of Asian rice. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 116:979-989

Mochizuki K, Ohtsubo H, Hirano H-Y, Sano Y, Ohtsubo E (1993) Classification and relationships of rice strains with AA genome by identification of transposable elements at nine loci. Jpn J Genet 68:205-217

Morell MK, Konik-Rose C, Ahmed R, Li Z, Rahman S (2004) Synthesis of resistant starches in plants. Journal of AOAC International 87:740

35

Mullins IM, Hilu KW (2002) Sequence variation in the gene encoding the 10-kDa prolamin in Oryza (Poaceae). 1. Phylogenetic Implications. Theoretical and Applied Genetics 105:841-846

Nakamura T, Vrinten P, Hayakawa K, Ikeda J (1998) Characterization of a Granule- Bound Starch Synthase Isoform Found in the Pericarp of Wheat. Plant physiology 118:451-459

Nandkishor BS (2015) Mutations in starch biosynthesis genes leading to high fiber and resistant starch expression in rice endosperm. Google Patents

Nanri A, Mizoue T, Noda M, Takahashi Y, Kato M, Inoue M, Tsugane S, Japan Public Health Center-based Prospective Study G, for the Japan Public Health Center- based Prospective Study G (2010) Rice intake and type 2 diabetes in Japanese men and women: the Japan Public Health Center-based Prospective Study. The American journal of clinical nutrition 92:1468-1477

Nishikawa T, Vaughan DA, Kadowaki K-i (2005) Phylogenetic analysis of Oryza species, based on simple sequence repeats and their flanking nucleotide sequences from the mitochondrial and chloroplast genomes. Theoretical and Applied Genetics 110:696-705

Nock CJ, Waters DLE, Edwards MA, Bowen SG, Rice N, Cordeiro GM, Henry RJ (2011) Chloroplast genome sequences from total DNA for plant identification. Plant biotechnology journal 9:328-333

Nuijten E, Temudo M, Richards P, Okry F, Teeken B, Mokuwa A, Struik PC (2013) Towards a new approach for understanding interactions of technology with environment and society in small-scale rice farming. CABI, Wallingford, UK, pp 355-366 Panlasigui LN, Thompson LU, Juliano BO, Perez CM, Yiu SH, Greenberg GR (1991) Rice varieties with similar amylose content differ in starch digestibility and glycemic response in humans. American Journal of Clinical Nutrition 54:871- 877

Park K-C, Lee JK, Kim N-H, Shin Y-B, Lee J-H, Kim N-S (2003a) Genetic variation in Oryza species detected by MITE-AFLP. Genes & genetic systems 78:235-243

Park KC, Kim NH, Cho YS, Kang KH, Lee JK, Kim NS (2003b) Genetic variations of AA genome Oryza species measured by MITE-AFLP. TAG Theoretical and applied genetics 107:203-209

36

Parks M, Cronn R, Liston A (2009) Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC biology 7:84-84

Paterson AH, ., Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Wang Y, Narechania A, Keller B, Wicker T, Schmutz J, Messing J, Otillar RP, Ware D, Zhang L, Lyons E, Carpita NC, Martis M, Haberer G, Kresovich S, McCann MC, Tang H, Gowik U, Mitros T, Freeling M, Hash CT, Wang X, Hellsten U, Poliakov A, Peterson DG, Grigoriev IV, Maher CA, Chapman J, Westhoff P, Spannagl M, Salamov AA, Bharti AK, Ming R, Mayer KFX, Mehboob ur R, Klein P, Feltus FA, Gingle AR, Penning BW, Rokhsar DS (2009) The Sorghum bicolor genome and the diversification of grasses. Nature 457:551-556

Peng S, Khush GS, Virk P, Tang Q, Zou Y (2008) Progress in ideotype breeding to increase rice yield potential. Field Crops Research 108:32-38

Ratnasingham S, Hebert PDN (2007) bold: The Barcode of Life Data System (). Molecular Ecology Notes 7:355-364

Regina A, Bird A, Topping D, Bowden S, Freeman J, Barsby T, Kosar-Hashemi B, Li Z, Rahman S, Morell M (2006) High-Amylose Wheat Generated by RNA Interference Improves Indices of Large-Bowel Health in Rats. Proceedings of the National Academy of Sciences of the United States of America 103:3546- 3551

Regina A, Li Z, Morell MK, Jobling SA (2014) Genetically Modified Starch: State of Art and Perspectives. In: Avérous PJH (ed) Starch Polymers. Elsevier, Amsterdam, pp 13-29

Ren F, Lu B-R, Li S, Huang J, Zhu Y (2003) A comparative study of genetic relationships among the AA-genome Oryza species using RAPD and SSR markers. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 108:113-120

Sabouri H (2009) QTL detection of rice grain quality traits by microsatellite markers using an indica rice (Oryza sativa L.) combination. Journal of Genetics 88:81- 85

Safford R, Jobling SA, Sidebottom CM, Westcott RJ, Cooke D, Tober KJ, Strongitharm BH, Russell AL, Gidley MJ (1998) Consequences of antisense RNA inhibition of starch branching enzyme activity on properties of potato starch. Carbohydrate Polymers 35:155-168

37

Sano Y (1984) Differential regulation of waxy gene expression in rice endosperm. TAGTheoretical and applied geneticsTheoretische und angewandte Genetik 68:467

Sano Y, Katsumata M, Okuno K (1986) Genetic studies of speciation in cultivated rice. 5. Inter- and intraspecific differentiation in the waxy gene expression of rice. Euphytica 35:1-9

Schlötterer C, Tobler R, Kofler R, Nolte V (2014) Sequencing pools of individuals - mining genome-wide polymorphism data without big funding. Nature reviewsGenetics 15:749-763

Septiningsih EM, Trijatmiko KR, Moeljopawiro S, McCouch SR (2003) Identification of quantitative trait loci for grain quality in an advanced backcross population derived from the Oryza sativa variety IR64 and the wild relative O. rufipogon. Theoretical and Applied Genetics 107:1433-1441

Sheard NF, Clark NG, Brand-Miller JC, Franz MJ, Pi-Sunyer FX, Mayer-Davis E, Kulkarni K, Geil P (2004) Dietary carbohydrate (amount and type) in the prevention and management of diabetes: a statement by the american diabetes association. Diabetes care 27:2266-2271

Shimada H, Tada Y, Kawasaki T, Fujimura T (1993) Antisense regulation of the rice waxy gene expression using a PCR-amplified fragment of the rice genome reduces the amylose content in grain starch. Theoretical and Applied Genetics 86:665-672

Singh VK, Khan AW, Saxena RK, Kumar V, Kale SM, Sinha P, Chitikineni A, Pazhamala LT, Garg V, Sharma M, Sameer Kumar CV, Parupalli S, Vechalapu S, Patil S, Muniswamy S, Ghanta A, Yamini KN, Dharmaraj PS, Varshney RK (2015) Next-generation sequencing for identification of candidate genes for Fusarium wilt and sterility mosaic disease in pigeonpea (Cajanus cajan). Plant biotechnology journal

Small RL, Cronn RC, Wendel JF (2004) Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 17:145-170

Swamy BPM, Kaladhar K, Shobha Rani N, Prasad GSV, Viraktamath BC, Reddy GA, Sarla N (2012) QTL analysis for grain quality traits in 2 BC2F2 populations derived from crosses between Oryza sativa cv Swarna and 2 accessions of O. nivara. The Journal of heredity 103:442

38

Takahashi H, Sato Y-i, Nakamura I (2008) Evolutionary analysis of two plastid DNA sequences in cultivated and wild species of Oryza. Breeding Science 58:225- 233

Tan YF, Li JX, Yu SB, Xing YZ, Xu CG, Zhang Q (1999) The three important traits for cooking and eating quality of rice grains are controlled by a single locus in an elite rice hybrid, Shanyou 63. Theoretical and Applied Genetics 99:642-648

Tang L, Zou X-h, Achoundong G, Potgieter C, Second G, Zhang D-y, Ge S (2010) Phylogeny and biogeography of the rice tribe (Oryzeae): Evidence from combined analysis of 20 chloroplast fragments. Molecular Phylogenetics and Evolution 54:266-277

Teeken B, Nuijten E, Temudo MP, Okry F, Mokuwa A, Struik PC, Richards P (2012) Maintaining or Abandoning African Rice: Lessons for Understanding Processes of Seed Innovation. Human Ecology 40:879-892

Teeken B, Okry F, Nuijten E (2011) Improving rice-based technology development and dissemination through a better understanding of local innovation systems. Proceedings of the Second Africa Rice Congress : Innovation and Partnerships to Realize Africa‘s Rice Potential, 22–26 March 2010. Africa Rice, Bamako, Mali

Terada R, Nakajima M, Isshiki M, Okagaki RJ, Wessler SR, Shimamoto K (2000) Antisense Waxy genes with highly active promoters effectively suppress Waxy gene expression in transgenic rice. Plant and Cell Physiology 41:881-888

Thomas LT, Elizabeth CB, Eric JVW, Tina TH, Sergey VN (2010) Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nature Genetics 42:260

Umeda M, Ohtsubo H, Ohtsubo E (1991) Diversification of the rice Waxy gene by insertion of mobile DNA elements into introns. 遺伝学雑誌 66:569-586

USDA-ARS (2013) National Genetic Resources Program. Germplasm Resources Information Network - (GRIN) [Online Database]. National Germplasm Resources Laboratory, Beltsville, Maryland. http://www.ars-grin.gov/cgi- bin/npgs/html/tax_search.pl (Accessed 18 April 2013)

Van Oort PAJ, Saito K, Tanaka A, Amovin-Assagba E, Van Bussel LGJ, Van Wart J, De Groot H, Van Ittersum MK, Cassman KG, Wopereis MCS (2015) Assessment of rice self-sufficiency in 2025 in eight African countries. Global Food Security 5:39-49

39

Vrinten PL, Nakamura T (2000) Wheat Granule-Bound Starch Synthase I and II Are Encoded by Separate Genes That Are Expressed in Different Tissues. Plant physiology 122:255-263

Wanchana S, Toojinda T, Tragoonrung S, Vanavichit A (2003) Duplicated coding sequence in the waxy allele of tropical glutinous rice (Oryza sativa L.). Plant Sci 165

Wang K, Henry R, Gilbert RG (2014a) Causal Relations Among Starch Biosynthesis, Structure, and Properties. Springer Science Reviews 2:15-33

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014b) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Wang Z-Y, Fei-Qin Z, Ge-Zhi S, Ji-Ping G, Snustad DP, Min-Gang L, Jing-Liu Z, Meng- Min H (1995a) The amylose content in rice endosperm is related to the post- transcriptional regulation of the waxy gene. Plant Journal 7:613-622

Wang ZY, Second G, Tanksley SD (1992) Polymorphism and phylogenetic relationships among species in the genus Oryza as determined by analysis of nuclear RFLPS Theoretical and Applied Genetics 83:565-581

Wang ZY, Zheng FQ, Shen GZ, Gao JP, Snustad DP, Li MG, Zhang JL, Hong MM (1995b) The amylose content in rice endosperm is related to the post- transcriptional regulation of the waxy gene. The Plant journal : for cell and molecular biology 7

Watanabe H, Futakuchi K, Jones MP, Teslim I, Sobambo BA (2002) Brabender viscogram characteristics of interspecific progenies of Oryza glaberrima Steud and O. sativa L. Nippon Shokuhin Kagaku Kogaku Kaishi 49:155-165

Waters DLE, Nock CJ, Ishikawa R, Rice N, Henry RJ (2012) Chloroplast genome sequence confirms distinctness of Australian and Asian wild rice. Ecology and evolution 2:211-217

Westcott RJ, Safford R, Jobling SA, Gidley MJ, Jeffcoat R, Schwall GP, Shi Y-C, Tayal A (2000) Production of very-high-amylose potato starch by inhibition of SBE A and B. Nature biotechnology 18:551-554

40

Wu Bi, Han Z, Xing Y (2013) Genome Mapping, Markers and QTLs. In: Zhang Q, Wing R (eds) Genetics and Genomics of Rice. Springer-Verlag New York, NewYork

Yang J, Jiang H, Yeh CT, Yu J, Jeddeloh JA, Nettleton D, Schnable PS (2015) Extreme‐phenotype genome‐wide association study (XP‐GWAS): a method for identifying trait‐associated variants by sequencing pools of individuals selected from a diversity panel. The Plant Journal 84:587-596

Yang JB, Tang M, Li HT, Zhang ZR, Li DZ (2013a) Complete chloroplast genome of the genus Cymbidium: lights into the species identification, phylogenetic implications and population genetic analyses. BMC evolutionary biology 13:84- 84

Yang Z, Huang D, Tang W, Zheng Y, Liang K, Cutler AJ, Wu W (2013b) Mapping of quantitative trait loci underlying cold tolerance in rice seedlings via high- throughput sequencing of pooled extremes. PLoS ONE 8(7): e68433. doi:10.1371/journal.pone.0068433

Yuan P, Kim H, Chen Q, Ju H, Ji S, Ahn SN (2010) Mapping QTLs for Grain Quality Using an Introgression Line Population from a Cross between Oryza sativa and O. rufipogon. J Crop Sci Biotech 13:205-212

Zeng D, Yan M, Wang Y, Liu X, Qian Q, Li J (2007) Du1, encoding a novel Prp1 protein, regulates starch biosynthesis through affecting the splicing of Wx b pre- mRNAs in rice (Oryza sativa L.). Plant molecular biology 65:501-509

Zhu L, Gu M, Meng X, Cheung SCK, Yu H, Huang J, Sun Y, Shi Y, Liu Q (2012a) High‐amylose rice improves indices of animal health in normal and diabetic rats. Plant biotechnology journal 10:353-362 Zhu L, Gu M, Meng X, Cheung SCK, Yu H, Huang J, Sun Y, Shi Y, Liu Q (2012b) High amylose rice improves indices of animal health in normal and diabetic rats. Plant biotechnology journal 10:353-362

Zhu Q, Ge S (2005) Phylogenetic relationships among A-genome species of the genus Oryza revealed by intron sequences of four nuclear genes. The New phytologist 167:249-265

Zhu T, Jackson DS, Wehling RL, Geera B (2008) Comparison of amylose determination methods and the development of a dual wavelength iodine binding technique. Cereal Chemistry 85:51-58

41

Zhu Y, Cai X-L, Wang Z-Y, Hong M-M (2003) An Interaction between a MYC Protein and an EREBP Protein Is Involved in Transcriptional Regulation of the Rice Wx Gene. Journal of Biological Chemistry 278:47803-47811

Zou C, Wang P, Xu Y (2016) Bulked sample analysis in genetics, genomics and crop improvement. Plant biotechnology journal, pp 1941-1955

Zou XH, Zhang FM, Zhang JG, Zang LL, Tang L, Wang J, Sang T, Ge S (2008) Analysis of 142 genes resolves the rapid diversification of the rice genus. Genome biology 9:R49-R49

42

Chapter 2

Conservation and utilization of African Oryza genetic resources1

Abstract

Africa contains a huge diversity of both cultivated and wild rice species. The region has eight species representing six of the ten known genome types. Though a high proportion of these species and the breadth of their genetic diversity are conserved in various ex situ conservation facilities, there exist significant collection gaps both at the taxa level and also within taxa. The lack of in situ conservation programs further exposes them to possible genetic erosion or extinction. The potential value of these resources is largely unknown since they are under characterized and therefore grossly underutilized. These resources possess a variety of useful traits which are important in rice improvement particularly in breeding for yield, biotic and abiotic stresses as well as grain quality. Among the African Oryza, African rice has the highest genetic potential especially on tolerance/resistance to biotic and abiotic stresses. In order to obtain maximum benefits from these resources, it is imperative that they are collected, efficiently conserved and optimally utilized. The ongoing advances in DNA sequencing could be employed to more precisely study their genetic diversity and value thereby enhancing their use in rice improvement. African Oryza species have benefited greatly from efforts aimed at developing reference genome sequences with several species already having their references released. These reference sequences have opened numerous opportunities for functional characterization of these resources.

Key words: Conservation, germplasm, genetic diversity, germplasm utilization, wild rice, genomic sequences

1 Published as Wambugu P.W., Furtado A, Waters D, Nyamongo D, Henry R (2013). Conservation and utilization of African Oryza genetic resources. Rice 6:29. doi:10.1186/1939-8433-6-29.

43

2.1. Introduction

Africa holds a rich diversity of both cultivated and wild species. These species usually occur in areas with no populations of domesticated rice nearby. They are therefore genetically isolated from the effects of gene flow particularly from Asian rice and therefore are likely to possess important traits (Wambugu et al. 2015). Though these genetic resources have made an important contribution in rice improvement (Brar and Khush 2002; Jena 2010; Khush et al. 1990), they represent a largely untapped source of useful genetic and allelic diversity.

Despite the long history of rice research covering diverse areas and the demonstrated value of the African Oryza species, there is still little information on these species compared to others in the genus (Goicoechea 2009). World agriculture is facing unprecedented challenges largely attributed to climate change and the need to ensure proper conservation and sustainable utilization of genetic resources of crop wild relatives has never been more important (Brar 2004; FAO 2010). In order to ensure that maximum benefits accrue to humankind from these resources, it is imperative that they are collected, efficiently conserved, well characterized and optimally utilized. The continued dramatic decrease in the costs of DNA sequencing and genotyping is likely to revolutionize genetic resources conservation and utilization. The on-going advances in DNA sequencing and other high throughput genomics approaches are increasingly finding application in genebank activities (McCouch et al. 2012). While information on the utilization and conservation of other wild species have been reviewed (e.g. Henry et al. 2009; Song et al. 2005), no such efforts have been made for the African Oryza species. This chapter provides an overview of the status of conservation and utilization of African Oryza species and identifies gaps in knowledge and opportunities for further research. c 2.2. Germplasm Conservation of African Oryza

Plant genetic resources underpin world agriculture and their conservation is therefore imperative if food security and sustainable development is to be assured. To define the current status of African Oryza germplasm conservation, we review approaches

44 that are currently being used in the conservation of wild rice germplasm ex situ and in situ. Details of available germplasm collections and a brief analysis of collection gaps in ex situ conservation, is presented.

2.2.1. Ex situ conservation

2.2.1.1. Collection and conservation Ex situ conservation is the most common germplasm conservation method and can take the form of seed banks, field genebanks or botanic gardens. DNA storage in DNA banks represents another option for the conservation of genetic materials (Henry et al. 2009). Collection of wild rice relatives started in the 1950s reaching a peak in the 1980s and 1990s when the International Board for Plant Genetic Resources (IBPGR, formerly IPGRI and now Bioversity International) was formed. IBPGR, in collaboration with both international and national partners funded and spearheaded most of the African wild rice germplasm collection efforts. This was due to the realization that these valuable resources faced eminent threat of genetic erosion due to both environmental and socio-economic factors. The collected materials were deposited in national genebanks with duplicates being conserved in international genebanks especially the International Rice Research Institute (IRRI;IBPGR 1984). Currently, IRRI holds the largest collection of African Oryza totaling to about 3601 accessions (Table 4). Other collections of African Oryza can be found in Australian Tropical Grains Germplasm Centre; International Network for the Genetic Evaluation of Rice in Africa (INGER- Africa); Bangladesh Rice Research Institute, Dhaka; SADC Plant Genetic Resources Centre (SPGRC), Zambia and China National Rice Research Institute (CNRRI) (Agnoun et al. 2012; Hay et al. 2013). Duplicate samples of African Oryza collections are shared between IRRI and the Africa Rice Center (previously known as WARDA) in Benin, Africa (Brar and Singh 2011). As in other species, the level of unintended duplication is high in rice thus escalating the costs of conservation. Many strategies and methods have been used to identify duplicates but most of these have been unsuccessful. However, high throughput genotyping and sequencing have the capacity to more precisely identify duplicate samples in the future as well as enable a more accurate analysis of available collection gaps.

45

Table 4: Regional, International and genebanks outside Africa holding African Oryza germplasm

Species IRRI, Africa Australia Millenniu USD ILRI, National SPGR Philippin Rice n Plant m Seed A3 Ethiopia Institute of C es1 Centre DNA Bank 1 Genetics4 1 bank2 Project1 O. glaberrima 2828 3796 7 1 476 1 O. barthii 331 114 5 5 15 1 386 2 O. longistaminata 285 2 3 9 5 149 54 O. eichingeri 37 6 0 0 0 11 O. punctata 89 0 1 14 2 18 O. brachyantha 31 0 1 6 1 16 2 1http://www.genesys-pgr.org/ 2 www.dnabank.com.au/ 3USDA-ARS (2013) 4 Nonomura et al. (2010) 2.2.1.2. Collection gap analysis The size of collections conserved in the genebanks (Table 4) might suggest that maximum genetic diversity has been captured. However, effective management of these genebanks and conservation of their resources would benefit from detailed gap analysis which would help guide future conservation priorities. Such an analysis would help to determine if all the genetic diversity found in a taxa is represented in in situ as well as ex situ conservation facilities (Maxted et al. 2008). While the combined use of molecular data, eco-geographic surveys and more accurate geographic referencing is vital in identifying gaps and redundancies in existing collections (FAO 2010), very little of this has been done for African wild Oryza held in many national collections. With the rapidly changing nature of African agriculture and in the face of increasing documented threats to biodiversity in the region (Khumalo et al. 2012; Wambugu and Muthamia 2009), there is need to use these tools to more precisely and systematically understand and document this genetic diversity as well as the status of its conservation. A comparative study of the diversity conserved ex situ with that found in situ would be particularly useful as it would help identify some possible collection gaps. There however seems to be no published work on such comparative studies for African Oryza. The success of a collection gap analysis is partially dependent on the quality and integrity of the available data (Ramirez-Villegas et al. 2010) and currently, paucity in the necessary data remains a major constraint in undertaking a comprehensive collection gap analysis. Available data indicates that globally, there are no collection deficiencies at the taxa level as all the taxa are represented in ex situ facilities. However, an analysis of germplasm collection data (Table 5 and 6) and

46 herbarium specimens (Table 7 and 8) for individual in-country collections reveals some significant gaps in taxa coverage. For example, in Kenya, where the authors had access to comprehensive and reliable data, there is a clear taxonomic gap in the collection of Oryza eichingeri. A herbarium specimen was collected and deposited at Missouri Botanic Garden (http://www.tropicos.org/Specimen/3179202 but no germplasm samples of the same have ever been collected.

47

Table 5: Germplasm collections held in various national genebanks

Country Ex situ conservation facility Species O. O. O. O. O. O. O. Oryza Leersia barthii longistamin punctata eichingeri brachyant glaberrim sativa sp. sp. ata ha a Kenya National Genebank of Kenya 36 21 1004 6 3 Uganda Botany Department 17 Tanzania National Plant Genetics 2 3 218 35 Resources Centre Zambia Zambia Agricultural Research 202 Institute Zimbabwe Genetic Resources and 1 1 1 55 5 Biotechnology Institute Benin Centre de Recherches 300 Agricoles Sud Ethiopia Genebank of the institute of 15 118 biodiversity conservation Malawi Forest Research Institute of 14 11 6 20 Malawi Nigeria National Centre for Genetic 32 Resources and Biotechnology Mozambiq National Center of Fotogenetic 26 1 344 4 5 ue Resources Source: National information sharing mechanism and country reports on the state of plant genetic resources for food and agriculture

48

Table 6: Germplasm collected from selected countries and currently held in international genebanks

Country O. O. O. O. O. O. O. sativa Oryza sp. barthii longistaminata punctata eichingeri brachyantha glaberrima Kenya - Uganda 4 18 Tanzania 9 20 23 1 2 6 322 4 Zambia 6 33 2 1 82 2 Zimbabwe 1 3 146 Benin Ethiopia 11 10 1 Malawi 5 124 Nigeria Chad 56 8 5 1 44 88 4 Mozambique 10 1 115 Ghana 2 5 2 102 326 1 cote d'ivoire 11 254 8880 1 Niger 10 6 3 2 20 14 8 Nigeria 12 20 10 1094 1768 8 Cameroon 29 18 9 5 94 161 3 Sudan 1 1 19 2 Botswana 1 6 15 Source: http://www.genesys-pgr.org/

49

Table 7: Number of herbarium specimens of African Oryza species held in various herbaria globally

Institution O. barthii O. longistaminata O. O. O. O. punctata eichingeri brachyantha glaberrima Royal Botanic Gardens, Kew 7 28 4 4 2 2 Missouri Botanical Garden 3 11 6 7 5 South African National Biodiversity Institute 9 100 35 2 (SANBI) MNHN - Museum national d'Histoire naturelle 92 108 31 12 24 20 Herbarium Senckenbergianum 28 22 5 3 University of Ghana – Ghana Herbarium 6 36 8 1 Netherlands Centre for Biodiversity Naturalis, 27 24 5 2 10 4 Ecole de Faune de Garoua 9 17 Herbarium of the University of Aarhus 24 21 4 6 Herbier du Bénin 4 6 1 3 Bioversity International- EURISCO 5 1 4 National Museum of Nature and Science, Japan 2 2 1 1 Australian National Herbarium 6 Botany (Baucom et al.) 4 12 Herbarium togoense 1 1 2 Real Jardin Botanico (Madrid), 2 Herbarium (MA) Source: Accessed through GBIF data portal, Royal Botanic Gardens, Kew; South African National Biodiversity Institute (SANBI); MNHN - Museum national d'Histoire naturelle; Herbarium Senckenbergianum; University of Ghana – Ghana Herbarium; Netherlands Centre for Biodiversity Naturalis, section National Herbarium of the Netherlands; Ecole de Faune de Garoua; Herbarium of the University of Aarhus; Herbier du Bénin; Bioversity International- EURISCO; National Museum of Nature and Science, Japan; Australian National Herbarium Botany (Baucom et al.); Herbarium togoense; Real Jardin Botanico (Madrid), Vascular Plant Herbarium (MA), http://data.gbif.org/datasets/resource.

50

Table 8: Number of herbarium specimens of African Oryza collected from different African countries

Country O. barthii O. longistaminata O. punctata O. eichingeri O. brachyantha O. glaberrima Kenya 0 1 2 1 Uganda 0 1 2 1 Tanzania 0 6 15 5 Zambia 1 8 2 Zimbabwe 1 6 3 Benin 21 21 1 3 Ethiopia 0 1 Malawi 0 2 1 Nigeria 7 5 2 Chad 2 1 Mozambique 0 3 1 Ghana 5 32 9 1 cote d'ivoire 4 3 1 1 6 1 Niger 0 0 Cameroon 13 24 6 6 Sudan 5 5 1 Botswana 4 25 South Africa 0 31 8 Guinea 1 Mali 7 9 1 7 Sierra Leone 2 1 Senegal 20 13 4 1 Madagascar 5 1 Namibia 12 1 Somalia 2 Swaziland 6 Togo 1 1 2 Burkina Faso 24 27 10 Angola 2 1 Liberia 4 Central African 4 republic Country not specified 96 110 31 14 26 29 Source: http://www.gbif.org/

51

In contrast to other species such as Phaseolus sp. (Ramirez-Villegas et al. 2010), the number of herbarium specimens in Oryza is higher than conserved germplasm accessions, a possible indicator of underrepresentation in ex situ conservation facilities. This observation is reinforced by a further numerical assessment of individual in-country germplasm collections (Table 5 and 6) which indicates that some of the wild species especially Oryza eichingeri and Oryza brachyantha have very few or no accessions in both national and international genebanks. A similar observation of under representation of Oryza wild species in ex situ conservation facilities was made by Maxted and Kell (2009). These two species had the lowest sampling representative score (SRS), of 5.9 and 5.6 respectively, with O. glaberrima having the highest followed by Oryza longistaminata. Sampling representative score (SRS) is an indicator of the adequacy of germplasm holdings for a particular taxon, based on available herbarium specimens and germplasm collections (Ramirez-Villegas et al. 2010). While acknowledging that the number of accessions may not always give an accurate reflection of the available diversity (FAO 2010), this data indicates some in-country collections with clearly evident collection gaps that may need to be filled. This finding aligns well with expert opinions on the existence of gaps in ex situ collection. For example, Hay et al., (2013) noted that there were indications of collection gaps of wild rice species in areas outside Asia such as Africa. Similarly, Ngwediagi et al. (2009) reported wild rice collection gaps in Tanzania.

Using molecular approaches, efforts should be made to ensure that any targeted collection efforts arising from these collection gap analysis and recommendations, should only be undertaken if it has been established that it will result in new genetic or allelic diversity. The planned release of reference genome sequences of African Oryza is expected to provide a basis on which to assess available genetic diversity using high throughput genomic approaches. As observed by McCouch et al. (2012), due to the continued dramatic decrease in the costs of sequencing and concomitant increase in efficiency, it is currently cheaper to undertake low coverage sequencing of an accession than it is to add an accession of cultivated rice to a germplasm collection. Consequently, it is expected that in future, genotyping using molecular markers or by sequencing will become a routine activity before an

52 accession is banked to ensure that only those with novel alleles or allele combinations are added to the collection. In addition to these ex situ collections, the importance of putting in place extra safeguards to genetic resources, by establishing and maintaining in situ collections is now universally accepted.

2.2.2. In situ conservation In situ conservation of wild species has for decades now been undertaken to complement ex situ conservation and is known to have particular advantages such as allowing the natural process of evolution to continue. However, despite the demonstrated importance of in situ conservation and several warnings on the alarming rates at which genetic diversity of rice wild relatives is being lost (Song et al. 2005; Vaughan and Chang 1992), there is no documented evidence of any targeted in situ conservation programs for African wild rice species (Brink 2006; Maxted and Kell 2009). This is in contrast to other rice species where several studies have indicated the presence of such programs (Song et al. 2005; Xie et al. 2010). The need to put in place a robust, complementary in situ conservation program for the African rice gene pool cannot be overemphasized and there is great agreement in the literature on the importance of such a program. Vaughan and Chang (1992) noted that ex situ conservation of wild species that exhibit great heterogeneity in their genetic structure is not only expensive but also time consuming, lending support to the need for in situ conservation. Maxted and Kell (2009) noted that in order to establish in situ genetic reserves, detailed genetic studies of wild rice species are vital as they help in identifying priority locations for in situ conservation. Such knowledge, supplemented with information on herbarium specimens (Table 7 and 8) and species distribution (Fig. 7) will be important in supporting conservation and sustainable utilization management decisions. Detailed genetic diversity studies on the naturally occurring variation of African Oryza are however limited, thus constituting a major information gap that greatly hampers establishment of these reserves. Currently, while populations of African wild rice may be found occurring in nationally designated protected areas, these are just an indirect effect of the establishment of these areas and such populations benefit from no active management. In Tanzania, for example, Vaughan and Chang (1992) reported the occurrence of O. barthii and O. punctata in Ruaha National Park which is a protected habitat. Outside the protected areas, wild rice

53 species may also be found in cultivated farmers’ fields, field edges, pasturelands, orchards, recreation parks and roads.

Figure 7: Distribution of Oryza species in Africa. Species distribution has been mapped based on records of herbarium specimens which have been preserved in various herbaria globally.

Whether in protected or in non-protected areas, Oryza populations are at great risk of extinction due to threats of climate change, overgrazing, flooding, habitat change, invasive alien species and pollution. Climate change probably constitutes the

54 greatest threat, with reports indicating that Africa, especially sub-Saharan Africa which holds considerable diversity of these species, will be adversely affected by climate variability. The challenge of climate change is expected to be serious particularly for wild species and urgent efforts are therefore needed to secure their genetic resources as the likelihood for extinction of narrowly adapted and endemic species is high (Jarvis et al. 2009). However, even in the face of threats posed by climate change, wild rice species are still expected to provide the basis for adapting agriculture to climate change in future. Those naturally occurring in situ populations that have the capacity to withstand the harsh effects of climate variability will have the potential to contribute valuable new traits for rice improvement (FAO 2010; Pettersson et al. 2009). In addition to these threats, an often neglected challenge facing in situ conservation is that of gene flow between cultivated and wild rice species and its impact on genetic diversity and integrity (Fig. 8). While it is generally acknowledged that the transfer of genes from wild to cultivated taxa may have important and beneficial consequences, gene flow in the reverse direction (i.e. from the crop to the wild) may lead to deleterious changes in genetic diversity or even result in extinction of small populations (Ellstrand et al. 1999). It would appear therefore that establishment of genetically isolated reserves is the most viable and effective strategy to conserve and at the same time reduce genetic admixture and its associated consequences. However, such a recommendation needs to be based on comprehensive gene flow studies that allow assessment of the genetic integrity of the populations involved. Such studies have lately been undertaken using high throughput sequencing technologies.

55

Figure 8: Rice field in coastal Kenya planted with Oryza sativa landrace with patches of Oryza punctata. The sympatric occurrence of cultivated and wild species may lead to gene flow thus affecting the genetic integrity of these species. It could also lead to extinction of less adapted genotypes.

2.2.3. On farm conservation

During the course of farming, farmers have been known to maintain important genetic diversity of Oryza sativa and Oryza glaberrima within the farming system (Fig. 9). This system of conservation, often referred to as on-farm conservation, is usually characterized by local/farmer seed production with little or no acquisition of certified seeds from the formal seed sector. Additionally, the system relies heavily on the use of diverse traditional varieties which usually have high levels of genetic diversity as compared to improved varieties. A study conducted in West Africa shows that about 70% of the farmers grow their traditional rice varieties (Mohapatra 2007). According to a National Research Council report, most of the farmers in this region have reported deliberate mixing of both the cultivated Asian and African rice in their farms so as to foster genetic diversity which occurs as a result of introgression. This has resulted in new types of landraces which, judging by ligule form, grain shape, and panicle type, are intermediate between the cultivated Asian and African rice (National Research Council 1996). These resultant landraces may be low yielding but are able to cope up

56 with many biotic and abiotic stresses due to their heterogeneity and are therefore likely to lead to yield stability. Studies have shown that farmers prefer yield stability to maximum obtainable yields in line with their strategy of risk avoidance (Almekinders and Louwaars 1999; Wambugu et al. 2012). This observation is confirmed by reports that farmers in some parts of West Africa such as the Banfora area in Burkina Faso are abandoning improved rice varieties in favour of the low yielding but highly adaptable Oryza glaberrima varieties (Futakuchi et al. 2012). Barry et al. (2007) observed that the replacement of rice landraces by improved varieties in Africa is less advanced than in Asia. It is therefore important that these landraces are conserved on-farm as well as ex situ and genetic studies undertaken on them as they might harbour useful traits that may find utility in rice improvement.

a b c d e f

Figure 9: Phenotypic diversity of both cultivated and wild Oryza genetic resources: Upper row, Oryza sativa landraces; Lower row, (a) Oryza glaberrima, (b, c, d, e, f) African wild Oryza species

2.3. Utility of African Oryza: Achievements and challenges

As already highlighted, a lot of efforts have been made in the collection and conservation of huge collections of African Oryza (Table 4), for the benefit of present and future generations. In order to derive maximum benefits from these resources, it

57 is imperative that efforts are made to ensure their optimal utilization in research, crop improvement as well as direct use. Promoting the sustainable use of these plant resources requires an understanding of their potential genetic value. This section therefore reviews some of the potentially desirable traits they possess as well as some of the progress achieved in incorporating them into commercial varieties.

2.3.1. Genetic potential of African Oryza

African cultivated and wild rice species are known to possess enormous genetic diversity (Fig. 9) which is of immense genetic value especially on resistance to biotic and abiotic stresses and therefore remain a vital raw material in rice improvement programs (Table 9). Efforts to use these resources have led to significant success in the transfer of useful traits into cultivated rice (Brar and Khush 2002). Hajjar and Hodgkin (2007) indicated that to date, a total of 12 traits in cultivated rice have been improved through the use of wild rice gene pool. One of the most significant and successful uses of wild genes in rice improvement is the transfer of the Xa-21 gene conferring resistance against bacterial blight resistance which was successfully introgressed from O. longistaminata into Oryza sativa (Khush et al. 1990). Several improved varieties carrying the Xa-21 gene have subsequently been released in different countries.

58

Table 9: Useful traits found in African Oryza species

Species Trait Reference O. longistaminata Resistance to bacterial blight, nematodes, (Brar and Khush 2002; Hu et al. drought avoidance; rhizomatousness 2011; Jena 2010; Khush et al. 1990; Yang et al. 2010) O. brachyantha Resistance to bacterial blight, yellow stem (Brar and Khush 2002; Ram et borer, leaf-folder, al. 2010; Yamakawa et al. 2008) whorl maggot, tolerance to laterite soil O. glaberrima Resistance to drought , iron toxicity, (Brar 2004; Brar and Khush nematodes; weed competitiveness; high 1997, 2002; Dingkuhn et al. adaptability to acidic soils showing low 1998; Futakuchi et al. 2001; levels of phosphorus availability; cultigen; Jones et al. 1997; Li et al. 2004; tolerance to waterlogging; crude protein Ndjiondjop et al. 1999; Nwilene content; cultigen; African gall et al. 2002; Plowright et al. 1999; Midge; stem borers; Rice yellow mottle Sauphanor 1985) virus; resistant to nematodes O. barthii Resistance to green leaf hopper, bacterial (Brar and Khush 2002) blight, drought avoidance O. punctata Resistance to brown plant hopper, zigzag (Brar and Khush 2002; Jena leafhopper 2010) O. eichingeri Resistance to brown plant hopper, white- (Brar and Khush 2002; Yan et al. backed plant hopper, green leaf hopper 1997)

Breeding for higher yields and yield stability remains the major objective of most rice breeding programs worldwide. Though wild rice species are phenotypically inferior and have predominantly been used as a source of genes for resistance to pests and diseases (Hodgkin and Hajjar 2007), they also possess the genetic value necessary to improve the yield potential of cultivated rice (Xiao et al. 1998). However, as noted by Hajjar and Hodgkin (2007), the contribution of wild species in improving yields has been limited or almost non-existent, not only in rice but also in all other economically important crops. In a review of the use of wild relatives in crop improvement in 16 major economically important crops, these authors reported only one case of a released variety bred by incorporating yield enhancing genes from a wild rice species. The rice cultivar, NSICRc112, developed from a cross between O. longistaminata and O. sativa was released in the Philippines in 2002 and is known to be high yielding (Brar 2004). Similar to the case of yield improving genes, wild species have not contributed any genes to enhance drought tolerance in rice. This apparent lack of contribution of genes to improve quantitatively inherited traits such as yield can be attributed to the fact that it is difficult to phenotypically identify these superior traits in wild species (Bimpong et al. 2011). There is therefore need for more focus to be put on the use of

59 molecular approaches in the identification of yield and drought related Quantitative Trait Loci (QTL) among other quantitative traits.

Despite being a cultivated species, O. glaberrima is arguably the most genetically promising of all African Oryza. It possesses a rich repertoire of favourable genes and alleles that have the potential of improving a diverse range of agronomically important traits in rice (Table 9). Recent studies of a QTL analysis of an O. sativa and O. glaberrima cross indicated that O. glaberrima contains useful QTL alleles that are likely to significantly enhance traits including yield and other yield components particularly under conditions of drought stress. Some of these alleles are potentially novel and have been shown to be stable across genetic backgrounds (Bimpong et al. 2011). Interspecific crosses between O. sativa and O. glaberrima resulted in interspecific hybrids trademarked as New Rice for Africa (NERICA) (Futakuchi and Sié 2009; Jones et al. 1997), whose release is considered a success story in rice improvement in Africa. NERICA varieties combine some of the stress tolerance traits of O. glaberrima with the high yielding potential of O. sativa (Jones et al. 1997). Though NERICA cannot match the performance of O. glaberrima in respect to tolerance to some biotic stresses such as weed competitiveness (Futakuchi and Sié 2009), farmers have shown strong preference and interest in NERICA varieties especially the early maturing ones as they are able to escape drought and are amenable to double cropping (Mohapatra 2009). Beyond its utility as a source of dietary calories and in plant breeding, Oryza glaberrima has been found to have ceremonial and cultural value. Communities in West Africa offer O. glaberrima rice as sacrifices to the ancestors to appease them (Mohapatra 2010; Teeken et al. 2012).

2.3.2. Negative attributes of African rice

Despite its great potential for utility in rice improvement programs, O. glaberrima has some negative traits namely poor yield, seeds that split and shatter easily, a notorious difficulty of milling and plants that lodge easily (Linares 2002; National Research Council 1996). Moreover, the brown or red pericarp is also not appealing to most consumers and in most cases the rice has to be polished to remove it (Teeken et al. 2012). It is perhaps due to these reasons that it is rapidly being displaced in West Africa in favour of Asian rice (Jones et al. 1997; Linares 2002) and is currently almost

60 becoming extinct (Mohapatra 2010). While this calls for urgent breeding efforts, there is also need to give more attention to the conservation of this germplasm before it gets lost. Moreover, the conserved germplasm needs to be properly characterized in order to unravel its genetic architecture.

2.3.3. Challenges in genetic mapping in African rice

One of the most important resource that have been used in understanding the genetic and molecular basis of important traits are inter specific crosses between Asian and African rice. Despite the huge success that has been achieved in generating interspecific crosses between O. sativa and O. glaberrima, the process is still fraught with technical challenges in addition to being tedious and time consuming due to sterility barriers. Research has identified a host of loci which are associated with reproductive barriers in cultivated rice (Lorieux et al. 2013). As reported by Lorieux et al. (2013), working under the framework of Global Rice Science Partnership (GRiSP), a multi institutional collaborative effort involving CIAT, IRD and AfricaRice has made efforts to address the challenge of sterility barriers by developing interspecific bridge lines. These are interspecific crosses between O. sativa and O. glaberrima and are developed through marker assisted selection of progenies that are homozygous for

s the S1 allele. The S1 locus is a sporo-gametophytic sterility factor which controls

s sterility, with the S1 allele being responsible for fertility restoration. These crosses contain large introgressions of O. glaberrima genome and by significantly increasing fertility in subsequent crosses with diverse O. sativa lines, they ensure effective exploitation of useful O. glaberrima genes in conventional breeding programmes. Related to this is the issue of segregation distortion which is another challenge faced in QTL mapping using interspecific crosses between O. glaberrima and O. sativa. Segregation distortion has been reported in various genomic regions associated with a sterility locus such as short arm of chromosome 6 where the S1 locus, is located (Gutiérrez et al. 2010). Segregation distortion may affect the accuracy of QTL mapping as it may cause the effect of some QTLs to be overestimated. QTLs mapping in regions segregating in non-mendelian fashion should therefore be interpreted with caution.

61

2.4. Characterization of African Oryza

Successful utilization of genetic resources especially of wild species in breeding programs as well as in other research primarily requires some understanding of their phenotypic and genotypic characteristics as this knowledge increases the potential value of these resources. Moreover, this knowledge is important in making important germplasm conservation management decisions. Using AFLPs, Kiambi et al. (2005) studied the genetic diversity and population structure of 48 O. longistaminata populations obtained from 8 Eastern and Southern Africa countries. The study revealed higher levels of genetic diversity as compared to some other species in the Oryza genus such as O. glumaepatula. This diversity was found to be more within than between populations and also to be more in populations within countries than among countries. Similar results were found in a recent study involving 320 accessions of O. longistaminata obtained from 8 populations in Ethiopia (Melaku et al. 2013). This study found more within than between population diversity but the level of diversity was higher than the one detected in the study by Kiambi et al. (2005). In another study on genetic diversity and domestication history, Li et al. (2011) found 70% less diversity in O. glaberrima as compared to O. barthii, its wild progenitor, an indication of severe domestication bottleneck. These and other similar studies have been valuable not only in ensuring effective germplasm utilization but also in setting conservation priorities and defining sampling strategies. It is on the basis of outputs from such genetic diversity and evaluation studies that most of the Oryza genomic resources have been developed.

Currently, the Oryza research community has access to numerous genomic resources among them a reference sequence, advanced mapping populations, transcriptome data as well as physical and genetic maps. Oryza sativa was the first crop plant to have its reference genome sequence released (International Rice Genome Sequencing Project 2005) marking a major milestone that opened numerous opportunities for functional characterization of the entire rice genome. Studies have however demonstrated that one reference genome sequence is not enough to fully explore the genetic variation in the Oryza genus (Goicoechea et al. 2010). Consequently, efforts to develop reference genome sequences for some other

62 selected species in the Oryza genus have been on-going under the auspices of the International Oryza Map Alignment Project (IOMAP). Through this initiative, physical maps have been developed for some selected species among them some African rices namely O. glaberrima, O. punctata and O. brachyantha (Kim et al. 2008). Genome sequences of O. glaberrima (Wang et al. 2014), O. longistaminata (Zhang et al. 2015) and O. brachyantha (Chen et al. 2013) have been published. Genome sequences of O. barthii and O. punctata are available in GenBank though they are yet to be published (Table 10) (Jacquemin et al. 2013; Wing 2012). With the expected release of reference genome sequences for the other wild species and with the continued decline in next generation sequences, it is anticipated that it will become increasingly feasible to study intraspecific genetic diversity by comparing individual genome sequences with reference sequence. Moreover, the release of these reference genome sequences will usher in a new platform that will give impetus to re- sequencing efforts which will in turn yield valuable information on SNPs, insertions, deletions, and other mutations as well as other structural and functional variations.

Table 10: Sequencing status of African Oryza genomes

Species Genome Sequencing Sequencing Status/Reference size method technology O. glaberrima ≈354Mb BP Illumina HiSeq 2000 (Wang et al. 2014) O. barthii ≈411Mb WGS/PM Illumina GAIIx and Unpublished* Roche 454 O. brachyantha ≈260Mb WGS/PM Illumina GAII (Chen et al. 2013) O. longistaminata ≈352Mb WGS Illumina GA2, Illumina (Zhang et al. 2015) HiSeq2000 and Roche GS FLX+ O. punctata ≈423Mb BP/WGS/PM Illumina HiSeq2000, Unpublished* Roche 454 O. eichingeri ≈650Mb WGS Sequencing in progress Key: WGS, Whole Genome Shotgun; CBC, Clone by Clone; BP, BAC Pool; PM, Physical Map Integration *Deposited in GenBank

Numerous studies have been undertaken aimed at genome wide-discovery of DNA polymorphisms such as SNPs and InDels in the Oryza genus. These studies have led to discovery of huge numbers of DNA polymorphisms (Feltus et al. 2004; Shen et al. 2004; Subbaiyan et al. 2012; Xu et al. 2012). However, despite discovery of these DNA polymorphisms, a significant gap exists in linking them to phenotypic traits so that they can be of more value in crop improvement as well as in genetic diversity analysis. Such information, according to Tung et al. (2010), will lead to more

63 insights on the value of naturally occurring variation and hence lead to better management and utilization of the biodiversity that is conserved in various global germplasm repositories. The integration of this information with genebank accession level information and data is expected to increase the value of genebank collections and thereby boost their utilization in rice improvement as well as other areas of plant science. Among the greatest challenges that genebanks currently face in this endeavor is the inadequacy in bioinformatics skills and computing capacity to handle the huge amount of genomic data that is being generated through sequencing and genotyping. Attempts towards proper integration of genomic and phenotypic data, which allow meaningful downstream analysis first calls for concerted efforts in compiling and publicly sharing genebank’s accession level information on characterization and evaluation. The lack of this information arguably presents the greatest obstacle to the effective use of conserved genetic resources (FAO 2010; Khoury et al. 2010; Wambugu et al. 2012). Without evaluation data, for example, it is not possible to link sequence polymorphisms with phenotypic performance. African Oryza thus remains greatly under characterized and hence grossly underutilized in rice breeding programs and its imperative that more focus is put in this important area. With the rapidly decreasing costs of next generation sequencing, there is need to start positioning genebanks for the ultra-high throughput genomic era which promises to revolutionize germplasm conservation and utilization as well as all associated activities.

2.5. References

Agnoun Y, Biaou SSH, Sié M, Vodouhè RS, Ahanchédé A (2012) The African Rice Oryza glaberrima Steud: Knowledge Distribution and Prospects. International Journal of Biology 4:158-180

Almekinders CJM, Louwaars NP (1999) Farmers' seed production: new approaches and practices. Intermediate Technology, London

Barry MB, Pham JL, Noyer JL, Courtois B, Billot C, Ahmadi N (2007) Implications for in situ genetic resource conservation from the ecogeographical distribution of rice genetic diversity in Maritime Guinea. Plant Genetic Resources 5:45-54

64

Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon J-M, Westerman RP, SanMiguel PJ, Bennetzen JL (2009) Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genetics 5:e1000732

Bimpong IK, Serraj R, Chin JH, Ramos J, Mendoza EMT, Hernandez JE, Mendioro MS, Brar DS (2011) Identification of QTLs for Drought-Related Traits in Alien Introgression Lines Derived from Crosses of Rice ( Oryza sativa cv. IR64) × O. glaberrima under Lowland Moisture Stress. Journal of Plant Biology 54:237- 250

Brar DS (2004) Broadening the gene pool of rice through introgression from wild species. In: Toriyama K, Heong KL, Hardy B (eds) Rice is life: scientific perspectives for the 21st century, Proceedings of the world rice research conference. International Rice Research Institute (IRRI), Tsukuba, Japan, pp 157 -159

Brar DS, Khush GS (1997) Alien introgression in rice. Plant molecular biology 35:35- 47

Brar DS, Khush GS (2002) Transferring genes from wild species into rice. CABI Publishing, Wallingford, UK, pp 197-217

Brar DS, Singh K (2011) Oryza. In: Kole C (ed) Wild Crop Relatives: Genomic and Breeding Resources, Cereals,. Springer-Verlag, Berlin Heidelberg, pp 321 – 365

Brink M (2006) PROTA 1: Cereals and pulses/Céréales et légumes secs. In: Brink M, Belay G (eds). PROTA, Wageningen, Netherlands.

Chen J, Liang C, Chen C, Zhang W, Sun S, Liao Y, Zhang X, Yang L, Song C, Wang M, Shi J, Huang Q, Liu G, Liu J, Zhou H, Zhou W, Yu Q, An N, Chen Y, Cai Q, Wang B, Liu B, Gao D, Min J, Huang Y, Wu H, Li Z, Zhang Y, Yin Y, Song W, Jiang J, Jackson SA, Wing RA, Wang J, Wang J, Chen M, Lang Y, Liu T, Li B, Bai Z, Luis Goicoechea J (2013) Whole-genome sequencing of Oryza brachyantha reveals mechanisms underlying Oryza genome evolution. Nature communications 4:1595

Dingkuhn M, Johnson DE, Sow A, Audebert AY (1998) Relationships between upland rice canopy characteristics and weed competitiveness. Field Crops Research 61:79-95

65

Ellstrand NC, Prentice HC, Hancock JF (1999) Gene Flow and Introgression from Domesticated Plants into Their Wild Relatives. Annual Review of Ecology and Systematics 30:539-563

FAO (2010) Second Report on the World's Plant Genetic Resources for Food and Agriculture. FAO, Rome, Italy, p 299

Feltus FA, Wan J, Schulze SR, Estill JC, Jiang N, Paterson AH (2004) An SNP resource for rice genetics and breeding based on subspecies indica and japonica genome alignments. Genome research 14:1812-1819

Futakuchi K, Jones MP, Ishii R (2001) Physiological and morphological mechanisms of submergence resistance in African rice (Oryza glaberrima Steud.). Japanese Journal of Tropical Agriculture 45:8-14

Futakuchi K, Sié M (2009) Better exploitation of African rice in rice variety development. Agricultural Journal 4:96-102

Futakuchi K, Sie M, Saito K (2012) Yield Potential and Physiological and Morphological Characteristics Related to Yield Performance in Oryza glaberrima Steud. Plant Production Science 15:151-163

Goicoechea JL (2009) Structural comparative genomics of four African species of Oryza. School of Plant Sciences. University of Arizona, Arizona, p 214

Goicoechea JL, Ammiraju JSS, Marri PR, Chen M, Jackson S, Yu Y, Rounsley S, Wing RA (2010) The Future of Rice Genomics: Sequencing the Collective Oryza Genome. Rice 3:89-97

Gutiérrez A, Carabalí S, Giraldo O, Martínez C, Correa F, Prado G, Tohme J, Lorieux M (2010) Identification of a Rice stripe necrosis virus resistance locus and yield component QTLs using Oryza sativa × O. glaberrima introgression lines. BMC plant biology 10:6-6

Hajjar R, Hodgkin T (2007) The use of wild relatives in crop improvement: a survey of developments over the last 20 years. Euphytica 156:1-13

Hay FR, Hamilton NRS, Furman BJ, Upadhyaya HD, Reddy KN, Singh SK (2013) Cereals. In: Normah MN, Chin HF, Reed BM (eds) Conservation of Tropical Plant Species. Springer Science, 2013, pp 293-315

Henry RJ, Rice N, Waters DLE, Kasem S, Ishikawa R, Hao Y, Dillon S, Crayn D, Wing R, Vaughan D (2009) Australian Oryza: Utility and Conservation. Rice 3:235- 241

66

Hodgkin T, Hajjar R (2007) Using crop wild relatives for crop improvement: trends and perspectives. CABI, Wallingford, UK, pp 535-548

Hu F, Wang D, Zhao X, Zhang T, Sun H, Zhu L, Zhang F, Li L, Li Q, Tao D, Fu B, Li Z (2011) Identification of rhizome-specific genes by genome-wide differential expression analysis in Oryza longistaminata. BMC plant biology 11:18

IBPGR (1984) Collection of crop germplasm: The First Ten Years; 1974-1984. International Board for Plant Genetic Resources, Rome, Italy, p 119

International Rice Genome Sequencing Project (2005) The map based sequence of the rice genome. Nature 436

Jacquemin J, Bhatia D, Singh K, Wing RA (2013) The International Oryza Map Alignment Project: development of a genus-wide comparative genomics platform to help solve the 9 billion-people question. Current opinion in plant biology 16:147-156

Jarvis A, Upadhyaya HD, Gowda CLL, Aggarwal PK, Fujisaka S, Anderson B (2009) Climate Change and its Effect on Conservation and Use of Plant Genetic Resources for Food and Agriculture and Associated Biodiversity for Food Security.

Jena KK (2010) The species of the genus Oryza and transfer of useful genes from wild species into cultivated rice, O. sativa. Breeding Science 60:518-523

Jones MP, Dingkuhn M, Aluko/snm GK, Semon M (1997) Interspecific Oryza Sativa L. X O. Glaberrima Steud. progenies in upland rice improvement. Euphytica 94:237-246

Khoury C, Laliberté B, Guarino L (2010) Trends in ex situ conservation of plant genetic resources: a review of global crop and regional conservation strategies. Genetic Resources and Crop Evolution 57:625-639

Khumalo S, Chirwa PW, Moyo BH, Syampungani S (2012) The status of agrobiodiversity management and conservation in major agroecosystems of Southern Africa. Agriculture Ecosystems & Environment 157:17-23

Khush GS, Bacalangco E, Ogawa T (1990) A New Gene for Resistance to Bacterial Blight from O. longistaminata. Rice Genetic Newsletter. IRRI, Phillipines, pp 121-122

67

Kiambi DK, Newbury HJ, Ford-Lloyd BV, Dawson I (2005) Contrasting genetic diversity among Oryza longistaminata (A. Chev et Roehr) populations from different geographic origins using AFLP. African Journal of Biotechnology 4:308-317

Kim H, Hurwitz B, Yu Y, Collura K, Gill N, SanMiguel P, Mullikin JC, Maher C, Nelson W, Wissotski M, Braidotti M, Kudrna D, Goicoechea JL, Stein L, Ware D, Jackson SA, Soderlund C, Wing RA (2008) Construction, alignment and analysis of twelve framework physical maps that represent the ten genome types of the genus Oryza. Genome biology 9:R45

Li J, Xiao J, Grandillo S, Jiang L, Wan Y, Deng Q, Yuan L, McCouch SR (2004) QTL detection for rice grain quality traits using an interspecific backcross population derived from cultivated Asian (O. sativa L.) and African (O. glaberrima S.) rice. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 47:697-704

Li Z-M, Zheng X-M, Ge S (2011) Genetic diversity and domestication history of African rice (Oryza glaberrima) as inferred from multiple gene sequences. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 123:21-31

Linares OF (2002) African rice (Oryza glaberrima): history and future potential. Proceedings of the National Academy of Sciences of the United States of America 99:16360-16365

Lorieux M, Garavito A, Bouniol J, Gutiérrez A, Ndjiondjop MN, Guyot R, Pompilio Martinez C, Tohme J, Ghesquière A (2013) Unlocking the Oryza glaberrima treasure for rice breeding in Africa. CABI, Wallingford, UK, pp 130-143

Maxted N, Dulloo E, V Ford-Lloyd B, Iriondo JM, Jarvis A (2008) Gap analysis: a tool for complementary genetic conservation assessment. Diversity and Distributions 14:1018-1030

Maxted N, Kell S (2009) Establishment of a Global Network for the In Situ Conservation of Crop Wild Relatives: Status and Needs. Commission on Genetic Resources for Food and Agriculture (CGRFA):266

McCouch SR, McNally KL, Wang W, Sackville Hamilton R (2012) Genomics of gene banks: A case study in rice. American journal of botany 99:407-423

Melaku G, Haileselassie T, Feyissa T, Kiboi S (2013) Genetic diversity of the African wild rice (Oryza longistaminata Chev. et Roehr) from Ethiopia as revealed by SSR markers. Genetic Resources and Crop Evolution 60:1047-1056

68

Mohapatra S (2007) In search of new seeds. Rice Today. IRRI, Phillipines

Mohapatra S (2009) Drought-proof rice for African farmers. Rice Today. IRRI, Phillipines

Mohapatra S (2010) Pockets-of-gold. Rice Today. IRRI, Phillipines

National Research Council (1996) Lost Crops of Africa. Volume 1: Grains. Washington, p 380

Ndjiondjop MN, Albar L, Fargette D, Fauquet C, Ghesquiere A (1999) The genetic basis of high resistance to rice yellow mottle virus (RYMV) in cultivars of two cultivated rice species. Plant Disease 83:931-935

Ngwediagi P, Maeda E, Kimomwe H, Kamara R, Massawe S, Akonaay HB, Mapunda LND (2009) Tanzania: Second Country Report on the State of Plant Genetic Resources for Plant and Agriculture. Commission on Genetic Resources for Food and Agriculture (CGRFA):88

Nonomura K-I, Morishima H, Miyabayashi T, Yamaki S, Eiguchi M, Kubo T, Kurata N (2010) The wild Oryza collection in National BioResource Project (NBRP) of Japan: History, biodiversity and utility. Breeding Science 60:502-508

Nwilene FE, Adam A, Williams CT, Ukwungwu MN, Dakouo D, Nacro S, Hamadoun A, Kamara SI, Okhidievbie O, Abamu FJ (2002) Reactions of differential rice genotypes to African rice gall midge in West Africa. International Journal of pest management 48:195-201

Pettersson E, Lundeberg J, Ahmadian A, Kth, Genteknologi, Skolan för b (2009) Generations of sequencing technologies. Genomics 93:105-111

Plowright RA, Coyne DL, Nash P, Jones MP (1999) Resistance to the rice nematodes Heterodera sacchari, Meloidogyne graminicola and M. incognita in Oryza glaberrima and O. glaberrima x O. sativa interspecific hybrids. Nematology 1:745-745

Ram T, Laha GS, Gautam SK, Deen R, Madhav MS, Brar DS, Viraktamath BC (2010) Identification of a new gene introgressed from Oryza brachyantha with broad- spectrum resistance to bacterial blight of rice in India. Rice Genetics Newsletter 25

69

Ramirez-Villegas J, Khoury C, Jarvis A, Debouck DG, Guarino L (2010) A Gap Analysis Methodology for Collecting Crop Genepools: A Case Study with Phaseolus Beans. PloS one 5:e13497

Sauphanor B (1985) Some factors of upland rice tolerance to stem-borers in West Africa. International Journal of Tropical Insect Science 6:429-434

Shen Y-J, Li X, Yu Q-B, Liu H-J, Chen D-H, Gao J-H, Huang H, Shi T-L, Yang Z-N, Jiang H, Jin J-P, Zhang Z-B, Xi B, He Y-Y, Wang G, Wang C, Qian L (2004) Development of genome-wide DNA polymorphism database for map-based cloning of rice genes. Plant physiology 135:1198-1205

Song Z, Li BO, Chen J, Lu BR (2005) Genetic diversity and conservation of common wild rice (Oryza rufipogon) in China. Plant Species Biology 20:83-92

Subbaiyan GK, Waters DL, Katiyar SK, Sadananda AR, Vaddadi S, Henry RJ (2012) Genome-wide DNA polymorphisms in elite indica rice inbreds discovered by whole-genome sequencing. Plant biotechnology journal 10:623-634

Teeken B, Nuijten E, Temudo MP, Okry F, Mokuwa A, Struik PC, Richards P (2012) Maintaining or Abandoning African Rice: Lessons for Understanding Processes of Seed Innovation. Human Ecology 40:879-892

Tung C-W, Zhao K, Wright MH, Ali ML, Jung J, Kimball J, Tyagi W, Thomson MJ, McNally K, Leung H, Kim H, Ahn S-N, Reynolds A, Scheffler B, Eizenga G, McClung A, Bustamante C, McCouch SR (2010) Development of a Research Platform for Dissecting Phenotype–Genotype Associations in Rice (Oryza spp.). Rice 3:205-217

USDA-ARS (2013) National Genetic Resources Program. Germplasm Resources Information Network - (GRIN) [Online Database]. National Germplasm Resources Laboratory, Beltsville, Maryland. http://www.ars-grin.gov/cgi- bin/npgs/html/tax_search.pl (Accessed 18 April 2013)

Vaughan D, Chang TT (1992) In Situ conservation of rice genetic resource. Economic Botany 46:368-383

Wambugu PW, Brozynska M, Furtado A, Waters DL, Henry RJ (2015) Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences. Scientific reports 5:13957

Wambugu PW, Mathenge PW, Auma EO, vanRheenen HA (2012) Constraints to On- Farm Maize (Zea mays L.) Seed Production in Western Kenya: Plant Growth and Yield. ISRN Agronomy 2012:1-7

70

Wambugu PW, Muthamia ZK (2009) Second Report on the State of Plant Genetic Resources for Food and Agriculture in Kenya. FAO, Rome, Italy

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Wing RA (2012) The International - Oryza Map Alignment Project: Development and analysis of a genus-wide comparative genomics platform to help solve the 9- billion people question. Third International Symposium on Genomics and Crop Genetic Improvement, Wuhan, China

Xiao J, Li J, Grandillo S, Ahn S-N, Yuan L, Tanksley SD, McCouch SR (1998) Identification of Trait-Improving Quantitative Trait Loci Alleles From a Wild Rice Relative, Oryza rufipogon. Genetics 150:899-909

Xie J, Agrama HA, Kong D, Zhuang J, Hu B, Wan Y, Yan W (2010) Genetic diversity associated with conservation of endangered Dongxiang wild rice ( Oryza rufipogon ). Genetic Resources and Crop Evolution 57:597-609

Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L, Li J, He W, Zhang G, Zheng X, Zhang F, Li Y, Yu C, Kristiansen K, Zhang X, Wang J, Wright M, McCouch S, Nielsen R, Wang J, Wang W (2012) Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nature biotechnology 30:105-111

Yamakawa H, Ebitani T, Terao T (2008) Comparison between locations of QTLs for grain chalkiness and genes responsive to high temperature during grain filling on the rice chromosome map. Breeding Science 58:337-343

Yan H, Xiong Z, Min S, Hu H, Zhang Z, Tian S, Tang S (1997) The Transfer of Brown Planthopper Resistace from Oryza eichingeri to O sativa. Journal of Genetics and Genomics 24:424-431

Yang H, Hu L, Hurek T, Reinhold-Hurek B (2010) Global characterization of the root transcriptome of a wild species of rice, Oryza longistaminata, by deep sequencing. BMC genomics 11:705

71

Zhang Y, Zhang S, Liu H, Fu B, Li L, Xie M, Song Y, Li X, Cai J, Wan W, Kui L, Huang H, Lyu J, Dong Y, Wang W, Huang L, Zhang J, Yang Q, Shan Q, Li Q, Huang W, Tao D, Wang M, Chen M, Yu Y, Wing RA, Wang W, Hu F (2015) Genome and Comparative Transcriptomics of African Wild Rice Oryza longistaminata Provide Insights into Molecular Mechanism of Rhizomatousness and Self- Incompatibility. Molecular plant 8:1683

72

Chapter 3 Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences2

Abstract

The evolutionary relationships of cultivated rice and its wild relatives have remained contentious and inconclusive. This study reports on the use of whole chloroplast sequences to elucidate the evolutionary and phylogenetic relationships in the AA genome Oryza species, representing the primary gene pool of rice. A combination of read mapping and de novo approaches were used to assemble the whole chloroplast genomes. This is the first study that has produced a well resolved and strongly supported phylogeny of the AA genome species. The pan tropical distribution of these rice relatives was found to be explained by long distance dispersal within the last million years. The analysis resulted in a clustering pattern that showed strong geographical differentiation. The species were defined in two primary clades with a South American/African clade with two species, O glumaepatula and O longistaminata, distinguished from all other species. The largest clade was comprised of an Australian clade including newly identified taxa and the African and Asian clades. There was genetic differentiation between the two subspecies of O. sativa with japonica and indica forming distinct clades associated with O. rufipogon and O. nivara respectively. This clustering pattern seems to support polyphyletic evolution, and suggests independent domestication of Asian rice as opposed to the single origin theory. This refined knowledge of the relationships between cultivated rice and the related wild species provides a strong foundation for more targeted use of wild genetic resources in rice improvement and efforts to ensure their conservation. In order to elucidate the evolutionary relationships of the newly discovered Australian taxa, more studies are recommended.

2Published as Wambugu, P.W., Brozynska, M., Furtado, A., Waters, D.L. and Henry, R.J. 2015. Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences. Scientific Reports 5, 13957; doi: 10.1038/srep13957

73

Key words: Phylogenetic relationships, chloroplast genome, evolution, divergence dating, AA genome

3.1. Introduction

The Oryza genus is one of the most economically important members of the plant kingdom as it includes one the most important crops, rice, which is the staple food for over half of the world’s population. The genus has two cultivated species and about 21 wild relatives. Based on chromosome paring, these Oryza species are divided into 10 genome types namely AA, BB, CC, BBCC, EE, FF, GG, CCDD, HHJJ and HHKK (Ge et al. 1999). The AA genome has eight diploid species among them one of the cultivated species, O. sativa L. which has two subspecies, O. sativa L. ssp. japonica and O. sativa L. ssp. indica (hereafter referred as japonica and indica respectively) which have a global distribution. The other cultivated species is O. glaberrima Steud., commonly referred to as African rice, which is localized in West Africa. The wild species in this genome group include O. barthii A. Chev., O. longistaminata A. Chev. et Roehr., O. meridionalis Ng, O. glumaepatula Steud., O. rufipogon Griff. and O. nivara Sharma & Shastry. Though O. nivara has previously been described as an ecotype of O. rufipogon, this study did not pass judgment and was included in this study. Four of these species are found in Africa thereby making it the region with the highest number of species representing the Oryza primary genepool. These cultivated and wild species hold valuable genetic diversity that has continued to contribute immensely to rice crop improvement.

Proper knowledge of evolutionary and phylogenetic relationships of genetic resources is important in ensuring their effective conservation and utilization. The extensive pan-tropical distribution of the AA genome species has long remained an unresolved issue. Theories of Gondwanaland origin and its subsequent breakup as well as long distance dispersal have been used to explain the distribution of Oryza species (Chang 1976). As reviewed by Wambugu et al. (2013) and the references therein (chapter 1), the persistent incongruence and inconsistency that has characterized efforts to resolve these issues might be attributed to a relatively rapid diversification of the AA genome, differential choice of genes, use of insufficient data,

74 incomplete lineage sorting, misidentification of accessions and introgression. One of the greatest challenges that have faced phylogeneticists and systematicists is in the selection of the right loci to target, one that is phylogenetically informative and provides sufficient resolution so as to circumvent some of these identified causes of incongruence.

Previously, phylogenetic analysis (Ge et al. 1999; Zhu and Ge 2005; Zhu et al. 2014) has followed the laborious process of amplifying selected loci, some of which unfortunately have not provided sufficient phylogenetic resolution. Although the amount of data required to give robust and stable phylogenetic inferences is not clear (Wahlberg and Wheat 2008), it is generally recognised that increasing the number of loci studied increases resolution. Recently, the potential of whole chloroplast genomes as opposed to individual loci in resolving phylogenetic relationships has been demonstrated (Nock et al. 2011; Parks et al. 2009; Yang et al. 2013). The use of whole chloroplast genomes helps to overcome the previously laborious process of data generation. Recent advances in next generation sequencing have led to the sequencing of many chloroplast genomes which have continued to find utility in various areas of plant science. While nuclear data may lead to inconsistent phylogenies due to recombination, plastids have particular advantages in phylogenetic reconstruction as they are structurally stable, generally uniparental, haploid and non- recombinant (Small et al. 2004). Therefore, the aim of this study was to use massively parallel sequencing to reconstruct the evolutionary and phylogenetic relationships of the Oryza AA genome species from whole chloroplast genome sequences. Using this phylogenomics approach, this study was able to produce a well resolved and strongly supported phylogeny of the AA genome species with a clustering pattern that showed strong geographical and genetic differentiation.

3.2. Results

3.2.1. Genome assembly and features

A total of 9 Oryza chloroplast genomes representing five species were sequenced using Illumina Genome Analyzer producing 34,562,744 - 38,275,022 paired reads. With O. sativa as the reference, a combination of de novo assembly and read mapping

75 strategies was employed in the assembly of these paired reads, resulting in different mapping rates and chloroplast genomes of diverse sizes. Reference guided assembly resulted in mapping rates that ranged from 5.1% to 12.6% which is equivalent to chloroplast genome fold coverage of between 1319X and 3443X (Table 11). Although not an A genome species but an out-group in this study, O. officinalis had the largest genome (134,911bb) with O. longistaminata having the smallest (134,563bp).

Table 11: Sequencing status of African Oryza genomes

Species Total Average No. of Proportion of Fold Cp Accession paired-end length (after Reads reads aligning coverage genome number reads trimming) aligning to to reference size reference (%) O. longistaminata 38275022 96.8 4786403 12.6 3443 134567 KM881641 1 O. longistaminata 38613960 96.7 4398194 11.5 3160 134563 KM881642 2 O. barthii 1 34562744 96.5 3387828 9.92 2427 134674 KM881634 O. barthii 2 35661888 96.5 3574521 10.12 2562 134603 KM881635 O. barthii 3 35336280 96.5 2304561 6.58 1652 134596 KM881636 O. barthii 4 35095334 96.4 2205372 6.35 1579 134640 KM881637 O. glaberrima 1 35220340 96.8 3695118 10.68 2657 134606 KM881638 O. glumaepatula 36220444 96.6 1838613 5.13 1319 134583 KM881640 O. officinalis 38710520 96.7 1935941 5.1 1387 134911 KM881643

The assembled genomes had a quadripartite structure comprising of a long single copy (80,512-80,684bb) and a short single copy (12,336-12,386bb) separated by a pair of inverted repeat regions (20,794-20,809bb), an organization structure which is typical of angiosperms (Fig. 10). Each genome had a total of 124 unique genes comprising of 83 protein coding genes, 33 transfer RNA (tRNA) and 8 ribosomal RNA (rRNA) genes. The genomes assembled using the strict pipeline employed in this study, were generally larger in size than those which were obtained from GenBank, most of which had been assembled through read mapping alone (Table 12).

76

Figure 10: Gene map of O. glaberrima chloroplast genome. The different types of genes are colour coded. Genes shown inside are transcribed clockwise while those on the outside are transcribed anticlockwise. The inner circle represents the two inverted repeats (IRA and IRB) which are separated by the short single copy (SSC) and the long single copy (LSC).

3.2.2. Phylogenetic analysis

The chloroplast sequences of O. barthii, O. longistaminata and O. glaberrima which were assembled in this study as detailed above were analysed together with those of O. meridionalis, O. nivara, O. rufipogon, japonica and indica which were downloaded from GenBank (Table 12). Aligning of full chloroplast sequences resulted in an alignment that was 136879bp in length with 134353 characters being constant. A total

77 of 484 characters were variable out of which 221 were parsimony informative (Table 13).

Table 12: GenBank sequences used in the phylogenetic analysis

Species Length (bp) Accession number O. sativa sp. Japonica 1 134,551 GU592207 O. sativa sp. Japonica 2 134,525 NC_001320 O. sativa sp. Indica 1 134,448 JN861109 O. sativa sp. Indica 2 134,496 NC_008155 O. rufipogon 1 134,537 KF562709 O. rufipogon 2 134,557 NC_022668 O. rufipogon 3 134,,557 JN005833 O. meridionalis 134,558 NC_016927 O. nivara 134,494 NC_005973 O. glaberrima 2 132,629 KJ513090

Table 13: Alignment statistics for whole chloroplast genome and with one inverted repeat (IR) excluded

Statistics/parameter Whole cp genome Cp genome with one IR excludeda Alignment length 136,879bp 114,151bp Conserved sites 134,353 113,591 Variable sites 484 439 Parsimony informative sites 221 214 Singleton sites 262 225 GC content 39.0 38.0 a Oryza glaberrima 2 was excluded in this analysis as it was not possible to accurately identify the inverted repeats (IR) due to suspected assembly errors.

We obtained a well resolved and strongly supported phylogeny with a strong hierarchy of clades (Fig. 11). The resultant phylogenetic tree had two main distinct clades where the primary division was into a clade including O. glumaepatula and O. longistaminata which formed the basal clade and one containing all the other species. The large clade in turn contained two well resolved clades, the Australian clade and another one with Asian and African species. The various regions/continents which are used to refer to the different taxa in this study represent their provenance, and in all cases were also the source of collection (Fig. 12).

78

Figure 11: Bootstrapping consensus and maximum parsimony tree showing phylogenetic relationships among Oryza AA genome species conducted using whole chloroplast sequences. O. officinalis acted as the out-group. The same topology was obtained from Neighbour joining (NJ) and maximum likelihood (ML) criteria. Indicated accession numbers are GenBank unique identifiers. The same topology was obtained when one of the inverted repeat (IR) regions was excluded. Deleting the indels resulted in a poorly resolved and inconsistent phylogeny.

79

Figure 12: Phylogenetic tree of the Oryza AA genome group species showing the origin of the various species. The tree was produced using whole chloroplast genome sequences.

The different phylogenetic criteria namely maximum likelihood (ML), maximum parsimony (MP) and neighbour joining (NJ) produced trees that were congruent, showing the robustness of this approach. There are varied and sometimes contradictory reports on the effect of indels on phylogenetic analysis, with different recommendations being advanced on how they should be treated (Nagy et al. 2012; Ogden and Rosenberg 2007). The phylogenetic analysis was therefore conducted with and without the indels and the resultant phylogenetic trees were then compared. The results indicate that deletion of indels, as is sometimes recommended, resulted in

80 generally poorly resolved and inconsistent phylogeny. Neighbour joining analysis yielded a poorly resolved tree while maximum parsimony and maximum likelihood analysis resulted in a tree that is almost similar to the one obtained without deleting the indels, except that O. glaberrima 2 moved next to the basal clade (Fig. 13 and 14). Overall, this study shows that deleting the indels prior to phylogenetic analysis may not be an effective approach.

The results of this study are consistent with those of Dessimoz and Gil (Dessimoz and Gil 2010) who concluded that indels carry substantial phylogenetic signal and deleting them can be detrimental in phylogenetic analysis. The alignment had long indels and perhaps it is the size of the indels that plays a role in determining their phylogenetic effect.

81

Figure 13: Phylogenetic tree of the Oryza AA genome group species showing the origin of the various species. The tree was produced using whole chloroplast genome sequences.

82

Figure 14: Maximum parsimony (MP) tree of the Oryza AA genome group species obtained after excluding indels in the alignment. The tree was produced using whole chloroplast genome sequences.

3.2.3. Molecular divergence dating

Based on whole chloroplast genome sequences, this study estimates that the AA genome may have diverged about 0.69-1.11million years ago (Mya) (Table 14 and Fig. 15), a period that seems to correspond to the Pleistocene.

83

Table 14: Divergence date estimates of the AA Oryza genome species based on chloroplast sequences

Node label a Divergence date estimates b Strict clock c Relaxed clock 1 48.93 (39.7-58.9) 49.41(49.21-50.99) 2 4.37 (3.36-5.22) 41.76 (34.94-51.48) 3 0.95 (0.69-1.11) 17.23 (8.49-35.55) 4 0.2 (0.1-0.23) 5.32 (1.27-10.87) 5 0.86 (0.61-0.97) 11.99 (7.41-19.82) 6 0.77 (0.54-0.86) 7.62 (2.68-12.73) 7 0.68 (0.5-0.78) 6.66 (2.95-11.62) a Nodes are labelled in the phylogenetic tree shown in Fig. 15 b Divergence estimates are in millions of years c 95% HPD (Highest Posterior Density) is indicated in parenthesis

Figure 15: Maximum parsimony (MP) tree of the Oryza AA genome group species obtained after excluding indels in the alignment. The divergence dates for the various labelled nodes were calculated using strict and relaxed clocks and are shown in Table 14. Divergence between Triticum aestivum and Zea mays was used as the calibration point while O. officinalis acted as the out-group. The tree is not drawn to scale.

84

3.3. Discussion

3.3.1. Origin and evolution of AA genome species Phylogenetic relationships in the AA genome have been the most widely studied of all the Oryza genome types primarily due to their close association with O. sativa, the cultivated species. In the light of the persistently inconsistent and inconclusive results obtained from these previous studies, the present study analysed the chloroplast genomes of the AA genome species with the aim of investigating their utility in elucidating their phylogenetic and evolutionary relationships. Among the inconsistencies observed in previous studies, the issue of the radiation of AA genome, basal taxa and divergence time have perhaps been the most outstanding. Some studies show O. longistaminata (Ren et al. 2003) to be the most diverged species while other phylogenies support O. meridionalis as the most distinct species (Duan et al. 2007; Kumagai et al. 2010; Park et al. 2003). Similarly, the relationship between O. glumaepatula, O. meridionalis and O. longistaminata has also remained either persistently inconsistent or unresolved (Duan et al. 2007; Zhu and Ge 2005). The wide topological incongruence between studies might be attributed to a relatively rapid diversification of the AA genome. The use of different out groups among the various studies could also be responsible for the incongruence. Resolving these inconsistencies therefore remains a priority in efforts aimed at promoting conservation and utility of these resources.

This study shows that the basal clade consists of O. longistaminata and O. glumaepatula which are found in Africa and South America respectively. This divergence could be as a result of long distance dispersal through movement of animals especially birds, followed by ecological differentiation. Elephants, buffaloes and other mega fauna are some of the vectors that have been associated with Oryza distribution in different regions (Vaughan et al. 2008). These species are too close to have a divergence due to a pan-Gondwanaland distribution and the continental drift between Africa and South America which is estimated to have occurred approximately 100 – 120Mya (Duncan and Hargraves 1984; Wood 1993). These two species show a genetic relationship that is common in many other species of African and South American origin (Givnish et al. 2004; Groeneveld et al. 2007). The presence of the

85 most primitive extant grasses in South America (Bremer 2000) may also seem to suggest the Oryza genus may have had some ancestry here.

Although it would have been expected that O. longistaminata would cluster with O. barthii and O. glaberrima based on their sympatric occurrence in some habitats in Africa, it showed little affinity with this clade. Results of this study therefore do not support the hypothesis that the immediate ancestor of O. barthii is O. longistaminata (Khush 1997). The close relationship between these two basal clade species has previously been reported by Duan et al. (2007) and Kumagai et al. (2010) in phylogenetic studies conducted using mitochondria and chloroplast DNA sequences respectively, where the two species formed monophyletic clades. In addition to phylogenetic analysis, the use of molecular clock is commonly used to gain insights on the origin and evolution of species.

The analysis used both relaxed and strict clock approaches. The results obtained using the strict clock approach were generally closer to the estimates reported in previous studies, as compared to those obtained using the relaxed clock approach which were generally much older. Strict clock is known to be superior to relaxed clock in case of shallow phylogenies (Brown and Yang 2011). With the Oryza AA genome species having diverged only recently, the strict clock approach seems to provide better divergence date estimates. Consequently, only estimates obtained using the strict clock approach will be highlighted in this chapter. The divergence estimate of about 0.69-1.11million years ago (Mya) (Table 14 and Fig. 15), is less than the estimates of 2Mya given in an earlier study based on intron sequences of four nuclear genes (Zhu and Ge 2005). The results also suggest that the AA genome group diverged from other Oryza species around 3.36-5.22Mya. Relatively recent long distance dispersal remains the most reliable explanation for the current distribution of AA genome species. Results of divergence dating vary widely between studies depending on the approach used thus making comparison between studies complex. Comparison of nuclear and chloroplast based divergence estimates is particularly difficult as they usually vary widely since each of these two approaches have their own strengths and limitations.

86

Though chloroplast genomes may have particular advantages in resolving phylogenetic relationships, Middleton et al. (2014) noted that their use in divergence dating may be problematic especially in species with short evolutionary times. This is due to the fact that presence of multiple chloroplast haplotypes may not correspond with species divergence since they may have diverged long before actual species divergence. This may therefore lead to inaccurate divergence estimates and may explain the difference between nuclear based and chloroplast based divergence estimates. Increasing the amount of data analysed in divergence dating has been found to result in more accurate estimates (Kumar and Hedges 1998). It is therefore reasonable to argue that due to the use of a large data set, the estimates obtained in this study are robust and reliable than those of many previous studies, majority of which have analysed single loci.

The origin of the AA genome has also remained quite contentious with different arguments and opinions being advanced. Previous studies suggest the AA genome group may have originated from Africa since the majority of the studies have reported O. longistaminata as the most ancestral species (Cheng et al. 2002; Iwamoto et al. 1999). Furthermore, O. longistaminata has unique morphological features that are not present in other AA genome species such as self-incompatibility, rhizomatous and unique characteristics of the ligules, all consistent with it being significantly differentiated. The conventional wisdom in evolutionary biology is that annual species are derived from perennial ancestors. This is consistent with the findings of this study which indicate that O. longistaminata and O. glumaepatula, both of which are perennial, are the most ancestral species.

The basal split in this study was followed by the divergence of the Australian species from the Asian and African taxa and is estimated to have occurred about 0.54- 0.86Mya. Contrary to some studies which show that the spread of Oryza AA genome species into Australia may have occurred earlier (Zhu et al. 2014), this study shows that spread to all the regions seems to have occurred just around the same time, an observation earlier made by Vaughan et al. (2005). The sharing of some taxa such as some types of Sorghum, Vigna radiata var. sublobata and Gossypium species between Australia and Africa suggests that there is a phytogeographic link between

87 the two regions (Vaughan et al. 2008). This study shows that the divergence between the African and Asian species occurred twice, with the first one being the divergence between ancestors of O. longistaminata and those of the Asian species while the second one is the radiation of the ancestors of the two cultivated taxa. The divergence between the two cultivated taxa is estimated to have occurred about 0.51-0.65Mya (Table 14 and Fig. 15). The timing of this divergence corresponds to estimates given in previous studies of 0.64 - 0.7 Mya (Ma and Bennetzen 2004; Zhu and Ge 2005). However, this estimated time of divergence is much earlier than the time of domestication of Asian and African rice, which is reported to have taken place about 10,000 years and 3,500 years ago respectively. In addition to giving insights on evolutionary relationships, this study enabled us to better define the genetic relationships between the various species particularly between Asian rice and its putative progenitors which have always remained controversial.

3.3.2. Relationship between O. rufipogon, O. sativa and O. nivara

Many conflicting theories have been advanced on the origin and domestication history of Asian cultivated rice and despite great research efforts, the debate rages on. Traditionally, these theories were broadly classified as supporting either monophyletic (Molina et al. 2011) (Joshi et al. 2000) or polyphyletic origin (Bautista et al. 2001; Kumagai et al. 2010). Recently however, some domestication pathways that deviate from these previously suggested models have been proposed (Huang et al. 2012). In one of these models, Huang et al. (2012) suggested that japonica was domesticated from O. rufipogon in Southern China while indica was the product of a cross between japonica and local wild rice in South East Asia and South Asia.

Analysis of genetic polymorphisms between the O. sativa subspecies shows some genetic differentiation with indica 1 and indica 2 having 296 and 247 nucleotide differences respectively between them and the japonica reference (GU592207). This genetic differentiation is confirmed by their phylogenetic placement where both japonica and indica formed distinct clades, associated with O. rufipogon and O. nivara respectively. This genetic differentiation between the two subspecies of O. sativa and the associated clustering pattern seems to support polyphyletic evolution, and suggests independent domestication of Asian rice as opposed to the single origin

88 theory. Proponents of the independent domestication theory, posit that the diverged genomic backgrounds of japonica and indica were derived from pre-differentiated ancestral gene pools during separate and geographically isolated domestication events. This theory has gained support from several phylogenetic, genetic distance and genomic palaeontology analyses. The findings of this study suggest the possibility that the maternal genome of indica may have been derived from O. nivara while the maternal genome of japonica may have originated from O. rufipogon. The nuclear genomes may have a more complex origin.

3.3.3. Relationship between O. barthii and O. glaberrima and the origin of African rice

Similar to Asian rice, the origin of African rice has also been controversial and remains grossly understudied. Although the widely held proposition is that African rice was domesticated from O. barthii about 3,500yr ago (Li et al. 2011; Wang et al. 2014), it has also been hypothesized that it evolved from Asian rice through sympatric speciation before it was later domesticated in West Africa (Nayar 2014). According to this theory, upon introduction of Asian rice to Africa, drastic changes in climate and especially the high temperatures in the Sahel may have led to mutations which may have subsequently been selected for by farmers. This latter theory has however not gained much support and the arguments presented in its support seem less convincing. An Asian origin of African rice has also been proposed but dismissed (Li et al. 2011). The shattering nature of O. glaberrima and presence of red pericarp in some varieties seems to suggest that the process of domestication is incomplete.

In the African clade, O. glaberrima clustered with all the four accessions of O. barthii, thus supporting the popular hypothesis of O. barthii being the progenitor of African rice. A recent study resequenced 20 accessions of O. glaberrima and 94 of O. barthii (Wang et al. 2014) and their results supported the domestication theory originally proposed by Porteres (1962) which was later supported by Li et al. (2011). Porteres (1962) suggested that African rice was first domesticated in the Inland Delta of the Upper Niger River before spreading to two secondary centres, one located along the Senegambian coast and the other in the Guinea highlands (Porteres 1962). The wide inconsistencies in evolutionary and phylogenetic relationships may be attributed to the use of germplasm that has been collected from areas that have been disturbed

89 through human activity as well as those that are misidentified. Such human activities may have eroded important genetic signatures thus leading to contradictory theories and conclusions.

3.3.4. Relationship between Australian wild taxa

The strong geographic differentiation among the “O. rufipogon” accessions from either Asia or Australia was evidently clear. Information on such genetic patterns is valuable in guiding both germplasm use and conservation efforts particularly when devising sampling strategies. As will be discussed later in this section, it is believed that some of the germplasm initially identified as O. rufipogon may have been misclassified. Though the observed genetic differentiation in “O. rufipogon” accessions from different regions is consistent with previous analysis conducted using molecular (Waters et al. 2012) and hybridization data (Banaticla-Hilario et al. 2013) which showed them as distinctly different taxa, it should be interpreted with caution. The clustering of “O. rufipogon” in different clades may be due to incomplete lineage sorting or introgressive hybridization which may have led to chloroplast capture (Rieseberg and Soltis 1991; Zou et al. 2008). O. rufipogon is an outcrossing species (Banaticla-Hilario et al. 2013) and so chloroplast capture is possible especially with species with which it has sympatric distribution (Cristina Acosta and Premoli 2010). Introgression is more frequent in chloroplast than in nuclear genomes and this phenomenon of reticulation and chloroplast capture usually leads to incongruence between nuclear and chloroplast based phylogenies (Rieseberg and Soltis 1991). It has generally been demonstrated that chloroplast based phylogenies usually reflect geographical distribution as compared to nuclear phylogenies which normally correspond with taxonomic relationships (Cristina Acosta and Premoli 2010; Rautenberg et al. 2010; Wysocki et al. 2015). As recommended by Rieseberg and Soltis (1991), conducting phylogenetic analysis using both nuclear and chloroplast genomes is an important strategy to avoid erroneous phylogenies. The use of unlinked nuclear genes would be useful in testing for incomplete lineage sorting and this study therefore serves as a useful starting point, whose findings can be compared with the analysis of selected nuclear loci. Recently, a comparative analysis of chloroplast and nuclear genome based phylogenies was conducted, with the resultant phylogenies showing different genetic relationships between the various species (Brozynska et al., 2016).

90

Recent studies using both molecular and morphological data have discovered two distinct perennial taxa in Northern Queensland, Australia which are believed to be new Oryza gene pools (Fig. 16) (Brozynska et al. 2014; Sotowa et al. 2013). Although it might be argued that these reported new taxa are hybrid swarms originating from crosses between O. sativa and Australian wild rice, these authors have discounted this, based on molecular data. O. rufipogon was initially thought to occur in both Asia and Australia (Henry et al. 2009; Vaughan 1994), with ecological and geographical isolation causing the observed genetic differentiation. However, some scientists believe that this may have been inaccurate and suggest the germplasm originally christened as O. rufipogon from Australia in fact represents the two newly identified taxa (Brozynska et al. 2014). While this remains inconclusive and might be considered highly speculative, it appears to be partially confirmed by molecular analysis which indicate that O. rufipogon accessions collected a long time ago from Australia and are currently in ex situ collections are similar to one of the new species (Sotowa et al. 2013). The similarity between these accessions is suggestive of a misidentification. Perhaps taxonomists confused these species to be O. rufipogon due to their perenniality and other overlapping variation which presents challenges in clearly circumscribing taxa. With the chloroplast genome of one of the two perennial taxa closely resembling that of the annual O. meridionalis (Waters et al. 2012), there is need for further work to assess the possibility of introgressive hybridization resulting in chloroplast capture as already discussed above.

91

Figure 16: Wild population of one of the newly identified perennial species in Lakefield National Park, North Queensland, Australia. Emerging evidence and opinions indicate that germplasm initially identified as Australian O. rufipogon may be a misidentification and may actually represent the newly identified species.

The evolutionary pattern of these newly discovered taxa currently remains unclear but it is believed that at least one of them may have diverged from O. meridionalis followed by continuous introgression with the wild taxa (Sotowa et al. 2013). This divergence pattern however deviates from the widely held opinion among evolutionary biologists that annual species are derived from perennial ancestral lineages (Tank and Olmstead 2008). More studies are however required in order to clearly elucidate their evolutionary paths.

The discovery of these gene pools has a great impact on global food security as it provides the breeders with critical genetic resources. As noted by Henry et al. (2009), genetic resources of these taxa are under collected and so genetic diversity studies and collection gap analysis are necessary so as to guide future conservation efforts. The Oryza gene pools in South America/Africa and Australia are likely to possess important traits as they are genetically isolated from domesticated rice and hence remain genetically uncontaminated by gene flow from the large Asian domesticated rice populations.

92

3.4. Methods

3.4.1. Plant materials

The seed samples of African Oryza wild species used in this study were obtained from the International rice research institute (IRRI), Philippines while those of O. glumaepatula and O. officinalis were obtained from the Queensland Alliance for Agriculture and Food Innovation (QAAFI), Australia (Table 15).

Table 15: Details of seed samples used for DNA extraction

Species Accession number Country of origin Oryza glaberrima 104040 Chad Oryza longistaminata 103905 Tanzania Oryza longistaminata 81953 Zambia Oryza barthii 103590 Cameroon Oryza barthii 101381 Niger Oryza barthii 104124 Chad Oryza barthii 86481 Zambia Oryza glumaepatula AusTRCF 309281 Suriname Oryza officinalis AusTRCF 309302 Myanmar

3.4.2. DNA extraction and sequence assembly

DNA was extracted from approximately 3gm of plant tissue using a modified cetyltrimethylammonium bromide (CTAB) method that was originally published by Carroll et al. (1995). The DNA quality and quantity was assessed by agarose gel electrophoresis (0.7%, 103V for 45 minutes) as well as by an Agilent BioAnalyzer 2100 (Agilent Technologies). Approximately 3µg of total DNA from each accession was indexed and pooled together in one lane of an Illumina Genome Analyzer and sequenced. In total, 9 Oryza chloroplast genomes representing five species were sequenced. The raw reads were assembled into whole chloroplast genomes in a multi-step approach employing a modified pipeline that involved a combination of both reference guided and de novo assembly approaches (Cronn et al. 2008). First, trimming was done with a quality score limit of 0.05 on CLC Genomics workbench 6.5.1 (CLC Bio, a QIAGEN Company, Aarhus, Denmark) before undertaking sequence assembly. In order to discard non-CP reads, the trimmed reads were assembled by mapping to the published O. sativa chloroplast reference sequence (GU592207) using CLC Genomic Workbench 6.5.1 under the following parameters: length fraction 0.5,

93 similarity fraction 0.8, mismatch cost 2, deletion and insertion costs 3. In order to allow assembly of the inverted repeats as well as other repetitive elements, reads showing non-specific matches were mapped randomly. The vote majority conflict resolution mode was used in order to ensure inclusion of only chloroplast specific reads thus avoiding contribution of nuclear and mitochondria reads to the consensus sequences. With the aim of correcting any assembly errors especially relating to possible misalignment due to insertions and deletions, the read mapping consensus sequences obtained were subjected to two passes of forced realignment using the respective InDel variant tracks.

After reference guided assembly, de novo assembly was undertaken with a minimum contig length of 1000bp. The resultant de novo contigs were aligned to the reference sequence using BLAST (http://blast.ncbi.nlm.nih.gov/) as implemented in CLC Genomic Workbench 6.5.1 and the matching cp contigs ordered by alignment to the reference sequence using clone manager 9.0. The trimmed reads were later mapped to the de novo assembled sequence as reference and the resultant consensus sequence subjected to one pass of forced realignment using InDel guidance variant tracks. The de novo and reference guided consensus were then compared and any conflicts resolved by reference to the reads.

3.4.3. Phylogenetic analysis

The final chloroplast sequences from O. barthii, O. longistaminata, O. glaberrima, O. glumaepatula and O. officinalis were exported to Geneious 7.0.5 (www.geneious.com) where, using MAFFT (Katoh et al. 2002), they were aligned with chloroplast sequences for O. meridionalis, O. nivara, O. rufipogon, japonica and indica which were downloaded from GenBank (Table 12). The alignment was physically inspected and confirmed to be correct. The number of variable characters and parsimony informative sites were analysed in MEGA6 (Tamura et al. 2013) with the out-group excluded. The aligned sequences were analysed by maximum parsimony (MP), neighbour joining (NJ) maximum likelihood (ML) and Bayesian Inference criteria using PAUP 4.0b10 (Swofford 2003) implemented through Geneious v 7.0.5. The Neighbour joining criterion used the general time-reversible (GTR) distance measure with rates assumed to follow gamma distribution. The maximum parsimony analysis was performed using

94 heuristic searches with the 'MulTrees' option followed by tree bisection–reconnection branch swapping. Topological robustness was assessed using bootstrap method with 1000 random addition replicates. All characters were unordered and were accorded equal weight with gaps being treated as missing data. A separate phylogenetic assessment was also conducted with the gaps being deleted manually but resulted in a poorly resolved and inconsistent tree. Appropriate nucleotide substitution models were determined using Modeltest 3.7 (Posada and Crandall 1998). The models were chosen using Akaike Information Criterion (AIC) with the models being used for subsequent maximum likelihood and Bayesian Inference analysis. Maximum likelihood analysis was undertaken using the GTR+G+I model. Bayesian phylogenetic inference was conducted using MrBayes 3.1 (Ronquist and Huelsenbeck 2003) with Monte Carlo Markov Chains (MCMC) estimation of posterior probability distributions. The analysis used the GTR model with the rate variation assumed to follow gamma distribution. Four independent runs of 1,100,000 MCMC were performed with trees sampled after every 200 runs followed by a burn in length of 100,000 MCMC. O. officinalis acted as the out group.

3.4.4. Molecular divergence dating

Molecular clock analysis was conducted using Bayesian method implemented in BEAST program (Bouckaert et al. 2014). The analysis was conducted using both strict and relaxed clock approaches with the data being partitioned into coding, non-coding and intergenic sequences. Calibration was done based on the divergence time between Zea mays and Triticum aestivum which is estimated to have occurred about 50Mya (Wolfe et al. 1989). The Hasegawa, Kishino and Yano (HKY) and four gamma categories substitution model was used with the Calibrated Yule model priors being used to model speciation. TreeAnnotator (pre-release version 2.1.2) was used to calculate maximum probability clade tree and the final tree was visualized in FigTree v1.4.2.

3.4.5. Genome annotation

Annotation of the sequenced genomes was done using CpGAVAS (Liu et al. 2012) (http://www.herbalgenomics.org/0506/cpgavas) followed by manual adjustments of start and stop codons as well as intron/exon boundaries which was done using Apollo

95

(Lewis et al. 2002). GenBank files were created using sequin and were subsequently used to draw chloroplast gene maps using OGDraw v1.2 (Lohse et al. 2013).

3.5. References

Banaticla-Hilario MCN, McNally KL, Berg RG, Hamilton NRS (2013) Crossability patterns within and among Oryza series Sativae species from Asia and Australia. Genetic Resources and Crop Evolution 60:1899-1914

Bautista NS, Solis R, Kamijima O, Ishii T (2001) RAPD, RFLP and SSLP analyses of phylogenetic relationships between cultivated and wild species of rice. Genes & genetic systems 76:71-79

Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, Suchard MA, Rambaut A, Drummond AJ (2014) BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS computational biology 10:e1003537

Bremer K (2000) Early Cretaceous lineages of monocot flowering plants. Proceedings of the National Academy of Sciences of the United States of America 97:4707- 4711

Brown RP, Yang Z (2011) Rate variation and estimation of divergence times using strict and relaxed clocks. BMC evolutionary biology 11:271-271

Brozynska M, Copetti D, Furtado A, Wing RA, Crayn D, Fox G, Ishikawa R, Henry RJ (2016) Sequencing of Australian wild rice genomes reveals ancestral relationships with domesticated rice. Plant biotechnology journal

Brozynska M, Omar ES, Furtado A, Crayn D, Simon B, Ishikawa R, Henry RJ (2014) Chloroplast Genome of Novel Rice Germplasm Identified in Northern Australia. Tropical Plant Biology 7:111-120

Carroll BJ, Klimyuk VI, Thomas CM, Bishop GJ, Harrison K, Scofield SR, Jones JD (1995) Germinal transpositions of the maize element Dissociation from T-DNA loci in tomato. Genetics 139:407-420

Chang T-T (1976) The origin, evolution, cultivation, dissemination, and diversification of Asian and African rices. Euphytica 25:425-441

Cheng C, Tsuchimoto S, Ohtsubo H, Ohtsubo E (2002) Evolutionary relationships among rice species with AA genome based on SINE insertion analysis. Genes & genetic systems 77:323-334

96

Cristina Acosta M, Premoli AC (2010) Evidence of chloroplast capture in South American Nothofagus (subgenus Nothofagus, Nothofagaceae). Molecular phylogenetics and evolution 54:235-242

Cronn R, Liston A, Parks M, Gernandt DS, Shen R, Mockler T (2008) Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by- synthesis technology. Nucleic acids research 36:e122-e122

Dessimoz C, Gil M (2010) Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome biology 11:R37-R37

Duan S, Lu B, Li Z, Tong J, Kong J, Yao W, Li S, Zhu Y (2007) Phylogenetic Analysis of AA-genome Oryza Species (Poaceae) Based on Chloroplast, Mitochondrial, and Nuclear DNA Sequences. Biochemical Genetics 45:113-129

Duncan RA, Hargraves RB (1984) Plate tectonic evolution of the caribbean region in the mantle reference frame. In: Bonini WE, Hargraves RS (eds) The Caribbean- South American Plate Boundary and Regional Tectonics. Geological Society of America, pp 81-95

Ge S, Sang T, Lu B-R, Hong D-Y (1999) Phylogeny of rice genomes with emphasis on origins of allotetraploid species. Proceedings of the National Academy of Sciences of the United States of America 96:14400-14405

Givnish TJ, Millam KC, Evans TM, Hall JC, Pires JC, Berry PE, Sytsma KJ (2004) Ancient Vicariance or Recent Long-Distance Dispersal? Inferences about Phylogeny and South American-African Disjunctions in Rapateaceae and Bromeliaceae Based on ndhF Sequence Data. International Journal of Plant Sciences 165:S35-S54

Groeneveld LF, Clausnitzer V, Hadrys H (2007) Convergent evolution of gigantism in damselflies of Africa and South America? Evidence from nuclear and mitochondrial sequence data. Molecular phylogenetics and evolution 42:339- 346

Henry RJ, Rice N, Waters DLE, Kasem S, Ishikawa R, Hao Y, Dillon S, Crayn D, Wing R, Vaughan D (2009) Australian Oryza: Utility and Conservation. Rice 3:235- 241

Huang X, Li W, Guo Y, Lu Y, Zhou C, Fan D, Weng Q, Zhu C, Huang T, Zhang L, Wang Y, Kurata N, Feng L, Furuumi H, Kubo T, Miyabayashi T, Yuan X, Xu Q, Dong G, Zhan Q, Li C, Fujiyama A, Wei X, Toyoda A, Lu T, Feng Q, Qian Q, Li

97

J, Han B, Wang Z-X, Wang A, Zhao Q, Zhao Y, Liu K, Lu H (2012) A map of rice genome variation reveals the origin of cultivated rice. Nature 490:497-497

Iwamoto M, Nagashima H, Nagamine T, Higo H, Higo K (1999) p-SINE1-like intron of the CatA catalase homologs and phylogenetic relationships among AA-genome Oryza and related species. TAG Theoretical and Applied Genetics 98:853-861

Joshi SP, Gupta VS, Aggarwal RK, Ranjekar PK, Brar DS (2000) Genetic diversity and phylogenetic relationship as revealed by inter simple sequence repeat (ISSR) polymorphism in the genus Oryza. TAG Theoretical and Applied Genetics 100:1311-1320

Katoh K, Misawa K, Kuma K-i, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research 30:3059-3066

Khush GS (1997) Origin, dispersal, cultivation and variation of rice. Plant molecular biology 35:25-34

Kumagai M, Wang L, Ueda S (2010) Genetic diversity and evolutionary relationships in genus Oryza revealed by using highly variable regions of chloroplast DNA. Gene 462:44-51

Kumar S, Hedges SB (1998) A molecular timescale for vertebrate evolution. Nature 392:917-920

Lewis SE, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smith CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME, Searle SMJ, Harris N, Gibson M, Iyer V, Richter J, Wiel C, Bayraktaroglu L, Birney E (2002) Apollo: a sequence annotation editor. Genome biology 3:0082

Li ZM, Zheng XM, Ge S (2011) Genetic diversity and domestication history of African rice (Oryza glaberrima) as inferred from multiple gene sequences. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 123:21-31

Liu C, Shi L, Zhu Y, Chen H, Zhang J, Lin X, Guan X (2012) CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences. BMC genomics 13:715-715

Lohse M, Drechsel O, Kahlau S, Bock R (2013) OrganellarGenomeDRAW--a suite of tools for generating physical maps of plastid and mitochondrial genomes and visualizing expression data sets. Nucleic acids research 41:W575

98

Ma J, Bennetzen JL (2004) Rapid recent growth and divergence of rice nuclear genomes. Proceedings of the National Academy of Sciences of the United States of America 101:12404-12410

Middleton CP, Senerchia N, Stein N, Akhunov ED, Keller B, Wicker T, Kilian B (2014) Sequencing of chloroplast genomes from wheat, barley, rye and their relatives provides a detailed insight into the evolution of the Triticeae tribe. PloS one 9:e85761

Molina J, Sikora M, Garud N, Flowers JM, Rubinstein S, Reynolds A, Huang P, Jackson S, Schaal BA, Bustamante CD, Boyko AR, Purugganan MD (2011) Molecular evidence for a single evolutionary origin of domesticated rice. Proceedings of the National Academy of Sciences of the United States of America 108:8351-8356

Nagy LG, Kocsubé S, Csanádi Z, Kovács GM, Petkovits T, Vágvölgyi C, Papp T (2012) Re-Mind the Gap! Insertion - Deletion Data Reveal Neglected Phylogenetic Potential of the Nuclear Ribosomal Internal Transcribed Spacer (ITS) of Fungi. PloS one 7:e49794

Nayar NM (2014) The Origin of African Rice. In: Nayar NM (ed) Origin and Phylogeny of Rices. Academic Press, San Diego, pp 117-168

Nock CJ, Waters DLE, Edwards MA, Bowen SG, Rice N, Cordeiro GM, Henry RJ (2011) Chloroplast genome sequences from total DNA for plant identification. Plant biotechnology journal 9:328-333

Ogden TH, Rosenberg MS (2007) How should gaps be treated in parsimony? A comparison of approaches using simulation. Molecular phylogenetics and evolution 42:817-826

Park K-C, Lee JK, Kim N-H, Shin Y-B, Lee J-H, Kim N-S (2003) Genetic variation in Oryza species detected by MITE-AFLP. Genes & genetic systems 78:235-243

Parks M, Cronn R, Liston A (2009) Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC biology 7:84-84

Porteres R (1962) Berceaux Agricoles Primaires Sur le Continent Africain. The Journal of African History 3:195-210

Posada D, Crandall KA (1998) MODELTEST: testing the model of DNA substitution. Bioinformatics (Oxford, England) 14:817-818

99

Rautenberg A, Hathaway L, Oxelman B, Prentice HC, Institutionen för o, Uppsala u, Biologiska s, Systematisk b, Teknisk-naturvetenskapliga v (2010) Geographic and phylogenetic patterns in Silene section Melandrium (Caryophyllaceae) as inferred from chloroplast and nuclear DNA sequences. Molecular phylogenetics and evolution 57:978-991

Ren F, Lu B-R, Li S, Huang J, Zhu Y (2003) A comparative study of genetic relationships among the AA-genome Oryza species using RAPD and SSR markers. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 108:113-120

Rieseberg LH, Soltis DE (1991) Phylogenetic consequences of cytoplasmic gene flow in plants. Evolutionary Trends in Plants 5:65-84

Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (Oxford, England) 19:1572-1574

Small RL, Cronn RC, Wendel JF (2004) Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 17:145-170

Sotowa M, Sato YI, Sato T, Crayn D, Simon B, Waters DLE, Henry RJ, Ishikawa R, Ootsuka K, Kobayashi Y, Hao Y, Tanaka K, Ichitani K, Flowers JM, Purugganan MD, Nakamura I (2013) Molecular relationships between Australian annual wild rice, Oryza meridionalis, and two related perennial forms. Rice 6

Swofford DL (2003) {PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4.}. Sinauer Associates

Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Molecular biology and evolution 30:2725-2729

Tank DC, Olmstead RG (2008) From annuals to perennials: phylogeny of subtribe Castillejinae (Orobanchaceae). American Journal of Botany [HW Wilson - GS] 95:608

Vaughan D (1994) The wild relatives of rice: A genetic resources handbook. IRRI, Phillipines

Vaughan DA, Ge S, Kaga A, Tomooka N (2008) Phylogeny and Biogeography of the Genus Oryza. In: Hirano H-Y, Hirai A, Sano Y, Sasaki T (eds) Rice Biology in the Genomics Era. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 219-234

100

Vaughan DA, Kadowaki K-i, Kaga A, Tomooka N (2005) On the Phylogeny and Biogeography of the Genus Oryza. Breeding Science 55:113-122

Wahlberg N, Wheat CW (2008) Genomic outposts serve the phylogenomic pioneers: designing novel nuclear markers for genomic DNA extractions of lepidoptera. Systematic Biology 57:231-242

Wambugu P, Furtado A, Waters D, Nyamongo D, Henry R (2013) Conservation and utilization of African Oryza genetic resources. Rice 6:29-29

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Waters DLE, Nock CJ, Ishikawa R, Rice N, Henry RJ (2012) Chloroplast genome sequence confirms distinctness of Australian and Asian wild rice. Ecology and evolution 2:211-217

Wolfe KH, Gouy M, Yang YW, Sharp PM, Li WH (1989) Date of the monocot-dicot divergence estimated from chloroplast DNA sequence data. Proceedings of the National Academy of Sciences 86:6201-6205

Wood AE (1993) The History of the Problem. In: George W, Lavocat R (eds) The Africa-South America Connection. Oxford university press, New York, pp 1-7

Wysocki WP, Clark LG, Attigala L, Ruiz-Sanchez E, Duvall MR (2015) Evolution of the bamboos (Bambusoideae; Poaceae): a full plastome phylogenomic analysis. BMC evolutionary biology 15:50

Yang JB, Tang M, Li HT, Zhang ZR, Li DZ (2013) Complete chloroplast genome of the genus Cymbidium: lights into the species identification, phylogenetic implications and population genetic analyses. BMC evolutionary biology 13:84- 84

Zhu Q, Ge S (2005) Phylogenetic Relationships among A-Genome Species of the Genus Oryza Revealed by Intron Sequences of Four Nuclear Genes. New Phytologist 167:249-265

Zhu T, Xu P-Z, Liu J-P, Peng S, Mo X-C, Gao L-Z (2014) Phylogenetic relationships and genome divergence among the AA- genome species of the genus Oryza

101

as revealed by 53 nuclear genes and 16 intergenic regions. Molecular phylogenetics and evolution 70:348-361

Zou XH, Zhang FM, Zhang JG, Zang LL, Tang L, Wang J, Sang T, Ge S (2008) Analysis of 142 genes resolves the rapid diversification of the rice genus. Genome biology 9:R49-R49

102

Chapter 4

Characterization of apparent gene duplication in a gene encoding a pullulanase-type debranching enzyme in African rice (Oryza glaberrima)

Abstract

Analysis of key starch biosynthesis genes in the genomes of various cultivated and wild Oryza species indicates that pullulanase has undergone gene duplication in African rice. A hypothesis based study aimed at characterizing this duplication was formulated. This study specifically aimed at functionally linking it to the unique starch physico-chemical properties in African rice among them high amylose. It was hypothesized that this gene duplication may have led to increased pullulanase activity which may have subsequently led to unique starch structure and properties due to protein dosage effects. Results show that the 2 genes have several functionally conserved regulatory motifs indicating that they were both functional and had putative roles in both starch synthesis and hydrolysis. The two paralogs show high homology at both protein and nucleotide level, with only few nucleotide differences in the intronic sequences, suggesting that this duplication may have been a relatively recent event. They share the alpha amylase catalytic domain in addition to which the duplicate has an extra domain, namely the Early set domain. There were significant differences in pullulanase activity between the two cultivated species (O. glaberrima and O. sativa) and O. barthii. However, pullulanase activity between O. glaberrima and O. sativa was not significantly different. The lack of elevated pullulanase activity as had been hypothesized may be due to reduction in gene expression in one or both copies. It is also possible that there was some unknown epistastic interaction between these two genes and other starch controlling loci such as GBSS.

Keywords: Pullulanase gene, starch, paralogs, gene duplication, Oryza glaberrima, amylose

103

4.1. Introduction

Despite African rice having potential for contributing genes to improve rice quality as reviewed by Wambugu et al. (2013), research on starch structure and properties as well as the genetic mechanisms controlling them have for a long time been neglected in this cultivated species. Recently however, there has been increased efforts in studying the biosynthesis as well as the physicochemical and functional properties of starch in African rice (Gayin et al. 2016a; Gayin et al. 2016b; Gayin et al. 2015; Wang et al. 2015). This renewed interest is perhaps driven by the need to gain increased understanding of the uniqueness of starch traits in African rice as compared to Asian rice.

Starch physicochemical properties are controlled by a variety of starch biosynthesis genes, among which are starch debranching enzymes (SDE) of which two types exist; pullulanase and isoamylase. Understanding the genetic effects of these starch biosynthesis genes and their functional effects on starch properties is important for rice improvement and genetics. However, despite the remarkable interest in starch traits in African rice, a significant gap exists in understanding the genetic mechanisms controlling these traits. Mutations or other changes in starch biosynthesis genes are likely to result in variations in starch content, structure and physicochemical properties and the effect of any such changes merits some study. Pullulanase encodes for a pullulanase-type debranching enzyme and has a dual function, in both starch anabolism and catabolism (Dinges et al. 2003; Li et al. 2009). The gene exists as a single copy in many cereals such as O. sativa, maize, sorghum, barley (Beatty et al. 1999; Francisco et al. 1998; Gilding et al. 2013). However, analysis of key starch biosynthesis genes in terms of chromosomal locations, copy number variation and other structural differences in the genomes of various cultivated and wild Oryza species indicates that pullulanase has undergone gene duplication in African rice. The analysis of genome sequences of several Oryza species shows that it is the only key starch biosynthesis gene to have undergone duplication in the Oryza genus thus making it particularly interesting. The evolution of some starch genes can be traced to whole genome duplications subsequently leading to sub-functionalization (Campbell et al. 2016; Paterson et al. 2004; Tetlow et al. 2004). It is therefore important to characterize pullulanase gene duplication as this might help gain insights on what

104 functionalization (Neo-, sub- or non-functionalization) model it has might have adopted.

Pullulanase gene has the capacity to hydrolyse the α-1, 6-glycosidic linkages of branched polysaccharides to form maltotriose (Hii et al. 2012). Though pullulanase has been found to have an effect on digestibility and has been used in preparation of high amylose starches, its functional role on starch physico-chemical properties is still not well understood (Campbell et al. 2016; Gilding et al. 2013; Hii et al. 2012). This study hypothesized that its duplication in African rice may have led to elevated pullulanase activity leading to protein dosage effects that may have subsequently had an impact on digestibility, amylose content as well as other starch structure and properties. This hypothesis is consistent with anecdotal reports from consumers of African rice which suggest that O. glaberrima has low digestibility and gives more satiety as compared to Asian rice, an attribute that makes African rice highly preferred in West Africa (Linares 2002; Teeken et al. 2012). If the hypothesis is validated and found to be true, then it suggests that pullulanase gene duplication confers potential health and economic benefits. Consumption of slowly digestible starch has been associated with potential health benefits and has been used in the control and dietary management of diabetes and obesity. In this study, the effects of pullulanase gene duplication on pullulanase activity as well as its effect on starch functional properties were investigated. To validate the hypothesis, the starch physicochemical properties in O. glaberrima, O. sativa and O. barthii; O. glaberrima’s putative progenitor were studied. The study identified conserved functional motifs as well as conserved protein domains in the promoter sequences of the two paralogs. These motifs have potential roles in starch anabolism and catabolism indicating the two genes might be functionally responsible for the two activities. However, the results obtained did not support the study hypothesis and suggest that pullulanase gene duplication may not be the molecular and genetic basis of amylose content in African rice.

105

4.2. Results

4.2.1. Sequence analysis

Sequence analysis show that the 2 genes show significant sequence divergence. The original gene has a total of 26 exons and 25 introns while the duplicate copy which is truncated has 18 exons and 17 exons (Table 16 and Figure 17). In both genes, the exon/intron boundaries follow the GT/AG rule of donor and acceptor sites of RNA splicing. Generally, the original gene has longer exons and introns with the size of exons ranging from 51bp to 197bp while the introns range from 71bp to 2628bp. On the other hand, the size of exons in the duplicate ranges from 51bp to 141bp with that of the introns ranging from 101 to 1429 The two genes encode for a protein with very high homology (Figure 18), though of different lengths (Table 16).

Table 16: Sequence statistics of the 2 pullulanase paralogs

Gene Gene length Transcript Number of Number of Protein length Strand (bp) length (bp) exons introns (aa) PUL Chr4 10298 1821 19 18 606 Forward PUL Chr6 13026 2676 26 25 891 Reverse PUL Chr4- Pullulanase gene on chromosome 4; PUL Chr6 – pullulanase gene on chromosome 6.

At both protein (Fig. 18) and the nucleotide level, the two paralogs show high homology, with only a few nucleotide differences in the intronic sequences (Table 17). This high sequence conservation indicates that this duplication may have been a relatively recent event.

106

Figure 17: Schematic representation of duplicated pullulanase gene. Upper figure: Pullulanase gene on chromosome 4; Lower figure: Pullulanase gene on chromosome 6. The arrows show the corresponding exons between the two paralogs. Schematic representation of the genes was obtained from EnsemblPlants.

Table 17: Nucleotide polymorphisms between the two pullulanase gene paralogs

Position Nucleotide Original Duplicate Original Duplicate 2492 T - 4132 2602 T A 4133 2603 A T 2609 - T 6255 A - 4747 - A 8002 T - 11054 - A All nucleotide polymorphisms were located in the non-coding regions

107

Figure 18: Protein Sequence alignment of the original and duplicate pullulanase gene

4.2.2. Putative functional cis-regulatory elements

Though the two paralogs have conserved putative cis functional motifs that indicate possible roles in both starch synthesis and hydrolysis, the duplicate gene on chromosome 6 seems to have more motifs associated with starch anabolism than the original copy on chromosome 4 (Table 18). Specifically, the duplicate copy has the AACA motif which together with GCN motif have been found to be involved in endosperm specific gene expression (Wu et al. 2000; Yasuyuki et al. 2001). The AACA motif is also found in GBSS1, a key starch synthesis gene, thus suggesting that the duplicate copy of the pullulanase gene may also be involved in starch metabolism. Though both genes had several TATA boxes, no corresponding TATA motif was found in the PLACE database.

108

Table 18: Putative cis regulatory element found in original pullulanase gene and its duplicate

Original gene Duplicate gene (Chr6) (Chr4) General putative Motif name Sequenc Number Number motif function e of Position of motifs Position motifs "amylase box" TAACAR 2 578, 153 - - Starch catabolism A AACA AACAAA - - 1 292 Starch anabolism C (CA)n element CNAACA 1 10 1 439 Starch anabolism C AAAG AAAG 4 272,219, 7 264, 416, 7, Starch anabolism 268,425 224, 465, 553, 560 Poly A signal AATAAA 1 144 1 80 Starch catabolism Pyrimidine box 1 TTTTTTC 1 543 Starch catabolism C Pyrimidine box 2 TAACAA 2 578,153 Starch catabolism A “WB Box” TTTGAC - - 1 153 Starch catabolism Y TGACT TGACT - - 1 155 Starch catabolism E2F consensus WTTSSC - - 1 91 Starch catabolism/ sequence SS anabolism

SP8b TACTAT 2 51,200 - - Starch catabolism T

AACA Motif – involved in endosperm specific gene expression. It has been found to be closely related to the GCN4 Motif in endosperm specific gene expression in rice (Wu et al. 2000)

CNAACAC- conserved in many storage protein genes (Stålberg et al. 1996) and is required for embryo and endosperm specific expression during seed development (Ellerström et al. 1996)

AAAG - This motif is the recognition core of Dof proteins which are transcription factors unique to plants. Dof play regulatory role in multiple gene expression in the carbon metabolism pathway (Yanagisawa 2000; Yanagisawa and Schmidt 1999)

AATAAA – found in the α-amylase gene expressed in rice germinating seeds (O'Neill et al. 1990)

TTTTTTCC – It is a pyrimidine box found in EPB, a cysteine proteinase responsible for the degradation of seed endosperm storage proteins in barley (Hordeum vulgare) (Cercós et al. 1999)

TTTGACY – found in alpha-Amy2 genes in wheat, barley and wild oat (Rushton et al. 1995) as well as in genes coding for sporamin and beta-amylase in sweetpotato (Ishiguro and Nakamura 1994)

TGACT – found in SUSIBA2, a novel WRKY Transcription Factor expressed in barley endosperm where it is involved in carbohydrate catabolism (Vandepoele et al. 2005)

WTTSSCSS - involved in theE2F pathway that is important for cell regulation and DNA replication (Vandepoele et al. 2005)

TAACARA - Conserved sequence found in 5'-upstream region of alpha-amylase gene of rice, wheat, barley (Huang et al. 1990)

TAACAAA – found in rice alpha-amylase gene, RAmy1 A, where it has been found to be involved sugar repression (Morita et al. 1998)

TACTATT - genes coding for sporamin and beta-amylase of tuberous roots (Ishiguro and Nakamura 1992, 1994)

109

4.2.3. Conserved protein domain The proteins from the 2 paralogous genes were searched against the conserved protein domain database. The original pullulanase gene has 1 main conserved domain namely the Alpha amylase catalytic domain while the duplicate gene has 2 main conserved domains namely the Alpha amylase catalytic domain and the Early set domain. Other conserved multi domains that are present in both paralogs include alpha-1, 6 glycosidase, glycogen debranching enzyme GlgX, 4-alpha- glucanotransferase/glycogen debranching enzyme and 1, 4-alpha-glucan branching enzyme.

4.2.4. Pullulanase activity

There were significant differences in pullulanase activity between the two cultivated species (O. glaberrima and O. sativa) and O. barthii. However, pullulanase activity between O. glaberrima and O. sativa was not significantly different (Table 19).

Table 19: Pullulanase activity in wild and cultivated Oryza species

Species Pullulanase activity SD O. glaberrima 22.592a 3.47 O. sativa 23.464a 2.44 O. barthii 17.36b 4.39 Means followed by same letter are not statistically significant (P≤0.05)

4.3. Discussion

4.3.1. Gene structure

Gene duplication is an important mechanism through which organisms acquire new genes and functional novelty (Ohno 1970). Though gene duplication has been studied extensively, the full extent of duplication in the genomes of various species is only starting to be realized, thanks to the increased availability of full genomes sequences. This study has characterized duplication of a pullulanase-type starch debranching enzyme in O. glaberrima. The progenitor gene which is located on chromosome 4 has 19 exons while the duplicate which is found on chromosome 6 has 26 exons (Table 16). Just like in the two paralogs, analysis of pullulanase gene structure in other closely related cereal species has revealed little conservation in the number of exons and

110 introns. In Oryza sativa, for example, the pullulanase gene has 26 exons and 25 introns (Francisco et al. 1998), barley has 27 exons and 26 introns while maize has 17 exons (Kristensen et al. 1999). This structural gene divergence in the 2 paralogs caused by apparent exon and intron loss after duplication is not unique to African rice as it has previously been reported in other species such as Arabidopsis thaliana and Oryza sativa (Xu et al. 2012). Like gene duplication, the loss/gain in introns and exons is important in evolution. Protein and gene expression divergence are known to increase as duplication ages (Hanada et al. 2009) and due to the sequence similarity between the two structurally diverged paralogous genes (Fig. 18), it seems that pullulanase gene duplication was a recent event. The sequence similarity seems to suggest that the two paralogs are still functional. Analysis of protein structure and conserved regulatory elements between the 2 genes may help elucidate the functional effect of the gene structure divergence.

4.3.2. Putative functional cis-regulatory elements

Conserved functional regulatory motifs in gene promoter regions have previously been used to establish whether duplicated genes are still functional, study their expression patterns and to deduce their putative roles (Li et al. 2009; Zhang et al. 2004). This study searched the promoter sequences of the two paralogs against the Plant Cis- acting Regulatory DNA Elements (Sileshi et al.) database which includes regulatory motifs whose functions have been empirically verified (Higo et al. 1999). Results show that the 2 genes have several functionally conserved regulatory motifs indicating that they were both functional and had putative roles in both starch synthesis and hydrolysis (Table 18). The presence of more conserved motifs associated with starch metabolism may seem to suggest that the duplicate copy has more starch metabolism functions that the original gene. In total, out of the 11 regulatory motifs that are putatively associated with starch metabolism, only 3 are shared between the two genes thus showing a bit of regulatory motif divergence. The motifs exhibit wide variance in the number of copies ranging from 1 to 7. Though both genes had several TATA boxes, no functionally conserved TATA motif related to starch anabolism or catabolism were found in the PLACE database. This perhaps suggests that the TATA boxes could be non-functional.

111

As indicated, the two paralogs have conserved putative cis functional motifs that indicate possible roles in both starch biosynthesis and starch hydrolysis. However, the duplicate gene on chromosome 6 seems to have more motifs associated with starch biosynthesis than the original copy on chromosome 4. Specifically, the duplicate copy has the AACA motif which together with GCN motif have been found to be involved in endosperm specific gene expression (Wu et al. 2000; Yasuyuki et al. 2001). The AACA motif is also found in GBSS1, a key starch synthesis gene, thus further suggesting that the duplicate copy of the pullulanase gene may also be involved in starch metabolism.

4.3.3. Conserved protein domains

A search on the protein conserved domain indicated that the two paralogs have a multi-domain architecture. Both genes share the main Alpha amylase catalytic domain while the duplicate gene has an extra main conserved domain namely the Early set domain. While the presence of an extra domain may seem to suggest that the exon- intron loss may have led to functionally distinct proteins and hence acquisition of new biochemical functions (Xu et al. 2012), a further analysis of the putative roles of the conserved protein domains as well as the conserved regulatory sequence motifs in the promoter regions does not seem to support such an acquisition. In addition to these domains, both paralogs have another domain mainly found in bacteria and eukaryotes but which has not been characterized functionally. The Alpha-amylase family comprises the largest family of glycoside hydrolases (GH), with the majority of enzymes acting on starch, glycogen, and related oligo- and polysaccharides. The E or "early" set domains are associated with the catalytic domain of pullulanase at either the N-terminal or C-terminal end, and sometimes at both ends. The presence of the Alpha amylase catalytic domain and the Early set domain points to the possible role of these two genes in starch breakdown, a role widely associated with pullulanase (Hii et al. 2012; Li et al. 2009). The 1, 4-alpha-glucan branching enzyme which is among the conserved domains present in both paralogs is involved in carbohydrate transport and metabolism. Just like in the conserved regulatory motifs, the conserved protein domains indicate that the two genes have putative roles in starch synthesis and hydrolysis.

112

4.3.4. Functional effect of pullulanase gene duplication

While many duplicated genes usually become pseudo genes and thus functionless (Zhang 2003), the presence of functionally conserved motifs and protein domains in the 2 pullulanase paralogs is an indicator that the two genes may still be functionally active. Since the conserved regulatory motifs and domains have potentially closely related roles, the observed motif and protein domain divergence may suggest that the two genes are operating under the sub-functionalization model (Force et al. 1999; Kondrashov et al. 2002). Under this model, each of the two genes takes a subset of the functions of the original gene but the level of expression between the genes may differ. Based on the additional motifs and domains, the duplicate gene in this case seems to have enhanced activity and may hence be performing better than the original gene which could eventually lead to functional specialization (Hughes 1999). Cis regulatory motif divergence has previously been reported in other species though previous studies have reported a weak correlation between motif conservation and expression profiles (Zhang et al. 2004). Consequently, highly divergent paralogs in the regulatory motifs may have similar expression patterns. Further research is therefore required on the gene expression patterns between these two paralogs in order to relate it to motif and domain divergence as well as on starch properties. Such a study is however likely to face challenges of selectively amplifying transcripts of these two paralogs as they seem to code for the same protein. Though there is no data to clearly indicate the specific subset of functions that each gene has taken, analysis of sequence motifs and domains may give a general picture of their potential roles.

Analysis of the cis regulatory motifs found in these 2 genes indicates that they are also conserved in the amylase family of genes in several species namely rice, barley, wheat and sweet potato thus suggesting they have some functional significance. With the main function of the amylase genes being starch breakdown (Sogaard et al. 1993), it is an indicator that the two paralogs may have a role in starch catabolism. This observation is further supported by the presence of conserved Alpha amylase catalytic and the Early set protein domains as well as other multi-domains in the two paralogs, all of which are involved in starch breakdown (Marchler-Bauer et al. 2011; Marchler-Bauer et al. 2009). On the other hand, while most of the conserved domains are involved in starch degradation, the conserved 1, 4-alpha-glucan

113 branching enzyme present in both genes, is involved in carbohydrate transport and metabolism, thus indicating a possible role in starch anabolism. It therefore seems that these two genes are functionally responsible for both starch synthesis and starch degradation, an observation that is consistent with previous reports of the pullulanase gene having dual function (Li et al. 2009).

4.3.5. Effect of pullulanase gene duplication on Amylose content

Mutations that occur in starch debranching enzymes have an impact on the structure and properties of starch by altering the number and spatial distribution of branches in amylopectin (Beatty et al. 1999). Amylose has been reported to be formed through cleavage of pre-existing branched amylopectin molecules (Marion van de et al. 1998). This study hypothesized that elevated pullulanase activity due to gene duplication may have led to increased hydrolysis of branched molecules through α-1, 6-bond cleavage thereby resulting in more linear molecules. The glucan chains released by pullulanase activity on amylopectin can be elongated by GBSS1 thus resulting in amylose molecules (Ahuja et al. 2013). Though it was hypothesized that this could explain the presence of significantly higher amylose content in O. glaberrima than O. sativa (Wang et al. 2015), results obtained do not support the hypothesis. Despite the two cultivated species having significantly different amylose content, pullulanase activity between them was not significantly different (Table 19). This study speculates that the lack of elevated pullulanase activity as would have been expected, may be attributed to a reduction in gene expression levels in the daughter genes; a type of sub- functionalization mechanism that is important for protein dosage rebalance and has been reported in other species (Qian et al. 2010). With motif divergence in cis regulatory factors, reduction in gene expression to the level of the progenitor gene, seems a plausible explanation.

It is also possible there was epistasis between the two pullulanase loci and GBSS or some other starch biosynthesis gene which could have altered their activity. Epistatic effects between various loci controlling amylose have previously been identified (Fan et al. 2005). As indicated earlier, it is noteworthy that the duplicate copy shares an AACA motif with GBSS1 and a possible interaction between these two loci may have led to increased amylose content. The AACA motif confers endosperm

114 specific gene expression (Takaiwa et al. 1996; Washida et al. 1999; Zheng et al. 1993). Further research is required to gain insights on any possible epistatic effects between this gene duplication and other loci involved in starch biosynthesis. High amylose content is associated with slow digestibility (Cai et al. 2015; Chung et al. 2011). In addition to the potential health benefits associated with slowly digestible starch, consumption of such foods translates to economic benefits especially for the poor communities in Africa as they are likely to eat less due to the associated satiety effects and thus have some food to spare for another day. This has a positive impact on the state of food security in the region and as noted by Gilding et al. (2013), understanding the role and potential of the inherent food value and calorific value of staple crops in addressing food insecurity has largely been ignored.

4.3.6. Genome assembly error

In addition to the possibility of some unknown epistatic and other genetic mechanisms as indicated above, there were also unconfirmed reports that the pullulanase gene duplication reported here may not have been real but as a result of genome assembly error. Preliminary but inconclusive analysis of the BAC-end sequences indicated that the scaffold containing pullulanase gene may have been placed incorrectly. Several genome assembly errors in this recently released reference have already been identified (Pariasca-Tanaka et al. 2014; Wang et al. 2014). This calls for concerted efforts by the international rice scientific community to address these errors so as to improve the quality and integrity of this genome sequence. It was not possible to unequivocally confirm the presence or absence of pullulanase gene duplication within the framework and timeframe of this PhD study.

4.4. Conclusion

Results of this study suggest that pullulanase gene duplication may not be underlying genetic mechanism for high amylose content in African rice. High amylose content might therefore be attributed to some yet to be identified genetic factor. Further analysis to unravel these underlying genetic mechanisms is important and is presented in chapter 5 and 6 of this thesis.

115

4.5. Materials and methods

4.5.1. Gene sequences

The gene sequences for the various Oryza species included in this study were extracted from whole genome sequences obtained from the International Oryza Mapping and Alignment Project (IOMAP) through the CyVerse web portal (http://www.cyverse.org; formerly iPlant Collaborative). The O. glaberrima genome was sequenced to 30x coverage following a hybrid approach consisting of BAC pooling and whole genome shot gun approach using the Roche GSFLX/454 Titanium sequencing technology. The genome sequences are also available in EnsemblPlants (http://plants.ensembl.org/index.html) and the National Centre for Biotechnology Information (NCBI, www.ncbi.nlm.nih.gov). In order to extract the gene coordinates from the genome, the whole genome was aligned against the O. sativa reference gene sequences (IRGSP 2005). The protein sequences for the two paralogs were obtained from EnsemblPlants.

4.5.2. Search for cis regulatory elements and conserved protein domains

This study assumed that gene co-expression arises from transcriptional co-regulation and was based on the expression-motif-conservation hypothesis that co-expressed genes have a high probability of sharing regulatory motifs (Roth et al. 1998; Thijs et al. 2002). Therefore, in order to search for known functionally conserved sequence motifs, promoter sequences of the two paralogs were searched against the Plant Cis- acting Regulatory DNA Elements database (www.dna.affrc.go.jp/PLACE/) (Higo et al. 1999). Only the sequence motifs whose function was related to starch metabolism were selected for further analysis. In the case of protein domains, Reverse Position- Specific BLAST (RPS-BLAST) was used to search the protein sequences of the two paralogs against the Conserved Domains Database (CDD, http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) in an attempt at identifying conserved protein domains (Marchler-Bauer et al. 2011; Marchler-Bauer et al. 2009).

4.5.3. Assay of pullulanase activity

Pullulanase enzyme activity was determined for O. glaberrima, O. barthii and O. sativa by measuring the enzymatic release of reducing sugars from pullulan. The assay was

116 done using Red Pullulan (Megazyme Ltd, Ireland) as substrate following an optimized protocol that was originally published by McCleary (1992). The extraction/activation buffer consisted of 0.2M sodium acetate buffer mixed with 2% (w/v) sodium azuride and 0.35% (w/v) L-cysteine. Extraction of the sample was done using 5ml of extraction buffer per 0.25g of rice flour which had been milled using a mixer miller (Qiagen, USA) set at 25 oscillations/min for 1minute. The assay was conducted using 0.5ml of 2% (w/v) red pullulan, 1.0ml of sample extract and the reaction stopped by the addition of 1.5mL of 95% ethanol (v/v). In order to correct for non-enzymatic release of reducing sugars, a sample blank was prepared by adding ethanol to the substrate prior to the addition of the enzyme preparation thus ensuring the enzyme is inactivated. The absorbance of both the sample and blank were measured using a spectrophotometer (Shimadzu UV-1601) at a wavelength of 510nm. Pullulanase activity was determined by reference to a standard curve which was used to convert absorbance to milli units per assay. Calculations for pullulanase activity were done as follows:-

Units per ml of rice flour solution = milli-Units per assay x 5 x 1/1000 x dilution

Where:

Milli-Units per assay is determined by reference to the standard curve

5 = initial dilution volume i.e. 1mL per 5mL

1/1000 = conversion from milli-Units to units

4.6. References

Ahuja G, Jaiswal S, Chibbar RN (2013) Starch Biosynthesis in Relation to Resistant Starch. Resistant Starch. John Wiley and Sons Ltd, pp 1-22

Beatty MK, Rahman A, Cao H, Woodman W, Lee M, Myers AM, James MG (1999) Purification and Molecular Genetic Characterization of ZPU1, a Pullulanase- Type Starch-Debranching Enzyme from Maize1. Plant physiology 119:255-266

Cai JW, Man JM, Huang J, Liu QQ, Wei WX, Wei CX (2015) Relationship between structure and functional properties of normal rice starches with different amylose contents. CARBOHYDRATE POLYMERS 125:35-44

Campbell BC, Gilding EK, Mace ES, Tai S, Tao Y, Prentis PJ, Thomelin P, Jordan DR, Godwin ID (2016) Domestication and the storage starch biosynthesis pathway:

117

signatures of selection from a whole sorghum genome sequencing strategy. Plant biotechnology journal:n/a-n/a

Cercós M, Gómez‐Cadenas A, Ho THD (1999) Hormonal regulation of a cysteine proteinase gene, EPB‐1, in barley aleurone layers: cis‐ and trans‐acting elements involved in the co‐ordinated gene expression regulated by gibberellins and abscisic acid. The Plant Journal 19:107-118

Chung H-J, Liu Q, Lee L, Wei D (2011) Relationship between the structure, physicochemical properties and in vitro digestibility of rice starches with different amylose contents. Food Hydrocolloids 25:968-975

Dinges JR, Colleoni C, James MG, Myers AM (2003) Mutational analysis of the pullulanase-type debranching enzyme of maize indicates multiple functions in starch metabolism. The Plant cell 15:666-680

Ellerström M, Stålberg K, Ezcurra I, Rask L (1996) Functional dissection of a napin gene promoter: identification of promoter elements required for embryo and endosperm-specific transcription. Plant molecular biology 32:1019-1027

Fan CC, Yu XQ, Xing YZ, Xu CG, Luo LJ, Zhang Q (2005) The main effects, epistatic effects and environmental interactions of QTLs on the cooking and eating quality of rice in a doubled-haploid line population. Theoretical and Applied Genetics 110:1445-1452

Force A, Lynch M, Postlethwait J (1999) Preservation of duplicate genes by subfunctionalization. AMERICAN ZOOLOGIST 39:78A-78A

Francisco PB, Zhang Y, Park S-Y, Ogata N, Yamanouchi H, Nakamura Y (1998) Genomic DNA sequence of a rice gene coding for a pullulanase-type of starch debranching enzyme. Biochimica et Biophysica Acta (BBA)/Protein Structure and Molecular Enzymology 1387:469-477

Gayin J, Abdel-Aal E-SM, Manful J, Bertoft E (2016a) Unit and internal chain profile of African rice (Oryza glaberrima) amylopectin. Carbohydrate polymers 137:466- 472 Gayin J, Bertoft E, Manful J, Yada RY, Abdel‐Aal ESM (2016b) Molecular and thermal characterization of starches isolated from African rice (Oryza glaberrima). Starch ‐ Stärke 68:9-19

Gayin J, Chandi GK, Manful J, Seetharaman K (2015) Classification of Rice Based on Statistical Analysis of Pasting Properties and Apparent Amylose Content: The Case of Oryza glaberrima Accessions from Africa. Cereal Chemistry 92:22-28

118

Gilding EK, Frère CH, Cruickshank A, Rada AK, Prentis PJ, Mudge AM, Mace ES, Jordan DR, Godwin ID (2013) Allelic variation at a single gene increases food value in a drought-tolerant staple cereal. Nature communications 4:1483

Hanada K, Kuromori T, Myouga F, Toyoda T, Shinozaki K (2009) Increased expression and protein divergence in duplicate genes is associated with morphological diversification. PLoS genetics 5:e1000781

Higo K, Ugawa Y, Iwamoto M, Korenaga T (1999) Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic acids research 27:297-300

Hii SL, Tan JS, Ling TC, Ariff AB (2012) Pullulanase: role in starch hydrolysis and potential industrial applications. Enzyme research 2012:921362

Huang N, Sutliff TD, Litts JC, Rodriguez RL (1990) Classification and characterization of the rice alpha-amylase multigene family. Plant molecular biology 14:655-668

Hughes AL (1999) Adaptive evolution of genes and genomes. Oxford University Press, New York

IRGSP (2005) The map-based sequence of the rice genome. Nature 436:793-800

Ishiguro S, Nakamura K (1992) The nuclear factor SP8BF binds to the 5'-upstream regions of three different genes coding for major proteins of sweet potato tuberous roots. Plant molecular biology 18:97-108

Ishiguro S, Nakamura K (1994) . Characterization of a cdna-encoding a novel dna- binding protein, spf1, that recognizes sp8 sequences in the 5' upstream regions of genes-coding for sporamin and beta-amylase from sweet-potato. Molecular & General Genetics 244:563-571

Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV (2002) Selection in the evolution of gene duplications. Genome biology 3:0008-research0008

Kristensen M, Lok F, Planchot V, Svendsen I, Leah R, Svensson B (1999) Isolation and characterization of the gene encoding the starch debranching enzyme limit dextrinase from germinating barley. Biochimica et Biophysica Acta (BBA)/Protein Structure and Molecular Enzymology 1431:538-546

Li Q-F, Zhang G-Y, Dong Z-W, Yu H-X, Gu M-H, Sun SSM, Liu Q-Q (2009) Characterization of expression of the OsPUL gene encoding a pullulanase-type debranching enzyme during seed development and germination in rice. Plant physiology and biochemistry : PPB / Société française de physiologie végétale 47:351-358

119

Linares OF (2002) African rice (Oryza glaberrima): history and future potential. Proceedings of the National Academy of Sciences of the United States of America 99:16360-16365

Marchler-Bauer A, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Lu S, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang D, Zhang N, Zheng C, Bryant SH, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC (2011) CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic acids research 39:D225-D229

Marchler-Bauer A, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Anderson JB, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic acids research 37:D205-D210

Marion van de W, Hulst C, Vincken J-P, Buleon A, Visser R, Ball S (1998) Amylose Is Synthesized in Vitro by Extension of and Cleavage from Amylopectin. Journal of Biological Chemistry 273:22232-22240

McCleary BV (1992) Measurement of the content of limit-dextrinase in cereal flours. Carbohydrate Research 227:257-268

Morita A, Umemura T, Kuroyanagi M, Futsuhara Y, Perata P, Yamaguchi J (1998) Functional dissection of a sugar-repressed alpha-amylase gene (RAmy1 A) promoter in rice embryos. FEBS letters 423:81-85

Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press, New York

O'Neill SD, Kumagai MH, Majumdar A, Huang N, Sutliff TD, Rodriguez RL (1990) The -amylase genes in Oryza sativa: Characterization of cDNA clones and mRNA expression during seed germination. Mol Gen Genet 221:235-244 Ohno S (1970) Evolution by gene duplication. Springer-Verlag, Berlin

Pariasca-Tanaka J, Chin JH, Dramé KN, Dalid C, Heuer S, Wissuwa M (2014) A novel allele of the P-starvation tolerance gene OsPSTOL1 from African rice (Oryza glaberrima Steud) and its distribution in the genus Oryza. Theoretical and Applied Genetics 127:1387-1398

120

Paterson AH, Bowers JE, Chapman BA, Phillips RL (2004) Ancient Polyploidization Predating Divergence of the Cereals, and Its Consequences for Comparative Genomics. Proceedings of the National Academy of Sciences of the United States of America 101:9903-9908

Qian W, Liao B-Y, Chang AY-F, Zhang J (2010) Maintenance of duplicate genes and their functional redundancy by reduced expression. Trends in Genetics 26:425- 430

Roth FP, Hughes JD, Estep PW, Church GM (1998) Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature biotechnology 16:939-945

Rushton PJ, Macdonald H, Huttly AK, Lazarus CM, Hooley R (1995) Members of a new family of DNA-binding proteins bind to a conserved cis-element in the promoters of alpha-Amy2 genes. Plant molecular biology 29:691-702

Sileshi G, Akinnifesi FK, Ajayi OC, Place F (2008) Meta-analysis of maize yield response to woody and herbaceous legumes in sub-Saharan Africa. Plant and Soil 307:1-19

Sogaard M, Abe J, Martineauclaire MF, Svensson B (1993) Alpha amylases- Structure and function. Carbohydrate Polymers 21:137-146

Stålberg K, Ellerstöm M, Ezcurra I, Ablov S, Rask L (1996) Disruption of an overlapping E-box/ABRE motif abolished high transcription of the napA storage-protein promoter in transgenic Brassica napus seeds. Planta 199:515- 519

Takaiwa F, Yamanouchi U, Yoshihara T, Washida H, Tanabe F, Kato A, Yamada K (1996) Characterization of common cis-regulatory elements responsible for the endosperm-specific expression of members of the rice glutelin multigene family. Plant molecular biology 30:1207-1221

Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S (2011) MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods. Molecular Biology and Evolution 28:2731-2739

Teeken B, Nuijten E, Temudo MP, Okry F, Mokuwa A, Struik PC, Richards P (2012) Maintaining or Abandoning African Rice: Lessons for Understanding Processes of Seed Innovation. Human Ecology 40:879-892

121

Tetlow IJ, Morell MK, Emes MJ (2004) Recent developments in understanding the regulation of starch metabolism in higher plants. Journal of experimental botany 55:2131-2145

Thijs G, Marchal K, Lescot M, Rombauts S, De Moor B, Rouzé P, Moreau Y (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. Journal of computational biology : a journal of computational molecular cell biology 9:447-464

Vandepoele K, Vlieghe K, Gruissem W, Van de Peer Y, Inze D, De Veylder L (2005) Genome-Wide Identification of Potential Plant E2F Target Genes. Plant physiology 139:316-328

Wambugu P, Furtado A, Waters D, Nyamongo D, Henry R (2013) Conservation and utilization of African Oryza genetic resources. Rice 6:29-29

Wang K, Wambugu PW, Zhang B, Wu AC, Henry RJ, Gilbert RG (2015) The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129:92-100

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Washida H, Wu C-Y, Suzuki A, Yamanouchi U, Akihama T, Harada K, Takaiwa F (1999) Identification of cis-regulatory elements required for endosperm expression of the rice storage protein glutelin gene GluB-1. Plant molecular biology 40:1-12

Wu CY, Washida H, Onodera Y, Harada K, Takaiwa F (2000) Quantitative nature of the Prolamin‐box, ACGT and AACA motifs in a rice glutelin gene promoter: minimal cis‐element requirements for endosperm‐specific gene expression. The Plant Journal 23:415-421

Xu G, Guo C, Shan H, Kong H (2012) Divergence of duplicate genes in exon-intron structure. Proceedings of the National Academy of Sciences of the United States of America 109:1187-1192

122

Yanagisawa S (2000) Dof1 and Dof2 transcription factors are associated with expression of multiple genes involved in carbon metabolism in maize. The Plant Journal 21:281-288

Yanagisawa S, Schmidt RJ (1999) Diversity and similarity among recognition sequences of Dof transcription factors. The Plant Journal 17:209-214

Yasuyuki O, Akihiro S, Chuan-Yin W, Haruhiko W, Fumio T (2001) A Rice Functional Transcriptional Activator, RISBZ1, Responsible for Endosperm-specific Expression of Storage Protein Genes through GCN4 Motif. Journal of Biological Chemistry 276:14139

Zhang J (2003) Evolution by gene duplication: an update. Trends in Ecology & Evolution 18:292-298

Zhang Z, Gu J, Gu X (2004) How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends in genetics : TIG 20:403-407

Zheng Z, Kawagoe Y, Xiao S, Li Z, Okita T, Hau TL, Lin A, Murai N (1993) 5′ distal and proximal cis-acting regulator elements are required for developmental control of a rice seed storage protein glutelin gene. Plant Journal 4:357-366

123

Chapter 5 Sequencing of bulks of segregants allows dissection of genetic control of amylose content in rice

Abstract

Amylose content is a key quality trait in rice. A cross between Oryza glaberrima (African rice) and Oryza sativa (Asian rice) segregating for amylose content was analyzed by sequencing bulks of individuals with high and low amylose content. SNP associated with the granule bound starch synthase (GBSS1) locus on chromosome 6 were polymorphic between the bulks. In particular a G/A SNP that would result in an Asp to Asn mutation was identified. This amino acid substitution may be responsible for differences in GBSS activity as it is adjacent to a disulphide linkage conserved in all grass GBSS proteins. Other polymorphisms in genomic regions closely surrounding this variation may be the result of linkage drag. In addition to the variant in the starch biosynthesis gene, SNP on chromosomes 1 and 11 linked to amylose content were also identified. SNP were found in the genes encoding the NAC and CCAAT-HAP5 transcription factors that have previously been linked to starch biosynthesis. Analysis of natural variation in GBSS1 from a diverse set of O. glaberrima accessions revealed wider variation, with 5 novel non synonymous SNPs which could be mined from GBSS1 being identified. This study has demonstrated that the approach of sequencing bulks was able to identify genes on different chromosomes associated with this complex trait.

Keywords: Amylose content, bulk segregant analysis, genetic markers, genetic linkage, allele frequency, whole genome sequencing

5.1. Introduction

Oryza glaberrima, commonly referred to as African rice, is one of the two independently domesticated rice species, the other one being Oryza sativa, which is also commonly referred to as Asian rice. African rice is a primary food source and source of livelihoods for many households in West Africa, in addition to being a valuable source of genetic diversity for rice improvement globally. Its genetic potential

124 in terms of resistance/tolerance to biotic and abiotic stresses has been well documented and deployed in rice improvement (Wambugu et al. 2013). Recently, it has been reported to have some unique starch traits which could confer potential health benefits. Specifically, this cultivated species possesses higher AC and gelatinization temperature than Asian rice (Wang et al. 2015). AC is a major determinant of the cooking, eating and milling properties of rice. High amylose content has been associated with nutritional and health benefits by having a positive effect on various indices of gastro intestinal health. The high amylose content and gelatinization temperature are likely to result in decreased digestion rate thereby making African rice a potential natural source of slowly digestible starch (Wang et al. 2015; Appendix 1). With the health benefits of high amylose foods increasingly getting recognized, African rice seems to be a valuable genetic resource for breeding premium-value genotypes with high AC, preferably in high yielding and elite genetic backgrounds. Amylose content plays an important role in predicting quality in a breeding programme and it is therefore imperative that AC is accurately determined in order to ensure success of breeding programmes that target grain quality.

Many methods used in determination of AC are not only costly in terms of consumables and equipment but are also time consuming, slow and suffer from varying degrees of inaccuracies (Delwiche et al. 1995; Hu et al. 2015). The use of marker assisted selection (MAS) can help overcome these challenges. The potential of application of MAS in early phases of a breeding programme can help in speedy and cost effective delivery of improved varieties possessing superior grain quality traits. However, the lack of adequate information on the genetic basis of some starch properties has been a hindrance in selecting for them (Bao et al. 2006). While several functional nucleotide polymorphisms have been found to associate with AC in Asian rice, (Ayres et al. 1997; Bligh et al. 1995; Dobo et al. 2010; Larkin and Park 2003), no marker-trait associations have been identified in African rice. This lack of this information acts as a major hindrance to the application of marker assisted selection.

Information on the genetic control of AC is vital as it helps in devising strategies for manipulating it, either through traditional plant breeding or through other starch modification approaches. Previous studies on the genetic architecture of AC in African

125 rice have analyzed the various individual loci in isolation rather than conducting global genome analyses. This locus-specific approach may have led to the poor understanding of the genetic mechanism underlying this trait. This study therefore set out to gain increased knowledge on the genetic architecture of AC in African rice by conducting a genome wide analysis of any underlying genetic variants. Leveraging the current advances in genomics, this study used bulk segregant analysis (BSA) (Michelmore et al. 1991) coupled with whole genome re-sequencing to conduct genetic mapping of the backcross progenies of an interspecific cross between O. sativa and O. glaberrima. This next generation sequencing (NGS) aided BSA, commonly referred as mapping by sequencing (Hartwig et al. 2012), provides a robust genetic mapping approach that has successfully been used in identifying causal mutations and candidate genes in various crops among them tomato (Illa-Berenguer et al. 2015), sorghum (Han et al. 2015), chick pea (Das et al. 2015), cucumber (Lu et al. 2014), rice (Takagi et al. 2013; Yang et al. 2013), sugar beet (Ries et al. 2016) and pigeon pea (Singh et al. 2015). Analysis of polymorphisms between the bulks identified SNPs associated with Granule Bound Starch Synthase 1 (GBSS1), with a G/A located in exon 12 potentially being linked with AC in African rice. Markers were also identified in transcription factors that have previously been linked with starch biosynthesis.

5.2. Results

5.2.1. Analysis of amylose content and identification of pool segregants

Phenotyping for AC was done on a total of 100 progenies of an interspecific cross between African rice and O. sativa. This phenotypic analysis resulted in a normal distribution with the AC ranging from 19.2% to 26.2% in the interspecific progenies. CG14, the African rice parent had AC of 25.7% while that of WAB 56-104, the O. sativa parent, was 23.2% thereby showing that there was transgressive segregation on both sides of the phenotypic distribution. This transgressive segregation was largely skewed towards the direction of the low amylose parent (Fig. 19), perhaps due to the numerous backcrosses conducted towards this parent. Based on the phenotypic analysis data, individual progenies to be used in constituting the two phenotypically distinct amylose bulks to be used in BSA were selected. The 2 bulks were selected by pooling individuals on the extreme ends of the phenotypic distribution, with each of the

126 two bulks having 10 individuals. The low amylose bulk had segregants with amylose ranging from 19.2% to 21.6% while those in the high amylose bulk ranged from 24.1% to 26.2%.

14 WAB 56-104

12

10

8

6 Frequency CG14 4

2

0

Amylose content

Figure 19: Frequency distribution of amylose content in 100 individuals and parental genotypes of a BC2F8 population of an interspecific cross between O. glaberrima and O. sativa (CG14 X WAB 56-104) 5.2.2. DNA sequencing and mapping

The pooled DNA together with that of the parents was subjected to whole genome sequencing. Sequencing the two phenotype based DNA pools generated a total of 131,008,138 and 145,429,179 paired end reads for the high and low amylose bulks respectively (Table 20). The mean sequencing depths were 53X and 59X for the high amylose bulk (HAB) and low amylose bulk (LAB) respectively. The trimmed reads were separately mapped to reference genome sequences of both African and Asian rice. A total of 112,021,886 and 119,224,875 reads were uniquely mapped to the O. sativa reference, yielding average nuclear genome coverages of about 85% and 81% for HAB and LAB respectively (Table 21).

127

Table 20: Sequencing Statistics

Sample Number of raw Sequencing Trimmed reads a reads depth Number Average size HAB 131,008,138 53X 130,975,209 114.9 LAB 145,429,179 59X 145,393,305 115.6 WAB 56-104 35,290,472 14X 35,281,191 114.3 CG14 31,816,416 13X 31,808,913 115.8 a Average read depth across the genome

Table 21: Statistics of reads uniquely mapped against O. sativa and O. glaberrima references

Sample O. sativa reference O. glaberrima reference Number % Average Number % Average coverage coverage HAB 112021886 85.53 35.77X 104532384 79.81 38.03X

LAB 119224875 81.98 38.39X 110919141 71.88 41.01X WAB 56-104 30624443 86.78 9.53X 28252141 80.08 10.30X CG14 29729099 93.44 9.82X 28756017 90.4 10.85X *Percentage of the genome covered by the short reads. The genome size of the two bulks was estimated to be equivalent to that of O. sativa (370Mb) since it is assumed that the successive backcrosses during mapping population development had recovered a substantial portion of the O. sativa genome

Variant calling resulted in a total of 280,875 and 333,299 SNPs for the high and low bulks respectively when mapped to the O. sativa reference. As expected, mapping to the O. glaberrima reference resulted in a significantly lower number of SNPs for the high amylose bulk, totalling 88,748 while the low bulk had a significantly higher number of SNPs, totalling 391,032 (Table 22). This significant difference in the number of variants between the 2 bulks when mapped to the 2 references is due to the fact that the high bulk included many individuals with a higher proportion of the O. glaberrima genome than the low bulk. On the other hand, the low bulk consisted of many individuals with a higher proportion of the O. sativa genome than the high bulk. Since the O. sativa reference sequence (IRGSP 2005) has higher quality annotations than African rice reference (Wang et al. 2014), further analysis was conducted on the variants called using the O. sativa Nipponbare reference.

128

Table 22: Number of SNPs called for the high and low amylose bulk and 2 parental genomes when mapped to O. sativa and O. glaberrima references

Sample O. sativa reference O. glaberrima reference HAB 280,875 88,748 LAB 333,299 391,032 WAB 56-104 594,358 1,802,859 CG14 2,329,714 4,145

5.2.3. Association analysis identifies putative amylose-linked genetic markers Genome wide analysis of polymorphisms between the 2 phenotypically distinct bulks led to the identification of a total of 114 genetic marker sites putatively linked with AC. In these marker sites, the alleles from the high amylose parent were significantly overrepresented in the high amylose bulk. Under the BSA approach, it would be expected that approximately 50% of the reads in loci that have no genetic effect on the trait of interest would be derived from each of the 2 parental genomes. On the other hand, allele frequency would differ noticeably in genomic regions controlling the trait, with such regions showing significant over-representation of reads from one of the parental genomes. In the identified marker positions, the alleles from the low amylose parental genome were significantly (P<0.05) over represented in the low amylose bulk as assessed by a chi square test. Statistically, these candidate SNPs therefore showed some association with AC.

These putative amylose-linked markers are located in 59 protein coding genes/transcription factors. Majority of the putative amylose linked markers are located in genes that lie near the terminal region of the short arm of chromosome 6 (Fig. 20 and Appendix 2) while the rest are located on chromosome 1 and 11 (Table 23). These candidate SNPs located on chromosome 6 cluster in a genomic region of approximately 2.1Mbp, covering chromosomal position 1653659bp to 3794599bp (Fig. 20). The waxy locus which encodes Granule Bound Starch Synthase 1 (GBSS1), the gene that is well known for regulating AC, is located in this genomic region.

Several SNPs that are associated with GBSS1 were among the markers that were polymorphic between the bulks. These include 2 SNPs that are located in the 5’ UTR and 2 non-synonymous SNPs located in the exonic regions (Table 23). A non- synonymous G/A SNP associated with GBSS1 that would result in an Asp to Asn

129 mutation was identified. This conservative amino acid change is located on chromosome 12 of African rice. As will be reported later in this chapter, the A/C SNP located on exon 6 which is a well-known amylose linked SNP is fixed in O. glaberrima.

Figure 20: Plot of chromosomal positions and allele frequency of putative candidate non-synonymous SNPs located on chromosome 6. The indicated allele frequencies are from the high amylose bulk. Almost all the putative candidate genes where these SNPs are located, cluster with GBSS1 suggesting an apparent close linkage with this locus. The putative candidate genes cluster within a region of about 2.1Mbp which is equivalent to about 10 cM. The SNPs were plotted using CandiSNP (Etherington et al. 2014). Majority of the other amylose linked SNPs were found in genes that have previously been associated with starch biosynthesis. These candidate genes include glutelin subunit mRNA (LOC_Os01g55690), basic helix-loop-helix (OsbHLH071), Starch synthase IVa (LOC_Os01g52250) and Histone-fold domain containing protein (LOC_Os01g01290) which are located on chromosome 1 while those on chromosome 11 include a NAC (for NAM, ATAF, and CUC) transcription factor (Table 23).

130

Table 23: Upstream and non-synonymous SNPs putatively linked to amylose content identified in candidate genes

Gene name Loci SNP Region Location Amino acid change GBSS1 LOC_Os06g04200 G/A Exon 1769686 Asp to Asn GBSS1 LOC_Os06g04200 A/C Exon 1768006 Tyr to Ser GBSS1 LOC_Os06g04200 G/A 5’UTR 1765976 N/A GBSS1 LOC_Os06g04200 C/T 5’UTR 1766005 N/A NAC LOC_Os11g31330 G/A Exon 18288616 Val to Ile Glutelin subunit mRNA LOC_Os01g55690 C/A Exon 32077903 Leu to Met Glutelin subunit mRNA LOC_Os01g55690 G/A Exon 32078006 Ser to Asn Glutelin LOC_Os01g55630 C/T Exon 32053231 Ile to Thr Basic helix-loop-helix LOC_Os01g01600 T/C Exon 304355 Leu to Pro (OsbHLH071) Starch synthase IVa LOC_Os01g52250 T/A Exon 30038502 Ser to Thr Histone-fold domain LOC_Os01g01290 Exon 141117 containing protein

Studies have indicated that very low coverage in genetic marker sites may lead to sampling error which may cause spurious deviations in allele frequency from the expected 50% segregation pattern in regions not linked to the phenotype (Duitama et al. 2014). In order to weed out false-positive associations, read coverage was therefore assessed across the genome and also specifically in the identified candidate genetic marker sites. Analysis of coverage in individual SNP positions indicated that it ranged from 21 to 41.

5.2.4. Allele mining in natural variation of African rice GBSS1 reveals novel alleles

Bi-parental populations segregate for only a fraction of the total variation available in a species. Since functional variation in GBSS1 appeared to be limited in the interspecific cross, this study sought to analyse the diversity available in GBSS1 in a diverse set of unrelated O. glaberrima accessions. Whole genome short reads of 10 African rice accessions were mapped to the GBSS1 gene from O. sativa with variant calling resulting in 5 novel putative SNPs spread across several exons (Table 24). The majority of the identified SNPs seem to be of low frequency, with three of them appearing in only one of the 10 accessions studied.

131

Table 24: Putative SNPs found in GBSS1 from African rice

Position* Exon SNP Amino acid change 139 2 T/G Ser to Ala 250 2 A/C Asn to His 1247 10 A/C Asp to Ala 1272 10 G/C Glu to Asp 1582** 12 G/A Asp to Asn * Positions are numbered from the translation start codon

**This G/A SNP has previously been identified but its functional role has not been elucidated

5.2.5. Analysis of an indel marker in O. glaberrima GBSS1

Based on the linkage observed in this study around the waxy locus (Fig. 20), it would appear reasonable to conclude that selecting for the GBSS1 gene from African rice would be selecting for high amylose. This is consistent with findings of a study by Kishine et al. (2008) who analysed grain quality properties in NERICA (New rice for Africa) - progenies of interspecific crosses between O. sativa and O. glaberrima - and found that the NERICA lines with high AC have their whole GBSS1 gene derived from O. glaberrima while those with low AC have theirs derived from O. sativa. Due to the limitation in the number of available SNP markers, an alternative marker that has been used to distinguish the O. sativa GBSS1 gene from the O. glaberrima one is a 139bp deletion located on intron 10 in O. glaberrima. PCR amplification of this indel region has been used to reveal the genetic derivation of GBSS1 in O. sativa and O. glaberrima interspecific progenies (Kishine et al. 2008). However, there is some contradiction on the structural variation of GBSS1 in O. glaberrima particularly as it relates to this non-functional deletion, with some reports reporting the absence of this deletion in some accessions of this species (Dobo 2006). This deletion was analysed in a diverse set of 10 O. glaberrima accessions. Further analysis of this deletion was undertaken by studying the structure of GBSS1 gene from 54 accessions of O. barthii; the putative progenitor of African rice, in order to understand the inheritance pattern of this indel. Out of the 10 African rice accessions studied, this analysis did not reveal any that lacked this deletion. Similarly, all the O. barthii accessions had this 139bp deletion.

132

5.2.6. Analysis of O. sativa amylose-linked functional SNPs

Analysis of marker trait relationships for AC in O. sativa has led to the identification of various amylose linked nucleotide polymorphisms found in GBSS1. These amylose associated SNPs include a G/T SNP in intron 1 splice site junction, A/C SNP in exon 6 and C/T SNP in exon 10 (Bligh et al. 1998; Larkin and Park 2003). The presence of these markers in African rice and their potential functional effect in this species were therefore analysed. For comparison purposes and to study their possible inheritance pattern, these markers were also studied in O. barthii, the putative progenitor of African rice. Analysis of these O. sativa functional markers in a diverse set of 10 O. glaberrima and 54 O. barthii accessions shows that these SNPs are fixed in these two species, with both of these species having a GAC allelic pattern in these 3 SNP positions (Table 25).

Table 25: Comparison of allelic patterns in functional SNP positions identified in O. sativa

Gene Exon/Intron Position SNP in Allele in Allele in O. O. sativa O. barthii glaberrima GBSS1 Intron 1/Exon 1 1164 upstream of start G/T G G boundary codon GBSS1 Exon 6 2294 A/C A A GBSS1 Exon 10 3286 C/T C C

5.3. Discussion

5.3.1. Genetics of amylose content

Despite African rice having potential for contributing genes to improve rice quality as reviewed by Wambugu et al. (2013), research on starch structure and properties as well as the genetic mechanisms controlling them have for a long time been neglected in this cultivated species. Interest in studying starch related traits in African rice is however growing (Gayin et al. 2016a; Gayin et al. 2016b; Gayin et al. 2015; Wang et al. 2015). This study investigated the genetic basis of AC in African rice by genetic mapping of the progenies of an interspecific cross between O. glaberrima and O. sativa. The continuous variation and normal distribution of AC observed in this study has been reported in previous studies (Aluko et al. 2004) and indicates that AC in

133

African rice is quantitatively inherited. The normal distribution suggests that amylose is under polygenic control, though only GBSS1 is known to be primarily involved in regulation of amylose synthesis. Other studies have reported amylose to be under the control of other loci in addition to the waxy locus though these loci have not been identified (Aluko et al. 2004; He et al. 1999). Identification of these additional candidate genes remains an important area of research focus as these genes are likely to provide new targets for AC modification.

The transgressive segregants observed in this study indicate that there exists potential of genetic improvement of AC in these two cultivated species. Results of this study on AC are comparable with those obtained by Kishine et al. (2008) for NERICA varieties, though the values in the present study are slightly lower. Similarly, the AC for the two parents obtained in this study were comparable to those obtained by Traore et al. (2006) who reported that CG14 had 26.1% while WAB-56-104 had 21.1%. In the present study, AC of 25.7% was obtained for CG14 while that of WAB 56-104 was 23.2%. Differences in AC are expected between studies due to the different methods used.

5.3.2. Bulk segregant analysis identifies major and minor loci associated with amylose content

This study sought to conduct genome wide analysis aimed at identifying markers which associate with AC either through causativeness or through linkage. SNPs were found in genes known to influence amylose biosynthesis as well as some which have previously been linked with starch biosynthesis but whose role in amylose synthesis had not been established. The identification of GBSS1 and several others as being linked to amylose demonstrates the potential of bulk segregant analysis in identifying major and minor loci associated with complex traits. GBSS1 is a major effect loci explaining as high as 90% of the AC variation in some genetic backgrounds (He et al. 1999). Among the markers identified in this study were 4 SNPs found in the exonic and promoter regions of GBSS1 structural gene. These include a non-synonymous G/A SNP located in exon 12 and which remains the only SNP that has been reported by previous studies in African rice to-date. This SNP leads to a change from Asparagine to Aspartic acid which is predicted to be a conservative amino acid change

134

(Dobo 2006; Umeda et al. 1991). This SNP affects the GBSS1 protein at position 528. Conservative amino acid changes can be functionally important if located near an enzyme active site (Henry 2008) and therefore, though it is conservative, further analysis on the functional importance of this SNP is necessary.

Analysis of the three dimensional structure of GBSS1 protein (Momma and Fujimoto 2012) shows that residue 528 is in the glycosyltransferase region of the protein and adjacent to Cys529. Cys529 is involved in a disulphide link with Cys337. Cys337 is conserved in the Poaceae but not in other species suggesting that all Poaceae GBSS1 proteins have this disulphide bond as part of their three dimensional structure. Other species have valine at position 337. It is hereby speculated that a change from aspartic acid in O. sativa to asparagine in O. glaberrima at position 528 could influence the stability of the S-S link. A negatively charged aspartic acid at position 528 could probably interact with the positively charged arginine at position 335 which is only separated from Cys337 by a glycine. The neutral asparagine at 528 in O. glaberrima could not provide this stabilization of structure. A similar argument has been advanced for an A/C SNP located on chromosome 6 which causes a serine/tyrosine amino acid substitution in O. sativa. It is postulated that this SNP influences AC by destabilizing GBSS1 through a hydrogen bonding, hence affecting its activity (Dobo et al. 2010; Pace et al. 2001). Concerning the G/A SNP, another possibility is some interaction between negatively charged aspartic acid at position 528 and the positively charged lysine at position 530 in O. sativa that could not happen in O. glaberrima. These speculations point to the functional importance of this SNP but functional characterization through targeted mutagenesis would be useful as it would help clearly elucidate on this. The use of protein modelling approaches, active site prediction and simulation packages may also help elucidate the functional importance of this and other SNPs. Relating to the 2 SNPs found in the promoter of GBSS1, it is possible that they may have some regulatory role in amylose synthesis. Transcriptional regulation of GBSS1 has previously been reported, with a 31bp sequence present in the GBSS1 gene promoter regulating the amount of amylose synthesized (Zhu et al. 2003).

In addition to GBSS1, this study identified other amylose candidate genes and markers, among which include TFs in the NAC (LOC_Os11g31330) and CCAAT-

135

HAP5 (LOC_Os01g01290) families. Through a gene co-expression analysis, both of these TFs have previously been linked to starch biosynthesis genes (Fang-Fang and Hong-Wei 2010). These TFs were analysed further through a candidate gene analysis and the results are presented in chapter 6. Though Basic helix-loop-helix (OsbHLH071) does not appear to have been functionally characterized, some members of this family have been linked to starch biosynthesis genes through gene co-expression analysis (Fang-Fang and Hong-Wei 2010). In Arabidopsis, loss of function mutation for a member of the basic helix-loop-helix family was found to result in small and shriveled seeds thus suggesting these some TFs in this family have roles in grain reserve accumulation (Kondou et al. 2008), perhaps through starch synthesis. Three SNPs found in two endosperm specific genes which are involved in glutelin synthesis were identified. Identification of SNPs located in genes involved in glutelin synthesis suggests a close interaction between the starch biosynthesis and glutelin synthesis pathways. This is consistent with results of a recent study which reported that there exists a link between glutelin and starch physicochemical properties in rice (Baxter et al. 2014). These authors reported a positive correlation between glutelin content and hardness as well as adhesive properties of starch. It is important to note that the relationship between AC and hardness is well known, with high amylose being associated with hard texture. It is therefore possible that glutelin has a direct correlation with AC and in turn affecting other physicochemical properties such as hardness. Moreover, analysis of conserved regulatory motifs indicate that both GBSS1 and glutelin mRNA subunit possess the GCN4 and the AACAAA motif. As reported by Washida et al. (1999) and Wu et al. (2000), both of these motifs interact with each other to confer endosperm specific expression further pointing to the functional relatedness between the two genes and their respective pathways.

5.3.3. Analysis of allele frequency detects linkage drag around waxy locus

One of the challenges that faces genetic mapping efforts and specifically those aimed at identifying trait causative mutations is distinguishing truly linked and spuriously linked genomic regions. The clustering of majority of the candidate markers in a genomic region of about 2.1Mbp could be as a result of linkage dragBased on a mean conversion rate of 200kbp/cM as the average physical distance per centimorgan (Orjuela et al. 2010), the 2.1Mbp region corresponds to about 10cM. Recombination

136 in rice is expected to occur at a frequency of between 0.39-0.45cM per 100kbp (Si et al. 2015; Wu et al. 2003) and so no recombination would be expected in the linkage interval. Linkage causes markers surrounding the causal loci to show deviations from the 50% inheritance pattern expected of parental alleles, though they may not be linked to the trait (Duitama et al. 2014). Linkage reduces the resolution of BSA by limiting it to the identification of the general region carrying the causal mutation rather than identifying the mutation itself. With the causal mutation possibly embedded in a large genomic region, the challenge in the hands of molecular geneticists is how to narrow down this region so as to identify the causal mutation.

Tight linkage between the waxy locus and a nearby sterility locus has previously been suggested by Heuer and Miézan (2003) and Sano (1990). Further suggestion of linkage in this region was made by Yonemaru et al. (2010) who, in their non-redundant rice QTL database, identified a “QTL cluster” on chromosome 6 in a region encompassing the 2.1Mbp region identified in the present study. These authors who defined a “QTL cluster” as any region having more than 15 QTLs within a 2Mbp region, suggested that the QTL clustering could be due to pleiotropy caused by one gene or linkage of different genes which control different traits. These reports suggest that most of the genes in the 2.1Mbp region may not be functionally involved in amylose biosynthesis but their clustering might be as a result of linkage drag. The implications of this linkage in breeding for AC are discussed later in this chapter.

5.3.4. Potential of bulk segregant analysis in genetic mapping

With only a few studies having employed whole genome sequencing based BSA, the potential of this state of the art genomic approach is only starting to be revealed. The power of BSA in genetic mapping depends on initial population size, number of pooled segregants, sequencing coverage and the amount of phenotypic variation explained by the QTL in question (Duitama et al. 2014; Magwene et al. 2011). Though BSA was initially designed to target major effect genes, its continued advances has increased its resolution to detect many underlying genetic factors including minor causal alleles (Ian et al. 2010; Venuprasad et al. 2009; Vikram et al. 2012; Zou et al. 2016). While it has been reported that using initial populations of 100-300 individuals only allows the identification of major loci (Sun et al. 2010), results of this study seem to contradict

137 this. With an initial population of 100 progenies, the present study has identified both major and minor loci. A population size of 100 progenies is possibly one of the lowest that have ever been used for genetic mapping using BSA, yet resulted in successful candidate gene discovery. The high resolution observed in the present study even with low initial population size might be explained by marker density as sequencing the whole genome provided the highest density of markers possible. Moreover, the use of

RILs (BC2F8) may also have contributed to the increased resolution. In a simulation study, Takagi et al. (2013) found that using F7 RIL gives more power in detecting QTLs than F2 population.

Bulk size also plays a major role in determining the success and resolution of bulked sample analysis. It is generally agreed that the higher the number of pooled segregants, the higher the mapping resolution as this increases the number of recombination events (Ries et al. 2016; Yang et al. 2015). Magwene et al. (2011) suggests that bulks of between 10-20% of the phenotyped population helps to increase power to detect causal mutations. In this study, a total of 100 progenies were phenotyped and 10 segregants were pooled per bulk. Bulks comprising of only 10% of the phenotyped population are therefore sufficient to enable dissection of the genetic basis of complex traits.

There seems to be contradictory recommendations on the most optimal coverage that will maximize mapping resolution. Magwene et al. (2011) recommends that the read coverage should be at least equal to the number of segregants in a pool to ensure all the chromosomes in the pool are sequenced. In contradiction to this recommendation, Ries et al. (2016) reports that it is possible to achieve optimal resolution with a read coverage much lower than the total number of segregants and that indeed such a high coverage may not be necessary. In the present study, deep sequencing was conducted, achieving average coverages of 35X and 38X of reads uniquely mapped against the Nipponbare reference for both high amylose and low amylose bulks respectively (Table 21). This deep coverage enabled sequencing of all the chromosomes in the pooled segregants in addition to making it possible to distinguish rare variants from sequencing errors. Analysis of coverage in individual candidate marker positions shows that all of them had high coverage which ranged

138 from 21 to 41. This coverage ruled out the possibility of spurious deviations in allele frequency from the expected 50% segregation pattern due to low coverage, which could have led these markers/region being falsely linked to amylose (Duitama et al. 2014). In addition, this coverage was more than the minimum recommended (20X) to compensate for sequencing errors (Dohm et al. 2008).

Analysis of polymorphisms between the bulks did not reveal any marker that was fully homozygous in both bulks. While this may signify markers that do not show strong association with AC, it could also be due to errors in phenotyping and bulking. Even with these possible errors, this study was still successful, suggesting that it is insensitive to some degree of phenotyping mistakes. This was supported by findings of Yang et al. (2015) who reported that BSA can tolerate a certain amount of error in phenotyping. Whole genome based BSA offers the advantage of cost effectiveness as it circumvents the costly and laborious process of DNA marker development and genotyping that characterizes traditional BSA approaches (Takagi et al. 2013). Due to its great potential the popularity of NGS aided BSA in genetic mapping is expected to grow tremendously. The recent demonstration that it can be used in analysing natural variation (Yang et al. 2015; Zou et al. 2016) is likely to widen its application. It might for example make it more appealing to use in analysing genebank collections, a large proportion of which constitute natural populations.

5.3.5. Analysis of natural variation identifies novel SNPs in African rice GBSS1

Though the bi-parental mapping resulted in the identification of only two non- synonymous SNPs in GBSS1 (Table 23), analysis of O. glaberrima natural germplasm identified seemingly rare allelic variation in this gene which was not segregating in the recombinant inbred line population. The G/A SNP which is one of the two non- synonymous SNPs that were segregating in the mapping population was also identified in the diverse germplasm and appears in position 1582 (Table 24). The A/C SNP was however not identified in this population and appears to be an O. sativa specific SNP with the “A” allele being fixed in O. glaberrima. To my knowledge this is the first study that has identified such a large number of putative markers in GBSS1 in African rice. The identification of such high variation in terms of non-synonymous variants in the GBSS1 loci is consistent with a recent study in African rice which found

139 a higher diversity of AC than earlier reported (Gayin et al. 2015). Previously, African rice has been reported to have a narrow range of AC of about 25% and above (Wang et al. 2015; Watanabe et al. 2002) but Gayin et al. (2015) reported AC ranging from 15.1% to 29.6%. There seems to be no information on the genetic mechanisms underlying this recently discovered variability in AC.

The identification of a large number of novel markers coupled with these recent reports of wide variability in amylose suggests that there could be more novel alleles that could be mined in GBSS1 loci by analysis of a larger population of African rice. The novel SNP markers identified in this study seem to be of low frequency as majority of them were found in only one of the 10 accessions analysed. This low frequency of variants in African rice is consistent with extremely low nucleotide diversity that has been reported in this species (Li et al. 2011; Nabholz et al. 2014; Orjuela et al. 2014; Wang et al. 2014) which is suggestive of a severe domestication bottleneck. It is possible that the reason why it has so far not been possible to explain the heritable genetic variation for AC in African rice is because the alleles contributing to this variation are rare as a result of purifying selection (Morrell et al. 2012; Teri et al. 2009). These alleles may be associated with undesirable starch physicochemical properties and may therefore have been selected against during domestication. As will be discussed in chapter 7, analysis of the functional effect of the alleles identified in this wider African rice population might help to shed light on this.

5.3.6. Indel marker can help distinguish O. glaberrima and O. sativa GBSS1

There have been contradictory reports on the structural variation of the GBSS1 gene from O. glaberrima, specifically as relates to the presence or absence of a 139bp indel on intron 10. While the 139bp deletion in African rice has been labelled as the key polymorphism differentiating the GBSS1 gene from O. sativa and O. glaberrima (Kishine et al. 2008; Umeda et al. 1991), a study by Dobo (2006) seems to contradict this. Dobo (2006) was the first study to report on the absence of this deletion in some accessions of African rice. Confusion about this deletion limits its application as a marker to distinguish GBSS1 gene from the two cultivated species and hence determine its genetic derivation in interspecific progenies. The structural variation in GBSS1 genes from a selection of “pure” O. glaberrima accessions specifically as it

140 relates to this 139bp indel on intron 10 was analysed. Out of the 10 accessions analysed in this study, none of them possessed the 139bp insertion as earlier reported by Dobo (2006). Moreover, none of the 54 O. barthii accessions analysed in this study had this insertion. This deletion seems to have been inherited from O. barthii and therefore its absence in a large number of O. barthii accessions as was analysed in this study, is a further indication that it could be an O. glaberrima specific variant.

These contradictory results on the genetic variations in GBSS1 gene can be attributed to possible accession misidentification or introgression. The close morphological and genetic resemblance between the two cultivated species as well as the outcrossing that naturally occurs between them in the field, amplifies the possibility of accession misidentification. This has occasionally been noted especially in genebank collections (Orjuela et al. 2014). Indeed, Dobo (2006) clearly pointed to the possibility of species misidentification or accidental inter-crossing by reporting that the African rice accessions possessing the unusual 139bp insertion exhibited features characteristic of O. sativa namely ligule and panicle architecture. Cases of species misidentification and introgression in the Oryza genus as well as other genera have continued to confound many genetic studies resulting in inconsistent and contradictory results.

5.3.7. Comparison of allelic patterns in functional SNP positions identified in O. sativa

In Asian rice, several amylose functional markers have been identified and have been deployed in marker assisted breeding. Some of these markers include CT microsatellite in 5’untranslated region, G/T SNP in leader intron 5’ splice site, A/C SNP in exon 6 and C/T SNP in exon 10 of GBSS (Ayres et al. 1997; Bligh et al. 1995; Dobo et al. 2010; Larkin and Park 2003). Though some of these polymorphisms are just closely linked markers and not the causal mutations (Dobo et al. 2010), their strong correlation qualifies them for use as genetic markers. The exon 6 SNP leads to a change from tyrosine to serine while the one in exon 10 causes a proline to serine conversion. In this study, alleles in these SNP positions were analysed. Table 25 shows that both O. barthii and O. glaberrima have a GAC allelic pattern for SNP positions in intron 1, exon 6 and exon 10 respectively. Previous studies in O. sativa

141 have reported that the 3 SNPs could distinguish the three amylose classes – low, intermediate and high – with high amylose varieties having either a GAC or GAT allele in these SNP positions. The two study species have been found to have high AC and though these SNPs are fixed in these two species, alleles in these two positions appear to associate with AC in the same way as in O. sativa. With the recent reports of O. glaberrima accessions with low AC as highlighted above, it would be interesting to gain insight in their allelic pattern for these SNP positions.

5.3.8. Conclusion and implications of this study in breeding for amylose

This study has demonstrated that the approach of sequencing bulks was able to identify genes on different chromosomes associated with this complex trait. With an initial population and pool size of 100 and 10 respectively, this analysis was able to identify the major and minor genes associated with AC. Results of this study have increased our understanding on the amylose biosynthesis pathway particularly on the transcriptional regulation. The genes encoding NAC and CCAAT-HAP5 TFs may have roles in regulating genes associated with amylose biosynthesis. The candidate genes and markers identified in this study present novel targets for manipulating AC. With GBSS1 having been identified as the major loci affecting AC, breeding for high amylose rice would entail selecting for the GBSS1 gene from O. glaberrima. The 139bp indel marker might play a useful role indistinguishing between O. sativa and O. glaberrima GBSS1. The pleiotropic effects of the waxy locus have been documented where in addition to amylose, it has also been found to be responsible for controlling gelatinization temperature (GT) as well as gel consistency (Tian et al. 2009; Zhou et al. 2003). African rice has high GT and therefore selecting for O. glaberrima GBSS1 will simultaneously select for high GT. Genotypes with higher GT are considered to be of higher quality and are preferred by some consumers (Zhou et al. 2003), further indicating the vast genetic potential of African rice in breeding for superior cooking and eating quality.

This study has identified genetic linkage around the GBSS loci. Markers showing linkage with the causal loci have potential for universal application in marker assisted breeding and are especially useful in cases where there is no reliable marker for genotyping the causal loci (Kim et al. 2015). This might be useful in the case of

142

African rice where polymorphisms in GBSS1 are limited as observed in the interspecific cross. The linked markers located in the 2.1Mbp region might be vital resources that can potentially be used to indirectly select for amylose in a practical breeding programme. However, these indirect amylose-associated markers have potential limitations as they are likely to undergone recombination between them and the causal loci, in this case GBSS1. It would be important to confirm this linkage by checking for the presence of recombinants by analyzing a large number of genotypes from segregating populations. This will give insights on the value and efficiency of these markers in breeding as tightly linked markers are highly preferred and are more MAS friendly.

The linkage drag observed in this study on the other hand has potential for presenting challenges to breeders due to the co-introduction of desirable alleles together with those associated with poor traits. An example relevant to this study is the tight linkage between a gene controlling rice eating quality and one for blast resistance. For several decades, breeders in Japan have faced challenges in developing elite varieties with resistance against blast and also possessing good quality traits due to the co-introduction of desirable alleles for blast resistance and the undesirable ones controlling poor grain quality (Fukuoka et al. 2009; Saka 2006). While such associations could be due to pleiotropy, they have in most cases been found to be due to tightly linked genes (Fukuoka et al. 2009). Breaking this linkage is usually costly as well as time consuming and remains one of the major challenges that molecular breeders and geneticists have to confront.

5.4. Methods

5.4.1. Mapping population

The analysis presented in this study was conducted on BC2F8 recombinant inbred line (RIL) population developed at Africa Rice Center from an Asian and African rice interspecific cross. CG14, the O. glaberrima parent acted as the donor parent while WAB 56-104, the O. sativa parent as the recurrent parent (Jones et al. 1997). CG14 is an O. glaberrima accession from Senegal that has several desirable attributes among them drought resistance and weed competitiveness. It however has low yield

143 potential mainly due to lodging and shattering (Dingkuhn et al. 1998; Jones et al. 1997). WAB 56-104 is an improved Japonica variety that was developed in Africa Rice Center and is characterized by high yields and early maturity (Semagn et al. 2007).

5.4.2. Amylose content determination

Phenotyping for AC was done on a total of 100 interspecific progenies together with their parental lines. About 60 seeds were first dehusked and then milled to fine flour using the QIAGEN tissuelyser. Amylose content was measured using the Megazyme amylose/amylopectin assay kit (Megazyme International Ltd., Bray, Ireland) according to manufacturer’s procedure.

5.4.3. Bulk segregant analysis

Bulk segregant analysis (Michelmore et al. 1991) is a rapid and cost effective genetic mapping approach that helps identify QTLs, genes and molecular markers controlling target traits. With the advent of cheap next generation sequencing technologies, bulk segregant analysis coupled with whole genome sequencing has increasingly become a popular strategy as it is easy and cheap to develop dense SNP markers across the genome. These dense SNP markers would enable mapping QTLs at finer resolution (Magwene et al. 2011). Under this approach, it would be expected that phenotype- neutral regions would have equal contribution from the two parental genomes while phenotype-linked regions would show significantly diverged allele frequencies between the 2 DNA pools.

The two bulks used in this study were constituted by selecting 10 individuals each from both extreme ends of the amylose normal frequency distribution curve (Fig. 19). The bulk with low AC hereby referred to as “low amylose bulk (LAB)” had individuals with AC of 21.6% and below while the one associated with high AC hereby referred to as “high amylose bulk (HAB)” comprised of individuals with AC of 24.1% and above.

5.4.4. DNA extraction and quantification

Genomic DNA extraction was conducted from about 15 one-month old seedlings using a modified cetyltrimethylammonium bromide (CTAB) protocol that was originally

144 published by Carroll et al. (1995). DNA quality was assessed using 0.7% (w/v) gel electrophoresis and NanoDrop 8000 UV-Vis Spectrophotometers (Thermo Fisher Scientific, Wilmington, DE, USA). DNA quantification was done using the Qubit® 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) with the dsDNA HS (High Sensitivity) Assay Kit. In order to constitute the two DNA bulks, equal amounts of DNA were bulked and the final concentration adjusted to about 70ng/µl for both bulks.

5.4.5. Sequencing and data processing

Genomic DNA libraries were prepared for the pair of DNA pools together with the parental DNA samples. They were then pooled and sequenced in one lane using an Illumina GAIIx sequencer. The sequencing reads were trimmed using CLC Genomic Workbench 9.0 (CLC Bio, a QIAGEN Company, Aarhus, Denmark) using a quality limit of 0.01 (Only 1% of low quality bases are allowed). The trimmed reads were then separately mapped against the O. sativa Nipponbare reference (IRGSP 2005) and the O. glaberrima CG14 reference (Wang et al. 2014) under the following parameters: length fraction 0.5, similarity fraction 0.8, mismatch cost 2, deletion and insertion costs 3. Most read mappers have weaknesses when mapping reads in genomic regions containing indels (Ries et al. 2016) and so the read mappings were subjected to two-pass local realignment using the CLC local realignment tool. This helped remove realignment artefacts and thereby avoid false SNP calls especially near indel positions. Reads exhibiting non-specific matches were mapped randomly thereby ensuring multiple read alignments were avoided. The analysis pipeline is shown in Figure 21.

SNPs were called using the basic variant detection tool with a minimum coverage of 20, read count of 20 and allele frequency of 90% for the two bulks and a minimum coverage of 5, read count of 5 and allele frequency of 100% for the parents. After SNP calling, a two way comparison of SNPs identified in the two bulks was conducted, with SNPs present in both bulks being filtered out. SNPs with a phred score of <30 were filtered out. Additionally, positions with multi allelic nucleotide variations as well as those showing non-parental alleles were excluded as these were likely to be false positives which could have been caused by sequencing errors or errors in variant detection.

145

Analysis for marker trait association was then conducted by quantitative evaluation of allele frequencies of the identified non-synonymous SNPs. The contribution of the parental alleles in the bulked DNA was assessed with the focus being genetic marker sites where the allele from the high amylose parent (African rice) showed over-representation in the high amylose bulk. A cut-off of 90% allele frequency in the high bulk was used while over-representation of the alternate allele in the low amylose bulk was assessed by a statistical test. The analysis for candidate markers was conducted in 2 steps. In the first step, the analysis focused on candidate marker positions where the parental genomes were fully homozygous. In the second step, the focus was on genetic marker sites where the parents were heterozygous but the two bulks were fully homozygous for different SNP alleles. Candidate marker positions obtained from this analysis were subject to further statistical analysis.

5.4.6. Statistical analysis

A Chi square test was done on the allele frequency of the identified SNPs in both bulks separately in order to assess the co-segregation pattern of the parental alleles in the two bulks. Specifically, the chi square test was used to determine the causativeness of the identified SNPs by assessing whether they follow a binomial distribution of 1:1 as would be expected of non-causative mutations. Only SNPs that did not follow a binomial distribution of 1:1 in both bulks at P<0.05 were retained for further analysis.

146

Figure 21: Schematic diagram of BSA experimental set up and data analysis pipeline

147

5.4.7. Analysis of GBSS1 from unrelated accessions of African rice and O. barthii

A total of 64 accessions comprising 10 of African rice and 54 of O. barthii were analysed in this study (Table 26). Whole genome short read data for these samples was downloaded from the short reads archive (SRA) database. These genomes had been sequenced as part of the study that published the genome sequence of African rice. Details of sample preparation and sequencing for these samples can be obtained from Wang et al. (2014).

The reads were trimmed using CLC Genomic Workbench 9.0 with a quality score limit of 0.01. Read mapping was done to the O. sativa GBSS1 gene sequences using CLC Genomic Workbench 9.0 under the following parameters: length fraction 0.5, similarity fraction 0.8, mismatch cost 2, deletion and insertion costs 3. Variant detection was conducted using the basic variant detection tool of CLC genomics Workbench under the following parameters: minimum coverage 20, read count 8 and allele frequency 30%.

148

Table 26: O. glaberrima and O. barthii accessions used in the analysis of GBSS1 gene

Species Genebank accession GenBank accession Origin number1 number2 O. glaberrima IRGC103469 SRX502298 Burkina Faso O. glaberrima TOG5923 SRX502301 Liberia O. glaberrima TOG5949 SRX502302 Liberia O. glaberrima IRGC103472 SRX502306 Burkina Faso O. glaberrima IRGC103632 SRX502308 Mali O. glaberrima IRGC104574 SRX502311 Mali O. glaberrima IRGC104955 SRX502312 Sierra Leone O. glaberrima CG14 O. glaberrima IRGC104040 Chad O. glaberrima IRGC 101328 Sierra Leone O. barthii IRGC 103590 Cameroon O. barthii IRGC 101381 Niger O. barthii IRGC 104124 Chad O. barthii IRGC 86481 Zambia O. barthii IRGC100122 SRX502162 Gambia O. barthii IRGC100934 SRX502164 Mali O. barthii WAB0009240 SRX502191 Cameroon O. barthii WAB0012712 SRX502192 Mali O. barthii WAB0024904 SRX502193 Nigeria O. barthii WAB0026768 SRX502194 Unknown O. barthii WAB0026769 SRX502195 Nigeria O. barthii WAB0026770 SRX502196 Nigeria O. barthii WAB0028874 SRX502197 Gambia O. barthii WAB0028876 SRX502199 Guinea O. barthii -III WAB0009239 SRX502190 Nigeria O. barthii WAB0028952 SRX502173 Zambia O. barthii WAB0028927 SRX502223 Chad O. barthii WAB0028992 SRX502179 Chad O. barthii WAB0028937 SRX502228 Nigeria O. barthii WAB0028938 SRX502172 Nigeria O. barthii WAB0028940 SRX502229 Nigeria O. barthii WAB0028875 SRX502198 Mali O. barthii WAB0028980 SRX502177 Mali O. barthii WAB0028975 SRX502241 Mali O. barthii WAB0028946 SRX502232 Cameroon O. barthii WAB0028981 SRX502243 Mali O. barthii WAB0028897 SRX502209 Mali O. barthii WAB0028916 SRX502218 Mali O. barthii WAB0028931 SRX502226 Chad O. barthii IRGC103534 SRX502189 Mali O. barthii IRGC101240 SRX502185 Mali O. barthii IRGC100931 SRX502163 Mali O. barthii IRGC104119 SRX502167 Chad O. barthii IRGC105608 SRX502168 Cameroon O. barthii WAB0028877 SRX502200 Niger O. barthii IRGC103912 SRX502170 Tanzania O. barthii WAB0028903 SRX502171 Zambia O. barthii WAB0028934 SRX502227 Chad O. barthii WAB0028942 SRX502230 Cameroon O. barthii WAB0029000 SRX502253 Botswana O. barthii WAB0028917 SRX502219 Chad O. barthii WAB0028926 SRX502222 Chad O. barthii WAB0028929 SRX502224 Chad O. barthii IRGC106234 SRX502169 Sierra Leone O. barthii WAB0028987 SRX502178 Nigeria O. barthii WAB0028910 SRX502213 Mali O. barthii WAB0028911 SRX502214 Mali O. barthii WAB0028912 SRX502215 Mali O. barthii IRGC100922 SRX502182 Unknown O. barthii WAB0028893 SRX502206 Mali O. barthii WAB0028900 SRX502210 Mali

149

O. barthii WAB0028913 SRX502216 Mali O. barthii IRGC100927 SRX502183 Sierra Leone O. barthii WAB0030186 SRX502255 Mali O. barthii WAB0028915 SRX502217 Mali 1Refers to the unique identifier used by the genebank where it was sourced from. Samples with Accession numbers starting with TOG were obtained from Africa Rice Center while those starting with IRGC were obtained from IRRI

2 Refers to unique sequence record identifier for the GenBank’s SRA database

Modified from (Wang et al. 2014)

150

5.5. References

Aluko G, Martinez C, Tohme J, Castano C, Bergman C, Oard JH (2004) QTL mapping of grain quality traits from the interspecific cross Oryza sativa x O. glaberrima. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 109:630-639

Ayres N, McClung A, Larkin PD, Bligh HFJ, Jones CA, Park WD (1997) Microsatellites and a single nucleotide polymorphism differentiate apparent amylose classes in an extended pedigree of US rice germplasm. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 94:773-781

Bao JS, Corke H, Sun M (2006) Nucleotide diversity in starch synthase IIa and validation of single nucleotide polymorphisms in relation to starch gelatinization temperature and other physicochemical properties in rice ( Oryza sativa L.). Theoretical and Applied Genetics 113:1171-1183

Baxter G, Blanchard C, Zhao J (2014) Effects of glutelin and globulin on the physicochemical properties of rice starch and flour. Journal of Cereal Science 60:414-420

Bligh HF, Till RI, Jones CA (1995) A microsatellite sequence closely linked to the Waxy gene of Oryza sativa. Euphytica 86:83-85

Bligh HFJ, Larkin PD, Roach PS, Jones CA, Fu H, Park WD (1998) Use of alternate splice sites in granule-bound starch synthase mRNA from low-amylose rice varieties. Plant molecular biology 38:407-415

Carroll BJ, Klimyuk VI, Thomas CM, Bishop GJ, Harrison K, Scofield SR, Jones JD (1995) Germinal transpositions of the maize element Dissociation from T-DNA loci in tomato. Genetics 139:407-420

Das S, Upadhyaya HD, Bajaj D, Kujur A, Badoni S, Laxmi, Kumar V, Tripathi S, Gowda CLL, Sharma S, Singh S, Tyagi AK, Parida SK (2015) Deploying QTL-seq for rapid delineation of a potential candidate gene underlying major trait-associated QTL in chickpea. DNA research : an international journal for rapid publication of reports on genes and genomes 22:193-203

Delwiche SR, Bean MM, Miller RE, Webb BD, Williams PC (1995) Apparent amylose content of milled rice by near-infrared reflectance spectrophotometry. Cereal Chemistry 72:182-187

151

Dingkuhn M, Johnson DE, Sow A, Audebert AY (1998) Relationships between upland rice canopy characteristics and weed competitiveness. Field Crops Research 61:79-95

Dobo M (2006) Role of GBSS allelic diversity in rice grain quality. ProQuest, UMI Dissertations Publishing

Dobo M, Ayres N, Walker G, Park WD (2010) Polymorphism in the GBSS gene affects amylose content in US and European rice germplasm. Journal of Cereal Science 52:450-456

Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra- short read data sets from high-throughput DNA sequencing. Nucleic acids research 36:e105-e105

Duitama J, Sánchez-Rodríguez A, Goovaerts A, Pulido-Tamayo S, Hubmann G, Foulquié-Moreno MR, Thevelein JM, Verstrepen KJ, Marchal K (2014) Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast. BMC genomics 15:207-207

Etherington GJ, Monaghan J, Zipfel C, MacLean D (2014) Mapping mutations in plant genomes with the user-friendly web application CandiSNP. PLANT METHODS 10:41-41

Fang-Fang F, Hong-Wei X (2010) Coexpression Analysis Identifies Rice Starch Regulator1, a Rice AP2/EREBP Family Transcription Factor, as a Novel Rice Starch Biosynthesis Regulator1[W][OA]. Plant physiology 154:927-938

Fukuoka S, Saka N, Koga H, Ono K, Shimizu T, Ebana K, Hayashi N, Takahashi A, Hirochika H, Okuno K, Yano M (2009) Loss of Function of a Proline-Containing Protein Confers Durable Disease Resistance in Rice. Science 325:998-1001

Gayin J, Abdel-Aal E-SM, Manful J, Bertoft E (2016a) Unit and internal chain profile of African rice (Oryza glaberrima) amylopectin. Carbohydrate polymers 137:466- 472

Gayin J, Bertoft E, Manful J, Yada RY, Abdel‐Aal ESM (2016b) Molecular and thermal characterization of starches isolated from African rice (Oryza glaberrima). Starch ‐ Stärke 68:9-19

Gayin J, Chandi GK, Manful J, Seetharaman K (2015) Classification of Rice Based on Statistical Analysis of Pasting Properties and Apparent Amylose Content: The Case of Oryza glaberrima Accessions from Africa. Cereal Chemistry 92:22-28

152

Han Y, Lv P, Hou S, Li S, Ji G, Ma X, Du R, Liu G (2015) Combining Next Generation Sequencing with Bulked Segregant Analysis to Fine Map a Stem Moisture Locus in Sorghum (Sorghum bicolor L. Moench). PLoS One PLoS ONE 10(5): e0127065. doi:10.1371/journal.pone.0127065

Hartwig B, James GV, Konrad K, Schneeberger K, Turck F (2012) Fast isogenic mapping-by-sequencing of ethyl methanesulfonate-induced mutant bulks. Plant physiology 160:591-600

He P, Li SG, Qian Q, Ma YQ, Li JZ, Wang WM, Chen Y, Zhu LH (1999) Genetic analysis of rice grain quality. THEORETICAL AND APPLIED GENETICS 98:502-508

Henry RJ (2008) An Overview of Advances in Plant Genomics in the New Millennium. In: Kole C, Abbott A (eds) Principles and Practices of Plant Genomics: Advanced Genomics. Science Publishers, Enfield, NH

Heuer S, Miézan KM (2003) Assessing hybrid sterility in Oryza glaberrima × O. sativa hybrid progenies by PCR marker analysis and crossing with wide compatibility varieties. Theoretical and Applied Genetics 107:902-909

Hu XQ, Lu L, Fang CY, Duan BW, Zhu ZW (2015) Determination of Apparent Amylose Content in Rice by Using Paper-Based Microfluidic Chips. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 63:9863-9868

Ian ME, Noorossadat T, Yue J, Jonathan K, Stephen M, Joshua AS, David G, Amy AC, Leonid K (2010) Dissection of genetically complex traits with extremely large pools of yeast segregants. Nature 464:1039 Illa-Berenguer E, Van Houten J, Huang Z, van der Knaap E (2015) Rapid and reliable identification of tomato fruit weight and locule number loci by QTL-seq. Theoretical and Applied Genetics

IRGSP (2005) The map-based sequence of the rice genome. Nature 436:793-800

Jones MP, Dingkuhn M, Aluko/snm GK, Semon M (1997) Interspecific Oryza Sativa L. X O. Glaberrima Steud. progenies in upland rice improvement. Euphytica 94:237-246

Kim S, Kim C-W, Park M, Choi D (2015) Identification of candidate genes associated with fertility restoration of cytoplasmic male-sterility in onion (Allium cepa L.) using a combination of bulked segregant analysis and RNA-seq. Theoretical and Applied Genetics 128:2289-2299

153

Kishine M, Suzuki K, Nakamura S, Ohtsubo Ki (2008) Grain qualities and their genetic derivation of 7 new rice for Africa (NERICA) varieties. Journal of agricultural and food chemistry 56:4605-4610

Kondou Y, Nakazawa M, Kawashima M, Ichikawa T, Yoshizumi T, Suzuki K, Ishikawa A, Koshi T, Matsui R, Muto S, Matsui M (2008) RETARDED GROWTH OF EMBRYO1, a New Basic Helix-Loop-Helix Protein, Expresses in Endosperm to Control Embryo Growth1[W]. Plant physiology 147:1924-1935

Larkin PD, Park WD (2003) Association of waxy gene single nucleotide polymorphisms with starch characteristics in rice ( Oryza sativa L.). Molecular Breeding 12:335-339

Li ZM, Zheng XM, Ge S (2011) Genetic diversity and domestication history of African rice (Oryza glaberrima) as inferred from multiple gene sequences. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 123:21-31

Lu H, Lin T, Klein J, Wang S, Qi J, Zhou Q, Sun J, Zhang Z, Weng Y, Huang S (2014) QTL-seq identifies an early flowering QTL located near Flowering Locus T in cucumber. Theoretical and Applied Genetics 127:1491-1499

Magwene PM, Willis JH, Kelly JK (2011) The statistics of bulk segregant analysis using next generation sequencing. PLoS computational biology 7:e1002255

Michelmore RW, Paran I, Kesseli RV (1991) Identification of Markers Linked to Disease-Resistance Genes by Bulked Segregant Analysis: A Rapid Method to Detect Markers in Specific Genomic Regions by Using Segregating Populations. Proceedings of the National Academy of Sciences of the United States of America 88:9828-9832

Momma M, Fujimoto Z (2012) Interdomain Disulfide Bridge in the Rice Granule Bound Starch Synthase I Catalytic Domain as Elucidated by X-Ray Structure Analysis. Bioscience, Biotechnology, and Biochemistry 76:1591-1595

Morrell PL, Buckler ES, Ross-Ibarra J (2012) Crop genomics: Advances and applications. Nature Reviews Genetics 13:85-96

Nabholz B, Sarah G, Sabot F, Ruiz M, Adam H, Nidelet S, Ghesquière A, Santoni S, David J, Glémin S (2014) Transcriptome population genomics reveals severe bottleneck and domestication cost in the African rice (Oryza glaberrima). Molecular ecology 23:2210-2227

154

Orjuela J, Garavito A, Bouniol M, Arbelaez JD, Moreno L, Kimball J, Wilson G, Rami J-F, Tohme J, McCouch SR, Lorieux M (2010) A universal core genetic map for rice. Theoretical and Applied Genetics 120:563-572

Orjuela J, Sabot F, Chéron S, Vigouroux Y, Adam H, Chrestin H, Sanni K, Lorieux M, Ghesquière A (2014) An extensive analysis of the African rice genetic diversity through a global genotyping. Theoretical and Applied Genetics 127:2211-2223

Pace CN, Horn G, Hebert EJ, Bechert J, Shaw K, Urbanikova L, Scholtz JM, Sevcik J (2001) Tyrosine hydrogen bonds make a large contribution to protein stability 1 1 Edited by C. R. Matthews. Journal of Molecular Biology 312:393-404

Ries D, Holtgräwe D, Viehöver P, Weisshaar B (2016) Rapid gene identification in sugar beet using deep sequencing of DNA from phenotypic pools selected from breeding panels. BMC genomics 17:236

Saka N (2006) A Rice (Oryza sativa L.) Breeding for Field Resistance to Blast Disease (Pyricularia oryzae) in Mountainous Region Agricultural Research Institute, Aichi Agricultural Research Center of Japan. Plant Production Science 9:3-9

Sano Y (1990) The Genic Nature of Gamete Eliminator in Rice. Genetics 125:183-191

Semagn K, Ndjiondjop MN, Lorieux M, Cissoko M, Jones M, McCouch S (2007) Molecular profiling of an interspecific rice population derived from a cross between WAB 56-104 (Oryza sativa) and CG 14 (Oryza glaberrima). African Journal of Biotechnology 6:2014-2022

Si W, Yuan Y, Huang J, Zhang X, Zhang Y, Zhang Y, Tian D, Wang C, Yang Y, Yang S (2015) Widely distributed hot and cold spots in meiotic recombination as shown by the sequencing of rice F2 plants.(Report) 206:1491

Singh VK, Khan AW, Saxena RK, Kumar V, Kale SM, Sinha P, Chitikineni A, Pazhamala LT, Garg V, Sharma M, Sameer Kumar CV, Parupalli S, Vechalapu S, Patil S, Muniswamy S, Ghanta A, Yamini KN, Dharmaraj PS, Varshney RK (2015) Next-generation sequencing for identification of candidate genes for Fusarium wilt and sterility mosaic disease in pigeonpea (Cajanus cajan). Plant biotechnology journal

Sun Y, Wang J, Crouch J, Xu Y (2010) Efficiency of selective genotyping for genetic analysis of complex traits and potential applications in crop improvement. New Strategies in Plant Improvement 26:493-511

Takagi H, Abe A, Yoshida K, Kosugi S, Natsume S, Mitsuoka C, Uemura A, Utsushi H, Tamiru M, Takuno S, Innan H, Cano LM, Kamoun S, Terauchi R (2013) QTL‐

155

seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. The Plant Journal 74:174- 183

Teri AM, Francis SC, Nancy JC, David BG, Lucia AH, David JH, Mark IM, Erin MR, Lon RC, Aravinda C, Judy HC, Alan EG, Augustine K, Leonid K, Elaine M, Charles NR, Montgomery S, David V, Alice SW, Michael B, Andrew GC, Evan EE, Greg G, Jonathan LH, Trudy FCM, Steven AM, Peter MV (2009) Finding the missing heritability of complex diseases. Nature 461:747

Tian Z, Zeng D, Wang Y, Yu J, Gu M, Li J, Qian Q, Liu Q, Yan M, Liu X, Yan C, Liu G, Gao Z, Tang S (2009) Allelic diversities in rice starch biosynthesis lead to a diverse array of rice eating and cooking qualities. Proceedings of the National Academy of Sciences of the United States of America 106:21760-21765

Traore K, McClung A, Fjellstrom R (2006) Characterization of Novel Rice Germplasm from West Africa and Genetic Marker Associations with Rice Cooking Quality. Presentation at the Africa Rice Congress. Africa Rice Center, Dar es Salaam, Tanzania (Available at http://www.warda.cgiar.org/africa-rice- congress/presentations.html)

Umeda M, Ohtsubo H, Ohtsubo E (1991) Diversification of the rice Waxy gene by insertion of mobile DNA elements into introns. 遺伝学雑誌 66:569-586

Venuprasad R, Dalid CO, Del Valle M, Zhao D, Espiritu M, Sta Cruz MT, Amante M, Kumar A, Atlin GN (2009) Identification and characterization of large-effect quantitative trait loci for grain yield under lowland drought stress in rice using bulk-segregant analysis. Theoretical and Applied Genetics 120:177-190

Vikram P, Swamy BPM, Dixit S, Ahmed H, Cruz MTS, Singh AK, Ye G, Kumar A (2012) Bulk segregant analysis: “An effective approach for mapping consistent- effect drought grain yield QTLs in rice”. Field Crops Research 134:185-192

Wambugu P, Furtado A, Waters D, Nyamongo D, Henry R (2013) Conservation and utilization of African Oryza genetic resources. Rice 6:29-29

Wang K, Wambugu PW, Zhang B, Wu AC, Henry RJ, Gilbert RG (2015) The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129:92-100

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros

156

N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Washida H, Wu C-Y, Suzuki A, Yamanouchi U, Akihama T, Harada K, Takaiwa F (1999) Identification of cis-regulatory elements required for endosperm expression of the rice storage protein glutelin gene GluB-1. Plant molecular biology 40:1-12

Watanabe H, Futakuchi K, Jones MP, Teslim I, Sobambo BA (2002) Brabender viscogram characteristics of interspecific progenies of Oryza glaberrima Steud and O. sativa L. Nippon Shokuhin Kagaku Kogaku Kaishi 49:155-165

Wu CY, Washida H, Onodera Y, Harada K, Takaiwa F (2000) Quantitative nature of the Prolamin‐box, ACGT and AACA motifs in a rice glutelin gene promoter: minimal cis‐element requirements for endosperm‐specific gene expression. The Plant Journal 23:415-421

Wu J, Mizuno H, Hayashi?tsugane M, Ito Y, Chiden Y, Fujisawa M, Katagiri S, Saji S, Yoshiki S, Karasawa W, Yoshihara R, Hayashi A, Kobayashi H, Ito K, Hamada M, Okamoto M, Ikeno M, Ichikawa Y, Katayose Y, Yano M, Matsumoto T, Sasaki T (2003) Physical maps and recombination frequency of six rice chromosomes. Plant Journal 36:720-730

Yang J, Jiang H, Yeh CT, Yu J, Jeddeloh JA, Nettleton D, Schnable PS (2015) Extreme‐phenotype genome‐wide association study (XP‐GWAS): a method for identifying trait‐associated variants by sequencing pools of individuals selected from a diversity panel. The Plant Journal 84:587-596

Yang Z, Huang D, Tang W, Zheng Y, Liang K, Cutler AJ, Wu W (2013) Mapping of quantitative trait loci underlying cold tolerance in rice seedlings via high- throughput sequencing of pooled extremes. PLoS ONE 8(7): e68433. doi:10.1371/journal.pone.0068433

Yonemaru J-i, Yamamoto T, Fukuoka S, Uga Y, Hori K, Yano M (2010) Q-TARO: QTL Annotation Rice Online Database 3:194

Zhou P, Tan Y, He Y, Xu C, Zhang Q (2003) Simultaneous improvement for four quality traits of Zhenshan 97, an elite parent of hybrid rice, by molecular marker- assisted selection. Theoretical and Applied Genetics 106:326-331

157

Zhu Y, Cai X-L, Wang Z-Y, Hong M-M (2003) An Interaction between a MYC Protein and an EREBP Protein Is Involved in Transcriptional Regulation of the Rice Wx Gene. Journal of Biological Chemistry 278:47803-47811

Zou C, Wang P, Xu Y (2016) Bulked sample analysis in genetics, genomics and crop improvement. Plant biotechnology journal, pp 1941-1955

158

Chapter 6

Analysis of candidate genes for amylose synthesis in rice using sequencing based bulk segregant analysis

Abstract Although amylose synthesis has been extensively studied, the genetic architecture of this complex trait is poorly understood. This study sought to study the genetic determinants of amylose biosynthesis by exploring candidate genes involved in its regulation. An integrative genetics approach involving a combination of sequencing based whole genome bulk segregant analysis, gene co-expression analysis and analysis of conserved regulatory motif was used. An interspecific cross between Asian and African rice was used in this genetic mapping study. Various regulatory regions that are putatively involved in amylose synthesis were identified among them being GBSS1 and starch synthase IVa. Two candidate transcription factors namely a NAC and CCAAT-HAP5 TF were found to be co-expressed with starch synthesis related genes which have roles in amylose synthesis. These TFs also share key cis regulatory motifs with some of these amylose associated starch genes. These transcription factors are highly expressed in the endosperm and therefore exhibit an expression pattern similar to that of GBBS1. Based on these findings it can be speculated that these TFs may be involved in amylose synthesis through transcriptional regulation of the GBSS1 structural gene. These candidate genes are useful to breeders and biotechnologist as they provide novel targets for manipulating amylose content.

Keywords: Amylose, candidate gene, gene co-expression, starch biosynthesis, conserved motif

6.1. Introduction

Rice is the most important source of calories to the human population, predominantly in the form of starch. Starch is composed of two insoluble polymers: amylose and amylopectin. Amylose is essentially a linear glucan that is composed largely of α-1, 4- linkages with very few α-1, 6-branches and makes up about 20-30% of the starch in rice. Amylopectin on the other hand is the major component of starch in rice and is a branched molecule with many more α -1, 6-branches. The starch biosynthesis

159 pathway is one of the most well characterized pathways providing a good understanding of the various enzymes involved in the regulation of starch synthesis. These enzymes include; ADP-glucose pyrophoshorylases, starch synthases, starch branching enzymes and starch de-branching enzymes, with most of them having several isoforms (James et al. 2003; Tetlow et al. 2004). The first step in starch synthesis involves the activation of glucose by adenosine-5’-diphosphate glucose pyrophosphorylase (AGPase) to produce adenosine-5’-diphosphate glucose which is subsequently used as a substrate for starch synthases (SSs). The starch synthases use the glycosyl part of adenosine-5’-diphosphate glucose to extend a pre-existing α- (1, 4)-linked glucan primer to synthesize amylose and amylopectin. Granule bound starch synthase 1 (GBSS1) is primarily involved in synthesis of amylose content while several isoforms of starch synthase are involved in extending the α-1, 4-links of amylopectin. Starch branching enzymes add α-1, 6-branches to amylopectin while starch debranching enzymes cleave the growing chains at α-1, 6-linkages (Ball and Morell 2003; Tetlow et al. 2004).

Despite the intense research and tremendous progress made in understanding mechanisms regulating starch metabolism, there exists significant gaps in knowledge (Bahaji et al. 2014; Geigenberger 2011; Wang et al. 2013). Amylose biosynthesis is known to be primarily regulated by the waxy locus which encodes GBSS1. Specifically, amylose content is influenced by the allelic variations in GBSS1, its transcriptional regulation through cis regulatory elements and efficiency in GBSS1 mRNA alternate splicing as this determines the amount of transcripts (Ayres et al. 1997; Bligh et al. 1995; Hirano et al. 1998; Larkin and Park 2003; Mikami et al. 2008). Regulation of GBSS1 expression occurs both at the transcriptional and post transcriptional levels. The MYC transcription factor, OsBP-5 has been reported to interact with the ethylene- responsive element binding protein (EREBP) to synergistically regulate transcription of the waxy gene thus directly influencing amylose content (Zhu et al. 2003). Functional characterization of OsbZIP58, a basic leucine zipper (bZIP) transcription factor revealed that it binds to the promoter of key starch genes, among them GBSS1, thereby acting as a key transcription regulator of amylose synthesis (Wang et al. 2013). Despite this increased understanding of the genetic mechanisms underlying amylose synthesis particularly at the transcriptional level, this pathway to a great

160 extent remains poorly understood (Wang et al. 2015). Although some minor genes and modifiers have also been reported to play some role in the control of amylose synthesis in addition to the waxy gene (Dobo et al. 2010; Kumar and Khush 1988; Tian et al. 2009), there seems to be very little or no information available about these minor genes and modifiers. Due to the limited variation that is usually analysed in individual studies, elucidating the role of minor genes in regulating complex pathways is challenging (Tian et al. 2009). The complexity and inadequate understanding of amylose biosynthesis is revealed by reports which show key starch biosynthesis genes that were previously thought not to be involved in amylose biosynthesis, also play minor roles in this pathway (Sun et al. 2011; Tian et al. 2009). This poor knowledge of the genetic architecture of amylose synthesis acts as a major obstacle to breeders in manipulating amylose content through genetic engineering or traditional breeding approaches.

Identification of candidate genes underlying a phenotypic trait is important as it helps biologists gain a much deeper understanding of the pathway in question. The current genomic revolution characterized by cheap next generation sequencing technologies and availability of well annotated reference genome sequences is driving many areas of functional genomics. In a single run, it is possible to obtain information about all predicted genes in a genome thereby providing a robust approach for candidate gene identification and analysis. In order to elucidate the molecular underpinnings of amylose synthesis, a candidate genes study was conducted. This study used an integrative genomics analysis to dissect the genetic architecture of amylose biosynthesis. Next generation sequencing aided bulk segregant analysis was used to generate and analyse genomic data for 2 distinct phenotype-based amylose pools obtained from progenies of an interspecific cross between African and Asian rice. In addition, gene co-expression and gene regulatory analysis of the candidate genes was undertaken. Through the combined approach, known and novel candidate genes/TFs, among them GBSS1 were found to be associated with AC synthesis. The transcription factors investigated belong to various families namely basic helix-loop- helix (bHLH), NAC and CCAAT_HAP5.

161

6.2. Results

6.2.1. Allele frequency analysis identifies putative candidate genes and TFs

This study has analysed candidate genes which have previously been reported to have some influence in starch and amylose biosynthesis. In total, 89 starch synthesis related genes were selected for candidate gene analysis (Table 32 and 33). Analysis of polymorphisms between the two amylose bulks led to the identification of 5 genes putatively having some roles in amylose synthesis. These putative candidate genes included 2 protein coding genes and 3 transcription factors. The protein coding genes included GBSS1 and Starch synthase IVa. The identified TFs belonged to basic helix- loop-helix (bHLH), NAC and CCAAT_HAP5 families (Table 27). The CCAAT_HAP5 family TF identified in this study (LOC_Os01g01290) encodes a histone-fold domain containing protein.

Table 27: Markers and candidate genes putatively associated with amylose content

Locus ID Description/Putati GO Transcription SN Position Amino ve function factor P acid name/cis change regulatory motif LOC_Os06g04200 Granule bound carbohydrate G/A 1769686 Asp to Asn starch synthase metabolic (GBSS1) process LOC_Os01g01290 Histone-like sequence- CCAAT_HAP5 C/T 140693 Pro to Ser transcription factor specific DNA and archaeal binding histone LOC_Os01g01600 Ethylene- OsbHLH071 T/C 304355 Leu to Pro responsive protein related LOC_Os11g31330 No apical meristem Regulation NAC G/A 1828861 Val to Ile protein of 6 transcription and DNA binding LOC_Os01g52250 Starch synthase IVa Starch T/A 3003850 Ser to Thr metabolic 2 process

Though several genes which have previously been reported to be involved in amylose synthesis had polymorphisms between the 2 bulks, they exhibited an allelic segregation pattern where one of the parental alleles was significantly enriched in both

162 amylose bulks. This suggests that they may not be functionally involved in amylose synthesis in the system currently under study (Table 28). These include MADS29 (LOC_Os02g07430) TF (Yin and Xue 2012) belonging to the MADS-box family and Basic leucine zipper (bZIP) TF (OsbZIP58) (Wang et al. 2013). Other genes that have been reported to be involved in amylose biosynthesis include SSIIa, ADP-glucose pyrophosphorylase large subunit 1, SSIIIa, SSI and pullulanase (Sun et al. 2011; Tian et al. 2009). Similarly, analysis of polymorphisms in the two bulks suggests that none of these may be involved in amylose synthesis in the population studied here. Among the 10 starch synthase isoforms that were analysed in this study, SSI had the largest number of polymorphisms between the 2 bulks, with a total of 5 non-synonymous SNPs. This gene is found in the 2.1 Mb genomic interval including GBSS1 that was reported in chapter 5 (Fig. 20). It is flanked by genes showing significant divergence in allele frequencies in both bulks. However, analysis of the 5 polymorphic sites indicates that though the allele frequency of the two parental alleles significantly differed in the high bulk, this was not the case in the low bulk. A similar pattern was exhibited by SSIIIa in 3 of the 4 polymorphic sites, with the fourth marker showing significant over representation of one parental allele in both bulks. It therefore appears that the presence of SSI in the 2.1Mbp interval was due to linkage drag and not due to its functional association with amylose content. No polymorphic markers were found in starch branching enzyme IIa, Isoamylase I, starch synthase IIa, starch synthase IIb, ADP-glucose pyrophosphorylase small subunit 2 and granule bound starch synthase II.

Table 28: Amylose biosynthesis genes showing no functional involvement in amylose synthesis in the system presently under study

Gene name Loci Ref. OsbZIP58 LOC_Os07g08420 Wang et al. (2013) MADS29 LOC_Os02g07430 Yin and Xue (2012) Soluble starch synthase 1 (SSI) LOC_Os06g06560 Sun et al. (2011); Tian et al. (2009) ADP-glucose pyrophosphorylase large LOC_Os05g50380 Tian et al. (2009) subunit 1 Pullulanase LOC_Os04g08270 Tian et al. (2009) Starch branching enzyme 1 LOC_Os06g51084 Sun et al. (2011) OsBP-5 LOC_Os02g56140 Zhu et al. (2003) Starch synthase IIa (SSIIa) LOC_Os06g12450 Sun et al. (2011) Starch synthase IIIa (SSIIIa) LOC_Os08g09230 Tian et al. (2009)

163

6.2.2. Gene expression profiles and gene co-expression analysis Gene expression profiles (Fig. 22-26) for the various candidate genes were obtained from RiceXPro. Figure 22 shows the expression profile of GBSS1, with gene expression rising steadily from 7days after flowering (DAF) to reach a maximum at 42DAF in the endosperm. Most of the candidate genes had high endospermic gene expression thus exhibiting patterns similar to that of GBSS1. Relating to endospermic expression, LOC_Os06g04790, a member of the HAD superfamily exhibited a pattern closely similar to that of GBSS1 which rises steadily from 7DAF to reach a peak at 42 DAF.

Gene co-expression analysis is based on the assumption that genes with similar expression profiles are likely to have some functional linkage. It has proven to be a powerful approach in the identification of candidate genes and transcription factors regulating various biological processes (Aoki et al. 2007; Fang-Fang and Hong- Wei 2010). Gene co-expression analysis of the identified amylose candidate genes was conducted using known starch biosynthesis genes as guide genes. Analysing the candidate genes using this approach showed that both NAC (LOC_Os11g31330) and CCAAT_HAP5 (LOC_Os01g01290) were co-expressed with various starch genes. NAC (LOC_Os11g31330) was co-expressed with 6 starch biosynthesis genes among them being GBSS1 and MADS29, both of which are involved in regulating amylose biosynthesis. Members of the NAC and CCAAT-box families are among the genes found in the GBSS1 co-expression network (Fig. 27). These co-expression profiles suggest the functional involvement of these TFs in starch biosynthesis (Table 29). Similarly, CCAAT_HAP5 was also co-expressed with GBSS1. These two candidate TFs did not show any co-expressed patterns with starch synthase IVa.

164

Figure 22: Spatial-temporal expression of granule bound starch synthase 1 (GBSS1) (LOC_Os06g04200)

Figure 23: Spatial-temporal expression of HAD superfamily phosphatase (LOC_Os06g04790)

165

Figure 24: Spatial-temporal expression of histone-fold domain containing protein (LOC_Os01g01290)

Figure 25: Spatial-temporal expression of NAC Transcription factor (LOC_Os11g31330)

166

Figure 26: Spatial-temporal expression of OsbHLH071 (LOC_Os01g01600)

167

Figure 27: GBSS 1 co-expressed gene network showing some of the co- expressed TFs. Some of the co-expressed TFs include NAC, MYB, CCAAT and Dof. GBSS1 gene is shown by the blue circle. The network was produced in cytoscape web (Lopes et al. 2010).(Lopes et al. 2010).

168

Table 29: Amylose candidate genes/transcription factors showing co- expression with known starch biosynthesis genes

Gene/transcription factor Locus ID Name Co-expressed starch genes Weighed PCC LOC_Os01g01290 Histone-fold domain Granule-bound starch synthase 1 0.709491 containing protein Starch branching enzyme 1 0.690723 ADP-glucose pyrophosphorylase 0.725921 subunit SH2.

LOC_Os11g31330 NAC ADP-glucose pyrophosphorylase 0.748331 large subunit 2 Starch branching enzyme 3 0.732204 Pullulanase 0.636763 Granule-bound starch synthase 1 0.785246 MADS29 0.83 Starch branching enzyme 1 0.744933

6.2.3. Analysis of conserved regulatory motifs

Conserved cis regulatory motifs in the candidate genes were analysed using the RiceFREND platform (Sato et al. 2013). Table 30 shows a list of regulatory motifs that were present in the candidate genes. Notably, conserved motifs that confer seed and endosperm specific gene expression were among them. Among these include the ACGT cis regulatory elements which are core motifs that are found in a wide range of plant genes and are required to activate transcription of target genes. Previous studies have shown that the ACGT elements are important in the transcriptional regulation of starch genes (Wang et al. 2013). Analysis of the candidate genes show that both NAC and CCAAT-HAP5 TFs shared these ACGT core elements with GBSS1. Analysis of key starch biosynthesis genes among them GBSS1, starch branching enzyme and pullulanase, all of which showed similar co-expression profiles with the candidate genes, shows that they also have the ACGT core motif. Both GBSS1 and CCAAT- HAP TF possess the CCAAT-box. Also present were motifs related to regulation of seed protein storage genes particularly those involved in glutelin synthesis.

169

Table 30: Cis regulatory motifs present in the candidate genes

Sl.N Motif name cis element Function Reference o 1 AACACOREOSGLUB1 AACAAAC Core of AACA motifs (Washida et found in rice (O.s.) al. 1999; Wu glutelin genes. et al. 2000) Required in combination with GCN4 and ACGT to confer endosperm and seed specific expression 2 ABREMOTIFIOSRAB16 AGTACGTGGC Motif required for ABA (Ono et al. B responsiveness 1996) 3 ACGTOSGLUB1 GTACGTG ACGT motif found in (Washida et GluB-1 gene in rice. al. 1999; Wu Interacts with GCN4, et al. 2000) AACA and ACGT to confer endosperm expression 4 GLUTEBP1OS AAGCAACACACA Binding site in the AC promoter region of glutelinGt3 gene 5 PROLAMINBOXOSGLU TGCAAAG Involved in quantitative (Wu et al. B1 regulation of the GluB-1 2000) gene 6 CCAAT-Box CCAAT Involved in multiple (Laloum et pathways including al. 2012) regulation of seed specific expression 7 GLUTAACAOS AACAAACTCTAT glutelin common motif; (Takaiwa "AACA motif and Oono 1990) 8 GT1CONSENSUS GRWAAW binding site in many light-regulated genes,

6.3. Discussion

6.3.1. Genetic control of amylose

The genetics of amylose especially in different genetic backgrounds is not well understood (Fitzgerald et al. 2009). Despite being extensively studied, there seems to be no consensus on the genetic determinants of amylose. While a majority of studies have reported amylose to be solely governed by the waxy locus (Mikami et al. 2008; Septiningsih et al. 2003; Tan et al. 1999; Zhou et al. 2003), others have reported it to be controlled by several gene loci in addition to the Wx gene (Aluko et al. 2004; He et al. 1999). Additionally, other studies have reported amylose synthesis to be under the

170 control of some unidentified non-Wx genes (Li et al. 2004). These discrepancies in results can be attributed to weakness of the QTL mapping approaches; the approach that has been most used in genetic mapping for rice grain quality.

Current advances in rice genomics research offers opportunities for addressing some of these limitations. Using association genetics and other advances in molecular analysis, many more genes have been reported to have roles in amylose biosynthesis thus providing increased understanding on this pathway (Sun et al. 2011; Tian et al. 2009). While it has previously been reported that isoforms of soluble starch synthase are solely involved in amylopectin biosynthesis (Smith et al. 1997), a study by Tian et al. (2009) shows that some isoforms of soluble starch synthase also play an essential role in amylose synthesis. In this study, the power of NGS was used to analyse candidate genes involved in the regulation of amylose. To achieve this, whole genome based bulk segregant analysis of progenies of an interspecific cross between African and Asian rice was conducted. This genomic-enabled approach has revolutionized the ability to link phenotypic variations to genotypic data.

6.3.2. Association analysis identifies known and novel candidate genes/TFs

Association analysis led to the identification of GBSS1, the major loci involved in amylose synthesis as well as other minor loci (Table 27) which have previously been reported to be involved in starch synthesis. This analysis shows that majority of the candidate genes that have previously been reported to be involved in amylose synthesis do not seem to be associated with amylose synthesis in the system under study. This finding is consistent with results of many studies which have reported that the effect of various starch related genes and their interaction depends on genetic background as well as environmental factors (Hsu et al. 2014). The only locus that has consistently been stable across genetic backgrounds is the waxy locus which encodes GBSS1, indicating that it is a major effect locus involved in amylose biosynthesis (Mikami et al. 2008; Septiningsih et al. 2003; Tan et al. 1999; Zhou et al. 2003).

Among the putative candidate genes are several families of transcription factors including a member of the CCAAT_HAP5 family. The CCAAT_HAP5 TF belong to the Nuclear Factor Y (NF-Y), also referred to as CCAAT box binding factor (CBF) or Heme

171

Activator Protein (HAP), and are heterotrimeric complex TFs that preferentially bind to the CCAAT sites of target genes thereby regulating their expression. These TFS have been reported to be involved in the regulation of important plant processes such as flowering, photosynthesis, root elongation, drought resistance (Cao et al. 2011; Laloum et al. 2012). The CCAAT_HAP5 TF that was identified in this study (LOC_Os01g01290) is an endosperm specific gene (Yang et al. 2016) and encodes a histone domain containing protein. It belongs to the NF-YC family which is one of the 3 sub units of the nuclear factor family. The other sub units of this family being NF-YB and NF-YA. The NF-YB and NF-YC initially form a heterodimer in the cytoplasm before being exported into the nucleus where they interact with the NF-YA to form heterotrimeric complex which then binds to CCAAT sites (Steidl et al. 2004). Though this TF has previously been reported to be co-expressed with starch synthesis genes showing preferential expression in both the endosperm and vegetative tissues (Fang- Fang and Hong-Wei 2010), their specific role in starch biosynthesis is however unknown.

NAC TFs form one of the largest group of plant-specific TFs, with over 150 of these having been identified in rice (Nuruzzaman et al. 2010). The identification of NAC TFs (LOC_Os11g31330) as putatively having a role in amylose synthesis is consistent with previous reports indicating that it is co-expressed with several starch biosynthesis genes (Fang-Fang and Hong-Wei 2010). NAC TFs have been widely reported to be involved in the control of leaf senescence and are involved in the remobilization of nutrients into the developing seed (Sperotto et al. 2009). They have been found to be upregulated during grain filling and are associated with seed size. LOC_Os11g31330 is among NAC TFs that exhibit seed specific expression (Mathew et al. 2016). In wheat, a NAC homolog (NAM-B1) was found to cause increased protein content, zinc and iron by accelerating senescence which subsequently increased nutrient remobilization (Uauy et al. 2006). Though there appears to be no documented role of NAC in starch synthesis, nutrient remobilization may potentially lead to an increase in starch content.

NAC TFs have also been reported to play roles in several plant developmental processes and in regulating plant defence mechanisms as well as other biotic and

172 abiotic stresses (Nuruzzaman et al. 2013; Sperotto et al. 2009). These biotic and abiotic stresses have an effect on photosynthesis which ultimately impacts on seed reserve accumulation and it is thereby possible that in addition to their potential direct role in grain filling, these TFs may also have an indirect role in starch synthesis. Starch accumulation in seeds is not evident until after 5DAF (Hirose and Terao 2004). The gene expression profile of NAC (LOC_Os11g31330) indicates that its endospermic expression is at its peak about 7 days after flowering (DAF) before it drops steadily and then rises marginally at 42DAF (Fig. 25). This gene expression profile suggests a role in starch accumulation. Further functional characterization of these TFs perhaps loss of function and over expressing experiments is required in order to elucidate their role in starch biosynthesis. Analysis of the gene expression patterns, co-expression profiles and transcriptional regulatory analysis of the TFs will be presented later in this chapter.

Rice has 10 isoforms of starch synthase and among these, starch synthase IV is perhaps the least studied. Gene expression patterns for the various isoforms were studied, with SSIVa showing relatively constant expression during grain filling (Hirose and Terao 2004). SSIV has recently been reported to be involved in controlling starch granule formation but does not seem to affect starch structure (Crumpton-Taylor et al. 2013). Though results of this study suggest that starch synthase IVa is involved in amylose biosynthesis, its role in this pathway remains obscure.

A HAD superfamily phosphatase (LOC_Os06g04790) exhibits an expression profile that is closest to that of GBSS1 where it rises steadily from 7DAF to reach a peak at 42DAF (Fig. 23). HAD phosphatases function in multiple metabolic contexts with the secondary metabolism being of interest to this study, where they are involved in carbohydrate and lipid biosynthesis. The role and functional importance of this enzyme in carbohydrate biosynthesis is further indicated by its great similarity to Sucrose-6 (F)-phosphate phosphohydrolase which as reported by Lunn et al. (2000) is involved in photosynthetic carbon assimilation and catalyzes the final steps in sucrose biosynthesis pathway. In plants, the N-terminal region of Sucrose-6 (F)- phosphate phosphohydrolase shows significant similarity to the proteins of the HAD (haloacid dehalogenase) superfamily. Homology searches show that the maize

173

Sucrose-6 (F)-phosphate phosphohydrolase is indeed a member of the HAD hydrolase/phosphatase superfamily and shares 3 motifs which are typical of members of this family (Aravind et al. 1998; Lunn et al. 2000).

6.3.3. Gene co-expression helps identify functionally related genes

Gene co-expression analysis has successfully been used in identifying regulatory factors involved in various metabolic pathways. The two candidate TFs that were found to be putatively associated with amylose were co-expressed with various starch genes (Table 29). The major starch genes that were co-expressed with these candidate TFs include GBSS1, ADP-glucose pyrophosphorylase, starch branching enzymes and pullulanase.This perhaps suggests that though these starch genes were not functionally associated with amylose in the mapping population studied here, they might have some role in regulating amylose synthesis. This is consistent with the findings of Tian et al. (2009) and Sun et al. (2011) who reported that ADP-glucose pyrophosphorylase (AGPase), pullulanase, SSIIIa, starch branching enzyme 1 (SBE1) and SSI play minor roles in amylose synthesis by interacting with GBSS1. These authors suggest that amylose synthesis starts at the initial stages of starch biosynthesis where glucose-1-phosphate is changed to adenosine-5’-diphosphate glucose, with AGPase playing the major role in this step. Reducing the activity of AGPase has been found to lead to reduced levels of amylose perhaps due to changes in the concentration of ADP-Glucose (Clarke et al. 1999; Lloyd et al. 1999). The potential role of pullulanase in amylose synthesis had earlier been suggested by Marion van de et al. (1998) who reported that amylose is synthesized by cleavage from amylopectin, perhaps through the actions of a starch debranching enzyme such as pullulanase. Sun et al. (2011) found two SNPs in SBE1 that were highly associated with apparent amylose content thereby showing the potential role of this gene in AC synthesis.

NAC TF is co-expressed with 6 starch biosynthesis genes (Table 29). Co- expression with such a high number of starch genes might be an indicator of the functional importance of this TF in the starch biosynthesis pathway. Among the genes with which it shows co-expression patterns include GBSS1 and MADS29 (LOC_Os02g07430), both of which have roles in amylose biosynthesis. This co-

174 expression pattern was only detected using the Plant Promoter Analysis Navigator (PlantPAN; http://PlantPAN2.itps.ncku.edu.tw), (Chow et al. 2016) and not in the RiceFrend platform. MADS29 belongs to the MADS-box family which are homeotic genes that are primarily associated with flower development (Kater et al. 2006). Functional characterization of MADS29 through transformation, phylogenetic and transcriptome analysis conducted by Nayar et al. (2013) revealed that it is a cereal specific transcription factor which, in rice, is involved in seed development. Reduction in gene expression of MADS29 not only affected seed set, but also resulted in deformed seeds characterized by loosely packed starch granules with low apparent amylose content (Nayar et al. 2013; Yin and Xue 2012). Its involvement in amylose synthesis is through maintenance of hormone homeostasis where it is specifically thought to regulate cytokinin and auxin homeostasis. High levels of cytokinin have been reported to lead to an increase in the transcripts of key starch biosynthesis genes among them GBSS1, AgpS and starch branching enzyme (Miyazawa et al. 2002; Nayar et al. 2013; Sakai et al. 1999). Co-expression of NAC TF and MAD29 points to their functional relatedness and perhaps suggests that they have the same regulatory mechanisms in starch synthesis.

As highlighted earlier, CCAAT-HAP5 which encodes the histone-fold domain containing protein had previously been reported to be co-expressed with various key starch biosynthesis genes (Fang-Fang and Hong-Wei 2010; Yang et al. 2016) suggesting they have a potential regulatory role in the starch biosynthesis pathway. This study shows that it is co-expressed with GBSS1, SBEI and ADP-glucose pyrophosphorylase subunit SH2.

6.3.4. Integrative genetic analysis provides insights on candidate genes and transcriptional regulatory pathways in amylose biosynthesis

This study conducted an integrative genetic analysis using a combination of approaches namely genomic based bulk segregant analysis, gene co-expression analysis and analysis of regulatory motifs. Genes showing similar expression patterns are likely to be regulated by the same TFs and regulatory motifs (Ma et al. 2013). An Integrated approach involving analysis of regulatory motifs in gene co-expression networks is a powerful approach to conduct functional annotation of genes and gain insights on transcriptional regulatory relationships (Sarkar and Maitra 2008;

175

Vandepoele et al. 2009; Yang et al. 2015). This study used candidate gene approach where an association analysis was first undertaken using bulk segregant analysis. This enabled identification of candidate genes putatively involved in amylose synthesis. The identified amylose candidate genes were then subjected to gene co-expression and finally regulatory motif analysis. Using this approach, this study was able to identify known and novel genes putatively involved in amylose synthesis. NAC and CCAAT-HAP5 TFs show similar co-expression profiles with key starch genes and also share key transcription factor binding sites. Gene expression profiles of the 2 TFs show that they are highly expressed in the endosperm, an expression profile similar to that of GBSS1 (Fig. 24 and 25). These lines of evidence suggest that these TFs individually form a co-regulated network with key starch genes. It would be reasonable to speculate that these TFs may be involved in amylose synthesis through transcriptional regulation of the GBSS1 structural gene. Although some members of the basic helix-loop-helix transcription factor family have been found to have roles in seed development especially during grain filling (Grimault et al. 2015), OsbHLH071 (LOC_Os01g01600) does not seem to have any obvious role in starch synthesis. Its gene expression profile shows that its expression in the endosperm is almost non- existent though it is highly expressed in the stem and floral organs (Fig. 26).

6.4. Conclusion

This study acts as a confirmation on the role of various genes that have previously been reported to be putatively involved in AC synthesis. The NAC and CCAAT-HAP5 TFs identified in this study may be involved in the transcriptional regulation of GBSS1 structural gene. Knowledge of these candidate genes and genetic markers as well as the association that the specific alleles have with amylose content, provides breeders and biotechnologists with a robust platform that helps them to manipulate amylose content in a more predictable manner. This information on novel targets and markers can particularly be useful to breeders as it can assist in efficiently deploying marker assisted selection thus leading to speedy and cost effective release of superior varieties.

176

6.5. Methods

6.4.1. DNA extraction and sequencing

Details of the mapping population, DNA extraction, quantification, sequencing and data processing have been presented in Chapter 5.

6.4.2. Variant calling

Variations were called using CLC Genomics Workbench 9.0 (CLC Bio, a QIAGEN Company, Aarhus, Denmark) based on the following parameters: - minimum read count: 15, minimum coverage: 20 and allele frequency of 60%.

6.4.3. Identification of starch related candidate genes and association analysis

A total of 89 starch synthesis related genes which have previously been reported to have some roles, either confirmed or putative, in starch/amylose biosynthesis were selected (Table 32 and 33). In order to identify amylose candidate genes, an association analysis based on genomic based bulk segregant analysis data was first conducted. This involved analysing for polymorphisms between the 2 amylose bulks through allele frequency analysis. The number of reads supporting each of the two alternate alleles at any given genetic marker site were calculated in order to assess alleles showing significant divergence in the bulks. The analysis focused on those positions where there was significant deviations from the expected allele frequency in both bulks based on a chi square test. Specifically, the chi square test enabled the identification of marker sites where the allele segregation pattern did not follow the 1:1 ratio expected of phenotype neutral genomic regions. The selected candidate genes were then subjected to gene co-expression analysis.

6.4.4. Gene expression and co-expression analysis

Gene expression data for the various candidate genes as well as other known starch biosynthesis genes was obtained from the Rice Genome Annotation Project (Kawahara et al. 2013) (http://rice.plantbiology.msu.edu/index.shtml). Gene expression profiles for the various co-expressed genes were obtained from the Rice Expression Profile database (RiceXPro version 3.0). The gene expression profile

177 found in RiceXPro is obtained from microarray data of various tissues at different growth stages (Sato et al. 2011).

In addition, using the principle of gene co-expression, the current study sought to identify candidate genes with similar expression patterns with known starch biosynthesis genes. The principle of gene co-expression is based on the assumption that genes with similar expression patterns are likely to be functionally related. A total of 23 starch synthesis related genes (Ohdan et al. 2005) were used as guide genes (Table 31) to conduct gene co-expression analysis using the Rice Functionally Related gene Expression Network Database (RiceFREND) platform (Sato et al. 2013) and PlantPAN (Chow et al. 2016). RiceFREND is a platform for retrieving co-expressed gene networks. A Pearson Correlation Coefficient (PCC) cut off of 0.6 or greater was used to identify co-expressed genes (Aoki et al. 2007; Fang-Fang and Hong-Wei 2010).

6.4.5. Conserved regulatory motif analysis

The promoter regions of the candidate genes were analysed for conserved regulatory motifs using PlantPAN (The Plant Promoter Analysis Navigator http://PlantPAN2.itps.ncku.edu.tw) (Chow et al. 2016).

178

Table 31: Starch biosynthesis genes used as guide genes in gene co- expression analysis

Category Gene Gene product Locus No. name AGPase ADP-glucose pyrophosphorylase large AGPL1 LOC_Os05g50380 subunit 1 ADP-glucose pyrophosphorylase large AGPL2 LOC_Os01g44220 subunit 2 ADP-glucose pyrophosphorylase large AGPL3 LOC_Os03g52460 subunit 3 ADP-glucose pyrophosphorylase large AGPL4 LOC_Os07g13980 subunit 4 ADP-glucose pyrophosphorylase small AGPS1 LOC_Os09g12660 subunit 1 ADP-glucose pyrophosphorylase small AGPS2 LOC_Os08g25734 subunit 2 Starch synthases SSI Starch synthase I LOC_Os06g06560 SSIIa Starch synthase IIa LOC_Os06g12450 SSIIb Starch synthase IIb LOC_Os02g51070 SSIIc Starch synthase IIc LOC_Os10g30156 SSIIIa Starch synthase IIIa LOC_Os08g09230 OsSSIII Starch synthase IIIb LOC_Os04g53310 b SSIVa Starch synthase IVa LOC_Os01g52260 SSIVb Starch synthase IVb LOC_Os05g45720 GBSSI Granule-bound starch synthase I LOC_Os06g04200 GBSSII Granule-bound starch synthase II LOC_Os07g22930 Branching BEI Starch branching enzyme I LOC_Os06g51084 enzymes BEIIa Starch branching enzyme IIa LOC_Os04g33460 BEIIb Starch branching enzyme IIb LOC_Os02g32660 Starch ISA1 Isoamylase I LOC_Os08g40930 debranching ISA2 Isoamylase II LOC_Os05g32710 enzymes ISA3 Isoamylase III LOC_Os09g29404 PUL Pullulanase LOC_Os04g08270

179

Table 32: Candidate protein coding genes

Locus No. Gene name Gene product LOC_Os05g50380 AGPL1 ADP-glucose pyrophosphorylase large subunit 1 LOC_Os01g44220 AGPL2 ADP-glucose pyrophosphorylase large subunit 2 LOC_Os03g52460 AGPL3 ADP-glucose pyrophosphorylase large subunit 3 LOC_Os07g13980 AGPL4 ADP-glucose pyrophosphorylase large subunit 4 LOC_Os09g12660 AGPS1 ADP-glucose pyrophosphorylase small subunit 1 LOC_Os08g25734 AGPS2 ADP-glucose pyrophosphorylase small subunit 2 LOC_Os06g06560 SSI Starch synthase I LOC_Os06g12450 SSIIa Starch synthase IIa LOC_Os02g51070 SSIIb Starch synthase IIb LOC_Os10g30156 SSIIc Starch synthase IIc LOC_Os08g09230 SSIIIa Starch synthase IIIa LOC_Os04g53310 OsSSIIIb Starch synthase IIIb LOC_Os01g52260 SSIVa Starch synthase IVa LOC_Os05g45720 SSIVb Starch synthase IVb LOC_Os06g04200 GBSSI Granule-bound starch synthase I LOC_Os07g22930 GBSSII Granule-bound starch synthase II LOC_Os06g51084 BEI Starch branching enzyme I LOC_Os04g33460 BEIIa Starch branching enzyme IIa LOC_Os02g32660 BEIIb Starch branching enzyme IIb LOC_Os08g40930 ISA1 Starch debranching enzyme: Isoamylase I LOC_Os05g32710 ISA2 Starch debranching enzyme: Isoamylase II LOC_Os09g29404 ISA3 Starch debranching enzyme: Isoamylase III LOC_Os04g08270 PUL Starch debranching enzyme: Pullulanase LOC_Os01g63270 PHOH Starch phosphorylase H LOC_Os03g55090 PHOL Starch phosphorylase L LOC_Os07g43390 DPE1 Disproportionating enzyme I LOC_Os07g46790 DPE2 Disproportionating enzyme II LOC_Os04g0645100 FLO2 gene

180

Table 33: Candidate transcription factors

TF family Locus TF family Locus OsbZIP58 LOC_Os07g08420 NAC LOC_Os01g01470 MADS29 LOC_Os07g08420 NAC LOC_Os01g29840 AP2/EREBP LOC_Os05g03040 NAC LOC_Os02g12310 AP2/EREBP LOC_Os08g41030 NAC LOC_Os05g34310 AP2/EREBP LOC_Os08g45110 NAC LOC_Os11g31330 ABI3VP1 LOC_Os01g68370 NAC LOC_Os11g31340 AS2 LOC_Os01g39070 NAC LOC_Os11g31360 bHLH LOC_Os02g34320 NAC LOC_Os11g31380 bHLH LOC_Os04g35010 Orphans LOC_Os04g28120 bZIP LOC_Os01g64000 Orphans LOC_Os07g15540 bZIP LOC_Os03g59460 Orphans LOC_Os10g14040 bZIP LOC_Os04g10260 PHD LOC_Os02g09910 bZIP LOC_Os07g08420 PLATZ LOC_Os01g33350 bZIP LOC_Os09g34880 ARID LOC_Os02g27060 C2C2-Dof LOC_Os02g15350 bHLH LOC_Os03g43810 C2H2 LOC_Os05g38600 bHLH LOC_Os03g56950 C3H LOC_Os01g07930 bZIP LOC_Os01g58760 CCAAT_HAP2 LOC_Os10g25850 CCAAT_HAP3 LOC_Os03g29970 CCAAT_HAP3 LOC_Os02g49410 CCAAT_HAP5 LOC_Os03g14669 CCAAT_HAP3 LOC_Os06g17480 HB LOC_Os06g39906 CCAAT_HAP5 LOC_Os01g01290 HB LOC_Os06g01934 CCAAT_HAP5 LOC_Os01g24460 MADS LOC_Os02g49840 CCAAT_HAP5 LOC_Os01g39850 NAC LOC_Os09g24670 CCAAT_HAP5 LOC_Os05g23910 Orphans LOC_Os02g05470 EIL LOC_Os08g39830 SET LOC_Os05g50980 GRAS LOC_Os05g49930 SET LOC_Os09g24530 HB LOC_Os01g62310 SET LOC_Os02g49326 MYB LOC_Os01g50110 Sigma70-like LOC_Os03g16430 MYB LOC_Os01g63680 Sigma70-like LOC_Os11g26160 MYB LOC_Os01g74590 MYB LOC_Os03g29614 MYB LOC_Os07g31470

6.5. References

Aluko G, Martinez C, Tohme J, Castano C, Bergman C, Oard JH (2004) QTL mapping of grain quality traits from the interspecific cross Oryza sativa x O. glaberrima. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 109:630-639

Aoki K, Ogata Y, Shibata D (2007) Approaches for Extracting Practical Information from Gene Co-expression Networks in Plant Biology. Plant and Cell Physiology 48:381-390

181

Aravind L, Galperin MY, Koonin EV (1998) The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. Trends in Biochemical Sciences 23:127- 129

Ayres N, McClung A, Larkin PD, Bligh HFJ, Jones CA, Park WD (1997) Microsatellites and a single nucleotide polymorphism differentiate apparent amylose classes in an extended pedigree of US rice germplasm. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 94:773-781

Bahaji A, Li J, Sanchez-Lopez AM, Baroja-Fernandez E, Munoz FJ, Ovecka M, Almagro G, Montero M, Ezquer I, Etxeberria E, Pozueta-Romero J (2014) Starch biosynthesis, its regulation and biotechnological approaches to improve crop yields. Biotechnology advances 32:87-106

Ball SG, Morell MK (2003) From Bacterial Glycogen to Starch: Understanding the Biogenesis of the Plant Starch Granule. Annual review of plant biology 54:207- 233

Bligh HF, Till RI, Jones CA (1995) A microsatellite sequence closely linked to the Waxy gene of Oryza sativa. Euphytica 86:83-85

Cao S, Kumimoto RW, Siriwardana CL, Risinger JR, Holt BF, Zhang B (2011) Identification and Characterization of NF-Y Transcription Factor Families in the Monocot Model Plant Brachypodium distachyon. PloS one 6

Chow C-N, Zheng H-Q, Wu N-Y, Chien C-H, Huang H-D, Lee T-Y, Chiang-Hsieh Y-F, Hou P-F, Yang T-Y, Chang W-C (2016) PlantPAN 2.0: an update of plant promoter analysis navigator for reconstructing transcriptional regulatory networks in plants. Nucleic acids research 44:D1154

Clarke BR, Denyer K, Jenner CF, Smith AM (1999) The relationship between the rate of starch synthesis, the adenosine 5?-diphosphoglucose concentration and the amylose content of starch in developing pea embryos. Planta 209:324-329

Crumpton-Taylor M, Pike M, Lu Kj, Hylton CM, Feil R, Eicke S, Lunn JE, Zeeman SC, Smith AM (2013) Starch synthase 4 is essential for coordination of starch granule formation with chloroplast division during Arabidopsis leaf expansion. New Phytologist 200:1064-1075

Dobo M, Ayres N, Walker G, Park WD (2010) Polymorphism in the GBSS gene affects amylose content in US and European rice germplasm. Journal of Cereal Science 52:450-456

182

Fang-Fang F, Hong-Wei X (2010) Coexpression Analysis Identifies Rice Starch Regulator1, a Rice AP2/EREBP Family Transcription Factor, as a Novel Rice Starch Biosynthesis Regulator1[W][OA]. Plant physiology 154:927-938

Fitzgerald MA, McCouch SR, Hall RD (2009) Not just a grain of rice: the quest for quality. Trends in plant science 14:133-139

Geigenberger P (2011) Regulation of Starch Biosynthesis in Response to a Fluctuating Environment. Plant physiology 155:1566-1577

Grimault A, Gendrot G, Chamot S, Widiez T, Rabillé H, Gérentes MF, Creff A, Thévenin J, Dubreucq B, Ingram GC, Rogowsky PM, Depège‐Fargeix N (2015) ZmZHOUPI, an endosperm‐specific basic helix–loop–helix transcription factor involved in maize seed development. Plant Journal 84:574-586

He P, Li SG, Qian Q, Ma YQ, Li JZ, Wang WM, Chen Y, Zhu LH (1999) Genetic analysis of rice grain quality. THEORETICAL AND APPLIED GENETICS 98:502-508

Hirano H-Y, Eiguchi M, Sano Y (1998) A single base change altered the regulation of the waxy gene at the posttranscriptional level during the domestication of rice. Molecular biology and evolution 15:978-987

Hirose T, Terao T (2004) A comprehensive expression analysis of the starch synthase gene family in rice ( Oryza sativa L.). An International Journal of Plant Biology 220:9-16

Hsu Y-C, Tseng M-C, Wu Y-P, Lin M-Y, Wei F-J, Hwu K-K, Hsing Y-I, Lin Y-R (2014) Genetic factors responsible for eating and cooking qualities of rice grains in a recombinant inbred population of an inter-subspecific cross. Molecular Breeding 34:655-673

James MG, Denyer K, Myers AM (2003) Starch synthesis in the cereal endosperm. Current Opinion in Plant Biology 6:215-222

Kater MM, Dreni L, Colombo L (2006) Functional conservation of MADS-box factors controlling floral organ identity in rice and Arabidopsis. Journal of experimental botany 57:3433-3444

Kawahara Y, Zhou S, Childs KL, Davidson RM, Lin H, Quesada-Ocampo L, Vaillancourt B, Sakai H, Lee SS, Kim J, Numa H, de la Bastide M, Itoh T, Buell CR, Matsumoto T, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J (2013) Improvement of the Oryza sativa

183

Nipponbare reference genome using next generation sequence and optical map data. Rice 6:1-10

Kumar I, Khush GS (1988) Inheritance of amylose content in rice (Oryza sativa L.). Euphytica 38:261-269

Laloum T, De Mita Sp, Gamas P, Baudin Ml, Niebel A (2012) CCAAT-box binding transcription factors in plants: Y so many? Trends in plant science

Larkin PD, Park WD (2003) Association of waxy gene single nucleotide polymorphisms with starch characteristics in rice ( Oryza sativa L.). Molecular Breeding 12:335-339

Li J, Xiao J, Grandillo S, Jiang L, Wan Y, Deng Q, Yuan L, McCouch SR (2004) QTL detection for rice grain quality traits using an interspecific backcross population derived from cultivated Asian (O. sativa L.) and African (O. glaberrima S.) rice. Genome / National Research Council Canada = Genome / Conseil national de recherches Canada 47:697-704

Lloyd JR, Springer F, Bul?on A, M?ller-R?ber B, Willmitzer L, Kossmann J (1999) The influence of alterations in ADP-glucose pyrophosphorylase activities on starch structure and composition in potato tubers. Planta 209:230-238

Lopes CT, Franz M, Kazi F, Donaldson SL, Morris Q, Bader GD (2010) Cytoscape Web: an interactive web-based network browser. Bioinformatics 26:2347-2348

Lunn JE, Ashton AR, Hatch MD, Heldt HW (2000) Purification, Molecular Cloning, and Sequence Analysis of sucrose-6F-phosphate Phosphohydrolase from Plants. Proceedings of the National Academy of Sciences of the United States of America 97:12914-12919

Ma S, Shah S, Bohnert HJ, Snyder M, Dinesh-Kumar SP (2013) Incorporating Motif Analysis into Gene Co-expression Networks Reveals Novel Modular Expression Pattern and New Signaling Pathways (Motif Analysis over Gene Co-expression Networks) 9:e1003840

Marion van de W, Hulst C, Vincken J-P, Buleon A, Visser R, Ball S (1998) Amylose Is Synthesized in Vitro by Extension of and Cleavage from Amylopectin. Journal of Biological Chemistry 273:22232-22240

Mathew IE, Das S, Mahto A, Agarwal P (2016) Three Rice NAC Transcription Factors Heteromerize and Are Associated with Seed Size. Frontiers in Plant Science

184

Mikami I, Uwatoko N, Ikeda Y, Yamaguchi J, Hirano HY, Suzuki Y, Sano Y (2008) Allelic diversification at the wx locus in landraces of Asian rice. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 116:979-989

Miyazawa Y, Kato H, Muranaka T, Yoshida S (2002) Amyloplast Formation in Cultured Tobacco BY-2 Cells Requires a High Cytokinin Content. Plant & Cell Physiology 43:1534

Nayar S, Sharma R, Tyagi AK, Kapoor S (2013) Functional delineation of rice MADS29 reveals its role in embryo and endosperm development by affecting hormone homeostasis. Journal of experimental botany 64:4239

Nuruzzaman M, Manimekalai R, Sharoni AM, Satoh K, Kondoh H, Ooka H, Kikuchi S (2010) Genome-wide analysis of NAC transcription factor family in rice. Gene 465:30-44

Nuruzzaman M, Sharoni MA, Shoshi kikuchi (2013) Roles of NAC transcription factors in the regulation of biotic and abiotic stress responses in plants. Frontiers in Microbiology 4

Ohdan T, Francisco JPB, Sawada T, Hirose T, Terao T, Satoh H, Nakamura Y (2005) Expression profiling of genes involved in starch synthesis in sink and source organs of rice. Journal of experimental botany 56:3229-3244

Ono A, Izawa T, Shimamoto K, Chua N-H (1996) The rab16B promoter of rice contains two distinct abscisic acid-responsive elements. Plant physiology 112:483-491

Sakai A, Miyagishima S-Y, Kawano S, Kuroiwa T (1999) Auxin and Cytokinin Have Opposite Effects on Amyloplast Development and the Expression of Starch Synthesis Genes in Cultured Bright Yellow-2 Tobacco Cells. Plant physiology 121:461-469

Sarkar C, Maitra A (2008) Deciphering the cis-regulatory elements of co-expressed genes in PCOS by in silico analysis. Gene 408:72-84

Sato Y, Antonio BA, Namiki N, Takehisa H, Minami H, Kamatsuki K, Sugimoto K, Shimizu Y, Hirochika H, Nagamura Y (2011) RiceXPro: a platform for monitoring gene expression in japonica rice grown under natural field conditions. Nucleic acids research 39:D1141

Sato Y, Namiki N, Takehisa H, Kamatsuki K, Minami H, Ikawa H, Ohyanagi H, Sugimoto K, Itoh J-I, Antonio BA, Nagamura Y (2013) RiceFREND: A platform for retrieving coexpressed gene networks in rice. Nucleic acids research 41:D1214-D1221

185

Septiningsih EM, Trijatmiko KR, Moeljopawiro S, McCouch SR (2003) Identification of quantitative trait loci for grain quality in an advanced backcross population derived from the Oryza sativa variety IR64 and the wild relative O. rufipogon. Theoretical and Applied Genetics 107:1433-1441

Smith A, Denyer K, Martin C (1997) The synthesis of the starch granule. Annu Rev Plant Physiol Plant Molec Biol, pp 65-87

Sperotto R, Ricachenevsky F, Duarte G, Boff T, Lopes K, Sperb E, Grusak M, Fett J (2009) Identification of up-regulated genes in flag leaves during rice grain filling and characterization of Os NAC5, a new ABA-dependent transcription factor. An International Journal of Plant Biology 230:985-1002

Steidl S, T?ncher A, Goda H, Guder C, Papadopoulou N, Kobayashi T, Tsukagoshi N, Kato M, Brakhage AA (2004) A Single Subunit of a Heterotrimeric CCAAT- binding Complex Carries a Nuclear Localization Signal: Piggy Back Transport of the Pre-assembled Complex to the Nucleus. Journal of Molecular Biology 342:515-524

Sun M-M, Abdula SE, Lee H-J, Cho Y-C, Han L-Z, Koh H-J, Cho Y-G (2011) Molecular Aspect of Good Eating Quality Formation in Japonica Rice (Molecular Aspect of Good Eating Quality in Rice). PloS one 6:e18385

Takaiwa F, Oono K (1990) Interaction of an immature seed-specific trans-acting factor with the 5? upstream region of a rice glutelin gene. MGG Molecular & General Genetics 224:289-293

Tan YF, Li JX, Yu SB, Xing YZ, Xu CG, Zhang Q (1999) The three important traits for cooking and eating quality of rice grains are controlled by a single locus in an elite rice hybrid, Shanyou 63. Theoretical and Applied Genetics 99:642-648

Tetlow IJ, Morell MK, Emes MJ (2004) Recent developments in understanding the regulation of starch metabolism in higher plants. Journal of experimental botany 55:2131-2145

Tian Z, Zeng D, Wang Y, Yu J, Gu M, Li J, Qian Q, Liu Q, Yan M, Liu X, Yan C, Liu G, Gao Z, Tang S (2009) Allelic diversities in rice starch biosynthesis lead to a diverse array of rice eating and cooking qualities. Proceedings of the National Academy of Sciences of the United States of America 106:21760-21765

Uauy C, Distelfeld A, Fahima T, Blechl A, Dubcovsky J (2006) A NAC Gene Regulating Senescence Improves Grain Protein, Zinc, and Iron Content in Wheat. Science (Washington) 314:1298-1301

186

Vandepoele K, Van Peer YD, Quimbaya M, Casneuf T, De Veylder L (2009) Unraveling transcriptional control in arabidopsis using cis-regulatory elements and coexpression networks 1[C][W]. Plant physiology 150:535-546

Wang J-C, Xu H, Zhu Y, Liu Q-Q, Cai X-L (2013) OsbZIP58, a basic leucine zipper transcription factor, regulates starch biosynthesis in rice endosperm. Journal of experimental botany 64:3453-3466

Wang K, Wambugu PW, Zhang B, Wu AC, Henry RJ, Gilbert RG (2015) The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129:92-100

Washida H, Wu C-Y, Suzuki A, Yamanouchi U, Akihama T, Harada K, Takaiwa F (1999) Identification of cis-regulatory elements required for endosperm expression of the rice storage protein glutelin gene GluB-1. Plant molecular biology 40:1-12

Wu CY, Washida H, Onodera Y, Harada K, Takaiwa F (2000) Quantitative nature of the Prolamin‐box, ACGT and AACA motifs in a rice glutelin gene promoter: minimal cis‐element requirements for endosperm‐specific gene expression. The Plant Journal 23:415-421

Yang W, Lu Z, Xiong Y, Yao J (2016) Genome-wide identification and co-expression network analysis of the OsNF-Y gene family in rice. The Crop Journal

Yang X, Koltes JE, Park CA, Chen D, Reecy JM (2015) Gene Co-Expression Network Analysis Provides Novel Insights into Myostatin Regulation at Three Different Mouse Developmental Timepoints.(Report). PloS one 10

Yin L-L, Xue H-W (2012) The MADS29 Transcription Factor Regulates the Degradation of the Nucellus and the Nucellar Projection during Rice Seed Development(W). The Plant cell 24:1049-1065

Zhou P, Tan Y, He Y, Xu C, Zhang Q (2003) Simultaneous improvement for four quality traits of Zhenshan 97, an elite parent of hybrid rice, by molecular marker- assisted selection. Theoretical and Applied Genetics 106:326-331

Zhu Y, Cai X-L, Wang Z-Y, Hong M-M (2003) An Interaction between a MYC Protein and an EREBP Protein Is Involved in Transcriptional Regulation of the Rice Wx Gene. Journal of Biological Chemistry 278:47803-47811

187

Chapter 7

General discussion, conclusions and future directions

7.1. Overview

Currently, agriculture relies on a few crops overlooking many important crops and species which may be useful directly as foods or as sources of genes for crop improvement. In the Oryza genus, the African wild and cultivated species have huge diversity but remain largely neglected. They have genetic potential that has not been properly characterized and exploited (chapter 2). These species and the breadth of their diversity have been the focus of this thesis. The central theme presented in this thesis is that plant genetic resources, which are an integral component of agricultural biodiversity, need to be efficiently conserved and sustainably utilized in order to ensure they play their role in enhancing global food and nutritional security as well as economic development. In order to achieve this objective, there is a need for proper characterization of plant genetic resources as this is a key to understanding and enhancing their potential genetic value. Specifically, there is a need to clearly understand the genetic relationships between species and also to have the functional diversity of important traits and their genetic architecture well defined.

These studies were therefore undertaken with the aim of conducting genomic characterization of both cultivated and wild African Oryza species with the overall goal of enhancing their genetic value and thereby promoting their conservation and utility. Considerable focus has been put on the potential of African rice in addressing food and nutritional security not only in Africa but globally. This general discussion focuses on the key highlights of the research, with the implications of the outputs of this study for breeding for amylose content in rice being outlined. The application of genomic tools in supporting the conservation and utilization of African Oryza genetic resources has also been discussed. Focus has particularly been put on the potential of two genomic approaches, namely whole chloroplast genome sequencing and whole genome based bulk segregant analysis, both of which were deployed in this study. Finally, recommendations for future research are presented. Figure 28 illustrates the

188 link between the various research questions that have been addressed in this thesis and how the outcome of these studies relate in enhancing the project goals of conservation and utilization of plant genetic resources.

Genetic resources and diversity as foundation for rice improvement (Wild relatives, breeding lines, specialized genetic stocks and conserved germplasm)

Plant genetic resource characterization

(Chloroplast genomesequencing) Properphylogenetic fram

Genotype Phenotype-genotype Phenotype association (BSA)

Candidate gene and Genetic markers

function

ework

Proper phylogenetic framework framework phylogenetic Proper

(Chloroplast genome sequencing) sequencing) genome (Chloroplast

Conservation Plant genetic resources use

Climate Grain Abiotic and Yield quality biotic stress resilience

Figure 28: Schematic diagram showing how the outputs of this study including candidate genetic markers, candidate genes and well defined phylogenetic relationships contribute in supporting the project’s goals of promoting the conservation and utilization of PGR

189

7.2. Application of genomics in conservation and utilization of African Oryza genetic resources

In chapter 2, the status of conservation and utilization of African Oryza genetic resources was documented, where in addition to reporting on the progress made, existing gaps were also highlighted. A significant proportion of African Oryza genetic resources consist of wild relatives. The last couple of years have witnessed increased awareness and recognition of the importance of crop wild relatives as sources of valuable diversity for crop improvement. There have been renewed global efforts aimed at promoting the conservation and sustainable use of these CWR genetic resources (Dempewolf et al. 2014; Maxted et al. 2012). Despite these efforts, genetic resources of African Oryza CWRs remain under-collected and hence poorly represented in germplasm repositories exposing them to threats of genetic loss and possible extinction (Wambugu et al. 2013). Crop wild relatives have been the focus of intense efforts aimed at enhancing their deployment in crop improvement as a source of valuable diversity for tolerance/resistance to various biotic and abiotic stresses (FAO 2010; Hajjar and Hodgkin 2007; Maxted and Kell 2009; Xiao et al. 1998). Generally, crop wild relatives are less well studied and difficult to conserve compared to domesticated crop species (Dempewolf et al. 2014; FAO 2010). The genetic potential of African wild relatives remains largely untapped but this is likely to change with the continued sequencing of crop wild relatives.

The increased sequencing of the genomes of various crop wild relatives lays a strong foundation for enhancing their utility (Brozynska et al. 2015). The current advances in genomics offers tools and opportunities for supporting conservation and utilization of plant genetic resources. Sequencing based molecular analysis for example has potential for addressing conservation gaps by enabling the assessment of the range of alleles present in a collection. However, insufficient knowledge of diversity available in in situ populations or in ex situ germplasm collections complicates the process of accurately determining collection gaps and selection of new materials to be conserved (Treuren et al. 2008). Conducting molecular analysis of accessions before banking will help breeders and genebanks to gradually accumulate information on available genetic diversity thus making it easier to more precisely undertake collection gap analysis which will help plan future germplasm collection missions.

190

In situ populations of African Oryza genetic resources exist and these are important in complementing ex situ conservation (chapter 2). Analysis of genomes of these in situ populations can help reveal novel wild populations which can be prioritized for conservation (Brozynska et al. 2015). African Oryza genetic resources have been poorly characterized and hence underutilized since their genetic value is unknown. The application of genomics as a tool for achieving greatly accelerated genetic improvement is usually hindered by the inadequate understanding of the genetic potential resident in genetic material. Without this knowledge, selection of germplasm to include in a breeding programme becomes a “hit or miss” affair which dramatically slows down the process. Genetic diversity in ex situ collections has to a large extent been previously studied anonymously with a large majority of studies just indicating diversity richness by reporting on the number of alleles detected in a collection (Gao et al. 2006; Ishii et al. 2011; Jain et al. 2004). This has done little to spur utility of conserved germplasm, however with current improvements in genomic approaches, there is increased impetus towards studying the functional genetic variation of genebank collections. Linking sequence variations to phenotypic variations is perhaps the best way to illuminate the genetic potential and breeding value of genetic resources thereby promoting their utility in rice improvement. The increased availability of high quality reference sequences has opened almost unlimited possibilities in deciphering the functional value of germplasm. Resequencing of several genotypes either individually or pooled, through whole genome shotgun sequencing, followed by mapping to the reference is currently the most popular approach for obtaining insights on informative marker trait associations (Huang et al. 2012; McCouch et al. 2012; Takagi et al. 2013). In this thesis, we have conducted a nuclear genome-based genetic mapping using bulked samples and a chloroplast genome-based phylogenetic analysis with the goal of promoting the conservation and utility of African genetic resources.

7.3. Application of chloroplast phylogenomics in phylogenetic analysis

Phylogenomics is one of the rapidly expanding fields of genomics. It is finding wide application in reconstructing the phylogenetic and evolutionary history of species by analysing a large number of loci across the genome (Eisen and Fraser 2003; Frédéric et al. 2005). Analysis of multiple loci is based on the assumption that the predominant

191 phylogenetic information inherent in the dataset will weed out any inconsistencies and incongruence thus resulting in a robust phylogeny (Molina et al. 2011; Rakshit et al. 2007). In chapter 3, a chloroplast phylogenomics approach was used to resolve the previous inconsistences in the phylogeny and evolutionary relationships of various African species that are part of the Oryza primary gene pool. This analysis resulted in a well resolved phylogeny of the AA genome with all the clades being strongly supported, thus demonstrating the value of whole plastid genomes in phylogenetic analysis. The results of the phylogenomic analysis provided valuable insights on domestication and evolution of rice. These well-defined phylogenetic relationships are invaluable aids in devising rice breeding strategies as well as in supporting conservation (Figure 28).

The increased availability of low-cost genome-scale sequencing technologies has led to a dramatic decrease in sequencing costs. The use of whole chloroplast genomes has been found to be a cost effective approach in plant barcoding. Low coverage sequencing through multiplexing is currently the most preferred approach for sequencing chloroplast genomes. In chapter 3, multiplex sequencing of 10 samples in one run was conducted resulting in fold coverage of the chloroplast genome between 1319X and 3443X (Table 11).This enabled the assembly of complete de novo assembled chloroplast genome sequences with no gaps. As demonstrated by Cronn et al. (2008), this approach leads to a perfect balance between cost, throughput and coverage. It is these factors coupled with the power to resolve previously intractable phylogenetic relationships that have continued to make whole plastome sequencing a popular tool for phylogenetic analysis.

However, despite the great potential of chloroplast genomes as a universal plant barcode, it also faces certain limitations. For example, due to their slow evolving nature (Small et al. 2004), plastome genomes have little sequence diversity and may therefore not be able to effectively resolve some clades such as in the case of rapidly radiated genomes. The low sequence diversity is responsible for the limited genome divergence that was observed in chapter 3. This makes plastome sequences unsuitable for resolving problems associated with rapid speciation which has been the cause of many unresolved phylogenies (Zou et al. 2008). In such cases, the combined

192 use of both chloroplast and nuclear data is recommended. Furthermore, using whole chloroplast genome sequences in undertaking genome scale analysis may be computationally challenging especially in cases of inadequate computing capacity. The sequences can however be made slightly more computationally manageable by excluding one of the inverted regions as doing so did not have any effect on the resultant phylogeny in this study.

The phylogenomics study advanced understanding on the relationships between various species in the AA genome group and specifically among the African species. The chloroplast sequences assembled in this study are a valuable resource that will be useful in studying phylogenetic and evolutionary relationships in the wider Oryza genus. The well-defined genetic relationships obtained in this phylogenomics study, coupled with the recently developed reference sequences of these African species, enables the exploration of functionally important variants in their genomes and their deployment in rice improvement

7.4. Analysis of functional diversity in African rice and its potential deployment in breeding for amylose

With the increased understanding of the utility of African species in rice improvement, this study proceeded to characterize the functionally important diversity associated with amylose content in African rice. While priority has been placed on breeding for high yielding crops especially in developing countries, there is also a need to ensure that these varieties deliver nutritional security which contributes to human health. We have reported on the potential of African rice to breed for elite varieties that have potential for conferring some health and dietary benefits due to their unique starch traits (Wang et al. 2015; appendix ). Despite their potential health benefits, African rice species remain neglected and has been reported to be less appealing than Asian rices. In order to improve consumer acceptance there is a need for rice breeders to integrate grain quality with sensory preferences (Anacleto et al. 2015). Moreover, a holistic understanding of the genetic, molecular and biochemical basis of some of the starch traits responsible for these health benefits is important in helping tap this genetic potential but is currently lacking. The growth in omics technologies is providing opportunities for addressing some of these research gaps. This study has leveraged

193 the advances in omics technologies where advances in high throughput sequencing technologies aided by equally revolutionary advances in bioinformatics have been instrumental in advancing our understanding of the molecular and genetic basis of natural variation.

In chapter 5, whole genome based bulk segregant analysis which is one of the fast and cost effective cutting edge genomics approaches, was used to explore amylose-associated functional diversity in African rice. Despite the dramatic decrease in sequencing costs, whole genome analysis of a large number of samples is still not economically feasible for a large majority of genebanks and researchers (Kilian and Graner 2012). The costs of library preparation still forms a substantial cost of next generation sequencing (Harakalova et al. 2011; Rohland and Reich 2012). Bulk segregant analysis (BSA) coupled with whole genome analysis is therefore likely to become a popular approach as it reduces the number of samples to sequence as well as cost of library preparation. This approach has continued to gain popularity and is steadily replacing traditional mapping approaches. Traditional approaches were costly, laborious and time consuming as they involved genotyping a large number of individuals. An additional advantage of BSA over traditional QTL mapping approaches is that it is not too stringent on phenotyping accuracy. Studies have shown that it can tolerate some degree of inaccuracy in phenotyping (Yang et al. 2015). In addition, for a crop like African rice which has extremely low nucleotide diversity (Li et al. 2011; Nabholz et al. 2014; Orjuela et al. 2014; Wang et al. 2014), identification and development of enough polymorphic markers for genotyping may be challenging. Unlike with traditional approaches, in whole genome sequencing of extreme bulked DNAs, marker development is no longer constrained by the number of nucleotide polymorphisms. While BSA has proven very useful and successful in mapping QTLs, its potential in aiding the discovery of the causal sites or mutations may be limited. The causal site is expected to have alleles from one of the parental genomes over represented in one of the bulks but due to linkage disequilibrium, the markers adjacent to the causal site similarly tend to have the same over representation (Duitama et al. 2014) making it difficult to pinpoint the exact causative mutation (Abe et al. 2012).

194

By conducting genome wide analysis, this study presents the most comprehensive analysis of the genetic architecture of amylose content in African rice to-date. With a global view of the molecular architecture of amylose synthesis in African rice, this study has advanced our knowledge of this pathway by providing information on new candidate genes and novel alleles. These genomic resources provide novel targets for manipulating amylose content. They can be used to incorporate the potential starch-related health benefits of African rice into high yielding rice varieties which is likely to result in healthier rice. The identification of GBSS which is known as the key regulator of amylose synthesis, in addition to other candidate genes as being associated with amylose content, clearly demonstrates the potential of whole genome-based bulk segregant analysis in genetic mapping. Schneeberger and Weigel (2011) predicted that in future, mapping by sequencing will be instrumental in efforts aimed at identifying genes that show heritable variation in target traits that could be deployed in crop improvement. The future that was predicted by these authors is already with us as evidenced by results of this study. NGS aided BSA has therefore dramatically accelerated the process of identifying causal genes and mutations for various traits.

7.5. Dissecting the genetic architecture of amylose synthesis using integrative genetics approach

In chapter 6 we used a candidate gene approach to analyse genes that have previously been reported to be involved in starch/amylose biosynthesis. We used an integrative genetic approach that employed a combination of genetic mapping strategies in order to enhance the discovery and use of useful genetic variation. Gene expression is mediated by transcription factors which bind selectively to DNA sequences called transcription factor binding sites (TFBS) found in cis regulatory regions of genes. Co-expressed genes are therefore more likely to share binding sites when compared to those that don’t have similar expression patterns (Hannenhalli 2008). Gene co-expression network analysis uses transcriptomic data to group genes based on their expression profiles (Usadel et al. 2009). Gene co-expression is becoming a powerful approach for predicting gene function as it assumes that gene co-expression suggests functional relatedness (Hansen et al. 2014). Gene co- expression analysis has been used to identify genes involved in various biological

195 processes such as starch biosynthesis (Fang-Fang and Hong-Wei 2010), fatty acid biosynthesis (Han et al. 2012), cellulose synthesis (Persson et al. 2005) and biomarkers in glycerol kinase (Maclennan et al. 2009).

However, gene co-expression can be noisy and may not necessarily indicate functionally relevant relationships (Hansen et al. 2014; Yeung et al. 2004). In order to correctly predict gene function additional analysis may therefore be needed. In this study an integrative genetics approach was used to elucidate the genetic mechanisms of amylose biosynthesis. Our integrative systems genetics approach comprised of genomics based bulk segregant analysis, gene co-expression network analysis as well as analysis of conserved cis regulatory motifs. Gene expression profiles were also studied. Using this approach known and novel genes putatively have some roles in amylose biosynthesis were identified. Several lines of evidence that we explored in our integrative genetics approach suggest that some of our candidate genes are novel genomic regions involved in transcriptional regulation of amylose synthesis. Through identification of additional genes, other than GBSS, which have some regulatory control on amylose content, we have increased understanding on the regulatory networks of amylose content.

7.6. Future directions

Analysis of the mapping population, revealed only two non-synonymous SNP in GBSS that are likely to be functionally associated with amylose. Analysis of natural variation in a set of unrelated African rice accessions led to the identification of a total of 5 non- synonymous SNPs in this key amylose-controlling gene. As indicated in Chapter 5, though this SNP has been reported to be conservative (Dobo 2006), we speculate that it could affect GBSS activity as it is adjacent to a disulphide linkage that is conserved in all grass GBSS proteins. The absence of any other markers associated with amylose in GBSS, makes further analysis of the functional importance of this SNP necessary. This analysis can be conducted using targeted mutagenesis. PCR based genotyping for this SNP in a wide set of African rice is also likely to give insights on its association with amylose content.

196

This study brings to the fore the weaknesses of bi-parental mapping which is the genetic mapping approach that has mostly been used in the analysis of molecular control of amylose content in African rice. Only a limited amount of diversity is found segregating in bi-parental populations as compared to use of more diverse germplasm which is likely to have a higher repertoire of alleles. Due to the low frequency of variants in GBSS as highlighted in chapter 5, efforts to identify markers controlling amylose content in this cultivated species may face challenges as such pedigree- based populations are likely to miss important alleles. To-date only a limited number of African rice accessions have been used in bi-parental mapping, limiting the variation that may have been deployed in genetic mapping. In order to increase the range of variation that can be surveyed from GBSS, the use of association studies is recommend as these provide more power and resolution than structured segregating mapping population. It would be important to include as wide a spectrum of diversity of O. glaberrima as possible including the recently reported low amylose content accessions (Gayin et al. 2015). Genome wide association study of amylose content has been reported in Asian rice (Huang et al. 2012) but none has been undertaken in African rice.

The lack of adequate genomic information of African rice may have hindered the use of genome wide association analysis but with the continued increase in available genomic resources, we are likely to witness an upsurge in GWAS. Association studies take advantage of several generations of ancestral recombination which helps to increase resolution. The use of ancestral recombination events to delineate linkage blocks that are associated with the phenotypic trait helps to shorten the identified genomic interval (Ingvarsson and Street 2011 and the references therein). Association analysis has also enabled better understanding of the genetic regulation of various physicochemical properties as compared to bi-parental mapping (Tian et al. 2009). However, as reported by Famoso et al. (2011), it would be important to note that in some instances, QTL mapping has been found to have more power and resolution by discovering QTLs which GWAS has been unable to detect. For example in the case of rare alleles or where the alleles are sub-population specific. These approaches may therefore be complementary.

197

7.7. References

Abe A, Kosugi S, Yoshida K, Natsume S, Takagi H, Kanzaki H, Matsumura H, Yoshida K, Mitsuoka C, Tamiru M, Innan H, Cano L, Kamoun S, Terauchi R (2012) Genome sequencing reveals agronomically important loci in rice using MutMap. Nature biotechnology 30:174-178

Anacleto R, Cuevas RP, Jimenez R, Llorente C, Nissila E, Henry R, Sreenivasulu N (2015) Prospects of breeding high-quality rice using post-genomic tools. Theoretical and Applied Genetics 128:1449-1466

Brozynska M, Furtado A, Henry RJ (2015) Genomics of crop wild relatives: Expanding the gene pool for crop improvement. Plant biotechnology journal

Cronn R, Liston A, Parks M, Gernandt DS, Shen R, Mockler T (2008) Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by- synthesis technology. Nucleic acids research 36:e122-e122

Dempewolf H, Eastwood RJ, Guarino L, Khoury CK, M?ller JV, Toll J (2014) Adapting Agriculture to Climate Change: A Global Initiative to Collect, Conserve, and Use Crop Wild Relatives. Agroecology and Sustainable Food Systems. Taylor & Francis, pp 369-377

Dobo M (2006) Role of GBSS allelic diversity in rice grain quality. ProQuest, UMI Dissertations Publishing Duitama J, Sánchez-Rodríguez A, Goovaerts A, Pulido-Tamayo S, Hubmann G, Foulquié-Moreno MR, Thevelein JM, Verstrepen KJ, Marchal K (2014) Improved linkage analysis of Quantitative Trait Loci using bulk segregants unveils a novel determinant of high ethanol tolerance in yeast. BMC genomics 15:207-207

Eisen J, Fraser C (2003) Phylogenomics: Intersection of Evolution and Genomics. In: Eisen JA (ed) Science pp 1706-1707

Famoso AN, Zhao K, Clark RT, Tung C-W, Wright MH, Bustamante C, Kochian LV, McCouch SR (2011) Genetic architecture of aluminum tolerance in rice (oryza sativa) determined through genome-wide association analysis and qtl mapping. PLoS Genetics 7:e1002221

Fang-Fang F, Hong-Wei X (2010) Coexpression Analysis Identifies Rice Starch Regulator1, a Rice AP2/EREBP Family Transcription Factor, as a Novel Rice Starch Biosynthesis Regulator1[W][OA]. Plant physiology 154:927-938

198

FAO (2010) Second Report on the World's Plant Genetic Resources for Food and Agriculture. FAO, Rome, Italy, p 299

Frédéric D, Henner B, Hervé P (2005) Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6:361

Gao L-Z, Zhang C-H, Li D-Y, Pan D-J, Jia J-Z, Dong Y-S (2006) Genetic diversity within Oryza rufipogon germplasms preserved in Chinese field gene banks of wild rice as revealed by microsatellite markers. Biodiversity and Conservation 15:4059-4077

Gayin J, Chandi GK, Manful J, Seetharaman K (2015) Classification of Rice Based on Statistical Analysis of Pasting Properties and Apparent Amylose Content: The Case of Oryza glaberrima Accessions from Africa. Cereal Chemistry 92:22-28

Hajjar R, Hodgkin T (2007) The use of wild relatives in crop improvement: a survey of developments over the last 20 years. Euphytica 156:1-13

Han X, Yin L, Xue H (2012) Co-expression Analysis Identifies CRC and AP1 the Regulator of Arabidopsis Fatty Acid Biosynthesis. Journal of Integrative Plant Biology 54:486-499

Hannenhalli S (2008) Eukaryotic transcription factor binding sites—modeling and integrative search methods. Bioinformatics 24:1325-1331

Hansen B, Vaid N, Magdalena M-L, Janowski M, Mutwil M (2014) Elucidating gene function and function evolution through comparison of co-expression networks in plants. Frontiers in Plant Science 5:NA-NA

Harakalova M, Mokry M, Hrdlickova B, Renkens I, Duran K, Van Roekel H, Lansu N, Van Roosmalen M, De Bruijn E, Nijman IJ, Kloosterman WP, Cuppen E (2011) Multiplexed array-based and in-solution genomic enrichment for flexible and cost-effective targeted next-generation sequencing. Nature Protocols 6:1870- 1886

Huang X, Zhu C, Fan D, Lu Y, Weng Q, Liu K, Zhou T, Jing Y, Si L, Dong G, Huang T, Zhao Y, Lu T, Feng Q, Qian Q, Li J, Han B, Wei X, Li C, Wang A, Zhao Q, Li W, Guo Y, Deng L (2012) Genome-wide association study of flowering time and grain yield traits in a worldwide collection of rice germplasm. Nature genetics 44:32-U53

Ingvarsson PK, Street N (2011) Association genetics of complex traits in plants. New Phytol, pp 909-922

199

Ishii T, Hiraoka T, Kanzaki T, Akimoto M, Shishido R, Ishikawa R (2011) Evaluation of Genetic Variation Among Wild Populations and Local Varieties of Rice. Rice 4:170-177

Jain S, Jain RK, McCouch SR (2004) Genetic analysis of Indian aromatic and quality rice ( Oryza sativa L.) germplasm using panels of fluorescently-labeled microsatellite markers. Theoretical and Applied Genetics 109:965-977

Jones ME, Shepherd M, Henry RJ, Delves A (2006) Chloroplast DNA variation and population structure in the widespread forest tree, Eucalyptus grandis. Conservation Genetics 7:691-703

Kilian B, Graner A (2012) NGS technologies for analyzing germplasm diversity in genebanks. Briefings in functional genomics 11:38-50

Li ZM, Zheng XM, Ge S (2011) Genetic diversity and domestication history of African rice (Oryza glaberrima) as inferred from multiple gene sequences. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 123:21-31

Maclennan NK, Dong J, Aten JE, Horvath S, Rahib L, Ornelas L, Dipple KM, McCabe ERB (2009) Weighted gene co-expression network analysis identifies biomarkers in glycerol kinase deficient mice. Molecular Genetics and Metabolism 98:203-214

Maxted N, Kell S (2009) Establishment of a Global Network for the In Situ Conservation of Crop Wild Relatives: Status and Needs. Commission on Genetic Resources for Food and Agriculture (CGRFA):266

Maxted N, Kell S, Ford-Lloyd B, Dulloo E, Toledo l (2012) Toward the Systematic Conservation of Global Crop Wild Relative Diversity. Crop Science 52:774-785

McCouch SR, McNally KL, Wang W, Sackville Hamilton R (2012) Genomics of gene banks: A case study in rice. American journal of botany 99:407-423

Molina J, Sikora M, Garud N, Flowers JM, Rubinstein S, Reynolds A, Huang P, Jackson S, Schaal BA, Bustamante CD, Boyko AR, Purugganan MD (2011) Molecular evidence for a single evolutionary origin of domesticated rice. Proceedings of the National Academy of Sciences of the United States of America 108:8351-8356

Nabholz B, Sarah G, Sabot F, Ruiz M, Adam H, Nidelet S, Ghesquière A, Santoni S, David J, Glémin S (2014) Transcriptome population genomics reveals severe

200

bottleneck and domestication cost in the African rice (Oryza glaberrima). Molecular ecology 23:2210-2227

Orjuela J, Sabot F, Chéron S, Vigouroux Y, Adam H, Chrestin H, Sanni K, Lorieux M, Ghesquière A (2014) An extensive analysis of the African rice genetic diversity through a global genotyping. Theoretical and Applied Genetics 127:2211-2223

Persson S, Hairong W, Jennifer M, Grier PP, Christopher RS (2005) Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proceedings of the National Academy of Sciences of the United States of America 102:8633

Rakshit S, Rakshit A, Matsumura H, Takahashi Y, Hasegawa Y, Ito A, Ishii T, Miyashita NT, Terauchi R (2007) Large-scale DNA polymorphism study of Oryza sativa and O. rufipogon reveals the origin and divergence of Asian rice. TAG Theoretical and applied genetics Theoretische und angewandte Genetik 114:731-743

Rohland N, Reich D (2012) Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome research 22:939-946

Schneeberger K, Weigel D (2011) Fast-forward genetics enabled by new sequencing technologies. Trends in plant science 16:282-288

Small RL, Cronn RC, Wendel JF (2004) Use of nuclear genes for phylogeny reconstruction in plants. Australian Systematic Botany 17:145-170

Takagi H, Abe A, Yoshida K, Kosugi S, Natsume S, Mitsuoka C, Uemura A, Utsushi H, Tamiru M, Takuno S, Innan H, Cano LM, Kamoun S, Terauchi R (2013) QTL‐ seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. The Plant Journal 74:174- 183

Tian Z, Zeng D, Wang Y, Yu J, Gu M, Li J, Qian Q, Liu Q, Yan M, Liu X, Yan C, Liu G, Gao Z, Tang S (2009) Allelic diversities in rice starch biosynthesis lead to a diverse array of rice eating and cooking qualities. Proceedings of the National Academy of Sciences of the United States of America 106:21760-21765

Treuren vR, van Hintum TJL, Wiel vdCCM (2008) Marker-assisted optimization of an expert-based strategy for the acquisition of modern lettuce varieties to improve a genebank collection. Genetic Resources and Crop Evolution 55:319-330

Usadel Br, Obayashi T, Mutwil M, Giorgi FM, Bassel GW, Tanimoto M, Chow A, Steinhauser D, Persson S, Provart NJ (2009) Co-expression tools for plant

201

biology: opportunities for hypothesis generation and caveats. Plant, Cell & Environment 32:1633-1651

Wambugu P, Furtado A, Waters D, Nyamongo D, Henry R (2013) Conservation and utilization of African Oryza genetic resources. Rice 6:29-29

Wang K, Wambugu PW, Zhang B, Wu AC, Henry RJ, Gilbert RG (2015) The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129:92-100

Wang MH, Yu Y, Haberer G, Marri PR, Fan CZ, Goicoechea JL, Zuccolo A, Song X, Kudrna D, Ammiraju JSS, Cossu RM, Maldonado C, Chen J, Lee S, Sisneros N, de Baynast K, Golser W, Wissotski M, Kim W, Sanchez P, Ndjiondjop MN, Sanni K, Long MY, Carney J, Panaud O, Wicker T, Machado CA, Chen MS, Mayer KFX, Rounsley S, Wing RA (2014) The genome sequence of African rice (Oryza glaberrima) and evidence for independent domestication. Nature genetics 46:982-988

Xiao J, Li J, Grandillo S, Ahn S-N, Yuan L, Tanksley SD, McCouch SR (1998) Identification of Trait-Improving Quantitative Trait Loci Alleles From a Wild Rice Relative, Oryza rufipogon. Genetics 150:899-909

Yang J, Jiang H, Yeh CT, Yu J, Jeddeloh JA, Nettleton D, Schnable PS (2015) Extreme‐phenotype genome‐wide association study (XP‐GWAS): a method for identifying trait‐associated variants by sequencing pools of individuals selected from a diversity panel. The Plant Journal 84:587-596

Yeung KY, Medvedovic M, Bumgarner RE (2004) From co-expression to co- regulation: how many microarray experiments do we need? Genome biology 5:R48-R48

Zou XH, Zhang FM, Zhang JG, Zang LL, Tang L, Wang J, Sang T, Ge S (2008) Analysis of 142 genes resolves the rapid diversification of the rice genus. Genome biology 9:R49-R49

202

Appendices

Appendix 1: The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima)3

Abstract The molecular structure and gelatinization properties of starches from domesticated African rice (Oryza glaberrima) and its wild progenitor (Oryza barthii) are determined and comparison made with Asian domesticated rice (Oryza sativa), the commonest commercial rice. This suggests possible enzymatic processes contributing to the unique traits of the African varieties. These have similar starch structures, including smaller amylose molecules, but larger amounts of amylose chains across the whole amylose chain-length distribution, and higher amylose contents, than O. sativa. They also show a higher proportion of two- and three-lamellae spanning amylopectin branch chains (degree of polymerization 34-100) than O. sativa, which contributes to their higher gelatinization temperatures. Fitting amylopectin chain-length distribution with a biosynthesis-based mathematical model suggests that the reason for this difference might be because O. glaberrima and O. barthii have more active SSIIIa and/or less active SBEIIb enzymes.

Keywords: African rice, starch, biosynthesis enzymes, molecular structure, gelatinization

1. Introduction

Rice is one of the most important cereal crops in the world, especially in the developing world. It is a major staple food for nearly half of the world population. There are two cultivated species of rice: Oryza sativa and Oryza glaberrima (Wambugu et al. 2013).

3 Published as Wang, K., Wambugu, P.W., Zhang, B., Wu, A.C., Henry, R.J.and Gilbert, R.G. 2015. The biosynthesis, structure and gelatinization properties of starches from wild and cultivated African rice species (Oryza barthii and Oryza glaberrima). Carbohydrate Polymers 129, 92-100

203

O. sativa was domesticated from the wild progenitor Oryza rufipogon around 10,000 years ago, and is now widely distributed throughout the world (Huang et al. 2012). O. glaberrima, on the other hand, was independently domesticated from the wild species O. barthii approximately 3000 years ago, and is traditionally found only in West Africa (Sweeney and Mccouch 2007). Currently O. glaberrima is increasingly being replaced by O. sativa for higher yield and productivity. However, O. glaberrima has many desired traits that O. sativa lacks, such as greater tolerance to diseases (for example, African rice gall midge and rice yellow mottle virus) and stresses (including drought, soil acidity, iron and aluminum toxicity, and weed competitiveness) (Sarla and Swamy 2005). Previous studies on starch structure and properties have been mainly focused on O. sativa, with only limited information being available on O. glaberrima and O. barthii species.

As the major component of rice grains, starch dominates the textural and nutritional qualities of rice. Starch is normally almost entirely composed of two types of polymers: mostly linear amylose and highly branched amylopectin. Both are made up of glucose units linearly extended by α-(1→4) bonds and connected by α-(1→6) bonds at branching points. Amylose has a small number of long-chain branches and relatively low molecular weights, whereas amylopectin accounts for ~75 % of regular starch dry mass, and has an enormous number of short-chain branches and much larger molecular weights. The amylose content and the structure of amylose and amylopectin determine the factors controlling many physiochemical properties of starch, including eating, cooking and nutritional qualities of rice.

Starch is synthesized in the rice endosperm through a complex pathway that is catalyzed by several enzymes, including granule bound starch synthases (GBSS), soluble starch synthases (SS), starch branching enzymes (SBE) and starch debranching enzymes (DBE) (Ball and Morell 2003). Each class of enzyme has multiple isoforms, and each isoform may play a distinct role in starch synthesis. Amylose is believed to be mainly elongated by GBSSI, with little branching or debranching activity, whereas amylopectin biosynthesis involves multiple enzymes, including several SS, SBE and DBE isoforms (Smith et al. 1997). Alterations in starch

204 structure, and hence the nutritional and processing properties of rice, could be achieved by modifying the genes encoding starch biosynthetic enzymes.

Gelatinization properties are frequently used in rice breeding programs as a rapid method to gauge rice quality (Cuevas et al. 2010a). Starch gelatinization is an order-disorder transition procedure that involves the disruption of the ordered semicrystalline structure of starch granules by heating in excess water, resulting in an amorphous matrix (Cooke and Gidley 1992; Jane et al. 1999). Gelatinization properties, commonly analyzed by differential scanning calorimetry (DSC), could predict the cooking and eating qualities of rice (Juliano and Perez 1983). Easily cooked rice is desirable for energy saving; however, rice grains that are relatively difficult to cook tend to have slower digestion rates, which can be beneficial to human health.

This study aims to determine the structural and gelatinization properties of starch from the African rice species O. glaberrima and O. barthii, and compare with those of the widely consumed O. sativa. This will help to find any advantages and potential applications of these less distributed African rice species. The molecular structures, including the molecular size distributions of whole starch molecules and chain-length distributions (CLDs) of debranched starch, of all three species were analyzed using size-exclusion chromatography (SEC) and fluorophore-assisted carbohydrate electrophoresis (FACE), respectively. Amylopectin CLDs were fitted with a biosynthesis-based mathematical model to identify possible enzymatic processes that are responsible for any unique structural and property characteristics of O. glaberrima and O. barthii. The resulting information could be beneficial in providing guidelines for gene engineering and rice breeding to produce rice with the desired traits from O. glaberrima, but similar or better eating and cooking qualities than O. sativa.

2. Materials and methods

2.1. Materials

African rice grains from ten O. glaberrima accessions and five O. barthii accessions supplied by International Rice Research Institute (IRRI, Los Baños, Laguna, The

205

Philippines) are listed in Table 34. Rice grains from four O. sativa breeding lines (YRR08-03-02, YRJ08-03-11, YRA08-04-10 and YRD08-01-22) were provided by Yanco Agricultural Institute, Department of Primary Industries, Yanco, NSW, Australia, and another O. sativa variety (Nipponbare) was provided by Dr. Matthew K. Morell, IRRI.

Protease from Streptomyces griseus (type XIV) was purchased from Sigma- Aldrich Pty. Ltd. (Castle Hill, NSW, Australia). Isoamylase (from Pseudomonas sp.) and a total starch (AA/AMG) assay kit were purchased from Megazyme International Ltd. (Bray, Co. Wicklow, Ireland). Pullulan standards with known peak molecular weights were purchased from Polymer Standards Service (PSS) GmbH (Mainz, Germany). Dimethyl sulfoxide (DMSO, GR grade for analysis) was purchased from Merck Co. Inc. (Kilsyth, VIC, Australia). All other chemicals were reagent grade and used as received.

2.2. Grinding of rice grains

Rice grains were dehulled and then ground into fine flour in a liquid nitrogen bath using a cryo-grinder (Freezer/Mill 6850 SPEX, Metuchen, NJ, USA) for 5 min at 10 s–1 (Tran et al. 2011).

206

Table 34: Information on O. barthii and O. glaberrima samples Accession Species Country of origin Provider number O. barthii 99565 Tanzania IRRI O. barthii 103895 Senegal IRRI O. barthii 104060 Nigeria IRRI O. barthii 104076 Nigeria IRRI O. barthii 86481 Zambia IRRI O. glaberrima 86825 Mali IRRI O. glaberrima 103450 Gambia IRRI O. glaberrima 104586 Cameroon IRRI O. glaberrima 103596 Cameroon IRRI O. glaberrima 96746 Nigeria IRRI O. glaberrima 103459 Senegal IRRI O. glaberrima 105023 Guinea IRRI O. glaberrima 104019 Tanzania IRRI O. glaberrima 104594 Burkina Faso IRRI O. glaberrima 104023 Guinea-Bissau IRRI

2.3. Size-exclusion chromatography

To prepare the starch sample for structural analysis, protein was removed by treating the flour with protease and sodium bisulfite solution, each followed by centrifugation using a method described elsewhere (Syahariza et al. 2010; Tran et al. 2011). The treated rice flour was then dissolved in a DMSO–0.5% (w/w) LiBr solution (denoted by DMSO/LiBr), and remaining non-starch polysaccharide components then removed by precipitating starch using a sufficient amount of ethanol (6 volumes of DMSO/LiBr) followed by centrifugation at 4000 g for 10 min. The precipitated starch was collected, and then dissolved in DMSO/LiBr at 80 °C overnight. The concentration of starch in DMSO/LiBr was analyzed using a Megazyme total starch assay kit, and then adjusted to 2 mg/mL for SEC analysis.

The SEC weight distribution of whole branched starch was characterized using an Agilent 1100 SEC system with a refractive index detector (RID; ShimadzuRID-10A, Shimadzu Corp., Kyoto, Japan) following a previously described method (Cave et al. 2009; Vilaplana

207 and Gilbert 2010b), and data were treated following the method given by Castro et al. (2005). Size separation of branched starch molecules was carried out using GRAM pre-column, GRAM 100 and GRAM 3000 columns (PSS) in a column oven at 80 °C, and starch molecules were eluted using DMSO/LiBr solution at 0.3 mL/min. A series of pullulan standards with molecular weights ranging from 342 to 2.35×106 were used for calibration, allowing elution volume to be converted to hydrodynamic volume (Vh, or the corresponding radius, Rh) using the Mark-Houwink equation (see Cave et al. (2009)). The Mark–Houwink parameters K and α of pullulan in DMSO/LiBr solution at 80 °C are 2.424 × 10–4 dL/g and 0.68, respectively.

The SEC weight distribution, wbr (logVh), of whole branched starch molecules was obtained from the RID signal and plotted as a function of Rh (Vilaplana and Gilbert 2010a).

To characterize the CLDs of amylose and amylopectin, starch samples were extracted as described above for whole branched starch molecules and debranched using isoamylase following a method described elsewhere (Hasjim et al. 2010; Wang et al. 2014). The debranched starch was freeze-dried, and then dissolved in DMSO/LiBr solution for SEC analysis. The same Agilent 1100 Series SEC system as described above was used, except using GRAM precolumn, GRAM 100 and GRAM 1000 columns (PSS) at a flow rate of 0.6 mL/min (Cave et al. 2009; Vilaplana and Gilbert 2010b). For debranched starch (a linear molecule), Vh was converted to the degree of polymerization DP (X) using the Mark-Houwink equation (where M = 162.2X + 18.0, 162.2 is the molecular weight of the anhydroglucose monomeric unit and 18.0 that of the additional water in the reducing end group), and the SEC weight distribution of individual chains, termed the weight CLD and denoted by wde(logX), was converted to the number distribution, the number CLD, denoted Nde(X), where Nde(X) = 2 X– wde(log X). The Mark-Houwink parameters used for linear starch in DMSO/LiBr solution at 80 °C were K = 1.5 × 10–4 dL/g and α = 0.743 (Wang et al. 2014).

2.4. Fluorophore-assisted carbohydrate electrophoresis

Starch was extracted from the flour, debranched, and freeze-dried in the same way as the preparation of debranched starch for SEC analysis. The freeze-dried starch was labeled using 8-aminopyrene-1,3,6,trisulfonic acid (APTS) following a procedure described by Wu et al. (2014a), and then separated with a carbohydrate separation buffer (Beckman-Coulter) in an N-CHO-coated capillary at 25° C using a voltage of 30 kV. The number CLD, Nde(X), of debranched amylopectin was characterized using a PA-800 Plus FACE System (Beckman-

208

Coulter, Brea, CA, USA), coupled with a solid-state laser-induced fluorescence (LIF) detector and an argon-ion laser as the excitation source (Wu et al. 2014a).

2.5. Mathematical fitting of the number chain-length distribution

The Nde(X) of debranched amylopectin from FACE was fitted with a mathematical model to parameterize the relative activities of three core classes of starch biosynthetic enzymes: starch synthases (SSs, including granule-bound starch synthases, GBSS, and soluble starch synthases, SS); starch branching enzymes (SBE) and starch debranching enzymes (DBE). The computer code for this model is publicly available online (Wu and Gilbert 2013), and the theory is described in detail elsewhere (Wu and Gilbert 2010; Wu et al. 2013). In summary, the amylopectin CLD is attributed to a number of enzyme sets, with each set comprising of various isoforms of SSs, SBE and DBE, and predominantly contributing to a particular range of the CLD. The resulting parameters are the activity ratio of SBE to SS (denoted β) and the relative contribution (denoted h) of each enzyme set to the entire amylopectin CLD.

2.6. Gelatinization properties

Starch granules were isolated from the cryo-ground rice flour using a wet-milling method described by Wang et al. (2014) which retains the crystalline structure of starch. This method used sodium bisulfite solution and toluene/sodium chloride (9/1, v/v) solution to remove proteins and lipids, followed by a repeated ethanol precipitation step at the end to wash off residual toluene and to facilitate the drying at low temperature. The collected starch was then air-dried and used for the analysis of gelatinization properties by DSC.

A differential scanning calorimeter (DSC, Mettler Toledo, Schwerzenbach, Switzerland) was used to analyze the gelatinization properties of starch following a method described by Zhang et al. (2014) with some modifications. Isolated starch granules (~4 mg, dry basis) and water (at a ratio of 1:3, w:w) were weighed and sealed in a 40 μL aluminium crucible. The crucible was allowed to equilibrate at room temperature for 1 h before loading into the DSC furnace. An empty sealed crucible was used as the reference, and indium was used for calibration. The crucible was held at 10 °C for 1 min in the DSC, and was then heated from 10 to 95 °C at a rate of 10 °C/min. The onset temperature (To), peak temperature (Tp), conclusion temperature (Tc) and enthalpy change (ΔH) were determined from each endotherm using the STARe software (Mettler Toledo).

209

2.7. Statistical analysis

All analyses were carried out in duplicate. The mean and standard deviation were analyzed by Minitab 16 (Minitab Inc., State College, PA, USA) using analysis of variance (ANOVA) with the general linear model and Tukey’s pairwise comparisons with the confidence level at 95.0%. Significant differences of the mean values were determined at p < 0.05.

3. Results and discussion

3.1. Molecular size distributions of whole starch molecules

The SEC weight distributions of whole starch molecules, wbr(logVh), are shown in Fig. 29, with three peaks observed for all samples. The peaks at Rh ~1000 nm and Rh ~30 nm correspond to amylopectin and amylose molecules, respectively. Another peak at Rh ~3 – 4 nm consists of residual proteins (Syahariza et al. 2010). Although most of the proteins were removed by protease during starch extraction, the presence of the residual protein peak indicates the amount of protease used might be insufficient, probably due to the relatively high protein content and strong protein network of the samples in the present study, being unpolished rice grains. Differences in the peak height indicate variations in the protein content or structure among the samples. O. barthii and O. glaberrima samples exhibit significantly higher peak-heights than O. sativa samples, probably because of the higher protein content and/or stronger protein structure that is not easily hydrolyzed by protease. The high protein content in African rice as compared to O. sativa has been reported in a previous study indicating that African rice has the potential to be used in breeding programs to improve the protein content of Asian rice (Li et al. 2004).

Because SEC analysis of large molecules like amylopectin suffered from inaccuracies resulting from both limitations in calibration and shear scission (Cuevas et al. 2010b), only the distribution of smaller molecules, such as amylose and debranched starch, was reliable from the SEC weight distribution. All distributions were normalized to the height of the amylose peak, and the distributions of whole branched amylopectin molecules are ignored.

The positions of the amylose peak maxima (denoted by Rh,AM) are given in Table 35. The

Rh,AM values of O. barthii and O. glaberrima are significantly lower than those of O. sativa, indicating significantly smaller amylose molecules in the former two species. Syahariza et al. (2013) and Syahariza et al. (2014) reported that the size of whole amylose molecules might

210 have an impact on the digestion rate of freshly cooked rice grains; however, whether it influences other starch/rice properties is still unknown and requires further study.

Figure 29: SEC weight distributions, wbr (logVh), of whole branched starch molecules extracted from O. barthii (O.B.), O. glaberrima (O.G.) and O. sativa (O.S.). All distributions were normalized to the same peak height of the peak at Rh ~30 nm.

211

Table 35: Structural parameters of amylose from all the samples

Amylose content Sample Rh,AM (nm) AUCAM1 AUCAM2 AUCAM3 (%)

O. B. 99565 24.7 ± 0.3 d-f 27.1 ± 0.1 a 0.051 ± 0.001 a-c 0.115 ± 0.001 a 0.062 ± 0.000 bc O. B. 103895 25.5 ± 0.2 c-e 26.5 ± 0.7 ab 0.052 ± 0.003 ab 0.099 ± 0.005 a 0.073 ± 0.001 bc O. B. 104060 24.6 ± 1.2 d-f 26.1 ± 1.3 abc 0.051 ± 0.000 a-c 0.099 ± 0.005 a 0.071 ± 0.004 c O. B. 104076 23.0 ± 1.7 d-f 26.8 ± 0.4 ab 0.052 ± 0.004 ab 0.098 ± 0.002 a 0.083 ± 0.000 ab O. B. 086481 24.4 ± 1.0 d-f 27.7 ± 0.8 a 0.056 ± 0.000 a 0.103 ± 0.004 a 0.085 ± 0.004 a O. G. 86825 23.3 ± 1.0 d-f 26.5 ± 0.3 ab 0.043 ± 0.002 de 0.091 ± 0.001 ab 0.083 ± 0.003 ab O. G. 103450 20.6 ± 0.7 f 26.8 ± 0.3 ab 0.046 ± 0.003 b-d 0.097 ± 0.002 a 0.083 ± 0.001 ab O. G. 104586 23.3 ± 0.3 d-f 26.3 ± 0.0 ab 0.043 ± 0.000 cd 0.091 ± 0.001 ab 0.081 ± 0.002 a-c O. G. 103596 22.2 ± 0.9 d-f 26.5 ± 0.0 ab 0.048 ± 0.000 a-d 0.096 ± 0.002 a 0.081 ± 0.002 a-c O. G. 96746 22.7 ± 0.2 d-f 26.4 ± 0.2 ab 0.044 ± 0.001 cd 0.096 ± 0.008 a 0.082 ± 0.001 ab O. G. 103459 23.5 ± 1.3 d-f 27.1 ± 0.0 a 0.048 ± 0.001 a-d 0.096 ± 0.001 a 0.080 ± 0.001 a-c O. G. 105023 22.3 ± 0.5 d-f 26.8 ± 0.4 ab 0.045 ± 0.001 b-d 0.098 ± 0.001 a 0.081 ± 0.002 a-c O. G. 104019 24.2 ± 0.0 d-f 26.3 ± 0.3 ab 0.044 ± 0.003 cd 0.094 ± 0.005 a 0.082 ± 0.002 a-c O. G. 104594 21.9 ± 3.2 ef 26.1 ± 0.6 abc 0.045 ± 0.002 b-d 0.096 ± 0.002 a 0.078 ± 0.002 a-c O. G. 104023 24.4 ± 1.0 d-f 26.0 ± 0.1 abc 0.046 ± 0.002 b-d 0.097 ± 0.001 a 0.073 ± 0.001 bc O. S. Nipponbare 29.6 ± 0.9 a-c 16.9 ± 1.1 d 0.024 ± 0.002 g 0.057 ± 0.003 d 0.050 ± 0.006 d O. S. YRB08-04-10 26.6 ± 0.2 b-d 24.6 ± 0.6 bc 0.035 ± 0.002 f 0.080 ± 0.004 bc 0.079 ± 0.005 a-c O. S. YRA08-03-11 32.1 ± 1.7 a 17.5 ± 0.0 d 0.022 ± 0.000 g 0.055 ± 0.000 d 0.057 ± 0.000 d O. S. YRR08-03- 30.3 ± 1.3 ab 18.6 ± 0.7 d 0.025 ± 0.002 g 0.060 ± 0.002 d 0.060 ± 0.003 d 02 O. S. YUD08-01- 26.1 ± 0.2 b-e 24.0 ± 0.0 c 0.036 ± 0.001 ef 0.077 ± 0.002 c 0.076 ± 0.002 a-c 22 a Mean ± standard deviation is calculated from duplicate measurements. Values with different letters in the same column are significantly different at p < 0.05.

3.2. Chain-length distributions of debranched starch

The SEC weight CLDs, wde (logX), of individual chains obtained from debranched starch as a function of DP X are shown in Fig. 30. Although SEC can be used to characterize chains of DP up to X 40 000, it is not reliable for short chains of amylopectin, because of the inaccuracies resulting from the calibration and from the Mark-Houwink empirical equation. FACE, in comparison, characterizes short chains very accurately, but it is not suitable for amylose chains due to the high noise to signal ratio for longer chains. SEC is the best technique for characterizing amylose chains. In the present study SEC was used for the CLD of individual amylose chains, while FACE was used to obtain an accurate CLD of amylopectin chains.

212

Figure 30: SEC weight CLDs, wde (logX), of (A) debranched starches and (B) an enlargement of the amylose region as a function of DP X;. All CLDs of starches from O. barthii (O.B.), O. glaberrima (O.G.) and O. sativa (O.S.) were normalized to yield the same area under the curve in order to avoid the effect of different sample concentrations.

Typical SEC weight CLDs were observed for all samples, with bimodal distributions observed for amylopectin (DP < 100) and multiple small bumps for amylose (DP > 100) (Fig. 30A). As the intensity of the RID signal is proportional to the mass concentration of a sample, all weight CLDs were normalized to yield the same global maximum to enable relative comparison in the amylose range. Apparent differences were observed in the weight CLDs of amylose chains among the different species (enlarged in Fig. 30B). The amylose weight CLDs appear to have three overlapping bumps. The amylose weight CLD was divided into the three groups accordingly. The first group (denoted by AM1) consists of shorter amylose chains, ranging from the DP at a local minimum (DP ~ 100) to DP 300. The other two groups are intermediate and long amylose chains, covering DP 300 – 1600 and 1600 – 40,000, denoted by AM2 and AM3, respectively. The areas under the curve (AUC) of AM1, AM2 and AM3 were used as parameters to indicate the amount of chains in each group. The amylose content was obtained from the weight CLD by calculating the ratio of the AUC of the whole amylose range (DP ~ 6 – 100) to the AUC of the whole starch distribution (DP ~ 6 – 40000).

All parameters (AUCAM1, AUCAM2, AUCAM3, and amylose content) for each sample are

213 displayed in Table 35, and are quantitatively compared between O. barthii, O. glaberrima and O. sativa.

The AUCAM1 value of O. barthii is slightly higher than O. glaberrima, while the AUCAM2 and AUCAM3 values are similar between O. barthii and O. glaberrima. The values of O. sativa are substantially lower in the other two species for AUCAM1, AUCAM2 and AUCAM3. These results suggest that the amylose structure of O. glaberrima is similar to its wild progenitor O. barthii, except for the slightly higher amount of short amylose chains of O. barthii (DP 100 – 300), indicating that amylose structure is largely unaffected by domestication in African rice. However, O. glaberrima and O. barthii have more amylose chains across all lengths compared to O. sativa. The amylose contents of O. barthii and O. glaberrima are similar (Table 35), ranging from 26.0 to 27.7%, and significantly higher than those of O. sativa, ranging from 16.9 to 24.6% (Table 35). This agrees with the results of Watanabe et al. (2002), where African rice starches were reported to have higher values and a narrower range of amylose content than Asian rice. However, a wider range of amylose content in African rice starch was reported by a recent study (Gayin et al. 2015). These differences in the amylose content and structure are most likely due to variations in the GBSSI enzyme, which is believed to be the major enzyme responsible for amylose synthesis (Smith et al. 1997). GBSSI is encoded by the waxy gene in rice, with various alleles reported and associated with the amylose content of starch (Kharabian-Masouleh et al. 2012). In addition to GBSSI, SBE almost certainly plays a role in amylose structure by affecting the lengths of amylose chains through branching (Wang et al. 2014). The amylose biosynthesis mechanism is still largely unexplored.

214

Figure 31: Number CLDs, Nde(X), of starches extracted from O. barthii (O.B.), O. glaberrima (O.G.) and O. sativa (O.S.) characterized using FACE, covering amylopectin All distributions were normalized to yield the same global maximum. Although the number CLD data comprise points for individual DPs, these are presented as continuous lines for visual ease.

The CLDs of amylopectin chains were analyzed using FACE, with the resulting number distributions being normalized to yield the same global maximum for qualitative comparisons (Fig. 31). All samples show typical number CLDs and have the same features as previously reported for rice starch (Wu et al. 2013). Three peaks/bumps can be observed for all samples. The first one is the global maximum at DP 12, with the peak covering DP 6 – 33; these chains span a single lamella. The second peak is from DP 34 to 67, with a local maximum at DP ~ 43. This peak represents intermediate amylopectin chains spanning two crystalline lamellae. The third bump is from long amylopectin chains ranging DP 68 – 100 that are likely to span through at least three crystalline lamellae, and are termed three- lamellae spanning chains (Wu et al. 2013). Chains greater than DP 100 are normally assigned as amylose chains, although there are almost certainly a small proportion of extra- long amylopectin chains with lengths similar to shorter amylose chains in this region (Hanashiro et al. 2008; Horibata et al. 2004).

215

The number CLDs of O. barthii and O. glaberrima samples are effectively superimposable, indicating that they have similar amylopectin branching structure. They are also similar to the number CLDs of O. sativa samples in the range of DP < 33. However, for DP ≥ 33, O. barthii and O. glaberrima show higher number CLDs than O. sativa, suggesting that the former two species have higher proportions of three-lamellae spanning amylopectin chains and more short amylose chains. This agrees well with the amylose weight CLDs (Fig. 30) and the amylopectin number CLDs converted from SEC weight CLDs as above. The same trends are also observed in the weight CLDs, wde (logX), converted from the number CLDs from FACE.

3.3. Parameterizing enzyme activities through mathematical modelling

As reported in our previous studies (Wang et al. 2014; Wu and Gilbert 2010; Wu et al. 2013;

Wu et al. 2014b), fitting the number CLDs, Nde(X), of amylopectin with a mathematical model gives the relative activities of core starch biosynthetic enzymes (GBSS, SS, SBE and DBE), and thus may be helpful in identifying the enzymes responsible for starch structural differences among various rice species. This study used an improved version of the model, compared to previous studies, that completely covers all the amylopectin chains up to DP 100, giving more information on the synthesis of long amylopectin chains. Preliminary fitting results suggest that the whole amylopectin CLD of rice starch can be attributed to three sets of enzymes, each with two subsets. Although the amylopectin number CLD is the result of the coordinated action of all of the six enzyme subsets, features in the CLD of single-lamella spanning chains (DP 6 – 33) are believed to be predominantly (but not exclusively) from enzyme-sets i and ii, those of two-lamellae spanning chains (DP 34 – 67) predominantly from enzyme-sets iii and iv, while the rest of the features in the range of three-lamellae spanning chains (DP 68 – 100) are mainly from enzyme-sets v and vi. Six β values (βi – βvi) were obtained, each representing the relative activity of SBE to SS within each sub-enzyme set. Another parameter h, reflecting the relative contribution of each enzyme-set to the whole CLD, was obtained only for the three enzyme sets, as rice CLDs are best fitted by what is termed the substrate-competing model (Wu et al. 2013) when the two enzyme subsets act together. The resulting h values for each of the three enzyme sets are denoted by hi, hiii and hv. Together with the β values, they are compared statistically between O. barthii, O. glaberrima and O. sativa, with results in Fig. 32.

216

Figure 32: Statistical comparisons of mathematical fitting parameters among O. barthii, O. glaberrima and O. sativa: (A) hi, (B) hiii, (C) hv, (D) βi, (E) βii, (F) βiii, (G) βiv, (H) βv and (I) βvi. Species with different letters in the same panel are significantly different at p < 0.05. “*”stands for the outlier.

217

No significant differences were observed in the hi, hiii and hv values among the three species, meaning the contribution of each enzyme set to the amylopectin CLD is similar among the three species. This also shows that the contribution of each enzyme set is not the reason causing the starch structural differences. However, significant differences are observed in the β values. The βi and βii values do not vary significantly between O. barthii, O. glaberrima and O. sativa species, suggesting the activity ratios of enzymes in enzyme-sets i and ii are similar among the three species, and that these enzymes are not responsible for the variations in the starch structure. The βiii and βv values, on the other hand, are significantly different among the three species, with O. barthii and O. glaberrima displaying higher values than O. sativa. The opposite trends are observed in βiv and βvi, although the differences are less or not significant. βiii and

βv represent the activity ratio of SBE to SS in enzyme-sets iii and v, which are involved in the synthesis of two- and three-lamellae-spanning amylopectin chains (DP 34 –

100). The higher βiii and βv values of O. barthii and O. glaberrima, in comparison with O. sativa, indicate that these particular enzymes sets have a lower SBE activity and/or a higher SS activity.

SBE is the only enzyme that can introduce α-(1→6) bonds into starch molecules and hence plays a core role in amylopectin biosynthesis. There are two types of SBE in higher plants, SBEI and SBEII, differing in the preferred length of the glucan chains transferred and their substrate specificities. SBEI tends to branch amylose and preferentially transfers longer branches (DP > 10), whereas SBEII has higher activity for branching amylopectin and transfers short branches (DP 3  9) (Guan et al. 1997; Takeda et al. 1993). Note that “preferred” does not mean activities for other chain lengths are zero, but rather are significantly smaller. SBEII can be further divided into two isoforms in cereals, namely SBEIIa and SBEIIb, with SBEIIa being primarily presented in rice leaves while SBEIIb is preferably expressed in the grains (Ohdan et al. 2005). A deficiency in SBEIIb results in the amylose-extender mutant phenotype, where the amylopectin contains higher proportions of longer chains and a smaller number of branches (Takeda and Preiss 1993). The lower activity ratio of SBE to SS for O. glaberrima and O. barthii in this study, as compared with O. sativa, is probably because they have lower SBEIIb activity, leading to a reduced amount of cleaved short chains and a higher amount of longer amylopectin chains.

218

This phenomenon could also arise if there were a higher SS activity of O. glaberrima and O. barthii compared to O. sativa, as SS is responsible for the elongation of amylopectin chains. Three types of SS have been identified: SSI, SSII and SSIII, each preferably synthesizing amylopectin chains of different lengths (Smith et al. 1997). SSI has no reported isoforms, and is able to elongate the shortest amylopectin chains (DP 6 – 7) to produce chains of DP ~9 – 12 (Fujita et al. 2006). SSII and SSIII, on the other hand, both have two isoforms, with SSIIa and SSIIIa being specifically expressed in rice endosperms and SSIIb and SSIIIb in leave tissues (Hirose and Terao 2004; Tetlow 2011). SSIIa generates chains of DP 13 – 24 (Umemoto et al. 2002), which can then be elongated by SSIIIa (Zhu et al. 2013). Thus O. glaberrima and O. barthii might have more activated SSIIIa than O. sativa, generating longer amylopectin chains, which span two or more crystalline lamellae (DP > 34). Further analysis on the genetics of these potential enzymes could confirm these inferences, which may provide information for genetically modifying genes to produce rice with good traits from both O. glaberrima and O. sativa.

3.4. Gelatinization properties

The onset (To), peak (Tp), conclusion temperatures (Tc) and enthalpy (ΔH), analyzed using DSC, of O. glaberrima, O. barthii and O. sativa are compared in Fig. 33. The To,

Tp and Tc of O. glaberrima are similar to those of O. barthii, ranging from 65.6 – 68.8 °C, 70.9 – 73.8 °C, and 76.8 – 80.9 °C, respectively; these are significantly higher than those of O. sativa, with the To, Tp and Tc values ranging from 55.8 – 65.1 °C, 63.6 – 70.5 °C and 72.0 – 76.2 °C, respectively. In addition, O. sativa samples have a variety of gelatinization temperatures, due to polymorphisms in their genetic background. O. glaberrima and O. barthii samples, on the other hand, lack genetic diversity and display relatively similar gelatinization temperatures. The ΔH results show a different trend, with O. glaberrima having a significantly lower ΔH than O. barthii and O. sativa. This difference in trends is because the enthalpy and gelatinization temperatures reflect different characteristics of starch. The gelatinization temperatures (To, Tp and

Tc) represent the heat stability of the crystallites, and are related to the length of double helices, whereas ΔH reflects the energy needed to disrupt the molecular order, and thus is associated with the amount of molecular order and the crystallinity (Cooke and Gidley 1992; Tester and Morrison 1990). The lower ΔH of O. barthii compared to O.

219 glaberrima and O. sativa shows there is a smaller amount of molecular order that requires less energy to be disrupted. The significantly higher To, Tp and Tc values of O. glaberrima and O. barthii, in comparison with O. sativa, suggests that the former two species have longer and more stable double helices arising from their higher proportion of long amylopectin chains that span through at least two crystalline lamellae (DP 34 – 100, as shown in the number CLDs in Fig. 31) (Witt et al. 2012). This phenomenon might also be associated with the higher amylose contents of O. glaberrima and O. barthii, compared to O. sativa. However, there is still disagreement on whether the amylose content influences the gelatinization temperatures of starch (Jane and Chen 1992; Noda et al. 1998), which needs to be clarified by further studies. The higher gelatinization temperatures of starch in O. glaberrima and O. barthii might lead to a slower digestion rate, as starch in these grains might not be completely gelatinized after being cooked by regular cooking methods (Syahariza et al. 2013). These rice species have the potential to provide natural sources of slowly digestible starch that are believed to have nutritional benefits (Higgins and Brown 2013; Johnston et al. 2010). It is perhaps due to this slower digestion rate that consumers of African rice have reported that it provides better satiety than Asian rice (Teeken et al. 2012). In order to slow down the rate of digestion of Asian rice and thereby increase satiety, these farmers usually cook Asian rice mixed with African rice. This also helps to reduce the consumption rate of the usually tasty Asian rice varieties (Teeken et al. 2011), a practice that may have positive impact on food security in the region.

220

Figure 33: Gelatinization properties of starch samples, compared among O. barthii (blue bars), O. glaberrima (orange bars) and O. sativa (green bars). To, Tp, Tc, and ΔH are the onset temperature, peak temperature, conclusion temperature, and enthalpy change, respectively. Each bar stands for the mean value of one variety, calculated from duplicate measurements. Mean ± standard deviation above the bars is calculated from different varieties within the same species. Values with different letters are significantly different at p < 0.05.

221

4. Conclusions

The molecular fine structure of starches extracted from African domesticated rice O. glaberrima and its wild progenitor species O. barthii were characterized using SEC and FACE, and were compared with the common Asian domesticated rice O. sativa. The amylose structure, including the molecular size distributions of whole amylose molecules and the CLDs of debranched amylose, of O. glaberrima are similar to those of O. barthii. The amylose structure of O. sativa starch, however, is significantly different, with smaller molecular sizes and lower amounts of amylose chains of all lengths. This might result from a lower activity of GBSSI enzyme in O. sativa. In addition to the amylose structure, the amylopectin structures also vary among the species, with O. glaberrima and O. barthii showing significantly different number CLDs than O. sativa. Higher proportions of intermediate and long amylopectin chains that span through at least two crystalline lamellae (DP > 34) are observed in O. glaberrima and O. barthii starches when compared with O. sativa starches. Modelling the amylopectin biosynthesis suggest that this is due to the lower activity ratio of SBE to SS in O. glaberrima and O. barthii, in comparison with O. sativa, probably because their SBEIIb has a lower activity and/or their SSIIIa has higher. The higher proportion of intermediate and long amylopectin chains of O. glaberrima and O. barthii leads to their higher gelatinization temperatures, meaning they require higher temperatures to be thoroughly cooked and may result in a slower digestion rate of rice grains. The information resulting from this study can then be used to verify genes responsible for the starch structural features of O. glaberrima and O. barthii, helping to shed light on gene modification in the rice breeding industry to improve rice quality.

5. References

Ball SG, Morell MK (2003) From bacterial glycogen to starch: understanding the biogenesis of the plant starch granule. Annu Rev Plant Biol 54:207-233

Castro JV, Dumas C, Chiou H, Fitzgerald MA, Gilbert RG (2005) Mechanistic information from analysis of molecular weight distributions of starch. Biomacromolecules 6:2248-2259

222

Cave RA, Seabrook SA, Gidley MJ, Gilbert RG (2009) Characterization of starch by size-exclusion chromatography: the limitations imposed by shear scission. Biomacromolecules 10:2245-2253

Cooke D, Gidley MJ (1992) Loss of crystalline and molecular order during starch gelatinisation: origin of the enthalpic transition. Carbohydr Res 227:103-112

Cuevas RP, Daygon VD, Corpuz HM, Nora L, Reinke RF, Waters DLE, Fitzgerald MA (2010a) Melting the secrets of gelatinisation temperature in rice. Funct Plant Biol 37:439-447

Cuevas RP, Daygon VD, Morell MK, Gilbert RG, Fitzgerald MA (2010b) Using chain- length distributions to diagnose genetic diversity in starch biosynthesis. Carbohydr Polym 81:120-127

Fujita N, Yoshida M, Asakura N, Ohdan T, Miyao A, Hirochika H, Nakamura Y (2006) Function and characterization of starch synthase I using mutants in rice. Plant Physiol 140:1070-1084

Gayin J, Chandi GK, Manful J, Seetharaman K (2015) Classification of rice based on statistical snalysis of pasting properties and apparent amylose content: The case of Oryza glaberrima accessions from Africa. Cereal Chemistry Journal 92:22-28

Guan HP, Li P, ImparlRadosevich J, Preiss J, Keeling P (1997) Comparing the properties of Escherichia coli branching enzyme and maize branching enzyme. Arch Biochem Biophys 342:92-98

Hanashiro I, Itoh K, Kuratomi Y, Yamazaki M, Igarashi T, Matsugasako JI, Takeda Y (2008) Granule-bound starch synthase I is responsible for biosynthesis of extra- long unit chains of amylopectin in rice. Plant Cell Physiol 49:925-933

Hasjim J, Cesbron Lavau G, Gidley MJ, Gilbert RG (2010) In vivo and in vitro starch digestion: Are current in vitro techniques adequate? Biomacromolecules 11:3600-3608

Higgins JA, Brown IL (2013) Resistant starch: a promising dietary agent for the prevention/treatment of inflammatory bowel disease and bowel cancer. Curr Opin Gastroenterol 29:190-194

Hirose T, Terao T (2004) A comprehensive expression analysis of the starch synthase gene family in rice (Oryza sativa L.). Planta 220:9-16

223

Horibata T, Nakamoto M, Fuwa H, Inouchi N (2004) Structural and physicochemical characteristics of endosperm starches of rice cultivars recently bred in Japan. Journal of Applied Glycoscience 51:303-313

Huang P, Molina J, Flowers JM, Rubinstein S, Jackson SA, Purugganan MD, Schaal BA (2012) Phylogeography of Asian wild rice, Oryza rufipogon: a genome-wide view. Mol Ecol 21:4593-4604

Jane JL, Chen JF (1992) Effect of amylose molecular size and amylopectin branch chain length on paste properties of starch. Cereal Chem 69:60-65

Jane JL, Chen YY, Lee LF, McPherson AE, Wong KS, Radosavljevic M, Kasemsuwan T (1999) Effects of amylopectin branch chain length and amylose content on the gelatinization and pasting properties of starch. Cereal Chem 76:629-637

Johnston KL, Thomas EL, Bell JD, Frost GS, Robertson MD (2010) Resistant starch improves insulin sensitivity in metabolic syndrome. Diabet Med 27:391-397

Juliano BO, Perez CM (1983) Major factors affecting cooked milled rice hardness and cooking time. J Texture Stud 14:235-243

Kharabian-Masouleh A, Waters DLE, Reinke RF, Ward R, Henry RJ (2012) SNP in starch biosynthesis genes associated with nutritional and functional properties of rice. Scientific Reports 2:1-9

Li JM, Xiao JH, Grandillo S, Jiang LY, Wan YZ, Deng QY, Yuan LP, McCouch SR (2004) QTL detection for rice grain quality traits using an interspecific backcross population derived from cultivated Asian (O-sativa L.) and African (O- glaberrima S.) rice. Genome 47:697-704

Noda T, Takahata Y, Sato T, Suda I, Morishita T, Ishiguro K, Yamakawa O (1998) Relationships between chain length distribution of amylopectin and gelatinization properties within the same botanical origin for sweet potato and buckwheat. Carbohydr Polym 37:153-158

Ohdan T, Francisco PB, Sawada T, Hirose T, Terao T, Satoh H, Nakamura Y (2005) Expression profiling of genes involved in starch synthesis in sink and source organs of rice. J Exp Bot 56:3229-3244

Sarla N, Swamy BPM (2005) Oryza glaberrima: A source for the improvement of Oryza sativa. CURRENT SCIENCE 89:955-963

Smith AM, Denyer K, Martin C (1997) The synthesis of the starch granule. Annu Rev Plant Physiol Plant Mol Biol 48:65-87

224

Sweeney M, Mccouch S (2007) The complex history of the domestication of rice. Ann Bot-London 100:951-957

Syahariza ZA, Li E, Hasjim J (2010) Extraction and dissolution of starch from rice and sorghum grains for accurate structural analysis. Carbohydr Polym 82:14-20

Syahariza ZA, Sar S, Hasjim J, Tizzotti MJ, Gilbert RG (2013) The importance of amylose and amylopectin fine structures for starch digestibility in cooked rice grains. Food Chem 136:742-749

Syahariza ZA, Sar S, Warren FJ, Zou W, Hasjim J, Tizzotti MJ, Gilbert RG (2014) Corrigendum to "The importance of amylose and amylopectin fine structures for starch digestibility in cooked rice grains", [Food Chem. 136 (2013) 742-749]. Food Chem 145:617-618

Takeda Y, Guan HP, Preiss J (1993) Branching of amylose by the branching isoenzymes of maize endosperm. Carbohydr Res 240:253-263

Takeda Y, Preiss J (1993) Structures of B90 (Sugary) and W64a (Normal) Maize Starches. Carbohydr Res 240:265-275

Teeken B, Nuijten E, Temudo MP, Okry F, Mokuwa A, Struik PC, Richards P (2012) Maintaining or abandoning African rice: lessons for understanding processes of seed innovation. Hum Ecol 40:879-892

Teeken B, Okry F, Nuijten E (2011) Improving rice-based technology development and dissemination through a better understanding of local innovation systems. Innovation and Partnerships to Realize Africa’s Rice Potential:4.2

Tester RF, Morrison WR (1990) Swelling and gelatinization of cereal starches .1. effects of amylopectin, amylose, and lipids. Cereal Chem 67:551-557

Tetlow IJ (2011) Starch biosynthesis in developing seeds. Seed Sci Res 21:5-32

Tran TTB, Shelat KJ, Tang D, Li EP, Gilbert RG, Hasjim J (2011) Milling of rice grains. The degradation on three structural levels of starch in rice flour can be independently controlled during grinding. J Agric Food Chem 59:3964-3973

Umemoto T, Yano M, Satoh H, Shomura A, Nakamura Y (2002) Mapping of a gene responsible for the difference in amylopectin structure between japonica-type and indica-type rice varieties. Theor Appl Genet 104:1-8

225

Vilaplana F, Gilbert RG (2010a) Characterization of branched polysaccharides using multiple-detection size separation techniques. J Sep Sci 33:3537-3554

Vilaplana F, Gilbert RG (2010b) Two-dimensional size/branch length distributions of a branched polymer. Macromolecules 43:7321-7329

Wambugu PW, Furtado A, Waters DLE, Nyamongo DO, Henry RJ (2013) Conservation and utilization of African Oryza genetic resources. Rice 6:29

Wang K, Hasjim J, Wu AC, Henry RJ, Gilbert RG (2014) Variation in Amylose Fine Structure of Starches from Different Botanical Sources. J Agric Food Chem 62:4443-4453

Watanabe H, Futakuchi K, Jones MP, Teslim I, Sobambo BA (2002) Brabender viscogram characteristics of interspecific progenies of Oryza glaberrima steud and o. sativa L. J Jpn Soc Food Sci 49:155-165

Witt T, Doutch J, Gilbert EP, Gilbert RG (2012) Relations between molecular, crystalline, andlamellar structures of amylopectin. Biomacromolecules 13:4273-4282

Wu AC, Gilbert RG (2010) Molecular weight distributions of starch branches reveal genetic constraints on biosynthesis. Biomacromolecules 11:3539-3547

Wu AC, Gilbert RG (2013) Program “APCLDFIT”. Brisbane, pp Least-squares fit of experimental amylopectin chain-length distributions to parameterized model for biosynthesis

Wu AC, Li E, Gilbert RG (2014a) Exploring extraction/dissolution procedures for analysis of starch chain-length distributions. Carbohydr Polym 114:36-42

Wu AC, Morell MK, Gilbert RG (2013) A parameterized model of amylopectin synthesis provides key insights into the synthesis of granular starch. PLoS One 8:e65768

Wu AC, Ral J-P, Morell MK, Gilbert RG (2014b) New perspectives on the role of alpha- and beta-amylases in transient starch synthesis. PLoS One 9:e100498

Zhang B, Wang K, Hasjim J, Li E, Flanagan BM, Gidley MJ, Dhital S (2014) Freeze- Drying Changes the Structure and Digestibility of B-Polymorphic Starches. J Agric Food Chem 62:1482-1491

226

Zhu F, Bertoft E, Kallman A, Myers AM, Seetharaman K (2013) Molecular Structure of Starches from Maize Mutants Deficient in Starch Synthase III. J Agric Food Chem 61:9899-9907

227

Appendix 2: Non-synonymous candidate SNPs putatively associated with amylose content in African rice Locus ID Gene name SNP Position Amino acid change

LOC_Os06g04040 WD domain, G-beta repeat domain containing protein T/C 1653659 Ser to Pro

LOC_Os06g04060 expressed protein T/G 1693774 Ser to Ala LOC_Os06g04169 hydrolase, alpha/beta fold family domain containing G/C 1736462 Thr to Ser protein

LOC_Os06g04200 Granule-bound starch synthase 1 G/A 1769686 Asp to Asn LOC_Os06g04280 3-phosphoshikimate 1-carboxyvinyltransferase, C/T 1816306 Ala to Val chloroplast precursor LOC_Os06g04440 expressed protein G/A 1900141 Cys to Tyr LOC_Os06g04520 WASH complex subunit 7-like isoform X1 T/C 1945370 Phe to Ser Loc_Os06g04550 expressed protein T/C 1971594 Ser to Pro Loc_Os06g04550 expressed protein T/C 1971601 Leu to Pro LOC_Os06g04660 oxidoreductase, 2OG-Fe oxygenase family protein A/T 2030977 Lys to Met T/C 2030920 Leu to Pro A/T 2030977 Lys to Met C/G 2031158 Ile to Met A/C 2032835 Lys to Thr LOC_Os06g04790 HAD superfamily phosphatase G/T 2088936 Glu to Asp LOC_Os06g04850 homeobox associated leucine zipper G/A 2123781 Val to Met A/G 2123811 Thr to Ala

LOC_Os06g04970 serine protease-like C/T 2193215 Thr to Ile T/C 2189916 Trp to Arg A/T 2191680 Lys to Met T/C 2193139 Ser to Pro C/T 2193215 Thr to Ile LOC_Os06g05050 OsWAK61 - OsWAK receptor-like protein kinase A/G 2231063 Lys to Glu A/C 22311069 Ile to Leu G/A 2232026 Asp to Asn G/C 2232134 Asp to His G/A 2232429 Ser to Asn A/G 2232713 Thr to Ala A/G 2232827 Thr to Ala T/G 2232834 Met to Arg C/T 2232866 Leu to Phe G/A 2232890 Val to Met G/C 2232980 Glu to Gln LOC_Os06g05130 myristoyl-acyl carrier protein thioesterase, chloroplast G/A 2278677 Ala to Thr precursor

LOC_Os06g05220 expressed protein T/G 2342262 Ser to Ala Loc_Os06g05230 retrotransposon protein T/G 2345081 Asp to Glu A/G 2345508 Ile to Val G/A 2345643 Gly to Arg G/A 2346661 Gly to Glu

228

A/G 2347278 Ser to Gly A/G 2347621 His to Arg C/G 2347638 Gln to 2Glu G/A 2347773 Glu to Lys T/C 2348010 Cys to Arg A/G 2348301 Ile to Val T/C 2348304 Cys to Arg LOC_Os06g05260 pectate lyase precursor, putative T/G 2373497 Leu to Arg T/A 2373890 Met to Lys G/A 2374173 Arg to His Loc_Os06g05284 transferase family protein A/G 2381841 Asn to Asp Loc_Os06g05310 transferase family protein A/G 2389627 Lys to Glu LOC_Os06g05350 whirly transcription factor domain containing protein G/T 2405572 Ala to Ser

LOC_Os06g05359 NBS-LRR disease resistance protein C/A 2414037 Ser to Arg A/G 2415254 His to Arg A/C 2415341 His to Pro T/G 2415448 Trp to Gly A/G 2416555 Asn to Asp LOC_Os06g05368 expressed protein A/G 2419603 Lys to Arg LOC_Os06g05380 WRKY73 C/T 2429308 Leu to Phe T/C 2433127 Leu to Pro LOC_Os06g05450 expressed protein T/C 2461309 Leu toPro LOC_Os06g05540 expressed protein A/G 2510507 His to Arg C/T 2510549 Ala to Val Loc_Os06g05580 OsFBDUF30 - F-box and DUF domain containing G/C 2530751 Met to Ile protein

LOC_Os06g05590 OsFBX186 - F-box domain containing protein T/C 2533567 Leu to Pro

LOC_Os06g05600 OsFBDUF31 - F-box and DUF domain containing C/G 2536081 Arg to Gly protein T/C 2564466 Met to Thr Loc_Os06g05610 OsFBDUF32 - F-box and DUF domain containing G/A 2538835 Gly to Asp protein C/T 2538859 Ala to Val C/G 2539218 Leu to Val LOC_Os06g05620 OsFBDUF33 - F-box and DUF domain containing C/T 2543338 Arg to Cys protein LOC_Os06g05630 GDSL-like lipase/acylhydrolase G/A 2548150 Arg to Gln T/C 2548180 Ile to Thr T/C 2548862 Leu to Ser LOC_Os06g05690 cysteine synthase, chloroplast/chromoplast precursor G/A 2574901 Val to Ile

LOC_Os06g05710 pollen-specific protein C/A 2582951 Ala to Glu LOC_Os06g05760 ubiquitin family protein A/G 2608688 Arg to Gly LOC_Os06g05820 OsLonP2 - Putative Lon protease homologue A/G 2654150 Ile to Val

LOC_Os06g05870 dual specificity protein phosphatase C/G 2688014 Pro to Ala LOC_Os06g06080 serine esterase family protein G/A 2804399 Gly to Glu Loc_Os06g06080 serine esterase family protein G/C 2803491 Ser to Thr

229

LOC_Os06g06250 GDSL-like lipase/acylhydrolase C/T 2895517 Pro to Ser LOC_Os06g06320 osFTL2 FT-Like2 homologous to Flowering Locus T G/C 2940248 Lys to Asn gene; contains Pfam profile PF01161: Phosphatidylethanolamine-binding protein LOC_Os06g06330 expressed protein A/T 2945721 His to Leu LOC_Os06g06350 AMP-binding enzyme A/G 2954475 Ile to Met LOC_Os06g06430 expressed protein T/A 3000187 Val to Glu LOC_Os06g06450 heat shock protein STI A/C 3036142 Glu to Asp T/A 3038682 Val to Asp C/G 3039165 His to Gln T/C 3039205 Phe to Leu LOC_Os06g06470 U-box domain containing heat shock protein C/T 3042434 Thr to Ile T/A 3043759 Ser to Thr G/A 3045525 Glu to Lys LOC_Os06g06540 AP2 domain containing protein T/G 3068018 Ser to Ala C/A 3068049 Ala to Glu C/T 3068052 Ala to Val LOC_Os06g06740 MYB family transcription factor T/A 3162015 Val to Asp LOC_Os06g06770 OsPOP11 - Putative Prolyl Oligopeptidase G/A 3186805 Arg to His homologue LOC_Os06g06790 OsPDIL1-5 protein disulfide isomerase PDIL1-5 G/T 3200117 Ser to Ile

LOC_Os06g07690 expressed protein G/A 3721726 Ser to Asn LOC_Os06g07700 expressed protein C/T 3723446 Arg to Trp LOC_Os06g07770 expressed protein C/T 3764827 Ser to Phe C/T 3770955 His to Tyr LOC_Os06g07790 retrotransposon protein C/T 3784973 Leu to Phe LOC_Os06g07800 expressed protein G/T 3788267 Cys to Phe LOC_Os06g07810 transposon protein, putative, unclassified G/A 3794599 Gly to Asp

230