Evolution and Bioinformatics Analysis of the Drug: H+ Antiporter Family 1 (DHA1) in the hemiascomycetes yeasts

André Miguel Moreira Machado

Thesis to obtain the Master of Science Degree in

Biotechnology

Supervisors: Prof. Dr. Isabel Maria de Sá Correia Leite de Almeida

Dr. Paulo Jorge Moura Pinto da Costa Dias

Examination Committee

Chairperson: Prof. Dr. Arsénio do Carmo Sales Mendes Fialho

Supervisor: Dr. Paulo Jorge Moura Pinto da Costa Dias

Members of the Committee: Dr. Margarida Isabel Rosa Bento Palma

July 2016

Agradecimentos

Primeiramente gostaria de agradecer à minha orientadora Professor Isabel Sá-Correia por me dar a oportunidade de participar do Grupo de Ciências Grupo de Pesquisa Biológica (BSRG), por me agilizar o máximo possível os processos burocráticos na entrega desta dissertação e por ter paciência para alunos como eu. Por outro lado, gostaria de fazer também aqui um especial agradecimento ao meu orientador Dr. Paulo Dias, que me guiou ao longo deste trabalho, durante o último ano e meio, e me ajudou em todas as fases da sua realização de forma incansável, desde o desenvolvimento de metodologias até á escrita e elaboração do documento. Devo ainda salientar que sem a sua exigência e perseverança muito provavelmente não teria adquirido a enorme quantidade de conhecimentos adquiri, a nível informático, biológico e evolutivo. O meu trabalho bioinformático foi realizado dentro grupo BSRG, uma unidade do Instituto de Bioengenharia e Biociências (IBB) que pertence ao Instituto Superior Técnico (IST) e agregado á Universidade de Lisboa. A este grupo gostaria de agradecer todo suporte técnico que me deram e todo o bom ambiente que proporcionaram.

Passando para os agradecimentos mais pessoais e personalizados, antes de mais, é importante referir que gostaria de poder ter presente, na minha vida, todas as pessoas que me ajudaram ao longo desta fase da minha vida académica. No entanto isso não é possível. Isto fez-me refletir sobre todas as prioridades que devemos ter durante a nossa existência, e todo o valor que devemos dar a quem mais nos apoia.

Aos meus Avós paternos, que faleceram ambos durante o decorrer deste Mestrado e que sempre me deram um apoio incondicional e irrepreensível, gostaria de dedicar todo o trabalho árduo e força de vontade avassaladora que a realização do mestrado e desta tese envolveu. Em respeito a eles e porque a minha língua materna é a portuguesa, escrevo estes agradecimentos em português.

Gostaria de agradecer também aos meus Pais, e Irmão, pelas palavras de ânimo e carinho que me fizeram vencer todos os obstáculos, pelas condições favoráveis que me proporcionaram e tornaram possível o vencer de mais uma batalha. Uma parte deste trabalho é, sem dúvida, uma conquista desta minha especial família.

Gostaria ainda de agradecer aos meus fiéis amigos Filipe Tente, João Nuno, Pedro Parente, João Pereira que apesar de algumas vezes afastados, muitas vezes foram os mais próximos. Ao meu grande amigo Gonçalo Curveira Santos a oportunidade que me deu de viver com ele em Lisboa, e por todo o apoio, tanto a nível pessoal como profissional. Gonçalo tu vais ser uma das mentes brilhantes deste país, continua a lutar. Aos meus amigos do IST, um especial agradecimento ao Carlos Branco, Pedro Santos, Miguel Gomes, Simion Petru e Rafael Crisóstomo entre outros (sabem quem são) por toda a ajuda informática, amizade e momentos de lazer que me fizeram manter a sanidade mental (ou não).

Por fim, e porque o melhor vem sempre no final, quero agradecer á minha companheira de vida Ana Ferreira, por todo apoio científico/pessoal, incentivo extraordinário, e alegrias constantes que garantiram e vão garantir sempre que eu dê o melhor de mim apesar do cansaço e dos dias mais difíceis. Esta tese é tanto minha quanto tua.

I

Abstract

The Saccharomyces cerevisiae 12-spanner drug: H+ antiporters (DHA1) and 14- spanner drug: H+ antiporters (DHA2) of the Major Facilitator Superfamily (MFS) are involved in Multidrug/Multixenobiotic resistance (MDR/MXR) phenomenon. The aim of the present work is to reconstruct and characterize the evolution of the DHA1 genes encoded in 33 hemiascomycetes strains classified in the Saccharomycetaceae taxonomic family, (corresponding to a total of 29 yeast species). The DHA1 and DAG (DHA2, ARN, GEX) proteins encoded in the genomes of 61 additional strains, spanning more than 15 hemiascomycetous taxonomic families, were also identified and briefly analysed. The constraining and traversing of a network representing the blastp pairwise similarity relationships established between more than one million hemiascomycetous translated ORFs allowed the identification of 1382 bona fide full-size DHA1 transporters (after correction of problematic translated ORFs). The evolutionary history of the DHA1 genes encoded in the genome sequences of 33 Saccharomycetaceae strains was reconstructed by combining phylogenetic and gene neighbourhood approaches. Twenty-six DHA1 phylogenetic clusters were identified and nine DHA1 gene lineages reported in previous published studies were revised and extended. State-of-the-art methodologies on phylogeny, comparative genomics and protein evolution were used to advance the existing knowledge on the still poorly biochemically characterized DHA1 transporters, allowing obtaining new insights on how the functional diversification of these proteins is related with ancestral genomic events, such as the Whole Genome Duplication (WGD), local gene duplications and losses, Horizontal Gene Transfers (HGT) between yeast species, chromosomal rearrangements, and other genome reshaping phenomena.

Keywords: DHA1, DAG, Saccharomyces cerevisiae, Multidrug resistance, Saccharomycetaceae, Whole Genome Duplication.

II

Resumo

Os genes do organismo Saccharomyces cerevisiae, da família 1(DHA1) e família 2(DHA2) de antiportadores de drogas, e da superfamília de facilitadores, estão envolvidos na resistência a múltiplas drogas/xenobióticos. O objetivo deste trabalho foi reconstruir e caracterizar a evolução dos genes DHA1, codificados no genoma de 33 estirpes de Hemiascomycetos classificadas na família taxonómica Saccharomycetaceae (correspondendo ao total de 29 espécies). Paralelamente, foram identificadas e analisadas, de forma breve, as proteínas DHA1 e DAG codificadas no genoma de 61 estirpes adicionais que estão classificadas em mais de 15 famílias de Hemiascomycetos. A constrição e atravessamento de uma rede de similaridade de pares blastp entre mais de um milhão de ORF’s em Hemiascomycetos, permitiu identificar 1382 transportadores DHA1 não fragmentados (após a correção das ORF’s problemáticas). Assim, foi reconstruída a historia evolutiva dos genes DHA1, codificados no genoma das 33 estirpes Saccharomycetaceae, com recurso á combinação de técnicas de filogenia e de vizinhança de genes. Isto permitiu identificar 26 grupos filogenéticos de genes DHA1, e fazer uma revisão e ampliação de 9 linhagens de genes DHA1 publicadas em estudos anteriores. As metodologias do estado de arte sobre filogenia, genómica comparativa, evolução proteica foram utilizadas para expandir o conhecimento acerca destes transportadores DHA1, ainda pouco caracterizados bioquimicamente, permitindo obter novas conclusões sobre o envolvimento da diversificação funcional destas proteínas com os eventos genómicos ancestrais , da duplicação total do genoma ,duplicações locais e perdas de genes, transferências horizontais de genes entre espécies de leveduras, rearranjos cromossômicos, e outros fenômenos de remodelação genómica.

Palavras-chave: DHA1, DAG, Saccharomyces cerevisiae, Resistência a Múltiplas drogas, Saccharomycetaceae, Duplicação total do genoma.

III

Contents

Agradecimentos ...... I

Abstract...... II

Resumo ...... III

List of figures ...... IV

List of tables ...... VII

List of abbreviations ...... VIII

1. Goals and brief introduction to the thesis study ...... 1

2. General Introduction ...... 3

2.1. Yeasts: Biotechnology, and diversity ...... 3

2.1.1. Biotechnology and applications ...... 3

2.1.2. Evolution, taxonomy and diversity of yeasts ...... 5

2.2. Yeasts: Comparative genomic, and evolution ...... 11

2.2.1. Comparative genomics ...... 11

2.2.2. Comparative genomics and evolution of yeasts ...... 12

2.2.3. Comparative genomics and synteny of yeasts ...... 14

2.3. Yeasts and Multidrug Resistance ...... 16

2.3.1. MDR-MFS Transporters ...... 17

2.3.2. DAG family ...... 18

2.3.3. DHA1 family ...... 19

2.4. Bioinformatics, phylogenetic and protein evolution analyses ...... 23

2.4.1. Sequence alignment ...... 24

2.4.2. Phylogenetic analysis ...... 25

2.4.3. Positive, negative selection and conservation analyses ...... 26

3. Materials and Methods ...... 27

3.1. Selection of the Hemiascomycetes yeasts used in this study...... 27

3.2. Selection of DHA1 and DAG gene families in 94 Hemiascomycetous Strains...... 32

3.2.1. Extraction of DHA1 and DAG gene sequences from a Local Database ...... 32

3.2.2. Sequence clustering of all DHA1 and DAG Proteins...... 33

I

3.3. Topology prediction, multiple alignments, and construction of DHA1 and DAG phylogenetic trees…...... 33

3.3.1. Pre-analysis of DHA1 and DHA2 genes...... 33

3.3.2. Final analysis of DHA1 and DAG genes...... 34

3.4. Construction and analyses of syntenic block diagrams representing genome regions of 33 Hemiascomycetous strains...... 36

3.5. Identification of positive and purifying extended over the amino acid sequences of the FLR1p homologs proteins during the evolution of Saccharomycetaceae yeasts...... 37

4. Results ...... 39

4.1. Identification of the DHA1 and DAG transporters encoded in the genomes of hemiascomycetes yeasts...... 39

4.2. A brief analysis of the DAG transporters encoded in the genomes of 78 hemiascomycetes strains...... 40

4.3. Analysis of the DHA1 transporters in the Hemiascomycetes...... 46

4.3.1. Correction of sequencing and annotation errors in the initial DHA1 protein dataset. ... 46

4.3.2. Analysis of the 2D-topology of the DHA1 transporters...... 50

4.4. Global phylogenetic analysis of the DHA1 transporters encoded in the genomes of hemiascomycetes yeast ...... 53

4.4.1. Phylogenetic analysis of the DHA1 transporters encoded in species of the Saccharomyces complex...... 54

4.4.2. Phylogenetic analysis of the DHA1 transporters encoded in species of the CTG complex… ...... 61

4.4.3. Phylogenetic analysis of the DHA1 transporters encoded in species classified in other late-divergent taxonomic families in the Hemiascomycetes ...... 61

4.4.4. Phylogenetic analysis of the DHA1 transporters encoded in species classified in early- divergent taxonomic families in the Hemiascomycetes ...... 62

4.5. Phylogenetic, gene neighbourhood and protein evolution analyses of the DHA1 transporters encoded in the genomes of Saccharomycetaceae yeast species...... 62

4.6. Phylogenetic, gene neighbourhood and protein evolution analyses of the FLR1 homologs of the S. cerevisiae genes encoded in the genomes of the Saccharomycetaceae species...... 65

4.6.1. Positive / Negative Selection and Conservation analysis ...... 70

4.7. Phylogenetic and gene neighbourhood analyses of the remaining homologs of the S. cerevisiae genes encoded in the genomes of the Saccharomycetaceae species...... 75

4.7.1. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae DTR1 and TPO4 genes encoded in the genomes of the Saccharomycetaceae species...... 75

II

4.7.2. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae QDR1/QDR2/AQR1 genes encoded in the genomes of the Saccharomycetaceae species...... 77

4.7.3. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae QDR3 and HOL1 genes encoded in the genomes of the Saccharomycetaceae species...... 80

4.7.4. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae TPO2/TPO3 and YHK8 genes encoded in the genomes of the Saccharomycetaceae species. .. 83

4.7.5. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae TPO1 gene encoded in the genomes of the Saccharomycetaceae species...... 88

5. Discussion and conclusions ...... 90

6. Bibliography ...... 97

7. Annexes ...... 110

7.1. Annex I – Phylogenetic analysis of Saccharomycetaceae Taxonomic family ...... 110

7.2. Annex II- Phylogenetic trees to the cluster T of the S. cerevisiae FLR1 homologs from 94 Hemiascomycetous strains...... 111

7.3. Annex III - Gene neighbourhood of the DTR1 lineage of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae...... 113

7.4. Annex IV - Phylogenetic tree to the cluster F of the S. cerevisiae QDR1/QDR2/AQR1 homologs from 32 Saccharomycetaceae strain...... 114

7.5. Annex V - Phylogenetic tree to the cluster N1 of the S. cerevisiae TPO2/TPO3 homologs from 32 Saccharomycetaceae strain...... 115

III

List of figures

Figure. 1 Representative phylogenetic tree of the dykarya fungi, showing the name of taxonomic families and of the strains used in the study. At the same time, also it’s represented the evolutionary events that occurred along of evolution.(Adapted from (Dujon 2010)) ...... 6 Figure. 2 Phylogenetic tree among ‘Saccharomyces complex’ yeast species. This tree was proposed by Kurtzman and allowed to resolve 75 species into 14 clades representing phylogenetically ci Saccharomycetaceae rcumscribed genera. The maximum-parsimony method was used to generate the tree according to the analysis of nucleotide sequences from 18S, 5.8S/alignable ITS, and 26S (three regions) rDNAs, EF-1K, mitochondrial small- subunit rDNA and COX II67 (Kurtzman 2003)) ...... 7 Figure. 3. Schematic cladogram depicting phylogenetic relationships among Saccharomyces species and well-known or frequently isolated hybrids. Dashed lines represent introgressions from a third or fourth species into a hybrid. Most introgressions are not present in all hybrid strains. Synonyms are given in parentheses below species names. (Cladogram topography from (Almeida et al. 2014) and the image retrieved from (Boynton and Greig 2014))...... 9 Figure. 4 The 10 main lineages of DHA1 transporters analysed on this study. ( Extracted From (Dias et al. 2010a) ...... 20 Figure. 5 Identification of the MFS-MDR DHA1 and DAG genes encoded in 94 yeast strains. Plot representing the number of sequences retrieved after constraining and traversing the pairwise similarity network at different e- values using 43 reference genes as starting nodes...... 39 Figure. 6 Phylogenetic analysis of DHA2, ARN and GEX transporters gathered from 78 yeast strains and subdivided in twenty clusters. Circular cladogram shows the tree topology. PhyML 3.0 software were used in the analysis. The different colours correspond to each cluster of the previous synteny study , done in (Dias and Sá- Correia 2013)...... 41 Figure. 7 Example of a biologic pseudogene genomic error. Protein alignment of HOL1 S. cerevisiae representative gene, against two fragment genes (ORF´s sace_14_7_g00290 and sace_14_7_g00300) of S. cerevisiae (RM11-1a) strain, with N´s in middle of sequence. The green circle presents the gap of seven amino acids in sequence. MultAlin software were used in the analysis (Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value...... 48 Figure. 8 Example of incomplete/truncated sequence genomic error correction. Protein alignment of YHK8 S. cerevisiae representative gene against one fragment (ORF sace_18_12_930) S. cerevisiae (DBVPG1106) strain.. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value...... 49 Figure. 9 Example of frameshift genomic error correction. Protein alignment of TPO3, S. cerevisiae representative gene, against one fragment gene (ORF sace_56_4_d04160) of S. cerevisiae EC1118 strain, with N´s in middle of sequence. The red circle presents sequence of translated N`s in X’x. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value...... 49 Figure. 10 Example of double gene genomic error correction. Protein alignment of QDR3 S. cerevisiae representative gene against two fragments (ORF´s sace_2_7_g01500 and sace_2_7_g01510) of S. cerevisiae (AWRI796) strain. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value...... 50 Figure. 11 An example of DTR1 S. cerevisiae gene topology. Twelve TMS and the start/end localization of protein is shown. Prevision was done with transmembrane topology predictor – Phobius (Käll, Krogh and Sonnhammer 2007) ...... 51 Figure. 12 Hydrophobicity profile obtained with default parameters of TOPPRED II software and using HOL1 S. cerevisiae gene. This profile serves to identify ‘certain’ and ‘putative’ transmembrane segments according to the ‘Upper Cutoff’ (defined by the green line) and ‘Lower Cutoff’ (defined by the red line) parameters...... 52 Figure. 13 Example of HOL1 S. cerevisiae gene topology. Eleven TMS and the start/end localization of protein are shown. Prevision was done with transmembrane topology predictor – Phobius (Käll, Krogh and Sonnhammer 2007) ...... 52 Figure. 14 Phylogenetic analysis of DHA1 transporters gathered 94 strains from 63 Hemiascomycetous species using the PhyML 3.0 software. Circular cladogram shows the 26 phylogenetic clusters identified by capital letters and the name of their S. cerevisiae members...... 53

IV

Figure. 15 Phylogenetic analysis of DHA1 transporters gathered 33 strains from 28 yeast species from Saccharomycetaceae group using the PhyML 3.0 software. A) Radial phylogram showing the amino acid sequence similarity distances between these 419 full-size DHA1 transporters. B) Circular cladogram shows the tree topology of these genes, the gene name S. cerevisiae members in each cluster, and the colours of each cluster associated to the synteny analyses performed in Figure.4 by Dias et al. (2010)...... 64 Figure. 16 Phylogenetic analysis of FLR1 homologous in 33 Saccharomycetaceae species using the PhyML 3.0 software. Circular cladogram shows the tree topology of 71 full-size DHA1 transporters. S1 to S5 represents the five subgroups by synteny approach originated. The gene taxa with blue colour marked represent different species the same genus in different subgroups...... 66 Figure. 17 Lineage 10 (homologs of S. cerevisiae FLR1 gene). Each box represents a gene. Lines connect genes sharing common neighbours. E, C and T indicates that the corresponding gene was classified as a fragment, corrected or sub-telomeric gene, respectively. The dashed line encompasses groups of proteins more similar in amino acid sequence (inferred from the analysis of the phylogenetic tree). HGT represents the plausible occurrence of events of horizontal gene transfer between species...... 69 Figure. 18 Plot correlation between negative selection, conservation and predicted topology of Cluster T aligned proteins in twenty-nine Saccharomycetaceae strains. A) For the topology, the TOPCONS webserver displayed the prediction of twelve TMS from six independent predictors. B) The LogOddsLogo software predict the sequences conservation, based on entropy, and in total frequency of each amino acid set per alignment position. The green and yellow bands represent each set of two TMS’s. C) Fubar methodology was used for the negative selection prediction, where de present values are Probability of [alpha>beta]...... 71 Figure. 19 Phylogenetic and positive selection analysis of FLR1 homologous genes in 29 Saccharomycetaceae strains using PHYML 3.0 software. Circular cladogram shows the tree topology. The red and orange lines correspond to branches where one or two of the fifteen pre-selected codon under positive selection positions have an EBF>20, respectively...... 73 Figure. 20 Evaluation of the total number of nodes in fifteen pre-determined codon positions, along of twenty-nine Saccharomycetaceae strains...... 74 Figure. 21 Lineage 1 and 9 (homologs of S. cerevisiae DTR1 and TPO4 genes). Each box represents a gene. Lines connect genes sharing common neighbours, purple in lineage 1 and orange in lineage 9. E, C, and T indicate that the corresponding gene was classified as a fragment, corrected or subtelomeric gene, respectively. The grey box in zyba_1 gene represents the zone of WGD occurrence. The red arrows and HGT represents the plausible occurrence of events of horizontal gene transfer between species...... 76 Figure. 22 Lineage 2 (homologs of S. cerevisiae QDR1/QDR2 and AQR1 genes). Conventions as in Figure. 21...... 78 Figure. 23 Gene neighbourhood comparison of the QDR1/QDR2 and AQR1 sublineages of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The QDR1/QDR2 and AQR1 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software...... 79 Figure. 24 Lineage 3 and 4 (homologs of S. cerevisiae QDR3 and HOL1 genes). Conventions as in Figure. 21...... 82 Figure. 25 Lineage 8 and 5 (homologs of S. cerevisiae YHK8 and TPO2/TPO3 genes). Conventions as in Figure. 21...... 84 Figure. 26 Gene neighbourhood comparison of the TPO2 and TPO3 sublineages of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The TPO2 and TPO3 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software...... 85 Figure. 27 Gene neighbourhood analyses of YHK8 homologous genes in the genome of 29 yeast species belonging to the Saccharomycetaceae. The YHK8 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood environment analyses, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software...... 87

V

Figure. 28 Lineage 7 (homologs of S. cerevisiae TPO1 genes). Conventions as in Figure. 21...... 89 Figure. 29 Phylogenetic analysis of FLR1 homologous in 94 Hemiascomycetous strains using the PhyML 3.0 software and retrieved results explored in Dendroscope software. Circular cladogram shows the tree topology of 163 full-size DHA1 transporters in cluster T...... 111 Figure. 30 Phylogenetic analysis of FLR1 homologous in 94 Hemiascomycetous strains using the PhyML 3.0 software. Radial phylogram shows the amino acid divergence among the FLR1 homologous proteins considering the total number of strains used in these master thesis...... 112 Figure. 31 Gene neighbourhood of the DTR1 lineage of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The DTR1 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software...... 113 Figure. 32 Phylogenetic analysis of QDR1/QDR2/AQR1 homologs in 32 Saccharomycetaceae species using the PhyML 3.0 software and retrieved results explored in Dendroscope software. The circular cladogram shows the tree topology of 70 full-size DHA1 transporters of F cluster...... 114 Figure. 33 Phylogenetic analysis of TPO2/TPO3 homologs in 32 Saccharomycetaceae species using the PhyML 3.0 software and retrieved results explored in Dendroscope software. The circular cladogram shows the tree topology of 41 full-size DHA1 transporters of N1 cluster...... 115

VI

List of tables

Table. 1 Hemiascomycetous strains examined during this work. The present table shows the name species and strains associated, as well as the acronyms for each strain used in this work. The scaffold number represent the number of potential chromosomes existing in each strain. The case of Genome size, Coverage are data collected from database origin website, that helped in the choice of the strains for analyses. Should be noted that Ashbya aceri not have a phylogenetic position in the phylogenetic tree of species nor is strain indicated. The positioning in the table was done in according to the similarity of this genome with the members of Eremothecium genus in (Dietrich et al. 2013)...... 29 Table. 2 Staring node genes used for identification of DHA1 and DAG family genes in 94 Hemiascomycetes strains. A) The first table shows the DHA1 genes selected for traversing the Blastp network in this work. On the other hand, also is shown the members of S. cerevisiae species and, respectively cluster distribution, previously determined in Dias et al. (2010). B) The second table shows the DAG genes selected for traversing the Blastp network in this work. On the other hand, also is shown the members of S. cerevisiae species and, respectively cluster distribution, previously determined in Dias and Sá-Correia (2013)...... 32 Table. 3 Number of full-size DAG proteins in each cluster for a specific yeast strain. Additional information regarding a total number of full-size transporters and their average percentage of sequence identity and similarity is shown in the bottom lines...... 43 Table. 4 Number and type of genomic error found in 1467 DHA1 Genes...... 47 Table. 5 Number of full-size DHA1 proteins in each cluster for a specific yeast strain. Additional information regarding a total number of full-size transporters and their average percentage of sequence identity and similarity is shown in the bottom lines...... 57 Table. 6 Codon positions detected under diversifying selection with two methods are reported. The FEL rate of non-synonymous and synonymous substitutions is expressed as dN–dS or x (i.e. dN/dS). The statistical significance of positive selection is expressed in term of p-value (FEL and MEME methods). Two different thresholds of Empirical Bayes Factor and the number of associated nodes are described...... 72 Table. 7. A number of full-size DHA1 proteins in each cluster of Saccharomycetaceae family. Additional information regarding a total number of full-size transporters and their average percentage of sequence identity and similarity is shown in the bottom lines...... 110

VII

List of abbreviations

ABC ATP-binding cassette

ATP Adenosine triphosphate

BILD Bayesian Integral Log-odds

BLASTN Basic local alignment search tool Nucleotide-Nucleotide

BLASTP Basic local alignment search tool Protein-protein

BM Bayesian Methods

BSRG Biological Sciences Research Group

CGD Candida Genome Database

DAG DHA2, ARN, GEX

DB Databases

DDBJ DNA Data Bank of Japan

DHA1 12 transmembrane spanner drug H+ antiporter-1 family of genes

DHA2 14 transmembrane spanner drug H+ antiporter-2 family of genes

EBF Empirical Bayes Factor

EMBL European Molecular Biology Laboratory

EPMF Electrochemical proton motive force

FEL Fixed effects likelihood

FUBAR Fast Unconstrained Bayesian AppRoximation

GSA Global sequence alignments

GEX Glutathione exchangers

HGT Horizontal gene transfer

IMD Integral membrane domain

LDB Local genome database

LSA Local sequence alignments

MDR Multidrug resistance

MEME Mixed effects model of evolution

VIII

MFS Major Facilitator Superfamily

ML Maximum likelihood

MSA Multiple sequence alignments

MSD Membrane-spanning domain

MXR Multixenobiotic resistance

NBD Nucleotide-binding domain

NCBI National Centre for Biotechnology Information

OECD Organization for Economic Co-operation and Development

ORF Open reading frames

PAM Progressive alignments methods

PT Phylogenetic trees

REL Random effects models

SGD Saccharomyces genome Database

TCDB Transporter Classification Databases

TM Transmembrane

TMD Transmembrane domain

TMS Transmembrane segments

UMF Major Facilitator Unknown

WGD whole genome duplication

YGOB Yeast Gene Order Browser

IX

1. Goals and brief introduction to the thesis study

The yeasts are one of the most important microorganism’s in biotechnology. These microorganisms are used in a wide diversity of applications, from industrial fermentations, laboratory models, food spoilage agents, medical research and environmental and agricultural research. In addition, yeasts have been used in phylogenetic studies with the goal of understanding gene and genome evolution (Dujon 2010; dos Santos and Sá-Correia 2015). The wide application of these microorganisms has been a driving force to the comprehension of their characteristics and survival mechanisms. Also, detailed comprehension of these mechanisms has given us knowledge about the resistance of these microorganisms to the surrounding environment, which is related to the capacity to process or expel a wide range of structurally and functionally unrelated cytotoxic chemicals, known as Multidrug / Multixenobiotic resistance (MDR/MXR) (dos Santos et al. 2014; dos Santos and Sá-Correia 2015). In the Hemiascomycetes, two superfamilies of transporters are known to be involved in the Multidrug Resistance phenomenon: ATP-Binding-Cassette (ABC-PDR) and Major Facilitator Superfamily (MFS- MDR) efflux pumps (Sá-Correia and Tenreiro 2002). The MFS-MDR transporters are energized by electrochemical proton-motive force established between the cell cytoplasm and the external medium. In the initial analysis, after the identification of 28 MFS-MDR genes encoded in the genome of Saccharomyces cerevisiae, it was proposed dividing the corresponding proteins into two families, the 12-spanner Drug: H+ Antiporter family 1 (DHA1), with 12 members, and the 14-spanner Drug: H+ Antiporter family 1 (DHA2), with 16 members (Paulsen et al. 1998; Sá-Correia and Tenreiro 2002; Sá- Correia et al. 2009). The members of the DHA1 family encoded in the S. cerevisiae genome are the AQR1, DTR1, QDR1, QDR2, QDR3, TPO1, TPO2, TPO3, TPO4, HOL1, FLR1 and YHK8 genes. The members of the DHA1 family are described being involved in a range of different physiological functions, such as polyamine transport, spore wall biosynthesis and stress resistance (Sá-Correia et al. 2009). More recently, it has been shown that the DHA2 family should be renamed to DAG (DHA2,ARN,GEX) and divided into three sub-families (Dias and Sá-Correia 2013): the DHA2 subfamily, with 10 members, corresponding to the ATR1, AZR1, SGE1, VBA1, VBA2, VBA3, VBA4, VBA5 and AMF1 genes and the still uncharacterized translated ORF YMR279c, the ARN subfamily, with 4 members, corresponding to the ARN1, ARN2, SIT1, and ENB1 genes, and the GEX subfamily, with 2 members, corresponding to the GEX1 and GEX2 genes. The majority of the members of the DHA2 subfamily do not have assigned function and the members of the GEX and ARN subfamilies encode glutathione; H+ antiporters and siderophore-iron chelate influx pumps, respectively ) (Lesuisse, Simon-casteras and Labbe 1998; Sá- Correia et al. 2009; Dhaoui et al. 2011). The 12- or 14-transmembrane (TM) α–helices of the MFS-MDR proteins are separated into six- or seven-helix bundle domains, respectively, connected by a cytoplasmic loop between them (Sá-Correia et al. 2009).

The identification and in silico characterization of the DHA1 and DAG genes encoded in the Hemiascomycetes is an important goal since the yeast MFS-MDR genes have shown promising applications in a different variety of fields in Biotechnology. The present work focus the reconstruction of the evolution of the DHA1 family members in the sub-phylum , aiming to extend

1 the results reported in previous published studies to 94 additional yeast strains whose genome sequences were recently determined. The identification of the DAG genes encoded in genomes of these yeast strains will be also performed in this work and a brief characterization in a subset of 74 hemiascomycetes strains will be attempted. The evolution of the DHA1 genes encoded in the genomes of yeast species belonging to the Saccharomycetaceae taxonomic family will be scrutinized in detail, with this goal being pursued based on complete bioinformatics and comparative genomic approaches developed by the BSRG group in the last 5 years. The identification of the DHA1 and DAG family members will be based on the traversal of a blastp networks using as starting nodes 43 reference genes, twenty encoded in the S. cerevisiae genome and twenty-three encoded in the genomes of other yeast species (Dias et al. 2010a; Dias and Sá-Correia 2013, 2014). This approach will allow the identification of the DHA1 and DAG transporters in the yeast strains considered in this study. The nucleotide sequences of DHA1 genes identified as problematic will be subject to detailed analyses and, when possible, to the necessary correction. A wide range of different 2D- topology prediction software will be applied to all DHA1 and DAG proteins, also aiding the identification of the problematic DHA1 genes. Distance-based and maximum likelihood-based methodologies for constructing phylogenetic trees will be used to classify the DHA1 and DAG proteins into phylogenetic clusters. In addition to classical phylogenetic methodologies, gene neighbourhood analysis will be used to obtain complementary and independent evidences allowing the reconstruction of the evolution of the DHA1 genes found encoded in the genomes of 33 Saccharomycetaceae strains. After the phase of sequence validation, the nucleotide and amino acid sequences of these DHA1 genes will be analysed for evidences of positive and negative selection using functions made available by the HYPHY software suite. A case of strong horizontal expansion of FLR1 homolog genes, in previously published works found in the genomes of the Zygosaccharomyces strains will be scrutinized using the Mixed Effects Model of Evolution (MEME) function, allowing to conclude or not, that episodic functional diversification was on the basis of the number increase of these genes. The conservation of the amino acid sequences of these DHA1 proteins will be analysed using a Simpson Entropy-based method made available by the LogOddsLogo portal, allowing the integration of the molecular evolution results of the DHA1 proteins with the structural results obtained with this method and with the 2D-topology prediction software’s.

2

2. General Introduction

2.1. Yeasts: Biotechnology, taxonomy and diversity

2.1.1. Biotechnology and applications

The concept of "biotechnology" encompasses a wide range of procedures for modifying living organisms according to human purposes, going back to domestication of animals, cultivation of the plants, and other "improvements" through the use of breeding programs employing artificial selection, hybridization and other human driven approaches. Several definitions can be applied, according to studied area. For instance, the American Chemical Society defines biotechnology as the application of biological organisms, systems, or processes by various industries to learn about the science of life and the improvement of the value of materials and organisms such as pharmaceuticals, crops, and livestock (Gavanji et al. 2013). On the other hand, the European Federation of Biotechnology considers biotechnology as the integration of natural science and organisms, cells, parts thereof, and molecular analogues for products and services (Fulekar 2010). Also, for the Organization for Economic Co- operation and Development (OECD), biotechnology is defined as the application of scientific and engineering principles to the processing of materials by biological agents (Europabio 2015). Besides that, biotechnology is part of the pure biological sciences (animal cell culture, biochemistry, cell biology, embryology, genetics, microbiology, and molecular biology) and in many instances, it’s also dependent on knowledge and methods from outside the sphere of biology (ex: bioinformatics- a new brand of computer science, bioprocess engineering, biorobotics and chemical engineering).

Biotechnology relies on several macro and microorganisms used in a whole range of applications, being the yeasts one of the most important and applied microorganisms. Yeasts are well-studied and many times used as a reference in a great number of studies (Rolland and Dujon 2011). Some of the more relevant characteristics are: 1) easy handling and maintenance in a laboratory environment; 2)-high replication rates; 3) simplicity to isolate mutants, when needed; 4) preferred proliferation in anaerobic conditions, but also possible in aerobic ones; 5) sexual or asexual reproduction according to environmental conditions; 5) ability to make some post-translational modifications (common characteristic with eukaryotic) (Cereghino and Cregg 2000; Teresa Fernández-Espinar, Barrio and Querol 2003; Herrero et al. 2008; Hinnebusch and Johnston 2011). Considering the broadness and origin, the major part of these yeasts have a natural origin, however, some cases show that the genetic and physiological properties can be manipulated by humans (Kurtzman, Fell and Boekhout 2011).

All of these yeast features contribute to the increased number of studies and applications focused on these microorganisms. The literature mentions that yeasts are covering a large number of applications, from the most traditional mechanisms of food production, being often associated with baking production (Dequin 2001), to the most advanced mechanisms of clinical therapy (Budtz-Jörgensen 1990). In the fermentation industry, yeasts are used not only in food industries to make bread, wine, and beer but also in non-food industries such as the biofuel industry, to produce ethanol (Remize 2016; Wang, Mas

3 and Esteve-Zarzoso 2016; Wilkinson, Greetham and Tucker 2016). Also, other applications are possible: food composition (flavour building blocks and carriers) (Seo and Jin 2016); medicine (biomedical research, organisms model for genetic studies, synthetic pathways for antibiotics or other molecules of interest) (Kschischo, Ramos and Sychrová 2016); environmental solutions (pollution clean- up, waste recycling, crop protection)(Xu et al. 2016). Regarding nutrition and health care ,yeasts have been applied in the production of dietary supplements and probiotics as well as pharmaceuticals, including anti-parasitic, anticancer compounds and biopharmaceuticals such as vaccines and insulin (Goffeau et al. 1996; Liti et al. 2005; Flagfeldt et al. 2009; Brown et al. 2013).Also in biopharmaceutical production, these microorganisms can be used as protein factories (Cereghino and Cregg 1999), for the production of certain functional proteins involved in the dynamic control of recombinant proteins (Cereghino and Cregg 1999) or yet, in the biosynthesis of glycolipids (Bednarski et al. 2004).

Although several yeasts are applied in biotechnology with different applications, it is still important to consider and differentiate the ones used for industrial and non-industrial purposes. The industrial yeast strains, generally, share the ability to grow and function under the concerted influences of a multitude of environmental stressors. In comparison, non-industrial isolates, such as laboratory strains, have been selected for their rapid and consistent growth in a nutrient rich laboratory media, thereby producing markedly different phenotypic outcomes when compared to their industrial relatives (Mortimer and Johnston 1986). The outcomes of these very different selection pressures are therefore most evident when comparing industrial and non-industrial yeasts. As an example, laboratory strains of S. cerevisiae, such as S288c, are unable to grow in the low pH and high osmolarity of most grape juices and therefore cannot be used to make wine, being this a clear difference between industrial and non-industrial strains of S. cerevisiae. However, there are numerous subtle differences between strains, not only between industrial-type ones but also between strains used within the same industry (Lambrechts and Pretorius 2000; Howell et al. 2005; Borneman et al. 2011). Different strains of S. cerevisiae were reported in Liti. et all (2009) as being laboratory, clinical, fermentative, wild or unknown according to their origin or application, reinforcing the multi-applications and the genetic diversity found in this species. At the laboratory/clinical level, other important new opportunistic yeasts have emerged due to developments in medical and pharmaceutical knowledge. Most emerging yeast pathogens are Candida species, different from Candida albicans. The capture of knowledge about these species have been considerable, while five Candida species were considered pathogenic in the 1960s , recently at least 33 Candida species were listed as pathogenic (Papon et al. 2013).This search occurred because these opportunistic yeasts vary greatly their susceptibilities to the antifungal agents (Odds 1993), presenting particular patterns of primary resistance or reduced susceptibility , that compromise the application of appropriate antifungal therapies (Papon et al. 2013).

In the industry, many processes rely heavily on the model yeast S. cerevisiae. Despite its traditional use in the food industry for the production of alcoholic beverages and bread fermentation, recently S. cerevisiae has also been used in the bioethanol industry and for the production of heterologous compounds, such as human insulin, hepatitis vaccines, and human papillomavirus vaccines (Hou et al. 2012). Notwithstanding the fact that S. cerevisiae remains, by far, the most widely used yeast species

4 to date in industrial processes, other, so-called nonconventional yeasts, such as Eremotecium gossypii, Kazachstania spp. , Scheffersomyces stipitis, Yarrowia lipolytica, Kluyveromyces lactis, and Dekkera bruxellensis, have also claimed their stake as valuable contributors to industrial processes (Blomqvist et al. 2010; Scordia et al. 2012; Steensels et al. 2014; James et al. 2015).

From a practical point of view, we should consider that all of these yeasts were scientifically optimized and scrutinized before the applications. In fact, many studies of this research field focus the developmental improvement of methodologies allowing the increase of the yield and efficiency of these microorganisms. Among them studies are those focusing phylogenetic evolution of yeasts, population genetics, genomics, proteomics, metabolomics, and transcriptomics, among others that contribute to the discovery of further applications. (Ma 2008)

2.1.2. Evolution, taxonomy and diversity of yeasts

Being single-celled microorganisms, yeasts are (Kingdom Fungy) that belong to two phyla- or Basidiomycota (sub-kingdom dikarya). The Ascomycota phylum is subdividing into three subphyla – Saccharomycotina, Pezizomycotina, and Taphrinomycotina –, hence divided into fifteen classes. One of the most important of these classes is the Hemiascomycetes (also known as ). This class includes order that containing the taxonomic families Saccharomycetaceae, Dipodascaceae, Debaryomycetaceae and others yeast species that will be analysed in this study and are classified in Figure. 1. Dipodascaceae and other families of yeasts have limited numbers of chromosomes and dispersed 5S RNA genes. Debaryomycetaceae, in turn, is characterized by a common deviation of the genetic code (the CUG codon encodes serine instead of leucine) with large centromeres and a unique mating-type locus (with two allelic forms). Finally, Saccharomycetaceae yeasts are characterized by point centromeres (which are highly conserved) and triplicate mating-type cassettes that ensure the simultaneous presence of both mating-type alleles in haploid cells, although some exceptions may occur (Figure. 1) (Dujon 2010). Particular interest on Hemiascomycetous species is associated with the fact that these species include the biochemically characterized eukaryotic models Saccharomyces cerevisiae, Candida glabrata, Candida albicans, Debaryomyces hansenii, Yarrowia lipolytica among others that have biotechnological, industrial or medical relevance (Arnaud et al. 2009; Sá-Correia et al. 2009; Kurtzman, Fell and Boekhout 2011; Costa et al. 2016). Being the Saccharomycetaceae yeasts under focus in the present work, a brief review at the taxonomic and evolutionary levels will be performed in this subsection.

In 2003, based on unlinked multigene phylogenetic analysis, Kurtzman and Robnett proposed the first Hemiascomycetes subdivision of the ‘Saccharomyces complex’ (subdivides into families, Saccharomycetaceae and Saccharomycodaceae), leading to resolution of 75 species in 14 clades, in a very efficient way (Figure. 2)(Kurtzman 2003). Nevertheless, in these 14 clades, only 12 genera belonged to the Saccharomycetaceae family – Saccharomyces, Kazachstania, Naumovozyma, Nakaseomyces, Tetrapisispora and Vanderwaltozyma (that belong to post-WGD genera); and Zygosaccharomyces, Zygotorulaspora, Torulaspora, Lachancea, Kluyveromyces and Eremothecium (that belong to pre-WGD genera) (Figure.2).

5

Saccharomycetaceae

Other Other Families

Debaryomycetaceae

Dipodascaceae

Figure. 1 Representative phylogenetic tree of the dykarya fungi, showing the name of taxonomic families and genus of the strains used in the study. At the same time, also it’s represented the evolutionary events that occurred along of evolution.(Adapted from (Dujon 2010))

6

WGDspecies

- Pre

Saccharomyc

family etaceae etaceae

WGDspecies

- Post

complex’

Saccharomycodaceae

family

Saccharomyces ‘

Figure. 2 Phylogenetic tree among ‘Saccharomyces complex’ yeast species. This tree was proposed by Kurtzman and allowed to resolve 75 species into 14 clades representing phylogenetically Saccharomycetaceae circumscribed genera. The maximum-parsimony method was used to generate the tree according to the analysis of nucleotide sequences from 18S, 5.8S/alignable ITS, and 26S (three regions) rDNAs, EF-1K, mitochondrial small- subunit rDNA and COX II67 (Kurtzman 2003))

7

The Saccharomyces genus, comprising the group yeast species known as Saccharomyces sensu stricto, is a complex that contains several strains with high importance in fermentation. According to the most current classification, this group is composed of seven species – Saccharomyces cerevisiae S288c, Saccharomyces paradoxus, Saccharomyces mikatae, Saccharomyces kudryavzezii Saccharomyces arboricola, Saccharomyces eubayanus, and Saccharomyces uvarum (Dujon 2010; Scannell et al. 2011; Hittinger 2013) (Figure.4).

S. cerevisiae is considered a model of excellence for molecular and genetic research, for evaluation of natural and ecological systems and plays an important role in human applications (Goffeau et al. 1996; Replansky et al. 2008). The main applications of this yeast are related to the fermentation, food production, ethanol production, biocontrol and health care (Goffeau et al. 1996; El-Sayed Shalaby and El-Nady 2008; Flagfeldt et al. 2009; Brown et al. 2013). Other species of this group are of relevance, including S. paradoxus, which is a widely used model in ecology, biogeography, and evolutionary biology (Liti et al. 2009; Kowallik, Miller and Greig 2015). S. paradoxus yeasts have similar phenotype twin and largely coexist with S. cerevisiae. In addition, the S. cerevisiae and S paradoxus contain the same number of chromosomes and no gross chromosomal rearrangements (Fischer et al. 2000; Cliften et al. 2003; Kellis et al. 2003). In terms of distribution, it’s interesting to verify that the S. cerevisiae and S. paradoxus are abundantly distributed when compared to other species of Saccharomyces sensu stricto. S. mikatae specie is constrained to specific geografically location (Japan). On the other hand, the S. kudriavzevii specie were isolated in China (Naumov, Lee and Naumova 2013).

S. uvarum and S. eubayanus are quite common and abundant in South and North America, in Europe and East Asia (Boynton and Greig 2014). For a long time, there was severe controversy about the Saccharomyces bayanus, S. eubayanus, and Saccharomyces uvarum classification.

Although in early days these species were classified as S. bayanus variety bayanus (S. bayanus var. bayanus) and S. bayanus uvarum variety (S. bayanus var. uvarum), Pulvirenti and Nguyen, independently, genetic and phylogenetic evidences supported that both varieties should be divided into two species, S. bayanus and S. uvarum (Pulvirenti 2000; Nguyen and Gaillardin 2005). In 2011, Libkind and co-workers isolated and proposed S. eubayanus as the new and parental species and characterized S. bayanus var. bayanus as a complex hybrid of S. eubayanus, S. uvarum, and S. cerevisiae (Libkind et al. 2011). The Saccharomyces Genus have great potential to hybridization as shown in Figure. 3, especially in cases of low-temperature industrial wine fermentations (Almeida et al. 2014). Saccharomyces pastorianus is an example of that, where a hybridization between S. cerevisiae and S. eubayanus yeast occurred, resulting in domestication with brewing at low temperatures (Libkind et al. 2011; Dunn et al. 2012). Other hybrids harbouring genomic contributions from S. cerevisiae and a distinct cryotolerant species, S. kudriavzevii, are commonly found among strains used to produce Belgian-style beers and wines fermented at low temperatures (González et al. 2006; Almeida et al. 2014).

8

Figure. 3. Schematic cladogram depicting phylogenetic relationships among Saccharomyces species and well-known or frequently isolated hybrids. Dashed lines represent introgressions from a third or fourth species into a hybrid. Most introgressions are not present in all hybrid strains. Synonyms are given in parentheses below species names. (Cladogram topography from (Almeida et al. 2014) and the image retrieved from (Boynton and Greig 2014)).

The Kazachstania genus was first proposed in 1971 by Zubkova, to classify an isolated strain from wine fermentations. Later on, Kurtzman (2003) made a reevaluation of the genus and included about 19 species into according to a multigene sequence analysis (Kurtzman 2003). At this moment, the genus includes more than 32 species, subdivided into 5 subgroups, with environmental origin range from animals, fermented foods, fruits, mushrooms leaves, silage, soil, to wastewater (Kurtzman, Fell and Boekhout 2011; James et al. 2015).

The Naumovozyma genus was described as having only two species, poorly described and used in literature – N. castellii and N. dairenensis (N. dairenensis) (Kurtzman 2003).

On the other hand, originally, Nakaseomyces genus included other species of yeasts such as N. delphensis (Kluyveromyces), Candida castellii and N. bacillisporus (Kluyveromyces) isolated from the environment (Kurtzman 2003). Recently, two new pathogens yeasts have been added to the genus, C. nivariensis and C. bracarensis (Alcoba-Flórez et al. 2005; Correia et al. 2006)). Also within this genus resides the important human pathogen yeast, C. glabrata, which despite the Candida name is more phylogenetically closer of S. cerevisiae than C. albicans (Dujon et al., 2004). C. glabrata is a pathogenic and a commensal of human body yeast described as a second causative agent of human candidiasis (Anderson 1917; Richard et al. 2005). On the other hand, the C. bracarensis and C. nivariensis ecological niche remain unknown. Of note, C. nivariensis was isolated from flowering plants in Australia (Lachance et al. 2001) suggesting the possibility that this species can colonize the human being from an environmental source. Curiously, from a phylogenetic point of view, although consistent with (CP Kurtzman, 2003), in published article of Gabaldón et al. (2013) is suggested that this genus can be subdivided into two major subgroups; the first one containing C. castellii and N. bacillisporus and the

9 second one (referred as glabrata group) comprise the three pathogenic species, (C. glabrata, C. bracarensis, C. nivariensis) and even N. delphensis (Gabaldón et al. 2013).

The Tetrapisispora genus was proposed by (Ueda-Nishimura and Mikata 1999), as including T. iriomotensis, T. nanseiensis and T. arboreal species, isolated from soil, flowers and leaves in the Nansei Islands, Japan. In 2003, Kurtzman and co-workers updated this genus and included two new species– T. phaffii and T. blattae. Recently this genus suffered a new revision with the addition of two new species isolated from Taiwanese forest soil samples- T. taiwanensis and T. pingtungensis (Chen et al. 2013).

When first proposed ,the Vanderwaltozyma genus only comprises two species, V. polyspora, and V. yarrowii, previously known as Kluyveromyces polyspora and Kluyveromyces yarrowii, respectively (Kurtzman 2003). Since then a third and four species belonging to this genus – V. verrucispora and V. tropicalis – have been isolated in Japan and Taiwan and added to this taxonomic group (Lee et al. 2009; Nakase et al. 2010). The species classified in this genus has an important role in the fermentation of glucose / galactose and in the production of spheroids (Lee et al. 2009).

The Zygosaccharomyces genus was proposed in 1901 by Baker. In (Kurtzman 2003) this genus was divided into four different genus Zygosaccharomyces, Zygotorulaspora, Torulaspora and Lachancea. The Zygosaccharomyces genus Kept only 6 species – Z. bailii, Z. bisporus, Z. kombuchaensis, Z. lentus, Z. rouxii and Z. mellis (Kurtzman, Fell and Boekhout 2011). More recently, six new species were proposed to be included in this genus in the publication of Hulin and co-workers (Hulin and Wheals 2014). Generally, the species of this genus are considered the most problematic spoilage yeast found in the food and beverage industry and are characterized as having intrinsic capacity to resist to weak acids widely used as fungistatic preservatives, such as acetic, propionic, benzoic and sorbic acids (Stratford et al. 2013; Palma et al. 2015).

In the case of Zygotorulaspora genus, only two species are documented – Z. florentinis and Z. mrakii and similarly to the Naumovozyma genus is one of the less studied Saccharomycetaceae family. Therefore ,more studies are needed to validate issues such as applications, evolutionary processes and diagnostic characteristics of the species (Kurtzman 2003).

Lindner established the Torulaspora genus in 1904, and so far, several classifications and studies have been done in order to establish its species. After the reorganization of this genus by Kurtzman et al. (2003), only five species were included in the genus – T. delbrueckii, T. pretoriensis, T. franciscae, T. globosa, and T. microellipsoides. Currently T. maleeae (isolated from the moss, leaves, soil, being the only known species of Torulaspora to be isolated from Moss (Kurtzman, Fell and Boekhout 2011)) and T. quercuum (isolated from oral flora of humans and in parallel of Quercus sp. leaves in Northeast China (Wang et al. 2009; Saluja et al. 2012)) were included in this genus. Curiously, the species of this genus are applied in important human applications like wine fermentation and food products production (Marsit et al. 2015).

The yeast species classified in Lachancea genus are known by their wide geographic range, occupying a large number of ecological niches ,that go from the soil, plants and insects to processed foods and drinks (Friedrich et al. 2012). In 2003, kurtzman et al. proposed that this genus encompasses 5 species,

10

L. cidri, L. fermentati, L. kluyveri, L. thermotolerans and L. waltii. More recently, Friedrich et al. (2012) added and re-sequenced five more possible species to this genus – L. dasiensis, L. nothofagi, L. mirantina, L. fantastica, L. meyersii – from diverse habitats (water sea, soil, plants and distilleries).

The Kluyveromyces genus was first proposed by Van Der Walt (1956) to accommodate K. polysporus (Van Der Walt 1956). In 1999, Agahama and coworkers proposed adding new species to this K. genus, K. non-fermentans. Recently, with the new revaluation of the genus, in (Kurtzman 2003) was joined five new species – K. aestuarii, K. dobzhanskii, K. lactis, K. maxianus, K. wickerhamii. The species of this genus are known to be suitable for many biotechnological applications such as, milk whey and lactose using for forage, enzyme production, ethanol and low alcohol drinks production, production of natural flavour and fragrance molecules, which make this group one of the most well-studied in Saccharomycetaceae family (Lachance 2007; Flagfeldt et al. 2009; Naumova et al. 2012).

The genus Eremothecium is a saccharomycetaceae genus comprising the most early divergent yeast species and has been described as having 5 species – E. cymbalariae, E. ashbyi, E. gossypii, E. coryli, E. sinecaudum (Kurtzman 2003). E. cymbalariae, E. ashbyi, and A. gossypii are filamentous fungi that share a rare growth mode that includes a dichotomous branching. The E. ashbyi and E. gossypii species are characterized as flavinogenic due their ability to produce and secrete more riboflavin, although this phenotype trait is not observed in E. cymbalariae species (Förster et al. 1999; Sengupta and Chandra 2011). Recently, also was isolated from an eastern box elder bug a new species, Ashbya aceri, that although to present high similarity with E. gossypii species and Eremothecium genus in general , phylogenetically remains uncharacterized (Dietrich et al. 2013).

2.2. Yeasts: Comparative genomic, and evolution

2.2.1. Comparative genomics

The advance of DNA sequencing technologies, particularly with Next-Generation Sequencing (NGS) methodologies in the late 2000s, allowed the efficient production of large amount of genomic information. This fact led to the development of innovative ways to deal with this deluge of data such as comparative genomics, bioinformatics, statistical, and mathematical methodologies (Hu et al. 2011).

Initially, comparative genomics were used to obtain valuable insights into developmental functions, reproductive enhancements, inborn errors, and disease defence mechanisms that have protected our ancestors (and ourselves) from extinction (O’Brien et al. 1999). Presently, the Comparative genomic field is defined as the comparison of the same, or different, genomic features in distinct organisms (Hardison 2003). This field of research has many applications in different areas, such as: the identification of putative targets for novel antifungals (Madero-lópez and Espinel-ingroff 2009); identification of regulatory motifs for the discovery of important genes and regulatory elements (Kellis et al. 2003); selection of model organisms (Preuss 2006); development of methodologies used for clustering of regulatory sites (van Nimwegen et al. 2002); identification of evolutionary mechanisms (Kurtzman, Fell and Boekhout 2011); identification of genes of interest ;establishment of the homologs /orthologues /paralogues status of genes (Koonin 2005). On the other hand, the basic features used in

11 the Comparative genomics field are the , nucleotide sequence, strand asymmetry, genes, gene order, regulatory sequences, or genomic structural landmarks (Xia 2013).

In addition, and contemplating the previously referred features, Xia (2013) suggested three fundamental applications of comparative genomics: 1) characterization of similarities and differences in genomic features as well as tracing their origin, changes and loss along different evolutionary lineages; 2) understanding of evolutionary changes that can occur, such as mutations, recombination, lateral gene transfer and selection in general (always taking into account the biotic and abiotic factors); 3) understanding of how genomic evolution can help combat and prevent several diseases by developing personalized medicine, improving environmental health, and restore sustainable development, amongst others (Xia 2013).

2.2.2. Comparative genomics and evolution of yeasts

The comparative genomics methodologies require the consideration and implementation of two important areas, Evolution and Bioinformatics. Interestingly, with the evolution of organisms in equation, other concepts such as conserved features and evolutionary patterns between different genus, species or strains are taken in consideration.

In evolutionary and phylogenetic patterns should be considered at the gene, proteins, genomes, strains, and species level. Generally, these patterns contain different evolutionary events such as local and global duplications/triplications, inversions, deletions, mutations, translocations, silencing, horizontal transfer of genes within a genome or between different genomes. This event confers high rates of organism adaptability to environmental pressures (Knop 2006) and creates genes counterparts, paralogs, orthologues, and ohnologs genes, widely sought after and used for genomes comparison. The determination of homologs, orthologs or paralogs genes is fundamental nowadays in several studies. Orthologs genes are genes originating from the single ancestral gene in the last common ancestor of the compared genomes and, therefore, they present available evidence of functional equivalence, making them of great interest for humans. In paralogs genes, the situation is not so linear, since they are generated via duplication they are described as having biologically distinct functions, though the structural similarity (Rigden and de Mello 2002; Koonin 2005; Lee, Redfern and Orengo 2007). In addition to the types of genes mentioned above, there are other kinds considered important in evolutionary terms such as pseudoorthologs, xenologs, co-orthologs, inparalogs, outparalogs, pseudoparalogs, and analogous, which will not be extensively discussed in this study.

Yeasts belonging to the Saccharomycetaceae family have emerged as a preeminent phylogenetic clade for comparative genomics due to their small, streamlined genomes, a wealth of functional data, and genetic diversity spanning 500–1000 million years of evolution (Dujon et al. 2004; Dujon 2010). In evolutionary terms, this family also presents a high variety of events that contribute to a better understanding of the Saccharomycetaceae species. Some of the most important events are i) genes in tandem repeat; ii) segmental duplications; iii) whole-genome duplications ; iv)single-gene duplications; and v) horizontal gene transfers (Dujon 2006).

12

The WGD event is an example of a global duplication that can occur in the evolution of organisms and has been appointed as an advantageous and common way to generate gene function novelty in some species. The whole genome duplication creates a high rate of mutation and genomic instability, however, there are mechanisms such as specializations of genes, mutations, loss or gene rearrangements, that can compensate and keep the stability of genomes (Kellis, Birren and Lander 2004; Koonin 2005). Van de Peer et al (2009) in the revision of WGD events reports several cases in ciliates, plants, and animals, proving the transversally of this event in all groups of organisms. In this article, competitive advantages of polyploidy were argued, such as extinction prevention mechanisms due to an increase of several parameters like vigour, species diversity, innovation gene, complexity among others that improve fitness. Although the advantages of WGD events, evidences supporting that multiple parts of the organisms, in different ways, are affected, and can lead to problems in terms of structure; ability to re- duplicate; storage mechanisms of duplicate genes; functions of duplicates; among other topics (Scannell et al. 2007).

In hemiascomycetous fungi, specifically in Saccharomycetaceae family, this event had an important role, where it caused many mandatory changes and the main division within the Saccharomyces complex (Wolfe and Shields 1997; Dujon et al. 2004; Kellis, Birren and Lander 2004). In S. cerevisiae, for instance, it’s believed that genome increased the number of genes from 5000 to 10 000, and then lost one member of most pairs to the present-day set. Currently, S. cerevisiae has about 5500 protein- coding genes of which 1102 originate 551 duplicated genes originated in WGD (ohnologs) (Byrne and Wolfe 2005).

This event has been well studied in Saccharomycetaceae species, however, the exact nature and origin of the whole-genome duplication in the ancestry of S. cerevisiae and related yeasts remain uncertain. Two hypotheses exist : autotetraploidization (duplication of the entire chromosome set) and allotetraploidization (hybridization between two distinct but related species) (Wolfe and Shields 1997; Dujon 2006).

The gene novelty and the functional divergence in duplicated copies can occur in different ways, mainly through genetic dosage changes in response to gene loss biases and neo/sub functionalization (Scannell, Butler and Wolfe 2007). The gene loss event can be seen as the major responsible for the divergence among different yeast lineages because, despite of its independent nature it seems to act by fostering a common pattern favouring elimination of specific ‘obsolete’ ancestral functions. Indeed, following duplication of genomes, new homologs regions can undergo progressive gene loss by random local deletion, which will provoke the loss of one or two paralogous copies of each gene (Kellis, Birren and Lander 2004; Van de Peer, Maere and Meyer 2009). On the other hand, the neo/sub functionalization processes are responsible for gene preservation. In neofunctionalization case, one duplicate evolves a useful new function while the other performs the ancestral function. In sub functionalization, both duplicates shared ancestral functions between them so that both duplicates are required for full fitness (Lynch and Force 2000; Scannell, Butler and Wolfe 2007; Conant and Wolfe 2008). Both neo and subfunctionalization processes are described as being composed of three distinct phases: creation, fixation-preservation, and subsequent optimization. Initially, creation can be caused

13 by a duplication that begins as a mutation in a single individual. After that, the last two stages- fixation- preservation and optimization- require that all the members of a species retain the new extra copy which may be maintained in a population through: genetic drift; positive selection, if the novel features favour the population (Conant and Wolfe 2008).

The HGT occurs when organisms acquire genes from other organisms of the same or different species- through an alternative way instead of vertical transmission ( when an organism receives genetic material from its ancestor transmission of DNA from parent to offspring). In general, HGT is applied in response to a change in the environment, and the gained characteristics will be passed on to their descendants (Keeling and Palmer 2008), which can directly increase genetic diversity, speed the evolution and innovation of the genome in different environmental conditions. Nevertheless, there are some important conditions to consider such as pH, temperature, that can inhibit gene exchange between organisms (Jain et al. 2003).

Horizontal gene transfers events also contribute to the high rate of genomic diversity in the Hemiascomycetes group. Although reported to be rare in yeast, the mechanisms underlying HGT are not yet elucidated and several authors have defined this event as a mechanism for genomic innovation and plasticity (Hall, Brachat and Dietrich 2005). Horizontal transference events can be classified into three distinct categories: acquisition of new genes; acquisition of paralogs of existing genes; and gene displacement where a gene is displaced by a horizontally transferred ortholog one from another lineage (Hall, Brachat and Dietrich 2005). Curiously, all three categories appear to be present in the genome of S. cerevisiae. An example of that is the used S. cerevisiae wine strain - EC1118- where a genetic analysis showed a strong evidence for the HGT of three DNA regions, encompassing 34 genes involved in key wine fermentation functions (Novo et al. 2009). Interestingly, two of these regions were acquired from non-Saccharomyces species and one region was identified from the donors Zygosaccharomyces bailii, a typical contaminant of wine fermentations, indicating that HGT in yeast can cross genus boundaries (Steensels et al. 2014).

2.2.3. Comparative genomics and synteny of yeasts

Returning to the Comparative Genomics and considering the organism’s evolution, the synteny analyses are one of the most powerful tool to found and scrutinize new evolutionary and genomic insights between different species or strains. Ohno, in 1973, defined for the first time synteny as linked genes resides in the same chromosomic environment in different genomes (Ohno 1973). More recently, different definitions have been proposed and different characteristics were included. Genes are usually grouped into blocks and segments, originating syntenic blocks and segments. These blocks generally have identical or very similar genes collections, which serve as markers and landmarks of the genome under analyses. Syntenic segments have a well-preserved order, which is important because these sets of genes are transmitted to offspring over thousands of years, thus keeping the synteny between species (Ghiurcuta and Moret 2009). From a general point of view, this process provides certain levels of abstraction for the study of integral genomes, which becomes an advantage in the identification of multiple copies, analogs , homologs, orthologs and paralogs genes in different parts of the genome (Goodstadt and Ponting 2006).

14

This type of evolutionary studies has great importance. For instance, the methodologies of sequencing and annotating genomes can lead to assembly problems, generating incorrect order between syntenic blocks and large genes or wrong gene order. In addition, genome evolutionary mechanisms such as inversions or translocations, can difficult the syntenic analyses (McCouch 2001; Ghiurcuta and Moret 2009).

In order to be able and perform these studies, there are several helpful programs for the representation and detection of synteny in blocks shapes or forms of networks. For synteny detection, there is online software available as IONS (Seret and Baret 2011),SyntenyTracker, COGE, Satsuma, Content, and OrthoCluster. On the other hand, for the mining of syntenic blocks, there are programs like SynMap, Region Miner, SyntenyMiner, that compare several genomic sequences. To visualize and define synteny blocks their programmes such as Autograph, Synchro, SyntenyView, and SynBrowse among others. All of these software are specific to synteny blocks. There are also specific synteny viewers of certain organisms, such as the YGOB for synteny display in yeasts.

Several evolutionary studies have been developed in hemiascomycetous species, based on phylogenetic and syntenic analyses by correlating the evolutionary history of different species with their genes and genome features. In yeasts, synteny has been based on assumption that two genes of distinct yeast species, whose protein sequences belong to the same sequence cluster ( homologs by similarity), will be members of the same gene lineage if they share at least one pair of neighbour genes (Seret et al. 2009; Dias et al. 2010b). The synteny between two species can be reinforced if other parameters such as values of similarity and identity (Blast and needle) among the selected genes are joined to the analyses (Dias and Sá-Correia 2014).

The strength of synteny evidences is understood as the number of neighbours that a specific gene has in common with others genes in different species. Some authors are trying to understand how these blocks of genes work in the genomic remodelling process (Drillon and Fischer 2011) (block of genes are defined as chromosome segments between two species showing a succession of several orthologous genes in the same order irrespective of gene orientation (Dujon 2010)). In fact, it’s interesting understand the way that these block of genes are distributed and how they influence the chromosomal rearrangements after the occurrence of evolutionary events like WGD and HGT (Seret and Baret 2011) Synteny in the Hemiascomycetous, varies considerably, reflecting small rearrangements occurring in their history (Friedrich et al. 2012). Although hemiascomycetous yeasts present a small rate of rearrangements that seem to be nonlinear, in the Saccharomycetaceae family, after the WGD event, the number of genome reshuffling increased (Souciet et al. 2009; Rolland and Dujon 2011).

15

2.3. Yeasts and Multidrug Resistance

Organisms require a controlled and homeostatic internal cellular environment to function, grow and reproduce. However, cells frequently can encounter various environmental conditions which lead to stress, such as oxidative stress, heat shock, hyper-and hypo-osmotic stress, and toxin challenge (either the cellular metabolism itself or from external sources). To maintain the optimal internal environment, cells develop various mechanisms to protect, cope, or adapt to stress. These protective mechanisms involve signalling proteins and transcription factors, which can be activated in response to stress, and then go on to rapidly activate many more target genes or proteins (Gasch et al. 2000). Many of these pathways involve the sensing of the stress, and the relay of this information to the nucleus (Hanlon et al. 2011). For example, the osmosensor for sensing osmotic stress activates a range of general stress- response transcription factors. Other pathways, such as the oxidative stress response pathway, make use of enzymes that convert toxic metabolites to non-toxic products (Jamieson 1998; Ikner and Shiozaki 2005), and some of the pathways utilize certain membrane protein transporters in cellular detoxification by effluxing toxic compounds/metabolites. An example of one of these pathways is the multidrug resistant stress response (Kuchler and Schüller 2007; Sá-Correia et al. 2009).

Multidrug resistance (MDR) is defined as the ability of living cells showing resistance to a wide variety of structurally and functionally unrelated compounds. This phenomenon is conserved in bacteria to humans and allows organisms to be able to survive or defend themselves against a variety of cytotoxic compounds in the environment (Sá-Correia et al. 2009; dos Santos et al. 2014).

MDR is also one of the main mechanisms by which pathogenic fungi and cancer cells develop drug resistance. Apart of that, the multidrug resistance in fungi also is responsible for many problems in several human activities, like clinical, industrial, agricultural or food (Sá-Correia et al. 2009). In agriculture, for instance, the extensive use of fungicides and agricultural herbicides, have provided perfect conditions for the increase of resistance in some species of yeasts. The same pattern is possible to see in the pharmaceutical industry, which has developed recurrently new drugs that affect different parts of microorganisms, and requires a more accelerated adaptation and evolution of the latter. In order to try to combat resistant strains is essential to understand the regulatory mechanisms, the tolerance to chemical stress as well the other kind of conditions stress, that allow the gain of resistance in these fungi. (Sá-Correia et al. 2009; Dias et al. 2010b; Dias and Sá-Correia 2014; dos Santos et al. 2014).

The MDR response can be mediated in many ways such as a mutation in drug targets, inactivation of drugs, reduced cellular uptake of drugs by reduction of membrane permeability (Kuchler and Schüller 2007). However, the most commonly observed MDR response is mediated through overexpression of membrane-spanning efflux pumps. Two major superfamilies of transporters that play important roles in MDR are the ABC (ATP-binding cassette) superfamily of transporters that are ATP-dependent, and Major Facilitator Superfamily (MFS) transporters (Higgins 2007; Sá-Correia et al. 2009).

16

2.3.1. MDR-MFS Transporters

About 25% of all known membrane transport proteins in prokaryotes belong to the major facilitator superfamily (MFS) (Saier Jr et al. 1999), the largest and most diverse superfamily of secondary active transporters divided into 76 distinct families of MFS proteins (http://www.tcdb.org/). More than 10,000 sequenced MFS proteins identified to date (Reddy et al. 2012) ,a number that expected to increase as more genomes are sequenced (Saier, Tran and Barabote 2006). This superfamily is ubiquitous in all kingdoms of life and includes members with strong medical and pharmaceutical interest. Mechanistically, the MFS transporters display three distinct kinetic mechanisms: there are (i) uniporters, which transport only one type of substrate and are energized solely by the substrate gradient; (ii) symporters, which translocate two or more substrates in the same direction simultaneously, making use of the electrochemical gradient of one of them as the driving force; and (iii) antiporters, which transport two or more substrates, but in opposite directions across the membrane (Law, Maloney and Wang 2009). Interestingly, symporters and antiporters, are characterized by having a deep central hydrophilic cavity surrounded by mostly irregular transmembrane helices ,allowing the catalysing of the transport of a wide range of compounds, including simple sugars, inositol, drugs, amino acids, nucleotides, oligosaccharides, esters and a wide variety of organic and inorganic ions (Yan 2013; Wisedchaisri et al. 2014).

Despite a large number of families comprises in MFS superfamily, in hemiascomycetes, only two families of antiporters are associated with multidrug resistance mechanism - the 12 transmembrane spanner drug: H+ antiporter-1 (DHA1) and the 14 transmembrane spanner drug: H+ antiporter-2 (DHA2) (Sá-Correia et al. 2009; Dias et al. 2010a). Structurally, MFS-MDR proteins contain two halves, the N- and C-terminal domains, each formed by a bundle of six or seven transmembranes (TM) helices and the substrate translocation pathway is located at the interface between the N- and C-domains. MFS members were proposed to use an alternating access mechanism for substrate translocation also known as the rocker-switch model (Abramson et al. 2003; Yan 2013; Deng et al. 2014). The transporters can adopt two main open conformations for substrate loading/release, namely the outward- and inward- facing open conformations and alternate between these two conformations through occluded states to translocate substrates across the membrane (Yan 2013).Recently, Quistgaard and coworkers proposed an update to the classical model rocker-switch, called clamp-and-switch model. According to this model, the pore-lining helices bend to form the occluded state (clamping step), after which the domains rotate from being outward-facing to inward-facing or vice versa, thereby exposing the binding site to the other side of the membrane (switching step) (Quistgaard et al. 2016). The MFS-MDR transporters possibly deviate from this mechanism, because they are approached by substrates also laterally from the lipid bilayer (Fluman and Bibi 2009).

Although several model organisms such as Escherichia coli and C. albicans have been used to understand MFS-MDR transport, in this study we just consider the S. cerevisiae model as a reference (Lemieux, Huang and Wang 2004; Sá-Correia et al. 2009; Dias and Sá-Correia 2014). Historically, the yeasts MFS-MDR proteins were classified in three families: the 12 transmembrane spanner drug H+ antiporter-1(DHA1) (12 members); the 14 transmembrane spanner drug H+ antiporter-2 (DHA2) (10

17 members); Major Facilitator Unknown (UMF) family (6 members) (Paulsen et al. 1998; Dias et al. 2010a; Chang et al. 2012). More recently, it was understood that the UMF family comprises two subgroups with four genes encoding for proteins carriers of siderophores (ARN) and two glutathione exchangers (GEX) (Lesuisse, Simon-casteras and Labbe 1998; Dhaoui et al. 2011). Dias and Sá-Correia (2013), proposed the new DAG (DHA2,ARN,GEX) family, comprising the DHA2 and the two UMF subfamilies (Dias and Sá-Correia 2013).

2.3.2. DAG family

DAG family in S. cerevisiae includes 16 proteins, encoded by ATR1, AZR1, SGE1, VBA1, VBA2, VBA3, VBA4, VBA5, YMR279c, and YOR378/AMF1 genes and translated ORF’s (the DHA2 subfamily, in general less well characterized than the DHA1), ARN1, ARN2, SIT, and ENB1 genes(ARN subfamily), GEX1 and GEX2 (GEX subfamily)(Sá-Correia et al. 2009; Dias and Sá-Correia 2013; dos Santos et al. 2014). Just a few members of DAG family have been associated with MDR mechanism(Dias and Sá- Correia 2013).

The ten S. cerevisiae DHA2 proteins have associated a broad number of functions that allows the survival of yeast cells. The ATR1 gene ,initially identified and biochemical characterized as being involved with the resistance to aminotriazole (Kanazawa, Driscoll and Struhl 1988; Sá-Correia et al. 2009) ,have been reported now as boron efflux pump (Kaya et al. 2009) , and reported, as having an important role in the yeast response to DNA replication stress (Tkach et al. 2012). The KNQ1 gene (ATR1 homolog in S. cerevisiae) of Kluyveromyces lactis have also been associated with the resistance to high concentrations of boron ,since the deletion of the KNQ1 gene increased boron sensitivity in K. lactis species (Svrbicka, Toth Hervay and Gbelska 2016). Both paralogs of ATR1, the YMR279c (paralog arose from the whole genome duplication) and YOR378w/AMF1 (putative paralog), seems to have different physiological roles. While YMR279c is involved in resistance to boron and to heat (Sakaki et al. 2003).The putative paralog AMF1 (phylogenetically more distant (Gbelska, Krijger and Breunig 2006)) is not required to the boron tolerance (Kaya et al. 2009; Bozdag et al. 2011). The recently biochemical characterized ammonium facilitator (AMF1) gene also has been implicated in ammonium transport and in TOR pathway. The overexpression of this gene was found as significantly increase the sensitivity of yeast cells to rapamycin (Butcher et al. 2006). On the other hand, this gene have been implicated in Ammonium transport , where the overexpression resulted in high methyl ammonium toxicity concentrations to the cell, while the suppression eliminated methyl ammonium transport and toxicity (Chiasson et al. 2014).

The plasma membrane transporter AZR1, have been associated with resistance to azole drugs such as ketoconazole and fluconazole and acetic acid (Sá-Correia and Tenreiro 2002; Bauer et al. 2003; Sá- Correia et al. 2009). In the initial characterization , the VBA1-5 genes was described by Shimazu and coworkers as vacuolar basic amino acids permeases (Shimazu et al. 2005). Nonetheless, the vacuolar membrane gene (VBA4), is the one without none physiologic role described. The remaining genes have capacity to : import histidine and lysine into the vacuole (VBA1 and VBA3); catalyse the vacuolar uptake of arginine (VBA2) ; catalyses the uptake of lysine and arginine in the plasma membrane, being involved

18 in MDR phenomenon in yeasts (VBA5) (Shimazu et al. 2005, 2012; dos Santos et al. 2014). Moreover, it’s interesting to observe that VBA5 and VBA3, being paralogs generated by segmental duplication, do not share the same intercellular localization. While the VBA3 is localized in plasma membrane, the VBA5 localizes in vacuolar membrane. The SGE1 gene encodes a plasma membrane transporter that acts as an extrusion permease. In the genomes of S. cerevisiae this gene is described as having multiple copies that confer resistance to methyl methanesulfonate (Ehrenhofer‐Murray, Keller Seitz and Sengstag 1998; Babrzadeh et al. 2012; Dias and Sá-Correia 2013; dos Santos et al. 2014). Interestingly, this transporter also is involved in chemoprotection and in the MDR phenomenon (Sá-Correia et al. 2009).

The S. cerevisiae GEX1 and GEX2 proteins, were identified in yeast as glutathione exchangers (GSH/H+), found mainly in the plasma membrane of iron-depleted early-log phase cells (Dhaoui et al. 2011). Several transporters independently contribute to GSH export in yeast (Jacobson 2016). Being paralogs genes, the encoded GEX1 and GEX2 transporters, apparently share and are involved in the same function. Although these genes share a high sequence similarity to the ARN transporters and their expression is dependent on the iron-responsive transcription factor Aft2, it has been shown not involved in siderophore transport (Dhaoui et al. 2011; Thorsen et al. 2012; Dias and Sá-Correia 2013). The ARN sub-family encodes four proteins involved in the uptake of siderophore-iron chelates, ARN1, ARN2, SIT1 and ENB1 (Dias and Sá-Correia 2013). In the case of ARN2 gene is described being involved in uptake of iron bound to the siderophore triacetylfusarinine C (Yun et al. 2000).The ARN1 is responsible for uptake of iron bound to ferrirubin, ferrirhodin, and related siderophores (Heymann, Ernst and Winkelmann 2000). The expression of the ARN1 gene increases upon DNA replication stress (Tkach et al. 2012). The ENB1 (ARN4) gene is classified as an endosomal ferric enterobactin transporter being expressed under conditions of iron deprivation (Heymann, Ernst and Winkelmann 2000; Yun et al. 2000). The SIT1 (ARN3) gene is characterized as Ferrioxamine B transporter and their transcription just is induced during iron deprivation and diauxic shift (Lesuisse, Simon-casteras and Labbe 1998; Philpott and Protchenko 2008).

2.3.3. DHA1 family

S. cerevisiae genome comprises twelve DHA1 members, the QDR1, QDR2, AQR1, DTR1, QDR3, HOL1, TPO1, TPO2, TPO3, TPO4, YHK8 and FLR1 genes. The functional analysis of most of these transporters was already performed, being described associated with different functions, such as polyamine transport, synthesis of spore wall and resistance to stress, including drugs and chemicals such as quinidine, ketoconazole, and fluconazole(Sá-Correia et al. 2009; Dias et al. 2010a; Dias and Sá-Correia 2014; dos Santos et al. 2014). The DHA1 proteins are energized by the electrochemical proton motive force (EPMF), and a chemical proton gradient, with the pumping of protons into the cell provoking the extrusion of the drugs against their concentration gradient (Cannon et al. 2009; Santos, Simões and Sá-Correia 2009; Pais et al. 2016). Nevertheless , the DHA1 are composed of two transmembrane domains, each consisting of six transmembrane segments interconnected by extra and intracellular loops (Sá-Correia et al. 2009; Redhu, Shah and Prasad 2016). A brief revision on DHA1 genes encoded in the genome of S. cerevisiae and hemiascomycetes species is performed in the next paragraphs.

19

The evolutionary history of the DHA1 members in Saccharomycetaceae family have been studied in last years. Dias and co-workers in 2010 did a phylogenetic analysis of 189 DHA1 proteins belonging to the genome of 13 hemiascomycetous species and identified 20 phylogenetic clusters. This allowed reconstructing the evolutionary history of most DHA1 members within 10 main gene lineages, corresponding to the phylogenetic clusters (E, F, G, J, N1, N2, P, R, S, T), spanning the whole hemiascomycetes clade (Figure.4).

Figure. 4 The 10 main lineages of DHA1 transporters analysed on this study. ( Extracted From (Dias et al. 2010a)

20

DTR1 is one of the first DHA1 proteins to which it was assigned a physiological role other than chemoprotection (Sá-Correia et al. 2009; dos Santos et al. 2014). Being localized in the prospore membrane, this protein plays an important role in spore wall synthesis through the facilitating of the translocation of bisformyl dityrosine, the major building block of the spore surface. The expression of DTR1 gene in vegetative cells renders the cells slightly more resistant to antimalarial drugs and food- grade organic acid preservatives (Felder et al. 2002; Sá-Correia et al. 2009). The important function of this gene is reinforced in Dias and Sá-Correia (2014) , where the a phylogenetic and genomic analyses of the correspond homologs genes showed showed that just one DTR1 gene copy encoded in 25 hemiascomycetous genomes analysed (Dias and Sá-Correia 2014).

The AQR1 gene appears in this group as conferring resistance to weak acids, but also to chemical stress inducers such as the antimalarial/antiarrhythmic drug quinidine, the cationic dye crystal violet, or the antifungal drug ketoconazole (Tenreiro et al. 2002).This gene have been also reported being involved in the excretion of amino acids, particularly homoserine, threonine, alanine, aspartate, and glutamate (Velasco et al. 2004). Despite giving resistance to many drugs ,the AQR1 gene is described as non-inducible by several drug stresses , being instead activated by nitrogen or amino acid limitation (Sá-Correia et al. 2009). Another function associated to the AQR1 gene, but in C. glabrata homolog ,is conferring resistance to short-chain monocarboxylic acids such as acetic and propionic acids, suggesting an important role of this gene in weak acids resistance (Costa et al. 2013a). In a genome- wide study focusing the DNA damage pathways ,the AQR1p protein was found being involved in drug- induced DNA replication stress (Tkach et al. 2012). The study of AQR1 ortholog gene (CgAQR1) encoded in the genome Candida glabrata showed a dual role in acetic acid and antifungal drug resistance, conferring resistance in this species to the antifungal drugs flucytosine, imidazoles, miconazole, tioconazole, and clotrimazole (Costa et al. 2013a).

The DHA1 family also includes the plasma membrane transporters QDR1, QDR2, and QDR3. The QDR1 gene is known to confer resistance to the antimalarial/antiarrhythmic drug quinidine and ketoconazole (Nunes, Tenreiro and Sá-Correia 2001). It was also implicated in resistance of yeast cells to the anticancer drugs cisplatin and bleomycine (Tenreiro et al. 2005),and toxic concentrations of spermine, spermidine, and putrescine (Teixeira et al. 2011). The QDR2 gene initially have been involved contributing to potassium homeostasis, possibly by functioning as an alternative K+ importer (Vargas et al. 2007). This physiological function was proposed to play an indirect role in the ability of this transporter to confer resistance to quinidine, a drug that leads to decreased K+ uptake and a drop in the intracellular accumulation of K+ in yeast cells, providing a physiological advantage to cells during the onset of K+ limited growth and in the presence of quinidine (Vargas et al. 2004, 2007). Rios et al. (2013) reported the QDR2 gene being involved in tolerance to lithium and for sodium , in the uptake of cadmium and cobalt, and in the export of copper (Ríos et al. 2013). The range of compounds to which QDR3 confers resistance is very similar to the one assigned to the QDR2p protein. The QDR3 gene confers resistance to a range of inhibitory compounds that are structurally and functionally unrelated, including the anti- malarial and antiarrhythmic drug quinidine, barban, and the anti-cancer drugs cisplatin and bleomycin (Tenreiro et al. 2005; Teixeira et al. 2011). Teixeira et al. (2011) proposed a new role for QDR3 gene of

21

S. cerevisiae polyamine homeostasis, specifically of spermine and spermidine but not of putrescine, through the maintenance of the plasma membrane potential. This contrasted with what was observed for QDR2, which conferred resistance to all three polyamines, being this role associated with K+ homeostasis. The inability of QDR3 expression to rescues the QDR2 mutant strains from polyamine susceptibility suggests that these two QDR transporters confer polyamine stress tolerance thought different mechanisms (Teixeira et al. 2011). The transcription level of QDR3 gene does not change in response to drug exposition , indicating that the corresponding protein might have other specific physiological substrates, and the drugs might be transported opportunistically (Tenreiro et al. 2005). Another role of QDR3 seems to be in yeast resistance to acetic acid. Recently in a synthetic genetic array analysis the QDR3 gene was identified as a component of the complex gene network that controls the assembly of the spore wall, together with QDR1 and DTR1 (Lin et al. 2013). Additionally, the QDR members seem to keep the same functions in other related species like C. albicans or C. glabrata. In C. glabrata, the QDR2 homolog, CgQDR2, was described as involved in resistance to imidazoles, clotrimazole, miconazole, tioconazole, and ketoconazole(Costa et al. 2013b) while in C. albicans, the deletion of QDR1, QDR2, and QDR3 transporters was found to lead to defects in biofilm architecture and thickness and to attenuate virulence in a mouse model (Costa et al. 2014a; Shah et al. 2014).

The plasma membrane proteins encoded by TPO1, TPO2, TPO3, and TPO4 genes were postulated as transporters of polyamines because they all confer resistance to toxic concentrations of polyamines (Tomitori et al. 2001; Sá-Correia et al. 2009). Although all TPO genes are involved in polyamines transport, all of these genes show different types of specificity. For instance , the TPO2 transporter showing specificity for spermine , TPO3 for spermine and spermidine ,and TPO1 and TPO4 can recognize spermine, putrescine, and spermidine (Sá-Correia and Tenreiro 2002; Albertsen et al. 2003; Teixeira et al. 2011; dos Santos et al. 2014). In addition to the polyamines substrates transport the TPO4 transporter was reported having an important role in quinidine resistance (Delling, Raymond and Schurr 1998). The TPO1 gene is the best-characterized member of this group, being able to extrude at least eight different substrates, included the previous polyamines, antimalarial drugs quinidine and artesunate, cycloheximide, caspofungin (Do Valle Matta et al. 2001; Sá-Correia and Tenreiro 2002; Markovich et al. 2004; Alenquer, Tenreiro and Sá-Correia 2006). Moreover, the ΔTPO1 mutant exhibits increased sensibility to nystatin and higher ergosterol content in the plasma membrane (Kennedy and Bard 2001).This gene is activated by the zinc-cluster transcriptional factors, Pdr1 and Pdr3 (Teixeira and Sá-Correia 2002). The TPO1 homolog gene in Arabidopsis thaliana was found conferring resistance to the herbicides 2-methyl-4- chlorophenoxyacetic acid (MCPA) and 2, 4-dichlorophenoxyacetic acid (2,4-D) (Cabrito et al. 2009). The Candida glabrata species have two TPO1 orthologues genes, CgTPO1_1, and CgTPO1_2 (Figure. 4). These genes were characterized as relevant for the azole drug resistance, shown to be required for clotrimazole, drug resistance by effectively extruding the drug to the external medium (Dias et al. 2010a; Pais et al. 2016). The TPO2 and TPO3 paralogs genes, generated in whole genome duplication, were described as play an important role in resistance to weak acids - such as acetic and propionic acids (Byrne and Wolfe 2005; Fernandes et al. 2005; Dias et al. 2010b). These genes are transcriptionally activated in response to acetic acid, under the dependence of the transcriptional activator Haa1, which is also a determinant of yeast resistance to acetic acid

22

(Fernandes et al. 2005; Mira et al. 2010, 2011). The physiological role and transcriptional regulation of the TPO3 orthologue gene in C. glabrata, CgTPO3, was evaluated, being shown conferring resistance to imidazole and triazole antifungal drugs(Costa et al. 2014b).

Regarding the HOL1 and YHK8 genes, little information is available in the literature. The HOL1 gene has been reported being involved in Na+ uptake , and to enhances the ability of yeast cells to import histidinol (a precursor of histidine) and mono- and divalent cations(Wright, Howell and Gaber 1996; Sá- Correia and Tenreiro 2002). The YHK8 gene is involved in resistance to azoles such as itraconazole and fluconazole(Barker, Pearson and Rogers 2003; Costa et al. 2014b; Redhu, Shah and Prasad 2016).

The FLR1 is one of the most well-studied genes in S. cerevisiae. This DHA1 member have been classified as determinant in resistance of fluconazole, cycloheximide, 4-nitroquinoline 1- oxide (4-NQO), benomyl, methotrexate, diazoborine, cerulenin, diamide, diethylmaleate, menadione, paracetamol (Alarco et al. 1997; Broco et al. 1999; Jungwirth et al. 2000; Tenreiro, Fernandes and Sá-Correia 2001; Sá-Correia et al. 2009; dos Santos et al. 2014). In addition, the FLR1 gene also was identified as a determinant of resistance to agriculture fungicide mancozeb in yeast (Teixeira et al. 2008; Dias et al. 2010b). In this case ,after the transcriptional activation of FLR1 by yeast exposure to mancozeb , was verified that this activation was fully dependent on the presence of the b-ZIP transcription factor, Yap1, and reduced (by 50%) in the absence of Rpn4p, Yrr1p or Pdr3p transcription factors (Teixeira et al. 2008). Moreover, the determination of all transcription factors associated with the FLR1 regulation seems not to be concluded, since it was highlighted the possible participation of a fifth transcription factor in this transcriptional regulatory network (Teixeira et al. 2010). Another interesting point is the localization of this DHA1 member, initially reported a plasma membrane transporter, was recently demonstrated that FLR1p protein relocalizes from the nucleus to the plasma membrane in response to DNA replication stress. (Tkach et al. 2012; dos Santos et al. 2014).

The homologs of S. cerevisiae FLR1 genes encoded in the pathogenic Candida species such as Candida glabrata, Candida albicans, and Candida dublinensis also have been found associated with MDR phenotypes. In fact, the first MFS transporter identified was the C. albicans drug:H+ antiporter MDR1, a homolog of S. cerevisiae FLR1 gene, responsible for azole, cycloheximide, methotrexate ,and Benomyl drug resistance (Goldway et al. 1995; Chen et al. 2007; Redhu, Shah and Prasad 2016). In C. glabrata, the FLR1 homolog, CgFLR1, is described conferring resistance to benomyl , diamide, and menadione (Goldway et al. 1995; Chen et al. 2007). CgFLR1 gene regulation also involves the C. glabrata Yap1 homologue (CgAP-1), suggesting the existence of a similar pathway of regulation of the FLR1 homologs in C. glabrata and S. cerevisiae (Chen et al. 2007). The MDR1 gene of C. dublinensis is responsible for resistance to the azoles compounds and to benomyl, brefeldin A, cerulenin, cycloheximide, fluphenazine and 4-nitroqinoline-N-oxide resistance (Redhu, Shah and Prasad 2016).

2.4. Bioinformatics, phylogenetic and protein evolution analyses

The bioinformatics field is intrinsic in comparative genomics and evolution. Being a necessary and widely used branch in comparative genomics, bioinformatics appears as an indispensable tool for the analysis of large-scale datasets. Although, being a relatively holistic and new approach, computational

23 methodologies have evolved at a truly high speed. In 1990, with the onset of genome sequencing, the necessity of process a large amount of information emerges. In order to solve this problem, computational tools and strategies were developed to analyse these genomic sequences.

These datasets are stored inside biological research databases (DB) planned to cover different kinds of biological information, such as 1) genomics, 2) proteomics, 3) metabolomics, 4) transcriptomics, 5) phylogenies, 6) cell functions ,7) protein localization, 8) clinical effects, 9) similarities of sequences, 10) biological structures , among others (Altman 2004). Usually, the databases can be specific or general, local or online. Being the biological knowledge in most cases distributed over different databases, it is possible that a lack of consistency occurs in the data. In order to solve this problem integrative bioinformatics is commonly used applying methods as cross-referencing between databases and other tools that reduce disparities (Shah et al. 2005). Some of the reference databases are DDBJ - DNA Data Bank of Japan (National Institute of Genetics), EMBL (European Bioinformatics Institute) and GenBank (National Center for Biotechnology Information). These three databases are characterized by the synchronization process that occurs between the three primary entities, which upon receipt of new data are constantly being updated enabling the information to be quickly available (Benson et al. 2015). An example of specific databases relevant for this study are, the SGD (Saccharomyces Genome Database), YGOB (Yeast Gene Order Browser), TCDB (Transporter Classification Databases), CGD (Candida Genome Database) and Genolévures, that work not only as databases but also as important tools to the analyses of yeast genomes. The main database used in this work, a local database (LDB) developed by the BSRG group comprises the available information regarding 171 yeast strains, from the annotation of genes until their known characterization.

2.4.1. Sequence alignment

Three main algorithms are general considered – Global (GSA), Local (LSA) and Multiple sequence alignments (MSA). Before being aligned, DNA sequences often contain open reading frames (ORF) that code for proteins. A coding sequence can be considered at either the nucleotide or amino acid level. Because of the genetic code redundancy, different codons encode the same amino acid sequence, hence the nucleotide sequence is thus less conserved but more informative than its amino acid sequence translation (Ranwez et al. 2011).

Considering the GSA and LSA algorithms, the difference between them is that the first consider the entire sequence length while the second only compares segments or portions of sequences (Needleman and Wunsch 1970; Smith and Waterman 1981). On the other hand, MSA algorithm considers three or more biological sequences of proteins, DNA, or RNA. Various types of algorithms such as FASTA (Lipman and Pearson 1985) PHYLIP (Felsenstein 1981), Feng and Doolittle Alignment (Feng and Doolittle 1987) can be used in this type of approach, with different goals. This kind of alignment can be used as basis GSA or LSA, depending on the aim of the study. In general, heuristic methods are mainly used, like progressive alignments methods (PAM), instead of global optimization methods, because first type of methods require lower computational time. This approach builds multiple alignments by first pairing various sequences, starting with the most similar pair of sequences and all other sequences will

24 be paired with the least similar one. After that, with the phylogenetic tree (PT) as a guide, the sequences are aligned by an efficient clustering method like UPGMA or Neighbour-joining. This approach is useful and more efficiently in phylogenetically close species (Mount and Mount 2001). The PAM most used today is Clustal, especially the variants CLUSTALW and CLUSTAL2W of GenomeNet, EBI, and EMBNet available online. Other progressive methods are available, such as T-Coffee, MEGA, PAUP, MUSCLE or MAFFT(Maddison, Swofford and Maddison 1997; Notredame, Higgins and Heringa 2000; Katoh et al. 2002; Edgar 2004; Tamura et al. 2011).T

In MSA of nucleotides, some interrupted ORFs are usually found (sequencing errors, in this work called genomic errors). These interruptions can be a result of: (i) the insertion of a now- multiple of 3 consecutive nucleotides – or the deletion thereof –, both inducing frameshifts that lead to transient or irreversible aberrant downstream amino acid sequence translation; (ii) the substitution of an in-frame nucleotide resulting in unexpected and premature stop codons that shorten the amino acid sequence (Ranwez et al. 2011). These events may be either artefactual (ex: resulting from elevated error rates in homopolymers when using 454 GS-FLX and in short read ends with Illumina Genome Analyser (Margulies et al. 2005; Kircher, Stenzel and Kelso 2009) or may be normal features existing in genome sequences like pseudogenes.

In the total alignment of genomes, evolutionary proximity should be taken into account for a good result. This occurs because many of these aligners do not consider insertions, deletions, inversions, rearrangements, exchanges and duplications that occur at a high rate. At the same time, it’s also important to consider that the analysis can become computationally unfeasible with more than one pair of genomes at a time, even with the use of efficient algorithms and programs specially developed for this purpose (Kent and Zahler 2000; Ma, Tromp and Li 2002; Kurtz et al. 2004). Given that, comparative analysis of genomes is usually done in modular level, taking into account only part of the sequences and assets of genes with specific functions.

2.4.2. Phylogenetic analysis

In the construction of phylogenetic trees, there are different clustering methods available, that should be chosen in accordance with the biological object of study, to provide the most efficient result (Singh 2015). Although many methods can be used to build phylogenetic trees in our work just were considered the maximum likelihood (ML) and Bayesian Methods (BM) and the distance-matrix method of neighbour- joining belonging.

The ML methods aim at ranking trees according to the likelihood of observing the data given the topology of the tree. In order to compute the phylogenetic tree these methods estimate P (Xn | T, t). Here, the data is the set of n sequences (taxa), T is the tree and t denotes the specific edge lengths of the tree branches. To quantitatively define this probability, an underlying model of evolution is assumed (Singh 2015). The Bayesian methods involve similar model-based computations to ML procedures, but they estimate a posterior probability of the model given the observed sequence data. Efficient computational algorithms and the increase in power of computation (e.g. Markov Chain Monte Carlo) made enabled the competition between the BM and ML methods (Huelsenbeck et al. 2001). On the other hand, the

25 neighbour-joining method, based on a distance matrix, works in a stepwise fashion by minimizing the sum of branch lengths at each step of sequence clustering. This method is very efficient and frequently used in large-scale data analysis (Kendall 2012; Yang and Rannala 2012).

For the display of a phylogenetic tree, there are different available programs that can be used at local or online level (e.g: Phyloviewer, EvolView, TreeVector, T-REX) (Stöver and Müller 2010; Vos et al. 2011; Rosindell and Harmon 2012; Zhang et al. 2012).Besides those mentioned, there are also widely used applications for viewing and treating phylogenetic trees, such as Dendroscope, Bio: Phylo, Multidendograms, Figtree and treeview (Huson et al. 2007; Fernández and Gómez 2008; Rambaut and Drummond 2010; Vos et al. 2011).

2.4.3. Positive, negative selection and conservation analyses

Other, but not less important evolutionary approach, is the identification of sites evolving under selection in protein-coding genes. The determination of these selection pressures, that shape the genetic variation forms, constitute a major part of many studies of molecular evolution. A common approach to address this involves rates estimation of nonsynonymous (dN or β) and synonymous (dS or α) substitutions. Convincing evidence for non-neutral evolution are obtained when estimates of dN are significantly different from dS (Murrell et al. 2013).

In the beginning, the studies of selection pressure were based on the average β/α ratio (ω) for the region of interest using distance-based methods or maximum likelihood. However, such approaches lack statistical power to detect positive selection as only a few sites may be under selection. Subsequently, some methods have been proposed to study selection on a site-by-site basis. These approaches are classified into four classes: 1) counting methods – count of the number of nonsynonymous and synonymous substitutions along the phylogeny; 2) random effects models - assume a distribution of rates across sites and infer the rate at which individual sites evolve given this distribution ω; 3) fixed effects models - estimate the ratio of nonsynonymous to synonymous substitutions on a site-by-site basis; 4) and those that allow the distribution of ω to vary from site to site (the fixed effect) and also from branch to branch at a site (mixed effects model of evolution)(Pond and Frost 2005; Yang, Wong and Nielsen 2005; Murrell et al. 2012).

In this study was three specific methods were used, involving three of the four above principles - fixed effects, random effects, and mixed effect. For each one an optimized software was chosen, from the HYPHY package available in Datamonkey web server, in order to obtain the best results (Delport et al. 2010). First, in fixed effects likelihood (FEL) models (Pond and Frost 2005) the parameters are inferred independently for each site. This approach avoids assumptions about the distribution of selection parameters over sites, yielding greater flexibility to describe the distributions. considering that a specific site is under positive selection when α<β, negative selection or purifying selection when α>β ,and without selective pressure when α=β (Pond and Frost 2005). Second, in random effects likelihood methods were chosen the FUBAR (Fast Unconstrained Bayesian AppRoximation), a codon-based maximum likelihood method that allows dN/dS (x) to vary over each codon across a gene according to a number of

26 predefined site classes given a priori. This allows testing codons for positive (dN/dS > 1) or purifying selection (dN/dS < 1) (Murrell et al. 2013).

In the case of mixed effects model, it was opted by MEME (mixed effects model of evolution) method that joins the fixed effects (variation of distribution ω site to site, applied in models like FEL), with the analyses branch to branch at a site. The greatest advantage of this approach is the capacity to reliably capture the molecular footprints of both episodic and pervasive positive selection while other methods, like FEL, just detect permanent positive selection (Murrell et al. 2012). In MEME, two important parameters are considered as a form of scrutinizing the results – likelihood ratio test (LRT) and empirical Bayes factor (EBF). This method recommends p-value analysis (low values) and LRT (high values), and posteriorly the identification of specific sites in branches with high values of EBF. This site reveals possible sites of episodic positive pressure. For better understand, various reviews of these methods are available (Pond and Frost 2005; Murrell et al. 2012, 2013).

The conservation of AA and NT sequence are directly related to negative and positive selection. Based on MSA, the major part of available software recurs to scores in order to determine if, in specific positions of alignment, AA or NN conservation occurred. It is not trivial to generalize pairwise scoring systems to multiple alignments, and the following four principal approaches have been used to solve the problem: Tree scores; Star scores; Sum-of- the-Pairs or SP scores and Entropy scores. In this work the only type of entropic scoring system was used - ‘‘Bayesian Integral Log-odds’’ or BILD scores. This BILD score considers that multiple alignment column scores can be similarly constructed, based upon explicit target frequency predictions for columns from accurate alignments of related sequences (Altschul et al. 2010).

3. Materials and Methods

3.1. Selection of the Hemiascomycetes yeasts used in this study.

Based on published literature (Dias et al. 2010a; Dias and Sá-Correia 2014) and using a local genome database (LDB) created in the MySQL relational environment, were selected the hemiascomycetous yeast strains. The LDB contains 171 strains whole genome sequence was obtained from different available online databases. These genome databases included Génolevures, Gene Yeast Browser Order (YGOB), Saccharomyces Genome Database (SGD), Broad Institute and CGD (Candide Genome Database). The genome database comprises different types of information about of each genome, such as a number of scaffold / chromosomes per species, order and direct or reverse gene direction in each chromosome, the length of each gene sequence, among others important informations. The Table.1 shows some features of each hemiascomycetous strains analysed in this master thesis as well as complementary information extracted from SGD, CGD and YGOB online databases - genome size, sequenced state (complete or incomplete), read coverage of genomes.

In fact, the choice of the different datasets was done according to the goals of the work. Initialy,it was considered from a general point of view, all hemiascomycetous species present in LDB, and from a total of 171 genomes yeasts genomes, ninety-four were selected to be evaluated. It is important to consider

27 that these yeast strains just were considered because of their biotechnological/industrial applications and genomic/evolutionary characteristics.

The ninety-four selected hemiascomycetous strains, corresponding to sixty-three different yeasts species divided into 53 strains belonging to the Saccharomyces complex; 20 strains of the CTG complex; 6 strains classified as early divergent hemiascomycetous strains and other 15 species belonging other taxonomic families. S. cerevisiae is the more abundant species in this study with eighteen strains, followed by K. pastoris, Z. bailii and S. paradoxus with 3 members each one; S. bayanus, N. castellii, C. glabrata, C. albicans, C. lusitaniae, D. hansenii, D. bruxellensis, C. jadinii constituted by 2 strains, and the remaining species with only one member (Table. 1). Interestingly, the H. valbyensis species, classified as belonging to Saccharomyces complex, is the only member of the Saccharomycodaceae family, sister family Saccharomycetaceae family in Saccharomyces complex (Figure. 2). This species was located jointly with other families to facilitate the comparison between Debaryomycetaceae and Saccharomycetaceae families.

For consistency purposes, the four letter code shown in Table. 1 for species abbreviation will be used to designate both yeast genes and species. The letter displayed after the first four letters are used to abbreviate the strain name when the genome of more than one strain from a given species is available or when the genome of the same strain was sequenced by different research centres.

28

Table. 1 Hemiascomycetous strains examined during this work. The present table shows the name species and strains associated, as well as the acronyms for each strain used in this work. The scaffold number represent the number of potential chromosomes existing in each strain. The case of Genome size, Coverage are data collected from database origin website, that helped in the choice of the strains for analyses. Should be noted that Ashbya aceri not have a phylogenetic position in the phylogenetic tree of species nor is strain indicated.

The positioning in the table was done in according to the similarity of this genome with the members of Eremothecium genus in (Dietrich et al. 2013).

Nº of Genome Species Species abbreviation Strain Acronym Coverage Database Origin

Scaffolds size (Mb)

Complex Phylogeny

Saccharomyces cerevisiae S. cerevisiae S288c sace_1 Complete 16 12,2 http://downloads.yeastgenome.org/sequence/S288C_reference Saccharomyces cerevisiae S. cerevisiae AWRI796 sace_2 20X 16 11.7 http://www.ncbi.nlm.nih.gov/assembly/GCA_000190195.1/ Saccharomyces cerevisiae S. cerevisiae BC187 Sace_7 - - - - Saccharomyces cerevisiae S. cerevisiae RM11-1a sace_14 10X 17 11.7 http://www.ncbi.nlm.nih.gov/assembly/GCA_000149365.1 Saccharomyces cerevisiae S. cerevisiae DBVPG1106 sace_18 - 18 - - Saccharomyces cerevisiae S. cerevisiae DBVPG1373 sace_19 - 18 - - Saccharomyces cerevisiae S. cerevisiae DBVPG1853 sace_21 - 18 - - Saccharomyces cerevisiae S. cerevisiae DBVPG6040 sace_22 - 17 - - Saccharomyces cerevisiae S. cerevisiae DBVPG6044 sace_23 - 18 - - Saccharomyces cerevisiae S. cerevisiae K11 sace_25 189.0x 18 11.5 http://www.ncbi.nlm.nih.gov/assembly/GCA_000767965.1

Saccharomyces cerevisiae S. cerevisiae L 1374 sace_26 - 18 - - Saccharomyces cerevisiae S. cerevisiae L 1528 sace_27 - 18 - - UWOPS05 217 Saccharomyces cerevisiae S. cerevisiae http://www.ncbi.nlm.nih.gov/assembly/GCA_000768095.1/ sace_32 57.0x 18 11.4

complex 3

Saccharomyces cerevisiae S. cerevisiae Y12 sace_37 - 18 - - WGD - Saccharomyces cerevisiae S. cerevisiae YIIc17-E5 sace_39 - 18 - -

Saccharomyces cerevisiae S. cerevisiae vin13 sace_49 20X 16 11.7 http://www.ncbi.nlm.nih.gov/assembly/GCA_000190215.1/ Post Saccharomyces cerevisiae S. cerevisiae VL3 sace_50 20X 16 11.7 www.ncbi.nlm.nih.gov/assembly/GCA_000190235.1/ Saccharomyces cerevisiae S. cerevisiae EC1118 sace_56 24X 31 11.7 http://www.ncbi.nlm.nih.gov/genome

Saccharomyces Sanger Saccharomyces paradoxus S. paradoxus reference sapa_1 - 16 - - (consensus)

Saccharomyces paradoxus S. paradoxus CBS432 Sapa_4 - - - - Saccharomyces paradoxus S. paradoxus NRRL Y-17217 sapa_26 7.7X 471 11.9 http://downloads.yeastgenome.org/sequence/fungi/S_paradoxus Saccharomyces mikatae S. mikatae IFO 1815 sami_1 5.9X 16 11,5 http://downloads.yeastgenome.org/sequence/fungi/S_mikatae Saccharomyces kudriavzevii S. kudriavzevii IFO 1802 saku_1 3.4× 16 12 http://downloads.yeastgenome.org/sequence/fungi/S_kudriavzevii Saccharomyces arboricola S. arboricola H-6 saar_1 50.0x 16 11,6 http://www.ncbi.nlm.nih.gov/assembly/GCF_000292725.1/ Saccharomyces bayanus S. bayanus 623-6C saba_1 2.9X 16 11.9 http://wolfe.ucd.ie/ Saccharomyces bayanus S. bayanus MCYC 623 MCYC 623 saba_2 6.4X - 11,5 http://downloads.yeastgenome.org/sequence/fungi/S_bayanus Saccharomyces uvarum S. uvarum CBS 7001 sauv_1 - - - -

29

Nº of Genome Species Species abbreviation Strain Acronym Coverage Database Origin

Scaffolds size (Mb)

Complex Phylogeny

Kazachstania africana K. africana CBS 2517 kaaf_1 20× 12 11,1 http://www.ncbi.nlm.nih.gov/bioproject/178246 Kazachstania naganishii K. naganishii CBS 8797 kana_1 20× 13 10,8 http://www.ncbi.nlm.nih.gov/assembly/GCA_000348985.1/ Naumovozyma castellii N. castellii CBS 4309 naca_1 20× 10 11,2 http://www.ncbi.nlm.nih.gov/assembly/GCF_000237345.1/

Naumovozyma castellii N. castellii NRRL Y-12630 naca_2 3.9× 10 11,2 http://www.ncbi.nlm.nih.gov/genome/?term=txid27288[orgn]

Naumovozyma dairenensis N. dairenensis CBS 421 nada_1 20× 12 13,5 http://www.ncbi.nlm.nih.gov/assembly/GCF_000227115.2 WGD - Candida glabrata C. glabrata CBS138 cagl_1 Complete 13 12,3 http://www.genolevures.org/download.html#. CCTCC http://www.ncbi.nlm.nih.gov/assembly/GCA_000497105.1/ Post Candida glabrata C. glabrata cagl_2 100.0x 13 12,1 M202019 Tetrapisispora phaffii T. phaffii CBS 4417 teph_1 20× 16 12,1 http://www.ncbi.nlm.nih.gov/assembly/GCF_000236905.1/ Tetrapisispora blattae T. blattae CBS 6284 tebl_1 20× 10 14 http://www.ncbi.nlm.nih.gov/assembly/GCF_000315915.1/ Vanderwaltozyma polyspora V. polyspora DSM 70294 vapo_1 7.8x - 14,8 http://www.ncbi.nlm.nih.gov/assembly/GCF_000150035.1/

Zygosaccharomyces bailii Z. bailii ISA1307 zyba_1 600x 154 21,1 http://www.ncbi.nlm.nih.gov/genome complex Zygosaccharomyces bailii Z. bailii IST302 zyba_2 - 106 - - Zygosaccharomyces bailii Z. bailii CLIB 213 zyba_3 - 27 10,2 http://www.ncbi.nlm.nih.gov/genome Zygosaccharomyces rouxii Z. rouxii CBS 732 zyro_1 Complete 7 9,8 http://www.genolevures.org/download.html#. Torulaspora delbrueckii T. delbrueckii CBS 1146 tode_1 20x 8 9,22 http://www.ncbi.nlm.nih.gov/genome/12254 Lachancea thermotolerans L. thermotolerans CBS 6340 lath_1 Complete 8 10,6 http://www.genolevures.org/download.html#.

Lachancea waltii L. waltii NCYC 2644 lawa_1 8× 8 10,9 http://www.ncbi.nlm.nih.gov/genome/?term=txid4914[orgn] Saccharomyces Lachancea kluyvery L. kluyvery CBS 3082 lakl_1 Complete 8 11,5 http://www.genolevures.org/download.html#. Kluyveromyces marxianus K. marxianus KCTC 17555 klma_1 280x 95 10,9 http://www.ncbi.nlm.nih.gov/assembly/GCA_000299195.2#/st Kluyveromyces lactis K. lactis CLIB210 klla_1 Complete 7 10,7 http://www.genolevures.org/download.html#.

Kluyveromyces wickerhamii K. wickerhamii UCD 54-210 klwi_1 12x - 9,8 http://www.ncbi.nlm.nih.gov/assembly/GCA_000179415.1/

Kluyveromyces aestuarii K. aestuarii ATCC 18862 klae_1 14x - 9,9 http://www.ncbi.nlm.nih.gov/genome/?term=txid33165[orgn]

WGD Eremothecium gossypii E. gossypii ATCC 10895 ergo_1 Complete 7 9,2 http://www.genolevures.org/download.html#. -

Eremothecium cymbalariae E. cymbalariae DBVPG 7215 ercy_1 40x 8 9,6 http://www.ncbi.nlm.nih.gov/assembly/GCF_000235365.1/ Pre Ashbya aceri A. aceri - asac_1 - - - - valbyensis H. valbyensis NRRL Y-1626 hava_1 441 - - Candida albicans C. albicans SC5314 caal_1 10.4X 9 27.6 http://www.ncbi.nlm.nih.gov/assembly/GCF_000182965.2

Candida albicans C. albicans WO-1 caal_2 10X 15 21.7 http://www.ncbi.nlm.nih.gov/assembly/GCA_000149445.2

Candida dubliniensis C. dubliniensis CD36 cadu_1 11X 8 14.6 http://www.ncbi.nlm.nih.gov/assembly/GCF_000026945.1 Candida tropicalis C. tropicalis MYA-3404 catr_1 10X 21 14.6 http://www.ebi.ac.uk/ Candida parapsilosis C. parapsilosis CDC317 capa_1 9.2X 9 13.1 http://www.ebi.ac.uk/ Lodderomyces elongisporus L. elongisporus NRLL YB-4239 loel_1 8.7X 26 15.5 http://www.ebi.ac.uk/

CTG complex CTG Candida orthopsilosis C. orthopsilosis Co 90-125 caor_1 - 8 - - Candida maltosa C. maltosa Xu316 cama_1 - 1308 - - Candida tenuis C. tenuis ATCC 10573 cate_1 - 19 - -

30

Nº of Genome Species Species abbreviation Strain Acronym Coverage Database Origin

Scaffolds size (Mb)

Complex Phylogeny

Meyerozyma guilliermondii M. guilliermondii ATCC 6260 megu_1 - 9 - - Scheffersomyces stipitis S. stipitis CBS 6054 scst_1 - 8 15.4 http://www.ebi.ac.uk/

Debaryomyces hansenii D. hansenii CBS76 deha_1 Complete 7 12.2 http://www.genolevures.org/download.html#.

Debaryomyces hansenii D. hansenii MTCC 234 deha_2 - 300 - - Pichia sorbitophila P. sorbitophila CBS 7064 piso_1 - 14 - - Spathaspora passalidarum S. passalidarum NRRL Y-27907 sppa_1 - 8 - - Spathaspora arborariae S. arborariae UFMG-19.1A spar_1 - 29 - - http://www.ebi.ac.uk/

CTG complex CTG Clavispora lusitaniae C. lusitaniae ATCC 42720 cllu_1 9X 8 12.1 Clavispora lusitaniae C. lusitaniae MTCC_1001 cllu_2 - 140 - - Metschnikowia bicuspidata M. bicuspidata NRRL YB-4993 mebi_1 - 16 - - Hyphopichia burtonii H. burtonii NRRL Y-1933 hybu_1 - 62 - - Pachysolen tannophilus P. tannophilus NRRL Y-2460 pata_1 - 53 - - Dekkera bruxellensis D. bruxellensis AWRI1499 debr_1 - 295 - - Dekkera bruxellensis D. bruxellensis CBS 2499 debr_2 - 28 - -

Pichia kudriavzevii P. kudriavzevii M12 piku_1 - 430 - -

Pichia membranifaciens P. membranifaciens NRRL Y-2026 pime_1 - 10 - -

Ascoidea rubescens A. rubescens NRRL Y17699 asru_1 - 33 - - WGD - Ogataea parapolymorpha O. parapolymorpha DL-1 ogpa_1 - 7 - -

Pre Candida arabinofermentans C. arabinofermentans NRRL YB-2248 caar_1 - 41 - - divergent families divergent

- Candida tanzawaensis C.tanzawaensis NRRL Y-17324 cata_1 - 11 - - Cyberlindnera jadinii C. jadinii NBRC 0988 cyja_1 - 12 - - Cyberlindnera jadinii C. jadinii NRRL Y-1542 cyja_2 - 41 - -

Other late Other Wickerhamomyces anomalus W. anomalus NRRL Y-366 wian_1 - 21 - - Hansenula polymorpha H. polymorpha NCYC 495 hapo_1 - 7 - - Nadsonia fulvescens var. elongata N. fulvescens DSM 6958 nafu_1 - 11 - -

Babjeviella inositovora B. inositovora NRRLY-12698 bain_1 - 30 - -

Komagataella pastoris K. pastoris CBS 7435 kopa_1 - 4 - -

Komagataella pastoris K. pastoris DSMZ 70382 kopa_2 - 112 - - Komagataella pastoris K. pastoris GS115 kopa_3 20X 4 9.2 -

divergent - - Yarrowia lipolytica Y. lipolytica CLIB122 yali_1 Complete 6 20.6

Lipomyces starkeyi L. starkeyi NRRL Y-11557 list_1 - 34 - - Early Candida caseinolytica C. caseinolytica Y-17796 caca_1 - 4 - -

31

3.2. Selection of DHA1 and DAG gene families in 94 Hemiascomycetous Strains.

3.2.1. Extraction of DHA1 and DAG gene sequences from a Local Database

The identification of members of DHA1 and DAG families was based on the published studies by BSRG group (Sá-Correia and Tenreiro 2002; Sá-Correia et al. 2009; Dias et al. 2010a; Dias and Sá-Correia 2013, 2014).In the published study of Dias and Sá-Correia (2014), twenty-four DHA1 gene clusters encoded in the genome of twenty-five species of Hemiascomycetous were identified (Dias and Sá- Correia 2014). On the other hand, the 14-spanner MFS-MDR family members were based on a published study performed by Dias and Sá-Correia (2013), where twenty clusters comprising 355 genes were subdivided into three subfamilies DHA2, ARN, GEX. In the present study twenty-three genes of twenty-four clusters were used as starting nodes in Blastp network traversal approach for the identification of DHA1 family members (Table.2A). Similarly, for the DAG family also twenty genes were used as starting nodes in a similar traversal approach (Table.2B). Table. 2 Staring node genes used for identification of DHA1 and DAG family genes in 94 Hemiascomycetes strains. A) The first table shows the DHA1 genes selected for traversing the Blastp network in this work. On the other hand, also is shown the members of S. cerevisiae species and, respectively cluster distribution, previously determined in Dias et al. (2010). B) The second table shows the DAG genes selected for traversing the Blastp network in this work. On the other hand, also is shown the members of S. cerevisiae species and, respectively cluster distribution, previously determined in Dias and Sá-Correia (2013). A) B) DHA1 S. cerevisiae Starting node DAG S. cerevisiae Starting node Clusters Genes Genes Clusters Genes Genes A+B - zyro_1_e10230g A - scst_1_5_e06890 SGE1/AZR1/ C - caal_1_19.341 B sace_1_ypr198w VBA3/VBA5 D - scst_1_2_b00770 C VBA1/VBA2 sace_1_ybr293w E DTR1 sace_1_ybr180w D VBA4 sace_1_ydr119w ATR1/ F QDR1 sace_1_yil120w E sace_1_yml116w YMR279C G QDR3 sace_1_ybr043c F YOR378W sace_1_yor378w H - scst_1_3_c07020 G - megu_1_00085 I - lawa_1_26.7984 H - naca_1_a05720 J HOL1 sace_1_ynr055c I - yali_1_c22660g K - zyro_1_e02288g J KNQ1 sace_6_19_s00210 K2 - caal_1_19.1582 L - yali_1_d22913g L - yali_1_a00528g K - caal_1_19.7336 M - deha_1_c00506g M - yali_1_a10593g N TOP2/TPO3 sace_1_ygr138c N - sace_6_19_s00190 N2 - vapo_1_2002.86 O ARN4 sace_1_yol158c O - zyro_1_e10054g P ARN3 sace_1_yel065w P TOP1 sace_1_yll028w Q - megu_1_05184 Q - zyro_1_d00286g R CaARN1 caal_1_19.2179 R YHK8 sace_1_yhr048w S ARN1/ARN2 sace_1_ykr106w S TOP4 sace_1_yor273c T GEX1/GEX2 sace_1_yhl047c T FLR1 sace_1_ybr008c U - saba_1_166_fj00120 V - saba_2_253_is00120

32

In Clusters A + B, of DHA1 family, just one gene (zyro_1_e10230g) was regarded as starting node in traversing of the network. The choice of DHA1 references took into account the genes encoded in the model organism S. cerevisiae with nine of twelve genes (DTR1, QDR1, QDR3, HOL1, TPO2, TPO1, YHK8, TPO4, FLR1) being assigned to the (E, F, G, J, N1, P, R, S, T) clusters (Table.2A). Similarly, also the DAG transporters of the same species were used to determine the belonging members of (B, C, D, E, F, J, O, P, R, S, T) clusters (Table.2B). Beyond the genes of this model organism, another gene from other species were selected, as start nodes in network traversal approach, for the remaining clusters.

3.2.2. Sequence clustering of all DHA1 and DAG Proteins.

Using the local database and previously constructed R language script was performed, a blast all- against-all pairwise comparison between 43 reference genes and LDB. For this purpose, was used Blastp algorithm made available in blast2 package (Tatusova and Madden 1999). The blastp algorithm used gapped alignment with the following parameters: expectation value (30−49), the open gap (−1), extend gap (−1), the threshold for extending hits (11) and word size (3). This approach generates a lot of information that require processing. Usually, on a small scale a manual selection is used, nevertheless, in phylogenetic studies involving the analyses of a large number of genomes, is more adequate to deal with the identification of homologs protein to consider, a Blastp pairwise hits, as constituting of network where the nodes are the obtained proteins and the edges indicate the existence of a pairwise sequence similarity between amino acid sequences. Moreover, the traversing of Blastp network at different thresholds of e-value still allow determining the exact level all proteins of a certain family where can be collected.

3.3. Topology prediction, multiple alignments, and construction of DHA1 and DAG phylogenetic trees.

3.3.1. Pre-analysis of DHA1 and DHA2 genes.

After the identification of the potential DHA1 and DHA2 genes encoded in the analysed genomes, a different number of approaches to confirm the obtained results were used. To dismiss the presence of false positive in gathered DHA1 and DAG gene set, the length of sequences, topology, phylogeny and protein alignment were verified. In early validation stage, HMMTOP, TMHMM 2.0 and TOPPRED and TOPPRED 2 were used for topology, Muscle for multiple alignment, JALVIEW 2.0 for alignment visualization, PROTDIST/NEIGHBOUR functions of the PHYLIP package for phylogeny data processing , and DENDROSCOPE tool for the visualization of phylogenetic tree (Plotree and Plotgram 1989; Von Heijne 1992; Tusnady and Simon 2001; Edgar 2004; Huson et al. 2007; Bernsel et al. 2009; Waterhouse et al. 2009). The genes comprising the DHA1 and DAG families were scrutinized at different depth levels.

In the case of DAG family, a brief analysis was performed. The number of transmembrane segments and the length of the amino acid sequence was used as an initial criterion to aim deciding whether the potential members of this protein family was a bona fide DAG transporter and if it comprised a full size

33 protein or a fragment. Importantly, to consider the proteins in the study were established that all amino acid sequences belonging to the DAG family, should contain, at least 14 transmembrane segments in topology and 420 residues in length, as followed in Dias et al. (2013). In the case where the proteins did not follow criteria, the sequence was re-evaluated by visual analysis of the topology probability and protein hydrophobicity plots generated by TMHMM 2.0 and TOPPRED 2 (Krogh et al. 2001; Néron et al. 2009) respectively. After the re-evaluation, if the amino acid sequences still not following the prerequisites, were removed from the dataset being considered, fragments or frameshifts. The remaining proteins that initially had these pre-requirements were used for the phylogenetic tree construction of their presenting the DAG proteins. After that were used the phylogenetic tree visualization tool to identify the cluster membership of genes in DAG family.

In contrast with the previous analysis, the gathered proteins belong to the DHA1 family, were scrutinized in more detail. Similarly, with DAG family members, a brief evaluation of amino acid sequences length (minimum of 431 residues) and the number TMS (minimum of ten) with the same strategy in DHA1 family members were done. The possibility of errors from genome annotation (frameshifts, fragments, Stop Codons) as well as a loss of DHA1 genes for the evolution reconstruction led us to corrected some fragments and frameshifts. In order to perform this correction, were used the amino acid length, the topology and the clustering of genes in phylogenetic tree to detect and to do a manual a correction of this error. To performed this correction were also done several comparisons between the starting node genes in Table.2A and each fragment/frameshift detected.

Initially, Blastp was done between the frameshift/fragment and the SGD, CGD online databases (depending on the evaluated species), to identify the homologs of the fragmented gene in S. cerevisiae or C. Albicans species. After that, using Tools like Clustal X 2.0 and Jalview 2.0 for fast alignment and visualization, made possible detected the exact mishap of the amino acid sequence (stop codons / lack of sequence part in C- or N- Terminal / or a X generated by miss of one three nucleotides required to be translated in an amino acid). Posteriorly to the error identification were extracted from the LDB one thousand nucleotides up and down of these sequences (errors) in genome strain correspondent. Should be taken into account that all identification of errors was done in amino acid level, however, when passed to the correction, this was done in nucleotide level. Finally, to confirm the correct correction of the frameshifts / fragment, the nucleotide sequence was translated in Emboss Package and subsequently , amino acid generated sequence was evaluated in terms of topology , in TMHMM , TOPPRED and HMMTOP software (Claros, M.G.; Von Heijne 1994; Rice, Longden and Bleasby 2000; Tusnady and Simon 2001; Chen et al. 2003; Waterhouse et al. 2009).

3.3.2. Final analysis of DHA1 and DAG genes.

Once the pre-analysis was completed, the final validation of the methodologies was necessary. In case of topology prediction, other software of transmembrane segment prediction, like the POLYPHOBIUS, OCTOPUS, PHYLIUS, SCAMPI, SPOCTOPUS, and still the web server TOPCONS with parameters default were used to complement the initial analyses (Käll, Krogh and Sonnhammer 2007; Bernsel et al. 2008, 2009; Viklund and Elofsson 2008; Viklund et al. 2008). The web server TOPCONS 2.0 do a prediction of TMS’s based on five individual software - OCTOPUS, PHILIUS, POLYPHOBIUS,

34

SPOCTOPUS, SCAMPI. A Hidden Markov Model algorithm combines an arbitrary number of topology predictions into one consensus prediction and quantifies the reliability of the prediction based on the level of agreement between the underlying methods, both on the protein level and on the level of individual TMS’s regions. In addition, the biological hydrophobicity scale, ΔG-scale is used to predict the free energy of membrane insertion for a window of 21 amino acids centred on each position in the sequence (Bernsel et al. 2009; Tsirigos et al. 2015).

The HMMTOP and TMHMM software predict the topology of integral membrane proteins using hidden Markov model and evaluate the input sequences one by one, providing an independent prediction for each of them using only single sequence information(Krogh et al. 2001; Tusnady and Simon 2001). On the other hand, the POLYPHOBIUS and PHILIUS software are based in PHOBIUS one that combines transmembrane protein topology and signal peptide predictor. This predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of inter- connected states. Differentially, the PHILIUS software extends PHOBIUS by exploiting the power of dynamic Bayesian networks, while the POLYPHOBIUS additionally to the PHOBIUS, use homology information of the sequences for the final prediction of topology (Käll, Krogh and Sonnhammer 2004, 2005, 2007; Reynolds et al. 2008).

OCTOPUS is a method, which is based on neural networks and Hidden Markov Models. To make a prediction, first, a multiple sequence alignment is generated by BLAST. Next, the algorithm calculates the position-specific scoring matrix (PSSM) profile and a raw sequence profile and both profiles are used as the input for the neuronal networks. These neuronal networks (one for the PSSM profile and one for the raw sequence profile) predict the preference of each residue to be located in a transmembrane helix or not. The outputs of these networks are used as input for a Hidden Markov Model, which generates the final prediction. OCTOPUS only predicts transmembrane helices, whereas SPOCTOPUS can also predict signal peptides. The basis of OCTOPUS and SPOCTOPUS is the same, above described, the algorithm(Viklund and Elofsson 2008; Viklund et al. 2008).

Other two translocon-based prediction tools have been developed to predict the localization of TMS: ΔG predictor and SCAMPI. These tools are based on the same calculation of the free energy cost of insertion but they use different algorithms. Moreover, whereas SCAMPI calculates the energy of peptides with a fixed length of 21 amino acids, ΔG predictor allows length corrections. The TopPred II software does a prediction of membrane protein structure begins with the construction of a hydrophobicity profile which serves to identify ‘certain’ and ‘putative’ transmembrane segments(Claros, M.G.; Von Heijne 1994; Hessa et al. 2007; Bernsel et al. 2008).

The Muscle and Jalview software were used to perform the multiple alignments and to confirm the occurrence of highly hydrophobicity portions in the protein sequences (Edgar 2004; Waterhouse et al. 2009).On the other hand , the hydrophobicity associated with the topology of each sequence still was verified through the software TOPPRED II(Claros, M.G.; Von Heijne 1994).

In addition to the Phylip Package-PROTDIST/NEIGHBOUR, two other methodologies were used to phylogenetic tree construction (Plotree and Plotgram 1989). First, the PHYML 3.0 software (Guindon et

35 al. 2010) based on the maximum-likelihood principle was used, together with the approximate likelihood- ratio test (aLRT), as well as a more classical non-parametric bootstrap method for trees validation (Felsenstein 1985; Anisimova and Gascuel 2006). Second, Mr. Bayes 3.2 was used, with the Bayesian analysis inferences of phylogeny based on the posterior probabilities of phylogenetic trees. In terms of validation, the Mr. Bayes method is based on the probability of bipartition in the nodes (Ronquist et al. 2012). Still, It is important to refer that Bayesian method just was used to confirm and validate, the individual phylogenetic tree of each cluster, and the general tree comprising all clusters of both DAG and DHA1 member’s families. Because of this, the phylogenetic tree of this method was not edited and included in the results. In order to perform the final image of phylogenetic trees , generated by PHYML 3.0 software, was chosen a Dendroscope and FigTree v1.4.2 software - a graphical viewers of phylogenetic trees and programs used for producing publication-ready figures (Rambaut and Drummond 2010).

3.4. Construction and analyses of syntenic block diagrams representing genome regions of 33 Hemiascomycetous strains.

In order to clarify and reinforce the evolutionary history of the DHA1 family, another approach was implemented based on the analyses synteny of blocks. This different tool gives us new and complementary information in comparison with similarity analyses. The gene neighbourhood analysis assumes that certain genes in different species have a number of neighbour genes in common, so that when homology or orthology between the studied genes exists in different species their neighbours will also have homology or orthology. In summary, the synteny is assessed taking into account sets of neighbour genes between each species or strain.

Taking into account the enormous quantity of information obtained using this approach for each cluster, the number of species studied needed to be constrained. Out of a total of ninety-four strains initially used in this work, just thirty-three was retained. These thirty-three species are all present in the Saccharomycetaceae family which made possible a more compact construction of each lineage.

In order to achieve this goal, a previously built R programming language script calling the “sqldf” library was used to retrieve fifteen neighbour genes on each side of the query genes as well as the corresponding sequence clustering classification from the LDB (Rice, Longden and Bleasby 2000) (Rice, Longden and Bleasby 2000). This collected data was given in tabular format and displayed in two files, one comprised the protein family of each gene and another comprise the acronym name of the genes. Interestingly, the sequence clustering classification of the fifteen genes neighbouring each query gene was done using a restrictive blastp e-value of E-50 to limit the number of false positive sequences incorporated together with true cluster members. When further evidence was needed to corroborate dubious synteny connections between genes, sequence clustering was executed at a less conservative e-value threshold (E-40). After collecting blast information from LDB in tabular format, two kinds of displayed formats were possible to obtain. In the first one, was used a previous build script to colour the tabular version and thus to obtain an intuitive table of synteny with colours associated with each protein family (Its important consider that each neighbour classified as belonging to the same protein family, in

36 different or in the same species, had the same colour). The second option aimed to transform the tabular data into the network. To this, a formatting was needed in order to introduce the data on Cytoscape 3.0 environment - a viewing platform of interaction networks (Lotia et al. 2013).

After that, additional information, like blastp e-values between the sequences, the length of gene sequences, the number of common neighbours shared between to gene, associated protein family number and the number of elements in each family was imported to the Cytoscape 3.0 environment. These parameters were extremely important to make a correct assessment of the bond strength between each pair of neighbours and scrutinize out dubious situations such as false synteny. On the other hand, the position of each common neighbour also was truly important to confirm and reinforce the results.

Another strong and important extracted information from synteny of the blocks approach, was the chromosomic localization, in particularly where there are errors (frameshifts, fragments, stop codons) associated with genes. With this strategy, it was possible to define whether the genes were or were not subtelomerics or truncated. Each gene with less than 15 neighbours in the diagram of blocks and after the correction of all truncated members was considered subtelomerics.

After the evolutionary line determination in each cluster of genes, a construction of block diagrams was performed in Microsoft VISIO 2013 and in GIMP - programs to build diagrams and to edit images (Kimball and Mattis 2015).

3.5. Identification of positive and purifying extended over the amino acid sequences of the FLR1p homologs proteins during the evolution of Saccharomycetaceae yeasts.

Whole Genome Duplication and Horizontal Gene Transfers events regularly arouse interest about the driving forces and specific evolution point where they occur. A protein alignment sequence conservation score and positive and purifying selection methods were used to improve our understanding of evolutionary forces acting on the FLR1 homologs genes encoded in the genome of yeasts species belonging to the Saccharomycetaceae family. Three distinct methodologies were used – a “Fast, Unconstrained Bayesian AppRoximation” (FUBAR)(Murrell et al. 2013),a “mixed effects model of evolution” (MEME) (Murrell et al. 2012) and a “fixed effects likelihood” (FEL) (Kosakovsky Pond et al. 2005) implemented in Hyphy 2.2 Package for selection and BILD ‘‘Bayesian Integral Log-odds’’ for conservation (Altschul et al. 2010).

In order to perform the analysis, a dataset formatting was needed. After the collection of the DNA sequence for each DNA gene, the TranslatorX tool was used to construct the multiple alignments of the DNA genes using the corresponding protein sequence to guide the alignment software. This tool was crucial for the genomic sequence errors detection and correction once disallowed stop codons in MEME and FUBAR software. The phylogenetic Tree for FEL and Meme needed was provided by PHYML 3.0 Software.

37

FUBAR (Fast Unbiased Bayesian Approximation) is a codon-based maximum likelihood method that allows dN/dS (x) to vary over each codon across a gene according to a number of predefined site classes given a priori. This allows for testing codons for positive (dN/dS > 1) or purifying selection (dN/dS < 1) (Murrell et al. 2013). While FUBAR posits that e.g. positive selection remains constant throughout time (affects most lineages in a phylogenetic tree) the MEME (Mixed Effects Model of Evolution) test allows the distribution of x to vary over sites and moreover from branch to branch, which makes it possible to detect episodic selection (Murrell et al. 2012). Alongside FEL (Fixed Effects Likelihood) models the parameters are inferred independently for each site.

For the purifying and neutral selection, a reference FUBAR method was used. In the case of positive selection, the MEME and FUBAR methods were adopted. When further evidence was needed to corroborate dubious cases, FEL analysis method was used. Significance was assessed by posterior probability > 0.9 (FUBAR) and P-value < 0.1 (MEME and FEL). The results of different methods were compared; in meme, different Empirical Bayes Factor value thresholds were tested, in level 0, 5 and >20, in order to define the level of positive selection in each level; in Fubar, purifying, positive and neutral selection significant values were considered when p(α>β) >0.5, p(α<β) >0.5 and (α=β), respectively.

Along with amino acid alignment, the BILD score conservation for each position was determined. This Bayesian method applies the sum of scores derived from the probabilities for individual observations, and usually it is used to define residue substitution scores for pairwise alignments, in multiple alignments (Altschul et al. 2010; Khenoussi et al. 2014). The program was run on a local computer, and a bash script was built for the extraction of results.

38

4. Results

4.1. Identification of the DHA1 and DAG transporters encoded in the genomes of hemiascomycetes yeasts.

The traversal of the local blastp network compiled by the BSRG group allowed the identification of the DHA1 and DAG genes encoded in the genomes of Saccharomycetaceae yeast strains.

The published studies on the reconstruction of the evolutionary history of the DHA1 and DAG genes allowed proposing 24 and 20 phylogenetic clusters, respectively (Dias and Sá-Correia 2013, 2014). Twenty-three DHA1 and twenty DAG proteins identified in these published studies were used to represent each of these phylogenetic clusters. These reference proteins were used as starting nodes for multiple traversals of blastp network. The results obtained with this approach allowed the construction of a bar plot shown in Figure. 5, representing the number of gathered DHA1 and DAG proteins in each different e-value level analysed in this work.

The constraining of the blastp network at different e-value levels, ranging from E-49 to E-40 allowed gathering the DHA1 and DAG genes encoded in the hemiascomycetes genomes analysed in this study. The blastp network constrained at an e-value of E-40 retrieves more than 30000 proteins, indicating the existence of many false positives gathered together with the bona fide DHA1 transporters. Interestingly, at this e-value level, the DHA1 and the DAG proteins share an amino acid similarity and, therefore, are jointly retrieved. As we further constrain the blastp network using an e-value of E-42, the HOL1 homolog proteins are not gathered together with the remaining DHA1 transporters (Figure. 5).

The blastp networks constrained at e-values ranging from E-48 to E-42 allowed the identification of 2050 DHA1 transporters, corresponding to those comprised in the majority of the phylogenetic clusters, 377 corresponding to the HOL1 homolog proteins (cluster J) and 26 corresponding to another reference gene (scst_1_3_c07020) from Scheffersomyces stipites species (cluster H) (Figure. 5).

35000

30000

25000

20000 Cluster HOL1 15000 DHA1 (11 Clusters) 10000 TOTAL Nº DHA1 5000 TOTAL Nº DAG

Number of amino acid sequences of acidaminoNumber 0 E-49 E-48 E-47 E-46 E-44 E-42 E-40 E-value

Figure. 5 Identification of the MFS-MDR DHA1 and DAG genes encoded in 94 yeast strains. Plot representing the number of sequences retrieved after constraining and traversing the pairwise similarity network at different e- values using 43 reference genes as starting nodes. 39

This in silico identification strategy led to the identification of 2453 (2050+377+26) and 5086 (3885+414+68+719) potential bona fide DHA1 and DAG transporters, respectively, encoded in the 171 hemiascomycetes genomes analysed in this study. Traversing the LDB network, the DHA1 and DAG subfamilies encoded in 94 and 78 Saccharomycetaceae strains, respectively, were identified. In gene cluster identification a total of 43 reference genes were used, 23 for DHA1 and 20 for DAG subfamilies

After the removal of proteins encoded in the genomes of the model organisms Homo sapiens, Mus musculus, Arabidopsis thaliana, Drosophila melanogaster and Caenorhabditis elegans, comprised in the Genome Database but not necessary to this study, we retained only 1466 DHA1 and 1706 DAG proteins encoded in 94 hemiascomycetes yeast strains of interest for further analysis.

4.2. A brief analysis of the DAG transporters encoded in the genomes of 78 hemiascomycetes strains.

Although the DAG proteins are not the main focus of this study, the gathered amino acid sequences were used to construct a global phylogenetic tree representing the DAG transporters. However, in the analysis of these proteins, only 78 yeast genomes were considered, corresponding to 63 different hemiascomycetes species (36 correspond to yeast species classified in the Saccharomycetaceae family, 18 correspond to yeast species classified in Debaryomycetaceae family, 18 correspond to yeast species classified in other Hemiascomycetes taxonomic families and 6 correspond to early-divergent hemiascomycetes yeasts) (Table.3).

The analysis of the topology prediction results, the aligned amino acid sequences and protein length allowed excluding 715 false positives from the initially identified 1706 potential DAG transporters. Of the remaining 991 bona fide DAG proteins, 818 were full-size 14-spanner transporters and 173 were fragments.

Based on a published study (Dias et al., 2013), the full-size DAG transporters were classified into 20 phylogenetic clusters, labelled from A to T (Figure. 6). Two different statistical methodologies of building phylogenetic trees, distances/neighbour-joining and maximum likelihood (ML), originated similar cluster composition and topology regarding the individual DAG transporters and, therefore, only the ML phylogenetic tree is shown in (Table.3).

The DAG gene family was proposed to sub-divide into the DHA2, ARN, and GEX subfamilies. The DHA2 subfamily comprises the S. cerevisiae SGE1, AZR1, VBA3, VBA5, VBA1, VBA2, VBA4, ATR1 and AMF1 genes and a still uncharacterized ORF YMR279C. The ARN subfamily comprises the S. cerevisiae ARN1, ARN2, ARN3, ARN4 genes and the biochemically characterized Candida albicans ARN1 gene. The GEX subfamily comprises the S. cerevisiae GEX1 and GEX2 genes.

The total number of DAG proteins encoded in each hemiascomycetous genomes was determined and used to calculate the average number of these transporters encoded in the genomes of the distinct taxonomic families/groups considered in this study.

40

Figure. 6 Phylogenetic analysis of DHA2, ARN and GEX transporters gathered from 78 yeast strains and subdivided in twenty clusters. Circular cladogram shows the tree topology. PhyML 3.0 software were used in the analysis. The different colours correspond to each cluster of the previous synteny study , done in (Dias and Sá -Correia 2013).

The Saccharomyces complex yeast species encode, in average 10.54 DAG proteins in their genomes. On the other hand, the yeast species of the CTG complex, of the other late families, of early-divergent families encode in average 10.9 ,10.8, 8 DAG proteins in their genome sequences. The last group, the early-divergent, present a discrepancy in the number of members once the Y. lipolytica have 19 members while the remaining species of this group just have from 4 to 8 members in maximum. Curiously, in this group also was verified the lower average of DAG per cluster with an average of 0.4 DAG per cluster. Regarding the evaluation of the numbers of clusters per phylogenetic group, a consistent pattern is observed, with the three first groups showing DAG members in 14,13 and 14 clusters, respectively, while the early divergent group just comprehend members in eight clusters (Table.3).

The DHA2 subfamily comprises the major part of DAG the proteins and clusters with 564 of the 818 total number of full-size DAG transporters. A more detailed analysis of clusters, shows the B (SGE1/AZR1/ VBA3/VBA5) cluster as the more populated with 145 members, while the G and A clusters have the smallest number of proteins with 3 and 1 members, respectively. In fact, the major part number of cluster B members, are spread in Saccharomyces and ctg complex, which indicate the importance of these genes for the species of this group. Interestingly, a few members such as S. stipites, L. thermotolerans and L. kluyvery encoded a high number of clusters B members in their genome sequences with 8, 6 and five genes, respectively. Apart of that, the cluster D of VB4 orthologs present a similar dispersion of genes in relation to cluster B. Contradictory with happens in B and D clusters, in cluster C (VBA1/VBA2) appears a high number of genes, about of 66, in Saccharomyces complex and a small number in CTG 8 complex. This fact, associated with a sequential gap of genes, before and after these CTG species found, suggests a possible horizontal gene transference from another hemiascomycetous group to the ctg-group, and after that a sequential loss of genes in some species. Interestingly, in the early-divergent

41 group, just Komagataella species have members of this C cluster (Table.3). Considering the cluster H, only five genes are identified in table.3. The fact of these genes present a sequential taxonomic order and reside only in Saccharomycetaceae family, suggest a potential HGT to these H cluster. Interestingly, the absence of genes in early species of Saccharomyces complex and in Saccharomycodaceae specie, H. valbyensis, in cluster E and F, still allowed the identification of a new potential HGT to these E and F clusters.

The GEX transporters comprise a total of thirteen members in all 78 yeasts strains, being predominantly encoded in genomes of the Saccharomyces sensu stricto group. In remaining species, just appear a member of this S cluster in two members of L. thermotolerans and in one member of L. waltii, K. marxianus, K. lactis. This fact reinforces the idea of the glutathione exchangers in Saccharomyces sensu stricto have been origin from a horizontal gene transference as proposed in Dias et. all (2013.). In remaining, groups of species were not found any genes of this cluster S (Table.3).

The ARN subfamily comprises a total of 241 members. The seven clusters of this subfamily showed a total absence of genes in Eremothecium genera. On the other hand, the homologs of the S. cerevisiae, ARN1 and ARN2 genes are predominantly found in Lachancea and Kluyveromyces species. In addition, this cluster T (ARN1/ARN2) still showed punctual cases of genes in other late-divergent group of species and did not present any gene in ctg complex or in early-divergent groups. Interestingly, the ARN4 just contain homologs genes in Saccharomyces sensu stricto group and in Torulaspora specie. The absence of genes at least in sequential ten species suggest a potential HGT to these cluster O. The ARN3 is the gene with better dispersion along of all groups, in ARN subfamily, containing 40, 19,10 and 4 genes in Saccharomyces complex, ctg complex, the other late families and early-divergent families groups. This ARN subfamily still comprises other four clusters of genes, M, N, Q, and R, which showed a poor quantity and distributions of genes along of the four groups of species here analysed (Table.3). Interestingly, the cluster R of CaARN1 homologs in S. cerevisiae not show any gene in Saccharomyces complex and in early divergent groups. Differentially, the clusters Q only comprise genes in CTG complex.

From another point of view in Dias et al. (2013), were proposed from the similarity and neighbour data the horizontal gene transference to cluster B (SGE1 and VBA3/VBA5 homologs), cluster C (VBA2 homologs), cluster F (YOR378W homologs), cluster J (KNQ1 homologs), cluster N, cluster S (GEX1/GEX2 homologs) and cluster T (ARN2 homologs). In our results data just was possible corroborate, directly, the cases of horizontal gene transference inside of Saccharomycetaceae family, in clusters F, J, N, and S. In the case of cluster J (Knq1), still is possible suggest a second potential lateral gene transference, from other late-divergent families to Saccharomycetaceae family due to the total absence of members belong to this cluster in Debaryomycetaceae family. Considering the cluster B were added T. phaffii and T. blattae species, in the WGD zone, and both species showed an absence of members in this cluster, which reinforce indirectly the HGT previously proposed. In cluster T of genes was not possible to revalidate the 2010 proposed HGT, however, is possible propose a new potential HGT from the other late divergent species to the Saccharomyces complex species, once the early- divergent and ctg complex species not contain none gene of this T cluster (Table.3).

42

Table. 3 Number of full-size DAG proteins in each cluster for a specific yeast strain. Additional information regarding a total number of full-size transporters and their average percentage of sequence identity and similarity is shown in the bottom lines.

Subfamily DHA2 ARN GEX

(SGE1/

S. cerevisiae AZR1/ (VBA1/ (ATR1 (YOR (Ca (ARN1/ (GEX1/ Total of Average of (VBA4) (Knq1) (Arn4) (Arn3) genes VBA3/ VBA2) /YMR279C) 378W) ARN1) ARN2) GEX2) Full-Size genes per

VBA5) Proteins cluster Complex

Phylogeny Acronym / A B C D E F G H I J K L M N O P Q R T S Cluster sace_1 0 4 2 1 2 1 0 0 0 0 0 0 0 0 1 1 0 0 2 2 16 0,80 sace_6 0 3 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 0 2 1 15 0,75 sapa_1 0 3 2 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 15 0,75 sapa_4 0 3 2 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 15 0,75 sapa_26 0 4 2 0 2 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 14 0,70 sami_1 0 3 2 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 15 0,75 saku_1 0 3 2 1 2 1 0 0 0 1 0 0 0 1 0 1 0 0 1 1 14 0,70 saar_1 0 2 2 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 2 1 15 0,75

saba_1 0 0 2 0 2 1 0 0 0 0 0 0 0 1 2 1 0 0 0 0 9 0,45

saba_2 0 0 1 0 2 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 8 0,40

WGD

- sauv_1 0 0 2 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 1 0 11 0,55

kaaf_1 0 0 0 0 3 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 7 0,35 Post

kana_1 0 1 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 6 0,30 complex naca_1 0 1 1 0 3 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 9 0,45 naca_2 0 1 1 0 3 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 9 0,45 nada_1 0 0 1 0 2 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 7 0,35 cagl_1 0 1 1 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 6 0,30 cagl_2 0 1 1 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 6 0,30 Saccharomyces teph_1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 7 0,35 tebl_1 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 6 0,30 vapo_1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 6 0,30 zyba_1 0 3 7 2 0 2 0 0 0 0 0 0 0 1 0 2 0 0 4 0 21 1,05 zyba_2 0 2 4 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 2 0 12 0,60

zyba_3 0 2 4 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 2 0 12 0,60

zyro_1 0 0 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 6 0,30 WGD - tode_1 0 1 2 1 2 0 0 0 0 1 0 0 0 1 1 2 0 0 1 0 12 0,60

Pre lath_1 0 6 3 1 1 1 0 0 0 1 0 0 0 0 0 2 0 0 5 2 22 1,10 lawa_1 0 2 3 2 1 1 0 0 0 1 0 0 0 0 0 2 0 0 3 1 16 0,80 lakl_1 0 5 2 3 1 0 0 0 0 1 0 0 0 0 0 2 0 0 4 0 18 0,90 43

Subfamily DHA2 ARN GEX

(SGE1/

S. cerevisiae AZR1/ (VBA1/ (ATR1 (YOR (Ca (GEX1/ Total of Average of (VBA4) (Knq1) (Arn4) (Arn3) (ARN1/ARN2) Genes VBA3/ VBA2) /YMR279C) 378W) ARN1) GEX2) Full-Size genes per

VBA5) Proteins cluster Complex Phylogeny Acronym / A B C D E F G H I J K L M N O P Q R T S Cluster klma_1 0 0 2 2 0 1 0 0 0 2 0 0 1 1 0 1 0 0 4 1 15 0,75

klla_1 0 2 1 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 4 1 12 0,60

klwi_1 0 1 1 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 6 0 12 0,60 complex

klae_1 0 2 2 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 9 0,45 ergo_1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0,10 ercy_1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0,10 asac_1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0,10

Saccharomyces hava_1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0,05 Subtotal S.C 0 57 66 33 49 19 2 5 0 19 0 0 1 16 10 40 0 0 60 13 390 0,53 caal_1 0 3 0 1 1 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 9 0,45 caal_2 0 3 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 8 0,40

cadu_1 0 3 0 1 1 1 0 0 0 0 2 0 0 0 0 0 0 1 0 0 9 0,45

catr_1 0 4 0 0 1 1 0 0 0 0 2 0 0 0 0 1 0 1 0 0 10 0,50

WGD -

capa_1 0 3 0 1 0 1 0 0 0 0 1 0 0 0 0 3 0 2 0 0 11 0,55 Pre loel_1 0 2 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 6 0,30

caor_1 0 2 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 7 0,35

cama_1 0 3 0 1 1 1 0 0 0 0 2 0 0 0 0 1 0 1 0 0 10 0,50 cate_1 0 3 2 2 1 1 0 0 0 0 4 0 1 1 0 1 0 0 0 0 16 0,80 megu_1 0 3 1 1 1 2 1 0 0 0 2 0 0 3 0 2 2 0 0 0 18 0,90

scst_1 1 8 1 1 1 2 0 0 0 0 3 0 0 0 0 1 0 1 0 0 19 0,95 CTG complex CTG deha_1 0 2 1 1 1 1 0 0 0 0 2 0 1 3 0 1 0 0 0 0 13 0,65 deha_2 0 2 1 1 1 1 0 0 0 0 2 0 3 4 0 1 0 0 0 0 16 0,80 piso_1 0 2 0 2 2 2 0 0 0 0 2 0 0 1 0 0 2 0 0 0 13 0,65 sppa_1 0 3 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 7 0,35 spar_1 0 2 0 1 0 1 0 0 0 0 1 0 0 0 0 3 0 1 0 0 9 0,45 cllu_1 0 4 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 8 0,40 cllu_2 0 4 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 8 0,40

44

Subfamily DHA2 ARN GEX

(SGE1/ S. cerevisiae AZR1/ (VBA1/ (ATR1 (YOR (Ca (GEX1/ Total of Average of (VBA4) (Knq1) (Arn4) (Arn3) (ARN1/ARN2) Genes VBA3/ VBA2) /YMR279C) 378W) ARN1) GEX2) Full-Size genes per

VBA5) Proteins cluster Complex Phylogeny Acronym / A B C D E F G H I J K L M N O P Q R T S Cluster

mebi_1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 5 0,25

C - hybu_1 0 3 2 2 1 1 0 0 0 0 4 0 1 1 0 1 0 0 0 0 16 0,80

CTG Subtotal CTG- 1 60 8 22 16 19 1 0 0 0 37 0 6 13 0 19 4 12 0 0 218 0,55 complex pata_1 0 2 2 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 4 0 12 0,60 debr_1 0 4 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 0,40 debr_2 0 5 2 2 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 12 0,60 piku_1 0 1 1 2 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 7 0,35

pime_1 0 0 1 2 0 0 0 0 1 4 2 0 0 0 0 0 0 0 0 0 10 0,50 asru_1 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 1 0 0 1 0 5 0,25 ogpa_1 0 1 2 2 0 1 0 0 2 4 1 0 1 2 0 2 0 0 0 0 18 0,90 caar_1 0 2 1 1 0 0 0 0 2 0 2 0 0 0 0 1 0 0 0 0 9 0,45

cata_1 0 3 0 1 1 1 0 0 0 0 2 0 0 0 0 1 0 1 0 0 10 0,50

divergent families divergent cyja_1 0 1 2 2 1 0 0 0 0 0 0 0 1 2 0 0 0 0 2 0 11 0,55 - cyja_2

WGD 0 1 2 2 1 0 0 0 0 1 0 0 1 2 0 0 0 0 1 0 11 0,55 -

wian_1 0 2 3 2 1 1 0 0 0 4 4 0 1 2 0 1 0 0 1 0 22 1,10 Pre

Other late Other hapo_1 0 1 2 2 0 1 0 0 2 2 1 0 1 2 0 2 0 0 0 0 16 0,80 nafu_1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 3 0,15 bain_1 0 1 1 1 2 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 8 0,40 Subtotal 0 25 21 21 6 4 0 0 12 22 13 1 7 10 0 10 0 1 9 0 162 0,54 Early-div. kopa_1 0 1 1 2 0 0 0 0 1 0 0 0 2 0 0 1 0 0 0 0 8 0,40

kopa_2 0 0 1 2 0 0 0 0 1 0 0 0 2 0 0 1 0 0 0 0 7 0,35 kopa_3 0 1 1 2 0 0 0 0 1 0 0 0 2 0 0 1 0 0 0 0 8 0,40

yali_1 0 0 0 1 0 0 0 0 2 0 1 2 13 0 0 0 0 0 0 0 19 0,95 divergent - list_1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0,10

caca_1 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 1 0 0 0 0 4 0,20 Early Subtotal 0 3 3 7 0 0 0 0 7 0 1 4 19 0 0 4 0 0 0 0 48 0,40 Early-div. Total 1 145 98 83 71 42 3 5 19 41 51 5 33 39 10 73 4 13 69 13 818 0,52 45

4.3. Analysis of the DHA1 transporters in the Hemiascomycetes.

The main focus of this master’s thesis is the characterization of the DHA1 transporters in the yeasts commonly known as the Hemiascomycetes (subphylum Saccharomycotina). This goal required an exhaustive and complete validation of the gathered 1466 amino acid sequence identified using the blastp network traversal approach, allowing deciding whether the 1466 amino acid sequences corresponded to bona fine full-size DHA1 proteins or not. The validation of problematic translated ORFs was based on sequence analysis at the DNA level, on 2D-topology prediction software and on blast queries against the corresponding yeast genomes.

Later, the evolution of the DHA1 genes encoded in the genomes of Saccharomycetaceae will be reconstructed by combining classical phylogenetic methodologies, gene neighbourhood analysis, and protein evolution models, and the obtained results will be compared with published literature(Dias et al. 2010a; Dias and Sá-Correia 2014). The FLR1 gene, the member of the DHA1 family with the most complex evolutionary history, will be subject to special attention, not only to understand why the results on its evolution are so interspersed and heterogeneous but also to exemplify the state-of-the-art methodologies adopted in this study.

4.3.1. Correction of sequencing and annotation errors in the initial DHA1 protein dataset.

In a first stage, 1466 potential DHA1 transporters were gathered using the blastp network traversal approach described previously. An initial analysis of these amino acid sequences shows that, in average, the potential DHA1 transporters have a size of 574 amino acids, ranging from 431 to 800 amino acids in length. Besides the size of each protein, the analysis of these amino acid sequences using 2D- topology prediction software and the inspection of the global multiple alignments of the gathered protein set confirmed that 1310 were bona fine DHA1 proteins. The amino acid sequences of the remaining 156 translated ORFs presented a range of problems and were decided to be subject to a more refined analysis at the DNA sequence level. These problems can be sequencing artefacts that impact the assembly of the DNA reads and introduce errors in the final genome sequence or may be true biological events leading, for instance, to gene non-functionalization.

The analysis of these 156 amino acid sequences leads to the decision of not merging 66 translated ORFs together with the bona fide DHA1 amino acid sequences. Three different types of reasons justified this decision. First, the amino acid sequences of 42 translated ORFs comprised some stop signals, not allowing discerning cases of genes carrying multiple sequencing errors from the mutational pattern typically observed in pseudogenes. In addition, many of these 42 translated ORFs were encoded in yeast strains that lacked a phylogenetic close species with a genome sequence determined with high- quality sequencing level. In those cases, there was not available a genome sequence that could be used as a reference for performing sequence correction even in particular cases where the low-quality sequencing scenario was more plausible, and justified amino acid sequence correction of the corresponding translated ORF. Second, 16 of these 66 amino acid sequences were concluded not encoding a DHA1 protein and, third, 8 were considered comprising true pseudogenes. In the example shown in Figure. 7, the analysis of the amino acid sequence of two fragments, ORF´s sace_14_7_g00290

46 and sace_14_7_g00300, identified each as encoding the N-terminal and C-terminal portions of a cluster J protein (homolog of S. cerevisiae HOL1p) showed the existence of stop codons. This fact together with the absence 21 nucleotides in the middle of the protein sequence suggests that this potential DHA1 gene is non-functional, being a recently formed pseudogene.

On the other hand, the analysis of these 156 amino acid sequences lead to the decision of performing the correction of the amino acid sequences of 90 translated ORFs and merging them back together in the group of bona fide DHA1 transporters. In the base of the sequencing errors observed in these amino acid sequences there were either i) the insertion/deletion of nucleotides in a quantity not multiple of 3 resulting in frameshifts and aberrant amino acid sequence translation or ii) the substitution of an in- frame nucleotide resulting in unexpected, premature stop codons shortening the amino acid sequence.

Based on the genome sequencing and annotation errors observed and in the corrective strategy required to be used to recover these 156 amino acid sequences, the corresponding translated ORFs were classified into four different groups (Table. 4).

Table. 4 Number and type of genomic error found in 1467 DHA1 Genes.

Genomic Errors Nº of Strains Nº Corrected Errors Nº of not corrected Errors Double gene 17 36 0

Incomplete Sequence / Truncated Scaffold 15 13 8

Frameshift (N´s) 11 24 0

Frameshift (Stop Codons) 36 19 37

Total 79 90 45

First, the amino acid sequences of 13 of these 90 translated ORFs were concluded to be DHA1 fragments resulting from errors of the YGAP genome annotation software. The gathering of the missing portions of these amino acid sequences leads to the recovery of full-size DHA1 proteins (Figure. 8). In some cases, the correction of the amino acid sequence was not possible due to the absence of DNA sequencing data (cases where the translated ORF was truncated).

Second, the amino acid sequences of 24 of these 90 translated ORFs were concluded being frame- shifted due to the wrongful incorporation of non-existing unknown nucleotides (represented by “N”) during the phase of assembly of genome sequencing reads into contigs. The removal of these “N’s” leads to the recovery of full-size DHA1 proteins. Figure. 9 illustrates this type of sequencing error, using as an example the amino acid sequence of the translated ORF sace_56_4_d04160, a member of the phylogenetic cluster N1 (comprising the TPO2/TPO3 homologs). The removal of a string of “N’s” whose translated result in the string of unknown amino acids (“X’s”) identified by the red circle in this figure was sufficient to correct the corresponding amino acid sequence of the translated ORF sace_56_4_d04160.

47

Third, the amino acid sequences of 17 of these 90 translated ORFs were concluded being frame-shifted due to the wrongful incorporation of stop codons during the phase of assembly of genome sequencing reads into contigs. The correction of these wrongly assigned nucleotides leads to the recovery of full- size DHA1 proteins.

Fourth, the amino acid sequences of 36 of these 90 translated ORFs were concluded being wrongly separated into two separate translated ORFs, residing next to each other, errors resulting (again) from the YGAP genome annotation software. The merge of these two translated ORF into a single one leads to the recovery of full-size DHA1 proteins. Figure. 10 illustrates this type of sequencing error, using as an example the amino acid sequence of the translated ORF sace_2_7_g01505, a member of the phylogenetic cluster G (comprising the QDR3 homologs). In this case, a string of “N’s” (frameshift) or a substitution of an in-frame nucleotide introducing a stop codon in the middle of the sequence originates the artificial division of a full-size DHA1 gene into two fragments.

Figure. 7 Example of a biologic pseudogene genomic error. Protein alignment of HOL1 S. cerevisiae representative gene, against two fragment genes (ORF´s sace_14_7_g00290 and sace_14_7_g00300) of S. cerevisiae (RM11-1a) strain, with N´s in middle of sequence. The green circle presents the gap of seven amino acids in sequence. MultAlin software were used in the analysis (Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value.

48

Figure. 8 Example of incomplete/truncated sequence genomic error correction. Protein alignment of YHK8 S. cerevisiae representative gene against one fragment (ORF sace_18_12_930) S. cerevisiae (DBVPG1106) strain.. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value.

Figure. 9 Example of frameshift genomic error correction. Protein alignment of TPO3, S. cerevisiae representative gene, against one fragment gene (ORF sace_56_4_d04160) of S. cerevisiae EC1118 strain, with N´s in middle of sequence. The red circle presents sequence of translated N`s in X’x. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value.

49

Figure. 10 Example of double gene genomic error correction. Protein alignment of QDR3 S. cerevisiae representative gene against two fragments (ORF´s sace_2_7_g01500 and sace_2_7_g01510) of S. cerevisiae (AWRI796) strain. MultAlin software were used in the analysis(Corpet 1988). The blue colour means High consensus value (>50%), the black colour represents the absence and low consensus value.

4.3.2. Analysis of the 2D-topology of the DHA1 transporters.

In the previous analyses were gathered and confirmed a total of 1382 were bona fine DHA1 proteins. These set of proteins comprise the twelve DHA1 members of S. cerevisiae species that are characterized in terms of topology as having two transmembrane domains, each consisting of six transmembrane segments interconnected by extra and intracellular loops (Sá-Correia et al. 2009; Redhu, Shah and Prasad 2016). In order to complement the pre-analyses of topology, and to predict a more accurate number of transmembrane segments (TMS) in the collected DHA1 proteins, were used three additional software to the previous ones (HMMTOP and TMHMM)- PHOBIUS, SCAMPI and TOPPRED II (GES scale and KD Scale). Interestingly, all of these software except the PHOBIUS, are based in single-sequence (Bernsel et al. 2008).In parallel, still was tested the number of TMS’s in web server TOPCONS 2.0 that do a prediction of TMS’s based on five individual software - OCTOPUS, PHILIUS, POLYPHOBIUS, SPOCTOPUS, SCAMPI.

Taken into account the previously described software’s, two approaches were considered to the validation of the truly number TMS’s in 1382 DHA1 proteins. First, were predicted the number TMS’s, for these proteins, in five individual software HMMTOP, TMHMM, PHOBIUS, SCAMPI, TOPPRED II (GES scale and KD Scale), and with the result of then were done an average to predict the final number of TMS’s in each protein. Second, were used the web server TOPCONS 2.0 that give us a consensus topology of five software (above described) to each protein. When further evidence were needed to corroborate dubious cases were verified in individual level each program of TOPCONS 2.0 webserver.

50

The first approach revealed a consensus prediction of twelve Transmembrane segments to 1284 DHA1 proteins. Interestingly, in these set of proteins were contained 11 of the twelve DHA1 members in S. cerevisiae species. The second approach, to the same 1284 DHA1 proteins also shown a consistent prediction of twelve TM segments. Coherently, with the literature these proteins have two transmembrane domains, each consisting of six transmembrane segments interconnected by extra and intracellular loops as shown in Figure. 11.

Figure. 11 An example of DTR1 S. cerevisiae gene topology. Twelve TMS and the start/end localization of protein is shown. Prevision was done with transmembrane topology predictor – Phobius (Käll, Krogh and Sonnhammer 2007)

The analyses of the remaining 98 proteins show a controversial case. First of all, is important to consider that all of these proteins have a high similarity with the gene HOL1 gene of S. cerevisiae species. Moreover, the 98 proteins in a posterior stage were assigned through the phylogenetic analyses as belong to the cluster J where the HOL1 gene is the reference. The first approach of the topology determination shown an average of eleven TMS´s to each protein. These eleven predicted spanners topology was validated with HMMTOP, TMHMM, PHOBIUS, SCAMPI, TOPPRED II (GES scale and KD Scale). Curiously, of these 98 proteins, according to with these software’s, shown that these DHA1 proteins in terms of architecture contain the N-terminal in the extracellular part of the membrane wich is contradictory with the idea of intracellular start, considered in literature to the DHA1 transporters (Figure. 12). Consistently with the prediction of eleven TMS’s, also the hydrophobicity prediction of TOPPRED II, in both GES and KD scales, is possible see a small peak of hydrophobicity between three and four transmembrane segments that is probably the missing twelfth TMS (Figure. 13).

On the other hand, in TOPCONS 2.0 web server consensus of predictor software shown the major part of these 98 proteins as containing twelve TMSs. The disaccording between both approaches led us to perform more detailed analyses in the second approach. This analyses consisted of the evaluation, at the individual level, of each one of the five TMS predictor software. As expected the majority of these software predicted 12 TMS’s for each one of the 98 proteins - OCTOPUS, PHILIUS, POLYPHOBIUS, SPOCTOPUS. In contrast and coherent with the result of the first approach, the SCAMPI, and ΔG-scale show us 11 TMS in this set of proteins.

51

Considering the first and the second approach is undisputed the prediction of twelve TMS’s to the 1284 proteins analysed in this study. In contrast, the case of 98 proteins belong to the cluster J (homologs of HOL1 in S. cerevisiae), raised some doubts in relation to the final topology. However, considering the total number predictors used in the analyses, six of ten pointed to the topology of eleven TMS’s in these proteins, which suggested a real composition of eleven transmembrane segments in HOL1 gene and their homologs in other species (Figure. 12).

Figure. 13 Hydrophobicity profile obtained with default parameters of TOPPRED II software and using HOL1 S. cerevisiae gene. This profile serves to identify ‘certain’ and ‘putative’ transmembrane segments according to the ‘Upper Cutoff’ (defined by the green line) and ‘Lower Cutoff’ (defined by the red line) parameters.

Figure. 12 Example of HOL1 S. cerevisiae gene topology. Eleven TMS and the start/end localization of protein are shown. Prevision was done with transmembrane topology predictor – Phobius (Käll, Krogh and Sonnhammer 2007)

52

4.4. Global phylogenetic analysis of the DHA1 transporters encoded in the genomes of hemiascomycetes yeast

With the goal of extending the published results regarding the evolution of the DHA1 gene family in the subphylum Saccharomycotina (Dias et al. 2010a; Dias and Sá-Correia 2014), the 1382 bona fide full- size DHA1 proteins were used to construct a phylogenetic tree representing the diversity of these transporters in 94 hemiascomycetes strains, corresponding to 63 different yeast species. These yeast strains divided the following way: 53 strains belong to the Saccharomyces complex, 20 strains to the CTG complex, 15 strains to other taxonomic families and 6 strains were early-divergent hemiascomycetes species residing near the root of the phylogenetic tree representing the sub-phylum Saccharomycotina and whose taxonomic classification is still unclear. Two different methods were used to construct the global phylogenetic tree representing the DHA1 transporters: the protdist/ neighbour- joining distance-based functions made available by the PHYLIP software suite (Plotree and Plotgram 1989) and the maximum likelihood statistical approach made available by the PhyML software suite (Guindon et al. 2010). Two different criteria were used to assign cluster membership to each DHA1 protein: i) the distances separating the proteins in the phylogenetic tree and ii) the consistency of the cluster membership assigned with each of the two methodologies. The analysis of the gathered results allowed concluding that the phylogenetic trees calculated using the distance and maximum likelihood methodologies were highly consistent regarding cluster composition and, therefore, only the tree obtained using the PhyML software is shown (Figure. 14).

Figure. 14 Phylogenetic analysis of DHA1 transporters gathered 94 strains from 63 Hemiascomycetous species using the PhyML 3.0 software. Circular cladogram shows the 26 phylogenetic clusters identified by capital letters and the name of their S. cerevisiae members.

53

The construction of these phylogenetic trees allowed the identification of 26 clusters. The PhyML software assigned the translated ORF saba_2_253_is00120, a member of cluster UV, as the root of the phylogenetic tree. In the absence of a molecular clock assumption and when outgroup sequences are not included in the input sequence dataset, PhyML tries to construct the “best” unrooted version of the phylogenetic tree and then select an amino acid sequence that have the highest likelihood of being the “true” root of the phylogenetic tree (i.e. selects an amino acid sequence for the root of the phylogenetic tree that will maximize the likelihood function). In the distance-based phylogenetic tree, the root was assigned to the translated ORF pata_1_6_f00170, encoded in the genome of Pachysolen tannophilus, an early-divergent yeast species. Consistently with published literature(Dias and Sá-Correia 2014), the analysis of the phylogenetic tree showed the existence of highly-populated clusters, with more than 60 members each (clusters F, P, T, G, E, J, N1, R and S), medium populated clusters, comprising between 20 and 60 members (clusters AB, C, D, H, I, K1, K2, N2 and Q) and poorly-populated clusters, with less 10 members (L, M, O, UV, X, and Y).

The Needleman-Wunsch algorithm made available by the needle function of the EMBOSS software suite was used to calculate the all-against-all pairwise amino acid sequence identity and similarity between the members of each phylogenetic cluster. This analysis showed that, in average, the DHA1 proteins share 41.3 to 72.0% and 57.8% to 81.0% of sequence identity and similarity, respectively.

4.4.1. Phylogenetic analysis of the DHA1 transporters encoded in species of the Saccharomyces complex.

The 2010 and 2014 papers (Dias et al. 2010a; Dias and Sá-Correia 2014) focusing the reconstruction of the evolution of this gene family used a total of 13 and 25 yeast genomes, respectively. In the 20 yeast species classified in the Saccharomyces complex, the 2014 paper reported that the DHA1 proteins divided into 18 phylogenetic clusters. However, when compared with the previously published literature on the DHA1 proteins, this master’s thesis has an important advantage since the number of yeast species and strains analysed in the present study increased to 63 and 94, respectively. The increase in the number of yeast genomes considered in this master’s thesis and, consequently, the number of DHA1 transporters, has the benefit of improving the power and resolution of the phylogenetic analysis of these proteins.

The phylogenetic analysis of the 654 DHA1 full-size proteins found encoded in the 53 yeast strains classified in the Saccharomyces complex in the present study recovered two additional clusters not observed in the published 2010 and 2014 papers. Overall, in average, the yeast species classified in the Saccharomyces complex were found encoding 12.4 DHA1 proteins per genome and 0.47 DHA1 proteins per phylogenetic cluster (Table. 5). The DHA1 genes found residing in additional phylogenetic clusters not reported to exist in the Saccharomycetaceae yeast in previously published studies are explained by the increase in the power and resolution of the phylogenetic analysis mentioned above. Although DHA1 proteins residing in the phylogenetic cluster D are not found encoded in the genomes of K. lactis and K. marxianus species, K. wickerhamii and K. aestuarii encode 1 and 2 DHA1 proteins residing this phylogenetic cluster. Similarly, the K. marxianus and K. aestuarii species were found

54 encoding one member of the phylogenetic cluster K2 in their genomes, a genetic feature also not observed in any other genome of yeast species belonging to the Kluyveromyces genus. In fact, with the above exceptions, cluster D, and K1 members are not found encoded in yeast species classified Saccharomycetaceae family under analysis in this study. On the other hand, DHA1 proteins residing in these two phylogenetic clusters are abundant in the genomes of yeast species belonging to the CTG complex as well as in some hemiascomycetous taxonomic families (Table. 5).

Previous literature reported that the genome of the S. cerevisiae S288c reference strain encodes 12 DHA1 transporters. However, with the accumulation of genome sequences of strains of this yeast species, we can scrutinize whether the DHA1 gene number is a stable genetic feature or whether it can present variations deriving, for instance, from the ecological niche the S. cerevisiae strain was isolated or on its geographic origin. The analysis of the 18 different S. cerevisiae strains analysed in this study support the last scenario, with RM11-1a, vin13 and Y12 strains showing variation in the total DHA1 gene number encoded in their genomes. Strain RM11-1a (sace_14), a fermentative grape musts strain, encodes only 11 full-size DHA1 proteins since the amino acid sequence of the HOL1 gene has accumulated stop codons, being considered a pseudogene. Strain vin13 (sace_49), a commercial wine strain, also encodes 11 DHA1 proteins, lacking the TPO2 gene. Although the S. cerevisiae Y12 strain (sace_37), a Palm wine strain, encodes 12 DHA1 proteins, the phylogenetic analysis showed that the TPO1 gene is missing in its genome sequence and that an extra copy of the TPO4 gene has been acquired by this yeast strain possible through horizontal gene transference (Table. 5).

When compared with the results gathered in this master’s thesis, the total number of DHA1 proteins encoded in the genome of some yeast strains analysed in the 2010 and 2014 papers increased. This phenomenon was observed in the following hemiascomycetes species: i) S. mikatae, where one additional DHA1 protein, a homolog of the S. cerevisiae AQR1 gene, was identified; ii) S. kudriavzevii, where seven additional DHA1 proteins residing in clusters E, F, G, J, N1 and S were identified; iii) N. castellii NRRL Y-12630, in whose genome sequence was identified three additional proteins, members of clusters G, J and T; iv) K. lactis, where one additional DHA1 protein member of the phylogenetic cluster F was identified.

The analysis of Table. 5 allowed identifying a number of strong discrepancies regarding the DHA1 gene number of yeast species classified in the same taxonomic family. The hemiascomycetes yeasts classified in the Kazachstania genus show a total difference of 4 DHA1 proteins encoded in their genomes. Although K. africana and K. naganishii show variations regarding the number of members in several phylogenetic clusters, the most important discrepancy observed concerns the number of cluster T members encoded in the corresponding genome sequences, with K. africana and K. naganishii showing a number of 5 and 1 homologs of the S. cerevisiae FLR1 gene, respectively. the Kazachstania genus shown a discrepancy of 4 genes between the K. africana and K. naganishii species. The increase in cluster T gene number in K. africana is unexpected because the yeast species classified in the closest taxonomic genera (i.e. Saccharomyces, Kazachstania, and Naumovozyma) only encode one member residing in this phylogenetic cluster. The hemiascomycetes yeast classified in the Tetrapisispora genus show a total difference of five DHA1 proteins encoded in their genomes. Although the Tetrapisispora

55 genomes show variations regarding the number of members in several phylogenetic clusters, the most important discrepancies concern the number of cluster F and cluster T members (homologs of the S. cerevisiae QDR1/QDR2/AQR1 and FLR1, respectively). While the T. blattae genome encodes 4 cluster F members, on the other hand, the T. phaffii genome encodes only 2 members residing in this phylogenetic cluster. Regarding cluster P, the T. blattae and T. phaffii genome sequences encodes 5 and 2 cluster P members, respectively. Consistent with the high DHA1 gene number reported for Z. rouxii in the 2010 paper (Dias et al. 2010a), totalling 21 genes, the two strains of the related yeast species Z. bailii considered in this master’s thesis were found encoding 21 and 18 DHA1 proteins in their genomes. The discrepancy in the DHA1 gene number in the two Z. bailii derives from the fact that the type strain CLIB 213 having one member in the phylogenetic clusters F, P, and T when compared with the strain IST302. Expectedly considering that the Zygosaccharomyces ISA1307 strain is a Z. bailii hybrid strain, the number of DHA1 genes encoded in its polyploid genome is higher than the number encoded in the genome of the type Z. bailii strain, showing a total of 39 DHA1 genes. Overall, in average, the yeast species classified in the Saccharomyces complex were found encoding 12.4 DHA1 proteins per genome (Table. 5).

In section 4.5 of the results, the genes found in the genome of 33 strains of saccharomycetaceae taxonomic family will be analysed in detail. Although contain the same species, the subset of Saccharomycetaceae species evidence new results with the reduced number of species.

56

Table. 5 Number of full-size DHA1 proteins in each cluster for a specific yeast strain. Additional information regarding a total number of full-size transporters and their average percentage of

sequence identity and similarity is shown in the bottom lines.

(QDR1, (TPO2, S. cerevisiae genes (DTR1) QDR2, (QDR3) (HOL1) (TPO1) (YHK8) (TPO4) (FLR1) Total of Average of TPO3) 2014 AQR1) Full-Size genes per Article Proteins cluster

Complex K K N U Phylogeny Acronym / Cluster AB C D E F G H I J L M N1 O P Q R S T Y X 1 2 2 V sace_1 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 12 0,46 sace_2 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_14 0 0 0 1 3 1 0 0 0 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 11 - 0,42 sace_7 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_18 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_19 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_21 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_22 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46

sace_23 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46

sace_25 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46

sace_26 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46

sace_27 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 complex sace_32 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 0 0 1 2 1 0 0 0 12 - 0,46 sace_37 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_39 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sace_49 0 0 0 1 3 1 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 11 - 0,42

sace_50 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 Saccharomyces

sace_56 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 Post Whole Genome Duplication Genome Post Whole sapa_1 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 - 0,46 sapa_4 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 11 0,46 sapa_26 0 0 0 1 3 1 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 11 - 0,42 sami_1 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 12 11 0,46 saku_1 0 0 0 1 2 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 2 0 0 0 12 5 0,46 saar_1 0 0 0 1 2 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 11 - 0,42 saba_1 0 0 0 1 2 1 0 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 11 11 0,42 saba_2 0 0 0 1 2 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 10 12 0,38 sauv_1 0 0 0 1 3 1 0 0 1 0 0 0 0 2 0 0 1 0 1 1 1 1 0 0 13 - 0,50

57

(QDR1, (TPO2, S. cerevisiae genes (DTR1) QDR2, (QDR3) (HOL1) (TPO1) (YHK8) (TPO4) (FLR1) Total of Average of TPO3) 2014 AQR1) Full-Size genes per Article

Proteins cluster Complex

Phylogeny A K K N U Acronym / Cluster C D E F G H I J L M N1 O P Q R S T Y X B 1 2 2 V

kaaf_1 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 0 2 0 1 1 5 0 0 0 13 - 0,50 kana_1 0 0 0 1 3 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 9 - 0,35 naca_1 0 0 0 1 3 0 0 0 0 0 0 0 0 2 0 0 1 0 1 1 1 0 0 0 10 - 0,38

naca_2 0 0 0 1 4 0 0 0 0 0 0 0 0 2 0 0 1 0 1 2 1 0 0 0 12 9 0,46 e e Duplication nada_1 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 7 - 0,27 cagl_1 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 0 2 0 1 1 2 0 0 0 10 10 0,38 cagl_2 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 0 2 0 1 1 2 0 0 0 10 - 0,38 teph_1 0 0 0 1 2 0 0 0 0 0 0 0 0 1 1 0 2 0 1 1 3 0 0 0 12 - 0,46

tebl_1 0 0 0 1 4 0 0 0 0 0 0 0 0 1 0 0 5 0 1 1 4 0 0 0 17 - 0,65

Post Whole Genom Post Whole vapo_1 0 0 0 1 2 0 0 0 0 0 0 0 0 1 1 0 2 0 1 1 4 0 0 0 13 13 0,50 zyba_1 4 0 0 1 2 2 0 0 0 2 0 0 0 2 2 4 4 4 0 0 12 0 0 0 39 - 1,50

complex zyba_2 2 0 0 1 2 1 0 0 0 1 0 0 0 1 1 1 2 2 0 0 7 0 0 0 21 - 0,81

zyba_3 2 0 0 1 1 1 0 0 0 1 0 0 0 1 1 2 1 1 0 0 6 0 0 0 18 - 0,69

zyro_1 3 0 0 1 1 2 0 0 0 1 0 0 0 1 0 2 1 1 1 0 7 0 0 0 21 21 0,81 tode_1 1 0 0 1 2 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 11 - 0,42 lath_1 0 0 0 1 2 1 0 1 1 2 0 0 0 1 1 0 1 0 1 1 2 0 0 0 15 15 0,58

Saccharomyces lawa_1 0 0 0 1 2 1 0 1 1 0 0 0 0 1 1 0 1 0 0 1 2 0 0 0 12 12 0,46 lakl_1 0 0 0 1 2 1 0 0 2 1 0 0 0 1 1 0 1 0 1 1 1 0 0 0 13 13 0,50 klma_1 0 0 0 1 2 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 10 - 0,38 klla_1 0 0 0 1 2 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 8 7 0,31 klwi_1 0 0 1 1 2 1 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 10 - 0,38 klae_1 0 0 2 1 2 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 13 - 0,50

Pre Whole Genome Duplication Genome Pre Whole ergo_1 0 0 0 1 1 1 0 1 2 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 8 8 0,31 ercy_1 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 6 - 0,23 asac_1 0 0 0 1 1 1 0 1 2 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 8 - 0,31 hava_1 0 0 0 0 2 2 0 1 2 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 10 - 0,38 subtotal Saccharomyces- 12 0 3 52 129 46 0 6 42 13 2 0 0 78 10 9 61 8 41 48 90 4 0 0 654 170 0,47 complex

58

(QDR1, S. cerevisiae (TPO2, Average (DTR1) QDR2, (QDR3) (HOL1) (TPO1) (YHK8) (TPO4) (FLR1) Total of genes TPO3) 2014 of genes AQR1) Full-Size Article per Proteins

Complex A K K N U cluster Phylogeny Acronym / Cluster C D E F G H I J L M N1 O P Q R S T Y X B 1 2 2 V caal_1 0 1 0 1 2 1 0 1 2 0 2 0 0 1 2 0 2 0 1 1 1 0 0 0 18 18 0,69 caal_2 0 1 0 1 1 1 0 1 2 0 2 0 0 1 2 0 2 0 1 1 1 0 0 0 17 17 0,65 cadu_1 0 1 0 1 2 1 0 1 2 0 2 0 0 1 2 0 2 0 1 1 1 0 0 0 18 18 0,69 catr_1 0 1 0 1 3 1 0 1 2 0 2 0 0 1 1 0 2 0 1 1 2 0 0 0 19 18 0,73 capa_1 0 1 0 1 1 1 4 2 1 0 2 0 0 1 2 0 9 0 0 1 3 0 0 0 29 28 1,12

loel_1 0 1 0 1 1 1 2 1 1 0 0 0 0 1 1 0 3 0 1 0 4 0 0 0 18 19 0,69 caor_1 0 1 0 1 2 1 3 1 1 0 2 0 0 1 1 0 6 0 0 1 3 0 0 0 24 - 0,92

cama_1 0 1 0 1 3 1 1 1 2 0 1 0 0 1 2 0 4 0 1 1 4 0 0 0 24 - 0,92

cate_1 1 2 1 1 3 1 0 1 1 1 3 0 0 1 1 0 1 0 1 1 1 0 0 0 21 - 0,81

megu_1 1 2 1 0 1 1 1 1 1 1 2 0 0 1 2 0 3 2 1 1 9 0 0 0 31 31 1,19 complex - scst_1 0 3 1 1 2 1 3 1 2 0 1 0 0 1 1 0 1 0 1 1 4 0 0 0 24 24 0,92

CTG deha_1 1 2 1 0 1 1 2 1 1 0 2 0 1 1 2 0 3 0 1 1 3 0 0 0 24 25 0,92 deha_2 0 2 1 0 2 1 3 1 1 0 2 0 0 1 1 0 3 0 1 2 2 0 0 0 23 - 0,88 piso_1 2 4 2 2 3 2 0 0 2 3 0 0 2 2 2 0 13 0 0 0 6 0 0 0 45 - 1,73 Pre Whole Genome Duplication Genome Pre Whole sppa_1 0 1 0 1 2 1 1 1 1 0 4 0 0 1 1 0 1 0 1 1 4 0 0 0 21 - 0,81 spar_1 0 1 0 1 2 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 2 0 0 0 16 - 0,62 cllu_1 1 2 0 1 2 1 0 1 2 0 2 0 0 1 1 0 1 0 1 1 1 0 0 0 18 17 0,69 cllu_2 1 2 0 1 2 1 0 1 2 0 2 0 0 1 1 0 1 0 1 1 1 0 0 0 18 - 0,69 mebi_1 1 1 0 1 0 2 0 1 1 1 1 0 0 1 1 0 1 0 1 1 3 0 0 0 17 - 0,65 hybu_1 1 2 2 1 3 1 0 1 1 1 3 0 0 1 1 0 1 0 1 1 1 0 0 0 22 - 0,85 3 2 2 subtotal CTG-complex 9 9 18 38 22 29 7 36 0 3 21 28 0 60 2 17 19 56 0 0 0 447 215 0,86 2 1 0

59

(QDR1, (TPO2, S. cerevisiae genes (DTR1) QDR2, (QDR3) (HOL1) (TPO1) (YHK8) (TPO4) (FLR1) Total of Average of TPO3) 2014 AQR1) Full-Size genes per Article

Proteins cluster Complex Phylogeny Acronym / Cluster AB C D E F G H I J K1 K2 L M N1 N2 O P Q R S T UV Y X

pata_1 1 0 0 0 1 1 0 1 2 2 0 0 0 0 1 0 1 0 2 1 0 0 0 1 14 - 0,54 debr_1 0 0 0 0 1 1 1 0 2 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 7 - 0,27 debr_2 0 0 0 0 2 2 2 0 3 2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 13 - 0,50

piku_1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 0 2 1 0 1 0 0 0 1 12 - 0,46 pime_1 0 0 0 0 1 1 0 0 2 1 2 0 0 1 1 0 1 1 0 1 0 0 0 1 13 - 0,50 asru_1 0 5 1 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 11 - 0,42 ogpa_1 0 0 2 0 1 1 0 0 2 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 14 - 0,54

caar_1 0 0 2 0 1 1 0 0 2 1 0 0 0 1 1 0 1 0 1 1 0 0 0 1 13 - 0,50 divergent families divergent

- cata_1 0 1 0 1 2 1 0 0 2 0 1 0 0 1 0 0 2 0 0 1 3 0 0 0 15 - 0,58 cyja_1 0 2 0 1 0 0 0 0 1 2 0 0 0 1 1 0 1 2 1 1 0 0 0 0 13 - 0,50 cyja_2 0 2 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 2 1 1 0 0 0 0 14 - 0,54

Other late Other wian_1 0 0 1 1 2 1 1 1 1 1 1 0 0 1 2 0 3 1 0 1 1 0 0 0 19 - 0,73 hapo_1 0 0 2 0 1 1 0 0 2 2 0 0 0 1 1 0 1 1 1 1 0 0 0 1 15 - 0,58 nafu_1 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 0 2 0 0 1 1 0 0 0 11 - 0,42 bain_1 0 3 0 0 0 1 0 1 2 0 2 0 0 1 1 0 1 0 0 1 0 0 0 0 13 - 0,50 Subtotal Other late-

Pre Whole Genome Duplication Genome Pre Whole 1 13 9 4 15 15 5 4 23 15 9 0 0 12 12 0 18 11 7 13 5 0 0 6 197 0 0,51 divergent families

kopa_1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 8 - 0,31

kopa_2 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 7 - 0,27

kopa_3 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 8 8 0,31 divergent

- yali_1 0 1 2 0 1 1 0 1 1 8 0 6 0 1 0 0 2 2 2 0 4 0 0 0 32 32 1,23

list_1 0 2 0 0 2 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 2 1 1 0 11 - 0,42 Early caca_1 0 0 0 0 2 2 0 1 0 3 0 0 1 1 1 0 2 0 1 1 3 0 0 0 18 - 0,69 subtotal Early-div. 0 3 2 0 8 6 0 2 4 11 0 6 1 5 5 0 7 2 4 4 12 1 1 0 84 40 0,54 Total 21 48 23 74 187 86 26 30 94 44 47 6 4 115 54 9 144 23 66 83 163 5 1 5 1382 425 - Identity (%) 62 47 45 50,4 41,3 47,2 45 53 55,5 46 45 55 65 67,3 59 72 47 46 56,2 50 46,41 69 - - - - - Similarity (%) 74 62 61 64 57,8 62,7 62 66 69,2 62 62 68 73 77,5 72 81 62,2 68 70,5 62 62,93 78 - - - - -

60

4.4.2. Phylogenetic analysis of the DHA1 transporters encoded in species of the CTG complex.

The analysis of 17 yeast species residing in the CTG complex and corresponding to a total of 20 hemiascomycetes strains, allowed the identification of 447 DHA1 genes in this phylogenetic clade. The genomes of the yeast species classified in the CTG complex, when compared with those of the Saccharomyces complex, encode a higher DHA1 gene number, with an average of 0.86 genes per cluster. The phylogenetic analysis of these DHA1 proteins still allowed the identification of 19 clusters.

One of the causes for the CTG complex species having a higher number of DHA1 proteins per genome is the fact that some yeast species show horizontal expansions of genes belonging to specific phylogenetic clusters. For instance, P. sorbitophila (piso_1), C. parapsilosis (capa_1) and C. orthopsilosis (caor_1) encode 13, 9 and 6 members of cluster P (homologs of S. cerevisiae TPO1 and C. albicans FLU1 genes). Regarding cluster T genes (homologs of S. cerevisiae FLR1 and C. albicans MDR1 genes), M. guilliermondii (megu_1) and P. sorbitophila (piso_1) encode 9 and 6 members of this phylogenetic cluster (Table. 5). These horizontal expansion of cluster P and cluster T members are the cause for the genomes of M. guilliermondii, P. sorbitophila (piso_1), C. parapsilosis showing a higher than one ratio of DHA1 genes per cluster. As previously described for the cluster P members of C. parapsilosis and for the cluster T members of M. guilliermondii (Dias and Sá-Correia 2014), the increase in the redundancy of these DHA1 genes, plausibly, was the result of local gene duplications with subsequent genome dispersal (in some cases).

The residual number of DHA1 proteins residing in cluster K1 and cluster M found encoded in the genomes of the CTG complex species suggest that the yeast strains encoding these type of DHA1 transporters have acquired them by lateral gene transfer.

4.4.3. Phylogenetic analysis of the DHA1 transporters encoded in species classified in other late- divergent taxonomic families in the Hemiascomycetes

In other late-divergent families group, the D. bruxellensis and O. parapolymorpha strains, previously reported as evolutionary species between ctg-clade and Saccharomyces complex (Dujon, 2010), revealed us an intermediate number of DHA1 genes, which reinforce the position of this species in Figure. 1 phylogenetic tree. The remaining species of this group have a total number of DHA 1 proteins that range from 7 to 22 proteins, with a J cluster (HOL1 homologs in S. cerevisiae) being the most abundant cluster with 37 members. Although, evolutionarily, the ctg-clade had first divergence in relation to the Saccharomyces complex, the DHA 1 genes of other families group seems to contain the closest number of genes per cluster in Saccharomyces complex than CTG-clade, an average of 0.51 genes.

Still there are two major cases in this group, the cluster T and E. Containing a lower number of members comparatively with others gene sets, nine and six, this species appear not consider essential, the FLR1 and DTR1 homologs genes, for the fluconazole or others multidrug resistance (Cluster T) or in spore wall synthesis (Cluster E). In terms of clusters, it's important refer that other families group is the only one with the new X cluster members (Table. 5).

61

4.4.4. Phylogenetic analysis of the DHA1 transporters encoded in species classified in early- divergent taxonomic families in the Hemiascomycetes

The early-divergent species are the most heterogeneous group here seen. Presenting an average of 0.54 DHA1 genes per cluster, this group have species with a high number of members, the case of Y. lipolytica (Yali_1) with 32 members, with a medium number, C. caseinolytica (caca_1) with 18 members, and with low number, Komagataella (kopa_1,2,3) genus with an average of 8 genes per strain. Comparing the present clusters of this group in relation to the other groups the T and E clusters, have one more time a different pattern. Cluster E have a complete absence of DHA1 genes in these species, reinforcing the idea of not being essential and suggests the possibility of DTR1 homologs genes have been originated, from speciation, of another cluster F member (Cluster F in Figure. 14 is the more closest of cluster E). On the other hand, the cluster T show us a constant presence in this group in contrast which was shown in other families group and at the start of Saccharomyces complex group. Thinking in species, the Y. lipolytica species is a curious case. Apart from contains a large number of DHA1 in L and k1 clusters, 6 and 8, should be noted that any other specie comprises a so large number of genes in this cluster, still contains four FLR1 homologs genes in cluster T. On the other side, the list_1 also is interesting, once that is the only one species which contains one member in new Y cluster and in under populated UV cluster, in this group (Table. 5).

Finally, in terms of lack clusters in each group, the ctg-complex group there is a lack of DHA1 genes in clusters L, O, UV, Y, X, by contrast the Saccharomyces-complex species missed the C, H, L, M, Y, X clusters, while the other families lack clusters L, M, O, UV, Y and early-divergent group lost AB, E, H, K2, O and X clusters (Table. 5).

4.5. Phylogenetic, gene neighbourhood and protein evolution analyses of the DHA1 transporters encoded in the genomes of Saccharomycetaceae yeast species.

As mentioned previously, the main focus of this master’s thesis is the characterization of the DHA1 transporters. However, the analysis of the 1382 bona fide full-size DHA1 proteins found encoded in 94 Hemiascomycetes strains, comprised of more than 15 taxonomic families, using the state-of-the-art methodologies (phylogenetic, neighbourhood and positive/negative selection analyses) proposed in this section would fall beyond the limited scope of a master’s thesis.

Thus, were used 438 DHA1 proteins encoded in the genomes of 33 yeast strains, belonging to the Saccharomycetaceae family, for the phylogenetic and syntenic approaches. For the construction of phylogenetic tree and subsequent clustering were used 419 full-size genes (Figure. 15A/B). This allows a detailed subdivision of the phylogenetic tree into twenty clusters with identities and similarities, between the proteins of each cluster, ranged from 47.2 to 87% and 62.6 to 92%, respectively. In addition, the analyses similarity also provided new information about the ratio of DHA1 proteins between the clusters, in comparison to the general hemiascomycetous analyses (Section 4.4.1). In fact, although the population density keeps itself inside the clusters, the ratio between the clusters change, which it is natural

62 considering that with the decreasing number of species involved in analysing also it is affected their heterogeneity. Detailing the analyses of population density, F and T cluster (QDR1/QDR2/AQR1 and FLR1 representative genes) continue to contain a high number of members, showing 70 and 71 DHA1 proteins, respectively. In the case of P and N1 (TPO1 and TPO2/TPO3 representative genes), medium populated clusters, the similar number of genes is kept (41 and 42 full-size genes respectively). However, the gene ratio changed once that the previously the N1 cluster appeared as the more populated cluster with 77 members, having advantage in more that seventeen genes in relation to the P cluster (60 members), and now with the reduced number of analysed species the gene proportion declined (annex I, Table. 6). In the case of medium populated genes are not presented only one uniformly pattern, but two distinct patterns. In the first, there are the case of G and J cluster (QDR3 and HOL1 representative genes) that only contain genes from Eremothecium genera to WGD zone and from S. uvarum to S. cerevisiae species, presenting a gap in early post-WGD species. Notwithstanding, these gaps in J and G clusters seems to be horizontal gene transference, once lost the homologs genes in fourteen and ten consecutively species, respectively. Second, in the case of E, R and S clusters (DTR1, YHK8, and TPO4 representative genes) there is a pattern of genes, uniformly spread by all species. Without S. cerevisiae representatives, the A+B, D, I, K1, K2, N2, O, Q and U+V clusters comprise a maximum and minimum of thirteen and two genes, being the less populated groups in this analyses. Curiously, must be referred that all of the Saccharomyces sensu stricto group species seem to have a coherent number of genes with the nine clusters (E, F, G, J, N1, P, R, S, T), by S. cerevisiae members marked, with the exception S. bayanus (623-6C and MCYC 623) strains and S. uvarum specie which have DHA1 protein on the tenth U+V cluster (annex I, Table. 6).

On the other hand, the syntenic approach was based on Dias et al. (2010) paper, where10 phylogenetic clusters (E, F, G, J, N1, N2, P, R, S, T) were described in terms of synteny. During the syntenic assessment of the DHA1 family of genes, all 438 members (fragments and full-size genes) were considered, however, just 433 proteins were detected as having common genes in the vicinity. Each considered lineage, in diagram form, presents, the chromosomic localization of each gene (sub-telomeric or centromeric), the corrected or not corrected genes, as well as the evolutionary events, like WGD and HGT, if these exists. All of these obtained results are directly comparable with the previously analysed study, and respective diagram of lineages (Figure. 4) (Dias et al. 2010a).

63

A)

B)

Figure. 15 Phylogenetic analysis of DHA1 transporters gathered 33 strains from 28 yeast species from Saccharomycetaceae group using the PhyML 3.0 software. A) Radial phylogram showing the amino acid sequence similarity distances between these 419 full-size DHA1 transporters. B) Circular cladogram shows the tree topology of these genes, the gene name S. cerevisiae members in each cluster, and the colours of each cluster associated to the synteny analyses performed in Figure.4 by Dias et al. (2010). 64

4.6. Phylogenetic, gene neighbourhood and protein evolution analyses of the FLR1 homologs of the S. cerevisiae genes encoded in the genomes of the Saccharomycetaceae species.

The published literature (Dias et al., 2010; Dias and Sá-Correia, 2014) reported that the DHA1 lineage 10, representing the homologs of the S. cerevisiae FLR1 gene, comprises the most complex evolutionary history from the set of DHA1 genes encoded in the genomes of the Saccharomycetaceae yeast species. It was decided to use this master’s thesis to show the theoretical basis and the operational implementation of the state-of-the-art methodologies on phylogeny, comparative genomics and protein evolution used to analyse the DHA1 genes. These methodologies include i) the tree building software suites PhyML and Mr. Bayes, ii) R and SQL scripting based on the information comprised in the local databases developed by the BSRG group, GenomeDB, and BlastpDB, iii) pairwise alignment algorithms such as Needleman- Wunsch, blastp, and TranslatorX, iv ) the network visualization Cytoscape software suite, v) protein evolution functions such as FEL, MEME, FUBAR made available by the HYPHY software suite, vi) protein conservation quantification score made available by the Log-OddsLogo software suite and vii) a wide range of 2D topology prediction software suites. Because of the inherent complexity of the DHA1 lineage 10, the evolution of the FLR1 homolog genes was selected to illustrate the full potential of these methodologies. Consequently, the length of this subsection focusing the reconstruction of the evolutionary history of the FLR1 homologs will be more extensive when compared with the length of the subsequent subsections.

Two different phylogenetic trees representing the FLR1 homologs were constructed using state-of-the-art phylogenetic packages, the maximum likelihood-based PhyML, and the Bayesian-based Mr. Bayes software suites. For purposes of clarity, only the PhyML phylogenetic trees are shown in the supporting figures.

The first phylogenetic tree represents the whole set of 163 full-size cluster T members (Table. 5). The FLR1 homolog genes divide into two superclusters in this phylogenetic tree, one comprising the proteins encoded in the Saccharomycetaceae yeast species and the other comprising the proteins encoded in the early and late-divergent taxonomic families, including the Debaryomycetaceae yeast species (annex II, Figures.29,30).

The second phylogenetic tree represents the phylogenetic relationships established between 71 full-size cluster T proteins encoded in 33 Saccharomycetaceae yeast strains. Based on the consistency of the clusters comprised of the phylogenetic trees obtained with the PhyML and Mr. Bayes software suites and on the distances separating the cluster T proteins, the PhyML phylogenetic tree was divided into 5 subsclusters (Figure. 16). The translated ORFs cagl_1_h0639g was assigned as the root of the tree (subcluster S5). Although the 71 cluster T members share, in average, 59.4% and 74.7% of identity and similarity, respectively, the analysis of this phylogenetic tree indicates that the different subclusters comprising the FLR1 homolog proteins tend to show a radial disperse.

65

Figure. 16 Phylogenetic analysis of FLR1 homologous in 33 Saccharomycetaceae species using the PhyML 3.0 software. Circular cladogram shows the tree topology of 71 full-size DHA1 transporters. S1 to S5 represents the five subgroups by synteny approach originated. The gene taxa with blue colour marked represent different species the same genus in different subgroups.

The gene neighbourhood approach allowed the reconstruction the evolutionary history of FLR1 homologues encoded in yeast species classified in the Saccharomycetaceae taxonomic family based on Comparative Genomics data. This approach provides independent information on gene evolution, complementing the classic analysis relying on the construction of phylogenetic trees based on the similarity between the amino acid sequences of these DHA1 transporters. These analyses confirmed the previously published report (Dias et al., 2010) that gene duplication was a frequent phenomenon in the evolution of the FLR1 homolog genes (Figure. 17). The gene neighbourhood analysis also showed that the chromosome environment where FLR1 homologues reside was poorly conserved and that lineage 10 does not exhibit the normal WGD gene pattern where the ancestral gene originates two ohnologs genes that can be either subsequently lost or not. Nevertheless, some FLR1 homolog genes organize into different subgroups sharing strong synteny links between them.

The S. cerevisiae FLR1 gene, as well as the corresponding orthologues encoded in the genomes of the Saccharomyces sensu stricto group, share strong synteny with the tebl_1_a04260, a translated ORF encoded in an early-divergent post-WGD species classified in the Tetrapisispora genus. Although the genomes of the Saccharomyces sensu stricto species show a tendency for encoding only one FLR1 homolog gene, one of these Hemiascomycetes yeasts was the exception to this rule. After the sequence

66 correction stage, S. kudriavzevii was found encoding 2 full-size FLR1 homolog genes (Figure. 17) and 1 FLR1 fragment. Interestingly, this S. kudriavzevii FLR1 fragment shares strong synteny with 3 FLR1 fragments encoded in the S. mikatae genome. The amino acid and nucleotide sequence analyses indicate that these FLR1 fragments are true pseudogenes. The FLR1 homolog genes encoded in the genomes of S. arboricola and S. kudriavzevii also show poor synteny.

In the genomes of the Hemiascomycetes yeasts evolving after the WGD event but not belonging to the Saccharomyces genus it has been observed the presence of 17 translated ORFs encoding cluster T members that do not establish any relation of synteny to any other FLR1 homolog gene: i) all 5 cluster T members encoded in K. africana (kaaf_1_e00140, kaaf_1_k00100, kaaf_1_k00150, kaaf_1_b00180, kaaf_1_d05140 ) ,ii) 1 cluster T member encoded in K. naganishii (kana_1_f03480),iii) 1 cluster T member encoded in N. dairenensis (nada_1_a07730), iv) 3 cluster T members encoded in T. phaffii ( teph_1_o02010, teph_1_a00130 , teph_1_i03320), v) 3 cluster T members encoded in T. blattae, and vi) 4 cluster T members encoded in V. polyspora (vapo_1_358.7, vapo_1_1057.25, vapo_1_1018.193, vapo_1_1018.195)(Figure. 17). Interestingly, 13 of these translated ORFs reside in sub-telomeric regions. Because of the lack of synteny between cluster T members, translated ORFs belonging to the same subcluster in the phylogenetic tree representing the FLR1 homolog proteins (i.e. showing strong amino acid sequence similarity) are linked by red dashed lines (Figure. 17). In fact, the group of genes linked by these dashed lines is coherent. Correlated with the phylogenetic analyses the most differentiated and unexpected sub-group is S1, once the Lachancea genus and Saccharomyces sensu stricto group presents a high proximity. On the other hand, the T. phaffii, T. blattae, V. polyspora, and N. castellii species show us other sub-group, S2. This group makes sense if the sequential position of these strains in the tree of species (Figure. 2), were considered. Considering the S3 group, one incoherence was found. Supposedly the presented genus should contain both species together, however, in Naumovozyma genus this is not happening, being the N. castellii and N. dairenensis listed in two different subgroups (taxa with blue colour), S3 and S4. The last considered group was S5 that showed it the most abundant group, with a high number of local duplications (Figure. 16).

The FLR1 homolog genes encoded in the genomes of Hemiascomycetes strains belonging to the same yeast species share, as expected, a high number of neighbourhood connections. This is observed between the cluster T members encoded in N. castellii CBS 4309 and NRRL Y-12630 strains and between the cluster T members encoded in C. glabrata CBS138 and CCTCC M202019 strains. In the Hemiascomycetes strains evolving from an ancestor that have diverged before the WGD event, the gene neighbourhood analysis confirmed the published report identifying a strong horizontal expansion of FLR1 homolog genes in the genomes of yeast species classified in the Zygosaccharomyces genus (Dias et al. 2010a), leading to the generation of 7 full-size cluster T members in the Z. rouxii genome. In the genomes of the Z. bailii IST302 and CLIB 213 strains and of the Z. bailii hybrid ISA1307 was found encoded 7, 6, and 12 translated ORF's encoding full-size cluster T members, respectively. All these translated ORFs encoded in the genomes of the four Zygosaccharomyces strains analysed in this study were found to share strong synteny to tode_1_d06500. Although the translated ORFs lawa_1_55.19590 and lath_1_e00528g share synteny with the tode_1_d06500, three genes encoded in the genomes of three

67

Lachancea species do not share synteny connections with any other cluster T member in the Saccharomycetaceae species.

The comparison of the results obtained in this master’s thesis with those previously published showed the existence of a number of inconsistencies. First, the analysis of chromosome environment where the cluster T members reside in the genomes of the Zygosaccharomyces strains did not show the existence of a higher number of synteny connections of the translated ORF zyro_1_b16808g with tode_1_d06500 when compared with the number of connections with the remaining FLR1 homolog genes encoded in Z. rouxii genome. Second, the analysis of the synteny data also showed that the translated ORF zyro_1_b16808g did not established a relevant connection with the cluster T members encoded in the N. castellii NRRL Y-12630 genome. Third, the tandem repeat comprising FLR1 homolog genes encoded in the C. glabrata genome did not share the same chromosome environment of the cluster T members encoded in S. bayanus 623-6C and MCYC 623 strains. The absence of neighbourhood conservation of FLR1 homologues belonging to successive post-WGD species suggests the occurrence either of multiple gene transpositions or multiple gene loss and gain. This potential gain of cluster T members would require at least five horizontal gene transferences between different yeast species that will be discussed in 5.4 section of this study.

68

Figure. 17 Lineage 10 (homologs of S. cerevisiae FLR1 gene). Each box represents a gene. Lines connect genes sharing common neighbours. E, C and T indicates that the corresponding gene was classified as a fragment, corrected or sub-telomeric gene, respectively. The dashed line encompasses groups of proteins more similar in amino acid sequence (inferred from the analysis of the phylogenetic tree).

69

4.6.1. Positive / Negative Selection and Conservation analysis

It is well known that the integration of results obtained using different methodologies and mathematical models can lead to a deeper understanding of a given problem. Therefore, in this master’s thesis, it was decided to use functions and algorithms made available by the HYPHY and Log-OddsLogo software suites with the goal of understanding the evolution and conservation of the amino acid sequences of cluster T members, respectively. Subsequently, these results can be integrated with the results obtained with 2D-topology prediction software and with the reconstruction of the evolutionary history of the cluster T members using the phylogenetic and gene neighbourhood analyses described above. This approach can give new insights on the functional diversification of cluster T members in respect to potential cases of lateral gene transference identified in this master’s thesis and in respect of the observed cases of horizontal expansion of FLR1 homologs mediated by local gene duplication events or by genome-wide duplication events (for instance, the WGD event). In these analyses, with the exception of Z. bailii where the hybrid strain ISA1307 and the type strain CLIB213 were both included, only one strain was used to represent each hemiascomycetous species. This resulted in the selection of a total 61 cluster T members encoded in the genomes of 29 Saccharomycetaceae yeast strains.

So, for this analyses, different methodologies and correlations were used, namely, FUBAR, MEME implemented in Hyphy 2.2 Package for selection and BILD ‘‘Bayesian Integral Log-odds’’ for the amino acid conservation (Altschul et al. 2010; Murrell et al. 2012, 2013). The Log-OddsLogo software uses per- observation multiple-alignment log-odds scores as measures of information content at each logo position, allowing the determination of the level of conservation between the set of the aligned amino acid sequences of the cluster T members. The study of purifying selection at the level of these amino acid sequences was based on the function Fast Unconstrained Bayesian AppRoximation (FUBAR) made available by the HYPHY software suite. As expected, the results obtained from these two methodologies show a high association, with the portions where the protein sequence is strongly conserved also being assigned a high probability of the parameter representing the synonymous substitution rate (α) being bigger than the parameter representing the non-synonymous substitution rate (β) (Figure. 18).

Six 2D-topology prediction software suites were also used to confirm the presence of twelve transmembrane segments assumed to exist in all DHA1 transporters T (Figure. 18). The results show that amino acid sequence conservation and purifying selection are strongly associated with the 2D- topology of these plasma membrane proteins, being consistent with the notion that the hydrophobic transmembrane portions of the cluster T members need to be conserved in evolution to maintain the structural stability of these transporters in the plasma membrane. This can be observed in figures Figure. 18 B\C, where the values above of 0.5 in the plot representing the amino acid conservation of the cluster T members show great superposition with the TMS location in these DHA1 transporters. The portions of the protein where the transmembrane segments of the cluster T members reside also show a high probability value of the synonymous substitution rate being bigger than the non-synonymous substitution rate (Figure. 18.C).

70

A)

B)

Bits

C)

Prob[alpha>beta]

Figure. 18 Plot correlation between negativePositions selection, conservationin Alignment and Sequence predicted topology of Cluster T aligned proteins in twenty-nine Saccharomycetaceae strains. A) For the topology, the TOPCONS webserver displayed the prediction of twelve TMS from six independent predictors. B) The LogOddsLogo software predict the sequences conservation, based on entropy, and in total frequency of each amino acid set per alignment position. The green and yellow bands represent each set of two TMS’s. C) Fubar methodology was used for the negative selection prediction, where de present values are Probability of [alpha>beta].

71

With the goal of identifying potential cases of functional diversification in the cluster T members, it was decided to use the Mixed Effects Model of Evolution (MEME) in this master’s thesis to capture potential molecular footprints of episodic or pervasive positive selection in these DHA1 genes. The use of MEME involved making available a number of pre-requisites. First, errors and potential stop codon introduced in the DHA1 genes during the sequencing process have to be corrected. As described in section 3.3.1, the correction of these errors has been pursued thoroughly. Second, MEME requires as input the multiple alignment of nucleotide sequences and the corresponding tree representing the phylogenetic relationships between DHA1 the genes under analysis. These two input files were obtained using i) the Translator X software suite (Abascal, Zardoya and Telford 2010), which is able to perform the multiple alignment of nucleotide sequences guided by the corresponding translated amino acid sequences and ii) the phylogenetic package PHYML 3.0 (Guindon et al. 2010)

The results obtained with MEME were initially explored by adopting very conservative values, namely simultaneously requiring that the p-value would be less than 0.05 and the Likelihood Ratio Test (LTR) being above 1.0 (Murrell et al. 2012).In a total of 619 possible positions in the multiple alignments, only 15 sites showed values obeying to these two requirements (Table. 6), corresponding to amino acid residues with high and constant rates of mutation.

Table. 6 Codon positions detected under diversifying selection with two methods are reported. The FEL rate of non-synonymous and synonymous substitutions is expressed as dN–dS or x (i.e. dN/dS). The statistical significance of positive selection is expressed in term of p-value (FEL and MEME methods). Two different thresholds of Empirical Bayes Factor and the number of associated nodes are described.

Nº of Nº of nodes Codon MEME p-value MEME LRT FEL p-value FEL dN/dS FEL LRT Nodes 5>(EBF)>20 (EBF)>20 154 0.021 5.37 0.000 0.06 29.20 1 3 170 0.003 8.60 0.003 0.27 8.97 2 4 212 0.008 6.94 0.000 0.09 25.91 0 3 262 0.011 6.47 0.000 0.12 21.90 0 3 264 0.006 7.48 0.001 0.09 11.79 1 3 327 0.031 4.67 0.000 0.13 16.16 0 2 348 0.000 21.52 0.017 0.47 1.85 3 2 349 0.000 13.05 0.000 0.13 22.60 2 2 473 0.038 4.30 0.000 0.16 14.94 3 7 591 0.050 3.85 0.000 0.01 47.40 0 1 593 0.012 6.31 0.004 0.26 8.40 5 2 595 0.000 16.84 0.000 0.08 24.84 2 2 600 0.041 4.18 0.000 0.02 48.41 2 2 610 0.003 8.77 0.033 0.33 4.55 3 5 616 0.002 9.31 0.044 0.38 4.06 7 6

In these 15 cases, the empirical Bayes factor score was used to identify the branches of the phylogenetic tree where the mutational rate had increased during the evolution of each of these amino acid residues. Table. 6 shows a summary of the results obtained for each of these positions, with one column showing the number of phylogenetic branches with an EBF score between 5 and 20, and another column showing the number of branches with an EBF score higher than 20. These results also are shown graphically in the phylogenetic tree of Figure. 19, where branches with high mutation rates being coloured in red (cases

72 where only one residue of the 15 considered showed an EBF score above 20) or in orange (cases where two or more residues of the 15 considered showed an EBF score above 20).

Figure. 19 Phylogenetic and positive selection analysis of FLR1 homologous genes in 29 Saccharomycetaceae strains using PHYML 3.0 software. Circular cladogram shows the tree topology. The red and orange lines correspond to branches where one or two of the fifteen pre-selected codon under positive selection positions have an EBF>20, respectively.

In the 2010 paper, Dias et al. proposed that the horizontal expansion of the FLR1 homolog genes observed in the Zygosaccharomyces strains could result from a decrease in the selective pressure over some of these cluster T members creating a reservoir for the development of functional novelty in these DHA transporters. The analysis of the plot representing the number of phylogenetic branches associated with particular yeast species that showed high mutational rate confirmed that the expansion of FLR1 homolog genes observed in the genome sequences of the Zygosaccharomyces strains correlates with high EBF scores. In fact, the analysis of these 15 amino acid residues leads to the conclusion that 10 of the 19 cluster T members encoded in these yeast species were under strong positive selection. In addition to the Zygosaccharomyces strains, the phylogenetic branches leading to the Tetrapisispora and Vanderwaltozyma genera, comprising yeast species that have evolved immediately after the WGD event, also show a strong propensity for the accumulation of amino acidic mutation in these 15 residue positions.

Analysing the relationship between both thresholds of EBF, a high level of nodes in WGD Saccharomycetaceae species were found. Indeed, it is clear in the pattern in (Figure. 20), where the increase of nodes begins in Lachancea genus and significantly slowed down in Tetrapisispora species.

73

The phylogeny tree representation shows several cases of constant positive selection in different branches. The red one’s present diversification in one node, while the orange sites have two or more nodes associated. Correlating the synteny with the positive selection some points seems to be clarified. For instance, the large expansion of FLR1 homologs in Zygosaccharomyces genus has 10 in a total of 19 genes under positive selection. Contrary to expectations, in Z. rouxii, just one case of positive selection

in seven genes was found. In turn, the Z. bailii IST302 (zyba_1) and Z. bailii CLIB 213 (zyba_2) comprehend 5 and 4 genes, respectively, under-diversification (Figure. 19).

Different levels of EBF in 29 Saccharmocetaceae Strains

6

5

4

3 520

1

0

Number of nodes Number of nodes under Positive Selection

SAMI_1

LAKL_1 KLLA_1

KLWI_1

TEBL_1

LATH_1 KLAE_1

KAAF_1 ZYBA_1 ZYBA_2

SAPA_4 SABA_1 TEPH_1

CAGL_1 KLMA_1

SACE_1 SAKU_1 SAAR_1 SAUV_1 KANA_1 ASAC_1

ZYRO_1

NACA_1 NADA_1 VAPO_1 TODE_1 ERCY_1

LAWA_1 ERGO_1 Saccharomycetaceae Strains

Figure. 20 Evaluation of the total number of nodes in fifteen pre-determined codon positions, along of twenty- nine Saccharomycetaceae strains.

74

4.7. Phylogenetic and gene neighbourhood analyses of the remaining homologs of the S. cerevisiae genes encoded in the genomes of the Saccharomycetaceae species.

4.7.1. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae DTR1 and TPO4 genes encoded in the genomes of the Saccharomycetaceae species.

Lineage 1 comprises a linear evolutionary history of 33 DHA1 genes. The most studied gene encoding a DHA1 protein residing in cluster E, the S. cerevisiae DTR1 gene, is associated with resistance to quinine, quinidine, propionic acid, butyric acid and benzoic acid (Felder et al., 2002).

Interestingly, all yeast strains classified in the Saccharomycetaceae encode in their genomes a sole cluster E member, a fact observed even in the Z. bailii ISA1307 hybrid strain, whose polyploidy nature of its genome sequence would suggest being likely the encoding two copies. This fact suggests not only that DTR1 homolog genes fulfil an important role in yeast physiology but also that the presence of additional copies could have a negative impact on the fitness of these microorganisms, explaining the rapid loss of the extra copy by the Z. bailii ISA1307 hybrid strain. Even though some cluster E members encoded in the protoploid species (pre-WGD) analysed in this study reside in a subtelomeric region of the chromosome, all DTR1 homolog genes share strong synteny throughout the evolution of the Saccharomycetaceae yeast. The gene neighbourhood analysis was able to identify the formation of two sublineages with origin in the WGD event. One of these sublineages comprises the orthologs of the S. cerevisiae DTR1 gene, being present in all genomes of yeast species evolving after the divergence of the Tetrapisispora genus. The other sublineage is composed by two translated ORFs, vapo_1_1048.45, and teph_1_l01760, encoded in the genomes of V. polyspora and T. phaffii species. In the case of the DHA1 gene sublineage comprising the V. polyspora and T. phaffii translated ORFs, the clusters of amino acid sequence similarity used to determine its orthology were 2533,14834 and 17093 (annex III, Figure. 31). In the case of the ortholog of the DTR1 gene encoded in the T. blattae genome, the clusters of amino acid sequence similarity used to determine its orthology were 764 and 14833 (annex III, Figure. 31).

The 29 homologs of the S. cerevisiae TPO4 gene form a simple evolutionary history. Each of the 33 Saccharomycetaceae strains analysed in this master’s thesis encode a single cluster S member in the genomes of the Saccharomycetaceae species, with the exception of the yeast strain N. castellii NRRL Y- 12630, that encodes two cluster S members disposed in tandem, and of the species classified in the Zygosaccharomyces genus and the N. dairenensis and K. lactis yeast species, that encode none. The gene neighbourhood analysis revealed that one T. phaffii translated ORF (teph_1_g01430) and one V. polyspora translated ORF (vapo_1_413.2) encoding cluster S members are paralogs of the S. cerevisiae TPO4 gene, with origin in the WGD event. These two translated ORFs form a DHA1 gene sublineage that after the divergence of the Tetrapisispora genus is discontinued, being lost in the late-divergent post- WGD species. This analysis also showed the existence of a T. blattae translated ORF (tebl_1_b03550) residing in the same chromosome environment of the T. phaffii and V. polyspora translated ORFs. However, the inspection of the coding sequence of the tebl_1_b03550 provided strong evidence of its non-functionality, being classified in this master’s thesis as a pseudogene (Figure. 21).

75

Figure. 21 Lineage 1 and 9 (homologs of S. cerevisiae DTR1 and TPO4 genes). Each box represents a gene. Lines connect genes sharing common neighbours, purple in lineage 1 and orange in lineage 9. E, C, and T indicate that the corresponding gene was classified as a fragment, corrected or subtelomeric gene, respectively. The grey box in zyba_1 gene represents the zone of WGD occurrence. The red arrows and HGT represents the plausible occurrence of events of horizontal gene transfer between species.

76

4.7.2. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae QDR1/QDR2/AQR1 genes encoded in the genomes of the Saccharomycetaceae species.

The phylogenetic cluster F comprises the largest number of DHA1 proteins when compared with the remaining clusters. The cluster F members are organized in the DHA1 gene lineage 2. Although this lineage comprises a great number of DHA1 genes, the corresponding evolutionary history fits well the commonly observed WGD pattern where the protoploid gene lineage is split into two sublineages, one originating the S. cerevisiae QDR1/QDR2 genes, and the other the S. cerevisiae AQR1 gene (Dias et al., 2010). During the evolutionary history of this lineage two, the WGD event is made quite clear, with the presence of two independent and full lineages of genes. In pre-WGD species, a pattern of two tandem repeat genes from K. aestuarii (klae_1) to T. delbrueckii species (tode_1) can be observed. In the case of Eremothecium genus, just one gene in two of three presented species was found. Similarly, also Z. rouxii (zyro_1) and Z. bailii strain (zyba_3) exhibit one gene each one (Figure. 22,Figure. 23). In Z. bailii (ISA1307) were observed a double local duplication, however, just one of them have strong synteny with the remaining members of the lineage, while the other one presents a weak synteny with QDR1/QDR2 orthologous gene in V. polyspora species. The loss of synteny in these two genes seems natural, once they are localized in the subtelomeric region of the chromosome. In post-WGD species, a different pattern in both sublineages was observed.

In AQR1 sublineage, in the major part species, just is shown one gene per strain with the high number of neighbour connections. The cases T. blattae (tebl_1), N castellii CBS 4309 (naca_1) and N castellii NRRL Y-12630 (naca_2) are the exception, once present several expansions of genes. Curiously in naca_2 strain expansions, just two of four genes keep strong synteny. One with the remaining sublineage, and the other one with the homologs gene in naca_1. The other two genes, fragmented and corrected, reveal a weak synteny with naca_1 and nada_1 species (Figure. 21). In the QDR1/QDR2 sublineage, the evolutionary history after WDG is quite different, once that various tandem repeat genes are shown, mainly in Saccharomyces sensu stricto species. From V. polyspora (vapo_1) to K. naganishii (kan_1), excluding T. blattae (tebl_1) species, just one member per strain is shown, and with strong synteny among them. On the other hand, the K. africana (kaaf_1), S. arborícola (saar_1), S. bayanus strains, the early- divergent species inside of Saccharomyces sensu stricto group, also just contain one member. Notably, the remaining members of this sublineage contain two genes in tandem. (Figure. 22). Interestingly, although the different scenarios showed in the post-WGD species, the phylogenetic analyses showed a strong conservation between the members of each sublineage and one gene of the tandem repeat genes from pre-WGD species (annex IV, Figure. 32). In order to understand the evolutionary origin of QDR1 and QDR2 homologs genes as well as the reasons that explain the high similarity between different genes of post and pre-WGD species, several hypotheses will be suggested and discussed in section 5.4 of this master thesis.

In order to clarify the evolutionary history of this cluster, the vertical scheme of synteny for both sublineages was performed. In this scheme, the neighbour genes flanking the (AQR1 and QDR1/QDR2) ohnologs genes were positioned according to the synteny approach and represented by different colours consistent with the protein families of each neighbour gene (neighbour genes belonging to the same

77 sublineage were represented by the same colour). Horizontal elements represent the chromosomal segments and vertical elements the homologs genes across the species (Figure. 23). The vertical scheme showing the neighbourhood environment not allowed the elucidation of QDR1 and QDR2 origin, however reinforced the presence of WGD event in cluster F with both sublineages of ohnologs.

Figure. 22 Lineage 2 (homologs of S. cerevisiae QDR1/QDR2 and AQR1 genes). Conventions as in Figure. 21.

78

QDR1/QDR2 Sublineage

AQR1 Sublineage

Post

WGD -

-

WGD Post

WGD

-

re P

synteny

Genes without

Figure. 23 Gene neighbourhood comparison of the QDR1/QDR2 and AQR1 sublineages of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The QDR1/QDR2 and AQR1 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software.

79

4.7.3. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae QDR3 and HOL1 genes encoded in the genomes of the Saccharomycetaceae species.

Lineage 3 comprises the members of the phylogenetic cluster G, encompassing a total of 25 DHA1 genes. The analysis of the gene neighbourhood results showed the existence of strong synteny linking all cluster G members encoded in the genome of the protoploid Saccharomycetaceae species. With the exception of the Z. rouxii CBS 732 strain, all protoploid Saccharomycetaceae parental strains encode a single cluster G member in their genome sequences. The Z. rouxii translated ORF zyro_1_a03564g shares strong synteny connections with the remaining cluster G members encoded in the protoploid Saccharomycetaceae genomes. On the contrary, the other cluster G member encoded in the Z. rouxii genome, the translated ORF zyro_1_f08228g, lacks any synteny connection with the remaining lineage 3 genes. This fact also suggests that, after the local gene duplication that originated the translated ORF zyro_1_f08228g, a transposition event has changed the place of residence of this gene to another chromosome environment. As expected, the genome sequence of the Z. bailii ISA1307 hybrid strain encodes two cluster G members, with both these genes sharing many synteny connections with the remaining lineage 3 members (Figure. 24).

With the exception of the Saccharomyces genus, the phylogenetic analysis of the cluster G members revealed that the genomes of all post-WGD yeast species analysed in this work, spanning 5 different taxonomic genera, Kazachstania (K. naganishii, K. Africana), Naumovozyma (N. dairenensis and N. castellii), Nakaseomyces (C. glabrata), Tetrapisispora (T. blattae and T. phaffii) and Vanderwaltozyma (V. polyspora), do not encode any homolog of the S. cerevisiae QDR3 gene (Figure. 24). Although the cluster G members encoded in the yeast species classified in the Saccharomyces genus share synteny with those encoded in the protoploid Saccharomycetaceae species, the absence of these genes in so many successive genera suggest that the cluster G members were lost immediately after the WGD event, being re-acquired by the yeast ancestral that originated the Saccharomyces species. The presence of gene synteny between the pre- and post-WGD species plausibly is explained by the presence of adjacent genes together with the cluster G members in the DNA fragmented transferred from a protoploid yeast species to the Saccharomyces yeast ancestral. The reason why the previously published study on the DHA1 genes in the Saccharomycetaceae family did not identify the lateral transference of the cluster G members between these two group of yeast species resides in the fact that the number of hemiascomycetes strain with publicly available genome sequences classified in this taxonomic family was very low at point in time, resulting in low power and resolution during the characterization of these DHA1 genes (Figure. 24).

Lineage 4 comprises twenty-two members of the phylogenetic cluster J, although its S. cerevisiae member, the HOL1 gene, is not very well biochemically characterized. The cluster J members encoded in the genomes of the protoploid Saccharomycetaceae species reside in a highly conserved chromosome region (Figure. 24).

Three tandem repeats comprising cluster J members were identified in the genomes of E. gossypii and L. kluyvery species and in the yeast isolate known as Ashbya aceri, a member of the Eremothecium genus. The genomes of the remaining pre-WGD species analysed in this work conserve a sole cluster J

80 member. The cluster J member encoded in the genome of Torulaspora delbrueckii CBS 1146 strain does not share strong synteny with the remaining DHA1 genes in the protoploid yeast species.

As reported in the previously published study focusing the evolution of the DHA1 genes in the Saccharomycetaceae family, lineage 4 show an evident gap between the pre- and post-WGD species. The lack of homologs of the S. cerevisiae HOL1 gene in six successive taxonomic genera, spanning protoploid species of the Zygosaccharomyces genus and post-WGD species, from the early-divergent Vanderwaltozyma genus to the Kazachstania genus, reinforces the proposed hypothesis of the ancestral yeast originating the Saccharomyces species having acquired a cluster J member from a horizontal gene transfer event (Dias et al. 2010a). The possibility of rapid loss of genes after and during the WGD event cannot be ruled out. However, the occurrence of at least six consecutive and independent gene loss events just seems very unlikely to have occurred (Figure. 24).

81

Figure. 24 Lineage 3 and 4 (homologs of S. cerevisiae QDR3 and HOL1 genes). Conventions as in Figure. 21.

82

4.7.4. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae TPO2/TPO3 and YHK8 genes encoded in the genomes of the Saccharomycetaceae species.

Cluster N1 (TPO2p, TPO3p) is composed of 40 full-size DHA1 genes, three fragmented and one corrected genes (Figure. 25). The S. cerevisiae TPO2 and TPO3 genes have been reported as ohnologs. The origin of these two genes was assigned to the WGD event, where two distinct DHA1 sublineages were formed. The gene neighbourhood analysis performed in this work raised some questions on this conclusion. The inclusion of more post-WGD species belonging to taxonomic genus not represented in the first published study showed that, in the sublineage leading to the TPO3 gene, the genomes of yeast species classified in the Kazachstania (K. naganishii, K. Africana), Nakaseomyces (C. glabrata), Tetrapisispora (T. blattae and T. phaffii) and Vanderwaltozyma genera do not encode orthologs of this S. cerevisiae gene. In the yeast strains belonging to the Naumovozyma genus analysed in this study, although both the N. castellii strains, CBS 4309 and NRRL Y-12630, encode a TPO3 ortholog, on the other hand, the N. dairenensis CBS 421 strain also lacks a TPO3 ortholog (Figure. 25).

Two evolutionary scenarios can explain these results. The first scenario consists of the S. cerevisiae TPO2 and TPO3 genes being true ohnologs and in the evolutionary history of the TPO3 gene, there has been a consistent gene loss of TPO3 orthologs. This scenario requires accepting a high number of consecutive and independent gene loss events. The second scenario consists of the rejection of the S. cerevisiae TPO2 and TPO3 genes being true ohnologs.

Through the synteny analyses in Figure. 26 was possible to support the first scenario, once the genes involved in the neighbourhood of each subcluster, after the WGD, were clearly distinct. In the case of the sublineage comprising the TPO3 orthologs, the clusters of amino acid sequence similarity used to determine its orthology were 13470,13467, 37361,13468 (Figure. 26). In the case of the ortholog of the TPO2 gene, the clusters of amino acid sequence similarity used to determine its orthology were 1219, 259,13471 (Figure. 26). In addition, this approach still allowed a clear and unambiguous determination of the members of each sublineage (TPO2/TPO3). Other points, like sub-telomeric zones, were also checked, but without additional information’s to this case explanation. Indeed, one-quarter of the presented genes is sub-telomeric, indicating some chromosomic mobility of these TPO2 and TPO3 homologous, between the evolution of species (Figure. 25). Although the syntenic analyses supported the first scenario, the phylogenetic analyses suggested the second scenario as more probable, which kept some doubts in the determination of the final correct scenario (annex V, Figure. 33).

The majority of the genomes of the remaining post-WGD species encode a single ortholog of the S. cerevisiae TPO2 gene. However, the genome of S. bayanus MCYC 623 strain encodes two translated ORFs showing amino acid sequence similarity with the S. cerevisiae TPO2 gene. One of these genes comprises a full-size cluster N1 member sharing weak synteny with the remaining TPO2 orthologs (whose sequence required correction) and one cluster N1 fragment (truncated) sharing a high number of synteny connections with the remaining TPO2 orthologs.

With the exception of E. cymbalariae, whose genome lacks a DHA1 gene classified in the phylogenetic cluster N1. the genome sequence of each parental protoploid Saccharomycetaceae species encodes a

83 single member of this phylogenetic cluster. Due to its polyploidy nature, as expected, the genome of the Z. bailii ISA1307 hybrid strain encodes two orthologs of the S. cerevisiae TPO2/TPO3 genes.

Figure. 25 Lineage 8 and 5 (homologs of S. cerevisiae YHK8 and TPO2/TPO3 genes). Conventions as in Figure. 21.

84

TPO2 Sublineage

TPO3 Sublineage

WGD

-

Post Post

- WGD

WGD

- re

P

Figure. 26 Gene neighbourhood comparison of the TPO2 and TPO3 sublineages of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The TPO2 and TPO3 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software.

85

Lineages 8 comprise 21 members of the phylogenetic cluster R, with the S. cerevisiae YHK8 being the most well-characterized gene encoding this prototype of DHA1 transporter. The analysis of the dispersal of the cluster R members in the protoploid yeast species classified in the taxonomic genera closest to the WGD event, the Zygosaccharomyces, Torulaspora and Lachancea genera, revealed that only three other protoploid yeast species, Z. rouxii, L. thermotolerans and L. kluyvery encode cluster R members. Hence, this fact leads to the three following conclusions. First, although one Zygosaccharomyces species encodes a cluster R member, the other Zygosaccharomyces species present in the hemiascomycetes strains analysed in this study, Z. bailii, encodes none. Second, the only yeast species residing in the Torulaspora genera, T. delbrueckii, also lacks a cluster R member. Third, although two Lachancea species encodes a cluster R member, the other Lachancea species present in the hemiascomycetes strains analysed in this study, L. waltii, encodes none (Figure. 25).

The sparseness of cluster R members in the genomes of the Zygosaccharomyces, Torulaspora and Lachancea genera and the fact that the protoploid yeast species belonging to the two earliest-diverging genera in the Saccharomycetaceae taxonomic family, the Eremothecium, and Kluvyeromyces genera, do not encode any DHA1 transporter classified in this phylogenetic cluster, suggests that the most plausible origin of these DHA1 genes was through lateral gene transfer phenomena. Since it is well known that the Saccharomycetaceae is more related to the Saccharomycetaceae yeast species than to the remaining hemiascomycetes strains classified in other taxonomic families, forming a clade coined with the name ’Saccharomyces complex’(Kurtzman 2003) and since the Saccharomycetaceae H. valbyensis species and the great majority of the early and late-divergent hemiascomycetes species encode cluster R members in their genomes, it is plausible that the loss of this particular DHA1 gene occurred in the ancestral yeast that originated all the Saccharomycetaceae species.

Regarding the yeast species evolving after the WGD event, the analysis of the corresponding genome sequences shows that all of them encode a single cluster R member. In contrast with the suggestion proposed in the previously published study (Dias et al. 2010a), the evaluation of chromosome environment where the cluster R member resides in the post-WGD species supports the existence of a single DHA1 gene sublineage.

The synteny evidence supporting this new hypothesis is show in tablature format in Figure. 27. The colours and numbers identify the clusters of amino acid sequence similarity each neighbour of the DHA1 genes encoding cluster R proteins belong to and show that the synteny evidence linking the cluster R encoding genes in the post-WGD species is very strong.

On the other hand, the cluster R member encoded in the T. phaffii genome sequence shares poor synteny with the remaining genes of lineage 8. In the new proposed evolutionary scenario, the cluster R member encoded in the genomes of the C. glabrata is an orthologs of the S. cerevisiae YHK8 gene instead of a paralog gene. The phylogenetic analysis provides support to this new hypothesis since the cluster R member encoded in the genome of the C. glabrata species cluster close to those encoded in the T. phaffii.

86

.

WGD

- Post

WGD -

Pre

Figure. 27 Gene neighbourhood analyses of YHK8 homologous genes in the genome of 29 yeast species belonging to the Saccharomycetaceae. The YHK8 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood environment analyses, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software.

87

4.7.5. Phylogenetic and gene neighbourhood analyses of the homologs of the S. cerevisiae TPO1 gene encoded in the genomes of the Saccharomycetaceae species.

Lineage 7 comprises a total of 43 genes encoding members of the phylogenetic cluster P, with 42 corresponding to full-size DHA1 transporters and 1 to a protein fragment. The S. cerevisiae member of phylogenetic cluster P is the TPO1 gene, known to encode a multidrug efflux pump responsible for providing yeast resistance to a wide range of different and chemically unrelated chemical compounds.

With the exception of S. bayanus MCYC 623 strain and the Eremothecium species, all yeast strains analysed in this work encode at least one cluster P member in their genome sequences. Regarding the absence of a homolog of the S. cerevisiae TPO1 gene in the S. bayanus strain, the low coverage and the high number of scaffolds present in its released genome sequence suggests that this could have been caused by sequencing artefact during the construction of the genome library or during the assembly phase. On the other hand, the absence of a homolog of the S. cerevisiae TPO1 gene in the E. gossypii ATCC 10895, E. cymbalariae DBVPG 7215 and Ashbya aceri strains provides support to the conclusion that the genomes of all yeast species classified in the Eremothecium genus do not encode a cluster P member (Figure.28).

The reconstruction of the evolutionary history of the homologs of the S. cerevisiae TPO1 gene was consistent with the previous one proposed by Dias et al. (2010). The cluster P members encoded in the genomes of the protoploid Saccharomycetaceae species reside in a highly conserved chromosome environment. Interestingly, the Zygosaccharomyces strains show a heterogeneous composition regarding these efflux pumps: the genomes of the Z. bailii ISA1307 hybrid, the Z. bailii IST302, and the Z. bailii CLIB213 strains encode each 4, 2 and 1 cluster P members, respectively. The high number of TPO1 homolog genes encoded in the Z. bailii ISA1307 hybrid strain is consistent with the polyploidy nature of its genome sequence.

The gene neighbourhood analysis shows that the WGD event originated two sublineages, each comprising ohnologs of the DHA1 genes comprised in the other sublineage. With the exception of T. blattae, the yeast species classified in the Vanderwaltozyma, Tetrapisispora, and the Nakaseomyces genera encode both cluster P ohnologs. In the yeast species classified in the Naumovozyma genus, while the genome of the N. dairenensis species retained the ortholog S. cerevisiae TPO1 gene, the genome of two N. castellii strains retained the corresponding ohnolog (in the NRRL Y-12630 strain occurred a local gene duplication that originated a tandem repeat of cluster P members). In the yeast species classified in the Kazachstania genus, while the genome of K. africana encode both cluster P ohnologs, the genome of K. naganishii encode only the ortholog of the TPO1 gene. The genomes of the yeast species classified in the Saccharomyces genus only encode orthologs of the S. cerevisiae TPO1 gene.

88

Figure. 28 Lineage 7 (homologs of S. cerevisiae TPO1 genes). Conventions as in Figure. 21.

89

5. Discussion and conclusions

The main goal of this MSc thesis was to reconstruct the evolution of the genes encoding drug: H+ antiporters of family 1 using 33 yeast strains classified in the Saccharomycetaceae taxonomic family. In complement, the diversity of the DAG family members encoded in the genomes of 78 hemiascomycetes strains was also attempted. The pursue of these two objectives this MSc thesis is based on two local databases developed by the BSRG group in the past five years. The first database comprises the genomic information gathered on these yeast ORFs and, therefore, is named Genome DB. The second database comprises pairwise amino acid sequence similarity information obtained using an all-against-all comparison between these yeast ORFs, being named Blastp DB.

Due to the high number of DHA1 genes identified in the yeast strains under analysis in this work (1382 proteins from 94 hemiascomycetous strains), it was decided to limit the scope of this study and only attempt to extend the published studies performed by Dias and co-workers (2010, 2014), focusing only on the Saccharomycetaceae taxonomic family. This previous study combined classical phylogenetic tree building methods with gene neighbourhood analysis allowing the reconstruction of the evolutionary history of 13 Hemiascomycetes species (Dias et al., 2010). The results published in the 2010 paper have become outdated due to two different reasons. First, the number of yeast species with publicly available genome sequences classified in the Saccharomycetaceae taxonomic family was very small when the work on the basis of the 2010 paper was performed, leading to low power and resolution of the DHA1 proteins phylogenetic analysis. With the release of the genomes of newly sequenced yeast strains and species, and with the consequent update of the Genome DB, it is now possible comparing the results obtained in these two studies separated by 6 years and assess how much the inferred evolutionary histories of the DHA1 genes have benefit from considering a much higher number of Saccharomycetaceae genome sequences in this MSc thesis. Second, the gene neighbourhood data used to reconstruct the evolution of the DHA1 genes was gathered from two different sources of evidence on Yeast Evolutionary Comparative Genomics: The Génolevures (Martin, Sherman and Durrens 2011) and the YGOB (Byrne and Wolfe 2006) databases. The lack of homogenous gene neighbourhood data and lack of direct access to raw genomic evidence that these two databases use to identify synteny between yeast genomes, in certain cases, leads to inconsistent analyses and, consequently, to incorrect conclusions. This limitation was solved by the development of a unique and consistent source of gene neighbourhood evidence compiled in the above referred local Genome and Blastp DataBases.

In the previous studies focusing the evolution of the DHA1 and DAG genes and in this MSc thesis, the members of each of these gene families were identified by constraining and traversing of a blastp pairwise similarity network (Dias et al., 2013; Dias et al. (2014). However, although the past studies were able to identify the DHA1 and DAG proteins using a sole member of each of these two families as starting node for network traversal, in this MSc thesis, it was required using one reference gene for each cluster identified in the previous phylogenetic analyses for the traversal of the blastp network. Complementing these multiple network traversals from different starting nodes were also the use of more stringent e- values constraining the blastp network. The fundamental reason for the necessity of multiple rounds of

90 network traversal from different members of each of these gene families was the increase in the genome DB and, consequently, of the Blastp DB. The increase in the size of these databases led to more unspecific blastp pairwise hits and to the joint retrieval of the DHA1 and DAG proteins. It has been proposed by Saier et al (2006) that the 12-spanner MFS proteins originated the 14-spanner MFS proteins through the duplication of the two central TMSs and, therefore, it is reasonable the joint retrieval of these two gene families at low stringent e-values. However, by constraining the blastp network at high stringency levels, it would be expected the separate retrieval of all members of each of these gene families. These results suggest that, in the future, as the size of the Genome DB increases by the addition of new yeast genomes, the gene families under study will require preliminary phylogenetic analysis to identify key proteins to use as starting nodes for the network traversal to assure the gathering of the complete set of family members and avoid retrieving many false positive proteins.

The increase on the number of yeast strains and species analyzed in this MSc thesis by considering genome sequences determined using NGS methodologies has the consequence of making necessary embarking in the correction of errors found present in the DHA1 genes. These sequence correction can be classified into four groups: 1) false double genes (a single ORF wrongfully annotated as two ORFs residing next to each other), 2) truncated genes (resulting from an annotation error of the YGAP software), 3) frameshifted sequences (originated from the wrongful incorporation of non-existing nucleotides, frequently undetermined, represented by “N”, during the phase of genome assembly), and 4) truncated genes resulting from the wrongful incorporation of stop codons during the phase of genome assembly. These sequence errors can be corrected using MACSE and similar software suites, although these automated methods require previous strong quality control of the sequences to assure that the used software converge on the correct solution for the sequence error. In this Msc thesis, it was decided to use human discern rather than automated software to correct these errors, therefore, minimizing the possibility of losing DHA1 genes due to the intrinsic inaccuracy of the software or due to the complexity of the sequence error itself, that would have a strong impact on the interpretation of the evolutionary results. The correction of several dubious DHA1 fragments which were concluded comprising full-size DHA1 transporters had the consequence of the total number of DHA1 proteins encoded in the genome of some yeast strains already analysed in the 2010 and 2014 papers increasing.

The use of a wide range of different 2D-topology prediction software also allowed establishing the number of the TMS found in the 1382 full-size DHA1 proteins identified in this study. Interestingly, not all DHA1 transporters seem to obey the presumed rule of all of the members of this gene family spanning twelve times the cellular membrane onto which the encoded proteins are embedded. All the 98 DHA1 transporters residing in the phylogenetic cluster J, comprising the homologs of the S. cerevisiae HOL1 protein, were predicted having only eleven TMS. This result was based on the consistent report of the absence of one TMS in the N-terminal extremity of the HOL1 homologs by six software suites used to predict the 2D-topology of these transmembrane proteins (in a total of 10 different software’s considered).

The construction of the phylogenetic tree allowed the identification of two additional clusters, labelled X and Y, when compared with the last published study focusing the DHA1 gene family (Dias and Sá-Correia 2014). This result is not unexpected considering that the number of hemiascomycetous genomes

91 increased since then, spanning 63 additional strains corresponding to 38 additional species, many belonging to taxonomic families not sampled in the referred 2014 paper. Two phylogenetic clusters, D and K2, already described in the previously published studies (Dias et al. 2010a; Dias and Sá-Correia 2014)) but not reported to exist in the Saccharomycetaceae yeast, were found encoded in the Kluyveromyces wickerhamii, Kluyveromyces aestuarii and Kluyveromyces marxianus species. This result suggests that these cluster D and cluster K1 members could have been acquired by lateral gene transfer.

Regarding the number of DHA1 proteins encoded in the genomes of the Saccharomycetaceae species, some divergences in the number of DHA1 proteins residing in clusters F, P and T (homologs of the S. cerevisiae QDR1/QDR2/AQR1, TPO1, and FLR1, respectively) were also found between yeast species classified in the same taxonomic. Although these results were not expected, these can be explained when species of the same genus have a remote common ancestral, thus not sharing a close phylogenetic relation (Dujon 2010) or when their ecological niches or their geographic origin are very distinct.

The number of DHA1 proteins found encoded in the hemiascomycetes yeast belonging to the CTG complex was high. In some cases, the reason for the abundance of these transporters resides in strong horizontal gene expansions. This phenomenon has already been described in some yeast species (Dias and Sá-Correia, 2014), namely in C. parapsilosis, where thirteen genes were assigned to the cluster P (homologs of S. cerevisiae TPO1 and C. albicans FLU1 genes), and in M. guilliermondii that was described as having nine genes belonging to the cluster T (homologs of S. cerevisiae FLR1 and C. albicans MDR1 genes). Our study allowed the identification two additional species where the same phenomena occurred - P. sorbitophila, C. orthopsilosis. Two reasons can explain these cases of strong horizontal gene expansion: 1) these additional genes may be used as backups of the original FLR1 and TPO1 homolog in these species; 2) these horizontal expansions can be interpreted as a reservoir of genes for the creation of functional novelty. It would be interesting, in the future, to test these genes for positive and negative selection in order to understand and detect the genomic variations, as well as the potential functional redundancy associated.

In other late-divergent taxonomic families, the average of DHA1 genes in each phylogenetic cluster was more similar to the number observed in the yeast species of the Saccharomycetaceae family (when compared with the abundance observed in the Debaryomycetaceae species). This fact is consistent with the phylogenetic tree of hemiascomycetous present in Kurtzman (2011) where major families from this set are close to the Saccharomycetaceae family than to the Debaryomycetaceae family. The lower number of cluster T and E members (homologs of S. cerevisiae FLR1 and DTR1) found encoded in the genomes of yeast species classified in the other late-divergent families was also remarkable. The early- divergent species analyzed in this work were the most heterogeneous group regarding the average number of DHA1 proteins per cluster (this number range from 0.27 to 1.23). This result can be justified by the large sequence divergence between the species of this group. Since this was not the aim of this MSc thesis, this possibility was not evaluated, but it should be developing in future studies.

In the last two decades, the accumulation of genome sequences showed that events of gene duplication and horizontal transference of genes between species are central to the evolutionary process, making possible the functional innovation of genes and, consequently, allowing species evolving new phenotypic

92 traits (Scannell, Butler and Wolfe 2007). Consistent with the previous studies focusing the DHA1 genes (Dias et al., 2010; Dias and Sá-Correia, 2014), gene duplication was a frequent event in the majority of DHA1 gene lineages reconstructed in this MSc thesis. A few DHA1 gene lineages show the opposite pattern, being “resistant” to gene duplication events, plausibly due to detrimental effects on the organism deriving from gene unbalance. The major part of these gene duplications were identified at the WGD event, proposed having occurred between the divergence of the Zygosaccharomyces and Vanderwaltozyma genera (Dujon 2010). The synteny and phylogenetic analyses of 33 yeasts strains performed in this study also suggest the occurrence of HGT in lineage 3 (QDR3 homologs), lineage 4 (HOL1 homologs), lineage 8 (YHK8 homologs) and lineage 10 (FLR1 homologs). The HGT found in lineage 4 (clusters J) proposed by Dias et al. (2010) was reinforced by the results gathered in this MSc thesis. The higher number of yeast species considered in this study also allowed the identification of a previously non-reported HGT involving the homologs of the S. cerevisiae QDR3 gene (lineage 3). This study also gathered evidence supporting the existence of another potential HGT in the DHA1 gene lineage comprising the homologs of the S. cerevisiae YHK8 gene. In fact, these DHA1 genes are absent in the genomes of hemiascomycetes species classified in the Eremothecium and in the Kluyveromyces genera, suggesting that the yeast ancestral that giving rise to the Saccharomycetaceae family lack a YHK8 homolog. In opposition to the evolutionary scenario proposed by Dias et al. (2010), the strong synteny evidence gathered by the gene neighbourhood analysis performed in this study support the assignment of an ortholog status for the YHK8 homolog encoded in the C. glabrata genome, instead of the ohnolog status previously proposed.

In cluster F two independent sublineages were observed, one with the homologs of QDR1/QDR2 gene and another one with the homologs of AQR1. This showed an unexpected evolutionary scenario since the early divergent post-WDG species just showed one gene in QDR1/QDR2 sublineage. To explain the evolutionary history of this cluster two scenarios are possible. First, the initial creation of tandem repeat genes in early post-WGD species followed by the independent gene losses of at least eleven non- sequential species; or second, the loss of one gene in V. polyspora and at the begin of Saccharomyces sensu stricto group occurred a local duplication (S. uvarum species) that is passed to the rest of Saccharomyces sensus stricto group. Observing the gene neighbourhood environment exist facts supporting both hypotheses equally, since there are more or less the same number of connections and shared neighbours between the tandem repeat genes of Saccharomyces sensus stricto and the remaining genes of the sublineage. Despite this, the loss of at least eleven gene members in eleven strains continued to be less likely, than the initial loss in V. polyspora and the local duplication ten strains after. Still in cluster F, a strong sequence conservation between each sublineage of QDR1/QDR2 and AQR1 homologs genes in post-WGD species and one gene of the tandem repeat showed in Pre-WGD species were identified. To explain this event two hypothesis are possible, or the genes of pre-WGD species were transferred to the post -WGD species by HGT event; or the more probably, occurred a case of convergent evolution, independently, in the post and pre-WGD species. The first scenario is supported by phylogenetic analyses, however, this implied in synteny analyses block of genes transference, following the independent loss of different neighbourhood genes as following in conventional WGD. In addition, this scenario also suggests that both sublineages were not founded in the WGD event. The second

93 hypothesis, the more probable, is supported by both synteny and similarity analyses. The synteny analyses showed WGD event and a high conserved chromosomic environment along of all pre and post- WGD species, while convergent evolution allows to justify the high sequence conservation between these species.

The cluster N1 is composed by two independent sublineages, the first one comprise the TPO2 homologs genes while the second one comprise the homologs of TPO3 genes. The evolutionary origin of both sublineages caused some doubts and opened the possibility of another scenario, distinct of WGD scenario, to explain this cluster of genes. In order to explain this five lineage, 2 scenarios were proposed. The S. cerevisiae TPO2 and TPO3 genes are true ohnologs in WGD event originated, or alternatively, in second scenario there is a rejection of the S. cerevisiae TPO2 and TPO3 genes as being true ohnologs. The first scenario was coherent with the synteny analyses, however, the phylogenetic appointed in a different way. This scenario assumes that in the evolutionary history of the TPO3 gene, there has occurred consistent gene loss of TPO3 homologs in some genus, being just kept in N. castellii and in Saccharomyces sensus strict species. The second hypothesis is consistent with the phylogenetic analyses and assumes that TPO2 and TPO3 were not founded in WGD event. Alternatively, to this second scenario, both sublineages can have been originated by a local duplication, in N. castellii species, and after that, in Kazachstania genus lost and in Saccharomyces sensus stricto species retained. This scenario involves a subfunctionalization of the TPO2 and TPO3 homologs genes after the local duplication and indirectly considers both sublineages of genes paralogs instead of ohnologs. Implicitly, although possible, this hypothesis required a loss of neighbourhood genes in both sublineages in similar way of the WGD-event.

The remaining lineages of our study, and excepting the case of Lineage 10 that shows a general absence of synteny between the FLR1 homologous, showed a high level of conservation in their chromosomic environments. The total lack of synteny in cluster T jointly with the non-conventional WDG pattern makes this lineage one of the most interesting here studied. In addition, both similarities and syntenic analyses support the idea of at least five HGT occurrences in this clusters, which is coherent with the subtelomeric localization of the most part of the genes comprised in this cluster. In addition, the subtelomeric localization of genes can be an important fact for the high genomic mobility of the genes present in this cluster T.

Apart of that and consider the HGT events, at the first point, the cluster T members of yeast species classified in the Saccharomyces sensu stricto group and the translated ORF tebl_1_a04260 evidence a strong synteny. This is not coherent with the phylogenetic evidence that assigned the cluster T members of Lachancea yeast species as the ancestor of FLR1 homologs genes classified in Saccharomyces sensu stricto. Three scenarios can explain this evolutionary history. The first scenario assumes that the genes of Saccharomyces sensu stricto and T. blattae species diverged a lot in terms of nucleotide sequence similarity, keeping the gene neighbours. Alternatively, the block of neighbour genes that made synteny between these species can be transferred, independently of the FLR1 homologous gene, from T. blattae to the Saccharomyces sensus stricto group. The third scenario considers the complete loss of

94 neighbourhood genes, between Lachancea and Saccharomyces sensu stricto species, while keeping the nucleotide sequence conservation.

The same question of HGT is raised to cluster T members of species classified in the Tetrapisispora and Vanderwaltozyma genera (tode_1 and vapo_1). There is a complete lack of synteny between FLR1 homolog genes (tebl_1_a04260, tebl_1_e02500, tebl_1_e02520, tebl_1_c07220; teph_1_o02010, teph_1_a00130, teph_1_i03320). Interestingly, the phylogenetic analyses keep all of these genes together in the phylogenetic tree. The phylogenetic analyses are coherent with the tree of species of Figure. 2, however, this implies a complete loss of FLR1 homologs neighbour genes in this species. On the other hand, still was possible the gain of these genes by HGT from other hemiascomycetous species, and this hypothesis is coherent with the synteny analyses. Although possible, this last scenario is more improbable, once this implied a high divergence of the nucleotide sequence in closer species. The phylogenetic analyses of Naumovozyma castellii species showed a controversy case of evolutionary history in this cluster T. As expected the FLR1 homolog genes of Naumovozyma dairenensis (nada_1; nada_1_a07730) are clustered with Kazachstania genes (kaaf_1 and kana_1). On the other hand, the FLR1 homolog genes of Naumovozyma castellii (naca_1 and naca_2) are grouped with the species of Tetrapisispora and Vanderwaltozyma genera. These facts suggest that these genes of Naumovozyma castellii species could have been acquired by HGT from the Tetrapisispora. Although the phylogenetic analyses support this scenario, was not possible corroborate thought the synteny since was observed a total absence of gene neighborhood connections between the genes of these species. The FLR1 homolog genes in C. glabrata species (cagl_1_h06017g, cagl_1_h06039g; cagl_2_3_c02180, cagl_2_3_c02170) only showed synteny in their own subcluster. This together with the phylogenetic analyses that assigned the cagl_1_h06039g ORF as an outgroup of the phylogenetic tree and the remain closer, suggest which possible these genes were acquired by HGT. Considering the Saccharomycetaceae early-divergent genera, Eremothecium, and Kluyveromyces, a total lack of genes in these species was verified. On the other hand, the sole yeast species classified in the Saccharomycodaceae taxonomic family, Hanseniaspora valbyensis, sister family Saccharomycetaceae family in Saccharomyces complex also not have the FLR1 homolog gene (see section 4.4.1 of the results). Considering these lack of genes in this species probably to Saccharomyces complex group of species acquired FLR1 homolog genes by HGT from another taxonomic family.

The construction of a phylogenetic tree representing the full-size DAG proteins allowed the classification of the gathered transporters into twenty clusters. The analysis of this phylogenetic tree showed that the DAG members of the phylogenetic clusters B (SGE1/AZR1/VBA3/VBA5), C (VBA1/VBA2), D (VBA4), E (ATR1/YMR279C) and P (Arn3) were found in the majority of the hemiascomycetous species considered in this MSc thesis. In addition, the DAG members of the phylogenetic cluster S, containing the S. cerevisiae glutathione exchangers (GEX1/GEX2), were found encoded only in the genome of yeast belonging to the Saccharomycetaceae family. The number of DAG members of cluster E (homologs of the S. cerevisiae ATR1/YMR229C genes) found encoded in the Saccharomycetaceae genomes was also high, spanning 49 of the total of 71 members identified. As previously reported (Dias et al., 2013), the members of cluster K, comprising the translated ORF caal_1_19.7336 encoded in the C. albicans

95 genome, were absent in the Saccharomycetaceae species, although these were found encoded in the genomes of yeast species belonging to the Debaryomycetaceae family and other late-divergent families. The translated ORF caal_1_19.7336 has been described in the literature as medium spider biofilm- induced (Nobile et al. 2012), suggesting a role in yeast pathogenicity of the phylogenetic cluster K members. Dias and Sá-Correia (2013) proposed multiple events of lateral gene transference between species involved DAG members of the phylogenetic clusters B, C, F, J, N, S, and T. The higher number of yeast species considered in this MSc thesis allowed to validate the cases of HGT in clusters B, F, J, N, S. Furthermore, this work suggests existence of new cases of HGT involving DAG genes in the phylogenetic clusters C, E, F, J, and T that, in the future, should be confirmed using the more resolving analyses based on synteny

Overall, the results obtained in this MSc thesis sheds light in some interesting evolutionary patterns involving DHA1 genes. The detailed analysis of the evolution of this gene family in the Saccharomycetaceae taxonomic family allowed understanding dubious cases that could not be resolved by the available tools and genome sequences available in the 2010 paper. Here, the gene neighbourhood analyses clearly proved to be one of the most powerful methodology to understand the evolution of the DHA1 gene family. Notwithstanding, in the future, the DHA1 and DAG members of some specific phylogenetic clusters should be submitted to more detailed analyses using methods able to evaluate cases of positive and negative selection with the goal of linking specific amino acids to functional diversification of these proteins in cases where solid evidence exists that the origin of the encoding genes were on duplication or transfer event. This could enhance our understanding of the physiological role of specific DHA1 and DAG transporters and open the door for using the phenotypic features encoded by these important genes in Biotechnological, Pharmaceutical and Clinically applications.

96

6. Bibliography

Abascal F, Zardoya R, Telford MJ. TranslatorX: Multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res 2010;38:7–13.

Abramson J, Smirnova I, Kasho V et al. Structure and mechanism of the lactose permease of Escherichia coli. Science (80- ) 2003;301:610–5.

Alarco AM, Balan I, Talibi D et al. Ap1-mediated multidrug resistance in saccharomyces cerevisiae requires FLR1 encoding a transporter of the major facilitator superfamily. J Biol Chem 1997;272:19304–13.

Albertsen M, Bellahn I, Krämer R et al. Localization and function of the yeast multidrug transporter Tpo1p. J Biol Chem 2003;278:12820–5.

Alcoba-Flórez J, Méndez-Álvarez S, Cano J et al. Phenotypic and molecular characterization of Candida nivariensis sp. nov., a possible new opportunistic fungus. J Clin Microbiol 2005;43:4107–11.

Alenquer M, Tenreiro S, Sá-Correia I. Adaptive response to the antimalarial drug artesunate in yeast involves Pdr1p/Pdr3p-mediated transcriptional activation of the resistance determinants TPO1 and PDR5. FEMS Yeast Res 2006;6:1130–9.

Almeida P, Gonçalves C, Teixeira S et al. A Gondwanan imprint on global diversity and domestication of wine and cider yeast Saccharomyces uvarum. Nat Commun 2014;5:4044.

Altman RB. Editorial: Building successful biological databases. Brief Bioinform 2004;5:4–5.

Altschul SF, Wootton JC, Zaslavsky E et al. The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput Biol 2010;6:e1000852.

Anderson HW. Yeast-Like Fungi of the Human Intestinal Tract. J Infect Dis 1917;21:341–86.

Anisimova M, Gascuel O. Approximate Likelihood-Ratio Test for Branches: A Fast, Accurate, and Powerful Alternative. Syst Biol 2006;55:539–52.

Arnaud MB, Costanzo MC, Shah P et al. Gene Ontology and the annotation of pathogen genomes: the case of Candida albicans. Trends Microbiol 2009;17:295–303.

Babrzadeh F, Jalili R, Wang C et al. Whole-genome sequencing of the efficient industrial fuel-ethanol fermentative Saccharomyces cerevisiae strain CAT-1. Mol Genet Genomics 2012;287:485–94.

Barker KS, Pearson MM, Rogers PD. Identification of genes differentially expressed in association with reduced azole susceptibility in Saccharomyces cerevisiae. J Antimicrob Chemother 2003;51:1131–40.

Bauer BE, Rossington D, Mollapour M et al. Weak organic acid stress inhibits aromatic amino acid uptake by yeast, causing a strong influence of amino acid auxotrophies on the phenotypes of membrane transporter mutants. Eur J Biochem 2003;270:3189–95.

Bednarski W, Adamczak M, Tomasik J et al. Application of oil refinery waste in the biosynthesis of glycolipids by yeast. Bioresour Technol 2004;95:15–8.

Benson DA, Clark K, Karsch-Mizrachi I et al. GenBank. Nucleic Acids Res 2015;43:D30.

Bernsel A, Viklund H, Falk J et al. Prediction of membrane-protein topology from first principles. Proc Natl Acad Sci U S A 2008;105:7177–81.

Bernsel A, Viklund H, Hennerdal A et al. TOPCONS: Consensus prediction of membrane protein topology. Nucleic Acids Res 2009;37:465–8.

Blomqvist J, Eberhard T, Schnürer J et al. Fermentation characteristics of Dekkera bruxellensis strains. Appl Microbiol Biotechnol 2010;87:1487–97.

Borneman AR, Desany B a, Riches D et al. Whole-genome comparison reveals novel genetic elements that

97

characterize the genome of industrial strains of Saccharomyces cerevisiae. PLoS Genet 2011;7:e1001287.

Boynton PJ, Greig D. The ecology and evolution of non-domesticated Saccharomyces species. Yeast 2014;31:449– 62.

Bozdag GO, Uluisik I, Gulculer GS et al. Roles of ATR1 paralogs YMR279c and YOR378w in boron stress tolerance. Biochem Biophys Res Commun 2011;409:748–51.

Broco N, Tenreiro S, Viegas CA et al. FLR1 gene (ORF YBR008c) is required for benomyl and methotrexate resistance in Saccharomyces cerevisiae and its benomyl-induced expression is dependent on pdr3 transcriptional regulator. Yeast 1999;15:1595–608.

Brown SD, Klingeman DM, Johnson CM et al. Genome Sequences of Industrially Relevant Saccharomyces cerevisiae Strain M3707, Isolated from a Sample of Distillers Yeast and Four Haploid Derivatives. Genome Announc 2013;1:1–2.

Budtz-Jörgensen E. Etiology, pathogenesis, therapy, and prophylaxis of oral yeast infections. Acta Odontol Scand 1990;48:61–9.

Butcher R a, Bhullar BS, Perlstein EO et al. Microarray-based method for monitoring yeast overexpression strains reveals small-molecule targets in TOR pathway. Nat Chem Biol 2006;2:103–9.

Byrne KP, Wolfe KH. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res 2005;15:1456–61.

Byrne KP, Wolfe KH. Visualizing syntenic relationships among the hemiascomycetes with the Yeast Gene Order Browser. Nucleic Acids Res 2006;34:D452–5.

Cabrito TR, Teixeira MC, Duarte AA et al. Heterologous expression of a Tpo1 homolog from Arabidopsis thaliana confers resistance to the herbicide 2,4-D and other chemical stresses in yeast. Appl Microbiol Biotechnol 2009;84:927–36.

Cannon RD, Lamping E, Holmes AR et al. Efflux-mediated antifungal drug resistance. Clin Microbiol Rev 2009;22:291–321.

Cereghino GPL, Cregg JM. Applications of yeast in biotechnology: Protein production and genetic analysis. Curr Opin Biotechnol 1999;10:422–7.

Cereghino JL, Cregg JM. Heterologous protein expression in the methylotrophic yeast Pichia pastoris. FEMS Microbiol Rev 2000;24:45–66.

Chang J-M, Di Tommaso P, Taly J-F et al. Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee. BMC Bioinformatics 2012;13:S1.

Chen KH, Miyazaki T, Tsai HF et al. The bZip transcription factor Cgap1p is involved in multidrug resistance and required for activation of multidrug transporter gene CgFLR1 in Candida glabrata. Gene 2007;386:63–72.

Chen S-F, Lo SF, Chang C-F et al. Tetrapisispora taiwanensis sp. nov. and Tetrapisispora pingtungensis sp. nov., two ascosporogenous yeast species isolated from soil. Int J Syst Evol Microbiol 2013;63:2351–5.

Chen Y, Yu P, Luo J et al. Secreted protein prediction system combining CJ-SPHMM, TMHMM, and PSORT. Mamm Genome 2003;14:859–65.

Chiasson DM, Loughlin PC, Mazurkiewicz D et al. Soybean SAT1 (Symbiotic Ammonium Transporter 1) encodes a bHLH transcription factor involved in nodule growth and NH4+ transport. Proc Natl Acad Sci U S A 2014;111:4814–9.

Claros, M.G.; Von Heijne G. TopPred II : An improved software for membrane protein structure predictions Claros, Manuel G.; von Heijne, Gunnar CABIOS. CABIOS 1994;10:685–6.

Cliften P, Sudarsanam P, Desikan A et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science (80- ) 2003;301:71–6.

Conant GC, Wolfe KH. Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet

98

2008;9:938–50.

Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 1988;16:10881–90.

Correia A, Sampaio P, James S et al. Candida bracarensis sp. nov., a novel anamorphic yeast species phenotypically similar to Candida glabrata. Int J Syst Evol Microbiol 2006;56:313–7.

Costa C, Dias PJ, Sá-Correia I et al. MFS multidrug transporters in pathogenic fungi: Do they have real clinical impact? Front Physiol 2014a;5:197.

Costa C, Henriques A, Pires C et al. The dual role of candida glabrata drug:H+ antiporter CgAqr1 (ORF CAGL0J09944g) in antifungal drug and acetic acid resistance. Front Microbiol 2013a;4:170.

Costa C, Nunes J, Henriques A et al. Candida glabrata drug:H+ antiporter CgTpo3 (ORF CAGL0I10384G): Role in azole drug resistance and polyamine homeostasis. J Antimicrob Chemother 2014b;69:1767–76.

Costa C, Pires C, Cabrito TR et al. Candida glabrata drug: H+ antiporter CgQdr2 confers imidazole drug resistance, being activated by transcription factor CgPdr1. Antimicrob Agents Chemother 2013b;57:3159–67.

Costa C, Ribeiro J, Miranda IM et al. Clotrimazole drug resistance in Candida glabrata clinical isolates correlates with increased expression of the drug:H+ antiporters CgAqr1, CgTpo1_1, CgTpo3 and CgQdr2. Front Microbiol 2016;7:1–11.

Delling U, Raymond M, Schurr E. Identification of Saccharomyces cerevisiae genes conferring resistance to quinoline ring-containing antimalarial drugs. Antimicrob Agents Chemother 1998;42:1034–41.

Delport W, Poon AFY, Frost SDW et al. Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 2010;26:2455–7.

Deng D, Xu C, Sun P et al. Crystal structure of the human glucose transporter GLUT1. Nature 2014;510:121–5.

Dequin S. The potential of genetic engineering for improving brewing, wine-making and baking yeasts. Appl Microbiol Biotechnol 2001;56:577–88.

Dhaoui M, Auchère F, Blaiseau P-L et al. Gex1 is a yeast glutathione exchanger that interferes with pH and redox homeostasis. Mol Biol Cell 2011;22:2054–67.

Dias PJ, Sá-Correia I. The drug:H+ antiporters of family 2 (DHA2), siderophore transporters (ARN) and glutathione:H+ antiporters (GEX) have a common evolutionary origin in hemiascomycete yeasts. BMC Genomics 2013;14:901.

Dias PJ, Sá-Correia I. Phylogenetic and syntenic analyses of the 12-spanner drug:H(+) antiporter family 1 (DHA1) in pathogenic Candida species: evolution of MDR1 and FLU1 genes. Genomics 2014;104:45–57.

Dias PJ, Seret M-L, Goffeau A et al. Evolution of the 12-spanner drug:H+ antiporter DHA1 family in hemiascomycetous yeasts. OMICS 2010a;14:701–10.

Dias PJ, Teixeira MC, Telo JP et al. Insights into the mechanisms of toxicity and tolerance to the agricultural fungicide mancozeb in yeast, as suggested by a chemogenomic approach. OMICS 2010b;14:211–27.

Dietrich FS, Voegeli S, Kuo S et al. Genomes of Ashbya fungi isolated from insects reveal four mating-type loci, numerous translocations, lack of transposons, and distinct gene duplications. G3 (Bethesda) 2013;3:1225–39.

Drillon G, Fischer G. Comparative study on synteny between yeasts and vertebrates. C R Biol 2011;334:629–38.

Dujon B, Sherman D, Fischer G et al. Genome evolution in yeasts. Nature 2004;430:35–44.

Dujon B. Yeasts illustrate the molecular mechanisms of eukaryotic genome evolution. Trends Genet 2006;22:375– 87.

Dujon B. Yeast evolutionary genomics. Nat Rev Genet 2010;11:512–24.

Dunn B, Richter C, Kvitek DJ et al. Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments. Genome Res 2012;22:908–24.

99

Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004;32:1792–7.

Ehrenhofer‐Murray AE, Keller Seitz MU, Sengstag C. The Sge1 protein of Saccharomyces cerevisiae is a membrane‐ associated multidrug transporter. Yeast 1998;14:49–65.

El-Sayed Shalaby M, El-Nady MF. Application of Saccharomyces cerevisiae as a biocontrol agent against Fusarium infection of sugar beet plants. Acta Biol Szeged 2008;52:271–5.

Felder T, Bogengruber E, Tenreiro S et al. Dtr1p, a multidrug resistance transporter of the major facilitator superfamily, plays an essential role in spore wall maturation in Saccharomyces cerevisiae. Eukaryot Cell 2002;1:799–810.

Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981;17:368– 76.

Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution (N Y) 1985;39:783–91.

Feng D-F, Doolittle RF. Progressive sequence alignment as a prerequisitetto correct phylogenetic trees. J Mol Evol 1987;25:351–60.

Fernandes AR, Mira NP, Vargas RC et al. Saccharomyces cerevisiae adaptation to weak acids involves the transcription factor Haa1p and Haa1p-regulated genes. Biochem Biophys Res Commun 2005;337:95–103.

Fernández A, Gómez S. Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J Classif 2008;25:43–65.

Fischer G, James SA, Roberts IN et al. Chromosomal evolution in Saccharomyces. Nature 2000;405:451–4.

Flagfeldt DB, Siewers V, Huang L et al. Characterization of chromosomal integration sites for heterologous gene expression in Saccharomyces cerevisiae. Yeast 2009;26:545–51.

Fluman N, Bibi E. Bacterial multidrug transport through the lens of the major facilitator superfamily. Biochim Biophys Acta (BBA)- Proteins Proteomics 2009;1794:738–47.

Förster C, Santos MA, Ruffert S et al. Physiological Consequence of Disruption of the VMA1Gene in the Riboflavin Overproducer Ashbya gossypii. J Biol Chem 1999;274:9442–8.

Friedrich A, Jung PP, Hou J et al. Comparative Mitochondrial Genomics within and among Yeast Species of the Lachancea Genus. PLoS One 2012;7:1–6.

Gabaldón T, Martin T, Marcet-Houben M et al. Comparative genomics of emerging pathogens in the Candida glabrata clade. BMC Genomics 2013;14:1.

Gasch AP, Spellman PT, Kao CM et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 2000;11:4241–57.

Gbelska Y, Krijger J-J, Breunig KD. Evolution of gene families: the multidrug resistance transporter genes in five related yeast species. FEMS Yeast Res 2006;6:345–55.

Ghiurcuta CG, Moret BME. A Formal Definition for Syntenic Blocks., 2009.

Goffeau A, Barrell BG, Bussey H et al. Life with 6000 Genes. Science (80- ) 1996;274:546–67.

Goldway M, Teff D, Schmidt R et al. Multidrug Resistance in Candida albicans : Disruption of the BEN r Gene. Microbiology 1995;39:422–6.

González SS, Barrio E, Gafner J et al. Natural hybrids from Saccharomyces cerevisiae, Saccharomyces bayanus and Saccharomyces kudriavzevii in wine fermentations. FEMS Yeast Res 2006;6:1221–34.

Goodstadt L, Ponting CP. Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol 2006;2:1134–50.

Guindon S, Dufayard JF, Lefort V et al. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst Biol 2010;59:307–21.

100

Hall C, Brachat S, Dietrich FS. Contribution of Horizontal Gene Transfer to the Evolution of Saccharomyces cerevisiae Contribution of Horizontal Gene Transfer to the Evolution of Saccharomyces cerevisiae †. Eukaryot Cell 2005;4:1102–15.

Hanlon SE, Rizzo JM, Tatomer DC et al. The stress response factors Yap6, Cin5, Phd1, and Skn7 direct targeting of the conserved co-repressor Tup1-Ssn6 in S. cerevisiae. PLoS One 2011;6:e19060.

Hardison RC. Comparative Genomics. PLoS Biol 2003;1:e58.

Von Heijne G. Membrane protein structure prediction: hydrophobicity analysis and the positive-inside rule. J Mol Biol 1992;225:487–94.

Herrero E, Ros J, Bellí G et al. Redox control and oxidative stress in yeast cells. Biochim Biophys Acta 2008;1780:1217–35.

Hessa T, Meindl-Beinker NM, Bernsel A et al. Molecular code for transmembrane-helix recognition by the Sec61 translocon. Nature 2007;450:1026–30.

Heymann P, Ernst JF, Winkelmann G. Identification and substrate specificity of a ferrichrome-type siderophore transporter (Arn1p) in Saccharomyces cerevisiae. FEMS Microbiol Lett 2000;186:221–7.

Higgins CF. Multiple molecular mechanisms for multidrug resistance transporters. Nature 2007;446:749–57.

Hinnebusch AG, Johnston M. YeastBook: an encyclopedia of the reference eukaryotic cell. Genetics 2011;189:683– 4.

Hittinger CT. Saccharomyces diversity and evolution: A budding model genus. Trends Genet 2013;29:309–17.

Hou J, Tyo KEJ, Liu Z et al. Metabolic engineering of recombinant protein secretion by Saccharomyces cerevisiae. FEMS Yeast Res 2012;12:491–510.

Howell KS, Klein M, Swiegers JH et al. Genetic Determinants of Volatile-Thiol Release by Saccharomyces cerevisiae during Wine Fermentation Genetic Determinants of Volatile-Thiol Release by Saccharomyces cerevisiae during Wine Fermentation. Appl Environ Microbiol 2005;71:5420–6.

Hu B, Xie G, Lo C-C et al. Pathogen comparative genomics in the next-generation sequencing era: genome alignments, pangenomics and metagenomics. Brief Funct Genomics 2011;10:322–33.

Huelsenbeck JP, Ronquist F, Nielsen R et al. Bayesian inference of phylogeny and its impact on evolutionary biology. Science (80- ) 2001;294:2310–4.

Hulin M, Wheals A. Rapid identification of Zygosaccharomyces with genus-specific primers. Int J Food Microbiol 2014;173:9–13.

Huson DH, Richter DC, Rausch C et al. Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics 2007;8:460.

Jacobson T. Cellular Responses to Arsenite and Cadmium - Mechanisms of Toxicity and Defense in Saccharomyces cerevisiae. 2016.

Jain R, Rivera MC, Moore JE et al. Horizontal gene transfer accelerates genome innovation and evolution. Mol Biol Evol 2003;20:1598–602.

James S a., Carvajal Barriga EJ, Portero Barahona P et al. Kazachstania yasuniensis sp. nov., a novel ascomycetous yeast species found in mainland Ecuador and on the Galapagos. Int J Syst Evol Microbiol 2015;65:1304–9.

Jungwirth H, Wendler F, Platzer B et al. Diazaborine resistance in yeast involves the efflux pumps Ycf1p and Flr1p and is enhanced by a gain-of-function allele of gene YAP1. Eur J Biochem 2000;267:4809–16.

Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol 2004;338:1027–36.

Käll L, Krogh A, Sonnhammer ELL. An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 2005;21:251–7.

101

Käll L, Krogh A, Sonnhammer ELL. Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server. Nucleic Acids Res 2007;35:W429–32.

Kanazawa S, Driscoll M, Struhl K. ATR1, a Saccharomyces cerevisiae gene encoding a transmembrane protein required for aminotriazole resistance. Mol Cell Biol 1988;8:664–73.

Katoh K, Misawa K, Kuma K et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002;30:3059–66.

Kaya A, Karakaya HC, Fomenko DE et al. Identification of a novel system for boron transport: Atr1 is a main boron exporter in yeast. Mol Cell Biol 2009;29:3665–74.

Keeling PJ, Palmer JD. Horizontal gene transfer in eukaryotic evolution. Nat Rev Genet 2008;9:605–18.

Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 2004;428:617–24.

Kellis M, Patterson N, Endrizzi M et al. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 2003;423:241–54.

Kendall SK. Encyclopedia of Life Sciences. Charlest Advis 2012;13:19–21.

Kennedy M a, Bard M. Positive and negative regulation of squalene synthase (ERG9), an ergosterol biosynthetic gene, in Saccharomyces cerevisiae. Biochim Biophys Acta 2001;1517:177–89.

Kent WJ, Zahler AM. Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res 2000;10:1115–25.

Khenoussi W, Vanhoutrève R, Poch O et al. SIBIS: A Bayesian model for inconsistent protein sequence estimation. Bioinformatics 2014;30:2432–9.

Kimball S, Mattis P. GIMP 2.8 - GNU Image Manipulation Program. 2015.

Kircher M, Stenzel U, Kelso J. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 2009;10:1–9.

Knop M. Evolution of the hemiascomycete yeasts: on life styles and the importance of inbreeding. Bioessays 2006;28:696–708.

Koonin E V. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 2005;39:309–38.

Kosakovsky Pond SL, Pond SLK, Frost SDW et al. Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection. Mol Biol Evol 2005;22:1208–22.

Kowallik V, Miller E, Greig D. The interaction of Saccharomyces paradoxus with its natural competitors on oak bark. Mol Ecol 2015;24:1596–610.

Krogh A, Larsson B, von Heijne G et al. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 2001;305:567–80.

Kschischo M, Ramos J, Sychrová H. Membrane Transport in Yeast, An Introduction. Yeast Membrane Transport. Springer, 2016, 1–10.

Kuchler K, Schüller C. “ABC Transporters in Yeast–Drug Resistance and Stress Response in a Nutshell.” Yeast as a Tool in Cancer Research. Springer, 2007, 289–314.

Kurtz S, Phillippy A, Delcher AL et al. Versatile and open software for comparing large genomes. Genome Biol 2004;5:R12.

Kurtzman C, Fell JW, Boekhout T. The Yeasts: A Taxonomic Study. Elsevier, 2011.

Kurtzman CP. Phylogenetic circumscription of Saccharomyces, Kluyveromyces and other members of the Saccharomycetaceae, and the proposal of the new genera Lachancea, Nakaseomyces, Naumovia, Vanderwaltozyma and Zygotorulaspora. FEMS Yeast Res 2003;4:233–45.

102

Lachance M, Starmer WT, Rosa CA et al. Biogeography of the yeasts of ephemeral flowers and their insects. FEMS Yeast Res 2001;1:1–8.

Lachance MA. Current status of Kluyveromyces systematics. FEMS Yeast Res 2007;7:642–5.

Lambrechts MG, Pretorius IS. Yeast and its importance to wine aroma-a review. South African J Enol Vitic 2000;21:129–97.

Law CJ, Maloney PC, Wang D. Ins and Outs of Major Facilitator Superfamily Antiporters. Annu Rev Microbiol 2009;62:289–305.

Lee CC-F, Liu CC-H, Ninomiya S et al. Vanderwaltozyma verrucispora sp. nov., a new ascomycetous yeast species. FEMS Yeast Res 2009;9:153–7.

Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007;8:995–1005.

Lemieux MJ, Huang Y, Wang D-N. The structural basis of substrate translocation by the Escherichia coli glycerol-3- phosphate transporter: a member of the major facilitator superfamily. Curr Opin Struct Biol 2004;14:405–12.

Lesuisse E, Simon-casteras M, Labbe P. Siderophore-mediated iron uptake in Saccharomyces cerevisiae : the S/Tl gene encodes a ferrioxamine B permease that belongs to the major facilitator superfamily. Microbiology 1998;144:12379.

Libkind D, Hittinger CT, Valério E et al. Microbe domestication and the identification of the wild genetic stock of lager- brewing yeast. Proc Natl Acad Sci U S A 2011;108:14539–44.

Lin CPC, Kim C, Smith SO et al. A Highly Redundant Gene Network Controls Assembly of the Outer Spore Wall in S. cerevisiae. PLoS Genet 2013;9:e1003700.

Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science (80- ) 1985;227:1435–41.

Liti G, Carter DM, Moses AM et al. Population genomics of domestic and wild yeasts. Nature 2009;458:337–41.

Liti G, Peruffo A, James S a et al. Inferences of evolutionary relationships from a population survey of LTR- retrotransposons and telomeric-associated sequences in the Saccharomyces sensu stricto complex. Yeast 2005;22:177–92.

Lotia S, Montojo J, Dong Y et al. Cytoscape app store. Bioinformatics 2013:btt138.

Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics 2000;154:459– 73.

Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics 2002;18:440–5.

Ma J. Applications of Functional Genomics in Studies of Yeast Signaling Networks and Genome Structure. ProQuest, 2008.

Maddison DR, Swofford DL, Maddison WP. NEXUS: an extensible file format for systematic information. Syst Biol 1997;46:590–621.

Madero-lópez L, Espinel-ingroff A. Revista Iberoamericana de Micología of invasive fungal diseases : a review of the literature ( 2005-2009 ). Rev Iberoam Micol 2009;26:15–22.

Margulies M, Egholm M, Altman WE et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:376–80.

Markovich S, Yekutiel A, Shalit I et al. Genomic Approach to Identification of Mutations Affecting Caspofungin Susceptibility in Saccharomyces cerevisiae. 2004;48:3871–6.

Marsit S, Mena A, Bigey F et al. Evolutionary advantage conferred by an eukaryote-to-eukaryote gene transfer event in wine yeasts. Mol Biol Evol 2015:msv057.

Martin T, Sherman DJ, Durrens P. The génolevures database. C R Biol 2011;334:585–9.

103

McCouch SR. Genomics and synteny. Plant Physiol 2001;125:152–5.

Mira NP, Henriques SF, Keller G et al. Identification of a DNA-binding site for the transcription factor Haa1, required for Saccharomyces cerevisiae response to acetic acid stress. Nucleic Acids Res 2011;39:6896–907.

Mira NP, Palma M, Guerreiro JF et al. Genome-wide identification of Saccharomyces cerevisiae genes required for tolerance to acetic acid. Microb Cell Fact 2010;9:79.

Mortimer RK, Johnston JR. Genealogy of principal strains of the yeast genetic stock center. Genetics 1986;113:35– 43.

Mount DW, Mount DW. Bioinformatics: Sequence and Genome Analysis. Cold spring harbor laboratory press New York:, 2001.

Murrell B, Moola S, Mabona A et al. FUBAR: A fast, unconstrained bayesian AppRoximation for inferring selection. Mol Biol Evol 2013;30:1196–205.

Murrell B, Wertheim JO, Moola S et al. Detecting individual sites subject to episodic diversifying selection. PLoS Genet 2012;8:e1002764.

Nakase T, Jindamorakot S, Tanaka K et al. Vanderwaltozyma tropicalis sp. nov., a novel ascomycetous yeast species found in Thailand. J Gen Appl Microbiol 2010;56:31–6.

Naumov GI, Lee C-F, Naumova ES. Molecular genetic diversity of the Saccharomyces yeasts in Taiwan: Saccharomyces arboricola, Saccharomyces cerevisiae and Saccharomyces kudriavzevii. Antonie Van Leeuwenhoek 2013;103:217–28.

Naumova ES, Naumov GI, Nikitina TN et al. Molecular genetic and physiological differentiation of Kluyveromyces lactis and Kluyveromyces marxianus: Analysis of strains from the all-Russian collection of microorganisms (VKM). Microbiology 2012;81:216–23.

Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443–53.

Néron B, Ménager H, Maufrais C et al. Mobyle: a new full web bioinformatics framework. Bioinformatics 2009;25:3005–11.

Nguyen HV, Gaillardin C. Evolutionary relationships between the former species Saccharomyces uvarum and the hybrids Saccharomyces bayanus and Saccharomyces pastorianus; Reinstatement of Saccharomyces uvarum (Beijerinck) as a distinct species. FEMS Yeast Res 2005;5:471–83. van Nimwegen E, Zavolan M, Rajewsky N et al. Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci 2002;99:7323–8.

Nobile CJ, Fox EP, Nett JE et al. A recently evolved transcriptional network controls biofilm development in Candida albicans. Cell 2012;148:126–38.

Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000;302:205–17.

Novo M, Bigey F, Beyne E et al. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Saccharomyces cerevisiae EC1118. Proc Natl Acad Sci U S A 2009;106:16333–8.

Nunes PA, Tenreiro S, Sá-Correia I. Resistance and adaptation to quinidine in Saccharomyces cerevisiae: Role of QDR1 (YIL120w), encoding a plasma membrane transporter of the major facilitator superfamily required for multidrug resistance. Antimicrob Agents Chemother 2001;45:1528–34.

O’Brien SJ, Menotti-Raymond M, Murphy WJ et al. The promise of comparative genomics in mammals. Science (80- ) 1999;286:458–81.

Odds FC. Resistance of yeasts to azole-derivative antifungals. J Antimicrob Chemother 1993;31:463–71.

Ohno S. Ancient linkage groups and frozen accidents. Nature 1973;244:259–62.

104

Pais P, Costa C, Pires C et al. Membrane Proteome-Wide Response to the Antifungal Drug Clotrimazole in Candida glabrata : Role of the Transcription Factor CgPdr1 and the Drug:H+ Antiporters CgTpo1_1 and CgTpo1_2. Mol Cell Proteomics 2016;15:57–72.

Palma M, Roque F de C, Guerreiro JF et al. Search for genes responsible for the remarkably high acetic acid tolerance of a Zygosaccharomyces bailii-derived interspecies hybrid strain. BMC Genomics 2015;16:1070.

Papon N, Courdavault V, Clastre M et al. Emerging and Emerged Pathogenic Candida Species: Beyond the Candida albicans Paradigm. PLoS Pathog 2013;9:e1003550.

Paulsen IT, Sliwinski MK, Nelissen B et al. Unified inventory of established and putative transporters encoded within the complete genome of Saccharomyces cerevisiae. FEBS Lett 1998;430:116–25.

Van de Peer Y, Maere S, Meyer A. The evolutionary significance of ancient genome duplications. Nat Rev Genet 2009;10:725–32.

Philpott CC, Protchenko O. Response to iron deprivation in Saccharomyces cerevisiae. Eukaryot Cell 2008;7:20–7.

Plotree D, Plotgram D. PHYLIP-phylogeny inference package (version 3.2). cladistics 1989;5:163–6.

Pond SLK, Frost SDW. Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics 2005;21:2531–3.

Preuss TM. Who’s afraid of Homo sapiens? J Biomed Discov Collab 2006;1:1.

Pulvirenti a. Saccharomyces uvarum, a proper species within Saccharomyces sensu stricto. FEMS Microbiol Lett 2000;192:191–6.

Quistgaard EM, Löw C, Guettou F et al. Understanding transport by the major facilitator superfamily (MFS): structures pave the way. Nat Publ Gr 2016;17:1–10.

Rambaut A, Drummond A. FigTree 1.3.1. 2010.

Ranwez V, Harispe S, Delsuc F et al. MACSE: Multiple alignment of coding SEquences accounting for frameshifts and stop codons. PLoS One 2011;6:e22594.

Reddy VS, Shlykov M a., Castillo R et al. The major facilitator superfamily (MFS) revisited. FEBS J 2012;279:2022– 35.

Redhu KA, Shah AH, Prasad R. MFS transporters of Candida species and their role in clinical drug resistance. FEMS Yeast Res 2016;16:1–26.

Remize F. Biotechnology of Wine Yeasts. Fermented Foods, Part I: Biochemistry and Biotechnology. CRC Press, 2016, 49.

Replansky T, Koufopanou V, Greig D et al. Saccharomyces sensu stricto as a model system for evolution and ecology. Trends Ecol Evol 2008;23:494–501.

Reynolds SM, Käll L, Riffle ME et al. Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Comput Biol 2008;4:e1000213.

Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet 2000;16:276–7.

Richard G-F, Kerrest A, Lafontaine I et al. Comparative genomics of hemiascomycete yeasts: genes involved in DNA replication, repair, and recombination. Mol Biol Evol 2005;22:1011–23.

Rigden DJ, de Mello LV. Computacional de proteínas. Biotecnol Ciência Desenvolv 2002;4:64–70.

Ríos G, Cabedo M, Rull B et al. Role of the yeast multidrug transporter Qdr2 in cation homeostasis and the oxidative stress response. FEMS Yeast Res 2013;13:97–106.

Rolland T, Dujon B. Yeasty clocks: dating genomic changes in yeasts. C R Biol 2011;334:620–8.

105

Ronquist F, Teslenko M, van der Mark P et al. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61:539–42.

Rosindell J, Harmon LJ. OneZoom: a fractal explorer for the tree of life. PLoS Biol 2012;10:e1001406.

Sá-Correia I, dos Santos SC, Teixeira MC et al. Drug:H+ antiporters in chemical stress response in yeast. Trends Microbiol 2009;17:22–31.

Sá-Correia I, Tenreiro S. The multidrug resistance transporters of the major facilitator superfamily, 6 years after disclosure of Saccharomyces cerevisiae genome sequence. J Biotechnol 2002;98:215–26.

Saier Jr MH, Beatty JT, Goffeau A et al. The major facilitator superfamily. J Mol Microbiol Biotechnol 1999;1:257–79.

Saier MH, Tran C V, Barabote RD. TCDB: the Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res 2006;34:D181–6.

Sakaki K, Tashiro K, Kuhara S et al. Response of Genes Associated with Mitochondrial Function to Mild Heat Stress in Yeast Saccharomyces cerevisiae. J Biochem 2003;134:373–84.

Saluja P, Yelchuri RK, Sohal SK et al. Torulaspora indica a novel yeast species isolated from coal mine soils. Antonie van Leeuwenhoek, Int J Gen Mol Microbiol 2012;101:733–42.

Santos PM, Simões T, Sá-Correia I. Insights into yeast adaptive response to the agricultural fungicide mancozeb: a toxicoproteomics approach. Proteomics 2009;9:657–70. dos Santos SC, Sá-Correia I. Yeast toxicogenomics: lessons from a eukaryotic cell model and cell factory. Curr Opin Biotechnol 2015;33:183–91. dos Santos SC, Teixeira MC, Dias PJ et al. MFS transporters required for multidrug/multixenobiotic (MD/MX) resistance in the model yeast: Understanding their physiological function through post-genomic approaches. Front Physiol 2014;5:1–15.

Scannell DR, Butler G, Wolfe KH. Yeast genome evolution — the origin of the species. Yeast 2007;24:929–42.

Scannell DR, Frank a C, Conant GC et al. Independent sorting-out of thousands of duplicated gene pairs in two yeast species descended from a whole-genome duplication. Proc Natl Acad Sci U S A 2007;104:8397–402.

Scannell DR, Zill O a, Rokas A et al. The Awesome Power of Yeast Evolutionary Genetics: New Genome Sequences and Strain Resources for the Saccharomyces sensu stricto Genus. G3 (Bethesda) 2011;1:11–25.

Scordia D, Cosentino SL, Lee J-W et al. Bioconversion of giant reed (Arundo donax L.) hemicellulose hydrolysate to ethanol by Scheffersomyces stipitis CBS6054. Biomass and Bioenergy 2012;39:296–305.

Sengupta S, Chandra TS. Sequence analysis and structural characterization of a glyceraldehyde-3-phosphate dehydrogenase gene from the phytopathogenic fungus Eremothecium ashbyi. Mycopathologia 2011;171:123– 31.

Seo J-H, Jin Y-S. Editorial Overview: Food Biotechnology: Critical Gap Filler in the Nexus of Food, Energy, and Waste for a Prosperous Future. Elsevier, 2016.

Seret M-L, Baret P V. IONS: Identification of Orthologs by Neighborhood and Similarity-an Automated Method to Identify Orthologs in Chromosomal Regions of Common Evolutionary Ancestry and its Application to Hemiascomycetous Yeasts. Evol Bioinform Online 2011;7:123–33.

Seret M-L, Diffels JF, Goffeau A et al. Combined phylogeny and neighborhood analysis of the evolution of the ABC transporters conferring multiple drug resistance in hemiascomycete yeasts. BMC Genomics 2009;10:459.

Shah AH, Singh A, Dhamgaye S et al. Novel role of a family of major facilitator transporters in biofilm development and virulence of Candida albicans. Biochem J 2014;460:223–35.

Shah SP, Huang Y, Xu T et al. Atlas–a data warehouse for integrative bioinformatics. BMC Bioinformatics 2005;6:34.

Shimazu M, Itaya T, Pongcharoen P et al. Vba5p, a Novel Plasma Membrane Protein Involved in Amino Acid Uptake and Drug Sensitivity in Saccharomyces cerevisiae. Biosci Biotechnol Biochem 2012;76:1993–5.

106

Shimazu M, Sekito T, Akiyama K et al. A family of basic amino acid transporters of the vacuolar membrane from Saccharomyces cerevisiae. J Biol Chem 2005;280:4851–7.

Singh GB. Fundamentals of Bioinformatics and Computational Biology. Springer, 2015.

Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981;147:195–7.

Souciet J-L, Dujon B, Gaillardin C et al. Comparative genomics of protoploid Saccharomycetaceae. Genome Res 2009;19:1696–709.

Steensels J, Snoek T, Meersman E et al. Improving industrial yeast strains: Exploiting natural and artificial diversity. FEMS Microbiol Rev 2014;38:947–95.

Stöver BC, Müller KF. TreeGraph 2: combining and visualizing evidence from different phylogenetic analyses. BMC Bioinformatics 2010;11:7.

Stratford M, Steels H, Nebe-von-Caron G et al. Extreme resistance to weak-acid preservatives in the spoilage yeast Zygosaccharomyces bailii. Int J Food Microbiol 2013;166:126–34.

Svrbicka A, Toth Hervay N, Gbelska Y. The major facilitator superfamily transporter Knq1p modulates boron homeostasis in Kluyveromyces lactis. Folia Microbiol (Praha) 2016;61:101–7.

Tamura K, Peterson D, Peterson N et al. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol 2011;28:2731–9.

Tatusova TA, Madden TL. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999;174:247–50.

Teixeira MC, Cabrito TR, Hanif ZM et al. Yeast response and tolerance to polyamine toxicity involving the drug: H + antiporter Qdr3 and the transcription factors Yap1 and Gcn4. Microbiology 2011;157:945–56.

Teixeira MC, Dias PJ, Monteiro PT et al. Refining current knowledge on the yeast FLR1 regulatory network by combined experimental and computational approaches. Mol Biosyst 2010;6:2471–81.

Teixeira MC, Dias PJ, Sim??es T et al. Yeast adaptation to mancozeb involves the up-regulation of FLR1 under the coordinate control of Yap1, Rpn4, Pdr3, and Yrr1. Biochem Biophys Res Commun 2008;367:249–55.

Teixeira MC, Sá-Correia I. Saccharomyces cerevisiae resistance to chlorinated phenoxyacetic acid herbicides involves Pdr1p-mediated transcriptional activation of TPO1 and PDR5 genes. Biochem Biophys Res Commun 2002;292:530–7.

Tenreiro S, Fernandes a R, Sá-Correia I. Transcriptional activation of FLR1 gene during Saccharomyces cerevisiae adaptation to growth with benomyl: role of Yap1p and Pdr3p. Biochem Biophys Res Commun 2001;280:216– 22.

Tenreiro S, Nunes PA, Viegas CA et al. AQR1 gene (ORF YNL065w) encodes a plasma membrane transporter of the major facilitator superfamily that confers resistance to short-chain monocarboxylic acids and quinidine in Saccharomyces cerevisiae. Biochem Biophys Res Commun 2002;292:741–8.

Tenreiro S, Vargas RC, Teixeira MC et al. The yeast multidrug transporter Qdr3 (Ybr043c): Localization and role as a determinant of resistance to quinidine, barban, cisplatin, and bleomycin. Biochem Biophys Res Commun 2005;327:952–9.

Teresa Fernández-Espinar M, Barrio E, Querol A. Analysis of the genetic variability in the species of the Saccharomyces sensu stricto complex. Yeast 2003;20:1213–26.

Thorsen M, Jacobson T, Vooijs R et al. Glutathione serves an extracellular defence function to decrease arsenite accumulation and toxicity in yeast. Mol Microbiol 2012;84:1177–88.

Tkach JM, Yimit A, Lee AY et al. Dissecting DNA damage response pathways by analysing protein localization and abundance changes during DNA replication stress. Nat Cell Biol 2012;14:966–76.

Tomitori H, Kashiwagi K, Asakawa T et al. Multiple polyamine transport systems on the vacuolar membrane in yeast. Biochem J 2001;353:681–8.

107

Tsirigos KD, Peters C, Shu N et al. The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides. Nucleic Acids Res 2015;43:W401–7.

Tusnady GE, Simon I. The HMMTOP transmembrane topology prediction server. Bioinformatics 2001;17:849–50.

Ueda-Nishimura K, Mikata K. A new yeast genus, Tetrapisispora gen. nov.: Tetrapisispora iriomotensis sp. nov., Tetrapisispora nanseiensis sp. nov. and Tetrapisispora arboricola sp. nov., from the Nansei Islands, and reclassification of Kluyveromyces phaffii (van der Walt) van der Wa. Int J Syst Evol Microbiol 1999;49:1915– 24.

Do Valle Matta MA, Jonniaux JL, Balzi E et al. Novel target genes of the yeast regulator Pdr1p: A contribution of the TPO1 gene in resistance to quinidine and other drugs. Gene 2001;272:111–9.

Vargas RC, García-Salcedo R, Tenreiro S et al. Saccharomyces cerevisiae multidrug resistance transporter Qdr2 is implicated in potassium uptake, providing a physiological advantage to quinidine-stressed cells. Eukaryot Cell 2007;6:134–42.

Vargas RC, Tenreiro S, Teixeira MC et al. Saccharomyces cerevisiae multidrug transporter Qdr2p (Yil121wp): Localization and function as a quinidine resistance determinant. Antimicrob Agents Chemother 2004;48:2531– 7.

Velasco I, Tenreiro S, Calderon IL et al. Saccharomyces cerevisiae Aqr1 is an internal-membrane transporter involved in excretion of amino acids. Eukaryot Cell 2004;3:1492–503.

Viklund H, Bernsel A, Skwark M et al. SPOCTOPUS: a combined predictor of signal peptides and membrane protein topology. Bioinformatics 2008;24:2928–9.

Viklund H, Elofsson A. OCTOPUS: improving topology prediction by two-track ANN-based preference scores and an extended topological grammar. Bioinformatics 2008;24:1662–8.

Vos RA, Caravas J, Hartmann K et al. BIO:: Phylo-phyloinformatic analysis using perl. BMC Bioinformatics 2011;12:63.

Van Der Walt JP. Kluyveromyces—A new yeast genus of theEndomycetales. Antonie Van Leeuwenhoek 1956;22:265–72.

Wang C, Mas A, Esteve-Zarzoso B. The Interaction between Saccharomyces cerevisiae and Non-Saccharomyces Yeast during Alcoholic Fermentation Is Species and Strain Specific. Front Microbiol 2016;7:502.

Wang Q, Xu J, Wang H et al. Torulaspora quercuum sp. nov. and Candida pseudohumilis sp. nov., novel yeasts from human and forest habitats. FEMS Yeast Res 2009;9:1322–6.

Waterhouse AM, Procter JB, Martin DMA et al. Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 2009;25:1189–91.

Wilkinson S, Greetham D, Tucker GA. Evaluation of different lignocellulosic biomass pretreatments by phenotypic microarray-based metabolic analysis of fermenting yeast. Biofuel Res J 2016;3:357–65.

Wisedchaisri G, Park M-S, Iadanza MG et al. Proton-coupled sugar transport in the prototypical major facilitator superfamily protein XylE. Nat Commun 2014;5:4521.

Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 1997;387:708–12.

Wright MB, Howell EA, Gaber RF. Amino acid substitutions in membrane-spanning domains of Hol1, a member of the major facilitator superfamily of transporters, confer nonselective cation uptake in Saccharomyces cerevisiae. J Bacteriol 1996;178:7197–205.

Xia X. “What Is Comparative Genomics?.” Comparative Genomics. Springer Berlin Heidelberg, 2013.

Xu EG, Ho PW-L, Tse Z et al. Revealing ecological risks of priority endocrine disrupting chemicals in four marine protected areas in Hong Kong through an integrative approach. Environ Pollut 2016;215:103–12.

Yan N. Structural advances for the major facilitator superfamily (MFS) transporters. Trends Biochem Sci

108

2013;38:151–9.

Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat Rev Genet 2012;13:303–14.

Yang Z, Wong WSW, Nielsen R. Bayes empirical Bayes inference of amino acid sites under positive selection. Mol Biol Evol 2005;22:1107–18.

Yun CW, Tiedeman JS, Moore RE et al. Siderophore-iron uptake in Saccharomyces cerevisiae: Identification of ferrichrome and fusarinine transporters. J Biol Chem 2000;275:16354–9.

Zhang H, Gao S, Lercher MJ et al. EvolView, an online tool for visualizing, annotating and managing phylogenetic trees. Nucleic Acids Res 2012;40:W569–72.

109

7. Annexes

7.1. Annex I – Phylogenetic analysis of Saccharomycetaceae Taxonomic family Table. 7. A number of full-size DHA1 proteins in each cluster of Saccharomycetaceae family. Additional information regarding a total number of full-size transporters and their average percentage

of sequence identity and similarity is shown in the bottom lines.

(QDR1,

(TPO2,

y S. cerevisiae genes (DTR1) QDR2, (QDR3) (HOL1) (TPO1) (YHK8) (TPO4) (FLR1) Total of Average of TPO3) 2014 AQR1) Full-Size genes per

Complex Article

Phylogen Proteins cluster

Acromyn / Cluster AB D E F G I J K1 K2 N1 N2 O P Q R S T UV sace_1 0 0 1 3 1 0 1 0 0 2 0 0 1 0 1 1 1 0 12 12 0,46 sapa_4 0 0 1 3 1 0 1 0 0 2 0 0 1 0 1 1 1 0 12 11 0,46

sami_1 0 0 1 3 1 0 1 0 0 2 0 0 1 0 1 1 1 0 12 11 0,46 saku_1 0 0 1 2 1 0 1 0 0 2 0 0 1 0 1 1 2 0 12 5 0,46 saar_1 0 0 1 2 1 0 1 0 0 2 0 0 1 0 1 1 1 0 11 - 0,42 saba_1 0 0 1 2 1 0 1 0 0 1 0 0 1 0 1 1 1 1 11 11 0,42 saba_2 0 0 1 2 1 0 1 0 0 1 0 0 0 0 1 1 1 1 10 12 0,38 sauv_1 0 0 1 3 1 0 1 0 0 2 0 0 1 0 1 1 1 1 13 - 0,50

kaaf_1 0 0 1 2 0 0 0 0 0 1 0 0 2 0 1 1 5 0 13 - 0,50 kana_1 0 0 1 3 0 0 0 0 0 1 0 0 1 0 1 1 1 0 9 - 0,35 naca_1 0 0 1 3 0 0 0 0 0 2 0 0 1 0 1 1 1 0 10 - 0,38 naca_2 0 0 1 4 0 0 0 0 0 2 0 0 1 0 1 2 1 0 12 9 0,46 nada_1 0 0 1 2 0 0 0 0 0 1 0 0 1 0 1 0 1 0 7 - 0,27 cagl_1 0 0 1 2 0 0 0 0 0 1 0 0 2 0 1 1 2 0 10 10 0,38 cagl_2 0 0 1 2 0 0 0 0 0 1 0 0 2 0 1 1 2 0 10 - 0,38 Post Whole Genome Duplication Genome Post Whole teph_1 0 0 1 2 0 0 0 0 0 1 1 0 2 0 1 1 3 0 12 - 0,46 tebl_1 0 0 1 4 0 0 0 0 0 1 0 0 5 0 1 1 4 0 17 - 0,65 vapo_1 0 0 1 2 0 0 0 0 0 1 1 0 2 0 1 1 4 0 13 13 0,50

zyba_1 4 0 1 2 2 0 0 2 0 2 2 4 4 4 0 0 12 0 39 - 1,50

zyba_2 2 0 1 2 1 0 0 1 0 1 1 1 2 2 0 0 7 0 21 - 0,81 zyba_3 2 0 1 1 1 0 0 1 0 1 1 2 1 1 0 0 6 0 18 - 0,69 zyro_1 3 0 1 1 2 0 0 1 0 1 0 2 1 1 1 0 7 0 21 21 0,81 tode_1 1 0 1 2 1 0 1 0 0 1 1 0 1 0 0 1 1 0 11 - 0,42

lath_1 0 0 1 2 1 1 1 2 0 1 1 0 1 0 1 1 2 0 15 15 0,58 Saccharomycetaceae taxonomic family taxonomic Saccharomycetaceae lawa_1 0 0 1 2 1 1 1 0 0 1 1 0 1 0 0 1 2 0 12 12 0,46 lakl_1 0 0 1 2 1 0 2 1 0 1 1 0 1 0 1 1 1 0 13 13 0,50 klma_1 0 0 1 2 1 0 1 1 1 1 0 0 1 0 0 1 0 0 10 - 0,38 klla_1 0 0 1 2 1 0 1 1 0 1 0 0 1 0 0 0 0 0 8 7 0,31 klwi_1 0 1 1 2 1 0 1 1 0 1 0 0 1 0 0 1 0 0 10 - 0,38 klae_1 0 2 1 2 1 0 1 1 1 1 0 0 1 0 0 1 0 1 13 - 0,50 ergo_1 0 0 1 1 1 1 2 0 0 1 0 0 0 0 0 1 0 0 8 8 0,31

ercy_1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 1 0 0 6 - 0,23 Pre Whole Genome Duplication Genome Pre Whole asac_1 0 0 1 1 1 1 2 0 0 1 0 0 0 0 0 1 0 0 8 - 0,31 Total 12 3 33 70 25 5 22 13 2 41 10 9 42 8 21 28 71 4 419 170 0,488344988 Identity (%) 75 74 53 47,2 59,5 70 73,6 66 78 72,5 74 73 62,5 82 68,2 61,8 59,4 87 - - - Similarity (%) 82 84 66,8 62,6 73,4 80 83,5 77 85 81,9 84 81 74,9 89 80,3 72,8 74,7 92 - - -

110

7.2. Annex II- Phylogenetic trees to the cluster T of the S. cerevisiae FLR1 homologs from 94 Hemiascomycetous strains.

Figure. 29 Phylogenetic analysis of FLR1 homologous in 94 Hemiascomycetous strains using the PhyML 3.0 software and retrieved results explored in Dendroscope software. Circular cladogram shows the tree topology of 163 full-size DHA1 transporters in cluster T.

111

Figure. 30 Phylogenetic analysis of FLR1 homologous in 94 Hemiascomycetous strains using the PhyML 3.0 software. Radial phylogram shows the amino acid divergence among the FLR1 homologous proteins considering the total number of strains used in these master thesis.

112

7.3. Annex III - Gene neighbourhood of the DTR1 lineage of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae.

WGD

- Post

WGD

- Post

WGD -

re P

Figure. 31 Gene neighbourhood of the DTR1 lineage of genes in the genome of 32 yeast strains belonging to the Saccharomycetaceae. The DTR1 homologous genes are represented in central boxes flanked by the corresponding neighbour genes highlighted at different colours according to the protein sequence cluster. White boxes represent genes without homologous neighbours in the represented chromosome region. In order to simplify the gene neighbourhood comparison, the retrieved 15 neighbour genes for each adjacent region of query genes were reduced to 5 neighbour genes here shown and the obtained scheme divided according to the sublineages identified. The neighbours’ order in each chromosome was scrutinized by exploring the synteny network in the Cytoscape software.

113

7.4. Annex IV - Phylogenetic tree to the cluster F of the S. cerevisiae QDR1/QDR2/AQR1 homologs from 32 Saccharomycetaceae strain.

Figure. 32 Phylogenetic analysis of QDR1/QDR2/AQR1 homologs in 32 Saccharomycetaceae species using the PhyML 3.0 software and retrieved results explored in Dendroscope software. The circular cladogram shows the tree topology of 70 full-size DHA1 transporters of F cluster.

114

7.5. Annex V - Phylogenetic tree to the cluster N1 of the S. cerevisiae TPO2/TPO3 homologs from 32 Saccharomycetaceae strain.

Figure. 33 Phylogenetic analysis of TPO2/TPO3 homologs in 32 Saccharomycetaceae species using the PhyML 3.0 software and retrieved results explored in Dendroscope software. The circular cladogram shows the tree topology of 41 full-size DHA1 transporters of N1 cluster.

115