Genomics of the capybara, two emblematic Colombian María José Gómez-Hughes¹, Santiago Herrera-Álvarez1,2, Andrew J. Crawford¹ ¹Department of Biological Sciences, Universidad de los Andes, Bogotá, 111711, . ²Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA.

Abstract Capybaras, which are native to South America, are not only the largest in the world, but they also have a number of other characteristics that make them unique. They are semi-aquatic, grazing and live in large groups where females engage in communal breeding. Males communally defend the territory through scent-marking with a specialized gland called the morillo and with two anal glands. Here we present the first genome assembly and annotation for the lesser capybara, isthmius, as well as the first transcriptome assembly for the capybara, H. hydrochaeris, both of which are comparable in completeness with previously published genomes, and compared them with the previously published genome assembly for the capybara. We found evidence of reduction on the effective population size of both species, as well as big regions of genomic rearrangement with the . Our phylogenetic analysis is consistent with previous phylogenies reported for the suborder Hystrichomorpha, but species related there is evidence for the capybara being a paraphyletic species. We hope that this study contributes for conservation efforts on these species, as well as a better understanding of all the characteristics that make them unique.

Resumen Los chigüiros, nativos a América del Sur, no solamente son los roedores más grandes del mundo sino también tienen otras características que los hacen únicos. Son especies de mamíferos semiacuaticas que pastean y viven en grandes grupos en los que las hembras crían comunitariamente a sus crías y los machos defienden sus territorios mediante marcajes con el morillo, una glándula especializada, y dos glándulas anales. Aquí presentamos el primer ensamblaje genómico del chigüiro menor, Hydrochoerus isthmius, y el primer transcriptoma del chigüiro, H. hydrochaeris, como también comparaciones con el genoma del chigüiro publicado anteriormente. Encontramos evidencia de reducciones poblacionales de ambas especies, como también rearreglos genómicos en comparación con el conejillo de indias. Nuestro análisis filogenético es consistente con análisis publicados previamente para el suborder , pero hay evidencia para la parafilia del chigüiro. Esperamos que este estudio contribuya a esfuerzos de conservación en estas especies, como también a un mejor entendimiento de esas características que los hacen únicos.

Keywords: Hydrochoerus sp., chigüiro, populational genomics, conservation genomics, 10X genomics, genome assembly.

Ethics Statement: Tissue samples of the lesser capybara and capybara were obtained under research and collecting permit No. 1177 issued to the Universidad de los Andes by the Autoridad Nacional de Licencias Ambientales (ANLA; National Authority of Environmental Permits). Anesthetic and euthanization protocols used were approved by the Universidad de los Andes’ Comité Institucional de Cuidado y Uso de Animales de Laboratorio (CICUAL; approval number C.FUA_14-023).

Introduction

The Order Rodentia is the most diverse group of mammals in the world in terms of species and ecological diversity as well as morphological variation (Samuels, 2009; Fabre et al., 2012). Rodents comprise almost 40% of all species (Burgin et al., 2018) and inhabit almost all terrestrial biomes (Hafner & Hafner, 1988). There are currently five recognized suborders of rodents: Sciuromorpha (dormices, mountain beavers, marmots, squirrels and squirrel-like rodents), Castorimorpha (beavers, kangaroo rats and pocket gophers), Myomorpha (hamsters, jerboas, mice, rats and mouse-like rodents), Anomaluromorpha (scaly tailed squirrels and springhares), and Hystricomorpha (chinchillas, guinea pigs, gundis, porcupines and others), which are composed of 33 families (Wilson & Reeder, 2005).

Among these groups, in the order Hystricomorpha and family (Wilson & Reeder, 2005), are the capybaras (genus: Hydrochoerus). Capybaras are known for being the largest extant rodents (Figure 1A; Moreira et al., 2013), their semi-aquatic habits (Macdonald, 1981), and for being social with communal breeding and communal defense (Macdonald, 1981). Capybaras are also known for having distinctive feeding and scent marking behaviors. Feeding related, capybaras graze on both aquatic and terrestrial herbaceous vegetation that undergo multiple passes through their digestive tracts, either via regurgitation or cropography (Lord, 1994). Scent marking related, capybaras possess two types of scent marking glands - the morillo, a protuberance that males express in the top of their snouts which size can be predictive of dominance (Rosenfield et al., 2019), and two anal glands - and is the social interaction most seen in them (Emilio & Macdonald, 1994).

There are currently two described species of capybaras: the capybara, H. hydrochaeris, and the lesser capybara, H. isthmius (Mones, 1991), inhabiting eastern Colombia, eastern , the , , Peru, northeastern and , and Panama, western Colombia and western Venezuela, respectively (Figure 1B; Reid, 2016; Delgado & Emmons, 2016). However, some dispute exists as to whether there are two or only one species of capybaras, with some still referring to the lesser capybara as a subspecies (see Correa & Jorgenson, 2009 and Carrascal, Linares & Chacón, 2011), but classified as its own species in the database of mammalian (Wilson & Reeder, 2005) and by the International Union for Conservation of Nature and Natural Resources (IUCN; Delgado & Emmons, 2016), as well as to where in the Hystricomorpha phylogeny they are localized (see Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017; Rowe & Honeycutt, 2002).

Figure 1. (A) Relative size of the two species of capybara as compared to a 1.75 m tall human. The lesser capybara (Hydrochoerus isthmius) is shown in purple on the left while the capybara (Hydrochoerus hydrochaeris) is shown in blue on the right. (B) Geographic ranges of the capybara (blue) and the lesser capybara (purple) according to the IUCN (Reid, 2016; Delgado & Emmons, 2016). (C) A picture showing a male capybara with its morillo indicated by the red arrow. All images used were labeled for noncommercial use with modifications from Wiki Commons.

Currently, the capybara is listed as a species of Least Concern by the IUCN (Reid, 2016), but some concerns have arisen over the years and substantial population declines have been noted in this species (Corriale & Herrera, 2014). The lesser capybara is listed as Data Deficient due to lack of baseline studies on the status of these populations (Delgado & Emmons, 2016). This species has been neglected in studies of conservation despite being harvested for meat, leather and fat (Pinheiro & Moreira, 2013) and despite the threats to its native habitat (Aldana-Domínguez, Vieira-Muñoz & Bejarano, 2013).

In this paper we present the first genome assembly for the lesser capybara and the first transcriptome assembly for the capybara, as well as new analyses for the previously published capybara genome assembly by Herrera-Álvarez et al. (2018). We compare demographic changes over time of both species, compare synteny between the two capybara species and between each species and the guinea pig ( porcellus), evaluate distinctiveness of the capybaras as independent monophyletic groups, assess the position of Hydrochoerus in the Hystricomorpha tree, and analyze differentially expressed genes among tissues from the capybara. Materials and Methods

Genome assembly of the capybara Tissue from a wild-caught but captive-raised capybara (H. hydrochaeris), reportedly from , was donated by the San Diego Zoo’s Frozen Zoo to the 200 Mammalian Genomes Project led by the Broad Institute, which then sequenced and assembled a draft genome using the sequencing and assembly method of DISCOVAR de novo (Weisenfeld et al., 2014). This assembly was then ‘upgraded’ using Chicago libraries provided by Dovetail Genomics (Putnam et al., 2016) and financed by Colciencias. Details on the final assembly can be found in Herrera-Álvarez et al. (2018).

Genome assembly of the lesser capybara Tissue samples of the lesser capybara were collected from one juvenile H. isthmius from San Juan del Carare, Santander, Colombia, on 22 June, 2017, that was subsequently accessioned into the mammals collection of the Museo Historia Natural ANDES, Universidad de los Andes, Bogotá, Colombia (field number AJC 7100, voucher number ANDES-M 2300). This sample was sequenced using 10X Genomics linked reads technology in two lanes of Illumina HiSeq X10. The resulting reads were run through longranger v2.2.2 (10X Genomics) to estimate the genome size, heterozygosity, and to process the barcodes. These reads were assembled with Supernova v2.0.1 (Weisenfeld et al., 2017), an assembler created by 10X Genomics that uses a progressively larger contigs approach and its own trimming step to create phased scaffolds from the reads. We included the mkoutput pseudohap option to visualize only one haplotype on the resulting assembly. To enhance this assembly we used the following three scripts. 1) Tigmint v1.1.2 was used to produce an assembly that is both more contiguous and more correct by comparing the alignment of linked reads to the draft assembly, to correct for possible mis-assemblies (Jackman et al., 2018). 2) Arcs v1.0.6 was used to add an additional scaffold step by organizing the assembly with information included in the linked reads and to join those scaffolds with more probability of being together to create a more contiguous assembly (Yeo et al., 2017). 3) Sealer v2.0.2 was used to identify intra- scaffold gaps in the draft assembly, search for flanking sequences, and then to fill the gaps by realigning the raw reads to the assembly (Paulino et al., 2015). Sealer navigates de Bruijn graphs via bloom filters based on k values and we chose k values of 64, 80, 96, 112, and 128, since this would give us a range that could help us close gaps on areas of low coverage with the lower values, and areas of high repetition levels with the higher values (Paulino et al., 2015). Between each of the aforementioned steps, and at the end, we ran QUAST v5.0.2 (Gurevich et al., 2013) to measure enhancement of quality metrics of the assembly. These metrics included (1) contig sizes: number of contigs, length of the largest contig, contig and scaffold Nx (the length in base pairs such that the sum of all contigs larger than said length add up to e x% of the length og the whole assembly, e.g., N50), and (2) Comparison to the domestic guinea pig reference genome assembly (RefSeq accession number: GCF_000151735.1; release: Cavpor 3.0) in terms of GC content (%), number of mismatches per 100 kilobase (Kb), and number of indels per 100 Kb (Gurevich et al., 2013; Table 1). Finally, to assess genome completeness, we ran BUSCO v3.0.2 using the Vertebrate dataset (Waterhouse et al., 2018) and compared the percentage of BUSCO genes recovered against the genome assemblies of other rodent species published in Ensembl (Table 2).

Transcriptome sequencing, assembly, and functional annotation Transcriptomic data were obtained from eleven tissues representing two H. hydrochaeris individuals of either sex (Table 3). Tissues were preserved in Nucleic Acid Preservation (NAP; Camacho-Sánchez et al., 2013) buffer to avoid RNA degradation. RNA was extracted with standard TRI Reagent® Solution (Ambion Inc., Austin, Texas, USA) and then cleaned using the RNeasy Plus Mini Kit (Qiagen. Hilden, Germany) and diluted to a final volume of 30μl with nuclease-free water. Quantity of extracted RNA for library construction was measured with the Qubit® RNA HS Assay kit. Complementary DNA (cDNA) libraries were constructed with the Illumina TruSeq v. 2 kit using half reactions. Quality of cDNA libraries was assessed using Agilient 2100 BioAnalyzer and Agilient High Sensitivity DNA kit. The 11 libraries were barcoded and run together in paired-end mode on one lane of an Illumina Hiseq 2000. For the transcriptome assembly, we used all reads from the 11 sequenced libraries. We first trimmed the reads with trimmomatic v0.39 (Bolger, Lohse & Usadel, 2014), filtered with the FASTX- toolkit v0.0.14, normalized based on the median coverage, and trimmed unreliable k-mers using the khmer v1 digital normalization algorithm (Crusoe et al., 2015; Brown et al., 2012). We then used Trinity to assemble the transcriptome from the remaining reads (Grabherr et al., 2011). To evaluate the transcriptome assembly quality based on its completeness we used Trinity scripts and BUSCO v3.0.2 (Waterhouse et al., 2018). To functionally annotate the transcriptome, we extracted the longest open reading frames and predicted the most likely coding regions with Transdecoder v3.0.0 (Haas & Papanicolaou, 2015). Then we used Trinotate to functionally annotate the predicted polypeptides and to create a database for navigating these data (following Bryant et al., 2017). We used BLAST v2.9.0 to search for homology hits against the UniProt Swiss-Prot database (UniProt Consortium, 2018), and identify to which Pfam protein family each transcript belonged (El-Gebali et al., 2018) using profile hidden Markov Models with HMMER v3.2.1 (Mistry et al., 2013). We predicted signal peptides using a deep neural network approach with SignalP v5.0 (Armenteros et al., 2019), predicted transmembrane protein domains using hidden Markov models with tmHMM v2.0 (Krogh et al., 2001), and assigned inferred proteins to Eggnog functional categories (Huerta-Cepas et al., 2015) and to gene ontology categories (GO; Ashburner et al., 2000) using BLAST v2.9.0.

Genomic repeat masking The results of repeat masking and annotation of the capybara were taken from Herrera- Álvarez et al. (2018), and a similar approach was taken for the lesser capybara. To repeat mask the lesser capybara assembly, we used RepeatMasker v4.0.9_p2 (Smit, Hubley & Green, 2015) specifying “rodentia” as the species to guide the masking using repeat evidence from other rodents. For the type of masking, we chose a soft-masked approach which gave us the possibility to visualize what the repeat subsequences were, but without them interfering on downstream analyses.

Genome annotation and gene content For annotating the lesser capybara genome, we selected three high-quality, annotated genome assemblies from representative rodent species. We used the Maker v2.31 pipeline (Holt & Yandell, 2011) based on proteomes from the guinea pig (Cavia porcellus, Cavpor 3.0), the ( musculus; GRCm38.p6), and the common rat (Rattus novergicus; Rnor6.0) to guide the annotation. We downloaded the proteomes from the Ensembl release 97 and clustered them with CD-HIT v4.6.1 into a single non-redundant file (Li, Jaroszewski, & Godzik, 2001). The capybara genome annotation was taken from Herrera-Álvarez et al. (2018). We used the Swiss-Prot reviewed database (UniProt Consortium, 2018) to add functionality to the annotations of both species by homology hits found by Blastp v2.9.0. Additionally, we used InterProScan v5.36-75 (Jones et al., 2014) to classify the annotated genes into Pfam protein families (El-Gebali et al, 2018). Microsatellites are a class of short tandem repeat (STR) motifs that are frequently used as Mendelian markers in population genetic and kinship studies (Jame & Lagoda, 1996). Here we define STRs as six or more dinucleotide repeats, and five or more repeats ranging from tri- to dodeca- nucleotide repeats. To annotate microsatellites for both species, we used the MIcroSAtellite identification tool (MISA-web; Beier et al., 2017).

Mitochondrial genome We created a DNA database of the capybara and the lesser capybara genomes independently and used the guinea pig mitochondrial genome (Accession number: NC_000884) as a query against the database with Blastn v2.9.0. We then annotated the most probable scaffold to obtain the mitochondrial genome sequence of each species using MITOS WebServer pipeline (Bernt et al., 2013). To visualize the mitogenomes we used the CGView Server (Grant & Stothard, 2008).

Demography

We used pairwise sequentially Markovian coalescent (PSMC; Li & Durbin, 2011) to infer how the effective population size (Ne ) may have changed over recent history. Briefly, this algorithm reconstructs the distribution of times to most recent common ancestor (TMRCA) along chromosomes by examining the density of heterozygous sites (Li & Durbin, 2011). To do this, we first indexed the assemblies, to make alignments faster and less computational exhaustive, and mapped the raw reads back to it with bwa v0.7.4 (Li & Durbin, 2009). We then sorted each alignment by their order and converted it from a BAM to a VCF file with SAMtools v1.8 (Li et al., 2009), called the SNPs and indels with bcftools v1.8 (Li, 2011), and transformed this file to a fastq file with vcftools v4.2.0 (Danecek et al, 2011). We then used this file to estimate the parameters of the PSMC model, with 100 bootstrap models, and a recombination parameter of “4+25*2+4+6” with psmc v0.6.5 (Li & Durbin, 2011).

Synteny between the capybaras and the common guinea pig To identify genomic regions containing large rearrangements in the capybaras relative to the guinea pig, we performed global pairwise alignments between the two capybara species, between the capybara and the guinea pig, and between the lesser capybara and the guinea pig using bwa v0.7.4 (Li & Durbin, 2009). To visualize these alignments, we drew 100 Kb windows where the sequences would align between two half circles representing each of the assemblies using Circos v0.69-8 (Krzywinski et al., 2009).

Genetic diversity within Hydrochoerus To estimate genomic divergence between northern and southern H. hydrochaeris relative to H. isthmius, we ran a phylogenomic analysis of protein-coding sequences derived from genomic and transcriptomic analyses, with the guinea pig as outgroup. To minimize possible problems with paralogy, we analyzed only those genes included in the BUSCO Vertebrate orthologs dataset (Waterhouse et al., 2018) which were obtained using BUSCO v3.0.2 from either the reference genome assembly (H. isthmius, southern H. hydrochaeris from Bolivia) or the transcriptome de novo assembly (northern H. hydrochaeris from the Colombian Llanos and the guinea pig transcriptome (Cavia porcellus; genome version: Cavpor3.0; accession number: GCA_000151735.1). All BUSCO genes that were found complete on the four datasets were aligned independently with MAFFT v7.309 using a BLOSUM 62 matrix (Katoh & Standley, 2013). Alignments were then trimmed with trimAl v1.4 (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer a species tree, we concatenated the resulting alignments with FASconCAT-G v1.04 (Kück & Meusemann, 2010) and used IQtree v1.6.10 to select the best fit model of substitution, for all genes, based on a corrected Akaike information criterion (AICc) and to implement a partitioned likelihood analysis (Nguyen et al, 2014; Chernomor, von Haeseler & Minh, 2016). Statistical support for relationships was estimated using 1000 non-parametric bootstraps for sites within partitions and 1000 likelihood ratio tests . The resulting tree was visualized in iTOL (Letunic & Bork, 2019).

Phylogenomic analyses To verify the position of Hydrochoerus on the Hystrichomorpha phylogeny, we downloaded from Ensembl all available proteomes of Hystrichomorpha, and included mouse as an outgroup for a total of 9 species (Table 2) and used Orthofinder v2.3.3 to search for orthologs (Emms, & Kelly, 2019). Then we extracted all single copy orthologs that were found in all nine species and performed a pre-alignment quality filter with PREQUAL v1.02 to identify and filter non-homologous sequences (Whelan, Irisarri & Burki, 2018). We performed a multiple sequence alignment with MAFFT v7.309 assuming a BLOSUM 62 matrix in each orthologue independently (Katoh & Standley, 2013), and then trimmed the alignments for poorly aligned regions with trimAl v1.4 to maintain only the most reliable alignments (Capella-Gutiérrez, Silla-Martínez & Gabaldón, 2009). To infer phylogenetic relationships, we used two approaches. First, we concatenated all the alignments with FASconCAT-G v1.04 (Kück & Meusemann, 2010), constructed a maximum likelihood tree with RAxML v8.2.12, using a GAMMA model for rate heterogeneity that estimates the alpha parameter, and 100 bootstraps for statistical support (Stamakis, 2014). Second, we estimated a species tree via a Bayesian approach using MCMCTree implemented in PAML v4.9 (Yang, 2007). We discarded the first 2000 generations of the Markov chain as a burnin and then sampled 20,000 trees one every 20 iterations. The timetree was calibrated by constraining the root to a temporal interval of 68 - 78 million years ago, corresponding to the TMRCA of the Hystricomorpha group and the mouse (Hedges, Dudley & Kumar, 2006). The sample of posterior trees was used to generate a Hessian matrix using CODEML in PAML v4.9 and assuming a WAG+GAMMA model. The Hessian matrix was used to run MCMCTree again to obtain a Bayesian consensus tree which was visualized using the R package MCMCTreeR (Puttick, 2019). As a check on the MCMCTree results, we used a second species-tree approach that takes into account the individual history of each gene. ,We used RAxML v8.2.12 to infer a maximum likelihood tree for each gene independently, also with a GAMMA model and 100 bootstraps per gene. The resulting gene trees were used as input to infer a species tree using NJst (Liu & Yu, 2011). Results Sequencing and genome assembly of the lesser capybara The sequencing of the lesser capybara, Hydrochoerus isthmius, resulted in a total of 1.751 billion reads, each of length 150 bp. From these reads it was inferred that the lesser capybara genome has a size of 2.7 Gb long and a heterozygosity of 0.24% (Table 1). The Supernova assembly (step 1 of the assembly process) had a total size of 2.5 Gb, counting only scaffolds ≥ 10,000 bp, and a scaffold N50 of 694 Kb (Table 4). Quast assembly statistics and the enhancement of these throughout the successive steps of the assembly process (see Materials and Methods) are reported in Table 4. The final lesser capybara genome draft contained 18,502 contigs with a contig N50 of 232 Kb plus 7,702 scaffolds with an N50 of 787 Kb, and a GC content of 40.01%. As a measure of genome completeness we used the fraction of genes recovered in our assemblies out of a total of 3,023 BUSCO genes in the vertebrates data set. For the lesser capybara we recovered, 2,563 genes that were assembled completely (84%) and 227 fragmented genes (7.5%), with 233 genes missing (7.7%), making the lesser capybara assembly comparable to other rodent genome assemblies in Ensembl (Figure 2).

Figure 2. Percentage of Vertebrate BUSCO genes recovered in published rodent genome assemblies (plus rabbit), as a measure of assembly completeness. The genome assemblies of H. hydrochaeris (Herrera-Álvarez et al., 2018) and H. isthmius (this study) are indicated in bold inside the rectangle.

Transcriptome sequencing, assembly, and functional annotation A total of 882.24 million reads with a length of 150 bp where sequenced from the 11 RNAseqlibraries made from Hydrochoerus hydrochaeris. After the quality filters, normalization, and trimming steps (see Materials and Methods) a total of 140 million reads were kept and subsequently used for transcriptome assembly. The resulting transcriptome had a total of 994,100 transcripts belonging to 768,228 genes. Transcriptome GC content was estimated to be 46.58% and a N50 of 704 bp, with an average length of 574.17 bp, taking into account only the longest isoform per gene. Among the Eggnog functional categories, the most common were translation, ribosomal structure and biogenesis (14.9%) followed by amino acid transport and metabolism (9.1%), and energy production and conversion (8%; Figure 3A). The Pfam protein families most represented were immunoglobulin V-set domain, immunoglobulin domain and Zinger finger C2H2 type with 18.2%, 8.5% and 5.4% respectively (Figure 3B). And the three most represented GO terms were cellular nitrogen compound metabolic process (5.1%), DNA metabolic process (3.7%) and biosynthetic process (3.2%; Figure 3C). Among the 768,228 genes, we predicted 273,824 coding regions and of these 5.3% (14,553 of 273,824) were predicted to have signal peptides.

Figure 3. Capybara (Hydrochoerus hydrochaeris) transcriptome functional annotation. (A) Functional categories from Eggnog mapping. (B) Protein families from Pfam. (C) 25 Gene ontology categories most represented.

Genomic repeat masking Lesser capybara: The repeats identified by RepeatMasker occupied 27.72% of the total assembly. These repeats belonged mostly to the LINEs repeat class (51.9%; long interspersed elements), followed by LTRs (17.2%; long terminal repeats) and SINEs (15.4%; short interspersed elements) (Figure 4A-B). Almost half of the repetitive elements (47.08%) were LINEs from the subclass LINE-1, which is consistent for what is reported for humans (45.55%; Lander et al., 2001), in mice (48.71%; Waterson et al., 2002) and in other rodents (Figure 4C; Smit, Hubley & Green, 2015).

Figure 4. Frequency of classes (A) and subclasses (B) of repetitive elements within the lesser capybara (Hydrochoerus isthmius) genome assembly. (A) SINEs: short interspersed elements, LINEs: long interspersed elements, LTR: long terminal repeats, others: satellites, simple repeats, small RNA, and low complexity repeats. (B) ALU/B1, B2 - B4, and MIRs: subclasses of SINEs; LINE1, LINE2, and L3/CR1: subclasses of LINEs; ERVL, ERVL-MaLRs, ERV class I, and ERV class II: subclasses of LTR elements; hAT-Charlie, and TCMar- Tigger: subclasses of DNA elements. (C) Frequency of subclasses of repetitive elements on different species of rodents. Data for all but the two capybara species are reported in Smit, Hubley & Green (2015).

Genomic annotation and gene content We annotated a total of 26,080 genes, 82% of which had an AED < 0.5 indicating high quality of the annotations. The higher number of genes annotated in the lesser capybara compared to the capybara can be explained by a less fragmented genome in the latter one (scaffold N50: 787 Kb and 12.2 Mb, respectively). More than half of the annotations were involved in cellular process (31.2%) and metabolic process (20.10%), followed by biological regulation (15.3%) and localization (11.1%; Figure 5).

Figure 5. Genes predicted in the lesser capybara (Hydrochoerus isthmius) genome annotation that are involved in: (A) Biological processes, (B) cellular components, (C) molecular functions, and (D) Protein classes from Pfam.

Microsatellites - A total of 509,265 and 718,560 microsatellites were found in the capybara and lesser capybara respectively. In both assemblies, the longer the unit size for the microsatellites, the more uncommon they were, but in some instances for uneven numbers (repeat unit length = 3, 7, 11) n+1 would have a higher count (Table 5; Figure 6).

Figure 6. Counts of microsatellites found in each genome assembly. The y axis represent the repeat unit length in base pairs, while the color and size of each circle represents the total count of microsatellites with that unit size found in each of the genome assemblies.

Mitochondrial genome Capybara - The mitogenome assembly of the capybara, H. hydrochaeris, consisted of a scaffold with three gaps. It contains two ribosomal RNA genes (12S and 16S), 21 transfer RNA genes, and 13 protein coding genes (CDS). The assembly seemed to suggest a duplication of tRNA-W and the deletion of tRNA-R. Lesser capybara - The mitogenome of H. isthmius assembled here had a length of 16,525 base pairs with a GC content of 39.37%, and contained two ribosomal RNA genes (12S and 16S), 22 transfer RNA genes, and 13 protein coding genes (CDS). No major rearrangements or gains/losses were found relative to the mammalian mitogenomes previously reported.

See Table 6 for the size and position of genes within the mitochondrial genomes of each species and Figure 7 for a visual comparison of both mitochondrias with the mitochondria from the guinea pig.

Figure 7. Mitochondrial genome of the (A) capybara (Hydrochoerus hydrochaeris), (B) lesser capybara (H. isthmius) and (C) guinea pig (Cavia porcellus). Annotated by MITOS web server. The mitogenome sequences were found in a single scaffold in each of the capybaras assemblies with Blast v2.9.0 from similarity with the guinea pig (Cavia porcellus) mitochondrial genome. Image made in the CGView Server.

Demography We fit a pairwise sequentially Markovian coalescent (PSMC) model to the genome assembly of each capybara species to evaluate possible changes in effective population size in the recent past (Figure 8). The capybara’s PSMC model suggested a relatively steady population size mildly fluctuating from Ne = 10,000 to 25,000. For the lesser capybara, on the other hand, a sudden population expansion started ~500,000, peaked around 200,000, and crashed back down to roughly

Ne = 20,000 some 100,000 years (Figure 8).

Figure 8. Pairwise sequentially Markovian coalescence analysis (PSMC) of the capybara and lesser capybara genome assemblies (in blue and red, respectively). Time goes from the present on the left towards the past on the right in the x-axis, and the y-axis represents effective population size (Ne) in units of 104.

Synteny between the capybaras and the common guinea pig We aligned the capybaras’ whole genome assemblies against each other and each of them independently against the guinea pig genome assembly (AccNum GCA_000151735.1) to search for regions of big genomic changes. Between the capybaras and the guinea pig there was found a major region of unmatching where the guinea pig may have gain/rearranged a region, or the capybaras lost/rearranged it (red arrows in Figure 9A-B). Between both capybara species assemblies there were not major rearrangements, but it is noticeable the lower contiguity of the lesser capybara assembly (Figure 9C). Due to the low contiguity of the assemblies used for this analysis, it is not possible to determine if the unmatches detected are due to one or multiple rearrangements nor in which specific parts of the capybaras genomes are they present.

Figure 9. Pairwise circos plots showing synteny on 100 kb windows between the capybara, Hydrochoerus hydrochaeris, the lesser capybara, H. isthmius, and the guinea pig. Each species is represented by a color (Capybara: blue, lesser capybara: orange, and guinea pig: green). (A) Comparison between the guinea pig, left, and lesser capybara, right. (B) Comparison between the capybara, left, and the lesser capybara, right. (C) Comparison between the guinea pig, left, and the capybara, right.

Genomic diversity within Hydrochoerus We extracted BUSCO genes from the transcriptome of the capybara from the eastern Llanos of Colombia, in the transcriptomes predicted in the gene annotations of the capybara (Bolivia) and lesser capybara (western Colombia) genome assemblies, and in the guinea pig to use as an outgroup. From this, we reconstructed a phylogenomic tree based on 2325 concatenated genes to test the following hypothesis: if H. hydrochaeris and H. isthmius are distinct species, both capybara samples would form a clade relative to the lesser capybara. Instead of this, we found that the lesser capybara was nested inside of the capybara clade, being genetically closer to the Bolivian sample than to the geographically more proximal Colombian Llanos sample, indicating a complex phylogeographic history (Figure 10).

Figure 10. Simple phylogeographic analysis of the capybara (n = 2 localities) and the lesser capybara. For this analysis, transcript samples from a capybara from Colombia, genetic samples from a capybara from Bolivia, and a lesser capybara from Colombia were used. Each of the samples are mapped using yellow lines from the phylogenetic tree to the region were they came from. The lesser capybara’s geographic range is indicated in purple, and in blue the capybara’s geographic range.

Phylogenomic analysis We inferred phylogenetic relationships among the two capybara species and available Hystricomorph species based on proteomes in Ensembl, using the mouse as an outgroup. We found a total of 508 single copy orthologs present in all 9 species that we subsequently used in the phylogenomic analysis. Within the Hytricomorpha subclass we found three distinct clades: one composed of the Damara mole rat and the naked mole rat (Family: Bathyergidae) that diverged from the rest approximately 56 million years ago (mya), a second clade containing the chinchilla (Family: Chinchillidae) and degu (Family: Octodontidae), and a third clade containing the guinea pigs and capybara species (Family: Caviidae), these two last mentioned clades, separating from each other approximately 30 mya (Figure 11).

Figure 11. Phylogenetic relationships among rodent species using single copy orthologs found in all nine species. These orthologs were aligned, filtered for quality and concatenated with FASconCAT-G v1.04. Then a maximum likelihood tree was constructed with RAxML and the species divergence was calculated with MCMCTree. All Hystrichomorpha proteomes available on Ensembl, and the mouse’, were used as inputs. Blue lines represent 95% credibility intervals around divergence times. Numbers on the upper x-axis represent millions of years ago, and letters on the lower x-axis represent geological epochs (La: late cretaceous; Pa.: paleocene; Eo.: eocene; Ol.: oligocene; Mi: miocene).

Discussion

Genome and transcriptome assemblies and annotation Here we report the first genome assembly for the lesser capybara as well as the first transcriptome assembly for the capybara. As has been demonstrated previously, low cost genome assemblies like the ones provided by 10X genomics are an incredible tool for understanding a species from a genomic perspective (e.g., Armstrong et al., 2019; Kocher et al., 2018; Hulse-Kemp, 2018). Even if the resulting assembly is not highly contiguous, these kind of technologies allow one to infer a range of biological processes.

Rapid population changes in the lesser capybara PSMC models use coalescent times across heterozygous sites on a single diploid genome to infer effective population size changes over time (Li & Durbin, 2011). Nadachowska‐Brzyska et al. (2016) suggested that PSMC models are reliable only on genome assemblies with a mean coverage over 18X and no more than 25% of missing data, thresholds that our capybara and lesser capybara genome assemblies surpass. From the PSMC models reported here, we can see that both species are tending into reducing their population sizes (Figure 8), a tendency that is more drastic in the lesser capybara. These trends are seen in other large mammal species which tend to need larger patches of uninterrupted habitat to exist (Berger, 2017). Currently, the capybara’s habitat is under various threats and is being reduced due to extensive changes in land use (Göpel et al., 2019). Other reasons this trend is evident in capybara is a change from hunting to breeding them for eating purposes by local people (da Rosa et al., 2019). Active breeding reduces the effective population size leading to problems such as inbreeding depression and higher genetic load (Wand, Santiago & Caballero; 2016; Hedrick & Garcia-Dorado, 2016).

Phylogenomic analyses places capybaras with other caviids Our phylogenomic analysis places capybaras as sister to guinea pigs (Family Caviidae; genus: Cavia spp.) and the caviids as sister to a group conformed of chinchillas and degus (Families: Chinchillidae and Octodontidae). Together these families conform the Caviomorpha and this clade is the sister to Phiomorpha represented by the family Bathyergidae, the mole rats, in this analysis. This topology supports that found previously (Upham & Patterson, 2015; Álvarez, Arévalo, & Verzi, 2017). Species diversification times for the different groups match those reported by Álvarez, Arévalo, & Verzi (2017), but different from the ones reported by Upham & Patterson (2015). However, Upham and Patterson (2015) used sequences from two mitochondrial genes and three nuclear genes while Álvarez, Arévalo, & Verzi, (2017) used mitochondrial genes in addition to five nuclear genes. We are highly confident in the results reported here due to the use of whole genome assemblies to find single copy orthologs and the quality filter steps taken, and since we took two different approaches, a concatenated maximum likelihood tree and a neighbour joining approach that takes each gene as independent from each other, to account for phylogenetic inference errors such as incomplete lineage sorting.

Lesser capybara is nested inside capybara Phylogenomic analyses demonstrate that the lesser capybara is more closely related to the capybara sample from Bolivia that it is to the capybara from the Llanos of Colombia. Even though this result suggests that the capybara (H. hydrochaeris) is a paraphyletic species, there is morphological evidence showing that the clades are divergent (Mones, 1991). Additionally, given the small sample size (n = 3), further population genetic studies coupled with morphological analyses should be carried out. If the pattern seen here is true, it would be even more concerning the fact that some populations have been artificially inbred as a consequence of humans breeding them since this fact can disrupt evolutionary processes that are isolating capybaras from Eastern and Western Colombia, which are separated by the Andes mountain range.

Acknowledgements This work was supported by Colciencias grant 1204-659-44334 (to AJC). Special thanks to: the DCB and the Facultad de Ciencias of the Universidad de los Andes for giving us access to the Magnus cluster on which all the computational analyses were run. Thanks to the University de los Andes Vice- president’s office for help with collecting and mobilization permits. Thanks to the members of the Biom|ics lab whose comments and help were invaluable for this project, to Juanita Herrera, María José Páramo and Diego Perico for their field help in the collection of samples. Thanks to Catalina Palacios, Phil Morin, and Alejandro Reyes for their analytical help and advice. Finally, thanks to Rachel Voyt and Melissa Hernández for their insightful comments on this manuscript.

Literature cited Aldana-Domínguez, J., Vieira-Muñoz, M. I., & Bejarano, P. (2013). Conservation and use of the capybara and the lesser capybara in Colombia. In Capybara (pp. 321-332). Springer, New York, NY. Álvarez, A., Arévalo, R. L. M., & Verzi, D. H. (2017). Diversification patterns and size evolution in caviomorph rodents. Biological Journal of the Linnean Society, 121(4), 907-922. Armenteros, J. J. A., Tsirigos, K. D., Sønderby, C. K., Petersen, T. N., Winther, O., Brunak, S., ... & Nielsen, H. (2019). SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature biotechnology, 37(4), 420. Armstrong, E. E., Taylor, R. W., Prost, S., Blinston, P., van der Meer, E., Madzikanda, H., ... & Petrov, D. (2019). Cost-effective assembly of the African wild dog (Lycaon pictus) genome using linked reads. GigaScience, 8(2), giy124. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., ... & Harris, M. A. (2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1), 25. Beier, S., Thiel, T., Münch, T., Scholz, U., & Mascher, M. (2017). MISA-web: a web server for microsatellite prediction. Bioinformatics, 33(16), 2583-2585. https://doi.org/10.1093/bioinformatics/btx198 Berger, J. O. E. L. (2017). The science and challenges of conserving large wild mammals in 21st- century American protected areas in. Science, Conservation, and National, 189-211. Bernt, M., Donath, A., Jühling, F., Externbrink, F., Florentz, C., Fritzsch, G., ... & Stadler, P. F. (2013). MITOS: improved de novo metazoan mitochondrial genome annotation. Molecular Phylogenetics and Evolution, 69(2), 313-319. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114-2120. Brown, C. T., Howe, A., Zhang, Q., Pyrkosz, A. B., & Brom, T. H. (2012). A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv preprint arXiv:1203.4802. Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin-Dogru, D., ... & Bateman, J. (2017). A tissue-mapped axolotl de novo transcriptome enables identification of limb regeneration factors. Cell reports, 18(3), 762-776. Burgin, C. J., Colella, J. P., Kahn, P. L., & Upham, N. S. (2018). How many species of mammals are there?. Journal of Mammalogy, 99(1), 1-14. Bushmanova, E., Antipov, D., Lapidus, A., Suvorov, V., & Prjibelski, A. D. (2016). rnaQUAST: a quality assessment tool for de novo transcriptome assemblies. Bioinformatics, 32(14), 2210-2212. Cahill, J. A., Soares, A. E., Green, R. E., & Shapiro, B. (2016). Inferring species divergence times using pairwise sequential Markovian coalescent modelling and low-coverage genomic data. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1699), 20150138. Camacho‐Sanchez, M., Burraco, P., Gomez‐Mestre, I., & Leonard, J. A. (2013). Preservation of RNA and DNA from mammal samples under field conditions. Molecular Ecology Resources, 13(4), 663- 673. Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973. Carrascal, J., Linares, J., & Chacón, J. (2011). Behavior of the Hydrochoerus hydrochaeris isthmius in a productive system, department of Córdoba, Colombia. Revista MVZ Córdoba, 16(3), 2754-2764. Chernomor, O., von Haeseler, A., & Minh, B. Q. (2016). Terrace aware data structure for phylogenomic inference from supermatrices. Systematic biology, 65(6), 997-1008. Correa, J. B., & Jorgenson, J. P. (2009). Aspectos poblacionales del cacó (Hydrochoerus hydrochaeris isthmius) y amenazas para su conservación en el Nor-Occidente de Colombia. Mastozoología neotropical, 16(1), 27-38. Corriale, M. J., & Herrera, E. A. (2014). Patterns of habitat use and selection by the capybara (Hydrochoerus hydrochaeris): a landscape‐scale analysis. Ecological research, 29(2), 191-201. Crusoe, M. R., Alameldin, H. F., Awad, S., Boucher, E., Caldwell, A., Cartwright, R., ... & Fenton, J. (2015). The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research, 4. da Rosa, P. P., Ávila, B. P., Costa, P. T., Fluck, A. C., Scheibler, R. B., Ferreira, O. G. L., & Gularte, M. A. (2019). Analysis of the perception and behavior of consumers regarding capybara meat by means of exploratory methods. Meat science, 152, 81-87. Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & McVean, G. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158. Delgado, C. & Emmons, L. (2016). Hydrochoerus isthmius . The IUCN Red List of Threatened Species 2016: e.T136277A22189896. https://dx.doi.org/10.2305/IUCN.UK.2016- 2.RLTS.T136277A22189896.en. Göpel, J., Schüngel, J., Schaldach, R., Stuch, B., & Löbelt, N. (2019). Assessing the effects of agricultural intensification on natural habitats and biodiversity in Southern Amazonia. bioRxiv, 846709. Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., ... & Chen, Z. (2011). Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature biotechnology, 29(7), 644. Grant, J. R., & Stothard, P. (2008). The CGView Server: a comparative genomics tool for circular genomes. Nucleic acids research, 36(suppl_2), W181-W184. Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075. El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., ... & Sonnhammer, E. L. L. (2018). The Pfam protein families database in 2019. Nucleic acids research, 47(D1), D427-D432. Emilio, A. H., & Macdonald, D. W. (1994). Social significance of scent marking in capybaras. Journal of Mammalogy, 75(2), 410-415. Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. BioRxiv, 466201. Fabre, P. H., Hautier, L., Dimitrov, D., & Douzery, E. J. (2012). A glimpse on the pattern of rodent diversification: a phylogenetic approach. BMC evolutionary biology, 12(1), 88. Haas, B., & Papanicolaou, A. (2015). TransDecoder (find coding regions within transcripts). Github, nd https://github. com/TransDecoder/TransDecoder. Hafner, J. C., & Hafner, M. S. (1988). Heterochrony in rodents. In Heterochrony in Evolution (pp. 217- 235). Springer, Boston, MA. Hedges, S. B., Dudley, J., & Kumar, S. (2006). TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics, 22(23), 2971-2972. Hedrick, P. W., & Garcia-Dorado, A. (2016). Understanding inbreeding depression, purging, and genetic rescue. Trends in Ecology & Evolution, 31(12), 940-952. Herrera-Álvarez, S., Karlsson, E., Ryder, O. A., Lindblad-Toh, K., & Crawford, A. J. (2018). How to make a rodent giant: Genomic basis and tradeoffs of gigantism in the capybara, the world’s largest rodent. BioRxiv, 424606. https://doi.org/10.1101/424606 Holt, C., & Yandell, M. (2011). MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC bioinformatics, 12(1), 491. Hulse-Kemp, A. M., Maheshwari, S., Stoffel, K., Hill, T. A., Jaffe, D., Williams, S. R., ... & Schatz, M. C. (2018). Reference quality assembly of the 3.5-Gb genome of Capsicum annuum from a single linked-read library. Horticulture research, 5(1), 1-13. Huerta-Cepas, J., Szklarczyk, D., Forslund, K., Cook, H., Heller, D., Walter, M. C., ... & Jensen, L. J. (2015). eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic acids research, 44(D1), D286-D293. Jackman, S. D., Coombe, L., Chu, J., Warren, R. L., Vandervalk, B. P., Yeo, S., ... & Birol, I. (2018). Tigmint: correcting assembly errors using linked reads from large molecules. BMC bioinformatics, 19(1), 393. Jarne, P., & Lagoda, P. J. (1996). Microsatellites, from molecules to populations and back. Trends in ecology & evolution, 11(10), 424-429. Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., ... & Pesseat, S. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics, 30(9), 1236-1240. Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780. Kocher, S. D., Mallarino, R., Rubin, B. E., Douglas, W. Y., Hoekstra, H. E., & Pierce, N. E. (2018). The genetic basis of a social polymorphism in halictid bees. Nature communications, 9(1), 1-7. Krogh, A., Larsson, B., Von Heijne, G., & Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. Journal of molecular biology, 305(3), 567-580. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., ... & Marra, M. A. (2009). Circos: an information aesthetic for comparative genomics. Genome research, 19(9), 1639- 1645. Kück, P., & Meusemann, K. (2010). FASconCAT: convenient handling of data matrices. Molecular Phylogenetics and Evolution, 56(3), 1115-1118. Lander, E., Linton, L., Birren, B. et al (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921. doi:10.1038/35057062. Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic acids research. Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760. Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357), 493. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17(3), 282-283. Liu, L., & Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic biology, 60(5), 661-667. Lord, R. D. (1994). A descriptive account of capybara behaviour. Studies on neotropical fauna and environment, 29(1), 11-22. Macdonald, D. W. (1981). Dwindling resources and the social behaviour of capybaras,(Hydrochoerus hydrochaeris)(Mammalia). Journal of Zoology, 194(3), 371-391. Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A., & Punta, M. (2013). Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic acids research, 41(12), e121-e121. Mones, A. (1991). Monografía de la familia Hydrochoeridae (Mammalia: Rodentia). Moreira, J. R., Alvarez, M. R., Tarifa, T., Pacheco, V., Taber, A., Tirira, D. G., ... & Macdonald, D. W. (2013). Taxonomy, natural history and distribution of the capybara. In Capybara (pp. 3-37). Springer, New York, NY. Nadachowska‐Brzyska, K., Burri, R., Smeds, L., & Ellegren, H. (2016). PSMC analysis of effective population sizes in molecular ecology and its application to black‐and‐white Ficedula flycatchers. Molecular ecology, 25(5), 1058-1072. Nguyen, L. T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q. (2014). IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular biology and evolution, 32(1), 268-274. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417. Paulino, D., Warren, R. L., Vandervalk, B. P., Raymond, A., Jackman, S. D., & Birol, I. (2015). Sealer: a scalable gap-closing application for finishing draft genomes. BMC bioinformatics, 16(1), 230. Pinheiro, M. S., & Moreira, J. R. (2013). Products and uses of capybaras. In Capybara (pp. 211-227). Springer, New York, NY. Putnam, N. H., O'Connell, B. L., Stites, J. C., Rice, B. J., Blanchette, M., Calef, R., ... & Haussler, D. (2016). Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome research, 26(3), 342-350. Puttick, M. N. (2019). MCMCtreeR: functions to prepare MCMCtree analyses and visualize posterior ages on trees. Bioinformatics, 35(24), 5321-5322. Reid, F. (2016). Hydrochoerus hydrochaeris . The IUCN Red List of Threatened Species 2016: e.T10300A22190005. https://dx.doi.org/10.2305/IUCN.UK.2016-2.RLTS.T10300A22190005.en. Rosenfield, D. A., Nichi, M., Losano, J. D., Kawai, G., Leite, R. F., Acosta, A. J., ... & Pizzutto, C. S. (2019). Field-testing a single-dose immunocontraceptive in free-ranging male capybara (Hydrochoerus hydrochaeris): Evaluation of effects on reproductive physiology, secondary sexual characteristics, and agonistic behavior. reproduction science, 209, 106148. Rowe, D. L., & Honeycutt, R. L. (2002). Phylogenetic relationships, ecological correlates, and molecular evolution within the Cavioidea (Mammalia, Rodentia). Molecular Biology and Evolution, 19(3), 263-277. Samuels, J. X. (2009). Cranial morphology and dietary habits of rodents. Zoological Journal of the Linnean Society, 156(4), 864-888. Smit, A. F. A., Hubley, R., & Green, P. (2015). RepeatMasker Open-4.0. 2013–2015. Stamatakis, A. (2014). RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9), 1312-1313. Trillmich, F., Kraus, C., Künkele, J., Asher, M., Clara, M., Dekomien, G., ... & Sachser, N. (2004). Species-level differentiation of two cryptic species pairs of wild cavies, genera Cavia and , with a discussion of the relationship between social systems and phylogeny in the . Canadian Journal of Zoology, 82(3), 516-524. UniProt Consortium. (2018). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515. Upham, N. S., & Patterson, B. D. (2015). Evolution of caviomorph rodents: a complete phylogeny and timetree for living genera. Biology of caviomorph rodents: diversity and evolution. Buenos Aires: SAREM Series A, 1, 63-120. Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., ... & Zdobnov, E. M. (2017). BUSCO applications from quality assessments to gene prediction and phylogenomics. Molecular biology and evolution, 35(3), 543-548. Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., ... & Antonarakis, S. E. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915), 520-562. Wang, J., Santiago, E., & Caballero, A. (2016). Prediction and estimation of effective population size. Heredity, 117(4), 193-206. Weisenfeld, N. I., Kumar, V., Shah, P., Church, D. M., & Jaffe, D. B. (2017). Direct determination of diploid genome sequences. Genome research, 27(5), 757-767. doi: 10.1101/gr.214874.116. Weisenfeld, N. I., Yin, S., Sharpe, T., Lau, B., Hegarty, R., Holmes, L., ... & Nusbaum, C. (2014). Comprehensive variation discovery in single human genomes. Nature genetics, 46(12), 1350. Whelan, S., Irisarri, I., & Burki, F. (2018). PREQUAL: detecting non-homologous characters in sets of unaligned homologous sequences. Bioinformatics, 34(22), 3929-3930. Wilson, D. E., & Reeder, D. M. (Eds.). (2005). Mammal species of the world: a taxonomic and geographic reference (Vol. 1). JHU Press. Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution, 24(8), 1586-1591. Yeo, S., Coombe, L., Warren, R. L., Chu, J., & Birol, I. (2017). ARCS: scaffolding genome drafts with linked reads. Bioinformatics, 34(5), 725-731. Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome biology, 11(2), R14.

Tables

Table 1. Quality metrics reported by the software Supernova v2.0.1 before and after assembling the lesser capybara genome (Weisenfeld et al., 2017). Input statistics

Number of reads 1751.10 M

Mean read length after trimming 139.50 b

Raw coverage 84.29X

Effective read coverage 50.16X

Fraction of Q30 bases in read 2 75.26%

Median insert size 345.00b

Fraction of proper read pairs 89.81%

Fraction of barcodes used 1

Estimated genome size 3.14 Gb

Genome repetitivity index 9.95%

High AT index 0.06%

GC content of assembly 40.04%

Dinucleotide content 1.23%

Weighted mean molecule size 36.91 Kb

Molecule count extending 10 kb on both sides 67.21

Mean distance between heterozygous SNPs 2.28 Kb

Fraction of reads that are not barcoded 6.97%

N50 reads per barcode 1.36 K

Fraction of reads that are duplicates 30.66%

Nonduplicate and phased reads 38.94%

Table 2. Rodent proteomes used for comparative analyses. Common name Species Genome version Accession number Mus spretus SPRET_EiJ_v1 GCA_001624865.1 Alpine marmot Marmota marmota marmota marMar2.1 GCA_001458135.1 American beaver Castor canadensis C.can_genome_v1.0 GCA_001984765.1 Arctic ground squirrel Urocitellus parryii ASM342692v1 GCA_003426925.1 Brazilian guinea pig Cavia aperea CavAp1.0 GCA_000688575.1 Chinese hamster CriGri Cricetulus griseus CriGri_1.0 GCA_000223135.1 Daurian ground squirrel Spermophilus dauricus ASM240643v1 GCA_002406435.1 Degu Octodon degus OctDeg1.0 GCA_000260255.1 Golden Hamster Mesocricetus auratus MesAur1.0 GCA_000349665.1 Guinea Pig Cavia porcellus Cavpor3.0 GCA_000151735.1 Kangaroo rat Dipodomys ordii Dord_2.0 GCA_000151885.2 Lesser Egyptian jerboa Jaculus jaculus JacJac1.0 GCA_000280705.1 Long-tailed chinchilla Chinchilla lanigera ChiLan1.0 GCA_000276665.1 Mongolian gerbil Meriones unguiculatus MunDraft-v1.0 GCA_002204375.1 Damara mole rat Fukomys damarensis DMR_v1.0 GCA_000743615.1 Naked mole-rat Heterocephalus glaber HetGla_female_1.0 GCA_000247695.1 Squirrel Ictidomys tridecemlineatus SpeTri2.0 GCA_000236235.1 Prairie vole Microtus ochrogaster MicOch1.0 GCA_000317375.1 Ryukyu mouse Mus caroli CAROLI_EIJ_v1.1 GCA_900094665.2 Mouse Mus musculus GRCm38.p6 GCA_000001635.8 Shrew mouse Mus pahari PAHARI_EIJ_v1.1 GCA_900095145.2 Steppe mouse Mus spicilegus MUSP714 GCA_003336285.1

Upper Galilee mountains blind mole rat Nannospalax galii S.galili_v1.0 GCA_000622305.1 Rabbit Oryctolagus cuniculus OryCun2.0 GCA_000003625.1

Northern American deer mouse Peromyscus maniculatus bairdii HU_Pman_2.1 GCA_003704035.1 Rat Rattus novergicus Rnor_6.0 GCA_000001895.4

Table 3. Tissues sequenced for the transcriptomic analysis.

Individuals sampled Species Coll. Number Location (Lat, Lon) Sex

Capybara (Hydrochoerus hydrochaeris) AJC 05614 05.8106°, - 70.9718° Male

Capybara (Hydrochoerus hydrochaeris) AJC 05615 05.8106°, - 70.9718° Female (gravid)

Tissues sampled

1. Heart

2. Brain

3. Kidney

4. Testes

5. Ovaries

6. Morillo

7. Anal gland

8. Fetal tissue 9. Bone marrow

10. Thyroid gland

11. Pancreas

Table 4. Quast assembly statistics for the different steps taken during the assembly. Assembly step Supernova Tigmint Arcs + Links Sealer (Final version)

Quast analysis Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs Scaffolds Contigs

# contigs (>= 0 bp) 29608 - 29762 - 28300 - 28300 -

# contigs (>= 1000 bp) 29608 - 29679 - 28217 - 28217 -

# contigs (>= 5000 bp) 16923 - 16982 - 15548 - 15551 -

# contigs (>= 10000 bp) 13322 50406 13367 50406 11991 50406 11994 29315

# contigs (>= 25000 bp) 10095 37807 10140 37807 9046 37807 9043 23474

# contigs (>= 50000 bp) 8503 25211 8543 25211 7702 25211 7702 18502

Largest contig 20810282 1295922 14115789 1295922 14115789 1295922 14116249 2052147

GC (%) 40.01 40.01 40.01 40.01 40.01 40.01 40.02 40.01

Reference GC (%) 39.95 39.95 39.95 39.95 39.95 39.95 39.95 39.95

N50 694764 116657 692348 116667 787285 116613 787090 232449

NG50 993541 156066 988664 156066 1101859 156066 1101324 308863

N75 344971 63185 344508 63189 389945 63157 390021 125489

NG75 649192 107821 645404 107821 726481 107821 725900 216001

L50 1465 9763 1483 9762 1328 9768 1328 4962

LG50 788 5738 803 5738 732 5738 732 2927

L75 3421 20802 3445 20801 3048 20815 3048 10522

LG75 1650 10992 1670 10992 1493 10992 1494 5567

# misassemblies 0 0 0 0 0 0 0 0

# unaligned mis. contigs 3846 6666 3865 6666 3685 6666 3687 5814 44170 + 44374 + 7699 + 13918 7720 + 5647 44146 + 6719 + 13922 6717 + # unaligned contigs 5623 part part part 13918 part 5272 part part 5277 part 21537 + 10144 part

# N's per 100 kbp 523.76 0 492.54 0 496.33 0 474.79 0.11

# indels per 100 kbp 402.9 403.6 402.9 403.6 402.79 403.56 403.03 403.38

Complete BUSCO (%) 84.16 81.19 84.16 81.19 84.16 81.19 84.16 82.84 Partial BUSCO (%) 2.97 4.29 2.97 4.29 3.3 4.62 3.3 4.29

Table 5. Microsatellites found in the capybara and lesser capybara genome assemblies. Capybara - Hydrochoerus hydrochaeris

Unit size Number of SSRs

2 318056

3 60663

4 97980

5 25910

6 5210

7 319

8 721

9 187

10 126

11 8

12 85

Lesser capybara - Hydrochoerus isthmius

Unit size Number of SSRs

2 445243

3 87656

4 138686

5 38745

6 6562

7 493

8 727

9 163

10 151

11 17

12 117

Table 6. Capybara and lesser capybara mitogenome annotations. Capybara - Hydrochoerus hydrochaeris Name Feature Start Stop Strand

trnF tRNA 332 398 -

trnP tRNA 1756 1825 +

trnT tRNA 1832 1898 -

cob CDS 1908 3041 -

trnE tRNA 3049 3117 +

nad6 CDS 3130 3642 +

nad5 CDS 3657 5456 - trnL1 tRNA 5457 5526 -

trnS1 tRNA 5526 5584 -

trnH tRNA 5588 5656 -

nad4 CDS 5667 7034 - nad4l CDS 7031 7324 -

trnW tRNA 7326 7394 -

nad3 CDS 7397 7735 -

trnG tRNA 7742 7810 - cox3_b CDS 7812 8033 - cox3_a CDS 8039 8590 -

atp6 CDS 8596 9270 -

atp8 CDS 9246 9431 -

trnK tRNA 9433 9499 - cox2-0 CDS 9506 10063 - cox2-1 CDS 10062 10181 -

trnD tRNA 10183 10251 -

trnS2 tRNA 10259 10327 + cox1_b CDS 10337 11872 -

trnY tRNA 11879 11947 +

trnC tRNA 11950 12016 +

trnN tRNA 12055 12127 +

trnA tRNA 12129 12197 +

trnW tRNA 12200 12269 -

nad2 CDS 12350 13297 -

trnM tRNA 13313 13381 - trnQ tRNA 13384 13454 +

trnI tRNA 13452 13520 -

nad1 CDS 13528 14478 -

trnL2 tRNA 14482 14556 -

rrnL rRNA 14558 16125 -

trnV tRNA 16124 16192 -

rrnS rRNA 16190 16947 -

Lesser capybara - Hydrochoerus isthmius

Name Feature Start Stop Strand

trnP tRNA 1038 1107 +

trnT tRNA 1114 1181 -

cob CDS 1191 2324 -

trnE tRNA 2332 2400 +

nad6 CDS 2413 2925 +

nad5 CDS 2940 4739 -

trnL1 tRNA 4740 4809 -

trnS1 tRNA 4809 4867 -

trnH tRNA 4871 4939 -

nad4 CDS 4950 6317 -

nad4l CDS 6314 6607 -

trnR tRNA 6609 6677 -

nad3 CDS 6680 7018 -

trnG tRNA 7025 7093 -

cox3 CDS 7095 7877 -

atp6 CDS 7883 8557 -

atp8 CDS 8530 8718 -

trnK tRNA 8720 8786 -

cox2 CDS 8793 9470 -

trnD tRNA 9472 9540 -

trnS2 tRNA 9547 9615 +

cox1_b CDS 9625 10368 -

cox1_a CDS 10365 11159 -

trnY tRNA 11166 11234 + trnC tRNA 11237 11303 +

trnN tRNA 11343 11415 +

trnA tRNA 11417 11485 +

trnW tRNA 11488 11557 -

nad2 CDS 11565 12584 -

trnM tRNA 12600 12668 -

trnQ tRNA 12671 12741 +

trnI tRNA 12739 12807 -

nad1 CDS 12815 13765 - trnL2 tRNA 13769 13843 -

rrnL rRNA 13845 15413 -

trnV tRNA 15412 15480 -

rrnS rRNA 15478 16431 -

trnF tRNA 16431 16497 -