International Doctoral School

Merly Mayela Escalona Fermín

DOCTORAL DISSERTATION

Sensitivity of phylogenomic inference to the design of NGS target enrichment in non-model organisms

Supervised by: Sara Rodrigues Passos Rocha and David Posada González

2018 International mention

Escola Internacional de Doutoramento

David Posada González y Sara Rodrigues Passos Rocha, FAN CONSTAR que o presente traballo, titulado “Sensitivity of phylogenomic inference to the design of NGS target enrichment in non -model organisms” , que presenta Merly Mayela Escalona Fermín para a obtención do título de Doutor/a, foi elaborado baixo a súa dirección no programa de doutoramento “Metodologías y aplicaciones en Ciencias de la Vida “.

Vigo, 2 de febrero de 2018.

Os Directores da tese de doutoramento

Dr. David Posada González Dr. Sara Rodrigues Passos Rocha

International Doctoral School

David Posada González y Sara Rodrigues Passos Rocha, DECLARES that the present work, entitle “Sensitivity of phylogenomic inference to the design of NGS target enrichment in non -model organisms” , submitted by Merly Mayela Escalona Fermín to obtain the title of Doctor, was carried out under their supervision in the PhD program “Metodologías y aplicaciones en Ciencias de la Vida “.

Vigo, 2 de febrero de 2018.

Os directores da tese de doutoramento

Dr. David Posada González Dr. Sara Rodrigues Passos Rocha

To whoever reads this, I hope you find it as interesting and useful as I have.

Acknowledgements

I would like to start thanking the comittee and the international experts that are going to read this dissertation. I hope it is as interesting and useful, as it has been my journey to develop it. This work would not have been possible without any financial support, hence my gratitude to the Spanish goverment, which has financially supported my work and my research visits abroad. A PhD thesis is a team effort, and so I would like to thank all the (current and former) members of the Phylogenomics Lab. Special thanks to Joao, Tama and Laura, for our infinite conversations about the humongous variant calling world; to Andrés, for always being realist, for sharing your statistical and R knowledge with me and for your support in most of the extracurricular conversations; to Sereina and Harry, for helping me to figure out my simulations while helping them out with yours, foralways offering advice and for those little details. To the single cell ladies, Sonia andNuria, for so many things, but specially for letting me in into your homes; and to Pili, for all those coffee breaks and for being so positive. To the mollusc guys, Carlos andMaria, but specially to Carlos, thanks for letting me assist you in the Genetics II practicals, it was a really nice experience; for his patience, I have finished your simulated datasets, I can give them to you now that this thesis is being delivered. To Diego, for developing SimPhy, for being so kind to explain to me all of its functionalities and parameters, and for being always available (a mail away) to answer all my doubs about it. To Merchi, because thanks to her we can focus on what we are meant to focus. This lab, would not work without her. To the tech support team, Ramón, Rubén and Fran, who helped me with my cluster issues, and never complain about my large amount of tickets, always high priority. To my supervisors, David and Sara, thanks for their guidance, knowledge, compre- hension, support and tireless patience, I would not have asked for better supervisors. David, thanks for being such a great boss, for being so direct and have your ideas so clear. Sara, thanks for being you and for turning the supervisor-student line so thin. WE would not have finished this otherwise. x

I feel lucky enough to have been able to visit two labs during my PhD adventure, but I feel even luckier to have made them under the supervision of such brilliant researchers. Alan, Emily and Rute, thank you so much. Rute thanks for all your advices, and be aware, we still have a lot of work to do. Also, thanks to all the people I have got to meet during my visits Ameer, Megan, Silvia, Paula, Anders, Capser, Rasmus, Emil, Yorgos, Ida, Stine, Fleur, Renata, Eduardo and Maria. I would also like to thank my previous supervisor, David Olivieri, for his support and for suggesting me to follow this project. To my friends, whom I have not seen as much as I (and them) would have liked, and who understand the sacrifice: Cristina, Fabio, Sora and Mónica. And finally, but not least important, to my family, for their unconditional support and for always pushing me to work hard towards the consecution of my objectives. Resumen en castellano

1. Introducción

La filogenética es la rama de la ciencia que estudia las relaciones evolutivas entre indi- viduos o grupos de organismos (filogenias), además de proveer de medios (métodos filogenéticos) para estimarlas. Los métodos de reconstrucción filogenética nos per- miten formular hipótesis sobre estas relaciones en forma de árboles filogenéticos. El uso de la información filogenética se ha extendido en Biología, pero también en campos múltiples y tan diversos como el lenguaje, la conservación y la medicina, entre otros. La filogenómica es un término amplio que puede ser visto como la intersección entre la evolución y la genómica. Ésta comprende varias áreas de investigación entre la biología molecular y la evolución, permitiendo el uso de datos genómicos para inferir relaciones filogenéticas y ganar información sobre los mecanismos de evolución y función de los genomas. En consecuencia, la filogenómica (y con ella, las filogenias) permiten colocar en perspectiva los estudios de genómica comparativa, enriqueciendo nuestro conocimiento sobre cómo evolucionan los genes, los genomas, las especies y las secuencias moleculares, además de ayudar a predecir cómo éstas podrían cambiar en el futuro. Los árboles filogenéticos tienen muchas aplicaciones a diferentes campos: la clasificación de organismos y el conocimiento de sus relaciones evolutivas; enla medicina forense, la evaluación de pruebas de ADN presentadas en casos judiciales; en la identificación de patógenos, donde las tecnologías de secuenciación molecular y los enfoques filogenéticos se utilizan con frecuencia para identificar los brotes de nuevos patógenos, su relación con otras especies, y, posteriormente, la posible fuente de transmisión, aportan información importante en políticas de salud pública. Por otro lado, los árboles filogenéticos nos proporcionan el marco adecuado para comparar caracteres biológicos entre distintas especies (i.e., el método comparativo), así como para la estima de parámetros evolutivos y demográficos de poblaciones y especies a distintos niveles (ver estudios filodinámicos, la teoría de la coalescencia, o estimas de diversificación y divergencia, entre muchos otras aplicaciones). xii

Los árboles de genes (“gen” entendido como región del genoma) reflejan el proceso de replicación de ADN a nivel local, una copia de un gen en un locus del genoma; por ejemplo, un gen que codifica una proteína, se replica, generando nuevas ramificaciones en el áraŕbol de genes, y su copias pasan de padres a hijos. Los árboles de especies, por su parte, representan la historia evolutiva de los organismos. Estos están compuestos por nodos que representan los eventos de especiación y sus ramas, que reflejan la historia de la población entre los eventos de especiación. Las ramas del árbol de especies pueden tener asociadas un ancho, que representa el tamaño efectivo de la población y la longitud, que representa el tiempo, bien sea en años o generaciones. Es importante destacar que la historia de una región genómica no es necesariamente equivalente a la historia de las especies que la contienen, es decir, los árboles de genes no son necesariamente equivalentes a los árboles de especies.. Esta noción no es nueva, ya que la percepción de la discordancia entre árboles de genes y árboles de especies data de los 80. Sin embargo, quizás por ignorancia sobre la importancia de esta discordancia a nivel genómico, pero también por conveniencia, los árboles de genes eran considerados hasta hace muy poco aproximaciones fiables a las filogenias de especies. La discordancia entre árboles de genes y árboles de especies puede ser causada por errores sistemáticos (especificación incorrecta de los modelos) o estocásticos (inherentes a la cantidad finita de datos y el proceso de muestreo), pero también pueden ser el resultado de diferentes procesos evolutivos como la ordenación incompleta de linajes, la duplicación y pérdida de genes, y la transferencia horizontal de genes. Esto ha motivado el desarrollo de enfoques filogenéticos que tienen en cuenta la heterogeneidad de árboles de genes en la estima de árboles de especies. En lugar de equiparar árboles de genes con la historia filogenética de la especie, los nuevos enfoques consideran explícitamente las relaciones entre los árboles de genes y la historia subyacente de divergencia de las especies, proporcionando estimas directas de los árboles de especies. Un organismo modelo es aquel que usamos para estudiar fenómenos biológicos particulares, incluyendo la representación de grupos determinados de taxones. Los organismo modelo son habitualmente más simples, pequeños y manejables que aquellos organismos a los que representan. Su estudio habitualmente proporciona ventajas experimentales porque algunos de ellos pueden criarse en grandes cantidades y/o tienen tiempos de generación muy cortos, mientras que otros tienen genes similares a los de los humanos, por ejemplo para biomedicina. Todas estas características han hecho que los organismos modelo se conviertan en herramientas irremplazables en la investigación biológica y clínica. Debido a su conveniencia, la comunidad científica xiii se se ha centrado en masa a estudiarlos, y esto ha llevado a un mayor desarrollo y optimización de recursos, protocolos, métodos, tuberías de análisis bioinformáticos y herramientas para el tratamiento de los datos obtenidos. Además, muchos de sus genomas ya han sido secuenciados completamente y bien caracterizados. Por el contrario, los organismos no modelo son los que no han sido seleccionados por la comunidad científica para un estudio extensivo, bien por razones históricas o porque carecen de las características que hacen a los organismos modelos fáciles de investigar. La mayoría de ellos no pueden ser criados (o cultivados) aisladamente en laboratorio o simplemente no están bien caracterizados a nivel molecular, por lo que se requieren grandes esfuerzos para poder trabajar con ellos. Si bien es cierto que los organismos modelos están extensamente estudiados y son representaciones simples de otros organismos más grandes y complejos, esto no implica que los organismos no modelos no sean importantes o que hayan quedado en el olvido. En general, los grandes proyectos de secuenciación de genomas han sido impulsados por su relevancia cara al hombre, de ahí el genoma humano; por su interés económico, importantes cultivos como el maíz o el arroz; para estudiar enfermedades, como los genomas de las células cancerígenas; para la reconstrucción del Árbol de la Vida; o simplemente por cuestiones relacionadas con su tamaño y/o reproducibilidad. La secuenciación de ADN es la base de cualquier estudio genómico. El método convencional de secuenciación de Sanger dominó la industria durante aproxi- madamente dos décadas, dando lugar a muchos logros, entre ellos la finalización de la primera secuencia completa del genoma humano. La secuenciación con el método de Sanger también ayudó a mejorar el conocimiento sobre los ácidos nucleicos, y en consecuencia mejorar la comprensión de mecanismos celulares y enfermedades. Sin embargo, este tipo de secuenciación tiene ciertas limitaciones en cuanto a la cantidad de datos que puede generar, el largo tiempo de adquisición de datos, la calidad de la secuencias y lo laborioso y costoso de su protocolo. En los últimos años, las técnicas de secuenciación masiva (en inglés denominadas “next-generation sequencing”, NGS) han revolucionado este campo proporcionando la secuenciación a gran escala de ADN (y ARN) a costos más bajos y sin la necesidad de grandes instalaciones. Las técnicas de NGS han permitido la generación de secuencias de datos de múltiples loci de forma rápida y rentable, además de la explosión de estudios filogenómicos. Ac- tualmente las tecnologías más populares de NGS en el mercado son la secuenciación por síntesis de Illumina (probablemente la más utilizada); la pirosecuenciación de la plataforma 454 de Roche; la secuenciación por ligación de SOLiD; la secuenciación por semiconductores de IonTorrent; la secuenciación de moléculas individuales en xiv tiempo real de Pacific Biosciences (PacBio) y la tecnología de secuenciación de células individuales de Oxford Nanopore. Las plataformas difieren, en términos general, enel tipo de lecturas que producen y en los errores que introducen. Es evidente que las tecnologías de secuenciación masiva han cambiado la escala en la que se obtienen los datos de secuenciación genómica. Sin embargo, la secuenciación de novo de genomas enteros para muchos organismos, especial- mente para los organismos no modelo, sigue siendo difícil y costosa, e innece- saria para muchas de las preguntas. Con el fin de obtener información genómica para organismos no modelo que sea significativa para la inferencia evolutiva a distintos niveles (poblaciones, especies, géneros, etc), necesitamos identificar regiones comunes (homólogas) entre los individuos/especies estudiadas. La última década ha dado lugar al desarrollo de numerosas técnicas que permiten solventar este problema, los llamados métodos de reducción o captura genómica. Estos métodos reducen dramáticamente el espacio de secuenciación del genoma enfocándose en regiones específicas. De hecho, estos métodos preceden el desarrollo de las tecnologías NGS, y han sido usados amplia- mente durante años, pero en conjunto con NGS nos permiten obtener un gran número de loci homólogos procedentes de múltiples individuos de múltiples especies. Estos métodos incluyen la captura genómica de regiones específicas (target enrichment), como captura de exones, de zonas ancladas a genes y mismo anonimas (anchored and anonymous enrichment), la captura de elementos ultraconservados; la secuenciación de transcriptomas o la secuenciación de ADN asociado a sitios de restricción (RAD-seq). Las tecnologías NGS están en un proceso de mejora constante en términos de eficiencia, calidad, cantidad y coste de la producción de datos, al igual que los algoritmos y las herramientas bioinformáticas que se han desarrollado para analizarlos. El análisis de datos de NGS para filogenómica consiste en muchos pasos, que podrían incluir ensamblaje, mapeamiento, inferencia de homólogos/ortólogos, estimación/inferencia de variantes y genotipos, y/o la inferencia de árboles de genes y/o de especies. Además, existe una cantidad exorbitante de herramientas con aún más parámetros que deben ser optimizados para cada paso del análisis y, en la mayoría de las veces, para cada conjunto de datos. Y estas decisiones influencian al conjunto de datos resultante. Por lo tanto, el flujo de trabajo a seguir para el análisis de datos filogenómicos de NGS es complejo y requiere de múltiples decisiones metodológicas e interacción humana. Es importante destacar que no hay una aproximación estándar para este tipo de análisis, y la influencia de las distintas estrategias y opciones enel nivel de precisión de los resultados es desconocida. Asimismo, los enfoques de análisis filogenómicos evolucionan según lo hacen las tecnologías, pero muchos delos xv enfoques son ad hoc, es decir, específicos a las características de los datos y los tipos de preguntas que se realizan sobre ellos. El objetivo principal de un estudio filogenómico es habitualmente conocer las historias de las especies, pero la “verdad” es desconocida cuando se trata con datos empíricos. Esto hace muy difícil, o incluso imposible, decidir entre protocolos, y por esto las simulaciones computacionales se han convertido en herramientas muy útiles en este campo. Las simulaciones nos permiten evaluar el comportamiento de un sistema (existente o propuesto, físico o abstracto) bajo distintas configuraciones de interés y durante largos períodos de tiempo. Esto implica ciertos tipos de modelos lógicos y matemáticos para describir el comportamiento del sistema. El modelo es similar al sistema que representa, pero más simple, y es evidente, que un buen modelo debe ser un balance entre realismo y simplicidad, dado que a mayor complejidad del modelo más difícil es de evaluarlo y entenderlo. Las simulaciones son de bajo costo, por lo general rápidas, y nos permiten modelar la realidad generando tantos datos como sean necesarios, bajo condiciones idílicas (escenarios controlados con parámetros predefinidos cuyos valores reales son conocidos). Los enfoques in silico ayudan a la identificación de problemas, cuellos de botella y fallos de diseño enla construcción o modificación de sistemas. Éstos tienen, por supuesto, sus limitaciones: hacen asunciones sobre los procesos que pueden no ser realistas, traduciéndose en que el modelo puede llegar a ser una descripción inadecuada del sistema original.

2. Motivación y objetivos

Este tesis está motivada por el interés general en obtener árboles de especies lo más precisos posibles, a partir de datos NGS. Como se acaba de explicar, el camino desde el diseño de un experimento NGS hasta la estima filogenómica del árbol de especies es largo, complejo, e implica múltiples decisiones metodológicas. Hasta ahora, los estudios sobre la precisión de la reconstrucción de los árboles de especies se ha preocupado fundamentalmente de variaciones del tamaño de los alineamientos múltiples de secuencias (e.g. número de loci, número de individuos por especies, datos ausentes, etc.) o del efecto de la historia de las especies (e.g. tamaño efectivo poblacional, tasa de substitución, polimorfismos ancestrales, etc.), pero siempre a partir de los alineamientos, En esta tesis se pretende rellenar este hueco e intentar entender el efecto de las decisiones anteriores necesarias para llegar al alineamiento, en el contexto actual de secuenciación masiva. Dicho de otro modo, el propósito principal de esta tesis es poder entender el impacto de las diferentes variables existentes en la tubería xvi de reconstrucción de árboles de especies a partir de datos NGS de captura genómica. Para ello, he identificado los siguientes objetivos específicos:

1. Proporcionar una mejor comprensión de la gran variedad de simuladores NGS existentes, así como directrices generales para la selección de simuladores para fines específicos. Este objetivo ha sido abordado enel Capítulo2: Simulation of genomic next-generation sequencing data. Publicado como Escalona et al. 2016.

2. Diseñar e implementar una herramienta realista para la simulación de datos filogenómicos de NGS. Abordado enel Capítulo3: NGSphy: phylogenomic simulation of next-generation sequencing data. Publicado como Escalona et al. 2017.

3. Identificar un espacio de parámetros realista para la simulación de datosNGS de captura genómica. Abordado en el Capítulo4: Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenetics. Publicación en preparación.

4. Evaluar la sensibilidad de la inferencia filogenómica a variaciones de los parámet- ros de las tuberías filogenómicas de NGS. Abordado enel Capítulo5: Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms. Publicación en preparación.

3. Metodología

Para poder realizar esta tesis he tenido que familiarizarme con las técnicas de se- cuenciación NGS, con el análisis bioinformático de datos NGS y con los métodos de reconstrucción filogenómica. Así mismo, he estudiado con detenimiento los diferentes pasos y decisiones que implican las distintas tuberías existentes para el análisis filo- genómico de datos NGS. Tras esta familiarización inicial, me centré en las simulaciones de datos NGS. Para esto tuve que realizar un estudio de las diferentes herramientas, sus características y funcionalidades, que me llevó a seleccionar ART para las simulaciones posteriores. ART es una herramienta que genera datos NGS imitando el proceso de secuenciación real con modelos de error empíricos. Elegí esta herramienta porque se adecuaba a la generación de datos Illumina, era la mejor documentada, la mejor mantenida y la más rápida. A continuación, dirigí mis esfuerzos en generar datos NGS a partir de secuencias genómicas de múltiples loci de especies cercanas. Para esto existía una herramienta, TreeToReads, que aunque es útil para generar datos NGS de un único árbol de genes, no satisfacía las necesidades de mis simulaciones, xvii puesto que no permite directamente generar datos de distribuciones de árboles de genes, ni usar individuos diploides, además de no considerar la heterogeneidad de la profundidad de secuenciación entre especies, individuos y loci. Esto me llevó a diseñar e implementar un nuevo simulador NGS filogenómico, al que llamé NGSphy. NGSphy es una herramienta escrita en Python y de código abierto, para la simulación de datos de Illumina o contaje de lecturas (read counts) que se obtienen de genomas de individuos haploides/diploides con miles de familias génicas independientes que han evolucionado bajo un árbol de especies común. Para poder imitar experimentos reales de NGS, permite modelar la heterogeneidad de la profundidad de secuenciación entre especies, individuos y loci, incluyendo loci que no han sido elegidos como blanco o los que no han sido capturados. Una vez identificado cómo generar los datos NGS, era necesario diseñar simula- ciones que representasen escenarios biológicos realistas. Para ello decidí seleccionar parámetros NGS habituales en los experimentos de captura genómica: filogenias re- cientes con individuos muy relacionados. Dado el debate actual sobre sobre los méritos de las diferentes estrategias de captura genómica a diferentes escalas de tiempo, el rango de tiempo elegido para las simulaciones abarca una amplia divergencia, desde 200.000 a 20 Millones de años, es decir, desde el Holocénico hasta el temprano Miocénico. Este rango de tiempo debería comprender la mayoría de los casos donde la ordenación incompleta de linajes puede afectar el proceso de reconstrucción de los árboles de especies y de donde los investigadores estarían mas interesados en recopilar datos para una gran cantidad de loci. Posteriormente, implementé una tubería de análisis de datos NGS para la reconstrucción de árboles de especies, para poder realizar el estudio de la sensibilidad de la inferencia de árboles de especies a la variación de distintos parámetros NGS. Esta tubería, en conjunto con NGSphy, me permitió realizar simulaciones de datos NGS a partir de alineamientos de secuencias de ADN siguiendo modelos evolutivos y árboles de especies conocidos, conjuntos de datos que analicé como si se tratasen de datos reales obtenidos en un experimento de secuenciación y, que posteriormente procesé y analicé con una combinación única de metodologías con la mayoría de ellas parametrizadas por defecto, hasta reconstruir su historia evolutiva. Por último, realicé la comparación de los árboles de especies “verdaderos” (=simuladas) con los estimadas, para inferir la influencia de la variación de métodos y parámetros del procesamiento y el análisis de los datos en la capacidad de inferir la solución correcta. xviii

4. Resultados

El estudio de las distintas herramientas para la simulación de datos de secuenciación masiva (conocidas en inglés como next-generation sequencing - NGS), resultó en una revisión de 23 de las herramientas de simulación NGS, resaltando sus funcionali- dades, requisitos y aplicaciones potenciales. Además, en esta revisión, proporciono un árbol de decisión para la selección informada de la herramienta de simulación NGS más apropiada según la pregunta propuesta (presentado en el Capítulo2, publicado como Escalona et al. 2016. Cómo ninguna de las herramientas satisfacía nuestras necesidades, para poder evaluar el efecto de las diferentes decisiones metodológicas a lo largo del proceso de producción y análisis de datos NGS, respecto a la calidad de la inferencia filogenómica, desarrollé NGSphy (presentado en el Capítulo3, publicado como Escalona et al. 2017), una herramienta de código abierto para la simulación de datos de Illumina o contaje de lecturas (“read counts”) obtenidos de genomas de individuos haploides/diploides con miles de familias génicas independientes que han evolucionado bajo un árbol de especies común. Los datos simulados por NGSphy, además, se acercan a los experimentos reales de NGS porque éste incluye múltiples opciones para modelar la heterogeneidad de la profundidad de secuenciación entre especies, individuos y loci, incluyendo loci que no han sido elegidos como diana o los que no han sido capturados. La simulación de escenarios complejos bajo distribuciones continuas de distintos parámetros no es una tarea fácil. Particularmente, para que estas simulaciones sean relevantes para la comunidad científica, deben reflejar escenarios lo más comunes y realistas posibles. De esta manera, en el Capítulo4 describo el proceso de parametrización de las simulaciones de datos filogenómicos de NGS. Finalmente, para lograr el objetivo general de esta tesis, he diseñado un estudio con datos simulados en cuatro escenarios NGS diferentes, y los he procesado con una tubería de datos que incluye un conjunto reducido de tratamientos. Con este análisis he identificado que el uso de una referencia de mapeo cercana produce mejores árboles. A mayor profundidad de secuenciación mejores resultados, mientras que la variación en profundidad entre individuos y loci no parece afectar. Además, el tipo y tamaño de lectura tienen un efecto pequeño pero estadísticamente significativo y he identificado que el perfil NGS más utilizado (PE 150bp) puede no ser el óptimopara resolver problemas de reconstrucción de filogenia. En cuanto al número de especies y el número de individuos por especie, se mantiene las relaciones de otros estudios de reconstrucción de árboles de especies en los que a menor número de especies, y mayor número de individuos por especie, mejor la precisión de los métodos. Por último, la xix forma de los árboles de especies es una de las características más importantes que afectan la discordancia entre los árboles de genes, y por consiguiente a la precisión de la reconstrucción de árboles de especies. La precisión de la reconstrucción mejora (de manera muy significativa) a medida que la altura de los árboles y el número medioy máximo de linajes extra (medida de directa de ordenación incompleta de linajes), se reduce.

5. Conclusiones

El trabajo presentado en esta tesis engloba mis esfuerzos para entender el efecto de las diferentes decisiones metodológicas que deben realizarse durante la producción y análisis de datos NGS en la calidad de las estimas filogenómicas, enfocados especial- mente en experimentos de captura genómica. Esta tesis proporciona a la comunidad científica con:

• Un mejor entendimiento de la variedad de simuladores genómicos de NGS, así como también una guía para saber cual sería el simulador más apropiado según la pregunta que se desee hacer (Capítulo2: Simulation of genomic next-generation sequencing data; Escalona et al. 2016).

• Un entorno de simulación más realista de datos filogenómicos de NGS, incluyendo múltiples opciones para modelar diseños experimentales y parámetros de secuen- ciación, haciendo posible análisis comparativos con diferentes parámetros NGS, bajo un espacio de parámetros evolutivos amplio y bajo el paradigma de árboles de genes y árboles de especies (Capítulo3: NGSphy: phylogenomic simulation of NGS data; Escalona et al. 2017).

• Una descripción detallada sobre el proceso de optimización de parámetros para la simulación de datos NGS de experimentos de capturas genómicas en especies de divergencia reciente (Capítulo4: Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenomics).

• La implementación de una tubería para un estudio sobre la sensibilidad de la inferencia filogenómica al diseño de experimentos de secuenciación masiva de capturas genómicas en organismos no modelo (Capítulo5: Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms). xx

• De acuerdo con las simulaciones presentadas en esta tesis, se observa que distintas variables del análisis NGS afectan la reconstrucción de árboles de especies. La referencia de mapeo y la profundidad de secuenciación tienen un efecto fundamental, mientras que el tipo y tamaño de lectura afectan menos.

En un futuro próximo espero ampliar los escenarios estudiados para proporcionar una evaluación más exhaustiva añadiendo otras variables metodológicas relevantes como el ensamblaje de novo en lugar de utilizar mapeado, la estima de variantes con diferentes algoritmos, o la consideración de la fase alélica, entre otros. Table of contents

Resumen en castellano xi

List of figures xxv

List of tables xxix

List of boxes xxxi

1 Introduction1 1.1 Phylogenetics and phylogenomics ...... 1 1.1.1 Species trees and gene trees ...... 2 1.1.2 Applications of phylogenetics/phylogenomics ...... 3 1.2 Model vs. non-model organisms ...... 4 1.3 Next-generation sequencing (NGS) ...... 6 1.3.1 The NGS revolution ...... 6 1.3.2 NGS Platform ...... 6 1.3.3 Genome reduction methods ...... 10 1.3.4 Bioinformatic analysis of NGS data ...... 16 1.3.4.1 Quality control ...... 16 1.3.4.2 Mapping ...... 18 1.3.4.3 Assembly ...... 19 1.3.4.4 Variant and genotype calling ...... 20 1.3.4.5 Phasing ...... 21 1.4 Phylogenomic estimation ...... 22 1.4.1 Orthology inference ...... 22 1.4.2 Reconstruction of gene trees ...... 23 1.4.3 Reconstruction of species trees ...... 24 1.5 Simulations in phylogenomics ...... 26 1.5.1 Simulation of evolutionary histories (gene trees and species trees) 27 xxii Table of contents

1.5.2 Simulation of molecular evolution (DNA sequences) ...... 29 1.5.3 Simulation of NGS data ...... 29 1.6 Motivation ...... 29 1.7 Objectives ...... 31

2 Simulation of genomic next generation sequencing data 33 2.1 Introduction ...... 35 2.2 Simulation parameters ...... 36 2.2.1 Reference sequence ...... 38 2.2.2 Profiles ...... 38 2.3 Accounting for PCR amplification ...... 43 2.4 Read features ...... 43 2.5 Base-calling errors ...... 45 2.5.1 Indel errors ...... 46 2.5.2 Substitution errors ...... 46 2.6 Quality scores ...... 47 2.7 Sequencing depth ...... 49 2.8 Simulating genomic variants ...... 49 2.9 Output ...... 50 2.10 Conclusions ...... 52

3 NGSphy: phylogenomic simulation of next-generation sequencing data 55 3.1 Introduction ...... 57 3.2 Description and implementation ...... 57 3.3 Simulation modes ...... 58 3.4 Assignment of individuals ...... 59 3.5 Re-rooting process ...... 59 3.6 Coverage heterogeneity ...... 62 3.6.1 Distribution-based parameterization ...... 62 3.7 Next-generation sequencing data ...... 67 3.7.1 Simulation of Illumina reads ...... 67 3.7.2 Simulation of read counts ...... 67 3.8 Input and output ...... 68 3.9 Validation test: phylogenetic reconstruction from simulated alignments 70 3.10 Use case: effect of the variation of depth of coverage on SNP recovery . 72 3.11 Execution ...... 74 3.12 Availability ...... 74 Table of contents xxiii

4 Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenomics 75 4.1 Introduction ...... 77 4.1.1 SimPhy’s configuration ...... 78 4.1.2 Replicate parameters ...... 79 4.1.3 Species tree parameters ...... 81 4.1.3.1 Species tree height ...... 81 4.1.3.2 Number of species and number of individuals per species 81 4.1.3.3 Speciation rate ...... 83 4.1.3.4 Tree-wide substitution rate ...... 83 4.1.4 Substitution rate heterogeneity parameters ...... 89 4.2 Indelible: simulating and characterizing DNA sequence alignments . . 90 4.3 Simulating NGS reads ...... 93 4.3.1 Analysis of coverage in target enrichment experiments . . . . . 93 4.3.1.1 On-target regions ...... 95 4.3.1.2 Off-target regions ...... 98 4.3.1.3 Non-captured regions ...... 100 4.3.1.4 Phylogenetic decay ...... 100 4.4 NGSphy parameterization ...... 103

5 Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms 105 5.1 Introduction ...... 107 5.2 Methods ...... 108 5.2.1 Data simulation ...... 108 5.2.1.1 Generation of species trees and gene trees ...... 109 5.2.1.2 Generation of DNA Sequences ...... 109 5.2.1.3 Assignment of gene-trees leaves to diploid individuals 111 5.2.1.4 Generation of NGS data ...... 111 5.2.2 NGS data analysis ...... 113 5.2.2.1 Quality control and trimming ...... 113 5.2.2.2 Construction of reference sequences ...... 113 5.2.2.3 Mapping ...... 114 5.2.2.4 Variant and genotype calling ...... 114 5.2.2.5 Consensus sequences ...... 114 5.2.2.6 Multiple sequence alignment ...... 114 5.2.3 Gene and species tree inference ...... 115 xxiv Table of contents

5.2.4 Accuracy of species trees reconstruction: relation to NGS and capture design variables ...... 115 5.3 Results and Discussion ...... 115 5.3.1 Effect of read type/length and mapping reference . . . . 115 5.3.2 Effect of NGS coverage ...... 117 5.3.3 Effect of the number and loci size ...... 120 5.3.4 Effect of sampling ...... 120 5.3.5 Effect of the speciation history ...... 121 5.4 Conclusions ...... 122

6 Overall conclusions 125

References 127

Appendix A List of deliverables 161

Appendix B Short CV 163

Appendix C NGSphy manual 167 List of figures

1.1 Gene tree species tree discordance ...... 3 1.2 Some of the organisms whose genome have been sequenced...... 5 1.3 Evolution of the raw per-megabase cost of DNA sequencing...... 7 1.4 Representative workflow for the acquisition of genome-reduced datasets compared to whole-genome sequencing...... 15 1.5 Typical workflow used for processing NGS data for phylogenomic infer- ence...... 17 1.6 Representation of the different species tree reconstruction approaches. 25

2.1 Decision tree for the selection of a suitable NGS genomic simulator. . . 37 2.2 General overview of the sequencing process and steps that can be pa- rameterized in the simulations...... 39 2.3 General overview of NGS simulation...... 48 2.4 Flows available to generate reads with and without genomic variation. . 51

3.1 NGSphy workflow...... 60 3.2 Re-rooting process...... 61 3.3 Distribution-based parameterization in NGSphy...... 63 3.4 Taxon-specific effects...... 64 3.5 Structure of the output folders of NGSphy...... 69 3.6 Gene-tree with five tips used for the validation...... 71 3.7 Gene-tree with five tips used for the use case simulation...... 72 3.8 Use case variant calling...... 73

4.1 Distribution of number of locus trees per species tree replicate...... 80 4.2 Schema of relative outgroup to ingroup distance...... 81 4.3 Distribution of species tree heights...... 82 4.4 Distribution of the number of species and number of individuals per species...... 82 xxvi List of figures

4.5 LogNormal distribution of the prior for the speciation rate (log scale). . 84 4.6 Heterogeneity within and across replicates...... 84 4.7 Heterogeneity within and across replicates...... 85 4.8 Average number of extra lineages (ILS) per species tree replicate across Ne values...... 86 4.9 Robinson-Foulds pairwise distances among gene trees within species tree replicate for different Ne values...... 87 4.10 Heterogeneity as hierarchically implemented in SimPhy...... 89 4.11 Pairwise distance observed in the simulated alignments per species tree replicate...... 92 4.12 Size distribution of genes (A) and targets (B) of the Anna’s (C. anna) and Swift (C. pelagica) datasets...... 94 4.13 Breadth versus depth of coverage from the mapping against the C. anna ("Anna") reference...... 96 4.14 Observed coverage per sample...... 97 4.15 Distribution of average coverage across samples, genes and targets, before (A) and after (B) removing outliers...... 97 4.16 Percentage of mapped reads to each of the references...... 98 4.17 Breadth versus depth of coverage for off-target regions from the mapping against the Anna reference...... 99 4.18 Percentage of non captured regions: gene (A) and targets (B)...... 100 4.19 Inferred ML concatenated tree...... 101 4.20 Per sample coverage vs. phylogenetic distance...... 102 4.21 Coverage heterogeneity parameterization in NGSphy...... 103

5.1 Per read mean quality distribution of 14 random files...... 113 5.2 Accuracy of the reconstructed species trees (n=121) for the combinations of read lengths (2), types (2), and reference used at mapping (2). . . . 116 5.3 Species tree accuracy per scenario...... 117 5.4 Expected coverage...... 118 5.5 Distribution of average coverage per individual across all the NGS profiles...... 119 5.6 Linear regression and Pearson r correlation test (p-value) between cov- erage and species tree accuracy...... 120 5.7 Linear regression and Pearson r correlation test (p-value) between target enrichment design parameters (number and size of the loci) and species tree accuracy...... 120 List of figures xxvii

5.8 Linear regression and Pearson r correlation test (p-value) between num- ber of species and number of individuals per species and species tree accuracy...... 121 5.9 Linear regression and Pearson r correlation test (p-value) between species tree accuracy and species tree parameters...... 122

List of tables

1.1 Main characteristics of NGS technologies ...... 9

2.1 General information about 23 NGS genomic simulators...... 42 2.2 Technical information about NGS genomic simulators...... 44 2.3 Genomic variants...... 52

4.1 Replicate parameters ...... 78 4.2 Species tree parameters ...... 78 4.3 Replicate Parameters ...... 79 4.4 Summary statistics for the replicates with of extra lineages...... 88

5.1 SimPhy simulation parameters...... 110 5.2 Example of the parameter values used in NGSphy...... 112

List of boxes

1.1 Types of homologous fragments...... 2 1.2 NGS read types...... 7 1.3 Concepts from the analysis of next-generation sequencing data for phylogenomics...... 19 1.4 Bootstrap...... 24 2.1 Definition of concepts related to the simulation of NGS data. . 38 2.2 Definition of concepts related to the sequencing technologies and possi- ble biases...... 43 2.3 Websites of the reviewed simulators and further sites of interest . . . . 54 3.1 Example of the coverage sampling strategy introducing individual and locus heterogeneity ...... 65

Chapter 1

Introduction

1.1 Phylogenetics and phylogenomics

Phylogenetics is the branch of science that studies the evolutionary relationships among individuals or groups of organisms, also providing the means (phylogenetic methods) for estimating evolutionary relationships. Phylogenetic reconstruction methods yield hypotheses of the evolutionary relationships between organisms, which are called phylogenies and are depicted as trees. The use of the phylogenetic information has become widespread in Biology, but also in multiple and diverse fields from language (e.g. Bouckaert et al. 2012; Gray et al. 2009) to conservation (e.g. Vézquez and Gittleman 1998) and to medicine (e.g. Smith et al. 2017), among other fields. Phylogenomics is a broad term that can be seen as the intersection of evolution and genomics. It comprises several areas of research in the interface between molecular biology and evolution; allowing the use of genomic data to infer phylogenetic relation- ships and gain insights into the mechanisms of molecular evolution, and comparative phylogenetics, and to infer putative functions for DNA or protein sequences (Philippe and Blanchette, 2007). The term was originally coined by J. Eisen (1998a; 1998b; 1997) when working on prediction of gene function with genome-scale data. He argued that phylogenomic analyses provide insights into the evolution of gene-families within and between species, and that while it was generally accepted that genome sequences were excellent tools for studying evolution, it was perhaps less well accepted that evolutionary analysis is a powerful tool in studies of evolution and function of genome sequences (Eisen and Fraser, 2003). This field has grown hugelly since, and the power of such approaches increases dramatically as sequences of a greater number of genomes become available and with them large-scale genomic datasets. 2 Introduction

In consequence, phylogenomics (and phylogenies within it) allows to place com- parative genomic studies in perspective, enriching our understanding of how genes, genomes, species and molecular sequences evolve. In addition, it may also help to predict how they will change in the future (Lässig et al., 2017).

1.1.1 Species trees and gene trees

Importantly, the history of a homologous (Box 1.1) genomic region (gene tree or locus tree) is not necessarily equivalent to the underlying history of the species (species tree). The notion of this discordance is not new (Goodman et al., 1979; Pamilo and Nei, 1988; Takahata, 1989) and the sources of such discordance are diverse, but for practical reasons (lack of proper computational methods, resources and power) gene trees were until recently most often considered as representatives of species phylogenies. The discordance between gene trees and species trees can be caused by systematic (model misspecification) and stochastic (inherent to the finite amount of data and sampling process) errors, but also as the result of different evolutionary processes (Figure 1.1), mainly; incomplete lineage sorting (ILS), gene duplication and loss (GDL) and horizontal gene transfer (HGT) (Degnan and Rosenberg, 2009; Jeffroy et al., 2006; Mallo and Posada, 2016). This has motivated the development of phylogenetic approaches that account for gene-tree heterogeneity in the estimation procedure. Rather than equating a gene tree with the phylogenetic history of the species, the new approaches explicitly consider the relationships between gene trees and the underlying history of species divergence, providing direct estimates of species trees (Knowles, 2009).

Box 1.1 Types of homologous fragments.

Homologs fragments are gene copies that are related by descent from a common ances- tral DNA sequence. For gene-trees (and often species-trees) one is mostly interested in assembling matrices of orthologous data.

Orthologs: gene copies that evolved from a common ancestral gene via a speciation event. Paralogs: homologous gene copies originated by a duplication event. Xenologs: where a homologous gene copy was transferred from another species by hori- zontal gene transfer. 1.1 Phylogenetics and phylogenomics 3

Figure 1.1 Representation of evolutionary processes explaining gene tree-species tree dis- cordance. A | Representation of gene and species tree agreement. B | Representation of incomplete lineage sorting (ILS): also known as deep coalescence or ancestral polymorphism, is the result of the retention of a genetic polymorphism along several speciation events. C | Representation of horizontal gene transfer (HGT): or lateral gene transfer (LGT) corresponds to the integration in the genome of a portion of genetic material coming from a different species through a non-sexual process. D | Representation of gene duplication and loss (GDL): the describes the copy of a locus into a different genomic location within an individual and/or its loss and is the primary source ofnew genetic material driving the evolution of gene families (Ohno, 1970). GDL is the result of several known molecular mechanisms like unequal crossing-over and retroposition (Zhang, 2003).

1.1.2 Applications of phylogenetics/phylogenomics

Estimating phylogenies has several applications, from the classification of beings, i.e, to provide more accurate descriptions of patterns of relatedness allowing the classification of species (Behura, 2015; Prum et al., 2015) to forensics - for example when assessing DNA evidence presented in court cases to inform situations (e.g. Arenas et al. 2017; Metzker et al. 2002), to identifying the origin of pathogens - molecular sequencing technologies and phylogenetic approaches are often used to learn about a new pathogen outbreak and its relationship to other species and subsequently the likely source of transmission (e.g. Faria et al. 2014; Grubaugh et al. 2017), information later used in recommendations for public health policies. Further applications relate to a broad range of biological questions, such as prediction of gene functions (Eisen, 1998b) and gene-families evolution (e.g. Margres et al. 2017), or understanding of speciation and extinction rates, their ecological and biogeographical causes, and the timing of these events (Nee and May, 1997; Vézquez and Gittleman, 1998). Also regarding bioinformatics and computing, for example, the algorithms developed for phylogenetic and phylogenomic analyses have been used to develop software in other fields. More on the applications of phylogenomics can be found in (Philippe and Blanchette, 2007). 4 Introduction

1.2 Model vs. non-model organisms

A model organism is a species (or variety) that has been widely studied because it is smaller, simpler, and easier to maintain and breed in a laboratory setting, in comparison to more complex ones like humans or trees. In addition, model organisms provide particular experimental advantages since many of them can breed in large numbers and/or have very short generation times (time between being born and being able to reproduce), while others have genes similar to humans. All of these characteristics have made some organisms suitable for studying a specific trait, disease, or phenomenon, in such level that they have turned into irreplaceable tools for biological and clinical research, helping scientists to amass a vast amount of knowledge (Hunter, 2008). Because they are so convenient, the research community flocked to use them, which has led to the development and optimization of resources, protocols, methods, commercially available kits; and often bioinformatic workflows and tools to deal with data from them. Also often their genomes have been fully sequenced and well-characterized. Among the most used model organisms we can find the yeast species Saccharomyces cerevisiae, the fruit fly (Drosophila melanogaster), the nematode worm Caenorhabditis elegans, the Western clawed frog (Xenopus tropicalis), the House mouse (Mus musculus) and the Zebrafish (Danio rerio). For other examples of not-so-popular model organisms and their corresponding field studies see Russell et al.(2017). On the contrary, non-model organisms are those that have not been selected by the research community for extensive study either for historic reasons, or because they lack the features that make model organisms easy to investigate (Nature | Non-model organisms - Latest research and news). Most of them are either not breedable as a stable isolated species in controlled laboratory conditions or simply not well characterized at the molecular biology level, thus requiring huge efforts in protocol adaptation and method development (Armengaud et al., 2014). Model organisms are extensively studied and are reduced-complexity representatives for larger, more complex ones, but that does not imply that non-model organisms are unimportant or have been completely left behind in genomics. Genome sequencing projects have been driven by their relevance to man - thus the human genome (Lander et al., 2001); by economic interest (major crops like maize Jiao et al. 2017 or rice Jackson 2016; disease agents or genomes of cancer cells (The Cancer Genome Atlas Research Network et al., 2013); by their importance for the reconstruction of the tree of life; or by practical considerations regarding size or repetitiveness. Several re-sequencing projects also aim to understand the genomic variation within specific species or clades (e.g. the 1.000 Genomes Project or the Maize Diversity Project. Others targeted important 1.2 Model vs. non-model organisms 5 nodes of the tree of life - the reconstruction of the evolutionary relationship of millions of species across billions of years is one of the fundamental challenges in biology (Ciccarelli et al., 2006), but many relationships remain controversial (Dunn et al., 2008). Currently, over a thousand whole genomes (some in progress, Figure 1.2) can be found in common databases such as NCBI or Ensembl. All three main domains of life (Bacteria, Archaea, and Eukarya) are represented, as well as many viruses, phages, viroids, plasmids, and organelles. In other databases (e.g. Genetic & Genomic Resources for Model Organisms from the National Institute of General Medical Services) one finds a higher variety of genomic resources for model organisms. For updated information on projects involving model organisms in genetics (see Model Organisms in Genetics - an overview at ScienceDirect).

Figure 1.2 Some of the organisms whose genome have been sequenced. Image by Daniel Ocampo Daza www.egosumdaniel.se. 6 Introduction

1.3 Next-generation sequencing (NGS)

1.3.1 The NGS revolution

DNA sequencing is at the base of any genomic studies. In recent years, Next-generation sequencing (NGS) techniques have revolutionized the field providing practical, massively parallel DNA (and RNA) sequencing at lower cost and without the requirement for large, automated facilities (Koboldt et al., 2013). NGS techniques have largely allowed the fast and cost-effective generation of multilocus sequence data (Goodwin et al., 2016; Mardis, 2011; Wetterstrand) and the explosion of genome level phylogenetic studies (i.e., phylogenomics; e.g. Burki et al. 2016; Irisarri et al. 2017; Misof et al. 2014; Wickett et al. 2014), which even so still often lead to dramatically different conclusions (e.g. Jarvis et al. 2014; Prum et al. 2015). All of this compared to the conventional Sanger method (Sanger, 1975; Sanger and Coulson, 1975; Sanger et al., 1977), which dominated the industry for about two decades and led to many accomplishments, including the completion of finished grade human genome sequence (International Human Genome Sequencing Consortium, 2004)s. Sanger sequencing also provided remarkable opportunities to life sciences and shaped and improved the knowledge about nucleic acids accordingly helped better understanding of cellular mechanisms and diseases. However, this method has certain limitations in terms of amount of data, long time for data acquisition, sequencing quality, and labor-intensive character of the protocol (Ari and Arikan, 2016).

1.3.2 NGS Platform

Currently, the most popular NGS technologies on the market are Illumina’s sequencing by synthesis, which is probably the most widely used platform (Pattnaik et al., 2014), Roche’s 454 pyrosequencing (454) (Ronaghi, 2001; Ronaghi et al., 1996, 1998), SOLiD sequencing-by-ligation (SOLiD), IonTorrent semiconductor sequencing (Rothberg et al., 2011) (IonTorrent), Pacific Biosciences’s (PacBio) single molecule real-time sequencing (Eid et al., 2009), and Oxford Nanopore Technologies (Nanopore) single-cell DNA template strand sequencing. Detailed overviews (Shendure and Ji, 2008; Shendure et al., 2011) and comparisons (Liu et al., 2012b) of DNA sequencing strategies and systems can be found elsewhere; here, I will only highlight the main features of the main NGS sequencing platform. Different platforms differ, for example, with respect to the type of reads they produce Box 1.2 or the kind of sequencing errors they introduce (Table 1.1). Only 1.3 Next-generation sequencing (NGS) 7

Figure 1.3 Evolution of the raw per-megabase cost of DNA sequencing. Change in costs of generating DNA sequence data from September 2001 to July 2017. Red line marks the first commercially available NGS platform (from https://www.genome.gov/27541954/dna-sequencing-costs-data/). two of the current technologies (Illumina and SOLiD) are capable of producing all three sequencing read types —single end, paired end and mate pair. Read length is also dependent on the machine and the kit used; in platforms like Illumina, SOLiD, or IonTorrent it is possible to specify the number of desired base pairs per read (by specifying the number of cycles). According to the sequencing run type selected it is possible to obtain reads with maximum lengths of 75 bp (SOLiD), 300 bp (Illumina) or 400bp (IonTorrent). On the other hand, in platforms like 454, Nanopore or PacBio, information is only given about the mean and maximum read length that can be obtained, with average lengths of 700 bp, 10 kb and 15 kb and maximum lengths of 1 kb, 10 kb and 15 kb, respectively. Error rates vary depending on the platform from ≤ 1% in Illumina to ∼ 30% in Nanopore. Further overviews and comparisons of NGS strategies can be found in Liu et al.(2012b); Quail et al.(2012); Shendure and Ji(2008); Shendure et al.(2004).

Box 1.2 NGS read types.

Single end: reads generated by single-read sequencing, which involves sequencing DNA fragments from only one end. 8 Introduction

Box 1.2 NGS read types.

Paired end: in paired-end sequencing, a single fragment is sequenced from both the 5’ and 3’ ends, giving rise to reads in both forward and reverse orientations, in which read one is the forward read and read two is the reverse. The sequenced fragments may be separated by a certain number of bases (depending on insert size and read length) or overlapping. Mate pair: mate-pair sequencing means generating long-insert paired-end DNA libraries. The inserts are circularized and fragmented, and the labelled fragments (correspond- ing to the ends of the original DNA ligated together) are purified, ligated to another set of adapters and finally sequenced at the paired end. The resulting inserts include two DNA segments that were originally separated by 2–5 kb, facilitating mapping and assembly. 1.3 Next-generation sequencing (NGS) 9 Gilles et al. ( 2011 ); Margulies et al. Jain et al. ( 2015 ); Laver etMadoui et al. al. ( 2015 ); Quick et al. Carneiro etet al. ( 2012 ); al. ( 2012 ); QuailKoren etSalmela al. ( 2012 ); and( 2014 ) Rivals ( 2005 ) ( 2015 ); Loman et al. ( 2015 ); ( 2014 ) Q20 1.78% Quail et al. ( 2012 ) ∼ Main characteristics of NGS technologies (Up to 40 Kb) > Q10 5 - 10 % (up to 1000) > Q20 1.07 - 1.7% Kb Table 1.1 700 15 ∼ ∼ Read type Max. Read length QS Error rates References SR PE MP SR: Single end reads. PE: Paired end reads. MP: Mate pairs. QS: Quality Score. NAY: Not available yet. 454 X X PacBio X SOLiD X X X 75 > Q30 0.01 - 1% Glenn ( 2011 ) Illumina X X X 300 > Q30 0.0034 - 1% Ross et( 2013 ) al. Nanopore X X 5.4 - 10 Kb NAY 10 - 40 % IonTorrent X X 400 Technology 10 Introduction

1.3.3 Genome reduction methods

NGS has fundamentally changed the scale at which genomic sequencing data can be acquainted. However, de novo whole-genome sequencing of several organisms, and specially of non-model ones, is still relatively difficult and expensive, and not needed for most questions. In order to obtain genomic information for non-model organisms that is meaningful for evolutionary inference at the population (population genetics) or species level (phylogenetics) we need to identify regions that are common (ie., homologous) for the individuals/species under study. The last decade has brought an explosion of new techniques on various approaches to this problem, the so-called “genome reduction methods” (GRM). GRM can dramatically reduce the “genome” sequence space to specific regions of it so that downstream sequencing efforts are enriched forthese regions (Hirsch et al., 2014). These methods in fact predate the development of NGS technologies, and have been widely used for several years in genome sciences (Altshuler et al., 2000), but, jointly with the NGS, allow us now to obtain large numbers of loci from multiple individuals from multiple species (Carstens et al., 2013; Godden et al., 2012; McCormack et al., 2013). GRMs include target enrichment strategies (Mamanova et al., 2010; Mertes et al., 2011), such as exon capture (Bi et al., 2012), anchored (hybrid) and anonymous enrichments (Lemmon et al., 2012), and ultraconserved elements (Bejerano et al., 2004; Faircloth et al., 2012); transcriptome sequencing; or restriction site associated DNA sequencing (RAD-seq) (Baird et al., 2008) and its variants (see Andrews et al., 2016). For a comparison of capture and RAD-seq methods see Harvey et al.(2016). NGS, jointly with these kind of strategies, allow us to obtain an unprecedented amount of genome-wide data from multiple individuals and multiple species (Godden et al., 2012; McCormack et al., 2013), something that is leading the change from “phylogenetic” into “phylogenomics”, and with it a) an increase of the use of genomic approaches to non-model organisms (da Fonseca et al., 2016; Ekblom and Galindo, 2011; McCormack et al., 2013), and b) the development of new approaches and methods for phylogenomic inference that acknowledge and model the discordance among individual gene trees, especially of fast methods that can handle the huge amount of information currently produced (Liu et al., 2015a). In the following lines, I will briefly explain the most popular approaches for genome reduction. Basically, these methods take advantage of useful library manipulations for obtaining samples enriched in particular regions of the genome (Figure 1.4). Following Good(2011), one can classify GRMs into two groups, depending on if the genomic 1.3 Next-generation sequencing (NGS) 11 reduction enrichment targets general (enzymatic enrichments methods like RAD-seq and its variants; transcriptome sequencing) or specific regions. Restriction site associated DNA sequencing (RAD-seq): originally coined by Miller et al.(2007), and adapted to genotype-by-NGS by Baird et al.(2008), with several variants posteriorly developed. It refers broadly to a series of methods that use restriction endonuclease digestion (i.e, enzymes that recognize, and cut within, particular DNA sequences, called restriction sites) to fragment the genomic DNA, posteriorly sequencing the sites adjacent to cuts. Using this strategy across several individuals should enrich samples for sequences from the same genomic regions, because closely related individuals are expected to share most of restriction sites. Variations on the original RAD-seq protocol concern the type and number of restriction enzymes(s) used; existence of a size selection step, and the number and order of adaptors ligation (comprehensively revised in Andrews et al. 2016). Experimenting with variation of these factors aids the adjustment of the intended number of loci. While these are methods more used/known in the context of population genomics (Andrews et al., 2016; Catchen et al., 2013; Hohenlohe et al., 2010), RAD loci are increasingly used also for phylogenetic inference at several scales of divergence (Cruaud et al., 2014; Eaton and Ree, 2013; Hipp et al., 2014; Leaché et al., 2015; McCluskey and Postlethwait, 2015), with several recent tools facilitating the assembly of such datasets (Catchen et al., 2013, 2011; Cruaud et al., 2016; Eaton, 2014). Transcriptome sequencing (RNA-seq): it is another powerful method of genome reduction (Wang et al., 2009), that refers to the direct sequencing of the transcribed parts of the genome (by reverse transcription of the polyadenylated fraction of the genome). This fraction usually does not amount to more than 5% of the total genome (Pertea, 2012), and thus provide an easy way to obtain sets of homologous sequences. NGS reads obtained can be assembled de novo into consensus transcript sequences, mapped to a reference genome, or the reads can be mapped into consensus transcripts obtained by de novo assemblies (in a similar process to the one used with target capture data). In order to use RNA-seq data for phylogenetic inference, some considerations such as normalizing libraries have to be taken, since the relative abundance of transcripts can vary by several orders of magnitude. Transcriptomes are now widely used in phylogenomic inference, especially at deeper levels (Behura, 2015; Dunn, 2017; González et al., 2015; Wang et al., 2017). Recent developments of the technique circumventing reverse amplification and amplification biases should further promote its use(Garalde et al., 2018). Target enrichment: as reviewed in Mamanova et al.(2010) and Mertes et al. (2011), refers to methods in which certain genomic regions (targets) are selectively 12 Introduction isolated from DNA samples and sequenced. The result is samples totally or partially enriched for specific genomic segments. Methods can be categorized according tothe technique used for enrichment or according to the type of target regions of interest: According to the technique:

• Polymerase chain reaction (PCR): PCR (Saiki et al., 1988) can be seen as the oldest (and most common) way of obtaining sequences for particular (homol- ogous) genomic regions. In the context of NGS, PCR products can be easily converted into NGS libraries for massive sequencing. Advances have been made aimed at conducting multiple long-range PCRs in parallel, a limited number of standard multiplex PCRs or highly multiplexed PCR methods that amplify very large numbers of short fragments that are then sequenced. While advances have been made (Baslan et al., 2015; Tewhey et al., 2009; Varley and Mitra, 2008), multiplex PCR remains challenging.

• Molecular inversion probes (MIPS) (Porreca et al., 2007): developed to capture specific regions using two targeted nucleotides linked with an intervening universal sequence. The two linked probes are designed to flank a target, which is copied with DNA polymerase, and single-stranded DNA circles that include target region sequences are then formed (by gap-filling and ligation chemistries) in a highly specific manner, creating structures with common DNA elements that are then used for selective amplification of the targeted regions of interest.

• Primer extension capture (PEC) and PCR-generated probes methods: broadly refers to a suite of methods that uses PCR-generated probes (instead of de novo synthesized ones) to target orthologous regions from genomes (e.g. Briggs et al. 2009; Maricic et al. 2010; Peñalba et al. 2014. Ligating biotinylated adaptors to PCR products an also be used. PEC and molecular inversion probes are best suited for cases of low number of targets and known homologous genomes.

• Hybrid capture: when nucleic acid strands derived from the input sample are hybridized to pre-prepared DNA fragments (usually referred to as probes or baits) complementary to the targeted regions of interest, either in solution or on a solid support, so that one can physically “capture” those regions, i.e., retain them for sequencing, enriching the samples in these sequences of interest. A couple of commercial platforms currently provide bait sets and capture protocols. Target sizes commonly range from less than 1M up to more than 25 Mbp. Hybrid capture is currently the most common target enrichment technique used for obtaining 1.3 Next-generation sequencing (NGS) 13

large datasets from non-model organisms, and typically yields very high efficiency, with 50% or greater of identifiable sequence reads matching targeted base-pairs (Good, 2011). 14 Introduction

According to the type of target regions of interest: • Exon capture: aimed to capture exon/coding sequences of an organism. Probes can be derived from known genomes or transcriptomes (Bi et al., 2012; Hodges et al., 2007).

• Anchored and anonymous hybrid enrichment: Lemmon et al.(2012) pro- posed an approach for rapidly capturing hundreds of loci useful both for shallow- and deep-level phylogenetic studies, by targeting highly conserved regions flanked by less conserved ones. The conserved probes are able to recover a large number of unassociated loci useful at a diversity of phylogenetic timescales and this ap- proach has been applied in a growing number of studies (e.g. Chen et al. 2017; Peloso et al. 2016; Prum et al. 2015; Pyron et al. 2016). Along the same lines of thought one can use known genomes (or data from reduced-representation libraries, or even whole-genome raw sequencing) to select regions with the desired level of variation across some taxa of interest for target enrichment. A growing number of studies follow such protocols, targeting a high number of anonymous (nuclear) loci, especially for shallow scale phylogenetics (Lemmon and Lemmon, 2012).

• Ultra-conserved elements (UCEs): Faircloth et al.(2012) introduced the use of capture of UCEs (Bejerano et al., 2004). The level of conservation of such sequences makes them easy to identify and align across very divergent genomes. In addition, they appear in high numbers throughout the genome (Stephen et al., 2008); do not intersect with most types of paralogous genes (Derti et al., 2006) and have few retroelement insertions (Simons et al., 2006). Similar to anchored loci, they have also been widely used and shown to be useful at several phylogenetic scales (Branstetter et al., 2017a,b; Van Dam et al., 2017). Very recent approaches aimed to combine the advantages of both non-targeted and targeted methods, specifically RAD-seq and capture methods: RAD Capture - Rapture (Ali et al., 2016) and the almost simultaneous RADcap (Hoffberg et al., 2016). Both methods are enrichment-based RAD sequencing approaches that combine restriction enzyme digests, ligation of adaptors, pooling, capture and sequencing, leading to significant advances that increase the density and consistency of genotype matrices while simultaneously reducing costs for large- scale projects. Of all these methods it is probably targeted capture that holds higher potential to advance evolutionary and ecological research (reviewed in Jones and Good 2016). While 1.3 Next-generation sequencing (NGS) 15

Figure 1.4 Representative workflow for the acquisition of genome-reduced datasets compared to whole-genome sequencing. Adapted from Davey et al.(2011) and Hirsch et al.(2014). A | Whole- genome sequencing: genomic DNA is fragmented, ligated to adaptors, amplified through PCR and sequenced. B | RNA-seq: mRNA is isolated from total RNA by their poly-A tails, fragmented and reverse transcribed. cDNA fragments are then amplified and sequenced. C | Target enrichment (capture): loci of interest are chosen and probes designed based on one (or more) reference sequences. DNA of samples of interest is hybridized to the designed probes, enriching the samples in the targeted regions. Captured fragments are amplified and sequenced. D | RAD-seq: DNA is fragmented by restriction enzymes in specific (homologous) sites across samples. A first set of different adaptors (P1) are ligated to each sample, samples are pooled and further fragmented. Fragments within a selected size range are isolated and a second set (P2) of adaptors are ligated to them. Fragments are then amplified with P1+P2, ensuring that only fragments containing restriction sites are amplified. These are then sequenced and aligned across samples. 16 Introduction whole genome sequencing will certainly become more common, genome reduction approaches will certainly remain prefered if they are cheaper for broader sampling and/or easier to analyse. Due to their flexibility with respect to the choice of the areas to be sequenced, and thus their adequacy to many questions, they are likely to remain for a while among the biologists’ favourite in their toolbox.

1.3.4 Bioinformatic analysis of NGS data

NGS technologies are constantly improving respect to the efficiency, quality, size and cost of the data acquisition, and so are the algorithms and bioinformatic tools developed to analyse them. Below I will describe the main steps of a typical pipeline (Figure 1.5) for processing NGS data in the phylogenomic context, with a focus on data from NGS target enrichment.

1.3.4.1 Quality control

First and foremost, one must assess the quality of the NGS dataset and prepare it for posterior analyses. The quality assessment is fundamental in any NGS pipeline, since it provides information about how well the genomic data was sequenced (independently of read quality, to access potential contamination specific analyses must be carried). There are several measures for assessing the NGS quality, among them: per base and sequence quality scores, per base sequence content, per base and sequence GC content, sequence length distribution, sequence duplication levels, overrepresented sequences, and k-mer content. We, as researchers, are reluctant to discard data, but removal of low-quality sequences may improve and facilitate downstream analyses (e.g. Niu et al. 2010). There are several software tools and packages available to perform quality assessment, such as Rqc (https://github.com/labbcb/Rqc), FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), NGS QC toolkit (Patel and Jain, 2012), FASTXtoolkit (Gordon and Hannon, 2010), BIGpre (Zhang et al., 2011a) or seq_crumbs (https://github.com/JoseBlanca/seq_crumbs). Additionally we may need to merge or trim the reads. Trimming is a process in which bases with poor quality scores or adapter sequences are identified and removed. In addition, the available software tools usually allow to apply extra filters based on the resulting read size, low complexity reads or identical forward and reverse reads, to avoid inflating coverage estimates. Some of the most commonly used toolsfor this process are Trimmomatic (Bolger et al., 2014), cutadapt (Martin, 2011), scythe (https://github.com/vsbuffalo/scythe), SICKLE (Joshi NA, 2011), PRINSEQ (Schmieder 1.3 Next-generation sequencing (NGS) 17

Figure 1.5 Typicalworkflow used for processing NGS data for phylogenomic inference. Major steps and methodological decisions that must be made when using NGS data for phylogenomic inference. Dashed lines indicate equivalent or optional steps: a) when mapping reads to reference sequences, under the stringent parameters usually used for phylogenomics, one assumes they are reads coming from orthologous fragments. Mapping information (coverage of fragments, percentage of reads with multiple mappings, for example) can be actually used to discard potential paralogs; b) some types of data matrices (unphased alignments) can be obtained both with or without performing variant calling. In this case variant calling increases the confidence in the heterozygous call but can be skipped, depending on the specific question being studied. 18 Introduction

and Edwards, 2011). Read-merging is a process that merges the overlapping forward and reverse paired-end reads, which can significantly improve genome assemblies. It can be done using FLASH (Magoč and Salzberg, 2011), COPE (Liu et al., 2012a), BBmerge (Bushnell et al., 2017); or tools built in specific pipelines, such as the one used in (Rokyta et al., 2012). I have distinguished two different flows on how to proceed after quality assessment and pre-processing are finished, depending on the availability of a reference sequence. One is mapping, which needs a reference, and the other is through de novo assembly. Note that “hybrid” approaches are also possible, such as assembling the targeted data (per individual or per clade of interest), choosing orthologous regions, and later mapping the reads to the de-novo assembled contigs for variant-calling (this is for example the kind of approach followed by researchers using UCEs such as Faircloth et al. 2012 and Faircloth 2016).

1.3.4.2 Mapping

An important process, when having a reference sequence, is to map the reads against it. The reference sequence could be a whole genome, a chromosome or a specific locus. The mapping results have the ability to determine whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the biological ex- periment and DNA preparation succeeded (Flicek and Birney, 2009). Mapping of reads is a distinctive manifestation of perhaps the oldest bioinformatics problem - sequence alignment (Horner et al., 2010). Numerous software tools (for a review (see Fonseca et al. 2012) have been developed with the primary challenge of efficiently finding the true loca- tion of each read from a potentially large quantity of reference data, while distinguishing between technical sequencing errors and true genetic variation within the sample. Four of the most widely used aligners are BWA (Li and Durbin, 2009), Bowtie2 (Langmead and Salzberg, 2012), NovoAlign (http://www.novocraft.com/products/novoalign/) and Stampy (Lunter and Goodson, 2011). Most of the algorithms used for mapping are based on ‘hashing’ or an effective data compression algorithm called the ‘Burrows- Wheeler transform’ (BWT) (M. Burrows, 1994). According to the most recent mappers evaluation assessment in the literature (Thankaswamy-Kosalai et al., 2017), BWT-based tools (Bowtie2, and BWA) are fast, efficient in terms of memory and particularly useful for aligning repetitive reads; but they tend to be less sensitive than the hash-based algorithms (Novoalign and Stampy). Novoalign and Stampy currently produce the most accurate overall results, while also being practical in terms of running time. For more 1.3 Next-generation sequencing (NGS) 19 comparisons of the performance of various mappers/aligners see Lunter and Goodson (2011) and Hatem et al.(2013).

1.3.4.3 Assembly

When a reference genome is not available, assembly of the reads is almost always essential for analysis. The assembly process maps the sequence data to a putative reconstruction of the target; grouping the reads into (as long as possible) pieces called contigs and scaffolds. Several biological, technical and computational variables have to be considered when performing an assembly, such as prior knowledge of the genomic data that has been sequenced (ploidy, GC-content, repetitive elements), genomic resources availability (closely related references), NGS data type and quality, computational software requirements (memory) (Bao et al., 2011; Martin and Wang, 2011; Pop, 2009). These will aid in the selection of the assembly approach and tool to be used.

Box 1.3 Concepts from the analysis of next-generation sequencing data for phyloge- nomics.

Coverage: the number of times a certain nucleotide has been sequenced. Quality scores: also known as Phred Q scores predictions of the probability of an error in a base call. Hashing: application of a hash function, which is any function that can be used to map data of arbitrary size to data of fixed size. K-mer: the possible subsequences of length k that can be obtained from a given sequence. Contig: set of overlapping DNA segments that together represent a consensus region of DNA. Scaffold: sometimes called supercontigs or metacontigs, defines the contig order and orientation and the sizes of the gaps between contigs.

Martin and Wang(2011) described the possible assembly strategies as: reference- based assembly, de novo assembly, and a hybrid approach. The reference-based strategy involves the reconstruction of sequences by aligning NGS data to an available genomic resource. On the other hand, de novo assembly strategy does not use any reference, and takes advantage of the redundancy of the reads to find overlap between them and assemble them into contigs (and later into scaffolds). Finally, the hybrid assembly com- bines the previous approaches to create a more comprehensively assembled sequence. Hybrid strategies may begin with assembly using a reference genome, followed by de 20 Introduction novo assembly of reads that initially failed to align to the reference, or by initiating the process with de novo assembly, followed by alignment of contigs to a reference scaffold. Several software packages are available for assembly, and their underlying algorithms (for a review see Miller et al. 2010) can handle large data sets, short read lengths, and error rate variations among different NGS platforms. For specific pipelines and tools using transcriptomic datasets see Martin and Wang(2011) and for more general use see Zhang et al.(2011b). Moreover, numerous benchmarkings and comparisons of de novo assemblers are available (Alkan et al., 2011; Bradnam et al., 2013; Earl et al., 2011; Lin et al., 2011; Salzberg et al., 2012; Zhang et al., 2011b).

1.3.4.4 Variant and genotype calling

As noted by Nielsen et al.(2011), the process of converting base calls and quality scores into a set of genotypes for each individual in a sample is often divided into two steps: variant calling (also referred to as single nucleotide polymorphism - SNP - calling) and genotype calling. Variant calling determines where (in which position) are the polymorphisms or where at least one base is different from the reference sequence; whereas genotype calling determines the genotype for each individual and it is usually done for those positions where a SNP (or variant) has been called. Methods and algorithms for variant and genotype calling were based originally in allele counts per position. More recent methods include uncertainty by using probabilistic frameworks, which use not only the allele count, but also the quality scores of the base calls in order to provide genotype likelihoods. There are two different ways of doing this process, depending on the number of samples used: marginal (for each sample separately) or joint (considering all samples simultaneously) inference. It is important to notice that these can provide different results (Kumar et al., 2014; Nho et al., 2014; Shringarpure et al., 2017). Variant calling accuracy will depend on the coverage of the regions and on the distribution of allele frequencies. For low-coverage data multi-sample calling is currently considered the most accurate, but it is not free from errors and biases (Maruki and Lynch, 2017; Van der Auwera, 2015). Also, it is possible to perform variant calling with respect to a reference sequence or based on the overall allele frequencies. Variant and genotype calling can be performed jointly or as separate steps. There are several tools that have been developed for both cases. Among the most used ones are GATK (Mckenna et al., 2010) and samtools (Li et al., 2009a), but other up-to-date general-purpose variant callers are ANGSD (Korneliussen et al., 2014) or freebayes (Garrison and Marth, 2012). For a wider review of genotype calling methods and tools 1.3 Next-generation sequencing (NGS) 21 see Nielsen et al.(2011, 2012). Several benchmarks have been performed due to the variety of variant and genotype callers, purpose and type of data sets (Cheng et al., 2014; Cornish and Guda, 2015; Hwang et al., 2015; Pirooznia et al., 2014; Rieber et al., 2013; Talwalkar et al., 2014).

1.3.4.5 Phasing

The sequence data obtained until this moment in the pipeline takes the form of unphased genotypes, meaning that, for a diploid organism, there is no information about to which of the two haplotypes/chromosomes a particular variant belongs. Thus we may further be interested in infer or determine which variants are co-located on the same chromosome, i.e, in phasing. In fact, most phylogenetic algorithms rely on single sequences representing each individual/species. For phylogenomic analyses, a consensus sequence of each diploid organism loci is often used, especially at deeper levels (when one is not worried with variation within species or individuals at each locus), instead of inferring its haplotypes. Yet, most of the existing gene-tree and species-tree reconstruction methods are unable to incorporate information from heterozygous site properly (see Potts et al. 2014 for an exception). For shallow phylogenies (intraspecific or between closely related species), when pervasive ILS may be present, and alleles may not be monophyletic within species or individuals, reconstructing haplotypic information becomes very important. Plus, properly accounting for ploidy and allelic variation, best matches the multispecies coalescent method (underlying most species-tree inference methods; which traces the history of gene-copies within populations and species) and thus increases inference power (of both topology and population sizes). There are two main ways of obtaining phased data: phase can be inferred (from consensus sequences with ambiguity codes) using computational approaches or it can be determined directly (by lab based experiments, or, using NGS, by the reads information directly (given each comes from a single DNA chain). A comprehensive revision can be found in Browning and Browning(2011). Population-based statistical phasing methods such as PHASE (Stephens et al., 2001; Stephens and Scheet, 2005) or fastPHASE (Scheet and Stephens, 2006) became widely used in phylogenetics/population genetics, although are hardly usable with the current large datasets. With NGS, a series of methods that aim to recover phasing information directly from the reads appeared: when a read encompasses two or more heterozygous sites of an individual, their phase is directly determined, as each fragment from which a read or pair of reads is obtained is a single haplotype. Thus, if the fragments are long and sequence coverage is sufficiently 22 Introduction high, a substantial amount of haplotype phase information can be obtained (Levy et al., 2007). GATK read-backed phasing algorithm (Broad Institute) or FreeBayes (Garrison and Marth, 2012) are two of the currently available tools for this task. Its application for non-model organisms (where SNP densities are not known, but usually higher that in humans) is still problematic though.

1.4 Phylogenomic estimation

Up until this step within the phylogenomic pipeline we find ourselves with multiple sequence alignments, with unphased (one sequence per individual) or phased sequences (two sequences per individual). There are still some steps left to perform in order to be able to estimate the history of the genes, individuals and species at hand.

1.4.1 Orthology inference

Orthology inference is central to phylogenetic analyses and, currently, when estimating relationships between species, we are mostly interested in assembling matrices of ortholog loci. Methods exist for species-tree inference that are able to deal with both paralogs and orthologs and that are agnostic regarding sources of gene-trees disagreement (e.g. dDe Oliveira Martins et al. 2016) but, for the scope of this thesis let’s stick to the most common situation of just using orthologs. Accurate orthology inference is critical for phylogenomic reconstruction, and especially challenging when dealing with partial genomic data obtained through NGS. This data often contains misassemblies and partial or missing sequences, and the complexities of NGS, together with the often non-existence of a reference genome, make it extremely difficult to distinguish recently duplicated copies from allelic variations, splice-variants and misassemblies (Yang and Smith, 2014). There are two main differences, in the phylogenomic context, that arise when dealing with mapped versus de novo assembled data:

(a) when using mapped data, we assume we are mapping orthologous reads to our reference fragments (instances of paralogy or copy number variation can in fact be inferred by variations in coverage, being usually discarded or splitted), and

(b) for assembled data, a series of methods usually based on reciprocal similarity can be applied to cluster the assembled contigs sequentially into homologs and orthologs (see Yang and Smith(2014) for thorough discussion and comparison of methods). 1.4 Phylogenomic estimation 23

1.4.2 Reconstruction of gene trees

Gene trees reflect the process of DNA replication at a local level: a copy of a geneata locus in the genome, e.g. a protein coding gene, replicates and its copies are passed on from parent to offspring, generating branching points in the gene tree (Szöllősi et al., 2015). The reconstruction of gene trees thus aims to estimate the phylogenetic relationships of a set of gene copies from different individuals from the same or different species. There are four main kinds of reconstruction methods that can then be used to infer gene trees. Herein I provide a brief description of each method, whereas details can be found elsewhere (such methods have been revised in Holder and Lewis 2003; Rannala and Yang 1996). Distance methods: these methods first convert the character matrix into a distance matrix that represents the evolutionary distances between all pairs of species. The gene tree is then inferred from this distance matrix using algorithms such as neighbour joining (NJ; Saitou and Nei 1987) or minimum evolution (ME; Rzhetsky and Nei 1992). The advantage of these methods is that they are fast (while reasonably accurate), since they do not examine all possible tree topologies, therefore being ideal for an initial data exploration or to generate starting trees for other methods. There are several implementations of these algorithms: QuickTree (Howe et al., 2002), QuickJoin (Mailund and Pedersen, 2004; Mailund et al., 2006), RapidNJ (Simonsen et al., 2008), NINJA (Wheeler, 2009), FastTree (Price et al., 2009, 2010) and FastJoin (Wang et al., 2012a) improving the execution in terms of running and computation times, and memory. Maximum parsimony: this method selects the gene tree that requires the mini- mum number of character changes (substitutions) to explain the observed data (Saitou and Nei, 1986). It does not consider an explicit probabilistic model of change and it has a well known systematic bias (long branch attraction; Felsenstein 1978), thus not being widely used with current phylogenomic datasets. Likelihood methods: The maximum likelihood (ML) method (Felsenstein, 1981) selects the tree that has the highest probability of explaining the sequence data, under a specific model of substitution (nucleotide or amino-acid changes in a sequence). This function allows the explicit incorporation of the processes of character evolution into probabilistic models. Some of the most popular tools using this approach are Phyml (Guindon et al., 2010), RAxML (Stamatakis, 2014) and IQTREE (Nguyen et al., 2015). Bayesian methods (Rannala and Yang, 1996): these methods select trees accord- ing to their posterior probability (probability that the tree is correct, gven both the priors and the model), using Bayes’ mathematical formula to combine the likelihood function 24 Introduction with prior probabilities on trees. Bayesian inference aims thus to obtain the posterior probability of the parameters given the data and therefore intrinsically accounts for the reconstruction uncertainty without the need for pseudoreplication procedures to establish statistical confidence (bootstrap; Box 1.4). Several tools have been developed for this purpose, with different features, accuracy and popularity, among them BEAST (Bouckaert et al., 2014; Drummond et al., 2012), MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003)) and PhyloBayes (Lartillot et al., 2009)).

Box 1.4 Bootstrap.

The bootstrap is a computationally intensive statistical method, introduced by Efron(1979), to obtain estimates of error in nonstandard situations by resampling the data set many times to provide a distribution against which hypotheses could be tested. Later, Felsenstein (1985) introduced the use of the bootstrap in the estimation of phylogenetic trees. It is a widely used technique, which provides assessments of “confidence” for each clade of an observed tree, based on the proportion of bootstrap trees showing that same clade. For estimating gene trees from genomic sequences, we usually have a matrix of taxa × characters (or sites). The bootstrap is then made by resampling characters (or sites) columns from the matrix with replacement. Thus, each bootstrap sample consists of a new matrix (=alignment) with the same set of taxa, but with some of the original character columns duplicated, and others dropped (Felsenstein, 1985). The frequency at which each node appears across the bootstrapped datasets reflects its robustness to the resampling process. Importantly, reconstruction uncertainty measures cannot capture systematic errors, and therefore high support values do not always ensure that the reconstructed phylogeny is correct (Soltis and Soltis, 2003).

1.4.3 Reconstruction of species trees

Species trees represent the evolutionary history of the sampled organisms (see Sec- tion 1.1.1). As trees, they are comprised of nodes that represent speciation events; and branches, that reflect the population history between speciation events. The branches can have also a width associated, which represents the effective population size, and lengths that represent time (generations or years) (Mallo et al., 2014). Several species tree reconstruction methods have been developed in recent years. These methods has been comprehensively revised in Liu et al.(2015a); Mallo et al.(2014); Szöllősi et al.(2015), and following Mallo et al.(2014), one can classify them as: supermatrix (or concatenation), co-estimation (or single-step coalescent) or supertree (or two-step coalescent). 1.4 Phylogenomic estimation 25

Figure 1.6 Representation of the different species tree reconstruction approaches. From Liu et al.(2015b).

Supermatrix (or concatenation) relies on joining all single-gene alignments into a multilocus alignment (“supergene”) which is then used as input data for any standard phylogenetic estimation methodology. The assumption is that either all loci share a single evolutionary history or that the different gene histories cancel out, so the “supergene” tree is a reasonable proxy of the species phylogeny (Mallo et al., 2014). Programs that can be used for the concatenated inference are the same used for gene tree reconstruction. Co-estimation methods estimate gene trees and species trees concurrently using the multilocus sequence data. Some of them can also estimate parameters like divergence times and ancestral population sizes. There are different algorithms to solve this. From the ones based on multispecies coalescent model (MSC; Rannala and Yang 2003), those implemented in BEST (Liu, 2008) and BEAST/*BEAST (Heled and Drummond, 2010; Ogilvie et al., 2017) are frequently used. PoMo (De Maio et al., 2015), revPoMo (Schrempf et al., 2016) and SNAPP (Bryant et al., 2012) are popular approaches when using SNP data. BPP (Rannala and Yang, 2003) is probably the most integrative approach being able to co-estimate gene-trees, species-trees and the possible species delimitation(s). 26 Introduction

Supertree approaches consist on first estimating the gene trees independently with any standard reconstruction method and later combining the estimated gene trees into a single species tree or “super tree”. They follow different strategies, such as a) minimizing the overall disagreement among gene trees (without assuming any underlying evolutionary process), or modeling b) single or c) multiple evolutionary processes (from GDL, HGT or ILS). Several implementations of each of these strategies exist, such as for a) BUCKy (Ané et al., 2007; Larget et al., 2010) or ASTRAL (Mirarab et al., 2014; Mirarab and Warnow, 2015); for b) MP-EST (Liu et al., 2010), STAR (Liu et al., 2009), STEM (Kubatko et al., 2009), NJst (Liu and Yu, 2011), STELLS (Wu, 2012); and for c) phylonet (Than et al., 2008), STEM-hy (Kubatko, 2009) or guenomu (dDe Oliveira Martins et al., 2016). Multiple simulation studies have been conducted to compare the performance of such strategies (e.g. Chaudhary et al. 2015; Lanier et al. 2014; Leaché et al. 2014; Mirarab et al. 2016; Ogilvie et al. 2016; Xi et al. 2015). As for which is the best method for estimation of species trees from genome wide data, there is still a lot of debate, and, as stated in Posada(2016), it seems very difficult to select an absolute winner for every scenario.

1.5 Simulations in phylogenomics

Computer simulation is a tool to evaluate the behaviour of a system (existing or proposed, physical or abstract) under different configurations of interest and over long periods of time (Maria, 1997; Naylor, 1966). This involves certain types of mathematical and logical models that describe the behaviour of the system (Rohrlich, 1990). The model represents the system itself, whereas the simulation represents the operation of the system over time (Gupta and Grover, 2013). The model is similar to the system it represents but simpler and it is evident that a good model should be a good trade off between realism and simplicity, since the higher the complexity of the model, the harder it becomes to be evaluated and understood. In the words of Box(1976): “all models are wrong, but some are useful”. Computer simulations are thus the execution of a simulation of an experiment on a computer (herein the terms computer simulations and simulations will be used exchangeably). Simulations are low-cost and usually fast and allow us to model reality, generating as much data as desired, under ideal conditions (controlled scenarios with predefined parameters for which the true values are known) (Huelsenbeck, 1995). In silico ap- proaches help with the identification of problems, bottlenecks and design flaws before 1.5 Simulations in phylogenomics 27 building or modifying a system. They further allow for evaluation and comparison of many alternative experimental designs. They have, of course, limitations: they make assumptions about the processes that may not be realistic, meaning that the model used is an inadequate description of the original system. Simulations have become very useful tools in a wide range of scientific fields, like physics (Gould et al., 1996), chemistry (Seader et al., 1977), medicine (Groothuis et al., 2001), engineering (Bah- waireth et al., 2016; Denkena and Winter, 2015), economics (Jadrić et al., 2014), and, indeed, phylogenetics (Huelsenbeck, 1995). Simulation of genetic and genomic datasets has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets (e.g. Daber et al. 2013). In the latter field, simulations alone can be used as guidance for the development of new compu- tational tools (Huang et al., 2012) and for debugging and/or evaluation of software performance (Caboche et al., 2014; Hu et al., 2012; Li, 2013). They also help in the design of sequencing projects (Shcherbina, 2014; Shendure and Aiden, 2012), allow us to generate new hypotheses (Hoban et al., 2012) and test them (DeChaine and Martin, 2006); evaluate methods (Arenas et al., 2008; Carvajal-Rodríguez et al., 2006)) and are absolutely essential to verify distinct inferences such as the correctness of an assembly (Knudsen et al., 2010), the accuracy of gene prediction (Mavromatis et al., 2007), the power to reconstruct accurate genotypes and haplotypes (McElroy et al., 2012; Nielsen et al., 2011), fidelity of de novo short read assemblies (Pignatelli and Moya, 2011), the accuracy of phylogenetic reconstruction (Bayzid and Warnow, 2013; Mallo et al., 2015b; Mirarab and Warnow, 2015; Sayyari et al., 2017) or tools (Li, 2013). Although it is true that simulations have limitations because they generate data under models of reality, and not the reality itself, they nicely complement validation with real data (Angly et al., 2012; Holtgrewe, 2010).

1.5.1 Simulation of evolutionary histories (gene trees and species trees)

There are several tools and software packages for the simulation of phylogenetic trees. R packages like ape (Paradis et al., 2004), apTreeshape (Bortolussi et al., 2006), Phytools (Revell, 2012) or TreeSimGM (Hagen and Stadler, 2017) allow fast simulation of gene-trees under several models. ape can simulate trees by splitting randomly the edges or by randomly clustering the tips (following the coalescence algorithm described in Paradis(2011). Phytools allows to simulate stochastic birth-death trees. apTreeshape simulates trees under a variety of models (see Bortolussi et al. 2006); 28 Introduction whereas TreeSimGM can simulate trees under the General Bellman and Harris (Bellman and Harris, 1948), thus allowing simulations under common models as birth-death, proportional to distinguishable arrangement (PDA) and Yule (Yule, 1925) models, or even the formulation and exploration of new tree models (Hagen and Stadler, 2017). In Python we have Dendropy (Sukumaran and Holder, 2010), a library for phyloge- netic computing, that allows to simulate trees under many scenarios: birth-death process, pure-neutral coalescent model and MSC. Bio::Phylo (Vos et al., 2011), a Perl5 toolkit for phyloinformatic analyses, can simulate tree topologies reflecting pure birth under Hey (Hey, 1992) or Yule models; equiprobable topologies (Simberloff, 1987); constant rate birth-death, evolving speciation rate, and beta binomial models implemented using novel algorithms (Hartmann et al., 2010). In Mallo et al.(2015a) it is explained how the different evolutionary processes (ILS, GDL and HGT) are usually modeled: ILS with tools that implement the MSC; GDL using birth-death process traversing the species tree (see Arvestad et al. 2003); and HGT as a series of transfer events following a Poisson distribution (as in Galtier 2007), or randomly distributed transfer events as in HgtSim (Song et al., 2017). There are tools that simulate gene phylogenies under the MSC like MCcoal (Rannala and Yang, 2003), GUMS (Heled et al., 2013) or Mesquite (Maddison and Maddison, 2017); or under the multispecies coalescent with recombination, SGWE (Arenas and Posada, 2014) or scrm (Staab et al., 2015) can also simulate DNA sequences. More- over, ms (Hudson, 2004) (and its faster re-implementation msprime Kelleher et al. (2016) is able to simulate gene trees and also sample data (sequences of individuals or populations), assuming an infinite sites model of mutation. Other set of tools are used to co-simulate gene and species phylogenies, very few of them considering multiple simultaneous sources of phylogenomic incongruence. Gen- PhyloData (Sjöstrand et al., 2013) combines the simulation of GDL and HGT, whereas DLCoal_sim (Rasmussen and Kellis, 2012) considers GDL and ILS. Some simulators (EvolSimulator (Beiko and Charlebois, 2007); ALF (Dalquen et al., 2012)), incorporate a series of complex “genomic” effects in an initial (real or simulated) sequence, suchas gene duplications and losses, lateral gene-transfers, genome rearrangements; although they do not incorporate this variation at population level (ILS). The only tool able to simulate the evolution of multiple gene families under ILS, GDL, HGT and GC, is, to my knowledge, SimPhy (Mallo et al., 2015a), which further implements a flexible hierarchical parameterization scheme that considers genome-wide and gene family specific conditions, including different sources for evolutionary rate variation among lineages. 1.6 Motivation 29

1.5.2 Simulation of molecular evolution (DNA sequences)

Random DNA sequences can be very easily simulated in any programming language by sampling the four nucleotides at specific frequencies. There is a wide variety of tools that have been developed to simulate more complex sequence data under the different substitution models, but also under different evolutionary processes such as selection, recombination, demographics, population structure, and migration (Arenas, 2012). There are several simulators of DNA, RNA or protein sequences on a underlying gene tree under a variety of substitution models (for a recent revision of these see Arenas 2015. Likely most popular ones are Seq-Gen (Rambaut and Grassly, 1997); ROSE (Stoye et al., 1998) or INDELible (Fletcher and Yang, 2009), from which ROSE and INDELible also allow the introduction of insertions and deletions (indels). Some tools only allow the simulation of DNA sequences; like Vanilla (a front end of PAL; Drummond and Strimmer 2001); ProSeq (Filatov, 2002), which simulates DNA sequences along a coalescent tree with or without recombination, or DAWG (DNA Assembly With Gaps; Cartwright 2005) which can also introduce gaps. Others, like indel-Seq-Gen (Strope et al., 2007), simulate also more realistic evolutionary process of protein sequences incorporating domains, motifs and indels.

1.5.3 Simulation of NGS data

Computer simulation of genomic data has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets. Multiple computational tools for the simulation of next-generation sequencing (NGS) data have been developed in recent years, which could be used to compare existing and new NGS analytical pipelines. This topic is the subject of the Chapter2, where I review 23 of these tools, highlighting their distinct functionality, requirements and potential applications. In addition, I provide a decision tree for the informed selection of an appropriate NGS simulation tool for the specific question at hand.

1.6 Motivation

The ever-evolving NGS field is continuously innovating (e.g. more variation of read lengths, larger throughput, faster data acquisition), allowing a better handling of the complexities of genomes (Goodwin et al., 2016). While such technological improvements have led to its widespread use (van Dijk et al., 2014), they have also brought new problems such as higher error rates, small reads needing to be assembled, systematic 30 Introduction biases derived from using single sequencing platforms, or issues related to the large size of the datasets. All these factors pose significant challenges for data processing, storage and analyses (Catchen et al., 2013), and it is hence evident that the use of NGS data is still very costly in terms of time, space and money. As we have seen in Section 1.3, the analysis of NGS data for phylogenomics consists of several steps, mainly assembly and/or mapping, homology/orthology inference, variant and/or genotype calling, gene tree estimation and/or species tree (plus other related parameters) estimation. Therefore, the NGS phylogenomic pipeline is complex and requires multiple methodological decisions and human interaction. Importantly, there is not a standard approach for the phylogenomic analysis of NGS data, and the influence of the different strategies and options in the accuracy ofthe resulting trees is unclear. Phylogenomic approaches appear as technology evolves, but they are mainly ad-hoc to the characteristics of the specific NGS datasets in question (see Allard et al. 2012; Jex et al. 2010; Kumar et al. 2015; Tosso et al. 2017). The first main methodological decision (de novo assembly vs. mapping) is related to the existence of sequence(s) similar enough that can be used as reference and to which the obtained reads can be mapped. In the lack of such reference, one must proceed with de novo assemblies, which will also imply homology/orthology inference. It is important to pay attention to the genetic distance between the reference sequence and the mapped organisms: it can strongly condition coverage, and thus all the downstream inferences. When performing de novo assemblies, we will have to deal with the non-trivial process of homology and orthology inference. Variant calling can be done per sample (single-sample calling) or taking all the samples into account (multi-sample calling), it being also possible to perform variant calling with respect to a reference sequence or based on the overall allele frequencies. Genotype calling is then made alongside with variant calling or independently (Nielsen et al., 2011). Many of the choices related to the methods and thresholds for variant and genotype calling can influence the resulting inferences (Han et al., 2014; Nevado et al., 2014; Nielsen et al., 2011, 2012). Genotype likelihoods (Korneliussen et al., 2014; Mckenna et al., 2010; Nielsen et al., 2012) were introduced to better account for uncertainty in genotype inference (da Fonseca et al., 2016), but phylogenetic methods usually do not take these directly into account. If working with diploid (or other ploidy) organisms, one may also be interested in phasing (i.e., knowing the sequences of each locus within an individual) or in the generation of consensus sequences (with or without ambiguities, with the major or minor allele, or with a random allele). Most popular tools for this task for phylogenetic studies involved inferring the phase from alignments 1.7 Objectives 31 with ambiguities, making use of known haplotypes and assuming coalescent models of haplotype frequencies in populations (Stephens et al., 2001; Stephens and Donnelly, 2003) an approach that is highly time-consuming and hardly applicable to current- sized datasets. Current approaches for NGS data are based on the overlap between reads and SNP’s to obtain the physical phase information for variants (read-backed phasing algorithms). Looking deeper, each step by itself implies a specific pipeline (for alignment and quality assessment see (see Brouwer et al. 2012; for assembly see Gonzalez et al. 2017 and/or for variant discovery and genotyping see Bai and Cavalcoli 2013; Maruki and Lynch 2017; Van der Auwera et al. 2013). Furthermore, there is a panoply of software tools from which to choose and parameters that need to be tuned in each step and often for each dataset. All these different possible treatments will influence the final datasets obtained (e.g. Bokulich et al. 2013). The ultimate goal of a phylogenomic study is usually knowing the histories of species (here meaning any clade of organisms), but the underlying truth when dealing with empirical data (the correct genome sequences, the true gene and species trees behind the extant organisms, their current and past population sizes and divergence times) is unknown. This makes it very hard, or even impossible, to grasp which protocol performs better in the inference/estimation process, and hence the need for computer simulations (Section 1.5) in order to benchmark the different phylogenomic strategies.

1.7 Objectives

Given the reasons just described, the main aim of my thesis is to understand the effect of different methodological decisions during the production and analysis of NGSdata on the quality of the phylogenomic inference, focusing on targeted NGS experiments. To accomplish this, I have identified four specific objectives:

1. To provide a better understanding of the wide variety of existing genomic NGS simulators as well as general guidelines for their selection for specific purposes. Addressed in Chapter2: Simulation of genomic next-generation sequencing data. Published as: Escalona et al. 2016.

2. To design and implement a realistic simulation tool for the generation of phy- logenomic NGS data. Addressed in Chapter3: NGSphy: phylogenomic simulation of next-generation sequencing data. Published as: Escalona et al. 2017. 32 Introduction

3. To identify realistic parameters for the simulation of targeted NGS data. Addressed in Chapter4: Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenomics.

4. To assess the sensitivity of species tree inference to variations of the phylogenomic NGS pipeline. Addressed in Chapter5: Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms. Chapter 2

Simulation of genomic next generation sequencing data

This chapter is based on the following peer-reviewed publication: Escalona M, Rocha S, Posada D. 2016. A comparison of tools for the simulation of genomic NGS data. Nature Reviews Genetics 17: 459-469.

Contributions: Merly Escalona (ME) comparatively reviewed the available litera- ture and software.

2.1 Introduction 35

2.1 Introduction

Next-generation sequencing (NGS) techniques are the standard nowadays for the generation of genomic data, producing ever-increasing amounts of information rapidly and at a low cost. These techniques allow us to sequence DNA and RNA very quickly, facilitating the acquisition of massive genomic, transcriptomic, DNA-protein interaction and epigenomic datasets, and are radically changing the way we look at genomes (Koboldt et al., 2013; Metzker, 2010; Nielsen et al., 2011). Given their higher parallelism and smaller reaction volumes compared to conventional Sanger sequencing, NGS methods (Section 1.3) offer larger amounts of data, shorter sequencing time and reduced costs, albeit at the cost of increased error rates and shorter reads (Wang et al., 2012b). NGS clearly facilitates the accumulation of large data sets, but the downstream processing of these data is still an important bottleneck (Liu et al., 2012b). Not surprisingly, NGS data result in numerous bioinformatics challenges, including storage, transmission, manipulation and analysis. Better computational methods and more efficient software tools are constantly being developed in order to provide faster processing and more accurate inferences. However, it is essential that these methods are benchmarked against existing tools with similar functionality, in order to show their superiority at least in some aspect. In general, computational methods can be benchmarked using empirical and/or simulated data. Although validation with empirical data is essential as it represents real scenarios, the true process underlying it is usually unknown, complicating its use for the assessment of accuracy (that is, how close the estimated value is to the ‘true’ value). On the other hand, in silico data allow us to generate as much data as desired and under controlled scenarios with predefined parameters for which the ’true’ values are known, nicely complementing the validation with real data (Angly et al., 2012; Holtgrewe, 2010). Thus, computer simulation of genetic and genomic data has become increasingly popular for assessing and validating biological models or to gain understanding about specific datasets. Simulations alone can be used as guidance for the development of new computational tools (Huang et al., 2012), for debugging and to evaluate software performance (Caboche et al., 2014; Hu et al., 2012). Computer simulations also allow us to generate new hypotheses (Hoban et al., 2012), help in the design of sequencing projects (Shcherbina, 2014; Shendure and Aiden, 2012), and are absolutely essential to verify distinct inferences such as the correctness of an assembly (Knudsen et al., 2010), the accuracy of gene prediction (Mavromatis et al., 2007) or the power to reconstruct accurate genotypes and haplotypes (McElroy et al., 2012; Nielsen et al., 2011). Several computational tools for the simulation of NGS data have been developed in the past few years. These tools 36 Simulation of genomic next generation sequencing data have very diverse input requirements and functionalities, which makes it quite difficult to choose the most appropriate one for the problem at hand. Here I present, to my knowledge, the first review of available software tools for the simulation of genomic NGS data. Note that I focus on the simulation of DNA sequences and do not discuss RNA sequencing (RNAseq) simulation, which has its own characteristics. I review 23 NGS simulation tools that were either recently published or developed, that were — in most cases — still maintained and that were freely available. I discuss their various features, such as the required input, the interaction with the user, the sequencing platforms, the type of reads, the error models, the possibility of introducing coverage bias, the simulation of genomic variants and the output provided. This is done within the framework of potential applications, providing readers with guidelines for the identification of the NGS simulators that are best suited for their purposes (Figure 2.1).

2.2 Simulation parameters

The existing sequencing platforms (see Chapter 1, Section 1.3) use distinct protocols that result in datasets with different characteristics (Metzker, 2010). Many of these attributes can be taken into account by the simulators (Figure 2.2), although there is not a single tool that incorporates all possible variations. The main characteristics of the 23 simulators considered here are summarized in Table 2.1 and Table 2.2. These tools differ in multiple aspects, such as sequencing technology, input requirements oroutput format, but maintain several common aspects. With some exceptions, all programs need a reference sequence, multiple parameter values indicating the characteristics of the sequencing experiment to be simulated (read length, error distribution, type of variation to be generated, if any, etc.) and/or a profile (a set of parameter values, conditions and/or data used for controlling the simulation), which can be provided by the simulator or estimated de novo from empirical data. The outcome will be aligned or unaligned reads in different standard file formats, such as FASTQ, FASTA orBAM. An overview of the NGS data simulation process is represented in Figure 2.3. In the following sections I delve into the different steps involved. 2.2 Simulation parameters 37

Figure 2.1 Decision tree for the selection of a suitable NGS genomic simulator. The selection of a next-generation sequencing (NGS) simulator requires a set of sequential decisions. First, decide whether there is a reference sequence or not. Then, decide whether reads should be simulated from one or several organisms. Next, specify whether genomic variants should be introduced (in addition to those that already exist in the reference sequence or sequences). Finally, determine the sequencing technology of interest. Illumina: Illumina’s sequencing by synthesis 454, 454 pyrosequencing (Roche); Nanopore, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; Sanger, Sanger sequencing; SOLiD, sequencing by oligonucleotide ligation and detection (Thermo Fisher). 38 Simulation of genomic next generation sequencing data

Box 2.1 Definition of concepts related to the simulation of NGSdata.

Reference sequence: A particular genomic region, multiple genomic regions concate- nated, a chromosome or a complete genome from which next-generation sequencing reads will be generated. Base calling: The analysis of the information obtained from the machine sensors during next-generation sequencing and posterior prediction of the individual bases. This converts the signal into actual sequence data with quality scores. Profile: A set of biological (GC content, insertions and deletions, and substitution rates) and/or technological (insert sizes, read lengths, error rates and quality scores) parameter distributions or values that will be used in a specific simulation. Abundance profile: A set of probabilities that represent the proportion of taxa within a community (and data set).

2.2.1 Reference sequence

Most NGS simulators require a reference sequence from which they will generate the simulated reads. This reference sequence can be a particular genomic region, multiple genomic regions concatenated, a chromosome, or a complete genome. The only excep- tion in this regard is the XS (Pratas et al., 2014) read simulator, which only requires the read length, sequencing technology and nucleotide composition to generate completely de novo reads. Most of the current NGS simulators use a haploid genome as the reference sequence. Some tools such as EAGLE (https://github.com/sequencing/EAGLE), pIRS (Hu et al., 2012), ReadSim (Lee et al., 2014) and SimSeq (Earl et al., 2011) simulate reads from different ploidies. While in EAGLE and ReadSim one can specify any ploidy (or even a specific chromosome for EAGLE), pIRS and SimSeq simulate reads from diploid genomes given a haploid reference. Furthermore, several tools are able to generate pools of reads from multiple reference sequences, in some cases using an abundance profile that defines the proportion of reads that are generated fromeach sequence.

2.2.2 Profiles

Most simulators require the setting of many parameters. This can be done in the command line and/or using a profile. Profiles can specify parameter distributions or discrete values for different biological features (e.g. GC-content, indel and substitution rates) and/or technological features (e.g. insert sizes, read lengths, error rates and quality scores). Note that there are not standard formats for profiles and the information 2.2 Simulation parameters 39

Figure 2.2 General overview of the sequencing process and steps that can be parameterized in the simulations. 40 Simulation of genomic next generation sequencing data

Figure 2.2 NGS simulators try to imitate the real sequencing process as closely as possible by considering all the steps that could influence the characteristics of the reads. a | NGS simulators do not take into account the effect of the different DNA extraction protocols in the resulting data. However, they can consider whether the sample we want to sequence includes one or more individuals, from the same or different organisms (e.g., pool-sequencing, metagenomics). Pools of related genomes canbe simulated by replicating the reference sequence and introducing variants on the resulting genomes. Some tools can also simulate metagenomes with distinct taxa abundance. b | Simulators can try to mimic the length range of DNA fragmentation (empirically obtained by sonication or digestion protocols) or assume a fixed amplicon length. c | Library preparation involves ligating sequencing–platform dependent adaptors and/or barcodes to the selected DNA fragments (inserts). Some simulators can control the insert size, and produce reads with adaptors/barcodes. d | Most NGS techniques include an amplification step for the preparation of libraries. Several simulators can take this step into account (for example, by introducing errors and/or chimaeras), with the possibility of specifying the number of reads per amplicons. e | Sequencing runs imply a decision about coverage, read length, read type (single-end, paired-end, mate-pair) and a given platform (with their specific errors and biases). Simulators exist for the different platforms, and they can use particular parameter profiles, often estimated from realdata. they include can change for the different tools. Because for many users it might be difficult to decide on particular parameter values or to construct their own profile, some simulators provide default profiles. Alternatively, many tools offer a way to estimate de novo profiles from empirical data. Several simulators are able to generate new profiles from alignments of reads mapped to a reference genome (SAM/BAM files) or from real sequencing data from a previous sequencing run (FASTQ files). Thus, BEAR (Johnson et al., 2014), NeSSM (Jia et al., 2013) and pIRS provide guidelines for the use of alignment and mapping tools such as BWA (Li and Durbin, 2009), BLAST (Altschul et al., 1990), SOAP (Li et al., 2008) or SOAP2 (Li et al., 2009b), and for error estimation programs such as DRISEE (Keegan et al., 2012), together with other scripts for parsing the data or for other tasks. The ART (Huang et al., 2012), FASTQsim (Shcherbina, 2014), GemSim (McElroy et al., 2012), SimSeq (Earl et al., 2011) and SInC (Pattnaik et al., 2014) packages provide their own standalone tools for the generation of error, quality and/or abundance profiles. ART and SInC generate quality profiles based on specific error models and/or the quality score distribution extracted from empirical data. NeSSM generates quality and error profiles. The quality profiles define the quality score given to each basealong the read and are estimated based on an existing set of reads. The error profiles define the proportion of the different error types (substitutions and indels) and are estimated with specific scripts. pIRS generates quality profiles using mapped reads andknown variations from re-sequencing data. The program BEAR, focused on metagenomics, 2.2 Simulation parameters 41 generates error, quality and abundance profiles. For the generation of the error profile it uses a modified version of DRISEE to infer error rates by clustering artefactual duplicate reads in the supplied dataset. For the quality profile it uses the output of the error model to determine the average quality score assigned to erroneous nucleotides per position per read (Johnson et al., 2014) In addition, it generates an abundance profile from the relative frequency of the different taxa in a metagenomic dataset. Finally, other simulation programs such as ArtificialFastqGenerator (Frampton and Houlston, 2012) and CuReSim (Caboche et al., 2014) do not use a profile, their simulation parameter values are specified directly via the command-line. 42 Simulation of genomic next generation sequencing data eois G eaeois A aaees E ed.P:Poie F eal rfl.G:Giet eeaepoie.S:Seii software to Specific SW: profiles. to generate Guide GU: Profile. Default DF: Profile. PR: Reads. RE: Parameters. PA: Metagenomics. MG: Genomics. 5:Rces44 L:Ilmn.SL Oi.IO o orn.PB aii isine.OT xodNnpr ehoois N:Sne.G: SNG: Sanger. Technologies. Nanopore Oxford ONT: Biosciences. Pacific PCB: Torrent. Ion ITO: SOLiD. SOL: Illumina. ILL: 454. Roche’s 454: generate profile. PCR: Polymerase Chain Reaction. GV: Genomic variants. QS: Quality scores. FO: Format. AL: Alignments. FA: Fasta. FQ: Fastq. Alignments. Format. AL: FO: scores. Quality QS: Genomic variants. GV: Reaction. Chain Polymerase PCR: profile. generate wsn(na L,SL T R E PxxxxxxxFQ x x x x x x x MP PE, SR, G ITO SOL, ILL, (dnaa) dwgsin rFsqe L ExxxxxxFQ x x x x x x PE G ILL ArtFastqGen ATSmIL O,PB T , RxxxxxxxFQ x x x x x x x SR G,M ITO PCB, SOL, ILL, FASTQSim iuao ehooyGv u ye REF types Run M vs G Technology Simulator uei 5,IL O,IOGS FQ x x x x SR G ITO SOL, ILL, 454, CuReSim edi C,OTGS FQ x x x x x x SR G ONT PCB, ReadSim eai 5,IL N , R E PxxxxFA x x x x MP PE, SR, G,M SNG ILL, 454, MetaSim eSm44 L , R ExxxxxxxFQ x x x x x x x PE SR, G,M ILL 454, GemSim F:Sadr lwrmFra.SMSqec lgmn a.BM opesdSMFile. SAM Compressed BAM: Map. Alignment Sequence SAM Format. Flowgram Standard SFF: lwi 5 R ExxxxxxxxSFF x x x x x x x x PE SR, G 454 Flowsim iNSILGS,P FQ x x x x x x PE SR, G ILL simNGS rne 5,IL N , R E PxxxxxxxxFQ x x x x x x x x MP PE, SR, G,M SNG ILL, 454, Grinder AL 5,IL C,IOGS,P FQ x x x x x x x PE SR, G ITO PCB, ILL, 454, EAGLE ihs 5,ILGS,P FQ x x x x PE SR, G ILL 454, simhtsd eS 5,ILMS,P FQ x x x x x x PE SR, M ILL 454, NeSSM iSqILGS,P,M SAM/BAM x x x x x x x x MP PE, SR, G ILL SimSeq 5Sm44GS SFF x x x x x SR G 454 454Sim ao 5,IL N R E PxxxxxxxxFA/FQ x x x x x x x x MP PE, SR, G SNG ILL, 454, Mason ER44 L,IOGMS,P FQ x x x x x x PE SR, G,M ITO ILL, 454, BEAR gi L,SLGS FQ x x x x x x SR G SOL ILL, wgsim bi C L/C FQ x x x x x x CLR/CCS G PCB pbsim ICILGP FQ x x x x x x x PE G ILL SInC ISILGMP FQ x x x x x x x x PE G,M ILL pIRS R 5,IL O R E PxxxxxxxxxxSFF/FQ x x x x x x x x x x MP PE, SR, G SOL ILL, 454, ART s44 L,SL T RP FQ x x x x SR,PE G ITO SOL, ILL, 454, xs a l 2.1 Table eea nomto bu 3NSgnmcsimulators. genomic NGS 23 about information General AR RD AG WPRG SR LFO AL RE QS GV PCR SW GU PA DF PR RE PA nu rfl process Profile Input Characterization rcse Outputs Processes 2.3 Accounting for PCR amplification 43

2.3 Accounting for PCR amplification

DNA amplification with polymerase chain reaction (PCR) is currently a necessary step in the preparation of libraries for the Illumina, 454, IonTorrent and SOLiD (Mardis, 2008; Morozova and Marra, 2008) sequencing platforms. One may be interested therefore in modeling the bias introduced by PCR (Aird et al., 2011; Haas et al., 2011; Metzker, 2010), as done by ART, Flowsim (Balzer et al., 2010, 2011) and Grinder (Angly et al., 2012). ART, which simulates reads for Illumina, 454 and SOLiD, can mimic PCR bias by specifying the number of reads (SR or PE) generated per amplicon (Huang et al., 2012). Flowsim is a suite of executables that simulate the entire 454 pyrosequencing process; using its module “kitsim” one can simulate the attachment of adapters to the end of each amplicon, which then serve as primers for their PCR amplification simulated by the “duplicator” module (Balzer et al., 2010, 2011). Grinder was specifically developed to simulate amplicon sequencing from user sup- plied PCR-primer collections, introducing known experimental artifacts like chimeras (Haas et al., 2011) and spurious copy number variants. Grinder can generate chimeras in two ways: 1) by appending consecutive segments at given breakpoints, where both amplicon sequences and breakpoints are randomly selected; and 2) by concatenating fragments at breakpoints determined by specific k-mers that must be shared by the amplicons. In addition, the presence of several gene copies in a genome may affect the composition of the amplicon library, contributing with extra amplicon reads. Grinder models this bias by sampling species proportionally to their relative abundance and to the number of copies of the amplicon in their genome (Angly et al., 2012).

Box 2.2 Definition of concepts related to the sequencing technologies and possible biases.

Amplicon: A piece of DNA or RNA resulting from a natural or artificial amplification event (for example, PCR). Coverage bias: A bias in the amount of reads for a particular region. For example, sequencing depth increases in regions of elevated GC content.

2.4 Read features

In an NGS experiment, the number, length and type of reads are determined by the specific sequencing machine and the library preparation. It is possible to simulate a 44 Simulation of genomic next generation sequencing data * nomto eae oti oi sntaalbe i:Wnos n:Lnx O:McS L:Cmadln nefc.GI rpia User Graphical GUI: interface. line Command CLI: MacOS. MOS: Linux. Lnx: Windows. Win: available. not is topic this to related Information (*) nefc.AI plcto rgamn nefc.N:N aallpoesn.P aallpoesn acpsmlitraig.GUGL GNU GPL: GNU multi-threading). (accepts processing Parallel P: processing. parallel No NP: Interface. Programming Application API: Interface. eea ulcLcne R:Poreaysfwr.A:Aaei s ny S:Bree otaeDsrbto.CAC:Cetv Commons Creative CCANCL: Distribution. Software Berkeley BSD: only. use Academic AU: software. Proprietary PRO: License. Public General ArtificialFastqGenerator Java Win, Lnx, MOS CLI P GNU GPL v3 Y P CLI MOS Lnx, Win, Java ArtificialFastqGenerator wsn(na /elPto n L N P 2Y v2 GPL GNU P CLI Lnx C/Perl/Python (dnaa) dwgsin ATSmBs/yhnLxCIN, N P 3Y v3 GPL GNU NP,P CLI Lnx Bash/Python FASTQSim iuaosPormigLnug prtv ytmItraePoesn I pnSource Open LIC Processing Interface System Operative Language Programming Simulators uei aaWn n,MSCIP*N * P CLI MOS Lnx, Win, Java CuReSim edi yhnWn n,MSCIP*Y * P CLI MOS Lnx, Win, Python ReadSim eai aaWn n,MSCI U R UN AU / PRO P GUI CLI, MOS Lnx, Win, Java MetaSim eSmPto i,Lx O L N P 3Y v3 GPL GNU P CLI MOS Lnx, Win, Python GemSim lwi akl n L N Y GNU P CLI Lnx Haskell Flowsim iNSCLx O L N P 3Y v3 GPL GNU P CLI MOS Lnx, C simNGS rne elWn n,MSCI U,AIPGLY GPL P API GUI, CLI, MOS Lnx, Win, Perl Grinder AL + n L PPBDY BSD NP,P CLI Lnx C++ EAGLE ihs elLxCIPGUGLv Y v3 GPL GNU P CLI Lnx Perl simhtsd eS /uaPr n L PPA Y AU NP,P CLI Lnx C/Cuda/Perl NeSSM iSqJv n L I Y MIT P CLI Lnx Java SimSeq 5SmC+Pr i,Lx O L PPGUGLv Y v1 GPL GNU NP,P CLI MOS Lnx, Win, C++/Perl 454Sim ao + i,Lx O L P/GL Y GPL/LGPL. P CLI MOS Lnx, Win, C++ Mason ERPto/elLxCIPA Y AU P CLI Lnx Python/Perl BEAR gi n L I Y MIT P CLI Lnx C wgsim bi + n L N P 2Y v2 GPL GNU P CLI Lnx C++ pbsim ICC+LxCIN, CNLN CCANCL NP,P CLI Lnx C++ SInC ISC+Pr n L PPGUGLv Y v2 GPL GNU NP,P CLI Lnx C++/Perl pIRS R +/elWn n,MSCIPGUGLY GPL GNU P CLI MOS Lnx, Win, C++/Perl ART sC+LxCIPGUGLv Y v3 GPL GNU P CLI Lnx C++ xs al 2.2 Table ehia nomto bu G eoi simulators. genomic NGS about information Technical trbto o-omrilLicense. Non-Commercial Attribution 2.5 Base-calling errors 45 specific amount of reads with different lengths and types according to the sequencing technology assumed. The number of reads can be specified or estimated according to the desired coverage. Also, it is possible to select a fixed length, the length of the longest read or a length distribution. The read type can be specified directly or indirectly by defining particular insert sizes. By default, most simulators assume single-end reads.

2.5 Base-calling errors

NGS technologies rely on a complex interplay between chemistry, hardware and optical sensors. Adding to this complexity is the software that analyzes the sensor data and predicts the individual bases. This last step is usually referred to as base calling (Ledergerber and Dessimoz, 2011). The base calling converts the signals into actual sequence data with quality scores (known as Phred Q Scores (Ewing and Green, 1998; Ewing et al., 1998). The different sequencing platforms usually assume an explicit error model in order to assign a measure of uncertainty to each base call (Kao et al., 2009). Error-rate models determine the probability of erroneous substitutions, insertions or deletions at a given position within a read (Illumina, 2011; Johnson et al., 2014). For the generation of realistic reads, it is necessary to understand and incorporate as much as possible the different sources of sequencing error. Each sequencing platform has a specific error rate (Chapter ??, Table 1.1), which can also vary within the same technology and among reads (McElroy et al., 2012). The importance of taking this into account and simulating sequencing data based on specific error models should not be underestimated. Simulators may generate sequence errors in different ways: based on the quality scores (ArtificialFastqGenerator); by introducing particular errors at specific positions (SimSeq); by using specific error parameters for each platform/technology, which canbe user-defined (ART, Mason (Holtgrewe, 2010), pIRS) or fixed by the program (DWGSIM (Homer, 2010), FASTQsim); using variable error rates within reads (simhtsd, Bodi 2009; wgsim, Li 2011a); using error distributions (Grinder); or generating specific errors along with some noise (simNGS, Embl-Ebi 2010). In the following subsections I describe in more detail the different errors that are modeled and their occurrence in sequencing platforms, as well as how the different simulators implement them. 46 Simulation of genomic next generation sequencing data

2.5.1 Indel errors

It has been reported that Illumina platforms rarely contain indel errors (Hu et al., 2012), whereas for 454 and IonTorrent insertions and deletions (indels) are actually the main source of error, although they occur at very low rates (Dohm et al., 2008). However, in 454, assessing the correct number of polynucleotide sites (homopolymers) is often quite difficult because light signal changes among homopolymers with similar lengths canbe undetectable (Ekblom et al., 2014; Kircher and Kelso, 2010; Liu et al., 2012b; Loman et al., 2012; Robasky et al., 2013; Yang et al., 2013). PacBio yields long single-molecule reads that are prone to false indels from non-fluorescing nucleotides (Kircher and Kelso, 2010; Robasky et al., 2013), which are stochastically modeled by the PacBio read simulator pbsim (Ono et al., 2013). With Nanopore it is also possible to have indel errors; insertions occur when the strand slips back and forth so that a given position is read more than once, and deletions occur when the rate of strand displacement in the pore sensor exceeds the rate of data acquisition. ReadSim, which is so far the only simulator available for Nanopore, assumes fixed error rates for indels and substitutions. Indel rates can be specified via the command line, or using a configuration profilein the cases of ART, CuReSim, Grinder, Mason, MetaSim (Richter et al., 2011), NeSSM, pbsim, ReadSim, SInC and XS. Some programs like BEAR, EAGLE and GemSim include utilities or use external tools like DRISEE for the estimation of indel rates from FASTQ or SAM files. On the other hand, 454 and IonTorrent homopolymer specific errors (Margulies et al., 2003) may be extracted from a profile determining the position and corresponding error rate (as in ART), or introduced under the form of homopolymeric stretches using a specified empirical model (as in MetaSim, Flowsim or Grinder).

2.5.2 Substitution errors

Substitution errors are dominant in Illumina and SOLiD platforms. These may occur when incorrect bases are introduced during clonal amplification of templates (for example, by PCR; Hu et al. 2012; Nakamura et al. 2011; Robasky et al. 2013) or when the optical signals are translated into bases. In the latter process, a green laser is used to detect G and T nucleotides at the same time, afterwards using a filter to distinguish between G and T. A and C nucleotides are detected in a similar way but using a red laser. Thus, base-calling errors may arise because of insufficient discrimination of the respective base emission spectra (Dohm et al., 2008). It is also known that SOLiD sequencers are unable to read through palindromic regions, presumably owing 2.6 Quality scores 47 to the formation of hairpin structures, and therefore such regions are interpreted as miscellaneous random sequences. ART simulates this kind of error. As with indels, substitutions errors rates have to be defined in the command line or within a profile. Some NGS platforms can produce position-specific substitution errors, with reads having significantly lower quality in the later cycles. In Illumina, these type oferrors possibly arise from either single-strand DNA folding or sequence-specific alterations in enzyme preference (Kircher and Kelso, 2010; Metzker, 2010; Nakamura et al., 2011; Robasky et al., 2013), and can be modelled by GemSim and pIRS. Similar errors can be observed for 454 platforms (Margulies et al., 2003). Flowsim, 454sim and MetaSim can simulate two kinds of sequencing flows with a degradation model. The positive flow, interpreted as the occurrence of one or more bases, is modelled as a normal distribution; the negative flow, with no base or noise, is modelled as a log-normal distribution. The degradation model is introduced as a standard deviation that gradually increases the probability of error along the sequence.

2.6 Quality scores

The quality score is a prediction of the probability of an error in a base call (Dohm et al., 2008; Ewing and Green, 1998; Ewing et al., 1998; Illumina, 2011). The distribution of base quality scores is position dependent, and the mean quality score decreases as a function of increasing base position for most of the available technologies (Huang et al., 2012). Some NGS read simulators separate the quality score from sequencing error, even though they are correlated measurements. Several strategies can be used to simulate the quality scores, in most cases using empirical information. 454sim, EAGLE, Flowsim and simNGS use fixed quality scores profiles that are based on previous studies. ART, ArtificialFastqGenerator, BEAR, FASTQsim, GemSim, NeSSM, SimSeq and SInC also include utilities that allow the user to derive quality profiles from FASTQ files. On the other hand, pIRS determines both the base and quality score in relation to the cycle number and to the base position on the simulated read, using empirical parameters. Alternatively, the distribution of the quality scores can be controlled by the user. Some programs use a simple parameter that determines a fixed quality score for every read (ArtificalFastqGenerator, CuReSim, DWGSIM, ReadSim, simhtsd, wgsim and XS). Grinder assigns two quality scores, depending on whether the simulated base call is correct or not. More complex, realistic simulators use a Gaussian distribution (XS) or a Position Specific Normal Distribution (Mason) with mean, standard deviation and quality standard deviation for the first and last base. For PacBio the distribution of 48 Simulation of genomic next generation sequencing data

Figure 2.3 General overview of NGS simulation. The simulation process begins with the input of a reference sequence (most cases) and simulation parameters. Some of the parameters can be given via a profile, that is estimated (by the simulator or other tools) from other reads or alignments. The outcome of this process may be reads (with or without quality information) or genome alignments in different formats. 2.7 Sequencing depth 49 errors is considered to be constant along the chromosomes (Quail et al., 2012) and programs like pbsim use a Uniform distribution to assign the quality scores. In Illumina, each PE read can have equal or different quality scores. Simulators that explicitly allow two different quality distributions for PE reads are ArtificialFastqGenerator, DWGSIM, EAGLE, SimSeq and SInC.

2.7 Sequencing depth

Sequencing depth or coverage is not continuous along genomes. This can be due to chance (Lander and Waterman, 1988) but also to the GC bias introduced during DNA amplification by PCR (Li et al., 2014; Sims et al., 2014), as sequencing depth increases in regions with elevated GC content (Aird et al., 2011; Dohm et al., 2008). This coverage bias is taken into account by ArtificialFastqGenerator, BEAR, EAGLE, NeSSM and pIRS. ArtificialFastqGenerator calculates the GC content of different genomic regions from the reference sequence and then samples coverage levels for these regions from a Normal distribution. BEAR, EAGLE NeSSM and pIRS use data from previous studies to determine the variation of the GC content along the reference sequence, resulting in the simulation of variable regional coverage.

2.8 Simulating genomic variants

Apart from sequencing error (Figure 2.4), many tools can also introduce different types of genomic variants in the simulated reads (Pattnaik et al., 2014) like single nucleotide polymorphisms (SNPs), indels, inversions, translocations, copy number variants (CNVs) and short tandem repeats (STRs) (Table 2.3). The general strategy is to create a mutated sequence by introducing genomic variants in the reference sequence before the generation of reads (Figure 2.4). In most cases, these variants are simulated using a given mutation rate, so the mutated sequence differs by a given percentage from the reference sequence. Programs like DWGSIM and EAGLE require instead a file with known mutations (in plain text, VCF or BED-like format). FASTQsim includes a separate tool that builds a mutation file from real data, using a NGS dataset (FASTQ files) and a reference genome, being best suited for re-sequencing. Some programs are capable of generating population-level diversity by creating several mutated sequences from a single reference sequence (Figure 2.4). Programs like GemSim and Mason can generate sets of related haplotypes differing by at least 50 Simulation of genomic next generation sequencing data one SNP from the reference sequence. In GemSim users may also create their own tab-delimited haplotype file providing the specific position and mutation introduced. Tools like GemSim, BEAR, Grinder and NeSSM can introduce genomic variants in a given set of reference sequences belonging to different taxa to create a set of mutated genomes that will resemble a metagenomic community (Figure 2.4). As mentioned before, these programs use an abundance profile so the reads are generated from these sequences with a probability proportional to “taxa” abundance.

2.9 Output

The generated NGS reads may be stored in different file formats. According tothe specific NGS technology simulated, one can get SFF files (standard flowgram format) from 454 platforms (454sim and Flowsim), and FASTA or FASTQ files for IonTorrent, Illumina, PacBio, SOLiD and Nanopore. Other possible output files include alignment files, either in MAF (Multiple Alignment Format) or SAM/BAM formats. These canbe outputted by default (as in Mason, pbsim and SimSeq), or as an option, complementary to the simulated reads (as in ART). 2.9 Output 51

Figure 2.4 Flows available to generate reads with and without genomic variation. Dots repre- sent variants present in the reference sequence(s), and crosses represent the newly introduced variants (mutated sequences). a | Simulation of reads from a single reference sequence without adding new genomic variants. b | Generation of reads from a single mutated sequence generated from a single reference sequence. c | Reads are generated from a set of mutated sequences that were generated from a single reference sequence. d | Generation of reads from a set of mutated sequences obtained from a set of reference sequences. e | Reads are obtained directly from a set of reference sequences without introducing additional genomic variants. 52 Simulation of genomic next generation sequencing data

Table 2.3 Genomic variants.

Input Genomic variants Simulators Single Multi MGC PLO SNPs Indels INVs TRA CNVs STRs BEAR x x dwgsin (dnaa) x x x x x x EAGLE x x x x x x x FASTQSim x x x x GemSim x x x x x Grinder x x x x x Mason x x x NeSSM x x pIRS x x x x x ReadSim x x x x x SimSeq x x SInC x x x x wgsim x x x x

Input: FASTA files. MGC: Metagenomic community. PLO: Ploidy. SNPs: Single Nucleotide Polymorphisms. Indels: Insertions and deletions. INVs: Inversions. TRA: Translocations. CNVs: Copy Number Variants. STRs: Short Tandem Repeats.

2.10 Conclusions

NGS is having a big impact in a broad range of areas that benefit from genetic in- formation, from medical genomics, phylogenetic and population genomics, to the reconstruction of ancient genomes, epigenomics and environmental barcoding. These applications include approaches such as de novo sequencing, resequencing, target se- quencing or genome reduction methods. In all cases, caution is necessary in choosing a proper sequencing design and/or a reliable analytical approach for the specific bio- logical question of interest. The simulation of NGS data can be extremely useful for planning experiments, testing hypotheses, benchmarking tools and evaluating particu- lar results. Given a reference genome or dataset, for instance, one can play with an array of sequencing technologies to choose the best-suited technology and parameters for the particular goal, possibly optimizing time and costs. Yet, this is still not the standard practice and researchers often base their choices on practical considerations like technology and money availability. As shown throughout this chapter, simulation of NGS data from known genomes or transcriptomes can be extremely useful when evaluating assembly, mapping, phasing or genotyping algorithms (e.g. Angly et al. 2012; Caboche et al. 2014; Li et al. 2014; Nielsen et al. 2011; Shcherbina 2014), exposing their advantages and drawbacks under different circumstances. 2.10 Conclusions 53

Altogether, current NGS simulators consider most, if not all, of the important features regarding the generation of NGS data. However, they are not problem-free. The different simulators are largely redundant, implementing the same or very similar procedures. In our opinion, many are poorly documented and can be difficult to use for non-experts, and some of them are no longer maintained. Most importantly, for the most part they have not been benchmarked or validated. Remarkably, among the 23 tools considered here, only 13 have been described in dedicated application notes, 3 have been mentioned as add-ons in the methods section of bigger articles, and 5 have never been referenced in a journal. Indeed, peer-reviewed publication of these tools in dedicated articles would be highly desirable. While this would not definitively guarantee quality, at least it would encourage authors to reach minimum standards in terms of validation, benchmarking, and documentation. Collaborative efforts like the Assemblathon (Earl et al., 2011) or iEvo (http://www.ievobio.org/) might be also a source of inspiration. Meanwhile, I hope that the decision tree presented in Figure 2.1 helps users making appropriate choices. 54 Simulation of genomic next generation sequencing data

Box 2.3 Websites of the reviewed simulators and further sites of interest

Websites of the reviewed simulators: • 454sim: http://sourceforge.net/projects/bioinfo-454sim • ART: http://www.niehs.nih.gov/research/resources/software/biostatistics/art • ArtificialFastqGenerator: http://sourceforge.net/projects/artfastqgen • BEAR: https://github.com/sej917/BEAR • CuReSim: http://www.pegase-biosciences.com/curesim-acustomized-read-simulator • DWGSIM: https://github.com/nh13/DWGSIM • EAGLE: https://github.com/sequencing/EAGLE • FastqSim: http://sourceforge.net/projects/fastqsim • Flowsim: http://biohaskell.org/Applications/FlowSim • GemSim: http://sourceforge.net/projects/gemsim • Grinder: http://sourceforge.net/projects/biogrinder • Mason: http://www.seqan.de/projects/mason • MetaSim: http://ab.inf.uni-tuebingen.de/software/metasim • NeSSM: http://cbb.sjtu.edu.cn/~ccwei/pub/software/NeSSM.php • PacBio reads simulator: https://code.google.com/archive/p/pbsim • pIRS: https://github.com/galaxy001/pirs • ReadSim: http://sourceforge.net/projects/readsim • simhtsd: http://sourceforge.net/projects/simhtsd • simNGS: http://www.ebi.ac.uk/goldman-srv/simNGS • SimSeq: https://github.com/jstjohn/SimSeq • SInC: http://sourceforge.net/projects/sincsimulator • Wgsim: http://github.com/lh3/wgsim • XS: http://bioinformatics.ua.pt/software/xs Other websites of interest: • iEvo: http://www.ievobio.org • NGS simulators: http://darwin.uvigo.es/ngs-simulators Chapter 3

NGSphy: phylogenomic simulation of next-generation sequencing data

Part of the work described in this chapter was first presented in the following meeting: • Escalona M, Rocha S, Posada D. 2016. NGSphy: generation of next- generation sequencing data from phylogenies. XII Encontro Nacional de Biología Evolutiva (ENBE). Universidade de Aveiro. December 16, 2016. Aveiro, Portu- gal. It is also now accepted for publication in Bioinformatics, pending minor changes, and is available at the moment as: • Escalona M, Rocha S, Posada D. 2017. NGSphy: phylogenomic simulation of next-generation sequencing data. bioRxiv. doi: https://doi.org/10.1101/197715

The corresponding source code, full manual and tutorials are available on Github: • NGSphy: https://github.com/merlyescalona/ngsphy • NGSphy Manual: https://github.com/merlyescalona/ngsphy/wiki/manual • INDELible-NGSphy: https://github.com/merlyescalona/indelible-ngsphy

Contributions: ME designed, implemented and tested NGSphy. ME also created INDELible-NGSphy, an adaptation of INDELible (Fletcher and Yang, 2009), a simulator of sequence evolution. Such adaptation allows the use of an ancestral sequence to evolve under a single partition.

3.1 Introduction 57

3.1 Introduction

Next-generation sequencing (NGS) technologies facilitate nowadays the collection of large “phylogenomic” datasets with hundreds or thousands of loci with several individuals from multiple species (McCormack et al., 2013). At the same time, genome- wide data have brought a renewed interest in the discordance between gene trees and species trees and in methods of simulation and phylogenetic reconstruction that deal with large amounts of loci and potential sources of phylogenetic incongruence (Mallo and Posada, 2016). Importantly, the assembly of multiple sequence alignments from NGS reads is not free from errors and biases. There are many variables that might interfere with the accuracy of the final gene trees and species trees inferred from NGS data, from experimental-design aspects such as number of samples, sequencing technology or depth of coverage (herein coverage), to methodological decisions during the processes of assembly or mapping, orthology inference, variant and genotype calling, or phasing. In this context, the simulation of NGS reads can be quite useful to properly understand and improve the NGS phylogenomic pipeline. However, at the phylogenomic scale we need to be able to simulate NGS data from multiple gene families or loci with potentially discordant phylogenies, and represented by multiple individuals. To the best of our knowledge, the only software which generates NGS data from phylogenies is TreeToReads (McTavish et al., 2017). Although this is a very useful tool for the simulation of NGS data along a single gene tree, it cannot directly simulate data along multiple gene trees, diploid individuals, read counts, or consider coverage heterogeneity across individuals and loci, being therefore not easily applicable to the species tree / phylogenomic scenario. In order to fill this gap, here I introduce NGSphy, an easy-to-use pipeline to generate NGS data (read counts or Illumina reads) from multiple loci belonging to multiple haploid/diploid individuals and species under the gene tree / species tree paradigm, and with different options to control coverage variation across species, loci and individuals. The kind of simulations allowed by NGSphy should be extremely useful to tune empirical NGS experiments or to benchmark phylogenomic pipelines.

3.2 Description and implementation

NGSphy is an open-source tool written in Python that takes advantage of the NumPy (der Walt et al., 2011) and Dendropy (Sukumaran and Holder, 2010) libraries. Its workflow is depicted in Figure 3.1. Parameter values and options for the simulations 58 NGSphy: phylogenomic simulation of next-generation sequencing data are specified in a settings file. Arguments and conditions for the different replicates can be sampled from user-defined statistical distributions, or set as fixed. Inthe simplest scenario, the user just needs to specify a single gene tree , a substitution model, and the sequencing design. Optionally, the user can provide an ancestral or tip nucleotide sequence that will be used at the root of the tree to start the simulation. Otherwise the root sequence is simulated according to the stationary frequencies of the specified substitution model. Nucleotide sequence alignments are then evolved using INDELible (Fletcher and Yang, 2009) or a version of it to force the sequence at the root (INDELible-NGSphy;http://github.com/merlyescalona/indelible-ngsphy). For more complex scenarios, NGSphy is able to read directly the output of SimPhy (Mallo et al., 2015a), a program that simulates multiple gene trees evolving within a species tree under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer or gene conversion, plus the corresponding multilocus alignments obtained with INDELible. Before producing the NGS reads, the genomic sequences at the tip of each gene tree are assigned to haploid (directly) or diploid (by random sampling within species) individuals. With the trees, multiple sequence alignments and individuals in place, NGSphy can produce read counts for single-nucleotide variants (SNVs) or generate Illumina reads using ART (Huang et al., 2012). Finally, NGSphy allows for multi-threaded, parallel runs. In addition, it can generate job templates for execution in computational clusters.

3.3 Simulation modes

NGSphy implements four different simulation scenarios (input modes), as follows:

1. Single gene tree: allows to generate DNA sequences from a single gene tree, generate haploid or diploid individuals (by random mating within the same gene families, see below Section 3.4) and produce reads or read counts.

2. Single gene tree with given ancestral sequence: allows to simulate data from a single gene tree and a known ancestral sequence. DNA sequences are evolved from the ancestral sequence under the specified gene-tree, haploid or diploid individuals and reads or read counts generated.

3. Single gene tree with given anchor sequence: allows to simulate reads/read counts from a single gene tree and a known anchor (tip) sequence. Tree is re- rooted in the anchor sequence before the simulation of DNA sequences. This has 3.4 Assignment of individuals 59

to be done with the modified version of indelible (INDELible-NGSphy; see below Section 3.5 for details).

4. Gene-tree/Species-tree distributions: this mode uses the output from SimPhy to generate reads or reads counts for its individuals. SimPhy generates distributions of gene trees and species trees under some desired conditions. Each species tree is here considered a replicate.

3.4 Assignment of individuals

For haploid individuals, each tip in the gene tree provided will correspond to a single individual. For diploid individuals the number of gene-tree tips per species must be even. In this case, the individuals are generated by randomly sampling without replacement two gene copies from a specific gene-family until all gene tree tips have been assigned to an individual. For the gene-tree distribution input mode only, given replicates with the same number of tips for all the gene trees, within NGSphy you can then filter the species tree replicates if the contained gene trees do not match your requirements for downstream analyses (eg. number of gene-tree tips per species and your (even) requirement if you want to simulate diploids). Also, for the same input mode, the outgroup in the gene trees has one gene copy. Therefore, for the generation of diploid individuals, the outgroup will be homozygous, obtained by the duplication of the sequence of its gene copy.

3.5 Re-rooting process

The simulation of DNA sequences with INDELible starts from a given tree, where the ancestral sequence is generated at the root node (Figure 3.2, T1) and sequence alignments are evolved, following a specific set of parameters that define the evolutionary model. In NGSphy, for the simulation model “Single gene tree with given anchor sequence”, the user knows the sequence of a tip of the given gene tree, and so wants to evolve the tree with such tip as root for the alignment simulation, to generate sequence alignments for the rest of the tips of the tree. To handle this situation, the gene tree has to be re-rooted to the anchor tip, so that the alignment simulation can process using the anchor sequence as the root node. It was not possible to run this scenario using INDELible, and I created a modified version (INDELible-NGSphy) which allows the 60 NGSphy: phylogenomic simulation of next-generation sequencing data

Figure 3.1 NGSphy workflow. 3.5 Re-rooting process 61

Figure 3.1 NGSphy workflow. At start, NGSphy first verifies the settings files and/or the existence of the corresponding third-party applications. If the input data corresponds to a single gene tree and an user-defined anchor sequence (input mode 3), the tree is first rooted to the selected gene-tree tip. Next, for any single gene tree input mode (input modes 1-3), nucleotide sequences are evolved under the specified substitution model resulting in sequence alignments. Then (for any input mode), haploid or diploid individuals are generated, as desired; for haploid individuals, the resulting sequences are separated into different FASTA files (per genomic fragment); for diploid individuals the sequences are randomly paired from tips within the same gene family and species, and FASTA files are thus generated each comprising both sequences of each fragment of each individual -variation of font emphasis in tip names represent differentiation in species (A, B: species 1; C, D species 2). Variation ofdepth coverage at species, locus and individual level are then generated if desired, and, finally the sequencing data, either Illumina reads (with ART) or read counts (VCF files), is obtained. use of an ancestral sequence to evolve under a single partition. In the example shown in Figure 3.2, NGSphy would transform the tree T2 (center) into T3 (right), using as anchor tip 2_0_0. The key observation here is that the branch length from node A to tip 2_0_0 has to become zero. Then, the re-rooted tree plus the anchor (known) sequence are given to indelible-ngsphy, with its control file (format of the file is detailed in the Online Manual; AppendixC) to simulate the corresponding sequence alignments under the model from the control file.

Figure 3.2 Re-rooting process. For a single gene tree, T1 represents the general scenario, where a simulated (simulation mode 1) or given (simulation mode 2) ancestral sequence is assumed to be the root node. In the case that a tip (=anchor) sequence is given and the user wants to simulate others related to it (T2), the tree has to be re-rooted at this node, transforming it in the ancestral one (T3, node A) 62 NGSphy: phylogenomic simulation of next-generation sequencing data

3.6 Coverage heterogeneity

NGSphy can simulate the variation of coverage that may occur in NGS due to differences in quantity or quality of DNA samples, technical problems when generating libraries, or genomic changes in GC content. Coverage variation is implemented hierarchically. For each replicate, an experiment-wide coverage is assigned according to the user –fixed or sampled from a distribution of choice. Coverage variation among loci and individuals is introduced by sampling multipliers from user-defined Gamma distributions, while coverage variation across species/taxa can be directly specified (see below Section 3.6.1 and Box 3.1 for details). For targeted-sequencing experiments NGSphy can simulate off-target loci (not targeted but captured with reduced coverage), uncaptured loci, ora coverage decay related to the phylogenetic distance to a selected reference sequence (Bragg et al., 2016).

3.6.1 Distribution-based parameterization

Most of the simulation studies apply a grid-based combination of discrete parameter values. Following Darriba et al.(2012); Leigh and Bryant(2015); Mallo et al.(2015a), NGSphy is able to to introduce coverage heterogeneity multipliers, sampling the param- eter values from prior statistical distributions. The distributions are defined by the user, and currently include Uniform, Normal, Lognormal, Exponential, Poisson, Binomial, Negative Binomial and Gamma with the possibility of fixing parameter values. Figure 3.3 shows how the parameterization for the coverage heterogeneity for each experiment across loci and individuals is calculated. For each replicate (r), an experiment-wide value is sampled from a given distribution E. The sampled value from E will be used as the starting coverage value for the experiment (including the loci and individuals belonging to such replicate). Then, if locus and/or individual coverage variation is desired, a value is sampled from a given distribution LW (for locus-wide) (or IW -for individual-wide) for the replicate r. The sampled value for LW (or IW ) will shape the distribution of the multiplier values (Gamma distribution with mean = 1) in order to determine the coverage variation among loci (or Individuals). Further, targeted sequencing parameters allow the user to emulate the variation in depth of coverage that can occur in a targeted-sequencing experiment. This is possible when using gene tree distributions (SimPhy project) as input data. These parameters identify the on-target/off-target loci as well as the number of loci that may notbe captured. While on-target loci will keep their expected coverage, the off-target fraction 3.6 Coverage heterogeneity 63

Individual-wide multipliers Individual-wide distribution IW Gamma distribution (shape=IWr, mean=1)

mIWr1

mIWri IWr

mIWr Experiment-wide distribution E number_individuals (nind)

Er

Locus-wide multipliers Locus-wide distribution LW Gamma distribution (shape=LWr, mean=1)

LWr

mLWr1

mLWrl

mLWrnumber_loci (nloc)

Calculation of coverage table for replicate r

Locus1 Locusl Locusnloc

(E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) Individual1 r 1 1 r 1 l r 1 nloc

(E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) Individuali r i 1 r i l r i nloc

(E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) (E ✕ mIWr ✕ mLW ) Individualnind r nind 1 r nind l r nind nloc

Figure 3.3 Distribution-based parameterization in NGSphy. Schema of the distribution-based parameterization used to introduced the coverage heterogeneity across replicates, loci and individuals. 64 NGSphy: phylogenomic simulation of next-generation sequencing data will have a (user-defined) fraction of this. The not-captured indicates the fraction of targeted loci that will not be captured, and its expected coverage will be 0x. Introducing taxon-specific effects allow to define coverage variation for specific taxa (Figure 3.4). It can be used for example to emulate a decay in coverage related to the phylogenetic distance of the a species to the reference species used to build the target-loci probes, which is sometimes called phylogenetic decay (Bragg et al., 2016) or to accommodate the conditions of particular samples (such as low amount of DNA, degraded museum specimens, etc).

Figure 3.4 Taxon-specific effects. Taxon-specific coverage can be incorporated. Different clades (blue and orange) can be assigned a proportion of the experiment-wide defined coverage. For example, if the expected coverage for the experiment is 100X and taxon-specific variation of 0.5 for the blue clade and 0.25 for the orange clade, then individuals from the blue clade will have a coverage of 50X and the individuals from the orange clade will have coverage of 25X . 3.6 Coverage heterogeneity 65

Box 3.1 Example of the coverage sampling strategy introducing individual and locus heterogeneity

Working with a gene-tree/species-tree distribution (SimPhy’s output) where our data set is comprised of 2 species tree replicates (replicate 1, replicate 2), 2 gene trees (locus A, locus B) and 2 individuals (individual I, individual II), we want to generate coverage variation under the following parameters:

experiment−wide : P : 1 0 0 locus −wide: LN:1.2,1 i n d i v i d u a l −wide : E : 1

First, to obtain the expected coverage per experiment (replicate) we sample 2 values from a Poisson, with mean = 100 (rep1c, rep2c).

The coverage within the experiment before adding locus/individual multipliers is thus:

Expected coverage Replicate Locus Individual I Individual II A 102 102 1 B 102 102 A 112 112 2 B 112 112

Afterwards, we sample the locus-wide rate multipliers from the hyper-distribution, in this case a Log Normal with mean=1.2 and standard deviation =1 (locwrep1 ,locwrep2). This, give us the shape of the Gamma distribution with mean 1 from which we sample the rate multipliers, as many as loci (locAm, locBm). 66 NGSphy: phylogenomic simulation of next-generation sequencing data

Box 3.1 Example of the coverage sampling strategy introducing individual and locus heterogeneity

Coverage variation after locus-wide multipliers:

Rate multiplier Resulting coverage Replicate Individuals (per individual) Locus A Locus B I 0.4849 82.64733 40.71036 1 II 1.6790 286.1721 140.9625 I 0.7437 177.0838 63.9867 2 II 1.3250 315.4984 114.0009 3.7 Next-generation sequencing data 67

Box 3.1 Example of the coverage sampling strategy introducing individual and locus heterogeneity

3.7 Next-generation sequencing data

3.7.1 Simulation of Illumina reads

Illumina reads simulation is based on the usage of ART (Huang et al., 2012). ART is a very flexible program that uses customized read error model parameters and quality profiles. NGSphy can call ART with the parameter values indicated in the settings file and produce Illumina reads for each individual in ALN, BAM and/or FASTQ format. Each NGSphy call to ART is made for each locus of each individual, controlling the coverage variation at different levels.

3.7.2 Simulation of read counts

The read counts approach is based on the assumption (Ritz et al., 2011) that the sequencing process is uniform in generating short reads from the target genome, and that the number of reads mapped to a region is expected to be proportional to the number of times the region appears in a DNA sample (Ji and Chen, 2015). Read counts are produced under a user-defined error rate. The simulation of read counts relies on the selection of a reference sequence from those in the multiple alignment file for the corresponding locus. The sequence label is given in the settings fileand afterwards parsed for processing. The process itself, starts with the identification 68 NGSphy: phylogenomic simulation of next-generation sequencing data of the variable sites (regarding the given reference sequence). Then, coverage for each position is sampled from a Negative Binomial distribution whose mean and overdispersion parameter are the sampled coverage for the specific locus and individual. For diploid individuals, coverage is further split among chromosomes with equal probability. Genotype likelihoods for every site are computed as in GATK (Mckenna et al. 2010; see also Korneliussen et al. 2014). The simulated read counts are written to a set of VCF files, one per locus.

3.8 Input and output

NGSphy was developed as a non-interactive command line program. Input parameters should be given through a settings file (for more details see the Online Manual or the AppendixC). Depending on the specified input modes users can also provide their own gene trees (in Newick format), ancestral or anchor sequences (in FASTA), SimPhy’s output path (gene-tree/species-tree distributions), reference allele files (used to simulate read counts) or evolutionary models (INDELible’s control file). The output of NGSphy will depend on the input mode and the NGS mode selected. The output structure, as depicted in Figure 3.5, can include: multiple sequence alignments, Illumina reads in FASTQ, ALN or BAM files, read counts in VCF, coverage variation tables in comma-separated values (CSV) format, bash scripts and log files. The log files will have more or less details according to the log level requested by the user. 3.8 Input and output 69

Figure 3.5 Structure of the output folders of NGSphy. Main output folder includes the subfolders alignments: all the simulated sequence alignments from single gene-tree input modes; coverage: expected coverage tables and multipliers files, one per replicate; ind_labels: tables that keep track of the relationship between sequences in alignments and individuals generated; ref_alleles: the sequences of the reference alleles used for the simulation of read counts; individuals: the haploid/diploid individual sequence files, one per individual, hierarchically organized within replicates and loci; reads: for Illumina reads stores the ALN/BAM and/or FASTQ files generated by ART, while for read counts stores all the VCF files. Schema of the distribution-based parameterization used to introduced the coverage heterogeneity across replicates, loci and individuals. 70 NGSphy: phylogenomic simulation of next-generation sequencing data

3.9 Validation test: phylogenetic reconstruction from simulated alignments

To test whether NGSphy is working as expected I performed several sanity checks and test runs. Here I describe a particular experiment to check that the simulated alignments have in fact evolved under the user-defined gene tree. The simulation process started from the gene tree in Figure 3.6, using the tip 1_0_0 as anchor (i.e., providing a known sequence corresponding to that tip). I ran 100 replicates of NGSphy in inputmode 3 (single gene tree with user-defined anchor sequence). The sequence alignments were simulated under a JC69 model (Jukes and Cantor, 1969) , equal base frequencies and a length of 1000 bp. The simulated alignments were used to reconstruct maximum likelihood (ML) trees with raxml-ng(Kozlov, 2017), using the (known) JC69 model. Ten heuristic searches were performed per alignment, starting with maximum parsimony trees. The Robinson-Foulds (RF; Robinson and Foulds 1981) and Branch Score distances (BSD; Kuhner and Felsenstein 1994) were used to compare the input gene tree and the estimated ML trees respect to topology and branch lengths, respectively. All RF scores were always zero, while the BSD were always minimal (mean = 0.0555, standard deviation = 0.0175), suggesting that the alignment simulation is correct. In the input mode used for this test, the anchor tip is used to re-root the tree, and then used by indelible-ngsphy to generate the locus alignment. This process involves the generation of a zero-branch-length between the anchor tip and what is considered the root node by indelible. To show that this re-rooting process was not leading to any error and that the generated anchor sequence is identical to the one defined by the user as anchor, I measured the p-distance between them. Inallcases this distance was zero. 3.9 Validation test: phylogenetic reconstruction from simulated alignments 71

Figure 3.6 Gene-tree with five tips used for the validation. Numbers above the branches represent branch lengths in expected number of substitutions 72 NGSphy: phylogenomic simulation of next-generation sequencing data

3.10 Use case: effect of the variation of depth of cov- erage on SNP recovery

One of the possible uses of NGSphy might be the optimization of depth of coverage for a given purpose. In this case I designed a small experiment to visualize the potential trade-off between NGS coverage and SNP discovery. In this case I used NGSphyto simulate a single sequence alignment from a given gene-tree (Figure 3.7) and from it I generated 100 NGS datasets at different depths of coverage. The sequence alignments were simulated under a JC69 substitution model, with equal base frequencies and a length of 10,000 bp. The Illumina runs generated 150 bp paired-end reads for all individuals at a coverage of 2X, 10X, 50X, 100X and 200X (100 replicates for each level). The detailed settings are shown in Table 3.1 .

Figure 3.7 Gene-tree with five tips used for the use case simulation. It represents four species with two individuals per species. Numbers above the branches represent branch lengths, in expected number of substitutions.

Mapping was carried out using the MEM algorithm of BWA Version 0.7.7-r441

(Li and Durbin, 2009), against a randomly chosen reference (sequence 102 in all cases). Following a standardized best-practices pipeline (Van der Auwera et al., 2013) mapped reads from all replicates were independently processed, performing local realignment around indels and removing PCR duplicates. Variant calling was made with GATK (Mckenna et al., 2010), using the single-sample variant-calling joint-genotyping framework using the HaplotypeCaller and GenotypeVCF modules. SNP calls from each replicate were compared to the true variant sites, showing that SNP recovery increased very rapidly until 10X, when almost all true variants were called (Figure 3.8). 3.10 Use case: effect of the variation of depth of coverage on SNP recovery 73

Figure 3.8 Use case variant calling. Called variants and true positives (mean and Q1/Q3) at different depths of coverage. The true number of SNPs is 386. At 100x and 200x only 1 site (average) is not recovered. 74 NGSphy: phylogenomic simulation of next-generation sequencing data

3.11 Execution

The execution of NGSphy can be split into 2 blocks: organization and NGS data generation. The organization block is run in a single thread, and all the components are run sequentially. The NGS data generation, comprised by the read counts and the ART calls can be run sequentially or in multithreaded, parallel ART calls In addition, for the ART calls, NGSphy can generate job templates for execution in computational clusters running Sun Grid Engine (Gentzsch, 2001) or Simple Linux Utility for Resource Management (Yoo et al. 2003, https://slurm.schedmd.com/). Job script files generated by NGSphy are general templates, and in most cases they will be have to be modified according the the particular cluster environments.

3.12 Availability

NGSphy is distributed under the license GNU GPL v3. It is written in Python and it relies on NumPy, Dendropy, INDELible and INDELible-NGSphy. NGSphy’s source code can be found on its GitHub repository: https://github.com/merlyescalona/ngsphy. More information about installation and usage, as well as detailed descriptions of all input needed and output generated can be found in the Online manual (AppendixC). It has been tested under Linux and MacOS environments. Also, all raw data and scripts used in Section 3.9 and Section 3.10 are available at: https://www.github.com/merlyescalona/ngsphy/manuscript/supp.material/scripts. Chapter 4

Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenomics

A substantial part of the contents of this Chapter are to be integrated the following “in prep” publication:

• Escalona M, Rocha S, Posada D. (in prep.) Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms.

Contributions: ME was involved in the conception of the simulations, in the discussion on how to implement them and about the meaning of the results of each step, and how to tune the parameter space of the simulations. ME performed the simulations and analysed all results.

4.1 Introduction 77

4.1 Introduction

The overarching goal of this thesis is to understand the impact of different variables of the phylogenomic pipeline in the final inferences obtained. For this, I designed a simulation study (described Chapter5) in which I evaluated the accuracy of the species trees obtained from different NGS data sets. However, simulating complex scenarios under continuous distributions of many parameters is not an easy task. Most importantly, for these simulations to be meaningful for other researchers they should implement as much as possible typical experimental and realistic ranges for the different evolutionary parameters. In addition, we expect simulations to be accurate (they should simulate what they are expected to simulate; i.e., they are bug-free) and, also key for a PhD, doable in a reasonable amount of time. Thus, in this chapter I describe the process of parameterization of the phylogenomic NSG simulations described in the next chapter. The general biological scenario chosen was the one assumed to be underlying the cases where most researchers will resort to target enrichment (also called capture) strategies: “shallow” species phylogenies, where multiple individuals from closely related species are available. Given the ongoing discussion over the relative merits of different enrichment strategies (see Ali et al. 2016; Harvey et al. 2016; Hoffberg et al. 2016) over different time-scales, the time frame of the tree depths was set to a wide rangeof divergence times from 200.000 to 20 My, i.e., possibly encompassing from Holocenic up to early Miocenic divergences. This range should include most (although not all) instances where incomplete lineage sorting (ILS) can lead to pervasive effects on species tree reconstruction (e.g. Degnan and Rosenberg 2009; National Academy of Sciences 2017; Szöllősi et al. 2015) and where researchers thus will be most interested in collecting data from a large amount of loci. I used SimPhy to simulate multiple gene families evolving under multiple evolu- tionary effects and uncorrelated relaxed clocks, sampling replicates from a continuous parameter landscape described by user-defined statistical distributions. In SimPhy, different parameters are sampled through different simulation layers (i.e, for each species, locus or gene trees), with the possibility of specifying certain dependencies among parameters (i.e, some parameters will act as hyper or hyper-hyper parameters for others). In brief, SimPhy samples for each species tree replicate sets of genome-wide parameters (i.e. species tree parameters), species-specific and gene-family-specific rate variation parameters, all these interacting across and within the number of gene families simulated. Because of the complexity of setting all parameters’ space and the potential large variance introduced by the different evolutionary processes implemented, plus the Optimization of parameters for the simulation of targeted sequencing data of shallow 78 phylogenomics possible interactions between them, I felt the need of carefully inspecting and validating all parameters of our simulations to make sure we were producing data within a realistic parameter landscape. I also analyzed empirical data to ensure realistic values were used at parameterization. In the sections below I describe how we set, optimized (when necessary) and validated this parameter space.

4.1.1 SimPhy’s configuration

This section will explain the parameterization used to obtain the gene tree distributions from SimPhy which will give rise to downstream data that will be processed in Chapter 5. SimPhy parameters (Tables 4.1, 4.2, 4.3) can be divided in three classes: replicate, species tree and substitution heterogeneity parameters.

Table 4.1 Replicate parameters

Paramater Value Description RS 121* Number of species trees (= replicates) RL Uniform(100, 2000) Number of locus trees per species tree (in this case identical to the number of gene trees, i.e one gene tree per locus tree was simulated) RG 1 Number of gene trees per locus tree

* A total of 200 replicates were simulated, but only 121 used in downstream simulations. Given that the number of individuals (gene copies) per species is sampled from a continuous distribution, odd values must be sampled and later removed if one wants to emulate diploid organisms by pairing gene copies.

Table 4.2 Species tree parameters

Paramater Value Description ST Uniform(200000, 20000000) Species tree height (in years) SL Uniform(2, 8) Number of species SI Uniform(2, 8) Number of individuals per species SB LogNormal(−13.58, 1.85) Speciation rate SG 1 Tree-wide generation time (in years) SO 1 Ratio between ingroup height and the branch from the root to the ingroup (outgroup branch length) SP 10, 000 Tree-wide effective population size SU Uniform(10−10, 10−8) Tree-wide substitution rate 4.1 Introduction 79

Table 4.3 Replicate Parameters

Paramater Value Description GO LogNormal(1.4, 1) Gene-by-lineage-specific substitution rate het- erogeneity parameter (to use with HG) HH – Gene-by-lineage-specific substitution locus tree parameter (to use with the HG) HG =GP Gene-by-lineage-specific substitution gene tree parameter

4.1.2 Replicate parameters

These parameters control the number of replicates (species trees), locus trees and gene trees desired for the simulation. For these the number of desired gene trees was set through the RL parameter (locus trees). The appropriate way of simulating scenarios like ours without duplications or losses in SimPhy is to simulate, within a species tree, multiple locus trees and then a single gene tree along each of them. In this case the locus tree will be topologically equivalent to the species tree, but branch lengths will have different units and may be scaled up or down to generate branch length heterogeneity, allowing for incomplete lineage sorting (ILS). The resulting gene tree topologies may thus be different from the locus (and species) tree topologies. For these simulations, gene trees were sampled from a Uniform distribution (100; 2,000) (Figure 4.1). These boundaries are within the number of loci used in most of the empirical studies for targeted enrichment experiments (Bi et al., 2012; Bragg et al., 2016; Faircloth et al., 2013; Lemmon and Lemmon, 2012). Optimization of parameters for the simulation of targeted sequencing data of shallow 80 phylogenomics

Figure 4.1 Distribution of number of locus trees per species tree replicate. Identical to the number of gene trees in this case. 4.1 Introduction 81

4.1.3 Species tree parameters

4.1.3.1 Species tree height

This parameter controls the height of the ingroup clade in the species tree. By default, in SimPhy, the branch between the outgroup and the root of the species tree is twice the ingroup height (Figure 4.2). The simulated species trees had thus total tree heights between 400 ky and 40 My (Figure 4.3).

Figure 4.2 Schema of relative outgroup to ingroup distance.

4.1.3.2 Number of species and number of individuals per species

Number of species and number of individuals per species were both chosen to follow a Uniform(2, 8) (Figure 4.4). The simulations were designed to include an outgroup (parameter SO), hence the resulting distribution of number of species is a Uniform(3, 9). For the number of individual per species, the parameter refers to the number of single gene copies, and not diploid individuals as such. Optimization of parameters for the simulation of targeted sequencing data of shallow 82 phylogenomics

Figure 4.3 Distribution of species tree heights.

Figure 4.4 Distribution of the number of species and number of individuals per species. A | Number of species. Shows the distribution of the total number of species including the outgroup. B | Number of individuals per species. Refers only to ingroup species. 4.1 Introduction 83

4.1.3.3 Speciation rate

The speciation rate controls the generation of tree topology. In SimPhy species trees were generated under a Yule model (no extinction; most appropriate for shallow phylogenies) and thus only a birth rate needs to be sampled (one that gives rise to the intended number of tips). The expected species tree height depends in fact on the number of species and the speciation rate (λ) (Equation 4.1):

ln(nspecies) ln(nspecies) E[STheight] = → λ = (4.1) λ E[STheight] Given we want species trees rooted between 200ky and 20 My, and a number of species between 2 and 8, we can take their average value (5) and obtain an approximate interval for the speciation rate by calculating the lambdas (λ) for 200k and 20M generations. I also had to look for an appropriate statistical distribution from where to sample values across this interval (soft boundaries). The Lognormal (LN) is in principle a good candidate because it allow us to easily sample across different orders of magnitude. We thus want to search for a mean and standard deviation of the LN that encompass these values (200k - 20M generation). We also have to make sure that the parameter space from which we are sampling will not get near the asymptotes of the LN distribution around 0 and/or 1, by allowing to sample within the 0.05 and 0.95 percentiles of the distribution. Thus, speciation rate values were sampled from a LN distribution described by a mean -13.58 and standard deviation 1.85 (Figure 4.5). To further ensure we were simulating tree topologies with branching events dis- tributed across the whole tree depth space and not only deep or shallow divergences (i.e, that we were sampling an appropriately wide speciation rate parameter), we visually inspected the shape of the obtained species-trees (Figure 4.6).

4.1.3.4 Tree-wide substitution rate

Most genome-wide capture empirical data is expected to be collected from (coding or non-coding) nuclear regions. We thus chose to allow for a tree-wide substitution rate within values commonly estimated for vertebrate and invertebrate nuclear regions (Kumar and Subramanian, 2002; Lynch, 2010; Scally, 2016), sampling this rate from a uniform distribution between 10−10 and 10−8 substitutions per year (Figure 4.7) With respect to the remaining species tree parameters, tree-wide generation time (SG) was set to 1 year for simplicity and the tree-wide effective population size (SP) fixed to 10, 000 (a lower bound for vertebrates; Lynch and Conery 2003), which seemed Optimization of parameters for the simulation of targeted sequencing data of shallow 84 phylogenomics

Figure 4.5 LogNormal distribution of the prior for the speciation rate (log scale). Blue lines correspond to the (average) speciation rates intended given the species tree heights and numbers of individuals. These define the approximate limits from where to sample the speciation rates. Reddotted lines correspond to the percentiles 0.05 and 0.95 of the LN distribution, interval used to constrain the speciation rates sampled.

Figure 4.6 Heterogeneity within and across replicates. These plots show the full set of gene-trees within two species tree replicates. A and B correspond to two different species tree replicates randomly sampled. 4.1 Introduction 85

Figure 4.7 Heterogeneity within and across replicates. These plots show the full set of gene-trees within two species tree replicates. A and B correspond to two different species tree replicates randomly sampled. to generate a “reasonable” number of extra-lineages (Table 4.4 and Figure 4.8). In fact we explored effective population sizeNe ( ) values up to 100, 000, comprising Ne estimates for both vertebrate and invertebrate populations (Lynch and Conery, 2003). We decided to set it as a fixed value in order to make ILS correlated only to species tree height, and easen data analyses, but please note that parameterization could have been more complex. The final Ne value was thus chosen by looking at the generated distribution of extra lineages (a direct measure of ILS). For these I simulated 100 species tree replicates, under the previously set species tree height distribution (200 ky to 20 My), with 10 gene trees per species tree replicate, number of species following a Uniform(4, 20), number of individuals per species following a Uniform(2, 20), and Ne values of 2, 000, 5, 000, 10, 000, 50, 000 and 100, 000. Figure 4.9 shows the distribution of Robinson-Foulds distances among gene trees within species tree replicates across Ne values. Optimization of parameters for the simulation of targeted sequencing data of shallow 86 phylogenomics

Figure 4.8 Average number of extra lineages (ILS) per species tree replicate across Ne values. For these simulations, species trees had between 4 and 20 taxa, and 2 to 20 individuals per species, thus the maximum possible number of extra lineages was higher than in the final simulations in Chapter5. 4.1 Introduction 87

Figure 4.9 Robinson-Foulds pairwise distances among gene trees within species tree replicate for different Ne values. Optimization of parameters for the simulation of targeted sequencing data of shallow 88 phylogenomics

Table 4.4 Summary statistics for the replicates with of extra lineages.

Ne Minimum 1Q Median Mean 3Q Maximum NGTs 2,000 0 0 0 1.27 0.00 43 826 5,000 0 0 0 3.30 1.00 134 645 10,000 0 0 0 5.79 5.00 122 516 50,000 0 0 1 6.78 6.00 91 484 100,000 0 3 10 21.05 23.25 255 121

1Q: 1st. Quartile. 3Q: 3rd. Quartile. NGTs: Number of gene trees with value of Extra lineages equal to 0 . 4.1 Introduction 89

4.1.4 Substitution rate heterogeneity parameters

In nature, different species, loci and lineages can accumulate nucleotide substitutions at different rates. SimPhy incorporates different sources of rate heterogeneity inthegene trees: at the species level, affecting all gene tree branches embedded in given branchof the species tree; HS parameter), at the gene-family level (affecting all branches of gene trees pertaining to a given gene family; HL parameter) and at the lineage level (affecting specific gene tree branches at specific loci; HG parameter) (Figure 4.10). Moreover, the amount of substitution rate heterogeneity among gene tree lineages can further be controlled (using so-called hyper-distributions) across replicates (genome-wide; GP) and across loci (gene-by-lineage-specific locus tree parameter; HH).

Figure 4.10 Heterogeneity as hierarchically implemented in SimPhy. A | Species tree level (GP), the level of heterogeneity can be modulated among replicates, by sampling a value that acts as a hyper parameter for the lower levels. B | Species specific heterogeneity (HS), where changes in rate affect a whole species or clade. C | Gene-family-specific heterogeneity (HL), controls heterogeneity across gene families. D | Gene-by-lineage-specific heterogeneity (HG), where changes in rate affect particular branches of particular locus and gene trees, independently. Adapted from Mallo2015-xr¸

In this thesis we are interested in shallow phylogenies, where species share similar life-history traits, hence species-specific heterogeneity was not modeled. That is, we assumed all species share the same overall substitution rate. Although different loci Optimization of parameters for the simulation of targeted sequencing data of shallow 90 phylogenomics can have different rates of evolution, often target experiments also “homogenize” this factor by not choosing very slow or very fast evolving loci. We thus chose not to introduce gene-family-specific rate variation , but we did introduce different levels of rate variation among lineages in the gene trees. For each replicate, we sampled the genome-wide parameter determining the gene-by-lineage-specific heterogeneity from a LogNormal(1.4). For each gene tree within that replicate, this sampled value then became the alpha parameter for the Gamma distribution from which we got the multipliers that modified each gene tree branch. For more details about how to setrate heterogeneity see https://github.com/adamallo/SimPhy and Mallo et al.(2015a).

4.2 Indelible: simulating and characterizing DNA se- quence alignments

Nucleotide sequences were evolved along the gene trees generated per replicate under a GTR (Tavaré, 1986) +G substitution model, with identical equilibrium base frequencies and rates based on estimates from large sets of empirical alignments, sampled from a Dirichlet distribution (Arbiza et al., 2011; Darriba et al., 2012). The shape (α) of the gamma distribution was sampled from an exponential of mean 0.5, allowing for most values to be between 0 and 1, fitting empirical Gamma estimates (Yang, 1996). Nucleotide sequences were generated without indels (i.e., we work with the true ungapped alignments to reduce the number of nuisance variables) and their lengths sampled per replicate from a Uniform distribution between 150 and 2000. This distribution was chosen taking in consideration loci sizes commonly seen across target enrichment phylogenomic studies (Bi et al., 2012; Bragg et al., 2016; Faircloth et al., 2013; Lemmon and Lemmon, 2012). 4.2 Indelible: simulating and characterizing DNA sequence alignments 91

Listing 4.1 Settings blocks and parameters used for the simulation of DNA sequences using INDELible with the wrapper provided within SimPhy. [TYPE] NUCLEOTIDE 1 [ SETTINGS ] [fastaextension] fasta [ output ] FASTA [ SIMPHY−UNLINKED−MODEL] sim_unlinked [submodel] GTR $(rd:20,2,4,6,8,16) [statefreq] $(d:1,1,1,1) // Equilibrium frequencies // sampled from a Dirichlet (1,1,1,1) [rates]0$(e:2)0 // Site −specific rate heterogeneities / / 0 p−inv, alpha from an E(2) and // using a continuous gamma distribution [ SIMPHY−PARTITIONS] simUnlinked [1 sim_unlinked $(U:150,2000)]

Once simulated, these were inspected to ensure a uniform range of nucleotide variation within the expected values (Figure 4.11) Optimization of parameters for the simulation of targeted sequencing data of shallow 92 phylogenomics

Figure 4.11 Pairwise distance observed in the simulated alignments per species tree replicate. Kimura 2-parameters distance (Kimura, 1980) was computed for all sequences across alignments Each boxplot represents a species tree replicate. 4.3 Simulating NGS reads 93

4.3 Simulating NGS reads

Illumina reads were simulated for all loci and individuals using NGSphy (Chapter3). In target enrichment experiments, the variation of sequencing depth of coverage plays a major role. This variation may occur in the same and/or different NGS experiments due to differences in quantity or quality of DNA samples, technical problems when generating libraries or genomic changes in GC content, and thus its inclusion in the simulations allows for more realistic scenarios. To model these effects, and properly understand this variation in real data, I explored an empirical dataset from a target enrichment experiment, during a research visit to the University of Copenhagen.

4.3.1 Analysis of coverage in target enrichment experiments

During my visit to The Bioinformatics Centre, at the University of Copenhagen, I explored a dataset coming from a target enrichment experiment in hummingbirds (Aves:Trochilidae) (Fonseca et al. unpublished), to understand the kind and degree of coverage variation that one can find in this type of experiment. In this dataset some samples had low DNA quantity and/or quality; there were two available reference genomes at different phylogenetic distances from the captured samples, and awide range of loci (with different characteristics) had been targeted. These analyses of this data provided me with a better understanding of capture experiments and helped me to model such effects in NGSphy (Chapter3) and in the simulations presented in this thesis. The analyses I describe below in more detail helped me to understand that: i) coverage heterogeneity does not only exists across samples (=individual), regions (=genes) and targets (in this dataset corresponding to exons), but also the distance between the samples and the organisms used to build the probes, as well as between the samples and the references used to map the reads, has to be taken in serious consideration, as this can heavily affect the inferences. The coverage heterogeneity can be due to problems during the library preparation (i.e. low amount or low quality of the DNA sample) or probe design (baits wrongly designed, bad target region selection), or inherent to a genomic region, affecting whole sets or specific samples. In this dataset the depth of coverage varied across samples up to circa 10 times of the expected coverage, whereas this variation increased across genes, and was even higher across targets, reaching up to 2 more orders of magnitude, although at low probability. Furthermore, I was also able to have a better idea and model the frequency of on-target, off-target and non captured loci. The coverage heterogeneity among these “types of loci” was Optimization of parameters for the simulation of targeted sequencing data of shallow 94 phylogenomics clear; off-target loci had considerably lower coverage than the on-target regions (around 1% of the expected coverage), and there were circa 6% of the targets that were not captured. Moreover, it was clear that the single most distant sample had lower average coverage than most remaining, although I could not observe any strong correlation between coverage and phylogenetic distance. In brief, these analyses allowed me to model the effects mentioned before, and add them to both NGSphy (Chapter3) and my simulations. The targeted regions for this particular experiment had been chosen from one-to- one orthologs between chicken (Gallus gallus) and zebra finch (Taeniopygia guttata) as annotated in ENSEMBL Version 66, overlapping with the Anna’s hummingbird (Calypte anna) scaffolds and Chimney swiftChaetura ( pelagica)(Zhang et al., 2014a,b). For most of the analyses here presented, from the total 2750 genes captured overall, I used only coding regions (CDSs; exons, herein referred to as targets) that overlapped between the Anna’s hummingbird (ingroup) and Chimney swift (outgroup), i.e., 1479 genes. The one-to-one gene orthology between the Anna’s hummingbird and the Chimney swift, does not necessarily mean that the sizes of genes and targets will be the same across species, but their distributions were in fact quite similar (Figure 4.12).

Figure 4.12 Size distribution of genes (A) and targets (B) of the Anna’s (C. anna) and Swift (C. pelagica) datasets.

The coverage module from bedtools (Version 2.22.0) (Quinlan and Hall, 2010) was used to obtain data with respect to overall breadth (how much of the total region of interested has at least one read) vs. depth (how much of the region of interested is 4.3 Simulating NGS reads 95 recovered at a certain depth) as well as the (average) depth of coverage per sample, per gene and per target. I also measured the amount of unmapped reads for some samples and the coverage at regions that were not targeted but that mapped in both C. pelagica and C. anna’s genomes (see below).

4.3.1.1 On-target regions

Overall breadth vs depth plots for all species (Figure 4.13) showed some degree of variation. In particular, two samples (out of 46; 4%) showed very little depth of coverage (something which is often common due to DNA quality or quantity). When mapped to C. anna (ingroup), coverage lies between 50x and 125x for most samples (Figure 4.14), with considerable variation. We could see a slight decrease in the estimated coverage when reads were mapped to the outgroup (C. pelagica). Looking at the distribution of average coverage across samples, genes and targets (Figure 4.15) is useful to understand their distribution, and the frequency of “outliers” for these regions. One can see that the “targets” (exons) were the most affected by extreme values of coverage. This occurred most probably due to repetitive regions or similar domains across proteins. Usually these loci are discarded after mapping (due to the uncertainty about paralogy) but it may be useful to be aware of these distributions to parametrize simulations of capture experiments. Optimization of parameters for the simulation of targeted sequencing data of shallow 96 phylogenomics

Figure 4.13 Breadth versus depth of coverage from the mapping against the C. anna ("Anna") reference. Each line represents a sample. 90% of the targets were covered at 10x for most of the samples. 4.3 Simulating NGS reads 97

Figure 4.14 Observed coverage per sample. Mean (solid red line), 1st. and 3rd. quartiles (dashed red lines). As calculated mapping reads to C. anna (“Anna”) versus C. pelagica (“Swift”).

Figure 4.15 Distribution of average coverage across samples, genes and targets, before (A) and after (B) removing outliers. Extreme outliers are visible especially in targets. Optimization of parameters for the simulation of targeted sequencing data of shallow 98 phylogenomics 4.3.1.2 Off-target regions

Off-target are those regions of the genome that are not targeted but that are nevertheless captured and sequenced. Depending on how good is the probe set and the success of the capture experiment, this can be a considerable proportion of the sequenced DNA. For “non-model” organisms, where detailed genomic data (other than the one used to chose the targets) is often not available, a reasonable proxy of the off-target portion is the percentage of unmapped reads to the closest reference. By looking at it in a reduced set of samples we could observe that this value was quite constant and around 20% (Figure 4.16). Other “non-model” organisms datasets I had access to in our lab showed similar-to-slightly-higher values, suggesting that a certain percentage of off-target regions across samples is a common feature. It is difficult to calculate the coverage for “unmapped” regions unless we assemble reads and map them in their own assembled contigs. As I had curated scaffolds (large contiguous regions) of C. anna’s genome where targets were quite accurately annotated, I was able to calculate coverage for some (a really low percentage though) captured regions that were not targeted but mapped to known regions within scaffolds. I could see their coverage was really low, circa of1% or less of the average coverage of the targeted regions (Figure 4.17). Realizing this is probably a common effect in capture datasets, I implemented in NGSphy the capability to simulate different proportions of off-target loci (see Chapter3).

Figure 4.16 Percentage of mapped reads to each of the references. 4.3 Simulating NGS reads 99

Figure 4.17 Breadth versus depth of coverage for off-target regions from the mapping against the Anna reference. Each line represents a sample. While 90% of the targets were covered at 10x by most of the samples in the on-target regions, in the off-target it is observed the dramatic reduction of coverage, where only 25% of the off-target regions have a coverage of 10x. Optimization of parameters for the simulation of targeted sequencing data of shallow 100 phylogenomics 4.3.1.3 Non-captured regions

Non-captured regions are those that were targeted but not captured (and therefore not sequenced). In the hummingbirds dataset only a low percentage of genes (<1%), and targets (<6%) were not recovered (Figure 4.18).

Figure 4.18 Percentage of non captured regions: gene (A) and targets (B).

4.3.1.4 Phylogenetic decay

Another known problem in capture experiments is the decrease of coverage with the increase of distance to the organisms used to design the target probes (phylogenetic decay, sensu, Bragg et al. 2016). I used the humminbirds data to explore this effect. An ML phylogenetic tree (Fonseca et al., unpublished) from the concatenated set of 2750 genes was available prior to my coverage analysis, thus I used this tree (Figure 4.19) to estimate the phylogenetic coverage decay. First I calculated the distance from all the samples to each of the references available (Figure 4.20). As the distances from most samples to the references were similar, I could not observe any strong correlation between coverage and phylogenetic distance in this dataset, but it was clear that the single most distant sample had lower average coverage than the rest. In any case, I decided to implement the ability to simulate this effect in NGSphy. 4.3 Simulating NGS reads 101

Figure 4.19 Inferred ML concatenated tree. “Aan” corresponds to the Anna’s hummingbird (C. anna) and “Cpe” to the Chimney swift (C. pelagica), the outgroup (in red). Optimization of parameters for the simulation of targeted sequencing data of shallow 102 phylogenomics

Figure 4.20 Per sample coverage vs. phylogenetic distance. Median (plus or minus 1st. and 3rd.quartiles) sequencing coverage for 46 samples, plotted as a function of phylogenetic distance from species for mapping species. 4.4 NGSphy parameterization 103

4.4 NGSphy parameterization

Thus, in target-capture simulations, for each replicate, the expected coverage (=sequenc- ing depth) was sampled from a Uniform distribution between 1 and 100x (=”experiment level”). Individual and locus wide variation within each replicate were introduced by sampling multipliers in two stages (Figure 4.21). The way NGSphy introduces variation is by sampling the shape (α) of a Gamma distribution with mean 1, from a prior distribution. In Chapter5 I used a LogNormal (LN) with mean 3.5 and standard deviation of 1, which allows a large variation of the expected coverage. Multipliers for each replicate were then sampled (for each loci and individual) from the generated gamma distributions. As for the read types, we decided to work with four different Illumina settings, single-end and paired-end of both 150 and 250 bp.

Figure 4.21 Coverage heterogeneity parameterization in NGSphy. To illustrate the parameteri- zation of the coverage we here show the possible range of the variation, corresponding to the lowest and highest variation within a replicate. A | Prior LN distribution. A LN distribution with mean 3.5 and standard deviation of 1 allows for shape parameters up to 700. B | Multiplier distribution (maximum variation). A value of 1.51 sampled from the LN originates the Gamma distribution shown, which allows variation within multipliers of up to circa of 5 times. These multipliers will be applied to the initial expected coverage values sampled from U(1,100). C | Multiplier distribution (minimum variation). When the sampled alpha is very large (727.73), the multipliers obtained from this Gamma distribution will have a low variation, with values around 1. I plotted the generated expected coverage matrices to ensure values were within the intended range (see Chapter5).

Chapter 5

Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms

This chapter is based on preliminary results of the following “in prep” publication:

• Escalona M, Rocha S, Posada D. (in prep.) Sensitivity of phylogenomic inference to the design of target enrichment NGS experiments in non-model organisms.

Contributions: ME was involved in the conception of the study and design of the analyses. ME performed the simulations and analysed all results.

5.1 Introduction 107

5.1 Introduction

The NGS field is continuously innovating, e.g. adding more variation of read lengths, larger throughput, faster data acquisition. While such technological improvements have led to its widespread use (van Dijk et al., 2014), they have also brought new problems such as higher error rates, small reads needing to be assembled, systematic biases derived from using single sequencing platforms, or issues related to the large size of the datasets. All these factors pose significant challenges for data processing, storage and analyses (Catchen et al., 2013). The NGS phylogenomic pipeline is complex, consisting of several steps: assembly and/or mapping, homology/orthology inference, variant and/or genotype calling, gene tree estimation and/or species tree (plus other related parameters) estimation; therefore, requiring multiple methodological decisions. Importantly, there is not a standard approach for the phylogenomic analysis of NGS data, and the influence of the different strategies and options in the accuracy ofthe resulting inferences is unclear. Many of the available phylogenomic approaches are mainly specific to the characteristics of the NGS datasets in question (see Allard et al. 2012; Jex et al. 2010; Kumar et al. 2015; Tosso et al. 2017). Regarding targeted NGS data, a key initial aspect is the availability of somehow related genomic sequences to which reads can be mapped (mapping). In the lack of such a reference, one must proceed with de novo assemblies, and when performing them, we have to deal with the non-trivial process of homology and orthology inference. Multiple sequence alignment (MSA) is needed after de novo assemblies and may be needed even when the approach lo locus assembly is by mapping. The alignment may introduce an extra possible source of error. Variant calling can be done per sample or taking all the samples into account. It is also possible to perform variant calling with respect to a reference sequence or based on the overall allele frequencies. Genotype calling is then made alongside with variant calling or subsequently (Nielsen et al., 2011). Many of the choices related to the methods and thresholds for variant and genotype calling can influence the resulting inferences (Han et al., 2014; Nevado et al., 2014; Nielsen et al., 2011). In fact, genotype likelihoods (Korneliussen et al., 2014; Mckenna et al., 2010; Nielsen et al., 2012) were introduced to better account for uncertainty at genotype inference (da Fonseca et al., 2016), but phylogenetic methods usually do not take this uncertainty directly into account. If working with diploid (or other ploidy) organisms, for downstream analyses one may also be interested in phasing (i.e., knowing the exact sequences of each locus at a given chromosome) or in the generation of consensus sequences (using IUPAC ambiguities or the major or a random allele). Most popular tools for phasing in phylogenetic studies involve inferring Sensitivity of phylogenomic inference to the design of target enrichment NGS 108 experiments in non-model organisms the phase from MSA with ambiguities, making use of known haplotypes or inferring them (Stephens et al., 2001; Stephens and Donnelly, 2003) approach that is highly time-consuming and hardly applicable to large datasets. Each step of this process by itself may imply a specific computational pipeline and further methodological decisions, like the different approaches for variant discovery and genotyping presented in (see Bai and Cavalcoli 2013; Maruki and Lynch 2017; Van der Auwera et al. 2013. Furthermore, there is a panoply of software tools from which to choose and multiple parameters that need to be optimized in each step and often differently for each dataset. All these different possible treatments will likely influence the final inferences (e.g. Bokulich et al. 2013). The ultimate goal of this Chapter is to understand the effect of different method- ological decisions during the production and analysis of NGS data on the quality of phylogenomic inference. Such an study might imply a massive computational time, as the analyses needed are many and lengthy, even using big computational clusters. Because of this, I was forced to design a simulation study that could be completed in a reasonable amount time. Thus, I decided to simulate four different types of NGS datasets, and processed them through a pipeline including yet only a reduced subset of the methodological treatments through which such kind of datasets can be processed. NGS datasets were generated for diploid individuals from loci whose phylogenies are known and potentially discordant, adequately simulated under the multispecies coales- cent framework. The pipeline used is roughly a single combination of methodologies under mostly default settings. In the future I hope to expand these scenarios in order to provide a more comprehensive evaluation of the sensitivity of phylogenomic inference to the NGS pipeline.

5.2 Methods

5.2.1 Data simulation

I used a pipeline which includes SimPhy (Mallo et al., 2015a), INDElible (Fletcher and Yang, 2009), NGSphy (Escalona et al. 2017; Chapter3) and ART (Huang et al., 2012) to simulate four types of NGS reads from a continuous landscape of phylogenomic parameters (Tables 5.1, 5.2 and Listing 5.1). This simulation presented here comprises 121 replicates. Each replicate refers to a species tree, an underlying distribution of gene trees and DNA sequences evolved along those gene trees. From those, sets of diploid individuals per species were obtained (paired from the gene copies of the 5.2 Methods 109 same gene-family) and Illumina reads were simulated for each of the individuals. The processes of the pipeline can be structured as follows:

1. Generation of species and gene trees

2. Generation of DNA sequences

3. Assignment of gene-tree tips to individuals

4. Generation of NGS data

5.2.1.1 Generation of species trees and gene trees

Simulation of species and gene trees were performed with SimPhy (Mallo et al., 2015a). Species trees were simulated following a Yule mode (Yule, 1925). The birth-rate process was parameterized by the speciation rate, the number of species, the number of individuals (tips) per species and tree height. Species tree heights (ingroup) were sampled from a Uniform distribution between 200 Ky (thousand years) and 20 My (million years); the number of species was sampled from a Uniform distribution within the interval [2, 8], and the number of individuals per species (here, number of leaves of the gene trees) was sampled from an Uniform distribution within the interval [2, 8], with a fixed effective population size Ne = 10000. Between 100 and 2,000 gene trees were simulated within each species tree according to the multispecies coalescent, allowing for incomplete lineage sorting as the only source of gene/species tree incongruence. Among- lineage rate variation was introduced in the gene trees at different levels (Table 5.2).

5.2.1.2 Generation of DNA Sequences

DNA sequences were evolved along gene trees under a GTR (Tavaré, 1986) +G substitu- tion model, whose parameters were sampled from a Dirichlet distribution parameterized according to different empirical estimates (Arbiza et al., 2011; Darriba et al., 2012). The shape (α) of the gamma distribution was sampled from an exponential of mean 0.5, allowing for most values to be between 0 and 1, fitting empirical gamma estimates. Sequences were generated without indels and their lengths was sampled per replicate from a Uniform distribution between 150 and 2000. This size distribution was chosen taking in consideration loci size distributions commonly seen across target enrichment phylogenomic studies (Bi et al., 2012; Bragg et al., 2016; Faircloth et al., 2013; Lemmon and Lemmon, 2012). Sensitivity of phylogenomic inference to the design of target enrichment NGS 110 experiments in non-model organisms

Table 5.1 SimPhy simulation parameters.

Parameter Value Description RS 121 Number of species trees (= replicates) RL Uniform(100, 2000) Number of locus trees per species tree (in this case identical to the number of gene trees, i.e one gene tree per locus tree was simulated) SB LogNormal(−13.58, 1.85) Speciation rate - depends on SU and SI (species tree height and number of inds. per taxa/tips) (see Chapter4) SG Fixed (1) Tree wide generation time (in years) SI Uniform(2, 8) Number of individuals (leaves) per species SL Uniform(2, 8) Number of species SO Fixed (1) Ratio between ingroup height and the branch from the root to the ingroup (out- group branch length) SP Fixed (10 000) Tree-wide effective population size ST Uniform(200000, 20000000) Species tree height (in years) SU Uniform(10 − 8, 10 − 10) Tree-wide substitution rate GP LogNormal(1.4, 1) Gene-by-lineage-specific rate heterogene- ity parameter (species tree parameter, to use with HG) HH LogNormal(1.2, 1) Gene-by-lineage-specific locus tree pa- rameter (to use with the HG) HG Fixed (GP) Gene-by-lineage-specific rate heterogene- ity modifier 5.2 Methods 111

Listing 5.1 Parameters used for the simulation of DNA sequences using INDELible with the wrapper provided within SimPhy. [TYPE] NUCLEOTIDE 1 [ SETTINGS ] [fastaextension] fasta [ output ] FASTA [ SIMPHY−UNLINKED−MODEL] sim_unlinked [submodel] GTR $(rd:20,2,4,6,8,16) [statefreq] $(d:1,1,1,1) // Equilibrium frequencies // sampled from a Dirichlet (1,1,1,1) [rates]0$(e:2)0 // Site −specific rate heterogeneities / / 0 p−inv, alpha from an E(2) and // using a continuous gamma distribution [ SIMPHY−PARTITIONS] simUnlinked [1 sim_unlinked $(U:150,2000)]

5.2.1.3 Assignment of gene-trees leaves to diploid individuals

The assignment of gene-tree leaves (i.e., gene copies) to diploid individuals was per- formed using NGSphy. The process consists of randomly pairing two gene-copies within the same gene family and species (see Chapter4). The outgroup was simulated in SimPhy by adding one gene copy to the tree after the simulation of the remaining phylogeny (Mallo et al., 2015a). NGSphy duplicates this sequence to emulate a diploid individual, thus the individual representing the outgroup will be homozygous at all positions.

5.2.1.4 Generation of NGS data

HiSeq2500 Illumina reads in FASTQ format were simulated for all loci and individuals using NGSphy (Chapter3). I introduced variation in the depth of coverage across species trees, individuals and locus, as explained in detail in Chapter4. The expected global coverage varied according to a Uniform distribution between 1x and 100x, with a series of multipliers later introducing correlated variation at individual and locus levels. Afterwards, Illumina single-end and paired-end reads of 150 and 250 bp were generated for each locus and individuals and for all replicates, according to the replicate- specific generated coverage matrices. Finally, all reads from the same individual were joined, emulating the output of a targeted NGS experiment. For more details on the configuration of NGSphy go to https://github.com/merlyescalona/ngsphy/wiki/Manual. Sensitivity of phylogenomic inference to the design of target enrichment NGS 112 experiments in non-model organisms

Table 5.2 Example of the parameter values used in NGSphy.

Section Parameter Value General Ploidy 2 Data inputmode 4 simphy_folder_path . simphy_data_prefix data simphy_filter TRUE Coverage experiment Uniform(1, 100) individual LogNormal(3.15, 1) locus LogNormal(3.15, 1) NGS-reads-art fcov TRUE ss HS25 m 215 s 50 q TRUE p TRUE Execution environment bash runART off running_times off threads 2 5.2 Methods 113

5.2.2 NGS data analysis

5.2.2.1 Quality control and trimming

In order to validate the sequencing profile of the simulated FASTQ files, quality control was performed on a random set (14 FASTQ files, from a random locus from a random species tree replicate; Figure 5.1), using the R package Rqc (Souza and Carvalho, 2017). The observed variation in read quality across the selected samples was low (Figure 5.1) and within the expected base quality score values (Guo et al., 2014), hence trimming of the reads was not performed.

Figure 5.1 Per read mean quality distribution of 14 random files. Average quality of the reads above 36 (Q>36).

5.2.2.2 Construction of reference sequences

For read mapping I explored two distinct options usually seen in the literature: mapping to the outgroup or choosing an ingroup sequence as reference. In the latter case, for Sensitivity of phylogenomic inference to the design of target enrichment NGS 114 experiments in non-model organisms each replicate I chose one of the ingroup sequences at random1. In order to map all reads from all loci at once, for each replicate the reference sequence was constructed concatenating all the loci for the selected tip sequence. Moreover, as we do not expect the target loci to be contiguous in the genome, in the reference sequence these loci were separated by 300 N’s (ensuring no cross-mapping of reads across loci), for the 150bp reads, and 500 N’s for 250bp reads.

5.2.2.3 Mapping

Mapping of the reads of each individual was carried out using the MEM algorithm of BWA Version 0.7.7-r441 (Li and Durbin, 2009), against the reference built for each replicate. Following a standardized best-practices pipeline (Van der Auwera et al., 2013), mapped reads from all replicates were independently processed, and local realignment around indels and removal of PCR duplicates performed. No further filtering was made. Depth of coverage across replicates, loci and individuals was estimated using the depth module of samtools Version 1.5.0 (Li et al., 2009a).

5.2.2.4 Variant and genotype calling

Variant and genotype calling were performed using GATK Version 3.5-g761ca1 (Mckenna et al., 2010), with the single-sample variant calling joint-genotyping framework using the HaplotypeCaller and GenotypeVCF modules.

5.2.2.5 Consensus sequences

Afterwards, I generated consensus sequences (with IUPAC ambiguities) for each in- dividual, making use of the consensus module of samtools/bcftools Version 1.2 (Li, 2011b).

5.2.2.6 Multiple sequence alignment

For each replicate, these consensus sequences were split by locus, gathered across individuals and each locus aligned using the FFT-NS-2 algorithm of MAFFT Version 7.212 (Katoh and Standley, 2013; Katoh et al., 2002).

1For this task I specifically developed a tool called REFSELECTOR (http://github.com/merlyescalona/refselector), able to work with the gene-trees/species-trees dis- tributions obtained with SimPhy. 5.3 Results and Discussion 115

5.2.3 Gene and species tree inference

Maximum-likelihood gene trees were estimated for all loci with RaxML-NG (Kozlov, 2017; Stamatakis, 2014), under the GTR+G model and empirical base frequencies. Ten heuristic searches were performed (SPR algorithm), from 10 parsimony starting trees, from which the ML tree was chosen. Species trees were estimated from the ML gene trees with ASTRAL-III Version 5.5.9 (Mirarab et al., 2014; Mirarab and Warnow, 2015; Zhang et al., 2017).

5.2.4 Accuracy of species trees reconstruction: relation to NGS and capture design variables

Accuracy of the reconstructed species tree measured as 1 - the normalized error (normalized Robinson-Foulds distance (Robinson and Foulds, 1981)(Equation 5.1):

RF (originaltree, inferredtree) accuracy = 1 − (5.1) 2n − 4 Linear regressions were used to explore test the correlation between the NGS and methodological variables and the species-trees accuracy. Analyses were carried out using R (Version 3.4.3) base packages and phangorn (Schliep, 2011).

5.3 Results and Discussion

Four NGS scenarios were finally simulated (paired-end reads of 150bp and 250bp, and single-end reads of 150bp and 250bp) and each mapped to both an ingroup and an outgroup reference, resulting in 8 scenarios for which variants and genotypes were called, consensus sequences per individual were produced and gene and species trees reconstructed (121 replicates per scenario). The following results refer to the accuracy of species trees reconstruction across all these scenarios.

5.3.1 Effect of read type/length and mapping reference

Distributions of accuracy are roughly similar across and within factors (Figure 5.2), with the majority of the true species tree being recovered for all cases. Yet, p-values seem to indicate that all factors play a role in the inferred outcome, with the influence of the reference (outgroup versus ingroup) and read length (150 bp versus 250 bp) being highly significant, whereas the read type (PE versus SE) is marginally significant. Sensitivity of phylogenomic inference to the design of target enrichment NGS 116 experiments in non-model organisms

Figure 5.2 Accuracy of the reconstructed species trees (n=121) for the combinations of read lengths (2), types (2), and reference used at mapping (2). Upper plots show distributions and mean (dots) of accuracy per factor. Lower table shows the correlations between accuracy and 1) reference, 2) read type and 3) read length. 5.3 Results and Discussion 117

Nevertheless, the means are really similar across the different scenarios, most falling very close to 1 (Figure 5.3).

Figure 5.3 Species tree accuracy per scenario. Species tree accuracy values for each read type and length as mapped to ingroup (pink boxplots) or outgroup (blue boxplots) reference sequences

Looking at each treatment individually, it seems there is an overall higher accuracy when an ingroup reference was used. Interestingly, the set of replicates of PE 150 bp reads mapped to outgroup references provided the set of species-trees with lower accuracies, which clearly increased again with longer reads. Certainly, more replicates (in these and different scenarios) might be important to confirm these trends, which are particularly interesting given that 150PE reads are perhaps most frequent in real experiments.

5.3.2 Effect of NGS coverage

Across above mentioned factors, we are very interested in understanding the effect of coverage in the accuracy of the results, as it’s one of the factors most easily controlled by the researchers, and also probably the one who most influences the cost of a real experiment. We thus chose to simulate loci across a very wide range of coverage values, reflected in the figure below. Sensitivity of phylogenomic inference to the design of target enrichment NGS 118 experiments in non-model organisms Distribution of expected coverage

Figure 5.4 Expected coverage. Histogram displaying the distribution of the average expected coverage per individual (i.e, across loci), across replicates. The overlapping density plot is shown solely to serve as visual comparison along the following figures.

Distribution of expected coverage vs. observed coverage

As we obtained DNA sequence matrices (for gene and species tree inference) by mapping the reads to reference genomes, observed coverage may differ from the expected one. This is shown in Figure 5.5 for all datasets produced. As a (small) percentage of the reads is lost, one observes an overall decrease of the coverage in both cases, much more pronounced (skewed to the left) when mapping to the outgroup reference. Distributions of coverage respect to other factors (read type and length), are basically identical. 5.3 Results and Discussion 119

Figure 5.5 Distribution of average coverage per individual across all the NGS profiles. NGS profile is here defined by the combination of read type and read length used to generate the NGSdatasets. There are four different NGS profiles PE 150 bp, SE 150 bp, PE 250 bp and SE250bp. A | Reads mapped to ingroup. Coverage distributions for all NGS profiles when mapped to the ingroup reference sequence. B | Reads mapped to outgroup. Coverage distributions for all profiles when mapped to the outgroup reference sequence.

Correlations between accuracy and expected/observed coverage, and cover- age variation among sites

As expected, there are strong (highly significant) correlations between species tree accuracy and both expected and observed coverage. On the other side, there is no correlation between the parameters controlling the variation of coverage both across individuals and loci. The absolute values of coverage are thus more important than the degree to which it varies (Figure 5.6). Sensitivity of phylogenomic inference to the design of target enrichment NGS 120 experiments in non-model organisms

Figure 5.6 Linear regression and Pearson r correlation test (p-value) between coverage and species tree accuracy. Alpha controls the shape of the gamma distribution controlling coverage variation across individuals and loci (plots below).

5.3.3 Effect of the number and loci size

In these simulations the number of loci (per replicate) varies between 100 and 2000 and no significant correlation is seen between them and species tree accuracy. It wouldbe interesting to further decrease its lower limit and explore accuracy with lower number of loci. The correlation between accuracy and locus size (here from 150 to 2000 bp) is, on the other side, highly significant (Figure 5.7).

Figure 5.7 Linear regression and Pearson r correlation test (p-value) between target enrich- ment design parameters (number and size of the loci) and species tree accuracy.

5.3.4 Effect of sampling

Both the number of species and the number of individuals (diploid) per species are significantly correlated with species tree accuracy. As expected, while it decreases 5.3 Results and Discussion 121 with the number of species, it increases with the number of individuals per species (Figure 5.8).

Figure 5.8 Linear regression and Pearson r correlation test (p-value) between number of species and number of individuals per species and species tree accuracy.

5.3.5 Effect of the speciation history

The shape of the species tree will obviously be one of the most important drivers of gene tree incongruence and thus of the accuracy of the reconstructed species trees. Variables such as species tree height, population size, and relative branch lengths, will all affect the amount of ancestral polymorphism within species trees and thus the discordance between gene trees, and the ability of recovering the true species tree. We examined the correlation between species tree accuracy and species tree height, average and maximum number of extra lineages (per species tree). The strong correla- tion observed, especially regarding these two last factors reflects their overall relevance. It will be important to explore if and how these variables interact with other parameters, either regarding NGS (profiles and coverage) as well as with loci number and size, and sampling variables. It is possible that different sets of NGS parameters and sampling designs work better for different speciation scenarios, knowledge that can be of great help for the field and lab biologists in the real world. Sensitivity of phylogenomic inference to the design of target enrichment NGS 122 experiments in non-model organisms

Figure 5.9 Linear regression and Pearson r correlation test (p-value) between species tree accuracy and species tree parameters.

5.4 Conclusions

The relevance of NGS variables, sampling design, and the impact of methodological decisions on the reconstruction of the species trees using target capture data is yet not well understood. The laboratory and computational burden associated with these experiments makes its replication under different conditions, with the same or down- sampled data, very laborious, to the point that it is quite unusual. Simulations are thus extremely useful for these kind of tasks, further allowing us to explore a very wide parameter space. In the simulations presented here, regarding NGS parameters, accuracy distributions seem very similar at first sight. However, correlations between accuracy and reference type, read length and read type tell us that these are relevant for the accuracy of the inferred species trees, especially the reference type. Looking at each NGS profile individually, it seems there is an overall higher accuracy when a closer reference is used. Interestingly, the current most frequently used NGS profile (PE 150 bp) originated the worst results when mapped to outgroup references. Distant references are often the single ones available in empirical studies. Certainly, more replicates (in these and different scenarios) might be important to confirm these trends, but this unexpected result highlights the usefulness of these kind of simulations. We tried to implement realistic simulations where erroneous mapping and loss of reads are allowed, and observed the expected trends of loss of coverage, which is one of the most important factors interfering with tree accuracy. When performing a target capture experiment, researchers generally aim for the highest number of loci possible considering the 5.4 Conclusions 123 available budget. Nevertheless, our results show a more significant correlation between tree accuracy and loci size than with loci number. The amount of ancestral polymorphism is also, as expected, highly correlated with species tree accuracy. Even within the ranges of the parameters here explored, where a very high number of loci are sequenced, the number of extra lineages is highly determinant for the success of species tree inference. In the future, it will be important to increase the number of simulations to be able to further explore if and how many of these variables interact with each other, which are the configurations of speciation history where certain (or all) methods perform bad, and to find the appropriate sets of NGS variables and target experiment designs for different diversification history scenarios.

Overall conclusions

The work presented in this dissertation embodies my efforts to develop a more real- istic framework for the simulation of phylogenomic NGS data and to use it to better understand the impact of NGS methodological variables in phylogenomic inference, especially in species tree inference, and especially for non-model organisms, where the lack of genomic resources is important. The works here presented, either published, accepted for publication or in preparation, represent important contributions for the field of phylogenomics, providing the scientific community with:

1. A better understanding of the wide variety of existing DNA NGS simulators as well as guidelines for the identification of the NGS simulators that are best suited for the purpose at hand (Chapter2: Simulation of genomic next-generation sequencing data; see also Escalona et al. 2016).

2. A more realistic simulation framework for the generation of NGS phylogenomic data, including multiple options to model experimental design and sequencing parameters, making possible comparative analyses of different NGS parameters, under a wide range of evolutionary parameters, and under the gene-tree/species- tree paradigm (Chapter3: NGSphy: phylogenomic simulation of NGS data; see also Escalona et al. 2017)

3. A detailed description of the optimization of parameters for the simulation of data from NGS capture experiments of taxa with underlying shallow phylogenies (Chapter4: Optimization of parameters for the simulation of targeted sequencing data of shallow phylogenomics). As these kind of simulations can be quite complex it is helpful to explore possible relationships between variables across the parameters space, and it is helpful to describe their detailed parametrization.

4. A comprehensive study on the sensitivity of phylogenomic inference to the design of NGS capture experiments (Chapter5: Sensitivity of phylogenomic inference to the Sensitivity of phylogenomic inference to the design of target enrichment NGS 126 experiments in non-model organisms design of target enrichment NGS experiments in non-model organisms). Although pre- liminary, and yet not incorporating all intended variables, these analyses already revealed important and unexpected aspects, such as the fact that the likely most used NGS profile (PE 150 bp) may be the non-optimal one. Loci size mayalso be more important than previously realized. Realistic coverage variation, that can now be implemented with NGSphy, across a wider parameter space, will be important to further explore this question, adding yet other relevant methodolog- ical variables such as assembling loci de novo instead of mapping; performing variant calling through different algorithms, or exploring the importance of allele phasing.

I expect the tools constructed here and analyses published to be a significant contribution to the field of phylogenomics. References

Daniel Aird, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum, and Andreas Gnirke. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12 (2):R18, January 2011. Omar A Ali, Sean M O’Rourke, Stephen J Amish, Mariah H Meek, Gordon Luikart, Carson Jeffres, and Michael R Miller. RAD capture (rapture): Flexible and efficient Sequence-Based genotyping. Genetics, 202(2):389–400, February 2016. Can Alkan, Saba Sajjadian, and Evan E Eichler. Limitations of next-generation genome sequence assembly. Nat. Methods, 8(1):61–65, January 2011. Marc W Allard, Yan Luo, Errol Strain, Cong Li, Christine E Keys, Insook Son, Robert Stones, Steven M Musser, and Eric W Brown. High resolution clustering of salmonella enterica serovar montevideo strains using a next-generation sequencing approach. BMC Genomics, 13:32, January 2012. S F Altschul, W Gish, W Miller, E W Myers, and D J Lipman. Basic local alignment search tool. J. Mol. Biol., 215(3):403–410, October 1990. D Altshuler, V J Pollara, C R Cowles, W J Van Etten, J Baldwin, L Linton, and E S Lander. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407(6803):513–516, September 2000. Kimberly R Andrews, Jeffrey M Good, Michael R Miller, Gordon Luikart, and PaulA Hohenlohe. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet., 17(2):81–92, February 2016. Cécile Ané, Bret Larget, David A Baum, Stacey D Smith, and Antonis Rokas. Bayesian estimation of concordance among gene trees. Mol. Biol. Evol., 24(2):412–426, February 2007. Florent E Angly, Dana Willner, Forest Rohwer, Philip Hugenholtz, and Gene W Tyson. Grinder: A versatile amplicon and shotgun sequence simulator. Nucleic Acids Res., 40 (12):e94, July 2012. Leonardo Arbiza, Mateus Patricio, Hernán Dopazo, and David Posada. Genome-wide heterogeneity of nucleotide substitution model fit. Genome Biol. Evol., 3:896–908, August 2011. 128 References

Miguel Arenas. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput. Biol., 8(5):e1002495, 2012. Miguel Arenas. Trends in substitution models of molecular evolution. Front. Genet., 6: 319, October 2015. Miguel Arenas and David Posada. Simulation of genome-wide evolution under hetero- geneous substitution models and complex multispecies coalescent histories. Mol. Biol. Evol., 31(5):1295–1301, May 2014. Miguel Arenas, Gabriel Valiente, and David Posada. Characterization of reticulate networks based on the coalescent with recombination. Mol. Biol. Evol., 25(12): 2517–2520, December 2008. Miguel Arenas, Filipe Pereira, Manuela Oliveira, Nadia Pinto, Alexandra M Lopes, Veronica Gomes, Angel Carracedo, and Antonio Amorim. Forensic genetics and ge- nomics: Much more than just a human affair. PLoS Genet., 13(9):e1006960, September 2017. Şule Ari and Muzaffer Arikan. Next-Generation sequencing: Advantages, disadvantages, and future. In Plant Omics: Trends and Applications, pages 109–135. Springer, Cham, 2016. Jean Armengaud, Judith Trapp, Olivier Pible, Olivier Geffard, Arnaud Chaumot, and Erica M Hartmann. Non-model organisms, a species endangered by proteogenomics. J. Proteomics, 105:5–18, June 2014. Lars Arvestad, Ann-Charlotte Berglund, Jens Lagergren, and Bengt Sennblad. Bayesian gene/species tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19 Suppl 1:i7–15, 2003. Khadijah Bahwaireth, Lo’ai Tawalbeh, Elhadj Benkhelifa, Yaser Jararweh, and Moham- mad A Tawalbeh. Experimental comparison of simulation tools for efficient cloud and mobile cloud computing applications. EURASIP Journal on Information Security, 2016(1):15, June 2016. Yongsheng Bai and James Cavalcoli. SNPAAMapper: An efficient genome-wide SNP variant analysis pipeline for next-generation sequencing data. Bioinformation, 9(17): 870–872, October 2013. Nathan A Baird, Paul D Etter, Tressa S Atwood, Mark C Currey, Anthony L Shiver, Zachary A Lewis, Eric U Selker, William A Cresko, and Eric A Johnson. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One, 3(10): e3376, October 2008. Susanne Balzer, Ketil Malde, Anders LanzÃn, Animesh Sharma, and Inge Jonassen. Characteristics of 454 pyrosequencing data enabling realistic simulation with flowsim. Bioinformatics, 26(18):i420–i425, September 2010. Susanne Balzer, Ketil Malde, and Inge Jonassen. Systematic exploration of error sources in pyrosequencing flowgram data. Bioinformatics, 27(13):304–309, 2011. References 129

Suying Bao, Rui Jiang, Wingkeung Kwan, Binbin Wang, Xu Ma, and You-Qiang Song. Evaluation of next-generation sequencing software in mapping and assembly. J. Hum. Genet., 56(6):406–414, June 2011. Timour Baslan, Jude Kendall, Brian Ward, Hilary Cox, Anthony Leotta, Linda Rodgers, Michael Riggs, Sean D’Italia, Guoli Sun, Mao Yong, Kristy Miskimen, Hannah Gilmore, Michael Saborowski, Nevenka Dimitrova, Alexander Krasnitz, Lyndsay Harris, Michael Wigler, and James Hicks. Optimizing sparse sequencing of single cells for highly multiplex copy number profiling. Genome Res., 25(5):714–724, May 2015. Md Shamsuzzoha Bayzid and Tandy Warnow. Naive binning improves phylogenomic analyses. Bioinformatics, 29(18):2277–2284, September 2013. S K Behura. Insect phylogenomics. Insect Mol. Biol., 24(4):403–411, August 2015. Robert G Beiko and Robert L Charlebois. A simulation test bed for hypotheses of genome evolution. Bioinformatics, 23(7):825–831, April 2007. Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart Stephen, W James Kent, John S Mattick, and David Haussler. Ultraconserved elements in the human genome. Science, 304(5675):1321–1325, May 2004. R Bellman and T E Harris. On the theory of Age-Dependent stochastic branching processes. Proc. Natl. Acad. Sci. U. S. A., 34(12):601–604, December 1948. Ke Bi, Dan Vanderpool, Sonal Singhal, Tyler Linderoth, Craig Moritz, and Jeffrey M Good. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics, 13:403, August 2012. Kip Bodi. simhtsd: Simulate High-thoughput Sequencing Data. http://sourceforge.net/ projects/simhtsd/, 2009. Nicholas A Bokulich, Sathish Subramanian, Jeremiah J Faith, Dirk Gevers, Jeffrey I Gordon, Rob Knight, David A Mills, and J Gregory Caporaso. Quality-filtering vastly improves diversity estimates from illumina amplicon sequencing. Nat. Methods, 10(1): 57–59, January 2013. Anthony M Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 30(15):2114–2120, August 2014. Nicolas Bortolussi, Eric Durand, Michael Blum, and Olivier François. aptreeshape: statistical analysis of phylogenetic tree shape. Bioinformatics, 22(3):363–364, February 2006. Remco Bouckaert, Philippe Lemey, Michael Dunn, Simon J Greenhill, Alexander V Alekseyenko, Alexei J Drummond, Russell D Gray, Marc A Suchard, and Quentin D Atkinson. Mapping the origins and expansion of the Indo-European language family. Science, 337(6097):957–960, August 2012. 130 References

Remco Bouckaert, Joseph Heled, Denise Kühnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. BEAST 2: a software platform for bayesian evolutionary analysis. PLoS Comput. Biol., 10(4): e1003537, April 2014. George E P Box. Journal of the American Statistical. J. Am. Stat. Assoc., 71(356):791–799, 1976. Keith R Bradnam, Joseph N Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, Inanç Birol, Sébastien Boisvert, Jarrod A Chapman, Guillaume Chapuis, Rayan Chikhi, Hamidreza Chitsaz, Wen-Chi Chou, Jacques Corbeil, Cristian Del Fabbro, T Roderick Docking, Richard Durbin, Dent Earl, Scott Emrich, Pavel Fedotov, Nuno A Fonseca, Ganeshkumar Ganapathy, Richard A Gibbs, Sante Gnerre, Elénie Godzaridis, Steve Goldstein, Matthias Haimel, Giles Hall, David Haussler, Joseph B Hiatt, Isaac Y Ho, Jason Howard, Martin Hunt, Shaun D Jackman, David B Jaffe, Erich D Jarvis, Huaiyang Jiang, Sergey Kazakov, Paul J Kersey, Jacob O Kitzman, James R Knight, Sergey Koren, Tak-Wah Lam, Dominique Lavenier, François Lavio- lette, Yingrui Li, Zhenyu Li, Binghang Liu, Yue Liu, Ruibang Luo, Iain Maccallum, Matthew D Macmanes, Nicolas Maillet, Sergey Melnikov, Delphine Naquin, Zemin Ning, Thomas D Otto, Benedict Paten, Octávio S Paulo, Adam M Phillippy, Francisco Pina-Martins, Michael Place, Dariusz Przybylski, Xiang Qin, Carson Qu, Filipe J Ribeiro, Stephen Richards, Daniel S Rokhsar, J Graham Ruby, Simone Scalabrin, Michael C Schatz, David C Schwartz, Alexey Sergushichev, Ted Sharpe, Timo- thy I Shaw, Jay Shendure, Yujian Shi, Jared T Simpson, Henry Song, Fedor Tsarev, Francesco Vezzi, Riccardo Vicedomini, Bruno M Vieira, Jun Wang, Kim C Worley, Shuangye Yin, Siu-Ming Yiu, Jianying Yuan, Guojie Zhang, Hao Zhang, Shiguo Zhou, and Ian F Korf. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2(1):10, July 2013. Jason G Bragg, Sally Potter, Ke Bi, and Craig Moritz. Exon capture phylogenomics: efficacy across scales of divergence. Mol. Ecol. Resour., 16(5):1059–1068, September 2016. M G Branstetter, J T Longino, P S Ward, and others. Enriching the ant tree of life: enhanced UCE bait set for genomescale phylogenetics of ants and other hymenoptera. Methods Ecol. Evol., 2017a. Michael G Branstetter, Bryan N Danforth, James P Pitts, Brant C Faircloth, Philip S Ward, Matthew L Buffington, Michael W Gates, Robert R Kula, and Seán GBrady. Phylogenomic insights into the evolution of stinging wasps and the origins of ants and bees. Curr. Biol., 27(7):1019–1025, April 2017b. Adrian W Briggs, Jeffrey M Good, Richard E Green, Johannes Krause, Tomislav Maricic, Udo Stenzel, Carles Lalueza-Fox, Pavao Rudan, Dejana Brajkovic, Zeljko Kucan, Ivan Gusic, Ralf Schmitz, Vladimir B Doronichev, Liubov V Golovanova, Marco de la Rasilla, Javier Fortea, Antonio Rosas, and Svante Pääbo. Targeted retrieval and analysis of five neandertal mtDNA genomes. Science, 325(5938):318–321, July 2009. Broad Institute. Purpose and operation of read-backed phas- ing. https://gatkforums.broadinstitute.org/gatk/discussion/45/ purpose-and-operation-of-read-backed-phasing,. Accessed: 2018-1-10. References 131

R W W Brouwer, M C G N van den Hout, F G Grosveld, and W F J van Ijcken. NARWHAL, a primary analysis pipeline for NGS data. Bioinformatics, 28(2):284–285, January 2012. Sharon R Browning and Brian L Browning. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet., 12(10):703–714, September 2011. David Bryant, Remco Bouckaert, Joseph Felsenstein, Noah A Rosenberg, and Arindam RoyChoudhury. Inferring species trees directly from biallelic genetic markers: by- passing gene trees in a full coalescent analysis. Mol. Biol. Evol., 29(8):1917–1932, August 2012. Fabien Burki, Maia Kaplan, Denis V Tikhonenkov, Vasily Zlatogursky, Bui Quang Minh, Liudmila V Radaykina, Alexey Smirnov, Alexander P Mylnikov, and Patrick J Keeling. Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of centrohelida, haptophyta and cryptista. Proc. Biol. Sci., 283(1823), January 2016. Brian Bushnell, Jonathan Rood, and Esther Singer. BBMerge - accurate paired shotgun read merging via overlap. PLoS One, 12(10):e0185056, October 2017. Ségolène Caboche, Christophe Audebert, Yves Lemoine, and David Hot. Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data. BMC Genomics, 15:264, 2014. Mauricio O Carneiro, Carsten Russ, Michael G Ross, Stacey B Gabriel, Chad Nusbaum, and Mark a DePristo. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics, 13(1):375, 2012. Bryan C Carstens, Tara A Pelletier, Noah M Reid, and Jordan D Satler. How to fail at species delimitation. Mol. Ecol., 22(17):4369–4383, September 2013. Reed a Cartwright. DNA assembly with gaps (dawg): Simulating sequence evolution. Bioinformatics, 21(SUPPL. 3):iii31–8, November 2005. Antonio Carvajal-Rodríguez, Keith A Crandall, and David Posada. Recombination esti- mation under complex evolutionary models with the coalescent composite-likelihood method. Mol. Biol. Evol., 23(4):817–827, April 2006. Julian Catchen, Paul A Hohenlohe, Susan Bassham, Angel Amores, and William A Cresko. Stacks: an analysis tool set for population genomics. Mol. Ecol., 22(11): 3124–3140, June 2013. Julian M Catchen, Angel Amores, Paul Hohenlohe, William Cresko, and John H Postleth- wait. Stacks : Building and genotyping loci de novo from Short-Read sequences. G3: Genes, Genomes, Genetics, 1(August):171–182, 2011. Ruchi Chaudhary, Bastien Boussau, J Gordon Burleigh, and David Fernández-Baca. Assessing approaches for inferring species trees from multi-copy genes. Syst. Biol., 64 (2):325–339, March 2015. 132 References

Xin Chen, Alan R Lemmon, Emily Moriarty Lemmon, R Alexander Pyron, and Frank T Burbrink. Using phylogenomics to understand the link between biogeographic origins and regional diversification in ratsnakes. Mol. Phylogenet. Evol., 111:206–218, June 2017. Anthony Youzhi Cheng, Yik-Ying Teo, and Rick Twee-Hee Ong. Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced indi- viduals. Bioinformatics, 30(12):1707–1713, June 2014. Francesca D Ciccarelli, Tobias Doerks, Christian von Mering, Christopher J Creevey, Berend Snel, and Peer Bork. Toward automatic reconstruction of a highly resolved tree of life. Science, 311(5765):1283–1287, March 2006. Adam Cornish and Chittibabu Guda. A comparison of variant calling pipelines using genome in a bottle as a reference. Biomed Res. Int., 2015:456479, October 2015. Astrid Cruaud, Mathieu Gautier, Maxime Galan, Julien Foucaud, Laure Sauné, Gwe- naëlle Genson, Emeric Dubois, Sabine Nidelet, Thierry Deuve, and Jean-Yves Rasplus. Empirical assessment of RAD sequencing for interspecific phylogeny. Mol. Biol. Evol., 31(5):1272–1274, May 2014. Astrid Cruaud, Mathieu Gautier, Jean-Pierre Rossi, Jean-Yves Rasplus, and Jérôme Gouzy. RADIS: analysis of RAD-seq data for interspecific phylogeny. Bioinformatics, 32(19):3027–3028, October 2016. Rute R da Fonseca, Anders Albrechtsen, Gonçalo Espregueira Themudo, Jazmín Ramos- Madrigal, Jonas Andreas Sibbesen, Lasse Maretty, M Lisandra Zepeda-Mendoza, Paula F Campos, Rasmus Heller, and Ricardo J Pereira. Next-generation biology: Sequencing and data analysis approaches for non-model organisms. Mar. Genomics, 30:3–13, December 2016. Robert Daber, Shrey Sukhadia, and Jennifer J D Morrissette. Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets. Cancer Genet., 206(12):441–448, December 2013. Daniel A Dalquen, Maria Anisimova, Gaston H Gonnet, and Christophe Dessimoz. ALF - a simulation framework for genome evolution. Molecular Biology and Evolution, 29(4):1115–1123, 2012. Diego Darriba, Guillermo L Taboada, Ramón Doallo, and David Posada. jModelTest 2: more models, new heuristics and parallel computing. Nat. Methods, 9(8):772, July 2012. John W Davey, Paul A Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet., 12(7):499–510, June 2011. Leonardo dDe Oliveira Martins, Diego Mallo, and David Posada. A bayesian supertree model for Genome-Wide species tree reconstruction. Syst. Biol., 65(3):397–416, May 2016. References 133

Nicola De Maio, Dominik Schrempf, and Carolin Kosiol. PoMo: An allele Frequency- Based approach for species tree estimation. Syst. Biol., 64(6):1018–1031, November 2015. Eric G DeChaine and Andrew P Martin. Using coalescent simulations to test the impact of quaternary climate cycles on divergence in an alpine plant-insect association. Evolution, 60(5):1004–1013, May 2006. James H Degnan and Noah A Rosenberg. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol., 24(6):332–340, June 2009. B Denkena and F Winter. Simulation-based planning of production capacity through integrative roadmapping in the wind turbine industry. Procedia CIRP, 33(Supplement C):105–110, January 2015. S van der Walt, S C Colbert, and G Varoquaux. The NumPy array: A structure for efficient numerical computation. Computing in Science Engineering, 13(2):22–30, March 2011. Adnan Derti, Frederick P Roth, George M Church, and C-Ting Wu. Mammalian ultraconserved elements are strongly depleted among segmental duplications and copy number variants. Nat. Genet., 38(10):1216–1220, October 2006. Juliane C Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Sub- stantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36(16):e105, September 2008. A Drummond and K Strimmer. PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 17(7):662–663, July 2001. Alexei J Drummond, Marc A Suchard, Dong Xie, and Andrew Rambaut. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol., 29(8):1969–1973, August 2012. Casey W Dunn. Ctenophore trees. Nat Ecol Evol, 1(11):1600–1601, November 2017. Casey W Dunn, Andreas Hejnol, David Q Matus, Kevin Pang, William E Browne, Stephen A Smith, Elaine Seaver, Greg W Rouse, Matthias Obst, Gregory D Edge- combe, Martin V Sørensen, Steven H D Haddock, Andreas Schmidt-Rhaesa, Akiko Okusu, Reinhardt Møbjerg Kristensen, Ward C Wheeler, Mark Q Martindale, and Gonzalo Giribet. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452(7188):745–749, April 2008. Dent Earl, Keith Bradnam, Aaron Darling, Dawei Lin, Joseph Fass, Hung On, Ken Yu, Vince Buffalo, Daniel R Zerbino, Mark Diekhans, Ngan Nguyen, Pramila Nuwantha Ariyaratne, Wing-Kin Sung, Zemin Ning, Matthias Haimel, Jared T Simpson, Nuno A Fonseca, T Roderick Docking, Isaac Y Ho, Daniel S Rokhsar, Rayan Chikhi, Do- minique Lavenier, Guillaume Chapuis, Delphine Naquin, Nicolas Maillet, Michael C Schatz, David R Kelley, Adam M Phillippy, Sergey Koren, Shiaw-Pyng Yang, Wei Wu, Wen-Chi Chou, Anuj Srivastava, Timothy I Shaw, J Graham Ruby, Peter Skewes-cox, Miguel Betegon, Michelle T Dimon, Victor Solovyev, Igor Seledtsov, Petr Kosarev, 134 References

Denis Vorobyev, Ricardo Ramirez-gonzalez, Richard Leggett, Dan Maclean, Fang- fang Xia, Ruibang Luo, Zhenyu Li, Yinlong Xie, Binghang Liu, Sante Gnerre, Iain Maccallum, Dariusz Przybylski, Filipe J Ribeiro, Shuangye Yin, Ted Sharpe, Giles Hall, Paul J Kersey, Richard Durbin, Shaun D Jackman, Jarrod A Chapman, Xiaoqiu Huang, Joseph L Derisi, Mario Caccamo, Yingrui Li, David B Jaffe, Richard E Green, David Haussler, Ian Korf, and Benedict Paten. Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research, 21:2224–2241, 2011. Deren A R Eaton. PyRAD: assembly of de novo RADseq loci for phylogenetic analyses. Bioinformatics, 30(13):1844–1849, July 2014. Deren A R Eaton and Richard H Ree. Inferring phylogeny and introgression using RADseq data: an example from flowering plants (pedicularis: Orobanchaceae). Syst. Biol., 62(5):689–706, September 2013. B Efron. Computers and the theory of statistics: Thinking the unthinkable. SIAM Rev., 21(4):460–480, October 1979. John Eid, Adrian Fehr, Jeremy Gray, Khai Luong, John Lyle, Geoff Otto, Paul Peluso, David Rank, Primo Baybayan, Brad Bettman, Arkadiusz Bibillo, Keith Bjornson, Bidhan Chaudhuri, Frederick Christians, Ronald Cicero, Sonya Clark, Ravindra Dalal, Alex Dewinter, John Dixon, Mathieu Foquet, Alfred Gaertner, Paul Hardenbol, Cheryl Heiner, Kevin Hester, David Holden, Gregory Kearns, Xiangxu Kong, Ronald Kuse, Yves Lacroix, Steven Lin, Paul Lundquist, Congcong Ma, Patrick Marks, Mark Maxham, Devon Murphy, Insil Park, Thang Pham, Michael Phillips, Joy Roy, Robert Sebra, Gene Shen, Jon Sorenson, Austin Tomaney, Kevin Travers, Mark Trulson, John Vieceli, Jeffrey Wegener, Dawn Wu, Alicia Yang, Denis Zaccarin, Peter Zhao, Frank Zhong, Jonas Korlach, and Stephen Turner. Real-time DNA sequencing from single polymerase molecules. Science, 323(5910):133–138, 2009. J A Eisen. A phylogenomic study of the MutS family of proteins. Nucleic Acids Res., 26 (18):4291–4300, September 1998a. J A Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res., 8(3):163–167, March 1998b. J A Eisen, D Kaiser, and R M Myers. Gastrogenomic delights: a movable feast. Nat. Med., 3(10):1076–1078, October 1997. Jonathan A Eisen and Claire M Fraser. Phylogenomics: intersection of evolution and genomics. Science, 300(5626):1706–1707, June 2003. R Ekblom and J Galindo. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity, 107(1):1–15, July 2011. Robert Ekblom, Linnéa Smeds, and Hans Ellegren. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics, 15:467, June 2014. Embl-Ebi. simNGS and simLibrary: Software for simulating Next-Gen sequencing data. http://www.ebi.ac.uk/goldman-srv/simNGS/, 2010. References 135

Merly Escalona, Sara Rocha, and David Posada. A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet., 17:459, June 2016. Merly Escalona, Sara Rocha, and David Posada. NGSphy: phylogenomic simulation of next-generation sequencing data. October 2017. B Ewing and P Green. Base-calling of automated sequencer traces using phred. II. error probabilities. Genome Res., 8(3):186–194, March 1998. Brent Ewing, Brent Ewing, Ladeana Hillier, Ladeana Hillier, Michael C Wendl, Michael C Wendl, Phil Green, and Phil Green. Base-Calling of automated sequencer traces using phred. i. accuracy assessment. Genome Res., 8:175–185, 1998. Brant C Faircloth. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics, 32(5):786–788, March 2016. Brant C Faircloth, John E McCormack, Nicholas G Crawford, Michael G Harvey, Robb T Brumfield, and Travis C Glenn. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst. Biol., 61(5): 717–726, October 2012. Brant C Faircloth, Laurie Sorenson, Francesco Santini, and Michael E Alfaro. A phylogenomic perspective on the radiation of Ray-Finned fishes based upon targeted sequencing of ultraconserved elements (UCEs). PLoS One, 8(6):e65923, June 2013. Nuno R Faria, Andrew Rambaut, Marc A Suchard, Guy Baele, Trevor Bedford, Melissa J Ward, Andrew J Tatem, João D Sousa, Nimalan Arinaminpathy, Jacques Pépin, David Posada, Martine Peeters, Oliver G Pybus, and Philippe Lemey. HIV epidemiology. the early spread and epidemic ignition of HIV-1 in human populations. Science, 346 (6205):56–61, October 2014. J Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17(6):368–376, 1981. Joseph Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Biol., 27(4):401–410, December 1978. Joseph Felsenstein. CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP. Evolution, 39(4):783–791, July 1985. Dmitry A Filatov. proseq: A software for preparation and evolutionary analysis of DNA sequence data sets. Mol. Ecol. Notes, 2(4):621–624, December 2002. William Fletcher and Ziheng Yang. INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8):1879–1888, 2009. Paul Flicek and Ewan Birney. Sense from sequence reads: methods for alignment and assembly. Nat. Methods, 6(11 Suppl):S6–S12, November 2009. Nuno A Fonseca, Johan Rung, Alvis Brazma, and John C Marioni. Tools for mapping high-throughput sequencing data. Bioinformatics, 28(24):3169–3177, December 2012. 136 References

Matthew Frampton and Richard Houlston. Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines. PLoS One, 7(11): e49110, 2012. Nicolas Galtier. A model of horizontal gene transfer and the bacterial phylogeny problem. Syst. Biol., 56(4):633–642, August 2007. Daniel R Garalde, Elizabeth A Snell, Daniel Jachimowicz, Botond Sipos, Joseph H Lloyd, Mark Bruce, Nadia Pantic, Tigist Admassu, Phillip James, Anthony Warland, Michael Jordan, Jonah Ciccone, Sabrina Serra, Jemma Keenan, Samuel Martin, Luke McNeill, E Jayne Wallace, Lakmal Jayasinghe, Chris Wright, Javier Blasco, Stephen Young, Denise Brocklebank, Sissel Juul, James Clarke, Andrew J Heron, and Daniel J Turner. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods, January 2018. Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-read sequencing. July 2012. W Gentzsch. Sun grid engine: towards creating a compute power grid. In Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid, pages 35–36. ieeexplore.ieee.org, 2001. André Gilles, Emese Meglécz, Nicolas Pech, Stéphanie Ferreira, Thibaut Malausa, and Jean-François Martin. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing. BMC Genomics, 12(1):245, 2011. Travis C Glenn. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour., 11 (5):759–769, September 2011. Grant T Godden, Ingrid E Jordon-Thaden, Srikar Chamala, Andrew A Crowl, Nicolás García, Charlotte C Germain-Aubrey, J Michael Heaney, Maribeth Latvis, Xinshuai Qi, and Matthew A Gitzendanner. Making next-generation sequencing work for you: approaches and practical considerations for marker development and phylogenetics. Plant Ecol. Divers., 5(4):427–450, December 2012. Gabriel Gonzalez, Michihito Sasaki, Lucy Burkitt-Gray, Tomonori Kamiya, Noriko M Tsuji, Hirofumi Sawa, and Kimihito Ito. An optimistic protein assembly from sequence reads salvaged an uncharacterized segment of mouse picobirnavirus. Sci. Rep., 7: 40447, January 2017. Vanessa L González, Sónia C S Andrade, Rüdiger Bieler, Timothy M Collins, Casey W Dunn, Paula M Mikkelsen, John D Taylor, and Gonzalo Giribet. A phylogenetic backbone for bivalvia: an RNA-seq approach. Proc. Biol. Sci., 282(1801):20142332, February 2015. Jeffrey M Good. Reduced representation methods for subgenomic enrichment and next-generation sequencing. Methods Mol. Biol., 772:85–103, 2011. Morris Goodman, John Czelusniak, G William Moore, A E Romero-Herrera, and Genji Matsuda. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Biol., 28(2): 132–163, June 1979. References 137

Sara Goodwin, John D McPherson, and W Richard McCombie. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet., 17(6):333–351, May 2016. A Gordon and G J Hannon. Fastx-toolkit. FASTQ/A short-reads preprocessing tools (unpublished, 2010. Harvey Gould, Jan Tobochnik, Dawn C Meredith, Steven E Koonin, Susan R McKay, and Wolfgang Christian. An introduction to computer simulation methods: Applications to physical systems, 2nd edition. Computers in Physics, 10(4):349–349, July 1996. R D Gray, A J Drummond, and S J Greenhill. Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science, 323(5913):479–483, January 2009. S Groothuis, G G van Merode, and A Hasman. Simulation as decision tool for capacity planning. Comput. Methods Programs Biomed., 66(2-3):139–151, September 2001. Nathan D Grubaugh, Jason T Ladner, Moritz U G Kraemer, Gytis Dudas, Amanda L Tan, Karthik Gangavarapu, Michael R Wiley, Stephen White, Julien Thézé, Diogo M Magnani, Karla Prieto, Daniel Reyes, Andrea M Bingham, Lauren M Paul, Refugio Robles-Sikisaka, Glenn Oliveira, Darryl Pronty, Carolyn M Barcellona, Hayden C Metsky, Mary Lynn Baniecki, Kayla G Barnes, Bridget Chak, Catherine A Freije, Adri- anne Gladden-Young, Andreas Gnirke, Cynthia Luo, Bronwyn MacInnis, Christian B Matranga, Daniel J Park, James Qu, Stephen F Schaffner, Christopher Tomkins-Tinch, Kendra L West, Sarah M Winnicki, Shirlee Wohl, Nathan L Yozwiak, Joshua Quick, Joseph R Fauver, Kamran Khan, Shannon E Brent, Robert C Reiner, Jr, Paola N Lichtenberger, Michael J Ricciardi, Varian K Bailey, David I Watkins, Marshall R Cone, Edgar W Kopp, 4th, Kelly N Hogan, Andrew C Cannons, Reynald Jean, Andrew J Monaghan, Robert F Garry, Nicholas J Loman, Nuno R Faria, Mario C Porcelli, Chalmers Vasquez, Elyse R Nagle, Derek A T Cummings, Danielle Stanek, Andrew Rambaut, Mariano Sanchez-Lockhart, Pardis C Sabeti, Leah D Gillis, Scott F Michael, Trevor Bedford, Oliver G Pybus, Sharon Isern, Gustavo Palacios, and Kris- tian G Andersen. Genomic epidemiology reveals multiple introductions of zika virus into the united states. Nature, 546(7658):401–405, June 2017. Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum- likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol., 59(3): 307–321, May 2010. Yan Guo, Fei Ye, Quanghu Sheng, Travis Clark, and David C Samuels. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform., 15(6):879–889, November 2014. N Gupta and S Grover. Introduction to modeling and simulation. Int. J. IT, Eng. Appl. Sci. Res, 2013. Brian J Haas, Dirk Gevers, Ashlee M Earl, Mike Feldgarden, Doyle V Ward, Georgia Giannoukos, Dawn Ciulla, Diana Tabbaa, Sarah K Highlander, Erica Sodergren, Barbara Methé, Todd Z DeSantis, Joseph F Petrosino, Rob Knight, and Bruce W 138 References

Birren. Chimeric 16S rRNA sequence formation and detection in Sanger and 454- pyrosequenced PCR amplicons. Genome Res., 21(3):494–504, 2011. Oskar Hagen and Tanja Stadler. TreeSimGM: Simulating phylogenetic trees under general Bellman–Harris models with lineage-specific shifts of speciation and extinction in R. Methods Ecol. Evol., October 2017. Eunjung Han, Janet S Sinsheimer, and John Novembre. Characterizing bias in popula- tion genetic inferences from low-coverage sequencing data. Mol. Biol. Evol., 31(3): 723–735, March 2014. Klaas Hartmann, Dennis Wong, and Tanja Stadler. Sampling trees from evolutionary models. Syst. Biol., 59(4):465–476, July 2010. Michael G Harvey, Brian Tilston Smith, Travis C Glenn, Brant C Faircloth, and Robb T Brumfield. Sequence capture versus restriction site associated DNA sequencing for shallow systematics. Syst. Biol., 65(5):910–924, September 2016. Ayat Hatem, Doruk Bozdağ, Amanda E Toland, and Ümit V Çatalyürek. Benchmarking short sequence mapping tools. BMC Bioinformatics, 14:184, June 2013. Joseph Heled and Alexei J Drummond. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol., 27(3):570–580, March 2010. Joseph Heled, David Bryant, and Alexei J Drummond. Simulating gene trees under the multispecies coalescent and time-dependent migration. BMC Evol. Biol., 13:44, February 2013. Jody Hey. USING PHYLOGENETIC TREES TO STUDY SPECIATION AND EX- TINCTION. Evolution, 46(3):627–640, June 1992. Andrew L Hipp, Deren A R Eaton, Jeannine Cavender-Bares, Elisabeth Fitzek, Rick Nipper, and Paul S Manos. A framework phylogeny of the american oak clade based on sequenced RAD data. PLoS One, 9(4):e93975, April 2014. Cory D Hirsch, Joseph Evans, C Robin Buell, and Candice N Hirsch. Reduced representation approaches to interrogate genome diversity in large repetitive plant genomes. Brief. Funct. Genomics, 13(4):257–267, July 2014. Sean Hoban, Giorgio Bertorelle, and Oscar E Gaggiotti. Computer simulations: tools for population and evolutionary genetics. Nat. Rev. Genet., 13(February):110–122, 2012. Emily Hodges, Zhenyu Xuan, Vivekanand Balija, Melissa Kramer, Michael N Molla, Steven W Smith, Christina M Middle, Matthew J Rodesch, Thomas J Albert, Gregory J Hannon, and W Richard McCombie. Genome-wide in situ exon capture for selective resequencing. Nat. Genet., 39(12):1522–1527, December 2007. Sandra L Hoffberg, Troy J Kieran, Julian M Catchen, Alison Devault, Brant C Faircloth, Rodney Mauricio, and Travis C Glenn. RADcap: sequence capture of dual-digest RADseq libraries with identifiable duplicates and reduced missing data. Mol. Ecol. Resour., 16(5):1264–1278, September 2016. References 139

Paul A Hohenlohe, Susan Bassham, Paul D Etter, Nicholas Stiffler, Eric A Johnson, and William A Cresko. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet., 6(2):e1000862, February 2010. Mark Holder and Paul O Lewis. Phylogeny estimation: traditional and bayesian approaches. Nat. Rev. Genet., 4(4):275–284, April 2003. Manuel Holtgrewe. Mason – A Read Simulator for Second Generation Sequencing Data. Life Sci., (October):18, 2010. Nils Homer. DWGSIM: Whole Genome Simulator for Next-Generation Sequencing. https://github.com/nh13/DWGSIM, 2010. David Stephen Horner, Giulio Pavesi, Tiziana Castrignanò, Paolo D’onorio De Meo, Sabino Liuni, Michael Sammeth, Ernesto Picardi, and Graziano Pesole. Bioinfor- matics approaches for genomics and post genomics applications of next-generation sequencing. Brief. Bioinform., 11(2):181–197, March 2010. Kevin Howe, Alex Bateman, and Richard Durbin. QuickTree: building huge Neighbour- Joining trees of protein sequences. Bioinformatics, 18(11):1546–1547, November 2002. Xuesong Hu, Jianying Yuan, Yujian Shi, Jianliang Lu, Binghang Liu, Zhenyu Li, Yanxi- ang Chen, Desheng Mu, Hao Zhang, Nan Li, Zhen Yue, Fan Bai, Heng Li, and Wei Fan. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics, 28(11): 1533–1535, June 2012. Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: A next- generation sequencing read simulator. Bioinformatics, 28(4):593–594, February 2012. R R Hudson. ms a program for generating samples under neutral models. 2004. J P Huelsenbeck and F Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, August 2001. John P Huelsenbeck. Performance of phylogenetic methods in simulation. Syst. Biol., 44 (1):17–48, March 1995. Philip Hunter. The paradox of model organisms. the use of model organisms in research will continue despite their shortcomings. EMBO Rep., 9(8):717–720, August 2008. Sohyun Hwang, Eiru Kim, Insuk Lee, and Edward M Marcotte. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep., 5: 17875, December 2015. Illumina. MiSeq system. (Figure 3):3–6, 2011. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, October 2004. 140 References

Iker Irisarri, Denis Baurain, Henner Brinkmann, Frédéric Delsuc, Jean-Yves Sire, Alexan- der Kupfer, Jörn Petersen, Michael Jarek, Axel Meyer, Miguel Vences, and Hervé Philippe. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nat Ecol Evol, 1(9):1370–1378, September 2017. Scott A Jackson. Rice: The first crop genome. Rice, 9(1):14, December 2016.

M Jadrić, M Ćukušić, and A Bralić. Comparison of discrete event simulation tools in an academic environment. Croat. Oper. Res. Rev. CRORR, 2014. Miten Jain, Ian T Fiddes, Karen H Miga, Hugh E Olsen, Benedict Paten, and Mark Akeson. Improved data analysis for the MinION nanopore sequencer. Nat. Methods, 12(4):351–356, 2015. Erich D Jarvis, Siavash Mirarab, Andre J Aberer, Bo Li, Peter Houde, Cai Li, Simon Y W Ho, Brant C Faircloth, Benoit Nabholz, Jason T Howard, Alexander Suh, Claudia C Weber, Rute R da Fonseca, Jianwen Li, Fang Zhang, Hui Li, Long Zhou, Nitish Narula, Liang Liu, Ganesh Ganapathy, Bastien Boussau, Md Shamsuzzoha Bayzid, Volodymyr Zavidovych, Sankar Subramanian, Toni Gabaldón, Salvador Capella- Gutiérrez, Jaime Huerta-Cepas, Bhanu Rekepalli, Kasper Munch, Mikkel Schierup, Bent Lindow, Wesley C Warren, David Ray, Richard E Green, Michael W Bruford, Xiangjiang Zhan, Andrew Dixon, Shengbin Li, Ning Li, Yinhua Huang, Elizabeth P Derryberry, Mads Frost Bertelsen, Frederick H Sheldon, Robb T Brumfield, Claudio V Mello, Peter V Lovell, Morgan Wirthlin, Maria Paula Cruz Schneider, Francisco Prosdocimi, José Alfredo Samaniego, Amhed Missael Vargas Velazquez, Alonzo Alfaro-Núñez, Paula F Campos, Bent Petersen, Thomas Sicheritz-Ponten, An Pas, Tom Bailey, Paul Scofield, Michael Bunce, David M Lambert, Qi Zhou, Polina Perelman, Amy C Driskell, Beth Shapiro, Zijun Xiong, Yongli Zeng, Shiping Liu, Zhenyu Li, Binghang Liu, Kui Wu, Jin Xiao, Xiong Yinqi, Qiuemei Zheng, Yong Zhang, Huanming Yang, Jian Wang, Linnea Smeds, Frank E Rheindt, Michael Braun, Jon Fjeldsa, Ludovic Orlando, F Keith Barker, Knud Andreas Jønsson, Warren Johnson, Klaus-Peter Koepfli, Stephen O’Brien, David Haussler, Oliver A Ryder, Carsten Rahbek, Eske Willerslev, Gary R Graves, Travis C Glenn, John McCormack, Dave Burt, Hans Ellegren, Per Alström, Scott V Edwards, Alexandros Stamatakis, David P Mindell, Joel Cracraft, Edward L Braun, Tandy Warnow, Wang Jun, M Thomas P Gilbert, and Guojie Zhang. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science, 346(6215):1320–1331, December 2014. Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, and Hervé Philippe. Phylogenomics: the beginning of incongruence? Trends Genet., 22(4):225–231, April 2006. Aaron R Jex, Ross S Hall, D Timothy J Littlewood, and Robin B Gasser. An integrated pipeline for next-generation sequencing and annotation of mitochondrial genomes. Nucleic Acids Res., 38(2):522–533, January 2010. Tieming Ji and Jie Chen. Modeling the next generation sequencing read count data for DNA copy number variant study. Stat. Appl. Genet. Mol. Biol., 14(4):361–374, August 2015. References 141

Ben Jia, Liming Xuan, Kaiye Cai, Zhiqiang Hu, Liangxiao Ma, and Chaochun Wei. NeSSM: A Next-Generation Sequencing Simulator for Metagenomics. PLoS One, 8 (10):e75448, January 2013. Yinping Jiao, Paul Peluso, Jinghua Shi, Tiffany Liang, Michelle C Stitzer, Bo Wang, Michael S Campbell, Joshua C Stein, Xuehong Wei, Chen-Shan Chin, Katherine Guill, Michael Regulski, Sunita Kumari, Andrew Olson, Jonathan Gent, Kevin L Schneider, Thomas K Wolfgruber, Michael R May, Nathan M Springer, Eric Antoniou, W Richard McCombie, Gernot G Presting, Michael McMullen, Jeffrey Ross-Ibarra, R Kelly Dawe, Alex Hastie, David R Rank, and Doreen Ware. Improved maize reference genome with single-molecule technologies. Nature, 546(7659):524–527, June 2017. Stephen Johnson, Brett Trost, Jeffrey R Long, Vanessa Pittet, and Anthony Kusalik. A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15 Suppl 9(Suppl 9):S14, January 2014. Matthew R Jones and Jeffrey M Good. Targeted capture in evolutionary and ecological genomics. Mol. Ecol., 25(1):185–202, January 2016. Fass J N Joshi NA. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files, 2011. Thomas H Jukes and Charles R Cantor. Evolution of protein molecules. In Mammalian Protein Metabolism, pages 21–132. 1969. Wei-Chun Kao, Kristian Stevens, and Yun S Song. BayesCall : A model-based base- calling algorithm for high-throughput short-read sequencing. pages 1884–1895, 2009. Kazutaka Katoh and Daron M Standley. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30(4):772–780, April 2013. Kazutaka Katoh, Kazuharu Misawa, Kei-Ichi Kuma, and Takashi Miyata. MAFFT: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res., 30(14):3059–3066, July 2002. Kevin P Keegan, William L Trimble, Jared Wilkening, Andreas Wilke, Travis Harrison, Mark D’Souza, and Folker Meyer. A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE. PLoS Comput. Biol., 8(6):e1002541, 2012. Jerome Kelleher, Alison M Etheridge, and Gilean McVean. Efficient coalescent sim- ulation and genealogical analysis for large sample sizes. PLoS Comput. Biol., 12(5): e1004842, May 2016. M Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16(2):111–120, December 1980. Martin Kircher and Janet Kelso. High-throughput DNA sequencing - Concepts and limitations. Bioessays, 32(6):524–536, 2010. 142 References

L Lacey Knowles. Estimating species trees: methods of phylogenetic analysis when there is incongruence across genes. Syst. Biol., 58(5):463–467, October 2009. Bjarne Knudsen, Roald Forsberg, and Michael M Miyamoto. A Computer Simulator for Assessing Different Challenges and Strategies of de Novo Sequence Assembly. pages 263–282, 2010. Daniel C Koboldt, Karyn Meltz Steinberg, David E Larson, Richard K Wilson, and Elaine R Mardis. The Next-Generation sequencing revolution and its impact on genomics. Cell, 155(1):27–38, 2013. Sergey Koren, Michael C Schatz, Brian P Walenz, Jeffrey Martin, Jason T Howard, Ganeshkumar Ganapathy, Zhong Wang, David A Rasko, W Richard McCombie, Erich D Jarvis, and Adam M Phillippy. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol., 30(7):693–700, 2012. Thorfinn Sand Korneliussen, Anders Albrechtsen, and Rasmus Nielsen. ANGSD: Analysis of next generation sequencing data. BMC Bioinformatics, 15:356, November 2014. Alexey Kozlov. raxml-ng, 2017. Laura S Kubatko, Bryan C Carstens, and L Lacey Knowles. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics, 25(7):971–973, April 2009. Laura Salter Kubatko. Identifying hybridization events in the presence of coalescence via model selection. Syst. Biol., 58(5):478–488, October 2009. M K Kuhner and J Felsenstein. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol., 11(3):459–468, May 1994. Pankaj Kumar, Mashael Al-Shafai, Wadha Ahmed Al Muftah, Nader Chalhoub, Mah- moud F Elsaid, Alice Abdel Aleem, and Karsten Suhre. Evaluation of SNP calling using single and multiple-sample calling algorithms by validation against array base genotyping and mendelian inheritance. BMC Res. Notes, 7:747, October 2014. Sudhir Kumar and Sankar Subramanian. Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci. U. S. A., 99(2):803–808, January 2002. Surendra Kumar, Anders K Krabberød, Ralf S Neumann, Katerina Michalickova, Sen Zhao, Xiaoli Zhang, and Kamran Shalchian-Tabrizi. BIR pipeline for preparation of phylogenomic data. Evol. Bioinform. Online, 11:79–83, April 2015. E S Lander and M S Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–239, April 1988. E S Lander, L M Linton, B Birren, C Nusbaum, M C Zody, J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh, R Funke, D Gage, K Harris, A Heaford, J Howland, L Kann, J Lehoczky, R LeVine, P McEwan, K McKernan, J Meldrim, J P Mesirov, C Miranda, W Morris, J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan, C Sougnez, Y Stange-Thomann, N Stojanovic, A Subramanian, D Wyman, J Rogers, J Sulston, References 143

R Ainscough, S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson, R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin, L French, D Grafham, S Gregory, T Hubbard, S Humphray, A Hunt, M Jones, C Lloyd, A McMurray, L Matthews, S Mercer, S Milne, J C Mullikin, A Mungall, R Plumb, M Ross, R Shownkeen, S Sims, R H Waterston, R K Wilson, L W Hillier, J D McPherson, M A Marra, E R Mardis, L A Fulton, A T Chinwalla, K H Pepin, W R Gish, S L Chissoe, M C Wendl, K D Delehaunty, T L Miner, A Delehaunty, J B Kramer, L L Cook, R S Fulton, D L Johnson, P J Minx, S W Clifton, T Hawkins, E Branscomb, P Predki, P Richardson, S Wenning, T Slezak, N Doggett, J F Cheng, A Olsen, S Lucas, C Elkin, E Uberbacher, M Frazier, R A Gibbs, D M Muzny, S E Scherer, J B Bouck, E J Sodergren, K C Worley, C M Rives, J H Gorrell, M L Metzker, S L Naylor, R S Kucherlapati, D L Nelson, G M Weinstock, Y Sakaki, A Fujiyama, M Hattori, T Yada, A Toyoda, T Itoh, C Kawagoe, H Watanabe, Y Totoki, T Taylor, J Weissenbach, R Heilig, W Saurin, F Artiguenave, P Brottier, T Bruls, E Pelletier, C Robert, P Wincker, D R Smith, L Doucette-Stamm, M Rubenfield, K Weinstock, H M Lee, J Dubois, A Rosenthal, M Platzer, G Nyakatura, S Taudien, A Rump, H Yang, J Yu, J Wang, G Huang, J Gu, L Hood, L Rowen, A Madan, S Qin, R W Davis, N A Federspiel, A P Abola, M J Proctor, R M Myers, J Schmutz, M Dickson, J Grimwood, D R Cox, M V Olson, R Kaul, C Raymond, N Shimizu, K Kawasaki, S Minoshima, G A Evans, M Athanasiou, R Schultz, B A Roe, F Chen, H Pan, J Ramser, H Lehrach, R Reinhardt, W R McCombie, M de la Bastide, N Dedhia, H Blöcker, K Hornischer, G Nordsiek, R Agarwala, L Aravind, J A Bailey, A Bateman, S Batzoglou, E Birney, P Bork, D G Brown, C B Burge, L Cerutti, H C Chen, D Church, M Clamp, R R Copley, T Doerks, S R Eddy, E E Eichler, T S Furey, J Galagan, J G Gilbert, C Harmon, Y Hayashizaki, D Haussler, H Hermjakob, K Hokamp, W Jang, L S Johnson, T A Jones, S Kasif, A Kaspryzk, S Kennedy, W J Kent, P Kitts, E V Koonin, I Korf, D Kulp, D Lancet, T M Lowe, A McLysaght, T Mikkelsen, J V Moran, N Mulder, V J Pollara, C P Ponting, G Schuler, J Schultz, G Slater, A F Smit, E Stupka, J Szustakowki, D Thierry-Mieg, J Thierry-Mieg, L Wagner, J Wallis, R Wheeler, A Williams, Y I Wolf, K H Wolfe, S P Yang, R F Yeh, F Collins, M S Guyer, J Peterson, A Felsenfeld, K A Wetterstrand, A Patrinos, M J Morgan, P de Jong, J J Catanese, K Osoegawa, H Shizuya, S Choi, Y J Chen, J Szustakowki, and International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921, February 2001. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nat. Methods, 9(4):357–359, March 2012. Hayley C Lanier, Huateng Huang, and L Lacey Knowles. How low can you go? the effects of mutation rate on the accuracy of species-tree estimation. Mol. Phylogenet. Evol., 70:112–119, January 2014. Bret R Larget, Satish K Kotha, Colin N Dewey, and Cécile Ané. BUCKy: gene tree/species tree reconciliation with bayesian concordance analysis. Bioinformatics, 26 (22):2910–2911, November 2010. Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. PhyloBayes 3: a bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics, 25(17):2286–2288, September 2009. 144 References

Michael Lässig, Ville Mustonen, and Aleksandra M Walczak. Predicting evolution. Nat Ecol Evol, 1(3):77, February 2017. T Laver, J Harrison, P A O’Neill, K Moore, A Farbos, K Paszkiewicz, and D J Studholme. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomolec- ular Detection and Quantification, 3:1–8, 2015. Adam D Leaché, Rebecca B Harris, Bruce Rannala, and Ziheng Yang. The influence of gene flow on species tree estimation: A simulation study. Syst. Biol., 63(1):17–30, January 2014. Adam D Leaché, Andreas S Chavez, Leonard N Jones, Jared A Grummer, Andrew D Gottscho, and Charles W Linkem. Phylogenomics of phrynosomatid lizards: conflict- ing signals from sequence capture versus restriction site associated DNA sequencing. Genome Biol. Evol., 7(3):706–719, February 2015. C Ledergerber and C Dessimoz. Base-calling for next-generation sequencing platforms. Brief. Bioinform., 12(5):489–497, 2011. Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W Richard McCombie, and Michael Schatz. Error correction and assembly complexity of single molecule sequencing reads. bioRxiv, page 6395, 2014. J W Leigh and D Bryant. popart: fullfeature software for haplotype network construction. Methods Ecol. Evol., 2015. Alan R Lemmon and Emily Moriarty Lemmon. High-throughput identification of informative nuclear loci for shallow-scale phylogenetics and phylogeography. Syst. Biol., 61(5):745–761, October 2012. Alan R Lemmon, Sandra a Emme, and Emily Moriarty Lemmon. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst. Biol., 61(5):727–744, October 2012. Samuel Levy, Granger Sutton, Pauline C Ng, Lars Feuk, Aaron L Halpern, Brian P Walenz, Nelson Axelrod, Jiaqi Huang, Ewen F Kirkness, Gennady Denisov, Yuan Lin, Jeffrey R MacDonald, Andy Wing Chun Pang, Mary Shago, Timothy B Stockwell, Alexia Tsiamouri, Vineet Bafna, Vikas Bansal, Saul A Kravitz, Dana A Busam, Karen Y Beeson, Tina C McIntosh, Karin A Remington, Josep F Abril, John Gill, Jon Borman, Yu-Hui Rogers, Marvin E Frazier, Stephen W Scherer, Robert L Strausberg, and J Craig Venter. The diploid genome sequence of an individual human. PLoS Biol., 5(10):e254, September 2007. Bo Li, Nathanael Fillmore, and Yongsheng Bai. Evaluation of de novo transcriptome assemblies from RNA-Seq data. pages 0–30, 2014. Heng Li. wgsim - Read simulator for next generation sequencing, 2011a. Heng Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987–2993, November 2011b. References 145

Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM. March 2013. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics, 25(14):1754–1760, July 2009. Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, 2009a. Ruiqiang Li, Yingrui Li, Karsten Kristiansen, and Jun Wang. SOAP: short oligonu- cleotide alignment program. Bioinformatics, 24(5):713–714, March 2008. Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kristiansen, and Jun Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinfor- matics, 25(15):1966–1967, August 2009b. Yong Lin, Jian Li, Hui Shen, Lei Zhang, Christopher J Papasian, and Hong-Wen Deng. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics, 27(15):2031–2037, August 2011. Binghang Liu, Jianying Yuan, Siu-Ming Yiu, Zhenyu Li, Yinlong Xie, Yanxiang Chen, Yujian Shi, Hao Zhang, Yingrui Li, Tak-Wah Lam, and Ruibang Luo. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics, 28(22):2870–2874, November 2012a. L Liu, Z Xi, S Wu, C C Davis, and others. Estimating phylogenetic trees from genomescale data. Ann. N. Y. Acad. Sci., 2015a. Liang Liu. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics, 24(21):2542–2543, November 2008. Liang Liu and Lili Yu. Estimating species trees from unrooted gene trees. Syst. Biol., 60 (5):661–667, October 2011. Liang Liu, Lili Yu, Laura Kubatko, Dennis K Pearl, and Scott V Edwards. Coalescent methods for estimating phylogenetic trees. Mol. Phylogenet. Evol., 53(1):320–328, October 2009. Liang Liu, Lili Yu, and Scott V Edwards. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol. Biol., 10(1):302, 2010. Liang Liu, Shaoyuan Wu, and Lili Yu. Coalescent methods for estimating species trees from phylogenomic data. Jnl of Sytematics Evolution, 53(5):380–390, September 2015b. Lin Liu, Yinhu Li, Siliang Li, Ni Hu, Yimin He, Ray Pong, Danni Lin, Lihua Lu, and Maggie Law. Comparison of next-generation sequencing systems. J. Biomed. Biotechnol., 2012, 2012b. 146 References

Nicholas J Loman, Raju V Misra, Timothy J Dallman, Chrystala Constantinidou, Saheer E Gharbia, John Wain, and Mark J Pallen. Performance comparison of benchtop high-throughput sequencing platforms. Nat. Biotechnol., 30(5):434–439, May 2012. Nicholas James Loman, Joshua Quick, and Jared T Simpson. A complete bacterial genome assembled de novo using only nanopore sequencing data. bioRxiv, ( June): 015552, 2015. Gerton Lunter and Martin Goodson. Stampy: a statistical algorithm for sensitive and fast mapping of illumina sequence reads. Genome Res., 21(6):936–939, June 2011. Michael Lynch. Evolution of the mutation rate. Trends Genet., 26(8):345–352, August 2010. Michael Lynch and John S Conery. The origins of genome complexity. Science, 302 (5649):1401–1404, November 2003. D J Wheeler M. Burrows. A block-sorting lossless data compression algorithm. 1994. W P Maddison and D R Maddison. Mesquite: a modular system for evolutionary analysis, 2017. Mohammed-Amin Madoui, Stefan Engelen, Corinne Cruaud, Caroline Belser, Laurie Bertrand, Adriana Alberti, Arnaud Lemainque, Patrick Wincker, and Jean-Marc Aury. Genome assembly using nanopore-guided long and error-free DNA reads. BMC Genomics, 16(1):327, 2015. Tanja Magoč and Steven L Salzberg. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics, 27(21):2957–2963, November 2011. Thomas Mailund and Christian N S Pedersen. QuickJoin—fast neighbour-joining tree reconstruction. Bioinformatics, 20(17):3261–3262, November 2004. Thomas Mailund, Gerth S Brodal, Rolf Fagerberg, Christian N S Pedersen, and Derek Phillips. Recrafting the neighbor-joining method. BMC Bioinformatics, 7:29, January 2006. Diego Mallo and David Posada. Multilocus inference of species trees and DNA barcod- ing. Philos. Trans. R. Soc. Lond. B Biol. Sci., 371(1702), September 2016. Diego Mallo, Leonardo de Oliveira Martins, and David Posada. Estimation of Species Trees. John Wiley & Sons, Ltd, Chichester, UK, els edition, May 2014. Diego Mallo, Leonardo de Oliveira Martins, and David Posada. SimPhy: Phylogenomic simulation of gene, locus and species trees. Syst. Biol., 65(2):334–344, November 2015a. Diego Mallo, Agustín Sánchez-Cobos, and Miguel Arenas. Diverse considerations for successful phylogenetic tree reconstruction: Impacts from model misspecifica- tion, recombination, homoplasy, and pattern recognition. In Pattern Recognition in Computational Molecular Biology, pages 439–456. John Wiley & Sons, Inc, 2015b. References 147

Lira Mamanova, Alison J Coffey, Carol E Scott, Iwanka Kozarewa, Emily H Turner, Akash Kumar, Eleanor Howard, Jay Shendure, and Daniel J Turner. Target-enrichment strategies for next-generation sequencing. Nat. Methods, 7(2):111–118, 2010. Elaine R Mardis. The impact of next-generation sequencing technology on genetics. (February):133–141, 2008. Elaine R Mardis. A decade’s perspective on DNA sequencing technology. Nature, 470 (7333):198–203, February 2011. Mark J Margres, Alyssa T Bigelow, Emily Moriarty Lemmon, Alan R Lemmon, and Darin R Rokyta. Selection to increase expression, not sequence diversity, precedes gene family origin and expansion in rattlesnake venom. Genetics, 206(3):1569–1580, July 2017. Elliott H Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program, David Haussler, and Eric D Green. Identification and characterization of multi- species conserved sequences. Genome Res., 13(12):2507–2518, December 2003. Marcel Margulies, Michael Egholm, William E Altman, Said Attiya, Joel S Bader, Lisa a Bemben, Jan Berka, Michael S Braverman, Yi-Ju Chen, Zhoutao Chen, Scott B Dewell, Lei Du, Joseph M Fierro, Xavier V Gomes, Brian C Godwin, Wen He, Scott Helgesen, Chun Heen Ho, Gerard P Irzyk, Szilveszter C Jando, Maria L I Alenquer, Thomas P Jarvie, Kshama B Jirage, Jong-Bum Kim, James R Knight, Janna R Lanza, John H Leamon, Steven M Lefkowitz, Ming Lei, Jing Li, Kenton L Lohman, Hong Lu, Vinod B Makhijani, Keith E McDade, Michael P McKenna, Eugene W Myers, Elizabeth Nickerson, John R Nobile, Ramona Plant, Bernard P Puc, Michael T Ronan, George T Roth, Gary J Sarkis, Jan Fredrik Simons, John W Simpson, Maithreyan Srinivasan, Karrie R Tartaro, Alexander Tomasz, Kari a Vogt, Greg a Volkmer, Shally H Wang, Yong Wang, Michael P Weiner, Pengguang Yu, Richard F Begley, and Jonathan M Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(September):376–380, 2005. Anu Maria. Introduction to modeling and simulation. In Proceedings of the 29th Conference on Winter Simulation, WSC ’97, pages 7–13, Washington, DC, USA, 1997. IEEE Computer Society. Tomislav Maricic, Mark Whitten, and Svante Pääbo. Multiplexed DNA sequence capture of mitochondrial genomes using PCR products. PLoS One, 5(11):e14004, November 2010. Jeffrey A Martin and Zhong Wang. Next-generation transcriptome assembly. Nat. Rev. Genet., 12(10):671–682, September 2011. Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1):10–12, May 2011. Takahiro Maruki and Michael Lynch. Genotype calling from Population-Genomic sequencing data. G3, 7(5):1393–1404, May 2017. 148 References

Konstantinos Mavromatis, Natalia Ivanova, Kerrie Barry, Harris Shapiro, Eugene Goltsman, Alice C McHardy, Isidore Rigoutsos, Asaf Salamov, Frank Korzeniewski, Miriam Land, Alla Lapidus, Igor Grigoriev, Paul Richardson, Philip Hugenholtz, and Nikos C Kyrpides. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 4(6):495–500, 2007. Braedan M McCluskey and John H Postlethwait. Phylogeny of zebrafish, a “model species,” within danio, a “model genus”. Mol. Biol. Evol., 32(3):635–652, March 2015. John E McCormack, Sarah M Hird, Amanda J Zellmer, Bryan C Carstens, and Robb T Brumfield. Applications of next-generation sequencing to phylogeography and phylo- genetics. Mol. Phylogenet. Evol., 66(2):526–538, February 2013. Kerensa E McElroy, Fabio Luciani, and Torsten Thomas. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics, 13(1):74, January 2012. Aaron Mckenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A Depristo. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20:1297–1303, 2010. Emily Jane McTavish, James Pettengill, Steven Davis, Hugh Rand, Errol Strain, Marc Allard, and Ruth E Timme. TreeToReads - a pipeline for simulating raw reads from phylogenies. BMC Bioinformatics, 18(1):178, March 2017. Florian Mertes, Abdou Elsharawy, Sascha Sauer, Joop M L M van Helvoort, P J van der Zaag, Andre Franke, Mats Nilsson, Hans Lehrach, and Anthony J Brookes. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief. Funct. Genomics, 10(6):374–386, November 2011. Michael L Metzker. Sequencing technologies - the next generation. Nat. Rev. Genet., 11 (1):31–46, January 2010. Michael L Metzker, David P Mindell, Xiao-Mei Liu, Roger G Ptak, Richard A Gibbs, and David M Hillis. Molecular evidence of HIV-1 transmission in a criminal case. Proc. Natl. Acad. Sci. U. S. A., 99(22):14292–14297, October 2002. Jason R Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next- generation sequencing data. Genomics, 95(6):315–327, 2010. Michael R Miller, Tressa S Atwood, B Frank Eames, Johann K Eberhart, Yi-Lin Yan, John H Postlethwait, and Eric A Johnson. RAD marker microarrays enable rapid mapping of zebrafish mutations. Genome Biol., 8(6):R105, 2007. S Mirarab, R Reaz, Md S Bayzid, T Zimmermann, M S Swenson, and T Warnow. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics, 30 (17):i541–8, September 2014. References 149

Siavash Mirarab and Tandy Warnow. ASTRAL-II: coalescent-based species tree esti- mation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12): i44–52, June 2015. Siavash Mirarab, Md Shamsuzzoha Bayzid, and Tandy Warnow. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst. Biol., 65(3):366–380, May 2016. Bernhard Misof, Shanlin Liu, Karen Meusemann, Ralph S Peters, Alexander Donath, Christoph Mayer, Paul B Frandsen, Jessica Ware, Tomáš Flouri, Rolf G Beutel, Oliver Niehuis, Malte Petersen, Fernando Izquierdo-Carrasco, Torsten Wappler, Jes Rust, Andre J Aberer, Ulrike Aspöck, Horst Aspöck, Daniela Bartel, Alexander Blanke, Simon Berger, Alexander Böhm, Thomas R Buckley, Brett Calcott, Junqing Chen, Frank Friedrich, Makiko Fukui, Mari Fujita, Carola Greve, Peter Grobe, Shengchang Gu, Ying Huang, Lars S Jermiin, Akito Y Kawahara, Lars Krogmann, Martin Kubiak, Robert Lanfear, Harald Letsch, Yiyuan Li, Zhenyu Li, Jiguang Li, Haorong Lu, Ryuichiro Machida, Yuta Mashimo, Pashalia Kapli, Duane D McKenna, Guanliang Meng, Yasutaka Nakagaki, José Luis Navarrete-Heredia, Michael Ott, Yanxiang Ou, Günther Pass, Lars Podsiadlowski, Hans Pohl, Björn M von Reumont, Kai Schütte, Kaoru Sekiya, Shota Shimizu, Adam Slipinski, Alexandros Stamatakis, Wenhui Song, Xu Su, Nikolaus U Szucsich, Meihua Tan, Xuemei Tan, Min Tang, Jingbo Tang, Gerald Timelthaler, Shigekazu Tomizuka, Michelle Trautwein, Xiaoli Tong, Toshiki Uchifune, Manfred G Walzl, Brian M Wiegmann, Jeanne Wilbrandt, Benjamin Wipfler, Thomas K F Wong, Qiong Wu, Gengxiong Wu, Yinlong Xie, Shenzhou Yang, Qing Yang, David K Yeates, Kazunori Yoshizawa, Qing Zhang, Rui Zhang, Wenwei Zhang, Yunhui Zhang, Jing Zhao, Chengran Zhou, Lili Zhou, Tanja Ziesmann, Shijie Zou, Yingrui Li, Xun Xu, Yong Zhang, Huanming Yang, Jian Wang, Jun Wang, Karl M Kjer, and Xin Zhou. Phylogenomics resolves the timing and pattern of insect evolution. Science, 346(6210):763–767, November 2014. Olena Morozova and Marco a Marra. Applications of next-generation sequencing technologies in functional genomics. Genomics, 92:255–264, 2008. K Nakamura, T Oshima, T Morimoto, S Ikeda, H Yoshikawa, Y Shiwa, S Ishikawa, M C Linak, A Hirai, H Takahashi, M Altaf-Ul-Amin, N Ogasawara, and S Kanaya. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res., 39(13): e90–e90, 2011. National Academy of Sciences. In the Light of Evolution: Volume X: Comparative Phylo- geography. The National Academies Press, Washington, DC, 2017. T H Naylor. Computer simulation techniques. 1966. S Nee and R M May. Extinction and the loss of evolutionary history. Science, 278(5338): 692–694, October 1997. B Nevado, S E Ramos-Onsins, and M Perez-Enciso. Resequencing studies of nonmodel organisms using closely related reference genomes: optimal experimental designs and bioinformatics approaches for population genomics. Mol. Ecol., 23(7):1764–1779, April 2014. 150 References

Lam-Tung Nguyen, Heiko A Schmidt, Arndt von Haeseler, and Bui Quang Minh. IQ- TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol., 32(1):268–274, January 2015. Kwangsik Nho, John D West, Huian Li, Robert Henschel, Apoorva Bharthur, Michel C Tavares, and Andrew J Saykin. Comparison of Multi-Sample variant calling methods for whole genome sequencing. IEEE Int Conf Systems Biol, 2014:59–62, October 2014. Rasmus Nielsen, Joshua S Paul, Anders Albrechtsen, and Yun S Song. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet., 12(6):443–451, 2011. Rasmus Nielsen, Thorfinn Korneliussen, Anders Albrechtsen, Yingrui Li, and Jun Wang. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation sequencing data. PLoS One, 7(7):e37558, January 2012. Beifang Niu, Limin Fu, Shulei Sun, and Weizhong Li. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics, 11:187, April 2010. Huw A Ogilvie, Joseph Heled, Dong Xie, and Alexei J Drummond. Computational performance and statistical accuracy of *BEAST and comparisons with other methods. Syst. Biol., 65(3):381–396, May 2016. Huw A Ogilvie, Remco R Bouckaert, and Alexei J Drummond. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol., 34(8):2101–2114, August 2017. Susumu Ohno. Evolution by Gene Duplication. 1970. Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. PBSIM: PacBio reads simulator– toward accurate genome assembly. Bioinformatics, 29(1):119–121, January 2013. P Pamilo and M Nei. Relationships between gene trees and species trees. Mol. Biol. Evol., 5(5):568–583, September 1988. Emmanuel Paradis. Analysis of Phylogenetics and Evolution with R. Springer Science & Business Media, November 2011. Emmanuel Paradis, Julien Claude, and Korbinian Strimmer. APE: Analyses of phyloge- netics and evolution in R language. Bioinformatics, 20(2):289–290, January 2004. Ravi K Patel and Mukesh Jain. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One, 7(2):e30619, February 2012. Swetansu Pattnaik, Saurabh Gupta, Arjun A Rao, and Binay Panda. SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data. BMC Bioinformatics, 15(1):1–9, 2014. References 151

Pedro L V Peloso, Darrel R Frost, Stephen J Richards, Miguel T Rodrigues, Stephen Donnellan, Masafumi Matsui, Cristopher J Raxworthy, S D Biju, Emily Moriarty Lemmon, Alan R Lemmon, and Ward C Wheeler. The impact of anchored phy- logenomics and taxon sampling on phylogenetic inference in narrow-mouthed frogs (anura, microhylidae). Cladistics, 32(2):113–140, April 2016. J V Peñalba, L L Smith, M A Tonione, and others. Sequence capture using PCRgen- erated probes: a costeffective method of targeted highthroughput sequencing for nonmodel organisms. Mol. Ecol., 2014. Mihaela Pertea. The human transcriptome: an unfinished story. Genes, 3(3):344–360, September 2012. Hervé Philippe and Mathieu Blanchette. Proceedings of the first international conference on phylogenomics. march 15-19, 2006. quebec, canada. BMC Evol. Biol., 7 Suppl 1(1): S1–16, February 2007. Miguel Pignatelli and Andrés Moya. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One, 6(5):e19984, January 2011. Mehdi Pirooznia, Melissa Kramer, Jennifer Parla, Fernando S Goes, James B Potash, W Richard McCombie, and Peter P Zandi. Validation and assessment of variant calling pipelines for next-generation sequencing. Hum. Genomics, 8:14, July 2014. Mihai Pop. Genome assembly reborn: recent computational challenges. Brief. Bioinform., 10(4):354–366, July 2009. Gregory J Porreca, Kun Zhang, Jin Billy Li, Bin Xie, Derek Austin, Sara L Vassallo, Emily M LeProust, Bill J Peck, Christopher J Emig, Fredrik Dahl, Yuan Gao, George M Church, and Jay Shendure. Multiplex amplification of large sets of human exons. Nat. Methods, 4(11):931–936, November 2007. David Posada. Phylogenomics for systematic biology. Syst. Biol., 65(3):353–356, May 2016. Alastair J Potts, Terry A Hedderson, and Guido W Grimm. Constructing phylogenies in the presence of intra-individual site polymorphisms (2ISPs) with a focus on the nuclear ribosomal cistron. Syst. Biol., 63(1):1–16, January 2014. Diogo Pratas, Armando J Pinho, and João M O S Rodrigues. XS: a FASTQ read simulator. BMC Res. Notes, 7:40, January 2014. Morgan N Price, Paramvir S Dehal, and Adam P Arkin. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol., 26(7):1641–1650, July 2009. Morgan N Price, Paramvir S Dehal, and Adam P Arkin. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One, 5(3):e9490, March 2010. Richard O Prum, Jacob S Berv, Alex Dornburg, Daniel J Field, Jeffrey P Townsend, Emily Moriarty Lemmon, and Alan R Lemmon. A comprehensive phylogeny of birds (aves) using targeted next-generation DNA sequencing. Nature, 526(7574):569–573, October 2015. 152 References

R Alexander Pyron, Felisa W Hsieh, Alan R Lemmon, Emily M Lemmon, and Catri- ona R Hendry. Integrating phylogenomic and morphological data to assess candidate species-delimitation models in brown and red-bellied snakes (storeria). Zool. J. Linn. Soc., 177(4):937–949, August 2016. Michael Quail, Miriam E Smith, Paul Coupland, Thomas D Otto, Simon R Harris, Thomas Richard Connor, Anna Bertoni, Harold P Swerdlow, and Yong Gu. A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers. BMC Genomics, 13(1):1, 2012. Joshua Quick, Aaron R Quinlan, and Nicholas J Loman. A reference bacterial genome dataset generated on the MinION(TM) portable single-molecule nanopore sequencer. Gigascience, pages 1–6, 2014. Aaron R Quinlan and Ira M Hall. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, March 2010. Andrew Rambaut and Nicholas C Grassly. Seq-Gen: an application for the monte carlo simulation of DNA sequence evolution along phylogenetic trees. 13(3):235–238, 1997. B Rannala and Z Yang. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J. Mol. Evol., 43(3):304–311, September 1996. Bruce Rannala and Ziheng Yang. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics, 164(4): 1645–1656, August 2003. Matthew D Rasmussen and Manolis Kellis. Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res., 22(4):755–765, April 2012. L J Revell. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol., 2012. Daniel C Richter, Felix Ott, Alexander F Auch, Ramona Schmid, and Daniel H Huson. MetaSim: A Sequencing Simulator for Genomics and Metagenomics. Handbook of Molecular Microbial Ecology I: Metagenomics and Complementary Approaches, 3(10): 417–421, January 2011. Nora Rieber, Marc Zapatka, Bärbel Lasitschka, David Jones, Paul Northcott, Barbara Hutter, Natalie Jäger, Marcel Kool, Michael Taylor, Peter Lichter, Stefan Pfister, Stephan Wolf, Benedikt Brors, and Roland Eils. Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PLoS One, 8(6):e66621, June 2013. Anna Ritz, Pamela L Paris, Michael M Ittmann, Colin Collins, and Benjamin J Raphael. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics, 12:114, April 2011. Kimberly Robasky, Nathan E Lewis, and George M Church. The role of replicates for error mitigation in next-generation sequencing. Nat. Rev. Genet., 15(1):56–62, 2013. References 153

D F Robinson and L R Foulds. Comparison of phylogenetic trees. Math. Biosci., 53(1): 131–147, February 1981. Fritz Rohrlich. Computer simulation in the physical sciences. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association, 1990(2):507–518, 1990. Darin R Rokyta, Alan R Lemmon, Mark J Margres, and Karalyn Aronow. The venom- gland transcriptome of the eastern diamondback rattlesnake (crotalus adamanteus). BMC Genomics, 13(1):312, 2012. M Ronaghi. Pyrosequencing sheds light on DNA sequencing. Genome Res., 11(1):3–11, January 2001. M Ronaghi, S Ka ramohamed, B Pettersson, M Uhlén, and P Nyrén. Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem., 242(1):84–89, November 1996. M Ronaghi, M Uhlén, and P Nyrén. A sequencing method based on real-time pyrophos- phate. Science, 281(9):363, 365, 1998. Fredrik Ronquist and John P Huelsenbeck. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19(12):1572–1574, August 2003. Michael G Ross, Carsten Russ, Maura Costello, Andrew Hollinger, Niall J Lennon, Ryan Hegarty, Chad Nusbaum, and David B Jaffe. Characterizing and measuring bias in sequence data. Genome Biol., 14(5):R51, May 2013. Jonathan M Rothberg, Wolfgang Hinz, Todd M Rearick, Jonathan Schultz, William Mileski, Mel Davey, John H Leamon, Kim Johnson, Mark J Milgrew, Matthew Ed- wards, Jeremy Hoon, Jan F Simons, David Marran, Jason W Myers, John F Davidson, Annika Branting, John R Nobile, Bernard P Puc, David Light, Travis A Clark, Martin Huber, Jeffrey T Branciforte, Isaac B Stoner, Simon E Cawley, Michael Lyons, Yutao Fu, Nils Homer, Marina Sedova, Xin Miao, Brian Reed, Jeffrey Sabina, Erika Feier- stein, Michelle Schorn, Mohammad Alanjary, Eileen Dimalanta, Devin Dressman, Rachel Kasinskas, Tanya Sokolsky, Jacqueline A Fidanza, Eugeni Namsaraev, Kevin J McKernan, Alan Williams, G Thomas Roth, and James Bustillo. An integrated semiconductor device enabling non-optical genome sequencing. Nature, 475(7356): 348–352, 2011. James J Russell, Julie A Theriot, Pranidhi Sood, Wallace F Marshall, Laura F Landweber, Lillian Fritz-Laylin, Jessica K Polka, Snezhana Oliferenko, Therese Gerbich, Amy Gladfelter, James Umen, Magdalena Bezanilla, Madeline A Lancaster, Shuonan He, Matthew C Gibson, Bob Goldstein, Elly M Tanaka, Chi-Kuo Hu, and Anne Brunet. Non-model model organisms. BMC Biol., 15(1):55, June 2017. A Rzhetsky and M Nei. Statistical properties of the ordinary least-squares, generalized least-squares, and minimum-evolution methods of phylogenetic inference. J. Mol. Evol., 35(4):367–375, October 1992. R K Saiki, D H Gelfand, S Stoffel, S J Scharf, R Higuchi, G T Horn, K B Mullis, and H A Erlich. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239(4839):487–491, January 1988. 154 References

N Saitou and M Nei. The number of nucleotides required to determine the branch- ing order of three species, with special reference to the human-chimpanzee-gorilla divergence. J. Mol. Evol., 24(1-2):189–204, 1986. N Saitou and M Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4(4):406–425, July 1987. L Salmela and E Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30(24):3506–3514, 2014. Steven L Salzberg, Adam M Phillippy, Aleksey Zimin, Daniela Puiu, Tanja Magoc, Sergey Koren, Todd J Treangen, Michael C Schatz, Arthur L Delcher, Michael Roberts, Guillaume Marçais, Mihai Pop, and James A Yorke. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res., 22(3): 557–567, March 2012. F Sanger. The croonian lecture, 1975. nucleotide sequences in DNA. Proc. R. Soc. Lond. B Biol. Sci., 191(1104):317–333, December 1975. F Sanger and A R Coulson. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol., 94(3):441–448, May 1975. F Sanger, S Nicklen, and a R Coulson. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U. S. A., 74(12):5463–5467, 1977. Erfan Sayyari, James B Whitfield, and Siavash Mirarab. Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Mol. Biol. Evol., October 2017. Aylwyn Scally. The mutation rate in human evolution and demographic inference. Curr. Opin. Genet. Dev., 41:36–43, December 2016. Paul Scheet and Matthew Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet., 78(4):629–644, April 2006. Klaus Peter Schliep. phangorn: phylogenetic analysis in R. Bioinformatics, 27(4):592–593, February 2011. Robert Schmieder and Robert Edwards. Quality control and preprocessing of metage- nomic datasets. Bioinformatics, 27(6):863–864, March 2011. Dominik Schrempf, Bui Quang Minh, Nicola De Maio, Arndt von Haeseler, and Carolin Kosiol. Reversible polymorphism-aware phylogenetic models and their application to tree inference. J. Theor. Biol., 407:362–370, October 2016. J D Seader, W D Seider, A C Pauls, and R R Hughes. FLOWTRAN simulation: an introduction. 1977. Anna Shcherbina. FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res. Notes, 7(1):533, 2014. References 155

Jay Shendure and Erez Lieberman Aiden. The expanding scope of DNA sequencing. Nature Publishing Group, 30(11):1084–1094, 2012. Jay Shendure and Hanlee Ji. Next-generation DNA sequencing. Nat. Biotechnol., 26(10): 1135–1145, 2008. Jay Shendure, Robi D Mitra, Chris Varma, and George M Church. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet., 5(5):335–344, 2004. Jay A Shendure, Gregory J Porreca, and George M Church. Overview of DNA Sequenc- ing Strategies. Curr. Protoc. Mol. Biol., ( January):1–11, 2011. Suyash S Shringarpure, Rasika A Mathias, Ryan D Hernandez, Timothy D O’Connor, Zachary A Szpiech, Raul Torres, Francisco M De La Vega, Carlos D Bustamante, Kathleen C Barnes, Margaret A Taub, and CAAPA Consortium. Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data. Bioinformatics, 33(8): 1147–1153, April 2017. Daniel Simberloff. Calculating probabilities that cladograms match: A method of biogeographical inference. Syst. Biol., 36(2):175–195, June 1987. Cas Simons, Michael Pheasant, Igor V Makunin, and John S Mattick. Transposon-free regions in mammalian genomes. Genome Res., 16(2):164–172, February 2006. M Simonsen, T Mailund, and CNS Pedersen. Rapid Neighbour-Joining. WABI, 2008. David Sims, Ian Sudbery, Nicholas E Ilott, Andreas Heger, and Chris P Ponting. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet., 15(2):121–132, 2014. Joel Sjöstrand, Lars Arvestad, Jens Lagergren, and Bengt Sennblad. GenPhyloData: realistic simulation of gene family evolution. BMC Bioinformatics, 14:209, June 2013. R A Smith, E L Ionides, and A A King. Infectious disease dynamics inferred from genetic data via sequential monte carlo. Mol. Biol. Evol., 34(8):2065–2084, August 2017. Pamela S Soltis and Douglas E Soltis. Applying the bootstrap in phylogeny reconstruc- tion. Statistical Science, 18(2):256–267, 2003. Weizhi Song, Kerrin Steensen, and Torsten Thomas. HgtSIM: a simulator for horizontal gene transfer (HGT) in microbial communities. PeerJ, 5:e4015, November 2017. W Souza and B Carvalho. Rqc: quality control tool for high-throughput sequencing data.”, 2017. Paul R Staab, Sha Zhu, Dirk Metzler, and Gerton Lunter. scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10):1680–1682, May 2015. Alexandros Stamatakis. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics, 30(9):1312–1313, May 2014. 156 References

Stuart Stephen, Michael Pheasant, Igor V Makunin, and John S Mattick. Large-scale appearance of ultraconserved elements in tetrapod genomes and slowdown of the molecular clock. Mol. Biol. Evol., 25(2):402–408, February 2008. M Stephens, N J Smith, and P Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet., 68(4):978–989, April 2001. Matthew Stephens and . A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet., 73(5):1162–1169, November 2003. Matthew Stephens and Paul Scheet. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet., 76(3):449–462, March 2005. J Stoye, D Evers, and F Meyer. Rose: generating sequence families. Bioinformatics, 14 (2):157–163, 1998. Cory L Strope, Stephen D Scott, and Etsuko N Moriyama. indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. Mol. Biol. Evol., 24(1993): 640–649, March 2007. Jeet Sukumaran and Mark T Holder. DendroPy: a python library for phylogenetic computing. Bioinformatics, 26(12):1569–1571, June 2010. Gergely J Szöllősi, Eric Tannier, Vincent Daubin, and Bastien Boussau. The inference of gene trees with species trees. Syst. Biol., 64(1):e42–62, January 2015. N Takahata. Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics, 122(4):957–966, August 1989. Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma’ayan Bresler, Yun S Song, Michael I Jordan, and David Patterson. SMaSH: a benchmarking toolkit for human genome variant calling. Bioinformatics, 30(19):2787–2795, October 2014. S Tavaré. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on mathematics in the life sciences, 1986. Ryan Tewhey, Jason B Warner, Masakazu Nakano, Brian Libby, Martina Medkova, Patricia H David, Steve K Kotsopoulos, Michael L Samuels, J Brian Hutchison, Jonathan W Larson, Eric J Topol, Michael P Weiner, Olivier Harismendy, Jeff Olson, Darren R Link, and Kelly A Frazer. Microdroplet-based PCR enrichment for large- scale targeted sequencing. Nat. Biotechnol., 27(11):1025–1031, November 2009. Cuong Than, Derek Ruths, and Luay Nakhleh. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics, 9:322, July 2008. Subazini Thankaswamy-Kosalai, Partho Sen, and Intawat Nookaew. Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics. Genomics, 109(3-4):186–191, July 2017. References 157

The Cancer Genome Atlas Research Network, John N Weinstein, Eric A Collisson, Gor- don B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas Pan-Cancer analysis project. Nat. Genet., 45:1113, September 2013. Félicien Tosso, Olivier J Hardy, Jean-Louis Doucet, Kasso Daïnou, Esra Kaymak, and Jérémy Migliore. Evolution in the Amphi-Atlantic tropical genus guibourtia (fabaceae, detarioideae), combining NGS phylogeny and morphology. Mol. Phylogenet. Evol., 120:83–93, December 2017. Matthew H Van Dam, Athena W Lam, Katayo Sagata, Bradley Gewa, Raymond Laufa, Michael Balke, Brant C Faircloth, and Alexander Riedel. Ultraconserved elements (UCEs) resolve the phylogeny of australasian smurf-weevils. PLoS One, 12 (11):e0188044, November 2017. G A Van der Auwera, M O Carneiro, and others. From FastQ data to highconfidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in Bioinformatics, 2013. Geraldine Van der Auwera. Should I analyze my samples alone or together? https://gatkforums.broadinstitute.org/gatk/discussion/4150/ should-i-analyze-my-samples-alone-or-together, May 2015. Accessed: 2018-1- 18. Erwin L van Dijk, Hélène Auger, Yan Jaszczyszyn, and Claude Thermes. Ten years of next-generation sequencing technology. Trends Genet., 30(9):418–426, September 2014. Katherine Elena Varley and Robi David Mitra. Nested patch PCR enables highly multiplexed mutation discovery in candidate genes. Genome Res., 18(11):1844–1850, November 2008. Diego P Vézquez and John L Gittleman. Biodiversity conservation: Does phylogeny matter? Curr. Biol., 8(11):R379–R381, May 1998. Rutger A Vos, Jason Caravas, Klaas Hartmann, Mark A Jensen, and Chase Miller. BIO::Phylo-phyloinformatic analysis using perl. BMC Bioinformatics, 12:63, February 2011. J Wang, M-Z Guo, and L L Xing. FastJoin, an improved neighbor-joining algorithm. Genet. Mol. Res., 11(3):1909–1922, July 2012a. Kai Wang, Wei Hong, Hengwu Jiao, and Huabin Zhao. Transcriptome sequencing and phylogenetic analysis of four species of luminescent beetles. Sci. Rep., 7(1):1814, May 2017. Xin Victoria Wang, Natalie Blades, Jie Ding, Razvan Sultana, and Giovanni Parmigiani. Estimation of sequencing error rates in short reads. BMC Bioinformatics, 13:185, 2012b. Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet., 10(1):57–63, January 2009. 158 References

K A Wetterstrand. DNA sequencing costs: Data from the NHGRI genome sequencing program (GSP). www.genome.gov/sequencingcostsdata. Travis J Wheeler. Large-Scale Neighbor-Joining with NINJA. In Lecture Notes in Computer Science, pages 375–389. 2009. Norman J Wickett, Siavash Mirarab, Nam Nguyen, Tandy Warnow, Eric Carpenter, Naim Matasci, Saravanaraj Ayyampalayam, Michael S Barker, J Gordon Burleigh, Matthew A Gitzendanner, Brad R Ruhfel, Eric Wafula, Joshua P Der, Sean W Graham, Sarah Mathews, Michael Melkonian, Douglas E Soltis, Pamela S Soltis, Nicholas W Miles, Carl J Rothfels, Lisa Pokorny, A Jonathan Shaw, Lisa DeGironimo, Dennis W Stevenson, Barbara Surek, Juan Carlos Villarreal, Béatrice Roure, Hervé Philippe, Claude W dePamphilis, Tao Chen, Michael K Deyholos, Regina S Baucom, Toni M Kutchan, Megan M Augustin, Jun Wang, Yong Zhang, Zhijian Tian, Zhixiang Yan, Xiaolei Wu, Xiao Sun, Gane Ka-Shu Wong, and James Leebens-Mack. Phylotran- scriptomic analysis of the origin and early diversification of land plants. Proc. Natl. Acad. Sci. U. S. A., 111(45):E4859–68, November 2014. Yufeng Wu. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution, 66(3):763–775, 2012. Zhenxiang Xi, Liang Liu, and Charles C Davis. The impact of missing data on species tree estimation. Mol. Biol. Evol., November 2015. X Yang, S P Chockalingam, and S Aluru. A survey of error-correction methods for next-generation sequencing. Brief. Bioinform., 14(1):56–66, 2013. Ya Yang and Stephen A Smith. Orthology inference in non-model organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Mol. Biol. Evol., 31(11):3081–3092, August 2014. Z Yang. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol., 11(9):367–372, September 1996. Andy B Yoo, Morris A Jette, and Mark Grondona. SLURM: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science, pages 44–60. Springer, Berlin, Heidelberg, June 2003. G U Yule. A mathematical theory of evolution, based on the conclusions of dr. JC willis, FRS. Philosophical transactions of the Royal Society of London., Series B, containing papers of a biological character(213):21–87, 1925. Chao Zhang, Erfan Sayyari, and Siavash Mirarab. ASTRAL-III: Increased scalability and impacts of contracting low support branches. In Comparative Genomics, Lecture Notes in Computer Science, pages 53–75. Springer, Cham, October 2017. G Zhang, B Li, C Li, M.T.P. Gilbert, C.V. Mello, E.D. Jarvis, The Avian Genome Consortium, and J Wang. Genomic data of the anna’s hummingbird (Calypte anna), 2014a. References 159

G Zhang, B Li, C Li, M.T.P. Gilbert, C.V. Mello, E.D. Jarvis, The Avian Genome Consortium, and J Wang. Genomic data of the chimney swift (Chaetura pelagica), 2014b. Jianzhi Zhang. Evolution by gene duplication: an update. Trends Ecol. Evol., 18(6): 292–298, June 2003. Tongwu Zhang, Yingfeng Luo, Kan Liu, Linlin Pan, Bing Zhang, Jun Yu, and Songnian Hu. BIGpre: a quality assessment package for next-generation sequencing data. Genomics Proteomics Bioinformatics, 9(6):238–244, December 2011a. Wenyu Zhang, Jiajia Chen, Yang Yang, Yifei Tang, Jing Shang, and Bairong Shen. A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One, 6(3), 2011b.

Appendix A

List of deliverables

Publications

• Escalona M, Rocha S, Posada D. (2016) A comparison of tools for the simulation of genomic NGS data. Nature Reviews Genetics 17, 459–469 doi:10.1038/nrg.2016.57

• Escalona M, Rocha S, Posada D. (2017) NGSphy: phylogenomic simulation of next generation sequencing data. Accepted in Bioinformatics, pending minor revisions

Software

• NGSphy, implemented in Python. Available at https://github.com/merlyescalona/ngsphy under GNU/GPL v.3 Li- cence.

• INDELible-NGSphy, implemented in C. Available at https://github.com/merlyescalona/indelible-ngsphy under GNU/GPL v.3 Licence.

Appendix B

Short CV

Personal Information

Name: Merly Mayela Escalona Fermín Mailing Address: Phylogenomics Lab. Edificio Torre CACTI. Campus Universitario, Universidad de Vigo. 36310 Vigo, Spain Phone: (+34) 690 052952 E-mail: [email protected] Nationality: Spanish Date of birth: May 6th, 1989.

Education

2014-2018 (Expected) | (Ph.D.), Methodologies and Applica- tions in Life Sciences. University of Vigo, Spain. 2012-2014 | Master of Science (M.Sc.), Intelligent and Adaptive Software Systems. University of Vigo, Spain. 2010-2013 | Degree in software engineering. University of Vigo, Spain. 2009-2010 | Erasmus course. University of Applied Sciences Bremerhaven (Hochschule Bremerhaven), Germany. 2007-2010 | Technical degree in software engineering. University of Vigo, Spain. 164 Short CV

Publications

(, ResearchGate)

• Escalona M, Rocha S, Posada D. (2017) NGSphy: phylogenomic simulation of next generation sequencing data. bioRxiv, doi: https://doi.org/10.1101/197715. Accepted in Bioinformatics, pending minor revisions

• Escalona M, Rocha S, Posada D. (2016) A comparison of tools for the simulation of genomic NGS data. Nature Reviews Genetics 17, 459–469 doi: https://doi.org/10.1038/nrg.2016.57

• Díaz-Pereira MP, Gómez Conde I, Escalona M and Olivieri D. Automatic recog- nition and scoring of olympic rhythmic gymnastic movements. Elsevier, Human Movement Science February 4, 2014.

• Olivieri D, Escalona M and Faro J. Software tool for 3D reconstruction of Germinal Centers. BMC Bioinformatics 2013, 14 (Suppl 6):S5

Oral and written communications

• Escalona M, Rocha S, Posada D. 2016. NGSphy: generation of next-generation sequencing data from phylogenies. XII Portuguese Evolutionary Biology Meeting (ENBE). Universidade de Aveiro. December 16, 2016. Aveiro, Portugal.

• Rocha S, Escalona M, Lemmon AR, Lemmon EM, Posada D. 2014. Probe Design for Anchored Hybrid Enrichment in Trovaoconus marine snails. X Portuguese Evolutionary Biology Meeting (ENBE). December 22, 2014. Lisboa, Portugal.

Honors and awards

• Fellowships

June 2014 - May 2018. Graduate Research Fellowship (FPI 2013) of the Span- ish Ministry of Economy and Competitiveness at University of Vigo. September 2012 - June 2013. Collaboration Grant of the Spanish Ministry of Sciences at University of Vigo.

• Awards 165

– Award to academic excellence for new PhD. Students in the University of Vigo (Courses 2013-2014, 2014-2015). Universidad de Vigo. December 2014. – First–Class Honours with Distinction – Premio Extraordinario de Fin de Carrera. Universidad de Vigo. January 2012

Short-term visits

April, 2017 - July, 2017. Visit to The Bioinformatics Centre. Department of Biology. University of Copenhagen. Copenhagen, Denmark. (Advisor: Dr. Rute R. da Fonseca) March, 2015 - June, 2015. Visit to Lemmon’s Lab. Center of Anchored Phyloge- nomics. Department of Scientific Computing. Department of Biological Science. Florida State University. Tallahassee, Florida. (Advisors: Dr. Alan R. Lemmon and Dr. Emily Moriarty Lemmon). September, 2014. Visit to the Lemmon’s Lab. Center of Anchored Phylogenomics. De- partment of Scientific Computing. Florida State University. Tallahassee, Florida. (Advisors: Dr. Alan R. Lemmon and Dr. Emily Moriarty Lemmon).

Teaching activities

October - December 2017. Genetics II, Biology. 30h of practical lectures. University of Vigo, Spain. October - December 2016. Genetics II, Biology. 45h of practical lectures. University of Vigo, Spain. October - December 2015. Genetics II, Biology. 45h of practical lectures. University of Vigo, Spain.

Workshops attended

March, 2017. Bioinformatics for adaptation. Red Temática en Genómica de la Adaptación. AdaptNet. FISABIO-Salud Pública, Valencia, Spain.

October, 2016. Wellcome Genome Campus Advanced Course: Next Generation Se- quencing Bioinformatics. Wellcome Trust Genome Campus. Hinxton, UK. 166 Short CV

July, 2016. Execution of R in HPC environments and Big Data. Rede Galega de Tecnoloxías Cloud e Big Data para HPC. CITIC (Centro de Investigación en Tecnoloxías da Información e as Comunicacións), A Coruña, Spain.

January, 2015. Administration and programming training for Apache Hadoop. Rede Galega de Tecnoloxías Cloud e Big Data para HPC. CITIC (Centro de Investi- gación en Tecnoloxías da Información e as Comunicacións), A Coruña, Spain.

October, 2014. Workshop on Marine Evolutionary Genomics and Proteomics. Univer- sidad de Vigo, Vigo, Spain.

June, 2014. NGS sequence analysis. Universitat Politècnica de València, Valencia, Spain Appendix C

NGSphy manual

NGSphy documentation

v. 20171214 http://github.com/merlyescalona/ngsphy

© 2017 Merly Escalona, Sara Rocha, David Posada

University of Vigo, Spain, http://darwin.uvigo.es ​ [email protected]

1. About NGSphy NGSphy is a Python open-source tool for the genome-wide simulation of NGS data (read counts or Illumina reads) obtained from thousands of gene families evolving under a common species tree, with multiple haploid and/or diploid individuals per species, where sequencing coverage (depth) heterogeneity can vary among species, individuals and loci, including off-target or uncaptured loci.

2. Citation If you use NGSphy, please cite:

● Escalona, M, Rocha S and Posada D. NGSphy: phylogenomic simulation of next-generation sequencing data . Submitted. ​ ​ ● Sukumaran, J and Holder MT. (2010). DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571. ​ ​ if running ART cite also:

● Huang W, Li L, Myers JR and Marth, GT. (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28 (4): 593-594 ​ ​ if using SimPhy cite also:

● Mallo D, De Oliveira Martins L and Posada D. (2016). SimPhy : Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic Biology 65(2): 334-344. ​ ​

If using single gene tree input, cite also:

● Fletcher, W and Yang Z. (2009) INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution. 26 (8): 1879–88. ​ ​

1 3. Getting started NGSphy is designed to simulate reads (or read counts) from alignments originated from single gene trees or gene-tree distributions (originated from species-tree distributions). It is designed to read directly from SimPhy (a simulator of gene family evolution) in the case of gene-tree ​ distributions, but it can also be fed with gene trees directly. These trees can contain orthologs, paralogs and xenologs. Alignments are simulated using INDELible and can represent multiple ​ haploid and/or diploid individuals per species. Then, either Illumina reads (using ART) or read ​ ​ counts are simulated for each individual, with the depth of coverage allowed to vary between species, individuals and loci. This flexibility allows for the simulation of both off-target (captured but not targeted) and uncaptured (targeted but not captured) loci.

You will need a NGSphy settings file and the required files according to the input mode selected (see below). ● Examples of setting files can be found here ​ (https://github.com/merlyescalona/ngsphy/tree/master/data/settings) ​ ​ ● For installation please go to Section 4. Installation. ​ ​ ● For detailed explanations please search through this manual. ● In Section 11, as well as in the NGSphy’s wiki, you can find tutorials for each of the ​ possible input modes (https://github.com/merlyescalona/ngsphy/wiki/). ​ ​

3.1. Input/output files

3.1.1. Input [Single gene-tree scenario] ● NGSPhy settings file ● INDELible control file ● Newick file with single gene tree ​ ● ancestral sequence file (FASTA) (optional) ​ ​ ● reference allele file (optional) ​

[Species-tree scenario] ● NGSPhy settings file ● SimPhy output ​ ● reference allele file (optional) ​ 3.1.2. Output ● NGS reads: ○ FASTQ ○ ALN ○ BAM

2

3 ● read counts: ○ VCF ● sequence alignments: ○ FASTA ● coverage variation ○ CSV ● log files ● bash scripts

3.2. Input modes

3.2.1. Single gene-tree scenarios:

● inputmode 1: allows you to generate DNA sequence ​ alignments from a single gene tree, generate haploid or diploid individuals (by random mating within the same species) and produce NGS reads or read counts [Tutorial 1]. ​ ​

● inputmode 2: allows you to simulate DNA sequence ​ alignments from a single gene tree and a known ancestral sequence. DNA sequences are evolved from the ancestral sequence under the specified gene-tree, haploid or diploid individuals and NGS reads or read counts generated [Tutorial 2]. ​ ​

● inputmode 3: allows you to simulate reads/read counts from a single gene tree and a known anchor (tip) sequence. The tree is re-rooted at the anchor sequence before the simulation of the sequence alignments [Tutorial 3]. ​ ​

4

3.2.2. Gene-tree/Species-tree distributions ● inputmode 4: this mode uses the output from SimPhy ​ to generate NGS reads or reads counts. SimPhy generates distributions of gene trees and species trees under different conditions. Each species tree is here considered a replicate. For diploids, NGSphy requires that the number of SimPhy gene tree tips per species be even. If this is not the case, NGSphy will issue a warning and stop. Alternatively the user can specify that NGSphy uses the SimPhy replicates that satisfy this requirement [Tutorial 4]. ​ ​

4. Installation 4.1 Computer requirements

NGSphy has been developed for Linux/MAC environments with Python 2.7.

4.2 NGSphy To install NGSphy you can: a) clone its git repository and download the required third-party software (Section 4.3): ​ ​

# 1. Clone NGSphy repository git clone https://github.com/merlyescalona/ngsphy.git ​ ​ ​ # 2. Move to ngsphy/dist folder cd ngsphy # 3. Extract files and install: python setup.py install --user # if user does not have sudoer permissions ​ ​ ​ ​ ​ # sudo python setup.py install # if user with sudoer permissions

b) install NGSphy through pip and download the required third-party software (Section 4.3): ​ ​

pip install --user ngsphy # if user does not have sudoer permissions ​ ​ ​ # sudo pip install ngsphy # if user with sudoer permissions

5 4.3 Third-party software

4.3.1 ART (for Illumina reads generation) ART is a set of simulation tools to generate synthetic next-generation sequencing reads. You can download it from:

http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ Version ChocolateCherryCake or later.

Following installation instructions from ART, you can download the binaries or compile the source code. If you decide to compile the source code:

# 1. Extract files from the compressed tgz cd /path/to/art-download ​ ​ ​ ​ ​ ​ ​ ​ tar -xvf artsrcmountrainier20160605linuxtgz.tgz ​ ​ ​ ​ # 2. Change current directory to the extracted one cd art_src_MountRainier_Linux/ # 3. Make sure you have all the dependencies installed and generate the Makefile ./configure ​ # 4. Run the Makefile make

4.3.2 INDELible (for sequence generation)

INDELible is an application for sequence simulation. You can download it from: http://abacus.gene.ucl.ac.uk/software/indelible/ Version 1.03.

In order to get INDELible, you will need to register. It is free software, and is distributed under: GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version. For more information go to http://www.gnu.org/licenses/. ​ ​

Once the software is downloaded: 1. Unpack the archive on Unix-like systems using:

# 1. Change directory to the download folder. cd /path/to/indelible-download ​ ​ ​ ​ ​ ​ ​ ​ # 2. Extract file from the compressed file. tar -xvzf INDELibleV1.03.tar.gz ​ ​ ​ ​ ​ ​ ​

6 a. If you want to compile from source:

# 3. Move to the source folder cd INDELibleV1.03/src/ ​ ​ ​ ​ ​

i. include the following line at to the top of MersenneTwister.h file.

#include ​ ​

ii. Compile INDELible using:

# 4. Compile the program. g++ -o indelible indelible.cpp -lm ​ ​ ​ ​ ​ ​ ​ ​

b. If you want to use the binaries, directly:

# This is an example for MacOS # 3. Move to the binaries folder cd INDELibleV1.03/bin/ ​ ​ ​ ​ ​ # 4. Rename the binary (for proper NGSphy execution) mv indelible_1.03_OSX_intel indelible ​ ​ ​ # 5. Modified execution permissions chmod +x indelible ​ ​

4.3.3. INDELible - NGSphy version (for sequence generation with known ancestral sequence)

This is a version of INDELible that we have modified to be able to use a given ancestral sequence at the root for a single partition. It can be obtained from cloning its repository:

# 1. Clone the repository git clone https://github.com/merlyescalona/indelible.git indelible-ngsphy ​ ​ ​ ​ # 2. Change directory to indelible-ngsphy source code folder. cd indelible-ngsphy/ ​ ​ # 3. Compile make

4.3.4. SimPhy (multiple gene trees evolved under a species tree) SimPhy can be obtained from cloning its repository and installing its dependencies. Detailed, information on how to install SimPhy here. ​ ​

# 1. Clone the repository git clone https://github.com/adamallo/SimPhy.git ​ ​

7 4.4. Adding NGSphy and third-party software to the path Once all software has been installed, it must be added to the path.

● First you have to add the lines below to the ~/.bashrc file to keep the changes ​ ​ permanently.

ART="/path/to/art/executable" ​ ​ INDELIBLE="/path/to/indelible/executable" ​ ​ INDELIBLENP="/path/to/indelible-ngsphy/executable" ​ ​ NGSPHY="/path/to/ngsphy/executable" ​ ​ SIMPHY="/path/to/simphy/executable" ​ ​ export PATH="$ART:$INDELIBLE:$INDELIBLENP:$NGSPHY:$SIMPHY:$PATH" ​ ​ ​

● Apply changes

source ~/.bashrc ​ ​

8 5. Usage NGSphy does not have a Graphical User Interface (GUI) and works on the Linux/Mac command line in a non-interactive fashion.

usage: ngsphy [-s ] ​ ​ ​ ​ ​ [-l ] [-v] [-h] ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

5.1. Arguments

● Optional arguments: ○ -s , --settings ​ ​ ​ ​ ​ ​ ​ Path to the settings file. This is optional, by default NGSphy looks for a settings.txt file in the current working directory. You can also specify a particular settings file with:

ngsphy -s my_settings.txt

○ -l , --log ​ ​ ​ ​ ​ ​ ​ Specified hierarchical log levels that will be shown through the standard output. A detailed log will be stored in a separate file when level == DEBUG. Possible ​ ​ values: ● DEBUG: shows very detailed information of the program's process. ​ ● INFO (default): shows only information about the state of the program. ​ ​ ● WARNING: shows only system warnings. ​ ● ERROR: shows only execution errors. ​

● Information arguments: ○ -v, --version ​ ​ ​ ​ ​ Show program's version number and exit. ○ -h, --help ​ ​ ​ ​ ​ Show help message and exit.

NOTE: Example of the settings files can be found under the data/settings folder in the NGSphy ​ source.

9 6. The settings file NGSphy requires a settings file (by default “settings.txt”) that specifies the different options and ​ parameter values for the simulations. A file with a different name can be specified with the -s/--settings option. The information in the settings file is organized in 6 optional/required blocks (default values are underlined): ​ ​

1. [general]: general parameters. ​ 2. [data]: specifies the type of input data as well as input parameters and files. ​ 3. [coverage]: parameters that describe the variation of coverage in the dataset ​ (optional). 4. [ngs-reads-art]: specifies ART execution parameters (optional) ​ 5. [ngs-read-counts]: specifies parameters for read counts (optional). ​ 6. [execution]: describes how the execution of the whole process will be made ​ (optional).

6.1. [general] block Stores general parameters for each NGSphy run.

[general] ​ path=/home/user/ ​ ​ ​ output_folder_name=NGSphy_output ​ ​ ploidy=1 ​

● path ○ purpose: path where output folder will be created. ○ type: string (path). ● output_folder_name ○ purpose: name of the output folder where NGSphy results will be stored. If the output folder already exists, the new output folder will get the same base name with a numerical suffix (outputFolder_n), representing the nth time the program with that output folder name was ran. ○ type: string. ○ value: NGSphy_output ​

● ploidy ○ purpose: refers to the ploidy that the resulting individuals will have. So far it is only possible to generate haploid and diploid individuals. ○ type: number (integer). ○ values: 1, 2 (in the closed-interval [1,2]). ​ ​ ● seed ○ purpose: random number generator seed. ○ type: number (integer) 32 ○ value in the closed-interval [0, 2 ​ - 1] ​

10 6.2. [data] block Defines the input data for NGSphy, which consists of different modes:

F IGURE 1: Input modes: a) a single gene tree; b) single gene tree with a user-defined ancestral sequence; c) a single gene tree with an anchor sequence and d) gene-tree distributions (SimPhy output [species-tree simulations])

11 6.2.1. Input data options

[data] Single gene tree ​ ​ inputmode=1 ​ ​ gene_tree_file=/home/myuser/my_gene_tree.tree ​ ​ ​ ​ ​ ​ ​ indelible_control_file=/home/myuser/my_control_indelible.txt ​ ​ ​ ​ ​ ​ ​ [data] Single gene tree with ​ ​ inputmode=2 user-defined ancestral ​ ​ gene_tree_file=/home/myuser/my_gene_tree.tree sequence ​ ​ ​ ​ ​ ​ ​ ancestral_sequence_file=/home/myuser/my_ancestral.fasta ​ ​ ​ ​ ​ ​ ​ indelible_control_file=/home/myuser/my_control_indelible.txt ​ ​ ​ ​ ​ ​ ​ [data] Single gene tree with ​ ​ inputmode=3 user-defined anchor ​ ​ gene_tree_file=/home/myuser/my_gene_tree.tree sequence ​ ​ ​ ​ ​ ​ ​ anchor_sequence_file=/home/myuser/my_anchor.fasta ​ ​ ​ ​ ​ ​ ​ anchor_tip_label=1_0_0 ​ ​ indelible_control_file=/home/myuser/my_control_indelible.txt ​ ​ ​ ​ ​ ​ ​ [data] Gene-tree distribution ​ ​ inputmode=4 SimPhy output ​ ​ simphy_folder_path=testSimphy (species-tree simulations) ​ ​ simphy_data_prefix=data ​ ​ simphy_filter=true ​ ​

● inputmode ○ purpose: identifies the type of input. ○ type: number (integer) ○ value: values within the closed interval [1,4] 1. single gene tree 2. single gene tree with an ancestral sequence 3. single gene tree with an anchor sequence 4. gene-tree distribution (SimPhy output [species-tree simulations])

6.2.2. Single gene tree ● gene_tree_file ○ purpose: path of the gene tree in Newick format . There must be a single path and ​ ​ a single tree in the file. The name of the file, without extension, must be the same as the name of the tree within the INDELible control file, in the [NGSPHYPARTITION] option. ​ ○ type: string (path) ○ format: see specification in Section 6.2.5. (INDELible control file). ​ ​ ● indelible_control_file ○ purpose: path for the INDELible control file. ○ type: string (path) ○ format: see specification in Section 6.2.5. (INDELible control file). ​ ​

12 6.2.3. Single gene tree with an user-defined ancestral sequence

● gene_tree_file ○ purpose: same as in Section 6.2.1. (Single gene tree) ​ ​ ○ type: string (path) ● ancestral_sequence_file ○ purpose: path to the FASTA file that contains the ancestral sequence. ○ type: string (path) ● indelible_control_file ○ purpose: Same as Section 6.2.1. (Single gene tree) ​ ​ ○ type: string (path)

6.2.4. Single gene tree with an user-defined anchor sequence

● gene_tree_file ○ purpose: same as in Section 6.2.1. (Single gene tree) ​ ​ ○ type: string (path) ● anchor_sequence_file ○ purpose: path to the FASTA file that contains the anchor sequence. ○ type: string (path) ● anchor_tip_label ○ purpose: tip label of the gene tree that corresponds to the tip that will be used as root. ○ type: string ○ format: see specification in the Section 6.2.6. (Single gene-tree file labeling) ​ ​ ● indelible_control_file ○ purpose: Same as Section 6.2.1. (Single gene tree) ​ ​ ○ type: string (path)

6.2.4.1. Re-rooting process If the user wants to use a tip sequence as the root for the alignment simulation (anchor_sequence), the gene tree has to be re-rooted (anchor_tip_label), so that simulation ​ ​ ​ ​ can proceed using the anchor_sequence as the root node. In the example shown in Figure 2, ​ ​ ​ ​ NGSphy would transform the tree on the left into the one on the right, using as anchor tip 2_0_0. ​ ​ The key observation here is that the branch length from node A to tip 2_0_0 has to become zero. ​ Then, the re-rooted tree plus the anchor (known) sequence are given to indelible-ngsphy, with ​ ​ the INDELible control file (format in Section 6.2.5. INDELible control file) to simulate the ​ ​ corresponding sequence alignments under the model from the control file.

13

F IGURE 2: Re-rooting process. Number above branches indicate branch length.

6.2.5. Gene-tree distribution (SimPhy output [species-tree simulations]) ● simphy_folder_path ○ purpose: path to the folder with SimPhy’s output ○ type: string (path) ● simphy_data_prefix ○ purpose: prefix used in SimPhy's run. ○ type: string ● simphy_filter [optional] ○ purpose: filter out the replicates that do not satisfy the required ploidy. For the diploid case the number of gene tree tips per species has to be an even number. See more in Section 6.2.7. (Individual assignment). ​ ​ ○ type: boolean ○ value: ■ 0, false, off: don’t filter ​ ■ 1, true, on: filter ​ 6.2.5.1. A valid SimPhy ouput A detailed description of SimPhy's output can be found in https://github.com/adamallo/simphy. ​ ​ The SimPhy output required by NGSphy has to include:

.command: a plain text file with the original command line arguments.

14 ● .db: a SQLite database composed by three (3) linked tables with different information about species, locus and gene trees. ● .params: a plain text file summarizing the sampled options. ​ ● a set of folders with the multiple sequence alignments and the corresponding trees in FASTA format.

6.2.6. INDELible Control file - NGSphy version When the input mode is a single gene tree, it is necessary to have a control file to call INDELible. Here, we use a slightly modified version of the INDELible's control file. To properly set up the configuration file for INDELible, users should refer first to INDELible’s manual ​ (http://abacus.gene.ucl.ac.uk/software/indelible/manual/)l. In our version, the file must include the ​ following blocks: ● [TYPE]: 1 block ​ ● [SETTINGS]: 1 block (optional) ​ ● [MODEL]: 1 block ​ ● [NGSPHYPARTITION]: 1 block ​

Including a wrong number of blocks or other type of blocks will result in an error message and will terminate NGSPhy execution.

6.2.6.1. Block definitions ● [TYPE] standard INDELible specification. ​ ● [SETTINGS] standard INDELible specification. ​ ● [MODEL] standard INDELible specification. ​ ● [NGSPHYPARTITION] this block defines: ​ ○ the gene tree for INDELible (this name has to be the same as the Newick file used as input (see Section 6.2) ​ ​ ○ the substitution model for INDELible. This name must match the name of the model used in the previous [MODEL] block. ​ ​ ○ the sequence length.

For example, we have a gene tree in the Newick file: tree1.tree, where sequences will ​ evolve under model m1, with a length of 500bp. ​ ​

[NGSPHYPARTITION] tree1 m1 500 ​ ​ ​ ​ ​

NOTE: Example of the modified INDELible control files can be found under the data/indelible ​ folder in the NGSphy source.

15 6.2.7. Single gene-tree file format and labeling Single gene trees in Newick format should have specific tip labels. Tips must follow a specific format in order to be managed by NGSphy. This format indicates species, locus and individual with the scheme (X_Y_Z) where: ​ ​

● X stands for the the species identifier, where X > 0 ​ ​ ● Y for the locus identifier, where Y > 0 ​ ​ ● Z for the individual identifier, where Z> 0 ​ ​

The gene tree file must be in Newick format, rooted and with branch lengths. If the gene tree is ​ ​ not rooted, it will be forced following Dendropy specifications. ​ ​

For example, if we have 3 species and 2 gene copies per species the labels would be:

F IGURE 3: Gene tree labeling example.

(((1_0_1:1.0,1_0_0:2.0):1.0, (2_0_1:1.0,2_0_0:1.0)),( (3_0_1:2.0,3_0_0:3.0) )); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

6.2.8. Assignment of individuals

For haploid individuals, each tip in the gene tree provided will correspond to a single individual. For diploid individuals the number of gene-tree tips per species must be even. In this case, the individuals are generated by randomly sampling without replacement two gene copies from a specific gene-family until all gene tree tips have been assigned to an individual.

For the gene-tree distribution input mode only, the outgroup in the gene trees is called “0_0_0” ​ and has one gene copy. Therefore, for the generation of diploid individuals, the outgroup will be homozygous, obtained by the duplication of the sequence of its gene copy.

The description of the sequence(s) in the final FASTA file of the individual is formatted as:

>project:repID:locusID:sequence_file_prefix:indID:full_sequence_description ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Where:

16 ● project: if using any of the single gene tree input modes, it will be NGSphy. For the ​ ​ ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder. ● repID: replicate identifier. ​ ● locID: locus identifier. ​ ● indID: individual identifier. ​ ● sequence_file_prefix: if using any of the single gene tree input modes, it will be ​ ngsphydata. For the gene-tree distribution input mode, it will be the simphy_data_prefix. ​ ​ ● full_sequence_description: the description of the original sequence. ​ 6.3. [coverage] block Sequencing coverage can be specified at three different levels: experiment, individual and locus-wide. It is also possible to mimic the variation in coverage expected for targeted sequencing, including off-target loci and taxon-specific effects.

[coverage] ​ experiment=F:100 ​ ​ ​ ​ individual=LN:1.2,1 ​ ​ ​ ​ ​ locus=LN:1.3,1 ​ ​ ​ ​ ​ offtarget=0.4, 0.01 # 40% loci are off-target, will have 1% of the coverage ​ ​ ​ ​ ​ ​ ​ notcaptured=0.5 ​ ​ taxon= 1,0.5;2:0.25 ​ ​ ​ ​ ​ ​ ​ ​ ​

6.3.1. Sampling notation

The parameters that will define the coverage in NGSphy have to be provided using a specific notation in order to define statistical distributions and dependency between arguments. The sampling notation is structured as a particular statistical distribution (see code for the statistical ​ distribution), followed by a colon and a list of comma-separated parameter values: ​

distribution_code:param1,param2, ... ​ ​ ​ ​ ​ ​ ​

For example:

a) Fixed value=100.

F:100 ​ ​

17 b) Poisson distribution with mean=100.

P:100 ​ ​

F IGURE 4: Sampling notation example. Poisson distribution.

c) Negative Binomial, mean=100 and overdispersion=10.

NB:100,10 ​ ​ ​ ​

F IGURE 5: Sampling notation example. Negative Binomial distribution.

18 6.3.2. Statistical distributions

Distribution Code Num. parameters Parameters Description

Binomial b/B 2 r,p trials, probabilities

Exponential e/E 1 s scale

Fixed point f/F 1 v value

Gamma g/G 2 sh,sc shape,scale

Log. Normal ln/LN 2 mu, sd mean, standard deviation

Negative nb/N 2 mu, r mean of the underlying Poisson Binomial B distribution, overdispersion

Normal n/N 2 mu, var mean, variance

Poisson p/P 1 l mean

Uniform u/U 2 min,max minimum (included), maximum (excluded)

6.3.3. Coverage options ● experiment ○ purpose: expected depth of coverage for a specific replicate. ○ type: fixed value or statistical distribution. ​ ​ ● locus [optional] ○ purpose: variation of expected coverage between loci. ○ type: fixed value or statistical distribution. ​ ​ ● individual [optional] ○ purpose: variation of expected coverage between individuals. ○ type: fixed value or statistical distribution. ​ ​ ● offtarget [optional] ○ purpose: related to targeted-sequencing experiments; percentage of loci that will be considered off-target (captured and sequenced but not originally targeted); expected coverage will be 1% of the experiment-wide. ○ type: 1 pair (proportionLoci, proportionCoverage) ○ value: ■ proportionLoci: number (float) in the closed interval [0,1]. ​ ■ proportionCoverage: number (float) in the closed interval [0,1]. ​

19 ● notcaptured [optional] ○ purpose: related to targeted-sequencing experiments; fraction of originally targeted loci that will not be captured/sequenced. ○ type: number (float). ○ value: number in the closed interval [0,1]. ● taxon [optional] ○ purpose: related to targeted-sequencing experiments; decrease in coverage for particular species. It can be due to the phylogenetic distance between a reference species (used to design the probes for the targeted loci) and the individuals from the target-sequencing experiment or to species-specific sample conditions. ○ type: pairs (speciesID,coverageProportion) ○ values: ■ speciesID: one or more of the existent species in the tree. ​ ■ coverageProportion: value in the closed interval [0,1]. ​ ○ format: taxon=speciesID1,coverageProportion1; speciesID2,coverageProportion2 ... ​ ​ ​ ​ ​ ​ ​ ​ ​

6.3.4. Coverage sampling strategy The experiment-wide coverage is sampled for each replicate from the specified statistical ​ distribution, and this value becomes the expected coverage for every loci and individual in that replicate. For example, if experiment=P:100, we might sample a value of 104 for replicate 1, so the expected coverage would be 104x for that particular experiment.

F IGURE 6: Experiment-wide coverage sampling example.

An individual-wide coverage multiplier is sampled for each individual within a given replicate. ​ The value indicated in the settings file is in fact a hyper-parameter that controls a specific hyper-distribution from which a single value is sampled per replicate. For that replicate, this value will become the shape of a Gamma distribution with mean = 1, from which a multiplier is sampled for each individual.

20 In exactly the same manner, a locus-wide coverage multiplier is sampled for each locus within a ​ given replicate. The value indicated in the settings file is again a hyper-parameter that controls a specific hyper-distribution from which a single value is sampled per replicate. For that replicate, this value will become the shape of a Gamma distribution with mean = 1, from which a multiplier is sampled for each loci.

For example, imagine we have 2 replicates, 2 loci, 2 individuals and input the following coverage settings:

[coverage] ​ experiment-wide: P:100 ​ ​ ​ ​ ​ ​ locus-wide: LN:1.2,1 ​ ​ ​ ​ ​ ​ ​ individual-wide: E:1 ​ ​ ​ ​ ​

First, we sample from a Poisson, with mean=100, to obtain the expected coverage per experiment (rep1c, rep2c).

F IGURE 7: Experiment-wide coverage sampling a complex example.

● Coverage variation before locus/individual multipliers:

Expected coverage Replicate Locus Individual I Individual II

A 102 102 1 B 102 102

A 112 112 2 B 112 112

21 ● Afterwards, we sample the locus-wide rate multipliers from the hyper-distribution, in this ​ case a Log Normal with mean=1.2 and standard deviation =1 (locrep1 α ,locrep2 α ). This, give us the shape of the Gamma distribution with mean 1 from which we sample the rate multipliers, as many as loci (locAm, locBm).

F IGURE 8: Locus-wide coverage sampling.

● Coverage variation after locus-wide multipliers:

Rate Resulting coverage Replicate Locus multiplier (per loci) Individual I Individual II

A 1.6710 170.4420 170.4420 1 B 0.8231 83.9562 83.9562

A 2.1260 238.1120 238.1120 2 B 0.7682 86.0384 86.0384

● Next, we get the individual-wide rate multipliers, sampling from the hyper-distribution, an ​ Exponential with rate 1 (indwrep1 α , indwrep2 α ). This, give us the shape of the Gamma distribution with mean 1 from which we sample the rate multipliers, as many as individuals (indAm, indBm).

22

F IGURE 9: Individual-wide coverage sampling.

Finally, we apply all the multipliers. Coverage variation after locus-wide and individual-wide multipliers:

Rate multiplier Resulting coverage Replicate Individuals (per individual) locus A locus B

I 0.4849 82.64733 40.71036 1 II 1.6790 286.1721 140.9625

I 0.7437 177.0838 63.9867 2 II 1.3250 315.4984 114.0009

Targeted sequencing parameters allow the user to emulate the variation in depth of coverage ​ that can occur in a targeted-sequencing experiment. This is possible when using gene tree distributions (SimPhy project) as input data. These parameters identify the on-target/off-target loci as well as the number of loci that may not be captured. While on-target loci will keep their expected coverage, the off-target fraction will have a (user-defined) fraction of this. The not-captured indicates the fraction of targeted loci that will not be captured, and its expected coverage will be 0x. For example, if we have 2 replicates, 3 loci, and input the following coverage:

23

[coverage] ​ experiment-wide: P:100 ​ ​ ​ ​ ​ ​ off-target=0.33 0.1 ​ ​ ​ ​ ​ ​ notcaptured=0.5 # half of the on-target ​ ​ ​ ​

If we consider the same coverage sampling as before, P:100:

Replicate Locus Category Expected Rate Sampled coverage multiplier coverage

1 A on-target 105 1 105

B on-target, not captured 105 0 0

C off-target 105 0.1 10.5

2 A on-target 92 1 92

B on-target, not captured 92 0 0

C off-target 92 0.1 9.2

Taxon-specific effects allows the user to define of coverage variation for specific taxa. It can be ​ used for example to emulate a decay in coverage, related to the phylogenetic distance of the a species to the reference species used to build the target-loci probes (Bragg et al, 2016) (this is ​ ​ sometimes called phylogenetic decay) or in a more general context for particular sample conditions (low amount of DNA, museum specimens, etc. ).

F IGURE 10: Taxon-specific coverage can be incorporated. Different clades (blue and orange) can be assigned a proportion of the experiment-wide defined coverage.

24

For example:

[coverage] ​ experiment: F:60 ​ ​ ​ ​ taxon=1,0.5; 2,0.25 ​ ​ ​ ​ ​ ​ ​ ​ ​

Meaning that, if the expected coverage for the experiment is 60x, individuals from the species speciesID=1, will have a coverage of 30x (50% of the expected coverage) and the individuals from the species speciesID=2, will have coverage of 15x (25% of the expected coverage).

6.4. [ngs-reads-art] block Defines the options for ART. If the user specifies here any input (in,i) or output (out,o) arguments, ​ ​ these will be ignored since these values will be auto-generated upstream by NGSphy.

For coverage, ART offers considers two different parameterizations:

● -c, --rcount: number of reads/read pairs to be generated per sequence/amplicon ​ ● -f, --fcov: mean fold of read coverage to be simulated for each amplicon. ​

Here, these parameters will be treated as Boolean (true,false) while its expected value will be set with the distributions given in the Section 6.3. [coverage] block. Units will be reads or depth of ​ ​ coverage according to the ART option selected.

Parameter Possible Values Units

-c, --rcount True/on/1 reads

-f, --fcov False/off/0 depth (x)

IMPORTANT: In NGSphy the coverage is defined per individual not per sequence/amplicon. ​

6.5. [ngs-read-counts] block Generates a VCF file per locus per replicate, that contains the variable positions, haplotype/genotype and likelihoods.

[ngs-read-counts] ​ ​ ​ ​ ​ read_counts_error=0.1 ​ ​ reference_alleles_file=/home/myuser/my_reference_alleles.txt ​ ​ ​ ​ ​ ​ ​

25

● read_counts_error ○ purpose: to emulate sequencing error. ○ type: number (float) ○ value: value in the left-closed interval [0,1). ● reference_alleles_file ○ purpose: identifiers of the sequences used as reference for the variable sites. ○ type: string (path)

6.5.1. Reference allele file [optional] Defines which alleles will be used as references to generate the VCF files. The description of the allele sequences follow the labeling explained above in Section 6.2.5 (Single gene-tree file ​ ​ labeling). The content of the file should be formatted as:

repID, spID,locID, indID ​ ​ ​ ​ ​ ​

Where: ● repID, replicate ID. ​ ● spID, species ID (X value of the sequence description) ​ ● locID, locus ID (Y value of the sequence description) ​ ● indID, gene tree tip ID (Z value of the sequence description). ​

IMPORTANT: By default, if the reference allele file is not specified (or badly formatted), the ​ reference allele will correspond to the sequence named 1_0_0.

6.5.1.1. Example The simplest case will be when the input is a single tree and all individuals have the same number of loci. So, let’s suppose we want to run NGSphy, with single gene tree inputmode (inputmode=1). The gene tree is the following (as in Section 6.2.6.) : ​ ​

F IGURE 11: Example of gene tree with proper label notation.

Also, we want to generate read counts, with no errors, and we want to use the gene-tree tip with label “2_0_1” as the reference allele. The reference allele file should contain: ​ ​

26 1,2,0,1 ​ ​ ​ ​ ​ 6.6. [execution] block This section define how NGSphy is executed. If the user has access to a computational cluster, the ART commands can be converted into jobs for SGE or SLURM schedulers (see Section 6.6.2). ​ ​ If desired, ART calls can be made by NGSphy transparently to the user (sequentially or in parallel - multi-threading).

[execution] ​ environment=bash ​ ​ runART=on ​ ​ running_times=off ​ ​ threads=4 ​

6.6.1. Options ● environment ○ purpose: specify in which environment the ART runs are going to be executed (more details below). ​ ​ ○ type: enumerate (possible environments) ○ values: ■ bash: generates a bash file with all the commands used to call ART. ​ ​ ■ sge: generates the necessary files to run a job array in a cluster ​ environment running Sun Grid Engine. Includes: seed file, job script and a possible script to launch ART jobs. ■ slurm: generates the necessary files to run a job array in a cluster ​ environment running Simple Linux Utility for Resource Management. Includes: seed file, job script and a possible script to launch ART jobs. ● threads ○ purpose: number of threads to execute NGSphy. ○ type: number (integer) ○ value: 1 ​ ● runART ○ purpose: indicate whether the user actually wants to generate NGS reads This will only run on local, under bash environment ○ type: boolean ○ values: ■ 1, true, on: run ART. ​ ■ 0, false, off: don't run ART, bash scripts will be generated. ​ ​ ● running_times: ​ ○ purpose: obtain the running times file for the NGS mode processes (read counts or ART). ○ type: boolean ■ values: ● 0, false, off: don’t generate file ​ ● 1, true, on: generate file ​

27

IMPORTANT: the generation of this file increases the execution time of the program. ​

NOTES ● If the execution block is missing, a bash script will be generated and ART instances will not be run. ● If the option environment is missing, a bash script will be generated (default behavior) and ART instances will not be run, unless runART option is set. ● If the option runART is missing, ART instances will not be run. ● If the value chosen for the option run is wrong and bash is the value of environment, then ART instances will not be run. ● If the value chosen for the option environment is wrong, behavior will be as if there was no execution section, bash script will be generated and ART instances will not be run.

6.6.2. Cluster execution options (SGE,SLURM)

NGSphy can generate job templates for execution in computational clusters running Sun Grid Engine (Gentzsch 2001, Oracle Corp.) or Simple Linux Utility for Resource Management (Yoo et al. ​ ​ ​ 2003, https://slurm.schedmd.com/ ). ​ ​ ​ In this case, NGSphy generates two files, project.XXX.sh (job script) and project.seedfile.txt (the seed-file for job arrays)

Where:

● XXX will be sge or slurm according to the selected execution environment. ​ ​ ​ ​ ​ ​ ● project: if using any of the single gene tree input modes, will be NGSphy. For the ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder.

To execute this one would type a different command depending on job scheduler (SGE or SLURM)

● SGE:

qsub -t 1-100 project.sge.sh ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

● SLURM:

sbatch --array 1-100 project.slurm.sh ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

28 Here there are some arbitrary examples of the files generated:

● SEED FILE

art_illumina art_illumina -ss GA2 -amp -p -sam -na -i ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/NGSphy_output/individuals/1/01/testwsimphy_1_1_data_0.fasta -l 50 -f 10 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ -o /home/user/NGSphy_output/reads/1/01/testwsimphy_1_1_data_0_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ art_illumina art_illumina -ss GA2 -amp -p -sam -na -i ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/NGSphy_output/individuals/1/01/testwsimphy_1_1_data_1.fasta -l 50 -f 10 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ -o /home/user/NGSphy_output/reads/1/01/testwsimphy_1_1_data_1_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ art_illumina art_illumina -ss GA2 -amp -p -sam -na -i ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/NGSphy_output/individuals/1/01/testwsimphy_1_1_data_2.fasta -l 50 -f 10 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ -o /home/user/NGSphy_output/reads/1/01/testwsimphy_1_1_data_2_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ...

● SGE job script:

#!/bin/bash # SGE submission options #$ -l num_proc=1 # number of processors to use #$ -l h_rt=00:10:00 # Set 10 mins - Average amount of time for up to 1000bp #$ -t 1-{0} # Number of jobs/files that will be treated #$ -N art.sims # A name for the job

command=$(awk 'NR==$SGE_TASK_ID{{print $1}}' $SEEDFILE) ​ ​ ​ ​ ​ ​ $command

● SLURM job script

#!/bin/sh #SBATCH -n 1 #SBATCH --cpus-per-task 1 #SBATCH -t 00:10:00 #SBATCH --mem 4G #SBATCH --array=1-1000

command=$(awk 'NR==$SLURM_ARRAY_TASK_ID{{print $1}}' $SEEDFILE) ​ ​ ​ ​ ​ ​ $command

29

IMPORTANT: Take into account that the job script files generated by NGSphy are general ​ templates, and that in most cases they will be have to be modified according the the particular cluster environments. It is strongly encouraged to consult the cluster administrator for proper execution.

6.6.3. Running times file Generated to keep track of the timings for each ART call or each NGS read counts process. File name follows the format:

project.info ​ ​ where, project will be NGSphy, if using any of the single gene tree input modes. Whereas, for the ​ ​ ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder.

Content of the file is formatted as follows:

repID,locID,indID,inputFile,cpuTime,seed,outputFilePrefix ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

● repID: replicate identifier. ​ ● locID: locus identifier. ​ ● indID: individual identifier. ​ ● inputFile: path of the input file, corresponding to the individual FASTA file. ​ ● cpuTime: processing time ​ ● seed: if the NGS mode needs a seed for the generation of random numbers, it will be ​ here. ● outputFilePrefix: prefix of the file generated. ​ 7. Output The output of NGSphy will depend on the NGS mode selected (ngs-read-counts or ngs-reads-art). In both cases, the user will get a detailed log file and a folder structure as:

30

F IGURE 12: Folder structure of the NGSphy output.

Folder structure include: 1. alignments: for single gene tree modes, stores the alignments and files generated for the ​ INDELible run. 2. coverage: stores tables describing the coverage for each locus and individual, one per ​ replicate. 3. individuals: stores the FASTA files with the individual sequences. Structured along the ​ hierarchy replicate > locus > individuals. 4. ind_labels: stores the correspondence between sequences and individuals. ​ 5. reads: for Illumina reads, stores the ALN/BAM and/or FASTQ files generated by ART. For ​ read counts, stores all the VCF files. Structured as hierarchy: a. replicate > locus > ALN/BAM/FASTQ/VCF files 6. ref_alleles: stores the sequences of the references alleles used for the simulation of read ​ counts. 7. scripts: stores all the bash scripts generated. ​

31 7.1. Alignments Stores the simulated sequences in FASTA format, both unaligned (*.FASTA) and aligned (*_TRUE.fasta) together with INDELible/INDELible-NGSphy control file, ancestral sequence, and ​ ​ ​ gene trees.

alignments/ |__ngsphy.tree # if inputmode = 3 ​ ​ ​ ​ ​ |__NGSphy.indelible.times # if running_times=1 ​ ​ ​ ​ ​ ​ ​ |__1 ​ ​ |__control.txt ​ ​ ​ ​ |__ancestral.fasta # if inputmode in [2,3] ​ ​ ​ ​ ​ |__ngsphydata_1.fasta ​ ​ ​ ​ |__ngsphydata_1_TRUE.fasta ​ ​ ​ ​ |__LOG.txt # default indelible file ​ ​ ​ ​ ​ |__tree.txt # default indelible file ​ ​ ​ ​ ​ 7.2. Coverage This folder will contain comma-separated file (CSV) files with the coverage distribution of each individual per replicate. Each file stores a matrix of shape (number of individuals X number of loci) where each cell corresponds to the depth of coverage of the loci for the specific individual. Format of the filename is:

project.repID.csv ​ ​ ​ ​

Where: ● project: if using any of the single gene tree input modes, it will be NGSphy. For the ​ ​ ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder. ● repID: number of the replicate. ​

Folder structure will look like this:

coverage/ |__SimOhyOutput.1.csv ​ ​ ​ ​ |__SimOhyOutput.2.csv ​ ​ ​ ​ |__SimOhyOutput.3.csv ​ ​ ​ ​ ... ​ 7.3. Ind_labels These will store the correspondence between the original sequences and the generated individuals. Each table is a CSV file named as follows:

project.repID.individuals.csv ​ ​ ​ ​ ​ ​

32

Where: ● project: if using any of the single gene tree input modes, it will be NGSphy. For the ​ ​ ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder. ● repID: number of the replicate. ​

Folder structure will look like this:

ind_labels/ |__NGSphy.1.individuals.csv ​ ​ ​ ​ ​ ​

7.3.1. Haploid individuals This folder will contain tables with the correspondence between the individual identifier and the corresponding sequence identifier. CSV file format:

repID, indID, spID,locID,geneID ​ ​ ​ ​ ​ ​ ​ ​ 1, 0, 0, 0, 0 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 1, 1, 0, 1 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 2, 1, 0, 2 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Where: ● repID: identifier of the replicate to which the gene trees and sequences belong. ​ ● indID: identifier of the haploid individual. ​ ● spID: identifier of the species. ​ ● locID: identifier of the locus. ​ ● geneID: identifier of the gene copy. ​ 7.3.2. Diploid individuals These tables will contain the correspondence between each individual and its two sequences. CSV file format:

repID,indID,spID,locID,mateID1,mateID2 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 1, 1, 0, 3, 0 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 2, 1, 0, 4, 1 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 3, 1, 0, 2, 5 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1, 4, 3, 0, 4, 0 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Where: ● repID: identifier of the replicate. ​ ● indID: identifier of the generated diploid individual. ​ ● spID: identifier of the species. ​ ● locID: identifier of the locus. ​ ● mateID(1&2): identifiers of the gene copies paired within each individual. ​

33 7.4. Individuals This folder will store the diploid individual sequence files (i.e., 2 sequences for each locus), hierarchically organized within replicates and loci. For example:

individuals/ |__1/ ​ ​ |__1/ ​ ​ |__prefix_1_1_ind1.fasta ​ ​ ​ ​ |__prefix_1_1_ind2.fasta ​ ​ ​ ​ |__prefix_1_1_ind3.fasta ​ ​ ​ ​ |__2/ ​ ​ |__prefix_1_2_ind1.fasta ​ ​ ​ ​ |__prefix_1_2_ind2.fasta ​ ​ ​ ​ |__prefix_1_2_ind3.fasta ​ ​ ​ ​ |__2/ ​ ​ |__1/ ​ ​ |__prefix_2_1_ind1.fasta ​ ​ ​ ​ |__prefix_2_1_ind2.fasta ​ ​ ​ ​ |__prefix_2_1_ind3.fasta ​ ​ ​ ​ |__2/ ​ ​ |__prefix_2_2_ind1.fasta ​ ​ ​ ​ |__prefix_2_2_ind2.fasta ​ ​ ​ ​ |__prefix_2_2_ind3.fasta ​ ​ ​ ​ 7.5. Ref_alelles This folder contains the FASTA files with the reference allele sequences used in the VCF file with the read counts. Folder is structured per replicate. There is a reference allele file per locus. Each file contains a single sequence. The format of each file name:

project_REF_repID_locID.fasta ​ ​

Where: ● project: if using any of the single gene tree input modes, it will be NGSphy. For the ​ ​ ​ gene-tree distribution input mode, it will be the name of the SimPhy output folder. ● repID: replicate identifier. ​ ● locID: locus identifier. ​

34 Folder structure will look like this:

ref_alleles/ |__1/ ​ ​ |__NGSphy_REF_1_1.fasta ​ ​ ​ ​ |__NGSphy_REF_1_2.fasta ​ ​ ​ ​ ... ​ |__2/ ​ ​ |__NGSphy_REF_2_1.fasta ​ ​ ​ ​ |__NGSphy_REF_2_2.fasta ​ ​ ​ ​ ... ​ 7.6. Scripts This folder will store all the scripts for ART execution, according to the options in the execution block. If we decide to run NGSphy for any cluster environment, we will have the job script and the seed file. If we choose bash as environment and we do not want to execute the ART commands within NGSphy, we would have a single bash script.

SGE SLURM

reads/ reads/ |__project.sge.sh |__project.slurm.sh ​ ​ |__project.sh |__project.sh ​ ​ bash

reads/ |__project.sh ​

Where, project will be NGSphy, if using any of the single gene tree input modes. Whereas, for ​ ​ ​ the gene-tree distribution input mode, it will be the name of the SimPhy output folder.

35 7.7. NGS mode Data will be structured per replicate.

7.7.1. NGS reads ART This folder will store the output of ART. It follows the same folder structure of the individuals ​ folder, but instead of having FASTA files, it will contain the FASTQ files [and alignment and ​ mapping files (ALN and SAM) if requested ] generated by ART.

reads/ |__1/ ​ |__1/ ​ ​ |__prefix_1_1_ind1_R1.fq ​ ​ ​ ​ |__prefix_1_1_ind1_R2.fq ​ ​ ​ ​ |__prefix_1_1_ind2_R1.fq ​ ​ ​ ​ |__prefix_1_1_ind2_R2.fq ​ ​ ​ ​ |__2/ ​ ​ |__prefix_1_2_ind1_R1.fq ​ ​ ​ ​ |__prefix_1_2_ind1_R2.fq ​ ​ ​ ​ |__prefix_1_2_ind2_R1.fq ​ ​ ​ ​ |__prefix_1_2_ind2_R2.fq ​ ​ ​ ​ |__2/ ​ |__1/ ​ ​ |__prefix_2_1_ind1_R1.fq ​ ​ ​ ​ |__prefix_2_1_ind1_R2.fq ​ ​ ​ ​ |__prefix_2_1_ind2_R1.fq ​ ​ ​ ​ |__prefix_2_1_ind2_R2.fq ​ ​ ​ ​ |__2/ ​ ​ |__prefix_2_2_ind1_R1.fq ​ ​ ​ ​ |__prefix_2_2_ind1_R2.fq ​ ​ ​ ​ |__prefix_2_2_ind2_R1.fq ​ ​ ​ ​ |__prefix_2_2_ind2_R2.fq ​ ​ ​ ​

NOTE: Independently of the environment chosen and the value of the “runART” option, ​ NGSphy will generate the hierarchical folder structure.

7.7.2. NGS read counts This folder will store the output obtained from the read count simulation. This folder is structured in 2 sub-folders (with and without sequencing errors), each structured per replicate, and containing as many VCF files as loci.

Sub-folders will be: ● no_error: VCF files with the simulated read counts without sequencing error. ​ ● with_error: VCF files with the simulated read counts with the introduced sequencing ​ error.

36 reads |__no_error/ ​ ​ |__1/ ​ ​ |__prefix_1_1_TRUE.VCF ​ ​ ​ ​ |__prefix_1_2_TRUE.VCF ​ ​ ​ ​ |__prefix_1_3_TRUE.VCF ​ ​ ​ ​ |__2/ ​ ​ |__prefix_2_1_TRUE.VCF ​ ​ ​ ​ |__prefix_2_2_TRUE.VCF ​ ​ ​ ​ |__prefix_2_3_TRUE.VCF ​ ​ ​ ​ |__with_error/ ​ ​ |__1/ ​ ​ |__prefix_1_1.VCF ​ ​ ​ ​ |__prefix_1_2.VCF ​ ​ ​ ​ |__prefix_1_3.VCF ​ ​ ​ ​ |__2/ ​ ​ |__prefix_2_1.VCF ​ ​ ​ ​ |__prefix_2_2.VCF ​ ​ ​ ​ |__prefix_2_3.VCF ​ ​ ​ ​ 7.8. Other files

7.8.1. Summary log file

In the output folder there will be a file which contains a summary of the parameters used for the simulation. The name of the file will have the format:

NGSPHY.YYYYMMDD-HH:mm:SS.summary.log ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

● YYYY: year ● MM: month ● DD: day ● HH: hours ● mm: minutes ● ss: seconds

37 Here, an example of the output:

Settings: ​ [GENERAL] ​ ​ path : ./ ​ output_folder_name : NGSphy_case1_100x_RC ​ ploidy : 2 (Diploid individuals) ​ ​ ​ ​ ​ ​ seed : 50426717 ​ numreplicates : 1 ​ numlociperreplicate : 1 ​ filtered_replicates : 1 ​ numindividualsperreplicate : 8 ​ [DATA] ​ ​ inputmode : 4 (Gene-tree distribution-SimPhy output) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ simphy_folder_path : SimPhy_usecase ​ simphy_data_prefix : data ​ [COVERAGE] ​ ​ experiment : f:100 ​ ​ ​ [NGS-READ-COUNTS] ​ ​ ​ ​ ​ ​ read_counts_error : 0 ​ reference_alleles_file : files/my_reference_allele_file.case1.txt ​ ​ ​ ​ ​ ​ ​ [EXECUTION] ​ ​ environment : bash ​ threads : 2 ​ runart : off ​ running_times : 0 ​

7.8.2. Running times file Stores information related to the time used in each ART run or read-count thread per locus. This file will contain input/output files for each process and its corresponding individual, locus (gene-tree) and replicate (REPID). See more on Section 6.6.3 ​

Example of the file

1,1,0,output/individuals/1/01/test_wrapper_1_01_data_0.fasta, 0.013984, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1479977980,output/reads/1/01/test_wrapper_1_01_data_0_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,1,1,output/individuals/1/01/test_wrapper_1_01_data_1.fasta, 0.014757, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1479977980,output/reads/1/01/test_wrapper_1_01_data_1_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,1,2,output/individuals/1/01/test_wrapper_1_01_data_2.fasta, 0.013589, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1479977980,output/reads/1/01/test_wrapper_1_01_data_2_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,1,3,output/individuals/1/01/test_wrapper_1_01_data_3.fasta, 0.013404, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1479977980,output/reads/1/01/test_wrapper_1_01_data_3_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,1,4,output/individuals/1/01/test_wrapper_1_01_data_4.fasta, 0.013775, ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1479977980,output/reads/1/01/test_wrapper_1_01_data_4_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

38 7.8.3. Debug file For each NGSphy run is optional to get a debug log file. If the “-l/--log” option in the command ​ line is set to DEBUG, the file will be generated in the current working directory and under the name:

NGSPHY.YYYYMMDD-HH:mm:SS.log ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

● YYYY: year ● MM: month ● DD: day ● HH: hours ● mm: minutes ● ss: seconds

This file stores information of the program execution, at a very detailed level. A debug log file will look like this:

13/08/2017 11:21:19 AM - ERROR (__main__|handlingCmdArguments:82): Something ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ happened while parsing the arguments. ​ ​ Please verify. Exiting. ​ ​ ​ ​

39 8. Additional information

8.1. Motivation Advances in sequencing technologies have now made very common that datasets for phylogenomic inference consist of large numbers of loci from multiple species and individuals. The use of next-generation sequencing (NGS) for phylogenomics implies a complex computational pipeline where multiple technical and methodological decisions are necessary that might influence the final tree obtained, from coverage to assembly, mapping, variant calling and/or phasing. In order to assess the influence of these variables, here we introduce NGSphy, an open-source tool for the genome-wide simulation of Illumina reads obtained from thousands of gene families evolving under a common species tree, with multiple haploid and/or diploid individuals per species, where sequencing coverage (depth) heterogeneity can be modelized across individuals and loci, including off-target loci and phylogenetic decay. Moreover, parameter values for the different replicates can be sampled from user-defined statistical distributions.

F IGURE 13: A possible analysis pipeline for multilocus, multispecies datasets with multiple individuals with the final goal of exploring the sensitivity of species tree inferences to NGS parameterization variation.

8.2. What can be done with NGSphy? The detailed scenarios With NGSphy you can generate: ● haploid/diploid individuals from gene-tree distributions ● genome sequences of haploid/diploid individuals from a single gene tree ● genome sequences of haploid/diploid individuals from a single gene tree and an user-defined ancestral sequence ● genome sequences of haploid/diploid individuals from a single gene tree, an user-defined ancestral sequence and an anchor tip.

40 ● NGS Illumina reads of haploid/diploid individuals ● NGS read counts of haploid/diploid individuals ● For the NGS data generation, variation of coverage due to the following: ○ variation across individuals and/or loci ○ targeted-sequencing effects ■ on/off target loci ■ on-target loci not captured ■ taxon-specific variation

8.3. Third-party software involved

8.3.1. ART ● Huang W, Li L, Myers JR, and Marth, GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28 (4): 593-594 ​ ​

ART (http://www.niehs.nih.gov/research/resources/software/biostatistics/art/) is a set of simulation ​ ​ tools to generate synthetic next-generation sequencing reads. ART simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. ART can also simulate reads using user own read error model or quality profiles. ART supports simulation of single-end, paired-end/mate-pair reads of three major commercial next-generation sequencing platforms: Illumina’s Solexa, Roche’s 454 and Applied Biosystems’ SOLiD. ART can be used to test or benchmark a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, ​ SNP and structural variation discovery. ART outputs reads in the FASTQ format, and alignments in the ALN format. ART can also generate alignments in the SAM alignment or UCSC BED file format.

8.3.2. INDELible ● William Fletcher and Ziheng Yang (2009) INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution. 26 (8): 1879–88. doi:10.1093/molbev/msp098

INDELible (http://abacus.gene.ucl.ac.uk/software/indelible/) is an application for biological ​ ​ sequence simulation that combines many features. Using a length-dependent model of indel formation it can simulate evolution of multi-partitioned nucleotide, amino-acid, or codon data sets through the processes of insertion, deletion, and substitution in continuous time.

Nucleotide simulations may use the general unrestricted model or the general time reversible model and its derivatives, and amino-acid simulations can be conducted using fifteen different empirical rate matrices. Substitution rate heterogeneity can be modelled via the continuous and discrete gamma distributions, with or without a proportion of invariant sites. INDELible can also simulate under non-homogenous and non-stationary conditions where evolutionary models are permitted to change across a phylogeny. Unique among indel simulation programs, INDELible

41 offers the ability to simulate using codon models that exhibit nonsynonymous/synonymous rate ratio heterogeneity among sites and/or lineages.

8.3.3. SimPhy ● Diego Mallo, Leonardo De Oliveira Martins and David Posada (2015). SimPhy : Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic Biology., November, syv082. doi:10.1093/sysbio/syv082

SimPhy (https://github.com/adamallo/simphy) is a program for the simulation of gene family ​ ​ evolution under incomplete lineage sorting (ILS), gene duplication and loss (GDL), replacing horizontal gene transfer (HGT) and gene conversion (GC). SimPhy simulates species, locus and gene trees with different levels of rate heterogeneity, and uses INDELible to evolve nucleotide/codon/aminoacid sequences along the gene trees. The input for SimPhy are the simulation parameter values, which can be fixed or sampled from user-defined statistical distributions. The output consists of sequence alignments and a relational database that facilitate posterior analyses.

42 8.4. NGSphy workflow

F IGURE 14: NGSphy workflow

43

NGSphy, verifies all the content of the project, the settings files involved and/or the existence of the corresponding third-party applications in order to run. If the input data corresponds to the single gene tree an user-defined ancestral sequence, first the tree is rooted to the selected gene-tree tip. The next step (for any single gene tree input mode) is to evolve the tree under the specific evolution mode to obtain the expected genome sequences. Then (any input mode), the generation of individuals, whether haploid or diploid: ​

● For haploid individuals, resulting genome sequences are separated into single FASTA files and identified. In addition, a file is generated with the correspondence between the individual generated and the description of the sequence it belongs to. ● For diploid individuals, there is a process of verification that the project content includes species-trees with an even number of individuals per taxa. Sequences are then "paired", individuals being generated by randomly sampling without replacement two sequences within the same gene family and species. Output will include a table for each replicate with the identifiers for the sequences paired and the individuals generated.

Afterwards, the coverage variation matrices will be computed according to the parameters introduced and finally the sequencing data generated, consist on either Illumina reads or read counts (VCF files).

● For the Illumina reads, program calls out ART, the NGS simulator, with the parameters established in the settings file and generates reads from the previously generated individuals. Resulting files depend on the settings introduced, and they are files related to the execution of the ART processes (scripts and text files), and the output of such processes (ALN, BAM and/or FASTQ files). ● For read counts, two scenarios are simultaneously computed, with and without errors.

8.5. Read count simulation

The read count approach is based on the assumption (Ritz et al., 2011) that the sequencing ​ ​ process is uniform in generating short reads from the target genome, and that the number of reads mapped to a region is expected to be proportional to the number of times the region appears in a DNA sample (Ji and Chen, 2015). Read counts are produced under a user-defined ​ ​ error rate. First, the variable sites (regarding the reference sequences) are identified. Then, coverage for each position is sampled from a Negative Binomial distribution whose mean and overdispersion parameter are the sampled coverage for the specific locus and individual. For diploid individuals, coverage is further splitted among chromosomes with equal probability. Genotype likelihoods for every site are computed as in GATK (McKenna et al 2010) (see also ​ ​ Korneliussen et al. 2014). The output is a set of VCF files, one per locus. ​

44 9. Getting help Most common issues, doubts and questions should be solved by reading this manual. If that is not the case or you find any bug, you can post an issue to this repository for reproducibility purposes, with the following files attached: ● the settings file ● .command file or the indelible_control.txt file. ​ ​ ​

10. Development and testing

This software has been developed for Linux/Mac environments and specifically tested under:

● Linux Kernel:

4.8.0-58-generic #63~16.04.1-Ubuntu SMP Mon Jun 26 18:08:51 UTC 2017 x86_64 x86_64 ​ ​ ​ ​ ​ ​ ​ x86_64 GNU/Linux

● Distribution

Ubuntu 16.04.2 LTS ​ ​ ​ ​ ​

● Hardware

Dual core Intel Core i5-3427U (-HT-MCP-) cache: 3072 KB ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 8GB RAM ​ ​

45 11. Tutorials Here we find settings files for 4 simple test-case scenarios. They are available in ngsphy/data/settings (see Escalona et al. submitted for further details). They correspond to the 4 ​ possible input modes. The trees, sequences, reference alleles and INDELible control files needed for their execution can be found, respectively, in:

● ngsphy/data/trees ● ngsphy/data/sequences/ ● ngsphy/data/reference_alleles/ ● ngsphy/data/indelible/

You can use them to check the proper installation of the pipeline and adapt them to your particular case. In addition, there is a script file for each case scenario in: ngsphy/test

1. Generating read counts from a single gene tree a. Corresponds to inputmode=1 b. Script: ngsphy.test.1.sh 2. Generating Illumina reads from a single gene tree, using an ancestral sequence a. Corresponds to inputmode=2 b. Script: ngsphy.test.2.sh 3. Generating read counts from a single gene tree, using an anchor sequence a. Corresponds to inputmode=3 b. Script: ngsphy.test.3.sh 4. Generating Illumina reads from gene tree distributions a. Corresponds to inputmode=4 b. Script: ngsphy.test.4.sh

NOTES: - All setting files here described assume that: a. executables are accessible from any folder, and are properly (re)named. b. all the requested files are in the same directory, otherwise you will have to change the path related options accordingly.

46 11.1.Generating read counts from a single gene tree Here we will generate read counts for the tips of a single gene tree. The read counts will have 0.1% of sequencing error and the expected coverage is 100x. Reference allele file is not given, thus the one with the label 1_0_0 is used (default). Output will be stored in the current working directory. Settings file looks as below and is named ngsphy.settings.1.txt

[general] ​ path=. ​ output_folder_name=NGSphy_output ​ ​ ploidy=1 ​ [data] ​ inputmode=1 ​ gene_tree_file=t1.tree ​ ​ ​ ​ indelible_control_file=control.1.txt ​ ​ ​ ​ [coverage] ​ experiment=F:100 ​ ​ ​ ​ [ngs-read-counts] ​ ​ ​ ​ ​ read_counts_error=0.1 ​ ​ [execution] ​ environment=bash ​ ​ running_times=off ​ ​ threads=2 ​ FILE: ngsphy.settings.1.txt

● For the generation of sequences with INDELible:

[TYPE] NUCLEOTIDE 1 // nucleotide simulation using algorithm from method 1 ​ ​ ​ ​ ​ ​ [SETTINGS] ​ [ancestralprint] NEW // generates a file with the ancestral sequence ​ ​ ​ ​ ​ [output] FASTA ​ ​ ​ ​ [MODEL] m1 // no insertions, no gamma ​ ​ ​ ​ [submodel] HKY 2.5 // HKY with kappa=2.5 ​ ​ ​ ​ ​ ​ ​ [statefreq] 0.1 0.2 0.3 0.4 // frequencies for T C A G ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ [NGSPHYPARTITION] t1 m1 1000 // t1 with model m1 to generate 1000 bp long sequences ​ ​ ​ ​ ​ ​ FILE: control.1.txt

47 ● The tree has 4 tips, one per individual.

F IGURE 15: Tree file t1.tree

((1_0_0:1.0, 2_0_0:1.0):1.0,(3_0_0:1.0, 4_0_0:1.0):1.0); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ FILE: t1.tree

11.1.1. Execution To run this example, use:

ngsphy -s ngsphy.settings.1.txt ​ ​ ​ ​ ​ ​

11.1.2. Output Several output folders/files are produced under the main directory:

● alignments: this contains the data used and generated by INDELible. ​ ● coverage: the exact coverage for each loci (L) of each individual (I). This is written into a ​ table with dimensions (I x L). In this case, this value was fixed at 100x for all individuals. ● individuals: where the sequence files (FASTA) for all loci and individuals are written. ​ There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t1 has 4 tips, corresponding to 4 haploid individuals, thus we will have 4 FASTA files, each file containing a single sequence corresponding to an

48 individual. Here, an example of a single individual file (see more in Section 6.2.7. ​ Individual assignment).: ​

$ cat individuals/1/1/NGSphy_1_1_ngsphydata_0.fasta ​ ​ ​ ​ ​ ​ ​ ​

>NGSphy:1:1:ngsphydata:0:1_0_0 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ GGCCGTGGCCCGGGGTGGGAAACGGCCGACGAAAATGGGGGAATCCAACCAGTGG.... ​

● ind_labels: it has as many files as replicates, and stores the relation replicate/individual/species/locus/sequence. In this case:

$ cat ind_labels/NGSphy.1.individuals.csv ​ ​ ​ ​ ​ ​ indexREP,indID,speciesID,locusID,geneID ​ ​ ​ ​ ​ ​ ​ ​ 1,0,1,0,0 ​ ​ ​ ​ ​ ​ ​ 1,1,2,0,0 ​ ​ ​ ​ ​ ​ ​ 1,2,3,0,0 ​ ​ ​ ​ ​ ​ ​ 1,3,4,0,0 ​ ​ ​ ​ ​ ​ ​

● reads: stores VCF with the simulated read counts. If sequence error is introduced, the ​ VCF files without errors will also be written.

|__reads/ ​ |__no_error/ # VCF files without sequencing error ​ ​ ​ ​ ​ |__1/ # replicate identifier ​ ​ ​ ​ ​ |__ngsphydata_1_1_NOERROR.vcf ​ ​ ​ ​ |__with_error/ # VCF files with sequencing error ​ ​ ​ ​ ​ |__1/ # replicate identifier ​ ​ ​ ​ ​ |__ngsphydata_1_1.vcf ​ ​ ​ ​

● ref_alleles: FASTA files with the sequences of the reference alleles used for the read ​ count process.

|__ref_alleles/ ​ |__1/ # replicate identifier ​ ​ ​ ​ ​ |__NGSphy_REF_1_1.fasta # FASTA file with reference allele sequence for ​ ​ ​ ​ ​ replicate 1, locus 1.

49 11.2.Generating Illumina reads from a single gene tree, using an ancestral sequence Here we will simulate Illumina reads from diploid individuals evolving under a single gene tree with a known ancestral sequence. The Illumina reads will have the following characteristics: ● Machine: HiSeq2000 ● 100bp PE reads ● Fragments will have mean length of 250bp (standard deviation 50bp) ● Expected coverage of 50x.

Settings file looks as below and is named ngsphy.settings.2.txt

[general] ​ path=. ​ output_folder_name=NGSphy_output ​ ​ ploidy=2 ​ [data] ​ inputmode=2 ​ gene_tree_file=t2.tree ​ ​ ​ ​ ancestral_sequence_file=my_ancestral.fasta ​ ​ ​ ​ indelible_control_file=control.2.txt ​ ​ ​ ​ [coverage] ​ experiment=F:50 ​ ​ ​ ​ [ngs-reads-art] ​ ​ ​ ​ ​ fcov=true amp=true ​ ​ l=100 ​ ​ m=250 ​ ​ p=true ​ ​ q=true ​ ​ s=50 ​ ​ sam=true ​ ​ ss=HS20 ​ ​ [execution] ​ environment = bash ​ ​ runART=off ​ ​ threads=2 ​ FILE: ngsphy.settings.2.txt

50 INDELible control file:

[TYPE] NUCLEOTIDE 1 ​ ​ ​ [SETTINGS] ​ [output] FASTA ​ ​ ​ ​ [ancestralprint] NEW ​ ​ ​ ​ [MODEL] m1 // no insertions, no gamma ​ ​ ​ ​ [submodel] HKY 0.5 // HKY with kappa=0.5 ​ ​ ​ ​ ​ ​ ​ [NGSPHYPARTITION] t2 m1 500 ​ ​ ​ ​ FILE: control.2.txt

The tree, in this case, has 8 tips belonging to 4 individuals of 4 species.

F IGURE 16: Tree file t2.tree

(((1_0_1:1.0,1_0_0:1.0):1.0,2_0_1:1.0,2_0_0:1.0):1.0):1.0,((3_0_1:1.0,3_0_0:1.0):1. ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 0, (4_0_1:1.0,4_0_0:1.0):1.0):1.0); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ FILE: t2.tree

11.2.1. Execution To run this example, use:

ngsphy -s ngsphy.settings.2.txt ​ ​ ​ ​ ​ ​

51 11.2.2. Output Several output files are produced under the main directory: ● alignments: this contains the data used and generated by INDELible. ​ ● coverage: the exact coverage for each loci (L) of each individual (I). This is written into a ​ table with dimensions (I x L). In this case, this value was fixed at 50x for all individuals. ● individuals: where the sequence files (FASTA) for all loci and individuals are written. ​ There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree t2 has 8 tips, corresponding to 4 diploid individuals, thus we will have 4 FASTA files, each file containing a 2 sequences corresponding to an individual. ● ind_labels: correspondence replicate/individual/species/locus/sequence. In this case: ​

$cat NGSphy.1.individuals.csv ​ ​ ​ ​ ​ indexREP,indID,speciesID,locusID,mateID1,mateID2 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,0,2,0,0,1 ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,1,3,0,0,1 ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,2,4,0,1,0 ​ ​ ​ ​ ​ ​ ​ ​ ​ 1,3,1,0,1,0 ​ ​ ​ ​ ​ ​ ​ ​ ​

● reads: stores the FASTQ files generated by ART. In this case the execution has been ​ turned off (runART=off) and we obtain an empty hierarchical folder structure. ● scripts: file with all needed command lines to generate the Illumina reads from all the ​ diploid individuals in ART. For medium-big datasets it is convenient to use this feature and run ART separately. The file looks like this:

$ cat NGSphy_output/scripts/NGSphy.sh ​ ​ ​ ​ ​ ​ ​ art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydat ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a_0.fasta --out ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_0_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydat ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a_1.fasta --out ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_1_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydat ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a_2.fasta --out ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_2_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ art_illumina -amp -l 100 -m 250 -p -q -s 50 -sam -ss HS20 --fcov 50.0 --in ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/individuals/1/1/NGSphy_1_1_ngsphydat ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ a_3.fasta --out ​ ​ ​ ​ /home/user/git/test-ngsphy/test2/NGSphy_output/reads/1/1/NGSphy_1_1_ngsphydata_3_R ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

52 11.3.Generating read counts from a single gene tree, using an anchor sequence In this example we use a sequence from a specific tip of the tree called anchor sequence - to root the tree and start the simulation. We will obtain sequences from the rest of the gene tree tips keeping their relationships, and then read counts for all tips with 0.1% sequencing error and 100x expected coverage. In this case the allele used as reference for the read counts is the anchor sequence (here 2_0_0; in my_anchor_sequence.fasta), but a different one can be specified.

[general] ​ path=. ​ output_folder_name=NGSphy_output ​ ​ ploidy=1 ​ [data] ​ inputmode=3 ​ gene_tree_file=t3.tree ​ ​ ​ ​ anchor_sequence_file=my_anchor_sequence.fasta ​ ​ ​ ​ anchor_tip_label=2_0_0 ​ ​ indelible_control_file=control.3.txt ​ ​ ​ ​ [coverage] ​ experiment=F:100 ​ ​ ​ ​ [ngs-read-counts] ​ ​ ​ ​ ​ read_counts_error=0.1 ​ ​ reference_alleles_file=my_reference_allele_file.txt ​ ​ ​ ​ [execution] ​ environment=bash ​ ​ running_times=off ​ ​ threads=2 ​ FILE: ngsphy.settings.3.txt

53 The tree:

F IGURE 17: Tree file t3.tree

((((1_0_0:1.0, 2_0_0:1.0):1.0),(3_0_0:1.0, 4_0_0:1.0):1.0),5_0_0:3.0); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ FILE: t3.tree

INDELible control file:

[TYPE] NUCLEOTIDE 1 ​ ​ ​ [SETTINGS] ​ [ancestralprint] NEW ​ ​ ​ ​ [output] FASTA ​ ​ ​ ​ [MODEL] m1 ​ ​ ​ [submodel] HKY 0.1 ​ ​ ​ ​ ​ [NGSPHYPARTITION] t3 m1 100 ​ ​ ​ ​ FILE: control.3.txt ​

11.3.1. Execution To run this example, use:

ngsphy -s ngsphy.settings.3.txt ​ ​ ​ ​ ​ ​

54 11.3.2. Output ● alignments: this contains the data used and generated by INDELible. ​ ● coverage: the exact coverage for each loci (L) of each individual (I). This is written into a ​ table with dimensions (I x L). In this case, this value was fixed at 100x for all individuals. ● individuals: where the sequence files (FASTA) for all loci and individuals are written. ​ There is a subfolder structure reflecting the number of replicates and gene trees. In this case, the single gene-tree (t3) has 5 tips, corresponding to 5 haploid individuals, thus we will have 5 FASTA files, each file containing a single sequence corresponding to an individual. ● ind_labels: relation replicate/individual/species/locus/sequence. ​ ● reads: stores VCF with the simulated read counts. ​ ​ ​ ● ref_alleles: FASTA files with the sequences of the reference alleles used for the read ​ count. These files will be generated whether the reference allele matches the anchor sequence or not.

55 11.4. Generating Illumina reads from gene tree distribution In this more complex example, we will simulate Illumina reads from a gene tree distribution, which needs to be first obtained with SimPhy.

11.4.1. SimPhy run For this example we will simulate 2 species tree replicates with a variable height between 200.000 years and 20.000.000 years (u:200000,20000000). These trees will have 5 ingroup + 1 outgroup species. The ingroup species will have 6 individuals per species, while the outgroup is a single individual. Each replicate will have 10 gene trees. The effective population size (10.000) of each species and the substitution rate (0.00001) of each gene tree are fixed. Finally will add heterogeneity at different levels (-h parameters). To run the simulation we use:

simphy -rs 2 -rl f:10 -sb ln:-15,1 -st u:200000,20000000 -sl f:5 -so f:1 -sp ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ f:100000 -su f:0.00001 -si f:6 -hh ln:1.2,1 -hl ln:1.4,1 -hg f:200 -v 1 -o ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ testwsimphy -cs 6656 -od 1 -op 1 -oc 1 -on 1 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

In detail:

Paramete Value Description r

-rs 2 Number of species tree replicates

-rl f:10 Number of locus tree per replicate

-rg f:1 Number of gene trees per replicate

-sb ln:-15,1 Speciation rate (events/time unit)

-st u:200000, Species tree height (time units) 20000000

-sl f:5 Number of taxa

-so f:1 Ratio between ingroup height and the branch from the root to the ingroup

-sp f:100000 Tree-wide effective population size

-su f:0.00001 Tree-wide substitution rate

-si f:6 Number of individuals per species

56 -hh ln:1.2,1 Gene-by-lineage-specific locus tree parameter

-hl ln:1.4,1 Gene-family-specific rate heterogeneity modifiers

-hg f:200 Gene-by-lineage-specific rate heterogeneity modifiers

-v 1 verbosity: Global settings summary, simulation progress per replicate (number of simulated gene trees), warnings and errors

-o testwsimph Common output prefix-name (for folders and files) y

-cs 6656 Random number generator seed

-od 1 Activates the SQLite database output

-op 1 Activates logging of sampled options

-on 1 Activates the output of the bounded locus subtrees file

To simulate the DNA alignment for every gene tree simulated we now use the script provided with Simphy (INDELIble_wrapper.pl) and the following INDELible control file.

[TYPE] NUCLEOTIDE 1 ​ ​ ​ [SETTINGS] ​ [output] FASTA ​ ​ [fastaextension] fasta ​ ​ [MODEL] complex_common ​ ​ ​ [submodel] GTR $(rd:6,16,2,8,20,4) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ [statefreq] $(d:1,1,1,1) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ [rates] 0 $(e:2) 0 ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

[SIMPHY-UNLINKED-MODEL] simple_unlinked ​ ​ ​ ​ ​ ​ [submodel] HKY $(e:1) ​ ​ ​ ​ ​ ​ ​ [statefreq] $(d:1,1,1,1) ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

[SIMPHY-PARTITIONS] simple [1 simple_unlinked 500] //// The first half of the gene ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ families will evolve under the model "simple_unlinked". Their sequence lengths are sampled from a Normal with mean=1000 and sd=100. [SIMPHY-EVOLVE] 1 data // One sequence alignment for each gene tree, saved in files ​ ​ ​ ​ ​ ​ ​ ​ with "dataset" as common prefix (it will generate dataset_1, dataset_2, etc.) FILE: control.4.txt. This file is based on a SimPhy example case. For more information on this example please go to SimPhy wiki ​

To run we use:

57

# perl INDELIble_wrapper.pl perl INDELIble_wrapper.pl testwsimphy/ control.4.txt $RANDOM 2 ​ ​ ​ ​ ​ ​ ​ ​

Here, we use the Linux environment variable $RANDOM, which returns a different random integer in the range [0,32767]. For more information, go here. ​ ​

After this, we will obtain folders in a hierarchical structure which look like this:

|__testwsimphy/ ​ |__testwsimphy.command # SimPhy log files ​ ​ ​ ​ ​ |__testwsimphy.db # SimPhy log files ​ ​ ​ ​ ​ |__testwsimphy.params # SimPhy log files ​ ​ ​ ​ ​ ​ |__1/ # Species tree replicate ​ ​ ​ ​ ​ |__data_[1-10].fasta # INDELIble output ​ ​ ​ ​ ​ ​ ​ ​ ​ |__data_[1-10]_TRUE.phy # INDELIble output ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ |__g_trees[1-10].trees # Simphy output ​ ​ ​ ​ ​ ​ ​ ​ ​ |__control.txt # INDELIble_wrapper.pl output ​ ​ ​ ​ ​ |__bounded_locus_subtrees.out # Simphy output ​ ​ ​ ​ ​ ​ |__LOG.txt # INDELIble output ​ ​ ​ ​ ​ |__l_trees.trees # Simphy output ​ ​ ​ ​ ​ |__s_tree.trees # Simphy output ​ ​ ​ ​ ​ |__trees.txt # INDELIble output ​ ​ ​ ​ ​ |__2/ # Species tree replicate ​ ​ ​ ​ ​ |__data_[1-10].fasta # INDELIble output ​ ​ ​ ​ ​ ​ ​ ​ ​ |__data_[1-10]_TRUE.phy # INDELIble output ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ |__g_trees[1-10].trees # SimPhy output ​ ​ ​ ​ ​ ​ ​ ​ ​ |__control.txt # INDELIble_wrapper.pl output ​ ​ ​ ​ ​ |__bounded_locus_subtrees.out # SimPhy output ​ ​ ​ ​ ​ ​ |__LOG.txt # INDELIble output ​ ​ ​ ​ ​ |__l_trees.trees # SimPhy output ​ ​ ​ ​ ​ |__s_tree.trees # SimPhy output ​ ​ ​ ​ ​ |__trees.txt # INDELIble output ​ ​ ​ ​ ​

58 11.4.2. Running NGSphy

Now we use NGSphy to generate the Illumina reads from the tips of these gene-trees with the settings file ngsphy.settings.4.txt (see below). As we want to generate diploid individuals in this case, NGSphy will randomly “mate” tips (gene-copies) within each taxa/species. Outgroup taxa has always one sequence (when it is generated in SimPhy) and is assumed to be homozygous (so its sequence is duplicated before generating reads/read counts). As the species trees generated have 6 taxa (5 ingroup + 1 outgroup), each ingroup having 6 tips, we will have 16 individuals in total: ○ 6 gene-copies * 5 ingroup taxa = 30 sequences = 15 diploid individuals. ○ 1 outgroup * 2 (homozygous) = 2 sequences = 1 diploid individual.

Also, we want the base coverage to be 100x for both replicates (experiment=F:100), but we want to add some variation among loci and individuals. This variation is in this example modeled with a Log Normal distribution with mean 1.2 and standard deviation 1 for the individuals and a Log Normal distribution with mean 1.3 and standard deviation 1 for the loci. These distributions will model the underlying Gamma distribution that will sample the rate multipliers for the specific individual and locus. For example, the multipliers for the individual variation, for each replicate, might be sampled from distributions like these, first (to the left) the shapes of the Gamma distributions, and finally from the Gamma distribution the multipliers:

F IGURE 18: Example of possible distributions of the rate multipliers according the settings file ngsphy.settings.4.txt to model the coverage variation among individuals. ​

59 The file indicates the general parameters, mode of the input (gene tree distribution) and SimPhy related parameters, the sampling process of the expected coverage and the sequencing features.

[general] ​ path=. ​ output_folder_name=NGSphy_output ​ ​ ploidy=2 ​ [data] ​ inputmode=4 ​ simphy_folder_path=./testwsimphy ​ ​ simphy_data_prefix=data ​ ​ simphy_filter=true ​ ​ [coverage] ​ experiment=F:100 ​ ​ ​ ​ individual=LN:1.2,1 ​ ​ ​ ​ ​ locus=LN:1.3,1 ​ ​ ​ ​ ​ offtarget=0.25, 0.01 ​ ​ ​ ​ ​ notcaptured=0.5 ​ ​ taxon= 1,0.5;2,0.25 ​ ​ ​ ​ ​ ​ ​ ​ ​ [ngs-reads-art] ​ ​ ​ ​ ​ fcov=true l=100 ​ ​ m=250 ​ ​ p=true ​ ​ q=true ​ ​ s=50 ​ ​ sam=true ​ ​ ss=HS20 ​ ​ [execution] ​ environment = bash ​ ​ runART = on ​ ​ threads=2 ​ FILE: ngsphy.settings.4.txt

60 11.4.3. Execution To run this example, use:

ngsphy -s ngsphy.settings.4.txt ​ ​ ​ ​ ​ ​

11.4.4. Output ● coverage: the exact coverage for each loci (L) of each individual (I). This is written into a ​ table with dimensions (I x L). In this case, we include variation in coverage among loci and individuals as in targeted-sequencing experiments. It is also introduced the simulation of off-target loci (thus with less coverage) and some of the loci are assumed to be not captured/sequenced at all. We have 2 files, one per replicate. From the simulated species trees we get 16 individuals per replicate and 10 gene-trees (loci) each. Hence, the coverage tables have 16 x 10 dimensions (I x L). Content of the coverage file (testwsimphy.1.coverage.csv ):

$ cat testwsimphy.1.coverage.csv ​ ​ ​ ​ indID ,L.01 ,L.02 ,L.03 ,L.04 ,L.05 ,L.06 ,L.07 ,L.08 ,L.09 ,L.10 0 ,4.317, 401.540, 467.337, 0.000, 0.0, 222.453, 0.0, 0.0, 190.841, 286.270 1 ,7.646, 711.199, 827.737, 0.0, 0.0, 394.004, 0.0, 0.0, 338.013, 507.035 2 ,6.401, 595.411, 692.976, 0.0, 0.0, 329.857, 0.0, 0.0, 282.982, 424.486 3 ,10.602, 986.243, 1147.850, 0.0, 0.0, 546.378, 0.0, 0.0, 468.733, 703.121 4 ,15.316, 1424.710, 1658.165, 0.0, 0.0, 789.289, 0.0, 0.0, 677.124, 1015.717 5 ,8.363, 777.926, 905.398, 0.0, 0.0, 430.971, 0.0, 0.0, 369.726, 554.606 6 ,3.749, 348.727, 405.870, 0.0, 0.0, 193.195, 0.0, 0.0, 165.740, 248.618 7 ,6.680, 621.410, 723.236, 0.0, 0.0, 344.261, 0.0, 0.0, 295.339, 443.022 8 ,7.538, 701.225, 816.129, 0.0, 0.0, 388.478, 0.0, 0.0, 333.272, 499.924 9 ,3.724, 346.376, 403.134, 0.0, 0.0, 191.892, 0.0, 0.0, 164.623, 246.942 10 ,6.686, 621.916, 723.824, 0.0, 0.0, 344.541, 0.0, 0.0, 295.579, 443.382 11 ,3.334, 310.150, 360.971, 0.0, 0.0, 171.823, 0.0, 0.0, 147.405, 221.115 12 ,18.427, 1714.101, 1994.976, 0.0, 0.0, 949.611, 0.0, 0.0, 814.664, 1222.033 13 ,7.502, 697.831, 812.178, 0.0, 0.0, 386.598, 0.0, 0.0, 331.659, 497.504 14 ,6.712, 624.364, 726.674, 0.0, 0.0, 345.898, 0.0, 0.0, 296.743, 445.128 15 ,0.015, 1.401, 1.631, 0.0, 0.0, 0.776, 0.0, 0.0, 0.666, 0.999

● individuals: we will have the sequence files for the individuals generated. This case, we ​ asked to generate diploid individuals. We will have 16 individuals per locus, and each file contains 2 sequences. ● ind_labels: relation correspondence replicate/individual/species/locus/sequences. ​ ● reads: this folder follows a hierarchical structure like the one obtained in SimPhy, and it ​ stores the FASTQ files generated by ART. ● scripts: this folder will be empty since ART is, in this example, ran within NGSphy. ​

61 List of Figures ● FIGURE 1: Input modes: a) a single gene tree; b) single gene tree with a user-defined ​ ancestral sequence; c) a single gene tree with an anchor sequence and d) gene-tree distributions (SimPhy output [species-tree simulations]) ● FIGURE 2: Re-rooting process ​ ● FIGURE 3: Gene tree labeling example.. ​ ● FIGURE 4: Sampling notation example. Poisson distribution. ​ ● FIGURE 5: Sampling notation example. Negative Binomial distribution. ​ ● FIGURE 6: Experiment-wide coverage sampling example. ​ ● FIGURE 7: Experiment-wide coverage sampling a complex example. ​ ● FIGURE 8: Locus-wide coverage sampling. ​ ● FIGURE 9: Individual-wide coverage sampling. ​ ● FIGURE 10: Taxon-specific coverage explanation. ​ ● FIGURE 11: Reference allele file example. ​ ● FIGURE 12: Folder structure of the NGSphy output. ​ ● FIGURE 13: A possible analysis pipeline for multilocus, multispecies datasets with multiple ​ individuals with the final goal of exploring the sensitivity of species tree inferences to NGS parameterization variation. ● FIGURE 14: NGSphy workflow ​ ● FIGURE 15: Tree file t1.tree. ​ ● FIGURE 16: Tree file t2.tree. ​ ● FIGURE 17: Tree file t3.tree. ​ ● FIGURE 18: Example of possible distributions of the rate multipliers according the settings ​ file ngsphy.settings.4.txt to model the coverage variation among individuals

References ● Bragg JG, Potter S, Bi K and Moritz, C. (2016). Exon capture phylogenomics: efficacy across scales of divergence. Molecular ecology resources, 16(5), 1059-1068. ● Escalona M (2017) indelible-ngsphy. Github. http://www.github.com/merlyescalona/indelible-ngsphy ● Fletcher W and Yang Z. (2009) INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution. 26 (8): 1879–88. ● Gentzsch W. (2001). Sun grid engine: Towards creating a compute power grid. In Cluster Computing and the Grid, 2001. ​ ​ Proceedings. First IEEE/ACM International Symposium on (pp. 35-36). IEEE. ● Huang W, Li L, Myers JR and Marth, GT. (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28 (4): 593-594 ● Ji T and Chen J. (2015) Modeling the next generation sequencing read count data for DNA copy number variant study. Stat. Appl. Genet. Mol. Biol. 2015; 14(4): 361–374. ● Korneliussen TS, Albrechtsen A and Nielsen R. (2014) ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15:356. ● Kozlov, Alexey. (2017, April 4). amkozlov/raxml-ng: RAxML-NG v0.2.0 BETA (Version 0.2.0). Zenodo. http://doi.org/10.5281/zenodo.492245 ● Mallo D, De Oliveira Martins L and Posada D. (2016). SimPhy : Phylogenomic Simulation of Gene, Locus, and Species Trees. Systematic Biology 65(2): 334-344. ● McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M and DePristo MA. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303. ● Ritz A, Paris PL, Ittmann MM, Collins C and Raphael BJ. (2011). Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics, 12(1), 114. ● Yoo AB, Jette MA and Grondona M. (2003). Slurm: Simple linux utility for resource management. In Workshop on Job ​ Scheduling Strategies for Parallel Processing (pp. 44-60). Springer, Berlin, Heidelberg. ​ 62 ● Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker R, Lunter G, Marth G, Sherry ST, McVean G, Durbin R and Analysis Group (2011) The Variant Call Format and VCFtools, Bioinformatics, Volume 27, Issue 15, 1 August 2011, Pages 2156–2158 ● Van der Auwera, G (2015) Together is (almost always) better than alone: GATK forum entry. (https://gatkforums.broadinstitute.org/gatk/discussion/4150/should-i-analyze-my-samples-alone-or-together) ​ ​ ● Li, Heng, and Richard Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25 (14):1754–60. ● Van der Auwera, G. A., and M. O. Carneiro. 2013. “From FastQ Data to High confidence Variant Calls: The Genome Analysis - Toolkit Best Practices Pipeline.” Current Protocols in Bioinformatics. Wiley Online Library. http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1110s43/full.

63