Plant systematics to cancer biology: Transferrable skills and evolutionary thinking in bioinformatics

Kate L Hertweck The University of Texas at Tyler Department of Biology Twitter @k8hert I am...... an educator and researcher. ...an evolutionary biologist. ...a data-driven bioinformaticist. ...committed to reproducible science. Goal: Relate genomic variation to organismal function and evolution to understand complex traits. Objectives: 1. Identify associations between genomic and organismal variation 2. Consider opportunities transferring bioinformatic skills among model systems

Outline: 1. Evolution in monocots 2. Population genomics in Drosophila 3. Biomarkers in cancer

Biodiversity Heritage Library ?

Can we use genomic data to determine relationships among and identify patterns of genomic evolution across deep time? Monocots are a delicious and diverse model system

● ca. 60,000 species, many edible and ornamental ● Variation in traits ● life history :growth habit, habitat ● genome: size, chromosome number, ploidy ● Few genomic resources except in grasses

Darlington 1963

Iris from user from user Ram-Man Bob Gutowski Asparagus from Allium, Bozzini 1964 user Evan-Amos Monocots exhibit varying rates of evolution and shifts in diversification rates

● Data: Eight loci from three genomic partitions (mt, cp, nuclear; including one low- A copy nuclear gene) ● Analysis: tree-building with RAxML, divergence time analysis with r8s and multidivtime, diversification with MEDUSA and apTreeShape

Fossil calibration Species-rich lineage (MEDUSA) Species-poor lineage (MEDUSA) A ApTreeShape

Hertweck et al., 2015 Bot J Linn Soc Plastid genomes resolve relationships in

Doryanthaceae Iridaceae Xeronemataceae e

* a e

Hemerocallidoideae c * e a ● Data: Genomic survey e

Xanthorrhoeoideae o

h sequences (GSS; r r

Asphodeloideae o anonymous, low-coverage h e t a n

e NGS data) a

Agapanthoideae c X * a ● h Analysis: plastome and t

* n a mt/nrDNA assembly, tree p

a building with PAUP and Garli g A *Aphyllanthoideae ● Lomandroideae Used less than 10% of the * data collected! e

Asparagoideae a

* e

* c problematic a * g a r a

Agavoideae p * s increase in A * bootstrap support * Steele, Hertweck, Mayfield, McKain, Brodiaeoideae Leebens-Mack, and Pires, 2012 AJB Transposable elements are an underappreciated source of genomic variation

● Transposable elements (TEs): mobile genetic elements or Approach: assembly of jumping genes TEs from GSS ● Independently ● contigs are consensus replicating of most abundant TEs ● Similar to or derived in the genome from viruses ● TEs must exist in high ● Occur in multiple copy to have sufficient Heslop-Harrison et al, 1997 copies throughout the reads for detection genome (assembly) ● the older a TE ● TEs are an important insertion, the more driver in genomic likely it has evolution accumulated ● Interactions with mutations which will genes inhibit detection ● Genome-wide ● data presented as modifications percentage of TE type ● Source of mutation on in nuclear genome which natural selection (relative abundance) can act TE content does not vary with genome size

70% 25000

e 60% ●

m 20000 Data: Previously e o ) c n 50% published GSS data C n e 1 e g /

● u 15000 r 40% b Analysis: assembly q a M e e (

l with MaSuRCA, s

c f 30% e

u 10000 BLAST to remove z o i

n

s e organellar g

m 20% e a o t r 5000 m sequences, annotate f n

10% o e s n with RepeatMasker c d e r a e G

e 0% 0 P r s s s s a a a a a a i i i i m t r e u u r r h h s u m d t x i t h g e u l i o t r n l n o a o m ● v n o r a H a d A b e e

a Inconsistent with a t w p l a e m s l s p c a a d o y n s o g l S e L H h a A e L A p

S hypothesis that TE h A c i

D proliferation is related to an increase in One of largest genome size genomes in dataset, but very small proportion of repeats! Hertweck, 2013, Genome TEs are difficult to annotate but appear to vary with ploidy

● Data: GSS from tetraploid, Agavoideae (tequila) largest (known) genome in dataset ● Analysis: additional annotation methods with CDD

● Agavoideae TEs are particularly difficult to sequence ● CDD more than doubles identifiable sequence!

Agave tequilana from user Stan Shebs Allium have much lower proportions of Copia LTR retrotransposons than closely related genera

copia ● Data: GSS from gypsy Allioideae (, garlic, leek)

● Allium has 800+ species, related genera have relatively few ● Low proportion of copia counter to expectations of diversification from TE expansion

other Allioideae Allium Allium senescens from user Adamantios Conclusions:Conclusions: Evolution Asparagales in monocots Can we use genomic data to determine relationships among species and identify patterns of genomic evolution across deep time?

● Monocot phylogenetics ● Unlinked loci from across the genome provide the framework for diversification analyses ● Complete plastomes resolve Asparagales relationships ● Asparagales TEs ● GSS can suggest what parts of the genome may be interesting for further investigation 1. Evolution in monocots 2. Population genomics in Drosophila 3. Biomarkers in cancer

Collaborators: Michael R. Rose (UC Irvine) Joseph L. Graves (NC A&T, UNCG)

D. melanogaster male from user Aka Do populations experimentally selected for specific phenotypes yield similar genomic patterns? Experimental evolution in Drosophila results in parallel responses to selection for time to development

Long term experimental evolution system (established 1980) with following treatments: ● Data: Whole-genome pooled population resequencing, A short life cycle (9 days) three selection types, six B baseline life cycle (14 days) treatments, five populations C long life cycle (28 days) each

● Analysis: phenotypes, SNPs, structural variants, TEs NCO AO BO CO

ACO B Populations with accelerated development have higher TE load

● Analysis: Identification of per-population TE load using PopoolationTE

● Within-treatment TE load is not significantly different (p>0.05) ● Between-treatment TE load does differ ● Consistent with expectation that TEs are more tightly controlled in populations with B C A longer life spans Heterozygosity of TE insertions is higher in populations with accelerated development

● Analysis: T-lex to identify insertion frequencies for TEs compared to reference genome

● Within-treatment TE load is not significantly different (p>0.05) ● Between-treatment TE load does differ ● Consistent with expectation that A-type selection is more intense B C A Between-treatment comparisons have more significantly differentiated TEs

● Analysis: T-lex to identify insertion frequencies for TEs compared to reference genome followed by CMH test

● 177 insertions vary in frequency between two or more populations ● 91 insertions were significantly differentiated among at least one treatment comparison ● Within-treatment comparisons have few to no significantly differentiated TEs Conclusions: Population genomics of Drosophila

Do populations experimentally selected for specific phenotypes yield similar genomic patterns?

● Yes, with evidence from across the genome

● Many types of TEs are responding to selective pressures

● Comparisons of treatment types shows parallel response to selection

● These data are a powerful tool for continuing to assess TE responses to selection at a genomic level

f 1. Evolution in monocots 2. Population genomics in Drosophila 3. Biomarkers in cancer

Collaborator: Santanu Dasgupta (UT Health Northeast)

Philley et al, 2015, J Cell Phys Can we integrate genomic data with experimental studies to identify biomarkers and cancer pathways? Background

● Both detection and treatment of cancer remain problematic because of complex and heterogeneous genetics ● Integration of NGS analysis with traditional wet lab work can inform the relevance of particular genetic variants and be used for biomarker development

Philley et al, 2015, J Cell Phys Haplotype phylogeny identifies variants potentially linked to cancer

● Data: mitochondrial genome sequencing from prostate cancer patients

● Analysis: Variant calling, haplotypes assigned with HaploGrep and PhyloTree

● Differentiates variants due to common ancestry from variants possibly related to cancer

Turquoise = heteroplasmy @ = reversion Philley et al, 2015, J Cell Phys Somatic mutations inform analyses in genes of interest for HNSCC

● Data: whole-genome NGS data from paired tumor/non-tumor tonsil tissue (HPV-induced head/neck squamous cell carcinoma)

● Analysis: Variant calling, filter for only somatic variants, mine genes of interest

● Provides the genetic context to match with protein expression studies ● Opportunities for data re-use to examine evolutionary questions Kannan, Hertweck et al., in review Conclusions: Biomarkers in cancer

Can we integrate genomic data with experimental studies to identify biomarkers and cancer pathways?

● Paired tumor/normal samples are a powerful tool for identifying variants related to multiple types of cancer

● The integration of genomic data with wet-lab work contributes to both biomarker development and elucidation of cancer pathways

● Evolutionary thinking is valuable for interpreting integrative studies General conclusions Goal: Relate genomic variation to organismal function and evolution to understand complex traits. 1. Evolution in monocots 2. Population genomics in Drosophila 3. Biomarkers in cancer

● You can answer really interesting questions about evolutionary biology by combining NGS data with other types of biological information ● Skills to assess variation in large datasets are very transferrable and offer great opportunity for novel research approaches Considerations for diversifying your research

● Learning reproducible science skills is well worth your time!

● Find a community.

● Be prepared to spend lots of time managing and organizing data.

● Choose collaborations carefully, but don't be afraid to branch out.

Clostridium acetobutylicum by user Hibiscus dasycalyx by user Geoman3 Sesamehoneytart Image by Sugar Research Australia Acknowledgements

For research support: For intellectual support and training:

For images: Research https://sites.google.com/site/k8hertweck

Blog: k8hert.blogspot.com

Twitter @k8hert Google+ [email protected] Bulbapedia GitHub https://github.com/k8hertweck