Basics of phage annotation & classification – How to get started Dr. Evelien Adriaenssens (she/her) Career Track Group Leader Gut & viromics

Chair Bacterial Viruses Subcommittee (ICTV) @EvelienAdri Disclaimers

The following presentation is my personal opinion. It aims at communicating best practices at this time for phage genome annotation. Practices may change over time.

I do not receive any remuneration for this presentation and any content may only be reproduced for non-commercial purposes.

Any mention of a software tool does not constitute institutional endorsements.

2 Database submission 3 Concepts

Assembly Read mapping Annotation

4 Next-generation-sequencing and assembly

Majority of phages: sequencing platform does not matter much

Things to keep in mind: Smallest sequencers sufficient for phages, 30-50 fold read coverage of genome is ideal, possible to co-sequence multiple phages E.g.: Illumina MiSeq, MiniSeq; Ion Torrent PGM; ONT MinION

Long read technologies have potential to give genome in one read Popular Nextera library prep = transposon based, ends of linear DNA will be missed Assembly of : choose tools that are accessible to you and in line with your bioinformatic skills In general: the easier to use, the more expensive

Do read mapping: find assembly errors, find genome ends

Owen, Perez-Sepulveda & Adriaenssens, 2021, Detection of : Sequence-Based Systems: 5 https://link.springer.com/referenceworkentry/10.1007%2F978-3-319-41986-2_19 Phage genome structure & implications (1)

Circularly permuted headful packaging: random ends pac sites: start fixed, end random

Merrill et al, 2016, BMC Genomics: Excellent examples of genome organisations

Dickeya phage LIMEstone1, Adriaenssens et al 2012, PLoS ONE Escherichia phage P1, Lobocka et al 2004, J. Bact. 6 Phage genome structure & implications (2)

Circularly permuted headful packaging: random ends pac sites: start fixed, end random

Useful software: PhageTerm: Garneau et al 2017, Scientific Reports 7 Phage genome structure & implications (3)

Defined ends Cohesive ends Terminal repeats

Merrill et al, 2016, BMC Genomics: Excellent examples of genome organisations 8 Phage genome structure & implications (4)

Defined ends Cohesive ends Terminal repeats

MSc thesis Vincent Dunon 9 Reorganise your genome!

Defined ends: verify (experimentally) and arrange correctly

No defined ends: - Compare with database genomic relative - Rearrange to be colinear with best-annotated relative

10 Genome annotation: Escherichia phage T7 example

UGENE, http://ugene.net/, Okonechnikov et al, 2012, Bioinformatics 11 ORF vs CDS

ORFfinder @ NCBI

Final annotation

12 Gene prediction

Know your organism:

• Common start codons

• Alternative stop codons?

• Which translation table?

• What is the coding density?

• Presence of introns?

13 Decisions, decisions

Shine-Dalgarno sequence: ribosome-binding site

E. coli: AGGAGG; phage T4 early genes GAGG 14 Decision criteria for phage CDSs

• Presence of ribosome-binding site • Alternative start codons allowed, most phages: Translation table 11 (some exceptions) • Maximize coding density • Small CDS overlap allowed • Nested CDS possible (depends on phage) • CDS on both strands (depends on phage) • Introns possible

15 Functional annotation

Based on similarity with proteins of known function • BLASTp or derivatives (PSI-BLAST, PHI-BLAST…)

• HMM-based searches (e.g. HHPred) • Structural predictions (e.g. PHYRE2)

Know your database: comparison search can’t find what isn’t in the database • RefSeq: curated databases (genomes, proteins), not all phages present • nr/nt: non-redundant datab