Blast2GO PRO Plugin for Geneious User Manual

Geneious 8.0 Version 1.0 October 2015

BioBam S.L. Valencia, Spain Contents

Introduction 2 1.1 Blast2GO methodology ...... 2 1.2 Blast2GO and BioBam ...... 3 1.3 Geneious and Biomatters ...... 3

Quick-Start 4

User Interface Overview 6 3.1 Toolbar ...... 6 3.2 Document Table ...... 6 3.3 Document Viewer ...... 7 3.3.1 Statistics Tab ...... 7 3.3.2 Graph Viewer Tab ...... 9 3.4 Sources Panel ...... 11

Blast2GO functions 13 4.1 Load Example Data ...... 13 4.2 Convert to Blast2GO ...... 13 4.3 Add Blast Hits ...... 13 4.4 CloudBlast ...... 13 4.5 Add InterProScan ...... 14 4.6 InterProScan ...... 14 4.7 Mapping ...... 15 4.8 Show GO Description ...... 15 4.9 Annotation ...... 15 4.10 Merge InterProScan ...... 16 4.11 ANNEX ...... 16 4.12 GO-Slim ...... 16 4.13 Remove Results ...... 18 4.14 Fisher’s Exact Test ...... 18 4.15 Selection ...... 19 4.16 Activate Subscription Key ...... 20 4.17 Remove Subscription Key ...... 20 4.18 CloudBlast History ...... 20

Geneious functions 21 5.1 Import Data ...... 21 5.2 Export Data ...... 21 5.3 Workflows ...... 21

Copyright 2015 - BioBam Bioinformatics S.L. 1 Introduction

Support: [email protected] Website: http://www.blast2go.com/geneious-plugin

1.1 Blast2GO methodology

Blast2GO Conesa et al. (2005) is a methodology for the functional annotation and analysis of gene or sequences. The method uses local sequence alignments (BLAST) to find similar sequences (potential homologous) for one or several input sequences. The program extracts all GO terms associated to each of the obtained hits and returns an evaluated GO annotation for the query sequence(s). Enzyme codes are obtained by mapping from equivalent GOs while InterPro motifs can directly be queried at the InterProScan web service. A basic annotation process with Blast2GO consists of 3 steps: blasting, mapping and annotation. These steps will be described in this manual including further explanations and information on additional functions G¨otzet al. (2008).

Figure 1.1: Table and Graph Viewer visualizing a small data-set

Copyright 2015 - BioBam Bioinformatics S.L. 2 1.2 Blast2GO and BioBam

Blast2GO PRO Plugin for Geneious is developed, maintained and distributed by Biobam Bioin- formatics S.L.

1.3 Geneious and Biomatters

Geneious is a powerful and comprehensive suite of molecular biology software tools. Geneious provides a an easy-to-use interface with simple and intuitive data management capabilities within a customizable and extendable platform. This allows researcher working with sequencing data to gain immediate access to a wide-range of essential data analysis features. Blast2GO PRO is now part of this feature-set in form of the plugin described in this document.

Geneious is developed by Biomatters, a New Zealand based company, founded in 2003, with a mission to create bioinformatics solutions for the analysis, interpretation, and application of molecular sequence data.

Copyright 2015 - BioBam Bioinformatics S.L. 3 Quick-Start

This section provides an overview of an typical Blast2GO usage. Detailed descriptions of the different steps and possibilities of this plugin are given in the remaining sections of this manual.

1. To start an annotation process with Blast2GO load a sequence file in fasta format containing your nucleotide or amino acid sequences: File Import From File . . . Alternatively you may use any sequence data already available in the Sources navigator on the left. 2. Convert your data to Blast2GO sequences: Select the sequence document list (or one or various loose sequence documents) and Right Click Blast2GO Convert to Blast2GO . This converts your sequences in Blast2GO sequences and activates three Blast2GO viewer in the bottom editor area (B2G Table, B2G Statistics, B2G Graph). You can now select B2G Table to view the sequence names in a list. Ob- serve that all sequence rows are white and to not contain information. Select the B2G Statistics Tab and click on the white colored statistic names in the right side panel (one of the first 4 options. 3. Blast sequences: Various options are available to add sequence alignments to your Blast2GO sequences in Geneious. We will test the CloudBlast option. More information about other options can be found in the section Blast. Right Click Blast2GO CloudBlast In most cases default settings can be applied. However, depending on the data-set the following parameters might be adjusted: • Select a Blast DB that fits your data-set. • In the advanced section (More Options), choose where a copy of your Blast results should be created. Now sequences will be send to the CloudBlast resource. The progress can be observed under the Sources panel. While processing, the color of the B2G Table changes from white to red or orange. Right-clicking on an orange sequence allows to Show Blast Result via a context menu. This option opens an extra window called Blast2GO viewer component. This is an independent window containing a list of result tabs. These tabs will not change its contents even a user changed to a different data-set (as the integrated Document Viewers). Once we have obtained Blast results the corresponding statistics can be viewed in the B2G Statistics viewer. 4. Perform Mapping: Orange sequences contain at least one hit (probably more than one) and are suitable for the mapping (the number of hits is shown in the #hits column in the B2G Ta- ble). Select all blasted sequences (whether they are white, red or orange) and Right Click Blast2GO Mapping . Now orange sequences may changed their color to green. Green se- quences contain GO candidate term which will be considered for annotation in the next step. 5. Annotation: Once GO mapped sequences (green) it is possible to apply the annotation step: Right Click Blast2GO Annotation We will use default parameters. Some sequences will now change from green to blue and are know successfully annotated. We can now use the green and blue options from the B2G Statistics viewer to review the annotation process. Annotated sequences (blue) can also be visualized with the B2G Graph tab.

Copyright 2015 - BioBam Bioinformatics S.L. 4 6. InterProScan: To complement the blast-based annotations with domain-based annotations, run an In- terProScan search. Select your data-set, Right Click Blast2GO InterProScan , introduce your email address and press OK. This adds information to the InterPro column in the B2G Table viewer, which must be merged to your already obtained GOs with Right Click Blast2GO Merge InterProScan . 7. Export Results: Once the annotation process is finished we can consider several options to export the results: File Export Selected Documents. The Files of Type combo-box shows various possible ex- port types and the ones beginning with Blast2GO are most suitable for our data-set. • annot-file: The annot file is the standard format to export GO annotations. It is a tab-separated text file, each row contains one GO term. • dat-file: The standard Blast2GO project file. This file can also be opened with the standalone Blast2GO application.

Copyright 2015 - BioBam Bioinformatics S.L. 5 User Interface Overview

This section provides an overview of the Blast2GO PRO Plugin for Geneious user interface.

3.1 Toolbar

By default, the main Geneious toolbar hosts an icon for all Blast2GO functions and a Blast2GO toolbox icon. The functions icon provide access to the algorithms like Blast, Mapping or Anno- tation. The toolbox contains options like Activate Subscription or the CloudBlast History viewer. All menu options are also available via the context menu of the Document Table (via right-click on a document).

Figure 3.1: An overview about the graphical elements the Blast2GO plugin makes available to the user

3.2 Document Table

The Document Table shows all available documents. Within the plugin we treat sequences documents (nucleotide or amino acid) which have been converted into B2G documents. These sequences can be loose sequences or grouped into lists of B2G Sequences. Note: At the moment Blast2GO list-documents can not be distinguished by their icon from normal nucleotide or amino-acid lists (Figure 3.3).

Copyright 2015 - BioBam Bioinformatics S.L. 6 (a) Group sequences into a list document (b) Extract sequences from list

Figure 3.2: Managing sequences and lists of sequences

Note: The Blast2GO viewers works significantly faster with un-grouped sequences. Un-group sequences requiring more RAM and this is not feasible when working with large lists.

3.3 Document Viewer

The Document Viewer in the lower-right area visualizes the data selected in the Document Table. There are 3 different viewers to visualize grouped or un-grouped B2G Sequences: B2G Table, B2G Statistics, B2G Graph

• B2G Table Tab The Blast2GO table shows the obtained results for each sequences including the analysis status (blast, mapped, annotated, etc.). When working on a list, selection can also be performed via the section tool (toolbox). • B2G Statistics Tab The statistics viewer can be used to create charts for all analysis steps and allows to export results in various file formats via the sidepanel. Statistics related to blasted sequences are colored in orange, annotation step related statistics are blue etc. Note: A dataset with green sequences e.g. can not be used to generate e.g. blue statistics. A GO Distribution Level statistic for example does not make any sense when applied on only mapped sequences. Section 3.3.1 gives an overview about the different available charts and their meaning.

• B2G Graph Tab To visualize a GO combined graph the viewer offers many different options in the sidepanel (Figure 3.3(b)). A first execution had to be started via the Make it so button. Subsequent parameter changes of the Graph Options have a direct impact on the shown graph. Please see section 3.3.2 for more information. Jump to Node searches and centers the graph view on the GO term matching the search criteria. The options available in the Charts area are explained in section 3.3.2, but basically the graphs information is reduced to a bar or pie chart. The graph information can be exported in four different file formats (svg, pdf, png, txt).

3.3.1 Statistics Tab • General • Data Distribution - This bar chart shows the distribution of un-blasted, blasted, mapped and annotated sequences over the whole data-set. • Data Distribution (pie) - The same as the former but pie-style. • Sequence Length - Plots the sequence length for all sequences. • Analysis Progress - Gives an overview about the current analysis progress of the data-set. • Blast

Copyright 2015 - BioBam Bioinformatics S.L. 7 • E-Value Distribution - This chart plots the distribution of E-values for all selected BLAST hits. It is useful to evaluate the success of the alignment for a given sequence database and help to adjust the E-Value cutoff in the annotation step. • Similarity Distribution - This chart displays the distribution of all calculated se- quence similarities (percentages), shows the overall performance of the alignments and helps to adjust the annotation score in the annotation step. • Species Distribution - This chart gives a listing of the different species to which most sequences were aligned during the BLAST step. • Top-Hit Distribution - Bar chart showing the species distribution of all Top-Blast hits. • Hit Distribution - This chart shows a distribution of the number of hits for the blasted sequences in a data-set. • Hsp Distribution - This bar chart shows the distribution of hsps per hit. • Hsp/Seq Distribution - This chart shows a distribution of percentages which rep- resents the coverage between the hsps and their corresponding sequences. • Hsp/Hit Distribution - Same as above but for hits instead of sequences. • Mapping • EC Distribution for Sequences - This chart shows the distribution of GO evidence codes for the functional terms obtained during the mapping step. It gives an idea about how many annotations derive from automatic/computational annotations or manually curated ones. • EC Distribution for Blast Hits - Same as above but per Blast hit. • DB Resources of Mapping - This chart gives the distribution of the number of annotations (GO-terms) retrieved from the different source databases e.g. UniProt, PDB, TAIR etc. • Annotation • Annotation Distribution - This chart informs about the number of GO terms assigned per sequence. • GO Annotation Lvl Distribution - A bar chart which shows all GO terms for all 3 categories for a given GO level taking into account the GO hierarchy (parent-child relationships). • Annotation Score Distribution - A chart that shows the number of sequences per annotation score. • Seq/Length Relative - This chart shows the relative correlation between length of the sequences and the number of assigned annotations. • Seq/Length Absolute - Same as above but absolute. • GO Distribution Level - A bar chart which shows all GO terms for all 3 categories for GO level 2, taking into account the GO hierarchy. • Direct GO Count MF - A chart for the Molecular Function GO category, which shows the most frequent GO terms within a data-set without taking into account the GO hierarchy. • Direct GO Count BP - Same as above but for Biological Process. • Direct GO Count CC - Same as abode but for Cellular Component. • IPS • InterProScan Overview - This chart reflects the effect of adding the GO-terms retrieved through the InterProScan results. • Enzymes • Main Enzyme Classes - Shows the distribution of the 6 main enzyme classes over all sequences. • Second Level Classes - Same as above but for the corresponding subclass.

Copyright 2015 - BioBam Bioinformatics S.L. 8 3.3.2 Graph Viewer Tab Once the Graph has been created via the Make it so button, the Graph Options will have a direct impact on the shown graph. The sidepanel options are explained in the next section 3.3.2 in detail.

Directed Acyclic Graphs Blast2GO offers the possibility of visualizing the hierarchical structure of the gene ontology by di- rected acyclic graphs (DAG). This functionality is available to visualize results at different stages of the application and although configuration dialogs may vary, there are some shared features when generating graphs. 1.Software. Blast2GO integrates a viewer based on the ZVTM frame- work developed by Emmanuel Pietriga at the INRA (France) for graph visualization Pietriga (2005). This high-performing vectored visualization framework allows fast navigation and zooms on the GO DAG. A graph overview is permanently shown at the upper right corner of the graph- ical tab to easy follow exploring across the DAG surface. Zoom in/out is supported on the mouse wheel and fast zoom to readability is reached by double click on a DAG node. Information about the current node is given on the lower application bar 2.Parameters. Node Filters. A potential drawback during drawing Gene Ontology DAGs where numerous sequences are involved is the presence of an excessive number of nodes that would make the graph hard to visualize and will demand large memory resources. Blast2GO allows modulation of graph size by introducing node filters that depend of the type of graph considered. Additionally, there are a maximum possible number of nodes to be displayed. Coloring mode. Blast2GO highlights nodes proportionally to some parameter of the analysis which result is visualized on the DAG. By this intensity varia- tion of node color relevant terms get more visual weight which is a useful way to guide visual inspection of the results.

Sidepanel Graph Options Blast2GO generates combined graphs where the combined annotation of a group of sequences is visualized together. This can be used to study the joined biological meaning of a set of sequences. Combined graphs are a good alternative to an enrichment analysis where there is no reference set to be considered or when the number of involved sequences is low. To get a better understanding of the different types of shapes please see section 3.3.2. 1. Graph Options:

• Ontology - Choose which type of the Gene Ontology category should be to visualized. In case of ”All”, the three graphs will be visualized at once. • Graph Coloring: • By Node-Score - A Score is computed for each node according to the formula: X score = seq ∗ αdist (3.1) GOs Where seq is the number of different sequences annotated at a child GO term and dist the distance to the child node. Coloring by Node-Score will highlight areas of high annotation density. • By Sequence Count - Node color intensity will be proportional to the number of contributing sequences at each node. • By Ontology - Each node takes the color of its ontology. • By Sequence Percentage - Node color intensity will be proportional to the percentage of sequences compared to 100%. The root node of each graph is indirectly present in all GO nodes, but the more specific it gets the lower the percentages. If the root node is present in all 10 sequences and one GO node is annotated to 6 sequences of your data-set, then we will have this GO node colored with 60% of intensity. • Without Colors - Self-explanatory.

Copyright 2015 - BioBam Bioinformatics S.L. 9 • Sequence Filter - The minimal number of sequences a GO node must have assigned, to be displayed. This filter is used to control the number of nodes present in the graph. It is recommended to start the analysis with a high number that, depending on the number of total sequences, is expected not to overload the graph. Depending on the result, adjust this value until the graph is satisfactory (10% of the total amount of annotated sequences is a good start). Additionally, nodes can be filtered out by the Node Score Filter (see below). • Score Alpha - The value for parameter alpha in equation 3.1. Only nodes with a Node-Score higher than the Filter will be shown. Use this parameter to thin out the GO-Dag and to remove little informative nodes. • Node Score Filter - By setting this value graphs can be thinned out, deleting nodes with a score lower than the given value. • The following checkboxes allow the user to modify the graph’s visual appearance. • Edge Labels - Show edge labels like “is a” or “part of”. • GO ID - If checked, the GO ID will be included in the node. • GO Name - The GO Name is shown in the node. • GO Definition - When checked the GO Definition will be included in the node. • GO Node Score - The Node-Score will be shown in the node. • Sequence Name - The names of the sequences annotated at each GO are in- cluded in the node. The maximum number of names to be displayed is 15. • Sequence Count - The absolute number of sequences annotated with that par- ticular GO will be displayed in the node. • Sequence Percentage - When checked the percentage of sequences annotated within the data-set with that particular GO will be included in the node. 2. Jump to Node: will try to find the entered text and center the graph view on the corresponding node, if found. 3. Charts: The options available in the Charts area are explained in section 3.3.2, but basically the graphs information is reduced into a bar or pie chart. 4. Export: The graphs can be exported in 4 different file formats.

Sidepanel Graph Charts Analysis of GO term associations in a set of sequences can also be done with pie/bar charts. Once the graph is visible, the Charts area allows the creation of 4 different charts.

• Cuts through the graph at a specific level and generates a pie representation of the number of sequences per GO node.

• Allows to select a minimum filter value in order to include only GO nodes with a higher Node-Score or sequence count in the resulting pie chart.

• Same as the first one but in bar chart style.

• Will show a bar chart with the number of sequences that have been annotated with a specific GO Term. All Charts will open in the Geneious B2G Window and can be saved in different file formats.

Graph Legend The GO Graphs are displayed in different shapes (Figure 3.4). • octagon - Annotated GO Terms • square - Intermediate GO Terms • ellipsis - GO Terms linked to a Blast Hit

Copyright 2015 - BioBam Bioinformatics S.L. 10 (a) Blast2GO summary charts. (b) Configure GO graph visual- ization.

Figure 3.3: Sidepanel options of the Statistics and Graph Viewer

Figure 3.4: Graph Legend that shows the graph shapes

Copyright 2015 - BioBam Bioinformatics S.L. 11 3.4 Sources Panel

In the Sources Panel we find two Blast2GO services to define the Blast2GO Database and the GO Dag file.Usually it is not necessary to modify these settings unless your want to connect to a local database installation or manually define the Gene Ontology hierarchy e.g. use during the annotation or GOSlim step.

Copyright 2015 - BioBam Bioinformatics S.L. 12 Blast2GO functions

The purpose of this chapter is to describe all options available in the Blast2GO PRO Plugin for Geneious. It is thought as quick reference to find information to a given menu entry.

4.1 Load Example Data

This menu option will load a small data-set of 100 plant nucleotide sequences which allows to experiment with different Blast2GO functionalities. This dataset can also be use as a reference dataset in case a own dataset return dubious or no results (e.g. in case an analysis step does not with your own data you may try this function with the example dataset).

4.2 Convert to Blast2GO

This option allows to convert sequences into B2G Sequences documents. Amino acid as well as nucleotide sequences can be converted. Converting sequences (or lists) into B2G Sequence documents is requiered to work with the Blast2GO Plugin.

4.3 Add Blast Hits

Add Blast Hits allows to add existing Blast results to your sequence dataset. Blast hits can be added from the Geneious sources or external files. Blast hits can be obtained via the Geneious built-in Sequence Search and added by choosing to the corresponding list of hits in the Source folders. To add external Blast results one or multiple XML files can be selected. This xml file can be created anywhere but should comply with the NCBI Blast XML format (blast+ parameter: -outfmt 5), otherwise it may not be recognized correctly by the plugin. To load Blast Hits from a sequence search, please select Hits and choose the protein hits from the Sources.

Note: Do not attempt to load a blast xml file via File Import From File. to then add the resulting Blast hits to your data. It will not work!

4.4 CloudBlast

CloudBlast allows to blast using our cloud system (Using the official NCBI Blast+ tools as well as databases from NCBI). To be able to use this service the user needs CloudBlast Computation Units and the current balance can be viewed under Blast2GO CloudBlast History (Section 4.18). CloudBlast Computation Units are spent proportional to the amount of computing time the user consumes. It is a direct representation of time needed to blast the data. We believe that this is a fair approach, because the user only pays what he consumes and he is able to reduce the consumption of units by blasting against smaller databases for example. Imagine the user blasting his tomato (Solanum lycopersicum) genes against nr database although he is only interested in results from plants (Viridiplantae). We recommend not blasting against nr, but its subset Viridiplantae in order to finish faster and to consume less CloudBlast Computation Units. Available parameters:

• Blast Program - blastx for nucleotides and blastp for amino-acids are available.

Copyright 2015 - BioBam Bioinformatics S.L. 13 • Blast DB - Only protein databases are available. Note: consider looking for a suitable subset of nr in order to speed up your analysis and spend less ComputationUnits. • Number of Blast Hits - The number of sequence alignments one wants to retrieve per sequence. • Blast Expectation Value - The statistical significance threshold for reporting matches against a sequence database. If the statistical significance of an alignment is greater than the e-Value threshold, this Hit will not be reported. Lower e-Value thresholds are more stringent, leading to less results. Increasing the threshold shows less stringent matches.

• Blast Description Annotator - Find the best possible description for a new sequence based on a given Blast result. • Word Size - One of the important parameters governing the sensitivity of Blast searches is the length of the initial word of the local alignment.

• Low Complexity Filter - The Blast program employ the SEG algorithm to filter low complexity regions from before executing a database search. • HSP Length Cutoff - Cutoff value for the minimal length of the first HSP of a Blast Hit, used to exclude Hits with only small local alignments from the Blast result. The given length corresponds to amino-acids or nucleotides depending on the type of the performed Blast. • Filter by Description - All Blast hits whose description line contains the text provided here will be removed from the result list. • XML - Save the results additionally to an xml file. This is recommended in order to be able to use the blast results for another software or simply to have a copy of the data. The results will append to the file selected.

Once done the user can visualize the blast result data by doing Right Click Show Blast Result on a sequence in the B2G Table.

4.5 Add InterProScan

This function allows to import and add already existing InterProScan results (version 4.8 or 5.0) to your dataset which have been generated outside of this plugin. The function to run InterProScan from the Plugin is explained in the section 4.6. Valid xml result files can be generated with the InterProScan executable and a local installation or directly via the EBI web-page (http://www.ebi.ac.uk/Tools/pfa/iprscan5/). Multiple files can be selected.

4.6 InterProScan

What is InterPro? (Ref: EBI web-page, http://www.ebi.ac.uk/interpro/about.html) InterPro is a resource that provides functional analysis of protein sequences by classi- fying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium.

Why is InterPro useful? InterPro combines signatures from multiple, diverse databases into a single searchable resource, reducing redundancy and helping users interpret their sequence analysis results. By uniting the member databases, InterPro capitalizes on their individual strengths, producing a powerful diagnostic tool and integrated resource.

Copyright 2015 - BioBam Bioinformatics S.L. 14 InterProScan allows to find similar sequences annotated with GOs and to confirm GOs we already annotated or to annotate GOs that did not appear via the conventional Blast2GO annotation pipeline. The latter must be done explicitly, or in other words: Performing InterProScan does not automatically augment the annotations (Neither does Add InterProScan). We call the transfer of GOs from the InterPro column to the GO IDs to Merge InterProScan results to the data (Section 4.10). To perform InterProScan via the public EBI service a valid email address is required. Choose one or multiple algorithms or databases for the search. Once done visualize the InterPro data from: Right Click Show InterProScan Result on a sequence in the B2G Table.

4.7 Mapping

Mapping is the process of retrieving GO terms associated to the hits obtained by the blast search. Blast2GO performs four different mappings steps:

1. Blast result accessions are used to retrieve gene names or symbols making use of two mapping files provided by the NCBI (gene info, gene2accession). Identified gene names are than searched in the species specific entries of the gene-product table of the GO database. 2. GeneBank identifiers (gi), the primary blast hit ids, are used to retrieve UniProt IDs making use of a mapping file from PIR (Non-redundant Reference Protein Database) including PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB. 3. Accessions are searched directly in the “dbxref” table of the GO database. 4. Blast result accessions are searched directly in the gene-product table of the GO database.

4.8 Show GO Description

To review certain GOs in more detail, there is the possibility to do Right Click Show GO Description for any sequence in the sequence table. This shows the Geneious B2G Window with all GOs listed together with names, descriptions and the option to obtain further information on AmiGO or to just visualize a GOs graph in the Gene Ontology structure by clicking show.

4.9 Annotation

This is the process of selecting GO terms from the GO pool obtained by the Mapping step and assigning them to the query sequences. In the current Blast2GO version this is the core type of functional annotation. The GO annotation is carried out by applying an annotation rule (AR) on the found ontology terms (mapping). The rule seeks to find the most specific annotations with a certain level of reliability. This process is adjustable in specificity and stringency. For each candidate GO an annotation score (AS) is computed. The AS is composed of two additive terms. The first, direct term (DT), represents the highest hit similarity of this GO, weighted by a factor corresponding to its evidence code (EC). The second term (AT) of the AS provides the possibility of abstraction. This is defined as annotation to a parent node when several child nodes are present in the GO candidate collection. This term multiplies the number of total GOs unified at the node by a user defined GO weight factor that controls the possibility and strength of abstraction. When the GO weight is set to 0, no abstraction is done. Finally, the AR selects the lowest term per branch that lies over a user defined threshold. DT, AT and the AR terms are defined as given in Figure 4.1. To understand better how the annotation score works, the following reasoning can be done: When the EC-weight is set to 1 for all ECs (no EC influence) and the GO-weight equals zero (no abstraction), then the annotation score equals to the maximum similarity value of the hits that have that GO term and the sequence will be annotated with that GO term if that score is above the given threshold provided. The situation when the EC-weights are lower than 1 means

Copyright 2015 - BioBam Bioinformatics S.L. 15 DT = max(similarity ∗ ECweigth)

AT = (#GO − 1) ∗ GOweight AR : lowest.node(AS(DT + AT )) ≥ threshold

Figure 4.1: Annotation Rule that higher similarities are required to reach the threshold. If the GO-weight is different from 0 this means that it is possible that a parent node will reach the threshold while its children nodes would not. The annotation rule provides a general framework for annotation. The actual way annotation occurs depends on how the different parameters at the AS are set. These can be adjusted in the Annotation dialog. • E-Value Hit Filter - This value can be understood as a pre-filter: only GO terms obtained from hits with a greater e-value than given here will be used for annotation.

• Annotation Cut-Off (threshold) - The annotation rule selects the lowest term per branch that lies over this threshold. • GO-Weight - This is the weight given to the contribution of mapped children terms to the annotation of a parent term.

• Hsp-HitCoverage Cutoff - Sets the minimum needed coverage between a hit and his HSP. For example a value of 80 would mean that the aligned HSP must cover at least 80% of the longitude of its hit. Only annotations from hits fulfilling this criterion will be considered for annotation transfer. • Filter GO by taxonomy - The filter will remove the Gene Ontology terms known not to be in the given taxonomy using the restrictions defined by Gene Ontology.

• EC-Weight EC code weights can be modified between 0.0 and 1.0. Note that in case, influence by evidence codes is not wanted, they can all be set to 1. Alternatively, when one wants to exclude GO annotations of a certain EC (for example IEAs), one can set its EC weight to 0.

4.10 Merge InterProScan

The Merge InterProScan function is used to transfer and merge the Gene Ontology terms iden- tified via InterPro into the blast based GO annotations (GO ID column). Figure 4.2 shows an example how to examine this process. In order to better understand what is going on, we can configure the B2G Table as shown in the figure: Switch GO View disabled Switch IPS View disabled This way we can directly see which GOs are already annotated (or candidates) and which ones would be added after the merge.

4.11 ANNEX

Blast2GO integrates the Second Layer Concept developed by the Norwegian University of Science and Technology Myhre et al. (2006) for augmenting GO annotation. Basically, this approach uses uni-vocal relationships between GO terms from the different GO categories to add implicit annotation.

4.12 GO-Slim

What is a GO Slim? (Ref: Gene Ontology website, http://geneontology.org/page/go-slim-and-subset-guide)

Copyright 2015 - BioBam Bioinformatics S.L. 16 (a) Before

(b) After

Figure 4.2: Notice that sequence C02008E09 does show functional annotation after the merge. Sequence C02008E05 also profits from merging by adding GO:0005515.

GO slims are cut-down versions of the Gene Ontology containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific fine grained terms.

GO slims are particularly useful for giving a summary of the results of GO annotation of a genome, microarray, or cDNA collection when broad classification of gene product function is required.

GO slims are created by users according to their needs, and may be specific to species or to particular areas of the ontologies. GO provides a generic GO slim which, like the GO itself, is not species specific, and which should be suitable for most purposes. Alternatively, users can create their own GO slims or use one of the model organism- specific slims integrated into the GO flat file. Please email the GO helpdesk for more information about creating and submitting your GO slim.

To get a better understanding of what GO Slim does in practice and how it works, here (Fig- ure 4.3) is a small visual example. Imagine figure 4.3(a) to be the subset of GO terms called GO Slim, figure 4.3(b) shows a data-set with GO 6,9 and 10 annotated. The GO Slim methodology will pull up the 3 annotated GOs as follows:

• 6 → 1

• 9 → 4 • 10 → 5

The result is shown in figure 4.3(c). Keep in mind that this would be a data-set containing various sequences, because one sequence that has annotated GO 1 and 4 would remain only with GO 4 because of the true-path rule. In the application our GO Slim subset is represented by a file with the extension .obo, this file contains all GO nodes and their hierarchical structure. The Gene Ontology Consortium provides various GO Slims that can be used and accessed directly from within the application. To select a predefined GO Slim, select Obo file from GO-Website and select your preferred file, it will then be used in combination with the currently selected obo file under Sources B2G GO Dag . The latter file contains the whole set of Gene Ontology terms. If the user wants to experiment and to try something separate, he can go for Custom Obo files and select the two obo files by hand. Keep in mind that the GO Slim file has to contain a real subset of GOs, otherwise the result is undefined.

Copyright 2015 - BioBam Bioinformatics S.L. 17 1

2 3

5 4

1 6 7

2 4 8

5 10 9

(a) GO Slim subset. (b) Whole data-set with GOs 6,9 and 10 anno- tated.

1

2 3

5 4

(c) The final annotation of the data- set after applying the GO Slim.

Figure 4.3: This shows an example of GO Slim in practice, each node represents one GO. White stands for normal, yellow for GO Slim and blue for directly annotated.

4.13 Remove Results

During a Blast2GO analysis (Blast, GO mapping, Annotation, InterPro, etc.) in each step pieces of information are added to the dataset. In order to redo a particular analysis step it is required to remove already existing results: Right click Remove Results . In Blast2GO most analysis steps will only be applied to sequences which have not been processed for this particular step. This means on one hand that if you e.g. want to run the Blast step, only white sequences will be considered - this allows to start and stop any step and any time without redoing already processed sequences. On the other hand, this means that if you want to re-blast a sequence or whole dataset, existing Blast results have to be removed first. Please note that when e.g removing the mapping results this automatically removes the annotation and GO Slim results. This is due to the fact that in Blast2GO the data is hierarchically structured and that a sequence cannot posses GO-Slim results without being annotated nor mapped in the first place.

4.14 Fisher’s Exact Test

Enrichment analysis can be performed in Blast2GO with a Fisher’s Exact Test (FET). Blast2GO implements the FatiGO Al-Shahrour et al. (2004) package for statistical assessment of annotation differences between 2 sets of sequences. FatGO includes Multiple Testing Correction. For this analysis, all annotated sequences have to be selected. After hitting Right click Fishers Exact Test a test set and a reference set can be selected. Blast2GO will perform the FET for the test

Copyright 2015 - BioBam Bioinformatics S.L. 18 against the reference set. If no reference is chosen explicitly, the whole data-set automatically becomes the reference set. Both files need to be .txt files containing one sequence ID per line. Additional options are: • Name - Choose a name for the resulting FET Result. • Remove Double IDs - This option allows to automatically remove all sequence IDs which are present in the test-set and in the reference set at the same time. By default double/common IDs are only removed from the reference set. • Two Sided - Perform a two sided test means to test for over and under-representation i.e. tests the test set against a reference set and vice-versa. • P-Value Filter Mode - Choose the type of value used for filtering. • P-Value - Single Test p-Value: P-Value without multiple testing corrections. • FDR - Corrected p-value by False Discovery Rate control. • P-Value Filter Value - GO-Term with a value higher than the given one are not shown. For further details please refer to the FatiGO publication Al-Shahrour et al. (2004). Once Enrichemnt result has been calculated two result viewers are available: • B2G FET Table - A table showing information about the enriched GOs. • B2G FET Graph - A graph viewer to visualize the hierarchy of the enriched GOs, the Nodes are colored proportionally to their significance value. The user can choose which type of calculated p-value to use for highlighting and the threshold for filtering out nodes. Additionally, the Thinned out Graph Node Filter will hide nodes with a significance value higher than the given value.

4.15 Selection

The Selection function available in Blast2GO Plugin allows to select a group of sequences based on different search criteria. Selections can be made based on the analysis step, functional annotation, etc. As described earlier in Geneious we can work with loose, un-grouped sequence documents as well as sequences lists. The selection functions can only be applied to list documents. Available options: • Select Type - General • Selection - New will first clear the selection and then select, whereas Add and Remove works with an existing current selection. • Select - Indicates the main search criteria. • Match Type ◦ Contains - Matches any string within another. ◦ Exact Match - Searches for the entire string of characters, including spaces, in the same order. Obtains only a result if the query matches the result exactly and completely. ◦ Whole Word - This is an “exact match” applied to each word. • Case Sensitive - Distinguish between lower and uppercase letters. • Rename Sequence Name - When filtering for Sequence Description, the user may decide to rename all filtered sequences after their description. • Include GO Parents - When filtering for GO Id or GO Name, this option will also consider all child terms. Let GO:2 be the child of GO:1. If we filter for GO:1, we will also selected sequences that have GO:2 annotated, because they are implicitly annotated as well. • Select Type - Color • Selection - New will first clear the selection and then select, whereas Add and Remove works with an existing current selection. • Color - Select sequences based on their sequence tables color.

Copyright 2015 - BioBam Bioinformatics S.L. 19 4.16 Activate Subscription Key

Brings up the initial Activate Blast2GO dialog, in order to activate the plugin. It is only available when the plugin is currently not active (e.g. After a fresh install or after calling Remove Subscription Key).

4.17 Remove Subscription Key

Deactivates the plugin.

4.18 CloudBlast History

Provides information regarding the CloudBlast usage and consumed ComputationUnits.

Figure 4.4: Provides information regarding the CloudBlast usage and consumed Computatio- nUnits. A link directs to a recharge option.

Copyright 2015 - BioBam Bioinformatics S.L. 20 Geneious functions

5.1 Import Data

To import data in Geneious we can use the Import dialog ( File Import From File ). The dialog offers an auto-detect function which allows to load annot, dat or fasta files without specifying the exact import data type.

Figure 5.1: Geneious offers many different possibilities to load data. All Blast2GO related options are also listed which includes Project files (.dat) and annotation files (.annot).

5.2 Export Data

The Export function ( File Export Selected Documents. ) allows to export Blast2GO Projects in (.dat) files as well as to save all generated a functional information in .annot or .gff format.

5.3 Workflows

The Geneious Workflow Manager allows to predefine analysis pipelines.

Copyright 2015 - BioBam Bioinformatics S.L. 21 Figure 5.2: Results can be exported with the data export dialog.

Figure 5.3: A basic workflow to convert, blast, map and annotate a list of sequences.

Copyright 2015 - BioBam Bioinformatics S.L. 22 Bibliography

Al-Shahrour, F., D´ıaz-Uriarte,R., and Dopazo, J. (2004). Fatigo: a web tool for finding signifi- cant associations of gene ontology terms with groups of genes. Bioinformatics, 20(4):578–580. Conesa, A., G¨otz,S., Garc´ıa-G´omez,J. M., Terol, J., Tal´on,M., and Robles, M. (2005). Blast2go: a universal tool for annotation, visualization and analysis in functional research. Bioinformatics, 21(18):3674–3676.

G¨otz,S., Garcia-Gomez, J. M., Terol, J., Williams, T. D., Nagaraj, S. H., Nueda, M. J., Robles, M., Talon, M., Dopazo, J., and Conesa, A. (2008). High-throughput functional annotation and data mining with the blast2go suite. Nucl. Acids Res., pages gkn176+. Myhre, S., Tveit, H., Mollestad, T., and Laegreid, A. (2006). Additional gene ontology structure for improved biological reasoning. Bioinformatics, 22(16):2020–2027.

Pietriga, E. (2005). A toolkit for addressing hci issues in visual language environments. In Visual Languages and Human-Centric Computing, 2005 IEEE Symposium on, pages 145–152.

Copyright 2015 - BioBam Bioinformatics S.L. 23