<<

interproscan-docs Documentation

EMBL-EBI

Aug 18, 2021

Contents:

1 Introduction 1 1.1 What is InterProScan?...... 1 1.2 Supported platforms...... 1 1.3 To install and run InterProScan...... 1 1.3.1 LSF cluster users...... 1

2 Release notes: InterProScan 5.52-86.03 2.1 What’s new...... 3 2.1.1 Data update...... 3 2.1.2 Software updates...... 3 2.1.3 Other updates...... 3 2.1.4 Known issues...... 3 2.1.5 Reporting issues...... 4

3 Installation requirements 5 3.1 How to check these on a system?...... 5 3.1.1 Which version of am I running?...... 5 3.1.2 Testing your Perl installation...... 6 3.1.3 Testing your Python installation...... 6 3.1.4 Testing the Java environment...... 7

4 Obtaining a copy of InterProScan9 4.1 Obtaining the core InterProScan software...... 9 4.2 Index hmm models...... 10 4.3 Panther models...... 10 4.4 Using the Local Pre-calculated Match Lookup Service (optional)...... 10

5 Running InterProScan 11 5.1 InterProScan test run...... 14 5.2 Command-line options...... 15 5.2.1 -dp / –disable-precalc (optional)...... 15 5.2.2 -appl / –applications application_name (optional)...... 15 5.2.3 -i / – sequence_file ...... 15 5.2.4 -iprlookup,–iprlookup...... 15 5.2.5 -goterms,–goterms (optional)...... 16 5.2.6 -b / –output-file-base file_name (optional)...... 16 5.2.7 -o / –outfile (optional)...... 16

i 5.2.8 -pa / –pathways (optional)...... 16 5.2.9 -t / –seqtype (optional)...... 16 5.2.10 -T / –tempdir (optional)...... 16 5.2.11 -dra / –disable-residue-annot (optional)...... 17 5.2.12 -version / –version (optional)...... 17 5.3 Included analyses...... 17 5.4 Output format...... 18 5.5 Optional configuration...... 18 5.5.1 Working directory for temporary files...... 18 5.5.2 Configuring the Pre-calculated Match Lookup Service...... 18 5.6 Running InterProScan on an LSF/SGE Cluster...... 18

6 Input formats 19 6.1 Supported input file format...... 19 6.2 Supported sequence format...... 19

7 Output formats 21 7.1 Tab-separated values format (TSV)...... 21 7.1.1 Example output...... 22 7.2 Extensible Markup Language (XML)...... 22 7.2.1 Example output...... 22 7.3 The XML Schema Definition...... 24 7.4 JavaScript Object Notation (JSON)...... 24 7.4.1 Example output...... 25 7.5 Generic Feature Format Version 3 (GFF3)...... 26 7.5.1 Example output...... 26 7.6 SVG and HTML...... 27 7.6.1 Example output...... 27

8 Nucleic acid sequences scan 29 8.1 The prediction tool...... 29 8.2 How can I scan nucleic acid sequences in InterProScan 5?...... 29 8.3 Which output formats are supported?...... 30 8.4 Redundant sequences and identifiers in your FASTA file...... 30 8.5 Improving performance...... 30 8.5.1 Selecting the ORFs to analyse...... 31

9 The InterProScan Lookup Match Service 33 9.1 Installing the lookup service locally...... 33 9.2 System requirements...... 34 9.3 Obtaining the lookup service...... 34 9.3.1 Run with graphical user interface (to set port number)...... 35 9.3.2 Run “Headless” (no graphical user interface)...... 35 9.4 Waiting for the lookup service to start...... 36 9.5 Testing the service...... 36 9.6 Configure InterProScan 5 to use your local lookup service...... 37

10 Running InterProScan 5 in Cluster Mode 39 10.1 Initial Setup...... 39 10.1.1 Cluster submission commands...... 40 10.1.2 Master configuration options...... 41 10.2 Example usage on an LSF, SGE and other clusters...... 41 10.3 clusterrunid...... 41 10.4 In house tested cluster versions...... 41 10.5 Related issues...... 42

ii 11 Running InterProScan 5 in CONVERT mode 43 11.1 Usage instructions...... 43 11.2 Example Usage...... 44

12 Improving performance 47 12.1 Review your CPU (and memory) command options...... 47 12.2 Consider chunking large input files...... 48 12.3 Review your command line input options...... 48 12.3.1 Running InterProScan in CLUSTER mode...... 48 12.4 Configure to analyse fewer ORFs (applies to nucleic acid sequences only)...... 48

13 Activating Phobius/SignalP/TMHMM analyses 49 13.1 Phobius...... 49 13.2 SignalP...... 49 13.3 TMHMM...... 50

14 Providing your feedback 51 14.1 Support requests...... 51 14.2 General discussion and suggestions...... 51

15 Known issues 53 15.1 Open issues in InterProScan...... 53 15.1.1 1. CDD/RPSBlast errors...... 53 15.1.2 2. Coils errors...... 53 15.1.3 Contacting us...... 54

16 FAQ 55 16.1 What should I do if one of the binaries included with InterProScan doesn’t work on my system?... 55 16.2 Where can I find the XSD of the XML output?...... 55 16.3 Can I use different binary versions than listed?...... 55 16.4 Which cluster does InterProScan support?...... 56 16.5 Is there Galaxy has a wrapper for InterProScan?...... 56 16.5.1 Documentation and contact details...... 56 16.5.2 Publication...... 56 16.6 I get Java errors on running InterProScan...... 56 16.7 How to analyse a huge amount of sequences (>30000)?...... 56 16.8 Should I filter by e-value?...... 57 16.9 Why do I see “Pre-calculated match lookup service failed - analysis proceeding to run locally”?... 57 16.10 How is InterProScan 5 different from InterProScan 4? How do I migrate?...... 57

17 Installing and compiling binaries used in Interproscan 59 17.1 cath-resolve-hits (used by CATH-Gene3D)...... 59 17.2 Pfscan/Pfsearch (used by ProSite Profiles, ProSite Patterns and HAMAP)...... 59 17.3 Hmmer 2 (used by SMART)...... 60 17.4 Hmmer 3 (used by CATH-Gene3D, HAMAP, PANTHER, , PIRSF, SFLD, SUPERFAMILY and TIGRFAMs)...... 60 17.5 ncoils (used by Coils)...... 60 17.6 fingerPRINTScan (used by PRINTS)...... 61 17.7 rpsblast/rpsbproc (used by CDD)...... 61 17.8 sfld_preprocess/sfld_postprocess (used by SFLD)...... 62 17.9 Phobius, TMHMM or SignalP...... 62

18 Configuration Options 63

19 Cluster mode benchmark run 65

iii 19.1 Benchmark run setup...... 65 19.1.1 Which version of InterProScan 5 (I5) was used for this run?...... 65 19.1.2 How was the set of input sequences assembled for this run?...... 65 19.1.3 Which I5 command was used for this run?...... 65 19.1.4 How does the interproscan.properties file look like?...... 66 19.1.5 On which cluster/farm did we run I5?...... 66 19.2 Benchmark run outcome...... 66

20 Change log for InterProScan JSON output format 67 20.1 InterProScan 5.31-70.0...... 67

21 Contact us 69 21.1 Helpdesk...... 69 21.2 Subscribe to the mailing list...... 69 21.3 Follow us on Twitter...... 69

22 Indices and tables 71

iv CHAPTER 1

Introduction

1.1 What is InterProScan?

InterPro is a which integrates together predictive information about ’ function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains. Users who have novel or protein sequences that they wish to functionally characterise can use the software package InterProScan to run the scanning from the InterPro database in an integrated way. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database’s signatures and the results are then output in a variety of formats.

1.2 Supported platforms

• 64-bit Linux that meets the Installation requirements There are no versions planned for Windows or Apple (MAC OS X) operating systems. This is due to constraints in the various third-party binaries that InterProScan runs.

1.3 To install and run InterProScan

For more information about using InterProScan please see the page links on the right, for example how to download a copy and how to run InterProScan.

1.3.1 LSF cluster users

As an alternative to the default “standalone” mode, Interproscan 5 allows components of the analysis to be farmed out on an LSF or SGE cluster. Full details of this can be found in Running InterProScan 5 in Cluster Mode.

1 interproscan-docs Documentation

2 Chapter 1. Introduction CHAPTER 2

Release notes: InterProScan 5.52-86.0

7th June 2021*. We are pleased to announce the release of InterProScan 5 (version 5.52-86.0). This release of InterProScan 5 includes a data update (using InterPro version 86.0 data).

2.1 What’s new

2.1.1 Data update

• Synchronized with InterPro version 86.0. • The addition of 299 InterPro entries. • An update to PROSITE patterns (2021_01), PROSITE profiles (2021_01). • Integration of 454 new methods from the CATH-Gene3D (80), CDD (27), PANTHER (295), PROSITE profiles (39), Pfam (7), SFLD (1), SMART (2), SUPERFAMILY (3) .

2.1.2 Software updates

• InterProScan requires at least Java 11

2.1.3 Other updates

• Deprecated HTML and SVG output formats.

2.1.4 Known issues

Documented on the following page: Known issues.

3 interproscan-docs Documentation

2.1.5 Reporting issues

You found a bug? Or do you want to give us your feedback? Please use EMBL EBI’s support form.

4 Chapter 2. Release notes: InterProScan 5.52-86.0 CHAPTER 3

Installation requirements

InterProScan is developed to run on Linux. There are no versions planned for Windows or Apple (MAC OS X) operating systems. This is due to constraints in the various third-party binaries that InterProScan runs. Note that InterProScan and the individual member database analyses are processor and memory intensive. A minimum specification requirement is a machine with 2 cores and 4 GB of RAM, which will allow the analysis of a small number of sequences at a time. However the more resources the faster the analysis/more sequences can be analysed at a time. Software requirements: • 64-bit Linux • Perl 5 (default on most Linux distributions) • Python 3 (InterProScan 5.30-69.0 onwards) • Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards) • Environment variables set – $JAVA_HOME should point to the location of the JVM – $JAVA_HOME/bin should be added to the $PATH

3.1 How to check these on a system?

3.1.1 Which version of Linux am I running?

InterProScan has been prepared with 64-bit binaries. To determine if you have a 32-bit or a 64-bit system, enter on the command line: uname-a

5 interproscan-docs Documentation

The exact response will depend upon the hardware vendor & architecture, however typical responses may look like: 64-bit as hinted by x86_64

$ uname -a Linux bob.com 2.6.32-358.6.2.el6.x86_64 #1 SMP Tue May 14 15:48:21 EDT 2013 x86_64

˓→x86_64 x86_64 GNU/Linux

32-bit as hinted by i686

$ uname -a Linux jim.com 2.6.32-50-generic-pae #112-Ubuntu SMP Tue Jul 9 20:44:31 UTC 2013 i686

˓→GNU/Linux

If you are still in any doubt, ask your systems administrator.

3.1.2 Testing your Perl installation

To test that Perl 5 is installed, enter on the command line

perl-version

This should report a version of Perl is available, similar to:

This is perl, v5.10.1( *) built for i486-linux-gnu-thread-multi

Copyright 1987-2009, Larry Wall

...etc

A default Perl installation is sufficient: no third party Perl modules need to be installed. Alternatively you could change the value of the ‘perl.command’ property in your interproscan.properties configuration file to point at a suitable Perl installation, the default value is:

perl.command=perl

3.1.3 Testing your Python installation

To test that Python 3 is installed, enter on the command line

python3--version

This should report a version of Python is available, similar to:

Python 3.5.1

A default Python installation is sufficient: no third party Python modules need to be installed. You could also change the value of the ‘python3.command’ property in your interproscan.properties configuration file to point at a suitable Python installation, the default value is:

python3.command=python3

6 Chapter 3. Installation requirements interproscan-docs Documentation

3.1.4 Testing the Java environment

To test your environment, enter on the command line java-version

This should report a version of java is available, similar to: openjdk version"11.0.4" 2019-07-16 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)

**InterProScan release 5.37-76.0 or later will only run with Java version 11.*++ ** You can get Java from many places. We have tested Java 11 from the OpenJDK Binaries from https://adoptopenjdk.net/ You can get information on OpenJDK reference implementations at https://jdk.java.net/ and download from https: //openjdk.java.net/install/index.html InterProScan releases prior to 5.37-76.0 required Java 8.

Appendix - Historical Java version testing information

Any Oracle/Open JDK/JRE with Java 1.8.x should work with InterProScan. Historical information about Java versions tested and confirmed to work/not work include below for information but this is not an exhaustive list! Oracle JDK/JRE for InterProScan 5.17-56.0 or later

Version Build Operating Sys- Architecture Status tem 1.8.074 1.8.0_74-b02 Linux x64 Works 1.8.060 1.8.0_60-b27 Linux x86 Works 1.7.* • Linux x86 Doesn’t work

OpenJDK for Interproscan 5.17-56.0 or later

Version Architecture Status Misc 1.8.0_66 Linux x64 Works 1.7.* Linux x64 Doesn’t work

Oracle JDK/JRE for InterProScan 5.16-55.0 or before

3.1. How to check these on a system? 7 interproscan-docs Documentation

Version Build Operating Sys- Architecture Status tem 1.8.0 1.8.0-Works Linux x64 Doesn’t work 1.7.0_51 1.7.0_51-b13 Linux x86 Works 1.7.0_40 • Linux x64 Works

1.7.0 • Linux x64 Works

1.6.0_45 • Linux x64 Works

1.6.0_37 • Linux x64 Works

1.6.0_22 • Linux x64 Works

1.6.0_11 • Linux x64 Works

1.6.0_07 • Linux x64 Works

1.6.0_05 • Linux x64 Works

1.6.0_04 • Linux x64 Works

1.6.0_03 • Linux amd64 Doesn’t work

1.6.0_02 • Linux amd64 Doesn’t work

OpenJDK for InterProScan 5.16-55.0 or before

Version Operating System Architecture Status Misc 1.7.0_25 Linux x64 Works :— 1.6.0_30 Linux i686 Works :— 1.6.0_27 Linux x64 Works :— 1.6.0_24 Linux (Red Hat Distribution) x64 Doesn’t work Reported by user

8 Chapter 3. Installation requirements CHAPTER 4

Obtaining a copy of InterProScan

Firstly check your system satisfies the Installation requirements. To install the InterProScan 5 software you then need to complete the following steps: • Install the core InterProScan • Configure the Pre-calculated Match Lookup

4.1 Obtaining the core InterProScan software mkdir my_interproscan cd my_interproscan wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.

˓→0-64-bit.tar.gz wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.

˓→0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was successful: md5sum- interproscan-5.52-86.0-64-bit.tar.gz.md5 # Must return *interproscan-5.52-86.0-64-bit.tar.gz: OK* # If not - try downloading the file again as it may be a corrupted copy.

(Direct link: https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.0-64-bit.tar.gz) As the compressed file is large, it is strongly recommended that you use md5sum to check that the file has been downloaded without errors, as described above. Extract the tar ball: tar-pxvzf interproscan-5.52-86.0- *-bit.tar.gz

# where: # p = preserve the file permissions # x = extract files from an archive (continues on next page)

9 interproscan-docs Documentation

(continued from previous page) # v = verbosely list the files processed # z = filter the archive through gzip # f = use archive file

This is a completely self-contained version that includes member database specific binaries and model / signature files. This should run ‘out of the box’ on a Linux system. Note that it excludes analyses that contain components for which you are obliged to acquire your own license.

4.2 Index hmm models

Before you run interproscan for the first time, you should run the command: python3 initial_setup.py

This command will press and index the hmm models to prepare them into a format used by hmmscan.

4.3 Panther models

Previous versions of InterProScan required a separate installation of Panther data. Starting with interproscan-5.52-86.0 onwards, this is not necessary. Panther data is bundled together with the rest of the application data.

4.4 Using the Local Pre-calculated Match Lookup Service (optional)

This service is by default switched on, so you don’t need to do any more installation or configuration, unless you want to install your own Pre-calculated Match Lookup Service. The uncompressed Match Lookup Service disk usage comes to more that 1TB, so it is recommended just to use the default setup. The pre-calculated match lookup web service is able to provide matches to more than 400 million protein sequences, including all of the sequence in UniProtKB. By default InterProScan is configured (in the interproscan.properties file) to use the web service hosted at the EBI. Your servers will need to have external access to http://www.ebi.ac.uk to use it. InterProScan uses this service to retrieve pre-calculated matches, reducing the need for compute on your server and speeding up the response time. If you are behind a firewall that prevents such access and you are unable to configure access, you could either What is the InterProScan 5 Lookup Service? or turn off the use of this service, which means the analysis will run locally without any match lookup To turn off the use of the service, either use the -dp command line option or edit interproscan.properties and add a # to the start of the following line to comment out the line or delete the following line, near the bottom of the file: precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

It is important to note that we run the latest available version of the pre-calculated match lookup service at the EBI. In the event of a new release, you will be required to either install the latest version of InterProScan 5, or to install the required version of the lookup service locally :ref:‘The InterProScan Lookup Match Service.

10 Chapter 4. Obtaining a copy of InterProScan CHAPTER 5

Running InterProScan

Once you have uncompressed your Obtaining a copy of InterProScan, you can run InterProScan directly from the command line. Run the supplied shell script. If you run this script with no arguments, you will be presented with the usage instructions:

./interproscan.sh

After a short delay, you will see the following usage instructions:

Welcome to InterProScan-5.XX-XX.X usage: java -XX:+UseParallelGC -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar interproscan-5.jar

Please give us your feedback by sending an email to [email protected]

-appl,--applications Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run. -b,--output-file-base Optional, base output filename (relative or absolute path). Note that this option, the --output-dir (-d) option and the --outfile (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used. -d,--output-dir Optional, output directory. (continues on next page)

11 interproscan-docs Documentation

(continued from previous page) Note that this option, the --outfile (-o) option and the --output-file-base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically . -dp,--disable-precalc Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally. -dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output -f,--formats Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, GFF3, HTML and SVG. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML. -goterms,--goterms Optional, switch on lookup of corresponding annotation (IMPLIES -iprlookup option) -help,--help Optional, display help information -i,--input Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert. -iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats. -ms,--minsize Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time! -o,--outfile Optional explicit output file name (relative or absolute path). Note that this option, the --output-dir (-d) option and the --output-file-base (-b) option are mutually exclusive. If this option is (continues on next page)

12 Chapter 5. Running InterProScan interproscan-docs Documentation

(continued from previous page) given, you MUST specify a single output format using the -f option. The output file name will not be modified. Note that specifying an output file name using this option OVERWRITES ANY EXISTING FILE. -pa,--pathways Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -iprlookup option) -t,--seqtype Optional, the type of the input sequences (dna/ (n) or protein (p)). The default sequence type is protein. -T,--tempdir Optional, specify temporary file directory (relative or absolute path). The default location is temp/. -version,--version Optional, display version number

Copyright (c) EMBL European Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html). Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the individual member database websites for details. Available analyses: TIGRFAM (XX.X) : TIGRFAMs are protein families based on Hidden

˓→Markov Models or HMMs SFLD (X.X) : SFLDs are protein families based on Hidden

˓→Markov Models or HMMs amap (XXXXXX.XX) : High-quality Automated and Manual

˓→Annotation of Microbial Proteomes SMART (X.X) : SMART allows the identification and analysis of

˓→domain architectures based on Hidden Markov Models or HMMs CDD (X.XX) : Prediction of CDD domains in Proteins ProSiteProfiles (XX.XXX) : PROSITE consists of documentation entries

˓→describing protein domains, families and functional sites as well as associated

˓→patterns and profiles to identify them ProSitePatterns (XX.XXX) : PROSITE consists of documentation entries

˓→describing protein domains, families and functional sites as well as associated

˓→patterns and profiles to identify them SUPERFAMILY (X.XX) : SUPERFAMILY is a database of structural and

˓→functional annotation for all proteins and genomes. PRINTS (XX.X) : A fingerprint is a group of conserved motifs

˓→used to characterise a PANTHER (X.X) : The PANTHER (Protein ANalysis THrough

˓→Evolutionary Relationships) Classification System is a unique resource that

˓→classifies genes by their functions, using published scientific experimental

˓→evidence and evolutionary relationships to predict function even in the absence of

˓→direct experimental evidence. Gene3D (X.X.X) : Structural assignment for whole genes and

˓→genomes using the CATH domain structure database PIRSF (X.XX) : The PIRSF concept is being used as a guiding

˓→principle to provide comprehensive and non-overlapping clustering of UniProtKB(continues on next page)

˓→sequences into a hierarchical order to reflect their evolutionary relationships.

13 interproscan-docs Documentation

(continued from previous page) Pfam (XX.X) : A large collection of protein families, each

˓→represented by multiple sequence alignments and hidden Markov models (HMMs) Coils (X.X) : Prediction of Coiled Coil Regions in Proteins MobiDBLite (X.X) : Prediction of disordered domains Regions in

˓→Proteins

Deactivated analyses: SignalP_GRAM_POSITIVE (X.X) : Analysis SignalP_GRAM_POSITIVE-X.X is

˓→deactivated, because the following parameters are not set in the interproscan.

˓→properties file: binary.signalp.4.0.path SignalP_EUK (X.X) : Analysis SignalP_EUK-X.X is deactivated,

˓→because the following parameters are not set in the interproscan.properties file:

˓→binary.signalp.4.0.path Phobius (X.XX) : Analysis Phobius-X.XX is deactivated, because

˓→the following parameters are not set in the interproscan.properties file: binary.

˓→phobius.pl.path.1.01 TMHMM (X.Xc) : Analysis TMHMM-X.Xc is deactivated, because

˓→the following parameters are not set in the interproscan.properties file: binary.

˓→tmhmm.path, tmhmm.model.path SignalP_GRAM_NEGATIVE (X.X) : Analysis SignalP_GRAM_NEGATIVE-X.X is

˓→deactivated, because the following parameters are not set in the interproscan.

˓→properties file: binary.signalp.4.0.path

The latest analysis versions can be obtained by running the InterProScan script without any options specified.

5.1 InterProScan test run

This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command: python3 initial_setup.py

This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once. You can then run the following two test case commands:

./interproscan.sh-i test_all_appl.fasta-f tsv-dp ./interproscan.sh-i test_all_appl.fasta-f tsv

The first test should create an output file with the default file name test_all_appl.fasta.tsv, and the second would then create test_all_appl.fasta_1.tsv (since the default filename already exists). Both the above test commands should be run successfully, before running InterProScan on you own input set of sequences. What should you get? InterProScan should run through properly without any warnings and it will create a TSV output file containing several member database matches, including Gene3d, PIRSF etc. The member database binaries supplied with InterProScan should run on most Linux systems, however if they don’t work on a particular system then see the FAQ page, What should I do if one of the binaries included with InterProScan 5 doesn’t work on my system?.

14 Chapter 5. Running InterProScan interproscan-docs Documentation

5.2 Command-line options

5.2.1 -dp / –disable-precalc (optional)

InterProScan is a computationally expensive program, sometimes taking a couple of minutes to characterise a single sequence. It calculates matches to InterPro signatures based purely on the sequence that is submitted to it. Therefore, 2 identical amino acid sequences will produce identical outputs (although if the sequences differ by just one residue, the outputs may or may not be the same). We can take advantage of this feature, and increase the speed of InterProScan, by pre-calculating matches for sequences already found in UniProtKB. When a sequence is submitted to it, InterProScan calculates an MD5 checksum for the amino acid sequence and then uses that checksum to check the What is the InterProScan 5 Lookup Service? pre-calculated lookup service to see whether it has already been encountered. If it has, the pre-calculated results are returned to the user; if not, the InterProScan search algorithms are run against the sequence. By default, InterProScan has this option turned on. If you wish to turn it off, you should add the “–disable-precalc” option to the command line. Users also have the option of using an EBI-hosted instance of the look-up service (this is what is enabled by default) or downloading a copy and running it locally. For more information, read the section on configuring the match lookup service below

5.2.2 -appl / –applications application_name (optional)

By default, all available analyses are run, however if you wish to restrict to a single analysis, use the -appl option. The argument to the -appl option should be one of the analyses named at the bottom of the usage instructions. Analysis names may or may not contain version numbers. For example:

./interproscan.sh-appl Pfam-i/path/to/sequences.fasta

If you wish to specifically run two or more analyses you can include multiple -appl arguments:

./interproscan.sh-appl Pfam-33.1-appl PRINTS-42.0-i/path/to/sequences.fasta or you can use a single -appl option with a comma-separated list of analyses:

./interproscan.sh-appl CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,

˓→PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM-i/path/to/

˓→sequences.fasta

A list of all available analyses is in the section “Included Analyses”

5.2.3 -i / –fasta sequence_file

To analyse the contents of a fasta file, you should add one argument as in the following example:

./interproscan.sh-i/path/to/sequences.fasta

This will return results in the default formats as described above, i.e., for protein sequences, return TSV, XML and GFF3 files or for nucleotide sequences, return GFF3 and XML files with file names based upon the name of the fasta file. (sequences.tsv, sequence.xml, sequences.gff3 in this case).

5.2.4 -iprlookup,–iprlookup

Option that provides mappings from matched member database signatures to the InterPro entries that they are inte- grated into. Starting from release of InterProScan-5.40-77.0, you don’t have to explicity specify this option

5.2. Command-line options 15 interproscan-docs Documentation

as InterProScan will always provide mappings to InterPro entries.

5.2.5 -goterms,–goterms (optional)

Option that provides mappings to the Gene Ontology (GO). These mappings are based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option)

5.2.6 -b / –output-file-base file_name (optional)

Optionally, you can supply a path and base name (excluding a file extension) for the results file as follows:

./interproscan.sh-i/path/to/sequences.fasta-b/path/to/output_file

The appropriate file extension will be added to each output file, depending upon the format(s) requested. (It is therefore recommended that you do not include a file extension yourself.) Note that using this option will not overwrite existing files. If a file with the required name exists at the path specified, the provided file name will have ‘underscore_number’ appended in front of the file extension.

5.2.7 -o / –outfile (optional)

This command can be given instead of the -b option. If you provide this argument, you must specify a single output format. The output file will be given the name specified by this option. Note that this option will overwrite existing files with the same path / name.

5.2.8 -pa / –pathways (optional)

Option that provides mappings from matches to pathway information, which is based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option). The different pathways databases that InterProScan provides cross links to are: • MetaCyc • Reactome

5.2.9 -t / –seqtype (optional)

InterProScan supports analysis of both protein and nucleic acid sequences (DNA/RNA). Your input sequences are interpreted as protein sequences by default. If you like to scan nucleotide sequences you must set the -t option:

./interproscan.sh-t n-i/path/to/sequences.fasta

5.2.10 -T / –tempdir (optional)

Optionally, you can specify the location of the InterProScan temporary directory. This directory is used as a working directory. The default temporary directory will be in the same directory as the InterProScan script file (interproscan.sh). By default, this directory is completely cleaned up after InterProScan finished all analyses successfully. Example usage:

16 Chapter 5. Running InterProScan interproscan-docs Documentation

./interproscan.sh-T/path/to/temp-directory-i/path/to/sequences.fasta

5.2.11 -dra / –disable-residue-annot (optional)

Optionally, you can prevent InterProScan from calculating the residue level annotations and displaying in the output where available. If you don’t require this information then disabling the feature will improve performance and result in smaller output files.

5.2.12 -version / –version (optional)

Display the version number of the InterProScan software you are running.

5.3 Included analyses

This distribution of InterProScan includes: • CDD • COILS • Gene3D • HAMAP • MOBIDB • PANTHER • Pfam • PIRSF • PRINTS • PROSITE (Profiles and Patterns) • SFLD • SMART (unlicensed components only by default - this analysis has simplified post-processing that includes an E-value filter, however you should not expect it to give the same match output as the fully licensed version of SMART) • SUPERFAMILY • TIGRFAMs A number of other analyses are available in InterProScan. These analyses use licensed code and data provided by third parties. If you wish to run these analyses it will be necessary for you to obtain a licence from the vendor and configure your local InterProScan installation to use these: • Phobius (licensed software) • SignalP • SMART (licensed components) • TMHMM The InterPro team would like to thank the developers and maintainers of all of these analyses for their valued and on-going support.

5.3. Included analyses 17 interproscan-docs Documentation

5.4 Output format

Please see Output formats.

5.5 Optional configuration

5.5.1 Working directory for temporary files

There is a second way of changing temporary/working directory beyond the -T option (where fasta files, binary output etc. are written to). You can do this by editing the interproscan.properties file and change the path for the property: temporary.file.directory=temp/[UNIQUE]

NOTE: Leave /[!UNIQUE] on the end - this is replaced with a timestamped / unique directory for each run. This directory is cleaned up and deleted at the end of each run of InterProScan.

5.5.2 Configuring the Pre-calculated Match Lookup Service

As this is a web service, your servers will need to have external access to http://www.ebi.ac.uk to use it. If you are behind a firewall that prevents such access and you are unable to configure access, you can either turn off use of this service or download a copy and run a local match lookup service. To turn off use of the service, either use the -dp command line option, or edit interproscan.properties and comment out* or delete the following line, near the bottom of the file: precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

‘‘*‘‘(To comment the line out, add a # to the start of the line.)

5.6 Running InterProScan on an LSF/SGE Cluster

Please see Cluster Mode.

18 Chapter 5. Running InterProScan CHAPTER 6

Input formats

6.1 Supported input file format

InterProScan 5 supports the FASTA file format. An example of a simple FASTA format file containing unaligned sequences:

> seq1 Description of seq1. AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA TAGTA > seq2 Description of seq2. CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGC

6.2 Supported sequence format

InterProScan 5 supports unaligned sequences only. Sequences should contain only valid IUPAC amino acid or nucleic acid characters. In addition gap (‘-‘), period (‘.’), asterix or underscore symbols are not allowed and should produce warnings and InterProScan will exit immediately. Example for supported protein sequence:

MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPN

Example for supported nucleic acid sequence: atgaaatataaacgcattgtgtttaaagtgggcaccagcagcctgaccaacg

Unsupported sequences:

19 interproscan-docs Documentation

-RFLLLSLARFSNNRFGVQLLQIANVNLKVRRYG (illegal gap character at the start)

RFLLLSL--ARFSNNRFGVQLLQIANVNLKVRRYG (illegal gap character in the middle)

RFLLLSLARFSNNRFGVQLLQIANVNLKVRRYG* (illegal asterix character at the end)

RFLLLSL_ARFSNNRFGVQLLQIANVNLKVRRYG (illegal underscore character)

RFLLLSL.ARFSNNRFGVQLLQIANVNLKVRRYG (illegal period character)

20 Chapter 6. Input formats CHAPTER 7

Output formats

In this version of InterProScan, you can retrieve output in any of the following five formats: • TSV: A simple tab-delimited file format • XML: The InterProScan XML format (XSD available here). • JSON: Full output of results in JSON format • GFF3: The GFF 3.0 format • HTML (deprecated): An HTML representation of the protein matches • SVG (deprecated): An Scalable Vector Graphics representation of the protein matches InterProScan 5 can output results for protein and nucleotide sequences in all formats. Please note you can only trace protein match positions to the original nucleotide sequence with GFF3, XML and JSON outputs. You can override the default output formats using the -f option, e.g.:

./interproscan.sh-f XML-f JSON-i/path/to/sequences.fasta-b/path/to/output_file or

./interproscan.sh-f XML, JSON-i/path/to/sequences.fasta-b/path/to/output_file

These two equivalent commands will output the results in XML and HTML format.

7.1 Tab-separated values format (TSV)

Basic tab delimited format. Outputs only those sequences with domain matches.

21 interproscan-docs Documentation

7.1.1 Example output

P51587 14086411a2cdf1c4cba63020e1622579 3418 Pfam PF09103 BRCA2,

˓→oligonucleotide/oligosaccharide-binding, domain1 2670 2799 7.9E-43T 15-

˓→03-2013 P51587 14086411a2cdf1c4cba63020e1622579 3418 ProSiteProfiles PS50138 BRCA2

˓→repeat profile. 1002 1036 0.0T 18-03-2013 IPR002093 BRCA2 repeat

˓→GO:0005515|GO:0006302 P51587 14086411a2cdf1c4cba63020e1622579 3418 Gene3D G3DSA:2.40.50.140

˓→2966 3051 3.1E-52T 15-03-2013 ...

The TSV format presents the match data in columns as follows: 1. Protein accession (e.g. P51587) 2. Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579) 3. Sequence length (e.g. 3418) 4. Analysis (e.g. Pfam / PRINTS / Gene3D) 5. Signature accession (e.g. PF09103 / G3DSA:2.40.50.140) 6. Signature description (e.g. BRCA2 repeat profile) 7. Start location 8. Stop location 9. Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52) 10. Status - is the status of the match (T: true) 11. Date - is the date of the run 12. InterPro annotations - accession (e.g. IPR002093) 13. InterPro annotations - description (e.g. BRCA2 repeat) 14. (GO annotations (e.g. GO:0005515) - optional column; only displayed if –goterms option is switched on) 15. (Pathways annotations (e.g. REACT_71) - optional column; only displayed if –pathways option is switched on) If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-’ is displayed.

7.2 Extensible Markup Language (XML)

XML representation of the matches - this is the richest form of the data. The XML Schema Definition (XSD) file links are below the example output.

7.2.1 Example output

˓→MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESLKKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVAHQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDCILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCENDPKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLSNLVMISRGKESYKMSDKLKGNNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYMRVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSEELFSDNENNFVFQVANERNNLALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVKQHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLSEHNIKKSKMFFKDIEEQYPTSLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSDCKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEESGSQFEFTQFRKPSYILQKSTFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDCNKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISLSSSKCHDSVVSMFKIENHNDKTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENEDNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTDQHNICLKLSGQFMKEGNTQIKEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAKESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTGNQLVTFQGQPERDEKIKEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFSHQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNNDKNLVSIETVVPPKLLSDNLCRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTSVSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYLSNSSMSNSYSYHSDEVYNDSGYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAYPQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVGPPAFRIASGKIVCVSHETIKKVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKVFADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFSTASGKSVQVSDASLQNARQVFSEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLISQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVLEEFDLIRTEHSLHYSPTSRQNVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQDKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSENYFETEAVEIAKAFMEDDELTDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILVGEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRLFMHHVSLEPITCVPFRTTKERQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTGRPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKNNSNQAAAVTFTKCEEEPLDLITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPRISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESFQFHTEDYFGKESLWTGKGIQLADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANISETSSNKTSSADTQKVAIIELTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVGSPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRPFPLPLSSLFSDGGNVGCVDVIIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENTTKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKKQAQIQLEIRKAMESAEQKEQGLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLLTEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPVSDEILFQIYQPREPLHFSKFLDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLIAASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKLMHILHANDPKWSTPTKDCTSGPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSVSTPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLPPPVSPICTFVSPAAQKAFQPPRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEKQFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI

˓→ (continues on next page)

22 Chapter 7. Output formats interproscan-docs Documentation

(continued from previous page) ...

˓→hmm-start="1" evalue="9.6E-102" score="0.0" end="2667" start="2479"/> ... ... (continues on next page)

7.2. Extensible Markup Language (XML) 23 interproscan-docs Documentation

(continued from previous page)

˓→numLocations="51"> ... ... ...

7.3 The XML Schema Definition

The XML Schema Definition (XSD) is available here. Listed below are the XSD files for the InterProScan 5 XML output format (with the InterProScan release versions they apply to noted in brackets afterwards). • interproscan-model-4.5.xsd (as produced by InterProScan 5 from version 5.51-85.0 onwards) • interproscan-model-3.0.xsd (as produced by InterProScan 5 from version 5.31-70.0 to 5.50-84.0) • interproscan-model-2.2.xsd (as produced by InterProScan 5 from version 5.28-67.0 to 5.30-69.0) • interproscan-model-2.1.xsd (as produced by InterProScan 5 from version 5.26-65.0 to 5.27-66.0) • interproscan-model-2.0.xsd (as produced by InterProScan 5 from version 5.21-60.0 to 5.25-64.0) • interproscan-model-1.4.xsd (as produced by InterProScan 5 in version 5.20-59.0 only) • interproscan-model-1.3.xsd (as produced by InterProScan 5 in version 5.19-58.0 only) • interproscan-model-1.2.xsd (as produced by InterProScan 5 from version 5.17-56.0 to 5.18-57.0) • interproscan-model-1.1.xsd (as produced by InterProScan 5 from version RC7 to 5.16-55.0) • interproscan-model-1.0.xsd (InterProScan 5 version RC1 to RC6)

7.4 JavaScript Object Notation (JSON)

JSON representation of the matches - an alternative to XML format. As new releases are made public, the changes to the expected JSON format are documented in Change log for InterProScan JSON output format.

24 Chapter 7. Output formats interproscan-docs Documentation

7.4.1 Example output

{ "interproscan-version":"5.26-65.0", "results": [{ "sequence":

˓→"MSKIGKSIRLERIIDRKTRKTVIVPMDHGLTVGPIPGLIDLAAAVDKVAEGGANAVLGHMGLPLYGHRGYGKDVGLIIHLSASTSLGPDANHKVLVTRVEDAIRVGADGVSIHVNVGAEDEAEMLRDLGMVARRCDLWGMPLLAMMYPRGAKVRSEHSVEYVKHAARVGAELGVDIVKTNYTGSPETFREVVRGCPAPVVIAGGPKMDTEADLLQMVYDAMQAGAAGISIGRNIFQAENPTLLTRKLSKIVHEGYTPEEAARLKL

˓→", "md5":"88d47cc807fe8e977130b0cc93e0bd61", "matches":[{ "signature":{ "accession":"PIRSF038992", "name":"Aldolase_Ia", "description" : null, "type" : null, "signatureLibraryRelease":{ "library":"PIRSF", "version":"3.01" }, "models":{ "PIRSF038992":{ "accession":"PIRSF038992", "name":"Aldolase_Ia", "description" : null, "key":"PIRSF038992" } }, "entry":{ "accession":"IPR002915", "name":"DeoC/FbaB/lacD_aldolase", "description":"DeoC/FbaB/ lacD aldolase", "type":"FAMILY", "goXRefs":[{ "identifier":"GO:0016829", "name":"lyase activity", "databaseName":"GO", "category":"MOLECULAR_FUNCTION" } ], "pathwayXRefs":[{ "identifier":"R-HSA-71336", "name":"Pentose phosphate pathway (hexose monophosphate shunt)", "databaseName":"Reactome" }, { "identifier":"R-HSA-6798695", "name":"Neutrophil degranulation", "databaseName":"Reactome" }] } }, "locations":[{ "start":1, "end": 265, "hmmStart":2, "hmmEnd": 262, "hmmBounds":"INCOMPLETE", "evalue": 3.3E-94, "score": 302.6, (continues on next page)

7.4. JavaScript Object Notation (JSON) 25 interproscan-docs Documentation

(continued from previous page) "envelopeStart":1, "envelopeEnd": 265 } ], "evalue": 3.0E-94, "score": 302.7 }, { ... }] }

7.5 Generic Feature Format Version 3 (GFF3)

The GFF3 format is a flat tab-delimited file, which is much richer then the TSV output format. It allows you to trace back from matches to predicted proteins and to nucleic acid sequences. It also contains a FASTA format representation of the predicted protein sequences and their matches. You will find a documentation of all the columns and attributes used on http://www.sequenceontology.org/gff3.shtml. Please note in GFF3 sequence identifiers “. . . may contain any characters, but must escape any characters not in the set. . . ” (1) a-zA-Z0-9.:^*$@!+_?-|.

1. http://www.sequenceontology.org/gff3.shtml

7.5.1 Example output

##gff-version 3 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?

˓→revision=1.269 ##interproscan-version 5.26-65.0 ##sequence-region AACH01000027 1 1347 ##seqid|source|type|start|end|score|strand|phase|attributes AACH01000027 provided_by_user nucleic_acid 1 1347 . + .

˓→Name=AACH01000027;md5=b2a7416cb92565c004becb7510f46840;ID=AACH01000027 AACH01000027 getorf ORF 1 1347 . + . Name=AACH01000027.2_21;Target=pep_

˓→AACH01000027_1_1347 1 449;md5=b2a7416cb92565c004becb7510f46840;ID=orf_AACH01000027_

˓→1_1347 AACH01000027 getorf polypeptide 1 449 . + .

˓→md5=fd0743a673ac69fb6e5c67a48f264dd5;ID=pep_AACH01000027_1_1347 AACH01000027 Pfam protein_match 84 314 1.2E-45 + . Name=PF00696;

˓→signature_desc=Amino acid kinase family;Target=null 84 314;status=T;ID=match$8_84_

˓→314;Ontology_term="GO:0008652";date=15-04-2013;Dbxref="InterPro:IPR001048",

˓→"Reactome:REACT_13" ##sequence-region 2 ... >pep_AACH01000027_1_1347 LVLLAAFDCIDDTKLVKQIIISEIINSLPNIVNDKYGRKVLLYLLSPRDPAHTVREIIEV LQKGDGNAHSKKDTEIRRREMKYKRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEA GHELILVSSGAIAAGFGALGFKKRPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQI LLTQDDFVDKRRYKNAHQALSVLLNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQ ADLLVFLTDVDGLYTGNPNSDPRAKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAA TIATESGVPVYICSSLKSDSMIEAAEETEDGSYFVAQEKGLRTQKQWLAFYAQSQGSIWV (continues on next page)

26 Chapter 7. Output formats interproscan-docs Documentation

(continued from previous page) DKGAAEALSQYGKSLLLSGIVEAEGVFSYGDIVTVFDKESGKSLGKGRVQFGASALEDML RSQKAKGVLIYRDDWISITPEIQLLFTEF ... >match$8_84_314 KRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEAGHELILVSSGAIAAGFGALGFKK RPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQILLTQDDFVDKRRYKNAHQALSVL LNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQADLLVFLTDVDGLYTGNPNSDPR AKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAATIATESGVPVYICS

7.6 SVG and HTML

Scalable Vector Graphics (SVG) and HyperText Markup Language (HTML) These two graphical output formats are now deprecated! HTML is not available for InterProScan-5.48-83.0 and later versions. SVG will be removed in the second quarter of 2021. We have always aimed at providing the protein sequence view in InterProScan’s HTML and SVG outputs to be almost exactly the same as the protein view on the website. But with the released of a new website with cool features at the end of 2019, maintaining the HTML and SVG outputs formats in InterProScan became unsustainable. However, we have a new **import feature on the website**, where after you import an InterProScan JSON output, you can get a nice graphical view of your results and can export the results into PNG, PDF etc. We would appreciate any feedback on this feature, send us some comments using EMBL EBI’s support form In the previous versions before InterProScan-5.48-83.0, InterProScan outputs a single HTML/SVG file for each protein sequence analysed. The HTML/SVG file(s) are compressed into a single gzipped tar archive (or “tarball”) that includes the resources (images, Javascript, style etc) to render the pages/images in a browser or image viewer. (Note that from version 5RC4, the SVG format has no external dependencies.) The tarball will be named something similar to: base_output_file_name.html.tar.gz OR base_output_file_name.svg.tar.gz

To access the HTML pages/SVG images, unzip the tarball using a command like: tar-xvzf base_output_file_name.html(svg).tar.gz

You can then open the unzipped HTML/SVG files in any browser or image viewer (for SVG).

7.6.1 Example output

7.6. SVG and HTML 27 interproscan-docs Documentation

Fig. 1: SVG example output

28 Chapter 7. Output formats CHAPTER 8

Nucleic acid sequences scan

8.1 The Open Reading Frame prediction tool

InterProScan 5 takes advantage of the Open Reading Frame (ORF) prediction tool Emboss getorf. The getorf appli- cation itself and all of its dependencies are integrated in InterProScan. You do not need to install the Emboss package on your own, but you may use a local installation if you wish. If you want to use a local installation you must edit the interproscan.sh script. This script sets 2 environment variables for Emboss getorf. Set these to the correct paths for your installation of Emboss.

# set environment variables for getorf export EMBOSS_ACDROOT=bin/nucleotide export EMBOSS_DATA=bin/nucleotide

In addition open and edit your properties file (interproscan.properties), which you will find in your InterProScan root directory. Search for the property ‘binary.getorf.path’ and change the path to your local getorf binary. binary.getorf.path=/path/to/bin/nucleotide/getorf

8.2 How can I scan nucleic acid sequences in InterProScan 5?

./interproscan.sh-t n-i/path/to/nucleic_acid_sequences.fasta or run the following commands:

#translate the nucleic_acid_sequences ./bin//translate-i/path/to/nucleic_acid_sequences.fasta-o/path/to/

˓→output_orfs_sequences.fasta #if output_orfs_sequences.fasta has more than 32,000 sequences then chunk the file

˓→then send the chunks to InterProScan #run InterProScan on the translated output ./interproscan.sh-i/path/to/output_orfs_sequences.fasta

29 interproscan-docs Documentation

8.3 Which output formats are supported?

Supported output formats are GFF3 and XML, which allow you to trace back from the match to the position inside your nucleic acid sequence. Other InterProScan 5 output formats like SVG,HTML and TSV are not available for nucleic acid .

8.4 Redundant sequences and identifiers in your FASTA file

InterProScan 5 is able to handle FASTA file entries with the same sequence, but different identifiers. For instance you have the following 2 sequences in your input file:

>sequence_1 ABC >sequence_2 ABC

InterProScan 5 will condense these into a single sequence with two identifier cross-references in the XML output file:

ABC ... and in the GFF3 output:

##sequence-region sequence_1|sequence_2 1 3 sequence_1|sequence_2 provided_by_user nucleic_acid13 ...

Entries with the same identifier and the same sequence will be merged into one. Please note: non unique identifiers are not supported. InterProScan 5 will exit (with exit code 0) and will print out a list of all non unique identifiers.

8.5 Improving performance

InterProScan does not select one best ORF from the getorf output, instead it takes the ORFs generated and select N longest ORFs and inputs them for analysis. The number selected depends on the binary.getorf.parser.filtersize property mentioned below. The default is 8. This means analysing nucleotide sequences can take much longer than analysing protein sequences. To improve InterProScan performance while running large nucleotide input files (> 10,000 sequences) you can: 1. First use an external program to translate your input. This is the best approach. There are various options, one of which is -transeq (http://emboss.open-bio.org/rel/rel6/apps/transeq.html) from emboss. If you use transeq then please use the -clean option to change STOP codon positions from ‘*’ to ‘X’ because Interproscan does not accept sequences with the ‘*’ character. and/or. . . 2. Chunk the input and then send the chunks to InterProScan. For tips on configuring the general InterProScan CPU usage see also improving performance.

30 Chapter 8. Nucleic acid sequences scan interproscan-docs Documentation

8.5.1 Selecting the ORFs to analyse

For improved performance, Interproscan will select the longest 8 ORFs predicted for each nucleic acid sequence. This can be changed using the new “binary.getorf.parser.filtersize” setting in the interproscan.properties file binary.getorf.parser.filtersize=8

8.5. Improving performance 31 interproscan-docs Documentation

32 Chapter 8. Nucleic acid sequences scan CHAPTER 9

The InterProScan Lookup Match Service

The InterProScan match lookup service stores pre-calculated InterProScan results for the sequences in the InterPro database. When InterProScan is queried with a known sequence, it retrieves the result from the lookup service and reports the result immediately, thereby reducing compute requirements and improving performance. For sequences not in the lookup service, InterProScan will calculate these from scratch using the various analyses requested by the user. The default interproscan.properties configuration will use the lookup service hosted at EBI http://www.ebi.ac.uk/ interpro/match-lookup/version. This will be will be the most recent lookup service version and only compatible with the most recent InterProScan release: precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

#proxy set up precalculated.match.lookup.service.proxy.host= precalculated.match.lookup.service.proxy.port=3128

The default lookup service will not be used in the following scenarios (therefore all calculations are performed locally): • The version number of the service does not match your version of InterProScan. • The service cannot be accessed (e.g., for firewall reasons or it is temporarily unavailable). • You disable the lookup feature. To disable the service you could either: – Use the “-dp” command lineoption. – Set the “precalculated.match.lookup.service.url=” property in your interproscan.properties configuration file (to an empty value).

9.1 Installing the lookup service locally

You can choose to download and install the InterProScan lookup service locally if required. This offers you several advantages:

33 interproscan-docs Documentation

• provide control over the version of the lookup service - if you choose to upgrade InterProScan less frequently than the release cycle, you can ensure that you are using a lookup service that is synchronized with the version of InterProScan that you are running. • A dedicated service. You will not be competing with other users for access to the service. • Control over the scale of the service. The service is extremely responsive (a few milliseconds per sequence request) and a single web server will cope with a high load, however if you expect to put the service under a very high load, you may chose to run the service in parallel on multiple machines, potentially with load balancing. • Run the service behind your firewall for maximum security.

9.2 System requirements

Because of the very large size of the Berkeley database used by the Lookup Service, you are recommended to observe the following minimum requirements: • Recommended minimum 2 cores (processors). • 4GB RAM (of which > 2GB will consumed by the service when you run it). • The lookup service version 5.37-76.0 onwards requires Java 11 (previous versions required Java 8.

9.3 Obtaining the lookup service

Version 5.52-86.0 of the lookup service is only compatible with version 5.52-86.0 of InterProScan. Instructions below are for installing the latest version, you can download previous versions of the lookup service from ftp://ftp.ebi.ac.uk/ pub/software/unix/iprscan/5/lookup_service/. This service is a very large download! You are strongly recommended to check the md5 checksum (as described below) to ensure that the file has been downloaded correctly.

# Create and enter a suitable directory mkdir i5_lookup_service cd i5_lookup_service

# Download the tarball and the MD5 file. wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/lookup_service/lookup_service_5.

˓→52-86.0.tar.gz wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/lookup_service/lookup_service_5.

˓→52-86.0.tar.gz.md5

# Recommended checksum to confirm the download was successful: md5sum-c lookup_service_5.52-86.0.tar.gz.md5 # Must return *lookup_service_5.52-86.0.tar.gz: OK* # If not - try downloading the file again as it may be a corrupted copy.

(Direct link: ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/lookup_service/lookup_service_5.52-86.0.tar.gz) Extract the tarball: tar-pxvzf lookup_service_5.52-86.0.tar.gz

# where: # p = preserve the file permissions (continues on next page)

34 Chapter 9. The InterProScan Lookup Match Service interproscan-docs Documentation

(continued from previous page) # x = extract files from an archive # v = verbosely list the files processed # z = filter the archive through gzip # f = use archive file

The service can be run in one of two ways:

9.3.1 Run with graphical user interface (to set port number)

If you are running it on a machine with a desktop interface and just want to test the lookup service, a simple user interface is included to allow you to set the port number to run the service. Note that in the example below, the memory available to Java has been set to 8000MB (using the -Xmx8000m switch). This is recommended as a good starting value - you may choose to set this higher if the service will be used heavily. (We have tested it with -Xmx36000m without problems). cd lookup_service_5.52-86.0 java-Xmx8000m-jar server-5.52-86.0-jetty-console.war

A new window will open. Set the port number as required and click the “Start” button to start the web service running. The initialization of the web service usually takes a while, depending on the machine you are running it. After successful initialization you will be forwarded to the ‘InterProScan 5 Pre-calculated Match Lookup Service’ landing page within your browser and from now on the lookup service is ready to be used.

9.3.2 Run “Headless” (no graphical user interface)

It is most likely that you will want to run the lookup service “headless”, i.e. purely as a command line tool. In this case, the port number and other options can be passed in on the command line as follows: Note that in the example below, the memory available to Java has been set to 8000MB (using the -Xmx8000m switch). This is recommended as a good starting value - you may choose to set this higher if the service will be used heavily. (We have tested it with -Xmx36000m). cd lookup_service_5.52-86.0 java-Xmx8000m-jar server-5.52-86.0-jetty-console.war [--option=value] [--

˓→option=value]

# Example command: # java -Xmx8000m -jar server-5.52-86.0-jetty-console.war --headless --port 8080

Where options include:

Options: --sslProxied- Running behind an SSL proxy --port n- Create an HTTP listener on port n (default 8080) --bindAddress addr- Accept connections only on address addr (default: accept on

˓→any address) --forwarded- Set reverse proxy handling using X-Forwarded-For headers --contextPath/path- Set context path (default:/) --headless- Don't open graphical console, even if available --help- Print this help message --tmpDir/path- Temporary directory, default is /tmp

9.3. Obtaining the lookup service 35 interproscan-docs Documentation

9.4 Waiting for the lookup service to start

The lookup service is very large and could take over an hour to start. Example output from a successful startup is given below:

$ java -Xmx8000m -jar server-5.52-86.0-jetty-console.war 10242 [Thread-2] INFO org.simplericity.jettyconsole.DefaultJettyManager - Added web

˓→application on path / from war /example/path/to/server-5.52-86.0-jetty-console.war 10243 [Thread-2] INFO org.simplericity.jettyconsole.DefaultJettyManager - Starting

˓→web application on port 8080 10245 [Thread-2] INFO org.eclipse.jetty.server.Server - jetty-8.1.12.v20130726 10818 [Thread-2] INFO org.eclipse.jetty.plus.webapp.PlusConfiguration - No

˓→Transaction manager found - if your webapp requires one, please configure one. 12226 [Thread-2] INFO org.eclipse.jetty.webapp.StandardDescriptorProcessor - NO JSP

˓→Support for /, did not find org.apache.jasper.servlet.JspServlet 12243 [Thread-2] INFO / - No Spring WebApplicationInitializer types detected on

˓→classpath 12344 [Thread-2] INFO / - Initializing Spring root WebApplicationContext Initializing BerkeleyDB Match Database (creating indexes): Please wait... Initializing BerkeleyDB MD5 Database (creating indexes): Please wait... 1049793 [Thread-2] INFO / - Initializing Spring FrameworkServlet 'mvc' Initializing BerkeleyDB Match Database (creating indexes): Please wait... Initializing BerkeleyDB MD5 Database (creating indexes): Please wait... 1050000 [Thread-2] INFO org.eclipse.jetty.server.AbstractConnector - Started @0.0.0.

˓→0:8080

Note a “Address already in use” error would indicate that the lookup service (or another existing service) appears to be already running on that machine and port. Either stop the existing service, or configure the lookup service to use a different port using the –port option. Once successfully started the service will wait, ready to receive any requests that are passed it’s way. It will continue listening for requests until the service is stopped. To confirm all is running correctly you can now test the service.

9.5 Testing the service

To test the service:

# Assuming the lookup service has been started on the same machine and you are using # the default port of 8080 then...

# in a web browser: http://localhost:8080/version http://localhost:8080/matches?md5=2E38C8D754C63117A4FA5F5E44F2194E

# or using curl on the command line: curl http://localhost:8080/version curl http://localhost:8080/matches?md5=2E38C8D754C63117A4FA5F5E44F2194E

# To access your lookup service from another machine replace "localhost" with # the fully qualified name of the machine where the lookup service is running. # The Linux command "uname -n" can be used to find the machine name. # Alternatively you could use the machines IP address instead of the hostname.

This should return an XML file containing match data (you may need to “view source” on your web browser to see this properly).

36 Chapter 9. The InterProScan Lookup Match Service interproscan-docs Documentation

If you leave it running then the lookup service is now ready to receive any requests that may come it’s way.

9.6 Configure InterProScan 5 to use your local lookup service

To configure your local installation of InterProScan 5 to use your lookup service, edit the interproscan. properties file and set the property precalculated.match.lookup.service.url to point to your ser- vice. Replace host with the machine name and port with the port number your server is running on: precalculated.match.lookup.service.url=http://host:port

# Note: You can check your lookup service URL is accessible using curl on # the command line of the machine you will be running InterProScan from # For example, "curl http://host:port/" should return the expected HTML source

For example, if you are running the server on a machine named lookuphost on port 8080, you should set the property as follows: precalculated.match.lookup.service.url=http://lookuphost:8080

Or if you are running the server on locally on port 8080, you should set the property as follows: precalculated.match.lookup.service.url=http://localhost:8080

You can also substitute the server name with an IP address if necessary. Please note that if you need to access the internet through a proxy server then you will also need to update the following properties: precalculated.match.lookup.service.proxy.host= precalculated.match.lookup.service.proxy.port=3128

9.6. Configure InterProScan 5 to use your local lookup service 37 interproscan-docs Documentation

38 Chapter 9. The InterProScan Lookup Match Service CHAPTER 10

Running InterProScan 5 in Cluster Mode

In the “cluster” mode, InterProScan 5 activates a master/worker parallelisation mode which takes advantage of your cluster capabilities to distribute the analysis components on the cluster making large jobs complete faster. The benefits of this mode will be seen with larger inputs (approx >32000 protein sequences depending on resources). However, for smaller inputs the default “standalone” mode (or “singleseq” mode for one sequence) will still be preferable due to the overhead in initialising InterProScan in cluster mode. This documentation should be read in conjunction with the information on the page Running InterProScan 5. Currently we support Load Sharing Facility (LSF) and Sun Grid Engine (SGE) now known as Oracle Grid Engine. InterProScan 5 has been tested on SGE 8.1.2 running 64 bit linux. However, currently “clustermode” is not as fault tolerant as the default “standalone” mode, so we recommend the more stable “standalone” mode. You can configure InterProScan 5 to run on other clusters by changing the submission commands below.

10.1 Initial Setup

Before running InterProScan 5 in cluster mode, the following configuration must be completed correctly for your cluster setup. Edit the interproscan.properties file. Add or modify the properties below appropriately for your cluster. Note - you must set the submission command including the ‘QUEUE_NAME’ correctly for your LSF, SGE or other cluster. If you are in any doubt about any of these settings, you should consult the systems administrator who maintains your cluster.

#Specify your cluster (LSF, SGE or any other cluster) grid.name=lsf #grid.name=other-cluster

#Java Virtual Machine (JVM) maximum idle time for jobs. (continues on next page)

39 interproscan-docs Documentation

(continued from previous page) #Default is 180 seconds, if not specified. When reached the worker will shutdown. jvm.maximum.idle.time.seconds=180

#JVM maximum life time for workers. #Default is 14400 seconds, if not specified. After this period has passed the worker

˓→will shutdown unless it is busy. jvm.maximum.life.seconds=14400

#Maximum number of jobs per clusterRunId. Default is 3000. grid.jobs.limit=3000

#commands to start new jvms worker.command=java-Xms256m-Xmx1024m-jar interproscan-5.jar worker.high.memory.command=java-Xms256m-Xmx2048m-jar interproscan-5.jar

#directory for any log files generated by InterProScan log.dir=logs

10.1.1 Cluster submission commands

On your cluster the following submission command properties should be configured. LSF example:

#Grid submission commands (e.g. LSF bsub or SGE qsub) for starting remote workers #The following 2 commands are used by the master to spawn normal or high memory

˓→workers grid.master.submit.command=bsub-q QUEUE_NAME grid.master.submit.high.memory.command=bsub-q QUEUE_NAME-M 8192

#The following 2 commands are used by workers to spawn normal or high memory workers grid.worker.submit.command=bsub-q QUEUE_NAME grid.worker.submit.high.memory.command=bsub-q QUEUE_NAME-M 8192

#network growth #if the main/master !InterProScan job runs on a submission node and other nodes

˓→cannot submit jobs set max.tier.depth to 1 else it can be greater than 1 max.tier.depth=1

SGE equivalent: grid.master.submit.command=qsub-cwd-V-b y-N i5t1worker grid.master.submit.high.memory.command=qsub-cwd-V-b y-N i5t1hmworker grid.worker.submit.command=qsub-cwd-V-b y-N i5t2worker grid.worker.submit.high.memory.command=qsub-cwd-V-b y-N i5t2hmworker

We would like to recommend to read the SGE manual (http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub. html) for the different qsub options. Note The SGE cluster mode is a new feature that has not been tested extensively and we would welcome any Feedback you may have.

Other clusters

For other clusters, change the submission property grid.master.submit.command to suit your cluster requirements.

40 Chapter 10. Running InterProScan 5 in Cluster Mode interproscan-docs Documentation

10.1.2 Master configuration options

If you require that the master InterProScan should not run any analysis but only do housekeeping, change the following property to false (from version 5.1-44.0 onwards).

#allow master interproscan to run binaries master.can.run.binaries=false

10.2 Example usage on an LSF, SGE and other clusters

To enable InterProScan 5 to “farm out” analysis components on LSF, it is necessary to run the interproscan.sh script with the -mode cluster switch. This turns on the ability for the “master” to create child “worker” processes on the cluster that are able to take analysis steps from the master and run them remotely. As an example:

./interproscan.sh-mode cluster-clusterrunid uniqueName-i/path/to/sequences.fasta-

˓→b/path/to/output_file

Please note, in cases where the main (master) InterProScan jvm dies unexpectedly you might still see workers running, but they will shutdown as soon as they reach their maximum idle time.

10.3 clusterrunid

--clusterrunid (alias -crid) is a mandatory option that takes an argument. This can be used for monitoring your distributed jobs within a single run. On LSF clusters, the value for --clusterrunid is passed as the LSF project option -P. In cluster mode InterProScan 5 spawns new “worker” Java processes according to the volume of analysis that needs to be performed.

10.4 In house tested cluster versions

Platform LSF

Version Result 8.0.1 Tested successfully 9.1.1.1 Tested successfully 1)

1) From this LSF version on you have to include the -n option in your bsub command, if you want to set more then 1 CPU for workers (1 CPU is the default value in this version). We strongly recommend to do that, otherwise InterProScan will be much slower in CLUSTER mode. How much CPUs you need to reserve depends on your cluster nodes and your binary CPU settings. If you need help on that, please don’t hesitate to contact us using EMBL-EBI’s support form. SGE

Version Result 8.1.2 Tested successfully

10.2. Example usage on an LSF, SGE and other clusters 41 interproscan-docs Documentation

10.5 Related issues

Known issues

42 Chapter 10. Running InterProScan 5 in Cluster Mode CHAPTER 11

Running InterProScan 5 in CONVERT mode

InterProScan 5’s CONVERT mode allows you to reformat an existing InterProScan XML result file into any other possible output format (TSV, GFF3, SVG and HTML). For compatibility reasons you can also convert XML results into InterProScan 4.8 raw format (RAW). This will give our users enough time to migrate their pipeline to InterProScan 5. Please note it is NOT possible to reformat any non-XML format. XML is the richest data type and is therefore the only format which allows us to produce any other format of interest. For more information on InterProScan formats available see output formats. To enable InterProScan 5 to run in CONVERT mode you need to set the mode option to ‘CONVERT’.

11.1 Usage instructions

./interproscan.sh-mode convert

You will see the following usage instructions:

Welcome to InterProScan5RC7 usage: java-XX:+UseParallelGC-XX:+AggressiveOpts -XX:+UseFastAccessorMethods-Xms512M-Xmx2048M-jar interproscan-5.jar

Please give us your feedback by sending an email to [email protected] -b,--output-file-base Optional, base output filename (relative or absolute path). Note that this option and the --outfile (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be (continues on next page)

43 interproscan-docs Documentation

(continued from previous page) appended automatically. By default the input file path/name will be used. -d,--output-dir Optional, output directory. Note that this option and the --outfile (-o) option or the --output-file-base (-b) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used. -f,--formats Optional, case-insensitive, comma separated list of output formats. Available formats are TSV, GFF3 (default set) and RAW (InterProScan4 TSV), HTML, SVG. -i,--input Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan5 XML file to convert. -o,--outfile Optional explicit output file name (relative or absolute path). Note that this option and the--output-file-base (-b) option are mutually exclusive. If this option is given, you MUST specify a single output format using the -f option. The output file name will not be modified. Note that specifying an output file name using this option OVERWRITES ANY EXISTING FILE. -T,--tempdir Optional, specify temporary file directory (relative or absolute path). The default location is temp/. Copyright (c) EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html). Third party components (e.g. member database binaries and models) are subject to separate licensing- please see the individual member database websites for details.

11.2 Example Usage

# Convert from XML format to all other available formats ./interproscan.sh-mode convert-f tsv,gff3,svg,raw-i/path/to/existing_output_file. ˓→xml-b/path/to/output_file_basename (continues on next page)

44 Chapter 11. Running InterProScan 5 in CONVERT mode interproscan-docs Documentation

(continued from previous page)

# Convert from XML format to TSV format (which automatically includes all available

˓→InterPro entry/GO term/pathways information) ./interproscan.sh-i/path/to/existing_output_file.xml-mode convert-f tsv-o/path/

˓→to/new_output_file.tsv

11.2. Example Usage 45 interproscan-docs Documentation

46 Chapter 11. Running InterProScan 5 in CONVERT mode CHAPTER 12

Improving performance

If InterProScan is taking a long time to run, or you just want to improve on the run time you are getting, then consider some of the following:

12.1 Review your CPU (and memory) command options

By default InterProScan uses 8 cpu cores on your machine. Most of the times this configuration is sufficient. However, if you have more cores available and you have more memory to support more threads, then you can change the number of cpu cores used by adding the option below to the InterProScan command line, where N is the desired number of cores

-cpu N

The value N for -cpu represents the maximum number of threads (embedded workers) InterProScan will start and run at a time. You have to remember, the more cores you specify, the more memory InterProScan will require to run successfully. Here are some observed numbers that may act as a guide, but you may have to experiment for your own data. The input sequences were taken from UniProt

Table 1: Run time statistics for selected input -cpu max memory used input sequence count input sequence run time (GB) size (MB) 16 8 8,000 3 2 hrs 16 12 16,000 6 4 hrs 16 15 160, 000 56 12hrs

Let’s say you have a super machine with 32 cores available and you want to use all or most of the cores. It would be recommended to specify -cpu 30, as the main InterProScan process will always use 1 core.

47 interproscan-docs Documentation

Each database analysis may also have options to specify how many threads to assign to it, for example, HMMER3 based analyses such as Gene3D have this option. But we dont recommend changing the default cpu values for each analysis.

12.2 Consider chunking large input files

If your FASTA input files contains a large number of sequences say over 160, 0000 protein sequences, then you may consider splitting your input into smaller chunks (depends on resources, but batches of 80,000 protein sequences is a suggested starting point). You can then submit the smaller input files to InterProScan and process the results afterwards. For DNA/RNA sequences a much smaller number is suggested (e.g. 12,000 sequences). However for improved performance you could translate these using an external tool and then submit the necessary protein sequences instead, see running nucleic acid sequences for more information.

12.3 Review your command line input options

Do you need all the output InterProScan supplies by default? See How to run InterProScan for more details, for example you may consider options such as: • Which result data are you interested in, do you require all applications (see -appl option)? • Do you require the residue level annotation? If not, this calculation can be disabled with the -dra option. • Make use of the default lookup service, or your own local lookup service to avoid the need for calculating known results again (on by default, read more).

12.3.1 Running InterProScan in CLUSTER mode

This mode is still experimental, so I would not run in this mode in production. You want to analysis sequences on a cluster/farm and you would like to set the number of reserved cores for each node. See Running InterProScan in CLUSTER mode

12.4 Configure to analyse fewer ORFs (applies to nucleic acid se- quences only)

For nucleic acid sequences, consider reducing the number of ORFs to analyse.

48 Chapter 12. Improving performance CHAPTER 13

Activating Phobius/SignalP/TMHMM analyses

By default the Phobius, SignalP and TMHMM member database analyses are deactivated because they contain licensed components. In order to activate these analyses please obtain the relevant license and files from the provider (ensuring the software version numbers are the same as those supported by your current InterProScan installation). An example of how to activate the Phobius 1.01, SignalP 4.1 and TMHMM 2.0 analyses with InterProScan 5.19-58.0 is given below. Files can be placed in any location as long as your interproscan.properties configuration is updated accordingly.

13.1 Phobius

Website: http://phobius.sbc.su.se/data.html Files required by InterProScan: • bin/phobius/1.01/decodeanhmm • bin/phobius/1.01/phobius.model • bin/phobius/1.01/phobius.options • bin/phobius/1.01/phobius.pl Example inteproscan.properties configuration: phobius.signature.library.release=1.01 binary.phobius.pl.path=bin/phobius/1.01/phobius.pl

13.2 SignalP

Website: http://www.cbs.dtu.dk/services/SignalP/ For academic users there is a download site at: http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp Other users are requested to contact [email protected].

49 interproscan-docs Documentation

Files required by InterProScan: • bin/signalp/4.1/signalp • bin/signalp/4.1/bin/nnhowplayer.Linux_i386 • bin/signalp/4.1/bin/nnhowplayer.Linux_i486 • bin/signalp/4.1/bin/nnhowplayer.Linux_i586 • bin/signalp/4.1/bin/nnhowplayer.Linux_i686 • bin/signalp/4.1/bin/nnhowplayer.Linux_ia64 • bin/signalp/4.1/bin/nnhowplayer.Linux_x86_64 Example inteproscan.properties configuration: signalp_euk.signature.library.release=4.1 signalp_gram_positive.signature.library.release=4.1 signalp_gram_negative.signature.library.release=4.1 binary.signalp.path=bin/signalp/4.1/signalp signalp.perl.library.dir=bin/signalp/4.1/lib

Please confirm that the following line in the “signalp” binary is set to the required location:

BEGIN { $ENV{SIGNALP} = 'bin/signalp/4.1'; }

13.3 TMHMM

Website: http://www.cbs.dtu.dk/services/TMHMM/ There is a download page http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm for academic users; other users are requested to contact CBS Software Package Manager at [email protected]. Files required by InterProScan: • bin/tmhmm/2.0c/decodeanhmm • data/tmhmm/2.0c/TMHMM2.0c.model Example inteproscan.properties configuration: tmhmm.signature.library.release=2.0c binary.tmhmm.path=bin/tmhmm/2.0c/decodeanhmm tmhmm.model.path=data/tmhmm/2.0c/TMHMM2.0c.model

50 Chapter 13. Activating Phobius/SignalP/TMHMM analyses CHAPTER 14

Providing your feedback

14.1 Support requests

Support requests should be sent by using EBI Support & Feedback. We will endeavour to respond to support requests as quickly as possible.

14.2 General discussion and suggestions

Send your comments and suggestions about InterProScan 5 to EBI’s Support & Feedback as well.

51 interproscan-docs Documentation

52 Chapter 14. Providing your feedback CHAPTER 15

Known issues

15.1 Open issues in InterProScan

This page documents the latest list of known issues, and we are working to fix them as soon as possible. For assistance with other InterProScan problems, please contact us using EMBL EBI’s support form.

15.1.1 1. CDD/RPSBlast errors.

On some linux systems, you may get rpsblast errors like bin//ncbi-blast-2.10.1+/rpsbproc: error while loading shared libraries: libgomp.

˓→so.1: cannot open shared object file: No such file or directory

The missing library is libgomp1. On Ubuntu you might install it as follows: sudo apt-get install-y libgomp1

On other systems, you have similar installation commands

15.1.2 2. Coils errors

If you see an error concerning Coils, for example the error below, it means the binary we provide is not compatible with your system.

Cannot run program".../bin/ncoils/2.2.1/ncoils": error=2, No such file or directory

In this case, you may need to compile the Coils binary and it is straight forward as follows. cd src/coils/ncoils/2.2.1 make cd../../../.. cp src/coils/ncoils/2.2.1/ncoils bin/ncoils/2.2.1/ncoils

53 interproscan-docs Documentation

These steps should update the Coils binary. If you encounter errors not listed above, please contact us using EMBL EBI’s support form.

15.1.3 Contacting us please give us enough background information when you contact us, such as: • the linux distribution and version • the InterProScan version • the java version • command line used • the complete error log if possible

54 Chapter 15. Known issues CHAPTER 16

FAQ

16.1 What should I do if one of the binaries included with Inter- ProScan doesn’t work on my system?

Please see the section Compiling binaries for instructions on how to compile the various binaries on your own system.

16.2 Where can I find the XSD of the XML output?

The XML Schema Definition (XSD) is linked under the Extensible Markup Language (XML) section of the Inter- ProScan OutputFormats page.

16.3 Can I use different binary versions than listed?

InterProScan 5 is designed to run with the same binaries used by the supported member database analysis versions. This ensures that the output results returned are as the member database intended. This is why for example you will find multiple versions of HMMER (e.g. for the SMART and Pfam analyses) bundled with InterProScan and referenced in the interproscan.properties configuration file. Swapping the binary versions is not recommended. InterProScan could fail (e.g. if the input/output of the binary has changed and is no longer recognised). Even if no errors are thrown, you would be running with an unexpected binary and we cannot guarantee the results would match what the analysis intended. If you are having problems running the provided versions of certain binaries on your system, please follow these instructions.

55 interproscan-docs Documentation

16.4 Which cluster does InterProScan support?

In theory InterProScan is written flexible enough to run on any cluster platform and not only on LSF and SGE. But LSF and SGE are the only platforms we can test here at the EBI. We had feedback from users who run it successfully on a PBS cluster. For further info on how to configure your cluster version please follow the documentation.

16.5 Is there Galaxy has a wrapper for InterProScan?

Do you want to add InterProScan 5 to your Galaxy analysis pipeline? You can find the wrapper for InterProScan 5 on GitHub. When using InterProScan 5 with Galaxy, the cluster integration is done via Galaxy, which means you cannot use InterProScan 5’s in-built CLUSTER mode.

16.5.1 Documentation and contact details

Galaxy Tool Shed link for InterProScan 5: http://toolshed.g2.bx.psu.edu/view/bgruening/interproscan5 Contact: Bjoern Gruening ([email protected])

16.5.2 Publication

Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 1:e167 (http://dx.doi.org/10.7717/peerj. 167)

16.6 I get Java errors on running InterProScan

If a simple test of InterProScan fails please check your installed version of Java is suitable, see installation requirements for more details. The latest version run with Java 11.

16.7 How to analyse a huge amount of protein sequences (>30000)?

The following guidance I would say is good practice, when you use InterProScan to annotate large sequence sets. To give you an example about InterProScan’s run time, inhouse we are able to annotate a complete Escherichia coli proteome (~3.000 protein sequences) on our farm (standalone mode) within ~1hour. Other sequence sets of 16,000 protein sequences have taken ~5 hours on an a machine with 8 cores and 8GB RAM. If you want to annotate huge ammounts of protein sequences we would strongly recommend to chunk your input sequences into chunks for lets say 80,000 sequences. If you are analysing nucleic acid sequences, the chunk size should be even smaller. And then you would run individual InterProScan jobs for each chunk file. That way you make sure you get intermediate results and if lets say your InterProScan program crashes on half way you do not lose everything. see improving performance

56 Chapter 16. FAQ interproscan-docs Documentation

16.8 Should I filter by e-value?

The e-values are specific to each individual InterPro member database and therefore cannot be compared directly, or a single threshold applied to them all. This is because some member databases use the e-values for post-processing (e.g. SMART, Panther), others just output it as part of their results but actually use other measures for filtering of results (e.g. Pfam and the Hmmer GA cut-off). Therefore as far as InterProScan is concerned, if a match is in the output then it is a match!

16.9 Why do I see “Pre-calculated match lookup service failed - anal- ysis proceeding to run locally”?

This is a warning to say that the match lookup service you are trying to use could not be used, therefore InterProScan will calculate the results locally on your system instead. In this situation InterProScan will continue run, however this is likely to result in slower performance than normal. This warning could occur because the lookup service your installation of InterProScan is configured to use is either: * Not (or no longer) compatible with your version of InterProScan. * Is not accessible through your internet, proxy or firewall system configuration. * Is temporarily down. See more information about the lookup service to understand what is does and how to configure it.

16.10 How is InterProScan 5 different from InterProScan 4? How do I migrate?

InterProScan 4 is way way obsolete! But if you are still using InterProScan 4 then we recommend you send us a support request as soon as possible. InterProScan 5 differs from InterProScan v4.x in the following ways: • New analysis type: Phobius for transmembrane and signal peptide prediction • New feature: ability to map InterPro results back to the original nucleotide sequences that were submitted • New feature: option to look up biological pathways that the protein is potentially involved in • New output formats: “IMPACT” XML format and GFF3.0 • Improved graphical (HTML and SVG) representations of the protein matches InterProScan 4.8 is no longer supported or updated. For more details on how to migrate to InterProScan 5 send us a support request.

16.8. Should I filter by e-value? 57 interproscan-docs Documentation

58 Chapter 16. FAQ CHAPTER 17

Installing and compiling binaries used in Interproscan

The binaries that we distribute with InterProScan should work on most linux systems. However, in some cases they may not work on a particular system. If you are trying to run InterProScan and you get an error then you may need to compile the binary causing the error on your own system in order for it to work. Once a binary has been compiled you can either: - Replace the binary in the relevant bin subdirectory in your InterProScan installation with your newly compiled version - Or update the location of the binary in your inter- proscan.properties configuration to point to your newly compiled version InterProScan is designed to work with the same binary versions as used by the supported member database analyses. Therefore it is important to use the binary version numbers listed below, see the FAQ for more information.

17.1 cath-resolve-hits (used by CATH-Gene3D) cath-resolve-hits is a tool written c/c++ and is used as part of the postprocessing for CATH-Gene3D. The binary bundled in InterProScan should work on most systems. If you get errors, download cath-resolve-hits v0.15.2 that corresponds to your system from the following page https://github.com/UCLOrengoGroup/cath-tools/releases/tag/v0. 15.2 into bin/gene3d/4.2.0/ and rename it to bin/gene3d/4.2.0/cath-resolve-hits. If the precompiled binary doesnt’ solve your problems, compile the binary for your system by following instructions on http://cath-tools.readthedocs.io/en/latest/build/ Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are: cath.resolve.hits.path=bin/gene3d/4.2.0/cath-resolve-hits

17.2 Pfscan/Pfsearch (used by ProSite Profiles, ProSite Patterns and HAMAP)

Pfscan and pfsearch are written in fortran, so you may need to install gfortran. On Ubuntu this can easily be done by:

59 interproscan-docs Documentation

sudo apt-get install gfortran

Otherwise, source code and instructions for compiling gfortran can be found at: http://gcc.gnu.org/wiki/ GFortranBinaries Next, download the source code and compile wget ftp://ftp.lausanne.isb-sib.ch/pub/software/unix/pftools/pft2.3/pft2.3.5.d.tar.gz tar-xzf pft2.3.5.d.tar.gz cd pftools/ make io.o pfscan pfsearch

Then either replace the relevant files with your new ones or update the relevant interproscan.properties values to point at the new file locations. The default property values are: binary..pfscan.path=bin/prosite/pfscan binary.prosite.pfsearch.path=bin/prosite/pfsearch

17.3 Hmmer 2 (used by SMART) wget ftp://selab.janelia.org/pub/software/hmmer/2.3.2/hmmer-2.3.2.tar.gz tar-xzvf hmmer-2.3.2.tar.gz cd hmmer-2.3.2 ./configure--enable-threads make make check make install

Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are: binary.hmmer2.hmmsearch.path=bin/hmmer/hmmer2/2.3.2/hmmsearch binary.hmmer2.hmmpfam.path=bin/hmmer/hmmer2/2.3.2/hmmpfam

17.4 Hmmer 3 (used by CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, SFLD, SUPERFAMILY and TIGRFAMs)

Instructions for downloading and compiling Hmmer 3.1b1 can be found at: http://hmmer.org/download.html Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are: binary.hmmer3.path=bin/hmmer/hmmer3/3.1b1 binary.hmmer3.hmmscan.path=bin/hmmer/hmmer3/3.1b1/hmmscan binary.hmmer3.hmmsearch.path=bin/hmmer/hmmer3/3.1b1/hmmsearch

17.5 ncoils (used by Coils)

If you get Coils (ncoils) errors, you may need to compile the Coils binary and it is straight forward as follows.

60 Chapter 17. Installing and compiling binaries used in Interproscan interproscan-docs Documentation

cd src/coils/ncoils/2.2.1 make cd../../../.. cp src/coils/ncoils/2.2.1/ncoils bin/ncoils/2.2.1/ncoils

The steps above normally solve the problem. Instructions for compiling the “ncoils” binary can also be found in the src/coils/ncoils/2.2.1/README file in your extracted InterProScan 5 distribution (release 5.17-56.0 onwards). Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are: binary.coils.path=bin/ncoils/2.2.1/ncoils

17.6 fingerPRINTScan (used by PRINTS)

Instructions for compiling the “fingerPRINTScan” binary can be found in the src/prints/fingerprintscan/3597/INSTALL file in your extracted InterProScan 5 distribution (release 5.17-56.0 onwards) and are summarised as below: cd src/prints/fingerprintscan/3597/ ./configure make cd _interproscan_dir cp src/prints/fingerprintscan/3597/fingerPRINTScan bin/prints/ where “_interproscan_dir” is the directory where you have installed InterProScan 5. If you choose not to replace the relevant binary with your new one then instead you can update the relevant inter- proscan.properties values to point at the new file location. The default property values are: binary.fingerprintscan.path=bin/prints/fingerPRINTScan

17.7 rpsblast/rpsbproc (used by CDD)

There are two seperate application from NCBI that CDD uses for analysis in InterProScan. If the applications rpsblast and rpsbproc provided in InterProScan are not working for you, • download rpsblast/rpsbproc from NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi) – for rpsblast, it is part of the main blast package, so download https://ftp.ncbi.nlm.nih.gov/blast/executables/ blast+/LATEST/ncbi-blast-2.11.0+-x64-linux.tar.gz and look for rpsblast after uncompressing the tar file. – for rpsbproc, get it from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/ • if they dont work, then you have to compile these binaries for your system. We are working on a summary of how to compile rpsblast/rpsbproc for the latest Blast release - ncbi-blast-2.11.0+. For an older release ncbi-blast-2.6.0+, below are the instructions. They could be adapted to work for ncbi-blast- 2.11.0+. Instructions on how to compile rpsblast/rpsbproc for interproscan are summarised as follows: First check the c++ compiler version

17.6. fingerPRINTScan (used by PRINTS) 61 interproscan-docs Documentation

c++--version if the c++ version is less than 4.8 compilation will most likely fail and you should upgrade to a c++ compiler version 4.8 or above. If you have a c++ version 4.8 or above then follow the instructions below. mkdir cddblast cd cddblast wget ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-src.tar.

˓→gz wget ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-src.tar.

˓→gz.md5 md5sum-c ncbi-blast-2.6.0+-src.tar.gz.md5 # Above command should return "ncbi-blast-2.6.0+-src.tar.gz: OK" if download

˓→successful tar xvzf ncbi-blast-2.6.0+-src.tar.gz cd ncbi-blast-2.6.0+-src/c++/src/app/ wget-r--no-parent-l1-np-nd-nH-P rpsbproc ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/

˓→rpsbproc/rpsbproc-src/ #edit Makefile.in and make sure SUB_PROJ is assigned two applications as follows: SUB_

˓→PROJ = blast rpsbproc cd../../ ./configure /usr/bin/make #after compilation is complete cp ReleaseMT/bin/rpsblast/bin/blast/ncbi-blast-2.6.0+/ cp ReleaseMT/bin/rpsbproc/bin/blast/ncbi-blast-2.6.0+/

The complete instruction set can be found here: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/README If you choose not to replace the relevant binary with your new one then instead you can update the relevant inter- proscan.properties values to point at the new file location. The default property values are: binary.rpsblast.path=bin/blast/ncbi-blast-2.6.0+/rpsblast binary.rpsbproc.path=bin/blast/ncbi-blast-2.6.0+/rpsbproc

17.8 sfld_preprocess/sfld_postprocess (used by SFLD)

Instructions for compiling the “sfld_preprocess” and “sfld_postprocess” binaries can be found in the src/sfld/1/README file in your extracted InterProScan 5 distribution (release 5.22-61.0 onwards). Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are: sfld.postprocess.command=bin/sfld/sfld_postprocess

17.9 Phobius, TMHMM or SignalP

By default the Phobius, SignalP and TMHMM member database analyses are deactivated because they contain licensed components. For instructions on how to activate these analyses, obtain the relevant licenses and compile the binaries please see “activating licensed analyses”.

62 Chapter 17. Installing and compiling binaries used in Interproscan CHAPTER 18

Configuration Options

This page will give you an overview and a detailed description about some of the available configuration options in your InterProScan 5 properties file (interproscan.properties).

Option Description Default setting Precalculat ed match lookup and proxy setup precal- Host name of your proxy (e.g. http://proxy.examp le.ebi.ac.uk). You would need to Not set culated set that option, if the pre-calculated match lookup service is enabled and you have .match.lookup a proxy (communication layer) between you and the world wide web. Please note .service.prox user proxy-authenticati on is not supported at the moment. y.host precal- Open port of your proxy (e.g. 8080) Not set culated .match.lookup .service.prox y.port precal- Web address of the precalculated match lookup service. Used if the pre-calculated http://www.ebi.ac.uk/in culated match lookup service is enabled. You would only want to change that, if you have terpro/match- .match.lookup installed a local version of the lookup service lookup .service.url Other prop- erties exclude.sites Calculate residue level annotation and include in the output where available? false .from.output

63 interproscan-docs Documentation

64 Chapter 18. Configuration Options CHAPTER 19

Cluster mode benchmark run

We have ran InterProScan 5 (I5) in CLUSTER mode against a complete Escherichia coli proteome to give you some benchmark figures in terms of analysis runtime. This documentation could be seen as a reference point for runtime, but also on how to set up I5 appropriate for speed improvement.

19.1 Benchmark run setup

19.1.1 Which version of InterProScan 5 (I5) was used for this run?

5.7-48

19.1.2 How was the set of input sequences assembled for this run?

For this run we decide to run I5 against the complete proteome of Escherichia coli (Taxon 83333). We’ve downloaded the proteome from the Reference proteomes website (RELEASE 2014_04). RELEASE 2014_04 is based on UniProt Release 2014_04, Ensembl release 75 and Ensembl Genome release 21. The E.coli proteome for this release contains 4303 protein sequences. You can download the sequence file here.

19.1.3 Which I5 command was used for this run?

We switched off the pre-calculated match lookup service and turned on the CLUSTER mode.

./interproscan.sh -i 83333.fasta -dp -f tsv,html --goterms (continues on next page)

65 interproscan-docs Documentation

(continued from previous page) -mode cluster -clusterrunid benchmark-5.7-48.0

19.1.4 How does the interproscan.properties file look like?

The default settings are very conservative. To speed up the CLUSTER mode we’ve added or changed the values of the following 6 attributes within the DEFAULT interproscan.properties file:

grid.throttle=false master.steps.to.consumer.ratio=1 steps.to.consumer.ratio=1 max.tier.depth=2 thinmaster.number.of.embedded.workers=5 thinmaster.maxnumber.of.embedded.workers=5

The full setting file can be found here.

19.1.5 On which cluster/farm did we run I5?

We ran I5 on our internal LSF cluster. As of 1st of May 2014, there are approximatley 680 nodes comprising 16,000 hyper-threaded CPU cores.

19.2 Benchmark run outcome

| | Run 1 | Run 2 | Run 3 | |:|:———-|:———-|:———-| |Wall clock time|1h 37min | N/A | N/A | |Max number of workers| 45 | N/A | N/A |

66 Chapter 19. Cluster mode benchmark run CHAPTER 20

Change log for InterProScan JSON output format

20.1 InterProScan 5.31-70.0

In InterProScan 5.30-69.0 a MobiDB Lite match would consist of the following output:

{ "signature":{ "accession":"-lite", "name":"disorder_prediction", "description":"consensus disorder prediction", "type" : null, "signatureLibraryRelease":{ "library":"MOBIDB_LITE", "version":"1.5" }, "entry" : null }, "locations":[{ "start": 1508, "end": 1530, } ], "model-ac":"mobidb-lite" }

Changes in InterProScan 5.31-70.0 have the following impact on the JSON output: 1. A ‘location’ will always consist of one or more ‘location-fragments’ [applies to all analyses]. Most locations will only have one fragment, however multiple fragments are possible in Pfam, CATH-Gene3D and SUPERFAMILY analyses where discontinuous domains are present. A ‘location-fragment’ will contain a start and stop position, and a ‘dc-status’ to indicate whether it is: * CONTINU- OUS (a continuous single chain domain) * N_TERMINAL_DISC (N-terminal discontinuous) * C_TERMINAL_DISC (C-terminal discontinuous) * NC_TERMINAL_DISC (N and C-terminal discontinuous) 2. A ‘location’ may have an optional ‘sequence-feature’ [only applies to MobiDB Lite locations].

67 interproscan-docs Documentation

3. The ‘type’ was removed from the ‘signature’ as these were never populated [applies to all analyses]. 4. For a HMMER3 based ‘location’, a ‘postProcessed’ boolean attribute now indicates whether the native HM- MER3 output was subject to analysis specific post-processing [applies to HMMER3 based analyses only]. Example new output:

{ "signature":{ "accession":"mobidb-lite", "name":"disorder_prediction", "description":"consensus disorder prediction", "signatureLibraryRelease":{ "library":"MOBIDB_LITE", "version":"2.0" }, "entry" : null }, "locations":[{ "start": 1508, "end": 1530, "sequence-feature":"Polyampholyte", "location-fragments":[{ "start": 1508, "end": 1530, "dc-status":"CONTINUOUS" }] } ], "model-ac":"mobidb-lite" }

68 Chapter 20. Change log for InterProScan JSON output format CHAPTER 21

Contact us

21.1 Helpdesk

For further assistance with installing and using InterProScan please contact us.

21.2 Subscribe to the mailing list

To get the latest InterProScan news, for example to receive announcements of new releases etc. please subscribe to the following mailing list: interproscan-announce mailing list

21.3 Follow us on Twitter

The InterPro team have a Twitter account. We use it to announce InterProScan releases, data updates and other items that we think would be of interest to InterPro’s users. Follow us on Twitter: [@InterProDB](https://twitter.com/InterProDB)

69 interproscan-docs Documentation

70 Chapter 21. Contact us CHAPTER 22

Indices and tables

• genindex • modindex • search

71