Sequedex Documentation Release 1.0-Rc1
Total Page:16
File Type:pdf, Size:1020Kb
Sequedex Documentation Release 1.0-rc1 Joel Berendzen, Judith Cohn, Nicolas Hengartner, Mira Dimitrijevic, Benjamin McMahon January 06, 2016 Contents 1 Copyright notice 1 2 Introduction to Sequedex 3 2.1 What is Sequedex?............................................3 2.2 What does Sequedex do?.........................................3 2.3 How is Sequedex different from other sequence analysis packages?..................4 2.4 Who uses Sequedex?...........................................4 2.5 How is Sequedex used with other software?...............................5 2.6 How does Sequedex work?........................................5 2.7 Sequedex’s outputs............................................8 3 Installation instructions 15 3.1 System requirements........................................... 16 3.2 Downloading and unpacking for Mac.................................. 16 3.3 Downloading and unpacking for Linux................................. 17 3.4 Downloading and unpacking for Windows 7 or 8............................ 17 3.5 Using Sequedex with Cygwin installed under Windows......................... 19 3.6 Installation and updates without network access............................. 20 3.7 Testing your installation......................................... 20 3.8 Running Sequedex on an example data file............................... 20 3.9 Obtaining a node-locked license file................................... 21 3.10 Installing new data modules and upgrading Sequedex - User-installs.................. 21 3.11 Installing new data modules and upgrading Sequedex - System-installs................ 21 4 Getting started 23 4.1 Running Sequescan........................................... 23 4.2 Output files - stats............................................ 25 4.3 Graphical output: Sequinator...................................... 26 4.4 Output files - Who?........................................... 27 4.5 Output files - What?........................................... 28 4.6 Output files - Who does what?...................................... 29 4.7 Obtaining annotated reads: Analysis of E. coli proteome........................ 30 4.8 Analysis of multiple files in parallel with the GUI............................ 31 4.9 The virus1252 data module....................................... 32 4.10 Alternative data modules and memory usage.............................. 32 4.11 Running Sequedex from the command line............................... 33 4.12 The Sequedex configuration utility and sequescan.conf......................... 34 4.13 Directory structures for data, output files, and analysis......................... 35 i 5 Initial analysis with Sequedex: Phylogenetic and functional profiles 37 5.1 Acquiring data files............................................ 38 5.2 Phylogenetic rollups........................................... 39 5.3 Functional rollups............................................ 44 5.4 Normalized functional profiles...................................... 47 6 Using Sequestat 53 6.1 Visualizing phylogenetic data with Sequestat.............................. 53 6.2 Reading Life2550 output files into Sequestat:.............................. 56 6.3 Reading virus1252 output files into Sequestat:............................. 57 6.4 Decomposing mixture into components with Sequestat......................... 58 7 Functional analysis 63 7.1 Visualizing SEED functional profiles with Sequestat.......................... 63 7.2 Top 100 functional categories...................................... 66 7.3 Identifying enriched functions...................................... 73 7.4 Visualizing distribution of functions................................... 74 7.5 Profiling Chlorella kinases........................................ 74 8 Comparing Samples 75 8.1 Scalar product of profiles......................................... 77 8.2 Principal component analysis...................................... 77 8.3 Ploting significant differences between two profiles........................... 78 8.4 ANalysis Of VAriance.......................................... 78 8.5 Clustering multiple profiles with R................................... 79 8.6 Visualizing the clusters with R...................................... 81 8.7 Niche determinants, gulf oil....................................... 82 9 Annotated reads 85 9.1 Obtaining annotated reads........................................ 85 9.2 Phylogenetic filtering and assemblies of genomes............................ 86 9.3 Functional filtering............................................ 86 9.4 Translated reads............................................. 86 9.5 Gene-specific BLAST databases..................................... 86 9.6 Placing reads into a multiple sequence alignment: conserved part of RNAP.............. 86 9.7 Obtaining specific genes......................................... 86 9.8 Targeted assemblies of genes....................................... 86 9.9 Nucleotide alignments.......................................... 86 9.10 Obtaining ribosomal (and tRNA) reads................................. 86 9.11 Aligning to a reference genome..................................... 86 10 Reference 87 10.1 Command line options for sequescan.................................. 87 10.2 Environmental variables......................................... 88 10.3 Configuration options.......................................... 89 11 Additional software tools and resources for use in conjunction with Sequedex 91 11.1 Graphing and sorting with Excel..................................... 91 11.2 How to install Cygwin and what to include................................ 92 11.3 The command line with standard tools (Linux, Mac, Cygwin for Windows).............. 92 11.4 Exploring trees with Archaeoptrix, FigTree, or NJPlot......................... 93 11.5 Comparing and annotating phylogenetic and functional profiles with Gnuplot............. 93 11.6 Statistical analysis and graphing with R and R Studio.......................... 94 11.7 Obtaining reference data from NCBI.................................. 95 11.8 Creating synthetic data.......................................... 95 ii 11.9 Comparing functional profiles to KEGG................................. 95 11.10 The Ribosomal Database Project..................................... 96 11.11 Bergey’s manual............................................. 96 11.12 Phylogeny vs. taxononomy - MEGAN and Krona plots......................... 96 11.13 Removal of duplicate reads with CD-HIT................................ 96 11.14 BioPython, BioPerl, and EMBOSS utilities............................... 96 11.15 BioEdit and reference alignments.................................... 96 11.16 BLAST and nr / nt databases....................................... 97 11.17 HMMER and Pfam............................................ 98 11.18 Muscle.................................................. 98 11.19 Velvet and de-novo assembly...................................... 98 11.20 Tree building with phyML or FastTree................................. 98 11.21 p-values and Pplacer........................................... 99 11.22 BWA and BowTie2............................................ 99 11.23 Samtools and Bamtools......................................... 99 11.24 The Integrated Genomics Viewer.................................... 99 12 Tree of Life, 2550 taxa 101 12.1 Bacteroidetes, et al............................................. 102 12.2 Alpha proteobacteria, et al......................................... 108 12.3 Beta and Gamma Proteobacteria..................................... 117 12.4 Actinobacteria.............................................. 128 12.5 Firmicutes, et al.............................................. 134 12.6 Cyanobacteria.............................................. 147 12.7 Archaea.................................................. 149 12.8 Eukaryotes................................................ 153 13 Definition of functional classifications 157 13.1 Amino Acids and Derivatives...................................... 157 13.2 Carbohydrates.............................................. 158 13.3 Cell Division and Cell Cycle....................................... 161 13.4 Cell Wall and Capsule.......................................... 161 13.5 Clustering-based subsystems....................................... 162 13.6 Cofactors, Vitamins, Prosthetic Groups, Pigments........................... 165 13.7 DNA Metabolism............................................ 166 13.8 Dormancy and Sporulation........................................ 167 13.9 Fatty Acids, Lipids, and Isoprenoids................................... 168 13.10 Iron acquisition and metabolism..................................... 169 13.11 Membrane Transport........................................... 169 13.12 Metabolism of Aromatic Compounds.................................. 170 13.13 Miscellaneous.............................................. 172 13.14 Motility and Chemotaxis......................................... 172 13.15 Nitrogen Metabolism........................................... 173 13.16 Nucleosides and Nucleotides....................................... 173 13.17 Phages, Prophages, Transposable elements, Plasmids.......................... 174 13.18 Phosphorous Metabolism........................................ 174 13.19 Photosynthesis.............................................. 175 13.20 Potassium metabolism.......................................... 175 13.21 Protein Metabolism..........................................