CLC Genomics Workbench Features & Benefits

Solving the data analysis challenges of Some features of CLC Genomics Workbench High-Throughput Sequencing Director of the Einstein Genomics Center for Epigenomics at With High-Throughput Sequencing machines, High • Read mapping of Sanger, 454, Illumina Genome Ana- the Albert Einstein College lyzer and SOLiD sequencing data of Medicine, Dr. John Gre- Throughput Sequencing has become accessible to a very large group of researchers. However, data analysis repre- • De novo assembly of genomes of any size (only limited ally: by RAM available) sents a serious bottleneck in NGS pipelines of most R&D • Color space mapping departments, which in turn dramatically reduces the Re- • Advanced visualization, scrolling, and zooming tools CLC bio's tools are go- turn of Investment of current NGS assets. • SNP detection using advanced quality filtering ing to put sophisticated • Support for multiplexing with DNA barcoding analytical ability into CLC Genomics Workbench solves this problem and will en- the hands of molecular able everyone to rapidly analyze and visualize the huge Transcriptomics biologists at Einstein, amounts of data generated by NGS machines. The user- and will greatly enhance friendly and intuitive interface essentially takes High- • RNA-seq incl. support for paired data and transcript- level expression their ability to explore Throughput Analysis away from hardcore • Small RNA analysis the massively-parallel programmers doing command-line scripts, and hands it • Expression profiling by tags sequencing data that to scientists searching for biological results. Furthermore, • EST library construction we are generating. We the versatile nature of CLC Genomics Workbench allows • Advanced visualization, scrolling, and zooming tools see this as a way of it to blend seamlessly into existing sequencing analysis • Gene expression analysis lowering barriers for workflows, easing implementation and maximizing return Epigenomics scientists who have not on investment. previously performed • ChIP-seq analysis these high-throughput • Peak finding and peak refinement epigenomic assays, al- Multi technology – multi platform • Case/control analysis lowing them to explore their data and explore CLC Genomics Workbench includes High Performance Classical sequence analysis tools hypotheses. Computing accelerated assembly of High-Throughput Se- quencing data as well as a large number of downstream • Primer design analysis tools. • Molecular cloning • BLAST • Alignments CLC Genomics Workbench is the first comprehensive anal- • Phylogenetic trees ysis package which can analyze and visualize data from • Advanced RNA structure prediction and editing For Windows, Mac OS X, all major NGS platforms, like SOLiD, 454, Sanger, Illumina • Integrated 3D molecule analysis and and Ion Torrent. Collaboration with instrument manufac- • Secondary protein structure predictions • And much more... CLC bio©Copyright 2011 turers is a natural part of CLC bio’s development process.

clcbio.com

CLC Genomics Workbench 1 / 4 Like all other Workbenches from CLC bio, CLC Genomics Workbench runs on Support for analysis of hybrid data Mac OS X, Windows, and Linux platforms. You decide which computer to run Read mapping as well as de novo assembly support the analysis of different your software on – not us. kinds of data at the same time. An example would be the de novo assembly of Sanger data, 454 single read data, and Illumina paired end data in the Genomics Features same analysis. This functionality dramatically reduces manual work for the scientists, facilitating focus on deriving biological results from the data in- CLC bio’s world renowned scientists have designed completely new and inno- stead of doing tedious data-crunching. vative algorithms to power the features of CLC Genomics Workbench. These highly advanced and cutting edge algorithms incorporate SIMD processor ac- Multiplexing celerating technology to yield a significant speed-up of the read mapping as When doing batch sequencing of different samples, you can use multiplexing well as the de novo assembly processes. techniques to run different samples in the same run. There is often a data analysis challenge to separate the sequencing reads, so that the reads from one sample are analyzed together.

CLC Genomics Workbench supports a large number of multiplexing protocols for various types of multiplexing based on name and multiplexing based on tags or barcoding.

SNP detection Fig. 1: A region of low coverage has been found in the assembly view, and the cor- CLC Genomics Workbench offers automated SNP detection. The SNP de- responding region of the contig sequence is automatically highlighted. tection in CLC Genomics Workbench is based on the Neighborhood Quality Standard (NQS) algorithm of [Altshuler et al., 2000] (also see [Brockman et al., 2008] for more information). Read mapping The read mapping functionality of CLC Genomics Workbench supports both If the reference sequence is annotated with ORF or CDS annotations, the SNP short and long reads, it supports paired reads, it supports gapped and un- detection will also report whether the SNP is synonymous or non-synony- gapped alignments, and it supports Sanger, 454, Illumina Genome Analyzer mous. If the SNP variant changes the amino acid in the protein translation, and SOLiD sequencing data. the new amino acid will be reported.

CLC Genomics Workbench map reads to genomes of any size as long as the The graphical user interface allows the user to easily identify SNPs and get a computer has the necessary RAM. A 10 fold human genome read mapping graphical overview of SNPs in smaller or larger genomic regions. can be carried out on a standard computer with 16 GB of RAM. Identifying genomic rearrangements Mapping of SOLiD data is carried out in native color space, using a high per- Through the advanced graphical user interface, CLC Genomics Workbench formance computing based algorithm. Up to 80% more hits have been found supports the identification of a variety of genomic rearrangements like inser- when assembling 35mer SOLiD data in color space, compared to assembling tions, deletions, duplications and inversions. the same data in base space.

De novo assembly Transcriptomics Features The de novo assembly of CLC Genomics Workbench supports both short and CLC Genomics Workbench has tools to support a full work flow in analysis of long reads, it supports paired reads, and it supports Sanger, 454, Illumina expression data. These include visual quality control tools, such as principal Genome Analyzer and SOLiD sequencing data. component blots and box plots, transformation and normalization tools, tools for statistical testing and false discovery rate control, clustering al- The de novo assembler can perform scaffolding for joining contigs based on gorithms, heat-map visualization, and tests on gene annotations, such as paired reads information. A combination of paired data protocols can be Hyper Geometric tests and Gene Set Enrichment analysis. used mixing paired end and mate pair data with various inset sizes in the same assembly. Data supported for expression analysis is RNA-seq, Small RNA, tag based expression based profiling and single color microaray gene expression data. Depending on the coverage and quality of the data, and, CLC Genomics Work- bench de novo assembles genomes of any size. The interactivity of the multiple available views allows easy navigation and

Benchmarks – E. Coli Minutes: 454: Read mapping and visualization of 439,000 reads to E. Coli (5 mega bases) on a 1,500 USD 2GB dual core, 2.13 GHz, 32 bit laptop computer 2

Illumina Genome Analyzer: Read mapping and visualization of 2 x 2.7 = 5.4 million paired end reads (1 lane) to E. Coli (5 Mega bases) on a 32GB, 8 core, 2.5 GHz, 64 bit desktop computer 3

2 / 4 CLC Genomics Workbench overview of data and analysis results. The complete integration of the ex- other resources. The annotations can be grouped on the precursor or mature pression analysis in the workbench enables the user to carry out downstream miRNA level. The final results can be visualized and analyzed using the ex- analysis of genes of interest with the comprehensive set of sequence analysis pression analysis tools. tools provided, immediately and without the hassle of switching between softwares. Expression profiling by tags CLC Genomics Workbench includes a powerful tag profiling functionality which is an extension to SAGE, using NGS technology. The full workflow ex- tracting tags from sequence reads of tag counting, creating virtual tag list, and annotating tag counts with gene names are supported.

EST library construction An EST library can be constructed using the de novo assembly algorithm - e.g. to be used as reference sequences for mRNA seq or tag based tran- scriptomics.

Epigenomics analyses Fig. 2: Heat-map visualization tool letting you depict the table of expression CLC Genomics Workbench includes a fully integrated ChIP-seq analysis solu- values. tion which can easily enable researchers to go from raw data, through refer- ence alignment and onto advanced visual and tabular output of ChIP-seq Digital Gene Expression result. Data can be based on the information contained in a single sample CLC Genomics Workbench includes mRNA seq based on the approach from subjected to immunoprecipitation (ChIP-sample) or by comparing a ChIP- Mortazavi A, et.al, "Mapping and quantifying mammalian transcriptomes by sample to a control sample. RNA-Seq", Nat Methods. 2008 Jul;5(7):585-7. Classical Sequence Analysis One of the advantages with this model is that the statistics is based on RPKM (Reads Per Kilobase exon Model per million mapped reads), which is a good In addition to all the High-Throughput Sequencing analysis tools, CLC Ge- and easy way for normalizing values for the expression level of a gene when nomics Workbench includes all the more than 100 features of CLC Main brain_sample1using Digital Gene Expression. Workbench for carrying out downstream analysis and for designing follow- up lab experiments. A few examples are primer design, molecular cloning, Feature ID Expression Transcripts Unique Unique Total gene reads values gene reads exon reads BLAST, 5 different types of alignments, and phylogenetic analyses. ABHD8 5.416,87 1 656 595 695 ABHD9 21,02 1 18 2 32 CLC Genomics AKAP8 673,58 1 222 124 361 AKAP8L 4.311,30 1 772 478 897 ANKRD41 125,49 1 27 20 31 AP1M1 2.749,13 1 426 326 468 ARMC6 1.238,56 1 201 149 230 Workbench ARRDC2 1.034,80 2 236 160 333 ATP13A1 1.332,95 1 325 244 341 C r B3GNT3 0,00 1 36 0 76 LC e G rv BRD4 1.427,11 2 656 554 693 enomics Se BST2 1.479,23 1 67 60 80 C19orf42 720,63 1 91 51 107 C19orf44 943,97 1 92 13 316 CLC Genomics C19orf50 2.653,11 1 264 195 307 C19orf60 5.254,14 2 346 242 359 Fig.C19orf62 3: A table view3.789,73 of an expression 2sample generated378 from a288 sequence file428 of NGSCALR3 mRNA reads. 0,00 1 47 0 90 Workbench CASP14 0,00 1 5 0 7 CCDC105 0,00 1 14 0 16 SmallCCDC124 RNA analysis5.040,96 1 320 274 342 CHERP 1.668,09 1 239 172 474 SmallCIB3 RNA sequenced0,00 on SOLiD, Illumina1 or 45425 systems can0 be analyzed44 CILP2 31,72 1 9 7 10 usingCOMP the CLC Genomics124,16 Workbench.1 Adapter trimming29 and16 optionally de-35 multiplexingCOPE are the7.315,62 first steps in the3 analysis, then582 following429 by tag counting649 Fig. 4: Overview of our three-tier solution the CLC Genomics Server. People can CPAMD8 98,50 1 280 31 553 andCRLF1 finally powerful881,29 tools for annotating1 the small116 RNAs using83 miRBase 145and access the server from their laptop computer and easily work on large projects. CRTC1 3.396,18 2 1613 1244 1790 CYP4F11 193,77 1 50 31 58 CYP4F12Benchmarks – Human33,09 1 44 2 76 Hours: CYP4F2 8,06 1 3 0 9 Illumina Genome Analyzer: Read mapping and visualization of 2 x 43 million = 86 million paired end reads to Human (3 Giga bases) on a 32GB, 8 core, 2.5 GHz, 64 bit desktop computer (ungapped alignment) 4 1 Illumina Genome Analyzer: Read mapping and visualization of 2 x 43 million = 86 million paired end reads to Human (3 Giga bases) on a 32GB, 8 core, 2.5 GHz, 64 bit desktop computer (gapped alignment) 7

CLC Genomics Workbench 3 / 4 4 Phone: +45 7022 5509 7022 +45 Phone: Denmark N Aarhus ·DK-8200 Katrinebjerg 10-12 Finlandsgade ·EMEA bio CLC lyzing thehugeamountsof data. provides researchers with new scientific opportunitiesand new ways of ana- A new and fast evolving technology, High-Throughput Sequencing constantly Customization MySQL,PostgreSQL, Oracle, H2,andMicrosoft SQLServer. supporting base, Data - Bioinformatics CLC with integrated fully be can Server Genomics CLC ITenvironment. flexible and secure, fast, a in out carried are analyses High-ThroughputSequencing your that ensure to functionalities other of range a and opportunities, ing shar data easy applications, other with integration easy system, queueing In addition to computational power, CLC Genomics Server offers a flexible job computer. local the on data other of analyses downstream with working while puters com - powerful, central, more or one on assemblies genome whole like jobs heavy run to user Workbench Genomics the enables 3). This page 4, (figure The CLC Genomics Workbench integrates smoothly with CLC Genomics Server Integration Server / 4 CLC Genomics Workbench Genomics 4 CLC Phone: +1 (617) 444 8765 +1 444 (617) Phone: USA 02142 ·MA Cambridge #101 St Rogers 10 ·Americas bio CLC Contact your local sales representative or send your localrepresentative sales Contact an e-mail [email protected] to you would if like to try CLC Genomics CLC Workbench.like try to 69 · Lane 77 · Xin Ai Road ·7 Road Ai ·Xin 77 ·Lane 69 ·AsiaPac bio CLC Phone: +886 2 2790 0799 22790 +886 Phone: Taiwan 114 ·Taiwan ·Taipei District Neihu - quality of quality your research. the and speed the both improving of way effective cost and quick a is This requests. customer specific on based Server Genomics CLC and Workbench Genomics CLC for modules add-on customized develop and design also We our products. Genomics CLCServer. Using the SDK, you for will be able to integrate your and own algorithms with Workbench Genomics CLC for (SDK) Kit Developer Software based Java free a offering by challenges these eliminates bio CLC tlenecks intheworkflow. bot manual removing for or analyses the out carrying for software ficient ef of lack is problem The data. of lack or ideas of lack not is problem The • 1024 x 768 display recommended display x768 • 1024 sets data large for required RAM GB 32 to • 16 sets data small for required RAM 4GB • 2to RAM 2GB than more for recommended system &operating computer bit • 64 required CPU AMD or • Intel later. or 10 SuSE later. 5or RedHat • Linux: 2008 Server Windows or 2003 Server 7, Windows Windows or Vista, XP, Windows • Windows required CPU Intel or G5 G4, PowerPC later. or X10.5 OS • Mac System requirements System th fl. S more for can - -

8.02.2012 CLC bio