Introduction to Label-Free Quantification

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Label-Free Quantification SeqAn and OpenMS Integration Workshop Temesgen Dadi, Julianus Pfeuffer, Alexander Fillbrunn The Center for Integrative Bioinformatics (CIBI) Mass-spectrometry data analysis in KNIME Julianus Pfeuffer, Alexander Fillbrunn OpenMS • OpenMS – an open-source C++ framework for computational mass spectrometry • Jointly developed at ETH Zürich, FU Berlin, University of Tübingen • Open source: BSD 3-clause license • Portable: available on Windows, OSX, Linux • Vendor-independent: supports all standard formats and vendor-formats through proteowizard • OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools – Building blocks: One application for each analysis step – All applications share identical user interfaces – Uses PSI standard formats • Can be integrated in various workflow systems – Galaxy – WS-PGRADE/gUSE – KNIME Kohlbacher et al., Bioinformatics (2007), 23:e191 OpenMS Tools in KNIME • Wrapping of OpenMS tools in KNIME via GenericKNIMENodes (GKN) • Every tool writes its CommonToolDescription (CTD) via its command line parser • GKN generates Java source code for nodes to show up in KNIME • Wraps C++ executables and provides file handling nodes Installation of the OpenMS plugin • Community-contributions update site (stable & trunk) – Bioinformatics & NGS • provides > 180 OpenMS TOPP tools as Community nodes – SILAC, iTRAQ, TMT, label-free, SWATH, SIP, … – Search engines: OMSSA, MASCOT, X!TANDEM, MSGFplus, … – Protein inference: FIDO Data Flow in Shotgun Proteomics Sample HPLC/MS Raw Data 100 GB Sig. Proc. Peak 50 MB Maps Data Reduction 1 GB Data Diff. Quant. Differentially Annotated 50 MB Expressed 50 kB Maps Identification Proteins Quantification Strategies Quantitative Proteomics Relative Quantification Absolute Quantification AQUA SISCAPA Labeled Label-Free Spectral Feature-Based In vivo In vitro Counting MRM 14N/15N SILAC iTRAQ TMT 16O/18O After: Lau et al., Proteomics, 2007, 7, 2787 Quantitative Data – LC-MS Maps • Spectra are acquired with rates up to dozens per second • Stacking the spectra yields maps • Resolution: – Up to millions of points per spectrum – Tens of thousands of spectra per LC run • Huge 2D datasets of up to hundreds of GB per sample • MS intensity follows the chromatographic concentration LC-MS Data (Map) Quantification (15 nmol/µl, 3x over-expressed, …) 10 Label-Free Quantification (LFQ) • Label-free quantification is probably the most natural way of quantifying – No labeling required, removing further sources of error, no restriction on sample generation, cheap – Data on different samples acquired in different measurements – higher reproducibility needed – Manual analysis difficult – Scales very well with the number of samples, basically no limit, no difference in the analysis between 2 or 100 samples LFQ – Analysis Strategy 1. Find features in all maps LFQ – Analysis Strategy 1. Find features in all maps 2. Align maps LFQ – Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features LFQ – Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features GDAFFGMSCK LFQ – Analysis Strategy 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features 5. Quantify GDAFFGMSCK 1.0 : 1.2 : 0.5 Feature-Based Alignment • LC-MS maps can contain millions of peaks • Retention time of peptides and metabolites can shift between experiments • In label-free quantification, maps thus need to be aligned in order to identify corresponding features • Alignment can be done on the raw maps (where it is usually called ‘dewarping’) or on already identified features • The latter is simpler, as it does not require the alignment of millions of peaks, but just of tens of thousands of features • Disadvantage: it replies on an accurate feature finding Feature-Based Alignment ~350,000 peaks ~ 700 features Feature Finding • Identify all peaks belonging to one peptide • Key idea: – Identify suspicious regions (e.g. highest peaks) – Fit a model to that region and identify peaks explained by it Feature Finding • Extension: collect all data points close to the seed • Refinement: remove peaks that are not consistent with the model • Fit an optimal model for the reduced set of peaks • Iterate this until no further improvement can be achieved Multiple Alignment • Dewarp k maps onto a comparable coordinate system • Choose one map (usually the one with the largest number of features) as reference map (here: map 2 -> T2 = 1) Map 1 T1 Map 2 … T2 … Consensus map Map k m/z Tk rt rt LFQ with OpenMS in KNIME • Identification • Feature finding and mapping • Map alignment • Feature linking • Statistical analysis with R Snippets • Visualization with KNIME plotting nodes Preprocessing of single maps Combining information of maps Statistical post-processing and visualization.
Recommended publications
  • Standard Flow Multiplexed Proteomics (Sflompro) – an Accessible and Cost-Effective Alternative to Nanolc Workflows
    bioRxiv preprint doi: https://doi.org/10.1101/2020.02.25.964379; this version posted February 25, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Standard Flow Multiplexed Proteomics (SFloMPro) – An Accessible and Cost-Effective Alternative to NanoLC Workflows Conor Jenkins1 and Ben Orsburn2* 1Hood College Department of Biology, Frederick, MD 2University of Virginia Medical School, Charlottesville, VA *To whom correspondence should be addressed; [email protected] Abstract Multiplexed proteomics using isobaric tagging allows for simultaneously comparing the proteomes of multiple samples. In this technique, digested peptides from each sample are labeled with a chemical tag prior to pooling sample for LC-MS/MS with nanoflow chromatography (NanoLC). The isobaric nature of the tag prevents deconvolution of samples until fragmentation liberates the isotopically labeled reporter ions. To ensure efficient peptide labeling, large concentrations of labeling reagents are included in the reagent kits to allow scientists to use high ratios of chemical label per peptide. The increasing speed and sensitivity of mass spectrometers has reduced the peptide concentration required for analysis, leading to most of the label or labeled sample to be discarded. In conjunction, improvements in the speed of sample loading, reliable pump pressure, and stable gradient construction of analytical flow HPLCs has continued to improve the sample delivery process to the mass spectrometer. In this study we describe a method for performing multiplexed proteomics without the use of NanoLC by using offline fractionation of labeled peptides followed by rapid “standard flow” HPLC gradient LC-MS/MS.
    [Show full text]
  • Text Mining Course for KNIME Analytics Platform
    Text Mining Course for KNIME Analytics Platform KNIME AG Copyright © 2018 KNIME AG Table of Contents 1. The Open Analytics Platform 2. The Text Processing Extension 3. Importing Text 4. Enrichment 5. Preprocessing 6. Transformation 7. Classification 8. Visualization 9. Clustering 10. Supplementary Workflows Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 2 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ Overview KNIME Analytics Platform Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 3 Noncommercial-Share Alike license 1 https://creativecommons.org/licenses/by-nc-sa/4.0/ What is KNIME Analytics Platform? • A tool for data analysis, manipulation, visualization, and reporting • Based on the graphical programming paradigm • Provides a diverse array of extensions: • Text Mining • Network Mining • Cheminformatics • Many integrations, such as Java, R, Python, Weka, H2O, etc. Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 4 Noncommercial-Share Alike license 2 https://creativecommons.org/licenses/by-nc-sa/4.0/ Visual KNIME Workflows NODES perform tasks on data Not Configured Configured Outputs Inputs Executed Status Error Nodes are combined to create WORKFLOWS Licensed under a Creative Commons Attribution- ® Copyright © 2018 KNIME AG 5 Noncommercial-Share Alike license 3 https://creativecommons.org/licenses/by-nc-sa/4.0/ Data Access • Databases • MySQL, MS SQL Server, PostgreSQL • any JDBC (Oracle, DB2, …) • Files • CSV, txt
    [Show full text]
  • Imagej2-Allow the Users to Use Directly Use/Update Imagej2 Plugins Inside KNIME As Well As Recording and Running KNIME Workflows in Imagej2
    The KNIME Image Processing Extension for Biomedical Image Analysis Andries Zijlstra (Vanderbilt University Medical Center The need for image processing in medicine Kevin Eliceiri (University of Wisconsin-Madison) KNIME Image Processing and ImageJ Ecosystem [email protected] [email protected] The need for precision oncology 36% of newly diagnosed cancers, and 10% of all cancer deaths in men Out of every 100 men... 16 will be diagnosed with prostate cancer in their lifetime In reality, up to 80 will have prostate cancer by age 70 And 3 will die from it. But which 3 ? In the meantime, we The goal: Diagnose patients that have over-treat many aggressive disease through Precision Medicine patients Objectives of Approach to Modern Medicine Precision Medicine • Measure many things (data density) • Improved outcome through • Make very accurate measurements (fidelity) personalized/precision medicine • Consider multiple perspectives (differential) • Reduced expense/resource allocation through • Achieve confidence in the diagnosis improved diagnosis, prognosis, treatment • Match patients with a treatment they are most • Maximize quality of life by “targeted” therapy likely to respond to. Objectives of Approach to Modern Medicine Precision Medicine • Measure many things (data density) • Improved outcome through • Make very accurate measurements (fidelity) personalized/precision medicine • Consider multiple perspectives (differential) • Reduced expense/resource allocation through • Achieve confidence in the diagnosis improved diagnosis,
    [Show full text]
  • KNIME Workbench Guide
    KNIME Workbench Guide KNIME AG, Zurich, Switzerland Version 4.4 (last updated on 2021-06-08) Table of Contents Workspaces . 1 KNIME Workbench . 2 Welcome page . 4 Workflow editor & nodes . 5 KNIME Explorer . 13 Workflow Coach . 35 Node repository . 37 KNIME Hub view . 38 Description. 40 Node Monitor. 40 Outline. 41 Console. 41 Customizing the KNIME Workbench . 42 Reset and logging . 42 Show heap status . 42 Configuring KNIME Analytics Platform . 43 Preferences . 43 Setting up knime.ini. 47 KNIME runtime options . 49 KNIME tables . 55 Data table . 55 Column types. 56 Sorting . 59 Column rendering . 59 Table storage. 61 KNIME Workbench Guide This guide describes the first steps to take after starting KNIME Analytics Platform and points you to the resources available in the KNIME Workbench for building workflows. It also explains how to customize the workbench and configure KNIME Analytics Platform to best suit specific needs. In the last part of this guide we introduce data tables. Workspaces When you start KNIME Analytics Platform, the KNIME Analytics Platform launcher window appears and you are asked to define the KNIME workspace, as shown in Figure 1. The KNIME workspace is a folder on the local computer to store KNIME workflows, node settings, and data produced by the workflow. Figure 1. KNIME Analytics Platform launcher The workflows and data stored in the workspace are available through the KNIME Explorer in the upper left corner of the KNIME Workbench. © 2021 KNIME AG. All rights reserved. 1 KNIME Workbench Guide KNIME Workbench After selecting a workspace for the current project, click Launch. The KNIME Analytics Platform user interface - the KNIME Workbench - opens.
    [Show full text]
  • Targeted Quantitative Proteomics Using Selected Reaction Monitoring
    Targeted Quantitative Proteomics Using Selected Reaction Monitoring (SRM)- Mass Spectrometry Coupled with an 18O-labeled Reference as Internal Standards Jong-Seo Kim, Errol Robinson, Boyd L. Champion, Brianne O. Petritis, Thomas L. Fillmore, Ronald J. Moore, Liu Tao, David G. Camp II, Richard D. Smith, and Wei-Jun Qian Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA Methods Results Effect of Q1 resolution Reproducibility 18O labeling efficiency Overview Patient Samples Conclusions Group A: • To address the need of stable isotope Group B: • The utility of 18O-labeled “universal” O area ratio O area Trypsin 18 Red error bars: label-free peak area ratio labeled internal standards for accurate (= each 16O area / average 18O area) reference as internal standards for targeted O/ digestion 16 18 quantification, we introduce the use of an 16 Green error bars: each pair area ratio of O/ O quantitative proteomics has been Patient peptide 18 Pooled reference ratio O area O-labeled reference as comprehensive samples sample 18 successfully demonstrated 16 18 LC-SRM-MS O/ Concentration ratio ( O/ O) internal standards for accurate SRM-MS- 18O labeling 16 – A linear dynamic range of quantification Concen. based quantification instead of using labeled 18 0.01 0.02 0.04 0.1 0.2 1 5 10 25 50 100 4 O-labeled “universal” Ratio ~ 10 in relative concentration synthetic peptides. reference 16O/18O 6.1 12.5 13.7 13.8 5.5 7.7 16.8 2.8 4.6 4.2 1.6 pairs – Better reproducibility than label-free 18 Label • O-labeling efficiency for most peptides is 39.2 23.4 47.5 26.5 19.3 10.6 31.2 3.2 10.7 4.8 3.5 Fig.
    [Show full text]
  • Data Analytics with Knime
    DATA ANALYTICS WITH KNIME v.3.4.0 QUALIFICATIONS & EXPERIENCE ▶ 38 years of providing professional services to state and local taxing officials ▶ TMA works exclusively with government partners WHO ▶ TMA is composed of 150+ WE ARE employees in five main offices across the United States Tax Management Associates is a professional services firm that has ▶ Our main focus is on revenue served the interests of state and local enhancement services for state government since 1979. and local jurisdictions and property tax compliance efforts KNIME POWERED CUSTOM ANALYTICS ▶ TMA is a proud KNIME Trusted Consulting Partner. Visit: www.knime.org/knime-trusted-partners ▶ Successful analytics solutions: ○ Fraud Detection (Michigan Department of Treasury) ○ Entity Discovery (multiple counties) ○ Data Aggregation (Louisiana State Tax Commission) KNIME POWERED CUSTOM ANALYTICS ▶ KNIME is an open source data toolkit ▶ Active development community and core team ▶ GUI based with scripting integration ○ Easy adoption, integration, and training ▶ Data ingestion, transformation, analytics, and reporting FEATURES & TERMINOLOGY KNIME WORKBENCH TAX MANAGEMENT ASSOCIATES, INC. KNIME WORKFLOW TAX MANAGEMENT ASSOCIATES, INC. KNIME NODES TAX MANAGEMENT ASSOCIATES, INC. DATA TYPES & SOURCES DATA AGNOSTIC ▶ Flat Files ▶ Shapefiles ▶ Xls/x Reader ▶ HTTP Requests ▶ Fixed Width ▶ RSS Feeds ▶ Text Files ▶ Custom API’s/Curl ▶ Image Files ▶ Standard API’s ▶ XML ▶ JSON TAX MANAGEMENT ASSOCIATES, INC. KNIME DATA NODES TAX MANAGEMENT ASSOCIATES, INC. DATABASE AGNOSTIC ▶ Microsoft SQL ▶ Oracle ▶ MySQL ▶ IBM DB2 ▶ Postgres ▶ Hadoop ▶ SQLite ▶ Any JDBC driver TAX MANAGEMENT ASSOCIATES, INC. KNIME DATABASE NODES TAX MANAGEMENT ASSOCIATES, INC. CORE DATA ANALYTICS FEATURES KNIME DATA ANALYTICS LIFECYCLE Read Data Extract, Data Analytics Reporting or Predictive Read Transform, and/or Load (ETL) Analysis Injection Data Read Data TAX MANAGEMENT ASSOCIATES, INC.
    [Show full text]
  • Quantitative Proteomics Reveals the Selectivity of Ubiquitin-Binding Autophagy Receptors in the Turnover of Damaged Lysosomes by Lysophagy
    bioRxiv preprint doi: https://doi.org/10.1101/2021.07.19.452535; this version posted July 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Quantitative proteomics reveals the selectivity of ubiquitin-binding autophagy receptors in the turnover of damaged lysosomes by lysophagy Vinay Eapen1,*, Sharan Swarup1,*, Melissa Hoyer1,*, Joao Paolo1 and J. Wade Harper 1 1Department of Cell Biology, Harvard Medical School, Boston MA 02115 *, Equal Contribution Send correspondence to: [email protected] ABSTRACT Removal of damaged organelles via the process of selective autophagy constitutes a major form of cellular quality control. Damaged organelles are recognized by a dedicated surveillance machinery, leading to the assembly of an autophagosome around the damaged organelle, prior to fusion with the degradative lysosomal compartment. Lysosomes themselves are also prone to damage and are degraded through the process of lysophagy. While early steps involve recognition of ruptured lysosomal membranes by glycan-binding Galectins and ubiquitylation of transmembrane lysosomal proteins, many steps in the process, and their inter-relationships, remain poorly understood, including the role and identity of cargo receptors required for completion of lysophagy. Here, we employ quantitative organelle capture and proximity biotinylation proteomics of autophagy adaptors, cargo receptors, and Galectins in response to acute lysosomal damage, thereby revealing the landscape of lysosomal proteome remodeling during lysophagy. Among proteins dynamically recruited to damaged lysosomes were ubiquitin-binding autophagic cargo receptors.
    [Show full text]
  • Mass Spectrometry and Proteomics Using R/Bioconductor
    Mass spectrometry and proteomics Using R/Bioconductor Laurent Gatto CSAMA - Bressanone - 25 July 2019 1 / 40 Slides available at: http://bit.ly/20190725csama These slides are available under a creative common CC-BY license. You are free to share (copy and redistribute the material in any medium or format) and adapt (remix, transform, and build upon the material) for any purpose, even commercially . 2 / 40 On the menu Morning lecture: 1. Proteomics in R/Bioconductor 2. How does mass spectrometry-based proteomics work? 3. Quantitative proteomics 4. Quantitative proteomics data processing and analysis Afternoon lab: Manipulating MS data (raw and identification data) Manipulating quantitative proteomics data Data processing and DE 3 / 40 4 / 40 1. Proteomics and mass spectrometry packages, questions and workow in Bioconductor. 5 / 40 2. How does mass spectrometry work? (applies to proteomics and metabolomics) 6 / 40 Overview 7 / 40 How does MS work? 1. Digestion of proteins into peptides - as will become clear later, the features we measure in shotgun (or bottom-up) proteomics are peptides, not proteins. 2. On-line liquid chromatography (LC-MS) 3. Mass spectrometry (MS) is a technology that separates charged molecules (ions, peptides) based on their mass to charge ratio (M/Z). 8 / 40 Chromatography MS is generally coupled to chromatography (liquid LC, but can also be gas- based GC). The time an analytes takes to elute from the chromatography column is the retention time. 9 / 40 An mass spectrometer is composed of three components: 1. The source, that ionises the molecules: examples are Matrix-assisted laser desorption/ionisation (MALDI) or electrospray ionisation (ESI).
    [Show full text]
  • Sheffield HPC Documentation
    Sheffield HPC Documentation Release November 14, 2016 Contents 1 Research Computing Team 3 2 Research Software Engineering Team5 i ii Sheffield HPC Documentation, Release The current High Performance Computing (HPC) system at Sheffield, is the Iceberg cluster. A new system, ShARC (Sheffield Advanced Research Computer), is currently under development. It is not yet ready for use. Contents 1 Sheffield HPC Documentation, Release 2 Contents CHAPTER 1 Research Computing Team The research computing team are the team responsible for the iceberg service, as well as all other aspects of research computing. If you require support with iceberg, training or software for your workstations, the research computing team would be happy to help. Take a look at the Research Computing website or email research-it@sheffield.ac.uk. 3 Sheffield HPC Documentation, Release 4 Chapter 1. Research Computing Team CHAPTER 2 Research Software Engineering Team The Sheffield Research Software Engineering Team is an academically led group that collaborates closely with CiCS. They can assist with code optimisation, training and all aspects of High Performance Computing including GPU computing along with local, national, regional and cloud computing services. Take a look at the Research Software Engineering website or email rse@sheffield.ac.uk 2.1 Using the HPC Systems 2.1.1 Getting Started If you have not used a High Performance Computing (HPC) cluster, Linux or even a command line before this is the place to start. This guide will get you set up using iceberg in the easiest way that fits your requirements. Getting an Account Before you can start using iceberg you need to register for an account.
    [Show full text]
  • Quantitative Assay of Targeted Proteome in Tomato Trichome
    Takemori et al. Plant Methods (2019) 15:40 https://doi.org/10.1186/s13007-019-0427-7 Plant Methods METHODOLOGY Open Access Quantitative assay of targeted proteome in tomato trichome glandular cells using a large-scale selected reaction monitoring strategy Ayako Takemori1, Taiken Nakashima2, Hisashi Ômura3, Yuki Tanaka4, Keisuke Nakata1, Hiroshi Nonami1,5,6 and Nobuaki Takemori4,6* Abstract Background: Glandular trichomes found in vascular plants are called natural cell factories because they synthesize and store secondary metabolites in glandular cells. To systematically understand the metabolic processes in glandular cells, it is indispensable to analyze cellular proteome dynamics. The conventional proteomics methods based on mass spectrometry have enabled large-scale protein analysis, but require a large number of trichome samples for in-depth analysis and are not suitable for rapid and sensitive quantifcation of targeted proteins. Results: Here, we present a high-throughput strategy for quantifying targeted proteins in specifc trichome glandular cells, using selected reaction monitoring (SRM) assays. The SRM assay platform, targeting proteins in type VI trichome gland cells of tomato as a model system, demonstrated its efectiveness in quantifying multiple proteins from a lim- ited amount of sample. The large-scale SRM assay uses a triple quadrupole mass spectrometer connected online to a nanofow liquid chromatograph, which accurately measured the expression levels of 221 targeted proteins con- tained in the glandular cell sample recovered from 100 glandular trichomes within 120 min. Comparative quantitative proteomics using SRM assays of type VI trichome gland cells between diferent organs (leaves, green fruits, and calyx) revealed specifc organ-enriched proteins. Conclusions: We present a targeted proteomics approach using the established SRM assays which enables quantif- cation of proteins of interest with minimum sampling efort.
    [Show full text]
  • Mathematica Document
    Mathematica Project: Exploratory Data Analysis on ‘Data Scientists’ A big picture view of the state of data scientists and machine learning engineers. ����� ���� ��������� ��� ������ ���� ������ ������ ���� ������/ ������ � ���������� ���� ��� ������ ��� ���������������� �������� ������/ ����� ��������� ��� ���� ���������������� ����� ��������������� ��������� � ������(�������� ���������� ���������) ������ ��������� ����� ������� �������� ����� ������� ��� ������ ����������(���� �������) ��������� ����� ���� ������ ����� (���������� �������) ����������(���������� ������� ���������� ��� ���� ���� �����/ ��� �������������� � ����� ���� �� �������� � ��� ����/���������� ��������������� ������� ������������� ��� ���������� ����� �����(���� �������) ����������� ����� / ����� ��� ������ ��������������� ���������� ����������/�++ ������/������������/����/������ ���� ������� ����� ������� ������� ����������������� ������� ������� ����/����/�������/��/��� ����������(�����/����-�������� ��������) ������������ In this Mathematica project, we will explore the capabilities of Mathematica to better understand the state of data science enthusiasts. The dataset consisting of more than 10,000 rows is obtained from Kaggle, which is a result of ‘Kaggle Survey 2017’. We will explore various capabilities of Mathematica in Data Analysis and Data Visualizations. Further, we will utilize Machine Learning techniques to train models and Classify features with several algorithms, such as Nearest Neighbors, Random Forest. Dataset : https : // www.kaggle.com/kaggle/kaggle
    [Show full text]
  • Direct Submission Or Co-Submission Direct Submission
    Z-Matrix template-based substitution approach Title for enumeration of 3D molecular structures Authors Wanutcha Lorpaiboon and Taweetham Limpanuparb* Science Division, Mahidol University International College, Affiliations Mahidol University, Salaya, Nakhon Pathom 73170, Thailand Corresponding Author’s email address [email protected] • Chemical structures • Education Keywords • Molecular generator • Structure generator • Z-matrix Direct Submission or Co-Submission Direct Submission ABSTRACT The exhaustive enumeration of 3D chemical structures based on Z-matrix templates has recently been used in the quantum chemical investigation of constitutional isomers, diastereomers and 5 rotamers. This simple yet powerful initial structure generation approach can apply beyond the investigation of compounds of identical formula by quantum chemical methods. This paper aims to provide a short description of the overall concept followed by a practical tutorial to the approach. • The four steps required for Z-matrix template-based substitution are template construction, generation of tuples for substitution sites, removal of duplicate tuples and 10 substitution on the template. • The generated tuples can be used to create chemical identifiers to query compound properties from chemical databases. • All of these steps are demonstrated in this paper by common model compounds and are very straightforward for an undergraduate audience to reproduce. A comparison of the 15 approach in this tutorial and other options is also discussed. SPECIFICATIONS TABLE Subject Area Chemistry More specific subject area Cheminformatics Method name Z-matrix template-based substitution Name and reference of original method N/A Source codes are available as supplementary information in this Resource availability paper. 2 of 10 Method details 20 1. Introduction Initial structures (Z-matrix or Cartesian coordinate) are important starting points for the in silico investigation of chemical species.
    [Show full text]