Development of Chemoinformatic Tools to Enumerate Functional Groups in Molecules for Organic Aerosol Characterization
Total Page:16
File Type:pdf, Size:1020Kb
Atmos. Chem. Phys., 16, 4401–4422, 2016 www.atmos-chem-phys.net/16/4401/2016/ doi:10.5194/acp-16-4401-2016 © Author(s) 2016. CC Attribution 3.0 License. Technical Note: Development of chemoinformatic tools to enumerate functional groups in molecules for organic aerosol characterization Giulia Ruggeri and Satoshi Takahama ENAC/IIE Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland Correspondence to: Satoshi Takahama (satoshi.takahama@epfl.ch) Received: 1 October 2015 – Published in Atmos. Chem. Phys. Discuss.: 27 November 2015 Revised: 4 March 2016 – Accepted: 9 March 2016 – Published: 11 April 2016 Abstract. Functional groups (FGs) can be used as a re- been many proposals for reducing representations in which a duced representation of organic aerosol composition in both mixture of 10 000C different types of molecules (Hamilton ambient and controlled chamber studies, as they retain a et al., 2004) are represented by some combination of their certain chemical specificity. Furthermore, FG composition molecular size, carbon number, polarity, or elemental ratios has been informative for source apportionment, and vari- (Pankow and Barsanti, 2009; Kroll et al., 2011; Daumit et al., ous models based on a group contribution framework have 2013; Donahue et al., 2012), many of which are associated been developed to calculate physicochemical properties of with observable quantities (e.g., by aerosol mass spectrom- organic compounds. In this work, we provide a set of val- etry (AMS; Jayne et al., 2000), gas chromatography–mass idated chemoinformatic patterns that correspond to (1) a spectrometry (GC-MS and GCxGC-MS; Rogge et al., 1993; complete set of functional groups that can entirely de- Hamilton et al., 2004)). Molecular bonds or organic func- scribe the molecules comprised in the α-pinene and 1,3,5- tional groups (FGs), which are the focus of this manuscript, trimethylbenzene MCMv3.2 oxidation schemes, (2) FGs that can also be used to provide reduced representations for mix- are measurable by Fourier transform infrared spectroscopy tures and have been shown useful for organic mass (OM) (FTIR), (3) groups incorporated in the SIMPOL.1 vapor quantification, source apportionment, and prediction of hy- pressure estimation model, and (4) bonds necessary for the groscopicity and volatility (e.g., Russell, 2003; Donahue, calculation of carbon oxidation state. We also provide exam- 2011; Russell et al., 2011; Suda et al., 2014). Examples ple applications for this set of patterns. We compare available of property estimation methods include models for pure- aerosol composition reported by chemical speciation mea- component vapor pressure (Pankow and Asher, 2008; Com- surements and FTIR for different emission sources, and cal- pernolle et al., 2011), UNIFAC, and its variations for ac- culate the FG contribution to the O : C ratio of simulated gas- tivity coefficients and viscosity (Ming and Russell, 2001; phase composition generated from α-pinene photooxidation Griffin et al., 2002; Zuend et al., 2008, 2011). The FGs (using the MCMv3.2 oxidation scheme). that can be detected or quantified by measurement vary widely by analytical technique, which include Fourier trans- form infrared spectroscopy (FTIR; Maria et al., 2002), Ra- man spectroscopy (Craig et al., 2015), spectrophotometry 1 Introduction (Aimanant and Ziemann, 2013), nuclear magnetic resonance (NMR; Decesari et al., 2000; Cleveland et al., 2012), and gas Atmospheric aerosols are complex mixtures of inorganic chromatography with mass spectrometry and derivatization salts, mineral dust, sea salt, black carbon, metals, organic (Dron et al., 2010). compounds, and water (Seinfeld and Pandis, 2006). Of these Projecting specific molecular information available components, the organic fraction can comprise as much as through various forms of mass spectrometry (e.g., Williams 80 % of the aerosol mass (Lim and Turpin, 2002; Zhang et al., 2006; Kalberer et al., 2006; Laskin et al., 2012; et al., 2007) and yet eludes definitive characterization due Chan et al., 2013; Nguyen et al., 2013; Vogel et al., 2013; to the number and diversity of molecule types. There have Published by Copernicus Publications on behalf of the European Geosciences Union. 4402 G. Ruggeri and S. Takahama: Technical Note: Functional group enumeration Yatavelli et al., 2014; Schilling Fahnestock et al., 2015; formulating patterns in such a way that permits a user to Chhabra et al., 2015) or model simulations employing not only match and test the total number of FGs within a explicit chemical mechanisms (e.g., Jenkin, 2004; Aumont molecule but also confirm that all atoms within molecule are et al., 2005; Herrmann et al., 2005) to a reduced dimensional classified uniquely into a set of FGs (except polyfunctional space represented by some combination of FGs can be useful carbon, which can be associated with many FGs). We present for measurement intercomparisons, or model–measurement a validation test for the groups defined, and show example comparisons. For this task, the aerosol community can applications for mapping molecules onto two-dimensional benefit from developments in the chemoinformatics com- volatility basis set (2-D VBS) space, inter-measurement munity. If the structure of a substance is described through comparison between OM composition reported by GC-MS its molecular (also referred to as chemical) graph (Balaban, and FTIR for several source classes, and discuss implications 1985) – which is a set of atoms and their association through for further applications. The patterns and software written for bonds – the abundance of arbitrary substructures (also this manuscript are provided in a version-controlled reposi- called fragments) can be estimated through pattern-matching tory (AppendixA). algorithms called subgraph isomorphisms (Barnard, 1993; Ehrlich and Rarey, 2012; Kerber et al., 2014). Structural information of molecules can be encoded in various rep- 2 Methods resentations, including a linear string of ASCII characters denoted as SMILES (Weininger, 1988). A corresponding In this section, we present a series of patterns corresponding set of fragments can be specified by SMARTS, which to substructures useful for vapor pressure estimation of FGs is a superset of the SMILES specification (DAYLIGHT in molecules defined by measurements and chemical mecha- Chemical Information Systems, Inc.). There are many nisms (Sect. 2.1) as well as the methods and compound sets chemoinformatic packages that implement algorithms for used for their validation (Sect. 2.2). We further describe the pattern matching – for instance, OpenBabel (O’Boyle data set used for constructing a few example applications et al., 2011), Chemistry Development Kit (Steinbeck et al., (Sect. 2.3). 2003), OEChem (Openeye Scientific Software, Inc.), RDKit (Landrum, 2015), and Indigo (GGA Software Services). The 2.1 Pattern specification for matching substructures concept of using SMILES and SMARTS patterns has been reported for applications in the atmospheric chemistry com- Four groups of patterns are defined: the first group (Table1, munity (Barley et al., 2011; COBRA, Fooshee et al., 2012). substructures 1–33) corresponds to the complete set of FGs While some sets of SMARTS patterns for substructure that can be found in the MCMv3.2 α-pinene and 1,3,5- matching can additionally be found in the literature (Hann trimethylbenzene oxidation scheme (Jenkin et al., 1997; et al., 1999; Walters and Murcko, 2002; Olah et al., 2004; Saunders et al., 2003), the second group is used to study Enoch et al., 2008; Barley et al., 2011; Kenny et al., 2013) or the FG abundance associated with FTIR measurements (FGs on web databases – e.g., DAYLIGHT Chemical Information not specified before, containing carbon, oxygen, and nitro- Systems, Inc. (DAYLIGHT Chemical Information Systems, gen atoms; Table1, substructures 33–57), the third group Inc.) – knowledge regarding the extent of specificity and corresponds to the FGs used to build the SIMPOL.1 model validation of the defined patterns is not available. (Pankow and Asher, 2008) to predict pure-component vapor In this work, we report specifications for four specific sets pressures that are not present in the first set of patterns (Ta- of substructures: ble2), and the fourth group is used to calculate the oxida- tion state of carbon atoms (Table3). The regions of absorp- 1. FGs contained in α-pinene and 1,3,5-trimethylbenzene tion in the IR spectrum associated with FGs patterns are re- photooxidation products defined in MCMv3.2 (Jenkin ported in Table4 as an additional reference. The OpenBabel et al., 1997; Saunders et al., 2003; Jenkin et al., 2003; toolkit (O’Boyle et al., 2011) is called through the Pybel li- Bloss et al., 2005), obtained via http://mcm.leeds.ac.uk/ brary (O’Boyle et al., 2008) in Python to search and enu- MCM; merate abundances of fragments (most of which are speci- 2. FGs that are measured or measurable (i.e., have absorp- fied by SMARTS) in each molecule (specified by SMILES). tion bands) for FTIR analysis (Pavia et al., 2008); A few groups for which SMARTS patterns were difficult to obtain were calculated through algebraic relations specified 3. molecular fragments used by SIMPOL.1 for estimation through the string formatting syntax of the Python program- of pure organic compound vapor pressures; ming language. In this syntax, values pre-computed through SMARTS matching are combined together to estimate prop- 4. bonds used for calculation of carbon oxidation state erties for another group. While SMARTS