<<

Evan Hepler-Smith, Princeton University Cheminformatics OLCC, Fall 2015 Leah McEwen, Cornell University Module 1: Chemical Information Literacy

Topic IV Communicating with chemical and names

Learning Objectives ...... 1 Overview ...... 2 Definitions ...... 4 4.1. Unit 1: Formulas ...... 6 4.2. Unit 2: Names ...... 19 4.3. Unit 3: Identifying chemical compounds on computers...... 23 4.4. Unit 4: Translating names and formulas ...... 29 4.5. Exercises ...... 32 4.6. Further Reading ...... 32

4. Communicating with chemical formulas and names

Abstract:

Learning Objectives Once you have completed this topic, you will have learned:

- To recognize various different kinds of chemical names, formulas, and other identifiers - What you do and do not know about a based on one of these names, formulas, or identifiers. - How to translate one kind of chemical name, , or other identifier into another, and what sorts of information can be inadvertently lost or added in translation - How chemists and computer systems interpret various kinds of chemical names, formulas, and other identifiers in chemically meaningful ways. - The kind of factors that you should consider in selecting an appropriate kind of chemical name, formula, or identifier to use, including the information you want to communicate, the kind of chemical entity you’re referring to, the audience with whom you’re communicating, and the medium in which you’re communicating.

2015-08-02

Body:

Overview involves a lot of communication. In the classroom, in the laboratory, or at the computer screen, as a chemist, you are constantly referring to all sorts of different chemical substances and molecular entities. You do so using chemical names, formulas, and notation. You’re probably already so accustomed to chemical names, formulas, and notation that you barely need to think about them when you use them, and can instead focus on the that you’re drawing, writing, or talking about. In the following four units, we’re going to turn things around and think about chemical names, formulas, and notation themselves.

Why would we want to do that?

Where there’s communication, there’s always a danger of misunderstanding. Experienced human chemists are generally able to figure out when they’ve misunderstood each other over the identity of a particular compound. However, work in cheminformatics almost always involves communicating not just with other chemists, but with computer systems. Often, it also involves different computer systems communicating with each other. In these cases, it’s often easier for miscommunication to go undetected. When it is detected, it’s often difficult to figure out what went wrong.

You can minimize the impact of this kind of miscommunication by keeping in mind what various sorts of chemical names and formulas DO and DO NOT tell you about a particular compound, and by documenting the sources of the names and formulas that you use.

In the first two units of this module, we will dig into the most common kinds of chemical names, formulas, and notation to figure out a) how they work, b) why they work like they do, and c) where they are most often used.

In the third unit, we’ll introduce several chemical identifiers and representations developed specifically for use on computers. In later modules, you’ll learn about these in more detail. But whether or not you’re doing cheminformatics, most chemical communication involves computers in one way or another. This unit will discuss how that may matter.

Miscommunication is especially apt to arise when you’re translating one kind of representation or identifier for a compound into another. Therefore, we’ll spend the fourth unit discussing how to translate between one kind of name or formula and another (and what can get lost in translation).

Later modules of this course will focus on how these various sorts of identifiers are used in cheminformatics applications. In this module, we’ll focus on the communications tasks that almost all chemists engage in. A convenient mnemonic for these tasks is RSVP: Register, Search, View, Publish. Most forms of chemical representation were developed with these uses in mind.

When representations originally built for these purposes are picked up for cheminformatics purposes, either directly or translated into more convenient formats, sometimes what had been

2

features of these names and formulas in previous contexts turn into bugs. If you understand the original purposes served by these “bugs,” you’ll be better able to deal with them when they arise in your cheminformatics work.

(A quick note to reassure you before we dive in: we’re not going to be memorizing any nomenclature rules. Systematic has become so complicated that even experts in the field use computer systems to review their work and catch their mistakes. We’ll talk a little bit about how this has happened, since it will help you understand how do deal with some of the challenges that might come up when you have to deal with systematic chemical names in your own work.)

The ability to communicate effectively using chemical names, formulas, and notation is a kind of literacy. As with regular literacy, this chemical literacy is something that you will get better at with practice. The better you understand what’s going on “under the hood” of various forms of chemical representation and the computer systems that make use of them, the better a chemical communicator you will become.

3

4.0.1. Definitions

Abstract:

Chemical identifiers and representations There a lots of different kind of chemical names and formulas. Confusingly, many of the terms that refer to them can be used in different ways.

Instead of trying to specify a single, unambiguous meaning for each term, we’re going to lay out the various different things that people might mean when they’re talking about, for example, an “.”

Body:

Formulas A is any formula that indicates the connectivity of a compound – that is, which of its are linked to each other by covalent bonds. There are various different kinds of structural formulas:

A line formula depicts connectivity but no three-dimensional structural information.

A condensed formula expresses the same information as a line formula using atomic symbols only.

A Lewis formula explicitly shows lone pairs in addition to bonds.

A is a simplified line formula in which atoms are depicted as unlabeled vertices and atoms bonded to carbon are suppressed. Skeletal formulas are the most common structural formulas.

Dash-wedge formulas use dashes and wedges to represent at sp3 stereocenters.

Projection formulas indicate conformation.

These different ways of drawing structural formulas are often combined or used alongside one another, sometimes in different parts of the same formula. For this reason, it’s not especially important or useful to memorize these terms and their definitions. Rather, you need to be able to interpret the kind of information that each of these formulas expresses. We’ll discuss this in more detail below.

Empirical and molecular formulas indicate the composition of a compound only:

An empirical formula expresses the ratio of the elements (or sometimes polyatomic ions) that make up a compound, in lowest integer terms.

4

A molecular formula indicates the total number of atoms of each element in one of a compound.

Names A systematic name is a chemical name based on the structural formula of a compound. If you know the rules and vocabulary of the system in question, you should be able to write a name based on a structural formula and vice-versa. Chemists have developed various ways of translating formulas into names, so it is nearly always possible to write more than one systematic name for a given compound.

Locants and sterochemical descriptors are numbers, letters (such as R, S, E, and Z), and prefixes (cis, trans) that indicate how the molecular fragments indicated by different parts of a systematic name fit together in the named compound.

A trivial name is a relatively short, memorable name that identifies a chemical entity without describing its structure.

IUPAC nomenclature is a well-known international system of chemical names. In general, IUPAC nomenclature is systematic but flexible, offering several ways of writing a systematic name for any given compound. IUPAC nomenclature rules also allow the use of certain well-established trivial names as IUPAC names.

A preferred IUPAC name (PIN) is one of the possible IUPAC names for a compound, singled out as the name to be used in official contexts such as regulation.

Notation Line notation expresses the structure of a compound using a string of characters. Line notation is designed to be easy for computers to process rapidly and reliably (and is usually not particularly legible to people). Currently, the most commonly used forms of line notation are SMILES/SMARTS and InChI.

Registry numbers are unique identifiers for chemical substances. They are designed not to give you any information whatsoever about a compound’s structure or its relationships to other compounds.

CAS Registry Numbers (CAS RNs) are the registry numbers used in the Chemical Abstracts Service Chemical Substance Registry, a major that can be searched with CAS applications including SciFinder and STN. They have often been used as official identifiers for chemical substances, especially in the US.

A connection table is a table listing all of the atoms and bonds in a molecule. It is the most common format used by computer programs to store, search, compare, and sort chemical structures. Connection tables are even harder for humans to read than line notation.

The MDL Molfile (.mol file) is a widely-used file format for connection tables.

5

4.1. Unit 1: Formulas

4.1.1. Structural formulas “The purpose of a diagram,” begins an article on how to draw these diagrams, “is to convey information—typically the identity of a molecule—to another human reader or as input to a computer program. Any form of communication, however, requires that all participants understand each other.”1

Below, we’ll go over the various ways in which structural formulas are most often drawn. Once again, our goal is to get you thinking about the kinds of structural formulas that you’ve gotten used to using without having to think too much about them. What could you possibly be misunderstanding in someone else’s structural formula? How could somebody misunderstand your structural formula? Is there a chemical feature in your head that didn’t make it into the formula that you drew? Is there more in the formula that you drew than you meant to express?

(We’ll be going over the ways in which formulas can be drawn. If you are interested in learning more about how formulas should be drawn, and in sharpening your own formula-drawing, we highly recommend checking out this detailed guide. Here’s what you’ll find:

Production of good chemical structure depictions will likely always remain something of an art form. There are few cases where it can be said that a specific representation is “right” and that all others are “wrong”. These guidelines do not try to do that. Rather, they try to codify the sorts of general rules that most chemists understand intuitively but that have never been collected in a single printed document. Adherence to these guidelines should help produce drawings that are likely to be interpreted the same way by most chemists and, as importantly, that most chemists feel are “good- looking” diagrams.2

When chemists talk about “structure,” what do they mean? Chemical structure can mean several different things:

- Connectivity (also known as constitution): which atoms are linked to which by covalent bonds? - Stereochemistry: what is the relative arrangement of these atoms and bonds in three- dimensional space? Are two groups across a or ring cis or trans to each other? Is a stereocenter R or S? - Conformation: in which of the many configurations permitted by rotation around single bonds are all of the atoms of a compound arranged in space? - Crystal structure: what is the precise position of each in the compound, in three- dimensional coordinates?

1 Jonathan Brecher, “Graphical Representation Standards for Chemical Structure Diagrams (IUPAC Recommendations 2008),” Pure and Applied Chemistry 80, no. 2 (January 1, 2008), 278. 2 Ibid., 280.

6

Structural formulas always express connectivity and often express stereochemistry. Both of these aspects of structure can usually be translated in a relatively straightforward way between different chemical formulas and names.

While structural formulas may also contain information about conformation, it is often more difficult to translate conformation from one formula to another or to a name. And while structural formulas may be drawn to suggest the shape of a molecule, they almost never contain reliable information about crystal structure.

4.1.1.1. How do they work? To get us started, here are some structural formulas:

I.A I.B I.C

I.D

7

II.A II.B II.C

(trans-but-2-enedioic acid) (cis-but-2-enedioic acid)

II.D II.E

8

III.A III.B III.C

III.D III.E

= =

IV.A IV.B IV.C

isopropyl acetate

9

V.A V.B

Compressing structural formulas: skeletal formulas, condensed formulas, and abbreviations. In order to draw structural formulas more quickly and clearly, chemists typically draw carbon atoms as unlabeled vertices leave out lone pairs and atoms bonded to carbon. Formulas drawn in this way are sometimes call skeletal formulas.

Even skeletal formulas can take up a lot of space, and sometimes, you’re only really interested in the structure of one part of a molecule.

Structural formulas often make use of abbreviations for common molecular subunits: i-Pr (isopropyl), Ph (phenyl), Me (methyl), Et (ethyl), Bu (butyl), t-Bu (tert-butyl), Ac (acetyl), among others. (Here’s a list.) (I.C, IV.B, IV.C)

In order to abbreviate structural formulas even more, condensed formulas express structure without using any lines. A condensed formula can be written in place of an entire structural formula (II.C, III.C) or in place of a portion of a structural formula (I.B).

Parentheses are used when more than two non-hydrogen atoms are bonded to the same atom (III.C).

The advantage of condensed formulas is that they can be written in normal type. However, it is often more difficult to perceive structural features in a condensed formula than in one of the graphical alternatives.

Stereochemistry Structural formulas typically indicate cis-trans isomerism across double-bonds (II.D, II.E).

Structural formulas are sometimes drawn in a way that keeps this ambiguous. A crossed double bond and/or a double bond aligned linearly with its neighboring single bonds indicates unknown cis-trans configuration (II.A, II.B).

The configuration of chiral centers is shown using dashes and wedges. (III.D, III.E)

10

A wavy line explicitly indicates unspecified stereochemistry or a mixture of stereoisomers (III.B). A chiral carbon with only regular bond lines, on the other hand, could indicate that the chemist who drew the formula just didn’t notice the stereocenter (III.A).

Condensed formulas usually do not show stereochemistry (II.C, III.C).

Here’s a little more on stereochemistry in structural formulas.

Delocalization Delocalization may be drawn via resonance structures (I.D), circles within aromatic rings (I.D, also). Dashed or dotted double bonds are also sometimes used to show delocalization. In some contexts, these can be confusing, since dotted and dashed bonds are also used to depict transition states, coordination relationships, hydrogen bonds, and other bonds that behave differently than covalent sigma and pi bonds.

4.1.1.2. Why do they work that way? Structural formulas tell you a lot more about the atoms that make up a compound than its valence electrons. Bonds represent electrons, of course, and you can draw in lone pairs. You can add curved arrows to show electron movement, draw resonance structures or dotted bonds to show delocalization, and sketch in orbitals, of course. But in structural formulas themselves, as usually drawn, the electrons are mostly implicit – you know they’re there, but they aren’t actually what the drawing depicts.

This is peculiar, since the majority of chemical phenomena depend on interactions involving valence electrons.

There’s a reason for this. Chemists began using structural formulas a hundred and fifty years ago, after they’d figured out the basic features of organic chemical structure (carbon atoms form chains, carbon forms four bonds, etc.) but before things like cis-trans isomerism, tetrahedral carbon, and even the electron itself had even been hypothesized.

By the time electrons and stereochemistry came along, structural formulas had come into general use, and chemists were quite familiar with them and fond of them. So they kept on using the same formulas, even though they’d been developed without electrons or stereochemistry in mind.

Eventually, chemists developed additional bits of notation – electron dots, dashes and wedges, and the like – to incorporate the electronic theory of bonding and stereochemistry into these familiar formulas. But even though electrons and stereochemical relationships became absolutely central features of how chemists think, it has always been a little bit difficult to represent them in structural formulas. It’s just not what structural formulas were built for.

Of course, structural formulas continued and continue to be enormously productive ways of representing compounds. Chemists have learned to think of these formulas as expressions of contemporary chemical ideas. However, in some situations – cheminformatics among them – the awkward disconnection between their origins as maps of connections between atoms and the current understanding of what chemical substances are and how they work can come out.

11

One more point: structural formulas were originally developed within the context of , and then applied in other fields such as coordination chemistry. Be aware that both people and computer programs will tend to assume, as a default, that structural formulas represent covalently bonded organic compounds. If you are working with structural formulas for complexes involving coordination or hydrogen bonding, make sure that these bonds aren’t accidentally mistaken for the covalent bonds of organic compounds. (Some suggestions on how to avoid this pitfall are available on pages 292–295 here.)

4.1.1.3. Where they are most often used? Everywhere in which you’re able to draw diagrams. Unfortunately, this excludes a lot of places, such as word processing programs, free-text search boxes, databases, and anytime you find yourself talking chemistry without a notepad in your pocket.

There’s an easy solution for the last of these cases (keep a notepad in your pocket!); for the others, the solution is systematic nomenclature and notation, which we will discuss in the next two units.

One particularly useful feature of structural formulas is that you can easily draw a structural formula for a section of a molecule or identify one molecule as a section of another. Many applications for searching chemical databases (such as SciFinder and PubChem) allow you to perform substructure searches (for all molecules containing a certain structural formula subunit) and superstructure searches (for all molecules whose structural formulas can be found within a certain structural formula).

4.1.1.4. What questions should I ask?

Are we showing any implicit H’s or lone pairs? Are we worried about the ones we aren’t showing? One or more H’s can be drawn in when there’s chemistry happening at an H, or if you want to indicate the configuration of a stereocenter (V.B). The same goes for lone pairs, when you have reason to call attention to them (I.D).

When you look at a skeletal formula, you know that all of the hydrogen atoms and valence long pairs that you would expect to be there are in fact present, even though they aren’t drawn in. Keep this in mind if you find yourself communicating with a human or a computer that you can’t count on to fill in those missing H’s and electrons.

How are we dealing with stereochemistry? Structural formulas can specify stereochemistry (II.D-E, III.D-E, V.A-B) or leave it unspecified (II.A- C, III.A-C). In the latter case, it is typically impossible to tell from the structural formula alone whether you’re dealing with a mixture of stereoisomers or unknown stereochemistry.

If you’re concerned about stereochemistry – and in most cases, you probably are – be alert for stereocenters (including rings with multiple ) with unspecified stereochemistry.

12

Watch out for double-bonds just drawn on top of single bonds without considering cis-trans isomerism (and don’t make this mistake yourself!).

Note that there is a chemical difference between substances of unspecified stereochemistry and mixtures of stereoisomers. (“What’s the stereochemistry? I don’t know!” vs. “What’s the stereochemistry? We’ve got both !,” respectively.) However, when you’re dealing with a structural formula drawn without stereochemistry specified, it can be difficult to know which of these cases you’re dealing with.

How are we dealing with delocalization? If there’s a delocalized π system in your molecule, think about whether you’ve chosen the appropriate resonance form, or whether it’s worth drawing multiple forms or indicating delocalization with a dotted bond.

How are we dealing with tautomers? If your compound can tautomerize, think about whether you’ve chosen the appropriate tautomer for your purposes, whether it’s worth drawing both.

Keep in mind that both tautomerism and delocalization are much easier to recognize when you’re working with structural formulas then when you’re working with systematic names or other sorts of notation. (Delocalization is very difficult even to represent using any other form of name or notation.) When you translate structural formulas into another form, make sure delocalization and tautomerism don’t get lost in the shuffle.

13

4.1.2. Empirical and molecular formulas You don’t always know, or need to express, or want to express the structure of a compound that you’re working with. In the case of inorganic salts, there’s little or no molecular structure (connectivity, that is) to represent.

In these cases, empirical and molecular formulas give you a way to identify the compound by its composition alone. And if you’re interested in a compound’s composition for its own sake, better to write down a molecular formula than keep counting each atom in a structural formula.

4.1.2.1. How do they work? Empirical and molecular formulas are pretty straightforward: you just count the atoms or the ions.

Empirical formulas are most often used to identify salts. Empirical formulas typically express the relative amount of each element that the compound contains, in lowest integer terms.

NaCl AlCl3 Fe2O3

Salts containing polyatomic ions are frequently represented with a formula expressing the relative amount of each ion that the compound contains, in lowest integer terms. Such formulas sometimes just referred to as “chemical formulas” and sometimes as empirical formulas.

NH4NO2 (NH4)2SO4 Mg3(PO4)2

To write a molecule formula, just count the atoms in one molecule of the compound.

C2H6O () C7H6O2 (benzoic acid) C2H2 (ethylene) C6H6 (benzene)

Salts containing polyatomic ions are sometimes represented by a “molecular formula” expressing the total number of atoms of each element that are present when the ions combine in lowest integer terms.

C2H7NO2 (NH4C2H3O2, ammonium acetate) H4N2O2 (NH4NO2, ammonium nitrite)

4.1.2.2. Why do they work that way? Empirical and molecular formulas predate structural formulas, but they actually became more important, not less, once structural formulas appeared on the scene. This was because:

Looking up chemical compounds was hard. If you knew the structural formula for a compound and wanted to look it up in a big chemical dictionary, it was usually pretty easy to find it if the dictionary was organized by molecular

14

formula. That way, you only had to look through the names for the couple dozen isomers that shared a molecular formula.

Sometimes chemists were wrong about their structure determinations. It was helpful to have a formula that you could go on using unchanged if it turned out that the double bond wasn’t where you thought it was, for multiple tautomeric forms of a compound

It’s useful to think of a molecular formula as a general label for a compound rather a specific one, especially when you’re dealing with an organic small molecule that almost certainly has a bunch of isomers.

That is: don’t think to yourself “diethyl ether is C4H10O” but rather “diethyl ether is one of the things in the C4H10O box, along with 1-butanol, 2-butanol, etc.”

C4H10O

1-butanol 2-butanol methylpropan-1-ol diethyl ether methylpropyl ether

etc… [methylisopropyl ether, methylpropan-2-ol, 2-butanol enantiomers]

4.1.2.3. Where they are most often used? Empirical formulas are most often used to represent inorganic salts.

Molecular formulas are most often used to identify molecular entities (organic compounds, covalently bonded inorganic compounds, coordination complexes) and their salts.

They usually show up in database entries for compounds, so you can use them to search for compounds (particularly useful if you suspect that tautomers might be throwing off your search).

You can also use them as a check to reveal if you

15

4.1.2.4. What questions should I ask?

Are we grouping atoms into ions or just listing them element by element?

How are we ordering the atoms? The two most common ways in which empirical and molecular formulas are ordered are:

 From electropositive to electronegative / cation to anion

NaCl CaCO3 AlCl3 Fe2O3

SO2 H2O

NH4C2H3O2 NH4NO2

Exceptions:

NH3

 In the order C, then H, then everything else, in alphabetical order. (This is sometimes called Hill system order.)

ClNa CCaO3 AlCl3 Fe2O3

O2S H2O

C2H7NO2 H4N2O2

H3N

Is this an empirical formula (a ratio of lowest terms) or a molecular formula (the total count of atoms in a particular chemical structure)?

C2H2 (ethylene) C6H6 (benzene)

…or

CH (ethylene and benzene)

Are we referring to a specific , and how do we know which one?

This almost goes without saying: for organic compounds and many inorganic ones, there are almost always a bunch of isomers that share the same molecular formula.

16

4.1.3. Other kinds of compounds and formulas As we mentioned above, most of the general principles behind chemical names and formulas were originally worked out for organic compounds and were later adapted to other sorts of chemical entities.

Molecular formulas for coordination complexes are often written in brackets, in the order [central atom (usually a metal), then negative ligands, then neutral ligands]. They may also be written in Hill system order.

[CoCl3(NH3)3] [CoCl(NH3)5]2+ [CoCl(NH3)5]Cl2

= = =

H9Cl3CoN3 H15ClCoN52+ H15Cl3CoN5

Projection formulas indicate stereochemistry or relative conformation.

D- (anti conformation)

The sequence of a fragment of biological polymer (a polypeptide or nucleic acid) is similar to a condensed formula, since it represents a linear chain of chemical units.

You may come across formulas in which one of these units is expanded.

17

18

4.2. Unit 2: Names

4.2.1. How do they work? There are two kinds of chemical name: trivial names and systematic names. Trivial names identify a compound (or sometimes a few closely related compounds), but provide little or no information about its structure and its relationships to other compounds. A trivial name may be a technical chemical term, or it may be a common name taken from regular, nonscientific language. You can think of acronyms for systematic names (THF, DMSO, and so forth) as a kind of trivial name.

Systematic names indicate the complete constitution of the compound. Systematic names are based on structural formulas. Writing a systematic name involves taking apart a structural formula into subunits, finding the appropriate term for each subunit, and putting those terms together to form the name. You should therefore be able to draw a structural formula for a compound based on its systematic name, by taking apart the name into its subunits and writing down the structural formulas for each of these subunits, connecting them as specified in the name.

Trivial:

Systematic:

19

Semi-systematic names take a trivial name of a related compound as a root and name the compound systematically as a derivative of that compound.

The root of a systematic name indicates the compound’s primary chain or parent compound, and prefixes and suffixes indicate the atoms or groups that are attached to that parent compound.

Locant numbers (and occasionally letters) indicate where these groups are attached to the parent compound (or “substituted” for hydrogen atoms of this parent, hence the term “substituent.”) Stereochemical prefixes – cis and trans, E and Z, R and S – are used to indicate stereochemistry. (If you need a refresher on assigning E/Z and R/S, here’s a primer.)

Several different forms of systematic nomenclature have been used both in the past and in the present. Furthermore, the best-known nomenclature system, that of the International Union of Pure and Applied Chemistry (IUPAC), provides various options for how to name a compound. Therefore, most compounds have more than one systematic name. Fortunately, most systems of nomenclature in wide use are based on more or less the same principles and the same vocabulary.

20

L-threo-Hex-2-enonic acid, γ-lactone

L-3-Keto-threo-hexuronic acid lactone

2-oxo-L-threo-hexono-1,4-lactone-2,3-enediol

(R)-3,4-dihydroxy-5-((S)-1,2-dihydroxyethyl)furan-2(5H)-one

(R)-5-((S)- 1,2-dihydroxyethyl)-3,4-dihydroxyfuran-2(5H)-one

Five systematic names for vitamin C (L-ascorbic acid)

IUPAC recently published rules for determining one Preferred IUPAC Name (PIN) for each compound. The rules for determining these names are rather complicated; however, as we will see shortly, other forms of notation are often used when you need a unique identifier for a compound.

4.2.2. Why do they work that way? Systematic names were originally designed primarily for use in alphabetical indexes of chemical substances. However, the effort to make these names both unambiguous and canonical (see Unit 3.b in this module) for this purpose made many of these names extraordinarily difficult to read, let alone say out loud. Chemists came up with different approaches to systematic nomenclature tailored for different sorts of compounds and different ways of organizing a chemical index; that’s how we ended up with so many different systematic names for the same compound.

Though some chemists initially predicted that systematic nomenclature would completely replace trivial names, this never happened. Trivial names convey little or no chemical information, but they have the advantage over systematic names in many of the qualities that we usually associate with good names: they are short, memorable, pronounceable, and easy to distinguish from other names.

4.2.3. Where they are most often used? Trivial names are used constantly in informal chemical communication. Chemists working together on specific complex compounds will typically develop their own trivial “nicknames” for their compounds of interest.

Systematic names are often required if you want to register a new compound and for compounds discussed in publications. They are typically listed in database records accessible through search applications like PubChem, SciFinder,Reaxys, and ChemSpider, as well as on the Wikipedia pages for chemical substances. However, because of the various different systems of nomenclature in use, because IUPAC names are not unique, because names formed according to now-defunct rules often stick around, and because of human error (a particularly issue in a crowd-curated site like Wikipedia), the systematic names that you find in these locations can sometimes vary.

Sections of systematic and semi-systematic names corresponding to a substructure of interest can be useful in searching for compounds containing that substructure, particularly in non-chemical settings like Google. However, this approach is generally less reliable than substructure searches that accept structural formulas as input.

21

4.2.4. What questions should I ask?

Are there structural ambiguities that the structural formula would clearly indicate but that the systematic name obscures? When you’re dealing with systematic names rather than structural formulas, it’s much harder to recognize when you need to pay attention to delocalization, stereochemistry, and tautomerism. You may wish to sketch a structural formula based on the name (or make use of a computer program that does so) to determine whether any of these factors – particularly stereochemistry – apply.

What system of nomenclature does the name fit within? Are you dealing with an IUPAC name? A Preferred IUPAC Name (PIN)? A CAS index name? A name that describes a structural formula without quite following any specific set of nomenclature rules?

Why am I using a systematic name, anyway? Systematic names are difficult to read and to write. Before you decide to use them, make sure that there isn’t a different chemical identifier that serves your purposes better. (See Unit 3.c below.)

22

4.3. Unit 3: Identifying chemical compounds on computers

4.3.1. How do they work? Three forms of chemical notation have been developed especially for communication on computers and between humans and computers.

Computers most often store information about chemical structure in the form of a connection table. Connection tables do for computers what systematic nomenclature does for human chemists: they take the structural information expressed visually in a structural formula and express it in a form that is easier to read and to order in a list. The difference is that computers can read, sort, search, and group connection tables far faster than humans can do so with systematic names or any other kind of formula or notation.

Connection tables may be generated in various ways and stored in different formats. However, you can think of a connection table, schematically, as a numbered list of entries corresponding to each of the atoms in a compound, where each entry indicates the identity of the atom, the other atoms to which it is linked by bonds, and the order of each of these bonds (single, double, triple).

The following depicts the connection table for benzoic acid in the MOL file format (with structural formula and annotations added):

No matter what form of input and output a computer program uses, anytime a computer does any “thinking” about chemical structure, the analysis most likely makes use of connection tables.

23

Even for a computer, reading connection tables can be a time-consuming process. Therefore, computers use another sort of identifier for cases in which it’s only necessary to communicate the identity of a compound, not its structure. These are called registry numbers.

The most widely used system of chemical identifiers is the Chemical Abstracts Service Registry Number (CAS RN). Just as the trivial names “vitamin C” or “L-ascorbic acid” identify that compound more directly than the systematic name “(R)-5-((S)-1,2-dihydroxyethyl)-3,4-dihydroxyfuran-2(5H)- one,” the CAS RN 50-81-7 identifies this compound in a more simple, straightforward, and reliable way than a connection table.

Connection tables are typically employed behind the scenes of chemical computer programs, out of the user’s view. As the table depicted above demonstrates, even when you can get a look at one, it’s pretty hard to learn much from it until the computer translates it into a more human-readable format.

Human chemists can see and use registry numbers, but a registry number on its own tells you nothing about a compound (unless you happen to have memorized a particular compound’s registry number!).

CAS RNs for many common compounds are available in many places online, including Wikipedia and PubChem. For less well-known compounds, you must have access to SciFinder or another CAS system in order to easily obtain and use CAS RNs. For this reason, open-access unique identifiers have been developed. The PubChem CID and the ChemSpider ID are two such open registry numbers. A somewhat different kind of unique alphanumerical identifier is the InChIKey, derived from the InChI line notation (see below).

CAS RN 65-85-0 InChIKey WPYMKLBDIGXBTP-UHFFFAOYSA- N

There is no way to tell the relationship among compounds – even enantiomers – from registry numbers.

L-lactic acid (S enantiomer) D-lactic acid (R enantiomer) CAS RN 79-33-4 10326-41-7 InChIKey JVTAAEKCZFNVCJ-REOHCLBHSA-N JVTAAEKCZFNVCJ-UWTATZPHSA-N

24

Line notation is designed to express the structure of compounds in a form that is readable and writable (at least in principle) in a straightforward way by both humans and computers. The two most widely-used forms of line notation are InChI and SMILES.

You will learn more about InChI, InChIKey, and SMILES in Module 5.

SMILES O=C(O)C1=CC=CC=C1 InChI InChI=1S/C7H6O2/c8-7(9)6-4-2-1-3-5-6/h1- 5H,(H,8,9)

4.3.2. Different notation for different purposes Different chemical names and formulas serve different purposes. Some help you identify individual compounds. Others describe the structure of a compound very clearly, or help you sort and compare compounds according to their structure.

In an unambiguous system of notation, each name or formula refers to exactly one chemical entity, typically in a way that allows you to draw a structural formula for it. However, each chemical entity might be represented by more than one name or formula. This is true of IUPAC names and SMILES.

A canonical system of notation contains a unique identifier for every chemical entity (a compound, a substructure, a ligand, a monomer, etc.) that can be represented within the system.

A canonical identifier may be the one and only representation for a chemical entity within a system (as with CAS Registry Numbers).

Alternatively, there may be several ways of representing a chemical entity using a certain system of nomenclature or notation, and an additional set of rules or an algorithm may be used to define one of these identifiers as canonical. This is true of Preferred IUPAC Names (PINs) and canonical SMILES (discussed in Module 5).

Ambiguous notation can refer to more than one chemical entity. This is true of most chemical names when used unsystematically (“octane,” used as a common term for all saturated with eight carbon atoms rather than systematically to indicate the straight-chain isomer only). It is also true of empirical and molecular formulas.

25

In general, canonical notation is most reliable if your goal is to identify a compound.

Unambiguous notation generally describes the structure of a compound effectively.

Ambiguous notation is often easier to interpret than canonical or unambiguous notation, and can be useful for sorting and comparing compounds.

Keep in mind: Just because a name or formula is canonical does not mean that it identifies a compound with absolute precision. For example, there are three CAS registry numbers for lactic acid: one for each enantiomer and one for the racemic or unspecified version of the compound. You may need to use all three if you mean to refer to lactic acid in general. Similarly, as will be discussed in module 5, canonical SMILES does not take R/S stereoisomerism into account, so each enantiomer of a compound will have the same canonical SMILES formula.

4.3.3. Where did all of these names and formulas come from? Names or formulas designed for one purpose can also be employed for another one. In fact, the ability to repurpose names and formulas in this way is part of what makes a particular kind of name or formula useful. However, a lot of the confusion that can arise over chemical names and notation arises from lack of awareness of the disjunction between the kinds of things that a name or formula was meant to do and the kind of things that you’re trying do with it.

This sort of re-use of notation happens a lot in cheminformatics – after all, some kinds of cheminformatics analysis weren’t even conceivable when most common forms of chemical names and formula first caught on. But the repurposing of notation isn’t unique to cheminformatics – in fact, as long as chemical names and formulas as we know them have been around, chemists have been re-using names, deciding that they fit other purposes better than the ones for which they were intended, or trying to change them in ways that undermine their original purpose.

For example, in 1892, a few dozen leading European chemists got together for the Geneva Nomenclature Congress, to develop for the first time an international system of organic chemical

26

nomenclature. Some of them wanted unambiguous but non-canonical names, since it was often helpful, especially in teaching, to be able to name a compound in a way that emphasized one or another of its functional group. They wished, for example, to be able to name vitamin C as either a lactone or a tetra-alcohol. Others thought that their nomenclature should be canonical as well as unambiguous. This would make it easier to use indexes of chemical substances. A reader would no longer have to try to think of all of the different names of a compound and check each of them, but would instead know that each compound could be found in an alphabetical list under one and only one name. At Geneva, the latter group triumphed, and the Congress created about sixty rules for translating structural formulas into canonical, unambiguous names.

Almost immediately, however, chemists started using these names in their teaching, research, and all sorts of other settings for which they weren’t designed. About the only place they weren’t used was in alphabetical indexes of chemical substances! The editors in charge of making these indexes found these canonical systematic names to be too difficult to write or to understand, and decided to incorporate a lot of trivial names into their indexes and/or to organize them according to molecular formulas rather than names.

Meanwhile, however, some chemists have become convinced that only the canonical systematic names that followed the Geneva nomenclature rules could guarantee clear, unambiguous communication. When another committee took up the question of international chemical nomenclature standards during the 1920s, it was the chemical editors who opposed canonical names, whereas chemists who were primarily interested in the use of names in teaching and laboratory research supported them. In the end, the result was the non-canonical IUPAC nomenclature rules, which offered numerous different options for systematically naming each compound, in the hope of satisfying as many of the different purposes for which chemists wished to use names as possible. Of course, one purpose that this approach could NOT satisfy was establishing a single identifier for each compound. That is why, over recent years, IUPAC has introduced even more rules for determining a canonical Preferred IUPAC Name for each compound. Both PINs and other changes in IUPAC nomenclature are also oriented toward making systematic names more easily readable by machines.

27

You don’t need to know any of the specifics of this history. What you do need to know is that chemists have been re-using notation and tinkering with how it works to make it fit the new use for a very long time. The lesson: carefully select the best existing kind of name, formula, or notation for your particular needs, be aware of the purpose for which it was originally designed, and think about whether you need to account for any differences between that purpose and yours.

28

4.4. Unit 4: Translating names and formulas

4.4.1. From one chemical representation to another: translation and identifier exchange We can summarize what we’ve learned so far:

- Different names and compounds may be designed to represent different sorts of chemical compounds, different structural features of these compounds, for different purposes. - You can figure out what a particular name or formula DOES and DOES NOT tell you about the structure of a chemical entity by asking the kinds of questions that we have discussed above. - Almost all chemical names and formulas, even ones designed for a very specific purpose, get re-used in other ways. Cheminformatics involves a lot of this sort of re-use.

Often, effective re-use of a particular name or formula involves swapping the identifier that you’ve found for another identifier for the same compound that’s more convenient for your purposes. For example, if you are interested in comparing the structures of a list of compounds for which you have registry numbers, you need to swap those registry numbers for structural formulas, connection tables, or another sort of representation that gives you the structural information you’re looking for.

The final unit of this module provides an overview of how you should think about this process of swapping one kind of identifier for another.

There are two ways of exchanging identifiers: lookup and translation. In the case of lookup, you locate the identifier that you have in an existing database that lists various different identifiers for each compound, and you select the other identifier that you want. This is like using a dictionary.

In the case of translation, you use a set of rules (or a computer uses an algorithm) to take apart one sort of representation of a compound and to create another sort of representation for the same compound.

Like words for the same object in different languages, even when two names or formulas are meant to refer to exactly the same compound, they differ in their connotations. They describe the compound’s structure in more or less specific ways, they emphasize different kinds of family relationships, and they draw upon different ways of understanding chemical objects and phenomena.

Scholars of literature like to emphasize that there is no such thing as a literal, perfect translation of a poem or novel from one language into another. Translation always involves decisions about what aspects of meaning to try to preserve and which to allow to become obscured. Literary language is often purposefully ambiguous, whereas chemical nomenclature and notation is extraordinarily precise. Nevertheless, even when it comes to chemical names and formulas, it is still often true that what one kind of chemical name or formula communicates cannot be perfectly expressed in another kind of name or formula.

29

Of course, this does not mean that translating one kind of name or formula into another is a bad idea. In fact, communication about chemical compounds depends upon chemists and computers constantly engaging in this kind of chemical translation. In this unit, we will discuss how you can approach this process carefully, anticipate where misunderstandings might arise, and take measures to avoid them.

4.4.2. Documentation Above, we discussed how the history of where a system of notation came from and the purpose for which it was designed affects what kinds of chemical information you can express using that sort of name or formula.

Individual names and formulas have their own individual histories, which we call their provenance. Where did you find the name or formula? Who put it there? Who or what created it, and why? Has it already been translated from another format? Can you trace it back to an original experiment, calculation, or hypothesis?

You should keep these questions in mind for your own sake. You should especially keep in mind that the person or computer that you’re communicating with won’t necessarily know the answers to these questions, and that this is a potential source of misunderstanding.

Whenever you exchange one chemical name or formula for another via translation or lookup, you should keep track of:

- The source and the form of the original name or formula - The tools and resources that you used in doing the lookup or (if applicable) the translation.

You may wish to share this information along with your new name or formula itself, in order to avoid miscommunication. Whether or not you do that, it’s your responsibility to keep track of your translation process in this way. Doing so will help you figure out what could have gone wrong when confusion arises (and perhaps to prove that the confusion isn’t your fault!)

4.4.3. Validation Naoki Sakai, a scholar of translation in literature and politics, has written, “Every translation calls for a countertranslation.” Any time you take an idea from one language A and put it into another B, you should think about how someone encountering the idea in the second language would translate it into the first.

This goes for chemical communication as well. When you “translate” a formula or name from one format to another, you should perform a countertranslation: that is, you should take your new name or formula and make sure that you can get back to the one you started with. The same goes for lookup: you should make sure that you can look up the identifier that you started with using the identifier that you generated.

This will help you be aware of what might have gotten lost or inadvertently added in translation. You won’t always be able to completely solve any potential problems that arise. Sometimes, the identifier that you started with and the one that you generate are not equally specific: for example,

30

you can translate a structural formula into a single molecular formula, but you cannot translate that molecular formula back into a structural formula. As we have said, perfect translation is often not possible. But the validation exercises of countertranslation and reverse-lookup will help you be aware of any problems that might arise, so that you can figure out other ways to head them off.

We’ll cover validation in greater detail later on in this course.

4.4.4. Knowing your audience Chemical communication takes a wide variety of forms. Different formats of chemical nomenclature and notation are more appropriate for different settings. Sometimes, it’s pretty clear what format is right (or wrong) in a given situation. Systematic names aren’t usually much good in casual conversation; you can’t do a google search for a sketch of a structural formula; you can’t illustrate a reaction mechanism using trivial names. However, there are plenty of cases in which it takes some thought to figure out what kind of name or formula is most effective for what you want to communicate. In addition to thinking about the object that you’re communicating about, as discussed in the last unit, you should always know with whom or with what you are communicating, and select an appropriate variety of name or formula.

One simple way to think about your audience is in terms of chemists and non-chemists, humans and computers.

Of course, things are a little more complicated than this. Synthetic organic chemists have a certain area of expertise, and materials scientists another. Wikipedia “knows” a fair amount of chemistry because human experts in chemistry have manually added chemical information to many of its pages. Treat this simple table not as a set of boxes into which you must slot all of the different people and programs with which you communicate, but rather as a reminder that you should keep your audience in mind.

31

4.5. Exercises

4.5.1. Sorbic acid is commonly used as a food preservative, often in the form of its calcium, potassium and sodium salts.

a) Write the molecular formula of sorbic acid.

C6H8O2

b) Look up calcium sorbate, potassium sorbate, and sodium sorbate on PubChem. Write the molecular formulas that you find there.

calcium sorbate: C12H14CaO4

potassium sorbate: C6H7KO2

sodium sorbate: C6H7NaO2

c) Typically, in chemical indexes arranged by molecular formula, organic salts are grouped under the molecular formula of the parent acid rather than under their individual molecular formulas. Based on your answers to a) and b), explain why.

Acceptable answers may note that the Ca2+ counterion leads to a doubled atom count for sorbate and a substantially different molecular formula and/or that the metal interposed between H and O in the molecular formulas for the salts.

4.5.2. What kind of structural information can an empirical formula provide? A molecular formula?

Empirical formulas often indicate the grouping of atoms into polyatomic ions.

Molecular formulas indicate the number of atoms in a single molecule of the compound, thus suggesting its size.

32

4.5.3.

Using PubChem’s tool for compound search (go here and click the hexagon to search by structural formula), or other programs of your choice (SciFinder, ChemSpider, Wikipedia (if you dare)), fill in the following table:

molecul Structural formula Systematic SMILES InChI CAS RN ar name formula C5H8 2- CC(=C)C=C InChI=1S/C5H8/c1 78-79-5 methylbut -4-5(2)3/h4H,1- a-1,3- 2H2,3H3 diene C2H7NO 1- C(C(N)O)O InChI=1S/C2H7NO 13053-46-8 2 aminoetha 2/c3-2(5)1-4/h2,4- (this one is ne-1,2-diol 5H,1,3H2 kind of hidden in PubChem) C2H7NO Ammoniu CC(=O)[O- InChI=1S/C2H4O2. 631-61-8 2 m acetate ].[NH4+] H3N/c1- 2(3)4;/h1H3,(H,3,4 );1H3

C6H6O Trick Question Lots Of Compounds C7H12O4 Diethyl CCOC(=O)CC InChI=1S/C7H12O 105-53-3 propanedi (=O)OCC 4/c1-3-10-6(8)5- oate 7(9)11-4-2/h3- 5H2,1-2H3

33

Note: the aminoethanediol example is kind of instructive. The R enantiomer first got into the CAS system in a 2014 patent and the S enantiomer isn’t registered at all. Only the racemic version is accessible in PubChem, and none of them show up in ChemSpider.

4.5.4. Which of above form(s) of notation is/are preferable for:

a) Ordering a specific compound from a supplier CAS RN, perhaps InChI

b) Identifying an important compound in a journal article Structural formula, definitely. Likely systematic name, and you could make a case for the rest of them, really.

c) Sorting a list of a bunch of compounds into chemically-meaningful groups Molecular formula and systematic name, probably. Perhaps structural formula. Definitely not CAS RN and almost certainly not SMILES or InChI.

4.5.5. Search PubChem for lactic acid (racemic) and its two enantiomers, shown above in Unit I figure III.a-d and in Unit 3, Part b.

a) Paste the urls, IUPAC names, and CAS numbers that you find below.

https://pubchem.ncbi.nlm.nih.gov/compound/612 2-hydroxypropanoic acid 50-21-5

https://pubchem.ncbi.nlm.nih.gov/compound/61503 (2R)-2-hydroxypropanoic acid 50-21-5 10326-41-7

https://pubchem.ncbi.nlm.nih.gov/compound/107689 (2S)-2-hydroxypropanoic acid 79-33-4

a) You should see something strange in your answer to Part b). What is it?

Two CAS numbers for the R enantiomer, one of which matches the racemic compound’s CAS number. There is supposed to be only one unique CAS number for each compound.

b) Luckily, the makers of PubChem have followed some of the best practices that we’ve outlined above. What have they done that can help you get to the bottom of the mistake that you’ve discovered in parts b-c)?

34

They documented their translation. We can follow the links that the pubchem record provides to the source of these conflicting CAS numbers, DrugBank. c) EXTRA CREDIT: Follow the clue that PubChem left you as to the source of the error and describe what went wrong. (If you aren’t sure, what to do, click here and here).

The DrugBank record for D-lactic acid (the R enantiomer) is fine. The DrugBank record for racemic lactic acid, however, contains the CAS number for the racemic compound but the structural formula of the R enantiomer. PubChem must have pulled in this CAS number for the R enantiomer’s PubChem record (incorrectly) by matching the connection table or InChI. PubChem must have pulled in this CAS number for the racemic compound (correctly) by matching the name.

35

4.6. Further reading & references

Formulas

Jonathan Brecher, “Graphical Representation Standards for Chemical Structure Diagrams (IUPAC Recommendations 2008),” Pure and Applied Chemistry 80, no. 2 (January 1, 2008), 227–410.

Antony Williams, “Chemical Structures,” in The ACS Style Guide (American Chemical Society, 2006), 375–83.

Neil G. Connelly and Ture Damhus, eds., IUPAC Nomenclature of Inorganic Chemistry (Cambridge: Royal Society of Chemistry, 2005), 53–67. (The “Red Book”)

Wikipedia entry on the Red Book

Compound Interest, http://www.compoundchem.com/ (good examples of effective communication using formulas)

Nomenclature

ACS/CAS

“Names and Numbers for Chemical Compounds,” in The ACS Style Guide (American Chemical Society, 2006), 233–54.

American Chemical Society, Naming and Indexing of Chemical Substances for Chemical Abstracts, 2007 Edition (Columbus, OH: American Chemical Society, 2008).

IUPAC

Henri A. Favre and Warren H. Powell, eds., Nomenclature of Organic Chemistry: IUPAC Recommendations and Preferred Names 2013 (Cambridge: Royal Society of Chemistry, 2014). (The “Blue Book”)

Wikipedia entry on the Blue Book

Neil G. Connelly and Ture Damhus, eds., IUPAC Nomenclature of Inorganic Chemistry (Cambridge: Royal Society of Chemistry, 2005), 53–67. (The “Red Book”)

Wikipedia entry on the Red Book

36