<<

Rassokhin J Cheminform (2020) 12:10 https://doi.org/10.1186/s13321-020-0415-y Journal of Cheminformatics

REVIEW Open Access The in cheminformatics++ and computational chemistry Dmitrii Rassokhin*

Abstract This paper describes salient features of the C programming language and its programming ecosystem, with emphasis on how the language afects scientifc++ software development. Brief history of C and its predecessor the C language is provided. Most important aspects of the language that defne models of programming++ are described in greater detail and illustrated with code examples. Special attention is paid to the interoperability between C and other high-level languages commonly used in cheminformatics, machine learning, data processing and statistical++ computing. Keywords: Programming languages, C, C , Scientifc computing, Computational chemistry, Cheminformatics ++

Introduction role of being the language that de-facto dominates mod- In recent years, a plethora of high-level domain-specifc ern scientifc software development, even though, at frst and general-purpose programming languages have been glance, this may not be so obvious. In this paper, we will developed to greatly increase the productivity of pro- briefy describe the history of C++ and focus on its main grammers working on various types of software projects. characteristics that make it so special. Scientifc programming, which used to be dominated by Brief history of C and C up until about mid-1980s, now enjoys a healthy ++ choice of tools, languages and libraries that excel in help- Te predecessor of C++, C was developed in the early ing solve all types of problems computational scientists 1970s by Dennis M. Ritchie, then an employee of Bell and scientifc software developers deal with in their eve- Labs (AT&T), when Ritchie and his colleagues were ryday work. For example, MATLAB is widely used for working on Unix, a multi-user time-sharing operating numerical computing, R dominates statistical computing system for mainframe computers. Early versions of this and data visualization, and Python is a great choice for a now ubiquitous were written in archi- wide range of scientifc applications from machine learn- tecture-specifc non-portable assembly languages. As ing and natural language processing to typical cheminfor- Unix was being further extended and gained popularity, matics tasks like chemical structure search and retrieval, the developers realized the need to re-write parts of it virtual compound screening and molecular property in a platform-independent high-level programming lan- prediction, just to name a few. However, among modern guage to make the codebase more manageable and easily high-level programming languages, C++ plays a special portable to diferent computer architectures. Back then, Fortran was one of the most commonly used high-level languages. Being the language of choice for numerical *Correspondence: [email protected] computing, Fortran circa early 1979s was not suitable for Janssen Research & Development, LLC, 1400 McKean Road, Spring House, PA 19477, USA low-level programming due to its verbose fow control

© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco​ ​ mmons.org/licen​ ses/by/4.0/​ . The Creative Commons Public Domain Dedication waiver (http://creativeco​ mmons​ .org/publi​ cdoma​ in/​ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Rassokhin J Cheminform (2020) 12:10 Page 2 of 16

structures and the absence of direct memory access general-purpose programming languages for a range of operations. Fortran was also ill-suited for non-numerical applications from device drivers, microcontrollers and computing, which typically involves defning complex operating systems to videogames and high-performance data structures and operations on them, while languages data analysis packages. designed for symbolic computing and list processing, In 1983, a committee formed by the American National such as Lisp, the second-oldest high-level computer lan- Standards Institute (ANSI) to develop a standard version guage after Fortran, were quite difcult to master, and of the C language based on the K&R C. ANSI published often required specialized and very expensive hardware the standard defnition in 1989 and is commonly called to achieve acceptable performance [1]. It is remarkable “ANSI C”. Subsequently, the ANSI X3.159-1989 C stand- that one of the frst very large and complex cheminfor- ard has undergone several revisions, the most recent of matics software packages, an interactive computer pro- which (informally named C18) is ISO/IEC 9899:2018 [5]. gram designed to assist planning syntheses of complex In the 1970, the object-oriented programming (OOP) organic molecules called LHASA (Logic and Heuristics paradigm was quickly gaining popularity. Simula 67, the Applied to Synthetic Analysis), was largely written in For- frst programming language to support the OOP, was tran and contained nearly 30,000 lines of very complex developed primarily for discrete event simulation, pro- Fortran code [2, 3]. cess modeling, large scale integrated circuit simulations, A better alternative for further Unix development the analysis of telecommunication protocols and other was the programming language B, which was derived niche applications. In 1979, Bjarne Stroustrup, while from BCPL in the 1960s by Ken Tompson for coding working towards his Ph.. in Computer Science at the machine-independent applications, such as operating University of Cambridge, England, used Simula 67 to systems and for other languages. Te B lan- implement calculations for his research and found the guage can be considered the direct predecessor of C. B OOP paradigm to be very productive, but all its exist- was much more suitable for the operating system devel- ing implementations inefcient. At that time, C had opment compared to Fortran, since it provided con- already become one of the most used general-purpose structs that map efciently to typical machine, had a clear programming languages, so Stroustrup got a brilliant and concise syntax and supported efcient direct mem- idea of adding OOP features to C and started his work ory access operations. Te main shortcoming of B was on “C with Classes”, the superset of K&R C, which would the lack of support for data types. In fact, it supported support object-oriented programming while preserv- only one type, the architecture-dependent computer ing the portability, low-level functionality and efciency word treated as an integer. Terefore, in B, operations of C [6]. Early implementations of C with Classes were on data types other than the machine word (such as, for translators that converted “C with Classes” code into the example, single-byte characters or structures composed standard K&R C, which could be compiled by any avail- of felds) were difcult to implement in a portable way. able C . “C with Classes” was extended by add- Tere shortcomings also made B totally unsuitable as a ing, among other important features, improved type general-purpose programming language. In the early 70s, checking, operator overloading, and virtual functions. In Dennis M. Ritchie gradually added support for primitive 1983 Stroustrup renamed “C with Classes” to C++. Te (integer and foating-point numbers, and characters) and ++ operator in the C language is an operator for incre- complex (user-defned structures) data types to B and menting a variable, which refected Stroustrup’s notion cleaned up its syntax. Eventually, the improved B dif- of C++ being the next generation of the C language. In ferentiated from the original B so much that it become 1986, Stroustrup published his famous book called Te a diferent language, which was half-jokingly called C C++ Programming Language [7], which became the after the next letter of the English alphabet. In 1978, the de-facto language reference manual. Very soon, C++ frst edition of the famous “Te C Programming Lan- started gaining a widespread popularity in the developer guage” book written by Brian Kernighan and Dennis community, and several good quality C++ compilers and Ritchie was published [4]. Te version of the C language libraries become available for practically all major com- described in the book is often referred to as K&R C, after puter platforms and operating systems. the book authors. Te C language quickly gained popu- Probably, the most important C++ release was C++ larity among operating system and device driver devel- 2.0 in 1989, documented in Te Annotated C++ Refer- opers. Subsequently, most of the Unix components were ence Manual by Ellis and Stroustrup [8]. C++ 2.0 was a rewritten in C. Due to the relative simplicity, portability, full-fedged object-oriented language with support for and efciency, the popularity of C soon went far beyond multiple inheritance, abstract classes, static member its original intended purpose of operating system devel- functions, constant member functions and protected opment, and it became one of the most commonly used class members, templates for generic programming, Rassokhin J Cheminform (2020) 12:10 Page 3 of 16

exceptions for structured error handling, namespaces, development environments, both free and commercial, and a Boolean type. are easily available for all modern computer platforms. Te next important release came in 2011, when the Te competition between the two major open source C++11 standard was published. C++11 has been aug- C++ compilers GCC [12] and [13] has led to rapid mented with several features afecting runtime per- progress in the quality of the produced and, formance, most importantly, the “move constructor”, importantly, the utility of the feedback provided to pro- which eliminated the bane of earlier C++, the costly and grammers in the case of errors, warnings, and diagnos- unneeded copying of large objects when they are passed tic messages. Both GCC and Clang are widely and easily to or returned from functions by value. C++11 also available on Linux and MacOS computers. included a number of signifcant features for producing Windows does not, by default, come with a C++ com- terser, more readable code; chief among these are auto piler, but one can readily download the Microsoft Vis- variables (removing the need for detailed variable dec- ual Studio integrated development environment, which larations while preserving type safety) and range-based includes a C++ compiler, runtime libraries and tools, “for” loops (allowing looping over the elements of a con- directly from Microsoft [14]. Compilers that generate tainer with an almost Python-like syntax). very efcient code targeting specifc hardware are also After the long delay to reach C++11, the C++ Stand- available from various vendors. For example, C ard Committee has been updating the C++ standard and C++ compilers are highly optimized to processors every three years. In 2014, the C++14 standard was pub- that support Intel architectures [15]. Sophisticated inte- lished, followed by C ++17 in 2017, which, at the time grated development environments that ofer built-in C/ of writing this article, is the most recent revision of the C++ code editors with , context-sen- ISO/IEC 14882:2017 standard for the C++ program- sitive help, powerful , profling and refactor- ming language [9]. Te next standard release is planned ing tools, visual interface designers, and various features for 2020. Te language is quickly evolving to improve that facilitate large developer teams working together on the code readability and expressive power. For example, large-scale software projects are readily available, both lambda-expressions with closures introduced in C++11 commercial (such as Microsoft Visual Studio [14] and and enhanced in C++14 [10], obviously inspired by Visual Studio Code [16] from Microsoft or CLion [17] functional programming languages like Haskel and F#, from JetBrains and open-source, such as, for example, a make it possible to pass function-like objects to generic widely-used Eclipse CDT [18]. Libraries of C and C++ methods, such as sorting, searching and fltering, which code are available for every programming task imagina- considerably shortens the code using these methods ble, from low-level hardware control to machine learning without sacrifcing the performance. Latest versions and natural language processing. of C make it easier to write portable code that takes ++ C : Basic language features advantage of modern multicore computer architecture ++ by providing facilities to create and manage sequences Let us frst discuss basic features of C++, which it inher- of instructions executed concurrently with other such ited from C and which are not related to advanced con- sequences (commonly referred to as “threads”) and syn- cepts like object-oriented or generic programming. It chronize memory accesses among diferent threads run- should be noted that modern C is not a true subset of ning in parallel. modern C++, and a modern C++ compiler will not As of 2019, C and C++ remain extremely popular compile most non-trivial programs written in modern programming languages for a wide range of applications C without at least some minor modifcations. However, [11]. In scientifc programming, including cheminfor- for the purposes of this paper we can consider modern matics and computation chemistry, scripting languages C++ to be an extension of “classical C with better type like Python (general-purpose) or R (statistical applica- safety and without some relatively rarely used features”. tions, predictive modeling and machine learning) have In this section, for brevity, C++ will mean “C or C++”. seen the explosion of popularity in recent years; however, C is primarily a compiled language as it will be discussed further below, it is a very common ++ scenario when Python or R are used to assemble com- Before it can be executed, an entire program in C++ must putational workfows from components of numerical, be “built”, that is, translated to the target machine’s native machine learning, statistical, cheminformatics, molecular instructions by a program called compiler and linked mechanics and other specialized packages written in C or with external pre-compiled libraries by a program called C++. . High-quality compilers perform extensive local C++ enjoy a great ecosystem of devel- and global code optimization and produce very efcient opment tools. Compilers, debuggers, and integrated and compact code. Compiled programs do not need any Rassokhin J Cheminform (2020) 12:10 Page 4 of 16

additional runtime environments to be present on target run. C++ interpreters that allow the of C++ computers in order to be executed. Compare this to inter- code in the interpreted and/or interactive mode line-by- preted languages, such as Python, or languages that are line do exist (for example, Cling [22]), but the very nature typically compiled into and delivered to users as platform- of the language, in particular, the static typing, does independent intermediate code, just as Java. Python code not play along well with the interactive read-evaluate- needs a Python in order to be run, and pro- print-loop (REFL) execution mode, so C++ interpreters grams compiled into the intermediate Java byte code need remain very specialized niche tools for quick prototyping a Java runtime environment to translate the intermediate and compiler development. code into the host machine instructions at runtime. A large C program can take signifcant time to compile, C is imperative ++ ++ since every single line of its has to be pro- Imperative programming is a programming paradigm in cessed by the compiler, regardless of whether it will actu- which a program consists of statements (or commands ally be executed during an invocation of the program. Tis to the computer) that change a program’s state. Impera- slows down the development cycle, but typically results in tive programming focuses on describing how a program more reliable code, as the compiler can catch many errors operates, and imperative code closely maps to machine at , thus avoiding unpleasant “runtime error” code native to the computer. At the low level, the pro- surprises so typical for interpreted languages like Python. gram state is defned by the contents of memory, and the Yet another downside of a compiled language is that the instructions in the native machine language of the com- fles produced by a compiler from source code puter prescribe the hardware how to change the data are not portable and will only run on the target platform in memory. Higher-level imperative languages abstract (that is, the hardware plus the operating system) for which away platform-specifc instructions, for example, use var- they are compiled, or a binary-compatible platform. Spe- iables instead of memory locations and statements writ- cial care must be taken when writing C++ code, specify- ten in human-readable notation rather than instruction ing compiler options and choosing code libraries to link codes, but still follow the same pattern. with to satisfy specifc binary compatibility requirements Compare the imperative to the declarative paradigm, (see, for example, the Wikipedia article on binary com- which focuses on what the desired result should be and patibility [19] and Red Hat Enterprise Linux 7: Applica- leaves it up to the execution engine to “decide” how tion Compatibility Guide [20] just to get an idea of how to obtain it. A common declarative language familiar complicated the issue of binary compatibility can be). In to most data scientists is SQL (Structured Query Lan- order to port a C++ program or to a diferent guage), which is designed to manage data stored in a rela- platform, the source code must be re-compiled specif- tional database system, such as Oracle or PostgreSQL. cally for that platform. Since nowadays C++ compilers For example, a typical SQL ‘select A, B, C from exist for all major computer platforms and operating sys- Table1 join Table2 on Table1.K = Table2. tems, generally, C++ source code is highly portable. FK’ data query statement describes what records to However, complex programs written in C++ using non- retrieve from which tables in a relational database, but standard or poorly supported language features or hav- does not instruct the database engine how to do this, and ing dependencies on code libraries that have not been the implementations of SQL query processors can be widely ported, or relying on specifc machine or OS fea- vastly diferent between diferent database engines. tures, such as, for example, the machine word size, byte order, or the support for certain specifc CPU instruc- C is procedural ++ tions, can be extremely difcult to port and may require A procedural programming language is an imperative making changes at the code level by an experienced C++ programming language which supports the concept of . Tere exists a very useful online tool called procedures and isolating segments of code Compiler Explorer [21], which can compile snippets of into reusable units that can be “called” to perform indi- programs in many programming languages including vidual tasks. Procedures and subroutines are known as C with various compilers and options interactively ++ functions in C or C++. A C++ function can take zero and visualize the output. Tis makes it a or more parameters (sometimes called arguments) and great teaching tool, which can also be used low-level code return zero or one value. optimization. It should be noted that the C language standard C supports structured programming ++ ++ does not prescribe that a program in C++ must frst be Structured programming languages provide intuitive compiled in its entirety into an executable fle containing mechanisms to control the fow of a program (that is, the the target platform machine instructions before it can be order in which statements are executed). Te structured Rassokhin J Cheminform (2020) 12:10 Page 5 of 16

fow control statements in C++ are similar to the ones *(char*)0x40008000 = 1; found in many other structured programming languages. Tese are if/else for implementing branching logic, and for, while, and do/while for implementing iterations Te data type in C/C is the smallest address- (loops). C does have the notorious goto statement that char ++ ++ able unit of the machine (one byte consisting of eight can be used to pass control to an arbitrary location within bits on most modern computers). Te is the a function in a “non-structured” way, but it is rarely used. (char*) type cast operator telling the complier to interpret C has lexical variable scope 0x40008000 as a pointer to a byte at the memory location ++ 0x40008000, and the prefx * (the asterisk character) is As most modern languages, C uses lexical scoping for ++ the operator used to access (read variables and functions. A variable or function in C pointer dereferencing ++ or write) the value stored at that location. may only be referenced from within the block of code in Manipulating data via memory pointers in C is which it is declared. Te scope is determined when the ++ a very common practice not only in low-level system code is compiled. Te opposite of lexical scope, dynamic programming, but also in the implementation of a wide scope refers to scope of a variable defned at run time and variety of algorithms and data structures with minimum depending upon the program state when the name of a possible overhead. Common vector-type data struc- variable is encountered. tures such as vectors, matrices and character strings are C is statically typed, but not type‑safe efciently represented in C++ by contiguous memory ++ blocks containing data of a certain type, and C pro- Te compiler does the type checking when a C pro- ++ ++ vides very terse syntax for operations on these memory gram is being compiled. Tis helps detect common pro- blocks. For example, fnding the position of a character gramming mistakes. In dynamically typed languages in a zero-terminated C string using C pointer operations (such as, for example, Python or JavaScript) the types of can be done with just one line of code, the while loop in variables and functions are checked at run-time, which the code snippet shown below: allows for extra fexibility and sometimes shortens the code, but often results in runtime errors when an opera- // Text following double forward slash to tion or function is applied to an object of inappropriate // the end of line is a comment. const char* text = "this is a C string"; type. It should be noted that C++ is not a type-safe lan- const char* p = text; guage. C compilers will allow many operations on const char charToFind = 'C'; ++ // While p is not pointing to the terminating zero character typed variables that might lead to undefned behavior or // and p is not pointing to the character to find, errors, but usually the programmer must “let the com- // advance the pointer to point to the next character. piler know” his or her intension, for example, by “casting” while (*p && *p != charToFind) p++; // If p does not to the zero character, set to the index a pointer to a memory location to a certain type. Tis // of the found character, otherwise, set it to -1. comes very handy in low-level programming where ef- int index = *p ? p -text: -1; cient access to hardware is a must, but the programmers are expected to know what they are doing, since errors arising from unsafe type conversions are notoriously dif- C has deterministic memory allocation fcult to debug and are often platform-dependent. ++ and de‑allocation C has facilities for low‑level memory manipulation Te lifetime of objects in C is deterministic and ++ ++ defned by the programmer. Tis eliminates the overhead C++ provides operations on pointers to arbitrary mem- of “garbage collection”, where the runtime environment ory locations, which makes C++ a perfect choice for programming operating systems, embedded systems and (such as, for example, the Java or Python device drivers. For instance, a peripheral input/output interpreter) must trace the lifetime of objects during device driver may map (or associate) the memory and the program execution and, when an object is no longer registers of the controlled device with certain reserved used, free up the resources associated with it [23]. It also addresses [12]. To control the device, the device driver allows placing an object at a specifed memory address. assigns values having special meaning according to the Tis makes C and C++ particularly suitable for writing device specifcations to those reserved memory loca- code for resource-limited systems, such as real-time sys- tions. For example, the following statement in the driver’s tems and microcontrollers. Below is an example illustrat- ing C/C deterministic heap and stack [24] memory code (assuming it is implemented in C or C++) sets the ++ byte at the memory location 40008000 (in hexadecimal management: notation) to 1. Rassokhin J Cheminform (2020) 12:10 Page 6 of 16

// Note: this is not a complete C++ program. Te ultimate efciency of C achieved via direct // Include file containing declarations of ++ // the standard C library functions malloc, access to memory via pointers, explicit deterministic // free and memset called below. #include memory management and a very close mapping of C++ // C++ uses curly braces (also referred to as just "braces" language constructs to hardware makes C the lan- // or as "curly brackets") to group declarations and statements ++ // into blocks. guage of choice in high-performance scientifc comput- { // Code block begins. // Compiler reserves space for ints1 ing, when implementing CPU- and memory-intensive // to hold a memory address // (commonly referred to as algorithms, for example, molecular mechanics modeling, // a pointer). Typically, // the pointer size is 4 or 8 bytes on machine learning and statistical analysis of very large

// most modern computer platforms. volumes of data. Te efciency comes at a cost though. // C-style dynamic memory allocation: // allocate a block sufficient to Programmer’s errors, such as accessing an array out- // accomodate 10 integers and store of-bounds, or forgetting to properly initialize a pointer, // its address in ints1. // malloc, memset and free are standard result in random program crashes or unexpected behav- // C functions. int* ints1 = (int*)malloc(sizeof(int) * 10); ior, which, in complex code, can be extremely difcult to // Set all array elements to 0 memset(ints1, 0, sizeof(int) * 10); locate and fx. Intentionally (for efciency) or uninten- // Set the element at index 5 to 1 ints1[5] = 1; tionally skipping checks for common manual memory // Some other code goes here... // ... management errors, such as bufer overfow, in C or // When we no longer need the previously // reserved memory block pointed to C++ code is a security vulnerability that has been often // by the value stored in ints1, we should // free the block and release the memory. exploited by computer viruses, ransomware and other // Note: memory areas reserved with malloc // can be used outside of the code blocks types of malware [25]. Numerous static and dynamic // whether they are reserved and freed much code analyzers and debuggers exist that help program- // later, but forgetting to free previously // allocated memory is a common source of mers detect memory management errors in C code, // "memory leaks" in C or C++ programs. ++ free(ints1); such as, for example, the GDB debugger [26] and the // C++ operators Valgrind toolkit [27]. Nevertheless, even with the help // can be used instead of // C-style malloc and free: of the most sophisticated C compilers and develop- int* ints11 = newint[10]; ++ // ... ment tools, memory management errors in non-trivial delete[] ints11; C and C++ code are hard to avoid even for experienced // Some other code potentially calling // the malloc and free functions and programmers. As it was mentioned above, many high- // reusing the memory area we have just level languages, such as Python or Java, provide auto- // released can go here... matic memory management with ‘garbage collection’ and // The following statement will compile, // but will likely cause a runtime disallow or restrict direct memory access via pointers, // error (a crash with core dump // or memory access violation, or thus eliminating the possibility manual memory man- // an erratic program behavior, which // may include producing incorrect results): agement bugs altogether. However, automatic memory ints1[5] = 1; // Memory reserved for ints1 (4 or 8 bytes, management has substantial performance implications // depending on the pointer size on // the host platform) is deterministically and makes these languages unsuitable for low-level // freed when inst1 goes out of scope. // programming. } // Code block ends.

{ // Code block begins. // Compiler reserves a block of memory C is a high‑level language with low‑level functionality // for the local variable to hold 10 ++ // integers (on stack in C ofers the ample means for programmers to // nearly all C implementations). ++ // Note that stack size is relatively small express their ideas at the high or low level of abstrac- // on most systems and rarely exceeds 8MB. tion, depending on the specifc task at hand. C int ints2[10]; ++ // Set all array elements to 0. (especially, its C subset) has very little runtime over- memset(ints2, 0, sizeof(int) * 10); // Set the element at index 5 to 1 head and, as it was already mentioned above, uses ints2[5] = 1; deterministic explicit memory allocation/deallocation. // Do something here... // Memory reserved for ints2 If desired, a C++ program can be written in a mini- // is deterministically freed malistic ‘portable assembly language’ style to efectively // at the end of the block of code // where ints2 is defined. control the hardware, for example, when programming device drivers. At the same time, C allows coding // Writing to a location past the end of the array ++ // is a sure recipe for disaster at runtime. in terms of abstract computer science concepts, such // The following statement, when executed, // will almost certainly result in a crash. as functions, programmer-defned types and opera- ints2[110] = 10; tors, generic types, lambda-expressions and closures, } // Code block ends. Rassokhin J Cheminform (2020) 12:10 Page 7 of 16

which makes it suitable for implementing complex // Command-line build and run instructions: algorithms with non-trivial execution fow logic, for // Microsoft Visual Studio 2015 or 2017 command line: // cl hello.c example, graphical user interfaces and compilers. In // hello.exe scientifc programming, C is often used as a high- // Linux and Mac OS: ++ // gcc hello.c -o hello level object-oriented language, taking full advantage // ./hello of its expressive power. High-level features of C // Note: depending on the compiler installed, ++ // you may have to use gcc, cc, g++ or c++. will be described in more detail below in the sections // Everything after // and to the end of line // is a comment ignored by the compiler. of this paper discussing object-oriented and generic // Include the standard header file containing // declarations of common input-output programming. // functions, such as the formatted print function // 'printf'. Expanded by the pre-processor // to include the entire contents of stdio.h file // verbatim in this compilation unit. #include C has pre‑processor, which adds some ++ // Pre-processor directive to define a function-like macro meta‑programming capabilities to the language // PRINT_BUILD_TIMESTAMP. __TIME__ and __DATE__ are the // standard pre-defined macros that are set to the Before being passed to the compiler, C code is // current time and date by the pre-processor when compiling ++ // the source. \at the end of line is the line continuation // marker. pre-processed to expand the so-called pre-processor #define PRINT_BUILD_TIMESTAMP \ printf("This file was compiled at %s on %s\n", \ directives. Te most common directives in C++ are __TIME__, __DATE__); expandable macros, fle inclusion and conditional // This is a definition of a function called greet. // The function takes one argument and returns no value. compilation directives. A detailed description of these // When invoked, it prints Hello, to // the standard output device (console, if not redirected). // 'whom' must be a pointer to a memory location is beyond the scope of this paper, but the interested // containing a zero-terminated string // (a sequence of bytes representing printable reader will fnd a few examples of pre-processing direc- // characters and ending with the zero byte). void greet(const char* whom) tives in the Hello, World code below. Tey can be iden- { // printf is declared in in the standard tifed in the source by the # (hash) character that marks // C runtime library header file stdio.h, // and its implementation is contained in the beginning of a directive. // the standard C library file, // which is usually named libc and searched // automatically by the compiler // when the target output is an executable module. // "Hello, %s!\n" is the format string. The additional Hello, World in C // arguments following the format string // are formatted and inserted in the resulting output // string replacing their respective specifiers. Before we address more advanced concepts related to // The %s format specifier will cause printf to interpret // the corresponding argument object-oriented and generic programming in C , // as a zero-terminated string. Other format specifiers ++ // are %d for signed integers, let consider a working example of a simple pro- // %e for floating-point numbers to be printed in // the scientific notation, etc. gram that demonstrates the “C subset” of C . Code // '\n' is the "escape sequence" for the end-of-line ++ // character. 'printf' for additional info. below shows a slightly extended and commented ver- printf("Hello, %s!\n", whom); sion of the traditional “Hello, World!” program that } // For a stand-alone executable, function 'main' can be run from a command line to display “Hello, // is the program . // argc will contain the number of command-line arguments World!” or “Hello, ”, depending on the // argv will contain the arguments encoded as arrays // of characters. The 0th argument is always the path command-line arguments it is invoked with. Note the // to the executable file itself, so normally argc is flename // greater than or equal to 1. #include < > directive that includes the // In C and C++ array indexes are zero-based. int main(int argc, char* argv[]) contents of the header fle identifed by the flename { // Using the macro defined above, into the current source fle. // print the build timestamp to the console. PRINT_BUILD_TIMESTAMP;

if (argc > 2) { printf("Error: too many arguments.\n"); return 1; }

// In C++, == is the equality operator, // whereas = is the assignment operator. const char* whom; if (argc == 1) { // If invoked with no command-line arguments, // print 'Hello, World' whom = "World"; } elseif (argc == 2) { // If invoked with exactly one command-line argument, // print 'Hello, ' whom = argv[1]; } greet(whom); return 0; } Rassokhin J Cheminform (2020) 12:10 Page 8 of 16

Te program also illustrates the typical compile/link/ C : Object‑oriented and generic programming ++ run C++ program execution pattern. In order to pro- Basic ideas duce an executable fle from the above source code, one As mentioned in Short history of C and C++ section has to compile it to an intermediate module and link the above, one of the distinctive features of C++ is its exten- module with standard and custom libraries that contain sive support for objective-oriented and generic program- pre-built implementations of functions used in the source ming. A programming language that serves the purpose but not defned there. Te above example is very simple of representing a programmer’s ideas in an understand- and depends only on the standard C runtime library for able form to the computer dictates not only a way of rep- the program initialization routines and the implemen- resentation but also, to a considerable extent, the ideas tation of the printf function, so it can be compiled and themselves. All programming languages consist of cer- linked to produce the executable on most modern Unix- tain systems of terms and concepts set in a framework like systems (Linux, Mac OS X, FreeBSD, AIX and oth- into which the programmer subconsciously “squeezes” ers) with a very simple command: the program he or she creates as early as during the design stage. In other words, the way a program is coded gcc hello.c -o hello dictates to a considerable extent the way the program is designed. One cannot set yourself free from a language’s dictates, but this is not necessary. Te desired solution to When run with the above command-line options, com- this situation is to use a computer language that closely piler will invoke the linker and link the standard C runt- supports the system of concepts on which we base our ime libraries automatically to produce the executable vision of the world—thus, the path from design to imple- module. All modern Unix-like systems come with a C/ mentation will be easier and the productivity of the labor C++ compiler, but, depending on the specifc OS version involved will increase. and distribution, you may have to substitute cc, C++ or Tis is exactly what object-oriented programming g++ for gcc. If the compiler discovers one or more syn- (OOP) suggests. OOP demands an object-oriented tax errors in the source code during the compilation, approach to program design—the so-called object-ori- it will report them to the programmer and generate no ented design (OOD)—that, in turn, successfully exploits executable fle. Running the resulting executable from our natural human abilities of classifcation and abstrac- the command line will output “Hello, World!” or “Hello, tion. For instance, in speaking the word “window” we argument! ”. imply something can be seen through it. Both a window in a house through which we view a street and a “window” ./hello on a computer screen, which is just a (usually rectangular) area with distinct boundaries containing various graphical elements drawn by a program possess that property. So, It should be pointed out again that the compiled exe- these window “instances” can be thought of as belonging cutable fle contains the machine code for the target to a class (or type, or concept) called “Window”. Classes platform and does not need an interpreter or runtime and objects, inheritance, and hierarchy are intrinsic to environment in order to run. However, it is platform- human thinking and intuitively understood. specifc and will not run on hardware and operating OOD and OOP are really the processes for the design system other than the one for which it was built, or its and creation of a specifc world—a program—inhabited emulator. Tis contrasts with interpreted languages like by objects that are born, change their internal state, inter- Python, where the interpreter translates the program act with each other, and die. And OOP requires the pro- source code into the machine instructions at runtime grammer become frst a creator who considers a program and immediately executes these instructions. Programs not as a subsequence of actions but as a specifc world living its own life. in many languages including C or C++ may be either compiled or interpreted, so being “compiled” or “inter- Rather than thinking in terms of data and procedures preted” is not an essential property of a language per se. the OOP paradigm encourages thinking in terms of inter- acting objects that possess certain properties and exhibit Te overwhelming majority of C and C++ implementa- tions are compilers rather than interpreters though. Te certain behaviors. Let us consider a specifc example from the feld of structure of C/C++, primarily, the static typing, makes its use as an interpreted language quite cumbersome and cheminformatics. Practically all cheminformatics does not realize its full potential as a language for system programming and high-performance computing. Rassokhin J Cheminform (2020) 12:10 Page 9 of 16

toolkits support chemical structure (or molecule) rep- class Molecule { private: resentation based on graph theory. Te most natural // Private fields and methods can be accessed only // by functions ("methods") that are members of representation of a molecule is a graph where the atoms // the class where they are defined. are encoded as the graph nodes and the bonds are the // Pointer to a memory area where // elements of the symmetric matrix graph edges. In the “traditional” non-OOP approach, // representing atom/atom connection // table will be stored. one would design this program by frst defning a data bool* connectionMatrix_; int numAtoms_; structure that represent the basic graph, for example, as // ...other data fields... // ... private methods ... a N N square symmetric connection matrix M, where // Translates atom indices to the index into × // the symmetric connection matrix. N is the number of atoms in the molecule. If atom i is // Marked as const because it does not change // any fields of the class. connected to atom j, the corresponding elements of the int symMatIdx(int atomIdx1, int atomIdx2) const { Mij Mji // Make sure atom indices are valid: matrix and will contain 1, otherwise, they will assert(atomIdx1 >= 0 && atomIdx2 >= 0 && atomIdx1 != atomIdx2); contain 0. In addition to the connection matrix, one assert(atomIdx1 < numAtoms_ && atomIdx2 < numAtoms_); // The ? ... : ... operator is a convenient will need to defne data structures to represent prop- // shorthand for 'if-else'. int idx = atomIdx2 < atomIdx1 ? erties of each atom and bond, for example, the atomic atomIdx1 * (atomIdx1 - 1) / 2 + atomIdx2 : atomIdx2 * (atomIdx2 - 1) / 2 + atomIdx1; number and bond type. Having defned the data struc- return idx; } tures, the developer would defne a set of procedures public: // This is theconstructor called to initialize to operate on these structures, for example, to add an // class members when a class instance is being created. // More than one constructor can be defined for a class, atom to the molecule, connect an atom to another atom // differing by the number and types of the arguments. Molecule() { with a bond, determine how many atoms and bonds // initialize _connectionMatrix and other data fields... are in a molecule, read from and save a molecule into // Initially, reserve memory for 10 atoms. a structure fle, and so on. Data in such a program are, int nInitCapacity = 10; // Size for a symmetric matrix with no diagonal elements: so to speak, low-men-on-the-totem-pole, being consid- int size = nInitCapacity * (nInitCapacity - 1) / 2; connectionMatrix_ = newbool[size]; // Initially, there are no bonds. ered only as a sphere of action for functions. // Set all elements to 0 ('false'). memset(connectionMatrix_, 0, sizeof(bool) * size); Te OOP paradigm encourages a completely difer- numAtoms_ = 0; ent mode of thinking, based on the data abstraction } // This is thedestructor called when instances of class and encapsulation. When designing code to repre- // are being disposed of. ~Molecule() sent molecules in the OOP style, one should focus on { delete[] connectionMatrix_; data felds representing a state of a molecule and com- } int addAtom(int atomicNo) mon operations that can be applied to all instances of { // ...implementation (incomplete)... a molecule. In this train of thought, molecules are rep- // ... // Return index of newly added atom resented as objects (or instances) of the abstract data // and increment the atom count. // Note: 'this' is a C++ keyword type (or “class”, using C terminology) . // that represents a pointer to 'self' ++ Molecule // which is passed as implicit argument // to all non-static class member functions. In this context, ‘abstract’ means that the type is defned // In this context, the use of this-> // is optional. in terms of operations that can be applied to it and the return this->numAtoms_++; expected behavior of these operations rather than its } void addBond(int atomIdx1, int atomIdx2, int bondOrder = 1) internal structure and details of its implementation. { // ...implementation (incomplete)...

Bundling (or encapsulating) the data and methods that int idx = symMatIdx(atomIdx1, atomIdx2); connectionMatrix_[idx] = true; operate on that data in one conceptual unit—a class,— // ... set other bond properties ... // ... exposing only operations that defne its behavior to } the “outside world” and hiding implementation details bool bondExists(int atomIdx1, int atomIdx2) const { greatly facilitates code reusability and modularity. int idx = symMatIdx(atomIdx1, atomIdx2); return connectionMatrix_[idx]; For example, in the code snippet below, the adjacency } // ... other methods ... matrix-based molecular graph representation can be // ... replaced with an alternative representation based, for }; // End of Molecule class definition. // Create an instance of Molecule. example, on a graph edge list. After such a change, any Molecule mol; // Call its methods to perform dependent code using only public methods and felds // operations on it. // Users of class Molecule of can be re-compiled and used with no // are not normally concerned about Molecule // the details of its implementation. int a1 = mol.addAtom(6); modifcations. int a2 = mol.addAtom(8); mol.addBond(a1, a2); bool hasBond = mol.bondExists(a1, a2); if (hasBond) std::cout << "Bond between " << a1 << " and " << a2 << " exists." << std::endl; Rassokhin J Cheminform (2020) 12:10 Page 10 of 16

Data abstraction is one of the key concepts of OOP the function f (as long as it derives from the base type and OOD. Other key concepts on which OOP is based Shape), it will generate the code to invoke the correct are inheritance, composition and polymorphism. type-specifc implementation of the function area(), Inheritance means deriving more complex data types defned either in type Square in type Circle, depend- from simpler ones. C++ ofers special mechanisms to suc- ing on the actual type of the object, and applying the cor- cessfully exploit that idea. A derived class “inherits” prop- rect formula to calculate the area of the object. erties and behaviors of its ancestor classes, while adding Generic programming is a style of programming in new properties and behavior. Using class inheritance, one which algorithms are written in terms of to-be-specifed- can design complex classes from the general to the spe- later types that are then instantiated when needed for cifc. Using our Molecule class as an example, it would specifc types provided as parameters [30]. C++ provides be natural to think of it as a superclass of a base class called very efective template-based mechanisms for generic Graph, inheriting the internal representation and graph programming, which make the generalization possible algorithm functions and adding features specifc to mol- without sacrifcing efciency, since the compiler gener- ecules, such methods to access and change properties of ates the type-dependent code, so the type determination atoms and bonds in a molecule, compute basic molecular and the type-dependent function binding do not have to properties, etc. happen at the runtime. A trivial example of defning and Composition in OOP is yet another method of build- instantiating a function template is shown below. ing complex types, alternative to inheritance. Types extended via composition contain instances (or pointers to // Include the definitions of the basic // standard input/output types, operators instances) of other classes implementing additional func- // and the string type from tionality rather than deriving from those classes. For exam- // the Standard Template Library (STL). ple, if we want molecules to be able to encode themselves // Note, most of the STL types are // declared in the namespace std into the SMILES linear notation [28], we can derive a class // and accessed using the std:: prefix. called, for example, SmilesEncodableMolecule from #include the base class Molecule and implement the method that #include will be returning SMILES-encoded molecules called, for // Define a template for function tmax. example, getSmiles(), plus all additional methods and template T tmax(T a, T b) felds needed for its implementation in the derived class. { // if a > b return a, otherwise, return b Alternatively, using the composition-based approach, we // Note that the actual type that replaces can re-design the base class Molecule to have a con- // template type parameter T when the template tainer-type data feld to hold pointers to various encoders, // is being instantiated must support operator >. develop a class that represents a SMILES encoder, and add return a > b ? a : b; } an instance of the SMILES encoder to an instance of Mol- ecule at runtime. A detailed discussion of composition void testMax() vs. inheritance is beyond the scope of this paper, and an { double x = 0.0; interested reader can refer to the Wikipedia article [29], double y = 1.0; which has multiple references to publications where the // At this point, the compiler pros and cons of either approach in various development // instantiates the function template // and generates the actual function that scenarios and programming languages are debated. // takes two parameters of type double Polymorphism is a Greek word meaning “having many // and returns a double. shapes”. Applied to OOP, this term is usually regarded double m = tmax(x, y); // C++ -style console output: as the property of an object to respond to an operation std::cout << "x: " << x << ", y: " << y according to the object’s type, even if its type is unknown << ", tmax(x,y): " << m << std::endl; at compile time. For example, we can defne types and as deriving from the base type std::string sx = "C"; Square Circle std::string sy = "C++"; Shape and pass a reference or a pointer to an instance of // Here, the compiler generates and calls type Shape to some function as an argument (for exam- // a different function, which operates on ple, that function may be defned as // arguments of the string type and calls void f(Shape* // the implementation of operator > for string. s)). Inside that function, we would call the function std::string sm = tmax(sx, sy); area() declared in the base type Shape and defned std::cout << "sx: " << sx << ", sy: " << sy in types Square and Circle. Even though at the com- << ", tmax(sx, sy): " << sm << std::endl; } pile time the compiler would have no information on the exact type of the object that can potentially be passed to Rassokhin J Cheminform (2020) 12:10 Page 11 of 16

Nearly all current C++ distributions include the Stand- compiler supporting at least C++11. Te example con- ard Template Library (STL), originally developed by tains a very basic implementation of the data type (or Alexander Stepanov, Meng Lee and David Musser [31], “class”, in C++ parlance) Molecule to represent chemi- whose design follows the object-oriented and generic cal structures in a program. Te example is extensively programming patterns. STL is a set of C++ template commented, and the reader is encouraged to examine it classes to provide common programming data structures closely. At the top of the class hierarchy lies the template and functions such as input/output streams, lists, stacks, class Graph , which represents an arrays, etc., and common algorithms, such as search- abstract graph of nodes connected by edges and imple- ing, sorting, and set operations. In a sense, STL can be ments such basic operations as adding nodes, connect- thought of as an essential part of C++, and nowadays ing them by edges, and accessing node and edge objects. C++ programmers increasingly rely on STL rather than Classes Atom and Bond represent chemical atoms and their own “homegrown” implementations of the basic bonds, respectively. Class BaseMolecule derives from data structures and algorithms. Many C++ scientifc Graph and adds molecule-specifc libraries are supplied as ‘header-only’ source code librar- behavior to the generic graph operations. Note that, when ies and heavily rely on C++ templates to make the code template arguments are provided, they are substituted generalizable so it can work with many compatible data for the template parameters to obtain a specialization types and various options yet be as efcient as possible. of the template, so the class Graph is For example, a widely used C++ library for linear algebra a specialization of the template class Graph . In addition to the data members and methods taining defnitions of various parameterized data types inherited from Graph , BaseMol- and functions. C++ compilers can generate machine ecule adds methods specifc to molecules, such as the instructions highly optimized for speed of execution functions that add atoms and bonds with certain atom- from function and class templates, but under certain and bond-specifc properties, such as the atomic num- code building scenarios template instantiation can intro- ber and bond order. Class Molecule is derived from duce binary code bloat, since the compiler may create BaseMolecule and further extends it with the addi- multiple instantiations of the same templated function or tional property called name and adds a method to com- class that are never actually used during the program exe- pute the molecule formula. It also redefnes (“overrides”) cution. Yet another notorious shortcoming of templated the toString method inherited from the base class. code is cryptic and hard-to-interpret compile-time error Te re-defned method returns a more detailed textual messages. Even a minor accidental typo somewhere in description of an object of the type Molecule com- the code using templates can result in a horrifc compiler pared to the base class method. error message several pages long due to very complex A more sophisticated “real-world” C++ code exam- templated type defnitions and template instantiation ple of the manipulation of chemical structures using the rules. It should also be noted that C++ template code RDKit open-source cheminformatics library [33] can be libraries must be distributed to users as C++ source fles found in the Getting Started section of the RDKit code rather than binary code libraries, which means that users repository [34]. Te reader is encouraged to confgure of these libraries will be able to browse the source code the build environment and frst compile and run the and study its inner workings. Tere exist tools that can simple molecule.cpp example from this paper, and then “obfuscate” C++ source code, making it intentionally attempt to follow instructions in [34] to install RDKit and hard to understand without changing its functionality, its dependencies, confgure the build environment, build which are sometimes used by authors of closed-source the RDKit library, and then and compile and run the software that must be delivered to customers in the form example. of C/C source fles. ++ Interoperability between C/C and other ++ An illustrative example of object‑oriented and generic high‑level languages C code C and C are universal languages equally well suited ++ ++ for practically all types of coding, which still stay sol- To summarize this short overview of the C++ lan- guage, let us consider a somewhat more elaborate idly on top of the most popular languages for system example, which illustrates concepts of object-oriented and embedded software programming, ofce desktop application development, and the implementation and generic programming in modern C++. Te source code for this example can be obtained from Additional of high-performance number-crunching, image and fle 1. It can be compiled and run from the command text processing algorithms for real-time stock market data analysis, 3D animation and numerous life science line on most modern platforms that come with a C++ Rassokhin J Cheminform (2020) 12:10 Page 12 of 16

applications. However, it is a very common practice such as Java or Microsoft.NET. Typical examples are to implement diferent parts of a program or software well-known and widely used CACTVS [42], RDKit [33], library in more than one programming language. Tere OpenBabel [43] and OEChem [44] cheminformatics tool- are many compelling reasons for “mixed language” devel- kits, NumPy [45] and other packages that are part of the opment, which can be roughly split into the following SciPy, a Python-based ecosystem of open-source software two categories: for mathematics, science, and engineering [46], and Ten- Legacy software library reuse: For example, there sorFlow data processing and machine learning library exist comprehensive high-quality software libraries for [47], just to name a few. Te computationally-intensive numerical analysis and linear algebra written in Fortran: parts of these toolkits are mostly implemented in C and BLAS [35], NAG [36], and LAPACK [37]. When devel- C++, with wrappers provided to make it possible to use oping a C or C++ application or a software module that all these toolkits in Python, and some of them in Tcl, Java relies on certain numerical algorithms already imple- and Microsoft.NET environments. Te TirdDimen- mented in one or more of these mature Fortran librar- sion Explorer (3DX) data retrieval, analysis and modeling ies, optimized and thoroughly tested, the time and efort application with “native” support for chemistry and biol- required to incorporate the existing Fortran modules into ogy developed at Johnson & Johnson Pharmaceutical C or C++ code is much smaller compared to the time Research & Development, L.L.C with a signifcant contri- and efort that would be needed to translate these librar- bution by the author of this paper was also implemented ies from Fortran to C or C++ in order to develop “- using this approach. Te front-end parts of the applica- lingual” code. Even though automatic converters from tion were mostly written in the C# language for Micro- Fortran to C do exist, for example, f2c [38] and FABLE soft.NET platform, with the core chemistry, machine [39], the result of conversion of non-trivial Fortran code learning and some high-quality graphics implemented to C or C++ often leaves a lot to be desired and a sub- in C++ and exposed to the.NET runtime via a wrapper stantial amount of work is usually required to clean up, interface [48]. debug and test the output of these automated converters. Te “reverse” or “hybrid” scenarios, where an applica- Coding convenience and productivity: it is a very tion is largely written in C or C ++, but an interpreter for common scenario where the main “driver” language in a scripting language is embedded in it to provide conven- which a particular program or server-side application ient interface for program customization and control, are is implemented is much better suited for a specifc task also quite common. VMD (Visual Molecular Dynamics) than C or C++, but, in order to achieve required perfor- molecular modeling and visualization mance characteristics and/or implement low-level hard- [49] and PyMOL molecular visualization system [50] are ware access, certain critical modules have to be written archetypal examples, as both include embedded Python in C or C++. For instance, statistical computing is ruled and Tcl interpreters to allow users to run Python or Tcl by R [40], MATLAB [41] is a very popular platform/lan- scripts from within these applications to perform auto- guage for numerical computing, a signifcant fraction of mated tasks and execute complicated workfows. server-size components for various business applica- Mixing modules written in C/C++ and other lan- tions are written in Java, and Python has recently climbed guages (for example, Fortran or Ada) compiled into up to the top ranks as a general-purpose language for a machine code for the same hardware and OS can be wide range of applications from quick prototyping to relatively easy, especially, if the same compiler and tool- scripting complex data processing pipelines, and to pro- chain technology is used to build all modules and librar- gramming sophisticated large-scale server-side appli- ies comprising the target software (for example, LLVM cations. Coding everything entirely in C or C++, even [51] or GNU compiler collection [12]). Te modern though theoretically possible, would be highly counter- language Julia, which is quickly gaining popularity in productive, because C and C++ are difcult to use in scientifc computing [52], has built-in support for call- read-eval-print-loop interactive environments, have a ing C, C++ or Fortran code using relatively simple and steep learning curve, and lack direct language support straightforward syntax. However, programmatic inter- for certain domain-specifc data types and operations face between modules produced from C or C++ source (for example, C++ does not have built-in operations code (and usually packaged as dynamically-loaded librar- on matrixes and data frames found in MATLAB or R). ies on Microsoft Windows platforms or shared object Many cheminformatics, bioinformatics, statistical and libraries on Linux-based platforms) and modules in other machine learning toolkits are mostly implemented in C/ languages which are interpreted at runtime (such as C++ and provide “wrappers” to expose their program- Python) or compiled into virtual machine (such matic interfaces to interpreted languages, such as Python, as C# or Java, often called “managed” runtime environ- and/or virtual machine-based runtime environments, ments) requires a “wrapper”. Wrapper code is usually Rassokhin J Cheminform (2020) 12:10 Page 13 of 16

the pybind11 library, which ofers functionality similar to also written in C or C++ and compiled into a shared object or dynamically linked library, which is then loaded that of .Python, but is much more compact and has by the host execution environment at runtime. Te aim much fewer dependencies; however, it can only be used of a wrapper is to allow the calling of functions written in with modern C++ compilers that support C++11 or later standards of C [55]. C or C++ and compiled into machine code from other ++ programming languages and calling functions written in Driven by the increasing popularity of Web-based applications ofering rich functionality on par with that other languages from C or C++ code, passing complex data types between functions, coordination of manag- of their desktop counterparts but delivered seamlessly over the Web and running completely inside standard ing memory management between C/C++ and other language runtime environments, and reusing non-trivial Web browsers, several methods of packaging compiled data types across languages. Depending on the complex- C++ code have been developed to allow its execution ity of the programmatic interface exposed by a module inside a browser, driven from JavaScript. Tey are not yet widely used, but the corresponding standards are emerg- written in C/C++ to the host runtime environment and the type of that environment (a Python or R interpreter, ing and look very promising. An interesting discussion Java or .NET runtime, etc.), the additional efort required with some working examples of the popular cheminfor- to create the “wrapper” code can greatly vary from triv- matics toolkit RDKit [33] adding interactive chemical ial (for example, exposing a small set of functions taking functionality to web pages can be found in Greg Lan- arguments of built-in types such as integers or foating- drum’s blog [56]. point number or pointers to contiguous memory blocks Te ability to package modules written in C++ in containing data of built-in types) to very substantial (for such a way that they can be accessed from common example, exposing an object-oriented programmatic interpreted or managed runtime environments, such as interface with complex type hierarchies and/or depend- Python and Java, allows a treasure trove of C/C++ code ing on a large number of third-party libraries, which have already written for all kinds of data processing needs to to be built in a certain way in order to be compatible with be reused in these environments and saves tremendous the host runtime). Tere exist multiple tools and librar- amounts of time and efort that would be required to port ies that simplify the creation of wrapper interfaces for these libraries from C/C++ to these other languages. It also allows the implementation of performance-critical C/C++ code to expose it to scripting or managed runt- ime environments. One of the most widely used tools of parts of software in C/C++ and compiling these parts this kind is SWIG [53], which is very fexible and highly into highly-optimized machine code for maximum per- confgurable and can generate wrappers for a large num- formance, which is especially important for interpreted ber of host languages, such as Lua, Perl, PHP, Python, R, scripting languages like R and Python. However, as the Ruby, Tcl, C#, Java, JavaScript, Go, Modula-3, OCaml, famous “there’s no free lunch” adage goes, mixed-lan- Octave, Scilab and Scheme. SWIG relies on manually guage programming adds a substantial layer of com- written annotated interface defnition fles and requires plexity to the software development process. Programs programmers to learn the SWIG-specifc interface-def- designed to run in a scripting (for instance, Python or R) or managed (for instance, Java or .NET) environ- inition language. Another widely used C/C++ wrap- per aid is the Boost.Python library [54], which is limited ment become hardware- and platform-dependent once they include modules compiled into architecture- and to interfacing C/C++ modules with only one—but very popular—language, Python. Boost.Python is part of Boost, OS-specifc machine code. For example, a program which is a very comprehensive collection of free open implemented in “pure” Python will run on any platform without any additional porting efort, as long as a Python source peer-reviewed portable C++ source libraries. As stated in the project documentation, Boost.Python interpreter for that platform is available and supports the attempts to maximize convenience and fexibility without version of Python language in which the program is writ- introducing a separate wrapping language. Instead, it pre- ten. However, if a Python program depends on a C/C++ library wrapped as a Python package, one has to fnd a sents the user with a high-level C++ interface for wrap- version of that package that has been built specifcally ping C++ classes and functions, managing much of the complexity behind the-scenes with static metaprogram- for the host hardware and operating system on which ming. Tis library is probably the best choice for expe- the program needs to be executed. And not only that, the package must be built separately for as many diferent rienced C++ programmers who are also well-versed in Python. For example, a very popular open-source chem- commonly used Python implementations as practically possible. For example, a version of that package built for informatics toolkit RDKit [33] is mostly written in C++ and heavily relies on Boost.Python in the implementation Python 3.6.4 MSC v.1900 64 bit (AMD64) for Microsoft of its Python interface. An alternative to Boost.Python is Windows won’t work with Python 3.6.4 on Ubuntu Linux Rassokhin J Cheminform (2020) 12:10 Page 14 of 16

18 distribution or even with the same version of Python efciently utilize hardware resources. In scientifc soft- for Windows but compiled as a 32-bit rather than 64-bit ware development, C++ is widely used to write entire release, let alone using that module with a completely dif- software packages (including stand-alone command-line ferent Python implementation, for instance, IronPython or GUI applications and server backend components), for the .NET platform [57]. Tis tremendously compli- or to implement just performance-critical parts of com- cates the package building and publishing process. One putational algorithms of applications and packages pro- may discover that a critical package on which a particular grammed in multiple languages. An excellent review of application depends is simply not available for a specifc open-source molecular modeling tools was recently pub- Python implementation (for example, there is a require- lished by Pirhadi et al. [58]. Te companion online up- ment that the software must run on a Windows machine, to-date catalog maintained by Koes [59] lists over two but the dependency package is only available for Linux), hundred toolkits and stand-alone programs for chemin- or two critically important packages are incompatible formatics, molecular visualization, QSAR/ADMET mode- between each other since they depend on diferent ver- ling, quantum chemistry, ligand dynamics and free energy sions of some third-party shared runtime library. It also calculations, and virtual screening and ligand design. Te happens that the same mixed-language package behaves catalog does not classify the software by the programming diferently on diferent hosting platforms. Certain rou- language and mentions the language only for a small frac- tines implemented in the package may run as expected tion of programs and libraries described in it. However, on one platform but would crash with a core dump on since the programs listed in the catalog are open-source, some other platform, or—which is often the worst pos- the author of this paper was able to browse the respec- sible scenario—would produce diferent and non-repro- tive source code repositories and collect statistics on their ducible results. Tis is most often caused by bugs in the implementation languages. As it turned out, most pack- C/C++ source code that are sensitive to such details of ages listed in the catalog are implemented in C/C++ as implementation as memory alignment, the size of mem- the primary language (75), followed by Python (52), Java ory pointer and certain primitive built-in data types (for (34), Fortran (18), JavaScript (9), R (7), Pascal (1), Perl (1), example, 32-bit vs 64-bit), the availability of certain hard- Haskel (1), OCaml (1), PHP (1), Scala (1) and C# (1). Nine ware features, etc. And the last but not least, there can programs or libraries out of 52 implemented mostly in be signifcant overhead with crossing the boundary and Python and three out of seven implemented mostly in R passing data structures (also known as “marshalling”) have substantial performance-critical parts written in C between Python or Java runtime and native machine code or C++. It is worth mentioning that Fortran still remains compiled from C/C++ when calling functions imple- a popular choice in the development of software heavily mented in C/C++ from Python or Java and vice versa. relying on numerical methods, such as, for instance, pro- When the performance of mixed-language code becomes grams for Ab initio calculations (11 out of the 21 listed in an issue, it is generally advised to re-write the code to the catalog) and Ligand Dynamics and Free Energy calcu- minimize the number of calls that cross the language bar- lations (7 out of 21), but many of those applications whose rier as much as possible. Using a good code profler tool major parts are programmed in Fortran include some can be a great help and an eye-opening experience when components implemented in C or C++. Tere is also a working on a mixed-language code optimization. Having clear trend for newer versions of packages that were origi- said that, we have to point out that the mixed-language nally programmed in Fortran to be completely or partially scenarios are extremely common in scientifc software re-written in C/C++ (quite often, with Python providing development, and the advantages of the existing code the scripting interface), or in Python (with performance- reuse and substantial gain in performance that can be critical parts written in C/C++). Detailed analysis of the achieved by implementing the most critical parts of the C++ usage in the areas of scientifc programming not code in C/C++ overweigh the disadvantages of the addi- directly related to cheminformatics or computational tional complexity of the mixed-language software build chemistry is beyond the scope of this paper, but there and distribution process. has been an apparent tendency in recent years towards mixed-language programming with general-purpose Conclusion: C as a language for scientifc ++ scripting languages, such as Python or Julia, or domain- software development specifc languages, such as R or MATLAB, being used to C++ is a universal multi-paradigm imperative, object- implement the majority of a stand-alone application or a oriented and generic programming language with great software package, with performance-critical and/or hard- library and development tool support and a very large ware-dependent parts programmed in C or C++. developer community. Modern C++ compilers pro- Even though C++ is a universal general-purpose lan- duce highly optimized executable code that can very guage suitable for most types of scientifc programming, it Rassokhin J Cheminform (2020) 12:10 Page 15 of 16

is rather difcult to learn, lacks built-in support and “short- 2. Corey EJ, Howe WJ, Pensak DA (1974) Computer-assisted synthetic analy- sis. Methods for machine generation of synthetic intermediates involving hand” syntax for operations on common data structures multistep look-ahead. J Am Chem Soc 96(25):7724–7737 such as, for example, matrices and data frames found in 3. Pensak DA, Corey EJ (1977) LHASA—Logic and heuristics applied to domain-specifc languages such as R or MATLAB, and is synthetic analysis computer-assisted organic synthesis. ACS symposium series 61. American Chemical Society, Washington, pp 1–32 not a good choice for interactive read-evaluate-print-loop 4. Kernighan BW, Ritchie DM (1978) The C programming language. execution mode. Typically, end-user applications or soft- Prentice-Hall, Englewood Clifs, p 228 ware libraries are coded in C and C by experienced 5. ISO/IEC. ISO/IEC 9899:2018. Programming languages—C. 2018 ++ 6. Stroustrup B. Evolving a language in and for the real world: C 1991- programmers with domain expertise combined with tech- 2006. Proceedings of the third ACM SIGPLAN conference on History++ of nical skills and deep knowledge of hardware architecture. programming languages; San Diego, California. 1238848: ACM; 2007. p Data scientists, computational chemists, biologists and 4-1-4-59 7. Stroustrup B (1986) The C programming language. Reading, mass. statisticians tend to use languages like Python, R or MAT- Addison-Wesley, Boston, p++ 327 8. Ellis MA, Stroustrup B (1990) The annotated C reference manual. Read- LAB, which are easier to learn, better suited for interactive ++ execution, and come with complete comprehensive com- ing, mass. Addison-Wesley, Boston, p 447 9. ISO/IEC. ISO/IEC 14882:2017. Programming languages—C . 2017 puting environments supporting package management 10. Lambda expressions. cplusplus.com website. https​://en.cppre++​feren​ infrastructure, interactive notebooks containing “live” code ce.com/w/cpp/langu​age/lambd​a. Accessed 27 Sept 2019 and graphics, and a plethora of project management and 11. Interactive: the top programming languages. IEEE SPECTRUM website: IEEE. https​://spect​rum.ieee.org/stati​c/inter​activ​e-the-top-progr​ammin​ collaboration tools. However, most of these computing g-langu​ages-2019. Accessed 27 Sept 2019 environments themselves are written in C and C++, and 12. GCC, the GNU compiler collection. GCC, the GNU Compiler Collection a signifcant fraction of reusable packages for them have website. https​://gcc.gnu.org/. Accessed 27 Sept 2019 13. Clang: a C language family frontend for LLVM. LLVM website. https​:// critical parts programmed in C or C++. Terefore, it is fair clang​..org/. Accessed 27 Sept 2019 to say that C and C still totally dominate scientifc pro- 14. Visual Studio Technologies: Develop C and C applications. ++ ++ gramming, perhaps, maybe, not in terms of the total num- Microsoft(R) Visual Studio website. https​://visua​lstud​io.micro​soft.com/vs/ featu​res/cplus​plus/. Accessed 27 Sept 2019 ber of lines of code written in these languages, but in terms 15. Intel® C Compiler release notes and new features. ­Intel® Developer of how many times these lines of code have been executed. Zone website.++ https​://softw​are.intel​.com/en-us/artic​les/intel​-cpp-compi​ ler-relea​se-notes​#2019u​5 Accessed 14 Dec 2019 Supplementary information 16. Visual Studio Code. Microsoft(R) Visual Studio Code website. https​://code. visua​lstud​io.com/ Accessed 27 Sept 2019 Supplementary information accompanies this paper at https​://doi. 17. JetBrains CLion. JetBrains(R) CLion website. https​://www.jetbr​ains.com/ org/10.1186/s1332​1-020-0415-y. clion​/. Accessed 27 Sept 2019 18. Eclipse CDT (C/C Development Tooling). The Eclipse Foundation CDT website. https​://www.eclip++ ​se.org/cdt/. Accessed 27 Sept 2019 Additional fle 1. A simple implementation of type Molecule in C 11 ++ 19. Binary-code compatibility. Wikipedia. https​://en.wikip​edia.org/wiki/Binar​ to illustrate the concepts of object-oriented and generic programming. y-code_compa​tibil​ity. Accessed 16 Dec 2019 20. Red Hat Enterprise Linux 7: application compatibility GUIDE Red Hat Knowledgebase website. https​://acces​s.redha​t.com/artic​les/rhel-abi- Acknowledgements compa​tibil​ity. Accessed 16 Dec 2019 The author thanks Dimitris Agrafotis, Walter Cedeno and Greg Landrum for 21. Compiler Explorer. Compiler Explorer website. https​://godbo​lt.org/. taking their time to thoroughly read the manuscript. Their insightful com- Accessed 30 Dec 2019 ments and constructive criticism are much appreciated by the author. 22. Cling. ROOT Data analytics Framework Cling website. https​://gcc.gnu. org/. Accessed 27 Sep 2019 Authors’ contributions 23. Garbage collection (computer science). Wikipedia. https​://en.wikip​edia. The author wrote the entire manuscript. The author read and approved the org/wiki/Garba​ge_colle​ction​_(compu​ter_scien​ce). Accessed 10 Dec fnal manuscript. 2019 24. Stack vs Heap memory allocation. GeeksforGeeks website. https​://www. Funding geeks​forge​eks.org/stack​-vs-heap-memor​y-alloc​ation​/. Accessed 10 Dec The author has received no funding for this work. 2019 25. Bufer overfow. Wikipedia. https​://en.wikip​edia.org/wiki/Bufe​r_overf​ ow. Availability of data and materials Accessed 16 Dec 2019 Source code for all code examples included in this manuscript is either con- 26. GDB: The GNU Project Debugger. GDB: The GNU Project Debugger web- tained in the manuscript body or available as Additional fles. site. https​://www.gnu.org/softw​are/gdb/. Accessed 16 Dec 2019 27. Valgrind. Valgrind website. https​://valgr​ind.org/. Accessed 16 Dec 2019 Competing interests 28. SMILES—A simplifed chemical language. Daylight Chemical Information The author declares that he has no competing interests. Systems website. https​://www.dayli​ght.com/dayht​ml/doc/theor​y/theor​ y.smile​s.html. Accessed 27 Dec 2019 Received: 8 November 2019 Accepted: 27 January 2020 29. Composition over inheritance. Wikipedia. https​://en.wikip​edia.org/wiki/ Compo​sitio​n_over_inher​itanc​e. Accessed 13 Dec 2019 30. Generic programming. Wikipedia. https​://en.wikip​edia.org/wiki/Gener​ ic_progr​ammin​g. Accessed 27 Sep 2019 31. Plauger PJ, Lee M, Musser D, Stepanov AA (2000) C Standard template References library. Prentice Hall PTR, Upper saddle river ++ 1. Neuss N (2003) On Using Common Lisp for Scientifc Computing. In: 32. Eigen is a C template library for linear algebra: matrices, vectors, Bänsch E (ed) Challenges in scientifc computing-CISC 2002 lecture notes numerical solvers,++ and related algorithms. Eigen hosted by Tuxfamily in computational science and engineering, vol 35. Springer, Berlin website. http://eigen​.tuxfa​mily.org/. Accessed 17 Dec 2019 Rassokhin J Cheminform (2020) 12:10 Page 16 of 16

33. RDKit: Open-Source Cheminformatics Software. RDKit website. https​:// discovery knowledge in an inherently decentralized world. J Chem Inf www.rdkit​.org/. Accessed 08 Oct 2019 Model 47(6):1999–2014 34. Getting Started/sample.cpp source code. RDKit repository hosted by 49. VMD Visual molecular dynamics. Theoretical and Computational Biophys- GitHub. https​://githu​b.com/rdkit​/rdkit​/blob/maste​r/Code/Demos​/RDKit​/ ics Group, The University of Illinois at Urbana–Champaign website. https​ Getti​ngSta​rted/sampl​e.cpp. Accessed 16 Dec 2019 ://www.ks.uiuc.edu/Resea​rch/vmd/. Accessed 08 Oct 2019 35. BLAS (Basic Linear Algebra Subprograms). Netlib Repository at UTK and 50. PyMol. PyMOL by Schrödinger website. https​://pymol​.org/2/. Accessed ORNL website. http://www.netli​b.org/blas/. Accessed 27 Sep 2019 08 Oct 2019 36. The NAG Library. The Numerical Algorithms Group (NAG) website: The 51. The LLVM Compiler Infrastructure. The LLVM Project website. https​://llvm. Numerical Algorithms Group, Ltd. https​://www.nag.com/conte​nt/nag- org/. Accessed 08 Oct 2019 libra​ry. Accessed 27 Sep 2019 52. The Julia Programming Language. JuliaLang.org website. https​://julia​ 37. LAPACK—Linear Algebra PACKage. Netlib Repository at UTK and ORNL lang.org/. Accessed 08 Oct 2019 website. http://www.netli​b.org/lapac​k/. Accessed 27 Sep 2019 53. Welcome to SWIG. SWIG.ORG website. http://www.swig.org/. Accessed 38. Netlib/f2c. Netlib Repository at UTK and ORNL website. http://www.netli​ 13 Oct 2019 b.org/f2c/. Accessed 08 Oct 2019 54. Boost.Python. Boost.org website. https​://www.boost​.org/doc/ 39. Grosse-Kunstleve R, Terwilliger T, Sauter N, Adams P (2012) Automatic libs/1_71_0/libs/pytho​n/doc/html/index​.html Accessed 08 Oct 2019 Fortran to C conversion with FABLE. Source Code Biol Med 7:5 55. pybind11—Seamless operability between C 11 and Python. Pybind 40. The R Project++ for statistical computing. The R Project website. https​:// project hosted by GitHub. https​://githu​b.com/pybin++ ​d/pybin​d11. www.r-proje​ct.org/. Accessed 08 Oct 2019 Accessed 08 Oct 2019 41. MATLAB. MathWorks website. https​://www.mathw​orks.com/produ​cts/ 56. Landrum G. Introducing new RDKit JavaScript wrappers. RDKit blog. matla​b.html. Accessed 08 Oct 2019 http://rdkit​.blogs​pot.com/2019/11/intro​ducin​g-new-rdkit​-javas​cript​.html. 42. The Xemistry Tools Universe. Xemistry GmbH website. https​://xemis​try. Accessed 27 Dec 2019 com/toolu​niver​se.shtml​ Accessed 08 Oct 2019 57. IronPython. IronPython.net website. https​://ironp​ython​.net/. Accessed 14 43. Open Babel: The Open Source Chemistry Toolbox. Opoen Babel website. Oct 2019 http://openb​abel.org/wiki/Main_Page. Accessed 08 Oct 2019 58. Pirhadi S, Sunseri J, Koes DR (2016) Open source molecular modeling. J 44. OEChem TK. OpenEye Scientifc website. https​://www.eyeso​pen.com/ Mol Graph Model 69:127–143 oeche​m-tk. Accessed 08 Oct 2019 59. Koens D. Open Source Molecular Modeling. Open Source Molecular 45. NumPy. NumPy.Org website. https​://numpy​.org/. Accessed 08 Oct 2019 Modeling hosted by GitHub. https​://opens​ource​molec​ularm​odeli​ 46. SciPy. SciPy.Org website. https​://www.scipy​.org/. Accessed 08 Oct 2019 ng.githu​b.io/. Accessed 17 Dec 2019 47. TensorFlow. TensorFlow website. https​://www.tenso​rfow​.org/. Accessed 08 Oct 2019 48. Agrafotis DK, Alex S, Dai H, Derkinderen A, Farnum M, Gates P et al Publisher’s Note (2007) Advanced biological and chemical discovery (ABCD): Centralizing Springer Nature remains neutral with regard to jurisdictional claims in pub- lished maps and institutional afliations.

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold Open Access which fosters wider collaboration and increased citations • maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions