The C++ Programming Language in Cheminformatics And
Total Page:16
File Type:pdf, Size:1020Kb
Rassokhin J Cheminform (2020) 12:10 https://doi.org/10.1186/s13321-020-0415-y Journal of Cheminformatics REVIEW Open Access The C programming language in cheminformatics++ and computational chemistry Dmitrii Rassokhin* Abstract This paper describes salient features of the C programming language and its programming ecosystem, with emphasis on how the language afects scientifc++ software development. Brief history of C and its predecessor the C language is provided. Most important aspects of the language that defne models of programming++ are described in greater detail and illustrated with code examples. Special attention is paid to the interoperability between C and other high-level languages commonly used in cheminformatics, machine learning, data processing and statistical++ computing. Keywords: Programming languages, C, C , Scientifc computing, Computational chemistry, Cheminformatics ++ Introduction role of being the language that de-facto dominates mod- In recent years, a plethora of high-level domain-specifc ern scientifc software development, even though, at frst and general-purpose programming languages have been glance, this may not be so obvious. In this paper, we will developed to greatly increase the productivity of pro- briefy describe the history of C++ and focus on its main grammers working on various types of software projects. characteristics that make it so special. Scientifc programming, which used to be dominated by Brief history of C and C Fortran up until about mid-1980s, now enjoys a healthy ++ choice of tools, languages and libraries that excel in help- Te predecessor of C++, C was developed in the early ing solve all types of problems computational scientists 1970s by Dennis M. Ritchie, then an employee of Bell and scientifc software developers deal with in their eve- Labs (AT&T), when Ritchie and his colleagues were ryday work. For example, MATLAB is widely used for working on Unix, a multi-user time-sharing operating numerical computing, R dominates statistical computing system for mainframe computers. Early versions of this and data visualization, and Python is a great choice for a now ubiquitous operating system were written in archi- wide range of scientifc applications from machine learn- tecture-specifc non-portable assembly languages. As ing and natural language processing to typical cheminfor- Unix was being further extended and gained popularity, matics tasks like chemical structure search and retrieval, the developers realized the need to re-write parts of it virtual compound screening and molecular property in a platform-independent high-level programming lan- prediction, just to name a few. However, among modern guage to make the codebase more manageable and easily high-level programming languages, C++ plays a special portable to diferent computer architectures. Back then, Fortran was one of the most commonly used high-level languages. Being the language of choice for numerical *Correspondence: [email protected] computing, Fortran circa early 1979s was not suitable for Janssen Research & Development, LLC, 1400 McKean Road, Spring House, PA 19477, USA low-level programming due to its verbose fow control © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Rassokhin J Cheminform (2020) 12:10 Page 2 of 16 structures and the absence of direct memory access general-purpose programming languages for a range of operations. Fortran was also ill-suited for non-numerical applications from device drivers, microcontrollers and computing, which typically involves defning complex operating systems to videogames and high-performance data structures and operations on them, while languages data analysis packages. designed for symbolic computing and list processing, In 1983, a committee formed by the American National such as Lisp, the second-oldest high-level computer lan- Standards Institute (ANSI) to develop a standard version guage after Fortran, were quite difcult to master, and of the C language based on the K&R C. ANSI published often required specialized and very expensive hardware the standard defnition in 1989 and is commonly called to achieve acceptable performance [1]. It is remarkable “ANSI C”. Subsequently, the ANSI X3.159-1989 C stand- that one of the frst very large and complex cheminfor- ard has undergone several revisions, the most recent of matics software packages, an interactive computer pro- which (informally named C18) is ISO/IEC 9899:2018 [5]. gram designed to assist planning syntheses of complex In the 1970, the object-oriented programming (OOP) organic molecules called LHASA (Logic and Heuristics paradigm was quickly gaining popularity. Simula 67, the Applied to Synthetic Analysis), was largely written in For- frst programming language to support the OOP, was tran and contained nearly 30,000 lines of very complex developed primarily for discrete event simulation, pro- Fortran code [2, 3]. cess modeling, large scale integrated circuit simulations, A better alternative for further Unix development the analysis of telecommunication protocols and other was the programming language B, which was derived niche applications. In 1979, Bjarne Stroustrup, while from BCPL in the 1960s by Ken Tompson for coding working towards his Ph.D. in Computer Science at the machine-independent applications, such as operating University of Cambridge, England, used Simula 67 to systems and compilers for other languages. Te B lan- implement calculations for his research and found the guage can be considered the direct predecessor of C. B OOP paradigm to be very productive, but all its exist- was much more suitable for the operating system devel- ing implementations inefcient. At that time, C had opment compared to Fortran, since it provided con- already become one of the most used general-purpose structs that map efciently to typical machine, had a clear programming languages, so Stroustrup got a brilliant and concise syntax and supported efcient direct mem- idea of adding OOP features to C and started his work ory access operations. Te main shortcoming of B was on “C with Classes”, the superset of K&R C, which would the lack of support for data types. In fact, it supported support object-oriented programming while preserv- only one type, the architecture-dependent computer ing the portability, low-level functionality and efciency word treated as an integer. Terefore, in B, operations of C [6]. Early implementations of C with Classes were on data types other than the machine word (such as, for translators that converted “C with Classes” code into the example, single-byte characters or structures composed standard K&R C, which could be compiled by any avail- of felds) were difcult to implement in a portable way. able C compiler. “C with Classes” was extended by add- Tere shortcomings also made B totally unsuitable as a ing, among other important features, improved type general-purpose programming language. In the early 70s, checking, operator overloading, and virtual functions. In Dennis M. Ritchie gradually added support for primitive 1983 Stroustrup renamed “C with Classes” to C++. Te (integer and foating-point numbers, and characters) and ++ operator in the C language is an operator for incre- complex (user-defned structures) data types to B and menting a variable, which refected Stroustrup’s notion cleaned up its syntax. Eventually, the improved B dif- of C++ being the next generation of the C language. In ferentiated from the original B so much that it become 1986, Stroustrup published his famous book called Te a diferent language, which was half-jokingly called C C++ Programming Language [7], which became the after the next letter of the English alphabet. In 1978, the de-facto language reference manual. Very soon, C++ frst edition of the famous “Te C Programming Lan- started gaining a widespread popularity in the developer guage” book written by Brian Kernighan and Dennis community, and several good quality C++ compilers and Ritchie was published [4]. Te version of the C language libraries become available for practically all major com- described in the book is often referred to as K&R C, after puter platforms and operating systems. the book authors. Te C language quickly gained popu- Probably, the most important C++ release was C++ larity among operating system and device driver devel- 2.0 in 1989, documented in Te Annotated C++ Refer- opers. Subsequently, most of the Unix components were ence Manual by Ellis and Stroustrup [8]. C++ 2.0 was a rewritten in C. Due to the relative simplicity, portability, full-fedged object-oriented language with support for and efciency, the popularity of C soon went far beyond multiple inheritance, abstract classes, static member its original intended purpose of operating system devel- functions, constant member functions and protected opment, and it became one of the most commonly used class members, templates for generic programming, Rassokhin J Cheminform (2020) 12:10 Page 3 of 16 exceptions for structured error handling, namespaces, development environments, both free and commercial, and a Boolean type.