Chemoinformatics and Structural Bioinformatics in Ocaml Francois Berenger1* , Kam Y
Total Page:16
File Type:pdf, Size:1020Kb
Berenger et al. J Cheminform (2019) 11:10 https://doi.org/10.1186/s13321-019-0332-0 Journal of Cheminformatics RESEARCH ARTICLE Open Access Chemoinformatics and structural bioinformatics in OCaml Francois Berenger1* , Kam Y. J. Zhang2 and Yoshihiro Yamanishi1,3 Abstract Background: OCaml is a functional programming language with strong static types, Hindley–Milner type inference and garbage collection. In this article, we share our experience in prototyping chemoinformatics and structural bioin- formatics software in OCaml. Results: First, we introduce the language, list entry points for chemoinformaticians who would be interested in OCaml and give code examples. Then, we list some scientifc open source software written in OCaml. We also present recent open source libraries useful in chemoinformatics. The parallelization of OCaml programs and their performance is also shown. Finally, tools and methods useful when prototyping scientifc software in OCaml are given. Conclusions: In our experience, OCaml is a programming language of choice for method development in chemoin- formatics and structural bioinformatics. Keywords: Chemoinformatics, Structural bioinformatics, Bisector tree, Scientifc software, Software prototyping, Open source, Functional programming, OCaml Introduction inherited so that generic code can be reused between Tere are several schools of thought in computer software components. C++, Java, Eifel, Ruby and programming. Each school is represented by sev- Python are famous members of this family of languages. eral programming languages and some languages are Most Object-Oriented languages use the imperative style multi-paradigm. of programming. In declarative languages (like SQL), a programmer In functional programming, a program is a collection writes a kind of mathematical specifcation of what to of functions. State passing is done explicitly via function compute, and the compiler will automatically derive a parameters. Functional programming has a mathematical program implementing this specifcation. Prolog [1], is taste and dates back to Lisp. Lisp, Scheme, OCaml, F#, also such a programming language where the specifca- Haskell, Scala, Racket and Clojure are representatives of tion is given as a collection of logic predicates. the functional style of programming. Tere are several On the contrary, in imperative programming, the pro- advantages to using functional programming [2]. Since grammer writes in extensive details how to compute the state passing is explicit, functional programs are easy to result he wants. Ada, C, Fortran and Pascal are famous reason about. Tey easily ft in the head of the program- representatives of this style of programming. mer. Some functional programming languages are pure In Object-Oriented programming, data structures and (e.g. Haskell); they guarantee referential transparency, the the allowed operations on them are grouped into classes. fact that an expression can be replaced by its correspond- Classes can be hierarchically organized, and behavior ing value without changing the program behavior. Tere are some articles about the productivity boost associated with functional programming [3, 4]. *Correspondence: [email protected] While there are not many, some functional program- 1 Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, ming libraries for chemoinformatics do exist. In Haskell, Fukuoka, Japan the ‘smiles’ library [5] provides full support for the Full list of author information is available at the end of the article OpenSMILES specifcation [6]. While the ‘radium’ library © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/ publi cdoma in/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Berenger et al. J Cheminform (2019) 11:10 Page 2 of 13 [7] provides the periodic table plus readers and writ- a function, from modules to module is called a functor ers forf SMILES and condensed formulas. In Scala, the and allows high level generic programming. ‘chem ’ library [8] provides a purely functional chemin- OCaml’s evaluation strategy is strict. All parameters formatics toolkit [9]. to a function are evaluated prior to entering the func- In this article, we concentrate on Objective Caml tion’s body. Te compilation of OCaml programs is fast. 3000 (OCaml [10]), in the context of scientifc software pro- For example, the ∼ OCaml lines (without com- totyping for chemoinformatics and structural bio- ments) of the consent software [14] and its four executa- 3.8 1.1 informatics. OCaml is a general purpose functional bles compile from scratch and link in ∼ s (resp ∼ ) programming language developed at INRIA, the French using dune (version 1.6.2) and a single core (resp. up to national research institute for computer science, robot- all cores) of our desktop computer (16 cores, Intel Xeon ics and applied mathematics. OCaml focuses on expres- 2.1GHz, 64GB RAM, Linux Ubuntu 18.04.1 LTS). siveness and safety. Some of the language’s strengths Strong static types are types which are enforced by the include its type system, with parametric polymorphism compiler. Due to the use of types and garbage collection, (called generics in Java, templates in C++) and type several run-time errors which plague other programming inference. Tanks to Hindley–Milner type inference languages are absent from OCaml programs: null pointer [11, 12], the OCaml programmer is freed from explic- exception, dereference after free, type cast exception, itly providing function parameters and result types. For segmentation fault, unhandled switch cases and most a course on programming languages and types, we refer memory leaks. In functional programming, more com- interested readers to Pierce [13]. OCaml supports user- plex properties can be encoded and statically enforced by defned algebraic data types, records, sums/enums and structuring code using monads [15], which are pervasive pattern matching. Pattern matching is a generalization of in Haskell [16], or by using dependent types (not avail- the switch statement present in other languages. When able in OCaml, but in Coq [17], Idris [18, 19] and Agda pattern matching, a program is driven by the type of the [20]). A function that is guaranteed to produce a result in parameter being matched upon. In OCaml, memory is a fnite time is called total. Functions for which there is managed automatically, by an incremental garbage col- no such guarantee are called partial. For some functions, lector, preventing memory corruption. Interactive use of Idris can check if they are total. However, such advanced OCaml is possible via a read-eval-print loop called the functional programming concepts are out of the scope of OCaml interpreter. Interacting with the interpreter is this article. a standard way to test a function or to check that some Despite not being very popular, OCaml is not a niche functionality provided by a library works the way one language. Most of its academic users work in computer understands it. In addition to its byte-code compiler and science, on compilers and formal methods. But, OCaml interpreter, OCaml ofers a compiler that produces ef- is also used in bioinformatics [21–23], structural bioin- cient executables. Tail-recursive functions are automati- formatics [24–27], chemoinformatics [14, 28], systems cally translated to efcient loops by the OCaml compiler. biology [29–32] and ecotoxicology [33]. OCaml also features an object-oriented layer, with mul- Tere are several industrial users of the language [34] tiple inheritance, parametric and virtual classes. While including Bloomberg, Citrix, Dassault Systèmes, Face- OCaml was initially used to develop symbolic computing book [35], Jane Street (a proprietary high frequency trad- applications, such as automatic theorem provers, compil- ing frm) and Microsoft. ers, interpreters and static program analyzers, it is now OCaml has some successes in the industrial world: used to develop software in many other areas. Lexif’s Modeling Language for Finance [36], the ASTRÉE Functions are frst-class values in OCaml. A function Static Analyzer [37] used by Airbus to certify on-board can be passed as an argument to, or returned by, another software and Microsoft’s static driver verifer [38]. function. OCaml is a multi-paradigm language. For per- OCaml has several successes in the open-source world formance reasons, OCaml ofers many imperative fea- too: the Unison fle synchronizer [39], the MLdonkey tures (exceptions, modifable variables, records, arrays [40] multi-protocol peer-to-peer client, the Coq [17] and loop statements). OCaml built-in data types include proof assistant [41] and FFTW’s symbolic optimizer of not only integers, foating point numbers, booleans, char- fast Fourier transforms [42]. acters and strings but also more advanced data types In the remaining of this article, resources to learn such as tuples, records, arrays and lists. OCaml are listed in “Resources to learn OCaml” sec- Large programs are easy to structure due to modules,