Dynamic Path Diagrams

Dynamic Path Diagrams E. F. Haghish Center for Medical Biometry and Medical Informatics (IMBI) University of Freiburg, Germany and Department of Mathematics and Computer Science University of Southern Denmark [email protected] Abstract. Diagrams are frequently used in data analysis and visualization. Stata provides an SEM builder interface to develop SEM path diagrams, execute the analysis, and export the analysis diagram to an image file. However, this process is manual and cannot be reproduced in an automated analysis. This article introduces the diagram package which renders \DOT" (Graph Descriptive Language) diagrams and exports them to a variety of formats including PDF, PNG, JPEG, GIF, and BMP with adjustable resolution within Stata. Rendering diagrams from a markup language provides the possibility to create diagrams for different purposes, which increases their applications beyond building SEM models. The article discusses potential applications of the package in data analysis and visualization and provides several examples for generating dynamic path diagrams, producing path diagrams from data sets, and visualizing func- tion call for Stata ado-programs. Keywords: Graphviz, graphics, DOT, diagram, reproducible research, visualization, SEM 1 Introduction Evaluating relationship among variables is one of the primary concepts in statistics. In order to represent and visualize multivariate analytic models, often path diagrams are used, where arrows point out the relationship between variables. The relationship is visualized using a one-headed arrow indicating one variable's direct effect on the other, whereas two-headed arrow indicating un- analyzed correlation. The residual variance is shown by arrows not originating from a variable (Mitchell 1992; Li et al. 1975). Despite their popularity for visualizing relationships between variables in Path Analysis and Structural Equation Models (SEM) (Andersson et al. 1999), the applications of path diagrams are not limited to statistical models. Path diagrams are commonly applied to visualize a summary, process, hierarchical structure, or interconnected systems such as networks, profiling statistical packages, and visualizing computer software in general (Haghish 2016a). Nevertheless, the popular commercial statistical packages such as Stata and SPSS Amos provide an SEM builder interface for the users to build a path 1 diagram, execute the analysis, and export the resulting diagram to an image file manually. As a result, this process cannot be reproduced automatically. Moreover, the SEM builder interface limits applications of path diagrams to modeling the relationship between variables, neglecting all other potential uses of path diagrams for data visualization. While there have been a few attempts for developing a diagram in R (Haghish 2016c; Murrell 2015; Sadeghi and Marchetti 2012; Murrell 2015), Stata community has not made any endeavor to develop a graphical system for creating reproducible diagrams. Yet, both statistical languages lack a comprehensive package that can execute SEM models and generate an automated path diagram. In the current article I introduce the diagram package, which renders diagrams written in DOT (Graph Descriptive Language) (Gansner et al. 2006). DOT is a simple, intuitive, and yet highly customizable markup language for drawing reproducible diagrams (Gansner and North 2000). I will also provide several example programs for generating dynamic diagrams to demonstrate the potential applications of the package for data visualization. 2 DOT graph descriptive language The DOT markup is a graph description language that specifies the relation between different elements of a diagram. DOT was invented in 1988 and was included in Graphviz software (Ellson et al. 2002). DOT1 can create two types of diagrams which are directed and undirected diagrams, as shown in figure 1. The syntax of the DOT language is available at http://www.graphviz.org/ content/dot-language. Example . diagram `"digraph directed -> "node 1"; directed -> "node 2";"´, /// export(directed.pdf) . diagram `"graph undirected -- "node 1"; undirected -- "node 2";"´, /// export(undirected.pdf) 2.1 Using DOT language for data visualization DOT is not the only markup language available for generating path diagrams. GraphML (Brandes et al. 2014), Graph eXchange Language (GXL) (Winter et al. 2002; Holt et al. 2006), and Graph Modelling Language (GML) (Himsolt 1997) are only a few of markup languages available for creating diagrams from a script. However, compared to these languages, DOT has several advantages, particularly for statistics and data science applications. In contrast to GraphML and GXL which are based on XML, DOT is easy-to- read and easy-to-write for humans due to its simple syntax. In addition, DOT is 1Initially, DOT markup { which is rendered with dot program { could only create directed diagrams and the undirected layouts were added later on with neato program (North 2004). Therefore, the DOT markup should not be conflicted with dot program. 2 directed undirected node 1 node 2 node 1 node 2 Figure 1: Example of directed and undirected graphs highly customizable, allowing detailed manipulation of different elements of the graph. Despite its syntactic simplicity, DOT is also a compact language where many arguments for customizing the nodes, edges, and the overall graph can be written one after another. Moreover, there is plenty of free software available for rendering DOT diagrams such as Graphviz ? and DotEditor. Another benefit of DOT markup is that not only it can be rendered automatically by several engines and customizable options, but also, the user can strictly draw the diagram by specifying the coordinates of the center of each node (see section 5.1). Finally, the DOT markup has been available for many years, which means there is a large and supportive online community to ask help from. Using the same script file but a different rendering program, the DOT diagrams can be rendered differently. The diagram package supports dot, neato, fdp, twopi, circo, and osage (Gansner 2004; Gansner et al. 2006; North 2004). Any of these programs can be specified to render the DOT diagram using the engine(name) option (see section 4). The examples provided in the current article only feature the dot and neato programs. The reader is referred to http://graphviz.org/Documentation.php for documentation, manual, and journal articles related to algorithms used in these programs. 2.2 Elements of DOT diagram The biggest advantage of the DOT language compared to XML languages such as GraphML and GXL is that it is human readable and it is based on a simple structure. Often, DOT diagrams are created by defining three types of objects, which are graph, node, and edge (Tsoukalos 2004). The graph object defines the graph type (directed or undirected) and can take arguments for general customizations of the diagram, whereas nodes define and customize the nodes of the diagram and edges define the connections between the nodes. DOT can also define a subgraph object, which puts the nodes in a cluster and can nest other subgraphs as well. 2.3 DOT customization The DOT markup language can highly customize any of the four objects (graph, subgraph, node, and edge), whether individually or globally, changing attributes of all objects in a graph. The customization allows changing shapes, thick- 3 ness, color, background color, labels, and many more attributes of nodes, edges, subgraphs, and graphs. While memorizing the arguments for a DOT new- bie imposes a learning curve, an IDE for developing interactive graphs could be of a great help for both learning the markup as well as building and customizing DOT graphs. A list of IDEs for DOT is provided on the Graphviz website at list of DOT editors and resources and I recommend DotEditor (visit http://vincenthee.github.io/DotEditor/) which allows building and customizing the graph using GUI interface and is available for Windows, Mac- intosh, and Linux operating systems free of charge. Explaining the attributes of each of these objects is beyond the purpose of the current manuscript. The reader is referred to http://graphviz.org/ content/attrs for a complete list of graph, node, and edge attributes. 3 diagram package 3.1 Installation The diagram package relies on webimage package (Haghish 2016b) package to adjust the resolution of the diagram and export it to PNG, GIF, JPEG, BMP, or PDF format. Both packages can be installed from SSC server: . ssc install diagram . ssc install webimage In addition, phantomJS (?Friesel 2014) software is required, which is a head- less WebKit scriptable with a JavaScript API. phantomJS is an open-source software, available for Windows, Mac, and Linux free of charge. Using phantomJS, the webimage package can render any webpage or web-based application to an image. The path to phantomJS software should be given to diagram package using the phantomjs(str) (see section 4). Alternatively, the path to phantomJS software can be permanently defined as shown below: . diagram setpath "/path/to/phantomJS" In the example below I first define the path to the executable phantomJS software on my Mac permanently and then test the installation of the diagram package by rendering a directed diagram that includes two nodes named Hello and World. To preserve the highest image quality without enlarging the diagram (using the magnify option), I export it to PDF format. Example . diagram setpath "/Applications/phantomjs/bin/phantomjs" . diagram "digraph fHello -> Worldg", export(helloworld.pdf) 4 Hello World Figure 2: Example of directed and undirected graphs 4 Syntax The main command of the package is diagram that renders DOT diagrams and export them to graphical files. 4.1 diagram command The diagram command can be used in two different ways, by rendering DOT script directly or reading a file that includes the DOT script. diagram using filename j "DOT-script" ,e xport(name) replace magnify(real) phantomjs(str) engine(name) The diagram options are explained below: export(name) specifies the filename and the file extension of the exported image.

Load more