1

Visual Analytics

Visualizing multivariate data:

High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or polar plots Others……

Demonstrations of DVAtool and Matlab examples

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

2 Visualizing Multivariate Data

Many statistical analyses involve only two variables: a predictor variable and a response variable. Such data are easy to visualize using 2D scatter plots, bivariate histograms, boxplots, etc. It's also possible to visualize trivariate data with 3D scatter plots, or 2D scatter plots with a third variable encoded with, for example color. However, many datasets involve a larger number of variables, making direct visualization more difficult. This demo explores some of the ways to visualize high- dimensional data in MATLAB®, using the Toolbox™. Contents

High Density plots Scatterplot Matrices Parallel Coordinates Plots

In this demo, we'll use the carbig dataset, a dataset that contains various measured variables for about 400 automobiles from the 1970's and 1980's. We'll illustrate multivariate visualization using the values for fuel (in miles per gallon, MPG), acceleration (time from 0-60MPH in sec), engine displacement (in cubic inches), weight, and horsepower. We'll use the number of cylinders to group observations.

>>load carbig >>X = [MPG,Acceleration,Displacement,Weight,Horsepower]; >>varNames = {'MPG'; 'Acceleration'; 'Displacement'; 'Weight'; 'Horsepower'};

Can do a simple graphical plot as follows: gplotmatrix(X);

But this plot is not as revealing or informative as the next plot using the same ‘gplotmatrix’ function. Scatterplot Matrices

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

3 Viewing slices through lower dimensional subspaces is one way to partially work around the limitation of two or three dimensions. For example, we can use the gplotmatrix function to display an array of all the bivariate scatterplots between our five variables, along with a univariate histogram for each variable. >>figure >>gplotmatrix(X,[],Cylinders,['c' 'b' 'm' 'g' 'r'],[],[],false); >>text([.08 .24 .43 .66 .83], repmat(-.1,1,5), varNames, 'FontSize',11); >>text(repmat(-.12,1,5), [.86 .62 .41 .25 .02], varNames, 'FontSize',11, 'Rotation',90);

The points in each scatterplot are color-coded by the number of cylinders: blue for 4 cylinders, green for 6, and red for 8. There is also a handful of 5 cylinder cars, and rotary-engined cars are listed as having 3 cylinders. This array of plots makes it easy to pick out patterns in the relationships between pairs of variables. However, there may be important patterns in higher dimensions, and those are not easy to recognize in this plot.

Parallel Coordinates Plots The scatterplot matrix only displays bivariate relationships. However, there are other alternatives that display all the variables together, allowing you to investigate higher-dimensional relationships among variables. The most straight-forward multivariate plot is the parallel coordinates plot. In this plot, the coordinate axes are all laid out horizontally, instead of using orthogonal axes as in the usual Cartesian graph. Each observation is represented in the plot as a series of connected line

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

4 segments. For example, we can make a plot of all the cars with 4, 6, or 8 cylinders, and color observations by group.

>>Cyl468 = ismember(Cylinders,[4 6 8]); >>parallelcoords(X(Cyl468,:), 'group',Cylinders(Cyl468), ... 'standardize','on', 'labels',varNames)

The horizontal direction in this plot represents the coordinate axes, and the vertical direction represents the data. Each observation consists of measurements on five variables, and each measurement is represented as the height at which the corresponding line crosses each coordinate axis. Because the five variables have widely different ranges, this plot was made with standardized values, where each variable has been standardized to have zero mean and unit variance. With the color coding, the graph shows, for example, that 8 cylinder cars typically have low values for MPG and acceleration, and high values for displacement, weight, and horsepower.

Even with color coding by group, a parallel coordinates plot with a large number of observations can be difficult to read. We can also make a parallel coordinates plot where only the and (25% and 75% points) for each group are shown. This makes the typical differences and similarities among groups easier to distinguish. On the other hand, it may be the outliers for each group that are most interesting, and this plot does not show them at all. >>parallelcoords(X(Cyl468,:), 'group',Cylinders(Cyl468), 'standardize','on', CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

5 …'labels',varNames, 'quantile',.25)

Here are examples of High density and correlation colour plots. Such plots allow one to visualize ‘multivariate’ relationships at work. These can also be experienced using ‘hands-on’ software tools such as the DVAtool developed by the U of Alberta group.

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

6 Tag names Time Trends 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

1 512 Samples

The above plot has a scroll feature so the relationships (or correlation or cause and effect analysis) can be explored. According to Shewhart, data should never be buried, it should be plotted and examined visually.

The above data is from a refinery that had experienced plant-wide oscillations. The root cause of the oscillations was difficult to diagnose. A simplified schematic of the refinery appears below:

The colour-coded temporal correlation plot of the data appears below. It is important to remember that even if 2 variables are oscillating at the same CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

7 frequency, their correlation can be zero if they are phase shifted by 90 degrees, that is if they are orthogonal. Thus one has to be careful when looking at correlation analysis in the time domain. Ideally such analysis should be carried out on lag- adjusted variables. The following plots illustrate this concept.

Information overload??

1 -0.185294206 0.027055773 -0.171819975 0.093276222 -0.027994229 0.098787646 -0.020583535 0.032817213 -0.132101895 0.007443381 -0.185294206 1 0.046731844 0.192725409 0.06221338 -0.121835155 0.007001147 0.072787318 0.256501727 -0.332697412 0.301121428 0.027055773 0.046731844 1 0.497561061 0.097669495 -0.042977998 0.073150009 0.01752308 0.446311963 -0.065613281 0.577804913 -0.171819975 0.192725409 0.497561061 1 0.098720947 0.04428365 0.109814696 0.207062089 0.277424246 0.017018503 0.374057968 0.093276222 0.06221338 0.097669495 0.098720947 1 -0.16665222 0.088320804 0.017838289 0.083761143 0.0118807 0.047680363 -0.027994229 -0.121835155 -0.042977998 0.04428365 -0.16665222 1 -0.083250958 -0.113249872 -0.066049714 0.083771862 -0.024636992 0.098787646 0.007001147 0.073150009 0.109814696 0.088320804 -0.083250958 1 0.11444586 0.18562112 -0.115407265 0.14030574 -0.020583535 0.072787318 0.01752308 0.207062089 0.017838289 -0.113249872 0.11444586 1 0.008346999 -0.031577329 -0.029811791 0.032817213 0.256501727 0.446311963 0.277424246 0.083761143 -0.066049714 0.18562112 0.008346999 1 -0.548843663 0.897175324 -0.132101895 -0.332697412 -0.065613281 0.017018503 0.0118807 0.083771862 -0.115407265 -0.031577329 -0.548843663 1 -0.410961064 0.007443381 0.301121428 0.577804913 0.374057968 0.047680363 -0.024636992 0.14030574 -0.029811791 0.897175324 -0.410961064 1 0.074563952Imagine-0.290999002 -0.166431901 looking-0.191828797 at0.137354964 thousands-0.318379029 0.361032714 of0.140153123 numerical0.041112456 0.020956227 -0.121890887 -0.113935147 0.096219583 -0.05286439 -0.015289902 0.019324048 0.016093852 0.079420974 0.30604406 0.012369794 -0.118651352 -0.024855722 -0.004367924values0.415302265 0.212895416of data!0.422025428 0.121021716 -0.096013616 0.208845673 0.062699756 0.305190719 -0.192434667 0.383145433 -0.041892367 0.292542753 0.088478765 0.242138917 0.091224942 -0.044085648 0.030151736 -0.048883722 -0.018817843 0.001127689 0.090586488 -0.012796946 0.428193732 0.148027299 0.374229362 -0.036415302 0.031485232 0.014195414 0.093444237 0.111307503 -0.142990907 0.198198421 -0.072987889Looking0.639653323 0.240224974 for a0.430130045 needle0.025582625 in0.028934191 a haystack?0.074669894 -0.003315806 0.264468969 -0.263129122 0.384595716 -0.081425541 0.650446804 0.264165028 0.526494862 0.024959155 0.016578929 0.086833576 0.043588107 0.234276435 -0.236008471 0.358375999 0.220771143 -0.298196834 -0.095225139 -0.369718105 0.018743973 0.082060135 -0.102865842 0.113548775 -0.131073669 0.077167167 -0.117265396 -0.024869432 -0.190880521 0.216974667 0.238447454 0.043845472 0.02495333 -0.005302967 -0.052698509 0.117642296 0.030820275 0.12686478 -0.030647599 0.099397397 0.060271104 -0.149448693 0.101629373 0.107726218 0.138653219 -0.135938599 0.29944496 -0.143383865 0.324332399 -0.105195337 -0.1838672 -0.148673467 -0.213141108 0.059956448 0.018061058 -0.017384883 -0.104230537 0.034619576 0.092101398 -0.031918291 0.017414435Without-0.037744275 0.109437444 a proper-0.048190722 -0.019413361 data 0.085313857analysis-0.03284987 tool,-0.090511989 such0.148464685 -0.022750366 0.185438833 0.038764349 -0.713947863 -0.052827235 -0.199718753 -0.077278511 0.145219414 -0.072139066 -0.148773823 -0.170502021 0.300067528 -0.221413011 -0.045966454 -0.478970547 0.019338226 0.217745669 -0.040564805 0.097708972 0.112071592 0.174626221 -0.127211683 0.192457187 -0.198799394 0.004199633an-0.557289909 exercise-0.28471802 -0.50961326will generate-0.10351578 0.076308499 more-0.029991343 heat-0.086972975 -0.114870901than -0.040561871light! -0.28198484 0.08599719 0.007025338 0.154179893 0.232546946 0.007835329 -0.086667223 0.156685077 0.023192903 -0.02996409 0.102102098 0.014799742 0.07875747 -0.429241921 -0.344287431 -0.244148911 -0.023116012 -0.027914037 -0.035809918 0.05492866 -0.371758851 0.202342995 -0.500472826 -0.058448605 0.611130166 -0.119253918 0.057550975 0.117777025 -0.183408655 0.082432127 0.10213768 -0.008270005 -0.326478145 -0.006977244 -0.02580405 -0.150816152 0.120793341 0.066957772 0.025269938 0.006378438 -0.075330271 0.07187571 -0.106953812 0.090117193 -0.067876774 0.159960053 -0.46153707 -0.061782981 -0.275932869 0.000719008 0.025290936 0.007663131 -0.053698037 -0.101831356 0.058117208 -0.173365206 0.026409868 -0.563862298 -0.386080423 -0.546002219 -0.101953064 0.031863056 -0.01116078 -0.088594064 -0.187499891 0.05573025 -0.370093034 0.215815744 -0.782048799 -0.133757845 -0.413764596 -0.063567993 0.057507301 0.005673666 -0.045465041 -0.198905102 0.157678121 -0.292768372 0.210107163 -0.762753731 -0.113514688 -0.377067501 -0.080202766 0.058553599 -0.013312535 -0.049311774 -0.178286335 0.116737358 -0.262262234 -0.160786539 0.822759906 0.082111296 0.319369724 0.073604799 -0.119269237 0.02934483 -0.005308015 0.179202768 -0.192898469 0.239070492 0.17932223 -0.707919187 -0.147742155 -0.468437394 -0.086347239 0.046364378 0.023420083 -0.056932385 -0.199912109 0.13527489 -0.320831812 0.149319082 -0.752830887 -0.06056419 -0.348884293 -0.090379995 0.156053854 0.00402923 -0.087617203 -0.010696796 0.026324582 -0.1081197 0.010059676 0.616117603 0.346256652 0.323705183 0.037419038 -0.036121112 0.01731208 0.015883625 0.443358533 -0.280423399 0.559601776 41 ARAMCO talk: 2014 Here is the temporal correlation map of this same data:

Correlation Color Map 37 1 33 32 31 0.9 30 29 28 27 0.8 26 25 24 23 0.7 22 21 19 16 0.6 15 14 13 0.5 12 11 10 9 0.4 Variables 8 7 6 5 0.3 2 1 36 0.2 35 18 17 34 0.1 20 4 3 0 3 42034171835361 2 5 6 7 8 910111213141516192122232425262728293031323337 Variables

Even though many variables appear to be oscillating at the same frequency, the temporal correlation plot is only able to identify a high degree of correlation between a few variables. There should be more than 4 variables that cluster

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

8 together as evident from the parallel temporal and spectral analysis of the data as shown below.

The corresponding colour-coded correlation plot with all highly correlated spectral shapes clustered together appears below:

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

9 Power Spectral Correlation Map 32 1 30 27 26 0.9 14 6 1 29 0.8 23 18 17 31 0.7 22 21 5 0.6 37 36 35 12 0.5 7 34 33 Variables 28 0.4 25 24 20 19 0.3 16 15 13 0.2 11 10 9 8 0.1 4 3 2 0 2 3 4 8 9101113151619202425283334 712353637 521223117182329 1 61426273032 Variables

The plot essentially shows that many variables have very similar or almost identical spectral shapes and are therefore clustered together.

Also explore the following graphics plus 3D plots in Matlab and other software.

Distributions as a way to visualize descriptive statistics

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

10 The distribution of the univariate data string is always insightful. It gives a picture of where the data lies, whether the data is symmetric or not and also which metric to use to describe the gross behavior of the process: mean, mode or median?

(Figure from Wikipedia)

When the distribution is symmetric, the mean, median and the mode are identical.

The mode corresponds to the most frequent observation.

The median gives the location such that 50% of the observations will be above and 50% will be below this point.

The mean is the arithmetic of all the observations.

In the same way the dispersion of the distributions should be carefully considered.

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

11

Box plots:

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

12 In descriptive statistics, a or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and- whisker diagram. Outliers may be plotted as individual points. Box plots are non- parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the , , range, mid-range, and . Box plots can be drawn either horizontally or vertically.

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

13 Radar or polar plots:

Tag26 Tag27 Tag28 Tag29 Tag30 Tag31 Tag32 Tag33

Tag25 Tag24 Tag23 Tag34 Tag22 Tag21 Tag35 Tag20 Tag19 Tag36 Tag18 Tag37 Tag17 Tag16 Tag38 Tag15 Tag39 Tag14 Tag40 Tag13 Tag12 Tag41 Tag11 Tag42 Tag10 Tag43 Tag9 Tag44 Tag8 Tag45 Tag7 Tag46 Tag6 Tag5 Tag47 Tag4 Tag49 Tag3 Tag48 Tag2 Tag50 Alarm Tag1 Tag51 Count Tag100 Tag52 Tag99 Tag53 Tag98 Tag97 Tag54 Tag96 Tag55 Tag95 Tag57 Tag94 Tag58 Tag93 Tag59 Tag92 Tag56 Tag91 Tag90 Tag60 Tag89 Tag61 Tag88 Tag62 Tag87 Tag63 Tag86 Tag85 Tag64 Tag84 Tag65 Tag83 Tag82 Tag66 Tag81 Tag79 Tag67 Tag80 Tag78 Tag77 Tag68 Tag76 Tag69 Tag70 Tag71 Tag72 Tag73

Tag74 Tag75

Tag37 Tag38 Tag40 Tag43 Tag47 Tag35 Tag36 Tag48

Tag30 Tag34 Tag27 Tag50 Tag23 Tag26 Tag49 Tag24 Tag21 Tag42 Tag22 Tag32 Tag12 Tag13 Tag63 Tag20 Tag39 Tag19 Tag41 Tag17 Tag68 Tag16 Tag15 Tag75 Tag11 Tag80 Tag8 Tag45 Tag7 Tag74 Tag10 Tag78 Tag9 Tag6 Tag83 Tag5 Tag87 Tag2 Tag88 Tag3 Tag44 Alarm Tag1 Tag76 Count Tag25 Tag84 Tag4 Tag85 Tag46 Tag31 Tag91 Tag67 Tag94 Tag18 Tag96 Tag82 Tag98 Tag28 Tag99 Tag56 Tag93 Tag72 Tag52 Tag14 Tag55 Tag79 Tag33 Tag81 Tag100 Tag29 Tag89 Tag73 Tag51 Tag69 Tag58 Tag53 Tag61 Tag59 Tag90 Tag54 Tag70 Tag97 Tag66 Tag60 Tag77 Tag92 Tag95 Tag64 Tag65 Tag57 Tag62

Tag71 Tag86

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

14

Diagnosis of Plant-Wide Oscillations

Time Trends

37 36

35 34 25

24 20 19

14 13 4

3 2 1

1 400 Samples

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

15

SEA Refinery Data Set

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

16

Power Spectral Correlation Map

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

17

Tags Corresponding to the 1st group

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

18

Continuous Wavelet Example

Signal made of three frequencies (0.05, 0.2 and 0.1 Hz) each present in 128 consecutive samples.

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017

19

Application of Wavelets

CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017