Visualizing Multivariate Data
Total Page:16
File Type:pdf, Size:1020Kb
1 Visual Analytics Visualizing multivariate data: High density time-series plots Scatterplot matrices Parallel coordinate plots Temporal and spectral correlation plots Box plots Wavelets Radar and /or polar plots Others…… Demonstrations of DVAtool and Matlab examples CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 2 Visualizing Multivariate Data Many statistical analyses involve only two variables: a preDictor variable and a response variable. Such data are easy to visualize using 2D scatter plots, bivariate histograms, boxplots, etc. It's also possible to visualize trivariate data with 3D scatter plots, or 2D scatter plots with a third variable encoded with, for example color. However, many datasets involve a larger number of variables, making direct visualization more difficult. This demo explores some of the ways to visualize high- dimensional data in MATLAB®, using the Statistics Toolbox™. Contents High Density plots Scatterplot Matrices Parallel Coordinates Plots In this demo, we'll use the carbig dataset, a dataset that contains various measured variables for about 400 automobiles from the 1970's and 1980's. We'll illustrate multivariate visualization using the values for fuel efficiency (in miles per gallon, MPG), acceleration (time from 0-60MPH in sec), engine Displacement (in cubic inches), weight, and horsepower. We'll use the number of cylinders to group observations. >>load carbig >>X = [MPG,Acceleration,Displacement,Weight,Horsepower]; >>varNames = {'MPG'; 'Acceleration'; 'Displacement'; 'Weight'; 'Horsepower'}; Can do a simple graphical plot as follows: gplotmatrix(X); But this plot is not as revealing or informative as the next plot using the same ‘gplotmatrix’ function. Scatterplot Matrices CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 3 Viewing slices through lower dimensional subspaces is one way to partially work around the limitation of two or three dimensions. For example, we can use the gplotmatrix function to display an array of all the bivariate scatterplots between our five variables, along with a univariate histogram for each variable. >>figure >>gplotmatrix(X,[],Cylinders,['c' 'b' 'm' 'g' 'r'],[],[],false); >>text([.08 .24 .43 .66 .83], repmat(-.1,1,5), varNames, 'FontSize',11); >>text(repmat(-.12,1,5), [.86 .62 .41 .25 .02], varNames, 'FontSize',11, 'Rotation',90); The points in each scatterplot are color-coded by the number of cylinders: blue for 4 cylinders, green for 6, and red for 8. There is also a handful of 5 cylinder cars, and rotary-engined cars are listed as having 3 cylinders. This array of plots makes it easy to pick out patterns in the relationships between pairs of variables. However, there may be important patterns in higher dimensions, and those are not easy to recognize in this plot. Parallel Coordinates Plots The scatterplot matrix only displays bivariate relationships. However, there are other alternatives that display all the variables together, allowing you to investigate higher-dimensional relationships among variables. The most straight-forward multivariate plot is the parallel coordinates plot. In this plot, the coordinate axes are all laiD out horizontally, instead of using orthogonal axes as in the usual Cartesian graph. Each observation is represented in the plot as a series of connected line CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 4 segments. For example, we can make a plot of all the cars with 4, 6, or 8 cylinders, and color observations by group. >>Cyl468 = ismember(Cylinders,[4 6 8]); >>parallelcoords(X(Cyl468,:), 'group',Cylinders(Cyl468), ... 'standardize','on', 'labels',varNames) The horizontal direction in this plot represents the coordinate axes, and the vertical direction represents the data. Each observation consists of measurements on five variables, and each measurement is represented as the height at which the corresponding line crosses each coordinate axis. Because the five variables have widely different ranges, this plot was made with standardized values, where each variable has been standardized to have zero mean and unit variance. With the color coding, the graph shows, for example, that 8 cylinder cars typically have low values for MPG and acceleration, and high values for displacement, weight, and horsepower. Even with color coding by group, a parallel coordinates plot with a large number of observations can be difficult to read. We can also make a parallel coorDinates plot where only the median and quartiles (25% and 75% points) for each group are shown. This makes the typical differences and similarities among groups easier to distinguish. On the other hand, it may be the outliers for each group that are most interesting, and this plot does not show them at all. >>parallelcoorDs(X(Cyl468,:), 'group',Cylinders(Cyl468), 'standardize','on', CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 5 …'labels',varNames, 'quantile',.25) Here are examples of High density and correlation colour plots. Such plots allow one to visualize ‘multivariate’ relationships at work. These can also be experienced using ‘hands-on’ software tools such as the DVAtool developed by the U of Alberta group. CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 6 Tag names Time Trends 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 512 Samples The above plot has a scroll feature so the relationships (or correlation or cause and effect analysis) can be explored. According to Shewhart, data should never be buried, it should be plotted and examined visually. The above data is from a refinery that had experienced plant-wiDe oscillations. The root cause of the oscillations was difficult to diagnose. A simplified schematic of the refinery appears below: The colour-coded temporal correlation plot of the data appears below. It is important to remember that even if 2 variables are oscillating at the same CCA 2017 Workshop on Process Data Analytics, S. Africa, December 2017 7 frequency, their correlation can be zero if they are phase shifted by 90 degrees, that is if they are orthogonal. Thus one has to be careful when looking at correlation analysis in the time domain. Ideally such analysis should be carried out on lag- adjusteD variables. The following plots illustrate this concept. Information overload?? 1 -0.185294206 0.027055773 -0.171819975 0.093276222 -0.027994229 0.098787646 -0.020583535 0.032817213 -0.132101895 0.007443381 -0.185294206 1 0.046731844 0.192725409 0.06221338 -0.121835155 0.007001147 0.072787318 0.256501727 -0.332697412 0.301121428 0.027055773 0.046731844 1 0.497561061 0.097669495 -0.042977998 0.073150009 0.01752308 0.446311963 -0.065613281 0.577804913 -0.171819975 0.192725409 0.497561061 1 0.098720947 0.04428365 0.109814696 0.207062089 0.277424246 0.017018503 0.374057968 0.093276222 0.06221338 0.097669495 0.098720947 1 -0.16665222 0.088320804 0.017838289 0.083761143 0.0118807 0.047680363 -0.027994229 -0.121835155 -0.042977998 0.04428365 -0.16665222 1 -0.083250958 -0.113249872 -0.066049714 0.083771862 -0.024636992 0.098787646 0.007001147 0.073150009 0.109814696 0.088320804 -0.083250958 1 0.11444586 0.18562112 -0.115407265 0.14030574 -0.020583535 0.072787318 0.01752308 0.207062089 0.017838289 -0.113249872 0.11444586 1 0.008346999 -0.031577329 -0.029811791 0.032817213 0.256501727 0.446311963 0.277424246 0.083761143 -0.066049714 0.18562112 0.008346999 1 -0.548843663 0.897175324 -0.132101895 -0.332697412 -0.065613281 0.017018503 0.0118807 0.083771862 -0.115407265 -0.031577329 -0.548843663 1 -0.410961064 0.007443381 0.301121428 0.577804913 0.374057968 0.047680363 -0.024636992 0.14030574 -0.029811791 0.897175324 -0.410961064 1 0.074563952Imagine-0.290999002 -0.166431901 looking-0.191828797 at0.137354964 thousands-0.318379029 0.361032714 of0.140153123 numerical0.041112456 0.020956227 -0.121890887 -0.113935147 0.096219583 -0.05286439 -0.015289902 0.019324048 0.016093852 0.079420974 0.30604406 0.012369794 -0.118651352 -0.024855722 -0.004367924values0.415302265 0.212895416of data!0.422025428 0.121021716 -0.096013616 0.208845673 0.062699756 0.305190719 -0.192434667 0.383145433 -0.041892367 0.292542753 0.088478765 0.242138917 0.091224942 -0.044085648 0.030151736 -0.048883722 -0.018817843 0.001127689 0.090586488 -0.012796946 0.428193732 0.148027299 0.374229362 -0.036415302 0.031485232 0.014195414 0.093444237 0.111307503 -0.142990907 0.198198421 -0.072987889Looking0.639653323 0.240224974 for a0.430130045 needle0.025582625 in0.028934191 a haystack?0.074669894 -0.003315806 0.264468969 -0.263129122 0.384595716 -0.081425541 0.650446804 0.264165028 0.526494862 0.024959155 0.016578929 0.086833576 0.043588107 0.234276435 -0.236008471 0.358375999 0.220771143 -0.298196834 -0.095225139 -0.369718105 0.018743973 0.082060135 -0.102865842 0.113548775 -0.131073669 0.077167167 -0.117265396 -0.024869432 -0.190880521 0.216974667 0.238447454 0.043845472 0.02495333 -0.005302967 -0.052698509 0.117642296 0.030820275 0.12686478 -0.030647599 0.099397397 0.060271104 -0.149448693 0.101629373 0.107726218 0.138653219 -0.135938599 0.29944496 -0.143383865 0.324332399 -0.105195337 -0.1838672 -0.148673467 -0.213141108 0.059956448 0.018061058 -0.017384883 -0.104230537 0.034619576 0.092101398 -0.031918291 0.017414435Without-0.037744275 0.109437444 a proper-0.048190722 -0.019413361 data 0.085313857analysis-0.03284987 tool,-0.090511989 such0.148464685 -0.022750366 0.185438833 0.038764349 -0.713947863 -0.052827235 -0.199718753 -0.077278511 0.145219414 -0.072139066 -0.148773823 -0.170502021 0.300067528