Graphics and visualisation in practice: for exploring multidimensional reality

Sugnet Gardner & Niël J le Roux Department of and , University of Stellenbosch Private Bag X1, Matieland, 7602, South Africa [email protected]

1. Introduction In considering the traditional of Gabriel (1971) as the multivariate analogue of a scatterplot, Gower & Hand (1996) provide a unified methodology for representing multivariate graphically. In this paper several extensions and novel applications of this philosophy in exploring multidimensional reality are demonstrated. When the above biplotology (biplot methodology) is applied to a application, the graphical display of the sample points is enhanced by adding information about the variables. The flexibility of this method facilitates incorporating both continuous and categorical measurements, representing large data sets making it suitable for datamining applications, as well as extending the mere representation of data to an exploratory analysis in itself by the application of several novel ideas. In this paper focus will not be on the underlying theoretical development of these ideas, discussed in Gardner (2001), but rather on illustrating the extensions through practical examples.

2. Multidimensional scatterplot with acceptance regions The principal component analysis (PCA) biplot is the most basic multidimensional extension of a scatterplot. PCA is based on the singular value decomposition of the matrix of the data. A complete discussion can be found in Gower & Hand (1996). However, a few core concepts need to be explained. The principal axes resulting from the PCA are used only as scaffolding for plotting purposes and are not shown in the graphical representation. Instead, biplot axes are constructed representing the original variables measured. The terms interpolation and prediction are used for ‘moving’ from the original p-dimensional space to the biplot space (usually 2 dimensional) and back, respectively. In general p > 2, therefore the biplot display will be a best fit (according to some criteria) approximation of the original data matrix. Contrary to the scatterplot, a biplot needs two sets of biplot axes, interpolation axes for interpolating new samples onto the display and prediction axes for inferring the values of the original variables for a point in the biplot. Specific to the PCA biplot is the optimal representation of multidimensional variation. The quality index (QI) biplot optimally represents the variation in 66 32.5 18 5.6 monthly values of 15 quality measurements from a manufacturing 1.8 Jul00 process. By interpolating a multivariate target onto the biplot the 1.2 64 5.4 1.7 ‘distance’ from each month to the target is graphically represented. The 16 2.5 Mar01 1.1 14.2 20 prediction biplot axes suggest on which variables a particular month did 31 3 not conform to the target. A QI is calculated from the monthly values Aug00 A2 Apr0079 TARGET 20.5 A5 Jun001.5 45 49 resulting in index values where 0 – 50 is considered poor, 50 – 80 B5 22 30 D6 26 27 Feb01 50 21 29 D7 43 Jan01 20.5 satisfactory while 80 – 100 is superior quality. To associate the index Feb00 1.4 Dec00 A1 May00 12 Sep00 4.5 values with the biplot display, a grid is superimposed on the biplot space. Mar00 Jan00 56 29.5 Nov00 14.5 4.6 Oct00 0.8 For the midpoint of each grid cell the values of the original variables are C5 E5 C4 D4C7 C8 C6 A4 A3 predicted for calculating a QI value. By colour-coding the cells, acceptance regions are constructed on the biplot. 3. Classification of an unknown entity The canonical variate analysis (CVA) biplot is used to optimally separate classes in data. This biplot, used as a graphical representation in a linear discriminant analysis (LDA) context is also discussed by Gower & Hand (1996). Interpolating the sample points onto the CVA biplot of the class allows for visual appraisal of the separation or overlap among classes. Three species wood of the genus Ocotea are used to demonstrate the use of a CVA biplot. Two of these, O.bullata (stinkwood) and O.porosa (imbuia), were traditionally used in the manufacturing of Old Cape furniture produced in the Cape region of South Africa during 1652 – 1900. The correct identification of the type of wood is important since furniture made from stinkwood has a high prestige value. For the classification of the type of wood of a chair to be auctioned, 500 a CVA biplot optimally separating the three species, is constructed. 80

Through the transformation from the original to the canonical (biplot) Obul 400 2000 1600 1200120 space, Mahalanobis distance in the original space becomes Pythagorean 800 Oken distance in the biplot space. A sample is therefore classified to its nearest Opor class mean in the biplot space. By colour coding grid cells according to 160 300 the nearest class mean, classification regions can be constructed. Interpolating the measurements of the chair onto the biplot, the chair is 200 clearly classified as manufactured from imbuia.

4. Exploring classes in data In an allometric study investigating morphological differences between tortoises of the species Homopus Areolatus from different regions in South Africa, the CVA biplot proves to be a useful exploratory tool when the very few observations available severely limited formal statistical analysis. -0.15 The Western Cape Nature Conservation Board (WCNCB) 0.12 -0.1 suspected that tortoises from the Karoo region have a flatter shell profile. 0.08 -0.05 PCA is used to decompose the data matrix into a size and shape + ‘error’ CH component. Utilising only the length, width and height of the shells, a -0.1 0 PCA biplot after the size is removed, clearly displays the Karoo tortoises separate from the other data points. Using the prediction biplot axes -0.08 0.1 (calibrated in log-transformed units) enough evidence of a flatter shell -0.12 0.15 profile is thus provided to support an application for funding a more in- CW CL depth study. Nelspruit 0 5. Application of bagplots in exploring classes 10

Ratio 20 2.5 30 When 16828 observations are plotted in a CVA biplot to explore 2 0 40 20 differences between the shape of export lemons from different cultivation 50 40 1.5 60 8060 areas, the spread of observations is concealed by overplotting. The 100 1 120 70 (Rousseeuw, Ruts & Tukey, 1999), as a form of a two- 140 80 160 0.5 dimensional boxplot is useful as a graphical summary of Length 180 90 points. Suppressing the plotting of the sample points and superimposing 100 Diameter a bagplot onto the biplot allows for visual comparison of classes via the Vaalharts 0 prediction biplot axes. Since certain requirements are set for export 10 20 standards on two of the three variables, the biplots can be fitted with Ratio 2.5 30 acceptance regions. Comparing each region’s bagplot with the 2 0 40 20 50 40 acceptance region, clearly indicates its suitability as an export lemon 1.5 60 8060 100 producing region. 1 120 70 140 80 160 0.5

Length 180 90 100

Diameter 6. Introduction of the α-bag Although superimposing bagplots onto the biplot scaffolding is very useful as summary of the cloud of points, each bagplot has to be plotted separately. The ‘bag’ of the bagplot is constructed based on the concept of halfspace location depth such that it contains the inner 50% of the data points. Using the algorithm of Rousseeuw, Ruts & Tukey (1999), the 50% cut-off is replaced by a value α ranging between 0 and 1. Typically a value of 0.9 or 0.95 will be useful for enclosing a class of observations, excluding the 10% or 5% of the observations at the extremes. The living standards and development survey was conducted among approximately 9000 South African households preceding the -2 4 ExpDec 14 1994 elections to provide policy makers with information regarding 0 12 15 MEdY 14 10 2 literacy in order to best address racial inequalities. Furnishing the CVA 12 8 6 10 4 6 biplot with an 80% bag for each of four race groups, the differences and White 8 Black Coloured8 6 4 6 4 8 2 overlap can be assessed. The α-bag also provides for a quantitative Indian 2 0 10 16 0 measure of the overlap between groups: the smallest α for which there is 1012 14 overlap. This measure provides a reference for evaluating progress in a 16 12 18 follow-up study. EduYrs TotScore Age

7. Exploring overlap between classes A sample of Middle Stone Age artefacts excavated from the caves at Klasies River on the southern coast of South Africa is explored in a 0 CVA biplot. The construction of α-bags in the biplot allows for 10 70 exploring differences and overlap between artefacts classified as Bladelength 20 0 -5 40 20 10 belonging to the sub-stage layers MSA I, MSA II upper and MSA II 0 80 60 5 40 lower. There is a considerable amount of overlap between the two MSA Bladethickn 20 20 0 8010 II sub-stages with some difference on blade width. The MSA I sub-stage Ratio 15 30 20 40 100 90 differs from the MSA II sub-stages on blade- and platform thickness as 40 120 MSA I

Platfthickn MSA II L well as the ratio length:platform thickness. The shape and overlap of the 50 50 140 MSA II U α-bags further highlight that the MSA II artefacts overlap with the MSA PlatfwidthPlatfangle Bladewidth I artefacts, but that a substantial proportion of the MSA I artefacts are distinctly different from MSA II.

8. Probing deeper into reality The world of mankind is restricted to perception of objects in three dimensions. The human brain fuses two different two-dimensional retinal images into the perception of a single three- dimensional object. However, data consists typically of multi-dimensional measurements. Although this multidimensional reality cannot be visually displayed in more than three dimensions, the human mind can appreciate multidimensional reality conceptually. In a biplot multi- dimensional data points are displayed in (usually) two dimensions. Although these projections might result in apparent contradictions and ambiguities biplotology allows for probing deeper into reality. When constructing a biplot, the view from the third axis in a direction perpendicular to the two orthogonal axes forming the scaffolding of the display plane is represented. Now imagine viewing from the second axis perpendicular in the direction of a plane resulting from axis one in its original position and axis three. This view can be shown in a biplot with the first eigenvector held fixed in its original position but replacing the second eigenvector in the scaffolding with the third. Another possibility is a biplot scaffolding where the second eigenvector is kept fixed while replacing the first with the third. This scaffolding is analogous to an observer looking from the first dimension of the canonical space perpendicular to the plane resulting from the second and the third dimensions. The appearance of copper froth from a flotation plant in South Africa plays a significant role in the operation of the plant. Based on digital images, five froth structures are considered in an ordinary 2-dimensional CVA biplot. The α-bags indicated in the biplot show clear separation between three clusters of classes, but almost complete overlap between two classes in each of two clusters. The error rate for classification in two dimensions can be considerably improved by classification in four dimensions.

-4.5 55000 5000 -4.9 -4.9 -5 5500 35000 -4.6 50000 -4 -4 0 6000 5500 150 5 5 0 1.1 6000 -2 -6

1.1 32.5 1.2 1.3 32 33.5 33 0.8 0.9 32 1 0.7

1 350 -5.3 350 7500 7000 15000 -12 -12 7500 -5.4 -5.7 -5.4 550 8000 15 15 10000 -10000 -5.8 10000 600 -5.5 -14 Constructing a biplot using eigenvectors one and three or three and two as scaffolding the α- bags for two of the overlapping classes are completely separated, while the other two overlapping classes can be partially separated by constructing a biplot using eigenvectors four and two. Fusing these two-dimensional biplot representations conceptually, provides graphical evidence of the improved classification in four dimensions. Although the biplot is two-dimensional, it can be employed to probe deeper into the higher dimensional space.

9. Conclusion “Imagine, for instance, a biologist who spends two years catching and measuring animals of two species. Certainly the biologist would hope to get more information out of the data than just a statement like ‘the mean difference between the two species is significant’” (Flury, 1997). This paper endeavoured to demonstrate that biplotology provides Flury’s biologist with tools to actively explore multidimensional reality.

REFERENCES Flury, B. 1997. A first course in . New York: Springer-Verlag. Gabriel, K.R. 1971. The biplot graphical display of matrices with application to principal component analysis. Biometrika, 58, 453 – 467. Gardner, S. 2001. Extensions of biplot methodology to discriminant analysis with applications of non-parametric principal components. Unpublished PhD thesis. Stellenbosch: University of Stellenbosch. Gower, J.C. & Hand, D.J. 1996. Biplots. London: Chapman & Hall. Rousseeuw, P.J., Ruts, I. & Tukey, J.W. 1999. The bagplot: a bivariate boxplot. The American , 53, 382 – 387.

RÉSUMÉ The philosophy of Gower & Hand (1996), viewing biplots as multivariate analogues of scatterplots, provides a methodology for visualising and analysing multidimensional data originating from many applications in practice. Biplotology encompasses a unified approach allowing for many novel extensions when analysing data. The versatility of biplots for exploring multidimensional reality is demonstrated with a diverse of applications.

La philosophie de Gower & Hand (1996) regardant les biplots comme analogues multivariés de ‘scatterplots’ fournit une méthodologie pour visualiser et analyser des données multidimensionelles provenant de plusieurs applications pratiques. La méthodologie biplot (‘biplotologie’) présente une approche unifiée permettant de nouveaux extensions dans l’analyse des données. L’adaptabilité des biplots pour explorer la realité multidimensionelle est démontrée par une gamme varieé d’applications.