Functional Boxplots based on orders ∗

Belen Martin-Barragan Universidad Carlos III de Madrid email: [email protected] Rosa Lillo Universidad Carlos III de Madrid email: [email protected] Juan Romo Universidad Carlos III de Madrid email: [email protected]

August 22, 2012

Abstract We propose a new functional boxplot that extends the classical univariate boxplot. Instead of depth measures, our approach makes use of new orders for functional data. These orderings are based on the functions epigraphs and hypographs and allow to define the func- tional quartiles. Thus, our proposal for functional boxplot is a natural extension of the univariate case. The method is computationally ef- ficient, requiring a computational time of o(n2), in contrast to o(n3) for already existing methods. Simulated and real examples show that this method provides a convenient visualization technique with a great potential for analysing functional data.

Keywords: Functional data, Box and Whisker Plots, Visualization, Functional Data Orderings.

∗Belen Martin-Barragan (email: [email protected]), Rosa Lillo (email: lillo@est- econ.uc3m.es) and Juan Romo (email:[email protected]), Mailing address: Departa- mento de Estad´ıstica,Universidad Carlos III de Madrid, c/ Madrid 126, Getafe, Spain. This research was partially supported by project MTM2009-14039, ECO2011-25706 of Ministerio de Educaci´ony Ciencia (Spain), CCG07-UC3M/ESP-3389 of the Comunidad de Madrid (Spain).

1 1 Introduction

Visualization techniques are very useful in data analysis. Their aim is to summarize information into a graph or plot. A popular visualization tool for univariate data is the boxplot, which can summarize a large set of measurements by plotting the quartiles, the range and the out- liers. A key step in constructing a boxplot is to sort the data. Thus, a meaningful way of sorting the data is required to extend the concept of boxplot to complex data. Functional data analysis is a statistical approach that cover a special kind of complex data: the observed units are functions (???). It has been applied in many fields including bi- ology, meteorology, medicine and speech recognition. A wide range of applications and techniques can be found in ?. Visualization methods for functional data have been proposed in recent years. In particular, different extensions of the univariate boxplot have been proposed (??). Two different ideas for extending univariate boxplots to functional data are addressed in ?: functional bagplots and functional highest density region boxplots. Both techniques require the reduction of the data into the two principal components, and visualization is only based on this information. To overcome this fact, ? proposed a visualization tool that is based on the notion of depth: given a data set, the depth of an observation measures its centrality with respect to remaining data. Thus, depth provides center-outward ordering of the data. Different notions of depth have been proposed for multivariate data (??????) and functional data (???). The functional boxplot proposed in ? sorts the functions according to the modified band-based depth (MBD) (?). The band-based depth (BD) considers that a is deep if it is contained in many bands among all the bands that can be formed with functions of the sample. MBD is a variant of BD that considers the proportion of the curve laying in the band. The main characteristics of the univariate boxplot are reflected in the functional boxplot: a central function that represents the median; a central 50% region limited by the first and third quartiles; and a fence to detect outliers. Note that, since the order used here is center-outward, the first and third quartiles are defined by using the 50% deepest curves. This procedure is different from the univariate case where the data are sorted from the lowest to the largest. In principle, for functional data, no clear and direct order exists for sorting the functions from lowest to highest. In this article we propose a new functional boxplot that naturally extends the classical univariate boxplot to functional data. Instead of using depth measures, we propose alternative orders that allow us the definition of the first and third functional quartiles without taking into account the depth of the functions. This is a natural extension of the univariate boxplot, alternative to the proposals available in the literature which use either the two principal components or the depth-

2 based order. These orderings are based either on epigraphs or hypographs. Roughly speaking, the of a function is the area above its graph, see e.g. ??. The index of a function is low if the function is contained in the epigraph of many functions from the sample. An analogous index can be defined using the hypograph. The proposed method is compu- tationally more efficient than the MBD-based boxplot. Our technique requires a computational time of o(n2) whereas MBD-based boxplot requires a computational time of o(n3). The ordering indexes are introduced in Section ?? and analyzed in Section ??. Both are combined in Section ?? to define the functional quartiles and construct a boxplot, that we call epi-hypo functional box- plot. Sections ?? and ?? evaluate the procedure with simulations and real data. Finally, some conclusions are given in Section ??.

2 Ordering functional data

In functional data analysis the observations are real functions yi(t), i = 1, 2, . . . , n, t ∈ I, where I is an interval in IR. Once we fix a criterium to order the sample curves, y[i] will denote the curve associated with the i−th lowest curve. There is a natural ordering for univariate data, but the choice is not unique for multivariate or functional data. The importance and difficulties of ordering functional data is high- lighted for instance in ?. Similar difficulties exist for multivariate data. A four-fold classification of possible ordering principles is proposed in ?: marginal ordering, reduced (aggregate) ordering, partial ordering and conditional (sequential) ordering. Each of these multivariate or- derings have their own weaknesses. For instance, marginal orderings lose multivariate information about correlations and data structure. A special kind of multivariate orderings has attracted a lot of attention since Barnett’s paper. The idea is to sort the data from the most cen- tered to the most outwards (?). These notions are better known as data depth. See e.g. ??? and ? and the reviews ? and ?. Exten- sions of data depth to functional data have been analyzed over the last decade (???). A new definition in which the plays a key role is proposed in ?. In this paper we are interested in a functional data ordering that is not based on a natural center-outwards order, but on an down-up order. Hence, we need an appropriate index that allows this ordering. A simple choice could be, for instance, the integral of its absolute value over the interval. However, this definition does not take into account the form of the other functions in the sample. Our proposal is inspired in ? where a natural notion of depth is based on bands of curves, adapting the notion of extremality proposed in ? to define an index

3 based on epigraphs and hypographs of curves. The graph of a function f in I is denoted as

G(f) = {(t, f(t)), t ∈ I}.

The outline of these concept is the following. Epigraph and hypograph of a function f are defined as

epi(f) = {(t, y) ∈ I × IR, y ≥ f(t)}, (1) hyp(f) = {(t, y) ∈ I × IR, y ≤ f(t)}. (2)

Let X (·) denotes the indicator function. We propose two indexes to order functional data: • Epigraph index (EI): A function f has a high index if few functions are contained in its epigraph epi(f).

n X X (G(yi) ⊆ epi(f)) EI(f) = 1 − BE(f) = 1 − n i=1

Note that BE(f) is the proportion of functions in the sample that are contained inside the epigraph of f. • Hypograph index (HI): A function f has a high index if many functions are contained in its hypograph hyp(f).

n X X (G(yi) ⊆ hyp(f)) HI(f) = n i=1

Note that HI(f) is the proportion of functions in the sample that are contained inside the hypograph of f. Figure ?? shows a simple example with n = 4 curves. The yellow area in the left graphics is the epigraph of y3 whereas the one in the right is its hypograph. Two of the three other functions are contained 2 in epi(y3), hence, the EI(y3) = 1 − 3 = 1 − 0.67 = 0.33. No function 0 is contained in the hypograph, so HI(y3) = 3 = 0. Inspired on ?, one could also define the modified versions of these measures: the modified epigraph index (MEI) and the modified hypograph index (MHI), anal- ogously to the MBD, taking into account the proportion of I where curve yi is inside epi(f) (respectively, the hyp(f)). The MEI of curve y3 considers also for how long y4 is in the epigraph, that is an inter- 2+0.25 val of length 0.25. Hence MEI(y3) = 1 − 3 = 0.25. Analogously 0.75 MHI(y3) = 3 = 0.25. Note that MEI(f) and MHI(f) only differ in the case in which function f coincides with yi in some subset of I with positive Lebesgue measure, for some i = 1, 2, . . . , n.

4 Construct/exEpi-eps-converted-to.pdfConstruct/exHypo-eps-converted-to.pdf

Figure 1: Epigraph (left) and hypograph (right) for function y3.

3 Analysis of EI and HI

This section tries to get a deeper insight into the behavior of the indexes defined in Section ??. We consider both EI and HI, together with their modified versions MEI and MHI. We generate the data following Model 1, previously considered in ? and ?.

• Model 1 is Yi(t) = g(t) + ei(t), i = 1, 2, . . . , n, with mean g(t) = 4t, t ∈ [0, 1], and where ei(t) is a stochastic Gaussian process with zero mean and covariance function γ(s, t) = exp (−|s − t|). A set of n = 100 curves is generated using Model 1. The simulated functions can be seen in Figure ??. Five randomly selected curves are shown in color whereas the rest are shown in grey. In each of the two graphics, the value of each index (EI, HI) is given in Table ?? (columns 2-3). Among the five selected curves, the green and the cyan, represented as solid curves, are the most extreme since their indexes are the highest and the lowest for both EI and HI which seems natural. For the other three curves, the graphics suggest that no curve can be said to be higher or lower than the other. Indeed, each index is very similar for the three functions, and each index yields a different ordering of them. Although the behavior of these three functions is coherent with the fact that there exists no unique order for functions, the behavior of the green and the cyan curves illustrate that the proposed indexes are adequate to sort the functions. The two last columns of Table ?? show the values of the two band- based depths BD and MBD. By chance, the green function seems to

5 Construct/fiveSortedFunctsMod1_opt111-eps-converted-to.pdf

Figure 2: Example of 5 curves among 100 generated by Model 1.

color form EI HI BD MBD red dotted 0.9495 0.1818 0.0382 0.4431 green solid 0.8889 0.0404 0.0289 0.4547 blue dotted 0.9798 0.1313 0.0253 0.4496 purple dotted 0.9899 0.1313 0.0226 0.3201 cyan solid 1.0000 0.6162 0.0200 0.1219

Table 1: Values of the proposed indexes and the band-based depths for five curves randomly selected among the curves generated by Model 1.

6 be very central, so MBD gives it the maximal value. For BD, the red function is the most central one. In general, sorting the functions by the depth value (using eigher BD or MBD) has a very different meaning than sorting them by one of the proposed indexes EI or NHh. That is the ordering based on the proposed indexes will sort the functions in a down-up sense, whereas the depth measures will do it central-outwards. Figure ?? shows all the curves in different grades of gray depending on the proposed indexes. The 10% lowest are in black, the next 10% in very dark gray, the next 10% in slightly lighter gray, so on and so for. Hence, a light value means high EI (respectively HI, MEI, MHI). It is clear in these plots how the higher the curve, the higher its index. This is true for the four indexes. Hence the proposed indexes are useful to sort the functions in a down-up sense.

4 Epi-hypo functional boxplot

This section illustrate the construction of the functional boxplot based on the indexes defined in Section ??. In univariate data, the main elements of a univariate boxplot are: the first and third quartiles, which define the limits of the box, the outliers, represented by isolated points; and the minimum and maxi- mum among the nonoutlying data, which represents the whiskers.A first step in the construction of the boxplot is to sort the data. Then, the first (respectively, third) quartile is computed as the datum that leaves below (resp, above) it 25% of the observations in the sample. The quartiles are represented as a box. The distance between the third and first quartile is the interquartile range (IQR). After computing the quartiles, there is an outlier detection step. Two fences are obtained by the 1.5 times IQR empirical outlier criterion. The lower fence is computed as the first quartile minus 1.5 times IQR, and all the data lower than this value are identified as outliers. The upper fence is analogously computed as the third quartile plus 1.5 times IQR, all the data higher than this value are identified as outliers. Outliers are usu- ally plotted as stars or plus signs in univariate boxplots. After outlier detection, the minimum and the maximum of the nonoutliers are com- puted. They are represented as the whiskers, two lines that go out of the box, whose ends are indicated with short lines. We now propose a procedure to construct the epi-hypo functional boxplots. Since our approach try to extend the univariate boxplot in a natural way, it is composed by the same elements: the first and third functional quartiles, the outliers and the minimum and maximum nonoutliers. This is illustrated in Figure ??. The first and third func- tional quartiles are represented as solid black curves. Their construc- tion is discussed in Section ??. They define a box that is represented

7 Construct/greyscaleMod1-eps-converted-to.pdf

Figure 3: Data generated with Model 1. The tone of gray indicates the index. 10% functions with lowest index are black, 10% next functions with lowest index are darker, so on and so for.

8 Boxplots/Mod2comb-eps-converted-to.pdf

Figure 4: Data generated with Model 2. Construction of epi-hypo functional boxplot.

in cyan. Once the functional quartiles are computed, our outlier detec- tion step is a direct extension of the univariate case. Note that now, the first and third quartiles are functions q1(t) and q3(t). The IQR is a function IQR(t) = q3(t) − q1(t). The lower fence is computed as f1(t) = q1(t)−1.5(IQR(t)), the upper fence f3 is computed analogously. Outliers are detected as the functions g for which there exists t ∈ I such that g(t) < f1(t) or g(t) > f3(t), i.e. the functions that go out of the fences. In Figure ??, outliers are represented as dotted red curves. Finally the minimal (rep. maximal) nonoutlying function is computed as the lower (resp. upper) envelope of all the functions that are not outliers. They are represented in Figure ?? as solid blue curves and corresponds to the whiskers of the univariate plot. Functions for com- putation of the epi-hypo functional boxplot and the indexes proposed in Section ?? are available under request.

4.1 Construction of the functional quartiles In this section, different ways of computing the functional quartiles are discussed. They are based on the index measures proposed in Section ??. We first focus in computing the first functional quartile. Since our work is inspired in ?, we use convex envelopes of set of functions to compute it. We have four different ways of using convex envelopes and the two proposed indexes to define the first functional quartile:

9 i upper limit of the convex envelope of the 25% of functions with lowest EI. ii upper limit of the convex envelope of the 25% of functions with lowest HI. iii lower limit of the convex envelope of the 75% of functions with highest EI. iv lower limit of the convex envelope of the 75% of functions with highest HI. This is illustrated in Figure ??. For each option, the functions used for computing the convex envelope are represented as solid blue curves, whereas the rest are black dashed curves. The proposed quartile is represented as a thick red curve. A natural question arises: which of these four options is the best representative of the first quartile? We now proceed with a careful analysis of the four options. In options (i) and (ii), the first quartile is computed using the upper limit of an envelope. This is equivalent to taking the maximum of all the functions taking part in the envelope. This makes the proposed quartile to end up quite in the middle of the of all the curves, instead to approximately on the first quarter. This is illustrated in Figure ?? (left) where options (i) and (ii) are represented by solid and dotted curve respectively. Hence, we discard options (i) and (ii). We now discuss between options (iii) and (iv). Options (iii) and (iv) compute the envelope of the 75% of the func- tions with the highest index in the dataset, using index EI and HI respectively. In particular, functions with EI or HI equal or very close to 1 will be part of the envelope. Imagine a function f such that its behavior is very different from the other functions in the dataset. Take for instance, the thick dashed blue curve in Figure ?? (right). In such a setting, function f is an outlier. A sensible quartile should not be affected by it because quar- tiles are robust measures. It seams reasonable to think that function f will rarely have many other functions completely above it (inside its epigraph). This means that EI(f) is equal or close to 1. Using option (iii) to compute the first quartile, function f will be part of the enve- lope, and hence can cause a distortion in such quartile. This quartile is represented in the figure as a solid curve . Option (iv) avoid this problem. No function is bellow f, hence HI(f) = 0, so f is not part of the envelope and do not cause any distortion in the first quartile (represented as a dotted curve). It might be argued that functions having HI equal or close to 1 are probably outliers that belong to the envelope and might cause distortions when using option (iv). We now defend that those functions can exists, but cause no distortion. Think in a function g with HI(g) = 1. This means that g has all the other functions in the dataset above it

10 Construct/illFirstQ-eps-converted-to.pdf

Figure 5: Data generated with Model 1. The four ways of computing the first quartile.

11 Construct/pushupq25-eps-converted-to.pdfConstruct/BEdiscussion-eps-converted-to.pdf

Figure 6: (left) Data generated with Model 1. Thick curve corresponds to (i). Dotted curve corresponds to (ii). (right) Data generated with Model 1 with one obvious outlier (dashed curve). Solid curve corresponds to (iii). Dotted curve corresponds to (iv)

(inside its hypograph). Hence g is the maximum of all the functions of the dataset. Since option (iv) computes the first quartile as the lower limit of the envelope, the maximum of all the functions do not cause distortions in such a lower limit. In view of all this, we conclude that option (iv) is the best to compute the first functional quartile. An analogous reasoning can be done for the third functional quartile. Hence, we have the following definitions: • The first functional quartile is defined as the lower limit of the convex envelope generated by the 75% of functions with highest HI. • The third functional quartile is defined as the upper limit of the convex envelope generated by the 75% of functions with lowest EI.

5 Simulation studies

We first consider a series of experiments with data generated according to some models proposed in the literature. We use the same models than ?. Similar model structures have also been used in ? and ?. Model 1 have already been defined in Section ??. This is the basic

12 model without contamination. Models 2-4 are modifications of Model 1 that contain magnitude outliers while Model 5 illustrate shape con- tamination.

• Model 2 includes a symmetric contamination: Yi(t) = Xi(t) + ciσiK, where Xi(t) follows Model 1, ci is 1 with probability q and 0 with probability 1 − q, K is a contamination size constant, and σi is a sequence of random variables independent of ci taking values 1 and -1 with probability 1/2.

• Model 3 is partially contaminated: Yi(t) = Xi(t) + ciσiK, if t ≥ Ti and Yi(t) = Xi(t) otherwise, where Ti is a random number generated from a uniform distribution on [0, 1].

• Model 4 is contaminated by peaks: Yi(t) = Xi(t) + ciσiK, if Ti ≤ t ≤ Ti +` and Yi(t) = Xi(t) otherwise, where Ti is a random number generated from a uniform distribution on [0, 1 − `]. • Model 5 considers shape contamination with different parameters in the covariance function γ(s, t) = k exp −c|t − s|µ. The basic Model 1, Xi(t) = g(t) + e1i(t), has parameter values k = 1, c = 1, µ = 1 for the covariance function of e1i. To generate irregular curves, let Yi(t) = g(t)+e2i(t), where e2i(t) is a Gaussian process with zero mean and covariance function parameters k = 8, c = 1, µ = 0.2. The contaminated model is given by Zi(t) = (1 − ci)Xi(t) + ciYi(t), i = 1, 2, . . . , n, where ci is 1 with probability q and 0 with probability 1 − q. In the simulation studies, n = 100 curves are generated with param- eters q = 0.1,K = 8, and ` = 3/49. We compare epi-hypo functional boxplot (hereafter EH-fb) with the functional boxplots proposed in ?, which uses the MBD to compute the central region. ? propose the use of MBD instead of BD because it is more flexible. We consider also an analogous version that uses BD. Hereafter we will refer to these three approaches as MBD functional boxplot (MBD-fb) and BD functional boxplot (BD-fb)respectively. The three different approaches consid- ered (EH-fb, BD-fb and MBD fb) are shown in Figure ?? (respectively first, second and third columns). Each row corresponds to a differ- ent data generating model. The number of false positive (FP), i.e. nonoutliers erroneously detected as outliers, and false negative (FN), i.e. nondetected outliers, is also shown. The three methods perform similarly for Model 1. In our genera- tion, BD-fb seems inadequate for Model 2. The central region is too large and consequently, no outlier is detected. Indeed a closer look to the picture revels that one of the outliers actually belongs to the central region. A reason for this might be that this outlier is com- pletely inside many bands as compared with the non-outlying data. The irregularities of the curves make it improbable for them to stay

13 completely inside the bands formed by other two curves. MBD-fb, as proposed in ?, does not suffer this problem. However, in Models 3 and 4, MBD-fb erroneously considers one or several outliers as part of the central box, hence the boxplot has a strange form and it is not able to detect the outliers. Since MBD takes into account the proportion of the interval I where a curve is inside a band, the outliers that differ from normal curves in a small subinterval are difficult to detect. This happens for instance in Models 3 and 4, where EH-fb overcome this fact. The three functional boxplots look very similar in Model 5. The central box seems a good representation of the data in the three cases, MBD-fb providing a slightly thinner box. This makes EH-fb and BD- fb to be more conservative for outlier detections, giving 1 and 2 false negatives respectively. In this randomly generated example, EH-fb provides a reliable box- plot, whose central box resembles the form of the non-outlying data for the five models. Band-based versions of functional boxplot fails to represent the central box in at least one of the models. In Figure ?? we can only show the behavior of one random gener- ation of the set of functions. In order to analyze if the behavior shown in the graphics is common or not, we repeat the simulation experi- ment 50 times. We consider the worst possible scenario, that is one of more outliers belong to the central box. This behavior produces an important distortion in the central box, which, in these cases, becomes a bad representative of the form of the functions. Table ?? shows the number of runs where this behavior is present, i.e. at least one out- lier is part of the central box. This erroneous consideration produces an important distortion in the central box, It is remarkable how, for Model 4, MBD-fb erroneously considers at least one outlier as part of the central box in all the 50 runs. This is the worst possible case. Another two very bad cases are the behavior of BD-fb for Model 2 and MBD-fb for Model 3 where over 64% of the runs give this kind of bad representations of the central box. For the EH-fb, the worst case is Model 3, where it happens for 36% of the runs. This analysis show the potential of EH-fb as a robust variant of the band-based functional boxplots.

6 Real data

We analyze three datasets: Rain Australia, Growth and US Precipi- tations.

14 Boxplots/Mod1comb-eps-converted-to.pdfBoxplots/Mod1BANDb-eps-converted-to.pdfBoxplots/Mod1BANDm-eps-converted-to.pdf

Boxplots/Mod2comb-eps-converted-to.pdfBoxplots/Mod2BANDb-eps-converted-to.pdfBoxplots/Mod2BANDm-eps-converted-to.pdf

Boxplots/Mod3comb-eps-converted-to.pdfBoxplots/Mod3BANDb-eps-converted-to.pdfBoxplots/Mod3BANDm-eps-converted-to.pdf

Boxplots/Mod4comb-eps-converted-to.pdfBoxplots/Mod4BANDb-eps-converted-to.pdfBoxplots/Mod4BANDm-eps-converted-to.pdf

Boxplots/Mod5comb-eps-converted-to.pdfBoxplots/Mod5BANDb-eps-converted-to.pdfBoxplots/Mod5BANDm-eps-converted-to.pdf 15

Figure 7: EH-fb (left), BD-fb (center) and MBD-fb (right). Simulated data. model EH-fb BD-fb MBD-fb 1 0 0 0 2 0 37 0 3 18 9 32 4 10 3 50 5 9 4 0

Table 2: Number of runs where at least one outlier erroneously belongs to the central region.

6.1 Rain Australia Dataset In the first real example, Rain Australia, we consider 191 rainfall curves from different weather stations in Australia. This dataset has been used before in ?? and can be downloaded from http://dss.ucar.edu/datasets/ds482.1. Each curve xi(t) represents the averaged rainfall in station i at day t with t ∈ [1, 365] . The raw data are observed daily and contains some missing values. We try to use the data with as little pre-processing as possible. For a missing value, the next available value contains the accumulated fall of the previous missed days. We have considered this accumulated fall equally distributed along the consecutive missed days. For every station and day, the averaged rainfall is computed among all the years in which the station was operative. The dataset contains data from year 1840 until 1990. Over these years there have been changes in the location of the stations. Hence, none of the stations have been operative over the whole period. There is one particular case of a station having been operative only during less than two years, whereas the others have been operative between 17 and 126 years. Since ci(t) represents the rainfall at station i and day t averaged over the years with available data, the consequence is that some of the curves are smoother than others. In Figure (??) we can found a plot of the 191 curves (top-left graphic), and the EH-fb (top-right), BD-fb (bottom-left) and MBD-fb (bottom-left). The proportion of detected outliers is given under each graphic. The three boxplots look very different. EH-fb detects two outliers. One of them correspond to a station where poor information is avail- able. For this station, the database only contains one complete year of data and two months of other years. Hence, the function of aver- aged daily rainfall is very different to other stations where more years are available. Indeed, it has one day where the curve almost doubles the maximum value of the other curves. The other outlier is a func- tion that has a different behavior from the other curves (see top-left

16 graphic). The rainfall around September-October is much higher than in other stations. This station is situated in Queenstown, in the island of Tasmania. BD-fb also detects two outliers. They correspond to stations with high rainfalls in winter and low rainfalls in summer. This behavior does not seem very different of the data plotted on the top-left graphic. The station where very few data are available and Queenstown station are not detected as outliers and Queenstown station even belongs to the central box. MBD-fb has a completely differen behavior and 14.66% of the curves are detected as outliers. In this graphic we have chosen to plot the outliers behind the boxplot. Indeed, there are so many out- liers and they are so similar to the data that if they are plotted in the front, then they cover the central box and make the boxplot difficult to see. Note that we have not performed any smoothing on the curves. Smoothing the curves before plotting the functional boxplot could give very different results, specially for MBD-fb. We believe that functional boxplots, along many visualization techniques, are often used in order to get a first glance to the data, as a step previous to a more thoughtful analysis. Hence, it is important to have good visualization methods that works well with as few preprocessing as possible. This example illustrate that EH-fb is a good option in such a setting. The next two real examples included where used also in ?. Two of them are also related with weather applications whereas the other is a medical dataset that has very often been used to illustrate functional data analysis techniques.

6.2 Growth Dataset The database Growth contains data from two groups: boys and girls. The left column of Figure ?? shows the graphics corresponding to the boys, whereas the right column shows the graphics of the girls. Only MBD-fb detects one outlier for the girls. This corresponds to a girl that is consistently taller than her equals for the different ages. Hence, we again see that EH-fb and BD-fb are more conservative in detecting outliers. The form of the central boxes is similar, with the cental box of BD-fb being wider than the other two.

6.3 US Precipitations Dataset The last example, US Precipitations, corresponds again to rainfall observations. In Rain Australia we compared the three methods in almost raw data, to see the behavior of the functional boxplots when data are not preprocessed. In this example we use a dataset where the

17 rainAustralia/rainAustralia-eps-converted-to.pdfrainAustralia/rainAustraliacombcomb-eps-converted-to.pdf

rainAustralia/rainAustraliaBANDbBANDb-eps-converted-to.pdfrainAustralia/rainAustraliaBANDmBANDm-eps-converted-to.pdf

Figure 8: Comparing Boxplots. Historical daily precipitations in Australia.

18 Growth/boysGrowth-eps-converted-to.pdfGrowth/girlsGrowth-eps-converted-to.pdf

Growth/boysGrowthcombcomb-eps-converted-to.pdfGrowth/girlsGrowthcombcomb-eps-converted-to.pdf

Growth/boysGrowthBANDbBANDb-eps-converted-to.pdfGrowth/girlsGrowthBANDbBANDb-eps-converted-to.pdf

Growth/boysGrowthBANDmBANDm-eps-converted-to.pdfGrowth/girlsGrowthBANDmBANDm-eps-converted-to.pdf

19

Figure 9: Comparing Boxplots. Growth data. Boys (left column) and girls (right column) curves have been previously smoothed. Each curve represents total an- nual rainfalls from 1895 to 1997 at a station in the Unated States. The original data is provided by the Institute for Mathematics Applied to Geosciences (http://www.image.ucar.edu/Data/US.monthly.met/). A preprocessing step to smooth the curves is suggested in ? and data af- ter this preprocessing can be found in Sun’s webpage (http://www.stat.tamu.edu/ sun- wards/publication.html)1 A functional boxplot is computed for every of the nine climatic regions defined by the National Climatic Data Center. Figure ?? show the curves, Figures ??-?? show EH-fb, BD-fb and MBD-fb. BD-fb produces boxplots that are very similar to EH-fb. The main difference is that in this example BD-fb is more conservative, as it detects a lower number of outliers. It can be observed that in general EH-fb produces central boxes that are flatter than the central boxes produced by MBD-fb. The most relevant cases are South and West regions, where the central box of MBD-fb have some picks whereas the one of EH-fb is completely flat. The reason of this difference might be that functions that are low in the plot, but contains many ups and down, are considered by MBD-fb as central functions, whereas EH-fb is considering them as low functions. We have seen in previous examples that EH-fb are usually more conservative because they produce wider central boxes. However, in this example, despite having wider boxes there are some outliers that are detected by EH-fb and are not detected by MBD-fb. One of these cases is East North Central region, where at certain station the rainfall has increased at the end of the period considered. The other example is Central region, where a curve has a high pick around 1950 and some other picks in other years. These two curves seem to have a different behavior with respect to the other curves of its region, and are detected by HEfb but not by MBD-fb. Another curious example is South East region. The two graphics look very similar, they both detect many outliers at the high part of the graphic, above the central box. However, EH-fb also detects an outlying curve bellow the central box whereas MBD-fb does not detect any curves in that part of the graphic.

7 Computational times

For a given function f, the computation of one of the proposed indexes requires to compare f with each other function g in the dataset. For

1More specifically, the data are in the zip file datasets under the name of the article Functional Boxplots. The files called sfitreg1cy.data and similar contain the curves for different regions.

20 surfaceData/data/surfaceFig/surface1-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface2-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface3-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface4-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface5-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface6-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface7-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface8-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface9-eps-converted-to.pdf

Figure 10: Observed yearly precipitation curves over the nine climatic re- gions for the coterminous United States from 1895 to 1997.

21 surfaceData/data/surfaceFig/surface11combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface21combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface31combcombb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface41combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface51combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface61combcombb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface71combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface81combcombb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface91combcombb-eps-converted-to.pdf

Figure 11: EH-fb of observed yearly precipitation over the nine climatic regions for the coterminous United States from 1895 to 1997.

22 surfaceData/data/surfaceFig/surface13BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface23BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface33BANDbBANDbb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface43BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface53BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface63BANDbBANDbb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface73BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface83BANDbBANDbb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface93BANDbBANDbb-eps-converted-to.pdf

Figure 12: BD-fb of observed yearly precipitation over the nine climatic regions for the coterminous United States from 1895 to 1997.

23 surfaceData/data/surfaceFig/surface14BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface24BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface34BANDmBANDmb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface44BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface54BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface64BANDmBANDmb-eps-converted-to.pdf

surfaceData/data/surfaceFig/surface74BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface84BANDmBANDmb-eps-converted-to.pdfsurfaceData/data/surfaceFig/surface94BANDmBANDmb-eps-converted-to.pdf

Figure 13: MBD-fb of observed yearly precipitation over the nine climatic regions for the coterminous United States from 1895 to 1997.

24 each other function g, g is checked to be inside epi(f), hyp(f) respec- tively. Hence, it requires o(n) comparisons, for each function f in the dataset. This is an advantage with respect to band-based depth, which requires to compare function f to each band formed by two other func- tions g and h which yields an number of comparisons on the order of o(n2) for each function f. Computation of the functional boxplot requires the computation of either the proposed indexes or the depth value of each function f in the dataset. Hence, the total computational time required for the functional boxplot is on the order of o(n2) for the EH-fb and o(n3) for the BD-fb or MBD-fb. This can make band-based methods prohibitive for databases with many curves. Table ?? shows the seconds needed to compute the different indexes for all the functions in the dataset, for the datasets used in Sections ?? and ??. Columns 3-6 shows the time for the proposed indexes and columns 7-8 shows it for the band-based depths. The experiment has been performed on a machine with 2.6 GHz Intel processor and 4 GB RAM. The first and second columns show information about the dataset (name and size). For the US Precipitations dataset, only the results for one of the regions is shown. The other cases behave similarly. For this dataset, which has 1,424 curves, it is remarkable that the proposed indexes can be computed in a few seconds, whereas the band-based measures need over one hour. dataset n MEI EI MHI HI BD MBD simulation, model 1 100 0.01 0.01 0.01 0.01 1.10 1.15 simulation, model 2 100 0.01 0.01 0.01 0.01 1.18 1.13 simulation, model 3 100 0.01 0.01 0.01 0.01 1.14 1.15 simulation, model 4 100 0.01 0.01 0.01 0.01 1.16 1.13 simulation, model 5 100 0.01 0.01 0.01 0.01 1.18 1.15 Australia 191 0.23 0.23 0.24 0.25 29.65 29.51 Growth, girls 54 0.01 0.01 0.01 0.01 0.23 0.22 Growth, boys 39 0.01 0.01 0.01 0.01 0.13 0.11 US, (North East) 1424 5.98 6.14 5.98 6.14 5,158.72 5,226.69

Table 3: Time in second needed to compute each index for the whole set of functions in the dataset.

8 Some conclusions

Descriptive statistics, in particular visualization methods, still play an important role in science. Functional boxplots (?) and other related

25 boxplots approaches for functional data ? have shown to be useful for such a purpose. The functional boxplot proposed in (?) is mainly based on the data depth, which is a center-outwards ordering of the curves. A new version of functional boxplot is proposed that uses a down-up ordering of the data. The use of this type of ordering seems to be a more natural extension to the univariate boxplot. The simulated examples show that EH-fb is a good alternative to its depth-based counterparts, that is able to give a robust representation of the central box when either BD-fb or MBD-fb fails. In our simulated examples, EH-fb approach where less affected than BD-fb by irregu- larities in the data, a case in which MBD-fb would work ok. However, when outliers were different in small subintervals, MBD-fb approach easily fails to detect them, whereas EH-fb are useful. Our real exam- ples illustrate that EH-fb detects different outliers than depth-based approaches. In particular it is has shown to be a useful tool when data have yet to be preprocessed. Hence EH-fb is a promising visualization technique for functional data. Moreover, the computational time required to compute the EH-fb is o(n2), much quicker than the existing band-based approach which requires o(n3). An extension of EH-fb to spatio-temporal data, dealing with sur- faces instead of curves, is straightforward, as it has been noted by ? for depth-based functional boxplots. In general, these approaches are not direct for multivariate data, unless an intuitive index has sense for a particular application. Extensions to other complex data, such as images or networks, look interesting for further research. In the last years, internet-based social networks have become very popular. The information captured by those networks brings up new challenges for data analysis and visualization techniques.

9 TO GET OUT OF THE PAPER 9.1 El Ni˜no El Ni~no dataset contains information of monthly sea surface tempera- tures (SST) measured in degrees Celsius over the east-central tropical Pacific Ocean. Figure ?? contains the plots of the curves and the three functional boxplots considered. The EH-fb is more conservative and do not detect any outliers. The BD-fb detects only one outlier, which correspond to year 1998. MBD-fb detects two outliers:1983 and 1997. The three central boxes have a similar form, the main difference being that for the MBD-fb where the central box is slightly tighter.

26 Elninio/Elninio-eps-converted-to.pdfElninio/Elniniocombcomb-eps-converted-to.pdf

Elninio/ElninioBANDbBANDb-eps-converted-to.pdfElninio/ElninioBANDmBANDm-eps-converted-to.pdf

Figure 14: Comparing Boxplots. El Ni˜no.

27 Boxplots/Mod6comb-eps-converted-to.pdfBoxplots/Mod6BANDb-eps-converted-to.pdfBoxplots/Mod6BANDm-eps-converted-to.pdf

Boxplots/Mod7comb-eps-converted-to.pdfBoxplots/Mod7BANDb-eps-converted-to.pdfBoxplots/Mod7BANDm-eps-converted-to.pdf

Figure 15: Comparing Boxplots. Simulated data, models 6-7.

9.2 Models 6 and 7 Models 6 and 7 simulations:

28