East Tennessee State University Digital Commons @ East Tennessee State University

Electronic Theses and Dissertations Student Works

5-2015 Exploring Ways of Identifying in Spatial Point Patterns Jie Liu East Tennessee State University

Follow this and additional works at: https://dc.etsu.edu/etd Part of the Mathematics Commons

Recommended Citation Liu, Jie, "Exploring Ways of Identifying Outliers in Spatial Point Patterns" (2015). Electronic Theses and Dissertations. Paper 2528. https://dc.etsu.edu/etd/2528

This Thesis - Open Access is brought to you for free and open access by the Student Works at Digital Commons @ East Tennessee State University. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of Digital Commons @ East Tennessee State University. For more information, please contact [email protected]. Exploring Ways of Identifying Outliers in Spatial Point Patterns

A thesis

presented to

the faculty of the Department of Mathematics and

East Tennessee State University

In partial fulfillment

of the requirements for the degree

Master of Science in Mathematical Sciences

by

Jie Liu

May 2015

Edith Seier, Ph.D., Chair

Michele Joyner, Ph.D.

Yali Liu, Ph.D.

Keywords: spatial data, , distance method, statistics. ABSTRACT

Exploring Ways of Identifying Outliers in Spatial Point Patterns

by

Jie Liu

This work discusses alternative methods to detect outliers in spatial point patterns.

Outliers are defined based on location only and also with respect to associated vari- ables. Throughout the thesis we discuss five case studies, three of them come from experiments with spiders and bees, and the other two are data from earthquakes in a certain region. One of the main conclusions is that when detecting outliers from the point of view of location we need to take into consideration both the degree of clustering of the events and the context of the study. When detecting outliers from the point of view of an associated variable, outliers can be identified from a global or local perspective. For global outliers, one of the main questions addressed is whether the outliers tend to be clustered or randomly distributed in the region. All the work was done using the programming language.

2 Copyright by Jie Liu 2015

3 DEDICATION

I would like to dedicate this work to my beloved family for always standing behind me, and my sweet boyfriend Kai Pei for the love and support.

4 ACKNOWLEDGMENTS

First, I would like to express my sincerest appreciation to my thesis advisor, Dr.

Seier Edith, for all her support, guidance, advice, encouragement and patience for this work. I could not have expected to have a better advisor during my Masters studies. Also, I would like to thank the other committee members, Dr. Michele

Joyner and Dr. Yali Liu, for the assistance and trust they offered. Furthermore, I must acknowledge the Department of Mathematics and Statistics at East Tennessee

State University for providing me with financial support. In the end, a very special thanks goes to my parents and my boyfriend, Bingshan Liu, Yuzhen Li and Kai Pei for being by my side all the way.

5 TABLE OF CONTENTS

ABSTRACT ...... 2

DEDICATION ...... 4

ACKNOWLEDGMENTS ...... 5

TABLE OF CONTENTS ...... 7

LIST OF FIGURES ...... 11

1 INTRODUCTION ...... 12

2 ANALYSIS OF SPATIAL POINT PATTERNS ...... 14

2.1 Spatial Point Patterns ...... 14

2.1.1 Case Studies ...... 14

2.2 Types of Distributions ...... 17

2.3 Analysis of Case Studies ...... 22

3 OUTLIERS IN SPATIAL POINT PATTERNS WITH RESPECT TO

LOCATION ONLY ...... 29

3.1 Outliers Regarding Location Only ...... 29

3.2 Outlier Detection Methods ...... 29

3.3 Case Studies ...... 31

4 OUTLIERS IN SPATIAL POINT PATTERNS WITH RESPECT TO

AN ASSOCIATED VARIABLE ...... 55

4.1 Working with One Associated Variable ...... 56

4.2 Working with Two Associated Variables ...... 75

5 CONCLUSIONS ...... 82

BIBLIOGRAPHY ...... 85

6 VITA ...... 87

7 LIST OF FIGURES

1 Locations of spiders in two different colonies ...... 15

2 The location of waggle dances of bees in a hive ...... 16

3 Locations of earthquakes within certain longitude (-68◦, -83◦) and lat-

itude (0◦, -18◦)...... 18

4 Different types of distances ...... 19

5 Analysis of Case Study 1 ...... 23

6 Analysis of Case Study 2 ...... 24

7 Analysis of Case Study 3 ...... 25

8 Analysis of Case Study 4 ...... 26

9 Analysis of Case Study 5 ...... 27

10 Outliers detection based on boxplot for Case 1 ...... 31

11 Outliers detection based on boxplot for Case 2 ...... 32

12 Outliers detection based on boxplot for Case 3 ...... 33

13 Histogram of the dnn for Case 3 ...... 34

14 Outliers detection based on first gap for Case 3 ...... 36

15 Histogram of dnn and lndnn for Case 3 ...... 37

16 Outliers detection based on boxplot for lndnn for Case 3 ...... 38

17 Outliers detection based on three standard deviations from of

lndnn for Case 3 ...... 39

18 Outliers detection based on linear adjusted boxplot for Case 3 . . . . 40

19 Outliers detection based on exponential adjusted boxplot for Case 3 . 41

20 Outliers Detection based on boxplot for Case 4 ...... 42

8 21 Histogram of the dnn for Case 4 ...... 43

22 Outliers detection based on first gap for Case 4 ...... 49

23 Histogram of dnn and lndnn for Case 4 ...... 49

24 Outliers detection based on boxplot for lndnn for Case 4 ...... 50

25 Outliers detection based on three standard deviations from mean of

lndnn for Case 4 ...... 51

26 Outliers detection based on linear adjusted boxplot for Case 4 . . . . 52

27 Outliers detection based on exponential adjusted boxplot for Case 4 . 53

28 Outliers detection based on boxplot for Case 5 ...... 54

29 Boxplot and histogram of depth for Case 4 ...... 56

30 Boxplot and histogram of magnitude for Case 4 ...... 57

31 3D-plot with outliers identified based on depth for Case 4 ...... 58

32 3D-plot with outliers identified based on magnitude for Case 4 . . . . 58

33 Gˆ with depth outliers for Case 4 ...... 59

34 Gˆ with magnitude outliers for Case 4 ...... 59

35 2D-plot with outliers identified based on depth for Case 4 ...... 60

36 2D-plot with outliers identified based on magnitude for Case 4 . . . . 61

37 3D-plot for earthquakes of magnitude 5 or more ...... 61

38 Boxplot and histogram of depth for Case 5 ...... 62

39 Boxplot and histogram of magnitude for Case 5 ...... 63

40 3D-plot with outliers identified based on depth for Case 5 ...... 64

41 3D-plot with outliers identified based on magnitude for Case 5 . . . . 64

42 Gˆ with depth outliers for Case 5 ...... 65

9 43 Gˆ with magnitude outliers for Case 5 ...... 65

44 2D-plot with outliers identified based on depth for Case 5 ...... 66

45 2D-plot with outliers identified based on magnitude for Case 5 . . . . 66

46 3D-plot for earthquakes of magnitude 6 or more ...... 67

47 Boxplots and histograms of differences from mean and max for depth

for Case 4 ...... 69

48 Boxplots and histograms of differences from mean and max for magni-

tude for Case 4 ...... 70

49 3D-plot with outliers identified based on differences from mean and

max for depth for Case 4 ...... 71

50 3D-plot with outliers identified based on differences from mean and

max for magnitude for Case 4 ...... 71

51 Boxplots and histograms of differences from mean and max for depth

for Case 5 ...... 72

52 Boxplots and histograms of differences from mean and max for magni-

tude for Case 5 ...... 73

53 3D-plot with outliers identified based on differences from mean and

max for depth for Case 5 ...... 74

54 3D-plot with outliers identified based on differences from mean and

max for magnitude for Case 5 ...... 74

55 Plot of the standardized value of one variable >3 for Case 4 . . . . . 76

56 Plot of the standardized value of one variable >3 for Case 5 . . . . . 77

57 Plot of the standardized values of both variables >3 for Case 4 . . . . 78

10 58 Plot of the sum of the standardized values >3 for Case 4 ...... 79

59 Plot of the sum of the standardized values >6 for Case 4 ...... 80

60 Plot of the sum of the standardized values >3 for Case 5 ...... 81

11 1 INTRODUCTION

Spatial point pattern is a set of points distributed irregularly within a specific space

[1]. In Section 2 of the thesis we review the literature on the analysis of spatial point

patterns. There are three types of distributions for spatial point patterns, random,

regular and clustered [2]. There are three estimated functions which can be used to

determine which distribution a given spatial point pattern exhibits, the G function, K function and F function. We use as motivation 5 case studies of spatial point patterns that have not been studied in the literature. Three come from data produced at ETSU by the NSF funded CRAWL (Collaborative Research in the Arthopod Way of Life) project. Two of those three cases refer to spiders’ location in a web at a certain time and the other one is the location of waggle dances in an experimental bee hive during one day. The other two cases have been created retrieving data from a large data base on earthquakes. We do basic analysis of the five cases based on the methods from the literature review and then we propose some methods for outliers’ identification and apply them to those case studies. For the first three cases, we apply the G function,

K function and F function using Euclidean distance. The earthquake data are given in latitude and longitude. The R package sp [7] is used to handle these coordinates.

Regarding the outliers in a spatial point pattern, I focus on two topics. The first one is when only the location of the events is considered and the second one is when there is one or more variables associated to the event besides the coordinates. In

Section 3, we discuss alternative methods to detect outliers in spatial point patterns when we only consider the location of the events. Most of the methods explored are based on the distances to the nearest neighbor (dnn). To determine whether a point is

12 an outlier or not, we need to discuss the distribution of dnn because a single method might not be appropriate for all shapes of distributions. The method to detect outliers might depend on the shape of the distribution. The methods considered are (1) the usual application of the boxplot, (2) the first gap method, (3) applying Method 1 to the transformed data, (4) three standard deviations from the mean of the transformed data, (5) using the adjusted boxplot based on a linear model and (6) using the adjusted boxplot based on an exponential model [9].

Section 4 addresses the problem of detecting outliers with respect to one or more associated variables. In some cases, like the earthquakes examples, there might be other variables such as magnitude and depth of epicenter associated to each event.

We discuss methods to identify outliers with respect to these variables, i.e., when the value of the variable is unusually high or low considering the values associated to all of the surrounding events. We consider the concepts of global and local outliers.

We study the spatial distribution of the global outliers. To identify local outliers, we define a circle around the event, with a defined value of radius, and check whether the value for the event would be an outlier compared to the values for the points inside the circle. We first work with each associated variable separately and then with both variables together to detect outliers in the spatial point pattern. To work with two associated variables at the same time, we standardize the values of each variable and plot their absolute standardized values in one plot.

It is necessary to write computer code to perform the proposed analysis. All the work is done using the R programming language.

13 2 ANALYSIS OF SPATIAL POINT PATTERNS

2.1 Spatial Point Patterns

A set of points distributed irregularly within a specific space, is called spatial point pattern [1]. Classical examples from the literature include the location of trees in a forest, nests of birds, cell nuclei in tissue, people with a certain illness in a region.

The trees, nest, nuclei, sick people in the examples are called ‘events’, and we care for the location of the events. We will work with the coordinates of where events happen. Here we will work mainly with five case studies. The first two come from observations of the location of spiders in a web; the third case is about the location of waggle dances done by bees in an experimental hive. These three data sets were produced in the undergraduate research project CRAWL (Collaborative Research in the Arthropod Way of Life) at ETSU. The last two case studies are the location of earthquakes within certain values of longitude and latitude; these two cases are ex- amples where the information about the value of an associated variable of interest in each location is available, such as the intensity or the depth at which the earthquake happened. Next we will describe these examples that will be analyzed in this work.

2.1.1 Case Studies

Cases 1 and 2 refer to experiments done with spiders of the species Anelosimus studiosus. Brooding spiders were located each one in an enclosure of 28cm x 28cm.

When the offspring was visible, the location of each juvenile spider was recorded in 14 four different stages of development (observation day 0, 23, 35 and 48) 6 times a day.

Colony 35 in Case 1 is formed by a mother and her offspring; mother is no longer alive or visible in Colony 32 for Case 2.

Case 1. Spiders in Colony 35, first day of observation at midnight, the mother is absent

The location of the spiders in Case 1 is displayed in Figure 1A.

Case 2. Spiders in Colony 32, first day of observation at 8am in the morn- ing, mother is present

The location of the spiders in Case 2 is displayed in Figure 1B. The mother is the point distant from the others in Figure 1B.

A.Spiders without Mother B.Spiders with Mother y y 0 5 10 15 20 25 0 5 10 15 20 25

0 5 10 15 20 25 0 5 10 15 20 25

x x

Figure 1: Locations of spiders in two different colonies

15 Case 3. The Location of waggle dances in a hive

As part of the data collection in the CRAWL project, the location of the waggle

dances that happened during the day in a square (2 connected panels, each 50x25cm),

experimental hive were recorded. Figure 2 displays the location of the waggle dances

during observation day 3 in a given hive.

Location of waggle dances y 0 10 20 30 40 50

0 10 20 30 40 50

x

Figure 2: The location of waggle dances of bees in a hive

Case 4 and 5 refer to earthquakes that happened within certain longitude (-68◦,

-83◦) and latitude (0◦, -18◦). That region includes the country Peru. The location of the earthquakes (since 1973) can be found at the website of Earthquake Hazards Pro- gram [3]. As for the earthquakes, I prepared two data sets, one for all the earthquakes with magnitude 5 or more and the other is for all the earthquakes with magnitude 6 or more.

16 Case 4. Earthquakes of magnitude 5 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

The location of the earthquakes of magnitude 5 or more within certain longitude

(-68◦, -83◦) and latitude (0◦, -18◦) are in Figure 3A.

Case 5. Earthquakes of magnitude 6 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

The location of the earthquakes with magnitude 6 or more within certain longi- tude (-68◦, -83◦) and latitude (0◦, -18◦) are displayed in Figure 3B.

Case studies 1-5 are examples of spatial point patterns. In the next section, the tools used to describe the distribution of spatial point patterns will be summarized.

2.2 Types of Distributions

Basically, there are three types of distributions for the spatial point patterns, which are ‘random’, ‘regular’ and ‘aggregated’ distributions [2].

Random distribution indicates a completely random pattern, which shows no ob- vious structure, i.e., there is equal chance for any point to occur at any location within the space and the events will not influence each other at all. Regular distri- bution the events are distributed regularly to some degree, like every event within the space is at nearly the same distance from all of its neighbors. Aggregated distribution means clustered points on the space, where many points are distributed close together.

17 A.Earthquake of magniitude 5+ B.Earthquake of magniitude 6+ y y −15 −10 −5 0 −15 −10 −5 0

−80 −75 −70 −82 −78 −74 −70

x x

Figure 3: Locations of earthquakes within certain longitude (-68◦, -83◦) and latitude

(0◦, -18◦)

2.2.1 Testing for Complete Spatial Randomness

Before doing further analysis of the spatial point patterns, we should test for com- plete spatial randomness (CSR). CSR assumes the points within the region follow a homogeneous Poisson point process. There are several methods to check for CSR, the most important ones are the distance methods [6]. They are nearest neighbour distance method, inter-event distance method and point to nearest event distance method. The following statistical tests can be conducted to test for significant pat- terns in our data.

18 Ho: events exhibit complete spatial randomness (CSR)

Ha: events are spatially clustered or dispersed

If the hypothesis of the complete spatial randomness for a spatial point patten is not rejected, then it is assumed that the number of events in the region follows a Poisson distribution and the events in the region are distributed randomly and independently.

In other words, all the events have equal chance to occur anywhere over the space and will not influence each other [4].

A B C y y y 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x x

Figure 4: Different types of distances

There are three distance methods for testing CSR. The decision about CSR is made based on the cumulative distribution function of a variable that is a distance; the difference between different methods is the type of distance they consider. Figure

4 depicts these different types of distributions. Consider a spatial point pattern with only 4 events, Figure 4A shows the distance to the nearest neighbor for the particular event in the center of the region. Figure 4B shows all the inter-event distances for the events in the space. Figure 4C shows a random point in the upper corner of the

19 region and the distance from that random point to the closest event.

2.2.2 Distance Methods Testing the Complete Spatial Randomness

Nearest Neighbor Distance Method

Illian [1] indicates that in a specific space with n events, rmin denotes the distance from the ith event to its nearest neighbour (Figure 4A). The G function is the CDF of the variables ‘distance to the nearest neighbor’. The Gˆ function is the cumulative frequency distribution of the nearest neighbor distances calculated from the data.

N[r (s ) ≤ r] Number of point pairs where r ≤ r Gˆ(r) = min i = min (1) n Number of points in study area

In the analysis of case studies, the simulated confidence envelope is constructed.

The confidence envelope includes the central 95% of the values obtained by simulation assuming complete spatial randomness. If the Gˆ for a given spatial point pattern is within the envelope, it indicates the spatial point pattern is randomly distributed; if the Gˆ is above the envelope, it indicates the spatial point pattern is clustered. An example of Gˆ can be seen in Figure 5b.

Inter-event Distance Analysis

The second type of distance is the inter-event distance. Compared to the near- est neighbour distance, which only considers the shortest distance, the inter-event distance method is based on all the distances between events in the study area [1].

Figure 4B shows all the inter-event distances for an sample with just four events.

Note that if there are n events, there should be 0.5n(n − 1) inter-event distances

[5]. In the inter-event analysis, we calculate the empirical distribution function of the inter-event distance called the Kˆ function.

20 ˆ R XX Ih(dij) K(h) = 2 (2) n wij i6=j

where R is the area of the region, n is the number of points, Ih is a dummy variables that takes value 1 if dij ≤ h and 0 otherwise, where dij is the inter-event distance between the i event and the j event. The symbol wij in the equation is the edge cor- rection, i.e., the proportion of circumference of a circle centered on point i, containing point j that is in the study area (proportion is 1 if the whole circle is in the study area).

The simulated confidence envelope helps to check the distribution of a given spa- tial point pattern. If the Kˆ function we estimated from the spatial point pattern is within the envelope, it indicates that it is random distributed; if the Kˆ function is

above the envelope, it indicates a clustered spatial point pattern. An example of Kˆ

can be seen in Figure 5d.

Point to the Nearest Event Distance Analysis

In this process, we randomly select m points on the space, then we calculate the

distance from each of the m points to the closest event located on the space. Figure

4C shows the distance from a random point to its nearest event. The F function is

the CDF of those ‘points to the nearest event distance’. The Fˆ is defined as

N[d (p ) ≤ r] Number of points pairs where r ≤ r Fˆ(r) = min i = min (3) m Number of sample points

The simulated confidence envelope is constructed in the analysis of case studies. If

the Fˆ function is within the envelope, it means the spatial point pattern is completely

randomly distributed; if the Fˆ function is below the envelope, it indicates a clustered

21 spatial point pattern. An example of Fˆ can be seen in Figure 5c.

2.3 Analysis of Case Studies

Now we will use the methods described in Section 2.2 to analyze the case studies mentioned in Section 2.1.

For Case 1 (Spiders in Colony 35), Case 2 (Spiders in Colony 32) as well as Case

3 ( Location of waggle dances in a hive), we are dealing with Euclidean distance, the

Gˆ function, the Kˆ function as well as the Fˆ function can be calculated as mentioned

in Section 2.2. However, with regard to the earthquake locations in Case 4 and Case

5, Euclidean distance can not be applied since the earth is round and we are dealing

with longitude and latitude. Therefore, in Case 4 ( Earthquakes of magnitude 5 or

more within certain longitude (-68◦, -83◦) and latitude (0◦, -18◦)) and Case 5 (Earth-

quakes of magnitude 6 or more within certain longitude (-68◦, -83◦) and latitude (0◦,

-18◦)), the R package sp [7] is used to calculate the distances between points where

coordinates are given in longitude and latitude. To calculate such distances, the great

circle method is used.

Case 1. Spiders in Colony 35, first day of observation at midnight, the

mother is absent

Based on Figure 5, we can see that the Gˆ and the Kˆ are above the envelope, while

the Fˆ is below the envelope, indicating that the spatial point pattern in Case 1 is

clustered.

22 a. Locations of spiders b. G Function c. F function

^ Gobs(r) G theo(r) ^ Ghi(r)

) ^ ) r Glo(r) r ( ( ^ F G Fobs(r) F theo(r) ^ Fhi(r) ^ Flo(r) 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0 5 10 15 20 25 0 1 2 3 4 0 2 4 6 8 10

r r

d.K Function e. Histogram of dnn f. Boxplot of dnn

^ Kobs(r) K theo(r) ^ Khi(r) ^ ) r Klo(r) ( K Frequency 0 100 200 300 400 500 0 2 4 6 8 10 12

0 1 2 3 4 5 6 7 0 2 4 6 8 10 12 0 2 4 6 8 10

r dnn

Figure 5: Analysis of Case Study 1

Case 2. Spiders in Colony 32, first day of observation at midnight, the mother is present

Figure 6 clearly indicates that the Gˆ and the Kˆ are at least partially above the envelope, while the Fˆ is below the envelope. Thus, the spatial point pattern in Case

2 is considered to be clustered.

23 a. Locations of spiders b. G Function c. F function

^ ^ Gobs(r) Fobs(r)

G theo(r) F theo(r) ^ ^ G r F r

) hi( ) hi( ) ) r r ( ^ ( ^ F G Glo(r) Flo(r) 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0 5 10 15 20 25 0.0 0.5 1.0 1.5 2.0 0 2 4 6 8 10 12

r r

d.K Function e. Histogram of dnn f. Boxplot of dnn

^ Kobs(r)

K theo(r) ^ Khi(r) ) r ( ^ K Klo(r) Frequency 0 200 400 600 800 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 5 10 15 20 25 30 0 5 10 15 20 25

r dnn

Figure 6: Analysis of Case Study 2

Case 3. The location of waggle dances in a hive

Based on Figure 7, the Gˆ and the Kˆ are above the envelope, while the Fˆ is below the envelope. Thus, the spatial point pattern in Case 3 is clustered.

Case 4. Earthquakes of magnitude 5 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

Based on Figure 8, by comparing the Gˆ estimated from Case 4 with the Gˆ under

CSR calculated with a simulated large data set, we conclude the spatial pattern of the location of the earthquakes with magnitude 5 is clustered. Because the Gˆ estimated

24 from the given spatial point pattern is above the Gˆ estimated from the simulated data set assuming spatial randomness.

a.Location of waggle dances b. G Function c. F function

^ ^ Gobs(r) Fobs(r)

G theo(r) F theo(r) ^ ^ G r F r

) hi( ) hi( ) ) r r ( y ^ ( ^ F G Glo(r) Flo(r) 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0

0 10 20 30 40 50 0.0 0.2 0.4 0.6 0 2 4 6 8 10 12

x r r

d.K Function e. Histogram of dnn f. Boxplot of dnn

^ Kobs(r)

K theo(r) ^ Khi(r) ) r ( ^ K Klo(r) Frequency 0 500 1000 1500 0 100 300 500 700

0 2 4 6 8 10 12 0 1 2 3 4 5 0 1 2 3 4

r dnn

Figure 7: Analysis of Case Study 3

Earthquakes do not randomly happen anywhere, they are determined by geology.

That explains why Case 4 shows a clustered pattern as well. Earths outer shell is formed of approximately ten large and about twenty small rigid tectonic plates that move slowly but continuously. According to the plate tectonics theory, earthquakes happen when tectonic plates touch each other or separate. The region under study

25 is located between longitudes -68◦ and -83◦ and from the equatorial line to latitude

-18◦. Part of that region is situated on the encounter of two tectonic plates: the

Nazca Plate and the South American plate. Peru is located along the boundary of two tectonic plates. These two plates are located closely putting huge strain on the

Earth’s crust. The pressures are periodically released through earthquakes. Earth- quakes do not randomly happen anywhere, they are determined by geology. That explains why Case 4 shows a clustered pattern as well.

a.Location of Earthquake5+ b.G estimated Fn(x) Latitude −15 −10 −5 0 0.0 0.4 0.8

−80 −75 −70 0 50 100

Longitude x

c.Histogram of dnn d.Boxplot of dnn Frequency 0 200 400 600

0 40 80 120 0 20 60 100

dnn

Figure 8: Analysis of Case Study 4

26 Case 5. Earthquakes of magnitude 6 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

Based on Figure 9, a conclusion can be reached that the spatial pattern concerning the location of earthquakes with magnitude 6 or greater is clustered.

a.Location of Earthquake6+ b.G estimated Fn(x) Latitude −15 −10 −5 0 0.0 0.4 0.8

−82 −78 −74 −70 0 50 100 150

Longitude x

c.Histogram of dnn d.Boxplot of dnn Frequency 0 5 15 25 35

0 50 100 150 0 50 100 150

dnn

Figure 9: Analysis of Case Study 5

Cases 1-5 show a clustered or aggregated spatial distribution. However, looking at the distributions of dnn in Figures 5-9, we can see that the degree of aggregation is not the same. The locations of spiders in Case 2 are more clustered except for an

27 outlier than the spiders in Case 1. The spatial point pattern in which the clustering or aggregation is stronger than the other cases is Case 3, the location of the waggle dances in an experimental hive during one day. The location of earthquakes in Case

4 are much more clustered than that in Case 5. The boxplot in Figure 8d indicates a more skewed distribution than the boxplot in Figure 9d.

28 3 OUTLIERS IN SPATIAL POINT PATTERNS WITH RESPECT TO

LOCATION ONLY

3.1 Outliers Regarding Location Only

We will consider the term ‘outlier’ with two different meanings in the context of spatial point patterns depending on whether we are considering only the location of the events or we are considering the values of a variable associated with the events.

For example, in the earthquake example we can consider only the location of the epicenter or the location and the magnitude or the depth of the earthquake. The outliers with respect to an associated variable will be considered in Section 4. Within the study area, the points which are not expected to occur on the space according to the general structure of the pattern, but they appear on the space are considered as outliers [1].

3.2 Outlier Detection Methods

In Illian et al. [1] the following comment is found ‘the basic statistical idea for outlier detection is quite simple: assign numerical or functional marks to all points in the pattern, analyze these marks statistically and regard points with extreme marks as outliers’. We will work with the distances to the nearest neighbor, which is the value assigned to each point but several criteria to determine above which distances are to be considered outliers are to be explored and compared. For our case studies, we consider outliers as the ones far away from their neighbors. Thus, we will only care for the large dnn values, but not the small ones.

29 As indicated in Section 2, the distances to the nearest neighbor can be calculated using packages in R. For points in a plane, the function nndist in the package spatstat

[7] calculates the distances to the nearest neighbor using Euclidean distance. For points expressed in longitudes and latitudes, we will use the package sp [7] to handle with latitude and longitude coordinates [8]. The function spDists in the package sp is able to calculate distances using the great circle method. We will discuss alternative ways of identifying outliers based on the distribution of the distances to the nearest neighbor. The different ways to identify outliers that we will use are:

Method 1. The usual application of the boxplot

One simple method of identifying outliers would be to prepare a boxplot with those distances. Any value greater than Q3 + 1.5IQR would be considered as an outlier.

Method 2. First gap method

We define this by looking at the histogram or the stem and leaf display and consider outliers the values beyond the first large gap that is highly visible at the right side of the distribution.

Method 3. Applying Method 1 to the transformed data

The usual boxplot is applied to the transformed data after applying the logarithm transformation or any other Box − Cox transformation.

Method 4. Three standard deviations from the mean of the transformed data

Values beyondy ¯ + 3sd where y is the transformed data are considered to be outliers.

Method 5. Using the adjusted boxplot based on a linear model

In the original boxplot, we use the cutoff value Q3 + 1.5IQR. However, when the

30 distribution is highly skewed, we use a different value instead of 1.5. For the linear model approach, we use 1.5 + aMC, where MC is the medcouple value [9].

Method 6. Using the adjusted boxplot based on an exponential model

There is another model for the adjusted boxplot, the R package Robustbase is used to make the adjusted boxplot based on the exponential model [9].

In the next section, those methods will be presented, applied and discussed in the context of analyzing these case studies.

3.3 Case Studies

Case 1. Spiders in Colony 35, first day of observation at midnight, the mother is absent

Boxplot of dnn Spider without Mother y 0 5 10 15 20 25 0 2 4 6 8 10

0 5 10 15 20 25

x

Figure 10: Outliers detection based on boxplot for Case 1

31 Based on Figure 10, it can be seen that an obvious outlier is identified which is

shown in the purple circle by the general method that events with values greater than

Q3 + 1.5IQR can be considered as outliers. The distribution of dnn is not highly

skewed except for the outlier and Method 1 works satisfactory.

Case 2. Spiders in Colony 32, first day of observation at 8am in the morn-

ing, mother is present

Boxplot of dnn spiders with mother 0 5 10 15 20 25 0 5 10 15 20 25

0 5 10 15 20 25

Figure 11: Outliers detection based on boxplot for Case 2

Based on Figure 11, it is clear that an obvious outlier is identified shown in the

purple circle by the general method that events can be considered as outliers if they

have values greater than Q3 + 1.5IQR. In this case, the outlier is the mother; a guess is that she went on a hunting trip to catch a prey. The boxplot without the distant outlier indicates a fairly symmetric distribution and Method 1 works well.

32 Case 3. Location of waggle dances in a hive

Method 1. The usual application of the boxplot

Boxplot of dnn(q3+1.5*iqr) Location of waggle dances(q3+1.5*iqr) y 0 10 20 30 40 50 0 1 2 3 4

0 10 20 30 40 50

x

Figure 12: Outliers detection based on boxplot for Case 3

Based on Figure 12, it can be seen that a large number of outliers are detected which is shown in red by the usual method that events can be considered as outliers with values greater than Q3 + 1.5IQR. Figure 12 is an application of Method 1.

There is a considerable number of events that are located at a moderate distance from the nearest neighbor are identified as outliers with this method. When most of the points are highly clustered, the ones that are even at a moderate distance from the nearest neighbor are identified as outliers. In this example, most distances (in cm) were very small and any point with distance to the nearest neighbor larger than

0.7844 is considered an outlier by applying the usual rule of the boxplot. Below are

33 the basic statistics for the dnn of Case 3.

Min. 1st Qu. Mean 3rd Qu. Max.

0.02054 0.17450 0.28390 0.38540 0.41850 4.62400

In cases like this we might want to rethink the way outliers are identified. Due to the overcrowding of the events on the space, some otherwise acceptable nearest neighbor distances for the events turned out to be outliers based on the Method 1.

Method 2. First gap method

In order to get a better view of how our data are clustered, a histogram and a stemplot of the dnn for Case 3 are displayed.

Histogram of dnn Frequency 0 100 300 500 700

0 1 2 3 4 5

dnn

Figure 13: Histogram of the dnn for Case 3

Based on Figure 13, the histogram of dnn for Case 3, it is clear that a gap exists.

The gap can indicate what the cutoff value to detect the outliers should be. However 34 the stemplot give us a better idea of where the gaps in the dnn exist.

The decimal point is 1 digit(s) to the left of the |

0 | 2233444444555566666666667777777777777777777777888888888888888*

2 | 0000000000000000011111111111111111112222222222222222222222333*

4 | 0000111111111111111122233334444444455555555555566666777777888*

6 | 000000111111223333344444555566667788888990122222223334668999

8 | 011446688889233557

10 | 1112256617

12 | 1144816

14 | 03579

16 | 2285666

18 | 2299

20 | 918

22 | 7

24 | 66

26 | 352

28 |

30 |

32 |

34 | 88

36 |

38 |

35 40 |

42 |

44 | 4

46 | 2

Note: *indicates that the line has been truncated for formatting purposes.

All the points after the first gap in the sorted distances could be considered as outliers. For Case 3, the dances with distance to the nearest neighbor equal or greater than 3.48 would be the natural candidates. After applying the first gap method, four outliers are detected which are shown in the darkgreen circles in Figure 14.

Boxplot of dnn(gap idea) Location of waggle dances(gap idea) y 0 10 20 30 40 50 0 1 2 3 4

0 10 20 30 40 50

x

Figure 14: Outliers detection based on first gap for Case 3

36 Method 3. Applying Method 1 to the transformed data

A. Histogram of dnn B. Histogram of lndnn Frequency Frequency 0 100 200 300 400 500 600 700 0 50 100 150 200 250

0 1 2 3 4 5 −4 −3 −2 −1 0 1 2

dnn lndnn

Figure 15: Histogram of dnn and lndnn for Case 3

Based on the histogram in Figure 15A, the distribution of dnn is highly skewed,

and it looks like the distribution of dnn for the waggle dances is log normal. We

calculated lndnn = log(dnn); the distribution of the lndnn shown in Figure 15B is fairly symmetric.

Figure 16 is based on the method that identifies outliers with values greater than

Q3 + 1.5IQR using the transformed distances lndnn. There is a considerable number

of events that are located at moderate distances from the nearest neighbor but are

identified as outliers.

37 rmnraiy ewn ha n ple hsmto hthstenra model normal the has that mind. method in this applied departure and small ahead any went to We sensitive very normality. is from test such the with (876), that observations aware of be number to large need We a case. this in normality of assumption the The rejected distribution. normal the than distribution higher symmetric with fairly apparently a indicates the 15B of Figure normality transformation. the non-linear for their first check should we speaking, Strictly distribution. transformed the of mean the from data deviations standard Three 4. Method nte dafrdtcigotir sbsdo h 89-97 uefrtenormal the for rule 68-95-99.7% the on based is outliers detecting for idea Another

−4 −3 −2 −1 0 1 iue1:Otir eeto ae nbxltfor boxplot on based detection Outliers 16: Figure Boxplot oflndnn(q3+1.5*iqr) 38

y Location ofwaggle dances(q3+1.5*iqr) 0 10 20 30 40 50 02 04 50 40 30 20 10 0 lndnn x o ae3 Case for Shapiro-Wilk dnn test or rae hn355 ilb osdrda ules hyaesoni h darkgreen the in shown are 18. They Figure of in outliers. values cicles as with considered ones the be only will so 3.5259 values, than large greater the about care only We outliers. as with events case, model linear a on based Boxplot adjusted the Using 14. Figure 5. with in Method 17 shown Figure method in gap identified first outliers the four the same applying the are these that Notice circle. green y of 3 mean Case from for deviations standard three on based detection Outliers 17: Figure ¯ 3 + iue1 sa plcto fajse opo sn iermdl[] nthis In [9], model linear a using boxplot adjusted of application an is 18 Figure of value cutoff the is boxplot the on line horizontal darkgreen the 17, Figure In sd −4 −3 −2 −1 0 1 yapyn hsctffvle orotir r dnie hw ntedark- the in shown identified are outliers four value, cutoff this applying By . Boxplot oflndnn(u+3*sd) dnn ausgetrta .29o oe hn000 r identified are 0.0306 than lower or 3.5259 than greater values 39

y

0 10 20 30 40Location ofwaggle dances(u+3*sd) 50 02 04 50 40 30 20 10 0 x lndnn dnn elfrtemdrt kwes u h itiuinof distribution the but skewed. skewness, works moderate boxplot Adjusted the which detected. for be of well can outliers high of too number not are considerable distances a neighbor However, nearest [9]. reference the in proposed the loaded function We the 19. use Figure in applied is boxplot adjusted The Robustbase which in than greater Values bution. model exponential an on based boxplot adjusted the Using 6. Method hr sa xoeta oe o h dutdbxltapidt kwddistri- skewed to applied boxplot adjusted the for model exponential an is There

iue1:Otir eeto ae nlna dutdbxltfrCs 3 Case for boxplot adjusted linear on based detection Outliers 18: Figure 0 1 2 3 4 Boxplot ofdnn(linearadjusted) MC 7 a eue omk dutdbxltbsdo h xoeta model. exponential the on based boxplot adjusted make to used be can [7] stemdopevle[] nR h function the R, In [9]. value medcouple the is adjust e ehdt dniyotir o kwddistribution skewed a for outliers identify to method new a , Q 1 + 3 . 5 exp 40 (3

MC y

) 0 10 20 30 40 50 IQR 02 04 50 40 30 20 10 0 Location ofwaggle dances a ecniee soutliers, as considered be can dnn robustbase adjust nti aei highly is case this in x ntepackage the in akg and package rmternaetnihosaeietfida ules hnms ftepit are points the of most When distances outliers. moderate as identified at are located neighbors events nearest of their number from considerable a is There 1. Method boxplot the of application usual The (- 1. longitude Method certain within more or 5 68 magnitude pattern. of point spatial Earthquakes the 4. in out Case 18) stand (Figure that model outliers linear four a same on the based identify boxplot adjusted the transformed using the and of 17) mean (Figure the data from deviations first standard the three of 14), those application (Figure of the method However, Some gap events. outliers. other to of close number pretty high actually a are identify outliers 6) Method 3, Method 1, (Method iue1:Otir eeto ae nepnnilajse opo o ae3 Case for boxplot adjusted exponential on based detection Outliers 19: Figure ◦ nsmay o ae3weeteeet r ihycutrd oemethods some clustered, highly are events the where 3 Case for summary, In iue2 hw htalrenme fotir i e)aedtce yapplying by detected are red) (in outliers of number large a that shows 20 Figure -83 ,

0 1 2 3 4 ◦ n aiue(0 latitude and ) Adjusted Boxplot ofdnn ◦ -18 , ◦ ) 41

y

0 10 20 30 40 50 02 04 50 40 30 20 10 0 Location ofwaggle dances x highly clustered, the ones that are even at a moderate distance from the nearest neighbor are identified as outliers. In this example, most distances (in km) were very small and any point with dnn larger than 0.2892 is considered as an outlier by apply- ing the usual rule that values greater than Q3+1.5IQR can be considered as outliers.

Boxplot of dnn(q3+1.5*iqr) Location of Earthquake5+(q3+1.5*iqr) Latitude −15 −10 −5 0 0 20 40 60 80 100 120

−80 −75 −70

Longitude

Figure 20: Outliers Detection based on boxplot for Case 4

Method 2. First gap method

It is clear that Method 1 does not work well for Case 4. To apply the first gap method, we would look at the histogram and stemplot of dnn for Case 4.

42 Histogram of dnn for Case4 Frequency 0 100 200 300 400 500 600 700

0 20 40 60 80 100 120

dnn

Figure 21: Histogram of the dnn for Case 4

The decimal point is 2 digit(s) to the left of the |

0 | 111111444444445566889999

1 | 0000111111111111111112222333344444455555555666666667777788888*

2 | 0000001111112222222222333333334444444444455555666666666677777*

3 | 0000000000000000111111111111111222222333333333333333444444444*

4 | 0000000000011111112222222222222333333333344444444455555555555*

5 | 0000000000011111122222222233344444444455556666666666777777778*

6 | 0001111112222222223333333334444444444445555555666666666777788*

43 7 | 0000000000111122222222222333333333334444444555555566666666667*

8 | 0000000000011222333333444444444455555666666666666777778888888*

9 | 0000000000011122222222233444444444555555555555666666777777788*

10 | 0000001111122222333333333333444444455555566666677777777788888*

11 | 00000112222223333333334445555555566666778888888899999999

12 | 000111122223333444445556666667777778889999

13 | 0001122222222233334556666667778899999999

14 | 01111113333444555555666666777888999

15 | 01111222233334555566666677778888999

16 | 001123334555556777888888888899

17 | 0233333334455555567778899

18 | 002333333445566667777788999

19 | 000001111444555666688899

20 | 0000011333333455566799

21 | 0022235555568899

22 | 01223666788899999

23 | 111111122233344445577788

24 | 12333456678

25 | 0056667999

26 | 22344489

27 | 00444447

28 | 036668

29 | 2379

44 30 | 00134458

31 | 136

32 | 00125

33 | 1223369

34 | 136

35 | 0

36 | 2224

37 |

38 | 488

39 | 35568

40 | 12334457

41 | 11

42 | 145

43 | 003366

44 |

45 | 456

46 | 66

47 | 5

48 |

49 | 68

50 | 235

51 | 8

52 | 5

45 53 | 88

54 | 55

55 | 2

56 |

57 |

58 | 79

59 | 7

60 | 228

61 |

62 | 2

63 |

64 |

65 | 2

66 |

67 |

68 | 5

69 | 2

70 |

71 |

72 |

73 |

74 |

75 | 0

46 76 |

77 | 9

78 |

79 |

80 |

81 | 8

82 |

83 |

84 |

85 | 3

86 |

87 | 4

88 |

89 |

90 |

91 | 1

92 |

93 |

94 |

95 |

96 |

97 |

98 |

47 99 |

100 |

101 |

102 |

103 |

104 |

105 |

106 | 9

107 |

108 |

109 |

110 |

111 | 9

112 |

113 | 0

Note: *indicates that the line has been truncated for formatting purposes.

Case 4 regarding the earthquakes of magnitude 5 experience the same situation as Case 3 regarding the waggle dances. Based on the histogram and the stemplot, the large gap exists between 92 and 105. We consider the dnn values greater than

100 can be identified as outliers. Applying the first gap method, four outliers are identified and shown in red in Figure 22.

48 ehd3 pligMto otetasomddata transformed the to 1 Method Applying 3. Method nFgr 3,w a e httedsrbto of distribution the that see can we 23A, Figure In

0 20 40 60 80 100 120

Frequency Boxplot ofdnn(first gap) 0 100 200 300 400 500 600 700 iue2:Otir eeto ae nfis a o ae4 Case for gap first on based detection Outliers 22: Figure 04 08 100 80 60 40 20 0 iue2:Hsormof Histogram 23: Figure A.Histogram ofdnn dnn 49 dnn

Frequency Latitude and 0 50 100 150 200 250 300

−15 −10 −5 0 lndnn 2024 2 0 −2 B.Histogram oflndnn Location ofEarthquake5+ dnn o ae4 Case for 8 7 −70 −75 −80 lndnn shgl kwd The skewed. highly is Longitude distribution of the transformed distances lndnn = log(dnn) is fairly symmetric except for a few extreme small outlier shown in Figure 23B. We are only concerned with large outliers but not small ones, so these small values should not matter.

Boxplot of lndnn(lnq3+1.5*lniqr) Location of Earthquake5+ Latitude −15 −10 −5 0 −2 −1 0 1 2 3 4 5

−80 −75 −70

Longitude

Figure 24: Outliers detection based on boxplot for lndnn for Case 4

Figure 24 is prepared by applying Method 1 to the transformed data lndnn. Three outliers were identified in this method shown in the darkgreen circles.

Method 4. Three standard deviations from the mean of the transformed data

The boxplot in Figure 25 is the boxplot of lndnn with a horizontal line y =y ¯+3sd wherey ¯ is the mean of lndnn and sd is the of lndnn. For a normal distribution, 99.7% of the population should fall into the interval (¯y − 3sd, y¯ + 3sd).

Based on this idea, we want to detect the outliers greater thany ¯ + 3sd since we only care about the large outliers. No event had values of lndnn greater than

50 y¯ + 3sd = 5.072. Thus, no outlier is identified using this method.

Boxplot of lndnn(u+3*sd) −2 −1 0 1 2 3 4 5

Figure 25: Outliers detection based on three standard deviations from mean of lndnn for Case 4

Method 5. Using the adjusted boxplot based on a linear model

As we indicated in Case 3, there is a way of identifying outliers with the adjusted boxplot using a linear model [9]. Figure 25 is an application of this method. Events with dnn value greater than 94.4810 are identified as outliers. Five outliers shown in red are identified by this method in Figure 26.

Method 6. Using the adjusted boxplot based on an exponential model

There is an adjusted boxplot using the exponential model. Values greater than

Q3+1.5exp(3MC)IQR can be considered as outliers, in which MC is the medcouple

51 dniyteotir htsadoti h pta on pattern. point spatial the 26) in (Figure out model stand linear that a outliers on transformed the based the identify boxplot adjusted to the 1 using Method and 23) applying events. (Figure other 22), data to (Figure close method pretty gap are first outliers the these However, of some but outliers, of number large moderate have them considerable of a some However, ues. identified; are red outlier. in as shown outliers considered of be number can 63.1390 than greater value [7]. model exponential an on based boxplot function The [9]. value nsmay h vnsi ae4aehgl lsee.Sm ehd dniya identify methods Some clustered. highly are 4 Case in events the summary, In any that indicates which 63.1390, is value boundary adjusted the 26, Figure In iue2:Otir eeto ae nlna dutdbxltfrCs 4 Case for boxplot adjusted linear on based detection Outliers 26: Figure 0 20 40 60 80 100 120 Boxplot ofdnn(linearmodel) adjust ntepackage the in 52 Robustbase Latitude

−15 −10 −5 0 Location ofEarthquake5+ sue omk h adjusted the make to used is 8 7 −70 −75 −80 Longitude dnn val- rpsdb ue 9 o kwddsrbtos hntedt r ihyskewed highly are data the When distributions. skewed for boxplot [9] adjusted the Huber consider by to need proposed we When clustered enough. moderately be are distribution. not events might the that boxplot of of most the shape to associated the outlier on of of depend definition distribution usual might The the outliers discuss define to to to need method However, we the outlier and neighbor. an nearest is the point to a whether distances determine the on based are methods current outliers. identify to enough good is 1 Method skewed, extremely not have that because (- longitude certain within more or 6 68 magnitude of Earthquakes 5. Case iue2:Otir eeto ae nepnnilajse opo o ae4 Case for boxplot adjusted exponential on based detection Outliers 27: Figure ◦ st eetn uleswt epc otelcto fteeet oto the of most event, the of location the to respect with outliers detecting to As 83 , iue2 hw htteeaefu vnsta a ecniee soutliers, as considered be can that events four are there that shows 28 Figure 0 20 40 60 80 100 120 ◦ n aiue(0 latitude and ) Adjusted Boxplot ofdnn dnn ausgetrthan greater values ◦ -18 , ◦ ) 53 Q

1 + 3 Latitude

. −15 −10 −5 0 5 IQR Location ofEarthquake5+ h itiuinof distribution The . 8 7 −70 −75 −80 Longitude dnn dnn is rnfre aai loago a odtc ulesfrahgl kwddistribution. skewed highly a for with outliers Working detect an to is way distributions. good method skewed a also gap highly is first data with transformed The deal to alternatives. method consider intuitive might and we easy 4, Case and 3 Case in

0 50 100 150 iue2:Otir eeto ae nbxltfrCs 5 Case for boxplot on based detection Outliers 28: Figure Boxplot ofdnn 54

Latitude

−15 −10 −5 0 8 7 7 −70 −74 −78 −82 Location ofEarthquake6+ Longitude 4 OUTLIERS IN SPATIAL POINT PATTERNS WITH RESPECT TO AN

ASSOCIATED VARIABLE

In this section we will address the problem of identifying outliers with respect to the values of one or more associated variables. In some cases, like the earthquake examples, there might be other variables such as magnitude and depth of the epicenter associated to each event. We will discuss methods to identify outliers with respect to these variables. We will consider two types of outliers, global and local. Global outliers are the events with unusually high or low values of their associated variables.

Local outliers are the outliers with respect to the values associated to the surrounding events.

There is very little in the statistical literature about outliers with regard to an associated variable in spatial point patterns. However, recently a book written from the point of view of computer science has been published [11]. The following rule of spatial data is found in Outlier Analysis [11]: ‘Everything is related to everything else, but nearby objects are more related than distant objects’. It indicates that we need to detect the events with behavioral attribute values varying much from the neighboring spatial data. The same reference [11] also indicates that when dealing with multiple behavioral attributes, we need to work with each variable separately, then make a combination of these two to get the final outlier score to do a further analysis of spatial outliers.

55 4.1 Working with One Associated Variable

When it comes to the earthquake data, there are two associated variables, which are depth and magnitude. In this section, we will discuss methods to identify outliers when we consider each associated variable separately.

Global Outliers

Case 4. Earthquakes of magnitude 5 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

We can determine outliers for each variable based on the usual concept that values greater than Q3 + 1.5IQR can be identified as outliers. An event with an extremely high value for the associated variable can be identified as an outlier [11]. We first prepare histograms and boxplots for each associated variable.

Boxplot of depth histogram of depth Frequency 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600

0 200 400 600

earth.d$depth

Figure 29: Boxplot and histogram of depth for Case 4

56 o utteoutliers. the the just compare for clustered to more propose We be to general. tend in with events pattern outliers the point the than spatial if a visualize in to variable tool associated a have an to to nice respect be would It scattered [10]. more clus- region are be magnitude the to to over respect all seem with view outliers of the and point region depth certain a the in from tered outliers the that is attention our calls those are red boxplots. in the points to The according outliers magnitude. be 3D for Two to one considered and located. depth are for outliers one those how created, see are to plots want these we since However, location, spatial 30. a Figure have and data 29 Figure in boxplots the to according variables entc htteesest ealto uleswt epc obt associated both to respect with outliers of lot a be to seems there that notice We fe xmnn h Dsatrposi iue3 n iue3,smtigthat something 32, Figure and 31 Figure in plots scatter 3D the examining After

5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 Boxplot ofmagnitude iue3:Bxltadhsormo antd o ae4 Case for magnitude of histogram and Boxplot 30: Figure 57

Frequency

0 200 400 600 800 1000 histogram ofmagnitude . . . 8.0 7.0 6.0 5.0 G ˆ o l h aawt the with data the all for earth.d$mag G ˆ Boxplot ofmagnitude for earthquake5+ hr r three are There 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 0 100 300 500 Boxplot ofdepthfor earthquake5+ iue3:3-ltwt ulesietfidbsdo antd o ae4 Case for magnitude on based identified outliers with 3D-plot 32: Figure iue3:3-ltwt ulesietfidbsdo et o ae4 Case for depth on based identified outliers with 3D-plot 31: Figure G ˆ nFgr 3adi iue3;tebakoei o h whole the for is one black the 34; Figure in and 33 Figure in 58

mag depth

2 1 1 − 0 −5 −10 −15 −20 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 0 −5 −10 −15 −20 0 100 200 300 400 500 600 700 3D plotfor Earthquake5+ 3D plotfor Earthquake5+ latitude latitude −85 −85 −80 −80 −75 −75 −70 −70 −65 −65

longitude longitude data set, the green one is for the outliers only and the purple one is for the simulated data under the condition of complete spatial randomness.

G estimated for earthquake5+−depth outlier Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0

0 50 100

x

Figure 33: Gˆ with depth outliers for Case 4

G estimated for earthquake5+−magnitude outlier Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0

0 50 100

x

Figure 34: Gˆ with magnitude outliers for Case 4

59 Based on Figure 33, it is clear that the outliers with regard to depth are super aggregated, which is much more clustered than all the events in Case 4, even though the whole data set is very aggregated as we discussed in Section 2. The Gˆ for the outliers with regard to magnitude in Figure 34 indicates that they are less clustered than all the events of earthquakes of magnitude 5 or more.

Figure 35 and Figure 36 display the 2D plots for the location with the outliers shown in red for depth and magnitude respectively. We make 2D plots of the location to have a better perception of where they are located.

Boxplot of depth for earthquake5+ 2D plot for Earthquake5+ latitude −15 −10 −5 0 0 100 200 300 400 500 600

−80 −75 −70

longitude

Figure 35: 2D-plot with outliers identified based on depth for Case 4

Figure 35 shows that the outliers with regard to depth in Case 4 are clustered and practically all in one region. However, the outliers with respect to magnitude (Figure

36) are scattered all over the whole pattern.

60 iue3:2-ltwt ulesietfidbsdo antd o ae4 Case for magnitude on based identified outliers with 2D-plot 36: Figure Boxplot ofmagnitude for earthquake5+

5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 iue3:3-ltfrerhukso antd rmore or 5 magnitude of earthquakes for 3D-plot 37: Figure

depth 0 100 200 300 400 500 600 700 2 1 1 − 0 −5 −10 −15 −20 ...... 8.4 8.2 8.0 7.8 7.6 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0 color codeofvariable Magnitude" "Earthquake latitude 3D plotfor earthquake5+ 61 latitude

−15 −10 −5 0 −85 2D plotfor Earthquake5+ 8 7 −70 −75 −80 −80 longitude −75 −70 −65

longitude In Figure 37, we make a 3D-plot with varying color to represent the fourth vari- able which is magnitude in our case to get a better perception of how the outliers are located in terms of two different associated variables.

Case 5. Earthquakes of magnitude 6 or more within certain longitude (-

68◦, -83◦) and latitude(0◦, -18◦)

We can determine outliers with respect to an associated variable as those values greater than Q3 + 1.5IQR, Figure 38 and Figure 39 show the outliers identified with boxplot in terms of depth and magnitude respectively.

Boxplot of depth histogram of depth Frequency 0 10 20 30 40 50 60 70 0 100 200 300 400 500 600

0 200 400 600

earth.d$depth

Figure 38: Boxplot and histogram of depth for Case 5

62 vns h re n sfrteotir nyadteprl n sfrcmlt spatial complete for is one simulation. purple by the obtained and randomness only outliers the for is one green three the has events, one Each 5. Case for respectively magnitude region. one identified in outliers this clustered the In are that boxplot seems boxplots. usual it the the magnitude, by by and outliers depth be to to respect considered with were both they case, if red in points the ing to want we located. location, are spatial outliers a those have where data see these since However, boxplots. the by plied oieta ohvralsse ohv ulesacrigt h sa ueap- rule usual the to according outliers have to seem variables both that Notice iue4 n iue4 ipa the display 43 Figure and 42 Figure mark- magnitude for one and depth for one plots, 3D are 41 Figure and 40 Figure

6.0 6.5 7.0 7.5 8.0 Boxplot ofmagnitude iue3:Bxltadhsormo antd o ae5 Case for magnitude of histogram and Boxplot 39: Figure 63 G ˆ Frequency o uleswt epc odphand depth to respect with outliers for

0 10 20 30 40 50 60 . . . . 8.0 7.5 7.0 6.5 6.0 histogram ofmagnitude G ˆ h lc n sfralthe all for is one black the , earth.d$mag 6.0 6.5 7.0 7.5 8.0 0 100 300 500 iue4:3-ltwt ulesietfidbsdo antd o ae5 Case for magnitude on based identified outliers with 3D-plot 41: Figure iue4:3-ltwt ulesietfidbsdo et o ae5 Case for depth on based identified outliers with 3D-plot 40: Figure Boxplot ofmagnitude Boxplot ofdepth 64 3D plotfor withmagnitude Earthquake6+ magnitude depth

2 1 1 − 0 −5 −10 −15 −20 6.0 6.5 7.0 7.5 8.0 8.5 0 −5 −10 −15 −20 0 100200300400500600700 3D plotfor Earthquake6+ latitude latitude −82 −82 −80 −80 −78 −78 −76 −76 −74 −74 −72 −72 −70 −70 −68 −68

longitude longitude G estimated for earthquake6+−depth outlier Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0

0 50 100 150

x

Figure 42: Gˆ with depth outliers for Case 5

G estimated for earthquake6+−magnitude outlier Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0

0 50 100 150

x

Figure 43: Gˆ with magnitude outliers for Case 5

Also, we want to make 2D plots, one for depth and one for magnitude, with location of the points and marking the points in red if they are considered as outliers 65 to get a better view of how the outliers are located. They are shown in Figure 44 and

Figure 45.

Boxplot of depth for earthquake6+ 2D plot for Earthquake6+ latitude −15 −10 −5 0 0 100 200 300 400 500 600

−82 −78 −74 −70

longitude

Figure 44: 2D-plot with outliers identified based on depth for Case 5

Boxplot of magnitude for earthquake6+ 2D plot for Earthquake6+ latitude −15 −10 −5 0 6.0 6.5 7.0 7.5 8.0

−82 −78 −74 −70

longitude

Figure 45: 2D-plot with outliers identified based on magnitude for Case 5

66 In Figure 46, we make a 3D plot with varying color to represent the fourth variable which is magnitude in our case to get a better perception of how the outliers are located in terms of two different associated variables.

3D plot for earthquake6+

depth −68 −70

−72 longitude −74 −76 −78 −80 −82 0 100 200 300 400 500 600 700 −20 −15 −10 −5 0

latitude

6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8 8.0 8.2 8.4

color code of variable "Earthquake Magnitude"

Figure 46: 3D-plot for earthquakes of magnitude 6 or more

Local Outliers

In the previous section, global outliers for one variable were identified using the usual method of the boxplot considering the values of the associated variable for all the events. Now we will address the question: Would they be outliers if we consider the values of the variable only for the events around them? In reference [11], we find the phrase ‘spatial outliers are objects which have behavioral attribute values that are distinct from those of their surrounding spatial neighbors’. We will call this ‘local 67 outliers’. We explore the idea of defining a circle around the point (event) and see if it would be an outlier compared to the other points or events inside the circle. The definition depends on the value of the radius, and we arbitrarily choose radius (r) equals 50km for the case of the earthquakes.

For each event in the spatial point pattern, we define a circle with the event as the center and r equalling to 50km, for each event we form a subset of points that are of distance r or less from it. Because the location for earthquakes is given in longitude and latitude, distances are calculated accordingly using the sp [7] package that implements the great circle method in R. We calculate the average depth (without the event in the center) and compare it with the depth for all the events inside the circle, those differences are stored in an object. Also, we compare the maximum depth with the depth for all the events inside the circle and store those differences in an object so that we can later examine them. The usual criterion to identify outliers in a boxplot is applied to those differences.

Case 4. Earthquakes of magnitude 5 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

We apply two criteria, the difference between the value for the event and the mean of the surrounding events and the difference between the value for the event and the maximum value of the surrounding events. In Figure 47 and Figure 48, we make boxplots and histograms in terms of those two differences for depth and magnitude respectively.

68 soitdvral et o ae4wt eadt ohcriteria. both to regard with 4 Case for depth variable associated for depth for max and mean 4 from Case differences of histograms and Boxplots 47: Figure Frequency ae nFgr 7 ecnseteei ubro ulesietfidwt the with identified outliers of number a is there see can we 47, Figure on Based

0 400 800 Histogram ofdiffmean−depth 0 100 200 300 Boxplot ofdiffmean−depth 010250 150 50 0 thedif 69 Frequency

0 400 800 Histogram ofdiffmax−depth 0 200 400 600 Boxplot ofdiffmax−depth 0 0 500 300 100 0 thedmax soitdvral antd o ae4wt eadt ohcriteria. both to regard with 4 Case for magnitude variable associated magnitude for 4 max Case and for mean from differences of histograms and Boxplots 48: Figure Frequency ae nFgr 8 ecnseteei ubro ulesietfidwt the with identified outliers of number a is there see can we 48, Figure on Based

0 200 400 600 Histogram ofdiffmean−mag 0.0 1.0 2.0 3.0 Boxplot ofdiffmean−mag . . . 3.0 2.0 1.0 0.0 thedif 70 Frequency

0 100 300 0.0 1.0 2.0 3.0 Histogram ofdiffmax−mag . . . 3.0 2.0 1.0 0.0 Boxplot ofdiffmax−mag thedmax 3D plot for Earthquake5+(diffmean) 3D plot for Earthquake5+(diffmax) depth depth

−65 −65 −70 −70

−75 longitude −75 longitude −80 −80 −85 −85 −20 0 100 200 −15 300 400 500 −10 600 700 −5 0 −20 0 100 200 −15 300 400 500 −10 600 700 −5 0

latitude latitude

Figure 49: 3D-plot with outliers identified based on differences from mean and max for depth for Case 4

3D plot for Earthquake5+(diffmean−mag) 3D plot for Earthquake5+(diffmax−mag) depth depth

−65 −65 −70 −70 −75 −75 −80 longitude −80 longitude −85 −85 −205.0 5.5 6.0 −15 6.5 7.0 7.5 8.0−10 8.5 −5 0 −205.0 5.5 6.0 −15 6.5 7.0 7.5 8.0−10 8.5 −5 0

latitude latitude

Figure 50: 3D-plot with outliers identified based on differences from mean and max for magnitude for Case 4

71 hwweetelclotir r oae.Telclotir r dnie yboth by identified are events. outliers same local the The exactly are located. criteria are outliers local the where show for depth for max and mean 5 from Case differences of histograms and Boxplots 51: Figure

nFgr 9 Dposfrbt rtrai em fteascae aibedepth variable associated the of terms in criteria both for plots 3D 49, Figure In Frequency

0 20 40 60 80 0 10 20 30 40 50 Histogram ofdiffmean−depth 02 04 50 40 30 20 10 0 Boxplot ofdiffmean−depth thedif

72 Frequency

0 20 40 60 80 0 50 100 150 010150 100 50 0 Histogram ofdiffmax−depth Boxplot ofdiffmax−depth thedmax iueso hr h oa ulesaelctd h oa ulesaeietfidby identified events. are same outliers the local exactly The are criteria located. both are outliers local the where show nitude magnitude for 5 max Case and for mean from differences of histograms and Boxplots 52: Figure

nFgr 0 Dposfrbt rtrai em fteascae aibemag- variable associated the of terms in criteria both for plots 3D 50, Figure In Frequency

0 20 40 60 80 0.0 0.4 0.8 1.2 ...... 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Histogram ofdiffmean−mag Boxplot ofdiffmean−mag thedif

73 Frequency

0 20 40 60 80 0 1 2 3 4 5 5 4 3 2 1 0 Histogram ofdiffmax−mag Boxplot ofdiffmax−mag thedmax 3D plot for Earthquake6+(diffmean) 3D plot for Earthquake6+(diffmax)

−68 −68 depth −70 depth −70 −72 −72 −74 longitude −74 longitude −76 −76 −78 −78 −80 −80 −82 −82 −20 0 100 200 300 −15 400 500 600 700 −10 −5 0 −20 0 100 200 300 −15 400 500 600 700 −10 −5 0

latitude latitude

Figure 53: 3D-plot with outliers identified based on differences from mean and max for depth for Case 5

3D plot for Earthquake6+(diffmean) 3D plot for Earthquake6+(thedmax)

−68 −68 −70 −70

−72 longitude −72 longitude mag mag −74 −74 −76 −76 −78 −78 −80 −80 −82 −82 −206.0 6.5 −15 7.0 7.5 −10 8.0 8.5 −5 0 −206.0 6.5 −15 7.0 7.5 −10 8.0 8.5 −5 0

latitude latitude

Figure 54: 3D-plot with outliers identified based on differences from mean and max for magnitude for Case 5

74 Case 5. Earthquakes of magnitude 6 or more within certain longitude (-

68◦, -83◦) and latitude (0◦, -18◦)

Similarly to Case 4, two criteria are applied, the first one is the difference com- pared to the mean, and the second one is the difference compared to the maximum.

Figure 51 and Figure 52 display the boxplots and histograms for the two different criteria for depth and magnitude for Case 5. A number of outliers are identified based on both criteria with regard to the associated variable depth in Figure 51. Also, there is a number of outliers identified based on both criteria with regard to the associated variable magnitude. They are displayed in Figure 52.

In Figure 53 and Figure 54, 3D plots based on both criteria with regard to the associated variable depth and magnitude are displayed. 3D plots give us a better view of where the outliers are located. The two criteria identify exactly the same outliers.

4.2 Working with Two Associated Variables

According to the reference [11], there are several attributes for each point, they should be first analyzed separately and then combined on a single deviation value.

Following that suggestion we combined magnitude and depth in the following way.

First the values of each variable were standardized because they are in different units and they have a different of values, then we plot their absolute standardized values in one plot. We decided to use the criterion ‘more than 3 standard deviations far from the mean’ to identify outliers. That criterion could be applied in different ways. Here are three of them:

75 • One of the standardized values is greater than three.

• Both of standardized values are greater than three.

• The sum of the standardized values is greater than three or another arbitrary

quantity.

One of the standardize value greater than three abs(zy) 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6

abs(zx)

Figure 55: Plot of the standardized value of one variable >3 for Case 4

Figure 55 is an application of the first criterion for Case 4, the outliers identified would include all the points shown in red beyond either the green or the blue line. 76 One of the standardize value greater than three abs(zy) 0 1 2 3 4

0.0 0.5 1.0 1.5 2.0 2.5

abs(zx)

Figure 56: Plot of the standardized value of one variable >3 for Case 5

In Case 5, there is no absolute value of the standardized values of depth (zx) greater than three. Thus, only two outliers shown in red are identified in Figure 56.

77 Both of the standardize values greater than three abs(zy) 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6

abs(zx)

Figure 57: Plot of the standardized values of both variables >3 for Case 4

Figure 57 shows the outliers identified in red when both of the standardized values are greater than 3. In Case 4, we see there are 5 outliers shown in red on the upper right that are outliers based on this criterion.

Since there is no abs(zx) for Case 5 greater than three, then there is no need to apply this criterion to Case 5.

We could define a criterion on the sum of the standardized values. One event could be considered an outlier if the sum of the absolute standardized values is greater than 78 a quantity c. In Figures 58 and 59 c = 3 and c = 6 are used for Case 4 respectively.

The sum of the standaridize values greater than three abs(zy) 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6

abs(zx)

Figure 58: Plot of the sum of the standardized values >3 for Case 4

All the outliers shown in red for Case 4 identified by the criterion that the sum of the standardized values is greater than three are beyond the purple line in Figure 58.

79 The sum of the standaridize values greater than six abs(zy) 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6

abs(zx)

Figure 59: Plot of the sum of the standardized values >6 for Case 4

All the outliers shown in red for Case 4 identified by the criterion that the sum of the standardized values is greater than six are beyond the purple line in Figure 59.

In Figure 60, the criterion ‘the sum of the standardized values greater than three’ is applied. All the outliers for Case 5 which are the events with the sum of the stan- dardized values greater than three are beyond the purple line and shown in red.

80 The sum of the standaridize values greater than six abs(zy) 0 1 2 3 4

0.0 0.5 1.0 1.5 2.0 2.5

abs(zx)

Figure 60: Plot of the sum of the standardized values >3 for Case 5

81 5 CONCLUSIONS

The thesis focuses on the exploration of alternative ways to detect outliers in spa- tial point patterns both when outliers are determined based on location alone and when they are determined based on the values of an associated variable. Several methods to identify outliers were proposed and explored for different types of spatial point patterns by using five case studies. The first two case studies are the locations of spiders in a web. The third case study is the location of waggle dances in an experimental bee hive during one day. The last two are the location of earthquakes within certain longitude (-68, -93◦) and latitude (0◦, -18◦).

For outliers with respect to the location only, the identification of outliers was done using the distances to the nearest neighbor. We developed the idea that differ- ent methods could be applied based on the shape of the distribution of the dnn. If the dnn is fairly symmetric, the usual way of identifying outliers based on the boxplot does work well. If the dnn is moderately skewed, the adjusted boxplot defined in the reference [9] works well because it identify the outliers that are clearly outliers.

However, when the distribution of the dnn is highly skewed, which means the events are highly clustered, several methods are proposed and some of them give the same results identifying the most relevant outliers. For the case of the dances of the bees in which the events are highly clustered the same four outliers are identified by the

first gap method, three standard deviations from the mean of the transformed dis- tances and the adjusted boxplot based on an exponential model. For the example of the earthquakes in which the events are clustered but not as extremely as in the bee dances case, the first gap method, the boxplot of the transformed distances and the

82 modified boxplot based on a linear model identified the most revelent outliers.

When an associated variable is available, such as depth or magnitude of earth- quakes in Case 4 and Case 5, we explored ways of identifying spatial outliers with respect to one or two associated variables. Two types of outliers were defined: global and local. The comparison of the Gˆ functions for the global outliers and all the events was used to explore whether global outliers tended to be more clustered than all the events or not, meaning whether it was more likely for outliers with respect to asso- ciated variables to happen in some regions and not in others. For the earthquakes of magnitude 5 or more, we found that the outliers identified in terms of depth are clustered, but the outliers identified in terms of magnitude are much more randomly distributed. The local outliers were determined by comparing the value of the variable at each point with the values of the variable at the surrounding points and a method based on circles of radius r was applied to identify the points with which each event was to be compared.

When working with two associated variables at the same time, we standardized the values of both associated variables since they are expressed in different units and used the criterion of three standard deviations from the mean to identify outliers for each variable. Several ways of looking at both standardized variables simultaneously were explored and compared.

Searching for outliers in spatial point patterns with respect to location only is a complex issue and it is not convenient to give one single rule to be applied to all cases.

The shape of the distribution of the distances to the nearest neighbor, either fairly symmetric, moderately skewed or highly skewed needs to be taken into account when

83 deciding the method to identify outliers.

Searching for outliers in spatial point patterns with respect to an associated vari- able is relatively more straight forward but it should include the discussion of the spatial distribution of the outliers for which we find the G function to be useful.

84 BIBLIOGRAPHY

[1] Illian, J., Penttinen, A., Stoyan, H. and Stoyan, D.(2008). Statistical Analysis

and Modelling of Spatial Point Pattern, Wiley, New Jersey.

[2] Chen, D., Lu, C., Kou, Y. and Chen, F.(2008). “On detecting spatial outliers”

Geoinformatica, 12, 455–475.

[3] Earthquake hazards program. Accessed March 12, 2015.

http://earthquake.usgs.gov/

[4] Cucala, L. and Agnan, C. T. Test for Spatial Randomness Based on Spacings.

Accessed November, 2004.

http://www.math.univ-montp2.fr/~cucala/testsCSR.pdf

[5] Knorr, E. M. and Ng, R. T.(1998). Algorithms for Mining Distance-Based Out-

liers in Large Datasets. In proceedings of 24th International Conference on Very

Large Data Bases, New York.

[6] Jin, W., Jiang, Y. L., Qian W. N. and Tung, A. K.(2006). Mining Outliers in

Spatial Networks, Springer, 3882, 156–170.

http://www.comp.nus.edu.sg/~atung/publication/DASFAA06Outlier.pdf

[7] R Core Team. R: A language and environment for statistical computing, R Foun-

dation and Statistical Computing, Vienna, Austria, 2012. Accessed March, 2015.

http://www.r-project.org/

85 [8] Pebesma, E., Customising Spatial data Classes and Methods. Accessed February,

2008.

http://cran.r-project.org/web/packages/sp/vignettes/csdacm.pdf

[9] Vanderviere, M. and Huber, M.(2004). An Adjusted Boxplot for Skewed Distri-

bution, Physica-Verlag/Springer 2004.

[10] Zimeras, S.(2008). Exploratory Point Pattern Analysis for Modeling Earthquake

data, 1st WSEAS International Conference on ENVIRONMENTAL and GEO-

LOGICAL science and ENGINEERING, Malta.

[11] Aggarwal C. C.(2013). Outlier Analysis, Springer, New York.

86 VITA

JIE LIU

Education: B.S. Life Science, ShanDong Normal University JiNan, ShanDong, China 2013 B.S. Biological Science, East Tennessee State University Johnson City, Tennessee 2013 M.S. Mathematical Sciences, East Tennessee State University Johnson City, Tennessee 2015

Professional Experience: Graduate Assistant , East Tennessee State University Johnson City, Tennessee, 2013–2015

87