Adaptive Wavelet Clustering for Highly Noisy Data

Total Page:16

File Type:pdf, Size:1020Kb

Adaptive Wavelet Clustering for Highly Noisy Data Adaptive Wavelet Clustering for Highly Noisy Data Zengjian Chen Jiayi Liu Yihe Deng Department of Computer Science Department of Computer Science Department of Mathematics Huazhong University of University of Massachusetts Amherst University of California, Los Angeles Science and Technology Massachusetts, USA California, USA Wuhan, China [email protected] [email protected] [email protected] Kun He* John E. Hopcroft Department of Computer Science Department of Computer Science Huazhong University of Science and Technology Cornell University Wuhan, China Ithaca, NY, USA [email protected] [email protected] Abstract—In this paper we make progress on the unsupervised Based on the pioneering work of Sheikholeslami that applies task of mining arbitrarily shaped clusters in highly noisy datasets, wavelet transform, originally used for signal processing, on which is a task present in many real-world applications. Based spatial data clustering [12], we propose a new wavelet based on the fundamental work that first applies a wavelet transform to data clustering, we propose an adaptive clustering algorithm, algorithm called AdaWave that can adaptively and effectively denoted as AdaWave, which exhibits favorable characteristics for uncover clusters in highly noisy data. To tackle general appli- clustering. By a self-adaptive thresholding technique, AdaWave cations, we assume that the clusters in a dataset do not follow is parameter free and can handle data in various situations. any specific distribution and can be arbitrarily shaped. It is deterministic, fast in linear time, order-insensitive, shape- To show the hardness of the clustering task, we first design insensitive, robust to highly noisy data, and requires no pre- knowledge on data models. Moreover, AdaWave inherits the a highly noisy running example with clusters in various types, ability from the wavelet transform to cluster data in different as illustrated in Fig. 1(a). Fig. 1(b) shows that AdaWave can resolutions. We adopt the “grid labeling” data structure to dras- correctly locate these clusters. Without any estimation of the tically reduce the memory consumption of the wavelet transform explicit models on the data, the proposed AdaWave algorithm so that AdaWave can be used for relatively high dimensional data. exhibits properties that are favorable to large scale real-world Experiments on synthetic as well as natural datasets demonstrate the effectiveness and efficiency of our proposed method. applications. Index Terms—Clustering, high noise data, wavelet transform, shape-insensitive I. INTRODUCTION Clustering is a fundamental data mining task that finds groups of similar objects while keeping dissimilar objects (a) raw data (b) AdaWave clustering separated in different groups or in the group of noise (noisy Fig. 1. A running example. points) [15], [17]. The objects can be spatial data points, feature vectors, or patterns. Typical clustering techniques in- For comparison, we illustrate the results of several typical arXiv:1811.10786v2 [cs.DB] 7 Jan 2019 clude centroid-based clustering [18], spectral clustering [16], clustering methods on the running example: k-means [23], [24] density based clustering [17], etc. These techniques usually as the representative for centroid-based clustering methods, perform well on “clean” data. However, they face a big DBSCAN [19] as the representative for density-based clus- challenge when dealing with real-world applications where tering methods [17], [20], and SkinnyDip [14] as a newly patterns are usually mixed with noise. Furthermore, complex proposed method that handles extremely noisy environments. and irregualarly shaped groups render these once-effective Besides illustration, we also use Adjusted Mutual Information clustering techniques intractable, because typical clustering (AMI) [21], a standard metric ranging from 0 at worst to 1 at approaches are either deficient of a clear “noise” concept or best, to evaluate the performances. limited in specific situations due to its shape-sensitive property. • Centroid-based clustering algorithms tend to lack a clear This work addresses the problem of effectively uncovering notion of “noise”, and behave poorly on very noisy data. arbitrarily shaped clusters when the noise is extremely high. As shown in Fig. 2(b), the standard k-means yields poor results and the AMI is very low at 0.25. This work is supported by National Natural Science Foundation (61772219,61572221). • Density-based clustering algorithms usually perform well *: Corresponding author. in the case where clusters have heterogeneous form. As a representative, DBSCAN is known to robustly II. RELATED WORK discover clusters of varying shapes in noisy environments. Various approaches have been proposed to improve the After fine-tuning the parameters that requires graphical robustness of clustering algorithms in noisy data [9], [14], inspection, the result of DBSCAN is shown in Fig. 2(c). [19]. Here we highlight several algorithms most related to this DBSCAN roughly finds the shapes of the clusters, but problem and focus on illustrating the preconditions for these there is still much room for improvement. It detects 21 clustering methods. clusters with an AMI of 0.28, primarily because there are DBSCAN [19] is a typical density-based clustering method various areas in which, through the randomness of noise, designed to reveal clusters of arbitrary shapes in noisy data. the local density threshold is exceeded. Also, we have When varying the noise level on the running example, we find tested various parameters and did not find a parameter that DBSCAN performs well only when the noise is controlled combination for DBSCAN that can solve the problem. below 15%. Its performance derogates drastically as we con- • SkinnyDip is known to robustly cluster spatial data in tinue to increase the noise percentage. Meanwhile, the overall extremely noisy environments. For the running example, average time complexity for DBSCAN is O(NlogN) for N we sample from the nodes of a sparse grid [13], which is data points and, in the worst case, O(N 2). Thus, its running regarded as the best mechanism for choosing a starting time can also be a limiting factor when dealing with large scale point for gradient-ascent in SkinnyDip. However, the datasets. Another density-based clustering method, Sync [21], clustering result in Fig. 2(c) shows that Skinnydip per- exhibits the same weakness on time complexity (reliance on forms poorly as the datasets do not satisfy its assumption pair-wise distance, with time complexity of O(N 2). that the projection of clusters to each dimension is in a Regarding data with high noise, as early as 1998, Dasgupta unimodal shape. et al. [9] proposed an algorithm to detect minefields and The above methods heavily rely on the estimation of explicit seismic faults from “cluttered” data. Their method is limited to models on the data, and produce high quality results when the the two-dimensional case, and an extended version for slightly data is organized according to the assumed models. However, higher dimension (d ≤ 5) requires significant parameter when facing more complicated and unknown cluster shapes tuning. or highly noisy environments, these methods do not perform In 2016, Samuel et al. [14] proposed an intriguing method well as expected. By comparison, as shown in Fig. 2(d), the called SkinnyDip [14]. SkinnyDip optimizes DipMeans [7] proposed AdaWave, which applies wavelet decomposition to with an elegant dip test of unimodality [22]. It focuses on denoising and “elbow theory” in adaptive threshold setting, the high noise situation, and yields good result when taking correctly detects all the five clusters and groups all the noisy a unimodal form on each coordinate direction. However, the points as the sixth cluster. AdaWave achieves an AMI value condition is very strict that the projections of clusters have as high as 0.76, and furthermore, computes deterministically to be of unimodal shapes in every dimension. When such and runs in linear time. condition does not exist, SkinnyDip could not uncover clusters We propose AdaWave1 to cluster highly noisy data which correctly. frequently appears in real-world applications where clusters A newly proposed work in 2017 [25] applies a sparse and are in extreme-noise settings with arbitrary shapes. We also latent decomposition of the similarity graph used in spectral demonstrate the efficiency and effectiveness of AdaWave on clustering to jointly learn the spectral embedding and the synthetic as well as real-world data. The main characteristics corrupted data. It proposes a robust spectral clustering (RSC) of AdaWave that distinguish the algorithm from existing technique that are not sensitive to noisy data. Their exper- clustering methods are as follows: iments demonstrated the robustness of RSC against spectral • AdaWave is deterministic, fast, order-insensitive, shape- clustering methods. However, it can only deal with low-noise insensitive, robusts in highly noisy environment, and data (up to 10% and 20% of noise). requires no pre-knowledge on the data models. To our Our proposed AdaWave algorithm targets extreme-noise knowledge, there is no other clustering method that meets settings as SkinnyDip does. Besides, as a grid-based algorithm, all these properties. AdaWave shares the common characteristic with STING [10] • We design a new data structure for wavelet transform and CLIQUE [11]: fast and independent of the number of data such that comparing to the classic WaveCluster algorithm,
Recommended publications
  • Lecture 19: Wavelet Compression of Time Series and Images
    Lecture 19: Wavelet compression of time series and images c Christopher S. Bretherton Winter 2014 Ref: Matlab Wavelet Toolbox help. 19.1 Wavelet compression of a time series The last section of wavelet leleccum notoolbox.m demonstrates the use of wavelet compression on a time series. The idea is to keep the wavelet coefficients of largest amplitude and zero out the small ones. 19.2 Wavelet analysis/compression of an image Wavelet analysis is easily extended to two-dimensional images or datasets (data matrices), by first doing a wavelet transform of each column of the matrix, then transforming each row of the result (see wavelet image). The wavelet coeffi- cient matrix has the highest level (largest-scale) averages in the first rows/columns, then successively smaller detail scales further down the rows/columns. The ex- ample also shows fine results with 50-fold data compression. 19.3 Continuous Wavelet Transform (CWT) Given a continuous signal u(t) and an analyzing wavelet (x), the CWT has the form Z 1 s − t W (λ, t) = λ−1=2 ( )u(s)ds (19.3.1) −∞ λ Here λ, the scale, is a continuous variable. We insist that have mean zero and that its square integrates to 1. The continuous Haar wavelet is defined: 8 < 1 0 < t < 1=2 (t) = −1 1=2 < t < 1 (19.3.2) : 0 otherwise W (λ, t) is proportional to the difference of running means of u over successive intervals of length λ/2. 1 Amath 482/582 Lecture 19 Bretherton - Winter 2014 2 In practice, for a discrete time series, the integral is evaluated as a Riemann sum using the Matlab wavelet toolbox function cwt.
    [Show full text]
  • A Fourier-Wavelet Monte Carlo Method for Fractal Random Fields
    JOURNAL OF COMPUTATIONAL PHYSICS 132, 384±408 (1997) ARTICLE NO. CP965647 A Fourier±Wavelet Monte Carlo Method for Fractal Random Fields Frank W. Elliott Jr., David J. Horntrop, and Andrew J. Majda Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, New York 10012 Received August 2, 1996; revised December 23, 1996 2 2H k[v(x) 2 v(y)] l 5 CHux 2 yu , (1.1) A new hierarchical method for the Monte Carlo simulation of random ®elds called the Fourier±wavelet method is developed and where 0 , H , 1 is the Hurst exponent and k?l denotes applied to isotropic Gaussian random ®elds with power law spectral the expected value. density functions. This technique is based upon the orthogonal Here we develop a new Monte Carlo method based upon decomposition of the Fourier stochastic integral representation of the ®eld using wavelets. The Meyer wavelet is used here because a wavelet expansion of the Fourier space representation of its rapid decay properties allow for a very compact representation the fractal random ®elds in (1.1). This method is capable of the ®eld. The Fourier±wavelet method is shown to be straightfor- of generating a velocity ®eld with the Kolmogoroff spec- ward to implement, given the nature of the necessary precomputa- trum (H 5 Ad in (1.1)) over many (10 to 15) decades of tions and the run-time calculations, and yields comparable results scaling behavior comparable to the physical space multi- with scaling behavior over as many decades as the physical space multiwavelet methods developed recently by two of the authors.
    [Show full text]
  • Cluster Analysis, a Powerful Tool for Data Analysis in Education
    International Statistical Institute, 56th Session, 2007: Rita Vasconcelos, Mßrcia Baptista Cluster Analysis, a powerful tool for data analysis in Education Vasconcelos, Rita Universidade da Madeira, Department of Mathematics and Engeneering Caminho da Penteada 9000-390 Funchal, Portugal E-mail: [email protected] Baptista, Márcia Direcção Regional de Saúde Pública Rua das Pretas 9000 Funchal, Portugal E-mail: [email protected] 1. Introduction A database was created after an inquiry to 14-15 - year old students, which was developed with the purpose of identifying the factors that could socially and pedagogically frame the results in Mathematics. The data was collected in eight schools in Funchal (Madeira Island), and we performed a Cluster Analysis as a first multivariate statistical approach to this database. We also developed a logistic regression analysis, as the study was carried out as a contribution to explain the success/failure in Mathematics. As a final step, the responses of both statistical analysis were studied. 2. Cluster Analysis approach The questions that arise when we try to frame socially and pedagogically the results in Mathematics of 14-15 - year old students, are concerned with the types of decisive factors in those results. It is somehow underlying our objectives to classify the students according to the factors understood by us as being decisive in students’ results. This is exactly the aim of Cluster Analysis. The hierarchical solution that can be observed in the dendogram presented in the next page, suggests that we should consider the 3 following clusters, since the distances increase substantially after it: Variables in Cluster1: mother qualifications; father qualifications; student’s results in Mathematics as classified by the school teacher; student’s results in the exam of Mathematics; time spent studying.
    [Show full text]
  • A Practical Guide to Wavelet Analysis
    A Practical Guide to Wavelet Analysis Christopher Torrence and Gilbert P. Compo Program in Atmospheric and Oceanic Sciences, University of Colorado, Boulder, Colorado ABSTRACT A practical step-by-step guide to wavelet analysis is given, with examples taken from time series of the El Niño– Southern Oscillation (ENSO). The guide includes a comparison to the windowed Fourier transform, the choice of an appropriate wavelet basis function, edge effects due to finite-length time series, and the relationship between wavelet scale and Fourier frequency. New statistical significance tests for wavelet power spectra are developed by deriving theo- retical wavelet spectra for white and red noise processes and using these to establish significance levels and confidence intervals. It is shown that smoothing in time or scale can be used to increase the confidence of the wavelet spectrum. Empirical formulas are given for the effect of smoothing on significance levels and confidence intervals. Extensions to wavelet analysis such as filtering, the power Hovmöller, cross-wavelet spectra, and coherence are described. The statistical significance tests are used to give a quantitative measure of changes in ENSO variance on interdecadal timescales. Using new datasets that extend back to 1871, the Niño3 sea surface temperature and the Southern Oscilla- tion index show significantly higher power during 1880–1920 and 1960–90, and lower power during 1920–60, as well as a possible 15-yr modulation of variance. The power Hovmöller of sea level pressure shows significant variations in 2–8-yr wavelet power in both longitude and time. 1. Introduction complete description of geophysical applications can be found in Foufoula-Georgiou and Kumar (1995), Wavelet analysis is becoming a common tool for while a theoretical treatment of wavelet analysis is analyzing localized variations of power within a time given in Daubechies (1992).
    [Show full text]
  • An Overview of Wavelet Transform Concepts and Applications
    An overview of wavelet transform concepts and applications Christopher Liner, University of Houston February 26, 2010 Abstract The continuous wavelet transform utilizing a complex Morlet analyzing wavelet has a close connection to the Fourier transform and is a powerful analysis tool for decomposing broadband wavefield data. A wide range of seismic wavelet applications have been reported over the last three decades, and the free Seismic Unix processing system now contains a code (succwt) based on the work reported here. Introduction The continuous wavelet transform (CWT) is one method of investigating the time-frequency details of data whose spectral content varies with time (non-stationary time series). Moti- vation for the CWT can be found in Goupillaud et al. [12], along with a discussion of its relationship to the Fourier and Gabor transforms. As a brief overview, we note that French geophysicist J. Morlet worked with non- stationary time series in the late 1970's to find an alternative to the short-time Fourier transform (STFT). The STFT was known to have poor localization in both time and fre- quency, although it was a first step beyond the standard Fourier transform in the analysis of such data. Morlet's original wavelet transform idea was developed in collaboration with the- oretical physicist A. Grossmann, whose contributions included an exact inversion formula. A series of fundamental papers flowed from this collaboration [16, 12, 13], and connections were soon recognized between Morlet's wavelet transform and earlier methods, including harmonic analysis, scale-space representations, and conjugated quadrature filters. For fur- ther details, the interested reader is referred to Daubechies' [7] account of the early history of the wavelet transform.
    [Show full text]
  • Cluster Analysis Y H Chan
    Basic Statistics For Doctors Singapore Med J 2005; 46(4) : 153 CME Article Biostatistics 304. Cluster analysis Y H Chan In Cluster analysis, we seek to identify the “natural” SPSS offers three separate approaches to structure of groups based on a multivariate profile, Cluster analysis, namely: TwoStep, K-Means and if it exists, which both minimises the within-group Hierarchical. We shall discuss the Hierarchical variation and maximises the between-group variation. approach first. This is chosen when we have little idea The objective is to perform data reduction into of the data structure. There are two basic hierarchical manageable bite-sizes which could be used in further clustering procedures – agglomerative or divisive. analysis or developing hypothesis concerning the Agglomerative starts with each object as a cluster nature of the data. It is exploratory, descriptive and and new clusters are combined until eventually all non-inferential. individuals are grouped into one large cluster. Divisive This technique will always create clusters, be it right proceeds in the opposite direction to agglomerative or wrong. The solutions are not unique since they methods. For n cases, there will be one-cluster to are dependent on the variables used and how cluster n-1 cluster solutions. membership is being defined. There are no essential In SPSS, go to Analyse, Classify, Hierarchical Cluster assumptions required for its use except that there must to get Template I be some regard to theoretical/conceptual rationale upon which the variables are selected. Template I. Hierarchical cluster analysis. For simplicity, we shall use 10 subjects to demonstrate how cluster analysis works.
    [Show full text]
  • Cluster Analysis Objective: Group Data Points Into Classes of Similar Points Based on a Series of Variables
    Multivariate Fundamentals: Distance Cluster Analysis Objective: Group data points into classes of similar points based on a series of variables Useful to find the true groups that are assumed to really exist, BUT if the analysis generates unexpected groupings it could inform new relationships you might want to investigate Also useful for data reduction by finding which data points are similar and allow for subsampling of the original dataset without losing information Alfred Louis Kroeber (1876-1961) The math behind cluster analysis A B C D … A 0 1.8 0.6 3.0 Once we calculate a distance matrix between points we B 1.8 0 2.5 3.3 use that information to build a tree C 0.6 2.5 0 2.2 D 3.0 3.3 2.2 0 … Ordination – visualizes the information in the distance calculations The result of a cluster analysis is a tree or dendrogram 0.6 1.8 4 2.5 3.0 2.2 3 3.3 2 distance 1 If distances are not equal between points we A C D can draw a “hanging tree” to illustrate distances 0 B Building trees & creating groups 1. Nearest Neighbour Method – create groups by starting with the smallest distances and build branches In effect we keep asking data matrix “Which plot is my nearest neighbour?” to add branches 2. Centroid Method – creates a group based on smallest distance to group centroid rather than group member First creates a group based on small distance then uses the centroid of that group to find which additional points belong in the same group 3.
    [Show full text]
  • Fourier, Wavelet and Monte Carlo Methods in Computational Finance
    Fourier, Wavelet and Monte Carlo Methods in Computational Finance Kees Oosterlee1;2 1CWI, Amsterdam 2Delft University of Technology, the Netherlands AANMPDE-9-16, 7/7/2016 Kees Oosterlee (CWI, TU Delft) Comp. Finance AANMPDE-9-16 1 / 51 Joint work with Fang Fang, Marjon Ruijter, Luis Ortiz, Shashi Jain, Alvaro Leitao, Fei Cong, Qian Feng Agenda Derivatives pricing, Feynman-Kac Theorem Fourier methods Basics of COS method; Basics of SWIFT method; Options with early-exercise features COS method for Bermudan options Monte Carlo method BSDEs, BCOS method (very briefly) Kees Oosterlee (CWI, TU Delft) Comp. Finance AANMPDE-9-16 1 / 51 Agenda Derivatives pricing, Feynman-Kac Theorem Fourier methods Basics of COS method; Basics of SWIFT method; Options with early-exercise features COS method for Bermudan options Monte Carlo method BSDEs, BCOS method (very briefly) Joint work with Fang Fang, Marjon Ruijter, Luis Ortiz, Shashi Jain, Alvaro Leitao, Fei Cong, Qian Feng Kees Oosterlee (CWI, TU Delft) Comp. Finance AANMPDE-9-16 1 / 51 Feynman-Kac theorem: Z T v(t; x) = E g(s; Xs )ds + h(XT ) ; t where Xs is the solution to the FSDE dXs = µ(Xs )ds + σ(Xs )d!s ; Xt = x: Feynman-Kac Theorem The linear partial differential equation: @v(t; x) + Lv(t; x) + g(t; x) = 0; v(T ; x) = h(x); @t with operator 1 Lv(t; x) = µ(x)Dv(t; x) + σ2(x)D2v(t; x): 2 Kees Oosterlee (CWI, TU Delft) Comp. Finance AANMPDE-9-16 2 / 51 Feynman-Kac Theorem The linear partial differential equation: @v(t; x) + Lv(t; x) + g(t; x) = 0; v(T ; x) = h(x); @t with operator 1 Lv(t; x) = µ(x)Dv(t; x) + σ2(x)D2v(t; x): 2 Feynman-Kac theorem: Z T v(t; x) = E g(s; Xs )ds + h(XT ) ; t where Xs is the solution to the FSDE dXs = µ(Xs )ds + σ(Xs )d!s ; Xt = x: Kees Oosterlee (CWI, TU Delft) Comp.
    [Show full text]
  • Cluster Analysis: What It Is and How to Use It Alyssa Wittle and Michael Stackhouse, Covance, Inc
    PharmaSUG 2019 - Paper ST-183 Cluster Analysis: What It Is and How to Use It Alyssa Wittle and Michael Stackhouse, Covance, Inc. ABSTRACT A Cluster Analysis is a great way of looking across several related data points to find possible relationships within your data which you may not have expected. The basic approach of a cluster analysis is to do the following: transform the results of a series of related variables into a standardized value such as Z-scores, then combine these values and determine if there are trends across the data which may lend the data to divide into separate, distinct groups, or "clusters". A cluster is assigned at a subject level, to be used as a grouping variable or even as a response variable. Once these clusters have been determined and assigned, they can be used in your analysis model to observe if there is a significant difference between the results of these clusters within various parameters. For example, is a certain age group more likely to give more positive answers across all questionnaires in a study or integration? Cluster analysis can also be a good way of determining exploratory endpoints or focusing an analysis on a certain number of categories for a set of variables. This paper will instruct on approaches to a clustering analysis, how the results can be interpreted, and how clusters can be determined and analyzed using several programming methods and languages, including SAS, Python and R. Examples of clustering analyses and their interpretations will also be provided. INTRODUCTION A cluster analysis is a multivariate data exploration method gaining popularity in the industry.
    [Show full text]
  • Factors Versus Clusters
    Paper 2868-2018 Factors vs. Clusters Diana Suhr, SIR Consulting ABSTRACT Factor analysis is an exploratory statistical technique to investigate dimensions and the factor structure underlying a set of variables (items) while cluster analysis is an exploratory statistical technique to group observations (people, things, events) into clusters or groups so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Factor and cluster analysis guidelines and SAS® code will be discussed as well as illustrating and discussing results for sample data analysis. Procedures shown will be PROC FACTOR, PROC CORR alpha, PROC STANDARDIZE, PROC CLUSTER, and PROC FASTCLUS. INTRODUCTION Exploratory factor analysis (EFA) investigates the possible underlying factor structure (dimensions) of a set of interrelated variables without imposing a preconceived structure on the outcome (Child, 1990). The analysis groups similar items to identify dimensions (also called factors or latent constructs). Exploratory cluster analysis (ECA) is a technique for dividing a multivariate dataset into “natural” clusters or groups. The technique involves identifying groups of individuals or objects that are similar to each other but different from individuals or objects in other groups. Cluster analysis, like factor analysis, makes no distinction between independent and dependent variables. Factor analysis reduces the number of variables by grouping them into a smaller set of factors. Cluster analysis reduces the number of observations by grouping them into a smaller set of clusters. There is no right or wrong answer to “how many factors or clusters should I keep?”. The answer depends on what you’re going to do with the factors or clusters.
    [Show full text]
  • Wavelet Clustering in Time Series Analysis
    Wavelet clustering in time series analysis Carlo Cattani and Armando Ciancio Abstract In this paper we shortly summarize the many advantages of the discrete wavelet transform in the analysis of time series. The data are transformed into clusters of wavelet coefficients and rate of change of the wavelet coefficients, both belonging to a suitable finite dimensional domain. It is shown that the wavelet coefficients are strictly related to the scheme of finite differences, thus giving information on the first and second order properties of the data. In particular, this method is tested on financial data, such as stock pricings, by characterizing the trends and the abrupt changes. The wavelet coefficients projected into the phase space give rise to characteristic cluster which are connected with the volativility of the time series. By their localization they represent a sufficiently good and new estimate of the risk. Mathematics Subject Classification: 35A35, 42C40, 41A15, 47A58; 65T60, 65D07. Key words: Wavelet, Haar function, time series, stock pricing. 1 Introduction In the analysis of time series such as financial data, one of the main task is to con- struct a model able to capture the main features of the data, such as fluctuations, trends, jumps, etc. If the model is both well fitting with the already available data and is expressed by some mathematical relations then it can be temptatively used to forecast the evolution of the time series according to the given model. However, in general any reasonable model depends on an extremely large amount of parameters both deterministic and/or probabilistic [10, 18, 23, 2, 1, 29, 24, 25, 3] implying a com- plex set of complicated mathematical relations.
    [Show full text]
  • Statistical Hypothesis Testing in Wavelet Analysis: Theoretical Developments and Applications to India Rainfall Justin A
    Nonlin. Processes Geophys. Discuss., https://doi.org/10.5194/npg-2018-55 Manuscript under review for journal Nonlin. Processes Geophys. Discussion started: 13 December 2018 c Author(s) 2018. CC BY 4.0 License. Statistical Hypothesis Testing in Wavelet Analysis: Theoretical Developments and Applications to India Rainfall Justin A. Schulte1 1Science Systems and Applications, Inc,, Lanham, 20782, United States 5 Correspondence to: Justin A. Schulte ([email protected]) Abstract Statistical hypothesis tests in wavelet analysis are reviewed and developed. The output of a recently developed cumulative area-wise is shown to be the ensemble mean of individual estimates of statistical significance calculated from a geometric test assessing statistical significance based on the area of contiguous regions (i.e. patches) 10 of point-wise significance. This new interpretation is then used to construct a simplified version of the cumulative area-wise test to improve computational efficiency. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to test for periodicities in a strict sense. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number 15 of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of India rainfall is also provided.
    [Show full text]