Periodicity Detection, Time-Series Correlation, Burst Detection
Total Page:16
File Type:pdf, Size:1020Kb
Foundations - 2 Periodicity Detection, Time-series Correlation, Burst Detection Temporal Information Retrieval Time Series • An ordered sequence of values (data points) of variables at equally spaced time intervals Periodicity Detection • How does one identify periodic values Periodicity Detection • Time-series is in the time domain • Method1 (DFT): Identify the underlying periodic patterns by transforming into the frequency domain • Method 2 (Autocorrelation) Correlate the signal with itself Find dominant frequencies is the Discrete Fourier Transform of the sequence . We may write this equation in matrix form as: . Fourier Transform. where and etc. • A signal has an amplitude amplitude (strength), DFT – example Let the continuous signal be frequency (periodicity) frequency and phase (offset) phase dc 1Hz 2Hz 10 • Fourier Transform 8 6 converts a signal from 4 2 the time domain to the 0 −2 frequency domain −4 0 1 2 3 4 5 6 7 8 9 10 Figure 7.2: Example signal for DFT. Let us sample at 4 times per second (ie. = 4Hz) from to . The values of the discrete samples are given by: by putting 84 Discrete Fourier Transform (DFT) This work targets similar applications and provides coefficients to record, we can perform a variety of tasks • A Fourier analysisto olsis thata method can significantly for ease expressing the “mining” of a useful functionsuch as compression, as a sum denoising, of periodic etc. information. Specifically, this paper makes the following components, andcontributions: for recovering the function from those components.Signal & Reconstruction 1. We present a novel automatic method for accurate • periodicity detection in time-series data. Our algorithm When both the functionis the first oneand that its exploits Fourier the information transform in both are replaced with discretized counterparts,periodogram and it is autocorrelation called the to provide discrete accurate Fourier transform (DFT). periodic estimates without upsampling. This work targets similar applications and provides coefficients to record, we can perform a variety of tasks tools that can significantly ease the “mining” of useful such as compression, denoising, etc. 2. We introduce new periodic distance measures that information. Specifically, this paper makes the following exploit the power of the dominant periods, as provided contributions: Signal & Reconstruction by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients mation we can provide more compact representations, fourier transform f 1. We present a novel automatic method for accurate that also capture similarities under time-shift transfor- 4 f periodicity detection in time-series data. Our algorithm mations. 3 is the first one that exploits the information in both f 3. Finally, we present comprehensive experiments 2 periodogram and autocorrelation to provide accurate demonstrating the applicability and efficiency of the f periodic estimates without upsampling. 1 proposed methods, on a variety of real world datasets f 0 2. We introduce new periodic distance measures that (online query logs, manufacturing diagnostics, medical exploit the power of the dominant periods, as provided data, etc.). inv. fourier transform Figure 1: Reconstruction of a signal from its first 5 by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients Fourier coefficients mation we can provide more compact representations, 2 Backgroundf that also capture similarities under time-shift transfor- 4 We provide a brieff introduction to harmonic analysis mations. using the discrete3 Fourier Transform, because we will 2.2 Power Spectral Density Estimation. In or- f der to discover potential periodicities of a time-series, 3. Finally, we present comprehensive experiments Advantages ofuse these DFT tools as apart2 the building blocks from of our algorithms.periodicity detection ? demonstrating the applicability and efficiency of the f one needs to examine its power spectral density (PSD 1 or power spectrum).denoising, The PSD compression essentially tells us how proposed methods, on a variety of real world datasets 2.1 Discrete Fourierf Transform. The normalized 0 much is the expected signal power at each frequency (online query logs, manufacturing diagnostics, medical Discrete Fourier Transform of a sequence x(n),n = of the signal. Since period is the inverse of frequency, data, etc.). 0, 1 ...N − 1 is a sequence of complex numbers X(f): Figure 1: Reconstruction of a signal from its first 5 by identifying the frequencies that carry most of the N 1 2 Fourier coefficients 1 − j πkn X(f )= x(n)e− N , k = 0, 1 ...N − 1 energy, we can also discover the most dominant peri- 2 Background k/N √N n=0 ods. There are two well known estimators of the PSD; We provide a brief introduction to harmonic analysis the periodogram and the circular autocorrelation. Both 2.2 Power Spectral Density Estimation.where the subscriptIn or- k/N denotes the frequency that using the discrete Fourier Transform, because we will each coefficient captures. Throughout the text we will of these methods can be computed using the DFT of der to discover potential periodicities of a time-series, use these tools as the building blocks of our algorithms. also utilize the notation F(x) to describe the Fourier a sequence (and can therefore exploit the Fast Fourier one needs to examine its power spectral density (PSD Transform. Since we are dealing with real signals, the Transform for execution in O(N log N) time). or power spectrum). The PSD essentially tells us how 2.1 Discrete Fourier Transform. The normalized Fourier coefficients are symmetric around the middle Discrete Fourier Transform of a sequence x(n),n = much is the expected signal power atone each (or frequency to be more exact, they will be the complex 2.2.1 Periodogram Suppose that X is the DFT of 0, 1 ...N − 1 is a sequence of complex numbers X(f): of the signal. Since period is the inverseconjugate of frequency, of their symmetric). The Fourier transform a sequence x. The periodogram P is provided by the by identifying the frequencies that carry most of the N 1 j2πkn represents the original signal as a linear combination of squared length of each Fourier coefficient: 1 − 2 N energy, we can also discover the most dominant peri- j πfn/N X(fk/N )= √ x(n)e− , k = 0, 1 ...N − 1 e 2 N 1 N the complex sinusoids sf (n)= . Therefore, the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ n=0 ods. There are two well known estimators of the PSD; √N k/N k/N 2 Fourier coefficients record the amplitude and phase of the periodogram and the circular autocorrelation. Both Notice that we can only detect frequencies that are at where the subscript k/N denotes the frequency that these sinusoids, after signal x is projected on them. each coefficient captures. Throughout the text we will of these methods can be computed usingWe the can DFT return of from the frequency domain back to most half of the maximum signal frequency, due to the also utilize the notation F(x) to describe the Fourier a sequence (and can therefore exploitthe the time Fast domain, Fourier using the inverse Fourier transform Nyquist fundamental theorem. In order to find the k 1 Transform. Since we are dealing with real signals, the Transform for execution in O(N log NF)− time).(x) ≡ x(n): dominant periods, we need to pick the k largest values 1 Fourier coefficients are symmetric around the middle of the periodogram. N 1 2 1 − j πkn one (or to be more exact, they will be the complex 2.2.1 Periodogram Suppose that X isx(n the)= DFT ofX(fk/N )e N , k = 0, 1 ...N − 1 √N 1 conjugate of their symmetric). The Fourier transform a sequence x. The periodogram P is provided byn the=0 Due to the assumption of the Fourier Transform that the data is periodic, proper windowing of the data might be necessary for represents the original signal as a linear combination of squared length of each Fourier coefficient:Note that if during this reverse transformation we j2πfn/N achieving a more accurate harmonic analysis. In this work we will e 2 discardN some1 of the coefficients (e.g., the last k), then the complex sinusoids sf (n)= . Therefore, the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ sidestep this issue, since it goes beyond the scope of this paper. √N k/N k/N the outcome2 will be an approximation of the original Fourier coefficients record the amplitude and phase of However, the interested reader is directed to [5] for an excellent Notice that we can only detect frequenciessequence that (Figure are at 1). By carefully selecting which review of data windowing techniques. these sinusoids, after signal x is projected on them. We can return from the frequency domain back to most half of the maximum signal frequency, due to the the time domain, using the inverse Fourier transform Nyquist fundamental theorem. In order to find the k 1 F − (x) ≡ x(n): dominant periods, we need to pick the k largest values of the periodogram. 1 N 1 2 1 − j πkn x(n)= X(f )e N , k = 0, 1 ...N − 1 √N k/N n=0 1Due to the assumption of the Fourier Transform that the data Note that if during this reverse transformation we is periodic, proper windowing of the data might be necessary for achieving a more accurate harmonic analysis. In this work we will discard some of the coefficients (e.g., the last k), then sidestep this issue, since it goes beyond the scope of this paper. the outcome will be an approximation of the original However, the interested reader is directed to [5] for an excellent sequence (Figure 1). By carefully selecting which review of data windowing techniques. is the Discrete Fourier Transform of the sequence .