<<

Foundations - 2

Periodicity Detection, Time-series Correlation, Burst Detection

Temporal Information Retrieval

• An ordered sequence of values ( points) of variables at equally spaced time intervals Periodicity Detection

• How does one identify periodic values Periodicity Detection

• Time-series is in the

• Method1 (DFT): Identify the underlying periodic patterns by transforming into the domain

• Method 2 () Correlate the with itself

Find dominant is the Discrete of the sequence .

We may write this equation in matrix form as:

. . . . . Fourier Transform.

where and etc. . • A signal has an amplitude (strength), DFT – example Let the continuous signal be frequency (periodicity) frequency and (offset) phase

dc 1Hz 2Hz

10 • Fourier Transform 8 6 converts a signal from 4 2 the time domain to the 0 −2

−4 0 1 2 3 4 5 6 7 8 9 10

Figure 7.2: Example signal for DFT.

Let us sample at 4 times per second (ie. = 4Hz) from to . The values of the discrete samples are given by:

by putting

84 Discrete Fourier Transform (DFT)

This work targets similar applications and provides coefficients to record, we can perform a variety of tasks • A Fourier analysisto olsis thata method can significantly for ease expressing the “mining” of a useful functionsuch as compression, as a sum denoising, of periodic etc. information. Specifically, this paper makes the following components, andcontributions: for recovering the from those components.Signal & Reconstruction 1. We present a novel automatic method for accurate • periodicity detection in time-series data. Our algorithm When both the functionis the first oneand that its exploits Fourier the information transform in both are replaced with discretized counterparts,periodogram and it is autocorrelation called the to provide discrete accurate Fourier transform (DFT). periodic estimates without upsampling. This work targets similar applications and provides coefficients to record, we can perform a variety of tasks tools that can significantly ease the “mining” of useful such as compression, denoising, etc. 2. We introduce new periodic distance measures that information. Specifically, this paper makes the following exploit the power of the dominant periods, as provided contributions: Signal & Reconstruction by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients mation we can provide more compact representations, fourier transform f 1. We present a novel automatic method for accurate that also capture similarities under time-shift transfor- 4 f periodicity detection in time-series data. Our algorithm mations. 3 is the first one that exploits the information in both f 3. Finally, we present comprehensive 2 periodogram and autocorrelation to provide accurate demonstrating the applicability and efficiency of the f periodic estimates without upsampling. 1 proposed methods, on a variety of real world datasets f 0 2. We introduce new periodic distance measures that (online query logs, manufacturing diagnostics, medical exploit the power of the dominant periods, as provided data, etc.). inv. fourier transform Figure 1: Reconstruction of a signal from its first 5 by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients Fourier coefficients mation we can provide more compact representations, 2 Backgroundf that also capture similarities under time-shift transfor- 4 We provide a brieff introduction to analysis mations. using the discrete3 Fourier Transform, because we will 2.2 Power Estimation. In or- f der to discover potential periodicities of a time-series, 3. Finally, we present comprehensive experiments Advantages ofuse these DFT tools as apart2 the building blocks from of our algorithms.periodicity detection ? demonstrating the applicability and efficiency of the f one needs to examine its power spectral density (PSD 1 or power spectrum).denoising, The PSD compression essentially tells us how proposed methods, on a variety of real world datasets 2.1 Discrete Fourierf Transform. The normalized 0 much is the expected signal power at each frequency (online query logs, manufacturing diagnostics, medical Discrete Fourier Transform of a sequence x(n),n = of the signal. Since period is the inverse of frequency, data, etc.). 0, 1 ...N − 1 is a sequence of complex numbers X(f): Figure 1: Reconstruction of a signal from its first 5 by identifying the frequencies that carry most of the N 1 2 Fourier coefficients 1 − j πkn X(f )= x(n)e− N , k = 0, 1 ...N − 1 energy, we can also discover the most dominant peri- 2 Background k/N √N n=0 ods. There are two well known estimators of the PSD; We provide a brief introduction to harmonic analysis the periodogram and the circular autocorrelation. Both 2.2 Power Spectral .where the subscriptIn or- k/N denotes the frequency that using the discrete Fourier Transform, because we will each coefficient captures. Throughout the text we will of these methods can be computed using the DFT of der to discover potential periodicities of a time-series, use these tools as the building blocks of our algorithms. also utilize the notation F(x) to describe the Fourier a sequence (and can therefore exploit the Fast Fourier one needs to examine its power spectral density (PSD Transform. Since we are dealing with real , the Transform for execution in O(N log N) time). or power spectrum). The PSD essentially tells us how 2.1 Discrete Fourier Transform. The normalized Fourier coefficients are symmetric around the middle Discrete Fourier Transform of a sequence x(n),n = much is the expected signal power atone each (or frequency to be more exact, they will be the complex 2.2.1 Periodogram Suppose that X is the DFT of 0, 1 ...N − 1 is a sequence of complex numbers X(f): of the signal. Since period is the inverseconjugate of frequency, of their symmetric). The Fourier transform a sequence x. The periodogram P is provided by the by identifying the frequencies that carry most of the N 1 j2πkn represents the original signal as a linear combination of squared length of each Fourier coefficient: 1 − 2 N energy, we can also discover the most dominant peri- j πfn/N X(fk/N )= √ x(n)e− , k = 0, 1 ...N − 1 e 2 N 1 N the complex sinusoids sf (n)= . Therefore, the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ n=0 ods. There are two well known estimators of the PSD; √N k/N k/N 2 Fourier coefficients record the amplitude and phase of the periodogram and the circular autocorrelation. Both Notice that we can only detect frequencies that are at where the subscript k/N denotes the frequency that these sinusoids, after signal x is projected on them. each coefficient captures. Throughout the text we will of these methods can be computed usingWe the can DFT return of from the frequency domain back to most half of the maximum signal frequency, due to the also utilize the notation F(x) to describe the Fourier a sequence (and can therefore exploitthe the time Fast domain, Fourier using the inverse Fourier transform Nyquist fundamental theorem. In order to find the k 1 Transform. Since we are dealing with real signals, the Transform for execution in O(N log NF)− time).(x) ≡ x(n): dominant periods, we need to pick the k largest values 1 Fourier coefficients are symmetric around the middle of the periodogram. N 1 2 1 − j πkn one (or to be more exact, they will be the complex 2.2.1 Periodogram Suppose that X isx(n the)= DFT ofX(fk/N )e N , k = 0, 1 ...N − 1 √N 1 conjugate of their symmetric). The Fourier transform a sequence x. The periodogram P is provided byn the=0 Due to the assumption of the Fourier Transform that the data is periodic, proper windowing of the data might be necessary for represents the original signal as a linear combination of squared length of each Fourier coefficient:Note that if during this reverse transformation we j2πfn/N achieving a more accurate harmonic analysis. In this work we will e 2 discardN some1 of the coefficients (e.g., the last k), then the complex sinusoids sf (n)= . Therefore, the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ sidestep this issue, since it goes beyond the scope of this paper. √N k/N k/N the outcome2 will be an approximation of the original Fourier coefficients record the amplitude and phase of However, the interested reader is directed to [5] for an excellent Notice that we can only detect frequenciessequence that (Figure are at 1). By carefully selecting which review of data windowing techniques. these sinusoids, after signal x is projected on them. We can return from the frequency domain back to most half of the maximum signal frequency, due to the the time domain, using the inverse Fourier transform Nyquist fundamental theorem. In order to find the k 1 F − (x) ≡ x(n): dominant periods, we need to pick the k largest values of the periodogram. 1 N 1 2 1 − j πkn x(n)= X(f )e N , k = 0, 1 ...N − 1 √N k/N n=0 1Due to the assumption of the Fourier Transform that the data Note that if during this reverse transformation we is periodic, proper windowing of the data might be necessary for achieving a more accurate harmonic analysis. In this work we will discard some of the coefficients (e.g., the last k), then sidestep this issue, since it goes beyond the scope of this paper. the outcome will be an approximation of the original However, the interested reader is directed to [5] for an excellent sequence (Figure 1). By carefully selecting which review of data windowing techniques. is the Discrete Fourier Transform of the sequence .

We may write this equation in matrix form as:

...... where and etc. . i.e. , , , , DFT – example

Let the continuous signal be Therefore

Discretedc 1Hz Fourier2Hz Transform (DFT) The magnitude of the DFT coefficients is shown below in Fig. 7.3. periodogram 10 20 8

6 15 4 fourier 2 10 transform |F[n]| 0

−2 5

−4 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 f (Hz) Figure 7.2: Example signal for DFT. Figure 7.3: DFTsinusoidof four point sequence. Let us sample fourierat 4 times percoefficientssecond (ie. = 4Hz) from to . The values of the discrete samples are given by: Inverse Discrete Fourier Transform N 1 The inverse transform of by putting1 j2⇡kn X(f )= x(n) e N k/N p N n=0 84 X 85 • The fourier coefficients encode both the amplitude and phase This work targets similar applications and provides coefficients to record, we can perform a variety of tasks tools that can significantly ease the “mining” of useful such as compression, denoising, etc. information. Specifically, this paper makes the following contributions: Signal & Reconstruction 1. We present a novel automatic method for accurate periodicity detection in time-series data. Our algorithm is the first one that exploits the information in both periodogram and autocorrelation to provide accurate periodic estimates without upsampling. is the Discrete Fourier Transform of the sequence . 2. We introduce new periodic distance measures that exploit the power of the dominant periods,We may write as providedthis equation in matrix form as: by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients mation we can provide more compact representations, f that also capture similarities under time-shift transfor- 4 f mations. . . 3 . . . . f 3. Finally, we present comprehensive experiments 2 f demonstrating the applicability and efficiency of the 1 proposed methods, on a variety of realwhere world datasets and etc. . f 0 (online query logs, manufacturing diagnostics, medical i.e. , , , , data, etc.). DFT – example Figure 1: Reconstruction of a signal from its first 5 Let the continuous signal be Fourier coefficients Therefore 2 Background We provide a brief introduction to harmonic analysis using the discrete Fourier Transform, because we will 2.2 Power Spectral Density Estimation. In or- Powerdc Spectral1Hz 2HzDensity (PSD) Estimation use these tools as the building blocks of our algorithms. der to discover potential periodicities of a time-series, one needs to examine its power spectralThe magnitude densityof the DFT coef(PSDficients is shown below in Fig. 7.3. 10 2.1 Discrete Fourier Transform. The normalized or power spectrum). The PSD essentially tells20 us how 8 Discrete Fourier Transform of a sequence x(n),n = much is the expected signal power at each frequency 6 15 0, 1 ...N − 1 is a sequence of complex numbers4 X(f): of the signal. Since period is the inverse of frequency, fourier 2 10

by identifying the frequencies that carry most|F[n]| of the N 1 j2πkn transform 1 − 0 X(f )= x(n)e− N , k = 0, 1 ...N − 1 energy, we can also discover the most dominant peri- k/N √N 5 n=0 −2 ods. There are two well known estimators of the PSD; −4 0 1 2 3 4 5 6 7 8 9 10 0 where the subscript k/N denotes the frequency that the periodogram and the circular autocorrelation. Both0 1 2 3 f (Hz) each coefficient captures. Throughout the text we willFigureof7.2: theseExample methodssignal for DFT can. be computed using the DFT of also utilize the notation F(x) to describe the Fourier a sequence (and can therefore exploit the FastFigure Fourier7.3: DFT of four point sequence. Transform. Since we are dealing withLet realus sample signals,•at the4Totimes findTransformper outsecond the(ie. for dominant= execution4Hz) from frequency in Oto(N log. TheNwe) time).need to find the values of the discrete samples are given by: Inverse Discrete Fourier Transform Fourier coefficients are symmetric around the middlepower at each frequency The inverse transform of one (or to be more exact, they will be the complex 2.2.1 Periodogramby putting Suppose that X is the DFT of conjugate of their symmetric). The Fourier transform• Periodograma sequence xencodes. The periodogram the strengthP is at provided a given by frequency the represents the original signal as a linear combination of squared length of each Fourier coefficient: j2πfn/N the complex sinusoids s (n)= e . Therefore, the 84 2 N 1 85 f √N P(fk/N )=∥X(fk/N )∥ k =0, 1 ...⌈ 2− ⌉ Fourier coefficients record the amplitude and phase of Notice that we can only detect frequencies that are at these sinusoids, after signal x is projected on them. We can return from the frequency domain back to most half of the maximum signal frequency, due to the the time domain, using the inverse Fourier transform Nyquist fundamental theorem. In order to find the k 1 F − (x) ≡ x(n): dominant periods, we need to pick the k largest values of the periodogram. 1 N 1 2 1 − j πkn x(n)= X(f )e N , k = 0, 1 ...N − 1 √N k/N n=0 1Due to the assumption of the Fourier Transform that the data Note that if during this reverse transformation we is periodic, proper windowing of the data might be necessary for achieving a more accurate harmonic analysis. In this work we will discard some of the coefficients (e.g., the last k), then sidestep this issue, since it goes beyond the scope of this paper. the outcome will be an approximation of the original However, the interested reader is directed to [5] for an excellent sequence (Figure 1). By carefully selecting which review of data windowing techniques. This work targets similar applications and provides coefficients to record, we can perform a variety of tasks tools that can significantly ease the “mining” of useful such as compression, denoising, etc. information. Specifically, this paper makes the following contributions: Signal & Reconstruction 1. We present a novel automatic method for accurate periodicity detection in time-series data. Our algorithm is the first one that exploits the information in both periodogram and autocorrelation to provide accurate periodic estimates without upsampling. 2. We introduce new periodic distance measures that exploit the power of the dominant periods, as provided by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients mation we can provide more compact representations, f that also capture similarities under time-shift transfor- 4 f mations. 3 f 3. Finally, we present comprehensive experiments 2 f demonstrating the applicability and efficiency of the 1 proposed methods, on a variety of real world datasets f 0 (online query logs, manufacturing diagnostics, medical data, etc.). Figure 1: Reconstruction of a signal from its first 5 Fourier coefficients 2 Background We provide a brief introduction to harmonic analysis using the discrete Fourier Transform, because we will 2.2 Power Spectral Density Estimation. In or- use these tools as the building blocks of our algorithms. der to discover potential periodicities of a time-series, one needs to examine its power spectral density (PSD 2.1 Discrete Fourier Transform. The normalized or power spectrum). The PSD essentially tells us how Discrete Fourier Transform of a sequence x(n),n = much is the expected signal power at each frequency 0, 1 ...N − 1 is a sequence of complex numbers X(f): of the signal. Since period is the inverse of frequency, by identifying the frequencies that carry most of the Each element of the periodogramN 1 provides2 the amplitude patterns, which nonetheless appear more Each element of the periodogram1 − providesj πkn the amplitude patterns, which nonetheless appear more X(f )= x(n)e− N , k = 0, 1 ...N − 1 energy, we can also discover the most dominant peri- power at frequency k/N or,k/N equivalently,√N at period N/k. scarcelyPSD (see exampleestimation in fig. 2). using Periodogram power at frequency k/N or, equivalently,n=0 at period N/k. scarcely (see exampleods. in There fig. are 2). two well known estimators of the PSD; BeingBeing more more precise, precise, each each DFT DFT ‘bin’ ‘bin’ corresponds corresponds to to a a where the subscript k/N denotes the frequency that the periodogramSequence and the circular autocorrelation. Both of periods (or frequencies). That is, coefficient Sequence range of periods (or frequencies).N ThatN is, coefficient of these methods can be computed using the DFT of X(f ) correspondseach to periods coefficient [ N captures.. . . N ). ThroughoutIt is easy the text we will time series data X(fk/N ) corresponds to periods [k . . .k 1 ). It is easy 0.2 k/N also utilize thek notationk− 1 F(x) to describe0.2 the Fourier a sequence (and can therefore exploit the Fast Fourier to see that the resolution of the periodogram− becomes or signal to see that the resolution of the periodogram becomes 0.1 Transform for execution in O(N log N) time). very coarse for longerTransform. periods. Since For example, we are dealing for a se- with real0.1 signals, the very coarse for longerFourier periods. coefficients For example, are symmetric for a se- around0 the middle quence of length N = 256, the DFT bin margins will be 0 quence of length N one= 256, (or the to DFT be more bin margins exact, they will will be be the complex50 1002.2.1150 Periodogram200 250 Suppose300 that350 X is the DFT of N/1,N/2,N/3,...= 256, 128, 64 etc. 50 100 150Periodogram200 250 300 350 N/1,N/2,N/3,...=conjugate 256, 128, of64 their etc. symmetric). The Fourier0.2 transform a sequencePeriodogramx. The periodogram P is provided by the 0.2 Essentially, the accuracy of the discovered periods, P1= 7 Essentially, therepresents accuracy of the the original discovered signal periods, as a linear0.15 combinationP2= 30.3333 of squaredP1= 7 length ofperiodogram each Fourier coefficient: 2 0.15 P2= 30.3333 deteriorates for large periods, due to the increasingej πfn/N deteriorates for large periods, due to the increasing 0.1 2 N 1 the complex sinusoids sf (n)= . Therefore,0.1 the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ width of the DFT bins (N/k). Another related issue√ isN Power k/N k/N 2 width of the DFT binsFourier (N/k coe).ffi Anothercients record related the issue amplitude is 0.05Power and phase of spectral leakage, which causes frequencies that are not 0.05 Notice that we can only detect frequencies that are at spectral leakage, which causes frequencies that are not 0 these sinusoids, after signal x is projected0 on0 them.0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 integer multiples of the DFT bin width, to disperse over 0 6 0.05 0.1 most0.15 half0.2 of0.25 the maximum0.3 0.35 0.4 signal0.45 frequency,0.5 due to the integer multiples of the DFT bin width, to disperse over x 10 6 Circular Autocorrelation We can return from the frequency domainx 10 back to Circular Autocorrelation the entire spectrum. This can lead to ‘false alarms’ 7 Nyquist fundamental theorem. In order to find the k the entire spectrum.the This time candomain, lead using to ‘false the inverse alarms’ Fourier7 transform 1 dominant periods, we need to pick the k largest values inin the the periodogram. periodogram.F − However, However,(x) ≡ x(n the): the periodogram periodogram can can • 6 7 day 1 still provide an accurate indicator of important short 6To7 dayfind the ofdominant the periodogram. frequencies choose the top-k still provide an accurate indicatorN of1 important2 short 1 − j πkn (to medium) length periods.x(n)= Additionally,X(f through)e N , thek = 0, 15 ...N − 1 (to medium) length periods. Additionally,√N k/N through the 5dominant frequencies n=0 1 periodogram it is easy to automate the extraction of 20 40 60 Due 80to the assumption100 120 of140 the Fourier160 Transform180 that the data periodogram it is easy to automate the extraction of 20 40 60 80 100 120 140 160 180 important periods (peaks)Note that by examining if during the this statistical reverse transformation we is periodic, proper windowing of the data might be necessary for important periods (peaks) by examining the statistical Figure 2: The 7 dayachieving period a ismore latent accurate in the harmonic autocorrela- analysis. In this work we will discard some of the coefficients (e.g., theFigure last2: k), The then 7 day period is latent in the autocorrela- propertiesproperties of of the the Fourier Fourier coe coeffifficientscients (such (such as as in in [15]). [15]). tion graph, becausesidestep it has this lower issue, amplitude since it goes (even beyond though the scope of this paper. the outcome will be an approximationtion of thegraph, original because it has lower amplitude (even though it happens with higherHowever, frequency). the interested However, reader is the directed 7 day to [5] for an excellent 2.2.2 Circular Autocorrelation.sequence (FigureThe 1). second By carefully way selectingit happens which with higherreview of frequency). data windowing However, techniques. the 7 day 2.2.2 Circular Autocorrelation. The second way peakpeak is is very very obvious obvious in in the the Periodogram. Periodogram. toto estimate estimate the the dominant dominant periods periods of of a a time-series time-seriesxx,, is is to calculate the circular AutoCorrelation Function (or to calculate the circular AutoCorrelation Function (or TheThe advantages advantages and and shortcomings shortcomings of of the the peri- peri- ACF), which examines how similar a sequence is to its ACF), which examines how similar a sequence is to its odogramodogram and and the the ACF ACF are are summarized summarized in in Table Table 1. 1. previous values for different τ lags: previous values for different τ lags: FromFrom the the above above discussion discussion one one can can realize realize that that al- al- N 1 though the periodogram and the autocorrelation cannot 1 N− 1 though the periodogram and the autocorrelation cannot ACF (τ)= 1 − x(τ) · x(n + τ) ACF (τ)=N x(τ) · x(n + τ) provide sufficient spectral information separately, there Nn=0 provide sufficient spectral information separately, there n=0 is a lot of potential when both methods are combined. Therefore, the autocorrelation is formally a convo- is a lot of potential when both methods are combined. Therefore, the autocorrelation is formally a convo- WeWe delineate delineate our our approach approach in in the the following following section. section. lution,lution, and and we we can can avoid avoid the the quadratic quadratic calculation calculation in in thethe time time domain domain by by computing computing it it e effifficientlyciently as as a a dot dot 33 Our Our Approach Approach productproduct in in the the frequency frequency domain domain using using the the normalized normalized We utilize a two-tier approach, by considering the in- FourierFourier transform: transform: We utilize a two-tier approach, by considering the in- 1 formation in both the autocorrelation and the peri- ACF = F − 1 formation in both the autocorrelation and the peri- ACF = F − odogram.odogram. We We call call this this method methodAUTOPERIODAUTOPERIOD.. Since Since the the TheThe star star (∗ (∗)) symbol symbol denotes denotes complex complex conjugation. conjugation. discoverydiscovery of of important important periods periods is is more more di diffifficultcult on on the the TheThe ACF ACF provides provides a a more more fine-grained fine-grained periodicity periodicity autocorrelation,autocorrelation, we we can can use use the the periodogram periodogram for for extract- extract- detectordetector than than the the periodogram, periodogram, hence hence it it can can pinpoint pinpoint inging period period candidates. candidates. Let’s Let’s call call the the period period candidates candidates withwith greater greater accuracy accuracy even even larger larger periods. periods. However, However, ‘hints’.‘hints’. These These ‘hints’ ‘hints’ may may be be false false (due (due to to spectral spectral leak- leak- itit is is not not su suffifficientcient by by itself itself for for automatic automatic periodicity periodicity age),age), or or provide provide a a coarse coarse estimate estimate of of the the period period (remem- (remem- discoverydiscovery for for the the following following reasons: reasons: berber that that DFT DFT bins bins increase increase gradually gradually in in size); size); there- there- 1.1. Automated Automated discovery discovery of of important important peaks peaks is is forefore a a verification verification phase phase using using the the autocorrelation autocorrelation is is re- re- moremore di diffifficultcult than than in in the the periodogram. periodogram. Approaches Approaches quired,quired, since since it it provides provides a a more more fine-grained fine-grained estimation estimation thatthat utilize utilize forms forms of of autocorrelation autocorrelation require require the the user user ofof potential potential periodicities. periodicities. The The intuition intuition is is that that if if the the toto manually manually set set the the significance significance threshold threshold (such (such as as in in candidatecandidate period period from from the the periodogram periodogram lies lies on on a a hill hill of of [2,[2, 3]). 3]). thethe ACF ACF then then we we can can consider consider it it as as a a valid valid period, period, oth- oth- 2.2. Even Even if if the the user user picks picks the the level level of of significance, significance, erwiseerwise we we discard discard it it as as false false alarm. alarm. For For the the periods periods that that multiplesmultiples of of the the same same basic basic period period also also appear appear as as peaks. peaks. residereside on on a a hill, hill, further further refinement refinement may may be be required required if if Therefore,Therefore, the the method method introduces introduces many many false false alarms alarms thethe periodicity periodicity hint hint refers refers to to a a large large period. period. thatthat need need to to be be eliminated eliminated in in a a post-processing post-processing phase. phase. FigureFigure 3 3 summarizes summarizes our our methodology methodology and and Figure Figure 3.3. Low Low amplitude amplitude events events of of high high frequency frequency may may 44 depicts depicts the the visual visual intuition intuition behind behind our our approach approach with with appearappear less less important important (i.e., (i.e., have have lower lower peaks) peaks) than than high high aa working working example. example. The The sequence sequence is is obtained obtained from from the the 33 This work targets similar applications and provides coefficients to record, we can perform a variety of tasks tools that can significantly ease the “mining” of useful such as compression, denoising, etc. information. Specifically, this paper makes the following contributions: Signal & Reconstruction 1. We present a novel automatic method for accurate periodicity detection in time-series data. Our algorithm is the first one that exploits the information in both periodogram and autocorrelation to provide accurate periodic estimates without upsampling. 2. We introduce new periodic distance measures that exploit the power of the dominant periods, as provided by the Fourier Transform. By ignoring the phase infor- Fourier Coefficients mation we can provide more compact representations, f that also capture similarities under time-shift transfor- 4 f mations. 3 f 3. Finally, we present comprehensive experiments 2 f demonstrating the applicability and efficiency of the 1 proposed methods, on a variety of real world datasets f 0 (online query logs, manufacturing diagnostics, medical data, etc.). Figure 1: Reconstruction of a signal from its first 5 Fourier coefficients 2 Background We provide a brief introduction to harmonic analysis using the discrete Fourier Transform, because we will 2.2 Power Spectral Density Estimation. In or- use these tools as the building blocks of our algorithms. der to discover potential periodicities of a time-series, one needs to examine its power spectral density (PSD 2.1 Discrete Fourier Transform. The normalized or power spectrum). The PSD essentially tells us how Discrete Fourier Transform of a sequence x(n),n = much is the expected signal power at each frequency 0, 1 ...N − 1 is a sequence of complex numbers X(f): of the signal. Since period is the inverse of frequency, by identifying the frequencies that carry most of the Each element of the periodogramN 1 provides2 the amplitude patterns, which nonetheless appear more Each element of the periodogram1 − providesj πkn the amplitude patterns, which nonetheless appear more X(f )= x(n)e− N , k = 0, 1 ...N − 1 energy, we can also discover the most dominant peri- power at frequency k/N or,k/N equivalently,√N at period N/k. scarcelyDisadvantages (see example in fig. 2). of the Periodogram power at frequency k/N or, equivalently,n=0 at period N/k. scarcely (see exampleods. in There fig. are 2). two well known estimators of the PSD; BeingBeing more more precise, precise, each each DFT DFT ‘bin’ ‘bin’ corresponds corresponds to to a a where the subscript k/N denotes the frequency that the periodogramSequence and the circular autocorrelation. Both range of periods (or frequencies). That is, coefficient Sequence range of periods (or frequencies).N ThatN is, coefficient of these methods can be computed using the DFT of X(f ) correspondseach to periods coefficient [ N captures.. . . N ). ThroughoutIt is easy the text we will time series data X(fk/N ) corresponds to periods [k . . .k 1 ). It is easy 0.2 k/N also utilize thek notationk− 1 F(x) to describe0.2 the Fourier a sequence (and can therefore exploit the Fast Fourier to see that the resolution of the periodogram− becomes or signal to see that the resolution of the periodogram becomes 0.1 Transform for execution in O(N log N) time). very coarse for longerTransform. periods. Since For example, we are dealing for a se- with real0.1 signals, the very coarse for longerFourier periods. coefficients For example, are symmetric for a se- around0 the middle quence of length N = 256, the DFT bin margins will be 0 quence of length N one= 256, (or the to DFT be more bin margins exact, they will will be be the complex50 1002.2.1150 Periodogram200 250 Suppose300 that350 X is the DFT of N/1,N/2,N/3,...= 256, 128, 64 etc. 50 100 150Periodogram200 250 300 350 N/1,N/2,N/3,...=conjugate 256, 128, of64 their etc. symmetric). The Fourier0.2 transform a sequencePeriodogramx. The periodogram P is provided by the 0.2 Essentially, the accuracy of the discovered periods, P1= 7 Essentially, therepresents accuracy of the the original discovered signal periods, as a linear0.15 combinationP2= 30.3333 of squaredP1= 7 length ofperiodogram each Fourier coefficient: 2 0.15 P2= 30.3333 deteriorates for large periods, due to the increasingej πfn/N deteriorates for large periods, due to the increasing 0.1 2 N 1 the complex sinusoids sf (n)= . Therefore,0.1 the P(f )=∥X(f )∥ k =0, 1 ...⌈ − ⌉ width of the DFT bins (N/k). Another related issue√ isN Power k/N k/N 2 width of the DFT binsFourier (N/k coe).ffi Anothercients record related the issue amplitude is 0.05Power and phase of spectral leakage, which causes frequencies that are not 0.05 Notice that we can only detect frequencies that are at spectral leakage, which causes frequencies that are not 0 these sinusoids, after signal x is projected0 on0 them.0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 integer multiples of the DFT bin width, to disperse over 0 6 0.05 0.1 most0.15 half0.2 of0.25 the maximum0.3 0.35 0.4 signal0.45 frequency,0.5 due to the integer multiples of the DFT bin width, to disperse over x 10 6 Circular Autocorrelation We can return from the frequency domainx 10 back to Circular Autocorrelation the entire spectrum. This can lead to ‘false alarms’ 7 Nyquist fundamental theorem. In order to find the k the entire spectrum.the This time candomain, lead using to ‘false the inverse alarms’ Fourier7 transform 1 • Good only for shortdominant and medium periods, periodicities we need to pick the k largest values inin the the periodogram. periodogram.F − However, However,(x) ≡ x(n the): the periodogram periodogram can can 6 7 day 1 still provide an accurate indicator of important short 6 7 day of the periodogram. still provide an accurate indicatorN of1 important2 short 1 − j πkn • Spectral leakage - frequencies not integral multiples of the DFT bin spread over (to medium) length periods.x(n)= Additionally,X(f through)e N , thek = 0, 15 ...N − 1 (to medium) length periods. Additionally,√N k/N through the 5other bins — false alarms n=0 1 periodogram it is easy to automate the extraction of 20 40 60 Due 80to the assumption100 120 of140 the Fourier160 Transform180 that the data periodogram it is easy to automate the extraction of 20 40 60 80 100 120 140 160 180 important periods (peaks)Note that by examining if during the this statistical reverse transformation we is periodic, proper windowing of the data might be necessary for important periods (peaks) by examining the statistical Figure 2: The 7 dayachieving period a ismore latent accurate in the harmonic autocorrela- analysis. In this work we will discard some of the coefficients (e.g., theFigure last2: k), The then 7 day period is latent in the autocorrela- propertiesproperties of of the the Fourier Fourier coe coeffifficientscients (such (such as as in in [15]). [15]). tion graph, becausesidestep it has this lower issue, amplitude since it goes (even beyond though the scope of this paper. the outcome will be an approximationtion of thegraph, original because it has lower amplitude (even though it happens with higherHowever, frequency). the interested However, reader is the directed 7 day to [5] for an excellent 2.2.2 Circular Autocorrelation.sequence (FigureThe 1). second By carefully way selectingit happens which with higherreview of frequency). data windowing However, techniques. the 7 day 2.2.2 Circular Autocorrelation. The second way peakpeak is is very very obvious obvious in in the the Periodogram. Periodogram. toto estimate estimate the the dominant dominant periods periods of of a a time-series time-seriesxx,, is is to calculate the circular AutoCorrelation Function (or to calculate the circular AutoCorrelation Function (or TheThe advantages advantages and and shortcomings shortcomings of of the the peri- peri- ACF), which examines how similar a sequence is to its ACF), which examines how similar a sequence is to its odogramodogram and and the the ACF ACF are are summarized summarized in in Table Table 1. 1. previous values for different τ lags: previous values for different τ lags: FromFrom the the above above discussion discussion one one can can realize realize that that al- al- N 1 though the periodogram and the autocorrelation cannot 1 N− 1 though the periodogram and the autocorrelation cannot ACF (τ)= 1 − x(τ) · x(n + τ) ACF (τ)=N x(τ) · x(n + τ) provide sufficient spectral information separately, there Nn=0 provide sufficient spectral information separately, there n=0 is a lot of potential when both methods are combined. Therefore, the autocorrelation is formally a convo- is a lot of potential when both methods are combined. Therefore, the autocorrelation is formally a convo- WeWe delineate delineate our our approach approach in in the the following following section. section. lution,lution, and and we we can can avoid avoid the the quadratic quadratic calculation calculation in in thethe time time domain domain by by computing computing it it e effifficientlyciently as as a a dot dot 33 Our Our Approach Approach productproduct in in the the frequency frequency domain domain using using the the normalized normalized We utilize a two-tier approach, by considering the in- FourierFourier transform: transform: We utilize a two-tier approach, by considering the in- 1 formation in both the autocorrelation and the peri- ACF = F − 1 formation in both the autocorrelation and the peri- ACF = F − odogram.odogram. We We call call this this method methodAUTOPERIODAUTOPERIOD.. Since Since the the TheThe star star (∗ (∗)) symbol symbol denotes denotes complex complex conjugation. conjugation. discoverydiscovery of of important important periods periods is is more more di diffifficultcult on on the the TheThe ACF ACF provides provides a a more more fine-grained fine-grained periodicity periodicity autocorrelation,autocorrelation, we we can can use use the the periodogram periodogram for for extract- extract- detectordetector than than the the periodogram, periodogram, hence hence it it can can pinpoint pinpoint inging period period candidates. candidates. Let’s Let’s call call the the period period candidates candidates withwith greater greater accuracy accuracy even even larger larger periods. periods. However, However, ‘hints’.‘hints’. These These ‘hints’ ‘hints’ may may be be false false (due (due to to spectral spectral leak- leak- itit is is not not su suffifficientcient by by itself itself for for automatic automatic periodicity periodicity age),age), or or provide provide a a coarse coarse estimate estimate of of the the period period (remem- (remem- discoverydiscovery for for the the following following reasons: reasons: berber that that DFT DFT bins bins increase increase gradually gradually in in size); size); there- there- 1.1. Automated Automated discovery discovery of of important important peaks peaks is is forefore a a verification verification phase phase using using the the autocorrelation autocorrelation is is re- re- moremore di diffifficultcult than than in in the the periodogram. periodogram. Approaches Approaches quired,quired, since since it it provides provides a a more more fine-grained fine-grained estimation estimation thatthat utilize utilize forms forms of of autocorrelation autocorrelation require require the the user user ofof potential potential periodicities. periodicities. The The intuition intuition is is that that if if the the toto manually manually set set the the significance significance threshold threshold (such (such as as in in candidatecandidate period period from from the the periodogram periodogram lies lies on on a a hill hill of of [2,[2, 3]). 3]). thethe ACF ACF then then we we can can consider consider it it as as a a valid valid period, period, oth- oth- 2.2. Even Even if if the the user user picks picks the the level level of of significance, significance, erwiseerwise we we discard discard it it as as false false alarm. alarm. For For the the periods periods that that multiplesmultiples of of the the same same basic basic period period also also appear appear as as peaks. peaks. residereside on on a a hill, hill, further further refinement refinement may may be be required required if if Therefore,Therefore, the the method method introduces introduces many many false false alarms alarms thethe periodicity periodicity hint hint refers refers to to a a large large period. period. thatthat need need to to be be eliminated eliminated in in a a post-processing post-processing phase. phase. FigureFigure 3 3 summarizes summarizes our our methodology methodology and and Figure Figure 3.3. Low Low amplitude amplitude events events of of high high frequency frequency may may 44 depicts depicts the the visual visual intuition intuition behind behind our our approach approach with with appearappear less less important important (i.e., (i.e., have have lower lower peaks) peaks) than than high high aa working working example. example. The The sequence sequence is is obtained obtained from from the the 33 Autocorrelation

• Correlate the time series with itself

⌧ N Y Y ACF (⌧)= i=1 i · i+⌧ • Peaks get amplified N P • Fine-grained periodicity detector Each element of the periodogram provides the amplitude patterns, which nonetheless appear more power at frequency k/N or, equivalently, at period N/k. scarcely (see example in fig. 2). Being more precise, each DFT ‘bin’ corresponds to a range of periods (or frequencies). That is, coefficient Sequence N N X(fk/N ) corresponds to periods [ k . . . k 1 ). It is easy 0.2 to see that the resolution of the periodogram− becomes Each element of the periodogram provides the amplitude0.1 patterns, which nonetheless appear more very coarse for longer periods. For example, for a se- Autocorrelation power at frequency k/N or, equivalently, at period N/k. scarcely0 (see example in fig. 2). quence of length N = 256, the DFT bin margins will be Being more precise, each DFT ‘bin’ corresponds to a 50 100 150 200 250 300 350 N/1,N/2,N/3,...= 256, 128, 64 etc. PeriodogramSequence range of periods (or frequencies). That is, coefficient 0.2 X(Essentially,f ) corresponds the accuracy to periods of the [ N discovered. . . N ). It periods, is easy P1= 7 k/N k k 1 0.150.2 P2= 30.3333 time series data deterioratesto see that thefor resolution large periods, of the due periodogram to the− increasing becomes 0.10.1 width of the DFT bins (N/k). Another related issue is Power or signal very coarse for longer periods. For example, for a se- 0.05 spectral leakage, which causes frequencies that are not 0 quence of length N = 256, the DFT bin margins will be 0 0 0.05 50 0.1 1000.15 0.2150 0.25 200 0.3 2500.35 0.4300 0.45 3500.5 integer multiples of the DFT bin width, to disperse over 6 N/1,N/2,N/3,...= 256, 128, 64 etc. x 10 CircularPeriodogram Autocorrelation the entire spectrum. This can lead to ‘false alarms’ 0.2 Essentially, the accuracy of the discovered periods, 7 P1= 7 in the periodogram. However, the periodogram can 0.15 P2= 30.3333 deteriorates for large periods, due to the increasing Auto-correlation 0.16 7 day stillwidth provide of the an DFT accurate bins (N/k indicator). Another of important related issue short is Power N 0.055 Y Y (tospectral medium) leakage length, which periods. causes Additionally, frequencies through that are the not ACF (⌧)= i=1 i · i+⌧ 0 N periodogram it is easy to automate the extraction of 0 200.05 0.140 0.1560 0.280 0.25100 0.3 120 0.35 140 0.4 1600.45 1800.5 P integer multiples of the DFT bin width, to disperse over 6 important periods (peaks) by examining the statistical x 10 Circular Autocorrelation the entire spectrum. This can lead to ‘false alarms’ Figure7 2: The 7 day period is latent in the autocorrela- propertiesin the periodogram. of the Fourier However, coefficients the (such periodogram as in [15]). can tion graph, because it has lower amplitude (even though • 6To determine dominant period significance threshold needs to be specified still provide an accurate indicator of important short it happens7 day with higher frequency). However, the 7 day 2.2.2 Circular Autocorrelation. The second way 5 (to medium) length periods. Additionally, through the peak• Multiples is very of obvious the same in theperiod Periodogram. are also peaks — needs post processing toperiodogram estimate the it dominant is easy to periods automate of a the time-series extractionx, is of 20 40 60 80 100 120 140 160 180 to calculate the circular AutoCorrelation Function (or important periods (peaks) by examining the statistical FigureThe2: advantages The 7 day period and shortcomings is latent in the of autocorrela- the peri- ACF), which examines how similar a sequence is to its properties of the Fourier coefficients (such as in [15]). odogramtion graph, and because the ACF it has are lower summarized amplitude in Table(even though 1. previous values for different τ lags: it happensFrom the with above higher discussion frequency). one can However, realize the that 7 day al- 2.2.2 Circular Autocorrelation.N 1 The second way thoughpeak is the very periodogram obvious in the and Periodogram. the autocorrelation cannot ACF τ 1 − x τ x n τ to estimate the( dominant)= N periods( ) · of( a+ time-series) x, is provide sufficient spectral information separately, there n=0 to calculate the circular AutoCorrelation Function (or is a lotThe of advantages potential when and both shortcomings methods are of combined. the peri- Therefore, the autocorrelation is formally a convo- ACF), which examines how similar a sequence is to its Weodogram delineate and our the approach ACF are in summarized the following in Table section. 1. lution,previous and values we can for avoid different theτ quadraticlags: calculation in the time domain by computing it efficiently as a dot From the above discussion one can realize that al- N 1 3though Our the Approach periodogram and the autocorrelation cannot product in theACF frequencyτ 1 domain− x τ usingx n theτ normalized ( )= N ( ) · ( + ) provide sufficient spectral information separately, there Fourier transform: n=0 We utilize a two-tier approach, by considering the in- 1 formationis a lot of in potential both the when autocorrelation both methods and are combined. the peri- ACF = F − Therefore, the autocorrelation is formally a convo- odogram.We delineate We our call approach this method in theAUTOPERIOD following section.. Since the lution, and we can avoid the quadratic calculation in The star (∗) symbol denotes complex conjugation. discovery of important periods is more difficult on the the time domain by computing it efficiently as a dot 3 Our Approach productThe ACF in the provides frequency a more domain fine-grained using the periodicity normalized autocorrelation, we can use the periodogram for extract- detectorFourier transform:than the periodogram, hence it can pinpoint ingWe period utilize candidates. a two-tier approach, Let’s call bythe considering period candidates the in- with greater accuracy even1 larger periods. However, ‘hints’.formation These in ‘hints’ both the may autocorrelation be false (due to and spectral the leak- peri- ACF = F − it is not sufficient by itself for automatic periodicity age),odogram. or provide We call a coarse this method estimateAUTOPERIOD of the period. Since (remem- the discoveryThe star for (∗) the symbol following denotes reasons: complex conjugation. berdiscovery that DFT of important bins increase periods gradually is more in diffi size);cult on there- the 1.The Automated ACF provides discovery a more of fine-grained important periodicity peaks is foreautocorrelation, a verification we phase can use using the the periodogram autocorrelation for extract- is re- moredetector difficult than than the periodogram, in the periodogram. hence it can Approaches pinpoint quired,ing period since candidates. it provides Let’s a more call fine-grained the period estimation candidates thatwith utilize greater forms accuracy of autocorrelation even larger periods. require the However, user of‘hints’. potential These periodicities. ‘hints’ may be The false intuition (due to is spectral that if leak- the toit manually is not suffi setcient the by significance itself for threshold automatic (such periodicity as in candidateage), or provide period a from coarse the estimate periodogram of the periodlies on (remem- a hill of [2,discovery 3]). for the following reasons: theber ACF that then DFT we bins can increase consider gradually it as a valid in period, size); there- oth- 2.1. Even Automated if the user discovery picks the of level important of significance, peaks is erwisefore a we verification discard it phase as false using alarm. the autocorrelation For the periods is that re- multiplesmore diffi ofcult the thansame basic in the period periodogram. also appear Approaches as peaks. residequired, on since a hill, it provides further refinement a more fine-grained may be required estimation if Therefore,that utilize the forms method of autocorrelation introduces many require false the alarms user theof potential periodicity periodicities. hint refers to The a large intuition period. is that if the thatto manually need to be set eliminated the significance in a post-processing threshold (such phase. as in candidateFigure period 3 summarizes from the our periodogram methodology lies and on a Figure hill of [2,3. 3]). Low amplitude events of high frequency may 4the depicts ACF the then visual we can intuition consider behind it as a our valid approach period, with oth- appear2. less Even important if the user (i.e., picks have the lower level peaks) of significance, than high aerwise working we example. discard it The as false sequence alarm. is For obtained the periods from that the multiples of the same basic period also appear as peaks. reside on a hill, further refinement may be required if 3 Therefore, the method introduces many false alarms the periodicity hint refers to a large period. that need to be eliminated in a post-processing phase. Figure 3 summarizes our methodology and Figure 3. Low amplitude events of high frequency may 4 depicts the visual intuition behind our approach with appear less important (i.e., have lower peaks) than high a working example. The sequence is obtained from the 3 Auto-Period

• Auto-correlation : Good for large periods but difficult to automatically determine periods

Metho•d PeriodogramEasy to : threshold Easy to thresholdAccurate short but periodsnot accurateAccurate for largeshort periods periodsComplexit y Periodogram yes yes no O(NlogN) Autocorrelation• Idea: Get candidateno periods fromy esPeriodogram and validateyes false O(NlogN) Combinationalarms using Auto-correlationyes yes yes O(NlogN) Table 1: Concise comparison of approaches for periodicity detection.

hill

Refine Period Period Sequence Periodogram Autocorrelation Candidate

Periods valley False Alarm Dismiss Figure 3: Diagram of our methodology (AUTOPERIOD method)

MSN query request logs and represents the aggregate cient, because both the periodogram and the ACF can demand for the query ‘Easter’ for 1000 days after the be directly computed through the Fast Fourier Trans- beginning of 2002. The demand for the specific query form of the examined sequence in O(N log N) time. peaks during Easter time and we can observe one yearly peak. Our intuition is that periodicity should be 3.1 Discussion. First, we need to clarify succinctly approximately 365 (although not exactly, since Easter that the use of the combined periodogram and auto- is not celebrated at the same date every year). Indeed correlation does not carry additional information than the most dominant periodogram estimate is 333.33 = each metric separately. This perhaps surprising state- ment can be verified by noting that: (1000/3), which is located on a hill of the ACF, with a 2 peak at 357 (the correct periodicity -at least for this = ∥X∥ 3 year span). The remaining periodic hints can be discarded upon verification with the autocorrelation. Therefore, the autocorrelation is the inverse Fourier transform of the periodogram, which that the MSN Query: "Easter" ACF can be considered as the dual of the periodogram, from the time into the frequency domain. In essence, 0.2 our intention is to solve each problem in its proper 0.1 domain; (i) the period significance in the frequency

0 domain, and (ii) the identification of the exact period 100 200 300 400 500 600 700 800 900 1000 in the time domain. Periodogram 0.2 P1= 333.3333 Another issue that we would like to clarify is the 0.15 reason that we are not considering a (seemingly) simpler 0.1 P2= 166.6667

Power approach for accurate periodicity estimation. 0.05 P3= 90.9091 Looking at the problem from a 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 perspective, one could argue that the inability to dis- Circular Autocorrelation cover the correct period is due to the ‘coarse’ 0.8 0.6 Hill: Candidate Period of the series. If we would like to increase the resolution 0.4 357: correct period of the DFT, we could ‘sample’ our dataset at a finer res- 0.2 Valley: False Alarms olution (upsampling). Higher sampling rate essentially 0 50 100 150 200 250 300 350 400 450 translates into padding the time-series with zeros, and Days calculating the DFT of the longer time-series. Indeed, if Figure 4: Visual demonstration of our method. Candi- we increase the size of the example sequence from 1000 date periods from the periodogram are verified against to 16000, we will be able to discover the correct period- the autocorrelation. Valid periods are further refined icity which is 357 (instead of the incorrect 333, given in utilizing the autocorrelation information. the original estimate). However, upsampling also imposes a significant Essentially, we have leveraged the information of performance overhead. If we are interested in obtaining both metrics for providing an accurate periodicity de- online periodicity estimates from a data stream, this tector. In addition, our method is computationally effi- alternative method may result in a serious system Method Easy to threshold Accurate short periods Accurate large periods Complexity Periodogram yes yes no O(NlogN) Autocorrelation no yes yes O(NlogN) Combination yes yes yes O(NlogN) Table 1: Concise comparison of approaches for periodicity detection.

hill

Refine Period Period Sequence Periodogram Autocorrelation Candidate

Periods valley False Alarm Dismiss Figure 3: Diagram of our methodology (AUTOPERIOD method)

MSN query request logs and represents the aggregate cient, because both the periodogram and the ACF can demand for the query ‘Easter’ for 1000 days after the be directly computed through the Fast Fourier Trans- beginning of 2002. The demand for the specific query form of the examined sequence in O(N log N) time. peaks during Easter time and we can observe one yearly peak. Our intuition is that periodicity should be 3.1 Discussion. First, we need to clarify succinctly approximately 365 (although not exactly, since Easter that the use of the combined periodogram and auto- is not celebrated at the same date every year). Indeed correlation does not carry additional information than the most dominant periodogram estimate is 333.33 = each metric separately. This perhaps surprising state- ment can be verified by noting that: (1000/3), which is located on a hill of the ACF, with a 2 peak at 357 (the correct periodicity -at least for this = ∥X∥ 3 year span). The remaining periodic hints can be discarded uponAuto-Period verification with the autocorrelation. Therefore, the autocorrelation is the inverse Fourier transform of the periodogram, which means that the MSN Query: "Easter" ACF can be considered as the dual of the periodogram, from the time into the frequency domain. In essence, 0.2 our intention is to solve each problem in its proper 0.1 domain; (i) the period significance in the frequency

0 domain, and (ii) the identification of the exact period 100 200 300 400 500 600 700 800 900 1000 in the time domain. Periodogram 0.2 P1= 333.3333 Another issue that we would like to clarify is the 0.15 reason that we are not considering a (seemingly) simpler 0.1 P2= 166.6667

Power approach for accurate periodicity estimation. 0.05 P3= 90.9091 Looking at the problem from a signal processing 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 perspective, one could argue that the inability to dis- Circular Autocorrelation cover the correct period is due to the ‘coarse’ sampling 0.8 0.6 Hill: Candidate Period of the series. If we would like to increase the resolution 0.4 357: correct period of the DFT, we could ‘sample’ our dataset at a finer res- 0.2 Valley: False Alarms olution (upsampling). Higher sampling rate essentially 0 50 100 150 200 250 300 350 400 450 translates into padding the time-series with zeros, and Days calculating the DFT of the longer time-series. Indeed, if Figure 4: Visual demonstration of our method. Candi- we increase the size of the example sequence from 1000 date periods from the periodogram are verified against to 16000, we will be able to discover the correct period- the autocorrelation. Valid periods are further refined icity which is 357 (instead of the incorrect 333, given in utilizing the autocorrelation information. the original estimate). However, upsampling also imposes a significant Essentially, we have leveraged the information of performance overhead. If we are interested in obtaining both metrics for providing an accurate periodicity de- online periodicity estimates from a data stream, this tector. In addition, our method is computationally effi- alternative method may result in a serious system Time Series

• Similar time series suggest similar things Matching Time Series

• Correlating time series used for clustering, classification, anomaly detection, speech recognition etc. Matching Time Series

What measure would you use to match two time series ?

d = y x | t t| t EuclideanX Distance

Why is Euclidean matching not good enough ? Dynamic Time Warping

Time series might be shifted

Time series might be compressed at some point in time

Random noise at some points

Dynamic time warping measures the distance between two sequences under certain restrictions.

Not a metric. Triangle inequality doesn't hold Detour - Edit Distance

• Edit distance measures how many steps it takes to convert a string to another based on restrictions

• Restrictions define cost function — insertion, deletion, replacement

f o x f o x

f 0 1 2 f 0 1 2

a 1 2 3 a 1 1 2

x 2 3 2 x 2 2 1

insertions and deletions insertions, deletions and replacements Edit Distance

f o x f o x

f 0 1 2 f 0 1 2

a 1 2 3 a 1 1 2

x 2 3 2 x 2 2 1

insertions and deletions insertions, deletions and replacements

di 1,j 1 aj = bi di 1,j + wdel(bi) 1 dij = 8 ,for1 i m, 1 j n. >min 8di,j 1 + wins(aj) 1 aj = bi     <> 6 <>di 1,j 1 + wsub(aj,bi) 1 or 2 > :> :> Dynamic Time Warping

• DTW aligns two sequences of feature vectors by warping the time axis iteratively until an optimal match (according to a suitable metrics) between the two sequences is found.

x1,x2,...xn y1,y2,...yn

d(i 1,j), d(i, j)=c(i, j)+min d(i 1,j 1), 8 <>d(i, j 1) :> Burst Detection

• Bursts are rare but extremely beneficial in time-series

• Used in number of applications

• Twitter: Trending topics

• Stock markets: Trending Stocks

• Text Mining: finding important time periods

• Elastic burst detection:

• Stream of data

• Quadratic computations not allowed Burst Detection

• Global Average

• Damped Average 2.5 Elastic Burst Detection and Shifted Binary

Tree

The elastic burst detection problem [84] is to detect bursts across multiple win- dow sizes. Formally: Elastic Burst Detection Problem 1. Given a data source producing non-negative data elements x1,x2,...,

asetofwindowsizesW = w1,w2,...,wm ,amonotonic,associativeag- • Given a time-series {xi} { } gregation function A (such as “sum” or “maximum”) that maps a consecu- • A set of window sizes W tive sequence of data elements to a number (it is monotonic in the sense that • A monotonic, associative aggregation function A which maps a sequence of values to a number. E.g. Average, Max A[xt xt+w−1] A[xt xt+w],forallw), and thresholds associated with each • and Thresholds··· associated≤ with each··· window size w, f(w) window size, f(wj),forj =1, 2,...,m,theelasticburstdetectionistheproblem • Find all pairs (t,w) such that t time a time point and w is a window size inof W finding all pairs (t, w) such that t is a time point and w is a window size in

W and A[xt xt w− ] f(w). ··· + 1 ≥

Anaivealgorithmistocheckeachwindowsizeofinterestoneat a time. To detect bursts over m window sizes in a sequence of length N naively requires Θ(mN)time.Thisisunacceptableinahigh-speeddatastreamenvironment. In [84], the authors show that a simple data structure called the Shifted Binary Tree could be the basis of a filter that would detect all bursts, and perform in time independent of the number of windows when the probability of bursts is very low. AShiftedBinaryTreeisahierarchicaldatastructureinspired by the Haar tree. The leaf nodes of this tree (denoted level 0) correspond to the time points of the incoming data; a node at level 1 aggregates two adjacent nodes at level 0. In general, a node at level i+1 aggregatestwonodesatlevel i,

i+1 thus includes 2 time points. There are only log2 N +1levelswhereN is the

14 Burst Detection - Shifted Binary Tree

Level 4

Level 3

Level 2 base level shifted level Level 1

Level 0

Figure 2.1: An example of a Shifted Binary Tree. The two shadedsequencesin Whenever more than f(2 + 2i−1) events are found in a window of size 2i+1, levelthen a detailed search must be performed to check if some subwindow of 0 are included in the shaded nodes in level 4 and level 3 respectively. size w, 2+2i−1 ≤ w ≤ 1+2i, has f(w) events. maximum window size. The Shifted Binary Tree includes a shifted sublevel for each level above level 0. In shifted sublevel i,thecorrespondingwindowsare still of length 2i but those windows are shifted by 2i−1 from the base sublevel. Figure 2.1 shows an example of a Shifted Binary Tree. The overlap between the base sublevels and the shifted sublevels guarantees that all the windows of length w, w 1+2i,areincludedinoneofthewindows ≤ at level i +1.BecausetheaggregationfunctionA is monotonically increasing, if A[xt xt w c] f(w), then surely A[xt xt w− ] f(w). The Shifted ··· + + ≤ ··· + 1 ≤ Binary Tree takes advantage of this monotonic property as follows: the threshold value f(2 + 2i−1)isassociatedwithleveli +1. Whenevermorethan f(2 + 2i−1)eventsarefoundinawindowofsize2i+1,thenadetailedsearchmustbe performed to check if some subwindow of size w,2+2i−1 w 1+2i,hasf(w) ≤ ≤ events. All bursts are guaranteed to be reported and many non-burst windows are filtered away without requiring a detailed check when the burst probability is very low. However, some detailed searches will turn out to be fruitless(i.e.thereisno burst at all). For example, assume the threshold for window size 4 is 100, for 5 is 120, and for 8 is 150. Because each node at level 8 covers window size 4 and

15 Summary

• Periodicity of Events

• Auto-correlation, Periodograms and their combinations

• Burst Detection Techniques and elastic detection

• Matching of time series

• Euclidean matching

• Dynamic Time Warping References

• “On Periodicity Detection and Structural Periodic Similarity”, 2005.

• Michail Vlachos, Philip Yu, Vittorio Castelli

• Burst Detection in Hierarchical Streams.

• Jon Kleinberg

• Everything you know about Dynamic Time Warping is wrong

• Eamonn Keogh Projects http://www.l3s.de/~anand/tir14/projects.html • Temporal and Phrase-based Indexing - Avishek([email protected])

• Temporal Retrieval Models - Jaspreet ([email protected])

• Temporal Query Autocompletion- Avishek

• Crawling for Temporal Collections - Gerhard ([email protected])

• Temporal Query Suggestions - Helge ([email protected])