<<

1 Discrete on Graphs: Frequency Analysis Aliaksei Sandryhaila, Member, IEEE and Jos´eM. F. Moura, Fellow, IEEE

Abstract—Signals and datasets that arise in physical and Methods for Laplacian-based graph signal analysis emerged engineering applications, as well as social, genetics, biomolecular, from research on the spectral [12] and manifold and many other domains, are becoming increasingly larger and discovery and embedding [13], [14]. Implicitly or explicitly, more complex. In contrast to traditional time and image signals, data in these domains are supported by arbitrary graphs. Signal in these works graphs discretize continuous high-dimensional M processing on graphs extends concepts and techniques from manifolds from R : graph vertices sample a manifold and traditional signal processing to data indexed by generic graphs. connect to nearest neighbors as determined by their geodesic This paper studies the concepts of low and high frequencies distances over the underlying manifold. In this setting, the on graphs, and low-, high- and band-pass graph signals and graph Laplacian operator is the discrete counterpart to the graph filters. In traditional signal processing, these concepts are easily defined because of a natural frequency ordering that continuous Laplace-Beltrami operator on a manifold [12], has a physical interpretation. For signals residing on graphs, [15]. in general, there is no obvious frequency ordering. We propose This connection is propagated conceptually to Laplacian- a definition of total variation for graph signals that naturally based methods for signal processing on graphs. For example, leads to a frequency ordering on graphs and defines low-, high-, the graph defined and considered in [8], and band-pass graph signals and filters. We study the design of graph filters with specified frequency response, and illustrate our as well as [16], [17], [18], [19], [20], [21], expands graph approach with applications to sensor malfunction detection and signals in the eigenbasis of the graph Laplacian. This parallels data classification. the classical Fourier transform that expands signals into the basis of complex exponentials that are eigenfunctions of the Keywords: Signal processing on graphs, total variation, low one-dimensional – the negative second order pass, high pass, band pass, filter design, regularization. derivative operator [8]. The frequencies are the eigenvalues of the Laplace operator. Since the operator is symmetric I.INTRODUCTION and positive semi-definite, graph frequencies are real-valued and hence totally ordered. So, just like for time signals, the Signals indexed by graphs arise in many applications, notions of low and high frequencies are easily defined in including the analysis of preferences and opinions in social this model. However, due to the symmetry and positive semi- and economic networks [1], [2], [3]; research in collaborative definiteness of the operator, the Laplacian-based methods are activities, such as paper co-authorship and citations [4]; topics only applicable to undirected graphs with real, non-negative and relevance of documents in the World Wide Web [5], weights. [6]; customer preferences for service providers; measurements In [9], [10], [11] we take a different route. Our approach from sensor networks; interactions in molecular and gene is motivated by the algebraic signal processing (ASP) theory regulatory networks; and many others. introduced in [22], [23], [24], [25], [26]; see also [27], [28], Signal processing on graphs extends the classical discrete [29], [30], [31] for additional developments. In ASP, the shift signal processing (DSP) theory for time signals and images [7] is the elementary non-trivial filter that generates, under an to signals indexed by vertices of a graph. There are two basic appropriate notion of shift invariance, all linear shift-invariant approaches to signal processing on graphs. The first one uses filters for a given class of signals. The key insight in [9] to the graph Laplacian as its basic building block (see a build the theory of signal processing on graphs is to identify recent review [8] and references therein). The second approach the shift operator. We adopted the weighted adopts the adjacency matrix of the underlying graph as its of the graph as the shift operator and then developed appropri- fundamental building block [9], [10], [11]. Both frameworks ate concepts of z-transform, impulse and frequency response, define fundamental signal processing concepts on graphs, but filtering, , and Fourier transform. In particular, the the difference in their foundation leads to different definitions in this framework expands a graph and techniques for signal analysis and processing. signal into a basis of eigenvectors of the adjacency matrix, and the corresponding spectrum is given by the eigenvalues of Copyright (c) 2014 IEEE. Personal use of this material is permitted. the adjacency matrix. This contrasts with the Laplacian-based However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. approach, where Fourier transform and spectrum are defined This work was supported in part by AFOSR grant FA95501210087. by the eigenvectors and eigenvalues of the graph Laplacian. A. Sandryhaila and J. M. F. Moura are with the Department of Elec- The association of the graph shift with the adjacency matrix trical and Computer Engineering, Carnegie Mellon University, Pitts- burgh, PA 15213-3890. Ph: (412)268-6341; fax: (412)268-3890. Email: is natural and has multiple intuitive interpretations. The graph [email protected], [email protected]. shift is an elementary filter, and its output is a graph signal 2 with the value at vertex n given approximately by a weighted graph signal. Once we have an ordering of the frequencies linear combination of the input signal values at neighbors based on the graph total variation function, we define the of n [9]. With appropriate edge weights, the graph shift can notions of low and high frequencies, as well as low-, high-, and be interpreted as a (minimum mean square) first-order linear band-pass graph signals and graph filters. We demonstrate that predictor [23], [9]. Another interpretation of the graph filter these concepts can be used effectively in sensor network analy- comes from Markov chain theory [32], where the adjacency sis and semi-supervised learning. In our experiments, we show matrix represents the one-step transition probability matrix that naturally occurring graph signals, such as measurements of the chain governing its dynamics. Finally, the graph shift of physical quantities collected by sensor networks or labels can also be seen as a stencil approximation of the first-order of objects in a dataset, tend to be low-frequency graph signals, derivative on the graph1. while anomalies in sensor measurements or missing data labels The last interpretation of the graph shift contrasts with can amplify high-frequency parts of the signals. We demon- the corresponding interpretation of the graph Laplacian: the strate how these anomalies can be detected using appropriately adjacency matrix is associated with a first-order differen- designed high-pass graph filters, and how unknown parts of tial operator, while the Laplacian, if viewed as a shift, is graph signals can be recovered with appropriately designed associated with a second-order differential operator. In the regularization techniques. In particular, our experiments show one-dimensional case, the eigenfunctions for both, the first that classifiers designed using the graph lead to order and second order differential operators, are complex higher classification accuracy than classifiers based on the exponentials, since graph Laplacian matrices, combinatorial or normalized. Summary of the paper. In Section II, we present the 1 d e2πjft = fe2πjft. (1) notation and review from [9] the basics of discrete signal pro- 2πj dt cessing on graphs (DSPG). In Section III, we define the local Interpreting the Laplacian as a shift introduces an even sym- and total variation for graph signals. In Section IV, we use the metry assumption into the corresponding signal model, and proposed total variation to impose an ordering on frequency for one-dimensional signals [25], this model assumes that components from lowest to highest. In Section V, we discuss the signals are defined on lines of an image (undirected line low-, high-, and band-pass graph filters and their design. In graphs) rather than on a time line (directed line graphs). The Section VI, we illustrate these concepts with applications to use of the adjacency matrix as the graph shift does not impose corrupted measurement detection in sensor networks and data such assumptions, and the corresponding framework can be classification, and provide experimental results for real-world used for arbitrary signals indexed by general graphs, regardless datasets. Finally, Section VII concludes the paper. whether these graphs have undirected or directed edges with real or complex, non-negative or negative weights. II. DISCRETE SIGNAL PROCESSING ON GRAPHS This paper is concerned with defining low and high fre- In this section, we briefly review notation and concepts of quencies and low-, high-, and band-pass graph signals and the DSPG framework that are relevant to this paper. A complete filters on generic graphs. In traditional discrete signal pro- introduction to the theory can be found in [9], [10], [11]. cessing (DSP), these concepts have an intuitive interpretation, since the frequency contents of time series and digital images are described by complex or real sinusoids that oscillate at A. Graph Signals different rates [33]. The oscillation rates provide a physical Signal processing on graphs is concerned with the analysis notion of “low” and “high” frequencies: low-frequency com- and processing of datasets in which data elements can be ponents oscillate less and high-frequency ones oscillate more. connected to each other according to some relational property. However, these concepts do not have a similar interpretation This relation is expressed though a graph G = (V, A), where on graphs, and it is not obvious how to order graph frequencies V = {v0,...,vN−1} is a set of nodes and A is a weighted to describe the low- and high-frequency contents of a graph adjacency matrix of the graph. Each data element corresponds signal. to node vn (we also say the data element is indexed by vn), We present an ordering of the graph frequencies that is and each weight An,m ∈ C of a directed edge from vm to based on how “oscillatory” the spectral components are with vn reflects the degree of relation of the mth data element to respect to the indexing graph, i.e., how much they change the nth one. Node vm is an in-neighbor of vn and vn is an from a node to neighboring nodes. To quantify this amount, out-neighbor of vm if An,m =0. All in-neighbors of vn form we introduce the graph total variation function that measures its in-neighborhood, and we denote a set of their indices as how much signal samples (values of a graph signal at a node) Nn = {m | An,m =0}. If the graph is undirected, the relation vary in comparison to neighboring samples. This approach is goes both ways, An,m = Am,n, and the nodes are neighbors. analogous to the classical DSP theory, where the oscillations Using this graph, we refer to the dataset as a graph signal, in time and image signals are also quantified by appropriately which is defined as a map defined total variations [33]. In Laplacian-based graph signal s : V → C, processing [8], the authors choose to order frequencies based vn → sn. (2) on a quadratic form rather than on the total variation of the We assume that each dataset element sn is a complex number. 1This analogy is more intuitive to understand if the graph is regular. Since each signal is isomorphic to a complex-valued vector 3 with N elements, we write graph signals as vectors C. Graph Fourier Transform T N s = s0 s1 ... sN−1 ∈ C . In general, a Fourier transform performs the expansion of a signal into a Fourier basis of signals that are invariant to filter- However, we emphasize that each element sn is indexed by ing. In the DSPG framework, a graph Fourier basis corresponds node vn of a given representation graph G = (V, A), as to the Jordan basis of the graph adjacency matrix A (the defined by (2). The space S of graph signals (2) is isomorphic Jordan decomposition is reviewed in Appendix A). Following to CN , and its dimension is dim S = N. the DSP notation, distinct eigenvalues λ0, λ1,...,λM−1 of the adjacency matrix A are called the graph frequencies and B. Graph Filters form the spectrum of the graph, and the Jordan eigenvectors In general, a graph filter is a system H that takes a graph () that correspond to a frequency λm are called the frequency signal s as an input, processes it, and produces another graph components corresponding to the mth frequency. Since mul- signal ˜s = H(s) as an output. A basic non-trivial filter defined tiple eigenvectors can correspond to the same eigenvalue, in on a graph G = (V, A), called the graph shift, is a local general, a graph frequency can have multiple graph frequency operation that replaces a signal value sn at node vn with the components associated with it. linear combination of values at the neighbors of node vn As reviewed in Appendix A, Jordan eigenvectors form the columns of the matrix V in the Jordan decomposition (47) s˜n = An,msm. (3) −1 m∈Nn A = V J V . Hence, the output of the graph shift is given by the product Hence, the graph Fourier transform of a graph signal s is of the input signal with the adjacency matrix of the graph: T s = Fs, (8) ˜s = s˜0 ... s˜N−1 = As. (4) where F = V−1 is the graph Fourier transform matrix. The graph shift is the basic building block in DSPG. 2 All linear, shift-invariant graph filters in DSPG are polyno- The values sn of the signal’s graph Fourier transform (8) mials in the adjacency matrix A of the form [9] characterize the frequency content of the signal s. L The inverse graph Fourier transform is given by h(A)= h0 I +h1A + ... + hLA . (5) −1 The output of the filter (5) is the signal s = F s = V s. (9) s = H(s)= h(A)s. It reconstructs the original signal from its frequency contents by constructing a linear combination of frequency components Linear, shift-invariant graph filters possess a number of weighted by the signal’s Fourier transform coefficients. useful properties. They have at most L ≤ NA taps hℓ, where 3 NA = deg mA(x) is the degree of the minimal polynomial mA(x) of A. If a graph filter (5) is invertible, i.e., matrix D. Frequency Response h(A) is non-singular, then its inverse is also a graph filter The graph Fourier transform (8) also allows us to charac- g(A)= h(A)−1 on the same graph G = (V, A). Finally, the terize the effect of a filter on the frequency content of an space of graph filters is an algebra, i.e., a vector space that is input signal. As follows from (5) and (8), as well as (47) in simultaneously a ring. Appendix A, These properties guarantee that multiplying A by any non- zero constant does not change the set of corresponding linear, s = h(A)s = F−1 h(J) Fs ⇔ F s = h(J)s. (10) shift-invariant graph filters. In particular, we can define the normalized graph shift matrix Hence, the frequency content of the output signal is obtained by multiplying the frequency content of the input signal by norm 1 A = A, (6) the block |λmax| where λmax denotes the eigenvalue of A with the largest h(Jr0,0 (λ0)) magnitude, i.e., .. h(J)=  .  . |λmax| ≥ |λm| (7) h(Jr − − (λM−1))  M 1,DM 1  for all 0 ≤ m ≤ M −1. The normalized matrix (6) ensures the   numerical stability of computing with graph filters h(Anorm) We call this matrix the graph frequency response of the filter as it prevents excessive scaling of the shifted signal, since h(A), and denote it as norm ||A s|| / ||s|| ≤ 1. In this paper, we use the graph shift Anorm instead of A where appropriate. h(A)= h(J). (11)

2Filters are linear if for a linear combination of inputs they produce the Notice that (10) extends the convolution theorem from same linear combination of outputs. Filters are shift-invariant if the result of classical signal processing [7] to graphs, since filtering a consecutive processing of a signal by multiple graph filters does not depend signal on a graph is equivalent in the frequency domain to on the order of processing; i.e., shift-invariant filters commute with each other. 3The minimal polynomial of A is the unique monic polynomial of the multiplying the signal’s spectrum by the frequency response smallest degree that annihilates A, i.e., mA(A) = 0 [34], [35]. of the filter. 4

compares how the signal varies with time or space. These v v v v concepts lie at the heart of many applications of DSP, in- 0 1 N–2 N-1 cluding signal regularization and denoising [33], [36], image compression [37] and others. Fig. 1. Traditional graph representation for a finite discrete periodic time series of length N. The variation (16) compares two consecutive signal samples and calculates a cumulative magnitude of the signal change E. Consistency with the Classical DSP over time. In terms of the time shift (13), we can say that The DSPG framework is consistent with the classical DSP the total variation compares a signal s to its shifted version: theory. Finite (or periodic) time series can be represented by the smaller the difference between the original signal and the the directed cycle graph shown in Fig. 1, see [24], [9]. The shifted one, the lower the signal’s variation. Using the cyclic direction of the edges represents the flow of time from past to (12), we can write (16) as future, and the edge from the last vertex vN−1 to v0 captures TV(s)= ||s − Cs|| . (17) the periodic signal extension sN = s0 for time series. The 1 adjacency matrix of the graph in Fig. 1 is the N × N cyclic The total variation (17) measures the difference between the permutation matrix signal samples at each vertex and at its neighbor on the graph 1 that represents finite time series in Fig. 1. 1 The DSPG generalizes the DSP theory from lines and A = C =  .  . (12) regular lattices to arbitrary graphs. Hence, we extend (17) to ..   an arbitrary graph G = (V, A) by defining the total variation  1    on a graph as a measure of similarity between a graph signal Substituting (12) into the graph shift (4) yields the standard and its shifted version (4): time delay Definition 1 (Total Variation on Graphs): The total varia- sn = sn−1 mod N . (13) tion on a graph (TVG) of a graph signal s is defined as norm The matrix (12) is diagonalizable. Its eigendecomposition, TVG(s)= ||s − A s||1 . (18) which coincides with the Jordan decomposition (47), is The definition uses the normalized adjacency matrix Anorm to −j 2π·0 e N guarantee that the shifted signal is properly scaled for com- 1 C DFT−1 .. DFT parison with the original signal, as discussed in Section II-B. = N  .  N , N 2π·(N−1) The intuition behind Definition 1 is supported by the −j  e N  underlying mathematical model. Similarly to the calculus   where DFTN is the discrete Fourier transform matrix. Thus, on discrete signals that defines the discretized derivative as as expected, the graph Fourier transform for signals indexed ∇n(s) = sn − sn−1 [33], in DSPG the derivative (and the by the graph in Fig. 1 is F = DFTN , and the corresponding gradient) of a graph signal at the nth vertex is defined by the frequencies are4, for 0 ≤ n TVG(vn). (27) ir = 1, 1 ≤ r < R Proof: Since the eigenvalues are real, it follows from (26) specify whether vr is a proper eigenvector of A or a general- that the difference between the total variations of the two ized one. Then we can write the condition (43) on generalized eigenvectors satisfies eigenvectors (see Appendix A) as λm λn TVG(vm) − TVG(vn) = 1 − − 1 − Av v v − |λmax| |λmax| r = λ r + ir r 1. (22) (a) λ λ Using (22), we write the total variation (18) of the gener- m n = 1 − − 1 − |λmax| |λmax| alized eigenvector vr as λn − λm norm TVG(vr) = ||vr − A vr|| (23) = > 0, 1 |λmax| 1 = v − Av which yields (27). Here, equality (a) follows from (7). r |λ | r max 1 As follows from Theorem 1, if a graph has a real spectrum λ ir = v − v − v − . and its frequencies are ordered as r |λ | r |λ | r 1 max max 1 λ0 > λ1 >...>λM−1, (28) In particular, when vris a proper eigenvector of A, i.e. r =0 and v0 = v, we have i0 =0. In this case, it follows from (23) then λ0 represents the lowest frequency and λM−1 is the that the total variation of the eigenvector v is highest frequency. Moreover, the ordering (28) is a unique λ ordering of all frequencies from lowest to highest. This fre- TV (v)= 1 − ||v|| . (24) G |λ | 1 quency ordering for matrices with real spectra is visualized in max Fig. 2(a). When a frequency component is a proper eigenvector of The next theorem extends Theorem 1 and establishes the the adjacency matrix A, its total variation (24) is determined relative ordering of two distinct frequencies corresponding to by the corresponding eigenvalue, since we can scale all complex eigenvalues. eigenvectors to have the same ℓ1-norm. Moreover, all proper Theorem 2: Consider two distinct complex eigenvalues eigenvectors corresponding to the same eigenvalue have the λm, λn ∈ C of the adjacency matrix A. Let vm and vn be same total variation. However, when a frequency component the corresponding eigenvectors. The total variations of these is not a proper eigenvector, we must use (23) to compare its eigenvectors satisfy variation with other frequency components. Finally, it follows from (7) for ||v||1 =1 that TVG(vm) < TVG(vn) (29)

λ λ 5 TVG(v)= 1 − ≤ 1+ ≤ 2. (25) Some real-world datasets are described by graphs with non-diagonalizable |λ | |λ | adjacency matrices and thus require a proper ordering of frequency com- max max ponents that correspond to generalized eigenvectors. An example of such a Hence, the total variation of a normalized proper eigenvector dataset is the of hyperlink references between political blogs is a real number between 0 and2. used in Section VI-B. 6

¸ ¸ ¸ ¸ Im .M-1 .M-2 . 1 . 0 HF LF ¸ 2 . ¸ (a) Ordering of a real spectrum . 1 ¸ ¸ N/2. . 0 Im Re .¸ . N-1 ¸ N-2 ¸ 0. .¸ . 1 . -|¸max| HF LF |¸max| Re ¸ Fig. 3. Frequency ordering for finite discrete periodic time series. Frequencies M-1 . λ and λ − have the same total variations since they lie on the same circle ¸. m N m 2 centered around 1.

Hence, the frequencies λ and λ − have the same variation, (b) Ordering of a complex spectrum m N m and the induced order from lowest to highest frequencies is λ , λ , λ − , λ , λ − ,..., with the lowest frequency corre- Fig. 2. Frequency ordering from low frequencies (LF) to high frequencies 0 1 N 1 2 N 2 (HF) for graphs with real and complex spectra. sponding to λ0 =1 and the highest frequency corresponding to λN/2 = −1 for even N or λ(N±1)/2 for odd N. This ordering is visualized in Fig. 3, and it is the conventional frequency if the eigenvalue λm is located closer to the value |λmax| on ordering in DSP [7]. the complex plane than the eigenvalue λn. Proof: This result follows immediately from the interpre- tation of the total variation (24) as a distance function on the C. Frequency Ordering Based on Quadratic Form complex plane. Since λmax =0 (otherwise A would have been Here we compare our ordering of the frequencies based a ), multiplying both sides of (29) by |λmax| yields on the total variation with an ordering based on using the 2- the equivalent inequality Dirichlet form, p = 2, like in [8]. Taking p = 2 in (20), we get |λmax|− λm < |λmax|− λn . (30) N−1 1 The expressions on both sides of (30) are the distances from S (s) = |∇ (s)|2 2 2 n λ and λ to |λ | on the complex plane. n=0 m n max As follows from Theorem 2, frequencies of a graph with a 1 norm 2 = ||s − A s||2 (31) complex spectrum are ordered by their distance from |λmax|. 2 As a result, in contrast to the graphs with real spectra, 1 norm H norm = sH (I −A ) (I −A ) s. the induced ordering of complex frequencies from lowest to 2 highest is not unique, since distinct complex frequencies can This quadratic form defines the seminorm yield the same total variation for their corresponding frequency components. In particular, all eigenvalues lying on a circle of ||s||G = S2(s), (32) radius ρ centered at point |λmax| on the complex plane have the norm H norm same total variation ρ/|λmax|. It also follows from (7) that all since (I −A ) (I −A ) is a positive-semidefinite ma- graph frequencies λm can lie only inside and on the boundary trix. The rationale in [8] is that the quadratic form is small of the circle with radius |λmax|. The frequency ordering for when signal values are close to the corresponding linear adjacency matrices with complex spectra is visualized in combinations of their neighbors’ values, and large otherwise. Fig. 2(b). We introduce an ordering of the graph Fourier basis Consistency with DSP theory. The frequency ordering from lowest to highest frequencies based on the graph shift induced by the total variation (18) is consistent with classical quadratic form. As we demonstrate next, this ordering coin- DSP. Recall from (14) that the cycle graph in Fig. 1, which cides with the ordering induced by the total variation. represents finite time series, has a complex spectrum The quadratic form (31) of an eigenvector v that corre-

−j 2π n sponds to the eigenvalue λ is λn = e N 1 norm 2 for . Hence, the total variation of the th frequency S2(v) = ||v − A v|| 0 ≤ n

λm < λn, then it follows from (33) that frequencies λm are lower frequencies than λcut. The frequency 2 2 responses of these filters are defined as λm λn S2(vm) − S2(vn) = 1 − − 1 − |λ | |λ | 1, λ > λcut, max max h(λ )=1 − g(λ )= m (34) 2 2 m m λm λn 0, λm ≤ λcut. = 1 − − 1 − |λ | |λ | max max As we demonstrate next, the design of such filters, as well λn − λm as any low-, high-, and band-pass graph filters, is a linear = 2 (λm + λn − 2|λmax|) > 0. |λmax| problem.

Hence, S2(vm) > S2(vn), and we obtain a reformulation of Theorem 1 for the graph shift quadratic form. As a con- B. Frequency Response Design for Graph Filters sequence, arranging frequency components in the increasing A graph filter can be defined through its frequency response order of their graph shift quadratic form leads to the same h(λm) at its distinct frequencies λm, m =0,...,M −1. Since ordering (27) from lowest to highest as the total variation. a graph filter (5) is a polynomial of degree L, the construction A similar reformulation of Theorem 2 for the graph shift of a filter with frequency response h(λm)= αm corresponds quadratic form is demonstrated analogously, which leads to the to inverse polynomial interpolation, i.e., solving a system of same ordering of complex frequencies as the ordering induced M linear equations with L +1 unknowns h0,...,hL: by the total variation. L h0 + h1λ0 + ... + hLλ0 = α0, L V. FILTER DESIGN h0 + h1λ1 + ... + hLλ1 = α1, . When a graph signal is processed by a graph filter, its . (35) frequency content changes according to the frequency re- L h + h λ − + ... + h λ = α − . sponse (11) of the filter. Similarly to classical DSP, we can 0 1 M 1 L M−1 M 1 characterize graph filters as low-, high-, and band-pass filters This system can be written as based on their frequency response. L 1 λ0 ... λ0 h0 α0 L 1 λ1 ... λ1 h1 α1 A. Low-, High-, and Band-pass Graph Filters  . .  . = .  . (36) . . . . Following the DSP convention, we call filters low-pass if  L      1 λ − ... λ  h   α −  they do not significantly affect the frequency content of low-  M 1 M−1  L   M 1       frequency signals but attenuate, i.e., reduce, the magnitude The system matrix in (36) is a full-rank M × (L + 1) Van- of high-frequency signals. Analogously, high-pass filters pass dermonde matrix [34], [35]. Hence, the system has infinitely high-frequency signals while attenuating low-frequency ones, many exact solutions if M ≤ L and one unique exact solution and band-pass filters pass signals with frequency content if M = L +1. within a specified frequency band while attenuating all others. When M ≥ L + 1, the system is overdetermined and The action of a graph filter h(A) of the form (5) on the fre- does not have an exact solution. This is a frequent case in quency content of a graph signal s is completely specified by practice, since the number of coefficients in the graph filter its frequency response (11). For simplicity of presentation, we may be restricted by computational efficiency or numerical discuss here graphs with diagonalizable adjacency matrices, stability requirements. In this case, we can find an approximate for which (11) is a diagonal matrix with h(λm) on the main solution, for example, in the least-squares sense. diagonal, 0 ≤ m

71F 100

0 Low frequencies High frequencies

12 F Fig. 6. The frequency content of the graph signal in Fig. 4. Frequencies are ordered from the lowest to highest.

Fig. 4. Temperature measured by 150 weather stations across the United States on February 1, 2003

1.5

1

Low-pass High-pass (a) True measurement (b) Corrupted measurement 0.5 Fig. 7. A subgraph of the sensor graph in Fig. 4 showing the (a) true and (b) corrupted measurement by the sensor located in Colorado Springs, CO. 0 a malfunctioning sensor solely from the data it generates. We Low frequencies High frequencies illustrate here how the DSPG framework can be used to devise -0.5 a simple solution to this problem. Many physical quantities represent graph signals with small Fig. 5. Frequency responses of low-pass and high-pass filters for the sensor graph in Fig. 4. The length of the filters is restricted to 10 coefficients. variation with respect to the graph of sensors. As an illustra- tion, consider the temperature across the United States mea- sured by 150 weather stations located near major cities [38]. high-pass filters (34). The frequency response in (35) for the An example temperature measurement is shown in Fig. 4, and low-pass filter is αm =1 for frequencies lower than λcut and the construction of the corresponding weather station graph 0 otherwise; and vice versa for the high-pass filter. By design, is discussed in Section V-B. The graph Fourier transform of the constructed filters satisfy the relation [39] this temperature snapshot is shown in Fig. 6, with frequencies ordered from lowest to highest. Most of the signal’s energy h(A)= IN −g(A). (38) is concentrated in the low frequencies. This suggests that the If we require that h(A) and g(A) do not have the same signal varies slowly across the graph, i.e., that cities located number of coefficients or if we use an approximation metric close to each other have similar temperatures. other than least squares, the constructed polynomials will not A sensor malfunction may cause an unusual difference satisfy (38). between its measurements and the measurements of nearby stations. Fig. 7 shows an example of an (artificially) corrupted VI. APPLICATIONS measurement, where the station located near Colorado Springs, CO, reports a temperature that contains an error of 20 degrees In this section, we apply the theory discussed in this paper (temperature at each sensor is color-coded using the same to graphs and datasets that arise in different contexts. We color scheme as in Fig. 4). The true measurement in Fig. 7(a) demonstrate how the DSP framework extends standard signal G is very similar to measurements at neighboring cities, while processing techniques of band-pass filtering and signal regu- the corrupted measurement in Fig. 7(b) differs significantly larization to solve interesting problems in sensor networking from its neighbors. and data classification. Such difference in temperature at closely located cities results in the increased presence of higher frequencies in the A. Malfunction Detection in Sensor Networks corrupted signal. By high-pass filtering the signal and then Today, sensors are ubiquitous. They are usually cheap to thresholding the filtered output, we can detect this anomaly. manufacture and deploy, so networks of sensors are used to Experiment. We consider the problem of detecting a cor- measure and monitor a wide range of physical quantities from rupted measurement from a single temperature station. We structural integrity of buildings to air pollution. However, the simulate a signal corruption by changing the measurement of sheer quantity of sensors and the area of their deployment one sensor by 20 degrees; such an error is reasonably small may make it challenging to check that every sensor is op- and is hard to detect by direct inspection of measurements of erating correctly. As an alternative, it is desirable to detect each station separately. To detect the malfunction, we extract 9

20 20 of the data. For example, image and video databases may be classified based on their contents; documents are grouped with respect to their topics; and customers may be distinguished based on their shopping preferences. In all cases, each dataset element is assigned a label from a pre-defined group of labels. Large datasets often cannot be classified manually. In this 0 0 Low frequencies High frequencies Low frequencies High frequencies case, a common approach is to classify only a subset of ele- ments and then use a known structure of the dataset to predict (a) True signal (b) Colorado Springs, CO the labels for the remaining elements. A key assumption in this approach is that similar elements tend to be in the same 20 20 class. In this case, the labels tend to form a signal with small variation over the graph of similarities. Hence, information about similarity between dataset elements provides means for inferring unknown labels from known ones. Consider a graph G = (V, A) with N vertices that represent

0 0 N data elements. We assume that two elements are similar to Low frequencies High frequencies Low frequencies High frequencies each other if the corresponding vertices are connected; if their (c) Tampa, FL (d) Atlantic City, NJ connection is directed, the similarity is assumed only in the direction of the edge. We define a signal s(known) on this graph 20 20 that captures known labels. For a two-class problem, this signal is defined as +1, nth element belongs to class 1, (known) sn = −1, nth element belongs to class 2,  0, class is unknown. 0 0 Low frequencies High frequencies Low frequencies High frequencies The predicted labels for all data elements are found as the (e) Reno, NV (f) Portland, OR signal that varies the least on the graph G = (V, A). That is, we find the predicted labels as the solution to the optimization Fig. 8. The magnitudes of spectral coefficients of the original and corrupted problem temperature measurements after high-pass filtering: (a) the true signal from (predicted) Fig. 4; (b)-(f) signals obtained from the true signal by corrupting the s = argmin S2(s) (39) measurement of a single station located at the indicated city. s∈RN subject to || Cs(known) − Cs||2 < ǫ, (40) the high-frequency component of the resulting graph signal 2 using the high-pass filter in Fig. 5 and then threshold it. If one where C is a N × N diagonal matrix such that or more Fourier transform coefficients exceed the threshold (known) value, we conclude that a sensor is malfunctioning. The cut-off 1, if sn =0, Cn,n = threshold is selected automatically as the maximum absolute 0, otherwise. value of graph Fourier transform coefficients of the high-pass filtered measurements from the previous three days. The parameter ǫ in (40) controls how well the known labels are Results. We considered 365 measurements collected during preserved. Alternatively, the problem (39) with condition (40) the year 2003 by all 150 stations, and conducted 150 × 365 = can be formulated and solved as 54, 750 tests. The resulting average detection accuracy was (predicted) (known) 2 s = argmin S2(s)+ α|| Cs − Cs||2 . (41) 89%, so the proposed approach, despite its relative simplicity, s∈RN correctly detected a corrupted measurement almost 9 times out Here, the parameter α controls the relative importance of of 10. conditions (39) and (40). Once the predicted signal s(predicted) Fig. 8 illustrates the conducted experiment. It shows fre- is calculated, the unlabeled data elements are assigned to class quency contents of high-pass filtered signals that contain a cor- 1 if (predicted) and another class otherwise. rupted measurement from a sensor at five different locations.A sn > 0 comparison with the high-pass component of the uncorrupted In classical DSP, minimization-based approaches to signal signal in Fig. 8(a) shows coefficients above thresholds that recovery and reconstruction are called signal regularization. lead to the detection of a corrupted signal. They have been used for signal denoising, deblurring and recovery [42], [43], [44], [45]. In signal processing on graphs, minimization problems similar to (39) and (41) formulated B. Data Classification with the Laplacian quadratic form (see (52) in Appendix B) Data classification is an important problem in machine are used for data classification [46], [47], [41], [48], charac- learning and data mining [40], [41]. Its objective is to classify terization of graph signal smoothness [18] and recovery [8]. each element of the dataset based on specified characteristics The problems (39) and (41) minimize the graph shift quadratic 10 form (31) and represent an alternative approach to graph signal regularization. 1 Experiments. We illustrate the application of graph signal regularization to data classification by solving the minimiza- 0.9 tion problem (41) for two datasets. The first dataset is a collection of images of handwritten digits 4 and 9 from the 0.8 NIST database [49]. Since these digits look quite similar, their automatic recognition is a challenging task. For each 0.7 digit, we use 1000 grayscale images of size 28 × 28. The Graph shift (directed) Classification accuracy Classification graph is constructed by viewing each image as a point in a Graph shift (undirected) 2 0.6 28 = 784-dimensional vector space, computing Euclidean Laplacian (undirected) distances between all images, and connecting each image Normalized Laplacian (undirected) 0.5 with six nearest neighbors by directed edges, so the resulting 0.5% 1% 2% 3% 5% 7% 10% 15% graph is a directed six-nearest neighbor graph. We consider Percentage of initially known labels unweighted graphs,6 for which all non-zero edge weights are set to 1. Fig. 9. Classification accuracy of images of handwritten digits 4 and 9 The second dataset is a collection of 1224 online political using the graph shift-based regularization and Laplacian-based regularization on weighted and unweighted similarity graphs. blogs [5]. The blogs can be either “conservative” or “liberal.” The dataset is represented by a directed graph with vertices corresponding to blogs and directed edges corresponding to 1 hyperlink references between blogs. For this dataset we also use only the unweighted graph (since we cannot assign a 0.9 similarity value to a hyperlink). For both datasets, we consider trade-offs between the two parts of the objective function in (41) ranging from 1 to 100. 0.8 In particular, for each ratio of known labels 0.5%, 1%, 2%, 3%, 5%, 7%, 10% and 15%, we run experiments for 199 0.7 Graph shift (directed) different values of α ∈{1/100, 1/99,..., 1/2, 1, 2,..., 100}, accuracy Classification Graph shift (undirected) a total of 8 × 199 = 1592 experiments. In each experiment, 0.6 Laplacian (undirected) we calculate the average classification accuracy over 100 Normalized Laplacian (undirected) runs. After completing all experiments, we select the highest 0.5 average accuracy for each ratio of known labels. 0.5% 1% 2% 3% 5% 7% 10% 15% Percentage of initially known labels For comparison, we consider the Laplacian quadratic form (52), and solve the minimization problems Fig. 10. Classification accuracy of political blogs using the graph shift-based (predicted) T (known) 2 regularization and Laplacian-based regularization on an unweighted graph of s = argmin s Ls + α|| C(s − s)||2 (42) s∈RN hyperlink references. for two definitions of the Laplacian: standard (49) and norma l- ized (50). As before, the values of parameter vary between α respectively, in Fig. 9 and Fig. 10. For both datasets, the and . Since the can only be used 0.01 100 total variation minimization approach (41) applied to directed with undirected graphs, we convert the original directed graphs graphs has produced the highest accuracies. This observation for digits and blogs to undirected graphs by making all edges demonstrates that using the information about the direction undirected. of graph edges improves the accuracy of regularization-based For a fair comparison with the Laplacian-based minimiza- classification. tion (42), we also test our approach (41) on the same undi- Furthermore, the approach (41) that uses the graph shift- rected graphs. These experiments provide an equal testing based regularization significantly outperforms the Laplacian- ground for the two methods. In addition, by comparing results based approach (42) on undirected graphs for the standard for our approach on directed and undirected graphs, we and normalized Laplacian matrices. In particular, for small determine whether the direction of graph edges provides addi- ratios of known labels (under 5%), the differences in average tional valuable information that can improve the classification accuracies can exceed 10% for image recognition and 20% for accuracy. blog classification. Results. Average classification accuracies for the handwrit- Discussion. The following example illustrates how classi- ten digits images dataset and the blog dataset are shown, fication based on signal regularization works. Fig. 11 shows 6We have also considered weighted graphs with edge weights set to a subgraph of 40 randomly selected blogs with their mutual − 2 exp( dn,m), where dn,m is the Euclidean distance between the nth and hyperlinks. Fig. 11(a) contains true labels for these blogs mth images. This is a common way of assigning edge weights for graphs obtained from [5], while the labels in Fig. 11(b) are obtained that reflect similarity between objects [50], [51], [16]. Results obtained for these weighted graphs were practically indistinguishable from the results in by randomly switching 7 out of 40 labels to opposite values. Fig. 9 and Fig. 10 obtained for unweighted graphs. The frequency content of the true and synthetic labels as 11

frequency response by finding least squares approximations to solutions of systems of linear algebraic equations. We applied these concepts and methodologies to the analysis and learning of real-world datasets. We studied detection of corrupted data in the dataset of temperature measurements collected by a network of 150 weather stations across the U.S. during one year, and classification of handwritten digit images (a) True labels (b) Synthetic labels from the NIST database [49] and hyperlinked documents [5]. The experiments showed that the techniques presented in Fig. 11. A subgraph of 40 blogs labels: blue corresponds to “liberal” blogs this paper are promising in problems like detecting sensor and red corresponds to “conservative” ones. Labels in (a) form a smoother malfunctions, graph signal regularization, and classification of graph signal than labels in (b). partially labeled data.

APPENDIX A: JORDAN DECOMPOSITION An arbitrary matrix A ∈ CN×N has M ≤ N distinct eigenvalues λ0,...,λM−1. Each eigenvalue λm has Dm cor- True labels responding eigenvectors v v − that satisfy Synthetic labels m,0,..., m,Dm 1

(A − λm IN )vm,d =0.

Moreover, each eigenvector vm,d can generate a Jordan chain of Rm,d ≥ 1 generalized eigenvectors vm,d,r, 0 ≤ r < Rm,d, where vm,d,0 = vm,d, that satisfy

0 (A − λm I)vm,d,r = vm,d,r−1. (43) Low frequencies High frequencies All eigenvectors and corresponding generalized eigenvectors are linearly independent. Fig. 12. Magnitudes of the spectral coefficients for graph signals formed by For each eigenvector vm,d and its Jordan chain of size Rm,d, true and synthetic labels in Fig. 11. we define a Jordan of dimensions Rm,d × Rm,d as signals on this subgraph are shown in Fig. 12. The true λm 1 labels form a signal that has more energy concentrated in .. λ . ×  m  CRm,d Rm,d lower frequencies, i.e., has a smaller variation than the signal JRm,d (λm)= ∈ . (44) .. formed by the synthetic labels. This observation supports our  . 1    assumption that the solution to the regularization problem (41)  λ   m should correspond to correct label assignment.   By construction, each eigenvalue λm is associated with Dm Incidentally, this assumption also explains why the maxi- Jordan blocks, each of dimension Rm,d×Rm,d, where 0 ≤ d< mum classification accuracy achieved in our experiments is Dm. Next, for each eigenvector vm,d, we collect its Jordan 96%, as seen in Fig. 10. We expect that every blog contains chain into a N × Rm,d matrix more hyperlinks to blogs of its own type than to blogs of the opposite type. However, after inspecting the entire dataset, Vm,d = vm,d,0 ... vm,d,Rm,d−1 . (45) we discovered that 50 blogs out of 1224, i.e., 4% of the We concatenate all blocks Vm,d, 0 ≤ d

APPENDIX B: CONNECTION WITH LAPLACIAN-BASED βm = d − λm. Since the smallest eigenvalue of L is β0 =0, VARIATION we also obtain λmax = d. The Laplacian matrix for an undirected graph G = (V, A) The graph shift quadratic form (31) of the eigenvector vm satisfies with real, non-negative edge weights An,m is defined as L = D −A, (49) 1 1 2 S (u ) = I − A u where D is a diagonal matrix with diagonal elements 2 m 2 d m 2 N−1 2 1 λm Dn,n = An,m. = 1 − 2 d m=0 1 Alternatively, the (normalized) Laplacian matrix is defined as = (d − λ )2 2d2 m −1/2 −1/2 1 L = IN − D A D . (50) = β2 2d2 m The Laplacian matrix (49) has real non-negative eigenvalues 1 2 = S(L)(s) . (55) 0 = β0 < β1 ≤ β2 ≤ ... ≤ βN−1 and a complete set of 2d2 2 corresponding orthonormal eigenvectors un for 0 ≤ n

[19] V. Ekambaram, G. Fanti, B. Ayazifar, and K. Ramchandran, “Multires- [49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based olution graph signal processing via circulant structures,” in Proc. IEEE learning applied to document recognition,” Proc. IEEE, vol. 86, no. DSP/SPE Workshop, 2013. 11, pp. 2278–2324, 1998. [20] V. Ekambaram, G. Fanti, B. Ayazifar, and K. Ramchandran, “Circulant [50] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality structures and graph signal processing,” in Proc. IEEE Int. Conf. Image reduction and data representation,” Neural Comp., vol. 15, no. 6, pp. Proc., 2013. 1373–1396, 2003. [21] X. Zhu and M. Rabbat, “Approximating signals supported on graphs,” [51] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” Appl. Comp. in Proc. IEEE Int. Conf. Acoust., Speech and Signal Proc., 2012, pp. Harm. Anal., vol. 21, no. 1, pp. 53–94, 2006. 3921–3924. [22] M. P¨uschel and J. M. F. Moura, “The algebraic approach to the discrete cosine and sine transforms and their fast algorithms,” SIAM J. Comp., vol. 32, no. 5, pp. 1280–1316, 2003. [23] M. P¨uschel and J. M. F. Moura, “Algebraic signal processing theory,” http://arxiv.org/abs/cs.IT/0612077. [24] M. P¨uschel and J. M. F. Moura, “Algebraic signal processing theory: Foundation and 1-D time,” IEEE Trans. Signal Proc., vol. 56, no. 8, pp. 3572–3585, 2008. [25] M. P¨uschel and J. M. F. Moura, “Algebraic signal processing theory: 1-D space,” IEEE Trans. Signal Proc., vol. 56, no. 8, pp. 3586–3599, Aliaksei Sandryhaila (S’06–M’10) received a B.S. 2008. degree in Computer Science from Drexel University, [26] M. P¨uschel and J. M. F. Moura, “Algebraic signal processing theory: Philadelphia, PA, in 2005, and a Ph.D. degree in Cooley-Tukey type algorithms for DCTs and DSTs,” IEEE Trans. Signal Electrical and Computer Engineering from Carnegie Proc., vol. 56, no. 4, pp. 1502–1521, 2008. Mellon University (CMU), Pittsburgh, PA, in 2010. He is currently a Research Scientist in the Depart- [27] M. P¨uschel and M. R¨otteler, “Fourier transform for the directed quincunx ment of Electrical and Computer Engineering at lattice,” in Proc. ICASSP, 2005, vol. 4, pp. 401–404. CMU. Previously, he was a Postdoctoral Researcher [28] M. P¨uschel and M. R¨otteler, “Fourier transform for the spatial quincunx in the Department of Electrical and Computer En- lattice,” in Proc. ICIP, 2005, vol. 2, pp. 494–497. gineering at CMU and a Senior Research Scientist [29] M. P¨uschel and M. R¨otteler, “Algebraic signal processing theory: 2-D at SpiralGen in Pittsburgh, PA. His research interests hexagonal spatial lattice,” IEEE Trans. on Image Proc., vol. 16, no. 6, include signal processing, large-scale data analysis, machine pp. 1506–1521, 2007. learning, biomedical signal and image analysis, high-performance computing, [30] A. Sandryhaila, J. Kovacevic, and M. P¨uschel, “Algebraic signal and design and optimization of fast algorithms. processing theory: 1-D Nearest-neighbor models,” IEEE Trans. on Signal Proc., vol. 60, no. 5, pp. 2247–2259, 2012. [31] A. Sandryhaila, J. Kovacevic, and M. P¨uschel, “Algebraic signal pro- cessing theory: Cooley-Tukey type algorithms for polynomial transforms based on induction,” SIAM J. Matrix Analysis and Appl., vol. 32, no. 2, pp. 364–384, 2011. [32] A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic Processes, McGraw-Hill, 4th edition, 2002. [33] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, 3rd edition, 2008. [34] P. Lancaster and M. Tismenetsky, The Theory of Matrices, Academic Jose´ M. F. Moura (S’71–M’75–SM’90–F’94) is Press, 2nd edition, 1985. the Philip and Marsha Dowd University Professor at [35] F. R. Gantmacher, Matrix Theory, vol. I, Chelsea, 1959. Carnegie Mellon University in the Departments of [36] M. Vetterli and J. Kovaˇcevi´c, Wavelets and Subband Coding, Signal Electrical and Computer Engineering and Biomedi- Processing, Prentice Hall, Englewood Cliffs, NJ, 1995. cal Engineering (by courtesy). In the academic year [37] A. Bovik, Handbook of Image and Video Processing, Academic Press, 2013-14, he is a Visiting Professor at New York 2nd edition, 2005. University (NYU) with the Center for Urban Science [38] “National Climatic Data Center,” 2011, and Progress (CUSP). He holds an EE degree from ftp://ftp.ncdc.noaa.gov/pub/data/gsod. Instituto Superior T´ecnico (IST, Portugal) and MSc, [39] T. J. Rivlin, An Introduction to the Approximation of Functions, Dover EE, and DSc degrees from the Massachusetts Insti- Publications, 1969. tute of Technology. He has held visiting professor appointments with MIT and NYU and was on the faculty of IST (Portugal). [40] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley, He is currently ECE Associate Department Head for Research and Strategic 2nd edition, 2000. Initiatives. He founded and directs the CMU Information and Communication [41] O. Chapelle, B. Sch¨olkopf, and A. Zien, Semi-Supervised Learning, Technologies Institute (ICTI) that manages the CMU/Portugal Program, and MIT Press, 2006. co-founded the Center for Sensed Critical Infrastructure Research (CenSIR). [42] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based With CMU colleagues, he co-founded SpiralGen that commercializes, under noise removal algorithms,” Physica D, vol. 60, no. 1–4, pp. 259–268, a CMU license, the Spiral technology (www.spiral.net). His interests are on 1992. statistical signal and image processing and data science. He holds eleven [43] T. F. Chan, S. Osher, and J. Shen, “The digital TV filter and nonlinear issued patents and has published over 450 papers. Moura has been an IEEE denoising,” IEEE Trans. Image Proc., vol. 10, no. 2, pp. 231–241, 2001. Board Director, IEEE Division IX Director, member of several IEEE Boards, [44] J. P. Oliveira, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “Adaptive to- President of the IEEE Signal Processing Society (SPS), and Editor-in-Chief tal variation image deblurring: A majorization-minimization approach,” for the Transactions on Signal Processing. He has been on the Editorial Boards Signal Proc., vol. 89, no. 9, pp. 1683–1693, 2009. of several IEEE and ACM Journals. He received several awards including the [45] I. W. Selesnick and P. Chen, “Total variation denoising with overlapping IEEE Signal Processing Society Technical Achievement Award and the IEEE group sparsity,” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Signal Processing Society Award for outstanding technical contributions and Proc., 2013. leadership in signal processing. Moura is a Fellow of the IEEE, a Fellow of [46] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch¨olkopf, “Learning AAAS, a corresponding member of the Academy of Sciences of Portugal, with local and global consistency,” in Proc. Neural Inf. Proc. Syst., 2004, and a member of the US National Academy of Engineering. pp. 321–328. [47] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semi- supervised learning on large graphs,” in Proc. Conf. Learn. Th., 2004, pp. 624–638. [48] F. Wang and C. Zhang, “Label propagation through linear neighbor- hoods,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 985–992.