<<

IEEE ICC 2015 SAC - Social Networking

1 Communication Theoretic Prediction on Networked Data Tzu-Yu Chuang∗, Jia-Pei Lu† and Kwang-Cheng Chen†, Fellow, IEEE, Graduate Institute of Communication Engineering, National University, Taipei, Taiwan Email : {d99942022∗ , r01942050† , ckc‡}@ntu.edu.tw

Abstract—Prediction based on observed data is one of the communication channel [2], i.e., a sequence of conditional major purposes in (big) data analytics, and has shown great probabilities. Therefore, a correspondence from data analysis impacts in many applications, including engineering, social to statistical communication theory can be established, and science, and medical treatments. Statistical machine learning has been widely adopted to deal with such problem. In this many fundamental limits can be understood and explained paper, we analogize the relationship among data variables as from the knowledge of communication networks. a sort of generalized social network [1], that is, networked data. In this paper, we address networked data prediction in the Consequently, a direct causal relationship from one data variable viewpoint of communication theory. A sequence of observed to another is thus equivalent to information transfer over a data point (observations) is given, and the objective is to communication channel. Prediction based on data variables is consequently to maximize utilizations of information conveyed predict a future outcome of a related variable (target). This over communication channels. Therefore, we introduce the con- task can be regarded as an information transfer over the cept of adaptive equalization to data analytics in this paper, communication channel with the target and observations taken which allows us to select appropriate data variables and optimum as the information transmitter and the information receiver, depth of observations for prediction. We illustrate by finance respectively. The information transfer is surely under noisy market data to show surprisingly good performance using this simple methodology. This result not only indicates a new direction distortion, and it is well known to use equalization to optimize to knowledge discovery and inference in big networked data the reception. Therefore, we are motivated to explore the data analytics based on communication theory, but also shows the prediction optimized by the adaptive equalization, which is consistency with the newly developed information coupling. similar to the linear estimation, but allows the adaptation to the Index Terms—Statistical communication theory, networked time-varying circumstance. Moreover, the multi-source predic- data, data analytics, communication channel, equalizer, receiver tion is equivalent to the single-input multiple-output (SIMO) diversity, knowledge discovery. communications, in which the receiver diversity techniques can be directly applied. The idea proposed in this paper is verified by an experiment of prediction of stock prices in the I.INTRODUCTION financial market. The result not only motivates us a novel ATA nalysis is an emerging technology involving statis- viewpoint on the knowledge discovery [3], but also shows a D tics, optimization, and computer science; one of the consistency with the information coupling from the frontier of most promising approaches of data modeling is the networked information theory [4]. data, which has a close relationship with social networks There are a lot of literature touched the networked data anal- and thus communication networks. The concept of networked ysis from different aspects. The idea of information transfer data, i.e., linking data together to form a network by their can be traced back in [5] and related works of the analysis in mutual relationship, is equivalent to the model of social time series. Authors in [6] proposed the using of compressed networks. It has been addressed that the general structure of sensing to process relational data. Besides, the constrained social networks is equivalent to the communication networks learning and estimation in networks were addressed in [7], and [1]. Therefore, one can expect that the signal processing the decision and inference of networked data via local network and communication theory can bring us more insights and topology was considered in [8]. Also, there were recent works opportunities for networked data analysis. discussing the signal processing over graphs on big data The key to relate data model and social/communication [9], and the knowledge discovery in databases [10]. On the networks is the channel. In modeling of data, each attribute shoulder of these excellent researches of networked data, we (feature) is represented by a random variable, then embedded first relate the data analysis with social and communication into a network by a node. An arbitrary pair of variables networks, and provide a novel insight on data processing. (nodes) are linked together via their statistical dependency. In this paper, we model a general data analysis by the com- Furthermore, we can extend each random variable into a munication channel first, and then the fundamental knowledge stochastic process by considering their evolution and dynamics of liner predictors is provided through information transfer. in time. Such relationship is consistent with the definition of The techniques of receiver diversity is introduced later to solve the multi-source prediction. We also present a selection crite- This research was supported in part by the US Air Force under the grant AOARD-14-4053 and by the Ministry of Science and Technology under the rion based on the information transfer. Finally, an experiment grant MOST-103-2221-E-002-022-MY3. of stock price prediction is demonstrated and discussed.

978-1-4673-6432-4/15/$31.00 ©2015 IEEE 1201 IEEE ICC 2015 SAC - Social Networking

2

ROBLEM ORMULATION II.P F Channel 1 A novel explanation of data prediction in the viewpoint of communication theory is provided in this section. Our work is based on the observation that large datasets have a Channel 2 distributional basis; i.e., there exists an underlying (sometime implicit) statistical model for the data. We formulate the data ⋮ ⋮ prediction as the following. Channel M A. Model for Data Fig. 1. A general model of prediction. The target is a single Information Our proposed model for data focuses on data generated from Transmitter Yn and the observations is the multiple Information Receiver n−1 a large system composed of many interacting units. Each unit [X]n−L. The statistical dependency of each pair is represented by the channel, n−1 the conditional probability {[Xi] |Yn} This is also the general model is capable of generating data continuously or sporadically with P n−L1 time. Therefore, the dataset D generated by such system is of SIMO communications. modeled by a family of stochastic processes  Upon the observation of the receiver [X]n−1 , one would D , {X(i),t}t∈I : i = 1, 2,... , (1) n−L like to infer the information bearing transmitter Yn. The where I is a proper index set corresponding to the scenario n−1 mutual information between [X]n−L and Yn is: we are interested in. In this paper, we only consider the ( n−1 ) p n−1 (Yn | [X]n−L) case of discrete time; i.e., I = N. We write {X(i),t} as n−1 Yn|[X]n−L b IY (X) I([X] ; Yn) = log {X(i),t}t∈ for simplicity. Furthermore, we use [Xi] = , n−L E N a pYn (Yn) [X(i),a,X(i),a+1,...,X(i),b] to denote an sub-sequence of informa- stochastic process {X(i),t}. In the following, we start from a Or, equivalently, we call this mutual information the simplified case to give a glimpse of communication theoretic tion transfer to emphasize the idea of modeling data prediction prediction of networked data. as a communication of information over the channel. More generally, we can consider the data prediction with B. Data Analysis via Communication Channels more than one source; that is, the prediction of Yn with observation set [X ]n−1 : i = 1, 2 ...,M . Since each pair i n−Li Consider D = {{Xt}, {Yt}} only. Given a time instance n, of random sequences ({Xi}, {Y }) forms a channel, and thus the problem of prediction is stated as follow: a signal detection problem with single-input and multiple- Problem 1. Suppose we have two random sequences, {Xt}, output, or SIMO, can be formulated to this problem. (see {Yt}. We observe {Xt} in some set of time n−L ≤ t ≤ n−1 Fig. 1) Under this network, the total information transfer is and we wish to estimate Yn from these observations. measured by n−1 I (X ,X ,...,X ) The relationship between Yn and [X]n−L is linked by Y 1 2 M n−1 n−1 n−2 n−1  (4) the conditional probability {[X] |Yn}. This dependency I [X ] , [X ] ,..., [X ] ; Y P n−L , 1 n−L1 2 n−L2 M n−LM n can be considered as a channel, which links the information n−1 The following theorem is useful when we need to measure transmitter, Yn, and the information receiver, [X]n−L. There- fore, the data prediction is a generalized signal detection with the information transfer over some specific channels. n−1 receiving a realization of [X]n−L. The optimal predictor of Theorem 1. Given the target and the data variables to be of ˆ n−1 n−1 Yn, Yn([X]n−L), can be derived by minimizing the measure the linear form; that is, [X]n−L = hYn + Nn, where Nn ∼ 2 2 of error. The most common one is in the mean-square error N (0, Σn). For any Yn independent of Nn with Var{Yn} ≤ σ , sense we have h i2 1 Y − Yˆ [X]n−1  (2) I([X]n−1 ; Y ) ≤ log 1 + σ2hTΣ−1h (5) E n n n−L n−L n 2 N 2 With this model, one can immediately identified that, the Proof: Let YG ∼ N (0, σ ), XG = hYG + Nn. It is very Problem 1 is equivalent to the signal detection over a highly straightforward to show that distorted channel, such as a frequency selective channel. The n−1  I [X]n−L; Yn receiver can to the information transmitted by Yn up   = D P n−1 kP | P n−1 − D (P kP ) to the depth (delay in the communication, equivalently) of [X] |Yn YG [X] Y YG n−1 n−L n−L (6) observation up to L, i.e., {X}n−L. The goal of signal detection = E {ıYG;XG (Y, hY + Nn)} − D(PY kPYG ) is to find an optimum estimate of Yn based on the reception n−1 1 2 T −1  of [X]n−L. ≤ log 1 + σ h ΣN h It is well-known that the minimum value of (2), referred to 2 as the minimum mean-square error or MMSE, is achieved by where D(·k·) is the relative entropy. The equality holds when the conditional mean estimator: Y ∼ N (0, σ2) In the following section, we will discuss a practical real- ˆ n−1   n−1 Yn [X]n−L = E Yn | [X]n−L . (3) ization of data prediction based on communication theory.

1202 IEEE ICC 2015 SAC - Social Networking

CHUANG AND CHEN et al. : COMMUNICATION THEORETIC PREDICTION ON NETWORKED DATA 3

We consider the single channel case first, and then give an ‧‧‧ extension to the prediction with multiple channels. 𝑥𝑥𝑛𝑛−1 𝐷𝐷 𝐷𝐷 𝐷𝐷 𝐷𝐷

, × , × , × , × III.DATA ANALYSIS OVER ONE CHANNEL 𝑤𝑤𝑛𝑛 𝑛𝑛−1 𝑤𝑤𝑛𝑛 𝑛𝑛−2 𝑤𝑤𝑛𝑛 𝑛𝑛−3 𝑤𝑤𝑛𝑛 𝑛𝑛−𝐿𝐿 In order to give a computable estimate, we have to consider a constrained family of estimates. In the following, the linear estimate will be introduced, and then the selection of depth of observation will be discussed. ∑

𝑦𝑦�𝑛𝑛 A. Equalization to Data Prediction Fig. 2. Communication-inspired Equalizer for data prediction. The equalizer coefficient wet {wn,t} can be adaptive adjusted to fit the dynamic signals. In Problem 1, of course, the optimum estimator (in the MMSE sense) is the conditional mean, Yˆn = n−1 we see that the optimum estimator coefficients are given by E{Yn+1|[X]n−L}. However, the computation of such esti- −1 mator can be quite cumbersome unless the problem exhibits wn = ΣX σYX (n) (12) special structure. Furthermore, the determination of the con- The realization of this equalizer is depicted in Fig. 2. We ditional mean generally requires knowledge of the joint dis- can further implement the adaptive equalizer, i.e., adjustable tribution of Y ,X ,...,X , which may be impractical n n−L n−1 equalizer coefficients, to fit the dynamic environment better. (or impossible) to obtain in practice. The detail is omitted in this paper, and can be referred in [12]. One way of circumventing these problems is to constrain the signal of some computationally convenient form, and then B. Optimum Selection of Tap Numbers to minimize the MSE over this constrained class. The linear constraint serves the purpose, in which we consider estimates There is an important difference between actual equalizers in communications and our communication-inspired equalizer Yˆn of the form for data prediction: the depth (delays) of observations. In the n−1 X communication system, the tap delay is determined by the Yˆ = w X + c , (7) n n,t t n nature of fading channel itself, while the depth of observations t=n−L in data prediction should be determined under some optimality n−1 where {wn,t}t=n−L and cn are scalars. This setting is equiva- constraints. Furthermore, data usually evolves with circum- lent to the MMSE Equalizer in the communication theory [11]; stance and they do not persist a consistent statistical properties that is, the coefficients {wn,t} are chosen to minimize (2). As for too long. Therefore, it is of importance to consider a such, finding the optimal filter coefficients {wn,t} becomes a prediction based on a proper depth of observations to avoid standard problem in linear estimation. In fact, this problem is a either a heavy load on computations or overfitting in statistics. standard Weiner filtering problem, whose solution is provided Generally, the larger L is, it suggests more information by the following proposition [11]: transfer over the channel. However, since Proposition 1. Yˆ minimizes (2) if and only if L n X n−1 IY (X) = I(Xn−1; Y ) + I(Xn−i; Y |X ), (13) ˆ n−i+1 E{Yn} = E{Yn} (8) i=2 and it is believed that the latter term is relatively small due to the weak correlation between X |Xn−1 and Y |Xn−1 . ˆ n−i n−i+1 n n−i+1 E{(Yn − Yn)Xl} = 0 ∀ n − L ≤ l ≤ n − 1 (9) An intuitive selection is constrained on the increasing of Proof: Please refer to [11]. information transfer. That is Proposition 1 gives conditions that are necessary and ∗  n−1 L = min L : I(Xn−L; Y |Xn−L+1) <  , (14) L∈ sufficient for the set of coefficients {wn,t} and ct to yield N n−1 an optimum linear estimator of Yn from [X]n−L. An optimal where  is a predefined threshold. However, this increasing linear estimate will always be of the form of information transfer is not always computable. Instead, we n−1 constrain on the maximum increasing of information transfer: X Yˆn = {Yn} + wn,t(Xt − {Xt}) (10) ∗  n−1 0 E E L = min L : max I(Xn−L; Y |Xn−L+1) <  t=n−L L∈N   2   σ n−1 (15) and  1 Y |Xn−L+1 0  = min L : log 1 + 2  <  , L∈N 2 σn σYX (n) = ΣX wn, (11)   where the analytical form is derived under the linear model where σYX (n) , [cov(Yn,Xn−L),..., cov(Yn,Xn−1)], σ2 and ΣY is the covariance matrix of the vector with the power spectrum density of noise n and the condi- T 2 (X ,...,X ) . Equation (11) is known as the tional variance σ n−1 . This is a direct result of Theo- n−L n−1 Y |Xn−L+1 Wiener-Hopf equation. Assuming that ΣX is positive definite, rem 1, we omit the derivation here.

1203 IEEE ICC 2015 SAC - Social Networking

4

n−1 set [Xm]n−L respectively, and then combine these estimates m × together to form a new estimate of Yn. Such combination can be done in several ways, and all of them depends on the ⋮ measure of prediction performance. × ∑ The adoption of linear prediction is based on the believe that the signal model is Predictor of n−1 [X] = hnYn + Nn, (18) × n−L 2 where Nn ∼ N (0, σn) at a particular time n. While the channel coefficients hn is unknown, we can use equalizer and Fig. 3. The illustration of information diversity and combining. One can training sequences to find optimal coefficients of equalizer, either combine different estimates of yn from different sources to form a {wn,t}, to detect the signal. We also believe that the power new estimate, or simply combine all observations via proper selections. spectrum of the embedded error in each channel is time- varying. Therefore, at every time instance, each equalizer Another way to deal with the selection of depth is the output has different and time-varying performances, and this regularization, which penalizes models with extreme number combination can leverage this property to obtain better pre- of parameters. The original problem (2) can be equivalently dictor of Yn. represented as the follow: The performance of each branch can be measured in the term of prediction error of every time instance; that is, n T n−1 2o wn = arg min E Yn − w Xn−L . (16) 2 L γm,n = (yn − yˆm,n) , (19) w∈R By regularization, the object in (16) to be minimized becomes which is the unbiased estimator of the power spectral density of error; i.e., σ2 . It is also of interest to develop other measure n T n−1 2o n E Yn − w Xn−L + λf(w), (17) of performance of each branch, but it is not included in the scope of this paper. In the following subsections, we introduce where f(·) is a measure of model complexity of w. Many two most common combining schemes: selection combing type of complexity measure has been shown to have good (SC) and maximal-ratio combining (MRC). and consistent properties, such as Akaike information criterion (AIC) and minimum description length (MDL) in information B. Selection Combining theory [13], or LASSO and Ridge regression in statistics [14]. In selection combining (SC), the combiner outputs the Remark. In the viewpoint of communication, we do not wish estimate on the branch with the highest performance. L to be too large, because such large L implies heavy inter- symbol interference (ISI) in the channel. We refer the problem ˆ ˆ of selection of depth of observation as the mitigation of the Yn = Ym∗,n, (20) cross-interference of information. It is potentially advanta- ∗ where m = arg maxm γm,n−1, m ∈ {1,...,M}. The geous over many conventional methods. performance of such combination relies on the belief that the condition of channel exhibits a strong time dependency. IV. DATA PREDICTIONOVER MULTIPLE CHANNELS

Consider a process {Yt} to be predict. Suppose we have C. Maximal-Ratio Combining M set of data variables that can help us to do prediction. Since we believe that the error process is a time-varying That is, D = {{X1,t}, {X2,t},..., {XM,t}}. There exhibits a process, it is somewhat risky and wasteful to select only one channel between each pair of data variable and the target. i.e., branch and abandon other estimates at each time. An alterna- ({Yt}, {Xm,t}). This prediction of Yn with multiple sources tive is that, we choose a convex combination of each branch is equivalent to the signal detection in SIMO environment. ˆ PM ˆ to form a new estimate; that is, Yn = m=1 αm,nYm,n. The Thus, the techniques of receiver diversity can be directly coefficients {αm} can be determined by the following: applied to data prediction over Multiple Channels to obtain γ−1 the information diversity. α = m,n−1 (21) m,n PM −1 i=1 γi,n−1 A. Information Diversity and Combining Also, a special case of MRC is the Equal-Gain Combining So far we have established the basic framework for single- (EDG). It assume that both channel are suffered from noises pair data prediction by equalizer. This idea can be easily with the same power spectrum density. The the coefficients extended to the case of multiple information sources. Consider to be multiplied is exactly the same. Generally, this situation M sets of data, {[X ]n−1 }M . We aim to find an is rare; however, the EGC is still of some values because it m t=n−Lm m=1 estimate of Yn based the the realization of these data. This works as robust combining scheme. In the situation that the problem is equivalent to the signal detection in SIMO. The channel changes faster than we expect, and the estimate of techniques of receiver diversity can be directly applied in this noise density is believed to be inaccurate, the EGC is still scenario. One can construct an estimate of Yn based on each valid because it is the average of all the estimate we have.

1204 IEEE ICC 2015 SAC - Social Networking

CHUANG AND CHEN et al. : COMMUNICATION THEORETIC PREDICTION ON NETWORKED DATA 5

D. Equalizer Combining Prediction of TSMC based on MS with Equalizer 140 The equalizer approach of prediction and the diversity 135 TSMC Prediction of TSMC combining can be all regarded as a type of linear prediction. 130 Besides, unlike the communication system, which is limited 125 by the antennas, there is no restriction that we have to 120 separate these observations of different sources. The most Price 115 110 straightforward way to find a linear predictor of Yn is to construct an estimate of the following form: 105 100 M n−Lj 95 X X 01/01/14 02/20/14 04/11/14 05/31/14 07/20/14 09/08/14 Yˆn = wj,tXj,t. (22) Date (mm/dd/yy) j=1 t=n−1 Fig. 4. The prediction of TSMC price based on Morgan Stanley (MS) with In such case, we allow different numbers of taps of different equalizer approach. The depth of observations is 1. sources to guarantee that each channel has the least ISI. This is also a tradeoff between prediction performance and TABLE I computation complexity. DATA PREDICTIONOF PRICES The selection of numbers of taps can also be established by FOXCONN Data Prediction over One Channel either the minimization of conditional information transfer or a the regularization method. Source Normalized NSE Depth of Obs. Generally speaking, this combining should dominate all ExR 0.99 2 methods discussed above, because it is consistent with the NASDAQ 1.00 2 most general MMSE estimate of Yn. However, the price to pay is the computational complexity. The number of candidates of FOXCONN Data Prediction over Multi-Channel model grows exponentially to the number of sources, while Combining Normalized MSE Depth of Obs. b the number of coefficients we have to compute grows with EGC 0.99 [2, 2] QM the product of tap numbers, i=1 Li. This fact leads to SC 0.98 [2, 2] the computation of cumbersome auto-correlations and cross- MRC 0.97 [2, 2] correlations among all sources and the target variables. Eq-MR 0.93 [2, 1] aThe normalized mean square error for each method. The reference MSE is 2.8306, which is the MSE of predicting TSMC by NASDAQ. V. EXPERIMENTAND DISCUSSIONS bThe number of taps corresponding to ExR, NASDAQ, and MS, respec- A. Experiment tively. In the financial market, it is of particular interest to un- derstand the mechanism among international stock prices and TABLE II DATA PREDICTIONOF TSMCPRICES the exchange rate of foreign currency. In this experiment, we select the NASDAQ-composite index (NASDAQ) and the TSMC Prediction over One Channel stock price of Morgan Stanley (MS) from the U.S. stock Source Normalized MSE a Depth of Obs. market to predict the stock prices of Taiwan ExR 0.99 2 Manufacturing Company (TSMC) and the Hon Hai/Foxconn NASDAQ 1.00 2 Technology Group (FOXCONN) from Taiwan, respectively. MS 0.83 1 Also the exchange rate of to U.S. Dollar, (ExR), is included in the prediction. The period we sampled is TSMC Data Prediction over Multi-Channel form Jan. 1, 2009 to Aug. 14, 2014, including 1,382 samples Combining Normalized MSE Depth of Obs. b of stock prices and exchange rates in total. EGC 0.85 [2, 2, 1] We predict the target stock prices (TSMC and FOXCONN) SC 0.86 [2, 2, 1] by the other data, respectively. Both the prediction over one MRC 0.84 [2, 2, 1] channel and multi-channel are conducted. The selection of Eq-MR 0.83 [0, 0, 1] depth of observations is done both by the Akaike information aThe normalized mean square error for each method. The reference MSE criterion (AIC) and the information transfer (Eq. 15), respec- is 2.9409, which is the MSE of predicting TSMC by NASDAQ. tively. Both selections show the consistent results. One of our bThe number of taps corresponding to ExR, NASDAQ, and MS, respec- result is visualized in Fig. 4. One can see that, even with one- tively. channel prediction, the accuracy of prediction of equalizer- based approach is satisfactory. The overall performance of prediction is shown in Table I and Table II. B. Discussion In the prediction of FOXCONN, we only use NASDAQ The stock prices we select are believed to have dependency and ExR, while we adopt NASDAQ, ExR, and MS to predict by our domain knowledge, but the exact influence may be the TSMC. Serval observations and insights can be found in too complicated such that we cannot explain the mechanism this experiment. We will discuss them in the next subsection. among them by a proper financial model. For example, we

1205 IEEE ICC 2015 SAC - Social Networking

6 believe that ExR has strong impact on the TSMC, and the VI.CONCLUSION observation of ExR can help us to predict TSMC, but we In this paper, we present a totally new understanding of do not know the exact relation. This problem is referred to data prediction in the viewpoint of communication theory the causal inference in time series [15]. Furthermore, the and social networks. A prediction process is identified as a determination of the causal relationship, and the model of process of information transfer, which can be quantified by prediction are in the research field of knowledge discovery the mutual information and modeled by the communication in machine learning [3]. channel. Based on this idea, many techniques, such as equal- In our experiment, all three data variables (NASDAQ, izer, receiver diversity and combining, are directly applied ExR, and MS) works great in the accuracy of predictions of to predict data. An experiment based on stock prices and targets (TSMC and FOXCONN). However, there are serval exchange rate is conducted. Many interesting observations can interesting observations worth of discussions. be found in this experiments, including the relation of our idea with the most advanced researches in machine learning • As what we expected, the Equal-Gain Combining (EGC) and information theory. Of course, more in-depth study on is a robust strategy of information combining. Its per- the communication-inspired techniques of data analysis, such formance is simply the average of respecitve prediction. as combining and information transfer, are still open with Generally, Maximal-Ratio Combining (MRC) outper- more opportunities in statistical data processing and learning, forms Selection Combining (SC) and Equal-Gain Com- where we wish this paper to be a cornerstone toward a new bining (EGC). technology. • The Selection Combining (SC) is not as well as we expected in the prediction of TSMC. The reason is REFERENCES that, in the single source test, we can discover that the prediction of stock price does not relies on a long length [1] K.-C. Chen, M. Chiang, and H. V. Poor, “From technological networks to social networks,” IEEE J. Sel. Areas Commun., vol. 31, no. 9, pp. of historical data. This fact indicates that the channel 548–572, Sep. 2013. condition changes very rapidly such that less dependency [2] C. E. Shannon, “A mathematical theory of communication,” Bell System can be found between time series. In addition, the predic- Technical Journal, vol. 27, pp. 379–423, 623–656, Jul. and Oct. 1948. [3] M. Last, Y. Klein, and A. Kandel, “Knowledge discovery in time series tion based on MS outperforms all other sources, which databases,” IEEE Trans. Syst., Man, Cybern. B, vol. 31, no. 1, pp. 160– make the selection combining meaningless. Likewise the 169, Feb 2001. [4] S. Huang and L. Zheng, “Linear information coupling problems,” in MRC does not performs well when there exists strong Proceedings of the 2012 IEEE International Symposium on Information information transfer (TSMC-MS). Theory, ISIT 2012, Cambridge, MA, USA, July 1-6, 2012, 2012, pp. • The Equalizer Combining is competitive to all the other 1029–1033. combining schemes. The selection of depth of observa- [5] T. Schreiber, “Measuring information transfer,” Phys. Rev. Lett., vol. 85, pp. 461–464, Jul 2000. tions is equivalent to model selection in knowledge dis- [6] J. Haupt, W. Bajwa, M. Rabbat, and R. Nowak, “Compressed sensing covery. However, in our experience, the implementation for networked data,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. complexity and computational cost are heavier than all 92–101, March 2008. [7] S. Yang and E. D. Kolaczyk, “Target detection via network filtering,” the other combinings. IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2502–2515, May 2010. [8] T.-Y. Chuang and K.-C. Chen, “Cognition on the networked data In this experiment, we believe that, given the information of stochastic topology,” in 2014 IEEE International Conference on transferred to the MS, the other information on the NASDAQ Communications (ICC), June 2014, pp. 3753–3757. [9] A. Sandryhaila and J. M. F. Moura, “Big data analysis with signal and ExR would be not necessary for prediction. Too much processing on graphs: Representation and processing of massive data information included in the prediction will only creates ISI sets with irregular structure,” IEEE Signal Process. Mag., vol. 31, no. 5, and degrades the performance. It is a straightforward result pp. 80–90, Sept 2014. [10] X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” since the information transfer (15) conditioning on the MS to IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97–107, Jan 2014. all other sources are close to zero. [11] H. V. Poor, An Introduction to Signal Detection and Estimation, 2nd ed. : Springer, 1994. Above fact implies that, the information transfer can be an [12] A. Goldsmith, Wireless Communications. New York, NY, USA: explanation, even criterion, for knowledge discovery. Not only Cambridge University Press, 2005. can the condition of sufficient information in the data predici- [13] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Trans. Inf. Theory, vol. 30, no. 4, pp. 629–636, Jul 1984. ton be derived via information theory but also can the criterion [14] A. Neumaier, “Solving ill-conditioned and singular linear systems: A be justified by the experiments. Furthermore, this viewpoint is tutorial on regularization,” SIAM Review, vol. 40, pp. 636–666, 1998. rationalized by the information coupling [4], which studies the [15] Z. Wang and L. Chan, “Learning causal relations in multivariate time series data,” ACM Trans. Intell. Syst. Technol., vol. 3, no. 4, pp. 76:1– geometric structure on the space of probability distributions 76:28, Sep. 2012. and finds a similarity measure for probability measures. Also, [16] A. Harvey, Forecasting, Structural Time Series Models and the Kalman it should be noted that, the equalization of time series is Filter. Cambridge University Press, 1990. equivalent to the Kalman filtering with deterministic model in a more general setting [16], which is one of the standard approaches in analysis of time series. This work not only provides a new methodology for data analytics, but also gives explanations to many useful and well-known techniques in data analysis.

1206