Quick viewing(Text Mode)

Distributed Density Estimation Using Non-Parametric Statistics

Distributed Density Estimation Using Non-Parametric Statistics

Distributed Density Estimation Using Non-parametric

Yusuo Hu Hua Chen ½ Microsoft Research Asia Tsinghua University, No.49, Zhichun Road, Beijing 100080, China Beijing 100084, China [email protected] [email protected] Jian-guang Lou Jiang Li Microsoft Research Asia Microsoft Research Asia No.49, Zhichun Road, Beijing 100080, China No.49, Zhichun Road, Beijing 100080, China [email protected] [email protected]

Abstract ing. In a distributed agent system, each agent may need to know some global information of the environment through Learning the underlying model from distributed is collaborative learning and make decisions based on the cur- often useful for many distributed systems. In this paper, rent knowledge. we study the problem of learning a non-parametric model In this paper, we focus on the distributed density esti- from distributed observations. We propose a gossip-based mation problem, which can be described as follows: Given

distributed density estimation algorithm and analyze a network consisting of nodes, where each node holds

the convergence and consistency of the estimation process. some local measurements of a hidden ,

Furthermore, we extend our algorithm to distributed sys- the task is to estimate the global unknown den-

´Üµ tems under communication and storage constraints by in- sity function (pdf) from all the observed measure- troducing a fast and efficient data reduction algorithm. Ex- ments on each node. periments show that our algorithm can estimate underlying Recently, some distributed density estimation algorithms density distribution accurately and robustly with only small based on parametric models have been proposed [12,14,18]. communication and storage overhead. In these approaches, the unknown distribution is modeled as Keywords Kernel Density Estimation, Non-parametric a mixture of Gaussians, and the parameters are estimated by Statistics, Distributed Estimation, Data Reduction, Gossip some distributed implementations of the Expectation Maxi- mization (EM) algorithm. However, we argue that the para- metric model is not always suitable for distributed learning. 1 Introduction It often needs some strong prior information of the global distribution such as the number of components and the form of the distribution. Furthermore, the EM algorithm is highly With the great advance of networking technology, many sensitive to the initial parameters. With a bad initialization, distributed systems such as peer-to-peer (P2P) networks, it may require many steps to converge, or get trapped into computing grids, sensor networks have been deployed in some local maxima. Therefore, for most distributed sys- a wide variety of environments. As the scales of these sys- tems, a general and robust approach for distributed density tems grow, there is an increasing requirement for efficient estimation is still needed. methods to deal with large amounts of data that are dis- Non-parametric statistical methods have been proven ro- tributed over a set of nodes. Particularly, in many applica- bust and efficient for many practical applications. One tions, it is often required to learn a global distribution from of the most used nonparametric techniques is the Kernel scattered measurements. For example, in a sensor network, Density Estimation (KDE) [23], which can estimate arbi- we may need to know the distribution of some variable in trary distribution from empirical data without much prior a target area. In a P2P system, we may want to learn the knowledge. However, since KDE is a data-driven approach, global distribution of resources for indexing or load balanc- for distributed systems, we need an efficient mechanism 1The work presented in this paper was carried out at Microsoft Re- to broadcast data samples. Furthermore, to incrementally search Asia. learn the nonparametric model from the distributed mea-

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 surements, the nodes need to frequently exchange their cur- number and parameters of the Gaussian components. To al- rent local estimates. Thus, a compact and efficient repre- leviate this issue, in [24], a distributed greedy learning algo- sentation of the local estimate on the node is also needed. rithm is proposed to incrementally estimate the components In this paper, we propose a gossip-based distributed ker- of the Gaussian Mixture. However, it requires much more nel density estimation algorithm to estimate the unknown time to learn the model in this way and there is still possi- distribution from data. We also extend the algorithm to han- bility of being trapped into local minimum because of the dle the case where the communication and storage resources greedy nature of the EM algorithm. are constrained. Through theoretical analysis and extensive Some other research work focuses on extracting the , we show the proposed algorithm is flexible, true signals from noisy observations in sensor networks. and applicable to arbitrary distributions. Compared with the Ribeiro et al. [20, 21] studied the problem of bandwidth- distributed EM algorithm, such as Newcast EM [14], we constrained distributed - estimation find that our distributed estimation method based on non- in additive Gaussian or non-Gaussian noises. Delouille et. is much more robust and accurate. al. [4] used the to describe the measure- Our main contributions are as follows: ments of the sensor networks and proposed an iterative distributed algorithm for linear minimum mean-squared-

¯ We propose a distributed non-parametric density esti- error (LMMSE) estimation of the mean-location parame- mation algorithm based on gossip-based protocol. To ters. These approaches put more effort on exploiting the the best of our knowledge, it is the first effort to gener- correlations between sensor measurements and eliminat- alize the KDE to the distributed case. ing the noise in observations, while we aim to discover the completely unknown pattern from distributed measure-

¯ We prove that the distributed estimate is asymptoti- ments without much assumption or dependency. cally consistent with the global KDE when there is no Nonparametric models [22, 23], which do not rely on resource constraint. We also provide rigorous analysis the assumption of the underlying data distribution, are quite on the convergence speed of the distributed estimation suitable for . However, there is little protocol. research on exploiting the usage of for distributed systems, except that a few papers have ad-

¯ We propose a practical distributed density estimation dressed the decentralized classification and clustering prob- for situations of limited storage and communication lems using kernel methods [13, 17]. To the best of our bandwidth. Experiments show that our distributed pro- knowledge, our work is the first effort to generalize the ker- tocol can estimate the global distribution quickly and nel density estimation to the distributed case. We believe accurately. that the distributed estimation technique is a useful build- ing block for many distributed systems, and non-parametric This paper is organized as follows: First, in Section 2, methods will play a more important role in distributed sys- we briefly review some related work on distributed density tems. estimation methods. We then discuss the problem of the distributed density estimation and present our solution in 3 Distributed Non-parametric Density Esti- Section 3. The experimental results of our algorithm are presented in Section 4. The conclusion is given in Section mation 5. In this section, we first review some definitions about kernel density estimation, and then present our distributed 2 Related Work estimation algorithm. We prove that the estimation pro- cess unbiasly converges to the global KDE if each node has Recently, distributed estimation has raised interest in enough storage space and communication bandwidth. Our many practical systems. Nowak [18] uses the mixture of analysis shows that the estimation error decreases exponen- Gaussians to model the measurements on the nodes, and tially as the number of gossip cycle increases. Based on the proposes a distributed EM algorithm to estimate the basic distributed algorithm, we then extend it to a more flex- and of the Gaussians. Later in [14], Kowalczyk ible version with constrained resource usage by introducing et. al. propose a gossip-based implementation of the dis- an efficient data reduction mechanism.

tributed EM algorithm protocol. Jiang et. al. [12] apply

ÃÖÒÐ Ò× ØÝ ×Ø ÑØ ÓÒ similar parametric model and distributed EM algorithm in ¿º½ sensor network and use multi-path routing to improve the estimation resilience to link and node failure. However, the Kernel Density Estimation (KDE) is a widely used non- EM-based algorithms depend on proper initialization of the parametric method for density estimation. Given a set of

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007

Ü ½Ü ¾ 

i.i.d data samples from some where is the current number of kernels on node , and

 Ü À  Ü

unknown distribution , the kernel density estimation of is the -th kernel. is referred as the

À is defined as [23] location of the kernel. and are the weight and

bandwidth matrix of the kernel. Initially, each node only

 Ü À  ½

½ ½ ½ ½

has one kernel , the weight , and



´Üµ ´Ü Ü µ

À

(1)

Ü  Ü À  À

½ ½

is its local sample, and is the corre- ½ sponding bandwidth matrix. where According to (1), the local estimation of the pdf on node

can be calculated as

½

½

¾

¾

´Ü Ü µÀ  ´À ´Ü Ü µµ

À

(2)



´Üµ ´Ü Ü µ

À

(6)

Ü À

is the -th kernel located at . Here, is named band-

½



width matrix, which is symmetric and positive. ½ are sample weights satisfying The gossip-based distributed estimation algorithm is il-

lustrated in Algorithm 1. In the estimation process, ev-

ery node periodically selects a random node from all other

(3)

nodes on the network, and exchanges kernels with it. After

½

exchanging, both the initiating node and target node merge

´¡µ The kernel function is a -variate non-negative, their new received kernels into their local kernel sets and symmetric real-value function. In this paper, we use the update their own local estimate. following Gaussian kernel due to its simplicity. During the above process, the size of local kernel set on a

node grows quickly. Intuitively, each node will collect many

½ ½

´Üµ ÜÔ´ Ü Üµ

(4) kernels to estimate the global density distribution in a very

¾

¾

µ ´¾ short time, and the local estimate on the nodes will be closer It has been proved that, given enough data samples, the and closer to the global KDE as the algorithm proceeds. Kernel Density Estimate (1) can approximate any arbitrary To analyze the consistency and convergence speed of our

distribution with any given precision [22], i.e. KDE is estimation algorithm, we define the relative estimation er-

 

 

  ¡  asymptotically consistent with the underlying distribution.

ror on node as , where is the -norm of





In practice, it usually requires only a few samples for KDE 



´Üµ

the real function, and is the local density estimation

to generate the estimate which can approximate the under-

based on the kernel sets at the end of a gossip cycle, and

lying distribution very well.

½



¿º¾ Ó×× Ô¹×  ×ØÖ ÙØ Ã ÐÓ¹

´Üµ ´Ü Ü µ

À

Ö ØÑ

½

is the global kernel density estimate. Consider a distributed network consisting of dis-

tributed nodes. Initially, each node has a local measure-

¼

Theorem 1. For the algorithm 1, given any Æ , af-

Ô

¾ Ü

ment , which can be regarded as i.i.d. samples of a ran-

ÐÓ ´ µ ÐÓ ´¾ µ

ter ¾ gossip cycles, the relative estima-

Æ dom variable with unknown distribution . The problem

tion error on node satisfies of distributed density estimation is how to let each node get

an estimate of the global distribution , or alternatively the

´ µ ½ Æ

probability density function , quickly and efficiently. In the following discussion, we assume that there is an under- Proof. Since all kernels are originated from the original lying communication mechanism for any two nodes in the data samples, we can re-write the local density estimation system to establish a communication channel (physical or

at node

virtual) and exchange messages.

The basic idea of our distributed kernel density estima-



´Üµ ´Ü Ü µ

À

tion method, is to incrementally collect information and ap-

Ð ½ proximate the global KDE through a gossip mechanism. To achieve this goal, we maintain a local kernel set on each

as

node as follows,



´Üµ ´Ü Ü µ

À



  Ü À  ½ 

(5)

½

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 Algorithm 1: Distributed KDE without Resource Con- can be written as

straints



´Üµ ´Ü Ü µ

1

À The local kernel set at node is initialized as



 ½ Ü À 

½

2 Gossip: On each node , two following procedures are

run in parallel: where is the weight of the kernel corresponding to the

Ü ¼ Ü

original data sample and let if the sample is

 ÈÖÓ  ÙÖ 

1. ØÚ Node periodically execute

not presented in .

the following procedure for gossip cycles.

After gossip, node and have updated kernel sets

¼ ¼



, and the new local estimates can also be repre-

(a) Randomly select a node as its gossip target. sented by the weighted sum of original kernels. According

¼ ¼

(b) Send its local kernel set to Algorithm 1, the new sample weights and are

  Ü À  ½ 

½

to

½ ½ ½

·

¼

node . ¼

 

(7)

¾

(c) After receiving the response from , which

consists of node ’s local kernel set Thus, the gossip process is actually the averaging pro-

  Ü À  ½ 

¾

,

¾ ¾

¾ cess of the weights of the original data samples. For any

¾

node updates its own local kernel set as follows Ü

data sample , denote as the mean and

of the weight in the -th gossip cycle. It is easy to see

¼

  Ü À  ½ 

½

½ ½

½ that, the mean value of all the weights is unchanged, i.e.

¼

½

 Ü À  ½ 

¾

¾ ¾

  ¾

, while the variance of the kernel weights

¾ ¾



reduces quickly through the gossip process.

where According to the property of the averaging process [11], ½

½ when the selection of the gossip target is uniformly random,

¼ ¼

 

½ ¾ ¾

½ the reduction rate of the variance satisfies

¾ ¾

½ ¾

During the union, the kernels with the same ¾

Ô



(8)

½

location and bandwidth matrix are merged, and ¾ ½

the corresponding weights are summed together. ¾ 

Note that ,wehave

¼

È××Ú ÈÖÓ  ÙÖ  ½

2. When the node receives a ½

¾

Ô

 ´ µ

(9)

gossip message from another node , the following

¾ procedure is triggered.

In the -th gossip cycle, the relative estimation error on

(a) Node responds node with its local kernel set. node is bounded by

¾

(b) Node update its local kernels according to the

 



received kernel set from using the same (10)

¾



method in the active procedure.

È

½

´ µ ´Ü Ü µ

À



½



3 After the active procedure terminates, each node uses È

´Ü Ü µ

À



½



´Üµ

its final kernel set to calculate .

½

 ÑÜ  

 

½¾

½

¾

 ´ µ

where is the weight of the kernel which corresponds to

Ü ¼ Ü

½

the original data sample and if the sample

 

is not presented in .

È

½

¾ ¾

´ µ 

Since , using Markov

½

Suppose node is node ’s gossiping target in cycle .

Similarly, the local estimate at node inequality, we have

´ µ ½ Æ

(11)



Ô

¾

ÐÓ ´ µ ÐÓ ´¾ µ 

when ¾ .

Æ

´Üµ ´Ü Ü µ

À

Ð

½

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 The above theorem shows that the relative estimation er- be stored at each node to be , and the final estimate would

ror of the algorithm 1 decreases exponentially and the local be represented by at most kernels.

estimate at each node converges to the global estimate in

To maintain the same estimation efficiency when the

´ÐÓ µ steps. Since the number of kernels transmitted in one gossip cycle will not exceed the number of initial mea- storage and communication are limited, the key problem

would be how to represent the current density estimate with

´ ÐÓ µ surements , there will be at most kernels to be transmitted in total. only a small set of representative samples, which is in fact

a data reduction problem.

Ê×ÓÙÖ¹ÓÒ×ØÖ Ò  ×ØÖ ÙØ ¿º¿ There are a number of data reduction methods proposed à [6–8, 16]. However, most previous methods aim to extract some representative samples and are computational expen- The algorithm 1 presented above does not take account in sive, thus are not suitable for our distributed algorithm. In the node’s resource limitation in distributed networks. We our case, we need to compress the dataset until its size is assumed that each node has enough storage space to store below some communication or storage threshold. And we at most local kernels and large communication capability want to preserve the density estimation as much as possible. to transmit at most kernels during a gossip cycle. How- Furthermore, the data reduction method should be simple ever, this assumption is not always valid in some distributed and fast.

systems, especially when the size of system scales up. Here we extend the basic distributed estimation algorithm Here we propose a simple while efficient data reduc- to a more flexible version to meet the resource constraint tion method, which can be regarded as a bottom-up con-

requirements. struction of a KD-tree structure [1]. Suppose there is

Usually there are three constraints often considered in a density estimate represented by ½ Gaussian kernels

 Ü À   ½ 

½ ¾

distributed systems: , our goal is to output

¼ ¼ ¼

 Ü À   ½ 

kernels ¾ to approximate the

1. Communication Constraints: The communication



´Üµ ½ original density estimate , where ¾ . To this bandwidth between the nodes is often limited by the goal, we iteratively choose two kernels with minimum dis-

channel capacity. For our estimation algorithm, when

¾ tance and merge them into a new kernel. After ½ the communication bandwidth is limited, it will be- iterations, we get the reduced kernel set whose size is ex-

come quite inefficient to send the full local kernel set

actly ¾ . The detailed data reduction method is described from one node to another in a single gossip cycle. in Algorithm 2. 2. Storage Constraints: The storage at a node is also lim- ited. Sometimes it is impractical to store too large Based on the above data reduction algorithm, we can kernel set on the node. When the number of kernels modify the distributed KDE algorithm to meet the resource- constraints. During the estimation process, we employ the that can be stored on the node is less than the total number of samples, it would be impossible for the data reduction method whenever the size of kernel set ex- distributed estimation to approach the global density ceeds the communication or storage constraints. function without any information loss. Instead, the There is some information loss during the data reduction goal will become to estimate the global KDE as ac- procedure, which will influence the accuracy of the whole

curately as possible with limited number of kernels. estimation process. Obviously, the information loss of data 3. Power Constraints: In some systems, e.g. the sensor reduction is determined by the parameters and .In network, the computation on a node is also limited to practice, we may need to adjust these parameters to achieve save the energy. Calculating KDE on a large sample the desirable trade-off between the overhead (bandwidth set is computation intensive and consumes consider- and storage) and the performance (accuracy and speed) of able power. Therefore, it would be inefficient to calcu- the algorithm. Because KDE is highly robust and only re- late KDE if the result kernel set is too large. quires a few data samples to generate an acceptable esti- mate, our data reduction method can efficiently represent Taking the above constraints into consideration, we ex- the local kernel set with a relative small constraint param- plicitly set some parameters of our gossip algorithm to limit eter. Experiments show that, with a small communication the resource consumption of the estimation process. For and storage overhead, our algorithm can result in estimate simplicity, rather than specifying the accurate size of packet whose accuracy is comparable with the global KDE and the to be transmitted or stored, we set the maximal number of estimation results converge almost as quickly as Algorithm kernels sent from a node to another node in each gossip cy- 1 does. We will further analysis the performance of our al-

cle to be . We also set the maximal number of kernels can gorithm in the experimental part.

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 Algorithm 2: Data Reduction Algorithm for Dis- is taken as the current estimate of the density distribution. tributed KDE New joining nodes can get the current estimate immediately Input: The kernel set from other nodes and will join the next round of estimation.

Similar restarting mechanism have been adopted in some

  Ü À  ½ 

½ ,

aggregation algorithms [11].

´

¾ ½ ¾ ).

Output: The compressed kernel set Another factor that may affect the estimation result is the

¼ ¼

¼ choice of bandwidth matrix. In our experiments, we set the

 Ü À  ½ 

¾

¾

À  Á approximating the original KDE. global bandwidth matrix in advance for the sake

of simplicity and clarity. We observed that, for most cases,

1 ½ ;

simply setting to a small value can result in good esti- 2 while ¾ do

mation result. However, it is also possible to to estimate the

3 ¾ Select two kernels ½ and which have minimal

bandwidth matrix from the local kernel set in the gossip pro-

Ü Ü

distance between their centers and ; ¾ ½ cess. And there is already a number of data-driven methods 4 Merge the two kernels into a parent kernel according to the following rule: for estimating global or local bandwidth matrix from sam-

ple sets [3, 5, 19]. In practice, we can use our estimation

¼

 ·

algorithm with any bandwidth selection method, if neces-

½ ¾

Ü · Ü

   

¼

½ ½ ¾ ¾ 

Ü sary.

·

 

½ ¾

Ì Ì

´À ·Ü Ü µ· ´À ·Ü Ü µ For some distributed systems, such as in sensor networks

     

 

½ ½ ½ ¾ ¾ ¾

¼ ¼ ¼Ì

½ ¾

À  Ü Ü

· 

 or P2P networks, nodes can only contact some connected

½ ¾

(12) neighbors. It is difficult or impossible to randomly select

¾ delete kernels ½ and from , and add kernel a node from all other nodes to contact. However, some

into ; previous work on gossip protocol shows that the topology

½ 5 . of the overlay network only affects the convergence speed 6 end of the averaging process [2, 11]. Therefore, the nodes can still get the same estimation result by periodically exchang- ing their local estimate only with its neighbors. For P2P

network, another possible solution to this problem would

¿º ÈÖØ Ð ÓÒ× ÖØ ÓÒ× be deploying a peer service [9] over the network. The peer sampling service provides each node an updating In the above discussion, we have assumed that each node node cache with good through periodically ex- holds only one measurement. In some practical networks, changing and updating the nodes in the cache. A detailed it is also possible that each node holds a number of mea- evaluation and discussion on the implementation of the peer surements. In this case, the initial local set on each node sampling service can be found in [9]. will contain more than one kernel. And our algorithm can be extended to this case without any special difficulty. There are some other implementation considerations 4 Simulation and Discussion when applying the above estimation algorithm in practi- cal systems. First, in most large-scale distributed systems, To evaluate the performance of our distributed KDE al- there is no global clock available. It is difficult for all the gorithm, we test our algorithm on a number of datasets with nodes to run the estimation algorithm synchronously. Fur- different size, dimension and distribution. We explore how thermore, in some dynamic systems, since the nodes are the constraint parameters affect the convergence and ac- dynamically joining and leaving the network, we need to curacy of the distributed algorithm. We also compare the periodically restart the estimation process to provide the up- performance of our algorithm with the Newcast-EM algo- to-date distribution estimates. To implement restarting and rithm [10] on some non-Gaussian data distributions. In ad-

asynchronous estimation, we run the algorithm in consecu- dition, we examine the robustness of our algorithm when

Æ Æ ¡ there are frequent node failures during the estimation pro-

tive rounds of length (where is the length of a

cess.

gossip cycle, and is the number of cycles in a round), and

start a new instance of the distributed KDE in each round.

ÜÔ Ö ÑÒØÐ ÅØÓ ÓÐÓÝ In the implementation, we attach a unique and increasing º½ round identifer to each gossip message so that the messages of different rounds can be distinguished from each other. We simulate a distributed network which consists of a Once the node receives a round identifier which is larger number of connected nodes. Initially, each node holds only than its current one, it will immediately join the new esti- one data sample which is drawn from an underlying distri- mation process. At any time, the result of the last round bution. We implement our distributed KDE algorithms on

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 5 10 0.05

samples 4.5 8 0.045 KDE 4 6 0.04 Distributed KDE

3.5 4 0.035

3 2 0.03

2.5 y 0 0.025 f(x)

−2 0.02 2 KL−

0.015 1.5 −4

0.01 1 −6

0.005 0.5 −8

0 0 −10 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 −10 −8 −6 −4 −2 0 2 4 6 8 10 x cycle x (a) The 1D dataset and estima- (b) Convergence (a) The 2D dataset. (b) Global KDE.

tion results. 0.35

0.3

Figure 1. Distributed KDE on 1D dataset 0.25

0.2

0.15 KL−divergence these nodes. In each , we run the algorithm for 0.1 a number of gossip cycles and investigate the estimation re- 0.05 0 1 2 3 4 5 6 7 sult. cycle To quantitatively analysis the estimation accuracy and (c) Distributed KDE. (d) Convergence compare our algorithm with others, we use the Kullback- Leibler divergence (KL-divergence) [15] as the perfor- Figure 2. Distributed KDE on 2D dataset mance metric of the estimation algorithm. In our experi- ments, we monitor the change of local density estimate on each node and calculate the K-L divergence from the base- Fig. 2(a) shows the location of the original data samples. line distribution, which may be the global KDE or the real The result of global KDE is shown in Fig. 2(b), and the dis- underlying distribution, to the local estimate. tributed estimation result on a random chosen node is shown in Fig. 2(c). We can see that the distributed kernel density

estimation is close to the global KDE. Fig. 2(d) shows the

º¾ Ë ÑÙÐØ ÓÒ Ê×ÙÐØ× change of the statistics of the KL-divergence with gossip 4.2.1 Estimation Accuracy and Convergence cycles. We observe that the convergence trend is similar with 1D case. Since the convergence of the distributed KDE without con- straints is guaranteed by Theorem 1, we mainly focus on the 4.2.2 Non-Gaussian Distribution performance of the resource-constrained distributed estima- tion algorithm. Fig. 1 shows the estimation result on a 1D We compare our proposed distributed KDE algorithm with dataset, which consists of 200 samples drawn from 5 mixed Newscast EM algorithm [14]. Fig. 3 shows the estimation Gaussian distribution. Such a heavily overlapped distribu- results of our distributed KDE algorithm and Newcast EM tion (five weighted Gaussians with different variances and on the same 1D non-Gaussian dataset, which consists of 200 locate closely) is difficult to estimate even in global settings. samples drawn from three bi-exponential distributions. Fig. 1(a) shows the local estimation result on a node after We run the Newscast EM algorithm with different initial 6 gossip cycles. The global KDE result and the original values. In the first experiment, in order to obtain the best dataset are also plotted in the figure. We can see that the estimation result, we set the initial location of the Gaussian estimated curve of our proposed algorithm is close to the to the center of the three distributions. The result is shown global result. in Fig. 3(b) and Fig. 3(e). In the second experiment, we To further verify the overall performance of our algo- slightly change the initial location of the Gaussian, and the rithm, we calculate the K-L divergence from the global corresponding result is shown in Fig. 3(c) and Fig. 3(f). KDE result to the distributedly estimated pdf on every node From Fig. 3(a) and Fig. 3(d), we can see that, though in each gossip cycle. Then, we compute the mean and vari- without any prior information of the distribution, our dis- ance of the K-L divergence values from all nodes within tributed DKDE algorithm still works well and results in each gossip cycle. The change of K-L divergence is shown highest estimation accuracy. From Fig. 3(b) and Fig. 3(e), in Fig. 1(b). We can see that both the mean and the variance we can see that, even tuned with the best tuned parameters, decreases exponentially, and after four cycles, the estima- the EM estimation result still deviates from the underlying tion errors on most nodes drop to a small value. distribution. The estimation error of the distributed EM al- We also tested our Distributed KDE algorithm on some gorithm is due to its wrong model assumption. From Fig. high-dimensional datasets. Fig. 2 shows the estimation re- 3(c) and Fig. 3(f), we observe that the sub-optimal initial- sult on a 2D dataset which consists of 600 samples drawn ization of the EM algorithm leads to a much worse estima- from a mixed distribution of 5 Gaussian components. tion of the probability density function, which shows that

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 Estimated probability density function 0.18 0.18 0.18 samples samples samples real model 0.16 real model 0.16 real model 0.16 Newscast EM Distributed KDE Newscast EM 0.14 0.14 0.14

0.12 0.12 0.12

0.1 0.1 0.1 f(x) f(x) f(x) 0.08 0.08 0.08

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

0 0 −25 −20 −15 −10 −5 0 5 10 15 20 25 −25 −20 −15 −10 −5 0 5 10 15 20 25 0 −25 −20 −15 −10 −5 0 5 10 15 20 25 x x x (a) Distributed KDE (b) Newscast EM with good ini- (c) Newscast EM with bad ini- tialization tialization

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

0.4 0.4 0.4

KL−divergence 0.3 KL−divergence 0.3 KL−divergence 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 cycle cycle cycle (d) Convergence of Distributed (e) Convergence of Newscast (f) Convergence of Newscast KDE EM with good initialization EM with bad initialization

Figure 3. Comparison with Newscast EM algorithm

−3 x 10 10 0.05 the distributed EM algorithm is highly sensitive to initial samples 9 0.045 KDE Distributed KDE (P =30%) f parameters. On the other hand, we also find that the con- 8 0.04

0.035 vergence speed of Newscast EM with good initialization is 7 0.03

6 0.025 slightly faster than our proposed algorithm. This is because f(x) 5 0.02 KL−divergence

0.015 the EM algorithm has utilized extra prior information (e.g. 4

0.01 3 the number of Gaussian models) and thus can quickly re- 0.005

2 0 0.05 0.1 0.15 0.2 0.25 0.3 0 P 0 10 20 30 40 50 60 70 80 90 100

duce the estimation error at the beginning. f x

È 

(a) KL-divergences (b) Estimation result (

¿¼±). 4.2.3 Resilience to Node Failure To test the robustness of our distributed algorithm, we sim- Figure 4. Impact of node failure ulate the situation where nodes fail frequently in the estima-

tion process. At the beginning of each gossip cycle, we se-

¢

lect nodes randomly selected and discard them from the the algorithm to deal with the situations of limited re- the network. We conduct a series of experiments where the

sources, we propose an efficient data reduction method for

ranges from 5% to 30%, and compare the exchanging local density estimate efficiently and also de- estimation results with the global KDE result which is cal- signed a resource-constrained distributed kernel density es- culated from the original complete dataset. Fig. 4 shows timation algorithm. Through extensive simulation experi- the estimation result under node failure. We can see that ments, we found that our resource-constrained distributed the estimation result is still acceptable even when there are estimation algorithm can achieve high estimation accu- a high percentage nodes failing in each cycle, which shows racy with only small communication and storage overhead. the distributed KDE is robust against node failure. Compared with the distributed EM algorithm, our estima- tion method is more robust and accurate. 5 Conclusion

In this paper, we proposed a distributed kernel density 6 Acknowledgement estimation algorithm which can learn the non-parametric model from distributed measurements with arbitrary distri- bution efficiently. It requires little prior information about

the underlying distribution and converges to the global KDE The authors would like to thank the anonymous review-

´ÐÓ µ with any given precision in steps. To extend ers for their valuable suggestions and comments.

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007 References [18] R. Nowak. Distributed EM algorithms for density estimation and clustering in sensor networks. IEEE Trans. on Signal [1] J. L. Bentley. Multidimensional binary search trees used for Processing, 51(8):2245–2253, Aug. 2003. associative searching. Commun. ACM, 18(9):509–517, Sep. [19] V. C. Raykar and R. Duraiswami. Very fast optimal band- 1975. width selection for univariate kernel density estimation. [2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Ran- Technical Report CS-TR-4774, Department of computer sci- domized gossip algorithms. IEEE Trans. Inform. Theory, ence, University of Maryland, Collegepark, 2005. 52(6):2508–2530, Jun. 2006. [20] A. Ribeiro and G. B. Giannakis. Bandwidth-constrained [3] D. Comaniciu. An algorithm for data-driven bandwidth se- distributed estimation for wireless sensor networks-part i: lection. IEEE Trans. Pattern Anal. Mach. Intell., 25(2):281– Gaussian case. IEEE Transactions on Signal Processing, 288, 2003. 54(3):1131–1143, 2006. [4] V. Delouille, R. Neelamani, and R. Baraniuk. Robust dis- [21] A. Ribeiro and G. B. Giannakis. Bandwidth-constrained dis- tributed estimation in sensor networks using the embedded tributed estimation for wireless sensor networks-part ii: un- polygons algorithm. In ACM IPSN ’04, pages 405–413, known probability density function. IEEE Transactions on Berkeley, California, USA, Apr. 2004. Signal Processing, 54(7):2784–2796, 2006. [5] T. Duong and M. L. Hazelton. Cross-validation bandwidth [22] P. J. Rousseeuw and A. M. Leroy. and matrices for multivariate kernel density estimation. Scandi- Outlier Detection. New York: Wiley-Interscience, 1987. navian Journal of Statistics, 32(3):485–506(22), 2005. [23] D. W. Scott. Multivariate Density Estimation: Theory, Prac- [6] M. Girolami and C. He. Probability density estimation from tice, and Visualization. Wiley-Interscience, 1992. optimally condensed data samples. IEEE Trans. on PAMI, [24] N. Vlassis, Y. Sfakianakis, and W. Kowalczyk. Gossip-based 25(10):1253–1264, Oct. 2003. greedy gaussian mixture learning. In Proc. 10th Panhellenic [7] L. Holmstrom and A. Hamalainen. The self-organizing re- Conf. on Informatics., Volos, Greece, Nov. 2005. duced kernel density . In Proc. IEEE International Conference on Neural Networks, pages 417–421, 1993. [8] A. T. Ihler, J. W. Fisher III, and A. S. Willsky. Using sample-based representations under communications con- straints. Massachusetts Institute of Technology, LIDS Tech- nical Report No. 2601, Dec. 2004. [9] M. Jelasity, R. Guerraoui, A.-M. Kermarrec, and M. Steen. The peer sampling service: Experimental evaluation of un- structured gossip-based implementations. Lecture Notes in Computer Science, 3231(3):79–98, 2004. [10] M. Jelasity, W. Kowalczyk, and M. Steen. Newscast com- puting. Technical Report IR-CS-006, Department of Com- puter Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, Nov. 2003. [11] M. Jelasity, A. Montresor, and O. Babaoglu. Gossip-based aggregation in large dynamic networks. ACM Trans. on Computer Systems, 23(3):219–252, Aug. 2005. [12] H. Jiang and S. Jin. Scalable and robust aggregation tech- niques for extracting statistical information in sensor net- works. In Proceedings of the 26th IEEE International Conference on Distributed Computing Systems (ICDCS06), 2006. [13] M. Klusch, S. Lodi, and G. Moro. Distributed clustering based on sampling local density estimates. In Proc. IJCAI- 03. AAAI Press., 2003. [14] W. Kowalczyk and N. A. Vlassis. Newscast EM. In Ad- vances in Neural Information Processing Systems 17, pages 713–720, Cambridge, MA, 2005. MIT Press. [15] S. Kullback and R. A. Leibler. On information and suffi- ciency. Annals of , (22):79–86, 1951. [16] P. Mitra, C. A. Murthy, and S. K. Pal. Density-based multi- scale data condensation. IEEE Trans. on PAMI, 24(6):734– 747, 2002. [17] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Decentral- ized detection and classification using kernel methods. In ICML ’04: Proceedings of the twenty-first international con- ference on , page 80. ACM Press, 2004.

27th International Conference on Distributed Computing Systems (ICDCS'07) 0-7695-2837-3/07 $20.00 © 2007