2013 Proceedings IEEE INFOCOM

Dynamic Chinese Restaurant Game in Cognitive Radio Networks

Chunxiao Jiang∗†, Yan Chen∗, Yu-Han Yang∗, Chih-Yu Wang∗‡, and K. J. Ray Liu∗

∗Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA †Department of Electronic Engineering, Tsinghua University, Beijing 100084, P. R. China ‡Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan E-mail:{jcx, yan, yhyang}@umd.edu, [email protected], [email protected]

Abstract—In a cognitive radio network with mobility, sec- is known as negative network externality [5], i.e., the negative ondary users can arrive at and leave the primary users’ licensed influence of other users’ behaviors on one user’s reward, due networks at any time. After arrival, secondary users are con- to which users tend to avoid making same decisions with fronted with channel access under the uncertain primary channel state. On one hand, they have to estimate the channel state, others to maximize their own payoffs. However, traditional i.e., the primary users’ activities, through performing spectrum cooperative sensing schemes simply combine all SUs’ sensing sensing and learning from other secondary users’ sensing results. results while ignoring the structure of sequential decision On the other hand, they need to predict subsequent secondary making [4], especially in a dynamic scenario where the users’ access decisions to avoid competition when accessing the primary channel state is time-varying and SUs arrive and leave ”spectrum hole”. In this paper, we propose a Dynamic Chinese Restaurant Game to study such a learning and decision making stochastically. Moreover, the negative network externality has problem in cognitive radio networks. We introduce a Bayesian not been considered in the previous channel access methods learning based method for secondary users to learn the channel [6]. Therefore, how SUs learn the uncertain primary channel state and propose a Multi-dimensional Markov Decision Process state sequentially and make best channel selections by taking based approach for secondary users to make optimal channel into account the negative network externality are challenging access decisions. Finally, we conduct simulations to verify the effectiveness and efficiency of the proposed scheme. issues in cognitive radio networks. Index Terms—Chinese Restaurant Game, Bayesian Learning, In our previous work [7], we proposed a new game, called Markov Decision Process, Cognitive Radio, Game Theory. Chinese Restaurant Game, to study how users in a social network make decisions when being confronted with uncertain I. INTRODUCTION network state [8][9]. Such a Chinese Restaurant Game is With the emerging of various wireless applications, avail- originated from the well-known Chinese Restaurant Process able electromagnetic radio spectrums are becoming more [10], which is widely adopted in non-parametric Bayesian and more crowded. The traditional static spectrum allocation in to model the distribution which policy results in a large portion of the assigned spectrum is not parametric but stochastic. In Chinese Restaurant Game, being under utilized [1]. Recently, dynamic spectrum access there are finite tables with different sizes and finite customers in cognitive radio networks has shown great potential to sequentially requesting tables for meal. Since customers do improve the spectrum utilization efficiency. In a cognitive radio not know the exact size of each table, they have to learn the network, Secondary Users (SUs) can opportunistically utilize table sizes according to some external information. Moreover, the Primary User’s (PU’s) licensed spectrum bands without when requesting one table, each customer should consider harmful interference to the PU [2]. the subsequent customers’ decisions due to the limited dining Two essential issues of dynamic spectrum access are spec- space in each table, i.e., the negative network externality. In trum sensing and channel access. Since SUs are uncertain [11] and [12], the applications of Chinese Restaurant Game in about the primary channel state, i.e., whether the PU is absent, various research fields are also discussed. they need to estimate the channel state through performing The channel sensing and access problems in cognitive radio spectrum sensing [3]. In order to counter the channel fad- networks can be ideally modeled as a Chinese Restaurant ing and shadowing problem, cooperative spectrum sensing Game, where the tables are the primary channels, and cus- technology was proposed recently, in which SUs share their tomers are SUs who are seeking vacant channels. Moreover, spectrum sensing results to improve the sensing performance how a SU learns the PU’s activities can be regarded as how a [4]. After channel sensing, SUs access one vacant prima- customer learns the restaurant state, and how a SU chooses a ry channel for data transmission. When choosing a vacant channel to access can be formulated as how a customer selects channel, each SU should not only consider the immediate a table. One assumption in the Chinese Restaurant Game [7] utility, but also take into account the utility in the future since is the fixed population setting, i.e., there is a finite number the more subsequent SUs access the same channel, the less of customers choosing the tables sequentially. However, in throughput can be obtained by each SU. Such a phenomenon cognitive radio networks, SUs may arrive and leave at any

978-1-4673-5946-7/13/$31.00 ©2013 IEEE 962 2013 Proceedings IEEE INFOCOM

2 time, which results in a dynamic population setting. In such &KDQQHO a case, a SU’s utility will change from time to time due to a &KDQQHO dynamic number of SUs in each channel. Considering these challenges, in this paper, we extend the &KDQQHO1 Chinese Restaurant Game in [7] to a dynamic population 3ULPDU\ setting and propose a Dynamic Chinese Restaurant Game 1HWZRUN for the cognitive radio networks. In this Dynamic Chinese Restaurant Game, we consider the scenario that SUs arrive 6HFRQGDU\ at and leave the primary network by Poisson process. Each 8VHUV new coming SU not only learns the channel state according to the information received and revealed by former SUs, but Fig. 1. System model of the cognitive radio network. also predicts the subsequent SUs’ decisions to maximize the utility. We introduce a channel state learning method based on Bayesian learning rule, where each SU constructs his/her channel i to access, his/her utility function can be given by j own belief on the channel state according to his/her own U θi(t),gi (t) , where θi(t) denotes the state of channel i at j signal and the former SU’s belief information. Moreover, we time t and gi (t) is the number of SUs choosing channel i formulate the channel access problem as a Multi-dimensional during the jth SU’s access time in channel i. Note that the j Markov Decision Process (M-MDP) and design a modified utility function is a decreasing function in terms of gi (t), value iteration algorithm to find the best strategies. We prove which can be regarded as the characteristic of negative network theoretically that there is a threshold structure in the optimal externality since the more subsequent SUs join channel i, the strategy profile for the two primary channel scenario. For less utility the jth SU can achieve. In the following analysis, multiple primary channel scenario, we propose a fast algorithm the time index (t) is omitted. with much lower computational complexity while achieving As discussed above, the channel state θ is changing with comparable performance. Finally, we conduct simulations to time. For new arriving SUs, they may not know the exact verify the effectiveness and efficiency of the proposed Dynam- state of each channel θi. Nevertheless, SUs can estimate the ic Chinese Restaurant Game theoretic scheme. state through channel sensing and former SU’s sensing result. The rest of this paper is organized as follows. Firstly, our Therefore, we assume that all SUs know the common prior system model is described in Section II. Then, we discuss how distribution of the state θi for each channel, which is denoted 0 0 0 SUs learn the primary channel state using Bayesian learning as b = {bi |bi = Pr(θi = H0), ∀i ∈ 1, 2, ..., N}. Moreover, j j rule in Section III. We introduce an M-MDP model to solve each SU can receive a signal s = {si , ∀i ∈ 1, 2, .., N} the channel access problem in Section VI and a modified value j through spectrum sensing, where si =1if the jth SU detects iteration algorithm in Section V. Finally, we show simulation j some activity on channel i and si =0if no activity is results in Section VI and draw conclusions in Section VII. detected on channel i. In such a case, the detection and II. SYSTEM MODEL false-alarm probability of channel i can be expressed as d f P = Pr(si =1|θi = H1) and P = Pr(si =1|θi = H0), A. Network Entity i i which are considered as common priors for all SUs. Fur- We consider a primary network with N independent primary thermore, we assume that there is a log-file in the server channels, and each channel can maximally host L users, as of the secondary network, which records each SU’s channel shown in Fig. 1. The PU has priority to occupy the channels belief and channel selection result. Through querying this log- at any time, while SUs are allowed to access the channel file, the new coming SU can obtain current grouping state under the condition that the PU’s communication QoS is information, i.e., the number of SUs in each channel, as well θ guaranteed [6]. We denote the primary channel state as = as the former SU’s belief on the channel state. {θ1,θ2, ..., θN } (all the subscripts mean the channel number index in the paper), where θi ∈{H0, H1} denotes the state B. ON-OFF Primary Channel Model of channel i, H0 means the PU is absent and H1 means the For the PU’s behavior in the primary channel, we model it PU is present. Note that the channel state θi(t) is time-varying as a general alternating ON-OFF renewal process, where the since the PU may appear at the primary channel at any time. ON state means the PU is present and the OFF state means For the secondary network, we assume SUs arrive and the PU is absent. This general ON-OFF switch model can be depart by Poisson process with rate λ and μ, respectively. applied in the scenario when SUs have no knowledge about

All SUs can independently perform spectrum sensing using the PU’s communication mechanism [13]. Let TON and TOFF energy detection [1]. For each SU, his/her action set is denote the length of the ON state and OFF state, respectively. A = {1, 2, ..., N}, i.e., choosing one channel from all N According to different types of the primary services (e.g., channels. Let us define the grouping state when the jth SU digital TV broadcasting or cellular communication), TON and Gj j j j arrives, =(g1,g2, ..., gN ) (all the superscripts mean the SU TOFF statistically satisfy different types of distributions. Here j ∈ index in the paper), where gi [0,L] stands for the number we assume that TON and TOFF are independent and satisfy ex- of SUs in channel i. Assuming that the jth SU finally chooses ponential distributions with parameter r1 and r0, respectively,

963 2013 Proceedings IEEE INFOCOM

3

denoted by fON(t) and fOFF(t) as follows: are assumed to be fully rational, they adopt Bayesian learning j j to update their beliefs b = {bi } by T ∼ f (t)= 1 e−t/r1 , ON ON r1 j j−1 j j (1) j Pr θi = H0|ˆbi Pr(si |θi = H0) ∼ 1 −t/r0 b = . (4) TOFF fOFF(t)= r0 e . i 1 j j−1 j j l=0 Pr θi = Hl|ˆbi Pr(si |θi = Hl) In such a case, the expected lengths of the ON state and As discussed in the system model, the channel state is vary- OFF state are r1 and r0 accordingly. These two parameters r1 ing with time. Here, we define the state transition probability and r0 can be effectively estimated by a maximum likelihood j j−1 as Pr(θi = H0|θi = H0), which represents the probability estimator [14]. Such an ON-OFF behavior of the PU is a that channel i is currently available when the jth SU arrives combination of two Poisson process, which is a renewal given the condition that channel i was available when the (j − j j−1 process [15]. The renewal interval is Tp = TON + TOFF and 1)th SU arrived. Similarly, we have Pr(θi = H1|θi = H0), the distribution of Tp, denoted by fp(t),is j j−1 j j−1 Pr(θi = H0|θi = H1) and Pr(θi = H1|θi = H1).In j j−1 ∗ such a case, the SU can calculate the items Pr θi = H0|ˆbi fp(t)=fON(t) fOFF(t). (2) j j−1 and Pr θi = H1|ˆbi in (4) as follows: III. BAYESIAN LEARNING FOR THE CHANNEL STATE j j−1 j j−1 j−1 Pr θi = H0|bi = Pr(θi = H0|θi = H0)bi + In this section, we discuss how a SU estimates the channel j j−1 j−1 Pr(θ = H0|θ = H1)(1 − b ),(5) state according to his/her own sensing results and the former i i i j H | j−1 j H | j−1 H j−1 SU’s beliefs. Here, we first introduce the concept of belief Pr θi = 1 bi = Pr(θi = 1 θi = 0)bi + j j−1 j−1 to describe the SU’s uncertainty about the channel state. The H1| H1 − j Pr(θi = θi = )(1 bi ).(6) belief bi denotes the jth SU’s belief on the state of channel j Since the primary channel is modeled as an ON-OFF i, θi . It is assumed that each SU reveals his/her beliefs after making the channel selection. In such a case, the jth SU’s process, the channel state transition probability depends on j−1 the time interval between the (j − 1)th and jth SUs’ arrival belief on channel i is learned from former SU’s belief bi , j time, tj. Note that the tj can be directly obtained from the his/her own signal si , which can be defined as log-file in the server. For notation simplicity, in the rest of this j j j j j−1 j j j j j b = {b |b = Pr(θ = H0|b ,s ), ∀i ∈ 1, 2, ...N}. (3) paper, we use P00(t ), P01(t ), P10(t ) and P11(t ) to denote i i i i i j j−1 j j−1 j Pr(θi = H0|θi = H0),Pr(θi = H1|θi = H0),Pr(θi = j ∈ j−1 j j−1 From the definition above, we can see that the belief bi [0, 1] H0|θi = H1) and Pr(θi = H1|θi = H1), respectively, j j j j is a continuous parameter. In a practical system, it is im- where P01(t )=1− P00(t ) and P11(t )=1− P10(t ). j possible for a SU to reveal his/her continuous belief using We can derive the close-form expression for P01(t ) using infinite data bits. Therefore, we quantize the continuous belief the as follow [18] into M belief levels {B1, B2, ..., BM }, which means that if r +r j j j r1 − 0 1 tj k−1 k − r0r1 we have bi ∈ [ M , M ], then Bi = Bk. Since each SU can P01(t )= 1 e . (7) r0 + r1 only reveal and receive the quantized belief, the former SU’s j quantized belief Bj−1 is first mapped into the continuous Thus, we can have P00(t ) as j−1 j−1 b Bk 1 0 r0+r1 j belief according to the rule that if Bi = then j j r r − r r t ˆj−1 1 k−1 k P00(t )=1− P01(t )= + e 0 1 . (8) bi = 2 M + M . Note that the mapped continuous belief r0 + r1 r1 j−1 j−1 ˆbi here is not the former SU’s real continuous belief bi . j Similarly, the close-form expression for P11(t ) can also be bj−1 sj Then, is combined with the signal to calculate the obtained by solving the following renewal equation continuous belief bj. Finally, bj is quantized into the belief j t B . Since all primary channels are assumed to be independent, − P11(t)=r1fON(t)+ P11(t w)fp(w)dw, (9) the learning processes of these channels are also independent. 0 Thus, the learning process for the jth SU on channel i can be j where fON(t) is the probability density function (p.d.f ) of the j−1 C j−1 si j Q j summarized as Bi −→ ˆbi −→ bi −→ Bi . ON state’s length given in (1) and fp(t) is the p.d.f of the In the learning process, the most important step is how to PU’s renewal interval given in (2). j j calculate current belief bi according to current signal si and By solving (9), we can obtain the close-form expression for j−1 j 11 the former SU’s belief ˆbi , which is a classical social learning P (t ) given by r +r problem. Based on the approaches to belief formation, social j r0 r1 − 0 1 tj P11(t )= + e r0r1 . (10) learning can be classified as Bayesian learning [16] and non- r0 + r1 r0 Bayesian learning [17]. Bayesian learning refers that rational Then, we can have P10(ti) as individuals use Bayes’ rule to form the best estimation of the r +r j j r0 − 0 1 tj unknown parameters, such as the channel state in our model, P10(t )=1− P11(t )= 1 − e r0r1 . (11) while non-Bayesian learning requires individuals to follow r0 + r1 some predefined rules to update their beliefs, which inevitably By substituting (5-6), (7-8) and (10-11) into (4), we can j limits the rational users’ optimal decision making. Since SUs calculate the jth SU’s belief bi with the corresponding sensing

964 2013 Proceedings IEEE INFOCOM

4

j j j j j results si =1and si =0in (12-13) below. In the following, system state he/she encounters is S =(B , G ). In such a j j−1 j we denote (12) as bi | j = φ(ˆbi ,ti,si =1), and denote case, with multiple SUs arriving sequentially, the system states si =1 j j−1 j at different arrival time {S1,S2, ...Sj , ...} form a stochastic (13) as bi |sj =0 = φ(ˆbi ,ti,si =0)for simplicity. i process. In our learning rule, only the (j − 1)th SU’s belief IV. MULTI-DIMENSIONAL MARKOV DECISION PROCESS is used to update the jth SU’s belief. Therefore, Bj depends BASED CHANNEL ACCESS only on Bj−1. Moreover, since SUs arrive by Poisson process, j In this section, we investigate the channel access game by the grouping state G is also memoryless. In such a case, we 1 2 j modeling it as a Markov Decision Process (MDP) problem can verify that {S ,S , ...S , ...} is a Markov process. [19]. In this game, each SU selects a channel to access, B. Belief State Transitions with the objective of maximizing his/her own expected u- tility during the access time. To achieve this goal, rational Note that a SU’s belief transition is independent with his/her SUs not only need to consider the immediate utility, but action, and is only related to the channel state, as well as also need to consider the following SUs’ decisions. In our the Bayesian learning rule. Here, we define the belief state j j−1 model, SUs arrive by Poisson process and make the channel transition probability as P B |B . Since all channels are selection sequentially. When making the decision, one SU is independent with each other, we have Gj only confronted with current grouping information and N Bj j j−1 j j−1 belief information . In order to take into account the SU’s P B B = P Bi |Bi , (14) expected utility in the future, we use Bellman equation to i=1 formulate the SU’s utility and use MDP model to formulate j | j−1 this channel selection problem. In traditional MDP problem, where P Bi Bi is the belief state transition probability of × a player can adjust his/her decision when the system state channel i. In such a case, there is an M M belief state tran- changes. However, in our system, after choosing a certain sition matrix for each channel, which can be derived according j B | j−1 primary channel, a SU cannot adjust his/her channel selection to the Bayesian learning rule. To find P Bi = q Bi = B j−1 B even if the system state has already changed. Therefore, p , with the quantized belief Bi = p, we can calculate ˆj−1 1 p−1 p the corresponding continuous belief bi = 2 M + M . traditional MDP cannot be directly applied here. To solve j this problem, we propose a Multi-dimensional MDP (M-MDP) Then, with Bi = Bq, we can have the value interval of j q−1 q model, and a modified value iteration method to derive the best bi =[M , M ]. Thus, the belief state transition probability response (strategy) for each SU. can be computed by q M A. System State j j−1 j j−1 j P Bi = Bq|Bi = Bp = P (bi |ˆbi )dbi . (15) q−1 To construct the MDP model, we first define the system M state and verify the of the state transition. j j−1 N B ∈ According to (12) and (13), we have bi = φ ˆbi = Let the quantized belief =(B1,B2, ..., BN ) [1,M] be p−1 p j the belief state. Thus, we can define the system state S as the 1 + ,tj ,s . Therefore, the belief state transition B G ∈ 2 M M i belief state with the grouping state =(g1,g2, ..., gN ) probability can be re-written by (16) below, where the second N B G [0,L] , i.e., S =( , ), where the finite state space is equality follows the assumption that the arrival interval of N N X = [1,M] × [0,L] . When the jth SU arrives, the two SUs tj obeys with parameter

r0+r1 tj j−1 f 0 r0r1 − 0 1 0 ˆ j r e r +(r + r )bi Pi | j bi = r +r r +r , (12) si =1 0 1 j 0 1 j r r t j−1 f r r t j−1 d r0e 0 1 − r0 +(r1 + r0)ˆbi Pi + r1e 0 1 + r0 − (r1 + r0)ˆbi Pi r0+r1 tj j−1 f 0 r0r1 − 0 1 0 ˆ − j r e r +(r + r )bi (1 Pi ) | j bi = r +r r +r . (13) si =0 0 1 j 0 1 j r r t j−1 f r r t j−1 d r0e 0 1 − r0 +(r1 + r0)ˆbi (1 − Pi )+ r1e 0 1 + r0 − (r1 + r0)ˆbi (1 − Pi )

j j−1 j j j−1 j j Pr(Bi = Bq|Bi = Bp)= Pr t ,si |ˆbi dt ds , q−1 ≤φ ˆbj−1= 1 ( p−1 + p ),tj ,sj ≤ q M i 2 M M i M −λtj j j−1 j = λe Pr(si =0|ˆbi )dt q−1 ≤φ ˆbj−1= 1 ( p−1 + p ),tj ,sj =0 ≤ q M i 2 M M i M −λtj j j−1 j + λe Pr(si =1|ˆbi )dt . (16) q−1 ˆj−1 1 p−1 p j j q M ≤φ bi = 2 ( M + M ),t ,si =1 ≤ M

965 2013 Proceedings IEEE INFOCOM

5

λ and is independent with the belief. To calculate (16), we arrival rate. If no SU arrives, but some SU leaves the channel j j−1 need to derive Pr(si |ˆbi ), which represents the distribution at state G, we have the remaining system state transition of the jth SU’s received signal when given the (j − 1)th probabilities in (25-26) below, where μ is SUs’ departure rate. j SU’s belief. Note that given current channel state θi , signal Note that λ and μ are normalized such that λ + NLμ ≤ 1.In j j−1 j j−1  si is independent with belief ˆbi . Thus, Pr(si |ˆbi ) can be such a case, the system state transition probabilities P (S |S) N N calculated as follows: form an M(L +1) × M(L +1) state transition matrix j j−1 j j j−1 j j j−1 when given action a. Pr(si |ˆbi )=Pr(si ,θi = H0|ˆbi )+Pr(si ,θi = H1|ˆbi ) j j j j−1 = Pr(si |θi = H0)Pr(θi = H0|ˆbi )+ D. Expected Utility j| j H j H |ˆj−1 The immediate utility of SUs in channel i at state S is Pr(si θi = 1)Pr(θi = 1 bi ). (17) j−1 Ui(S)=ˆbi · Ri gi , (27) Moreover, given the previous channel state θi , current state j ˆj−1 θi is also independent with the former SU’s belief bi . Thus, where ˆbi is the continuation mapping of Bi and Ri is a de- j H |ˆj−1 Pr(θi = 0 bi ) in (17) can be obtained as follows: creasing function with respect to the number of SUs in channel j j−1 j j−1 j−1 i, gi. In general, each SU will access the selected channel for Pr(θi = H0|ˆbi )=Pr(θi = H0,θi = H0|ˆbi )+ j j−1 j−1 a period of time, during which the system state may change. Pr(θ = H0,θ = H1|ˆb ) i i i Therefore, when making the channel selection, SUs should not j j−1 ˆj−1 = Pr(θi = H0|θi = H0)bi + only consider the immediate utility, but also take into account j j−1 j−1 Pr(θi = H0|θi = H1)(1 − ˆbi ), the future utility. Here, we define a SU’s expected utility in j ˆj−1 j − ˆj−1 channel i, Vi(S), based on Bellman equation [19] as follow = P00(t )bi + P10(t )(1 bi ).(18)   j j−1 Vi(S)=Ui(S)+(1− μ) Pi(S |S)Vi(S ), (28) Similarly, for Pr(θi = H1|ˆbi ),wehave S∈X j H |ˆj−1 j ˆj−1 j − ˆj−1 Pr(θi = 1 bi )=P01(t )bi + P11(t )(1 bi ). (19) where (1 − μ) is the discount factor, which can be regarded By substituting (18-19) into (17), the conditional distribution as the probability that the SU keeps staying at the selected  of the signal can be obtained as (20-21) below, with which we channel since μ is the departure probability, and Pi(S |S) is can calculate the transition probability matrix using (16). the state transition probability defined as      C. Actions and System State Transitions Pi S =(B , G )|S =(B, G) =P B |B · Pi G |G , (29)  The finite action set for SUs is the N channel set, i.e., where P B |B is the belief state transition probability,  A = {1, 2, ..., N}. Let a ∈Adenote a new SU’s action under and Pi G |G is the grouping state transition probability    the system state S =(B, G). Let P S =(B , G ) S = conditioned on that SUs in channel i still stay in channel i   (B, G),a denote the probability that action a in state S will in the next state S , which is different with P G |G in (23-   lead to state S . Since the SU’s belief transition is independent 26). Note that Pi G |G is closely related to the new arriving with his/her action, we have SU’s action. Suppose that the new SU’s action aS = k, i.e.,      accessing channel k at state S, we have the state transition P S =(B , G )|S =(B, G),a =P B |B P G |G,a , (22) probability in (30) below. For the leaving transition probability, where P G|G,a is the system grouping state transition since we have considered the discount factor (1 − μ) in the probability. Suppose that G =(g1,g2, ..., gN ),ifanewSU future utility, i.e., the SU will not leave the channel, thus we arrives and accesses channel i, i.e., a = i, we have the system have state transition probabilities in (31-33) below, where the state transition probabilities in (23-24) below, where λ is the item (gi − 1) is because the grouping in channel i, gi, already

j j−1 f j j−1 j j−1 d j j−1 j j−1 Pr(si =0|ˆbi )= 1 − Pi P00(t )ˆbi + P10(t )(1 − ˆbi ) + 1 − Pi P01(t )ˆbi + P11(t )(1 − ˆbi ) , (20) j j−1 f j j−1 j j−1 d j j−1 j j−1 Pr(si =1|ˆbi )=Pi P00(t )ˆbi + P10(t )(1 − ˆbi ) + Pi P01(t )ˆbi + P11(t )(1 − ˆbi ) . (21)

G |G P =(g1,g2, ..., gi +1, ..., gN ) =(g1,g2, ..., gi, ..., gN ),a= i = λ, (23) G  |G P =(g1,g2, ..., gi +1, ..., gN ) =(g1,g2, ..., gi, ..., gN ),a= i =0, (24)  P G =(g1,g2, ..., gi − 1, ..., gN )|G =(g1,g2, ..., gi, ..., gN ) = giμ, ∀i ∈ [1,N] , (25) N  P G = G|G =(g1,g2, ..., gi, ..., gN ) =1− λ − giμ. (26) i=1

966 2013 Proceedings IEEE INFOCOM

6 includes this SU who will not leave the channel at state S.In Algorithm 1 Modified Value Iteration Algorithm for Multi- such a case, we can have an M-dimensional expected utility dimensional MDP Problem. P | | |∀  ∈X 1: • Given tolerance η1 and η2, set 1 and 2. function set in (34), where i(S S)= Pi(S S) S (0)    • { ∀ ∈X} and Vi(S |S)= Vi(S |S)|∀S ∈X . 2: Initialize Vi (S)=0, S and randomize 3: π = {aS, ∀S ∈X}. E. Best Strategy 4: while 1 >η1 or 2 >η2 do The strategy profile π = {aS|∀S ∈X}is a mapping from 5: for all S ∈X do  the state space to the action space, i.e., π : X→A. Due to 6: • Calculate transition probability Pi(S |S), the selfish nature, each SU will choose the best strategy to 7: ∀i ∈ [1,N] using π and (29-33). (n+1) maximize his/her own expected utility. Suppose that one SU 8: • Update utility function Vi (S), ∀i ∈ [1,N] B G arrives at the primary network with system state S = , = 9: using (34). (g1,g2, ..., gi, ..., gN ) , his/her best strategy can be defined as 10: end for 11: for all S ∈X do aS =argmax Vi B, G =(g1, ..., gi +1, ..., gN ) . (35)  i∈[1,N] 12: • Update π = {aS} using (35). 13: end for Since the strategy profile satisfying (34) and (35), denoted • −    14: Update the parameter 1 by 1 = π π 2. by π , maximizes every arriving SU’s utility, π is a Nash (n+1) 15: • Update the parameter 2 by 2 = V (S)− equilibrium of the proposed game. i V(n) 16: i (S) 2.  V. M ODIFIED VALUE ITERATION ALGORITHM 17: • Update the strategy file π = π . As discussed at the beginning of Section IV, although the 18: end while  channel access problem can be modeled as an MDP problem, it 19: • The optimal strategy profile is π . is different from the traditional MDP problem that a SU cannot adjust action even the system state changes. In traditional MDP problem, there is only one Bellman equation associated with A. Channel Access: Two Primary Channels Case each system state, and the optimal strategy is directly obtained In this subsection, we discuss the case where there are by optimizing the Bellman equation. In our Multi-dimensional two primary channels. In such a case, the system state S = MDP problem, there is a set of Bellman equations as shown (B1,B2,g1,g2), where B1 and B2 are beliefs of two channels, in (34) and the optimal strategy profile should satisfy (34) g1 and g2 are numbers of SUs in two channels. We define SUs’ and (35) simultaneously. Therefore, the traditional dynamic immediate utility in channel i, U(Bi,gi),as programming method in [19] cannot be directly applied. To solve this problem, we design a modified value iteration SNR U(Bi,gi)=ˆbiR(gi)=ˆbi log 1+ , (36) algorithm. (gi − 1)INR +1 Given an initial strategy profile π, the conditional state  where ˆbi is the continuous mapping of quantized belief Bi. transition probability Pi(S |S) can be calculated by (29-33), According to (34), the expected utility functions of two and thus the conditional expected utility Vi(S) can be found channels can be written as by (34). Then, with Vi(S), the strategy profile π can be   updated again using (35). Through such an iterative way, we V1(S)=U(B1,g1)+(1− μ) P1(S |S)V1(S ), (37)  can finally find the optimal strategy π . In Algorithm 1, we S   summarize the proposed modified value iteration algorithm for V2(S)=U(B2,g2)+(1− μ) P2(S |S)V2(S ), (38) the Multi-dimensional MDP problem. S

G |G Pi =(g1, ..., gk +1, ..., gN ) =(g1, ..., gk, ..., gN) = λ, (30) G − |G − Pi =(g1,g2, ..., gi 1, ..., gN ) =(g1,g2, ..., gi, ..., gN) =(gi 1)μ, (31)   Pi G =(g1,g2, ..., gi=i − 1, ..., gN )|G =(g1,g2, ..., gi=i, ..., gN ) = gi μ, ∀i ∈ [1,N] , (32) N  Pi G = G|G =(g1,g2, ..., gN ) =1− λ − gi − 1 μ. (33) i=1

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤   T V1(S) U1(S) P1(S |S) 0 ... 0 V1(S ) ⎢ ⎥ ⎢ ⎥ ⎢  ⎥ ⎢  T ⎥ ⎢ V2(S) ⎥ ⎢ U2(S) ⎥ ⎢ 0P2(S |S) ... 0 ⎥ ⎢ V2(S ) ⎥ ⎢ . ⎥ = ⎢ . ⎥ +(1− μ) ⎢ . . . . ⎥ ⎢ . ⎥ . (34) ⎣ . ⎦ ⎣ . ⎦ ⎣ . . .. . ⎦ ⎣ . ⎦   T VN (S) UN (S) 00... PN (S |S) VN (S )

967 2013 Proceedings IEEE INFOCOM

7 where S =(B1,B2,g1,g2) is the system state, P1 and P2 Algorithm 2 Fast Algorithm for the Multi-channel Case. are the state transition probabilities conditioned on the event 1: if N is even then that SUs stay in the channels they has chosen, which can be 2: while N>1 do calculated according to (29-33). 3: • Randomly divide the N channels into N/2 pairs. According to (35), the best strategy aS for SUs arriving 4: for all N/2 pairs do with system state S =(B1,B2,g1,g2) is as follows: 5: • Select one channel according to Algorithm 1. 6: end for 1,V1(B1,B2,g1 +1,g2)≥V2(B1,B2,g1,g2 +1), aS = (39) 7: • N = N/2. 2,V1(B1,B2,g1 +1,g2)1 do the beliefs of two channel, there exists a threshold structure 12: • Randomly divide the N primary channels into  in the optimal strategy profile π . 13: (N − 1)/2 pairs and one channel. Lemma 1: For g1 ≥ 0 and g2 ≥ 1, 14: for all (N − 1)/2 pairs do 15: • Select one channel according to Algorithm 1. V1(B1,B2,g1,g2) ≥ V1(B1,B2,g1 +1,g2 − 1), (40) 16: end for ≤ − V2(B1,B2,g1,g2) V2(B1,B2,g1 +1,g2 1). (41) 17: • N =(N − 1)/2+1. Proof: Due to page limitation, we show the proof in the 18: end while supplementary information [20]. 19: end if Lemma 1 shows that given the beliefs of two channels, V1 is non-decreasing and V2 is non-increasing along the line of g1 +g2 = m, ∀m ∈{0, 1, ..., 2L}. Based on Lemma 1, we will channel to access. On the other hand, if the channel number show the threshold structure in the optimal strategy profile π N is odd, the suboptimal channel can be selected by a similar by Theorem 1. way. With such an iterative dichotomy method, a SU can find Theorem 1: For the two-channel case, given the belief state, one suboptimal primary channel only by log N steps and the  the optimal strategy profile π = {aS} derived from the complexity of each step is same with that of the two-channel modified value iteration algorithm has threshold structure. case. This fast algorithm is summarized in Algorithm 2. In Proof: Due to page limitation, we show the proof in the the simulation section, we will compare the performance of supplementary information [20]. this fast algorithm with the optimal algorithm using modified Note that the optimal strategy profile π can be obtained off- value iteration method. line and the profile can be stored in a table for SUs. We can VI. SIMULATION RESULTS see that for some spectific belief state, the number of system states is (L+1)2, which means the corresponding strategy file In this section, we conduct simulations to evaluate the has (L+1)2 strategies. With the proved threshold structure on performance of proposed scheme in cognitive radio networks. each line g1 + g2 = m, ∀m ∈{0, 1, 2, ..., 2L}, we just need Specifically, we evaluate the performance of channel sensing to store the threshold point on each line. In such a case, the and access, as well as the interference to the PU. storage of the strategy profile can be reduced from O(L2) to A. Bayesian Channel Sensing O(2L). In this simulation, we evaluate the performance of channel B. Channel Access: Multiple Primary Channels Case sensing with Bayesian learning. We first generate one primary In this subsection, we discuss the case where there are channel based on the ON-OFF model, and the channel param- multiple primary channels. Although the optimal strategy eters are set to be r0 =55s and r1 =50s, respectively. Then, a profile of the multi-channel case can also be obtained using number of SUs with some arrival rate λ sequentially sense the Algorithm 1, the computation complexity grows exponentially primary channel and construct their own beliefs by combining in terms of the number of primary channels N. Besides, the the sensing result with the former SU’s belief. In Fig. 2, we storage and retrieval of the strategy profile are also challenging compare the detection and false-alarm probabilities between when the number of system states exponentially increases with channel sensing with Bayesian learning and sensing without N. Therefore, it is important to develop a fast algorithm for learning under different arrival rate λ. Overall, the detection the multi-channel case. probability is enhanced and false-alarm probability decreases Suppose the channel number N is even, we can randomly when the Bayesian learning is adopted. Moreover, we can divide these N primary channels into N/2 pairs. For each see that with Bayesian learning, the larger the arrival rate λ, pair, SUs can choose one channel using the threshold strategy the higher detection probability and the lower the false-alarm derived in the previous subsection. Then, SUs can further probability. This is because a larger λ means a shorter arrival divide the selected N/2 channels into N/4 pairs and so on interval between two SUs, and thus the former SU’s belief so forth. In such a case, SUs can finally select one suboptimal information is more useful for current SU’s belief construction.

968 2013 Proceedings IEEE INFOCOM

8

Fig. 3. Social welfare under 2-channel with M =2and L =2. Fig. 2. Detection and false-alarm probability.

B. Channel Access of Two Primary Channel Case In this subsection, we evaluate the performance of the pro- posed Multi-dimensional MDP model, as well as the modified value iteration algorithm for the two-channel case. The param- eters of the two primary channels are set to be: for channel 1, r0 =55s and r1 =25s; for channel 2, r0 =25s and r1 =55s, which means channel 1 is statistically better than channel 2. In the simulation, our proposed strategy is compared with centralized strategy, myopic strategy and random strategy in terms of social welfare. We first define the social welfare, W , when given a strategy profile π = {aS, ∀S ∈X}as π W = σ (S) g1U(B1,g1)+g2U(B2,g2) , (42) S∈X Fig. 4. Social welfare under 2-channel with M =5and L =5. where S =(B1,B2,g1,g2) in the two-channel case, and σπ(S) is the stationary probability of state S. The four strategies we test are defined as follows. levels and maximally 2 SUs in each channel, i.e., M =2 • Proposed strategy is obtained by our proposed value and L =2. Note that if M =2and L =3, there are 22·(3+1)2 64 iteration algorithm in Algorithm 1. totally 2 =2 possible strategy profiles to verify, • Centralized strategy is obtained by exhaustively search- which is computational intractable. Therefore, although slight- ing all possible 2|X | strategy profiles to maximize the ly outperforming our proposed strategy as shown in Fig. 3, social welfare, i.e., πc =argmaxW π, where the super- the centralized method is not applicable to the time-varying π script c means centralized strategy. We can see that the primary channels. Moreover, we also compare the proposed complexity of finding the centralized strategy is NP-hard. strategy with the myopic and random strategies under the case • Myopic strategy is to maximize the immediate utility, with M =5and L =5in Fig. 4. We can see that the proposed i.e., to choose the channel with the largest immediate strategy performs the best among all the strategies. m reward by π = {aS =argmaxU(Bi,gi), ∀S ∈X}, We verify that the proposed strategy is a Nash equilibrium i∈[1,2] (NE) by simulating a new coming SU’s expected utility in where the superscript m means myopic strategy. Fig. 5. The deviation probability in x-axis stands for the • Random strategy is to randomly choose one chan- probability that a new coming SU deviates from the proposed r nel with equal probability 0.5, i.e., π = {aS = strategy or centralized strategy. We can see that when there rand(1, 2), ∀S ∈X}, where the superscript r means is no deviation, our proposed strategy performs better than random strategy. the centralized strategy. Such a phenomenon is because the In the simulation, we use the myopic strategy as the com- centralized strategy is to maximize the social welfare and thus parison baseline and show the results by normalizing the sacrifices the new coming SU’s expected utility. Moreover, performance of each strategy by that of the myopic strategy. we can see that the expected utility of a new coming SU In Fig. 3, we evaluate the social welfare performance of decreases as the deviation probability increases, which verifies different methods. Due to the extremely high complexity of that the proposed strategy is a NE. On the other hand, by the centralized strategy, we consider the case with 2 belief deviating from the centralized strategy, a new coming SU can

969 2013 Proceedings IEEE INFOCOM

9

networks. Based on the Bayesian learning rule, we introduced a channel state learning method for SUs to estimate the primary channel state. We modeled the channel access problem as a Multi-dimensional MDP and designed a modified value iteration algorithm to find the optimal strategy. The simulation results show that compared with the centralized approach that maximizes the social welfare with an intractable computational complexity, the proposed scheme achieves comparable social welfare performance with much lower complexity, while com- pared with random strategy and myopic strategy, the proposed scheme achieves much better social welfare performance. Moreover, the proposed scheme maximizes a new coming SU’s expected utility and thus achieves Nash equilibrium where no user has the incentive to deviate. Fig. 5. NE verification under 2-channel with M =2and L =2. REFERENCES [1] B. Wang and K. J. R. Liu, “Advances in cognitive radios: A survey,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 1, pp. 5–23, 2011. [2] K. J. R. Liu and B. Wang, Cognitive Radio Networking and Security: A Game Theoretical View. Cambridge University Press, 2010. [3] B. Wang, Y. Wu, and K. J. R. Liu, “Game theory for cognitive radio networks: An overview,” Computer Networks, vol. 54, no. 14, pp. 2537– 2561, 2010. [4] I. F. Akyildiz, B. F. Lo, and R. Balakrishnan, “Cooperative spectrum sensing in cognitive radio networks: A survey,” Physical Communica- tion, vol. 4, no. 3, pp. 40–62, 2011. [5] W. H. Sandholm, “Negative externalities and evolutionary implementa- tion,” The Review of Economic Studies, vol. 72, no. 3, pp. 885–915, 2005. [6] I. F. Akyildiz, W.-Y. Lee, M. C. Vuran, and S. Mohanty, “Next generation/dynamic spectrum access/cognitive radio wireless networks: A survey,” Computer Networks, vol. 50, no. 9, pp. 2127–2159, 2006. [7] C.-Y. Wang, Y. Chen, and K. J. R. Liu., “Chinese restaurant game - part I: Theory of learning with negative network externality,” http://arxiv.org/abs/1112.2188. Fig. 6. Social welfare under 3-channel with M =5and L =5. [8] H. V. Zhao, W. S. Lin, and K. J. R. Liu, Behavior Dynamics in Media- Sharing Social Networks. Cambridge University Press, 2011. [9] Y. Chen and K. J. R. Liu, “Understanding microeconomic behaviors in social networking: An engineering view,” IEEE Signal Process. Mag., obtain higher expected utility, which means that the centralized vol. 29, no. 2, pp. 53–64, 2012. [10] D. Aldous, I. Ibragimov, J. Jacod, and D. Aldous., “Exchangeability and strategy is not a NE and SUs have incentive to deviate. related topics,” Lecture Notes in Mathematics, vol. 1117, 1985. [11] C.-Y. Wang, Y. Chen, and K. J. R. Liu., “Chinese restaurant game - part C. Fast Algorithm for Multiple Channel Access II: Applications to wireless networking, cloud computing, and online In this simulation, we evaluate the performance of the pro- social networking,” http://arxiv.org/abs/1112.2187. [12] B. Zhang, Y. Chen, C.-Y. Wang, and K. J. R. Liu, “Learning and decision posed fast algorithm for multi-channel case, which is denoted making with negative externality for opportunistic spectrum access,” in as suboptimal strategy hereafter. In Fig. 6, the suboptimal strat- Proc. IEEE Globecom, 2012, pp. 1–6. egy is compared with the proposed strategy, myopic strategy [13] C. Jiang, Y. Chen, K. J. R. Liu, and Y. Ren, “Analysis of interference in cognitive radio networks with unknown primary behavior,” in Proc. and random strategy in terms of social welfare under 3-channel IEEE ICC, 2012, pp. 1746–1750. case, where the channel parameters are set to be: for channel [14] H. Kim and K. G. Shin, “Efficient discovery of spectrum opportunities with MAC-layer sensing in cognitive radio networks,” IEEE Trans. 1, r0 =55s and r1 =25s; for channel 2, r0 =45s and Mobile Computing, vol. 7, no. 5, pp. 533–545, 2008. r1 =40s; for channel 3, r0 =25s and r1 =55s. We can see [15] D. R. Cox, Renewal Theory. Butler and Tanner, 1967. that the suboptimal strategy achieves the social welfare very [16] D. Gale and S. Kariv, “Bayesian learning in social networks,” Games close to that of the optimal one, i.e., the proposed strategy and Economic Behavior, vol. 45, no. 11, pp. 329–346, 2003. [17] L. G. Epstein, J. Noor, and A. Sandroni, “Non-bayesian learning,” The using modified value iteration, and is still better than the B.E. Journal of Theoretical Economics, vol. 10, no. 1, pp. 1–16, 2010. myopic and random strategies. Therefore, considering the low [18] C. Jiang, Y. Chen, and K. J. R. Liu, “A renewal-theoretical framework complexity of the suboptimal strategy, it is more practical to for dynamic spectrum access with unknown primary behavior,” in Proc. IEEE Globecom, 2012, pp. 1–6. use the suboptimal strategy for the multi-channel case. [19] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dy- namic Programming. John Wiley & Sons, Inc, 1994. VII. CONCLUSION [20] C. Jiang, Y. Chen, Y.-H. Yang, C.-Y. Wang, and K. J. R. Liu, “Supplementary information for dynamic chinese restrau- In this paper, we extended the previous Chinese Restaurant rant game in cognitive radio networks,” [Online]. Available: Game work [7] into a dynamic population setting and proposed http://www.sig.umd.edu/jcx/INFOCOM2013SuppInfo.pdf. a Dynamic Chinese Restaurant Game for cognitive radio

970