Performance Evaluation of Queueing Networks - Outline

Introduction - networks of queues are the example family of systems to be studied • Deterministic models including network calculus • Dynamic routing • Review of elements of probability and statistics & statistical confidence, overview of simu- • lation

Stationary models • Markovian models in continuous and discrete time • Caching • Fork-join queues • Markov decision processes • Constrained optimization and duality with examples •

November 9, 2020 George Kesidis

1

Performance Evaluation of Queueing Networks - Outline (cont)

Queueing system models have been used in a wide range of applications including com- • puter/communication networking, computation, supply chain and logistics.

The focus of this course will be (unambiguous) theoretical derivations of performance ob- • jectives based on models of queueing system and their workloads.

To this end, we will review the basic, relevant elements of probability theory. • We will also discuss performance evaluation based on simulation. • Simulation is useful when system or workload complexity precludes simple models that lead • to close-form analytical results for the performance objectives.

We also will review the use of statistical confidence when reporting the results of a simulation • study.

November 9, 2020 George Kesidis

2 Performance Evaluation of Queueing Networks - Outline (cont)

In the following, our approach to performance evaluation will be to will consider models of • increasing detail: 1. deterministic, including worst-case analysis 2. stationary and ergodic 3. stationary Markovian

We will demonstrate how increased model complexity (assumedsuitableforthephysical • system under consideration) leads to more refined and detailed performance results.

November 9, 2020 George Kesidis

3

Deterministic models of queues and queuing networks

Arrivals, departures and queue occupancy • Trafficshaping-tokenbuckets,servicecurves • Flow • Network calculus •

November 9, 2020 George Kesidis

4 Queues - preliminaries

Aqueueorbuffer is simply a waiting room with an identified arrival process and departure • (completed ”jobs”) process.

Work is performed on jobs by servers according to a service policy. • In some applications, jobs arriving to the queue will be packets of information; in others, • the arrivals will represent calls attempting to be set-up in the network.

Some jobs may be blocked from entering the queue (if the queue’s waiting room is full) or • join the queue and be expelled from the queue before reaching the server.

For jobs reaching the server, their queueing delay plus service time is called their sojourn • time, i.e., the time between the arrival of the job to the queue and its departure from the server.

We will consider queues that serve jobs in the order of their arrival known as first come, • first serve (FCFS) or first in, first out (FIFO).

November 9, 2020 George Kesidis

5

Arrivals, departures, and queue occupancy

Over the time interval (0,t],thecountingprocess • – A(0,t] t R+ represents the number of jobs arriving at the queue, { | ∈ } – D(0,t] t R+ represents the number of departures from the queue, { | ∈ } – L(0,t] t R+ represents the number of jobs lost (blocked upon arrival or pushed- {out/preempted).| ∈ }

Let Q(t) be the number of jobs in the queueing system at time t; i.e., • – the occupancy of the queue plus the number of jobs being servedattimet; – including the arrivals at t but not the departures at t.

We assume no jobs with zero sojourn time. •

November 9, 2020 George Kesidis

6 Arrivals, departures, and queue occupancy (cont)

Clearly, a previously ”arrived” job is either queued or has departed or has been blocked, i.e., • Q(0) + A(0,t]=Q(t)+D(0,t]+L(0,t].

If we take the origin of time to be ,wecansimplywrite • −∞ Q(t)=A( ,t] D( ,t] L( ,t]. −∞ − −∞ − −∞

November 9, 2020 George Kesidis

7

Basic assumptions

We’ll typically assume that: • – Servers are nonidling (or ”work conserving”) in that they are busy whenever Q(t) > 0. – Ajob’sservicecannotbepreemptedbyanotherjob. – Jobs may only be blocked upon arrival to a queue. – All servers associated with a given queue work at the same, constant rate (otherwise, need to define the work each job brings).

th Thus, we can unambiguously define Si to be the service time required by the i job. • In addition, each job i will have the following two quantities associated with it: • – its arrival time to the queueing system Ti,assumedtobeanondecreasingsequencein i ( i, T T +1), and ∀ i ≤ i – its departure (service completion) time from the server Vi if the job is not lost.

November 9, 2020 George Kesidis

8 Queue workload (not blocked jobs)

th Let Ri(t) be the residual amount service time required by the i job at time t. •

Clearly, 0 Ri(t) Si for all i, t; Ri(t)=0for t>Vi; Ri(t)=Si for t

For jobs i that are not lost, let Vi be the departure time of the job from the server. • th Clearly, Vi Si is the time at which the i job enters a server and, for all t and i, • − Ri(t)=Vi t. −

Clearly, a job i is in the queue but not in service if Ti t

Ri(t)

Si

Ti Vi Si t Vi −

November 9, 2020 George Kesidis

9

Parameterizing queue arrival and departure processes

The arrival process A is parameterized above as Ti,Si i Z or Z+ . • { } ∈ The queueing discipline determines how jobs are enqueued andinwhichordertheyare • served (dequeued), i.e., the dynamics of queue Q and workload W processes.

The departure process D,parameterizedby Vi,Si is determined by both the queueing • discipline and the arrival process. { }

November 9, 2020 George Kesidis

10 Lossless queues

Now assume the queue we have just introduced is lossless, i.e., L( ,t]=0for all t. • −∞ Define the indicator 1B =1if B is true, else =0.Since, • A(s, t]= 1 T (s, t] , and D(s, t]= 1 V (s, t] , { i ∈ } { i ∈ } i i ! ! we get (by recalling i, Vi >Ti by assumption) that ∀ Q(t)=A( ,t] D( ,t]= 1 T t

Again, this sojourn time consists of two components: the queueing delay, Vi Ti Si, • − − plus the service time, Si. Expressions will be derived for quantities of interest such as the number of jobs in the queue, • the workload, and job sojourn times.

The objective is to express quantities of interest in terms ofthejobarrivaltimesandservice • times alone.

November 9, 2020 George Kesidis

11

The case of no waiting room

Suppose the queueing system consists only of the servers and no waiting room. • Thus, if the job flow is demultiplexed (demux’ed) to one of K servers, the queueing system • can only hold K jobs at any given time.

Since the system is assumed lossless: for all jobs i, • Vi = Ti + Si

If there are infinitely many servers (K = ), the system is always lossless and so the • number of jobs queued and the workload are∞

Q(t)= 1 Ti t

November 9, 2020 George Kesidis

12 The case of no waiting room - example sample path

W (t)

S2 S3 S1

...

t

S1 Q(t) S2 4 3 2 1 ... t T1 T2 T3V2 V1

November 9, 2020 George Kesidis

13

The case of a lossless single-server queue

waiting room server

arriving jobs departing jobs

Now suppose that the queue has a waiting room and only a single server. • Clearly, if the waiting room was infinite in size, the queue would be lossless irrespective of • the job arrival and service times.

For the following example sample path, note that upon arrivaloftheith job at time T , Q • i increases by 1 and W increases by Si.

The process Q is piecewise constant and, due to the action of the server, W (t) has zero • time derivative if Q(t)=0(i.e., W is constant) and otherwise has time derivative 1 for any t that is not a job arrival time. −

Upon departure of the ith job, Q decreases by 1. •

November 9, 2020 George Kesidis

14 The case of a lossless single-server queue - example sample path

W (t)

...

S3 S1 S2

t

Q(t) 4 3 2 ... 1 t T1 T2 T3 V1

November 9, 2020 George Kesidis

15

Lossless single-server queue: Departure-times recursion

Theorem: For a work-conserving, single-server, lossless FIFO queue,theith job’s departure • time

Vi =maxVi 1,Ti + Si { − } + for all jobs i Z ,whereV0 0. ∈ ≡ Proof: For the ith job arriving at the lossless queue, there are two cases. •

If Ti >Vi 1,then: • − – job i 1 has already departed the queue by time Ti. − – So, Q(T )=0and, i− – when the ith job joins the queue, it immediately enters the server.

– So, it departs Si seconds after it arrives, i.e., Vi = Ti + Si.

On the other hand, if Ti Vi 1, • ≤ − – job i 1 is present in the queue (and immediately ahead of the ith job) when the ith job joins− the queue. th – Thus, the i job will depart the queue Si seconds after job i 1, i.e., Vi = Vi 1 +Si. − −

November 9, 2020 George Kesidis

16 Lossless single-server queue: Departure-times recursion (cont)

Note that, by subtracting T from both sides of the departure-times recursion, we get a • i statement involving the sojourn times Vi Ti and the interarrival times Ti Ti 1: − − − Vi Ti =maxVi 1 Ti, 0 + Si − { − − } =max(Vi 1 Ti 1) (Ti Ti 1), 0 + Si, { − − − − − − } where T0 0. ≡ An immediate consequence of the FIFO nature of a single-server queue is this relation to • workload:

Vi = Ti + W (Ti).

Again, here we take the work brought by each job i, Si,asitsrequiredservicetime. • Also note that the time at which the ith job enters the server is • max Vi 1,Ti { − }

November 9, 2020 George Kesidis

17

Single server and constant service times

Suppose each job requires the same amount of service, i.e., for some constant c>0, • Si =1/c for all i.

So, the service rate of any server can be described as c jobs per second. Further suppose • that the (assumed lossless) queue has a waiting room.

1 Because each job contributes c− to workload upon its arrival, the number of jobs in the • system in terms of the workload is, t, ∀ Q(t)= cW (t) . ' ( That is, • 1 1 (Q(t) 1)+

So, Q(t)= cW (t) follows because Q(t) is integer valued. • ' (

November 9, 2020 George Kesidis

18 Max-plus expression for workload

Theorem: For a work-conserving, single-server, lossless, initiallyempty(W (0) = 0 ) • FIFO queue with constant service times, 1 W (t)=max A(s, t] (t s) 0 s t c − − ≤ ≤ " # for all times t 0,wherethemaximizingvalueofs is t if W (t)=0,elsethestarting time of the busy≥ period containing t.

November 9, 2020 George Kesidis

19

Max-plus expression for workload - proof

We first define a notion of a queue busy period as an interval of time [s, t] with s 0 (and Q(r) > 0 )foralltimer [s, t),and ∈ – W (t)=Q(t)=0, i.e., the system is empty at time t.

Queue busy periods (each started by a job arrival to an empty queue) are separated by idle • periods,whichareintervalsoftimeoverwhichW (and Q)arebothalwayszero.

So, the evolution of W is an alternating sequence of busy and idle periods. • Q(t) busy period idle period

t

November 9, 2020 George Kesidis

20 Max-plus expression for workload - proof (cont)

Arbitrarily fix a time t somewhere in a queue busy period, i.e., Q(t),W(t) > 0. • Define b(t) as the starting time of the busy period containing time t,sothat,inparticular, • b(t) t and W (b(t) )=0. ≤ − The total work that arrived over [b(t),t] is A[b(t),t]/c and the total service done over • [b(t),t] was t b(t). − Since W (s) > 0 for all s [b(t),t], • ∈ 1 W (t)= A(b(t) ,t] (t b(t)) c − − − where A(b(t) ,t]=A[b(t),t]. − Furthermore, for any s (b(t),t), • ∈ 1 1 W (t)=W (s)+ A(s, t] (t s) A(s, t] (t s) c − − ≥ c − −

November 9, 2020 George Kesidis

21

Max-plus expression for workload - proof (cont)

Now consider a time s

Therefore, • 1 1 1 A(s, t] (t s)= A(s, b(t)) (b(t) s)+ A(b(t) ,t] (t b(t)) c − − c − − c − − − 1 A(b(t) ,t] (t b(t)) ≤ c − − − = W (t).

So, we have proved the desired result for the case where W (t) > 0. • The other case, where t is in an idle period (i.e., Q(t),W(t)=0), is similarly proved • (with maximizing time s = t).

November 9, 2020 George Kesidis

22 Max-plus expression for queue backlog

Combining the last two results gives • Q(t)= max A[s, t] (t s)c . 0 s t − − $ ≤ ≤ % Also, when the ith job is in the server at time t, • 1 W (t)= max Q(t) 1, 0 + Vi t. c { − } −

November 9, 2020 George Kesidis

23

Single server and general service times

Now consider a lossless FIFO single-server queue wherein the ith arriving job has service • time Si,

Here, • W (t)=max Si1 s Ti t (t s), 0 s t { ≤ ≤ } − − ≤ ≤ i ! since A[s, t]= S 1 s T t . i { ≤ i ≤ } i !

November 9, 2020 George Kesidis

24 Single server and general service times (cont)

Alternatively focusing just on job arrival times, let i(t) be the index of the last job arriving • prior to time t, i.e., i(t) max j T t . ≡ { | j ≤ } For this queue, the workload is given by • + i(t)

W (t)= max Sk (t Tj) , j i(t)   − −  ≤ = !k j     where (x)+ max x, 0 . ≡ { }

November 9, 2020 George Kesidis

25

Queues in communication/computer networks

Now consider packet queues/buffers in communication/computer networks operated by • network providers.

In particular, such queues reside in network switches and routers. • At their network boundaries, network providers strike service-level agreements (SLAs) wherein • the transmitting network agrees that his or her egress packetflowwillconformtocertain parameters.

November 9, 2020 George Kesidis

26 A 3 3 Router ×

1

2

R

3

1 1

2 R 2

3 3

November 9, 2020 George Kesidis

27

Linecards of a Router (ca. 2000s)

ingress egress linecard 0 linecard 0

ingress egress linecard 1 linecard 1

ingress egress linecard 2 linecard 2

router router input switch output links fabric input links fabric fabric output links links

SONET iTM + fabric deframer NP iSIF frames packet memory segments

IP packets labeled IP packets

Note: VOQs and VIQs about the switch fabric, and eTM in egress linecard

November 9, 2020 George Kesidis

28 SLA parameters regarding packet flows

Apreferablechoiceofflowparameterswouldbethosethatare: • – significant from a queueing perspective, simply to ensure conformity by the sending network, and – simple to police by the receiving network.

We will see how useful the mean arrival rate is in terms of predicting the queueing behav- • ior/performance.

The mean arrival rate is, however, difficult to police as it is only known after the flow has • terminated.

Instead of the mean arrival rate, we consider flow parameters that are policeable on a • packet-by-packet basis.

November 9, 2020 George Kesidis

29

The burstiness of a packet-flow

Suppose that when the flow of packets arrives to a lossless, dedicated FIFO queue • – with a constant service rate of ρ bytes per second (Bps), – the backlog of the queue never exceeds σ bytes.

One can define σ as the burstiness of a flow of packets as a function of the rate ρ used to • service it.

So, a node can allocate both memory and bandwidth resources inordertoaccommodate • such a regulated flow.

Moreover, by limiting the burstiness of a flow, one also limitsthedegreetowhichitcan • affect other flows with which it shares network resources.

Such trafficregulationwasstandardizedbytheATMForumandadopted by the Internet • Engineering Task Force (IETF); see RFCs 2697 and 2698 at www.ietf.org

November 9, 2020 George Kesidis

30 Token (leaky) buckets for packet-trafficshaping-preliminaries

Suppose that at some location there is a flow of packets specified by the sequence of pairs • (Ti, #i),where th – Ti is the arrival time of the i packet in seconds (Ti+1 >Ti)and th – #i is the length of that packet in bytes (both work that the i packet brings and memory it occupies in the queue).

The total number of bytes that arrives over an interval of time (s, t] is •

B(s, t]= #i1 s

November 9, 2020 George Kesidis

31

Token (leaky) buckets for packet-trafficshaping(cont)

Assume that this packet flow arrives to a token bucket mechanism. • Atokenrepresentsabyteandtokensarriveataconstant rate of ρ tokens/s to the token • bucket which has a limited capacity of σ tokens.

A(head-of-line)packeti leaves the packet FIFO queue when #i tokens are present in the • token bucket;

when the packet leaves, it consumes # tokens, i.e., they are removed from the bucket. • i Note that this mechanism requires that σ be larger than the largest packet length (again, • in bytes) of the flow.

November 9, 2020 George Kesidis

32 Token (leaky) buckets for packet-trafficshaping(cont)

Let Bo(s, t] be the total number of bytes departing from the packet queue over the interval • of time (s, t].

The following result is directly proved by considering the maximal amount of tokens that • can be consumed over an interval of time.

Theorem: For all arrival processes B to the packet queue, • Bo(s, t] σ + ρ(t s), s t. ≤ − ∀ ≤

Any flow Bo that satisfies this inequality is said to satisfy a (σ, ρ) constraint. • In the jargon of the IETF RFCs, ρ could be a sustained information rate (SIR), and σ a • maximum burst size (MBS).

Alternatively, ρ could be a peak information rate (PIR>SIR), in which case σ would usually • be taken to be the number of bytes in a (single) maximally sizedpacket(< MBS).

Note that the mean departure rate over (s, t] is Bo(s, t]/(t s) ρ + σ/(t s) ρ • for large t s. − ≤ − ≈ −

November 9, 2020 George Kesidis

33

Bounded queue backlog if (σ, ρ) constrained arrivals

Let W (t) be the backlog in bytes at time t of a queue with arrival flow Bo and a dedicated • server with constant rate ρ.

Theorem: The flow Bo is (σ, ρ) constrained if and only if W (t) σ for all time t. • ≤ Proof: The maximum queue size is • max W (t)=maxmax Bo(s, t] ρ(t s) . t t s: s t{ − − } ≤

Substituting Bo’s (σ, ρ) inequality gives the result. •

November 9, 2020 George Kesidis

34 Trafficshapingandpolicing

We have shown how the token bucket can delay packets of the arrival flow B so that the • departure flow Bo is (σ, ρ) constrained.

This is known as traffic shaping. • The receiving network of the exchange of flows described abovemaywishto: • – shape the flow using a (σ, ρ) token bucket, or – police the flow by simply identifying (marking) any packets that are deemed out of the (σ, ρ) profile of the flow, or – police the flow by dropping any out of profile packets.

There are two main devices used for trafficpolicing. • The first is a token-bucket device but without the packet queue: A packet is dropped or • marked out of profile if and only if there are not sufficient tokens (according to its length) in the token bucket upon its arrival (no tokens consumed if dropped).

November 9, 2020 George Kesidis

35

Trafficpolicing

Alternatively, by the previous theorem, one can employ a policer as depicted above which • does not delay any packets.

Apacketisdroppedormarkedout-of-profileifandonlyitsarrival and inclusion in the • virtual queue would cause its backlog W to become larger than σ;

when this happens, the arriving packet is not included in the virtual queue. • Note that the virtual queue can be maintained by simply keeping track of two state variables: • – the queue length, W bytes, upon arrival of the previous packet and – the arrival time, a,ofthepreviouspacket.

36 Trafficpolicing(cont)

Thus if a packet of length # bytes arrives at time T and is admitted into the virtual queue, • then W max W ρ(T a), 0 + # and a T. ← { − − } ← This (event-driven) operation requires one multiplicationoperationperpacket. • Alternatively, one could maintain the departure time d of the most recently admitted packet • instead of the queue occupancy W .

November 9, 2020 George Kesidis

37

Trafficpolicing:2R3CM

If two such virtual queues are used, one for (SIR,MBS) and the other for PIR, then every • packet has one of four fates: – in-profile for both – out-of-profile for PIR but in-profile for (SIR,MBS) – in-profile for PIR but out-of-profile for (SIR,MBS) – out-of-profile for both

Thus, one of three three different “colors” can be used to mark the out-of-profile packets • (by setting a field in their headers).

This policing system with two virtual queues is called a two-rate, three-color marker (2R3CM) • -again,seeRFCs2697and2698atwww.ietf.org

November 9, 2020 George Kesidis

38 Scheduling flows of variable-length packets - Introduction

Suppose that at some location, N flows are to be multiplexed (scheduled) into a single • flow.

Similarly, scheduling sequences of jobs of variable work amounts. • The flows are indexed by n 1,...,N below. • ∈ { }

Each flow n is assigned its own tributary FIFO queue with “relative allocation” fn and the • output flows of the tributary queues are multiplexed into the transmission FIFO queue.

How the multiplexing occurs depends on the kinds of relative priorities of the flows. •

39

FIFO scheduling

First suppose a system without tributary queues, i.e., all flows directly arrive to the trans- • mission queue.

In FIFO scheduling, packets are served in first-come first-served (first-in first-out) fashion. •

Hard to differentially manage per-flow service (fn)thisway-perhapsadifferential rule for • queue admission/blocking.

Also, flows more readily “interfere” with each other. • Note that FIFO queues without overtaking or push-out have minimal per-packet overhead: • operations only at the head (join, block) or tail (serve) of the queue.

November 9, 2020 George Kesidis

40 Strict priority scheduling

Now and hereafter suppose that each flow n has a separate tributary FIFO queue/buffer so • that “flow” and “queue” (or “transmission queue”) may be used interchangeably.

In strict priority multiplexing, flows are ranked according to priority. • Aflowisservedbythescheduleronlyifnopacketsofanyhigherpriorityflowsarequeued. • Even when the volume of high priority trafficislimited(perhaps by a leaky bucket mecha- • nism), there remains the potential problem of service starvation to lower priority flows.

The problems with both priority and single FIFO-queue multiplexing can be solved by using • aschedulerthatcaninsomewayallocateservicebandwidth to a flow in order to prevent long-term service starvation.

November 9, 2020 George Kesidis

41

Lottery scheduling

Let #n be the size in bytes of the head-of-line packet of queue/flow n and • c Bps be the capacity of server shared by the flows. • Under a kind of (work-conserving) “lottery” scheduling, thediscretedistribution • f1/#1 f2/#2 f /# , ,..., N N Z Z Z * + is sampled to determine which queue to serve next, where the normalizing term

N fn Z = #n n=1 ! and take #n = when queue n is empty. ∞ In addition to sampling this discrete distribution, the overhead of lottery scheduling in- • volves recomputing its terms (fn/#n)/Z when the size of a head-of-line packet changes, particularly when a queue becomes empty or nonempty.

November 9, 2020 George Kesidis

42 Lottery scheduling (cont)

Let Dn and Dm be the total number of bytes transmitted by two different flows n = m • over an interval of time T during which both of their queues are never empty. -

By the law of large numbers, the throughput of non-idle flows n will be proportional to fn • in the long run, i.e., i.e.,

Dn(T ) Dm(T ) lim =0. T f − f →∞ , n m , , , , , But due to random sampling, it’s, possible that a queue, is starved of service,i.e., • P(Dn(T )=0)> 0 for (finite) T>0.

November 9, 2020 George Kesidis

43

Deficit round-robin

Under round-robin multiplexing (scheduling), time is divided into successive rounds,perhaps • each not necessarily of the same time duration depending on which flows (tributary queues) are active.

Each flow is visited once per round by the scheduler. • Suppose that in each round there is a rule allowing for at most one packet per tributary • queue to be transmitted into the transmission queue.

Aproblemhereisthatflowswithlarge-sizedpackets(e.g., large file transfers using TCP) • will monopolize the bandwidth and starve out flows of small-sized packets (e.g., those of streaming media).

Thus, one might want to regulate the total number of bytes that can be extracted from • any given tributary queue in a round.

This leads to the notion of deficit round-robin (DRR) scheduling. •

November 9, 2020 George Kesidis

44 Deficit round-robin - definition

To describe a DRR mechanism, we need the following definitions. •

Let Lmax be the size, in bytes, of the largest packet and Lmin be the size of the smallest. •

Here, the priority of a flow has to do with the fraction fn of the total link bandwidth c • bytes per second assigned to it, where we assume no overbooking:

N fn 1. ≤ n=1 ! Note: In practice, resources may be overbooked to exploit “statistical multiplexing”. • Finally, let the minimal allotment of bandwidth to a queue be • fmin =minfn n

November 9, 2020 George Kesidis

45

Deficit round-robin - definition (cont)

Under DRR, at the beginning of each round, each nonempty FIFO queue is allocated a • certain number of tokens.

Packets departing a queue consume their byte length in tokensfromthequeue’sallotment. • Queues are serviced in a round until their token allotment becomes insufficient to transmit • their next head-of-line packet.

For example, if a queue is allocated 8000 tokens at the start of a round and has six packets • queued each of length 1500 bytes, then the first five of those packets are served leaving the trailing sixth packet at the head of the queue and 8000 5 1500 = 500 tokens unused. − ×

If it’s not empty, the nth queue is allocated • fn Lmax tokens fmin at the start of a round, thereby ensuring that at least one packet from this queue will be transmitted in the round irrespective of the packet’s size.

If a queue has no packets at the end of a round, its remaining token allotment may be reset • to zero (esp. if the queue is empty); in the following, assume that at most one round’s worth of tokens can carry over to the next.

46 Deficit round-robin - discussion and performance

Note that the token allotments per round can be precomputed given service requirements • fn,where

the fn themselves change at a much slower “connection-level” time scale than that of the • transmission time required for a single packet (Lmax/c).

One could replace fmin in the token allocation rule by the minimum bandwidth allocation • among nonempty queues at the start of a round, but the result would be a significant amount of computation per round possibly precluding a high-speed implementation.

Claim: If the nth queue is always not empty over k>1 consecutive rounds with constant • fmin,thencumulativebytesDn(k) transmitted from this queue over this period satisfies

fn fn (k 1) Lmax (Lmax 1) + Lmin Dn(k) k Lmax + Lmax 1 − fmin − − ≤ ≤ fmin − Proof: “Always nonempty” means the queue is not empty after each of its service epochs, • except possibly the last. The upper bound is obtained assuming that all allocated tokens are consumed in addition to a maximal amount of initial carryover tokens (Lmax 1)from the round prior to the k consecutive ones under consideration. The lower bound is− obtained by assuming no initial carryover tokens from a previous round, Lmax 1 tokens are carried − over in each round except the final one where only Lmin are consumed.

November 9, 2020 George Kesidis

47

DRR is rate-proportionally fair

The previous claim demonstrates that DRR scheduling indeed allocates bandwidth consis- • tent with the parameters fn.

Exercise: If two queues n and m are never idle over an interval of k consecutive rounds, • show that

Dn(k) Dm(k) 1 1 max , Lmax, f − f ≤ f f , n m , * n m + , , (bound doesn’t depend, on k)and , , , Dn(k) fn lim = . k D (k) f →∞ m m

Exercise: Derive related bounds if Lmax is different for the two queues or if fmin changes, • i.e., fmin,i for round i.

Exercise: Explain a potential problem if more than one round’s work of unused tokens are • allowed to accumulate for a flow.

November 9, 2020 George Kesidis

48 Shaped VirtualClock

We will now describe a scheduler • – that employs timestamps to give packets service priority over others – but restricts consideration only to packets that meet an eligibility criterion – to limit the jitter of the individual output flows.

This trait, which is lacking in DRR, is important for link channelization (partitioning a link • into smaller channels) at network boundaries where SLAs are struck and policed.

Ageneralproblemoftimestampbasedschedulingisthatdequeue requires O(log N) de- • queue complexity to determine the flow with smallest head-of-line/queue packet timestamp.

November 9, 2020 George Kesidis

49

Shaped VirtualClock - definition

For all i and n, (n, i) denotes the ith packet of the nth flow. •

Packet (n, i) is assigned a service deadline dn,i and a service eligibility time εn,i.Apacket • is said to be eligible for service at time t if ε t. ≤ As with DRR, once a packet begins service, its service is not interrupted. • Upon service completion of a packet, the next packet selectedforservicewillbetheone • with the smallest deadline among all eligible packets.

Assuming the queues are FIFO, only head-of-queue packets need to be considered by the • multiplexing (scheduling) algorithm.

Each packet (n, i) has two other important attributes: its arrival time an,i to the multiplexer • and its size in bytes, #n,i.

Under what we will hereafter call shaped VirtualClock (SVC) scheduling, packet (n, i)’s • eligibility time and deadline are

#n,i εn,i := max dn,i 1,an,i and dn,i := εn,i + . { − } fnc

November 9, 2020 George Kesidis

50 SVC - performance evaluation - preliminaries

That is, if the nth flow were instead to arrive to a queue with a dedicated server of constant • rate fnc bytes per second, then packet (n, i) would:

– reach the head of the queue (and begin service) at its eligibility time εn,i and

– completely depart the server at its service deadline dn,i.

The Lindley recursion of the packet departure times for this virtual queue n: • #n,i #n,i dn,i =maxdn,i 1,an,i + = εn,i + { − } fnc fnc

Lemma: Just prior to the start time of a busy period of the multiplexer, the aggregate • eligible work to be done of all N virtual queues is zero.

This lemma is used to prove a guaranteed-rate property of SVC. •

November 9, 2020 George Kesidis

51

SVC - guaranteed rate property

Now recall Lmax is the maximum size of a packet (in bytes), a quantity that is typically • about 1500 in the Internet.

The following theorem demonstrates that SVC schedules bandwidth appropriately in our • time-division multiplexing context:

Theorem: For all n and i,thetimeatwhichpacket(n, i) completely departs from the • multiplexer is not more than

Lmax dn,i + . c

This is a kind of guaranteed-rate result for the SVC multiplexer. • Such results can easily be extended to an end-to-end guaranteed-rate property of a tandem • system of such multiplexers.

November 9, 2020 George Kesidis

52 SVC - output burstiness

The SVC multiplexer also has an appealing property of bounding the jitter of every output • flow.

Consider any flow/queue n and note that the ith packet of this flow • – will have completely departed the multiplexer between times εn,i + #n,i/c and dn,i + Lmax/c,

– where #n,i/c is its total transmission time.

We can use this fact and the fact that dn,i ε +1 to show: • ≤ n,i Theorem: The cumulative departures from the nth queue of the multiplexer over an • interval of time [s, t] is less than or equal to fnc(t s)+2Lmax bytes. −

That is, the departure process is (2Lmax,fnc)-constrained. •

November 9, 2020 George Kesidis

53

SVC - discussion

Another perspective for SVC is that flows • – just get what they pay for (i.e., service fnc)and – either “use it or lose it”, i.e., the scheduler is not obligated to distributed unreserved (1 n fnc)orcurrentlyreserved-but-unusedresources(owingtoidleflows/queues) to currently− nonidling flows. - This perspective may be that of a public, for-profit utility (ISP, cloud services provider). • Exercise: Consider how eligibility times could be used in DRR to limit output burstiness. •

November 9, 2020 George Kesidis

54 Fair scheduling with timestamps - SCFQ and SFQ

There is a significant literature on fair scheduling including timestamp based • – Self-Clocked Fair Queueing (SCFQ) [Golestani INFOCOM’94],and – Start-Time Fair Queueing (SFQ) [Goyal et al. SIGCOMM’96].

They are “fair” in that they track work-conserving, rate based scheduling of a Generalized • (GPS) fluid trafficflowmodelthataccountsfor queue idle times.

Let the start-time and finish-time of the ith packet of flow/queue n be, respectively, • #n,i Sn,i =max v(an,i),Fn,i 1 and Fn,i = Sn,i + , { − } fnc where v is the virtual time and Fn,0 =0.

Under SCFQ, v(t) is the finish-tag of the packet in service (or most recently in service if t • is during a server idle period), and the head-of-queue packets with the smallest finish-tag are serviced first.

Under SFQ, v(t) is the start-tag of the packet in service (or most recently in service if t • is during a server idle period), and the head-of-queue packets with the smallest start-tag are serviced first.

November 9, 2020 George Kesidis

55

SCFQ and SFQ - fair scheduling performance & cost

For SCFQ and SFQ, over an interval of time T where both queues n, m are never idle, • Dn(T ) Dm(T ) 1 1 n, m, T, + Lmax, ∀ f − f ≤ f f , n m , " n m # , , , , Again, SCFQ and SFQ have, O(1) enqueue complexity, because they use a simple virtual • time function to compute the start-times and finish-times,

but they have O(log(N)) dequeue complexity, again because to decide which head-of- • queue packet to serve, the one with smallest timestamp needs to be determined.

November 9, 2020 George Kesidis

56 Deterministic network calculus

Amorepowerfulformulationofguaranteedserviceisgivenbytheservice curve concept • on which a kind of “network calculus” is based for determiningdelayandjitterboundsfor apacketflowasittraversesaseriesofmultiplexedFIFOqueues, each of which may be shared with other flows.

The following discussion is principally based on • R. Cruz, “SCED+ ...,” In Proc. IEEE INFOCOM,1998;seealso C.-S. Chang. Performance Guarantees in Communication Networks.Springer,2000.

Network calculus provides a succinct way • – to describe the burstiness of job/packet arrival flows – and the service guarantees provided by tandem (lossless) multiplexers/schedulers, – to derive bounds on delay and queue backlog.

The burstiness curves are typically piecewise linear in practice - recall token/leaky buckets. • Extensions to time-varying envelopes have been developed. • Extensions to stochastic settings (so that packet-by-packet policing is not possible), will be • discussed later.

November 9, 2020 George Kesidis

57

Convolution and deconvolution operators

We will now revisit some previous calculations via the convolution and deconvolution • operators, ⊗ 0 – as used in “min-plus” algebras – on flows, i.e., initially zero and non-decreasing (and hence non-negative)functionsof continuous time t R+ := [0, ) (or t Z+ if time is discrete), i.e., X is a flow if ∈ ∞ ∈ t v 0 ,X(t) X(v) with X(0 )=0, ∀ ≥ ≥ − ≥ − e.g., cumulative arrivals or departures, or maximum or minimum service.

For any two flows X and Y at time t 0: • ≥ – X convolved with Y is, t 0, ∀ ≥ (X Y )(t)= minX(v)+Y (t v)=(Y X)(t). ⊗ 0 v t − ⊗ ≤ ≤ – X deconvolved with Y is, t 0, ∀ ≥ (X Y )(t)=maxX(t + s) Y (s). 0 s 0 − ≥

November 9, 2020 George Kesidis

58 Basic properties of convolution and deconvolution

X Y means t 0, X(t) Y (t). • ≤ ∀ ≥ ≤

X =min Y Y means t 0, X(t)=minY Y (t), • i.e., X is the{ largest| ∈ suchG} that X∀ ≥Y for all Y . ∈G ≤ ∈ G The identity function of the convolution and deconvolution operators is the infinite step, • 0 if t 0 u (t)= ≤ ∞ + if t>0 * ∞ i.e., for all flows X, X u = X and X u = X. ⊗ ∞ 0 ∞ Convolution is commutative and associative. • One can directly show that for all flows X, Y, Z: • (X Y ) Z = X (Y Z) 0 0 0 ⊗ X Y =minZ Z Y X 0 { | ⊗ ≥ } (X Y ) Y X ⇒ 0 ⊗ ≥ (X Y ) Z X (Y Z). ⇒ ⊗ 0 ≤ ⊗ 0 Exercise: Prove the above identities. •

November 9, 2020 George Kesidis

59

Exercise: Delay function

Define the delay function • 0 if t d, ∆ (t)= d + if t>d.≤ * ∞

That is, ∆d(t)=u (t d) • ∞ − Exercise: Show that for any function f and a constant d 0, • ≥ t, f(t d)=(f ∆ )(t). ∀ − ⊗ d

November 9, 2020 George Kesidis

60 Flow burstiness curves (trafficenvelopes)

Consider an initially empty, lossless queue in a network device with cumulative arrivals and • departures over [0,t] respectively denoted A(t) and D(t).

AflowA is said to have burstiness bounded by (or an upper envelope) bin if • t v 0,A(t) A(v) bin(t v) A A bin, ∀ ≥ ≥ − ≤ − ⇔ ≤ ⊗ more succinctly denoted as A bin. 3 Note that this is abound on arrivals over any time-interval (v, t]. •

For example, if A is the output of dual token-bucket regulators, then bin is piecewise-linear: • bin(r)=minσ + ρr, ε + πr { } where the maximum “burst size” σ > ε 0 (ε is small) and the peak rate is greater than the “sustainable” rate, π > ρ. ≥

In the following, we assume the arrival flow A bin. • 3

November 9, 2020 George Kesidis

61

Service curves

Now consider a single (lossless) queue of a multiplexer (mux)withinanetworkdevice(e.g., • arouter).

A and D respectively are the queue’s arrival departure flows. • The cumulative departures D of a given queue depends on any service guarantees as sched- • uled by the mux and possibly (in the case of nonidling service)howtheotherqueuesare busy.

If W (0) = 0,thenthequeuebacklogattimet 0 is • ≥ W (t)=A(t) D(t). − In the special case where the queue receives exact, deterministic service at rate c>0: • t 0, ∀ ≥ W (t)=maxA(t) A(r) (t r)c 0 r t − − − ≤ ≤ D(t)= minA(r)+(t r)c =(A s0)(t), ⇒ 0 r t − ⊗ ≤ ≤ where the “service flow” s0(t)=tc for all t 0. ≥

More generally, a scheduler is said to give the queue a minimum service-curve smin,respec- • tively maximum service-curve smax,ifforallarrivalflowsA,

D A smin, respectively D A smax. ≥ ⊗ ≤ ⊗ 62 Guaranteed rate property and minimum service curve

Exercise: If a scheduler with guaranteed-rate property parameter µ (SVC has µ = • Lmax/c)foraqueuewithbandwidthallocationc,showthatthequeuehasminimum service-curve

smin(t)=maxct cµ, 0 . { − } Cruz’s Service-Curve Earliest Deadline First (SCED+) scheduler was designed to achieve • output service-curves.

November 9, 2020 George Kesidis

63

Output burstiness

Theorem: If A bin and the initially empty queue has minimum service-curve smin, • then 3

D bout := bin smin. 3 0 Proof: t 0: • ∀ ≥ D(t) A(t) ≤ min A(v)+bin(t v) ≤ 0 v t − ≤ ≤

min A(v)+ min smin(t v r)+maxbin(x + r) smin(x) ≤ 0 v t 0 r t v − − x 0 − ≤ ≤ ≤ ≤ − " ≥ # =minmin A(v)+smin(t v r)+bout(r) 0 r t 0 v t r − − ≤ ≤ ≤ ≤ − =minbout(r)+ min A(v)+smin(t v r) 0 r t 0 v t r − − ≤ ≤ ≤ ≤ − min bout(r)+D(t r) ≤ 0 r t − ≤ ≤ where we have switched the order of minimization for the first equality.

Thus, • D bout. 3

November 9, 2020 George Kesidis

64 Output burstiness via convolution and deconvolution

We now redo the previous proof using convolution notation andbasicproperties: • D A ≤ A bin ≤ ⊗ A (smin (bin smin)) ≤ ⊗ ⊗ 0 = A (smin bout) ⊗ ⊗ =(A smin) bout ⊗ ⊗ D bout ≤ ⊗ Exercise: Prove the extension of this result to also account for maximumaservice-curve • smax of the queue:

D (bin smax) smin. 3 ⊗ 0

November 9, 2020 George Kesidis

65

Virtual delay processes (for arrivals) and delay jitter bound

For a queue with arrival flow A and departure flow D,attimet 0, • ≥ – the queue backlog is W (t)=A(t) D(t), i.e., “vertical” difference between the flows, and − 1 1 – the virtual delay for a hypothetical arrival at time t is D− (A(t)) t,whereD− (a) is the smallest time t such that D(t)=a, i.e., “horizontal” diff−erence between the flows.

Note that the virtual delay process does not depend on arrivals after t under FIFO queuing, • and recall our discussion of a virtual-queue policer.

November 9, 2020 George Kesidis

66 Virtual delay processes and delay jitter bound - theorem

Theorem: If a queue has arrival flow A bin,minimumservice-curvesmin,andmaximum • 3 service-curve smax,then

t 0,A(t dmin) D(t) A(t dmax), ∀ ≥ − ≥ ≥ − where

dmin =maxx 0:smax(x)=0 , { ≥ } dmax =minz 0:smin(x) (bin ∆z)(x)=bin(x z) x 0 { ≥ ≥ ⊗ − ∀ ≥ } Re. the virtual delays, this theorem implies that, t 0, • ∀ ≥ 1 D− (A(t dmin)) (t dmin) dmin and 1 − − − ≥ D− (A(t dmax)) (t dmax) dmax − − − ≤

November 9, 2020 George Kesidis

67

Maximum delay - remarks

Note that dmax is the largest horizontal difference between bin and smin. • Also note that, equivalently, • dmax =minz 0:bin(x z) smin(x) 0 x 0 { ≥ − − ≤ ∀ ≥ } =minz 0: maxbin(x z) smin(x) 0 { ≥ x 0 − − ≤ } ≥ =minz 0:(bin smin)( z) 0 . { ≥ 0 − ≤ }

Moreover, smax ∆ . • ≤ dmin

November 9, 2020 George Kesidis

68 Delay jitter bound - Proof via convolution notation

First,

t 0,D(t) (A smin)(t) ∀ ≥ ≥ ⊗ (A (bin ∆ ))(t) ≥ ⊗ ⊗ dmax =((A bin) ∆ )(t) ⊗ ⊗ dmax (A ∆ )(t) ≥ ⊗ dmax = A(t dmax). − Finally,

t 0,D(t) (A smax)(t) ∀ ≥ ≤ ⊗ (A ∆ )(t) ≤ ⊗ dmin = A(t dmin). −

November 9, 2020 George Kesidis

69

End-to-end network calculus - exercise

Consider the tandem queues of static flow-routes across multiple network devices. •

Suppose a given end-to-end flow with network arrivals A bin visiting FIFO queues indexed • j on its path, where each queue j has minimum and maximum3 service curves respectively smin,j and smax,j,andeachqueuej handles only the given flow.

Extend the previous results on delay and output jitter from a single queue to the entire • network of tandem queues as experienced by the given flow.

See https://www.comm.utoronto.ca/~jorg/teaching/ece1545/ •

November 9, 2020 George Kesidis

70 Dynamic routing

Routing algorithms are highly distributed/decentralized in their response to network state • because network operating conditions potentially involve: – alargescalewithrespecttotrafficvolumeorgeographyorboth, and/or – high variability in the trafficvolumebothatpacketandconnection/call level on short time-scales (possibly due in part to the routing algorithm itself), and/or – potentially high variability in the network topology due to,forexample,nodemobility, channel conditions, or node or link removals because of faults or energy depletion.

November 9, 2020 George Kesidis

71

Additive path costs

Routing algorithms often assume that costs (or “metrics”) Cr of paths/routes r are additive, • i.e.,

Cr = cl, l r !∈ where cl represents the cost of link l.

Such nonnegative link costs include Boolean hops, i.e., c =1for all active links l (leading • l to path costs Cr that are hop counts as used in the Internet), and those based onestimates of access delays at the transmitting node of the link.

November 9, 2020 George Kesidis

72 Path costs based on bottlenecks

Alternatively, path costs could be based on the bottleneck link on the path, i.e., • Cr =maxcl. l r ∈

In a multihop wireless context, such link costs include thosebasedonresidualenergyel of • transmitting nodes (in a multihop wireless context), e.g., 1 cl = , el or an estimate of the lifetime of the transmitting node of the link.

November 9, 2020 George Kesidis

73

Hybrid path costs

More complex two-dimensional link metrics of the form (cx,cy) may be employed to con- • sider more than one quantity simultaneously, e.g., – delay and energy, or – hop count and BGP policy domain factors. One can define (lexicographic) order • (cx,cy ) (cx,cy ) 1 1 ≤ 2 2 to mean cx cx or cx = cx and cy cy , 1 ≤ 2 { 1 2 1 ≤ 2} and define the cost of path composed of links indexed 1 and 2 as (cx + cx, max cy ,cy ); 1 2 { 1 2} For example, • – if cx 1, in order to count hops of a path ∈ { ∞} – and cy is based on the residual energy of the transmitting node, – then the chosen paths will be those with the highest bottleneck energy among those with the shortest hop count to the destination.

November 9, 2020 George Kesidis

74 Hybrid path costs - examples

Or one can determine optimal paths • – according to one metric (the primary objective) and – choose among these paths conditional on another metric (the secondary objective) being less than a threshold.

For instance, suppose the primary objective is to minimize (bottleneck) energy costs and • x y suppose a route r has Cr hops and Cr energy cost.

Appending link l to r, r = r l ,willbeconsideredbasedoncosts • 4 ∪ { } (Cx,Cy )=(cx + Cx, max cy,Cy ) r4 r4 l r { l r } x x x x if cl + Cr < θ for some threshold θ > 0.

(Cx,Cy )=( , ) Otherwise it will set r4 r4 and, consequently, the network will not use • route r nor any route r that uses r ∞(i.e.,∞r r ). 4 ∗ 4 4 ⊂ ∗ Similarly, the network can find routes with minimal hop counts(primaryobjective)while • avoiding any link with energy cost cy θy > 0 (i.e., the residual energy of the transmitting node of the link e 1/θy). ≥ ≤

November 9, 2020 George Kesidis

75

Optimal routing frameworks: link states

Within an autonomous system (AS) of the Internet, it may be feasible for routers to peri- • odically flood the network with their link-state information.

So, each router can build a picture of the entire layer-3 AS graph from which loop-free op- • timal (minimal-hop-count) intra-AS paths can be found by OSPF and ISIS interior-gateway routing protocols (IGPs) based on Dijkstra’s algorithm.

AhierarchicalOSPFframeworkcanbeemployedonthecomponent “areas” of a large AS. • Under OSPF, each router z will forward packets ultimately destined to router v along the • subpath p to a neighboring (predecessor) router rp of v that is

arg min Cp + c(r ,v), p p

where Cp is the path cost (hop count) of p.

Dijkstra’s algorithm works iteratively at each node z based on a consistent graph of the AS • owing to flooded link-states: – optimal paths to nodes are found in order of increasing distance from z, – and so a spanning tree rooted at z is built out from its leaves.

November 9, 2020 George Kesidis

76 Optimal routing frameworks: distance vectors

Adistributeddistance-vectorapproachinvolvescomputing(atz)optimalpathcostfromz • to S as

arg min c(z,w) + C(w,S), w

where c(z,w) is the single-hop/link cost of the link (z, w) between z and its neighboring node w,andC(w,S) is w’s current path cost to S as advertised to z. Only nearest-neighbor communication is generally more scalable than flooding. • In the Internet, the BGP and the IGP RIP are distance vector based. • BGP maintains whole-path vectors to avoid loops and implement important inter-domain • routing policies (that may take precedence over distance).

Also, BGP employs route reflectors, poison reverse, dynamic minimum route advertisement • interval (MRAI) adjustments, and other mechanisms to dampenthefrequencyofrouteup- dates, reduce responsiveness (to, e.g., changing trafficconditions,linkornodewithdrawals), and improve stability/convergence properties.

So, both Dijkstra’s and the distributed Bellman-Ford algorithms use the fundamental “prin- • ciple of optimality” (easily proved by contradition): all subroutes of any optimal (minimum cost) route are themselves optimal.

November 9, 2020 George Kesidis

77

Example - shortest path on a graph

Suppose we are planning the construction of a highway from city A to city K. • Different construction alternatives and their “edge” costs g 0 between directly connected • cities (nodes) are given in the following graph. ≥

The problem is to determine the highway (edge sequence) with the minimum total (additive) • cost.

November 9, 2020 George Kesidis

78 Bellman’s principle of optimality - exercise

If C belongs to an optimal (by edge-additive cost J ∗)pathfromAtoB,thenthesub-path • AtoCandCtoBarealsooptimal, i.e., any sub-path of an optimal path is optimal (easy proof by contradiction). •

Dijkstra’s algorithm uses the predecessor node of the destination (path penultimate node) • &isbasedoncompletelink-state(edge-state)infoconsistently shared among all nodes:

J ∗(A, B)=minJ ∗(A, C)+g(C, B) C is a predecessor of B , C { | } i.e., CandBareadjacentnodesinthegraph(endpointsofthesameedge). The distributed Bellman-Ford algorithm uses the successor node of the path origin and only • nearest-neighbor distance-vector information sharing:

J ∗(A, B)=ming(A, C)+J ∗(C, B) C is a successor of A C { | }

November 9, 2020 George Kesidis

79

Review of Elements of Probability

The sample space (Ω, , P). • F Random variables and their distributions. • The law of large numbers. • See slidedeck at http://www.cse.psu.edu/ gik2/teach/Prob-4.pdf • ∼

November 9, 2020 George Kesidis

80 Stationary, Ergodic, Stable and Lossless Stochastic Systems

Finite-dimensional distributions of a stochastic process • Stationarity and ergodicity • Little’s result for stable and lossless queueing systems • Probabilistic service curves • Flow-balance equations of a network of queues •

November 9, 2020 George Kesidis

81

Stochastic Processes - Introduction

Astochastic(orrandom)processisasetofrandomvariablesindexedbyaparameter(e.g., • time, location).

If the time parameter takes values only in Z+ (or any other countable subset of R), the • stochastic process is said to be discrete time; i.e., X(t) t Z+ . { | ∈ } If the time parameter t takes values over R or R+ (or any real interval), the stochastic • process is said to be continuous time.

The dependence on the sample ω Ω can be explicitly indicated by writing Xω(t). • ∈ + For a given sample ω,therandomobjectmappingt Xω(t),forallt R say, is called • a sample path of the stochastic process X. → ∈

November 9, 2020 George Kesidis

82 Stochastic Processes - Introduction (cont)

The state space of a stochastic process is simply the union of the strict ranges of the random • variables X(t) t Z+ . { | ∈ } We will restrict our attention to stochastic processes with countable state spaces, typically • Z, Z+,orafinitesubset 0, 1, 2,...,K . { } Of course, this means that the random variables X(t) are all discretely distributed. • We will also focus on continuous-time so that queueing systems we will consider will be a • little easier to analyze.

November 9, 2020 George Kesidis

83

Finite-dimensional distributions of a stochastic process

Consider a continuous-time stochastic process • X = X(t) t R+ { | ∈ } with state space Z+.

Let pt1,t2,...,tn be the joint PMF of X(t1),X(t2),...,X(tn) for some finite n and different • t R+ for all k 1, 2,...,n , i.e., k ∈ ∈ { }

pt1,t2,...,tn (x1,x2,...,xn)=P(X(t1)=x1,X(t2)=x2,...,X(tn)=xn).

This is called an n-dimensional distribution of X. • The family of all such joint PMFs is called the set of finite-dimensional distributions (FDDs) • of X.

November 9, 2020 George Kesidis

84 Consistent finite-dimensional distributions

AfamilyofFDDs(onstate-spaceZ+,withtimet R+)areconsistent if marginalizing • (reducing the dimension) of one results in another, e.g.,∈

pt1,t2,t4 (x1,x2,x4) pt1,t2,t3,t4 (x1,x2,x3,x4). ≡ + x3 Z !∈ Recall that consistency ought to hold simply because •

P(A)= P(A, X(t3)=x3), where A := X(t1)=x1,X(t2)=x2,X(t3)=x3 .

+ { } x3 Z !∈ Beginning with a family of consistent FDDs, Kolmogorov’s extension (or ”consistency”) • theorem demonstrates the existence of a stochastic process t Xω(t), ω Ω,that possesses them. → ∈

November 9, 2020 George Kesidis

85

Stationarity of a stochastic process

AstochasticprocessX is said to be (strongly) stationary if all of its FDDs are time-shift • invariant.

That is, if •

pt1,t2,...,t p + + + n ≡ t1 τ,t2 τ,...,tn τ for all integers n 1,allt R+,andallτ R such that t + τ R+ for all k. ≥ k ∈ ∈ k ∈

November 9, 2020 George Kesidis

86 Stationary queues

th Consider the i job arriving at time Ti to a FIFO single-server, nonidling queue. • The departure time of this job is given by • Vi = Ti + W (Ti), where W is the queue’s workload process.

If the queue is stationary, the sojourn times of the jobs are identically distributed. • Indeed, suppose we are interested in the distribution or justthemeanofthejobsojourn • times.

One is tempted to identify the distribution of the sojourn times V T with the stationary • distribution of W ;cf.the“PASTA”(PoissonArrivalsSeeTimeAverages)rulea− .k.a. arrival theorem.

But in general the distribution of W (Tn ) (i.e., the distribution of the W process viewed • − just before a “typical” job arrival time Tn)isnot equal to the stationary distribution of W (i.e., viewed at at typical time).

November 9, 2020 George Kesidis

87

Loynes’ construction of a stationary queue viewed at finite time (0)

Consider a stationary marked point process on R,whereamarkS is a random variable • associated with an arrival time T (point).

Assume that the marks S are the service times of the arrivals by a unit server (one unitof • work per second), which do not depend on future arrivals/marks (i.e., are non-anticipative, causal).

November 9, 2020 George Kesidis

88 Loynes’ construction of a stationary queue viewed at time 0 (cont)

Suppose that the arrivals commence at some negative time r<0, i.e., ignore arrivals at • times T

So that the work-to-be-done of a single-server queue at (typical) time 0 is •

Wr(0) = max Si t, r t 0 − ≤ ≤ i : t T 0 !≤ i≤ th where Si is the service time of the i job arriving at time Ti.

Note that as r , Wr(0) monotonically increases. • →−∞ 1 If λ =(E(Ti Ti 1))− < and the queue is stable, i.e., 1 > λESi,then • − − ∞ limt i : t T 0 Si t = ,andso →−∞ ≤ i≤ − −∞ - lim Wr(0) W (0) < a.s., r ↑ ∞ →−∞ the stationary queue on R viewed at a typical, finite time 0.

89

Stationary queueing system viewed at typical time vs at typical job

We will now explore the relationship between the stationary distribution of a queueing • system (i.e., as viewed from a typical time) and the distribution of the queueing system at the arrival time of a typical job - we now illustrate the potential difference.

Consider a stationary and ergodic point process on R whose interarrival times τ are discretely • distributed as 1 3 P(τ =5)=4 and P(τ =10)=4.

Also consider a large interval of time H 1 spanning N consecutive interarrivals. • :

Consider an interarrival interval T1 T0 viewed at a typical time 0, i.e., by definition • − T0 < 0 T1 a.s. ≤ The probability of selecting such an interval of length say 5 is equal to the fraction of • interarrivals of length 5 that cover H.

That is, since H 1,bythelawoflargenumbersH N(5 1 +10 3),andso • : ≈ · 4 · 4 N 5(1/4) 1 1 P(T1 T0 =5) = · = = = P(τ =5) − N(5(1/4) + 10(3/4)) 7 - 4

T1 T0 τ when job arrivals are Poisson, cf. PASTA. • − ∼ 90 Alossless,stationary,stablequeue:inputrateequalsoutput rate

Let λ be the mean arrival rate and µ the mean service rate of jobs (data packets) to a • stable queue, i.e., µ>λ.

Theorem: For a stable, lossless and stationary queue, the mean (net) arrival rate equals • the mean departure rate in steady state, i.e., A(0,t] D[0,t) λ := lim =lim , t t →∞ t →∞ t where A(0,t] and D(0,t] are the cumulative arrivals and departures over (0,t],respec- tively.

Proof: The queue is stable implies that Q(t)/t 0 almost surely as t . • → →∞ Since • Q(0) + A(0,t]=Q(t)+D[0,t),

Dividing this equation by t and letting t gives the desired result. • →∞ Note: The mean departure rate of the stable queue (λ)islessthanµ,astheserveris • active only when Q>0.

91

Little’s result: L = λw

Consider a stationary, lossless, and stable queueing system. • Partition an interval of time of length T 1 so that the number of jobs in the system is • constant in each subinterval. :

That is, jobs arrive or depart the queueing system only at partition boundaries. • Let J be the number of departures of jobs over [0,T]. • Let t be the duration of the kth interval, so that K t = T . • k k=1 k - Let n be the average number of jobs in the system during the kth interval. • k Thus, the time-average number of jobs in the system over [0,T] is • K K tk 1 L nk = nktk. ≈ T T =1 =1 k! k!

November 9, 2020 George Kesidis

92 Little’s result: L = λw (cont)

Assume any jobs initially in the system (i.e., Q(0))oranythatremain(i.e., Q(T ))are • negligible compared to J when T 1;soJ is approximately the number of arrivals over Ttoo. :

Thus, • λ J/T. ≈ Similarly, the mean sojourn time (queueing delays plus service times) of jobs in the queueing • system is

1 K w nktk, ≈ J =1 k! where the numerator is the total sojourn time of all jobs in theinterval[0,T].

By substitution, we arrive at Little’s result: L = λw. • ArigorousproofofLittle’sresultisbasedonapowerfulconservation law for stationary • marked point-processes, Campbell’s theorem.

November 9, 2020 George Kesidis

93

Little’s result - discussion and example

To reiterate, Little’s result relates • – the average number of jobs in the stationary lossless queueing system (i.e., the average number of jobs viewed at a typical time 0) – to the mean sojourn time of a typical job.

For example: We will see that the mean number of jobs in a stationary “M/M/1” queue is • ρ L = , 1 ρ − where ρ = λ/µ < 1 is the trafficintensity.

By Little’s result, the mean workload in the M/M/1 queue upon arrival of a typical job • (i.e., the mean sojourn time of a job) is L 1 w = = . λ µ λ −

November 9, 2020 George Kesidis

94 Little’s result: mean server busy-time

Now consider again a lossless, FIFO, single-server queue Q with mean interarrival time of • jobs 1/λ and mean job service time 1/µ < 1/λ,

i.e., mean job arrival rate λ and mean job service rate µ>λ. • Suppose the queue and arrival process are stationary at time zero. • The following result identifies the trafficintensityλ/µ with the fraction of time that the • stationary queue is busy.

November 9, 2020 George Kesidis

95

Little’s result: mean server busy-time (cont)

Theorem:Foranonidling,stationaryandstable(λ <µ)queueQ, • λ P(Q(0) = 0) = 1 =1 ρ. − µ −

Proof: Consider the server separately from the waiting room and notethattheserveris • idle if and only if the entire queueing system is empty, i.e., Q =0.

Since the mean departure rate of the waiting room is λ too, and w =1/µ for the server, • Little’s result implies that the mean number of jobs in the server is L = λ/µ.

Finally, since the stationary number of jobs in the server is Bernoulli distributed (with • parameter L), the mean corresponds to the probability that the server is occupied (has one job) in steady state.

As above, note that the mean departure rate is • µ P(Q>0) + 0 P(Q =0) = µ ρ = λ. · · ·

November 9, 2020 George Kesidis

96 Probabilistic service curves - gSBB

Recall that a scheduler acting on a queue is said to offer a service-curve β if • – β is nondecreasing with β(0) = 0, – for all cumulative arrivals A and for all times t 0 such that the queue is always backlogged over [0,t],thecumulativedepartures≥D from that queue satisfy D[0,t] min A[0,s)+β(t s)= minA[0,t s)+β(s). ≥ 0 s t − 0 s t − ≤ ≤ ≤ ≤ Now consider a queue occupancy process Q with cumulative arrivals A and a service rate • of exactly ρ bytes/s.

A is said to have generalized stochastically bounded burstiness (gSBB, or strong SBB) with • bound fρ at ρ if

t 0, P(Qρ(t) σ) fρ(σ), ∀ ≥ ≥ ≤ where fρ 0 is a nonincreasing function with fρ(0) = 1 and, as before, Q(0) = 0 and, for t>0,≥ Q(t)=maxA(s, t] ρ(t s); 0 s t − − ≤ ≤ see Y. Jiang et al.,“FundamentalCalculusongSBB...”,Comp. Nets 53(12), Aug. 2009.

November 9, 2020 George Kesidis

97

Other probabilistic service curves

Alternatively, we can work with a weaker definition: define a A as having (weak) SBB by • fρ at ρ if

t s 0, P(A(s, t] ρ(t s) σ) fρ(σ), ∀ ≥ ≥ − − ≥ ≤ see D. Starobinski and M. Sidi, “SBB for Comm. Nets,” IEEE ITT 46(1), Jan. 2000.

An earlier framework involves bounds/envelopes on the log moment generating function of • the cumulative arrivals A, see C.-S. Chang, “Stability, Queue Length, ...,” IEEE TAC 39(5), May 1994.

November 9, 2020 George Kesidis

98 Probabilistic service curves - gSBB (cont)

We denote A (ρ,f) if A has gSBB with bound f at constant service rate ρ. • 3 Clearly, A (ρ,f) implies A (r, f ) for r>ρ. • 3 3 Also note that this definition reduces to the (σ , ρ) constraint when f(σ) u(σ σ ). • ∗ ≡ − ∗ Note that, unlike the gSBB, the deterministic (σ, ρ) constraint is policeable on a packet- • by-packet basis.

November 9, 2020 George Kesidis

99

Probabilistic service curves - gSBB (cont)

ρ A β Q˜ D

Theorem: If an initially empty queue Q has service curve β and arrivals A (ρ,f), • 3 then its departures D (ρ,g),whereg(x) f (x +mins 0 β(s) ρs ). 3 ≡ ≥ { − } Proof: Consider the backlogs the initially empty tandem queues Q, Q˜ with the latter • having arrivals D and service rate ρ,sothat Q˜(t)=maxD[0,t] D[0,s] (t s)ρ 0 s t − − − ≤ ≤ max A[0,t] min A[0,s u)+β(u) (t s)ρ ≤ 0 s t − 0 u s − − − ≤ ≤ " ≤ ≤ # =maxmax A[s u, t) ρ(t s + u)+ρu β(u) 0 s t 0 u s − − − − ≤ ≤ ≤ ≤ Q(t)+max ρu β(u) . ≤ u { − } Applying this inequality to the definition of gSBB proves the theorem.

Exercise: Extend this theorem to an end-to-end result for a flow crossingtandemschedulers • each giving the flow different service curves β.

November 9, 2020 George Kesidis

100 Flow-balance equations - preliminaries

Consider a stationary system consisting of a group of N 2 lossless, single-server, work- • conserving queueing stations. ≥

th Jobs at the n station have a mean required service time of 1/µn. • The job arrival process to the nth station is a superposition of N +1component arrival • processes.

Jobs departing the mth station are forwarded to (and immediately arrive at) the nth station • with probability rm,n.

Also, with probability rm,0,ajobdepartingstationm leaves the queueing network forever; • here we use station index 0 to denote the world outside the network.

Clearly, for all m, • N rm,n =1. n=0 ! th Arrivals from the outside world arrive to the n station at rate Λn;it’stheseinteractions • with the outside world that make the network open.

November 9, 2020 George Kesidis

101

Flow-balance equations (cont)

th Let λn be the total arrival rate to the n (stable) station (= total departure rate). • These are found by solving the so-called flow-balance equations which are based on the • notion of conservation of flow and require that all queues are stable, i.e.,

n, µn > λn. ∀ Since the mean arrival rate equals that of the mean departure rate, the flow-balance equa- • tions are, N λn = Λn + λmrm,n, n 1, 2,...,N . ∀ ∈ { } m=1 ! Note that the flow-balance equations can be written in matrix form: • λT(I R)=ΛT, − th th where the N N matrix R has entry rm,n in the m row and n column. × N Note: We could define the total throughput of the system λ0 = Λm so that • m=1 r0,m = Λm/λ0. -

November 9, 2020 George Kesidis

102 Flow-balance equations - solution requirements

Thus, • T T 1 λ = Λ (I R)− . − Again, we are assuming that λ <µfor stability. • Also, we clearly require that det(I R) =0, i.e., I R is invertible. • − - −

This (and stability and stationarity) requires that rm,0 > 0 for some station m, i.e., jobs • can exit the network and don’t on-average accumulate in it.

Otherwise, • – on average work accumulates in the system and so it cannot be stationary, and – R would be a stochastic matrix (all entries nonnegative and all rows sum to 1) so that 1isaneigenvalueofR and, therefore, 0 is an eigenvalue of I R, i.e., I R is not invertible. − −

Note: One can define stationary queueing systems that are closed, i.e., with r 0 =0= • n, r0,n for all n;insuchsystemstherearenosuchstabilityrequirements.

November 9, 2020 George Kesidis

103

Flow-balance equations - solution requirements

We can also write a flow- between the outside world and the queueing • network as a whole by summing over the individual queueing stations n 1,...,N to get: ∈ { }

N N Λn = λnrn,0, n=1 n=1 ! ! i.e., the total flow into the queueing network equals the total flow out of the network as the previous theorem.

The flow-balance equations hold in great generality. • In the following, we will apply them to derive the stationary distribution of a special network • with Markovian dynamics.

November 9, 2020 George Kesidis

104 Flow-balance equations - example

Λ1 r12 Λ2

1 r21 2

r31 r23

r13 r32 3

r30

This example network has three lossless FIFO queues, queues 1and2respectivelyhave • exogenous arrival rate Λ1 and Λ2 jobs per second.

The mean service time at queue k is 1/µ . • k The nonzero job routing probabilities are • 1 1 1 r12 = r13 = 2,r21 = r23 = 2,r31 = r32 = r30 = 3, where again the subscript 0 represents the outside world.

November 9, 2020 George Kesidis

105

Flow-balance equations - example (cont)

Assuming that the queues are all stable, the flow-balance equations are • 1 1 λ1 = Λ1 + 2λ2 + 3λ3, 1 1 λ2 = Λ2 + 2λ1 + 3λ3, 1 1 λ3 = 2λ1 + 2λ2.

Thus, in matrix form, • 1 1 1 2 3 Λ1 1 − −1 1 λ = Λ2 , −2 −3  1 1 1   0  −2 −2 which implies     1 1 1 − 1 2 3 Λ1 1 − −1 λ = 1 Λ2 , −2 −3  1 1 1   0  −2 −2     i.e., λ =((I R)T) 1Λ. − −

November 9, 2020 George Kesidis

106 Flow-balance equations - example (cont)

Given the total flow rates λ,theserviceratesµk need to be chosen so that µk > λk for • all queues k to achieve stability and stationarity (the flow-balance equations hold).

Note that the mean departure rate to the “outside world,” λ3,willworkoutfromthe • flow-balance equations to be

λ3r30 = Λ1 + Λ2.

Finally, the stability assumption requires that the servicerates • T T T 1 µ > λ = Λ (I R)− . −

November 9, 2020 George Kesidis

107

Exercise - maximum throughput of a network processor

Consider a NP with multiple internal engines/stations, e.g., for: 1. header checksum • processing, 2. TTL decrement, 3. forwarding look-up, and 4. flow-based processing (e.g., token-bucket based policing or shaping, prioritizing - a flowengine).

ANPneedstobeabletooperateata“worst-case”prescribedpacket (job) arrivals at rate; •

e.g., for an OC-48 line, 2.5 Gbps = 7.8 Mpps =: λ0,assumingtheworst-casethatallIP • packets are 40 bytes long and all packets pass through the first four engines.

Suppose all packets arriving to the 4th (flow) engine cause a flow lookup operation, and • thereafter a number N of different “flow sub-engine” visits, indexed 5 to N +5 1, may occur. −

Define the probabilities r0; r +1 =1 k<4; r 0 = r0 k 4; r4 =(1 r0)/N • k,k ∀ k, ∀ ≥ ,j − j>4; r =(1 r0)/(N 1) k = j 5. ∀ k,j − − ∀ - ≥

Exercise: In terms of r0,N: • – Find the average number of flow sub-engine visits by a packet.

– Find the minimum service capacity of each engine and sub-engine so that λ0 is the throughput of the NP.

November 9, 2020 George Kesidis

108 Markovian queuing systems in continuous time

Introduction • Memoryless property of exponential distribution • Finite-dimensional distributions and stationarity • The Poisson counting process • Poisson Arrivals See Time Averages (PASTA) • Time-homogeneous Markov processes on countable state space(Markovchains) • Fitting a Markov model to data • Birth-death Markov chains • Markovian queuing models: single queues and queuing networks • Models with embedded Markov chains • Discrete-time Markov chains and queues •

November 9, 2020 George Kesidis

109

Markov modeling - state variables

More complex performance metrics, such as the distribution of delays experienced by jobs, • requires more detailed modeling of the (stationary) queueing system.

Application of Markovian models begins with identifying state variables in the data (or the • system that generated the data).

The current state summarizes the past evolution of the data so that one need not remember • the past in order to determine/predict the future evolution of the data/system.

This is consistent with the notion of a finite-state machine in computer science. • In deterministic linear circuits, the state variables are “outputs of integrators,” i.e., voltage • across capacitors C, 1 t t s, vC(t)=vC(s)+ iC(τ)dτ ∀ ≥ C 2s and currents through inductors.

In a stochastic setting, continuous-time Markov processes have a special structure involving • the (memoryless) exponential distribution.

November 9, 2020 George Kesidis

110 Memoryless property of the exponential distribution

If X is exponentially distributed, then x, y > 0, • ∀ P (X>x+ y X>y)=P(X>x). | The proof is an immediate consequence of the distribution of an exponential, • P(X>x)=e λxu(x),whereEX = λ 1 and u is the unit step, u(x)=1 x 0 . − − { ≥ } This is the memoryless property and its simple proof is left as an exercise. • For example, if X represents the duration of the lifetime of a light bulb, the memoryless • property implies that, given that X>y,theprobabilitythattheresidual lifetime (X y) is greater than x is equal to the probability that the unconditioned lifetime is greater− than x.

So, in this sense, given X>y,thelifetimehas“forgotten”thatX>y. • Only exponentially distributed random variables have this property among all continuously • distributed random variables and only geometrically distributed random variables have this property among all discretely distributed random variables.

November 9, 2020 George Kesidis

111

Minimum of independent exponentially distributed random variables

If X1 exp(λ1) and X2 exp(λ2) are independent,then • ∼ ∼ min X1,X2 exp(λ1 + λ2). { } ∼

Proof: Define Z =min X1,X2 and let F (z)=P(Z z), F1,andF2 be the c.d.f. • { } Z ≤ of Z, X1,andX2,respectively.

Clearly, F (z)=0for z<0 and for z 0, • Z ≥ 1 F (z)=P(min X1,X2 >z) − Z { } = P(X1 >z,X2 >z) = P(X1 >z)P(X2 >z) by independence =exp((λ1 + λ2)z) − as desired.

November 9, 2020 George Kesidis

112 Minimum of independent exponentially distr’d random variables (cont)

Again, if X1 exp(λ1) and X2 exp(λ2) are independent,then • ∼ ∼ λ1 P(min X1,X2 = X1)= . { } λ1 + λ2

Proof: • P(min X1,X2 = X1)=P(X1 X2) { } ≤x2 ∞ λ1x1 λ2x2 = λ1e− dx1 λ2e− dx2 2−∞ 2−∞ ∞ (λ1+λ2)x2 = λ2e− dx2 2−∞ λ = 1 λ1 + λ2

(which is increasing in λ1 =1/EX1).

Two independent geometrically distributed random variables also have these properties. •

November 9, 2020 George Kesidis

113

AcountingprocessonR+

A counting process X on R+ is characterized by the following properties: • (a) X has state space Σ = Z+, (b) X has nondecreasing (in time) sample paths that are continuousfromtheright,i.e., lim X(t)=X(s), and t s ↓ (c) X(t) X(t )+1so that X does make a single transition of size 2 or more, where t is a≤ time immediately− prior to t, i.e., − X(t ):=limX(s). − s t ↑

We take the origin of time to be zero and, clearly, T T +1 for all i. • i ≤ i

November 9, 2020 George Kesidis

114 AcountingprocessonR+ (cont)

The total number of customers that arrived over the interval of time [0,t] is defined to be • X(t).

Note that X(Ti)=i, X(t)

∞ X(t)= 1 T t =max i T t . { i ≤ } { | i ≤ } i=1 ! Of course, X is an example of a continuous-time counting process whose sample paths are • continuous from the right:

X(t)

4 ... 3 2 1 t T1 T2 T3 T4

November 9, 2020 George Kesidis

115

The Poisson counting process - definition by interarrival times

Now let the sequence of job interarrival times be • Si = Ti Ti 1 − − for job indexes i 1, 2, 3,... ,where ∈ { } T0 0. ≡

A Poisson process is a continuous-time counting process whose interarrival times Si i∞=1 • are mutually i.i.d. exponential random variables. { }

1 Let the parameter of the exponential distribution of the Si’s be λ, i.e., ESi = λ− for all • i.

Since • n Tn = Si, i=1 ! Tn is (gamma) distributed with parameters λ and n.

November 9, 2020 George Kesidis

116 Marginal distribution of the Poisson process

X(t) is Poisson distributed with parameter λt. • For this reason, λ is sometimes called the intensity (or “mean intensity”, “mean rate”, or • just “rate”) of the Poisson process X.

Proof: First note that, for t 0, • ≥ λt P(X(t)=0)= P(T1 >t)=:P(S1 >t)=e− .

Now, for an integer i>0 and a real t 0, • ≥ i+1 i λz ∞ λ z e− P(X(t) i)=P(Ti+1 >t)= dz, ≤ i! 2t where we have used the gamma p.d.f.

By integrating by parts, we get the Poisson c.d.f.: • i i i i 1 λz λ z λz ∞ λ z − e− P(X(t) i)= ( e− ) t∞ + dz ≤ i! − | t (i 1)! i λt 2i i 1 λ−z (λt) e ∞ λ z e = − + − − dz. i! (i 1)! 2t −

November 9, 2020 George Kesidis

117

Marginal distribution of the Poisson process - Proof (cont)

After successively integrating by parts in this manner, we get • i λt 1 λt (λt) e− (λt) e− ∞ λz P(X(t) i)= + + + λe− dz ≤ i! ··· 1! t i 2 (λt)je λt = − . j! j=0 ! Now note that X(t)=i and X(t) i 1 are disjoint events and • { } { ≤ − } X(t)=i X(t) i 1 = X(t) i . { } ∪ { ≤ − } { ≤ } Thus, • P(X(t)=i)=P(X(t) i) P(X(t) i 1) i ≤ − i 1 ≤ − (λt)je λt − (λt)je λt = − − j! − j! j=0 j=0 ! ! (λt)ie λt = − . i!

November 9, 2020 George Kesidis

118 Increments of a Poisson Process

X is a Poisson process if and only if, for all k,alldisjointintervals(s1,t1], (s2,t2],..., • + + (s ,t ] R ,andalln1,n2,...,n Z , k k ⊂ k ∈ P(X(t1) X(s1)=n1,X(t2) X(s2)=n2,...,X(tk) X(sk)=nk) k − − − ni [λ(ti si)] λ(t s ) = − e− i− i . n ! i=1 i 3

X(ti) X(si),calledanincrement of X,isthenumberoftransitionsofthePoisson • − process in the interval of time (si,ti].

Thus, the Poisson process has independent (nonoverlapping) increments. • Also, the increment over a time interval of length τ is Poisson distributed with parameter • λτ.

November 9, 2020 George Kesidis

119

Equivalent definitions of a Poisson process

The Poisson process is the only counting process that • – has Poisson distributed independent increments, or – has i.i.d. exponentially distributed interarrival times, or – possesses the conditional uniformity property.

That is, all of these properties are equivalent. • The memoryless property of the exponential distribution is principally responsible for the • independent increments property of a Poisson process.

November 9, 2020 George Kesidis

120 Poisson process - finite-dimensional distributions

We will now drive the k( 1)-dimensional distribution of a Poisson process by using the • independent increments property.≥

Consider times 0 t1

where m1,m2,...,mk Z and mi mi+1 for all i (otherwise the probability above would be zero). ∈ ≤

Define ∆mi := mi mi 1 and ∆Xi := X(ti) X(ti 1) to get • − − − − P(X(t1)=m1, ∆X2 = ∆m2,...,∆Xk = ∆mk) = P(∆X2 = ∆m2,...,∆X = ∆m X(t1)=m1) P(X(t1)=m1) k k | = P(∆X2 = ∆m2,...,∆Xk = ∆mk) P(X(t1)=m1), where the last equality is by the independent increments property.

By repeating this argument, we get that the above k-dimensional distribution is • k P(X(t1)=m1) P(X(ti) X(ti 1)=mi mi 1) − − − − i=2 3 k m1 mi mi 1 (λt ) (λ(t t )) − 1 λt1 i i 1 − λ(ti ti 1) = e− − − e− − − . m1! (m m 1)! i=2 i i 3 − − 121

Poisson Arrivals See Time Averages (PASTA)

Consider a stationary system X that is causally dependent upon a Poisson process N. • Consider dt>0 very small. • Consider the distribution of X at times just prior to (or as seen by) a Poisson arrival, • P(X(t) A N(t + dt) N(t)=1) ∈ | − P(N(t + dt) N(t)=1X(t) A)P(X(t) A) = − | ∈ ∈ P(N(t + dt) N(t)=1) − P(N(t + dt) N(t)=1)P(X(t) A) = − ∈ P(N(t + dt) N(t)=1) − = P(X(t) A) ∈ where the second equality due to independent increments of the Poisson process.

Formally, PASTA is an immediate consequence of Palm’s theorem for stationary systems, • see F. Baccelli and P. Bremaud. Elements of Queuing Theory. Springer, 2003

November 9, 2020 George Kesidis

122 Poisson processes on Rn for n 1 ≥

AstationaryPoissonprocessonthewholereallineR is defined by • – acountablecollectionofpoints τi i∞= , { } −∞ – where the interarrival times τi τi 1 are i.i.d. exponential random variables. − − Alternatively, we can characterize a Poisson process on R by stipulating that • – the number of points in any interval of length t is Poisson distributed with mean λt, and – that the number of points in nonoverlapping intervals is independent.

This last characterization naturally extends to that of a on Rn for all • dimensions n 1,i.e.,aspatial Poisson process: ≥ If v(A) is the volume of A Rn, • ⊂ – then the number of points in A is Poisson distributed with mean δv(A), – where δ is the intensity of the Poisson process with [δ]=points/metren.

November 9, 2020 George Kesidis

123

Example: Hand-off rates among wireless cells

For this example, we need following result that the Poisson property is preserved by i.i.d. • random shifts of the points.

n Theorem: If τi is a Poisson process in R with intensity δ and the random vectors Yi • in Rn are i.i.d.{ and} a.s. bounded, then τ + Y is a Poisson process intensity δ as well.{ } { i i} In the two-dimensional plane R2 covered by roughly circular cells, assume each mobile takes • adirectpaththrougheachcell.

At a cell boundary, an independent and uniformly distributedrandomchangeofdirection • occurs for each mobile.

Asamplepathofasinglemobileisdepictedinthefollowingfigure, where the dot at the • center of a cell is its base station.

November 9, 2020 George Kesidis

124 Example: Hand-off rates among wireless cells (cont)

November 9, 2020 George Kesidis

125

Example: Hand-off rates among wireless cells (cont)

Further assume that the average velocities of a mobile through the cells are i.i.d. with • density f(v) over [vmin,vmax].

The mobiles are initially distributed in the plane accordingtoaspatialPoissonprocesswith • density δ mobile nodes per unit area.

Finally, assume that the cells themselves are also distributed in the plane so that, at any • given time, the total displacements of the mobiles are i.i.d.

Note: The base stations could also be randomly placed according to a spatial Poisson • process with density δ4 δ and the resulting circular cells approximate Voronoi sets about each of them. 3

Exercise: • (a) Find the mean rate λm of mobiles crossing into a cell of diameter ∆.Hint:consider the length of a chord and use Little’s result. (b) How would the expression differ in (a) if velocity and direction through a cell were dependent?

November 9, 2020 George Kesidis

126 Cts-time, time-homog. Markov processes with countable state-space

We will now define a kind of stochastic process called a Markov process. • The Poisson process is a (transient) pure birth Markov process. • AMarkovprocessonacountablestatespaceΞ is called a Markov chain. • AMarkovchainisakindofrandom walk on Ξ (= Z+ w.l.o.g.). • It visits a state, stays there for an exponentially distributed amount of time, then makes a • transition at random to another state, stays at this new stateforanexponentiallydistributed amount of time, then makes a transition at random to another state, etc.

All of these visit times and transitions are independent in a way that will be more precisely • explained in the following.

November 9, 2020 George Kesidis

127

The Markov property

If, for all integers k 1,allsubsetsA, B, B1,...,Bk Ξ,andalltimest, s, s1,...,sk • + ≥ ⊂ ∈ R such that t>s>s1 > >s , ··· k P(X(t) A X(s) B, X(s1) B1, ,X(s ) B ) ∈ | ∈ ∈ ··· k ∈ k = P(X(t) A X(s) B), ∈ | ∈ then the stochastic process X is said to possess the Markov property.

If we identify • – X(t) as a future value of the process, – X(s) as the present value,

– and past values as X(s1),...,X(sk), then the Markov property asserts that the future and the past are conditionally independent given the present.

In other words, given the present state X(s) of a Markov process, one does not require • knowledge of the past to determine its future evolution.

November 9, 2020 George Kesidis

128 The Markov property (cont)

Any stochastic process (on any state space with any time domain) that has the Markov • property is called a Markov process.

As such, the Markov property is a “stochastic extension” of notions of state associated • with, e.g., finite-state machines and linear time-invariant systems.

The Markov property as stated above is an immediate consequence of a slightly stronger • and more succinctly stated Markov property: for all times s

November 9, 2020 George Kesidis

129

Sample path construction of a continuous-time Markov chain

For a time-homogeneous Markov chain, consider each state n Z+ and let • ∈ 1 ES = > 0 qn,n − be the mean visiting time of the Markov process, i.e., qn,n < 0. That is, a Markov chain is said to enter state n at time T and subsequently visit state n • for S seconds if X(T ) = n, X(t)=n for all T t

n

Ti s Ti+1

Also, define the set of states to which a transition is possibledirectlyfromn, • + Ξn Ξ n = Z n . ⊂ \{ } \{ }

November 9, 2020 George Kesidis

130 Sample path construction of a Markov chain - transition rates

For all m Ξn,defineqn,m > 0 such that the probability of a transition from n to m is • ∈ qn,m > 0. − qn,n

Thus, by the law of total probability, • q n,m =1, − qn,n m Ξ !∈ n i.e., for all n Z+, ∈ qn,m =0, m Z+ !∈ where qn,m := 0 for all m Ξn n . -∈ ∪ { }

November 9, 2020 George Kesidis

131

Sample path construction of a Markov chain - initial distribution

th Now let Ti be the time of the i state transition with T0 0, i.e., the process X is • ≡ constant on intervals [Ti 1,Ti) and − X(Ti 1)=X(Ti ) = X(Ti) − − - for all i Z+. ∈ Let the column vector π(t) represent the distribution of X(t) on Z+,sothatentryinthe • nth row is

πn(t)=P(X(t)=n), where π(0) is the initial distribution of the stochastic process X, i.e., X(0) π(0). ∼

November 9, 2020 George Kesidis

132 Sample path construction of a Markov chain - alternative construction

+ Suppose that X(Ti)=n Z . • ∈

To the states m Ξn,associateanexponentially distributed random variable Si(n, m) • ∈ with parameter qn,m > 0 (recallthismeansESi(n, m)=1/qn,m ).

Given X(T )=n,thesmallest of the random variables • i S (n, m) m Ξn { i | ∈ } determines X(T +1) and the intertransition time T +1 Ti. i i −

That is, X(T +1)=j if and only if • i Ti+1 Ti = Si(n, j)= minSi(n, m). − m Ξn ∈ The entire collection of exponential random variables • + + Si(n, m) i Z ,n Z ,m Ξn { | ∈ ∈ ∈ } are assumed mutually independent.

November 9, 2020 George Kesidis

133

Sample path construction - alternative construction (cont)

Therefore, the inter-transition time T +1 T is exponentially distributed with parameter • i − i

qn,n := qn,m, − m Ξ !∈ n 1 E(Ti+1 Ti)= > 0 in particular. ⇒ − qn,n − Also, the state transition probabilities • qn,j P(X(Ti+1)=j X(Ti)=n)=P(Si(n, j)=minSi(n, m)) = , | m Ξn − ∈ qn,n

Note again that if a transition from state n to state j is impossible (has probability zero), • qn,j =0.

Note: Parameters (rates) q are not probabilities. •

November 9, 2020 George Kesidis

134 Conservativeness and time-homogeneity assumptions

In the following, we assume that • qn,n < for all states n, − ∞ i.e., the Markov chain is conservative.

Also, we have assumed that the Markov chain is temporally (time) homogeneous, i.e., for • all times s, t 0 and all states n, m: ≥ P(X(s + t)=n X(s)=m)=P(X(t)=n X(0) = m). | | In summary, assuming the initial distribution π(0) and the parameters • + qn,m n, m Z { | ∈ } are known, we have described how to construct a sample path of the Markov chain X from acollectionofindependentrandomvariables + + S (n, m) i Z ,n Ξ = Z ,m Ξn , { i | ∈ ∈ ∈ } where Si(n, m) is exponentially distributed with parameter qn,m.

When a Markov chain visits state n,itstaysanexponentiallydistributedamountoftime • with mean 1/qn,n and then makes a transition to another state m Ξn with probability − ∈ qn,m/qn,n. −

November 9, 2020 George Kesidis

135

Proof that thus constructed process is Markovian

The Markov property is due to independent and exponentially distributed state visit-times • and independent state transitions.

Let • – n := X(s) and – i be the number of transitions of X prior to the present time s.

Clearly, the random variables i, n,andTi (the last transition time prior to s)canbe • discerned from Xr, 0 r s and can therefore be considered “given” as well. { ≤ ≤ }

The memoryless property of the random variable T +1 Ti,distributedexponentiallywith • i − parameter qn,n,impliesthatforallx>0, − P(T +1 s>x T +1 >s) i − | i = P(T +1 Ti >x+(s Ti) T +1 Ti >s Ti) i − − | i − − = P(T +1 Ti >x) i − =exp(qn,nx)

Note that exp(qn,nx) depends on Xr, 0 r s only through n = X(s). • { ≤ ≤ }

November 9, 2020 George Kesidis

136 Proof that thus constructed process is Markovian (cont)

So, T +1 s is exponentially distributed with parameter qn,n and conditionally indepen- • i − − dent of s Ti given Xr, 0 r s . − { ≤ ≤ }

Furthermore, Xr, 0 r

n

Ti s Ti+1

Since the exponential distribution is the only continuous one that is memoryless, one can • conversely show that the Markov property implies the qualities of the previous constructions.

November 9, 2020 George Kesidis

137

The Poisson process is Markovian

Clearly, a Poisson process with intensity λ is an example of a Markov chain. • The transition rates of a Poisson process are • λ if m = n +1, + n Z ,qn,m = λ if m = n, ∀ ∈  −0  else. 

November 9, 2020 George Kesidis

138 Transition-rate matrix (generator) of a cts-time Markov chain

th th The matrix Q having qn,m as its entry in the n row and m column is called the • transition rate matrix (or just “rate matrix” or “generator”) of the continuous-time, time- homogeneous Markov chain X.

Note by definition of qn,n < 0,thesumoftheentriesinanyrowofthematrixQ equals • zero.

The nth row of Q corresponds to state n from which transitions occur, and • the mth column of Q corresponds to states m to which transitions occur. •

For n = m,theparameterqn,m is called a transition rate (or probability flux)because,for • - + any i Z ,ES (n, m)=1/qn,m. ∈ i

Thus, we expect that, if qn,m >qn,j,thentransitionsfromstaten to m are more likely (will • tend to be made more frequently, or at a higher rate)bytheMarkovchainthantransitions from state n to j.

November 9, 2020 George Kesidis

139

Rate matrix of a Poisson process

The transition matrix of a Poisson process with intensity λ > 0 is • λλ000 −0 λλ00··· Q =  00− λλ0 ···  . . − . . . ···.  ......         

November 9, 2020 George Kesidis

140 Rate matrix - example 2

Suppose the strict state space of X is 0, 1, 2 and the rate matrix is • { } 523 Q = −0 44  10− 1  −   Q is just a 3 3 since the strict state space is just a (finite-sized) 3-tuple 0, 1, 2 ,rather • than all of the× nonnegative integers, Z+. { }

Adirecttransitionfromstate2tostate1isimpossible(asisadirecttransitionfromstate • 1tostate0).

Also, each visit to state 0 lasts an exponentially distributed amount of time with parameter • 2 5(i.e., with mean 0.2); a transition to state 1 then occurs with probability 5 or a transition 3 to state 2 occurs with probability 5.

November 9, 2020 George Kesidis

141

Graphical depiction of a Markov chain’s transition rates

We can also represent the transition rates (and states) graphically by what is called a • transition rate diagram.

The states of the Markov chain are circled and arrows are used to indicate the possible • transitions (labeled with assumed positive transition rates) between states.

The transition rate itself labels the corresponding arrow (transition). • For the two previous examples: •

λλλλ q0,2 =3

01234... q0,1 =2

0 1 2

q1,2 =4

q2,0 =1

142 The Kolmogorov equations

Consider the Markov chain X on Z+ with rate matrix Q and initial distribution π(0). • For τ R+ and n, m Z+,define • ∈ ∈ pn,m(τ)=P(X(s + τ)=m X(s)=n). | Again, we are assuming that the chain is temporally homogeneous so that the right-hand • side of the above equation does not depend on time s.

th th The matrix P(τ) whose entry in the n row and m column is pn,m(τ) is called the • transition probability matrix.

Finally, for all times s R+ and all states n Z+,define • ∈ ∈ πn(s):=P(X(s)=n).

th So, the column vector π(s),whosen entry is πn(s),isthemarginaldistributionofX • at time s, i.e., the p.m.f. of X(s).

November 9, 2020 George Kesidis

143

The Kolmogorov equations (cont)

Conditioning on X(s) and using the law of total probability: For τ > 0, • ∞ P(X(s + τ)=m)= P(X(s + τ)=m X(s)=n)P(X(s)=n) | n=0 ! for all m Z+, i.e., the convolution ∈ ∞ + πm(s + τ)= pn,m(τ)πn(s) for all m Z . ∈ n=0 ! We can write these equations compactly in matrix form: • πT(s + τ)=πT(s)P(τ), where πT(s) is the transpose of the column vector π(s), i.e., πT(s) is a row vector.

November 9, 2020 George Kesidis

144 The Kolmogorov equations (cont)

Moreover, any finite-dimensional distribution (FDD) of the Markov chain can be computed • from the transition probability functions and the initial distribution.

For example, for times 0

In the second-to-last expression, we clearly see the transition from some initial state i to • state k at time r,thentostatem at time s (s r seconds later), and finally to to state n at time t (t s seconds later). − −

November 9, 2020 George Kesidis

145

Computing the transition probability matrix with the rate matrix

First note that a transition in an interval of time of length zero occurs with probability zero, • pn,m(0) = 1 n = m n, m, i.e., P(0) = I, { } ∀ where I is the (multiplicative) identity matrix, i.e., i.e., the square matrix with 1’s in every diagonal entry and 0’s in every off-diagonal entry.

For states n = m,asmall amount of time 0 < ε 1,andanarbitrarilychosentime • s R+,consider- 3 ∈ pn,m(ε)=P(X(s + ε)=m X(s)=n). |

Let Vn be the residual holding time in state n after time s, i.e., X(t)=n for all • t [s, s + Vn) and X(s + Vn) = n. ∈ -

November 9, 2020 George Kesidis

146 Computing the TPM with the rate matrix (cont)

The total holding time in state n is exp( qn,n). • ∼ −

So by the memoryless property, Vn exp( qn,n) and for all m = n, • ∼ − - qn,m pn,m(ε)=P(Vn ε) + o(ε). ≤ × qn,n − The first term on the right-hand side represents the probability that the Markov chain X • makes only a single transition (from n to m)inintervaloftime(s, s + ε].

Recall that the probability that X makes a transition to state m from state n is qn,m/qn,n. • − The symbol o(ε) (”little oh of ε”) represents a function satisfying • o(ε) lim =0, ε 0 → ε specifically here the probability that the Markov chain has two or more transitions in the interval of time (s, s + ε].

November 9, 2020 George Kesidis

147

Computing the TPM with the rate matrix (cont)

Substituting • P(Vn ε)=1 exp(εqn,n) ≤ − = εqn,n + o(ε), − gives for all m = n, - pn,m(ε)=qn,mε + o(ε) pn,m(ε) pn,m(0) o(ε) − = qn,m + , ⇒ ε ε

where we recall that pn,m(0) = 0 for all m = n. - Letting ε 0,weget • → dpn,m m = n, p˙n,m(0) := (0) = qn,m, ∀ - dt where the left-hand side is the time derivative of pn,m at time 0.

November 9, 2020 George Kesidis

148 The Kolmogorov backward equations

Finally, since •

pn,n(ε)=1 pn,m(ε), − m Z+,m=n ∈ ! - we get, after differentiating with respect to time,

p˙n,n(0) = qn,m = qn,n < 0, − m=n !- where we have used the definition of qn,n.

In matrix form, • P˙ (0) = Q.

This statement can be generalized to obtain the Kolmogorov backward equations: • τ 0, P˙ (τ)=P(τ)Q with P(0) = I. ∀ ≥

November 9, 2020 George Kesidis

149

The Kolmogorov backward equations - proof

First, we have already established (for s =0)that • P˙ (0) = IQ = P(0)Q,

For s>0,takearealε such that 0 < ε min s, 1 .So, • 3 { } pn,m(s)=P(X(s)=m X(0) = n) | P(X(s)=m, X(0) = n) = P(X(0) = n) ∞ P(X(s)=m, X(s ε)=k, X(0) = n) = − P(X(0) = n) =0 k! P(X(s ε)=k, X(0) = n) − × P(X(s ε)=k, X(0) = n) − ∞ = P(X(s ε)=k X(0) = n) − | =0 k! P(X(s)=m X(s ε)=k, X(0) = n) × | − ∞ = P(X(s ε)=k X(0) = n) (by Markov property) − | =0 k! P(X(s)=m X(s ε)=k) × | − ∞ = p (s ε)p (ε) n,k − k,m =0 k! 150 The Kolmogorov backward equations - proof (cont)

Therefore, • pn,m(s)=pn,m(s ε)pm,m(ε)+ε p (s ε)q + o(ε) − n,k − k,m k=m !-

= pn,m(s ε) 1 pm,i(ε) + ε p (s ε)q + o(ε) −  −  n,k − k,m i=m k=m !- !-  

= pn,m(s ε) 1 ε qm,i + ε p (s ε)q + o(ε) −  −  n,k − k,m i=m k=m !- !-   = pn,m(s ε)(1 + εqm,m)+ε p (s ε)q + o(ε). − n,k − k,m k=m !-

November 9, 2020 George Kesidis

151

The Kolmogorov backward equations - proof (cont)

After a simple rearrangement we get • pn,m(s) pn,m(s ε) o(ε) − − = pn,m(s ε)qm,m + pn,k(s ε)qk,m + ε − − ε k=m !- ∞ o(ε) = pn,k(s ε)qk,m + . − ε =0 k! So, letting ε 0 in the previous equation, we get, for all n, m Z+ and all real s>0, • → ∈ ∞ p˙n,m(s)= pn,k(s)qk,m =0 k! as desired.

November 9, 2020 George Kesidis

152 Kolmogorov forward equations

Using a similar argument, one can condition on the distribution of X(ε), i.e., move forward • in time from the origin.

We will then arrive at the Kolmogorov forward equations: • P˙ (s)=QP(s).

November 9, 2020 George Kesidis

153

Transition probability matrix by matrix exponential

Recall that • P(0) = I.

Equipped with this initial condition,wecansolvetheKolmogorovequationsforthecaseof • afinitestatespacetoget,forallt 0, ≥ P(t)=eQt, where the matrix exponential 1 1 exp(Qt) I + Qt + Q2t2 + Q3t3 + . ≡ 2! 3! ···

Note that the terms tk/k! are scalars and the terms Qk (including Q0 = I)areallsquare • matrices of the same dimensions.

November 9, 2020 George Kesidis

154 Transition probability matrix by matrix exponential (cont)

Indeed, clearly exp(Q0) = I and, for all t>0, • d 1 exp(Qt)=Q + Q2t + Q3t2 + dt 2! ··· 1 =[I + Qt + Q2t2 + ]Q 2! ··· =exp(Qt)Q, where, in the second equality, we could have instead factored Q out to the left to obtain the forward equations.

In summary, for all s, t R+ such that s t,thedistributionofcontinuous-time,time- • homogeneous Markov chain∈ X(t) is ≤ πT(t)=πT(s) P(t s) − = πT(s)exp(Q(t s)). −

November 9, 2020 George Kesidis

155

Transition probability matrix - example

Consider an example where the TRM Q has distinct real eigenvalues, • 211 Q = −1 10.  10− 1  − The corresponding transition rate diagram (TRD) is 

1

1

0 1 2

1

1

November 9, 2020 George Kesidis

156 Transition probability matrix - example (cont)

The eigenvalues are the roots of Q’s characteristic polynomial: • det(zI Q) z(z +1)(z +3); − ≡ Taking the eigenvalues z 0, 1, 3 and then solving the right-eigenvectors x from • Qx = zx gives: ∈ { − − } – [1 1 1]T is a right-eigenvector corresponding to eigenvalue 0 (true for all rate matrices Q), – [0 1 1]T is a right-eigenvector corresponding to eigenvalue 1,and − − – [2 1 1]T is a right-eigenvector corresponding to eigenvalue 3. − − − Combining these three statements in matrix form gives • 102 102 000 Q 111 = 111 0 10 =: VΛ.  1 1 −1   1 1 −1   00− 3  − − − − −      

November 9, 2020 George Kesidis

157

Computing matrix exponential by Jordan form

Thus, we arrive at a Jordan decomposition of the matrix Q for the special case of distinct • eigenvalues: 1 Q = VΛV− .

So, for all integers k 1, • ≥ k k 1 Q = VΛ V− , where 00 0 Λk = 0(1)k 0  00(− 3)k  − 1 exp(Qt)=V exp(Λt)V−  ⇒ 10 0 t 1 = V 0 e− 0 V− .  3t  00e−   Note that we could have developed this example using left eigenvectors instead of right; • e.g., the stationary distribution σT is the left eigenvector corresponding to eigenvalue 0.

November 9, 2020 George Kesidis

158 Stationary distribution of a Markov chain

Suppose there exists a distribution σ on the state space Ξ = Z+ that satisfies the full • balance equations σTQ =0T, ∞ + i.e., σnqn,m =0for all m Z , ∈ n=0 ! so that σ is a nonnegative left eigenvector corresponding to Q’s zero eigenvalue (recall Q1 =0).

Therefore, for all integers k>0, • σTQk =0T σTP(t)=σTeQt = σTI = σT t R+. ⇒ ∀ ∈ Recall that π(t) is defined to be the distribution of Markov chain X(t). • Therefore, if π(0) = σ,thenπ(t)=σ for all real t>0. • So σ is called a stationary or invariant distribution of the Markov chain X with TRM Q. • The Markov chain X itself is said to be stationary if π(0) = σ. •

November 9, 2020 George Kesidis

159

Stationary distribution of a Markov chain - balance equations

σTQ =0 states m,probabilityfluxintom equals that out of m: ⇔∀ ∞ σnqn,m = σm( qm,m) − n=m !- ∞ = σm qm,n n=m !-

November 9, 2020 George Kesidis

160 Stationary distribution of a Markov chain - examples

For the previous example 3-state TRM, the unique invariant distribution is uniform σT = • 1 1 1 3 3 3 . To9 model: the packet-flow generated by a voice source: • – First let the talkspurt state be denoted by 1 and the silent state be denoted by 0, i.e., our modeling assumption is that successive talkspurts and silent periods are independent and exponentially distributed. – In steady state, the mean duration of a talkspurt is 352 ms and the mean duration of asilenceperiodis650ms. – The mean number of packets generated per second is 22, i.e., 22 48-byte (ATM) payloads, or about 8 kbits per second on average. – Solving the balance equations for a two-state Markov chain gives the invariant distri- bution: q1,0 q0,1 σ0 = and σ1 = . q0,1 + q1,0 q0,1 + q1,0 = 1 = 1 – So, q1,0 0.352 and q0,1 0.650.

– Finally, the mean transmission rate is 0 σ0 + r σ1 =22packets/s. · ·

November 9, 2020 George Kesidis

161

Existence and uniqueness of stationary distribution

We now consider the properties of Markov chains that have bearing on the issues of existence • and uniqueness of stationary distributions.

By the definition of its diagonal entries, the sum of the columns of a rate matrix Q is the • zero vector.

Thus, the balance equations σTQ =0T are dependent. • Obviously another requirement is σ needs to be a p.m.f. on the state space, • σ 0 for all i. i ≥ That is, we replace one of the columns of Q,saytheith,withacolumnall of whose entries • are 1, resulting in the matrix Q˜ i so that T ˜ T σ Qi = ei , th where ei is a column vector whose entries are all zero except that the i entry is 1.

Thus, we are interested in conditions on the rate matrix Q that result in the invertibility • (nonsingularity) of Q˜ i (for any i)givingaunique T T ˜ 1 σ = ei Qi− .

November 9, 2020 George Kesidis

162 Doeblin’s theory for Markov chains - recurrence & transience

First note that the quantity • ∞ V (i) 1 X(t)=i dt X ≡ { } 20 represents the total amount of time the stochastic process X visits state i.

Astatei of a Markov chain is said to be recurrent if • P(V (i)= X(0) = i)=1, X ∞ | i.e., the Markov chain will visit state i infinitely often with probability 1.

On the other hand, if • P(V (i)= X(0) = i)=0, X ∞ | i.e., P(VX(i) < )=1so that the Markov chain will visit state i only finitely often, then i is said to be∞ a transient state.

All states are recurrent in the previous example of a 3-state TRM,whereasallstates • are transient for the Poisson process.

If all of the states of a Markov chain are recurrent, then the Markov chain itself is said to • be recurrent.

November 9, 2020 George Kesidis

163

Positive and null recurrence

Suppose that i is a recurrent state. •

Let τi > 0 be the time of the first transition back into state i by the Markov chain. • The state i is said to be positive recurrent if • E(τi X(0) = i) < . | ∞ On the other hand, if the state i is recurrent and E(τ X(0) = i)= , i | ∞ then it is said to be null recurrent;see

If all of the states of the (temporally homogeneous) Markov chain are positive recurrent, • then the Markov chain itself is said to be positive recurrent.

cf. the example of a birth-death Markov chain with infinite state-space and the M/M/1 • queue special case.

November 9, 2020 George Kesidis

164 Irreducibility

AMarkovchainX or associated TRM Q is irreducible if there is a directed path from any • state of the transition rate diagram to any other state of the diagram.

The following example is an irreducible transition rate diagram. •

q0,2 q2,3

023

q2,0 q3,2

q1,0 q2,1 1

November 9, 2020 George Kesidis

165

Irreducibility (cont)

The following transition rate diagram does not have a path from state 2 to state 0; therefore, • the associated Markov chain is reducible.

q5,0 q0,1 q1,2

5 0 12

q0,5 q2,1 q0,3 q3,4

34

q4,3

November 9, 2020 George Kesidis

166 Irreducibility (cont)

The state space of a reducible Markov chain can be partitionedintoonetransientclass • (subset) and a number of recurrent (or “communicating”) classes.

If a Markov chain begins somewhere in the transient class, it will ultimately leave it if there • are one or more recurrent classes.

Once in a recurrent class, the Markov chain never leaves it (when a single state constitutes • an entire recurrent class, it is sometimes called an absorbing state of the Markov chain).

For the previous reducible example, 0, 5 is the transient class and 1, 2 and 3, 4 • are recurrent classes. { } { } { }

Irreducibility is a property only of the transition rate diagram (i.e., whether the transition • rates are zero or not); irreducibility is otherwise not dependent on the values of transition rates.

If the Markov chain has a finite number of states, then all recurrent states are positive • recurrent and the recurrent and transient states can be determined by the TRD’s structure.

November 9, 2020 George Kesidis

167

Existence and uniqueness of stationary distribution

Theorem: If a continuous-time Markov chain is irreducible and positive recurrent, then • there exists a unique stationary (invariant) distribution.

In the following theorem, the associated Markov chain X(t) π(t) is not necessarily • stationary. ∼

Theorem: For any irreducible and positive recurrent TRM Q and any initial distribution • π(0), lim πT(t)= limπT(0) exp(Qt)=σT, t t →∞ →∞ where σ is the (unique) invariant of Q.

That is, the Markov chain will converge in distribution to itsstationaryσ. • For this reason, σ is also known as the steady-state distribution of the Markov chain X • with rate matrix Q.

November 9, 2020 George Kesidis

168 Existence and uniqueness of stationary distribution (cont)

Consistent with the previous theorem, if Q is the TRM of an irreducible and positive • recurrent Markov chain, then σT σT lim exp(Qt)= .  t . →∞  σT    where σ is the unique invariant of Q.  

Note that this limit is a matrix of rank 1. • Also, for any summable function g on Z+, • ∞ ∞ Eg(X(t)) = πi(t)g(i) σig(i) as t . → →∞ i=0 i=0 ! !

November 9, 2020 George Kesidis

169

Time-reversed Markov chain

Consider a Markov chain X on (the entire) R with TRM Q and unique stationary distri- • bution σ.

The stochastic process that is X reversed in time is • Y (t) X( t) for t R. ≡ − ∈ Theorem: The time-reversed Markov chain of X, Y ,isitselfaMarkovchainand,ifX • is stationary, the transition rate matrix of Y is R whose entry in the mth row and nth column is σn rm,n = qn,m , σm

where qn,m are the transition rates of X.

It is easy to show that the reverse-time chain Y (t) X( t) also has stationary distri- • bution σ;clearly,thisshouldbetruesincethefractionofoftimeth≡ − at Y visits any given state would be the same as (the forward-time chain) X.

November 9, 2020 George Kesidis

170 Theorem on time-reversed Markov chains - proof

First note R is indeed a transition rate matrix because the balance equations • ∞ rm,n =0. n=0 ! + Consider an arbitrary integer k 1,arbitrarysubsetsA, B, B1,...,Bk of Z ,andarbitrary • + ≥ times t, s, s1,...,s R such that t s> s1 > > s . − − − ··· − k

November 9, 2020 George Kesidis

171

Theorem on time-reversed Markov chains - proof (cont)

The transition probabilities for the reverse-time chain Y are • P(Y ( t) A Y ( s) B, Y ( s1) B1,...,Y( s ) B ) − ∈ | − ∈ − ∈ − k ∈ k = P(X(t) A X(s) B, X(s1) B1,...,X(s ) B ) ∈ | ∈ ∈ k ∈ k P(X(t) A, X(s) B, X(s1) B1,...,X(s ) B ) = ∈ ∈ ∈ k ∈ k P(X(s) B, X(s1) B1,...,X(s ) B ) ∈ ∈ k ∈ k P(X(sk) Bk X(t) A, ..., X(sk 1) Bk 1) = ∈ | ∈ − ∈ − P(X(sk) Bk X(s) B, ..., X(sk 1) Bk 1) ∈ | ∈ − ∈ − P(X(t) A, ..., X(sk 1) Bk 1) ∈ − ∈ − × P(X(s) B, ..., X(sk 1) Bk 1) ∈ − ∈ − P(X(sk) Bk X(sk 1) Bk 1) = ∈ | − ∈ − P(X(sk) Bk X(sk 1) Bk 1) ∈ | − ∈ − P(X(t) A, ..., X(sk 1) Bk 1) ∈ − ∈ − × P(X(s) B, ..., X(sk 1) Bk 1) ∈ − ∈ − P(X(t) A, X(s) B, X(s1) B1,...,X(sk 1) Bk 1) = ∈ ∈ ∈ − ∈ − , P(X(s) B, X(s1) B1,...,X(sk 1) Bk 1) ∈ ∈ − ∈ − where the second-to-last equality is by the Markov property of X.

November 9, 2020 George Kesidis

172 Theorem on time-reversed Markov chains - proof (cont)

We can repeat this argument k 1 more times to get • − P(Y ( t) A Y ( s) B, Y ( s1) B1,...,Y( s ) B ) − ∈ | − ∈ − ∈ − k ∈ k P(X(t) A, X(s) B) = ∈ ∈ P(X(s) B) ∈ = P(X(t) A X(s) B) ∈ | ∈ = P(Y ( t) A Y ( s) B). − ∈ | − ∈ So, we have just shown that Y is Markovian. •

November 9, 2020 George Kesidis

173

Theorem on time-reversed Markov chains - proof (cont)

We now want to find R in terms of Q and σ. • For t

Y X σn pm,n( t ( s)) = pn,m(s t) , − − − − σm where n = m and the left-hand side is the transition probability for Y . - Differentiating this equation with respect to s t = t ( s) and then evaluating • the result at s t =0gives − − − − − σn rm,n = qn,m . σm

November 9, 2020 George Kesidis

174 Time-reversible Markov chains and detailed balance equations

AMarkovchainX is said to be time reversible if • σn qm,n = rm,n := qn,m for all states n = m, σm -

i.e., the transition rates of the stationary (forward-time) Markov chain X, qm,n,arethe same as those of the reverse-time Markov chain Y (t)=X( t). − These are the simplified detailed balance equations for a time-reversible Markov chain: • σmqm,n = σnqn,m for all states n = m. - So, X is time reversible if the average rate at which transitions from state m to n occur • in reverse time equals the average rate at which transitions from state n back to m occur forward in time.

Many of the Markov chains subsequently considered will be time reversible. •

November 9, 2020 George Kesidis

175

Time-reversible Markov chains and detailed balance equations

Exercise: Show that if a distribution σ satisfies the detailed balance equations for a rate • matrix Q,thenitalsosatisfiesthebalanceequationsfortheinvariant distribution of Q.

Given an irreducible and positive recurrent rate matrix Q,ifonefindsadistributionσ that • satisfies detailed balance, the associated Markov chain is time reversible.

That is, time reversibility is a property that holds if and only if the detailed balance equations • are satisfied.

Note that all two-state Markov chains are time reversible since the single balance equation • is also a detailed balance equation.

November 9, 2020 George Kesidis

176 Time-reversible Markov chains - examples

The previous example 3-state TRM is trivially time-reversible since the stationary distribu- • tion is uniform and the TRM is symmetric.

Exercise: Does every symmetric TRM have a uniform invariant distribution? • Consider the following (asymmetric) TRM with states 1, 2, 3 : • { } 312 Q = −1 21.  11− 2  −   Its invariant distribution is • T 3 4 5 σ = 12 12 12 , so that 9 : 3 4 σ1q1,2 = 1 = 1=σ2q2,1. 12 · - 12 ·

November 9, 2020 George Kesidis

177

Modeling time-series data using a Markov chain

Consider a single sample path Xω(t),t [0,T], of a stationary process, where T 1. • ∈ : We may be interested in estimating its marginal mean, • µ EX(t), ≡ by 1 T Xω(t)dt. T 20

If this quantity converges to µ as T (for almost all sample paths Xω)thenX is said • to be ergodic in the mean. →∞

If the stationary distribution of X, σ,canbesimilarlyapproximatedbecause • 1 T σn =lim 1 Xω(t)=n dt, T T { } →∞ 20 then X is said to be ergodic in distribution.

Such estimates are sometimes used even when the process X is known not to be stationary • assuming that the transient portion of the sample path will be negligible.

November 9, 2020 George Kesidis

178 Fitting a Markov model to data - states

Given sample path measurements, we now describe how to obtainthemostlikelyTRMQ • for one or more measured sample paths (time series) of the physical process to be modeled.

We first assume that the states themselves are readily discernible from the data. • Quantization (aggregation) of the observed/physical states may be required to obtain a • discrete state space if the physical state space is uncountable.

Even if the physical state space is already discrete, it may befurthersimplifiedbyjudicious • quantization/clustering.

However, assuming the data was generated by a Markov process,excessivestateaggregation • may compromise its Markovian character.

November 9, 2020 George Kesidis

179

Fitting a Markov model to data - pertinent statistics

Given a space of N defined states, one can glean the following information from sample-path • data, Xω: – the total time duration of the sample path, T ,

– the total time spent in state n, τn,foreachelementn of the defined state space, i.e., T τn = 1 Xω(t)=n dt, { } 20 – the total number of jumps taken out of state n (i.e., the number of visits to state n), Jn,and

– the total number of jumps out of state n to state m, Jn,m.

Clearly, • T = τn n ! and, for all states n,

Jn = Jn,m. m=n !-

November 9, 2020 George Kesidis

180 Most likely Markov model of data

From this information, we can derive: • – the sample occupation time for each state n,

τn 1 τn σn = and = T −qn,n Jn – the sample probability of transiting to state n from m,

Jn,m rn,m = . Jn

From this derived information, we can directly estimate the “most likely” transition rates • of the process:

qn,m = rn,m( qn,n) for all n = m. − - Exercise: Using the definitions above, do the full balance equations σTQ =0T hold? • Note that the total “speed” of the Markov chain, i.e., the aggregate mean rate of jumps, is • 1 σn( qn,n)= Jn. − T n n ! !

November 9, 2020 George Kesidis

181

Fitting a Markov model to data - example

For N =3states, consider sample path data leading to the following information. • The time-duration and occupation times were observed to be: • T τ0 τ1 τ2 100 20 50 30

The total number of transitions out of each state were observed to be: • J0 J1 J2 10 40 30

The specific transition counts Ji,j were observed to be: from to 012 0\ 55 1 10− 30 2 20 10− −

November 9, 2020 George Kesidis

182 Fitting a Markov model to data - example (cont)

So, finding q for all i = j as above gives • i,j - 5 5 q0,0 q0,0 q0,0 −10 −10  10 30  Q = q1,1 q1,1 q1,1 . −40 −40    20 10   q2,2 q2,2 q2,2   −30 −30    Finally, • q0 0 = J0/τ0 = 1/2,q1 1 = J1/τ1 = 4/5, and q2 2 = J2/τ2 = 1. , − − , − − , − −

November 9, 2020 George Kesidis

183

Birth-death Markov chains

We now define an important class of Markov chains on Ξ = Z+ that are called birth-death • processes.

The terminology comes from Markovian population models wherein • – X(t) is the number of living individuals at time t,

– abirth,representedbyastatechangefromi 0 to i +1,isatrateqi,i+1 = λi, and ≥

– adeath,representedbyastatechangefromi>0 to i 1,isatrateqi,i 1 = µi. − −

November 9, 2020 George Kesidis

184 Birth-death processes with finite state space

Consider a finite state space • Ξ = Z+ 0, 1, 2,...,K K ≡ { } and transition rates – λ > 0 for all i 0, 1, 2,...,K 1 and i ∈ { − } – µ > 0 for all i 1, 2,...,K but i ∈ { } – µ0 =0and λK =0.

So, the finite birth-death process has an (K +1) (K +1)transition rate matrix • × λ0 λ0 000 0 − ··· µ1 µ1 λ1 λ1 00 0 − − ···  0 µ2 µ2 λ2 λ2 0 0  − − ···   Q =  ......  . (1)  ......       00 0 µK 1 µK 1 λK 1 λK 1   − − − −   00 ··· 00− µ− µ   ··· K − K   

November 9, 2020 George Kesidis

185

Birth-death processes with finite state space (cont)

Note that this rate matrix is irreducible. • The finiteness of the state space implies that the birth-deathprocessisalsopositiverecur- • rent.

We will now compute the stationary distribution σ which is a vector of size K +1by • solving σTQ =0, which is a compact representation for the following system of K +1balance equations:

λ0σ0 + µ1σ1 =0, − λi 1σi 1 (µi + λi)σi + µi+1σi+1 =0for 0

November 9, 2020 George Kesidis

186 Birth-death processes with finite state space (cont)

The solution to these equations is given by • i λj 1 σi = σ0 − for 0

σn =1 n 0 ≥ ! 1 K i − λn 1 σ0 = 1+ − . ⇒ µn ; i=1 n=1 < ! 3 Exercise: Check whether this Markov chain is time reversible and detailed balance holds. • Show how detailed balance can be used to derive the above expression for σi.

November 9, 2020 George Kesidis

187

Birth-death processes with finite state space - example

λ λλ

0 1 2 ... K 1 K − µ 2µKµ

Consider the example where • λi = λ and µi = i µ · for some positive constants λ and µ.

Define the constant • λ ρ . ≡ µ

In this case the stationary distribution is a truncated Poisson, • K 1 ρi ρn − σi = σ0 for 1 i K and σ0 = . i! ≤ ≤ n! ;n=0 < ! Here, births arrive according to a Poisson process. •

November 9, 2020 George Kesidis

188 Birth-death processes with infinite state space

The balance equations σTQ =0for an infinite state space Z+ with transition rates • λi > 0 for all i 0 and µi > 0 for all i 1 are ≥ ≥ λ0σ0 + µ1σ1 =0 − and, for i>0,

λi 1σi 1 (µi + λi)σi + µi+1σi+1 =0. − − − As for the finite case, the infinite birth-death process is irreducible. • Assuming for the moment that it is positive recurrent as well,wecansolvethebalance • equations to get i λj 1 σi = σ0 − µ j=1 j 3 for i>0.

Choosing σ0 to normalize (so ∞ σ =1), we get • i=0 i 1 - i − ∞ λn 1 σ0 = 1+ − . µn ; i=1 n=1 < ! 3

November 9, 2020 George Kesidis

189

Birth-death processes with infinite state space - recurrence

The condition for positive recurrence is that • i ∞ λn 1 − < µ ∞ i=1 n=1 n ! 3 because σ0 > 0 under this condition and, therefore, σ is a well-defined distribution (p.m.f.) on Z+.

Otherwise (i.e., if σ0 =0), the Markov chain is null recurrent or transient (even though • the Markov chain is irreducible).

November 9, 2020 George Kesidis

190 Birth-death processes with infinite state space - example

λλλ λ

041 2 3 ...

µµµ µ

We now consider the example where λi = λ and µi = µ for all i and constants λ,µ>0. • Again define the constant • λ ρ . ≡ µ

i The invariant is geometric when ρ < 1: σi =(1 ρ)ρ for i 1. • − ≥ Note that • 1 − ∞ i σ0 = ρ =1 ρ > 0 − ;i=0 < ! if and only if ρ < 1,whichistheconditionforpositiverecurrenceinthisexample.

November 9, 2020 George Kesidis

191

Birth-death processes with infinite state space - example (cont)

That is, if λ <µthis process is positive recurrent. • If λ >µthe process is transient, i.e., each state is visited only finitely often a.s. • If λ = µ the process is null recurrent, i.e., though each state is visited infinitely often a.s., • the expected time between visits is infinite.

November 9, 2020 George Kesidis

192 AqueuedescribedbyanunderlyingMarkovchain-notation

The previous example birth-death Markov chain example is an “M/M/1” queue, where • – The first “M” means that the job interarrival times are i.i.d. Memoryless (exponentially distributed); i.e., the job arrival process is a Poisson. – The second “M” means that the job service times or lifetimes are i.i.d. Memoryless. – The “1” means that there is one work-conserving server, i.e., only one job is being served at any given time. – The service times are independent of the arrival times.

The queue is implicitly assumed to have an infinite capacity toholdjobsinthisnotation: • M/M/1 = M/M/1/ .(SotheM/M/1queueislossless.) ∞ When a general distribution is involved, the terms “G” or “GI”areusedinsteadof“M”; • “GI” denotes general and i.i.d.

So, an M/GI/ queue has a Poisson job arrival process and i.i.d. job servicetimesofsome • distribution that∗ is not necessarily exponential.

The finite birth-death process previously considered is an M/M/K/K queue, K servers and • no waiting room.

November 9, 2020 George Kesidis

193

Forward-equation applications to b.d. procesess

Consider an interval J = i, i +1,...,j 1,j Z of states, where i

Note that P(Zi = t)=1by the definition of t above. •

Also, by the assumed temporal homogeneity, the distributionofZi t does not depend on • t; −

so we take t =0to simplify notation in the following. • Assume that the birth-death process is such that there is a positive probability that it will • exit the interval J at either end.

We will show how the Kolmogorov equations can be used to compute the probability that • the Markov chain exits the interval J at i.

November 9, 2020 George Kesidis

194 Probability a b.d. process exits an interval at a given end

For k i 1,i,...,j,j+1 ,define • ∈ { − } g(k)=P(Zi 1

g(k)= P(Zi 1

November 9, 2020 George Kesidis

195

Probability a b.d. process exits an interval at a given end

Recall that • qk,k = qk,m = qk,k 1 qk,k+1. − − − − m=k !- Therefore, get the following set of j i +1equations in as many unknowns ( g(k) for • i k j ), − ≤ ≤ k+1 g(m)q =0for i k j, k,m ≤ ≤ m=k 1 !− with boundary conditions g(i 1) = 1 and g(j +1)=0. −

November 9, 2020 George Kesidis

196 Probability a b.d. process exits an interval at a given end

The unique solution of these equations can be found by, e.g., the systematic method of • Z-transforms - in particular, the desired quantity g(i) can be found.

For the example where constant q>0 s.t., for all k J, • ∃ ∈ qk,k+1 = q = qk,k 1 − the solution is g(k)=Ak + B for all k J and for some constants A and B found using the boundary conditions, i.e., ∈ 1=g(i 1) = A(i 1) + B, − − 0=g(j +1)=A(j +1)+B.

Therefore, A = 1/(j i +2), B =(j +1)/(j i +2),and • − − − j k +1 g(k)= − . j i +2 −

November 9, 2020 George Kesidis

197

Mean time to return to a given state by a b.d. process

Considering that there are only finitely many states less thanagivenstatei,statei is • positive recurrent only if hi(i +1)< ,where ∞ hi(j) E(Zi X(0) = j). ≡ | By time-homogeneity, for small ε > 0, • E(Zi X(ε)=j)=E(Zi + ε X(ε)=j) ε = hi(j) ε + o(ε). | | − − Again, by using a forward-equation argument, • 1 qj,j+1 qj,j 1 hi(j)= + hi(j +1)+ − hi(j 1) qj,j+1 + qj,j 1 qj,j+1 + qj,j 1 qj,j+1 + qj,j 1 − − − − 1 λ µ = + hi(j +1)+ hi(j 1) λ + µ λ + µ λ + µ − for all j>i,with(bydefinition) h (i) 0. i ≡ Intuitively, the first term on the right-hand side is the mean visiting time of state j and • that the coefficient of hi(j 1) is the probability of transitioning from j to j 1 in one step. ± ±

November 9, 2020 George Kesidis

198 Mean time to return to a given state by a b.d. process

If we define • η (j) h (j) h (j +1), i ≡ i − i then above equations for h become

1 qj,j 1 1 1 ηi(j)= + − ηi(j 1) = + ηi(j 1). qj,j+1 qj,j+1 − λ ρ −

Iterating, we get • j i 1 1 1 1 1 − − 1 1 ηi(j)= 1+ + ηi(j 2) = + ηi(i). λ ρ ρ2 − λ ρk ρj i =0 − " # k! j i Multiplying through by ρ and then rewriting in terms of hi gives • − j i − j i 1 k ρ − (hi(j) hi(j +1)) = hi(i +1)+ ρ , − − λ =1 k! where we note that η (i)=h (i) h (i +1)= h (i +1). i i − i − i

November 9, 2020 George Kesidis

199

Mean time to return to a given state by a b.d. process

Now consider this equation as j . • →∞ First note that the difference h (j) h (j +1) 0. • i − i →

Now if ρ =1,thenclearlyrequiresthathi(i +1)= since the summation on the • right-hand side is tending to infinity, i.e., state i and, by the∞ same argument, all other states are not positive recurrent.

If ρ < 1,thesummationontheright-handsideconvergesandtheleft-hand side tends to • zero as j so that →∞ 1 ρ 0= hi(i +1)+ , − λ · 1 ρ − i.e., 1 hi(i +1) = . µ λ −

November 9, 2020 George Kesidis

200 Forward-equation applications - further reading

Amoregeneralstatementalongtheselinesforbirth-deathMarkov chains is given at the • end of Section 4.7 of [Karlin & Taylor, “A First Course...”, 2nd Ed., 1975].

Explore use of backward equation for similar problems. • Explore these problems for discrete-time birth-death processes. •

November 9, 2020 George Kesidis

201

The M/M/1 queue

The previous example birth-death process with infinite state-space is the M/M/1 queue • with – Poisson job arrivals of rate λ jobs per second, – identically distributed exponential service times with mean 1/µ seconds that are mu- tually independent and independent of the arrivals, and – asingleserver.

That is, the job interarrival times are independent and exponentially distributed with mean • 1/λ seconds and, therefore, for all times s

November 9, 2020 George Kesidis

202 The M/M/1 queue (cont)

λλλ λ

041 2 3 ...

µµµ µ

November 9, 2020 George Kesidis

203

The M/M/1 queue (cont)

When the trafficintensity • λ ρ < 1, ≡ µ Q is the positive recurrent birth-death process with ρ-geometric stationary distribution.

So, the stationary mean number of jobs in (backlog of) the system is • ρ λ EQ = L = = , 1 ρ µ λ − − and, by Little’s formula, the stationary mean sojourn time ofjobsis • L 1 w = = . λ µ λ − For the M/M/1 queue, we can obtain the stationary distribution of the sojourn time. •

November 9, 2020 George Kesidis

204 The “arrival theorem” - recall PASTA

For a causal, stationary and ergodic, and stable queue, suppose the job arrival times form • aPoissonpointprocess.

PASTA: If Q is the state of such a queuing system and T R is distributed as a Poisson • arrival time, then Q(T ) is distributed as the stationary distribution∈ of Q. −

November 9, 2020 George Kesidis

205

PASTA - sojourn time of stationary M/M/1 queue

Recall that stationary distribution (i.e., at a typical time) of the number of jobs in an • M/M/1 queue with trafficintensityρ = λ/µ < 1 is geometric with parameter ρ.

By PASTA, the distribution of the number of jobs in the queue just before the arrival time • T of a typical job is also geometric with parameter ρ: P(Q(T )=i)=(1 ρ)ρi, i Z+. − − ∀ ∈ Note that Q(T )=Q(T )+1 1. • − ≥ Thus, we can obtain the distribution of the stationary sojourn time W as Elrang(µ, i +1) • (= Γ(µ, i+1) sum of i+1 i.i.d. exp(µ)randomvariables)withprobability(1 ρ)ρi, i 0, ∼ − ∀ ≥ Exercise: Using this distribution, show that the mean sojourn time of anM/M/1queueis • 1 ∞ ∞ 1 = E(W (T ) Q(T )=i)P(Q(T )=i)= (i +1) (1 ρ)ρi µ λ | − − µ − i=0 i=0 − ! !

November 9, 2020 George Kesidis

206 The stationary M/M/K/K queue

Consider a queue with Poisson arrivals, i.i.d. exponential service times, K servers, and no • waiting room.

That is, a lossy M/M/K/K queue described by a finite-state birth-death Markov chain. • Since the capacity to hold jobs equals the number of servers, there is no waiting room (each • server holds one job).

Again, let λ be the rate of the Poisson job arrivals and let 1/µ be the mean service time • of a job.

Suppose that there are n jobs in the system at time t, i.e., Q(t)=n. • As before, we can show Q is a birth-death Markov chain. • λ λλ

0 1 2 ... K 1 K − µ 2µKµ

November 9, 2020 George Kesidis

207

The stationary M/M/K/K queue (cont)

Indeed, suppose Q(t)=n>0 and suppose that the past evolution of Q is known (i.e., • Q(s) s t is given). { | ≤ } – By the memoryless property of the exponential distribution,theresidual service times of the n jobs are exponentially distributed random variables with mean 1/µ. – Therefore, Q makes a transition to state n 1 at rate nµ, i.e., for 0

Thus, the stationary distribution of Q is the truncated Poisson given before: • K 1 ρi ρi − σi = σ0 for 1 i K, and σ0 = . i! ≤ ≤ i! ;i=0 < !

November 9, 2020 George Kesidis

208 Erlang’s blocking formula for the stationary M/M/K/K queue

Now consider a stationary M/M/K/K queue. • Suppose we are interested in the probability that an arrivingjobisblocked (dropped) • because, upon its arrival, the system is full, i.e., every server is occupied.

Note above that when we assumed a “lossless” queue, we meant internally lossless. •

More formally, we want to find P(Q(Tn )=K),wherewerecallthatTn is the arrival • time of the nth job. −

Since the arrivals are Poisson, we can invoke PASTA to get • ρK P(Q(Tn )=K)=σK = σ0 =: (ρ,K), − K! E which is called Erlang’s blocking or Erlang B formula.

Obviously, the mean sojourn time of all admitted arrivals is w =1/µ. •

November 9, 2020 George Kesidis

209

Erlang’s blocking formula for the stationary M/M/K/K queue

For more general (non-exponential) service time distributions, it can be shown that Erlang’s • blocking formula still holds.

Therefore, given the mean service time 1/µ,Erlang’sresultissaidtobeotherwise“insen- • sitive” to the service time distribution.

Finally note that, by Little’s theorem, the mean number of busy servers in steady state is • 1 L = λ(1 σK) = ρ(1 σK), − µ −

where λ(1 σK) is the mean rate of arrivals that are admitted (by PASTA), i.e., successfully join the queue.−

Exercise: check that L = EQ = K iσ . • i=0 i -

November 9, 2020 George Kesidis

210 M/M/K/(K + W ) queue - K servers and W 1 waiting room ≥

Modeling a call center as an M/M/K/K queue, customers calling when all servers are • occupied will be blocked with probability given by Erlang’s blocking formula (indicated by slow busy signal).

If we add a waiting room of W 1 jobs, then we have an M/M/K/(K + W ) queue, • with blocking (fast busy signal) probability,≥ by PASTA, equal to the stationary

K+W K+W λj 1 ρ σK+W = P(Q = K + W )=σ0 − = σ0 , µ K!KW j=1 j 3 1 i K j 1 K+W i λn 1 − K ρ ρ W ρ − − where σ0 = P(Q =0)= 1+ = + j , i=1 n=1 µn i=0 i! K! j=1 K = = ρ λ/µ,thebirthrateλn= λ,anddeathrate- > ? =- - ? nµ if 1 n K µ = n Kµ if K≤ n≤ K + W * ≤ ≤

211

M/M/K/(K + W ) queue with impatient customers (abandonment)

Now consider customers that depart (abandon) the queue if their queueing delay is larger • than an independent, exponentially distributed amount withmean1/δ, i.e., the death rate nµ if 1 n K µ = n Kµ +(n K)δ if K≤ n≤ K + W * − ≤ ≤ In steady-state, the total arrival rate equals the total “departure” rates due to blocking, • abandonment or successful service: K+W K+W λ = λσ + + (q K)δσq + (q K)µσq. K W − ∧ = +1 q=1 q !K ! So, probabilities of successful service and abandonment (departure due to impatience) are, • respectively,

K+W K+W 1 1 S(K, W ):=λ− (q K)µσq and A(K, W ):=λ− (q K)δσq, ∧ − q=1 = +1 ! q !K and again σK+W is the probability of blocking upon arrival.

November 9, 2020 George Kesidis

212 M/M/K/(K+W ) queue with impatient customers (cont)

For a call center, one can consider the optimization problem of the form • max rsS(K, W ) caA(K, W ) cbσK+W K, K,W − − −

where ca 0,resp. c 0,isthecostofabandoned,resp.blocked,customersperunit ≥ b ≥ server, and rs is the reward for served customers per unit server.

Normally, ca >cb as customer who abandons after being on hold may naturally be more • irate than one who is immediately blocked.

Exercise: Verify that σK+W decreases in K and W , A decreases in K but increases with • W ,andS increases in both K and W .

November 9, 2020 George Kesidis

213

Exercise: Delays in memoried, multiserver systems - Erlang Cformula

Now consider an M/M/K (i.e., M/M/K/ )systemwithinfinitewaitingroom. • ∞ Here, λ

November 9, 2020 George Kesidis

214 Pollaczek-Khintchine formula for M/GI/1 queue

Now let • – Q = stationary (at a typical time) number of queued customers NOTincludingin- service customers – R = statioary residual service time of the in-service customer – T = queueing delay NOT including service time of a typical customer – S = service times of jobs – λ = mean arrival rate of jobs

By PASTA for the M/GI/1 queue, • ET = EQES + ER

By Little’s formula for the queue, • EQ = λET

By eliminating EQ,weget • ER ET = . 1 λES −

November 9, 2020 George Kesidis

215

Pollaczek-Khintchine formula for M/GI/1 queue (cont)

Plotting the residual service time R(t) over time, we see alternating busy and idle periods, • where the busy periods consist of i.i.d. triangle pulses of height S and duration S. ∼ ∼ Over a large interval of time t suppose there are n arrivals: • n n 1 1 1 1 n 1 R(u)du = S2 = S2. t t 2 i 2 · t · n i t i=1 i=1 2 ! ! So, as as t ,thisimplies • →∞ 1 ER = λES2. 2

Substituting ER gives • λES2 ET = . 2(1 λES) − Note that the mean sojourn time is ET + ES. •

November 9, 2020 George Kesidis

216 Queuing models with Markov embeddings

In some queueing models, the queue occupancy itself is not Markovian but is based on an • “embedded” Markov chain.

The M/GI/1 queue-length distribution can be obtained by considering an embedded, discrete- • time Markov chain, e.g., https://www.netlab.tkk.fi/opetus/s383143/kalvot/E_mg1jono.pdf L. Kleinrock. Queueing Systems.Vol.1,Chapter5.

In some cases, non-exponential service times or interarrival times are modeled with a plurality • of i.i.d. exponential “phases” (PH).

November 9, 2020 George Kesidis

217

Markovian queueing networks with static routing

We now introduce two classical Markovian queueing network models: • – loss networks modeling circuit-switched networks (e.g., the former telephone network, MPLS networks) with static routing, and – Jackson networks that can be used to model packet-switched networks and packet-level processors, with purely randomized routing with static routing probabilities.

Both will be shown to have “product-form” invariant distributions. •

November 9, 2020 George Kesidis

218 Loss networks - example

m

1

2 11 3

5 4 12

6 8 10 13

9 7

n

November 9, 2020 George Kesidis

219

Loss networks - example (cont)

The previous Figure depicts a network with 13 enumerated links. • Note that the cycle-free routes connecting nodes (end systems) m and n are • r1 = 1, 3, 6, 9 , { } r2 = 1, 4, 13, 9 , { } r3 = 1, 4, 5, 6, 9 , { } r4 = 1, 3, 5, 13, 9 , { } where we have described each route by its link membership as above.

We will return to this example in the following. •

November 9, 2020 George Kesidis

220 Loss networks - preliminaries

Consider a network connecting together a number of network access points (behind each • of which are a community of end-users/systems).

Bandwidth in the network is divided into fixed-size amounts called circuits, e.g., acircuit • could be a 64 kbps channel (voice line) or a T1 line of 1.544 Mbps.

Let L be the number of network links (edges) and let cl circuits be the fixed capacity of • network link l.

Let R be the number of distinct bidirectional routes in the network, where a route r is • defined by a group of links l r. ∈ Let be the set of distinct routes so that R = . • R |R| th th Finally, define the L R matrix A with Boolean entries al,r in the l row and r column, • where × 1 if l r, al,r = 1 l r = ∈ { ∈ } 0 if l r. * -∈ That is, each column of A corresponds to a route and each row of A corresponds to a link. •

November 9, 2020 George Kesidis

221

Loss networks - example (cont)

The four routes r1 to r4 for the previous example network (of 13 links) are described by • the 13 4 matrix × 1111 0000  1001  0110  0011    1010   A =  0000 .    0000    1111    0000    0000  0000    0101    

November 9, 2020 George Kesidis

222 Loss networks - preliminaries (cont)

Assume each route (path) r has an independent associated Poisson connection arrival • (circuit setup) process with with intensity λr.

This situation arises if, for each pair of end nodes π,thereisanindependentPoisson • connection arrival process with rate Λπ that is randomly thinned among the routes π R connecting π so that there are independent Poisson arrivals to each route r π with ∈ R rate pπ,r 0. ≥ These fixed “routing” parameters satisfy •

pπ,r =1. r !∈Rπ

Define Xr(t) as the number of occupied connections on route r at time t and define the • corresponding random vector X(t).

th Let er be the R-vector with zeros in every entry except for the r row whose entry is 1. • The following result generalizes that for an M/M/K/K queue. •

November 9, 2020 George Kesidis

223

Loss networks - preliminaries (cont)

If an existing connection on route r terminates at time t, • X(t)=X(t ) er. − − Similarly, if a connection on route r is admitted into the network at time t, • X(t)=X(t )+er. −

Clearly, a connection cannot be admitted (i.e., is blocked)alongrouter∗ at time t if any • associated link capacity constraint is violated, i.e., if, for some l r , ∈ ∗ Xr(t )=c . − l r l r !| ∈ An R-vector x is said to be a feasible state if it satisfies all link capacity constraints, i.e., • for all links l,

(Ax) = xr 0, 1, 2,...,c . l ∈ { l} r l r !| ∈ Thus, the state space of the stochastic process X is • S(c) x (Z+)R Ax c . ≡ { ∈ | ≤ }

November 9, 2020 George Kesidis

224 Loss networks - example (cont)

For the previous network example of L =13links, note that link 1 (and link 9) is common • to all R =4routes.

We now illustrate how link capacities c are used to determine whether route occupancies x • are feasible via corresponding link occupancies Ax.

For example, • 1 1 x =  1  1     is feasible if the capacities c 4 for all links l but is not feasible if c1 < 4 because l ≥ (Ax)1 =4.

November 9, 2020 George Kesidis

225

Loss Networks - product-form invariant of the Markovian model

In addition to assuming that each route r has independent Poisson connection arrivals with • rate λr,alsoassumeindependentandexponentiallydistributedconnection lifetimes with mean 1/µr.

Now it is easily seen that the stochastic process X is an Markov chain wherein • – the state transition x x + er S(c) occurs with rate λr and → ∈ – the the state transition x x er S(c) occurs with rate xrµr. → − ∈ Theorem: The X is time reversible with stationary distribution on S(c) • given by the product form

xr 1 ρr λr σ(x)= , where ρr = , G(c) xr! µr r 3∈R c is the L-vector of link capacities, and ρxr G(c)= r xr! x S(c) r ∈! 3∈R is the normalizing term (partition function) chosen so that x S(c) σ(x)=1. ∈ - November 9, 2020 George Kesidis

226 Loss networks - proof of product-form invariant

λr

x x + er

(xr +1)µr

Assuming x,x+ er S(c) for some r ,agenericdetailedbalanceequationis • ∈ ∈ R λrσ(x)=(xr +1)µrσ(x + er).

The theorem statement therefore follows if the claimed σ satisfies this equation. • So, substituting the claimed expression for σ and canceling from both sides the normalizing • term G(c) and all terms pertaining to routes other than r gives

xr xr+1 ρr ρr λr =(xr +1)µr . xr! (xr +1)!

This equation is clearly seen to be true after canceling xr +1on the right-hand side, then • x canceling ρ r /xr! from both sides, and finally recalling ρr λr/µr. ≡ Note how the normalizing term G depends on c through the state space S(c) •

November 9, 2020 George Kesidis

227

Loss networks - connection blocking

An arriving connection at time t is admitted on route r only if a circuit is available on all • of r’s links (edges), i.e., only if – (AX(t )) c 1 for all l r,where − l ≤ l − ∈ th – (Ax)l represents the l component of the L-vector Ax.

th th Consider the L-vector Aer, i.e., the r column of A whose l entry is • 1 if l r, al,r =(Aer)l = 1 l r = ∈ { ∈ } 0 if l r. * -∈

th Thus, the L-vector c Aer has l entry • − cl 1 if l r, cl (Aer)l = − ∈ − cl if l r. * -∈

November 9, 2020 George Kesidis

228 Loss networks - connection blocking (cont)

Theorem: The stationary probability that a connection is blocked on route r is • G(c Aer) Br =1 − . − G(c)

Proof: First note that Br is 1 minus the stationary probability that the connection is • admitted (on every link l r). ∈ Therefore, by PASTA, •

Br =1 σ(x) − x Ax c Ae | !≤ − r 1 ρxr =1 r , − G(c) xr! x S(c Ae ) r ∈ !− r 3∈R from which the expression for exact blocking directly follows by definition of the normalizing term G.

November 9, 2020 George Kesidis

229

Fixed-point iteration for approximate connection blocking

The computational complexity of the partition function G grows rapidly as the network • dimensions (L, R, N,etc.)grow.

We now formulate an iterative method for determining approximate blocking probabilities • under the assumption that the individual links block connections independently.

Consider a single link l and let b be its unknown blocking probability. • ∗ l∗

For the moment, assume that the link blocking probabilities bl of all other links l = l∗ are • known. -

Consider a route r containing link l . • ∗

By the independent blocking assumption, the incident load (trafficintensity)ofl∗ from this • route, after blocking by all of the route’s other links has been taken into account, is

ρr (1 b ). − l l r l=l ∈ 3| - ∗

November 9, 2020 George Kesidis

230 Fixed-point iteration for approximate connection blocking(cont)

Thus, the total load of link l is reduced/thinned by blocking to • ∗

ρr (1 bl) ρˆl (b l ), − ≡ ∗ − ∗ r l r l r l=l !| ∗∈ ∈ 3| - ∗ where b l is the (L 1)-vector of link blocking probabilities not including that of link l. − −

By the independent blocking assumption, the blocking probability of link l∗ must therefore • be the reduced-load approximation,

bl = (ˆρl (b l ),cl ) l∗ 1, 2,...,L , ∗ E ∗ − ∗ ∗ ∀ ∈ { } where again is Erlang’s blocking formula E ρc/c! (ρ,c) (ρ) . c c j E ≡ E ≡ j=0 ρ /j! -

November 9, 2020 George Kesidis

231

Fixed-point iteration for approximate connection blocking(cont)

Clearly, the link blocking probabilities b must simultaneously satisfy the reduced load ap- • proximation for all links l∗, i.e., giving a system of L equations in L unknowns.

Approaches to numerically finding such an L-vector b include Newton’s method and the • following fixed-point iteration (method of successive approximations).

Beginning from an arbitrary initial b(0),afterj iterations set • (j) (j 1) bl = (ˆρl(b l− ),cl) for all links l. E − Brouwer’s fixed-point theorem gives that a solution b [0, 1]L exists. • ∈ Uniqueness of the solution follows from the fact that this solution is the minimum of a • convex function.

November 9, 2020 George Kesidis

232 Fixed-point iteration for approximate connection blocking(cont)

Given the link blocking probabilities b,undertheindependenceassumption,theroute block- • ing probabilities are

Br =1 (1 b ) − − l l r 3∈

= bl + o bl , l r ; l r < !∈ !∈ i.e., if l r bl 1,then ∈ 3

- Br b . ≈ l l r !∈ That is, the blocking probability is approximately additive. • Now, instead of “jobs” (connections or calls) occupying circuits on every link of a route and • without waiting rooms, in the following we will consider jobsasspatiallylocalizedpackets.

November 9, 2020 George Kesidis

233

Example: A loss model of a server bank

Consider a set of (physical or virtual) servers for an aggregate stream of (computational) • jobs.

Each server has a finite capacity of executors (containers in virtual machines or cloud/Lambda • functions).

The aggregate job stream is a superposition of independent Poisson arrival processes, each • for jobs of a certain type.

Suppose each type of job requires for its component tasks a certain number of executors, • where each executor needs to be from a certain server because of “data locality” constraints for example.

The the correspondence for this example is: servers to links,executorstocircuits,andjobs • to connections/routes.

Because all servers have a limited number of executors, some arriving jobs may not receive • their requested set of executors, i.e., blocking.

November 9, 2020 George Kesidis

234 Stable open networks of queues

Consider an idealized packet-switched network where • – the forwarding decisions, made at each node forwarding the packets (jobs), are inde- pendently random, and – the service times at the nodes that a given packet visits are independently random with distribution depending on the forwarding node.

Consider a group of N 2 lossless, single-server, work-conserving queueing stations. • ≥ th Packets at the n station have a mean required service time of 1/µn for all n 1, 2,...,N , • ∈ { } and external arrival rate Λn,whereΛn > 0 for at least one station n (open network).

The packet arrival process to the nth station is a superposition of N +1 component arrival • processes.

Packets departing the mth station are forwarded to and immediately arrive at the nth • station with probability rm,n.

Also, with probability rm,0,apacketdepartingstationm leaves the queueing network • forever; here we use station index 0 to denote the world outside the network.

N Clearly, for all m, rm,n =1. • n=0 - November 9, 2020 George Kesidis

235

The flow balance equations

th Defining λn as the total arrival rate to the n station, recall the flow balance equations • based on the notion of conservation of flow and require that allqueuesarestable, i.e., µn > λn and

N λn = Λn + λmrm,n, n 1, 2,...,N , ∀ ∈ { } m=1 ! or in matrix form, T T T 1 T T λ (I R)=Λ λ =(I R)− Λ <µ − ⇒ −

Also recall the conditions under which I R is nonsingular, including that rm,0 > 0 for • at least one station m (open network), so−R is a strictly sub-stochastic matrix.

Exercise: Show there is “aggregate” flow balance between the outside world and the • network:

Λm = λmrm,0. m=0 m=0 !- !-

November 9, 2020 George Kesidis

236 Open Jackson networks - introduction

Suppose that the route exogenous arrival processes are independent Poisson and that the • service time distributions are exponential.

All routing decisions, service times and exogenous (exponential) interarrival times are mu- • tually independent.

The resulting network is Markovian and is called an open . •

Note that if Xn(t) is the number of packets in station n at time t,thenthevectorofsuch • processes X(t) taking values in (Z+)N is Markovian.

Jackson networks can be used to model aggregate packet-flows between networked queueing • stations and the outside world.

November 9, 2020 George Kesidis

237

Jackson network - example

Λ1 r12 Λ2

1 r21 2

r31 r23

r13 r32 3

r30

November 9, 2020 George Kesidis

238 Open Jackson networks - Markovian transition rates

T + N Consider a vector x =(x1,...,xN ) (Z ) and define the following operator δ • mapping (Z+)N (Z+)N . ∈ →

If xm > 0 and 1 n, m N,thenδm,n represents a packet departing from station m • and arriving at station≤ n: ≤ x if i = m, n, i - (δm,n x)i = xm 1 if i = m,  +1− =  xn if i n, i.e., δm,nx x em + en. ≡ − 

If xm > 0 and 1 m N,thenδm,0 represents a packet departing the network to the • outside world from≤ station≤ m:

xi if i = m, (δm,0 x)i = - xm 1 if i = m. * −

If 1 n N,thenδ0,n represents a packet arriving at the network at station n from the • outside≤ world:≤

xi if i = n, (δ0,n x)i = - xn +1 if i = n. *

November 9, 2020 George Kesidis

239

Open Jackson networks - Markovian transition rates (cont)

In the following TRD fragment: Λm > 0 for the transition at left; and both xm > 0 and • rm,n > 0 for the transition at right:

For example, for N =4stations, suppose the network is currently in state x =[17506]T, • so that:

– Assuming Λ1 > 0,anexogenousarrivaltostation1causestransition

Λ1 T x δ0,1x =[18506] −→ – Assuming r2,4 > 0,adeparturefromstation2thatarrivestostation4causestransition

µ2r2,4 T x δ2,4x =[17407] −→ – Adeparturefromstation3isimpossiblebecauseit’sempty(x3 =0)

November 9, 2020 George Kesidis

240 Open Jackson networks - Markovian transition rates and invariant

The transition rate matrix of the Jackson network is given by the following equations: • µmrm,n if 1 m N, 0 n N, xm > 0, q(x, δm,n x)= ≤ ≤ ≤ ≤ Λn if m =0, 1 n N. * ≤ ≤ Theorem: The stationary distribution of an open Jackson network is product form, • 1 N σ(x)= ρxn , G n n=1 3 where, for all n,thetrafficintensity

λn ρn < 1 ≡ µn for stability and the normalizing term (partition function)is

N

xn G = ρn . x (Z+)N n=1 ∈! 3

November 9, 2020 George Kesidis

241

Proof of Jackson’s theorem

We need to verify the (full) balance equations for the claimedinvariantdistributionσ, • x, σ(y)q(y,x)=0 ∀ y !

Recall the balance equations for the Jackson network “at” state x, i.e., corresponding to • x’s column in the network’s transition rate matrix.

To do this, we consider states from which the network makes transitions into x,including: • – δn,m x for stations n such that xn > 0,where

q(δn,m x,x)=µmrm,n;

– δ0,n x for all stations n,where

q(δn,0 x,x)=Λn; – all other transitions that do not occur in one step, i.e., q =0.

Note that δm,nδn,m x = x. •

November 9, 2020 George Kesidis

242 Proof of Jackson’s theorem (cont)

Therefore, the balance equations at x are • N q(δn,0 x,x)σ(δn,0 x)+ q(δn,m x,x)σ(δn,m x) n x >0 @ m=1 A { |!n } ! N + q(δ0,m x,x)σ(δ0,m x) m=1 ! N N = q(x, δm,0 x)+ q(x, δm,n x) + q(x, δ0,n x) σ(x).   m x >0 @ n=1 A n=1 { !| m } ! !  

November 9, 2020 George Kesidis

243

Proof of Jackson’s theorem (cont)

Substituting the transition rates we get • N Λnσ(δn,0 x)+ µmrm,nσ(δn,m x) n x >0 @ m=1 A { |!n } ! N + µmrm,0σ(δ0,m x) m=1 ! N N = µmrm,0 + µmrm,n + Λn σ(x)   m x >0 @ n=1 A n=1 { !| m } ! !  N  = µm + Λn σ(x)   m x >0 n=1 { !| m } !  

November 9, 2020 George Kesidis

244 Proof of Jackson’s theorem (cont)

The theorem is proved if we can show that the claimed product-form invariant distribution • σ satisfies this balance equation at every state x.

Substituting it and factoring out σ(x) on the left-hand side, we get • N N N 1 ρm Λn + µmrm,n + µmrm,0ρm = µm + Λn. ρn ρn n x >0 @ m=1 A m=1 m x >0 n=1 { |!n } ! ! { !| m } !

Substituting λm = µmρm,weget • 1 N N N Λn + λmrm,n + λmrm,0 = µm + Λn. ρn n x >0 @ ; m=1 0 n=1 { |!n } ! ! { !| m } ! Finally, substitution of the flow balance equations implies this equation does indeed hold. •

Note that the product form implies the queue occupancies are statistically independent in • steady state.

November 9, 2020 George Kesidis

245

Jackson’s theorem - exercise

If the network has the following properties • rm,n > 0 rn,m > 0 and Λn > 0 r 0 > 0, ⇔ ⇔ n, determine whether Jackson’s theorem for product-form invariant distribution holds by de- tailed balance, i.e., whether the open Jackson network is time-reversible.

November 9, 2020 George Kesidis

246 Jackson network - example

Λ1 r12 Λ2

1 r21 2

r31 r23

r13 r32 3

r30

If this Jackson network is stable with the forwarding probabilities and exogenous arrival • rates indicated, the stationary distribution of the number of jobs in e.g. the third queue is k P(Q3 = k)=ρ (1 ρ3), 3 − where ρ3 = λ3/µ3 < 1 and λ3 is found by solving the flow-balance equations 1 1 r1,2 r1,3 − T T 1 − − T λ = Λ (I R)− =[Λ1 Λ2 0] r2,1 1 r2,3 <µ, −  − −  r3 1 r3 2 1 − , − ,   where r3,1 + r32 =1 r3,0 < 1. −

Thus, the mean number of jobs in the third queue is L3 := EQ3 = ρ3/(1 ρ3). • − 247

Little’s formula and open Jackson networks

Can use Little’s theorem to get the mean sojourn time of jobs through Q3, w3 = L3/λ3. • To find the stationary mean sojourn time w of jobs through the entire network: • – First use Jackson’s theorem to find the mean number of jobs at each station n, Ln = EQn = ρn/(1 ρn). − N N – Then use Little’s formula w = n=1 Ln/ n=1 Λn - -

248 Discrete-time Markov chains - Outline

1. The Bernoulli process, its counting process, and the arrival theorem in discrete time

2. The Markov property

3. Transition probability matrices and diagrams

4. Transience, recurrence and periodicity

5. Invariant distribution

6. Birth-death Markov chains in discrete-time

7. Discrete-time queues

8. Examples: modeling passwords, PageRank

9. Simulating a discrete-time Markov chain

November 9, 2020 George Kesidis

249

Overview of discrete-time Markov chains

We now consider Markov processes in discrete time on countable state spaces, i.e., discrete- • time Markov chains.

We covered continuous time Markov chains first because applications are somewhat simpler. • For example, in a queueing network operating in discrete time, it would be possible, e.g., • that an arrival occurs at one station, while a departure from asecondstationarrivestoa third, all in the same (discrete) time-slot.

Recall that a stochastic process X is said be “discrete time” if its time domain is countable, • e.g., X(n) n D for D = Z+ or for D = Z,where,indiscretetime,wewilltypically use n{instead| of ∈t to} denote time.

In discrete time, the Markov property is defined as in continuous time, relying on the • memoryless property of the (discrete) geometric distribution - see http://www.cse.psu.edu/ gik2/teach/Prob-4.pdf ∼

November 9, 2020 George Kesidis

250 Markovian counting process in discrete-time

If the random variables B(n) are i.i.d. Bernoulli distributed for, say, n Z+,thenB is • said to be a Bernoulli process on Z+. ∈

Assume that for all time n,constantP(B(n)=1)=:q. • Thus, the duration of time B visits state 1 (respectively, state 0), is geometrically dis- • tributed with mean 1/(1 q) (respectively mean 1/q ). − The analog to the Poisson counting process can be constructedonZ+: • n X(n)= B(m). m=0 ! The marginal distribution of X follows a binomial distribution, i.e., X(n 1) binom(n, q): • for k 0, 1, 2,...,n , − ∼ ∈ { } n k n k P(X(n 1) = k)= q (1 q) − . − k − = ? Recall the law of small numbers relating the binomial to Poisson distributions: • http://www.cse.psu.edu/ gik2/teach/Prob-4.pdf ∼ The arrival theorem (PASTA in continuous time) holds for the Bernoulli process in discrete • time.

251

One-step transition probabilities

The one-step transition probabilities of a discrete-time Markov chain Y are defined to be • P(Y (n +1)=a Y (n)=b), | where a, b are, of course, taken from the countable state space of Y .

If these one-step transition probabilities do not depend on time n,thentheMarkovprocess • Y is time homogeneous.

The one-step transition probabilities of a time-homogeneous Markov chain can be graphi- • cally depicted in a transition probability diagram (TPD).

The transition probability diagrams of B and X are given below. • Note that the nodes are labeled with elements in the state space and the branches are • labeled with the one-step transition probabilities.

Graphically unlike TRDs, TPDs may have “self-loops”, e.g., • P (X(n +1)=1 X(n)=1)=q>0. q | qq 1 q 0 1 q − 0 12... 1 q − 1 q 1 q 1 q − − − 252 Transition probability matrices (TPMs)

From one-step transition probabilities of a discrete-time Markov chain, one can construct • its (one-step) transition probability matrix (TPM),

i.e., the entry in the ath column and bth row of the TPM P(n +1)for Y is • P(Y (n +1)=a Y (n)=b)=P (n +1). | b,a For example, the Bernoulli process B has state space 0, 1 and TPM • { } 1 qq P = . 1 − qq B − C That of the counting process X defined above is • 1 qq 000 −01qq00··· P = .  001− qq0 ···  . . −. . . ···.  ......      Note that the previous two examples are time homogeneous. •

November 9, 2020 George Kesidis

253

Example transition probability matrix on 0, 1, 2 { }

Another example of a discrete-time Markov chain with state space 0, 1, 2 and TPM { } 0.30.20.5 P = 0.500.5  0.10.20.7   

0.5

0.2 0.5

0.3 0120.7

0.5 0.2

0.1

November 9, 2020 George Kesidis

254 TPMs are stochastic matrices

All TPMs P are row-stochastic matrices, i.e., they satisfy the following two properties: • – All entries are nonnegative and real (the entries are all probabilities). – The sum of the entries of any row is 1, i.e., P1 =1by the law of total probability.

Clearly, all such matrices have eigenvalue 1 with • non-negative associated left eigenvector, which is of interest in the following. • So, the PMF of the transition from state k is given by • – the kth row of the TPM, or – the labels of the out-bound arrows of state k in the TPD.

November 9, 2020 George Kesidis

255

TPMs and marginal distributions

Given the TPM P and initial distribution π(0) of the process Y ( i.e., P(Y (0) = k)=: • πk(0) ), one can easily compute the other marginal distributions of Y . For example, by conditioning on Y (0) we can compute the distribution of Y (1) as • πT(1) = πT(0)P, i.e., for all k in the state space S of Y :

πk(1) := P(Y (1) = k)= P(Y (1) = k, Y (0) = b) b S !∈ = P(Y (1) = k Y (0) = b)P(Y (0) = b) | b S !∈ T = Pb,kπb(0) = (π (0)P)k b S !∈ By induction, we can compute the distribution of Y (n): • πT(n)=πT(0)Pn. The quantity Pn can be computed using similarity transform to its diagonal matrix of • Jordan blocks. General finite-dimensional distributions can be found by sequential conditioning. •

November 9, 2020 George Kesidis

256 Time-inhomogeneous Markov chains

Note that a time-inhomogeneous discrete-time Markov chain will simply have time-dependent • transition probabilities,

If P(n) the one-step TPM of Y from time n 1 to time n,thenthedistributionofY (n) • is − πT(n)=πT(0)P(1)P(2) P(n). ···

November 9, 2020 George Kesidis

257

Forward Kolmogorov equations

For a time-inhomogeneous Markov chain Y ,theforwardKolmogorovequationsindiscrete-time can be obtained by conditioning on Y (1): (P(0,n)) P(Y (n)=a Y (0) = b) a,b ≡ | P(Y (n)=a, Y (0) = b, Y (1) = k) = P(Y (0) = b) !k P(Y (n)=a, Y (0) = b, Y (1) = k) P(Y (1) = k, Y (0) = b) = P(Y (1) = k, Y (0) = b) P(Y (0) = b) !k = P(Y (n)=a Y (0) = b, Y (1) = k)P(Y (1) = k Y (0) = b) | | !k = P(Y (n)=a Y (1) = k)P(Y (1) = k Y (0) = b), | | !k where the last equality is the Markov property.

November 9, 2020 George Kesidis

258 Kolmogorov equations in Matrix form

The Kolmogorov forward equations in matrix form are • P(0,n)=P(1)P(1,n).

Similarly, the backward Kolmogorov equations are generatedbyconditioningonY (n 1): • − P(0,n)=P(0,n 1)P(n). − Note that both are consistent with P(0,n) P(1)P(2) P(n), • ≡ ··· which simply reduces to P(0,n)=Pn in the time-homogeneous case. •

November 9, 2020 George Kesidis

259

Invariant distribution for the time-homogeneous case

For a time-homogeneous Markov chain, we can define an invariant or stationary distribution • of its TPM P as any distribution σ satisfying the balance equations in discrete time: σT = σTP with σ = σT1 =1and i, σ 0. i i ∀ i ≥ Clearly,- if the initial distribution π(0) = σ for a stationary distribution σ,thenπ(1) = σ • as well and, by induction, the marginal distribution of the Markov chain is σ forever,

i.e., π(n)=σ for all time n>1 and the Markov chain is stationary. •

November 9, 2020 George Kesidis

260 Invariant distribution - examples

The counting process X with binomially distributed marginal does not have an invariant • distribution as it is transient.

By inspection, the stationary distribution of the BernoulliMarkovchainis • 1 q σ = . −q B C The stationary distribution of the previous TPM on 0, 1, 2 is unique because it’s positive • recurrent (only finite number of states), irreducible,{ and aperiodic.}

The invariant can be computed by solving • σT(I P)=0, − σT1 =1.

Note that the first block of equations (three in this example) are equivalent to σT = σTP • and are linearly dependent, i.e., I P is singular since P is row stochastic. −

November 9, 2020 George Kesidis

261

Example - computing an invariant distribution (cont)

We can replace one of the columns of I P,saycolumn3,withall1’s(corresponding • T − T to 1=σ 1 = σ0 + σ1 + σ2)andreplace0 with [0 0 1] to obtain three linearly independent equations: 0.7 0.21 σT 0.511− =[00 1] σT =[0.20833 0.16667 0.625]  −0.1 0.21 ⇒ − −   Suppose that this Markov chain on 0, 1, 2 has an initial distribution that is uniform, i.e., • πT(0) = [1/31/31/3]. { }

The distribution at time 2 is • 0.24 0.16 0.6 πT(2) = πT(0)P2 = πT(0) 0.20.20.6 =[0.2133 0.1733 0.6133]  0.20.16 0.64    So, we see that after just two time steps from uniform initial π(0),thedistributionis • approximately the invariant, π(2) σ. ≈

November 9, 2020 George Kesidis

262 Recurrence, irreducibility and periodicity

Individual states of a discrete-time Markov chain can be nullrecurrent,positiverecurrent, • or transient.

We can call the Markov chain itself “positive recurrent” if all of its states are. • Also, a discrete-time Markov chain can possess the irreducible property. • Unlike continuous-time chains, all discrete-time chains also possess either a periodic or an • aperiodic property through their TPDs (as with the irreducibility property).

November 9, 2020 George Kesidis

263

Periodicity

Astateb of a time-homogeneous Markov chain Y is periodic if there is a time n>1 such • that: P(Y (m)=b Y (0) = b) > 0 m is a multiple of n, | ⇔ where n is the period of b.

That is, given Y (0) = b, Y (m)=b is only possible when m = kn for some integer k. • AMarkovchainissaidtobeaperiodicifithasnoperiodicstates; otherwise it is said to • be periodic.

The examples of discrete-time Markov chains considered previously are all aperiodic. •

November 9, 2020 George Kesidis

264 Periodicity - example

This Markov chain is periodic with n =2being the period of state 2. • 1

1

0 1 2

0.6

0.4

One can solve the balance equations of this Markov chain to gettheuniquesolution, • σ =[0.20.30.5]T,

which represents the average fraction of time the Markov chain spends in the states in • steady-state,

but, e.g., if X(0) = 2,thenX(n)=2almost surely (i.e., P(X(n)=2 X(0) = • 2) = 1 )forallevenn and X(n) =2a.s. for all odd n. | -

November 9, 2020 George Kesidis

265

Existence and uniqueness of invariant distribution

Theorem: Atime-homogeneousdiscrete-timeMarkovchainhasauniquestationary (in- • variant) and steady-state distribution if and only if it is irreducible, positive recurrent and aperiodic.

The proof of this basic statement of Doeblin is given in the 1968 book by Feller. • The unique invariant σ is also the unique steady-state distribution because: if P is the • TPM (of an irreducible, positive recurrent and aperiodic Markov chain), then

σT σT lim Pn =  .  . n . →∞  σT      T n T Thus, for any initial distribution π(0), limn π (0)P = σ . • →∞

November 9, 2020 George Kesidis

266 Birth-death Markov chains with finite state-space

The counting process X defined above is a “pure birth” process on Z+. •

qK 1 q0 q1 q2 −

012 ... K p1 p2 pK

1 q0 1 q1 p1 1 q2 p2 1 pK − − − − − −

This TRD of a birth-death process on a finite state space 0, 1,...,K (naturally assuming • { } q + p 1 for all k,wherep0 =0and q =0). k k ≤ K

November 9, 2020 George Kesidis

267

Birth-death Markov chains with finite state-space (cont)

The balance equations (σT = σTP)are • (1 q0)σ0 + p1σ1 = σ0, − qn 1σn 1 +(1 qn pn)σn + pn+1σn+1 = σn for 0

The example with qn q and pn = np again yields a truncated Poisson distribution for σ • with parameter ρ = q/p≡ .

November 9, 2020 George Kesidis

268 Birth-death process on an infinite state-space

The process will be positive recurrent if and only if • i ∞ qn 1 R − < , ≡ pn ∞ i=1 n=1 ! 3 1 in which case σ0 =(1+R) > 0 and n 1, − ∀ ≥ i qj 1 σn = σ0 − > 0. p j=1 j 3

The example where pn = p and qn = q also yields a geometric, invariant stationary • distribution with parameter ρ = q/p < 1.

November 9, 2020 George Kesidis

269

Discrete-time M/M/1 queue

Consider a FIFO queue with a single nonidling server and infinite waiting room in discrete • time.

Suppose that the job interarrival times are i.i.d. geometrically distributed with mean 1/q. • The service times of the jobs are also i.i.d. geometric with mean 1/p,whereρ q/p < 1. • ≡ So, the number of jobs in the queue Q is a birth-death Markov chain, i.e., adiscrete-time • M/M/1 queue.

From the invariant geom(ρ)distributionσ,themeannumberofjobsinthequeueis • ∞ ρ L = iσi = . 1 ρ =0 k! − Thus, by Little’s formula in discrete time, the mean sojourn time is L/q =1/(p q). • −

November 9, 2020 George Kesidis

270 Discrete-time M/M/1 queue - simultaneous events

However, our model of the discrete-time M/M/1 queue may not bequiterightasstated, • because it’s possible that an arrival and departure occur simultaneously. • For example, a one-step transition from state k>0 to state k +1is the event that an • arrival occurs but a departure does not, i.e., with one-step transition probability q(1 p). − Considering such simultaneous events, the one-step TPM of the M/M/1 queue is • 1 qq 000 p(1− q) qp +(1 q)(1 p) q(1 p)00··· P = .  0− p(1− q) − qp +(1 −q)(1 p) q(1 p)0···  . .− −. − .− . ···.  ......      Exercise: Find the invariant distribution and mean sojourn time and compare to that of • geom(ρ).

Exercise: Explore the discrete-time M/M/K/K queue. Is it a birth-death Markov chain? •

November 9, 2020 George Kesidis

271

Discrete-time queues with constant service-rate - event ordering

To show how the order in which events are accounted may impact a discrete-time queueing • model,

we now repeat a deterministic analysis for a single-server queue but in discrete time n Z+ • or n Z. ∈ ∈ Suppose that the server works at a rate of c Z+ jobs per unit time and that a(n) is the • number of jobs that arrives at time n. ∈

If we assume that, in a given unit of time, service on the queue is performed prior to • accounting for arrivals (in that same unit of time), then the number of jobs at time n is

Q(n)=(Q(n 1) c)+ + a(n), where (ξ)+ max 0, ξ . − − ≡ { } Thus, work cannot begin on a job in the time-slot in which it arrives. •

November 9, 2020 George Kesidis

272 Cut-through discrete-time queues with constant service-rate

Alternatively, if the arrivals are counted before service inatimeslot, • Q(n)=(Q(n 1) c + a(n))+; − − these dynamics are sometimes called “cut-through” because it’s possible that arrivals to empty queues can depart immediately, incurring no delay. By induction, under cut-through • Q(n)= max[A(m, n] c(n m)] ,

November 9, 2020 George Kesidis

273

Markov models of discrete-time queues with constant service-rate

Suppose c, Q(0) Z+ and that a is i.i.d., Z+-valued process such that • ∈ c>Ea(n), i.e., so that the Q queue is stable.

Given the (stationary) distribution α of a(n),wecancomputeQ’s TPM on Z+. • For the case of cut-through: for all j, i Z+, • ∈ P(Q(n)=j Q(n 1) = i)=P(j =(i c + a(n))+) | − − = P(a(n)=k), where j =(i c + k)+ −

November 9, 2020 George Kesidis

274 Discussion - modeling queueing networks in discrete-time

Again, in a queueing network operating in discrete time, it would be possible, e.g., that an • arrival occurs at one station, while a departure from a secondstationarrivestoathird,all in the same (discrete) time-slot.

So, a discrete-time analog of a continuous-time model would not simply be the “jump • chain” of transitions of the latter (i.e., for all states a = b,theTPMP = Q /Qa,a - a,b − a,b so that a, Pa,a =0,whereQ is the TRM of the continuous-time Markov model of the network).∀

Rather, a much larger number of state transitions would need to be considered to account • for the possibility of such simultaneous events in discrete time.

Moreover, the order of occurence of such simultaneous eventsinatimeslot(unitofdiscrete • time) would need to be specified to clarify the the dynamics of the system state.

November 9, 2020 George Kesidis

275

Simulating a sample path of a discrete-time (n)Markovchainx

n =0 u = rand() 1 x(0) = F − (init,u) while n

the rand function returns i.i.d. (continuously) uniform[0,1] samples, • F 1(init, ) is the inverse CDF of the initial distribution, and • − · F 1(x, ) is the inverse CDF of PMF that’s the xth row of TPM P, • − · 1 e.g., for a uniform initial on state-space 0, 1, 2 :ifu<1/3 then F − (init,u)=0, • 1 { }1 else if u<2/3 then F − (init,u)=1,elseF − (init,u)=2.

276 Simulating a sample path of a continuous-time (t)Markovchainx

n =0 u = rand() 1 x(0) = F − (init,u) t(0) = 0 while t(n) < max simulation time do u = rand() t(n +1)=t(n)+log(1 u)/Q(x(n),x(n)) − u = rand() 1 x(n +1)=F − (x(n),u) end while

where t(n) is the nth jump/transition time, and • F 1(x, n) is the CDF of the xth row of the jump chain with TPM: i, P =0;and • − ∀ i,i j = i, Pi,j = Qi,j/Qi,i. ∀ - −

November 9, 2020 George Kesidis

277

Continuous-time Markov chain simulation by uniformization

For any q>maxj Qj,j,insteadofthejumpchainabove,usethe(non-jump)TPM • P = I + Q/γ. −

n>0, t(n) t(n 1) are i.i.d. exp(γ) random variables, i.e., • ∀ − − t(n)=t(n 1) + log(1 u)/γ. − − So, the number of iterations of the while loop over an intervalof(continuous)time[0,t] • will be Poisson(γt). ∼ It follows that the TPM in continuous time, • ∞ n n (γt) γt exp(Qt)= P e− . n! n=0 ! Exercise: Verify this by using the definition of P. • There is an alternative approach called perfect simulation. •

November 9, 2020 George Kesidis

278 Example of fitting a discrete-time Markov chain to data

Consider a known/given corpus of typical passwords which a hacker could use to guess at • apassword,i.e., a“dictionaryattack.”

Each password, an ordered list of alpha-numeric characters,ismodeledasthetrajectoryof • acommonMarkovchainmodeling(generating)thegivencorpus.

In a second-order model, the state of the Markov chain is an ordered pair (bigram) of • characters, e.g., “1a”, “b ”, “dA”, “%2”.

We can augment the character set to include a symbol, say ε indicating the termination of • the password, i.e., all bigrams of the form “xε”areabsorbing: Pxε,xε 1 x. ≡ ∀ Using the corpus, directly count the number of times • – Nxy that each bigram xy appears (anywhere in any password),

– Nxyz that each trigram xyz appears.

Define the Markov transition probabilities on bigrams, Pxy,yz = Nx,y,z/Nxy. •

Also, let πxy be the fraction of the corpus’ passwords beginning with the bigram xy. •

279

Rejecting passwords using a generative model

Let w(k) be the kth character of password w and l(w) 2 be the length of w,where • w(l(w)+1) ε. ≥ ≡

Given the transition probabilities Pxy,yz learned from a document corpus, the likelihood • L(w) of any given password w can be assessed,

l(w) 1 − L(w)=πw(1)w(2) Pw(k)w(k+1),w(k+1)w(k+2) =1 k3 From the given corpus of passwords, we can compute the mean andvarianceofL(w) for • passwords of the same length l = l(w): µ(l), σ2(l),respectively.

Anewlysuggestedpasswordw˜ could be rejected if, e.g., L(˜w) <µ(l(˜w)) 2σ(l(˜w)) • (> 0 depending on the password corpus), i.e., if its likelihood is less than− two standard deviations of the mean of known passwords of the same length.

Additionally, a minimum length for new passwords is typically required. • For a related problem, see: J. Raghuram, D.J. Miller and G. Kesidis. Unsupervised, low • latency anomaly detection of algorithmically generated domain names by generative prob- abilistic modeling. NSF USA-Egypt Workshop on Cyber Security,Cairo,May282013 (Springer JAR 5(4), July 2014).

280 Web-page ranking via discrete-time Markov chain

Web search results are prioritized, e.g., pages can be listed in order of the number of other • pages which link to them as in Google’s PageRank, i.e., ameasureofthe“popularity”of the page.

Such measures of popularity are important for setting the price of advertising on commercial • web sites.

Asimpleiterativeprocedurefordeterminingtherelativepopularity of web pages is as • follows.

November 9, 2020 George Kesidis

281

Inferring relative popularity through page links

For a population of N pages numerically indexed 1, 2,...,N,letdi be the number of • different pages which are linked-to by page i, i.e., i’s out-degree.

Define the N N stochastic matrix P with entries P =1/d if i links to j,otherwise • × i,j i Pi,j =0(with Pi,i =0for all i, i.e., a“purejump”chain).

Define the popularity/rank of π 0 of page i so that: • i ≥

i, πi = πjPj,i and 1= πj. ∀ j j ! ! Note how the j contributes to i’s popularity, but that contribution is reduced through • division by the total out-degree dj of j.

Exercise: Relate π to the “eigenvector centrality” of the web-page graph. •

November 9, 2020 George Kesidis

282 Inferring relative popularity through page links (cont)

In matrix form, the first set of equations is simply πT = πTP. • So, π is the invariant distribution of a discrete-time Markov chain on the web pages with • transition probabilities P,

i.e., arandomwalkonthegraphformedbytheN web pages as vertices and the links • between them as directed edges, with time corresponding to the number of transitions to other web pages (clicked-on web links).

November 9, 2020 George Kesidis

283

The stationary distribution as page ranks

The marginal distribution of the Markov chain at time k, π(k) satisfies the Kolmogorov • equations (π(k))T =(π(k 1))TP, −

i.e., πi(k) is the probability that the random walk is at page i at time k. • Recall that if P is aperiodic and irreducible then there is a unique stationary/invariant • distribution π such that that limk π(k)=π. →∞

November 9, 2020 George Kesidis

284 Google’s PageRank

Google’s PageRank considers a parameter that models how web surfers do not always select • links from web pages but may select links from among their bookmarks or manually type a URL.

Suppose that a bookmark selection occurs with probability b and that the probability of • specific bookmarked page selected is b/(N 1). − To this end, instead of P,analternativeistousethestochasticmatrix • b P˜ := (1 b)P + (1 I), − N 1 − − where 1 is the N N matrix all of whose entries are 1 and I is the identity matrix. × With 0

November 9, 2020 George Kesidis

285

Optimization issues

Markov Decision Processes (MDPs) • Background on constrained optimization • Anetworkapplicationofgametheory •

November 9, 2020 George Kesidis

286 Markov decision processes (MDPs) - References

D.P. Bertsekas. Dynamic Programming.PrenticeHall,1987,VolsIandII • M. Puterman. Markov decision processes.JohnWiley&Sons,1994 • C. Cassandras and S. Lafortune. Introduction to Discrete Event Systems.Springer,2007 • Recall our previous discussion of • – link-state and distance-vector routing and – discrete-time Markov chains.

November 9, 2020 George Kesidis

287

Example - shortest path on a graph

Suppose we are planning the construction of a highway from city A to city K. • Different construction alternatives and their “edge” costs g 0 between directly connected • cities (nodes) are given in the following graph. ≥

The problem is to determine the highway (edge sequence) with the minimum total (additive) • cost (path length).

November 9, 2020 George Kesidis

288 Recall Bellman’s principle of optimality

If C belongs to an optimal (by edge-additive cost J ∗)pathfromAtoB,thenthesub-path • AtoCandCtoBarealsooptimal, i.e., any sub-path of an optimal path is optimal (easy proof by contradiction). •

Dijkstra’s algorithm uses the predecessor node of the destination (path penultimate node), • and is based on complete link-state (edge-state) info consistently shared among all nodes:

J ∗(A, B)=minJ ∗(A, C)+g(C, B) C is a predecessor of B , C { | } i.e., CandBareadjacentnodesinthegraph(endpointsofthesameedge). The Iterated distributed Bellman-Ford algorithm instead uses the successor node of the path • origin and only nearest-neighbor distance-vector information sharing:

J ∗(A, B)=ming(A, C)+J ∗(C, B) C is a successor of A C { | }

November 9, 2020 George Kesidis

289

Discrete-time, deterministic scenario

At “time” n, • – gn(xn,un) 0 is the cost, ≥ – xn is the state, and

– un is the control.

State evolves according to • x +1 = fn(xn,un), n 0, 1, 2,...,N 1 . n ∀ ∈ { − }

Given initial state x0,theadditivecostis • N 1 − J0(x0,u0)= gn(xn,un)+gN (xN ), n=0 ! where gN is the terminal cost.

N 1 Objective is to find the control u0 = un − (N decision variables u0)thatminimizes • { }n=0 J0(x0,u0) - i.e., given the initial state x0,dynamicsf and costs g,

min J0(x0,u0). u0

November 9, 2020 George Kesidis

290 Discrete-time, deterministic scenario - problem variations

We can, alternatively, maximize an additive total reward J0 of rewards gn at n. •

Or, J0 =maxn 0 gn as maximum of signed rewards gn R. • ≥ ∈

Or, J0 =minn 0 gn as minimum of signed costs. • ≥

November 9, 2020 George Kesidis

291

Discrete-time, deterministic scenario - backward induction

The cost-to-go from time k

which we have written a function of just xk, uk and uk+1.

Applying the optimality principle and state dynamics to minimize J0, • – we can work backward from time N to find the optimal control uk∗+1 before uk∗, – thus finding u = u ,u , k∗ { k∗ k∗+1}

x, uN∗ 1(x)=argminu JN 1(x, u)=argminu gN 1(x, u)+gN (fN 1(x, u)) ∀ − − − − x, k

Note how optimal control at time k

292 Discrete-time Markov decision processes with state’s TPM

We will also model x as a Markov chain on its state space with transition probability matrix • (TPM) P(k, u) which depends on the (not state anticipative)controluk = u at all times k, i.e.,

P (k, u)=P(x +1 = j x = i, u = u), ij k | k k and we’ve dispensed with the recursive update fk.

So, at each time k we choose from a (controlled) family of TPMs P(k, ). • · The marginal distribution π of x satisfies • T T x +1 π (k +1) = π (k)P(k, u ). k ∼ k

Given the initial distribution π(0) x0,wewishtofindtheoptimalcontrolu0 = • N 1 ∼ un − minimizing the expected additive cost to find { }n=0 N 1 − Eπ(0)J0(x0,u0∗ )=minEπ(0) gn(xn,un)+gN (xN ) u0 ;n=0 < ! Recall that the expectation operator E is linear. •

November 9, 2020 George Kesidis

293

Discrete-time Markov decision processes with state’s TPM (cont)

Given a state x governed by TPMs P,wecanwritetheprincipleofoptimalityforminimum • expected cost-to-go at time k

Vk(π(k)) := min Eπ(k)Jk(xk, uk; uk∗+1 ) uk { } N 1 − =minEπ(k)g(xk,uk)+Eπ(k) gn(xn,un∗ )+gN (xN ) uk   = +1 n !k =minEπ(k)g(xk,uk)+Eπ(k+1)Jk+1(xk+1,uk∗+1)  uk

=minEπ(k)g(xk,uk)+Vk+1(π(k +1)) uk where T T π(k +1) = π(k) P(k, uk)

depends on uk.

Note how the minimizing u will depend on π(k) x and future optimal controls u . • k∗ ∼ k k∗+1

November 9, 2020 George Kesidis

294 Discrete-time Markov decision processes - perturbations model

As a special case, suppose the cost at time n is gn(xn,un,wn) 0 is the cost at time n, • where ≥ – u is the control, – w is a (discrete-time) “driving” Markov process (of “perturbations” in the state), so P(w)(n) is the (uncontrolled) TPM of at w time n,and – x is the state which evolves according to modified recursive update

xn+1 = fn(xn,un,wn), n 0, 1, 2,...,N 1 , i.e., (x) ∀ ∈ { − } P (n, un)=E P(fn(i, un,wn)=j xn = i), where wn α(n) ij α(n) | ∼ = P(fn(i, un,wn)=j xn = i)αw (n), where αw (n):=P(wn = w ). | 4 4 4 w !4 That is, x is also a Markov process. • Again, the additive cost is the sum of non-negative components g at each time n, • N 1 − J0(x0,u0)= gn(xn,un,wn)+gN (xN ) n=0 !

295

Discrete-time Markov decision processes - perturbations model (cont)

(w) Given the initial state x0,theinitialdistributionα(0) w0,anditsTPMP ,wewish • ∼ to find the optimal control u0∗ achieving the expected minimum cost

V0(x0, α(0)) := min Eα(0)J0(x0,u0). u0

So, we can write the principle of optimality for expected cost-to-go at time k

November 9, 2020 George Kesidis

296 Discrete-time Markov decision processes - perturbations model (cont)

For the special case of i.i.d. disturbances w:

n, P(w)(n)=I, • ∀

w is stationary so that there is a distribution α such that, k, wk α(k)=α (does not • depend on time k), and ∀ ∼

so indicating dependence of V and J on α may be suppressed. •

November 9, 2020 George Kesidis

297

Example - playing chess

A strategic player plays against an opponent, where the (non-strategic)opponentdoesnot • change his actions in accordance with the current state.

Adrawfetches0pointsforboth,awinfetches1pointforthewinner and 0 for the loser. • They play N independent games. • If the scores are tied after N games, then the players go to sudden death, where they play • until one wins a game.

November 9, 2020 George Kesidis

298 Example - playing chess - Timid and Bold strategies

The (strategic) player can play “Timid”, in that case draws agamewithprobabilitypd and • loses with probability 1 p , i.e., cannot win playing Timid. − d

The player can play “Bold”, in that case wins agamewithprobabilitypw and loses with • probability 1 pw. −

Consideration of strategy is nontrivial when p >pw > 0. • d Optimal strategy in sudden death? Play Bold (to win!) •

November 9, 2020 George Kesidis

299

Example - playing chess - set-up

u :controliseitherTimid(0)orBold(1) k • k ∀ w :outcomeofthekth game • k – Given Timid play, P(w =0u =0)=p ,P(w = 1 u =0)=1 p k | k d k − | k − d – Given Bold play, P(w =1u =1)=pw,P(w = 1 u =1)=1 pw k | k k − | k −

After k games, strategic player leads by xk = wk 1 + xk 1 wins, with x0 := 0. • − − S = k, (k 1),..., 1, 0, 1,...,k 1,k :statespaceofx • k {− − − − − } k N :timehorizonofoptimization •

November 9, 2020 George Kesidis

300 Example - playing chess - reward function to optimize

Now consider maximization of reward instead of minimizationofcost. • At time N,theprobabilityofwinningthewholematchis • 0 if xN < 0 EJN (xN )=EgN (xN )= pw if xN =0 (need sudden death)  1 0  if xN >

The probability of winning the whole match in k

Egk(xk,uk,wk)=0.

November 9, 2020 George Kesidis

301

Example - playing chess - optimal strategy

VN (xN )=EgN (xN )=...(see previous slide) k1(x −1 > 0)  −  So for the N th game: if trailing by 1 then play to win; else if leading by 1 thenplayto • draw; else if tied (as in sudden death) then play to win; else the play action doesn’t matter as the winner has already been determined.

Similarly, compute VN 2 using VN 1, etc.,toV0. • − − Exercise: Show by backwards induction that optimal strategy is • Bold(1) if k x 0, − ≤ ≤ uN∗ k(x)= Timid(0) if 1 x k, −  ≤ ≤  arbitrary else

 302 The Linear dynamics and Quadratic cost (LQ) framework

n Assume a perturbed model with linear-dynamics f for state and perturbations xk,wk R • m n n ∈ and control uk R , i.e., there are deterministic matrix sequences Ak R × , Bk m n ∈ ∈ ∈ R × such that

k

Jj(xj)= gk(xk), = !k j N where the cost-to-go at time j, Jj,dependsoncontroluj = u . { k}k=j When w is a zero-mean sequence of unit variance, can directly show that optimal linear • control is uk(xk)=Lkxk where T 1 T L = (B K +1B + R )− B K +1A for k

Linear dynamics, Quadratic cost (LQ) - time-invariant case

If A = A, B = B, R = R, Q = Q (time-invariant/homogeneous case), then • k k k k as time k becomes large, Kk converges to the steady-state solution of algebraic Riccati equation, T T 1 T K = A (K KB(B KB + R)− B K)A + Q − − So, the LQ-optimal control is u(x)=Lx,where • T 1 T L = (B KB + R)− B KA −

November 9, 2020 George Kesidis

304 Optimal stopping problems

Suppose in any time-slot, one of the control actions stops thesystem. • The decision maker can terminate the system at a certain loss or choose to continue at a • certain cost.

The challenge will be when to stop so as to minimize the total/final expected cost. •

For example, decision maker possesses an asset that is the subject of sequential offers wn • and the question is which offer to take?

November 9, 2020 George Kesidis

305

Optional stopping example - asset selling problem

Adecision-makerhasanassetforwhichhereceivesquotes/offers in every time-slot, • w0,...,wN 1 > 0. − Quotes are independent from slot to slot and identically distributed. • If the offer is accepted, it is then invested to earn a fixed rate of interest r>0. • Control action u for k>0 is to sell or not to sell at slot k based on offer w • k k State is the offer in the previous slot if the asset is not sold yet, or a flag < 0 if it was • previously sold (terminating state), S if sold in previous slots (

November 9, 2020 George Kesidis

306 Asset selling problem - rewards

So, xk = means xk = wk 1 > 0. • - S − Reward at N is • gN (xN )=xN = wN 1 if xN = (not prev. sold, take final offer) JN (xN )= − - S 0 if xN = (prev. sold) * S Terminal reward at step k

N k max (1 + r) − wk 1, EJk+1(wk) if xk = (xk = wk 1) Vk(xk)= { − } - S − 0 if xk = * S

November 9, 2020 George Kesidis

307

Asset selling problem - optimal control is threshold

Let the expected discounted future reward be • N k N k α = EJ +1(w )/(1 + r) = EJ +1(x +1)/(1 + r) when x = . k k k − k k − k - S So, J (x )=(1+r)N k max x , α . • k k − { k k} So by backward induction, the optimal (maximizing reward) control strategy u is: • k – Accept the offer (wk)ifxk > αk

– Reject the offer if xk < αk – Act either way otherwise

November 9, 2020 George Kesidis

308 Asset selling problem - threshold non-increasing in time

Theorem: α is non-increasing function of k, i.e., k

Proof: We will show by backward induction that x 0 (x = ): ∀ ≥ k - S

αN 1 := EJN (w)/(1 + r)=Ew/(1 + r),notingwk are assumed i.i.d. • − 2 αN 2 := EJN 1(w)/(1 + r)2 = max Ew/(1 + r), EJN (w)/(1 + r) . • − − { } 2 Thus, αN 1 := EJN (w)/(1+r) EJN (w)/(1+r) =: αN 2, i.e., we’ve established • the base case.− ≤ −

N k+1 N k+2 Assume αk 1 := EJk(w)/(1 + r) − EJk 1(w)/(1 + r) − =: αk 2 for • some arbitrary− k N. ≥ − − ≤ Thus, • N k+3 1 N k+2 αk 3 := EJk 2(x)/(1 + r) − =max(1 + r)− Ew, EJk 1(w)/(1 + r) − − − − 1 { N k+1 } max (1 + r)− Ew, EJk(w)/(1 + r) − (by inductive assumption) ≤ { N k+2 } = EJk 1(w)/(1 + r) − =: αk 2. − − Q.E.D. •

November 9, 2020 George Kesidis

309

Asset selling problem - iterative computation of threshold

N k Let Vk(xk)=Jk(xk)/(1 + r) − when xk = , i.e., xk = wk 1 > 0,decisiontosell • has not been made, and threshold is still relevant.- S −

VN (xN )=xN (again, xN = ) - S1 k

Since α := EV +1(w)/(1 + r), • k k Vk(xk)=maxxk, αk =maxwk 1, αk { } { − }

November 9, 2020 George Kesidis

310 Asset selling problem - iterative computation of threshold (cont)

Thus, • αk = EVk+1(w)/(1 + r) = E max w, αk +1 /(1 + r) α +1{ } k ∞ = αk+1dFw(z)+ zdFw(z) /(1 + r), "20 2αk+1 # where Fw is the cumulative distribution function of w.

Note that the first term is αk+1P(w αk+1) αk+1 < and the second term is • Ew< by assumption. ≤ ≤ ∞ ≤ ∞ So, α > 0 is a bounded, monotonically non-increasing sequence, so it must converge. • k For large k,thesequenceconvergestosolutionα of • α ∞ α = αdFw(z)+ zdFw(z) /(1 + r) "20 2α # ∞ = αP(w α)+ zdFw(z) /(1 + r) ≤ " 2α #

November 9, 2020 George Kesidis

311

Background on constrained optimization and duality

Consider a primal optimization problem with a set of m inequality constraints: Find • arg min f0(x), x D ∈ where the constrained domain of optimization is n D x R fi(x) 0, i 1, 2,...,m . ≡ { ∈ | ≤ ∀ ∈ { }} To study the primal problem, we define the corresponding Lagrangian function on Rn • [0, )m: × ∞ m L(x,v) f0(x)+ vifi(x), ≡ i=1 ! where, by implication, the vector of Lagrange multipliers (dual variables) is v [0, )m, i.e., non-negative v 0. ∈ ∞ ≥

November 9, 2020 George Kesidis

312 Primal constrained optimization with Lagrange multipliers

Theorem: • min max L(x,v)=minf0(x) p∗. x Rn v 0 x D ≡ ∈ ≥ ∈ Proof: Simply, • if x D, max L(x,v)= ∞ -∈ v 0 f0(x) if x D (cf., complementary slackness) ≥ * ∈

Note that if x D then i>0 s.t. fi(x) > 0 optimal v = . • -∈ ∃ ⇒ i∗ ∞ So, we can maximize the Lagrangian in an unconstrained fashion to find the solution to the • constrained primal problem.

November 9, 2020 George Kesidis

313

Complementary slackness of primal solution

Define the maximizing values of the Lagrange multipliers, • v∗(x) arg max L(x,v) ≡ v 0 ≥ and note that the complementary slackness conditions

vi∗(x)fi(x)=0 hold for all x D and i 1, 2,...,m . ∈ ∈ { } That is, if there is slackness in the ith constraint, i.e., f (x) < 0,thenthereisnoslackness • i in the constraint of the corresponding Lagrange multiplier, i.e., vi∗(x)=0.

Conversely, if fi(x)=0,thentheoptimal value of the Lagrange multiplier vi∗(x) is not • relevant to the Lagrangian.

November 9, 2020 George Kesidis

314 The dual problem

Now define the dual function of the primal problem: • g(v)=minL(x,v). x Rn ∈ i.e., unconstrained optimization w.r.t. primal variables first.

Note that g(v) may be infinite for some values of v and that g is always concave. • Theorem: For all x D and v 0, • ∈ ≥ g(v) f0(x). ≤ Proof: For v 0 and x D, • ≥ ∈ g(v) L(x,v) max L(x,v)=f0(x), ≤ ≤ v 0 ≥ where the last equality is the bound on L assuming x D. ∈

November 9, 2020 George Kesidis

315

The dual problem (cont)

So, by the previous theorem, if we solve the dual problem, i.e., find • d∗ max g(v), ≡ v 0 ≥ then we will have obtained a (hopefully good) lower bound to the primal problem, i.e.,

max g(v)=maxmin L(x,v)=d∗ p∗ =minmax L(x,v)=minf0(x) v 0 v 0 x Rn ≤ x Rn v 0 x D ≥ ≥ ∈ ∈ ≥ ∈

Under certain conditions in this finite dimensional setting,inparticularwhentheprimal • problem is convex and a strictly feasible solution exists, the duality gap,

p∗ d∗ =0. −

November 9, 2020 George Kesidis

316 The dual problem for a linear program

n n If f0(x)= φjxj and all fi(x)=ξi + γi,jxj are linear functions, then the • j=1 j=1 above primal problem, minx f0(x) s.t. fi(x) 0 i 1,iscalledaLinearProgram (LP). - -≤ ∀ ≥

Exercise: Find an equivalent dual LP. Hint: first show the Lagrangian of the primal problem • can be written as m n m L(x,v)= ξivi + xj φj + viγi,j . i=1 j=1 ; i=1 < ! ! ! LPs can be solved by the simplex algorithm (along feasible region boundaries) or by interior • point methods.

November 9, 2020 George Kesidis

317

Iterated subgradient method

Using duality to find p∗ and x∗ = argminx Df0(x) in this case, consider a slow ascent • ∈ method is used to maximize g,

vn = vn 1 + α1 vL(x∗(vn 1),vn 1), − ∇ − − and between steps of the ascent method, a fast descent method is used to evaluate g(vn) by minimizing L(x,vn),

xk = xk 1 α2 xL(xk 1,vn) x∗(vn) − − ∇ − → The process described by such an ascent/descent method is called an iterative subgradient • method.

The step sizes α can be chosen dynamically, e.g., steepest ascent/descent (i.e., itself the • result of optimization).

Instead of slow ascent, the descent step can be projected on the feasible domain D. •

November 9, 2020 George Kesidis

318 KKT conditions

Consider again a primal optimization problem with a set of m inequality constraints: Find • arg min f0(x), x D ∈ where the constrained domain of optimization is n D x R fi(x) 0, i 1, 2,...,m . ≡ { ∈ | ≤ ∀ ∈ { }} So the Lagrangian on (x,v) Rn (R+)m is • ∈ × m L(x,v) f0(x)+ v f (x). ≡ i i i=1 ! and our objective is to find minx maxv 0 L. ≥

If f0 is convex and, i 1, fi is linear, then the following Krush-Kuhn-Tucker (KKT) • ∀ ≥ conditions suffice for optimality: j, ∂L/∂x =0and ∀ j i, v f =0(complementary slackness). ∀ i i

November 9, 2020 George Kesidis

319

Example - load balancing in a network of parallel routes

Consider a total demand of Λ between two network end-systems having R disjoint routes • connecting them.

On route r,theservicecapacityiscr and the fraction of the demand applied to it is πr, • where πr =1and r, cr > πrΛ (the latter for stability). r ∀ Consider- the problem of the routing decisions that minimize the mean number of jobs in • the system,

πrΛ cr N(π)= = 1+ , c π Λ − c π Λ r r r r r r ! − ! " − # where the summand is taken from an M/M/1 queue and increases with πr.

To find optimal π,wecanfirsttrytouseaLagrangianwithjustoneoftheinequality • constraints, πr 1: r ≥

- L(π,v)=N(π)+v(1 πr) − r ! Since L is convex in π,therewillbezerodualitygapallowingustominimizeoverπ first. •

November 9, 2020 George Kesidis

320 Example - load balancing (cont)

By the first-order necessary conditions, i, ∂L/∂πr =0,theminimizing • ∀ cr cr π∗ = . r Λ − Λv E To meet the equality constraint π =1(maximize the dual function), the Lagrange • r r∗ multiplier v∗ is - c c 2 √ r √ r r √ r cr √cr cj Λv∗ = v∗ = & πr∗ = 1+ , 1+ cr/Λ ⇒ cr Λ Λ − c − Λ r r j √ j j − - " - − # ! where - - -  

– the first equality requires the system stability condition r cr > Λ,and

– stability in each route is achieved, cr > πr∗Λ (with slackness).-

Note that if route capacities c are highly imbalanced, it’s possible that this π < 0 for • r∗ routes r with smallest cr,inwhichcasetheconstraintsπr 0 need to be considered in ≥ the Lagrangian (exercise)-elseifcr cs r, s,thenπ uniform (> 0). ≈ ∀ ∗ ≈

1 = 1 ( ) By Little’s theorem, π∗ also minimizes mean delay r πr c π Λ ΛN π . • r− r - = ? This model was extended to an end-user game in [Korilis et al. INFOCOM’97]. • 321

Example - Max-Min Fair (MMF) allocation: problem set-up and def’n

Suppose a set of N processes require service from a set of M cores (processors). •

Let δn,m 0, 1 indicate whether process n N prefers core m M. • ∈ { } ∈ ∈

Let φn be the weight or priority of process n N. • ∈

Let sm be the capacity of core m. •

Finally, let xn,m be the fraction of core m allocated to process n,where • δn,m =0 xn,m =0. ⇒ The normalized total allocation to process n is •

Fn := xn,mδn,msm φn . m M F !∈ x is a MMF allocation if the following condition holds: • If xn,m > 0, δk,m =1and Fk >Fn,thenxk,m =0.

That is, at a MMF allocation, all processes receiving positive allocation by any given core • must have the same normalize total allocations: xn,m,x > 0 Fn = F k,m ⇒ k

322 Example - Max-Min Fair (MMF) allocation by constrained convex opt

Consider the Lagrangian with Lagrange multipliers v 0: • ≥

L = φng(Fn)+ vm xn,m 1 + vn,m( xn,m) − − n N m M ;n N < n,m !∈ !∈ !∈ ! where g is strictly convex and g strictly increasing (e.g., g(F )= log(F )). 4 −

The KKT conditions for optimality require that if δn,m > 0 then • 1 vn,m vm smg4(Fn)+vm vn,m =0 Fn =(g4)− − − ⇒ s " m # where we note that g strictly increasing (g ) 1 strictly increasing. 4 ⇒ 4 −

If xn,m > 0,thenvn,m =0by complementary slackness. • Additionally, if δ =1then • k,m 1 vk,m vm 1 vm Fk =(g4)− − (g4)− − = Fn, s ≥ s " m # " m # which is the definition of MMF allocation.

So, the solution of the above convex optimization is the MMF allocations. • 323

An “efficient” game among routed flows in a network

Reference: F. Kelly. Charging and rate control for elastic traffic. European Trans. • Telecommun. 8:33-37, 1997.

Consider R users sharing a network consisting of m links (hopefully without cycles) each • connecting a pair of nodes.

We identify a single fixed route r with each user, where, again, a route is simply a group • of connected links.

Thus, the user associated with each route could, in reality, be an aggregation of many • individual flows of smaller users.

Each link l has a capacity of cl bits per second and each user r transmits at xr bits per • second.

Link l charges κ X dollars per second to a user transmitting X bits per second over it. • l

November 9, 2020 George Kesidis

324 Noncooperative network-game formulation

Suppose that user r derives a certain benefit from transmission of xr bits per second on • route r.

The value of this benefit can be quantified as Ur(xr) dollars per second. •

AuserutilityfunctionUr is often assumed to have the following properties: Ur(0) = 0, • Ur is nondecreasing, and, for elastic traffic, Ur is concave.

The concavity property is sometimes called a principle of diminishing returns or diminishing • marginal utility.

November 9, 2020 George Kesidis

325

Noncooperative network-game formulation (cont)

Note that user r has net benefit (net utility) •

Ur(xr) xr κ . − l l r !∈ Suppose that, as with the loss networks, the network wishes toselectitspricesκ so as to • optimize the total benefit derived by the users, i.e., the network wishes to maximize “social welfare,” for example,

R f0(x) Ur(xr), − ≡ r=1 ! subject to the link capacity constraints f (x)=(Ax) c 0 for 1 l m. l l − l ≤ ≤ ≤ We can therefore cast this problem in the primal form using theLagrangian, • m L(x,v) f0(x)+ vifi(x). ≡ i=1 !

November 9, 2020 George Kesidis

326 Dual problem formulation

Since all of the individual utilities Ur are assumed concave functions on R, f0 is convex on • Rn.

Since the inequality constraints fi are all linear, the conditions for zero duality gap are • satisfied.

So, we will now formulate a distributed solution to the dual problem in order solve the • primal problem.

First note that, because of convexity, a necessary and sufficient condition to minimize the • Lagrangian L(x,v) over x (to evaluate the dual function g)is

xL(x∗(v),v)=0. ∇

November 9, 2020 George Kesidis

327

Solving the dual problem

For the problem under consideration, • ∂L(x,v) = Ur4(xr)+ vl ∂xr − l r ∈ !T = U 4(xr)+(A v)r. − r Therefore, for all r, •

1 xr∗(v)=(Ur4)− vl ; l r < ∈ 1 !T =(Ur4)− ((A v)r),

where the right-hand side is made unambiguous by the above assumptions on Ur.

November 9, 2020 George Kesidis

328 Solving the dual problem - ascent-descent framework

Assume that, at any given time, user r will act (select xr)soastomaximizetheirnet • benefit, i.e., select

1 arg max Ur(x) x κl =(Ur4)− κl =: yr, x 0 − ≥ l r ; l r < !∈ !∈ where this quantity is simply xr∗(κ).

That is, the prices κ correspond to the Lagrange multipliers v. • So, the dual function • g(κ)=L(x∗(κ), κ), i.e., for fixed link costs κ,thedecentralizedactionsofgreedyusersminimizetheLagrangian and, thereby, evaluate the dual function.

So, at fixed prices, the noncooperative game played by the users is efficient in that social • welfare f0 is maximized at their Nash equilibrium. −

A Nash equilibrium is a set of play-actions x∗ where no single user can benefit from • unilateral defection.

November 9, 2020 George Kesidis

329

Solving the dual problem - ascent-descent framework (cont)

Following the ascent-descent framework of the dual algorithm, suppose that the network • slowly modifies its link prices to maximize g(κ),whereby“slowly”wemeanthatthegreedy users are able to react to a new set of link prices well before they change again. To apply the ascent method to modify the link prices, we need toevaluatethegradientof • g to obtain the ascent direction. Since • ∂g(κ) ∂xr∗(κ) ∂xr∗(κ) =[(Ax∗(κ))l cl] Ur4(xr∗(κ)) + κl4 ∂κl − − ∂κl ∂κl r l4 r l r ! ! !| 4∈ =(Ax∗(κ)) c , l − l for each link l the ascent rule for link prices becomes • (κl)n =(κl)n 1 + α1 ((Ax∗(κn 1))l cl) − − − or, in vector form,

κn = κn 1 + α1(Ax∗(κn 1) c). − − − Note that the these link price updates depend only on “local” information such as link • capacity and price and link demand, (Ax∗(κn 1))l,wherethelattercanbeempirically evaluated. −

November 9, 2020 George Kesidis

330 Solving the dual problem - ascent-descent framework (cont)

Suppose that we initially begin with very high prices κ0 so that demands x∗(κ) are very • small.

The action of the previous link-price updates will be to lowerpricesand,correspondingly, • increase demand.

The prices will try to converge to a point κ ,wheresupplyc equals demand Ax (κ ). • ∗ ∗ ∗

November 9, 2020 George Kesidis

331