<<

TOPIC 0

Introduction

1 Review

Course: Math 535 (http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) Author: Sebastien Roch (http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison Updated: Sep 21, 2020 Copyright: © 2020 Sebastien Roch

We first review a few basic concepts.

1.1 Vectors and norms

For a vector ⎡ 푥 ⎤ ⎢ 1 ⎥ 푥 ⎢ 2 ⎥ 푑 퐱 = ⎢ ⎥ ∈ ℝ ⎢ ⋮ ⎥ ⎣ 푥푑 ⎦ the Euclidean norm of 퐱 is defined as  ‾‾푑‾‾‾‾ ‖퐱‖ =  푥2 = √퐱‾‾푇‾퐱‾ = ‾⟨퐱‾‾, ‾퐱‾⟩ 2 ∑ 푖 √ ⎷푖=1

푇 where 퐱 denotes the transpose of 퐱 (seen as a single-column matrix) and 푑 ⟨퐮, 퐯⟩ = 푢 푣 ∑ 푖 푖 푖=1 is the inner product (https://en.wikipedia.org/wiki/Inner_product_space) of 퐮 and 퐯. This is also known as the 2- norm. More generally, for 푝 ≥ 1, the 푝-norm (https://en.wikipedia.org/wiki/Lp_space#The_p- norm_in_countably_infinite_dimensions_and_ℓ_p_spaces) of 퐱 is given by 푑 1/푝 ‖퐱‖ = |푥 |푝 . 푝 (∑ 푖 ) 푖=1

Here (https://commons.wikimedia.org/wiki/File:Lp_space_animation.gif#/media/File:Lp_space_animation.gif) is a nice visualization of the unit ball, that is, the set {퐱 : ‖푥‖푝 ≤ 1}, under varying 푝.

There exist many more norms. Formally:

푑 푑 Definition (Norm): A norm is a ℓ from ℝ to ℝ+ that satisfies for all 푎 ∈ ℝ, 퐮, 퐯 ∈ ℝ

(Homogeneity): ℓ(푎퐮) = |푎|ℓ(퐮) (Triangle inequality): ℓ(퐮 + 퐯) ≤ ℓ(퐮) + ℓ(퐯) (Point-separating): ℓ(푢) = 0 implies 퐮 = 0. ⊲

The triangle inequality for the 2-norm follows (https://en.wikipedia.org/wiki/Cauchy– Schwarz_inequality#Analysis) from the Cauchy–Schwarz inequality (https://en.wikipedia.org/wiki/Cauchy– Schwarz_inequality).

푑 Theorem (Cauchy–Schwarz): For all 퐮, 퐯 ∈ ℝ |⟨퐮, 퐯⟩| ≤ ‖퐮‖2‖퐯‖2.

푑 The Euclidean distance (https://en.wikipedia.org/wiki/Euclidean_distance) between two vectors 퐮 and 퐯 in ℝ is the 2-norm of their difference 푑(퐮, 퐯) = ‖퐮 − 퐯‖2.

Throughout we use the notation ‖퐱‖ = ‖퐱‖2 to indicate the 2-norm of 퐱 unless specified otherwise. 푑 We will often work with collections of 푛 vectors 퐱1 , … , 퐱푛 in ℝ and it will be convenient to stack them up into a matrix ⎡ 퐱푇 ⎤ ⎡ 푥 푥 ⋯ 푥 ⎤ ⎢ 1 ⎥ 11 12 1푑 푇 ⎢ ⎥ ⎢ 퐱2 ⎥ ⎢ 푥21 푥22 ⋯ 푥2푑 ⎥ 푋 = ⎢ ⎥ = ⎢ ⎥ , ⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ 푇 ⎣ ⎦ ⎣ 퐱푛 ⎦ 푥푛1 푥푛2 ⋯ 푥푛푑 where 푇 indicates the transpose (https://en.wikipedia.org/wiki/Transpose):

(Source (https://commons.wikimedia.org/wiki/File:Matrix_transpose.gif))

NUMERICAL CORNER In Julia, a vector can be obtained in different ways. The following method gives a row vector as a two-dimensional array.

In [1]: # Julia version: 1.5.1 using LinearAlgebra, Statistics, Plots

In [2]: t = [1. 3. 5.]

Out[2]: 1×3 Array{Float64,2}: 1.0 3.0 5.0

To turn it into a one-dimensional array, use vec (https://docs.julialang.org/en/v1/base/arrays/#Base.vec).

In [3]: vec(t)

Out[3]: 3-element Array{Float64,1}: 1.0 3.0 5.0 To construct a one-dimensional array directly, use commas to separate the entries.

In [4]: u = [1., 3., 5., 7.]

Out[4]: 4-element Array{Float64,1}: 1.0 3.0 5.0 7.0

To obtain the norm of a vector, we can use the function norm (https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.norm) (which requires the LinearAlgebra package):

In [5]: norm(u)

Out[5]: 9.16515138991168

which we can check "by hand"

In [6]: sqrt(sum(u.^2))

Out[6]: 9.16515138991168

The . above is called broadcasting (https://docs.julialang.org/en/v1.2/manual/mathematical-operations/#man- dot-operators-1). It applies the operator following it (in this case taking a square) element-wise.

Exercise: Compute the inner product of 푢 = (1, 2, 3, 4) and 푣 = (5, 4, 3, 2) without using the function dot (https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.dot). Hint: The product of two real numbers 푎 and 푏 is a * b .

In [7]: # Try it! u = [1., 2., 3., 4.]; # EDIT THIS LINE: define v # EDIT THIS LINE: compute the inner product between u and v

To create a matrix out of two vectors, we use the function hcat (https://docs.julialang.org/en/v1.2/base/arrays/#Base.hcat) and transpose. In [8]: u = [1., 3., 5., 7.]; v = [2., 4., 6., 8.]; X = hcat(u,v)'

Out[8]: 2×4 Adjoint{Float64,Array{Float64,2}}: 1.0 3.0 5.0 7.0 2.0 4.0 6.0 8.0

With more than two vectors, we can use the reduce (https://docs.julialang.org/en/v1/base/collections/#Base.reduce-Tuple{Any,Any}) function.

In [9]: u = [1., 3., 5., 7.]; v = [2., 4., 6., 8.]; w = [9., 8., 7., 6.]; X = reduce(hcat, [u, v, w])'

Out[9]: 3×4 Adjoint{Float64,Array{Float64,2}}: 1.0 3.0 5.0 7.0 2.0 4.0 6.0 8.0 9.0 8.0 7.0 6.0

1.2 Multivariable

1.2.1 Limits and continuity

‾‾‾푑‾‾‾‾‾2 푇 푑 Throughout this section, we use the Euclidean norm ‖퐱‖ = √∑푖=1 푥푖 for 퐱 = (푥1, … , 푥푑 ) ∈ ℝ .

푑 The open 푟-ball around 퐱 ∈ ℝ is the set of points within Euclidean distance 푟 of 퐱, that is, 푑 퐵푟(퐱) = {퐲 ∈ ℝ : ‖퐲 − 퐱‖ < 푟}.

푑 푑 A point 퐱 ∈ ℝ is a limit point (or accumulation point) of a set 퐴 ⊆ ℝ if every open ball around 퐱 contains an element 퐚 of 퐴 such that 퐚 ≠ 퐱. A set 퐴 is closed if every limit point of 퐴 belongs to 퐴. (Source (https://www.math212.com/2017/12/limit-point-closure.html))

푑 푑 A point 퐱 ∈ ℝ is an interior point of a set 퐴 ⊆ ℝ if there exists an 푟 > 0 such that 퐵푟(퐱) ⊆ 퐴. A set 퐴 is open if it consists entirely of interior points.

(Source (https://commons.wikimedia.org/wiki/File:Open_set_-_example.png)) 푑 푇 A set 퐴 ⊆ ℝ is bounded if there exists an 푟 > 0 such that 퐴 ⊆ 퐵푟(0), where 0 = (0, … , 0) .

푑 Definition (Limits of a Function): Let 푓 : 퐷 → ℝ be a real-valued function on 퐷 ⊆ ℝ . Then 푓 is said to have a limit 퐿 ∈ ℝ as 퐱 approaches 퐚 if: for any 휖 > 0, there exists a 훿 > 0 such that |푓(퐱) − 퐿| < 휖 for all 퐱 ∈ 퐷 ∩ 퐵훿(퐚) ∖ {퐚}. This is written as lim 푓(퐱) = 퐿. 퐱→퐚 ⊲

Note that we explicitly exclude 퐚 itself from having to satisfy the condition |푓(퐱) − 퐿| < 휖. In particular, we may have 푓(퐚) ≠ 퐿. We also do not restrict 퐚 to be in 퐷.

푑 Definition (): Let 푓 : 퐷 → ℝ be a real-valued function on 퐷 ⊆ ℝ . Then 푓 is said to be continuous at 퐚 ∈ 퐷 if lim 푓(퐱) = 푓(퐚). 퐱→퐚 ⊲ (Source (https://commons.wikimedia.org/wiki/File:Example_of_continuous_function.svg))

We will not prove the following fundamental analysis result, which will be useful below. See e.g. Wikipedia 푑 (https://en.wikipedia.org/wiki/Extreme_value_theorem). Suppose 푓 : 퐷 → ℝ is defined on a set 퐷 ⊆ ℝ . We ∗ ∗ say that 푓 attains a maximum value 푀 at 퐳 if 푓(퐳 ) = 푀 and 푀 ≥ 푓(퐱) for all 퐱 ∈ 퐷. Similarly, we say 푓 attains a minimum value 푚 at 퐳∗ if 푓(퐳∗ ) = 푚 and 푚 ≥ 푓(퐱) for all 퐱 ∈ 퐷.

Theorem (Extreme Value): Let 푓 : 퐷 → ℝ be a real-valued, continuous function on a nonempty, closed, 푑 bounded set 퐷 ⊆ ℝ . Then 푓 attains a maximum and a minimum on 퐷.

1.2.2 Derivatives

We begin by reviewing the single-variable case. Recall that the derivative of a function of a real variable is the rate of change of the function with respect to the change in the variable. Formally: Definition (Derivative): Let 푓 : 퐷 → ℝ where 퐷 ⊆ ℝ and let 푥0 ∈ 퐷 be an interior point of 퐷. The derivative of 푓 at 푥0 is ′ d푓(푥0) 푓(푥0 + ℎ) − 푓(푥0) 푓 (푥0) = = lim d푥 ℎ→0 ℎ provided the limit exists. ⊲

(Source (https://commons.wikimedia.org/wiki/File:Tangent_to_a_curve.svg))

Exercise: Let 푓 and 푔 have derivatives at 푥 and let 훼 and 훽 be constants. Show that [훼푓(푥) + 훽푔(푥)]′ = 훼푓 ′ (푥) + 훽푔′ (푥).

The following lemma encapsulates a key insight about the derivative of 푓 at 푥0 : it tells us where to find smaller values.

Lemma (Descent Direction): Let 푓 : 퐷 → ℝ with 퐷 ⊆ ℝ and let 푥0 ∈ 퐷 be an interior point of 퐷 where ′ ′ 푓 (푥0) exists. If 푓 (푥0) > 0, then there is an open ball 퐵훿(푥0) ⊆ 퐷 around 푥0 such that for each 푥 in 퐵훿(푥0):

(a) 푓(푥) > 푓(푥0) if 푥 > 푥0 , (b) 푓(푥) < 푓(푥0) if 푥 < 푥0 .

′ If instead 푓 (푥0) < 0, the opposite holds.

′ Proof idea: Follows from the definition of the derivative by taking 휖 small enough that 푓 (푥0) − 휖 > 0. ′ Proof: Take 휖 = 푓 (푥0)/2. By definition of the derivative, there is 훿 > 0 such that 푓(푥 + ℎ) − 푓(푥 ) 푓 ′ (푥 ) − 0 0 < 휖 0 ℎ for all 0 < ℎ < 훿. Rearranging gives ′ 푓(푥0 + ℎ) > 푓(푥0) + [푓 (푥0) − 휖]ℎ > 푓(푥0) by our choice of 휖. The other direction is similar. ◻

푑 For functions of several variables, we have the following generalization. As before, we let 퐞푖 ∈ ℝ be the 푖-th standard basis vector.

푑 Definition (Partial Derivative): Let 푓 : 퐷 → ℝ where 퐷 ⊆ ℝ and let 퐱0 ∈ 퐷 be an interior point of 퐷. The partial derivative of 푓 at 퐱0 with respect to 푥푖 is ∂푓(퐱 ) 푓(퐱 + ℎ퐞 ) − 푓(퐱 ) 0 = lim 0 푖 0 ∂푥푖 ℎ→0 ℎ

∂푓(퐱0) provided the limit exists. If exists and is continuous in an open ball around 퐱0 for all 푖, then we say that 푓 ∂푥푖 continuously differentiable at 퐱0 . ⊲

푚 푑 Definition (Jacobian): Let 퐟 = (푓1 , … , 푓푚 ) : 퐷 → ℝ where 퐷 ⊆ ℝ and let 퐱0 ∈ 퐷 be an interior point of ∂푓푗(퐱0) 퐷 where exists for all 푖, 푗. The Jacobian of 퐟 at 퐱0 is the 푑 × 푚 matrix ∂푥푖 ∂푓 (퐱 ) ∂푓 (퐱 ) ⎛ 1 0 … 1 0 ⎞ ⎜ ∂푥1 ∂푥푑 ⎟ ⎜ ⎟ 퐉퐟 (퐱0 ) = ⎜ ⋮ ⋱ ⋮ ⎟ . ⎜ ∂푓 (퐱 ) ∂푓 (퐱 ) ⎟ ⎝ 푚 0 … 푚 0 ⎠ ∂푥1 ∂푥푑

For a real-valued function 푓 : 퐷 → ℝ, the Jacobian reduces to the row vector 푇 퐉푓 (퐱0 ) = ∇푓(퐱0 ) where the vector 푇 ∂푓(퐱0 ) ∂푓(퐱0 ) ∇푓(퐱0 ) = ( , … , ) ∂푥1 ∂푥푑 is the gradient of 푓 at 퐱0 . ⊲ Example: Consider the affine function 푓(퐱) = 퐪푇 퐱 + 푟

푇 푇 푑 where 퐱 = (푥1, … , 푥푑 ) , 퐪 = (푞1, … , 푞푑) ∈ ℝ . The partial derivatives of the linear term are given by ∂ ∂ 푑 [퐪푇 퐱] = 푞 푥 = 푞 . ∂푥 ∂푥 [∑ 푗 푗] 푖 푖 푖 푗=1

So the gradient of 푓 is ∇푓(퐱) = 퐪.

Example: Consider the quadratic function 1 푓(퐱) = 퐱푇 푃퐱 + 퐪푇 퐱 + 푟. 2

푇 푇 푑 푑×푑 where 퐱 = (푥1, … , 푥푑 ) , 퐪 = (푞1, … , 푞푑) ∈ ℝ and 푃 ∈ ℝ . The partial derivatives of the quadratic term are given by ∂ ∂ 푑 [퐱푇 푃퐱] = 푃 푥 푥 ∂푥 ∂푥 [ ∑ 푗푘 푗 푘 ] 푖 푖 푗,푘=1 ∂ 푑 푑 = 푃 푥2 + 푃 푥 푥 + 푃 푥 푥 ∂푥 [ 푖푖 푖 ∑ 푗푖 푗 푖 ∑ 푖푘 푖 푘 ] 푖 푗=1,푗≠푖 푘=1,푘≠푖 푑 푑 = 2푃 푥 + 푃 푥 + 푃 푥 푖푖 푖 ∑ 푗푖 푗 ∑ 푖푘 푘 푗=1,푗≠푖 푘=1,푘≠푖 푑 푑 = [푃 푇 ] 푥 + [푃] 푥 . ∑ 푖푗 푗 ∑ 푖푘 푘 푗=1 푘=1

So the gradient of 푓 is 1 ∇푓(퐱) = [푃 + 푃 푇 ] 퐱 + 퐪. 2 ⊲

1.2.3 Optimization

Optimization problems play an important role in data science. Here we look at unconstrained optimization problems of the form: min 푓(퐱) 퐱∈ℝ푑

푑 where 푓 : ℝ → ℝ. Ideally, we would like to find a global minimizer to the optimization problem above. 푑 푑 푑 Definition (Global Minimizer): Let 푓 : ℝ → ℝ. The point 퐱∗ ∈ ℝ is a global minimizer of 푓 over ℝ if 푓(퐱) ≥ 푓(퐱∗ ), ∀퐱 ∈ ℝ푑.

2 ∗ Example: The function 푓(푥) = 푥 over ℝ has a global minimizer at 푥 = 0. Indeed, we clearly have 푓(푥) ≥ 0 for all 푥 while 푓(0) = 0.

In [10]: f(x) = x^2 x = LinRange(-2,2,100) y = f.(x) plot(x, y, lw=2, legend=false)

Out[10]:

푥 The function 푓(푥) = 푒 over ℝ does not have a global minimizer. Indeed, 푓(푥) > 0 but no 푥 achieves 0. And, for any 푚 > 0, there is 푥 small enough such that 푓(푥) < 푚. In [11]: f(x) = exp(x) x = LinRange(-2,2, 100) y = f.(x) plot(x, y, lw=2, legend=false, ylim = (0,5))

Out[11]:

2 2 ∗ ∗∗ The function 푓(푥) = (푥 + 1) (푥 − 1) over ℝ has two global minimizers at 푥 = −1 and 푥 = 1. Indeed, ∗ ∗∗ 푓(푥) ≥ 0 and 푓(푥) = 0 if and only 푥 = 푥 or 푥 = 푥 . In [12]: f(x) = (x+1)^2*(x-1)^2 x = LinRange(-2,2, 100) y = f.(x) plot(x, y, lw=2, legend=false, ylim = (0,5))

Out[12]:

In general, finding a global minimizer and certifying that one has been found can be difficult unless some special structure is present. Therefore weaker notions of solution have been introduced.

푑 ∗ 푑 푑 Definition (Local Minimizer): Let 푓 : ℝ → ℝ. The point 퐱 ∈ ℝ is a local minimizer of 푓 over ℝ if there is 훿 > 0 such that ∗ ∗ ∗ 푓(퐱) ≥ 푓(퐱 ), ∀퐱 ∈ 퐵훿(퐱 ) ∖ {퐱 }.

∗ If the inequality is strict, we say that 퐱 is a strict local minimizer. ⊲ ∗ ∗ In words, 퐱 is a local minimizer if there is open ball around 퐱 where it attains the minimum value. The difference between global and local minimizers is illustrated in the next figure.

(Source (https://commons.wikimedia.org/wiki/File:Extrema_example_original.svg))

Local minimizers can be characterized in terms of the gradient, at least in terms of a necessary condition. We will prove this result later in the course.

푑 푑 Theorem (First-Order Necessary Condition): Let 푓 : ℝ → ℝ be continuously differentiable on ℝ . If 퐱0 is a local minimizer, then ∇푓(퐱0 ) = 0.

1.3 Probability

1.3.1 Expectation, variance and Chebyshev's inequality Recall that the expectation (https://en.wikipedia.org/wiki/Expected_value) (or mean) of a function ℎ of a discrete random variable 푋 taking values in  is given by 피[ℎ(푋)] = ℎ(푥) (푥) ∑ 푝푋 푥∈ where 푝푋 (푥) = ℙ[푋 = 푥] is the probability mass function (https://en.wikipedia.org/wiki/Probability_mass_function) (PMF) of 푋. In the continuous case, we have 피[ℎ(푋)] = ℎ(푥)푓 (푥) d푥 ∫ 푋 if 푓푋 is the probability density function (https://en.wikipedia.org/wiki/Probability_density_function) (PDF) of 푋.

Two key properties of the expectation:

linearity, that is, 피[훼ℎ(푋) + 훽] = 훼 피[ℎ(푋)] + 훽

monotonicity, that is, if ℎ1(푥) ≤ ℎ2(푥) for all 푥 then

피[ℎ1(푋)] ≤ 피[ℎ2(푋)].

The variance (https://en.wikipedia.org/wiki/Variance) of a real-valued random variable 푋 is Var[푋] = 피[(푋 − 피[푋])2] and its standard deviation is 휎푋 = √V‾‾a‾r‾[‾푋‾‾]. The variance does not satisfy linearity, but we have the following property Var[훼푋 + 훽] = 훼2 Var[푋].

The variance is a of the typical deviation of 푋 around its mean. A quantified version of this statement is given by Chebyshev's inequality.

Lemma (Chebyshev) For a random variable 푋 with finite variance, we have for any 훼 > 0 Var[푋] ℙ[|푋 − 피[푋]| ≥ 훼] ≤ . 훼2

The intuition is the following: if the expected squared deviation from the mean is small, then the deviation from the mean is unlikely to be large. To formalize this we prove a more general inequality, Makov's inequality. In words, if a non-negative random variable has a small expectation then it is unlikely to be large.

Lemma (Markov) Let 푍 be a non-negative random variable with finite expectation. Then, for any 훽 > 0, 피[푍] ℙ[푍 ≥ 훽] ≤ . 훽

Proof idea: The quantity 훽 ℙ[푍 ≥ 훽] is a lower bound on the expectation of 푍 restricted to the range {푍 ≥ 훽}, which by non-negativity is itself lower bounded by 피[푍].

Proof: Formally, let 1퐴 be the indicator of the event 퐴, that is, it is the random variable that is 1 when 퐴 occurs and 0 otherwise. By definition, the expectation of 1퐴 is 피[퐴] = 0 ℙ[1퐴 = 0] + 1 ℙ[1퐴 = 1] = ℙ[퐴]

푐 where 퐴 is the complement of 퐴. Hence, by linearity and monotonicity, 훽 ℙ[푍 ≥ 훽] = 훽 피[1푍≥훽 ] = 피[훽1푍≥훽 ] ≤ 피[푍].

Rearranging gives the claim. ◻

Finally we return to the proof of Chebyshev.

Proof idea (Chebyshev): Simply apply Markov to the squared deviation of 푋 from its mean.

Proof (Chebyshev): Let 푍 = (푋 − 피[푋])2 , which is non-negative by definition. Hence, by Markov, for any 훽 = 훼2 > 0 ℙ[|푋 − 피[푋]| ≥ 훼] = ℙ[(푋 − 피[푋])2 ≥ 훼2] = ℙ[푍 ≥ 훽] 피[푍] ≤ 훽 Var[푋] = 훼2 where we used the definition of the variance in the last equality.◻

Chebyshev's inequality is particularly useful when combined with independence.

1.3.2 Independence and limit theorems Recall that discrete random variables 푋 and 푌 are independent if their joint PMF factorizes, that is 푝푋,푌 (푥, 푦) = 푝푋 (푥) 푝푌 (푦), ∀푥, 푦 where 푝푋,푌 (푥, 푦) = ℙ[푋 = 푥, 푌 = 푦]. Similarly, continuous random variables 푋 and 푌 are independent if their joint PDF factorizes. One consequence is that expectations of products of single-variable functions factorize as well, that is, for functions 푔 and ℎ we have 피[푔(푋)ℎ(푌 )] = 피[푔(푋)] 피[ℎ(푌 )].

The latter has the following important implication for the variance. If 푋1, … , 푋푛 are independent, real-valued random variables, then Var[푋1 + ⋯ + 푋푛] = Var[푋1] + ⋯ + Var[푋푛].

Notice that, unlike the case of the expectation, this equation for the variance requires independence in general.

Applied to the sample mean of 푛 independent, identically distributed (i.i.d.) random variables 푋1, … , 푋푛 , we obtain 1 푛 1 푛 Var 푋 = Var[푋 ] [ 푛 ∑ 푖] 푛2 ∑ 푖 푖=1 푖=1 1 = 푛 Var[푋 ] 푛2 1 Var[푋 ] = 1 . 푛

So the variance of the sample mean decreases as 푛 gets large, while its expectation remains the same by linearity 1 푛 1 푛 피 푋 = 피[푋 ] [ 푛 ∑ 푖] 푛 ∑ 푖 푖=1 푖=1 1 = 푛 피[푋 ] 푛 1 = 피[푋1].

Together with Chebyshev's inequality, we immediately get that the sample mean approaches its expectation in the following probabilistic sense.

Theorem (Law of Large Numbers) Let 푋1, … , 푋푛 be i.i.d. For any 휀 > 0, as 푛 → +∞, ∣ 1 푛 ∣ ℙ ∣ 푋 − 피[푋 ]∣ ≥ 휀 → 0. [∣ 푛 ∑ 푖 1 ∣ ] ∣ 푖=1 ∣ Proof: By Chebyshev and the formulas above, ∣ 1 푛 ∣ ∣ 1 푛 1 푛 ∣ ℙ ∣ 푋 − 피[푋 ]∣ ≥ 휀 = ℙ ∣ 푋 − 피 푋 ∣ ≥ 휀 [∣ 푛 ∑ 푖 1 ∣ ] [∣ 푛 ∑ 푖 [ 푛 ∑ 푖]∣ ] ∣ 푖=1 ∣ ∣ 푖=1 푖=1 ∣ 1 푛 Var [ ∑ 푋푖] ≤ 푛 푖=1 휀2 Var[푋 ] = 1 푛휀2 → 0 as 푛 → +∞. ◻

NUMERICAL CORNER We can use simulations to confirm the Law of Large Numbers. Recall that a uniform random variable over the interval [푎, 푏] has density 1 푥 ∈ [푎, 푏] 푏−푎 푓푋 (푥) = { 0 o.w.

We write 푋 ∼ U[푎, 푏]. We can obtain a sample from U[0, 1] by using the function rand (https://docs.julialang.org/en/v1.2/stdlib/Random/#Base.rand) in Julia.

In [13]: rand(1)

Out[13]: 1-element Array{Float64,1}: 0.6833180625581647

Now we take 푛 samples from U[0, 1] and compute their sample mean. We repeat 푘 times and display the empirical distribution of the sample means using an histogram (https://en.wikipedia.org/wiki/Histogram).

In [14]: function lln_unif(n, k) sample_mean = [mean(rand(n)) for i=1:k] histogram(sample_mean, legend=false, title="n=$n", xlims=(0,1), nbin=15) # "$n" is stri ng with value n end

Out[14]: lln_unif (generic function with 1 method) In [15]: lln_unif(10, 1000)

Out[15]:

Taking 푛 much larger leads to more concentration around the mean.

In [16]: lln_unif(100, 1000)

Out[16]: Exercise: Recall that the cumulative distribution function (CDF) of a random variable 푋 is defined as 퐹푋 (푧) = ℙ[푋 ≤ 푧], ∀푧 ∈ ℝ.

(a) Let  be the interval where 퐹푋 (푧) ∈ (0, 1) and assume that 퐹푋 is strictly increasing on . Let 푈 ∼ U[0, 1]. Show that ℙ[ −1(푈) ≤ 푧] = (푧). 퐹푋 퐹푋

(b) Generate a sample from U[푎, 푏] for arbitrary 푎, 푏 using rand and the observation in (a). This is called the inverse transform sampling method.

In [17]: # Try it! a, b = -1, 1; X = rand(1); # EDIT THIS LINE: transform X to obtain a random variable Y ~ U[a,b]