Lecture 10 Sublinear Time (contd)

CSC2420 – Allan Borodin & Nisarg Shah 1 Recap

• Sublinear time algorithms ➢ Deterministic + exact: binary search ➢ Deterministic + inexact: estimating diameter in a metric space ➢ Randomized + exact: searching in a sorted list o Lower bound (thus optimality) using Yao’s principle ➢ Randomized + inexact: o Estimating average degree in a graph o Estimating size of maximal matching in a graph o Property testing • Testing linearity of a Boolean function

CSC2420 – Allan Borodin & Nisarg Shah 2 Today

• Continue sublinear time property testing ➢ Testing if an array is sorted ➢ Testing if a graph is bipartite • Some comments about sublinear space algorithms • Begin streaming algorithms ➢ Find the missing element(s) ➢ Finding very frequent or very rare elements ➢ Counting the number of distinct elements

CSC2420 – Allan Borodin & Nisarg Shah 3 Testing Monotonicity of Array

• Input: Array 퐴 of length 푛 with 푂(1) access to 퐴[푖] • Check: 퐴 푖 < 퐴[푖 + 1] for every 푖 ∈ {1, … , 푛 − 1} • Definition of “at least 휖-far”: You need to change at least 휖푛 entries to make it monotonic ➢ Equivalently, there are at least 휖푛 entries that are not between their adjacent values.

log 푛 • Goal: 1-sided with 푂 queries 휖

CSC2420 – Allan Borodin & Nisarg Shah 4 Testing Monotonicity of Array

• Proposal: ➢ Pick 푡 random indices 푖, and return “no” even if 푥푖 > 푥푖+1 for even one of them. • No! ➢ For 1 1 1 … 1 0 0 0 … 0 (푛/2 each), we’ll need 푡 = Ω(푛) • Proposal: ➢ Pick 푡 random pairs (푖, 푗) with 푖 < 푗, and return “no” if 푥푖 > 푥푗 for even one of them. • No! ➢ 1 0 2 1 3 2 4 3 5 4 6 5 … (two interleaved sorted lists) ➢ ½-far (WHY?), but need 푡 ≥ Ω(푛) (by Birthday Paradox, we also must access Ω 푛 elements) (WHY?)

CSC2420 – Allan Borodin & Nisarg Shah 5 Testing Monotonicity of Array

• Algorithm: ➢ Choose 2/휖 random indices 푖. ➢ For each 푖, do a binary search for 퐴[푖]. ➢ Return “yes” if all binary searches succeed. • Assume all elements are distinct w.l.o.g. ➢ Can replace 퐴[푖] by (퐴 푖 , 푖) and use lexicographic comparison • Important observation: ➢ “searchable” elements form an increasing subsequence! (WHY?)

CSC2420 – Allan Borodin & Nisarg Shah 6 Testing Monotonicity of Array

• Algorithm: ➢ Choose 2/휖 random indices 푖. ➢ For each 푖, do a binary search for 퐴[푖]. ➢ Return “yes” if all binary searches succeed. • Thus: ➢ If 훼 ⋅ 푛 elements searchable ⇒ array is at most 1 − 훼 -far from monotonic ➢ If array is at least 휖-far from monotonic ⇒ at least 휖 ⋅ 푛 elements must not be searchable o Each iteration fails to detect violation w.p. at most 1 − 휖 2 1 o All 2/휖 iterations fail to detect w.p. at most 1 − 휖 휖 ≤ Τ3

CSC2420 – Allan Borodin & Nisarg Shah 7 Graph Property Testing

• It’s an active area of research by itself. • Let 퐺 = (푉, 퐸) with 푛 = |푉| and 푚 = |퐸|

• Input models: ➢ Dense: Represented by adjacency matrix o Query if 푖, 푗 ∈ 퐸 in 푂(1) time o 휖-far from satisfying 푃 if 휖푛2 matrix entries must be changed to satisfy 푃 o Change required = 휖-fraction of the input

CSC2420 – Allan Borodin & Nisarg Shah 8 Graph Property Testing

• It’s an active area of research by itself. • Let 퐺 = (푉, 퐸) with 푛 = |푉| and 푚 = |퐸|

• Input models: ➢ Sparse: Represented by adjacency lists o Query if 푣, 푖 to get the 푖푡ℎ neighbor of 푣 in 푂(1) time o We only use it for graphs with degrees bounded by 푑 o 휖-far from satisfying 푃 if 휖(푑푛) matrix entries must be changed to satisfy 푃 o Change required = 휖-fraction of the input

➢ Generally, dense is easier than sparse

CSC2420 – Allan Borodin & Nisarg Shah 9 Testing Bipartiteness

• Dense model: 2 ➢ Upper bound: 푂(1Τ휖 ) (independent of 푛) 1.5 ➢ Lower bound: Ω(1Τ휖 )

• Sparse model (for constant 푑): log 푛 ➢ Upper bound: 푂 푛 ⋅ 푝표푙푦 휖 ➢ Lower bound: Ω( 푛)

CSC2420 – Allan Borodin & Nisarg Shah 10 Testing Bipartiteness

• In the dense model: • Algorithm [Goldreich, Goldwasser, Ron] 1 log 휖 ➢ Pick a random subset of vertices 푆, 푆 = Θ 휖2 ➢ Output “bipartite” iff the induced subgraph is bipartite

• Analysis: ➢ Easy: If the graph is bipartite, algorithm always accepts. ➢ Claim: If the graph is 휖-far, it rejects w.p. at least 2/3 ➢ Running time: trivially constant (i.e., independent of 푛)

CSC2420 – Allan Borodin & Nisarg Shah 11 Testing Bipartiteness

• Q: Why doesn’t this work for the sparse model? ➢ Take a line graph of 푛 nodes. Throw 휖푛 additional edges. ➢ In the dense model, we don’t care about this instance because it’s not 휖-far (only 휖/푛-far). ➢ In the sparse model, we care about it, and the previous algorithm will not work.

CSC2420 – Allan Borodin & Nisarg Shah 12 Testing Bipartiteness

• In the sparse model: • Algorithm [Goldreich, Ron] • Repeat 푂(1Τ휖) times: ➢ Pick a random vertex 푣 ➢ Run 푂푑푑퐶푦푐푙푒(푣), and if it finds an odd cycle, REJECT. • If no trial rejected, then ACCEPT. • OddCycle: ➢ Performs 푝표푙푦(log 푛Τ휖) random walks from 푣, each of length 푝표푙푦(log 푛Τ휖). ➢ If a vertex is reachable by an even-length path and an odd-length prefix, an odd cycle is detected.

CSC2420 – Allan Borodin & Nisarg Shah 13 Limitations of Sublinear Time

• The problems we saw are rather exceptions. • For most problems, there is not much you can do in sublinear time. • For instance, these problems require Ω(푛2) time: ➢ Estimating min 푑푖,푗 in a metric space 푑. 푖,푗 o Contrast this with the sublinear algorithm we saw for estimating max 푑푖,푗 (diameter) 푖,푗 ➢ Estimating the cost of the minimum-cost matching ➢ Estimating the cost of 푘-median for 푘 = Ω(푛) ➢ …

CSC2420 – Allan Borodin & Nisarg Shah 14 Sublinear Space Algorithms

• An important topic in complexity theory

• Fundamental unsolved questions: ➢ Is 푁푆푃퐴퐶퐸 푆 = 퐷푆푃퐴퐶퐸(푆) for 푆 ≥ log 푛? ➢ Is 푃 = 퐿? (퐿 = 퐷푆푃퐴퐶퐸(log 푛), and we know 퐿 ⊆ 푃) 푂 1 ➢ What’s the relation between 푃 and polyL = 퐷푆푃퐴퐶퐸 log 푛 ? o We know 푃 ≠ 푝표푙푦퐿, but don’t know if 푃 ⊂ 푝표푙푦퐿, 푝표푙푦퐿 ⊂ 푃, or if neither is contained in the other.

• Savitch’s theorem: 2 ➢ 퐷푆푃퐴퐶퐸 푆 ⊆ 푁푆푃퐴퐶퐸 푆 ⊆ 퐷푆푃퐴퐶퐸(푆 )

CSC2420 – Allan Borodin & Nisarg Shah 15 USTCON vs STCON

• USTCON (resp. STCON) is the problem of checking if a given source node has a path to a given target node in an undirected (resp. directed) graph. ➢ USTCON ∈ RSPACE(log 푛) was shown in 1979 through a random-walk based algorithm ➢ After much effort, Reingold [2008] finally showed that USTCON ∈ DSPACE log 푛

• Open questions: ➢ Is STCON in RSPACE(log 푛), or maybe even in RSPACE(log 푛)? 2 ➢ What about 표(log 푛) instead of log 푛 space? ➢ Is RSPACE(푆) = DSPACE(푆)?

CSC2420 – Allan Borodin & Nisarg Shah 16 Streaming Algorithms

• Input data comes as a 푎1, … , 푎푚, where, say, each 푎푖 ∈ {1, … , 푛}. ➢ The stream is typically too large to fit in the memory. ➢ We want to use only 푆(푚, 푛) memory for sublinear 푆. o We can measure this in terms of the number of integers stored, or the number of actual bits stored (might be log 푛 times). ➢ It is also desired that we do not take too much processing time per element of the strem. o 푂(1) is idea, but 푂 log 푚 + 푛 might be okay! ➢ If we don’t know 푚 in advance, this can often act as an online algorithm.

CSC2420 – Allan Borodin & Nisarg Shah 17 Streaming Algorithms

• Input data comes as a stream 푎1, … , 푎푚, where, say, each 푎푖 ∈ {1, … , 푛}. ➢ Most questions are about some statistic of the stream. ➢ E.g., “how many distinct elements does it have?”, or “count the #times the most frequent element appears” ➢ Once again, we will often approximate the answer. ➢ Most algorithms process the stream in one pass, but sometimes you can achieve more if you can do two or more passes.

CSC2420 – Allan Borodin & Nisarg Shah 18 Missing Element Problem

• Problem: Given a stream 푎1, … , 푎푛−1 , where each element is a distinct integer from {1, … , 푛}, find the unique missing element.

• An 푛-bit algorithm is obvious ➢ Keep a bit for each integer. ➢ At the end, spend 푂(푛) time to search for the 0 bit. • We can do 푂(log 푛) bits by maintaining the sum. 푛 푛+1 ➢ Missing element = − 푆푈푀 2 • Deterministic + exact.

CSC2420 – Allan Borodin & Nisarg Shah 19 Missing Elements Problem

• Problem: Given a stream 푎1, … , 푎푛−푘 , where each element is a distinct integer from {1, … , 푛}, find all 푘 missing elements. • The previous algorithm can be generalized: ➢ Instead of just computing the sum, compute power-sums. 푛−푘 푗 ➢ 푆 where 푆 = σ 푎 푗 1≤푗≤푘 푗 푖=1 푖 ➢ At the end, we have 푘 equations, and 푘 unknowns. 2 ➢ This uses 푂(푘 log 푛) space. ➢ Computationally expensive to solve the equations o Using Newton’s identities followed by finding roots of a polynomial

CSC2420 – Allan Borodin & Nisarg Shah 20 Missing Elements Problem

• We can design much more efficient algorithms if we use randomization. ➢ There is a streaming algorithm with space and time/item that is 푂(푘 log 푘 log 푛). 푛 ➢ It can also be shown that Ω 푘 log space is 푘 necessary.

CSC2420 – Allan Borodin & Nisarg Shah 21 Frequency Moments

• Another classic problem is that of computing frequency moments. ➢ Let 퐴 = 푎1, … , 푎푚 be a data stream with 푎푖 ∈ {1, … , 푛}. ➢ Let 푚푖 denote the number of occurrences of value 푖. 푡ℎ ➢ Then for 푘 ≥ 0, the 푘 frequency moment is defined as 푘 퐹푘 = ෍ 푚푖 푖∈ 푛 ➢ 퐹0 = # distinct elements ➢ 퐹1 = 푚 ➢ 퐹2 = Gini’s homogeneity index o The greater the value of 퐹2, the greater the homogeneity in 퐴

CSC2420 – Allan Borodin & Nisarg Shah 22 Frequency Moments

′ • Goal: Given 휖, 훿, find 퐹푘 s.t. ′ Pr 퐹푘 − 퐹푘 > 휖퐹푘 ≤ 훿 • Seminal paper by Alon, Matias, Szegedy [AMS’99] ➢ 푘 = 0: For every 푐 > 2, 푂(log 푛) space algorithm s.t. 1 ′ Pr ൗ푐 퐹0 ≤ 퐹0 ≤ 푐퐹0 ≥ 1 − 2Τ푐 1 ෨ ➢ 푘 = 2: 푂 log 푛 + log 푚 log Τ훿 Τ휖 = 푂(1) space ෨ 1−1Τ푘 1 1 ➢ 푘 ≥ 3: 푂 푚 푝표푙푦 Τ휖 푝표푙푦푙표𝑔 푚, 푛, Τ훿 space 1−5Τ푘 ➢ 푘 > 5: Lower bound of Ω 푚

CSC2420 – Allan Borodin & Nisarg Shah 23 Frequency Moments

′ • Goal: Given 휖, 훿, find 퐹푘 s.t. ′ Pr 퐹푘 − 퐹푘 > 휖퐹푘 ≤ 훿 • Seminal paper by Alon, Matias, Szegedy [AMS’99] ➢ 푘 = 0: For every 푐 > 2, 푂(log 푛) space algorithm s.t. 1 ′ Pr ൗ푐 퐹0 ≤ 퐹0 ≤ 푐퐹0 ≥ 1 − 2/푐

➢ Exactly counting 퐹0 requires Ω(푛) space: o Once the stream is processed, the algorithm acts as a membership tester. On new element 푥, the count increases by 1 iff 푥 was not part of the stream. o Algorithm must have enough memory to distinguish between all possible 2푛 states

CSC2420 – Allan Borodin & Nisarg Shah 24 Frequency Moments

′ • Goal: Given 휖, 훿, find 퐹푘 s.t. ′ Pr 퐹푘 − 퐹푘 > 휖퐹푘 ≤ 훿 • Seminal paper by Alon, Matias, Szegedy [AMS’99] ➢ 푘 = 0: For every 푐 > 2, 푂(log 푛) space algorithm s.t. 1 ′ Pr ൗ푐 퐹0 ≤ 퐹0 ≤ 푐퐹0 ≥ 1 − 2/푐

➢ State-of-the-art is “HyperLogLog Algorithm” o Uses hash functions o Widely used, theoretically near-optimal, practically quite fast o Uses 푂 휖−2 log log 푛 + log 푛 space o It can estimate > 109 distinct elements with 98% accuracy using only 1.5kB memory!

CSC2420 – Allan Borodin & Nisarg Shah 25 Frequency Moments

′ • Goal: Given 휖, 훿, find 퐹푘 s.t. ′ Pr 퐹푘 − 퐹푘 > 휖퐹푘 ≤ 훿 • Seminal paper by Alon, Matias, Szegedy [AMS’99] 1−5Τ푘 ➢ 푘 > 2: The Ω 푚 bound was improved to Ω 푚1−2Τ푘 by Bar Yossef et al. o Their bound also works for real-valued 푘. ➢ Indyk and Woodruff [2005] gave an algorithm that works for real-valued 푘 > 2 with a matching upper bound of 푂෨ 푚1−2Τ푘 .

CSC2420 – Allan Borodin & Nisarg Shah 26 AMS 퐹푘 Algorithm • The basic idea is to define a random variable 푌 whose expected value is close to 퐹푘 and variance is sufficiently small such that it can be calculated under the space constraint.

• We will present the AMS algorithm for computing 퐹푘, and sketch the proof for 푘 ≥ 3 as well as the improved proof for 푘 = 2.

CSC2420 – Allan Borodin & Nisarg Shah 27 AMS 퐹푘 Algorithm

• Algorithm: 1 −2 1− ൗ푘 1 ➢ Let 푠1 = 8휖 푘 푚 and 푠2 = 2 log Τ훿. ➢ Let 푌 = 푚푒푑푖푎푛 푌1, … , 푌푠2 , where ➢ 푌푖 = 푚푒푎푛 푋푖,1, … , 푋푖,푠1 , where o 푋푖,푗 are i.i.d. random variables that are calculated as follows:

o For each 푋푖,푗, choose a random 푝 ∈ [1, … , 푚] in advance.

o When 푎푝 arrives, note down this value.

o In the remaining stream, maintain 푟 = |{푞|푞 ≥ 푝 and 푎푞 = 푎푝}|. 푘 푘 o 푋푖,푗 = 푚 푟 − 푟 − 1 . • Space:

➢ For 푠1 ⋅ 푠2 variables 푋, log 푛 space to store 푎푝, log 푚 space to store 푟. • Note: This assumes we know 푚. But it can be estimated as the stream unfolds.

CSC2420 – Allan Borodin & Nisarg Shah 28 AMS 퐹푘 Algorithm

• We want to show: 퐸 푋 = 퐹푘, and 푉푎푟[푋] is small.

• 퐸 푋 = 퐸 푚 푟푘 − 푟 − 1 푘 ➢ The 푚 different choices of 푝 ∈ [푚] have probability 1/푚. 푘 푘 ➢ Thus, 퐸[푋] is just the sum of 푟 − 푟 − 1 across all choices of 푝. ➢ For each distinct value 푖, there will be 푚푖 terms: 푘 푘 푘 푘 푘 푘 푘 푚푖 − 푚푖 − 1 + 푚푖 − 1 − 푚푖 − 2 + ⋯ + 1 − 0 = 푚푖 푘 ➢ Thus, the overall sum is 퐹푘 = σ푖 푚푖 . • Thus, 퐸 푌 = 퐸 푋 = 퐹푘

CSC2420 – Allan Borodin & Nisarg Shah 29 AMS 퐹푘 Algorithm

1 • To show: Pr 푌푖 − 퐹푘 > 휖퐹푘 ≤ Τ8 ➢ Median over 2 log 1Τ훿 many 푌푖 will do the rest. • Chebyshev’s inequality: 푉푎푟 푌 ➢ Pr 푌 − 퐸 푌 > 휖퐸 푌 ≤ 푖 푖 푖 휖2 퐸 푌 2 푉푎푟 푋 퐸 푋2 ➢ 푉푎푟 푌푖 ≤ ≤ , and 퐸 푌 = 퐸 푋 = 퐹푘. 푠1 푠1 ➢ Thus, probability bound is: Just more 2 2 퐸 푋 퐸 푋 algebra! = 푠 휖2 퐹 2 −2 1−1ൗ 2 2 1 푘 8휖 푘푚 푘휖 퐹푘 ➢ To show that this is at most 1/8, we want to show: 2 1−1ൗ 2 퐸 푋 ≤ 푘푚 푘 퐹푘 1 2 1− 2 ➢ Show that: 퐸 푋 ≤ 푘퐹1퐹2푘−1, and 퐹1퐹2푘−1 ≤ 푚 푘 퐹푘

CSC2420 – Allan Borodin & Nisarg Shah 30 Sketch of 퐹2 improvement

• They retain 푠2 = 2 log 1/훿, but decrease 푠1 to just a constant 16/휖2. ➢ The idea is that 푋 will not maintain a count for each value separately, but rather an aggregate. 푛 2 ➢ 푍 = σ푡=1 푏푡 푚푡, then 푋 = 푍 푛 ➢ The vector 푏1, … , 푏푛 ∈ −1,1 is chosen at random as follows: 2 o Let 푉 = {푣1, … , 푣ℎ} be 푂(푛 ) “four-wise independent” vectors 푛 o Each 푣푝 = 푣푝,1, … , 푣푝,푛 ∈ −1,1 o Choose 푝 ∈ {1, … , ℎ} at random, and set 푏1, … , 푏푛 = 푣푝.

CSC2420 – Allan Borodin & Nisarg Shah 31 Majority Element

• Input: Stream 퐴 = 푎1, … , 푎푚, where 푎푖 ∈ [푛] • Q: Is there a value 푖 that appears more than 푚/2 times?

• Algorithm: ∗ ➢ Store candidate 푎 , and a counter 푐 (initially 푐 = 0). ➢ For 푖 = 1 … 푚 ∗ o If 푐 = 0: Set 푎 = 푎푖, and 푐 = 1. o Else: ∗ • If 푎 = 푎푖, 푐 ← 푐 + 1 ∗ • If 푎 ≠ 푎푖, 푐 ← 푐 − 1

CSC2420 – Allan Borodin & Nisarg Shah 32 Majority Element

• Space: Clearly 푂(log 푚 + log 푛) bits • Claim: If there exists a value 푣 that appears more than 푚/2 times, then 푎∗ = 푣 at the end. • Proof: ➢ Take an occurrence of 푣 (say 푎푖), and let’s pair it up: o If it decreases the counter, pair up with the unique element 푎푗 (푗 < 푖) that contributed the 1 we just decreased. o If it increases the counter: • If the added 1 is never taken back, QED!

• If it is decreased by 푎푗 (푗 > 푖), pair up with that. ➢ Because at least occurrence of 푣 is not paired, the “never taken back” case happens at least once.

CSC2420 – Allan Borodin & Nisarg Shah 33 Majority Element

• Space: Clearly 푂(log 푚 + log 푛) bits • Claim: If there exists a value 푣 that appears more than 푚/2 times, then 푎∗ = 푣 at the end. • A simpler proof: ′ ∗ ′ ➢ At any step, let 푐 = 푐 if 푎 = 푣, and 푐 = −푐 otherwise. ➢ Every occurrence of 푣 must increase 푐′ by 1. ➢ Every occurrence of a value other than 푣 either increases or decreases 푐′ by 1. ➢ Majority ⇒ more increments than decrements in 푐′. ➢ Thus, a positive value at the end!

CSC2420 – Allan Borodin & Nisarg Shah 34 Majority Element

• Note 1: When a majority element does not exist, the algorithm doesn’t necessarily find the mode.

• Note 2: If a majority element exists, it correctly finds that element. However, if there is no majority element, the algorithm does not detect that and still returns a value. ➢ It can be trivially checked if the returned value is indeed a majority element if a second pass over the stream is allowed. ➢ Surprisingly, we can prove that this cannot be done in 1- pass. (Next lecture!)

CSC2420 – Allan Borodin & Nisarg Shah 35