<<

Bocconi Ph.D. in Statistics and Computer Science 2021-2022

TITLE: Computer Science II (Algorithms)

SPEAKER: Paolo Ferragina, University of

GOAL: In this course we will study, design and analyze algorithms and data structures for the efficient solution of combinatorial problems involving Big Data of several types, such as integers, strings, trees and graphs. Special attention will be devoted to the architectural features of modern storage technologies which are key issues when designing scalable Data Science platforms which process large and complex datasets. Every lecture will follow a problem-driven approach that starts from a real software-design problem, abstracts it in a combinatorial way (suitable for an algorithmic investigation), and then introduces algorithms aimed at minimizing the use of some computational resources like time, space, I/O, energy, etc.

PREREQUISITE: Basic algorithms and data structures, Basic notions of probability and discrete math.

TEACHING MATERIAL: Notes of the teacher

PRELIMINARY PROGRAM: 10 lectures, for a total of 24h. Every lecture will consist of few “sessions”, each one of 45 mins.

Lecture 1 (Warm Up – 2 sessions) • Models of computation: i.e., RAM, 2-level memories, streaming. • Scanning versus Jumping in algorithm design • The issue of Virtual memory

Lecture 2 (Sorting – 2 sessions) • Sorting Data: Mergesort in internal memory (small data) and on disk (big data) • The I/O Lower Bound • Permuting versus Sorting

Lecture 3 (Sampling – 2 sessions) • Sampling Data uniformly at random from a stream of items (known length, unknown length)

Lecture 4 (Dictionary problem: Exact search – 2 sessions) • Cuckoo hashing • Bloom filters and Spectral Bloom Filters • Application to search engines and DBs

Lecture 5 (Dictionary problem: Approximate search – 2 sessions) • Locality Sensitive Hashing • Hamming distance on vectors • Shingling and document deduplication • Application dedup, clustering, approximate search, ...

Lecture 6 (Strings: Prefix search – 2 sessions) • Tries (uncompacted, compacted) • Patricia trie • 2-level indexing and prefix search • Application to key-value store and Auto-completion search Lecture 7 (Strings: Substring search – 2 sessions) • Suffix Arrays and LCP array • Suffix Tree • Searching strings by substring • Application to string statistics in efficient space and time

Lecture 8 (Strings: Dynamic Programming – 2 sessions) • Dynamic Programming • Application to Edit distance, Knapsack, Viterbi, Data compression

Lecture 9 (Data Compression – 4 sessions) • 0-th order entropy compressors: Huffman coding and Arithmetic coding • K-th order entropy compressors: Lempel-Ziv parsing and the Burrows-Wheeler Transform • Applications: gzip and bzip

Lecture 10 (Graphs – 4 sessions) • Graph representation • Algorithms for graphs: DFS, BFS, … • Graph storage (compressed?): from basic codes to Elias-Fano codes • Centrality measures: PageRank and HITS, and some of their variants