2021 Key Data Structures in Data Science

DATA SCIENCE MEI/1 University of Beira Interior, Department of Informatics Hugo Pedro Proença, Hugomcp@Di.Ubi.Pt, 2020/2021 Key Data Structures in Data Science

• Data strctures are used to store data in na organized way, in order to make data manipulation eficient. • Typically, using ETL processes, data are imported from one (or several databases) into this kind of structures. • Vectors • They are one of the most eficient and simple data structures, due to their homogenous nature. • In Python, the “Numpy” library is typically used for creating vectors • vec_row = np.array([1, 2, 3]) • Matrices • Matrices are two-dimensional data structures, also homogenous (i.e., all elements are of the same type) • The “Numpy” library is also typically used to create matrices. • Matrix = nop.mat([1,2], [3,4], [5,6]) Key Data Structures in Data Science

• Arrays • Arrays are the general form of vectors and matrices, and have a multi- dimensional shape. • Typically, they have not the homogeneity constraint, i.e., diferente data types can be included in each dimension of the array. • In Python “Lists” are the closest semantical data structure to the concept of array • A=[[1, ’Volvo’], [2, ’BMW]] • Data Frames • Data frames are 2-dimensional arrays that resemble database tables. Each column contains one variable and each row contains one instance • In Python, they are typically created using “panda” library: • cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus'], 'Price': [22000,25000,27000,35000] } df = pd.DataFrame(cars, columns = ['Brand', 'Price']) Key Data Structures in Data Science

• Dictionaries • Also known as “Hash maps”, they support arbitrary keys and values. Keys are unique identifies of instances in the data strcucture. • They are unordered, mutable and indexed. • In Python, they are created using curly brackets: • D = {1: [1, 2, 3, 4], 'Name': 'Bill’} • Tupples • Tupples regard one instance, where elements are ordered and immutable. A tuple can have any number of itens of diferente types. • In Python, we simply create a variable with parenthesis. • tuple1 = ("apple",1, False) Amortized Analysis and Computational Performance

• Algorithmic complexity is a crucial concept in Data Science. Knowing the complexity of algorithms allows to answer various questions: • Is the problem solvable? • How long will my processing chain take to run? • How much space will it take? • The concept of amortized analysis is closely related to Asymptotic Analysis. • The classical asymptotic analysis aims at analyzing the performance of an individual operation asymptotically, as a function of the size of the problem. • The goal is to perceive how the performance of a given operation will scale to a large data set. Amortized Analysis and Computational Performance

• The key difference between Asymptotic and Amortised Analysis is that the former is dependent on the input itself, while the latter is dependent on the sequence of operations the algorithm will execute. • In summary: • Asymptotic analysis allows to assert that the complexity of the algorithm when it is given a worst/average case input of size n is bounded by some function f(n) • Amortised analysis allows to assert that the complexity of the algorithm when it is given an input of unknown characteristics but known size n is no worse than the value of a function f(n) Asymptotic Analysis

• Typically, there are two modes for performing the asymptotic analysis of an algorithm (processing chain): • The worst-case mode considers a single operation. • To find the overall cost of the algorithm we need to find the worst-case cost of every single operation and then count the number of their executions. • If algorithm runs in time T(n) it means that it is and upper bound for any inputs of size n. • Even if the algorithm may take less time on some inputs of that size, due to particular operations may cheaper for them, the idea is to always count the worst-cost of every operation in the algorithm. • The average-case mode aims at obtaining the running time for randomly chosen inputs. It is considered harder to obtain due to the fact it needs some probabilistic arguments and some assumptions about the distribution of the inputs. • Despite that it may be a lot more useful, hence the worst-case analysis is often misleading. For example, the worst-case temporal complexity for the quick-sort algorithm is 2n and the average-case is n*log(n). Asymptotic Analysis: Big-O Notation

•The order of growth describes how an algorithm/processing chain time and space complexity will increase with respect to the size of the input. •There are various notations to measure the order of growth, but the most popular is the Big-O notation, which gives the worst-case time complexity. For instance, O(f(x)) = g(x) it means that the growth of the function f() will never surpass the function g(). • In this setting, g() is the asymptotic upper bound on the time complexity of f(). Asymptotic Analysis: Big-O Notation

• For instance, consider a simple nested for loop:

for (i=1; i) break; } } • Even if - at some point – the condition will be met, and the inner loop will break, this cycle will perform at most n2 steps to execute, i.e., g()=n2. • Accordingly, its order of growth is O(n2) Asymptotic Analysis: Big-O Notation

• The asymptotic analysis has two major weaknesses: • As it ignores constants in the g() function, two algorithms that in practice have very different performance will get the same asymptotic bound. • For example, if one algorithm takes 999*n*log(n) steps, and another ones takes 2*n*log(n), their asymptotic bound will be the same: n*log(n) • Another feature is that the worst-case scenario (input) might never happen, or have extremely low probability. In practice, this means that an algorithm asymptotically slower than other might actually perform better, because of the inputs distribution Amortized Analysis

• Considering the weaknesses of Asymptotic Analysis, the concept of Amortized Analysis can be seen as more reliable, particularly for complex processing chains. • Amortised Analysis aims at perceive how the average performance of all the operations on a large data set scales. • Comparing to the average-case mode of Asymptotic Analysis, amortised analysis gives an upper bound of the actual cost of an algorithm, which the average-case doesn’t guarantee. • In summary, it gives the average performance (over time) of each operation in the worst-case. Amortized Analysis

• Considering a particular sequence of operations, it is not expected that the worst-case occur very often in each operation. • In practice, the operations vary in their costs: some may be cheap and some may be expensive. • For example, consider a dynamic array data structure. • In this kind of data structure, the ordered insertion of elements can take different times: linear or constant (if elements are inserted in order). • In this case, if the operations have different costs, how can we correctly obtain the total time? • This is where the Amortised Analysis comes into play. It assigns an artificial cost to each operation in the sequence, which is called the Amortised Cost. • This way, the total cost of the algorithm is bounded by the total number of the amortised costs of all operations. Amortized Analysis

• Thera are three methods for obtaining the Amortized Cost: • Aggregate Method (brute-force); • Accounting Method (the banker method); • Potential Method (the physicist method); • Aggregate Method • When considering the dynamic array as example, suppose that when the array has space available, we simply insert the new item in the first available space. Otherwise, the following steps are performed: • Allocate memory for a larger array of size twice as the old one • Copy the contents of old array to new one • Let’s assume first that the cost for insert is equal to 1 unit and resizing an array costs us 1 unit per each element in the array. • The cost of inserting the “ith” element is given by: • Cost_i() if (i-1 is a power of 2) i; else 1; Amortized Analysis

• The cost of inserting the first element is 2 (create first space and insert) • The cost of inserting the second element is also 2 • The cost of inserting the third element is 3 • The cost of inserting the fourth element is 1 • The cost of inserting the fifth element is 5 • The cost of inserting the sixth element is 1 • In general, the total cost is (2 + 2 + 3 + 1 + 5 + 1 + 1) / 7= 2.5 • Considering that we omit constants, the amortized cost is O(1) • Aggregate Analysis determines the upper bound T(n) on the cost of “n” operations, and then obtains the amortized cost, given by T(n)/n Amortized Analysis

• The Accounting method has a simple rationale. There is an account where we can save up time and every operation is allowed to take some time from the account. • The cheap operations help to pay for the most expensive ones. By distributing the costs this way, we get some kind of average. • We stop for the first operation that sets the balance to 0. • Supposing that we define a cost of ”3” • The cost of inserting the first element is 2 (create first space and insert) (Balance = 1, i.e., 3-2) • The cost of inserting the second element is also 2 (Balance = 2, i.e., 1+3-2) • The cost of inserting the third element is 3 (Balance=2) • The cost of inserting the fourth element is 1 (Balance=4) • The cost of inserting the fifth element is 5 (Balance=-1) • We got a negative balance, so “3” is not enough. We should repeat the experiments for cost “4”. Amortized Analysis

• The Potential Method is based in a potential function that that should have two properties: � ℎ0 = 0, being h0 the initial state of the structure and � ℎ� ≥ 0. • The amortized time of an operation is then given by: c + � ℎ� - � ℎ� − 1 where c is the real cost of the operation � ℎ� I the state of the structure after the operation � ℎ and � − 1 the corresponding state before the operation. • Ideally, � should be defined such that the amortised time of each operation is small, because the change in potential should be greater than 0 for cheap operations and lower for expensive operations. • Considering the previous case of dynamic arrays, if:

• � ℎ� = 2n –m, where n is the number of elements in the array and m is the array length • We have two cases: • n < m, then the actual cost is 1, n increases by 1, and m does not change. Then the potential increases by 2, so the amortised time is 1 + 2 = 3. • n = m, then the array is doubled, so the actual time is n + 1. But the potential drops from n to 2, so amortised time is n + 1 + (2 − n) = 3. • In both the aforementioned cases, the amortized time is O(1)