Quick viewing(Text Mode)

WIPRO CONFIDENTIAL Agenda

WIPRO CONFIDENTIAL Agenda

and Program Design

By FCG Team

WIPRO CONFIDENTIAL Agenda

§ Introduction to Data Structures § Array as a Data § – Definition and Implementation § Introduction to data structures related to linked lists and arrays Ÿ Doubly linked lists Ÿ Circular linked lists Ÿ Chunk lists Ÿ Stacks and Queues Ÿ Lookup tables Ÿ Hash tables Ÿ Associative arrays Ÿ Circular buffers Ÿ Dynamic arrays § Binary trees § Graphs § Examples of choosing data structures WIPRO CONFIDENTIAL

2 Prerequisites

§ Basics of § Basic Understanding of Pointers § Basic Understanding of recursion § Some Understanding of Search and Sort techniques

WIPRO CONFIDENTIAL

3 Introduction

§ Take the example of a text processing program. § One of the requirement is to sort a list of words. § How would you design your program to perform this function? § Specifically how would you store the data for this program?

W1 W2 W3 W4 W5 W6 W2 W5

W3 W4

W1 W6

WIPRO CONFIDENTIAL

4 What is a Data structure

§ A structured way of organizing a collection of data elements, which are related to each other.

§ A Data structure is also defined by Ÿ a of defined operations on the structure, which determines how the data is processed.

§ Given a collection of data elements, there can be several ways that it can be organized, i.e a programmer can chose to represent the data with different types of data structures.

§ Choosing an appropriate data structure for your program depends on Ÿ How the input set of data is interrelated, Ÿ What are the data processing requirements, Ÿ What are the program constraints in terms of time and space. WIPRO CONFIDENTIAL

5 Importance of Data structures

§ The processing stage of a program revolves around the data structure. § Typically, if data structure chosen correctly, the rest of the logic will fall in place. § Frequently Used Data structures: Ÿ Arrays/Tables Ÿ Stack Ÿ Link list (singly linked list, ) Ÿ § The basic data structures can be combined to form more complex data structures Ÿ Arrays & Link list can be combined to form data structures like chunk list, hashed list, etc. § Selecting wrong data structure would result in Ÿ processing stage becoming unnecessarily complex and buggy

WIPRO CONFIDENTIAL

6 Data structures – Algorithms and implementation

§ For each of the data structures, there are standard algorithms available for Ÿ parsing Ÿ Searching Ÿ Sorting § Combining these algorithms, several derived algorithms exist for Ÿ Splitting Ÿ Merging

§ Implementation of Data structures generally uses program defined basic data types along with references/pointers and structures/unions.

WIPRO CONFIDENTIAL

7 Array as a Data Structure

§ Stores collection of elements of the same type. § the entire array is allocated as one contiguous block of memory. § Only defining operation for an array is : Indexing. § Static and Dynamically allocated arrays: Ÿ Static arrays are allocated at compile time Ÿ Dynamically allocated arrays are allocated at run time § Fixed size arrays: Ÿ Static arrays: at compile time Ÿ Dynamically allocated: size can be determined at runtime, but once declared it most often remains fixed. § Dynamic Arrays: Ÿ You can dynamically allocate an array in the , and resize it with realloc() call. Ÿ Managing a heap block in this way is fairly complex, but can have excellent efficiency for storage and iteration. Ÿ if continuous free memory is not available, realloc will copy the old block to new block and return the pointer to new block. This makes realloc expensive. WIPRO CONFIDENTIAL

8 Advantages and Disadvantages of an array

§ Pros Ÿ Access to an array element is convenient and fast w Any element can be accessed directly using the [ ] syntax. w Array access is almost always implemented using fast address arithmetic: the address of an element is computed as an offset from the start of the array which only requires one multiplication and one addition. Ÿ Because arrays are contiguous, processing of sequential data stored in an array can make best use of memory caches. § Ÿ Because of the fixed size restriction on an array, most often programmers tend to allocate an array that is “large enough”, which tends to either waste memory for most occasions and crash for special out of bounds cases. Ÿ When large arrays are used as local variables, the stack space becomes unmanageable. Ÿ Inserting/Deleting new elements at the front or middle of an array is potentially expensive because existing elements need to be shifted over to make room. WIPRO CONFIDENTIAL

9 Linked Lists

§ List of nodes at non-contiguous memory locations in the heap, to each other through pointers/links § Need to keep track of the head of the list to be able to traverse the list § Size of the list varies dynamically during the execution of the program § Typical operations are Insertion, Deletion and Searching

DATA NP typedef struct { int data; struct node* next; Stack/Data }NODE_TYPE; Heap Segment Head 1 2 3 A “head” pointer local to function or global, keeps Each node the whole list by storing Each node stores stores one a pointer to the first one next pointer. node. data element WIPRO(int in CONFIDENTIALthis example).

10 Linked Lists – Insert at beginning of list

2 8 10 4 6 3 2 5

head head head

2

2 Allocate memory for a new node. Dummy head Assign data Set next of new node to next of head X Update head of next to point to the new node head head X 2 X 8 2

WIPRO CONFIDENTIAL 8 11 Linked Lists - Code

List Initialize() head { Node* temp; X temp = (Node*)malloc(sizeof(Node)); return temp; } head

X 2 void InsertBegin(List head,int ) { Node* temp;

temp = (Node*)malloc(sizeof(Node)); main() temp->data = d; { List head; temp->next = head->next; head = Initialize(); head->next = temp; InsertBegin(head, 2); } WIPRO CONFIDENTIAL}

12 Linked Lists – Insert at any point in the list

2 8 1 4 10 3 6 5 Allocate memory for a new node. head Insert point Assign data Find the insert point after which the new X 8 2 node has to be inserted Set next of new node to next of insert point 1 Update next of insert point to point to the new node head Insert point

X 8 2 1

4 head

X 8 4 2 1

10 WIPRO CONFIDENTIAL

13 Linked Lists - Delete

head prev del

X 10 2 4 8

head prev del

X 10 4 8

head Identify ‘del’, the node to be deleted X 10 4 Store this in temp Modify its previous pointer to point del->next Free the memory pointed by temp head

X 4 WIPRO CONFIDENTIAL

14 Linked Lists – Pros and Cons

§ Pros Ÿ Optimal Memory Usage, when no of elements is not fixed at compile time. Ÿ It is easier to do insertions and deletions at any random point in the list. § Cons Ÿ Iteration is costlier because of non-contiguous storage. Ÿ Optimal only for sequential processing of elements. Ÿ Extra storage needed for references, which often makes them impractical for small lists of small data items Ÿ Locality of data is poor. Linked list data elements do not make the best use of memory caches.

WIPRO CONFIDENTIAL

15 Linked Lists - Exercise

• Write code to implement a phone book as found in a mobile phone •Add new entry •Edit existing entry •Delete an entry • The user should be able to search for the relevant details based on the name. • The details to be stored in an entry are: • Name of the person • Mobile number • Landline number • A text section of 128 to store some notes or address related to the entry • The program should flag a warning if a new entry is being attempted with an existing name and add the new details to the existing name.

•Given a linked-list write a function reverse() that reverses its contents - reverse(List* head) - At the end of call to this function head should be pointing to the last node in the list that was passed to it andWIPRO its contents CONFIDENTIAL should be reversed. -This program should not allocate memory for any additional nodes. 16 Arrays or Linked Lists?

Chose to use an Array when - § of elements is a requirement. § Maximum number of elements is known. § Program performance w.r. time is more important than w.r.t to memory. § During most of the runtime, the array elements are all occupied with valid values, so that there is no wastage of memory. § There is no need to insert elements in the beginning or middle of list. Insertion is needed only at end of list. § There are no deletions required. Chose a linked list when – § Maximum number of elements is not known. § Access of elements is mostly sequential. Frequent insertions and deletions at any point in the list is required. § Performance in termsWIPRO of memory CONFIDENTIAL is critical more than in terms of time.

17 Related Data Structures

§ Doubly-Linked lists: Ÿ Each node has both a next and prev pointer, so that the list can be traversed in both the directions. Insertions and deletions are easier in this list, since there is no need to track a previous pointer. Ÿ Use this when there is a need to frequently parse the list in both directions sequentially, this can be used commonly in editors, which requires parsing the cursor in both directions. HEAD HEAD 1 2 3 1 2 3

•Circular lists: • The last node is connected back to the first node. Use this for naturally circular data relation. • Instead of needing a fixed head end, any pointer into the list will do. Can help to speedWIPRO up accesses.CONFIDENTIAL

18 Related Data Structures

§ Tail lists: Ÿ the list is represented by a head pointer which points to the first node and a tail pointer which points to the last node. Ÿ allows operations at the end of the list such as adding an end element or appending two lists to work efficiently.

HEAD TAIL HEAD 1 2 3 1 2 3 4 5 6 7 8 9

§ Chunk List: Ÿ Instead of storing a single data element in each node, store a little constant size array of data elements in each node. Ÿ Tuning the number of elements per node can provide different performance characteristics: many elements per node has performance more like an array, few elements per node has performance more like a linked list. Ÿ The Chunk List is a good way to build a linked list with better performance. WIPRO CONFIDENTIAL

19 Related Data Structures

§ Hash List: Ÿ This is an array of linked lists B1 B2 B3

K1 K2 K3

WIPRO CONFIDENTIAL

20 Stacks

§ How do you organise a set of plates for disposal during a lunch buffet? § A stack in is similarly organized, so that only the most recently added data element is accessible. It is based on the principle of Last In First Out ( LIFO ) § Push and Pop are its defining data operations. § It can be considered as a specialized linked list data structure with the only operations being insert-begin ( push ), and delete-begin ( pop ). § Applications: Ÿ modern PC uses stacks , which are used in the basic design of an operating system for interrupt handling and operating system function calls. Ÿ Expression evaluation like in calculators employing reverse Polish notation use a stack structure to hold values. Ÿ Syntax parsing : Like in evaluating expressions in parenthesis or HTML like tags.

WIPRO CONFIDENTIAL

21 Queues

§ Data is organized in such a way that only the front(beginning) and back(last) elements are accessible. § Furthermore, elements can only be removed from the front(beginning) and can only be added to the back(last). Thus it’s a First In First Out structure ( FIFO ) or a first-come-first-served (FCFS) structure. § It may be considered as a specialized linked list with the above constraints , i.e only insert at the tail and delete at the head of the list. § In this way, an appropriate metaphor often used to represent queues are people traveling up an escalator, machine parts on an assembly line, or cars in line at a gas station. § The recurring theme is clear: queues are essentially waiting lines. § Application examples: Ÿ an event scheduler in a real time operating system Ÿ a simple to-do list, Ÿ A message queue in real time operating systems WIPRO CONFIDENTIAL

22 Lookup Tables

§ a is simply an array-like data structure used to replace a runtime computation with a simpler lookup operation. § A classic example is a trigonometry calculation i.e. calculating the sine of a value every time. Such operation can substantially slow some applications. So if the sine values are pre-calculated and stored in a table, later on we need only lookup the value in this table. § The defining operations are fill_table and indexing. § In image processing, lookup tables give an output value for each of a range of index values. Also used for colormap or palette, to determine the colors and intensity values with which a particular image will be displayed. § Though this should speed up computational lookups we must be aware of its limitations: Ÿ If the table is large, each table access will almost certainly cause a miss Ÿ one cannot construct a lookup table larger than the space available for the table, Ÿ the time required to compute the table values in the first instance — although this usually needs to be done only once, if it takes a prohibitively long time, it may make the use of a lookup table inappropriate Ÿ if the computation it replaces is relatively simple, retrieving the result from memory may require more time, and it may increase memory requirements and pollute the cache.WIPRO CONFIDENTIAL

23 Hash Tables

§ A is made up of two parts: an array (the actual table where the data to be searched is stored) and a mapping function, known as a . § The hash function is a mapping from the a subset of input data( keys ) to the integer space that defines the indices of the array. In other words, hashing means taking some input data(keys) and producing a single number from the data. The complete data then can be stored at the array index (hash value) corresponding to the key. § The key is normally converted into the index/hash value by taking a modulo, or sometimes masking is used where the array size is a power of two. § Take the example of an input of strings. Hash function will evaluate the sum of the ASCII characters that make up the string mod the size of the table. This determines where the string will be placed in the hash table. § Typical data operations on a hash table include insertion, deletion and lookup § All the above operations for a hash table take constant time on average and hence is very efficientWIPRO CONFIDENTIAL

24 Hash Tables

§ Two or more keys could hash to the same index, in that case the keys are hashed into hash buckets. Each bucket is a list of key value pairs. Since different keys may hash to the same bucket, the goal of hash table design is to spread out the key-value pairs evenly with each bucket containing as few key-value pairs as possible. When an item is looked up, its key is hashed to find the appropriate bucket. Then, the bucket is searched for the right key-value pair.

KEY Hashing Hash value algorithm (Bucket addr)

ARAB Bucket1 addr

ARAB Data Bucket 1 ASIA Data ARG Data Bucket 2 BRAZ Data BRIT Data BEUN Data Bucket 3 CHIL Data WIPRO CEZHCONFIDENTIALData CEYL Data 25 Hash Tables

§ One way to implement this is to store the corresponding records in the same bucket as a linked list at that location. Thus this data structure becomes an array of linked lists. § Like arrays, hash tables provide constant-time lookup on average, regardless of the number of items in the table. While theoretically the worst-case lookup time can be as bad as a linked list, this is, for practical purposes, statistically unlikely unless the hash function is poorly designed. § Disadvantage is poor locality of references resulting in cache misses, difficult to implement and error prone. § Use it in place of direct array lookups only if the data is very huge and there is a need to hash the index in order to optimize memory storage. § Applications : For building Caches, Associative arrays, Call processing message to handler mappings etc.

WIPRO CONFIDENTIAL

26 Associative Arrays

§ An can be viewed as a generalization of an array. While a regular array maps an index to an arbitrary such as integers, to other primitive types, an associative array's keys can be of any arbitrary type for eg a string. The values of an associative array do not need to be the same type. § One can think of a telephone book as an example of an associative array, where names are the keys and phone numbers are the values. Using the usual array- like notation, we might write. Such associative array types are a part of . telephone['ada']='01-1234-56' telephone['bert']='02-4321-56' § Operations on an associative array are Ÿ Add: Bind a new key to a new value Ÿ Reassign: Bind an old key to a new value Ÿ Remove: Unbind a key from a value and remove the key from the key set Ÿ Lookup: Find the value (if any) that is bound to a key § They can be implemented as hash tables or binary search trees or sometimes a linked list ( association lists ). Implementations are usually designed to allow speedy lookup, at the expense of slower insertion and a larger storage footprint. WIPRO CONFIDENTIAL

27 Circular Buffers

§ A or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. § Memory is reused (reclaimed) when an index, incremented modulo the buffer size, writes over a previously used location. § This can be used when we know that only a certain portion of data is useful for processing, while the data continuously streams in. So while a part of the data is being processed, the other part of the buffer just buffers in the data. When one portion of data is processed, now this can be overwritten, while the remaining data will be processed. Thus memory and time is optimized in embedded systems. § In other programs, if we know that only chunks of data is significant for processing, you can use circular buffers to optimize memory storage. For example , the unix “tail” program could use a “n” size buffer to store n line chunks of data read from a file, till it reaches the end of file. § Operations on a circular buffer can be add, delete, read, write § In a circular buffer, a read would read only a subset of latest elements. § And a write would WIPROwrite only CONFIDENTIAL a subset of oldest elements.

28 Circular Buffers

§ Generally, a circular buffer requires three pointers: Ÿ one to the actual buffer in memory Ÿ one to point to the start of valid data Ÿ one to point to the end of valid data end start start end

buf 2 4 6 end start end start

12 14 2 4 6 8 10 12 14 16 4 6 8 10 end start start end

12 14 16 18 20 22 10 12 14 16 18 20 22 24

§ Circular Buffers can be used to receive analog data from a decoder. While one sample is being played by the audio player, the rest of the buffer can be filled by the decoder. Once a sample is played, it can be overwritten. § It can be used in a TCP stack to hold a window of data. WIPRO CONFIDENTIAL

29 Binary Trees

§ A is composed of nodes each of which stores data and also links to up to two other child nodes (leaves) which can be visualized spatially as below the first node with one placed to the left and with one placed to the right. § A node is constructed of a left pointer (LP), a right pointer (RP), and a data field (DATA). The whole structure can be visualized as a tree, having a right sub-tree and left sub-tree Root

LP DATA RP

Non leaf nodes

Leaf nodes § The root node is the node that is not a child of any other node, and is found at the "top" of the tree structure. § A node with no children is referred to as a leaf node. § Nodes that are not a root node or a leaf node are often referred to as non leaf nodes. § If no of nodes in left WIPROand right CONFIDENTIALtrees are equal , then tree is perfectly balanced.

30 Binary Trees

§ The binary tree is a useful data structure for rapidly storing sorted data and rapidly retrieving stored data. It is used for implementations.

§ The binary search technique is a fundamental method for locating an element of a particular value within a sequence of sorted elements. The idea is to eliminate half of the search space with each comparison.

§ It is the relationship between the leaves linked to the parent node, which makes the binary tree useful for the efficient binary search technique. Ÿ leaf on the left has a lesser key value and the leaf on the right has an equal or greater key value. Ÿ the leaves on the farthest left of the tree have the lowest values, whereas the leaves on the right of the tree have the greatest values. Ÿ as each leaf connects to two other leaves, it is the beginning of a new, smaller, binary tree. Due to this nature, it is possible to easily access data in a binary tree using recursive functions called on successive leaves.

§ Ideally, the tree would be structured so that it is a perfectly balanced tree, for faster insertion and retrieval. The worst case scenario is a tree in which each node only has one child node, so it becomes as if it were a linked list in terms of speed. But keeping it balanced itself is computationallyWIPRO CONFIDENTIAL intense.

31 Binary Trees

§ Traversals:

In Order: L, N, R Pre Order:N, L, R 4 1

2 2 6 5

1 3 5 7 3 4 6 7

Post Order: L, R, N 7

3 6

1 WIPRO2 CONFIDENTIAL4 5

32 Graphs and Trees

§ A graph is defined as a set of vertices ( nodes ) which are joined to form edges. § Generic graphs, can be cyclic, consisting of a maze or network of connectivities(edges) from/to nodes with any number of edges, vertices and paths. Each path may have an associated weight. § Trees are graphs, which are acyclic, i.e if you start traversing from one node, there is no path that leads back to the node. § A binary tree has only two edges for each node. Eg: An ancestral family tree (a hierarchy of children to parents, starting from one child ) § A n’ary tree is a graph which has a node connected to at most n nodes and at most n edges terminating on each node. Eg: an organizational ladder or a direct family hierarchy tree. Here it is possible that a single child node can have multiple parents…or there is a parallel reporting structure. § Common examples are: Ÿ flight paths between cities Ÿ a maze/grid problem Ÿ Network connectivity between a network of computers Ÿ Telecommunication networks WIPRO CONFIDENTIAL

33 Graphs and Trees

§ Data Structure for a Graph: typedef struct graph_t { DATA_T data; /* Data of this node */ struct graph_t * neighbour_list; /* List of pointers to all other nodes connected */ }GRAPH_T;

§ For a binary tree, the above structure reduces to a neighbor list always of length 2, namely the left and right pointers.

§ However, if the no of nodes and edges in a graph is known to be small and fixed size, then it can be represented as a matrix i.e a two dimensional array. This is typically done for the case of grids and mazes. A A B C 1 5 0 means that it is connected with wieght of 0 A 0 1 5 -1 means no connection B C Any other positive number is the wieght on that connection B -1 0 1 1 C -1 -1 0

§ Nearly all graph problems will somehow use a grid or network in the problem, but sometimes these will be well disguised. Secondly, if you are required to find a path of any sort, it is usually a graph problem as well. Some common keywords associated with graph problems are: vertices, nodes, edges, connections, connectivity, paths, cycles and direction. WIPRO CONFIDENTIAL

34 Minimum Spanning Tree (MST)

§ Spanning Tree: A spanning tree of a graph is just a sub graph that contains all the vertices and is a tree. (no cycles) § Assuming each edge has a weight, MST defines the cheapest subset of edges that keeps all the vertices connected without cycles. § Applications Ÿ Wherever connectivity is needed across multiple sites. (example: Telephone companies) Ÿ To be critically viewed during any network design. GRAPH MST • No loops should be 20 A there A B B 2 • C->D: is through C->A 2 1 1 10 1 1 and A->D C D C 5 D • B->D: is through B->C, C->A, A->D typedef struct { typedef struct mst { int node1; EDGE *ep; int node2; struct mst *next_p; Int weight; } MST, MSTP; } EDGE, EDGEP;WIPRO CONFIDENTIAL

35 Examples: How to chose and design data structures

§ Snakes and Ladders: If you have to design a snakes and ladders game, what data structure would you use?

§ Things to note : Ÿ A pawn moves through squares in a single directions ( each alternate line ) Ÿ A ladder jumps up to a square, vertically or diagonally. Ÿ A snake jumps down to a squares, vertically or diagonally.

struct square_t { int value; struct square_t * next; int snake_or_ladder_type; struct square_t * snake_or_ladder_ptr; } § Most data structures with multiple links will eventually reduce to a specialized graph data structure. The above structure is a graph with at most 2 children, not necessarily acyclic, since one can start from a square and come back to that square through a snakeWIPRO in another CONFIDENTIAL square. So its not a binary tree.

36 Examples : How to chose right data structures

§ Editor Implement a basic text editor program that allows user to open/create a file and edit it and save the changes. (For editing, it should support movement of cursor in all four directions)

§ Search and Replace Write a program that replaces all occurrences of given string (say string1) in the input files by another string (say string 2).

§ Parenthesis mismatch detection Write a program that can detect and report unmatched parenthesis and braces in a C source file.

WIPRO CONFIDENTIAL

37 Examples: How to chose right data structures

Editor: § Things to consider while chosing data structure Ÿ Editor has a standard window of say around 20 lines Ÿ Cursor in an editor moves right and left, up and down. § Double linked list of lines while each line is a doubly linked list of characters. § of lines. Movement of cursor in the editor should be in all four directions. Having a dynamic array helps in moving up, down, left or right. However, resizing the array will be expensive. § The above two approaches have memory vs time tradeoffs. The first is optimal for memory, but has a time cost, while traversing up or down. The second option may help in faster up and down traversals, but the time advantage is quite lost because of realloc calls that might be needed and if the size is high, then it might result in non-optimal use of heap storage. WIPRO CONFIDENTIAL

38 Examples: How to chose right data structures

Search and Replace: Things to consider: Ÿ When a string is replaced, the string may be smaller or larger than the original string. If its smaller, the original line has to be shrinked, if its larger, all subsequent words have to be shifted in the original file, so that nothing else is modified in the input file Ÿ How do we read the file and process the data to find the string to replace? – by character, line by line, block by block, one whole file at one time? § Since the input file has to be modified, and if the replace string is shorter or longer than the original, we need to shift subsequent characters which is very expensive, alternatively we may write into another tmp file and then remove original input file, and rename the tmp file to input file name in the end. § We need to use string buffers to hold the input string to be replaced and the string to be replaced. In addition we need a line buffer to read a line of file after which we could search for the string and replace it and write the modified line to output file. WIPRO CONFIDENTIAL

39 Examples: How to chose right data structures

Parenthesis mismatch:

§ Things to consider: Ÿ Parenthesis can be nested. When they are nested, the match takes place in a last in first out order. § A stack of opening parenthesis is ideal for this problem. Every time a closing parenthesis is seen, pop up an element from a stack. Top of stack } { { { Pop stack { Stack empty Stack not empty at end At end Add to stack { { Matched MisMatch WIPRO CONFIDENTIAL

40 Thank you.

Information contained and transmitted by this presentationWIPRO is proprietary CONFIDENTIAL to Wipro Limited and is intended for use only by the individual or entity to which it is addressed, and contains information that is privileged, confidential or exempt from disclosure under applicable law. 41