<<

Master Thesis Thesis no: MCS-2011-28 December 2011

Parallel of context-free

Piotr Skrzypczak

School of Computing Blekinge Institute of Technology SE-371 79 Karlskrona Sweden This thesis is submitted to the School of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Computer Science. The thesis is equivalent to 20 weeks of full time studies.

Contact Information: Author(s): Piotr Skrzypczak 871216-P317 E-mail: [email protected]

University advisor(s): Olgierd Unold, PhD, DSc Faculty of Electronics, Wroclaw University of Technology, Poland

Bengt Aspvall, PhD School of Computing

School of Computing Blekinge Institute of Technology Internet : www.bth.se/com SE-371 79 Karlskrona Phone : +46 455 38 50 00 Sweden Fax : +46 455 38 50 57 ABSTRACT

During the last decade increasing interest in parallel programming can be observed. It is caused by a tendency of developing microprocessors as a multicore units, that can perform instructions simultaneously. Pop- ular and widely used example of such platform is a graphic processing unit (GPU). Its ability to perform calculations simpultaneously is being investigated as a way for improving performance of the complex . Therefore, GPU’s are now having the architectures that allows to use its computional power by programmers and software de- velopers in the same way as CPU. One of these architectures is CUDA platform, developed by nVidia. Aim of this thesis is to implement the parallel CYK , which is one of the most popular parsing algorithms, for CUDA platform, that will gain a significant speed-up in comparison with the sequential CYK algorithm. The thesis presents review of existing parallelisations of CYK algorithm, descriptions of implemented algorithms (basic version and few modifications), and experimental stage, that includes testing these versions for various inputs in order to justify which version of algorithm is giving the best performance. There are three versions of algorithm presented, from which one was selected as the best (giving about 10 times better performance for the longest instances of inputs). Also, a limited version of algorithm, that gives best performance (even 100 times better in comparison with non-limited sequential version), but requires some conditions to be fulfilled by , is presented. The motivation for the thesis is to use the developed algorithm in GCS.

1 Contents

1 Introduction 7 1.1 Problem definition...... 7 1.2 Background...... 8 1.2.1 GCS...... 8 1.2.2 CUDA...... 8 1.3 Aims and Objectives...... 9 1.4 Research questions...... 9 1.5 Expected outcomes...... 9 1.6 Research methodology...... 10 1.7 Thesis outline...... 11

2 Context-free grammars 12 2.1 Grammar definition...... 12 2.2 Chomsky chierarchy...... 12 2.3 ...... 13

3 CYK algorithm 14 3.1 Parsing process...... 14 3.2 Sequential CYK algorithm...... 14 3.3 Existing parallelisations of CYK...... 16

4 CUDA 19 4.1 Overview...... 19 4.2 CUDA Enabled GPU architecture...... 20 4.3 Programming model...... 20 4.4 CUDA compute capability versions...... 23 4.5 Memory Levels...... 23 4.5.1 Global memory...... 23 4.5.2 Constant memory...... 24 4.5.3 Shared memory...... 24 4.5.4 Registers...... 24 4.5.5 Local memory...... 25 4.5.6 Texture/Surface memory...... 25 4.6 Driver API vs Runtime API...... 25 4.7 CUDA ...... 25 4.8 Atomic operations...... 26

2 4.9 CUDA Occupancy Calculator...... 26

5 Application 28 5.1 Environment...... 28 5.2 Data representation...... 29 5.3 Parallel CYK...... 30 5.3.1 Basic version...... 30 5.3.2 Modified version...... 34 5.3.3 Version with shared memory...... 35

6 Experiments 37 6.1 First round of experiments...... 38 6.1.1 Tests...... 38 6.1.2 Results...... 40 6.2 Second round of experiments...... 45 6.2.1 Tests...... 46 6.2.2 Results...... 46 6.3 Third round of experiments...... 47 6.3.1 Limited version...... 47 6.3.2 Tests...... 52 6.3.3 Results...... 52

7 Discussion 53 7.1 Research questions...... 53 7.2 Discussion of time complexity...... 54 7.3 Discussion of validity...... 55 7.4 Conclusion...... 55

References 55

A CUDA 2.x compute capability 58

B GeForce 440 GT specification 59

C Descriptions of the grammars used in tests 60 C.1 First and second round...... 60 C.1.1 Grammar 1...... 60 C.1.2 Grammar 2...... 60 C.1.3 Grammar 3...... 60

3 C.2 Third round...... 61 C.2.1 Grammar 1...... 61 C.2.2 Grammar 2...... 61 C.2.3 Grammar 3...... 61 C.2.4 Grammar 4...... 62 C.2.5 Grammar 5...... 62

4 List of Tables

1 Times of algorithms’ executions, first round, grammar 1...... 38 2 Times of algorithms’ executions, first round, grammar 2...... 39 3 Times of algorithms’ executions, first round, grammar 3...... 39 4 Times of algorithms’ executions, second round, grammar 1...... 42 5 Times of algorithms’ executions, second round, grammar 2...... 43 6 Times of algorithms’ executions, second round, grammar 3...... 44 7 Times of algorithms’ executions, third round, grammar 1...... 49 8 Times of algorithms’ executions, third round, grammar 2...... 49 9 Times of algorithms’ executions, third round, grammar 3...... 50 10 Times of algorithms’ executions, third round, grammar 4...... 50 11 Times of algorithms’ executions, third round, grammar 5...... 51 12 CUDA 2.x technical specifications...... 58

5 List of Figures

1 CYK Algorithm run...... 16 2 Cells important for one thread...... 17 3 Architecture of CUDA enabled GPU...... 21 4 Threads configuration within CUDA application...... 22 5 one row of CYK table...... 30 6 Threads within basic version...... 32 7 Threads within modified version...... 34 8 CUDA Occupancy Calculator output for 17 and 18 registers per thread...... 35 9 Relationship between length of input and algorithms’ performance - Grammar 1, 1st round.... 40 10 Relationship between length of input and algorithms’ performance - Grammar 2, 1st round.... 40 11 Relationship between length of input and algorithms’ performance - Grammar 3, 1st round.... 41 12 Relationship between length of input and algorithms’ performance - Grammar 1, 2nd round... 41 13 Relationship between length of input and algorithms’ performance - Grammar 2, 2nd round... 45 14 Relationship between length of input and algorithms’ performance - Grammar 3, 2nd round... 45 15 CUDA Occupancy Calculator output for 11 registers per thread...... 46 16 Relationship between length of input and algorithms’ performance - Grammar 4, 3rd round... 51 17 Relationship between length of input and algorithms’ performance - Grammar 5, 3rd round... 52 18 Relationship between length of input, size of grammar and modified algorithm’s performance... 53 19 Relationship between length of input, size of grammar and limited algorithm’s performance... 53 20 GeForce 440 GT specification...... 59

6 1 Introduction

1.1 Problem definition

Parsing is a process of syntactic of input data, given as a text, in order to determine its grammar structure, according to the given grammar. Usually output from the parsing process is answer whether given text belongs to the that given grammar describes. If so, additional output can be the tree of the input text, which represents the meaning of the text. The process of parsing is crucial for a systems which are dedicated to work with the natural or artificial , i. e. interpreters of scripting languages, , and systems, which concern pattern or recognition [9]. There is a wide spectrum of parsing algorithms dedicated to the parsing process, from which the most effective (in the terms of computational complexity) are: CYK algorithm [16] and Earley algorithm [7]. Therefore, those algorithms are usually used for parsing [9].

It can be estimated, that time of parsing is a significant part of the time of the whole process of computing text in those systems (i. e. for interpreting HTML with CSS and JavaScript it is approximately 40 %) [7]. The time consumption of the process increases highly in systems dedicated for working with large inputs, such as RNA sequences [23]. Therefore, process of parsing is usually the bottleneck of the text computing. That is the reason and motivation for looking for the ways of improving performance of parsing. One of the environments where process of parsing is conducted frequently, is GCS - Grammar-Based Classifier System. This is a system which is a motivation and purpose for investigating the possibilities of speed-up parsing process. Because GCS uses CYK algorithm, investigating this algorithm will be the of the thesis.

The idea for improvement of effectiveness of the CYK algorithm is its parallelisation - implementation for a multi-core system. Because of previously mentioned, proven best computational complexity of the CYK, and its use in GCS this thesis concerns only this algorithm.

CYK algorithm [16] is a algorithm, which fills a triangle matrix, which height and width have a size of parsing . Table is being filled according to given grammar rules. First row of matrix represents parsing sentence (each cell contains a token that corresponds with particular word of sentence). Next row’s cells contains the tokens which correspond with following sequences of symbols. During each step of the algorithm, which corresponds to every cell of matrix, the algorithm checks entries in previous rows of matrix in order to find pairs of tokens that can be mapped into a new token, according to grammar rules. If the rule exists, the token is written to the cell.

Parallelisation of the algorihm can be done by dividing the steps of algorithm for different threads (processors). Current state of art for this method states as follows: several approaches for the parallelisation of parsing algo- rithms were introduced: for CYK [8][16][9][24]. Also, some modifications of these algorithms for managing with RNA structures were introduced: for sequences alignment [23], or for sequence prediction [14]. According to this, there are some approaches and models for paralleling these algorithms.

Nevertheless, the implementations of these algorithms demand access to the multi-processor system. Particular example of multiprocessor system, which is rather cheap and accessible, is a Graphic Processing Unit (GPU). The thesis concerns implementation of the parsing algorithm for a GPU, using nVidia CUDA [10][20][21]

7 technology. Using CUDA is a suggestion from one of the supervisors. Choice and motivation for using this technology is justified by its availability, as well as its small cost and accessibility, which allows to use it on the regular PC, with the CUDA-supported graphic card type.

The goal of the master thesis is to deliver parallel implementation of CYK algorithm for parsing context-free grammar languages. Expected outcome is implementation, in which the time of threads’ execution, thread’s synchronization and data distribution between them is significantly shorter than time of sequential algorithm exe- cution.

1.2 Background

1.2.1 GCS

GCS - Grammar-Based Classifier System, is a particular form of learning classifier system (LCS) [28]. LCS is a system, in which knowledge is represented as a set of condition-action rules, called classifiers. Each classifier represents different output. Each of them has also its prediction value, which indicates the reward that is expected to come, when the classifier will be used, and fitness value, which represents their usability for the environment, in which it is used. During the LCS learning process, for each given input from the learning set there is also a reinforcement given, which represents the reward, if the output is correct, and penalty, if the output is wrong. System chooses classifier that fits the best with the given input, and responds with the action which corresponds to the chosen classifier. If the output was correct (expected), then chosen classifier is rewarded with reinforcement (its fitness value increases), otherwise, it is punished (its fitness value decreases). The last stage is a genetic algorithm, which manages the population of classifiers. According to the fitness values, it can choose the best classifiers and create new ones (by cross-over or mutation) and reject the worst. The goal of learning process is to gain such set of classifier, which will answer correctly for each possible input.

GCS, introduced in 2005 [28] is a system, where every classifier is represented as a grammar rule in Chomsky Normal Form (CNF). Input data is given in a form of text string; learning set is a set of correct and incorrect sentences. During learning process, inputs are being parsed according to the classifiers, and the classifiers are created and evaluated by the genetic algorithm. The goal is to deliver a population of classifiers that will form a grammar able to parse every correct input, and unable to parse every incorrect input. This system was successfully applied to many environments, such as natural language processing [25] or biosequences [26]. However, it still demands improvement in a field of parsing, which is most neuralgic stage of the process (it is executed many times during the learning).

1.2.2 CUDA

CUDA (Computer Unified Device Architecture) is a parallel computing architecture for GPU (and other multi- core systems), introduced in 2007. This architecture enables to use the set of parallel processors for executing numerical calculations, instead for sequential processing. Also, sets of processor units may use the shared mem- ory, which gives faster access to the data than to the global memory. It delivers the special C language instructions extension, containing instructions for dynamic allocation of GPU memory, and instructions for executing one func- tion as many threads in parallel. Using CUDA demands having a GPU which supports this technology. However,

8 it can also be emulated, so continuous access to the device during the work is not necessary. Idea of implementing CYK for CUDA states as follows [21]: calcluating each cell in matrix can be done by individual thread. There are two main challenges of this implementation: firstly, because there are dependancies between rows, only cells from the same row can be calculated simultaneously. Also, there is a need of proper synchronization of threads, in order to calculate only those cells for which the data actually exists. Moreover, most time-consuming operations are global memory acccesses For this reason there is a problem of proper distribution of data to the thread’s shared memory.

1.3 Aims and Objectives

The main aim of the research is to deliver a parallel CYK parsing algorithm implementation for CUDA, basing on existing parallel model of this algorithm, which will be more effective than currently used solutions.

The objectives of this project are:

• research in literature for finding existing parallel CYK algorithm models and implementations,

• implementation of the existing parallel CYK algorithm in CUDA environment,

• implementation of the new parallel CYK algorithms, adjusted to CUDA possibilities in CUDA environment,

• evaluating the algorithms effectiveness,

• optimizing the CUDA environment for gaining best efficiency of theses algorithms,

• identifying the benefits (the improvement) of using the CUDA algorithms.

By formulating the problem this way, we are giving the hypothesis of the research, that this kind of improvement is possible to achieve (there are no obstacles to implement it, i. e. memory access limitations etc.).

1.4 Research questions

There are five main research questions, of this study:

• Does the CUDA implementation of CYK give a significant speed-up ?

• How can this implementation be further improved (by modifying first implementation)?

• What is the best thread configuration for the specific version of platform and algorithm?

• How effective is delivered solution in comparison with the existing ones?

1.5 Expected outcomes

The expected outcomes of the study are:

• parallel implementations of CYK algorithm,

• evaluation of the algorithms, comparison between each other and the sequential version,

• choice of the best solution - one version, which can be used in a GCS.

9 1.6 Research methodology

First part of the project was the literature study. The study allowed to gain the knowledge about the problem. As the problem of parsing is very significant, it can be assumed that a lot of approaches and research were realized in this field. It was needed to find papers which describe the existing parsing algorithms (CYK), and approaches for designing them for parallel systems. It was also very important to find and use existing approaches. If there were attempts for paralleling this algorithm, it would be useful to find all papers which describe these approaches, instead of reinventing them. For this part of research, the systematic literature review [11] was proceeded. It required formulating the keyword phrases, and performing a search for existing papers and articles, using available searching engines. Because this problem is rather not a new issue, it is not needed to perform a research only for the new articles. Both new and older works and papers connected with this issue were useful. Also, the literature provided by the supervisor was analyzed, as a good source of knowledge.

Several databases were searched : ISI, Scholar, ACM, IEEE, Scopus. Significant keyword phrases were:

• parallel parsing,

• parallel CYK,

• simultaneous parsing,

• simultaneous CYK,

• parallel parsing CUDA,

• parallel CYK CUDA,

• simultaneous parsing CUDA,

• simultaneous CYK CUDA.

Second part of the thesis was focused on the platform of implementation of the solution. By default, the CUDA architecture were suggested by the supervisor, and chosen, because of its advantages and adequacy. This technology is the one on which this thesis focuses; however, there is a series of concurrent technologies for GPU programming (like OpenCL, for instance). Choosing CUDA can be justifed by its availability and effectiveness. Also, point of thesis was to deliver the model of algorithm, that can be easily implemented on other platforms, with similar architecture. For getting knowledge about CUDA, the materials delivered by nVidia on their web page, were investigated. The website contains full documentation of the project, the presentations, tutorials and university lessons about CUDA. Because it is rather new project, there is lack of handbooks available for learning it. Besides, materials delivered by nVidia on their website, were sufficient.

Third stage of the thesis was the experimental stage. It demanded implementing sequential algorithm, designing the models of algorithms for GPU, and implementing these algorithms for GPU. The performed experiments focused on executing these versions of algorithm for prepared input data, in order to find best and universal solution. Experimental stage was iterative - while performing influential optimization changes in the algorithm, the experiment were performed again in order to check if changes were beneficial. There were three rounds of experiment. In each round, different implementations were tested (four in first and second round, two in third).

10 Input data for the algorithm is the input word, which should be parsed, and the grammar of the language. Thus, the experiments were performed for testing data sets which contained words of different length, and different grammars. The proper amount of instances of experiments was chosen, in order to justify if the experiment is authoritative. The value which justifies the effectiveness of algorithms is, obviously, time of executing, so the value of time of execution of every instance was remembered. This part resulted with set of data which evaluates the implementations, and could be used to choose the most effective one.

1.7 Thesis outline

The thesis is organized as follows:

• Chapter one describes the background of the thesis, its main objectives and outcomes.

• Chapter two is dedicated to the definition and description of the grammars, four components of each gram- mar, and grammar types.

• Chapter three describes the parsing process and algorithms that are used for parsing. It is also giving precise description of CYK algorithm, as well as literature review regarding existing attempts of its parallelisation.

• Chapter four describes main aspects of CUDA platform, from the perspective of this thesis.

• Chapter five describes the prepared application (prepared for testing algorithms). Also, it describes details of each implementation of the algorithm.

• Chapter six describes the experimental stage of the thesis, and its results.

• Chapter seven concludes the thesis.

11 2 Context-free grammars

2.1 Grammar definition

In general, uses an alphabet (set of letters), which can be used to create words (finite, ordered series of letters). Set of words is infinite, however, a language uses only finite subset of them. is a description of the language formulated as set of rules. The rules describes the way of forming the words of formal language from a language alphabet symbols.

Formally, grammar G contains four components [27]:

• set of terminal symbols, which are equivalent to the letters of language,

• set of auxiliary symbols (called non-terminal symbols),

• set of production rules, in a form of α → β, where α and β are the words of language. Each production is

defined as the assignment of series of terminal and non-terminal symbols for another series. The word αn

is derived directly from word α1, if the production rule :α1 → αn exists, and undirectly, when the series of

production rules: α1 → α2 → α3 → ... → αn exists,

• start symbol (one of non-terminal symbols). Each word of the language is derivable from this start symbol, by applying set of the production rules.

In following chapters of the thesis, following notation will be used: α, β, γ for any string of terminal or non- terminal symbols, A, B, C for single non-terminal symbols and a, b, c for single terminal symbols. Non-terminal S indicate starting symbol.

2.2 Chomsky chierarchy

Grammar types differ with the form of production rules. According to Chomsky hierarchy of formal grammars [27] there are 4 levels of grammars, from type 0 to type 3. The higher grammar type requires more strict production rules than the lower level, therefore each type of grammar includes all higher types.

Four levels of Chomsky hierarchy are:

• Type 0 – recursively enumerable grammars or unrestricted grammars. This set includes all formal grammars

with production rules in form s1 → s2, where s1 and s2 are any possible words. Languages generated by those grammars are recognizable by Turing Machine. The decision problem of finding, whether given word α belongs to the language defined by given grammar unrestricted G, is undecidable.

• Type 1 – context-sensitive grammars. The production rules in this set have form aAc → abc, with non- terminal symbol A and terminals a, b and c. In context grammars, the production rule always defines the assignment for the non-terminal symbol, depending on the symbols on its left and right side (context). In other words, meaning of concrete symbol depends on its context - the position in word. Languages generated by those grammars are recognizable by a linear-bounded Non-Deterministic Turing Machine. Context-sensitive grammars are used for describing natural languages.

12 • Type 2 – context-free grammars. The production rules in this set have a form A → α, with a non-terminal symbol A and series of terminal and non-terminal symbols α. Production rules of this type of grammar are insensitive for a context of symbol. Languages generated by those grammars are recognizable by the finite state machines with infinite stack. For artificial formal languages, such as programming languages, context-free grammars are usually sufficient.

• Type 3 – regular grammars. The production rules in this set have form A → a,where A is non-terminal symbol and a is a terminal symbol, or A → aB, where B is also non-terminal symbol. Languages generated by those grammars are recognizable by the finite state machines.

For the purposes of this thesis, only context-free grammars are considered, as the most useful and sufficient for describing formal languages, usage in GCS, as well as for their possibility to adapt by CYK parsing algorithm.

2.3 Chomsky normal form

Every context-free grammar can be expressed in a Chomsky normal form. This form of grammar requires all production rules to be in a form of either:

• A → a (non-terminal symbol is mapped to the terminal symbol), or

• A → BC (non-terminal symbol is mapped to the pair of non-terminal symbols).

Algorithms that will be further described and implemented require the input grammar to be in Chomsky normal form. This is the form which is used within GCS system to describe classifiers, therefore, it is not a limitation within context of this thesis. However, general instances of this problem (with grammars which are not in CNF form) demand converting them to CNF. According to [13] conversing context-free grammar to the CNF may result in significant increase of the number of non-terminal symbols (in some cases, the increase can be even exponential). Therefore, such requirement influences the computational complexity of algorithm. This issue can be a limitation for a wide use of presented algorithms, and is a significant reason for a future work in this field.

13 3 CYK algorithm

3.1 Parsing process

In a terms of computer science, process of parsing (or syntax analysis) is a process of analyzing input text data, given as a series of tokens (i. e. letters) in order to determine its structure, according to the given formal grammar. Output of the parsing process is answer whether computing string belongs to the language described by the grammar or not. Additional output is a which represents the parsed text, i. e. syntax tree. Constructing the syntax tree requres remembering additional information during algorithm process. Basic version of algorithm gives answer whether the word belongs to the language, or not.

Process of parsing cannot be proceeded efficiently on every kind of grammar. Effective algorithms of parsing require context-free grammars, which have some bounds and restrictions in describing languages (i. e. lack of context sensitivity, mentioned above). Therefore, not every language can be expressed by this kind of grammar.

There are two approaches of parsing process : top-down and bottom-up. First approach (top-down parses given text starting from the left, attempting to apply the starting production rule first and then descended rules. Second one (bottom-up applies the production rules for basic tokens, and replaces them to the following non-terminal symbols, with the goal in gaining the start symbol at the end.

3.2 Sequential CYK algorithm

Problem of parsing has been investigated for a long time, due to its significant role in many fields of computer science. Efficient recognition and parsing algorithms for context-free grammars were subject of interest of many scientists works [17]. Sequential approaches of parsing have been well studied. One of the most effective and, therefore, widely-used algorithms [16], is CYK (Cocke-Younger-Kasami) algorithm, which is a subject of this thesis.

The CYK algorithm, is a dynamic programming algorithm that represents the bottom-up parsing approach. Basically it indicates that every entry in a table, on which algorithm is working, depends on the previous entries in this table. Because of its complexity of O(gn3), where g is a size of grammar (number of rules), and n is a length of input string, and spatial complexity O(gn2), it is one of the most efficient recognition and parsing algorithm [8]. It is most effective in its worse scenario. The main disadvantage of CYK is that it requires the grammar, according to which the process of parsing is proceeded, to be in CNF. This disadvantage is not a limitation, since every context-free grammar can be transformed to the equivalent CNF form (this problem was discussed in section 2.3). Also, GCS is a system which is dedicated to work on CNF grammars, so within context of the GCS, there is no limitation. Problem of this limitation can be solved, i. e. by investigating different algorithm, such as Earley or Earley-Packrat algorithm, since these algorithms do not require grammar to be in CNF. For a CYK algorithm, this problem can be addressed as a problem of parallelisation of grammar conversion, because algorithm that converts a grammar to CNF can be as well optimized by implementing it for CUDA.

The CYK algorithm works as follows: it fills triangle n × n matrix, in which every cell is a list of non-terminal symbols, initially empty. Input of the algorithm is a string in a form a1a2a3...an which will be computed, and

14 given grammar G in CNF. Process of filling the table can be divided into two main stages. During the first stage algorithm iterates the first row of matrix, filling the particular cells with the non-terminal symbol A into the cell th th with index i, according to the terminal rules - A → ai in the grammar. The i cell corresponds with i element of

the input string ai. During the second stage of the process, algorithm iterates on all substrings, and searches non- terminal symbols, from which subsequent substrings can be derived. It starts from all substrings of length 2 for the second row in first iteration, then increase length of the substrings for following rows in next iterations, till the whole string in last row in the last iteration. For each cell, algorithm checks whether two neighboring substrings that were already parsed to the non-terminals can be derived from another non-terminal symbol. Algorithm iterates on all of the cells of the matrix, putting non-terminal symbols into them, according to the grammar rules. For each

cell Ci,j, and every k that 0 < k < n, pair of cells: Ck,j and Ck−i−1,j+k+1 is checked, and of there is a rule

A → BC, such that B is a non- terminal symbol in Ck,j, and C is a non-terminal symbol in Ck−i−1,j+k+1, then

A non-terminal symbol is assigned to Ci,j. In other words, Ci,j contains non-terminal symbol A if the sub-string aj...aj+i can be derived from A. Algorithm recognized the input string according to given grammar, if the start symbol of a grammar is in the Cn,0 cell of matrix.

Listing1 shows the sequential version of described basic CYK algoritm in pseudocode:

Algorithm 1 CYK Algorithm

Require: : grammar G, Input string α = a1a2a3...an

for all ai, 0 ≤ i < n do

if A → ai ∈ G then

Add A to Ci,0 end if end for for all 1 ≤ i < n do for all 0 ≤ j < n − i do for all 0 ≤ k < i do

for all A → BC ∈ G, B ∈ Ci,k and C ∈ Ck,j do

Add A to Ci,j end for end for end for end for

For example, with the given grammar: G = {{A, B, C, S}, {a, b}, {S → AC, C → SB,S → AB, A → a, B → b},S} and input string aabb, steps of the algorithm will look as presented in figure1.

For C3,0 cell, pairs C0,0C2,1, C1,0C0,2 and C2,0C0,3 will be checked. Because for the first pair there is a rule S → AC, the S symbol will be assigned to the last cell. It can be easily observed, that to calculate the values of each cell, only the data from the column and diagonal corresponding to this cell are necessary (as it is shown in

15 Figure 1: CYK Algorithm run

figure2. This observation is really important while considering distribution of the data between processing units in parallel versions of CYK, discussed later on the text.

3.3 Existing parallelisations of CYK

In order to improve the effectiveness of the algorithm, several attempts to their parallelisation have been proceeded. First attempt to parallel recognition was based on arrays of finite-state machines. In 1979, S. Kosoraju [12] presented the parallel implementation of the CYK algorithm for a two-way two-dimensional array of n2 finite- state machines. He proved that recognition of the string of length n can be proceeded on with the computational complexity O(n). Later, J. Chang et al. [2] proposed the parallel CYK algorithm for the two-dimensional array of

16 Figure 2: Cells important for one thread n2 finite state machines, with the linear times of parsing and recognizing. Also, several similar solutions, but using Earley’s algorithm were considered. The VLSI implementations of these algorithms were presented by Y. Chiang et al. in [3]. However, these approaches were based on the assumptions that number of processors is dependable on the length of the input string n. In real systems, given architecture of parallel processors contains constant, fixed number of them. Therefore, implementations of the solutions mentioned above are not easy to implement on the real systems [9].

The particular solutions of parallel CYK algorithm were extensively described by A. Niholt [16]. He presented the sequential CYK in a few different versions. Also, he presented the ideas of how can the tasks of the algorithm be divided for a number of available processors, for each version. Parallel version of tabular CYK (which is the main form of the algorithm), that he described, was based on assumption that the processing units are arranged in a matrix that covers the CYK matrix. In other words, each processor is responsible for calculating the values of each cell. Also, communication between the processors is done in one way. Processors are transferring the table entries they had calculated to the neighbor processors that requires these entries. It is assumed, that every unit receives data from the units that are one row lower in the corresponding column and diagonal. Similarly, it sends this data, expanded by results computed by it, to the units that are one row higher in the same column and diagonal. Disadvantage of this approach is that many processors are finishing their calculations rather quickly and remain idle till the end of algorithm. Also, although Nijholt gave the exhaustive description of CYK parallelistation, there is a lack of any real implementations of them. He did not show the effectiveness of proposed approaches - this was only a theoretical contribution.

Later, O. Ibarra et al. [9] presented the algorithm which deals with the situation, where the number of processors is lower than number of tasks. In their approach every unit is responsible for calculating all the cells in a set of columns, instead of assigning each processor to only one cell. Moreover, they proposed the hypercube model of architecture, which reduces the cost of communication between processes. Later on, they extended their approach by introducing dynamic distribution of tasks. It allows reusing idle processors, by assigning the cells to the units after every set of iterations (while number of cells in the row is decreasing). In their article they also present the results of implementing their solution on the NCUBE7 Hypercube machine. They showed the speedup gained by these two algorithms (with and without dynamic partitioning) in comparison with sequential version of the algorithm, with promising results. For some instances of the problem, their version was even 30 times faster [9].

17 J. Hill et al. [8] showed the approach similar to the one described above, with the algorithm that assigns all cells from the set of neighbor columns to each processor. They showed that within every set, it is more effective to calculate the cells in the diagonal order, than calculating them row after row. Before the beginning of first stage, the grammar description and the substring of input string is distributed between processors. Then, in each iteration, some data from the next processor are collected. After calculating, some required data are distributed to the previous processor. They analyzed the results of their implementation of this approach, for architectures with 3 and 7 processing units. They showed that their approach can work approximately 2 times faster in a first case, and 1,5 times faster in a second.

T. Ninomya et. al. [24] proposed the algorithm similar to the one presented by Nijholt. They proposed the algorithm for the large-scale distributed-memory parallel machines. Their approach to CYK was, similarly to Nijholt’s, to divide the computation into processes, each one responsible for a particular CYK table entry. In con- trary to Nijholt’s solution, where every unit connects only with the neighbor units, they proposed the asynchronous communication of processors. In their model, each unit is able to send data directly to the unit that requires it. Each process, in each iteration waits until processes that calculates data that are demanded in this iteration, are ready. As soon as the computation is done, processor is able to send the result to all processors that requires it. In other words, they paralleled the two outer loops’ iterations of the algorithm, applying the most nested loop to every process. Their approach was evaluated by implementing it on a machine with 256 nodes. Running this implementation gave approximately 40 times better results for the longest instances they have tested.

Later, some other dynamic programming algorithms similar or basing on CYK were introduced and imple- mented. For example: RNA alignment algorithm [23], or RNA structure prediction algorithm [1]. These articles showed the result of the significant speedup while using parallel solutions, in comparison with the sequential versions.

Despite the fact that problem of parallel parsing is widely described in professional literature, these solutions are still not widely used. Authors often notice the main problems of parallel parsing algorithms, that in fact concern all parallel dynamic programming algorithms. These problems are:

• High cost of communication between processors, which influences mostly the time complexity of imple- mentations.

• problem with proper distribution of the tasks to the processors. Solutions mentioned above always face the problem of the last, longest stage of the algorithm, which has to be done sequentially.

• Problem of re-using the processors that are idle, by redistributing the tasks.

Also, the problem with the parallel algorithms is that even if the experimental results are promising, the program can run only on high-scale multiprocessor expensive computers. For this reason, these solutions cannot be dis- tributed to the wide usage [10]. Therefore, this issue was abandoned for a long time. Now, as parallel algorithms’ relevance increases, due to fact that multi-core architectures are becoming more popular and available [7], as well as introducing architectures like CUDA or OpenCL, it is worth to reconsider these approaches.

18 4 CUDA

4.1 Overview

During last decade development of single central processing units (CPU) has slowed down. It is caused by restrictions, such as energy consumption and heat dissipation, that does not allow to increase the clock frequencies, and increase of CPU activities in a time period any more [10]. Thus, the main tendency of CPU’s developing switched to the multiprocessing model, which bases on many processing cores, that can work in parallel, instead of one. From the other side, increasing accent on developing powerful graphic processing units (GPU’s) demanded by expectations of the market caused their evolution to the highly multi-threaded, multi-core parallel processors. Therefore, and because of their wide availability, GPU’s computational power is now often used as a platform for parallel programming. These platforms are giving significant speedup in comparison to the sequential solutions. This is why parallel programming should be considered as a course for getting better performance. Obviously, parallel programming is not a new approach. Parallel applications were developed in a past, for high-cost large- scale computers.

CUDA Architecture, introduced by nVidia in 2007, allows programmers to access the computational power of GPU without having extensive knowledge about graphic API (application programming interface), like OpenGL or Direct3D. Before CUDA was introduced, several other ways of programming GPU were used (such as C for GRAPHICS, OpenCL extensions, or High Level Shader Language - HLSL). However, programming possibili- ties of these solutions were limited. What is more, this way of programming demanded having knowledge about graphic programming. Units that supports CUDA differ in their hardware [10] from other GPU’s. By this, pro- gramming graphic units can be done using provided general interface. It allows developers to use the device memory and execute functions on multiprocessors.

CUDA enables access to multiprocessors of graphic processing card, and allows to use them by a program- mer for any calculations in the similar way as programming standard sequential applications. Because of rapid development of computational power of GPU’s (96 cores in one of the first CUDA enabled devices from 2007, and 1024 cores for one of the newest devices GeForce GTX 590 gives almost 10 times increase of computational power [21]), CUDA is also promising for the future development.

Later in the paper, CUDA-enabled GPU device will be simply called device.

Instruction Set Architecture for CUDA enabled devices is called Parrallel Thread Execution (PTX) ISA [19]. Instructions within this set are able to access device’s multiprocessors and memory. Programming in CUDA is possible using PTX assembly language, however, CUDA API (see section 4.6) delivers tools for programming in ANSI C with extensions, and in many high-level languages as well.

The main components that form CUDA software development environment are: CUDA enabled drivers, CUDA (nvcc), CUDA Software development kit and emulator of CUDA enabled device (up to 3.0 version). Additionally, some language bindings for many programming languages (such as Python, FORTRAN or Matlab) are provided.

19 4.2 CUDA Enabled GPU architecture

In general, architecture of CUDA enabled GPU (unit that supports CUDA) is organized as an array of streaming multiprocessors (SM’s), grouped into building blocks. Every SM contains array of streaming processors (SP’s) for integer and floating point operations (single precision in 1.x, double precision in 2.x version of CUDA), each one of which is multi-threaded. Each unit has also a global DRAM memory, accessible from every SM, which contains all the textures, images and matrices used by unit, giving better bandwidth than CPU RAM memory. Number of SP’s within one SM also depends on CUDA version; 8 SP’s for CUDA 1.x, and 32 for 2.x. Every streaming processor has additional special function units for the double-precision floating point operations. Also, every streaming multiprocessor has its own data cache and texture cache, and scheduler (one in CUDA 1.x, two in 2.x) - unit responsible for managing the sequence of threads’ execution. Simplified architecture of CUDA device scheme is presented in Figure3.

Depending on the version of device, the architecture can vary in number of SM’s in whole matrix, number of SM’s in block, or number of SP in every multiprocessor, as well as in size of memories. Architecture differs also in number of threads that can be assigned to one SM (up from 512 in G80 card). Because of this differences, application written for CUDA is highly device-dependable and therefore, should be adjusted to the version of CUDA, on which calculations are being performed, in order to gain maximum performance available for a device.

4.3 Programming model

Computing large set of data by a program obviously results in long time of its execution. Parallelism in pro- gramming refers to the process of dividing the sequence of calculations that has to be performed into series of arithmetical operations. Each of these operations can be executed simultaneously and does not influence on other operations. On the programming level, CUDA uses an abstract architecture of threads and blocks. Within this architecture, all threads are grouped into blocks, and each block contains the same amount of threads. Configu- ration of threads and blocks is chosen by a programmer. Depending on this configuration of threads and blocks of threads, they can differ in the way they can be synchronized or how can they use shared resources. It is be- cause, basing on that configuration, threads within one block are (during the execution of the program) assigned to the same streaming multiprocessor. Threads within one SM can be synchronized and can use the same cache. Threads that are in different blocks can be assigned to different units, and, if so, can neither be synchronized nor share the memory. More threads in a block give a possibility to synchronize them and communicate with each other. However, this solution may result in worse occupation of SM’s (because of the limitation of threads that can be executed on each of them). Therefore, choosing the proper configuration is crucial for performance.

Figure4 shows the scheme of CUDA threads architecture. A top-level structure that includes all threads created to perform particular set of calculations is called a grid. Grid is a one- or two-dimensional array of blocks. Each block is a one-, two- or three-dimensional set of threads. In this case, the grid has width of four and height of two blocks. Each block has width of three and height of two threads. Number of threads in every block is the same. The reason of grouping threads into blocks, as it was mentioned two paragraphs above, is that threads within one block can be synchronized, and also they can efficiently use shared memory - very fast memory module assigned to every block.

20 Figure 3: Architecture of CUDA enabled GPU

Basically, code of parallel application is divided into phases. Each phase can be either host or device phase. Host phase is a part of the code that will be executed on CPU. Device phase is a code that will be performed on the device - GPU. Host code includes instructions of collecting data, allocating and copying them to the device memory (since CUDA up to version 2.0 does not allow threads to have access to the CPU internal memory), collecting data from device memory, and calling the kernel function. Kernel is a function - in terms of C language - which is to be performed on GPU.While invocating each kernel function, some additional arguments are required. These arguments are: the configuration of threads and blocks, in the way programmer wants them to be performed, and size of shared memory. Invocation of kernel results in creating a set of threads. Later, kernel code is executed by them independently.

Size and number of dimensions of both grid and blocks are configurable. However, maximum size of both the grid and blocks (every coordinate) is restricted by the version of CUDA that given device supports. Moreover, devices can differ in number of blocks and threads that can be possibly assigned to SM’s. Usually, number

21 of threads and number of blocks are adjusted to the size of data, that program will be operating on, and the capabilities of particular device.

Figure 4: Threads configuration within CUDA application

Although each thread is performing the same set of instructions, they can operate on different data. Each thread has its unique id’s that can identify its position in the grid. Thread has access to its coordinates in the block and coordinates of the block, as well as to the size of blocks and whole grid. Therefore, data on which CUDA is performing calculations are usually arrays of numbers. Because each thread has its unique id, they can all work on the different entries of array independently.

Independence of threads is crucial for performing calculations correctly. After launching the kernel it is impos- sible to determine the order in which threads will be performed on SM’s. Configuration of blocks is a software concept. On a hardware, threads within a SM (threads from all blocks assigned to one SM) are always grouped into warps. Each warp is a sequence of threads that will be executed simultaneously on one SM. Size of warp depends on version of CUDA; for 2.0 it is 32 threads per warp. Warp scheduler unit is responsible for choosing the warps that are active and ready to be performed. Warp contains threads in order of their id’s in the block. Only limited part of the warps is being executed at the time, and the warps are not scheduled in order. When one or more of the threads from a warp have to wait for a result for a longer time (i. e. because of global memory access, floating point operation, or due to branch instructions), other warp, with all threads ready to perform, will be chosen. This solution allows to avoid latencies. Asynchronous mechanism of warp scheduling may allow to

22 cover the memory access latencies or latencies that emerge due to branch instructions. It simply chooses the warps that are ready to be performed, instead of warps that contains one or more thread that is waiting for data.

Usually, device part of the applications looks as follows: memory allocation and transfer of input data to the device memory, calling kernel function on appropriate number of threads, transfer of output data from device memory.

4.4 CUDA compute capability versions

CUDA enabled devices differ in their capabilities, depending on the type and year of release. There are 6 levels of capability until now: 1.0, 1.1, 1.2, 1.3, 2.0, 2.1. Units with compute capability from version 1.0 to 1.3 have the same architecture, but differ in features and possibilities. Starting with 2.0 version, core architecture of de- vice changed, and some technical specifications (such as number of residing warps in one SM, or size of shared memory) are improved. Appendix A shows main features of all versions of CUDA compute capabilities. For this thesis, 2.0 version was used. The GPU used in experiment allows to use CUDA 2.1 version. The reason for choosing 2.0 version was that it is the newest version of CUDA that was supported by the last stable version of CUDA toolkit at the moment when this thesis project was started.

CUDA Compute Capability describes version of CUDA device, and should not be mixed with version of CUDA toolkit, which describes version of the software.

4.5 Memory Levels

As it was mentioned earlier, CUDA application consists of two stages of code - host code and device code (ker- nel). Host stage is able to operate on host memory and on device memory as well. However, kernel is able to work only on device memory units. There are plenty of memory units within device that differs in their size, properties, bandwidth and access latencies. Because memory accesses causes latencies, memory management issue is crucial for time of execution of the program. Moreover, data within CUDA memory units have to be coalesced properly in order to maximize effectiveness of fetching them by threads. Therefore, memory management in CUDA is supported by a set of API functions that allow to allocate memory and fill it with data correctly in terms of their alignment and avoiding memory conflicts. Particular types of device memory are described below.

4.5.1 Global memory

Global memory is the main memory on device. It is allocatable and accessible from host. This part of memory is accessible by any thread, no matter the grid or block configuration. Depending on version of compute capability, it is addressed by 32-bit cells (CUDA 1.x) or 40 bit cells (CUDA 2.x). Global memory main disadvantage is its high access’ latency. For having better performance, global memory accesses are being done simultaneously for many threads. When threads within one warp are accessing global memory, request for one warp is split for two independent requests - each one for one half-warp. Because size of warp is 32, one transaction covers 16 threads’ request. There are some requirements to make such transactions possible - some of them depends on version of CUDA compute capability. Data within a memory should be aligned properly (i. e. data that are requested are in size of 1,2,4,8 or 16 bytes, and addresses of this data in memory a multiplicity of this size). Memory alignment

23 condition is fulfilled when using the built-in CUDA types of variables, or types in the proper size while allocating memory (CUDA instructions automatically coalesce the data in memory for optimal performance). If structures are used, the manual alignment of them to the proper size is demanded.

Also, while using 1.0 and 1.1 version, data accessed simultaneously has to be in the adjacent cells in the memory. Requested data has to be in the same segment, and the following threads have to access following addresses - kth thread of warp accesses kth entry in the segment. Higher versions of CUDA does not have this restriction. Data for the one half-warp threads do not have to be in the same segment, and does not have to be coalesced. Still, high performance of memory operations can be achieved, due to different management of these operations, than in CUDA 1.0 and 1.1 [21]. This feature is crucial for an algorithm such as CYK within which threads need to access data that are widely distributed in memory. For CUDA 2.0 and 2.1 version, data from global memory can additionaly be cached.

Under all these conditions, the memory transaction can be performed. If the conditions are not fulfilled, fetch- ing the memory in one memory instruction is not possible. In that situation many single operations has to be performed, separately for each thread in a warp. That situation can extend significantly time of execution, there- fore, it should be avoided.

4.5.2 Constant memory

Constant memory is another memory block, similar to global memory. It is also accessible and allocatable by host code, but the data that are transferred to this memory are unchangeable during whole time of executing the device code. It gives much lower access latency. Moreover, it can be cached on SM (apart from version of CUDA), which makes it useful for keeping constant data, that are used by many threads.

4.5.3 Shared memory

Shared memory is an on-chip memory, that resides on every stream multiprocessor. From a threads view, it is accessible by any thread within one block. Shared memory is very fast - data can be accessed immediately, because they are placed on the chip. Using shared memory is desirable every time when the same parts of memory are accessed by many or all threads in a block. While using it, every piece of data can be copied once to the shared memory (during first time it is read from global memory). Later, the same data can be accessed by all threads, using the shared memory access, instead of global memory access. Shared memory is not possible to initialize by host. Instead, the size of shared memory has to be given while calling kernel. It also gives the advantage of aligning data. Even when data that threads access are distributed in global memory, they can be simultaneously accessible after copying them into shared memory, where they are ordered in sequence. There are two ways of work with shared memory. First one - by allocating the constant size of memory (known in the moment of compilation) using shared directive. Second one - dynamic allocation - is done by passing size of shared memory, that will be used, as an additional argument of kernel invocation.

4.5.4 Registers

Registers are memory banks that are unique for each thread. Processors keep in them data of variables that differ in each thread, such as thread id’s and variables initialized in kernel by programmer. The number of registers on

24 every SM is obviously limited. Therefore, depending on how many registers any of threads is using, the number of threads assigned to one SM can change. It is important to control number of registers available. Each SM has the limits of threads and blocks that are possible to assign to it, but ineffective usage of registers can decrease these limits, or use local memory, and hereby, decrease the performance.

4.5.5 Local memory

Local memory is a part of the global memory that is accessible only for one thread. Term ”local” refers to the fact that it is exclusive for each of the threads that are using it. Physically this memory is allocated on global memory of device. If the number of registers is insufficient for data that all threads within one SM, the rest of variables of each thread is being kept in a part of global memory. It also concerns arrays that are dynamically allocated within a kernel. Generally, all the data, that is unique for one thread and cannot be kept in registers, is located in a local memory. Because of memory latencies, local memory usage is undesirable. For this reason, kernel should not use the dynamic memory allocation.

4.5.6 Texture/Surface memory

Texture memory resides in a global memory, but it is optimized for storing two dimensional arrays of variables. Data that are stored in a texture memory can be cached in a texture cache. Texture memory mechanism is dedicated for storing two-dimensional arrays, but without the coalescing restrictions. It can be used as an improvement of accessing matrices of data that cannot be improved by using shared memory mechanism.

4.6 Driver API vs Runtime API

There are two programming interfaces that CUDA is supporting [21]. First one - driver API is a low-level pro- gramming interface, which supports programming directly in PTX ISA (assembly language) or some C extensions of these functions. This way of programming makes code completely dependable on programmer’s idea; he de- cides when all CUDA instructions will be executed. Second one is a CUDA Runtime API, that includes all CUDA C extensions (described in section 4.7) that allows to launch functions and use the memory C in similar way that in traditional C programming. Runtime API also includes additional bindings that support other programming languages. When using this API, code will be compiled by nvcc to the PTX instructions in optimal way. This way of programming makes executions of CUDA PTX instructions less dependable on programmer, but more optimized and easier in use. For purposes of this thesis, CUDA C for runtime API will be used.

4.7 CUDA C

CUDA C is an extension of the ANSI C language, that delivers a set of functions and directives for managing with device memory and execute device code. There are plenty of extensions of C that allow to use all CUDA features; main components used in this work are:

• cudaMalloc/cudaFree functions that allow to allocate and release global memory,

• cudaMemCpy function that allows to copy data from host memory to device memory,

25 • constant , device , shared - are the directives that describe the type of memory, each directive precede the declaration of variable or array,

• host , device - directives that precedes function declarations, describing whether they are host or device functions,

• global - directive that precedes the kernel function declaration,

• syncthreads() function that synchronizes the threads within a block,

• cudaThreadSynchronize() - function that synchronizes all threads within a grid,

• cudaError() - function that returns description of error, if it appears,

• blockDim/gridDim, blockId, threadID variables accessible from every thread, that contains sizes of block and grim, and coordinates of the thread and block,

• built-in vector types, such as char1, char2, uint1, uint2 structures that contains vectors of variables; to initialize such structures, make constructor has to be used (I. e. make char1(x);

• many mathematical, atomic and bitwise functions, optimized for device calculations.

4.8 Atomic operations

While having many threads operating on the same set of variables, there is a possibility of having two or more threads accessing the same value. It can cause a problem when two threads are performing read-modify-write instruction. Because of asynchronous and disordered execution of threads, some data may be overwritten. It can happen with two or more threads that are modifying the variable that reside in the memory. To perform it in a correct way, each modification has to be performed sequentially by them. Otherwise, threads can read obsolete value, before it was updated by another threads, that are already working on this variable. To avoid these situations, atomic operations can be used. Atomic operation is an operation on a variable in memory, that, while is performed, it is certain that no other thread will access the same address in memory. CUDA gives wide range of this functions, however, most of them works only on 32-bit variables (except for atomic adding and atomic exchange function, that can work also on 64-bit variables, since CUDA 2.x). Atomic operations in CUDA include arithmetical and bitwise functions.

4.9 CUDA Occupancy Calculator

As it was written in section 4.3, threads within a blocks assigned to one SM are divided into 32-element warps. Warps and their execution are managed by warp scheduler, which chooses the set of threads that is ready to be executed at the time. Due to memory latencies when one or more of the instruction operands resides on off-chip memory - some warps are not ready to perform next instruction, therefore warp scheduler chooses another warp, which is ready. Number of threads in warp is constant and contains 32 threads. However, number of warps for one SM depends on defined configuration of threads and blocks, and on device capabilities as well.

26 There is a finite number of registers within one SM, and different number of threads, blocks and warps that can be assigned to one SM (depending on device model and computer capability). Also, there is a limited size of shared memory for each SM. While programming, these four restrictions (limit of blocks and warps, limit of registers and size of shared memory) has to be taken into account, since performance of every device code execution depends on how chosen configuration of the grid, number of registers that kernel demands and size of shared memory used fits these restrictions. For instance, if number of threads in a block is too high, less number of blocks will be assigned to one SM in one time, because configuration does not fit the threads number restriction, although SM allows to assign more blocks. Other example is too high amount of registers that every thread demands - even if number of block and threads fits the restriction, number of on-chip registers is insufficient, and there is no possibility to assign all three threads that should be assigned to one SM. Because many factors influence the performance (CUDA compute capability version, kernel complexity), finding optimal configuration of threads within a grid can be problematic, and using improper configuration may highly decrease the time of execution. For finding the best option - which will give most effective performance, CUDA occupancy calculator [18] can be used.

Occupancy calculator is a spreadsheet delivered with CUDA SDK, that calculates the performance ratio and percentage occupation of every SM, based on specification given by user (CUDA version, occupancy of memory, number of registers and warps, and threads). This data can be obtained from the compiler in the time of the code compilation. Obviously, the goal is to find the configuration and statistics that occupy 100 % of each SM’s resources. Occupancy calculator output will be shown for every implementation, and given numbers of threads will be used in experiments.

27 5 Application

First implementation was the sequential CYK algorithm, which was used to compare the performance of parallel algorithms with the basic one. Then, several versions of parallel CYK were implemented. Starting point for im- plementing the parallel CYK was the version based on the models presented by Nijholt [16] and Ninomiya [24] (first algorithm - basic version). Algorithm from [24] was adjusted for CUDA environment, by taking CUDA features into account. This algorithm was furthermore modified in some ways. Firstly, by changing the algorithm model (distribution of tasks between threads) This version of the algorithm was called modified version. Sec- ondly, using CUDA advanced feature - shared memory, which was considered as promising in achieving better performance (third algorithm - shared version). These implementations were tested, by executing them with the sequential version for the same input data, and collecting the times of their execution. After proceeding all basic tests, the version of the algorithm which gave most satisfying results, was modified into fourth algorithm - limited version (version with some restrictions regarding size of grammar), which was furthermore tested. Each of these four algorithms is described in the section 5.3.

Application is attached to the thesis on a CD, as a Microsoft Visual Studio 2008 console application project.

5.1 Environment

All the algorithms and the application were implemented in CUDA C, which is the extension of ANSI C, using the runtime CUDA API functions. Specification of used platform is described below:

• OS: Windows 7

• CPU: Athlon X2 2.0 GHz, 3 GB DDR2 RAM

• GPU: Nvidia GeForce 440 GT, with CUDA 2.1

For using CUDA, there is a need for installing three components:

• CUDA graphic card driver - version 270.81 for Windows 7 was used,

• CUDA toolkit, with nvcc compiler, additional libraries and CUDA Visual Profiler - version 3.2 of toolkit, in 32 - bit version was used,

• CUDA SDK (optional) - set of code samples and additional tools - version 3.2 was used.

It can be considered as two different platforms, because sequential and parallel version are performed on different units (CPU and GPU). These platforms are hardly comparable - GPU with the same computational power as given CPU cannot be defined precisely. The criteria here for choosing such configuration is that they belong to the same ”price range”. The development environment which was used was Microsoft Visual Studio 2008, with applied CUDA Build Rules for Runtime API, version 3.2. However, most part of the code was developed using CUDA emulation mode, which demanded to use another version of CUDA toolkit (3.0), because in newer versions CUDA emulator was removed.

28 Implemented application is a console application that works on input files that contain grammars and strings. Device that was used as a platform for testing the algorithms has 2.1 version of compute capability. However, since version 3.2 of CUDA toolkit supports CUDA compute capability up to 2.0, all experiments were performed in 2.0 compute capability. Full technical specification of this device is given in Appendix B.

5.2 Data representation

Application requires the grammar to be given in CNF. Input files with grammars contains all rules of the grammar, using lowercase characters as terminal symbols, uppercase characters as non-terminal symbols, > as a transition operator, and character ’S’ as a start symbol. In the application, grammar symbols are remembered by two arrays:

• one-dimensional array of terminal symbols, and

• one-dimensional array of non-terminal symbols.

During the process of reading the grammar file, every new terminal symbol or non-terminal symbol that had not appeared yet is added to the end of the array. Later, algorithm will identify this symbol by its index in this array. Algorithms operate only on indexes that represents the symbols, instead of characters.

Each rule in the grammar is represented in one of two arrays:

• one-dimensional array term nonterm that represents every transition in a form A → a. Index of the cell in this array represents the number of terminal symbol on the right side of rule, and the content of the cell represents the number of non-terminal on the left side of rule. For instance, if A is represented by 0, and a by 1, then if the rule A → a exists, then cell of index 1 of this array contain 0 value.

• two-dimensional array nonterm nonterm that represents every transition rule in a form A → BC. Indexes of the array’s row and column represents non-terminal symbols B and C on the right side of rule, and the content of the cell indexed by these values represents the non-terminal symbol on the left side of rule. For example, if A is represented by 0, B by 1, and C by 2, and the rule A → BC exists, than value in the array, in its 1st row and 2nd column should contain value 1. If there is no such rule, than value should be −1 (since symbol are represented only by non-negative values; −1 represents the empty cell). is not allowed; any pair of non-terminals can be mapped to at most one non-terminal.

After parsing each of the rules from the file with the grammar, parser adds the value that will represent this rule to one of the arrays. Also, when parser recognizes S symbol, index of this symbol is remembered in the additional variable. This variable is later used for recognizing this symbol in the CYK table. If the parser recognizes incorrect rule (which does not obey the CNF restrictions, or contains other symbol than lowercase or uppercase characters, or other permitted characters), then the process of parsing will be interrupted, and the grammar description will be needed to be revised.

Input string is represented as a string with lowercase characters. Files with the input strings have following form: first line contains number of the input strings in the file. Each other line contains a string, preceded by the number of characters in the string. If the string contains a character that does not belong to the given grammar, the string will not be parsed.

29 Figure 5: one row of CYK table

Concept of CYK algorithm uses two-dimensional matrix (with width and height equal to string length), in which every entry is a list of non-terminals that are assigned to this cell - size of each cell is different. This concept, presented in section 3.2 has to be differed from its implementation. In the program, the table is implemented as a three-dimensional array, with third dimension representing list of non-terminals, in constant size, instead of keeping a list of non-terminals in one cell. Each cell of the CYK table as described in section 3.2 corresponds with the row of the CYK matrix in the program. One row of the CYK table corresponds with the slice of the matrix from the program, as pictured in Figure5. Value 1 in the ith position of the third dimension indicates that ith non-terminal symbol is assigned to the cell, value −1 mean that this non-terminal is not assigned. For example, value 1 in CYK[j][k][l] means that lth non-terminal is assigned to the cell in jth row and kth column. This solution requires more space in memory, but allows to ease allocating data on device, without differing size of third dimension of matrix. In the limited version of algorithm, described below, CYK table was implemented as a two-dimensional matrix, with every cell as a vector of constant length. To make descriptions clear, later, pieces of matrix will be named as follows: CYK[i][j][k] is one entry (contain- ing value 1 or −1), CYK[i][j] is one cell (list of non-terminal flags), and CYK[i] is one row of the table.

5.3 Parallel CYK

This section exhaustively describes first three implemented algorithms - basic, modified and shared. Fourth version of the algorithm is described in section 6.3.1.

5.3.1 Basic version

CYK is an algorithm that fills cells of matrix, therefore the obvious solution for paralleling it is to perform cal- culation for each cell separately, by another thread. Because CYK is dynamic programming algorithm, process of calculating every cell of the matrix can be initialized as soon as particular previous entries were calculated. Threads that are working in the same time have to be independent - result of any of them cannot depend on any other that is running in the same time. First problem of paralleling CYK algorithm is a problem of proper synchro- nization of the threads - proper choosing threads that can run simultaneously. In case of CYK algorithm, before calculation of any cell starts, all previous entries from the column and diagonal containing this cell have to be finalized first.

Algorithms described by [16] and [24], assign every cell of the matrix to one thread (processor) in the beginning. Threads that cannot be performed yet are waiting until all threads from their column and diagonal are ready. This approach demands direct communication and synchronization between the threads, which is impossible in

30 Algorithm 2 CYK Algorithm-basic parallel version Require: : Grammar G of l terminals and k non-terminals, term nonterm matrix of transitions, and nonterm nonterm matrix of transitions, α = a1a2a3...an - input string, X, Y = blocks size.

begin Copy term nonterm and nonterm nonterm arrays to device memory Initialize CYK table on device memory Gx = n/X + 1 { Calculate threads and blocks configuration parameters for init kernel} Gy = n/Y + 1 Bx = X Bx = Y Call init kernel for Bx,By threads in block and Gx,Gy blocks

Bx = X ∗ Y {Calculate threads and blocks configuration parameters for main kernel} By = 1 Gx = n/Bx + 1 Gy = 1 for all i, 1 ≤ i < n do Call main kernel for Bx,By threads in block and Gx,Gy blocks, for step = i if (n − i)%Bx == 0 then Gx = Gx − 1 end if end for Copy CYK table form device to host memory end

init kernel begin idx = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of thread in grid} idy = BlockIdx.y ∗ Y + T hreadIdx.y {Calculate y coordinate of thread in grid} if idx < n&idy < n then for i, 0 ≤ i < k do CYK[idx][idy][i] = −1 if idy = 0 then CYK[idx][0][term nonterm[ai]] = 1 end if end for end if end

main kernel begin id = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of thread in grid} if idx < n − step then for all 0 ≤ s < step do for all 0 ≤ m < k do if CYK[s][id][m] 6= −1 then for all 0 ≤ n < k do if CYK[step − s − 1][id + s + 1][n] 6= −1 then if nonterm nonterm[m][n] 6= −1 then CYK[step][idx][nontermnonterm[m][n]] = 1 end if end if end for end if end for end for end if 31 end CUDA. Therefore, basic version of algorithm, presented below, is the version of Nijholt’s parallel CYK approach, adjusted to the CUDA specifics. CUDA architecture allows only to synchronize all the threads together by syncthreads() function (only for the threads in the same block), or by synchronizing kernel after it was launched (wait for all threads to end) and launching the kernel again. Size of the block is limited (in CUDA 2.0 to 1024 threads) and therefore insufficient for larger instances of problem using this approach. Problem of synchronization can be solved by launching the kernel for each row of the matrix sequentially, because calculating threads of one row can be done independently. After launching kernel for one row, program waits for all threads to finish, and then starts kernel for the next row. Number of threads that will be used is decremented in each

Figure 6: Threads within basic version

iteration (since every row has one less cell than previous one), but number of threads in one block is constant for every run of the algorithm, so number of all threads is always rounded up to nearest multiplicity of block size. For this reason, if size of row does not match number of threads, some threads from the right side of grid are redundant. This is unavoidable since there is no possibility to launch exact number of threads in every iteration. Figure6 shows the idea of this version for an instance of 8 threads, 4th iteration and 8 symbols in word - grid is superimposed on the 4th (currently calculated) row of CYK table to show the idea. Thread marked as green (1-5) are working on previous entries in their columns and diagonals, thread marked as white (6-8) are redundant.

In this version, threads are launched in one dimensional grid: in ith iteration, jth thread of array fills the entries for CYK[i][j]. Every thread iterates on all previous cells in the same column and diagonal, and checks if the grammar rules allow to assign a non-terminals to the calculated entry. For each row, one less thread is needed, therefore, if the size of the block is n, after every n iterations last block is no longer needed, and total number of blocks can be decremented.

Before launching the algorithm, matrix has to be initialized with values −1 in every entry of matrix. It is also performed in parallel by another kernel function, that launches threads in two dimensional grid of n2, with each thread of coordinates i, j, filling entries for CYK[i][j], with values −1. Also, first row of the matrix is filled by this kernel. It puts all non-terminal symbols that correspond to particular terminal tokens from input string, according to the rules A → a. Pseudo-code of this algorithm is presented on listing2, for grammar G of k non-terminals and l terminals.

32 Algorithm 3 CYK Algorithm- modified parallel version Require: : Grammar G of l terminals and k non-terminals, term nonterm matrix of transitions, and nonterm nonterm matrix of transitions, α = a1a2a3...an - input string, X, Y = blocks size.

begin Copy term nonterm and nonterm nonterm arrays to device memory Initialize CYK table on device memory Gx = n/X + 1 { Calculate threads and blocks configuration parameters for init kernel} Gy = n/Y + 1 Bx = X Bx = Y Call init kernel for Bx,By threads in block and Gx,Gy blocks

Gy = 1 for all i, 1 ≤ i < n do Call main kernel for Bx,By threads in block and Gx,Gy blocks, for step = i if (n − i)%X == 0 then Gx = Gx − 1 end if if )%Y == 0 then Gy = Gy + +1 end if end for Copy CYK table form device to host memory end

init kernel begin idx = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of thread in grid} idy = BlockIdx.y ∗ Y + T hreadIdx.y {Calculate y coordinate of thread in grid} if idx < n&idy < n then for i, 0 ≤ i < k do CYK[idx][idy][i] = −1 if idy = 0 then CYK[idx][0][term nonterm[ai]] = 1 end if end for end if end

main kernel begin idx = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of cell} idy = BlockIdx.y ∗ Y + T hreadIdx.y {Calculate x coordinate of cell} pidx = step − idy − 1 {Calculate x coordinate of corresponding cell} pidy = idx + idy + 1 {Calculate x coordinate of corresponding cell} if idx < n − stepandidy < step then for all 0 ≤ m < k do if CYK[idy][idx][m] 6= −1 then for all 0 ≤ n < k do if CYK[pidy][pidx][n] 6= −1 then if nonterm nonterm[m][n] 6= −1 then CYK[step][idx][nontermnonterm[m][n]] = 1 end if end if end for end if end for 33 end if end Figure 7: Threads within modified version

5.3.2 Modified version

Next approach in paralleling CYK was: dividing the calculations for a greater amount of threads, and creating larger amount of tasks. CUDA in 2.x compute capability version gives a possibility to launch 65536x65536 blocks, each of maximum 1024 threads. Due to this fact, approach to divide tasks for as many threads as possible, was taken. In this approach, the same as in the previous one, kernel is launched for calculating every row of the array. However, thread configuration is different. Instead of assigning one thread for calculating every entry, in this approach every thread checks only one pair of cells. As it was written in section 3.2, calculating one cell in CYK table consists of checking every corresponding pair of entries from the column and diagonal that contain this cell. Because operation of reading any of these pairs can be done independently, the process of calculating one cell can be split into the series of simple operations. Each operation can be assigned to one thread. Series of threads responsible for one cell will be performed simultaneously. Grid of threads is in this situation two-dimensional, and every thread is assigned for the one cell from the part of CYK table which is already calculated. Each thread has to calculate the coordinates of corresponding cell, check whether theses cells contains non-terminals that can be mapped to another non-terminal token, and if it is, then the new value is added to the cell that is being calculated.

Figure7 shows the idea of this version for an instance of 64 threads, 4th iteration and 8 symbols in word - grid is again superimposed on CYK table. Each thread assigned to the cell over the row (green color in the figure) that is being calculated in current itaration, checks the values of this cell and the corresponding cell. This was done by one thread in previous version, now it is divided for as many threads, as many of rows were calculated until the performed step of algorithm. This solution also causes some threads to be redundant - number of threads in a grid will usually be greater than number of threads that are necessary - (white color in the figure). It is due to fact that both coordinates of the grid are multiplicities of block size, and size of the CYK table does not fulfill this condition. Last threads from each row of the grid, that are not assigned to any cell of CYK table, are idle. To differ between the threads, conditional instruction was added. Every thread calculates its coordinates, then it checks whether it has to perform the rest of calculations. Because according to [20] branch instructions always impact badly on performance (because threads within one warp have another instructions to execute, which result with performing the same warp again) it could increase time consumption of their execution. One attempt to

34 deal with this problem was to skip the condition instruction, and allow threads to do the irrelevant calculations. Unfortunately, this approach would occupy much bigger part of the memory and invocate too many threads that perform irrelevant calculations in later iterations. Therefore, it was not implemented.

Initialization in this version, and in all following versions, was done in the same way as in the previous one. Also, number of blocks in a row of the grid is being decremented every n iterations, where n is width of block. However, number of rows of the blocks has to be incremented after every m iterations, where m is a height of the block. It is caused by growing part of table which is already filled and has to be checked by algorithm.

Pseudo-code of this algorithm is presented in listing4.

Figure 8: CUDA Occupancy Calculator output for 17 and 18 registers per thread

5.3.3 Version with shared memory

It can be noticed, that every thread in the modified version operates one two places in memory. Each thread examines n entries of one cell. For each of the entries that has a positive value, it searches in n corresponding entries of the second cell. It results in several accesses to global memory - from n (if there is non-terminals assigned to the cell) till n2 (if all non-terminals are assigned to the cell). This number of accesses can be limited by using shared memory. Until now, all data accessed by threads were located in global memory, which, as it was said, has a disadvantage of long access latency. The next trial to improve the performance of algorithm was to make every thread copy the content of these two cells to the shared memory at the beginning of the execution. Then, all the other instructions within the thread operate only on data in the shared memory, which allows instant access. This approach was expected to result in better performance because of possible avoidance of many time- consuming global memory accesses. It limits global accesses per one thread to 2n, which is worse than best case in previous version (n), but way better than the worst case (n2). Performance of this version depends then greatly on characteristics of the input. This was the reason for testing algorithms also for grammar and inputs that causes the worst case of the process - where every or almost every entry in CYK has to be read.

35 In this implementation every block uses shared memory. Size of memory block (in bytes) is twice as big as number of threads in block, since each thread needs two bytes for caching data from global memory. Therefore, one additional variable that points to the cell in shared memory, for each thread, is needed.

36 6 Experiments

The experimental stage of the work was conducted in three rounds. Each round concerns implementing several (three in first and second round, one in third round) versions of the parallel algorithm, executing them for a series of input data, and collecting the time consumption of each run of each algorithm. Description of grammars used in each round are given in Appendix C. Input strings were generated randomly from the set of terminal symbols (additional generator of both correct and incorrect words was written).

First round was proceeded as follows: each of the three algorithms described in section 5.3 was implemented. These versions of algorithms are, as it was written in section 5.3, called basic - basic version, modified - modified version and shared - version with shared memory. These three implementations, and the sequential implementa- tion were tested for different grammars (three grammars were chosen), different length of strings, (in the range from 100 to 1000 characters), and two different sets of strings - those which are recognized by algorithm as be- longing to the language, and those which are unrecognized (do not belong to the language). The range of the inputs (up to 1000 symbols) is justified by the limit of size of CUDA array. For each length of the string, 10 different strings were given as input, and the times of their execution were collected and averaged. Thus, for each algorithm, the time of its performance for three grammars, ten different lengths of strings and two types of strings (correct and incorrect) was collected. One of the grammars and input sets was prepared for checking the performance of the algorithms in case when input forces algorithm to fill every entry in CYK table (to check the worst case scenario). Number of threads in a block for these tests were 256 (in configuration 16 x 16), according to output from CUDA occupancy calculator (Figure8). According to these figures, there is a wide range of values that give the maximum occupancy of the SM’s; choosing value 256 from this range was random.

Because first round of experiment did not give a satisfying results (algorithm was only 10-15 % faster), some modifications of the three algorithms were made, described in section 6.2. The tests were repeated for these algorithms, on the similar data. Range of inputs was change to 100 - 2000 symbols, because due to changes in implementation, larger tables could be used. Inputs larger than 2000 symbols were not tested, because increasing the input on this level would result in great increase of time of tests - even for these range, tests last few days.

Again, times of execution for each run were collected. Because in this round parallel versions of algorithm gave much better results, statistical analysis of the results was conducted. According to Demsar et al [6], who theoretically and practically analyzed and evaluated several statistical tests for comparing classifiers, Friedman test was chosen as the best method to analyse the results from the experimental stage.

After analysis of the output from the second round, which gave more promising results, one version of the algorithm was chosen to further analysis- third round of the experiment (described wider in section 6.3.1). In this round, the version of the algorithm which gave the best results - the modified version from the second round - was chosen for further upgrade. Different version of this algorithm - with a limited size of non-terminals set - was implemented. Limiting the number of non-terminals was considered as promising way for having better perfor- mance. The version from the second round (modified), and new version (limited) were tested simultaneously, to check whether the modifications had impact on performance. Testing set was changed - there were five grammars instead of three. The reason for using more grammars was that it was needed to check the impact of grammar size

37 Table 1: Times of algorithms’ executions, first round, grammar 1

Length sequential basic modified shared Correct 100 2.548592 47.742493 33.592535 36.435916 200 19.378497 139.615417 85.721899 113.191260 300 64.015344 287.522095 185.830530 275.173169 400 153.667822 485.565820 342.567139 567.498389 500 296.470166 763.636084 586.090576 1024.440527 600 511.024072 1084.355859 927.432227 1689.646289 700 816.207617 1508.874414 1402.228711 2610.041406 800 1449.615918 2047.734375 2021.547461 3822.636328 900 2311.105664 2655.650000 2778.041211 5357.817969 1000 3450.842969 3394.794922 3742.634766 7281.211719 Incorrect 100 2.646510 51.633264 34.366461 39.474011 200 19.710774 140.865454 90.800195 133.508704 300 64.785852 291.960205 189.888440 285.584766 400 154.201892 497.576318 354.920435 582.802441 500 297.377563 774.609082 599.807666 1043.834277 600 510.936865 1097.694629 948.479395 1719.469336 700 817.939746 1531.998828 1428.747266 2654.277539 800 1453.539160 2067.797266 2065.667578 3879.991797 900 2334.156836 2685.177734 2832.300391 5437.751562 1000 3459.837891 3421.238281 3794.807422 7368.5914068 number of non-terminals - on the performance. Also, one grammar was prepared as an input that forces algorithm to perform the worse case, as in previous tests.

According to the previous results, fact that the string belongs to the language does not impact performance of the algorithm. Times of executions are almost the same, no matter if string is correct or not. Therefore, testing set was reduced only to 10 strings for one length (both correct and incorrect), instead of two differents sets. These tests allowed to analyze the algorithm’s performance depending on grammar characteristics. Description of this testing set (grammar rules, input strings) is given in Appendix C.

6.1 First round of experiments

6.1.1 Tests

Performance was measured as a time of executing whole algorithm (in milliseconds) - since the intialization, untill filling whole table. Number of registers used by threads was obtained during the process of compilation of the code by nvcc:

• 17 registers for basic version,

• 17 registers for modiifed version,

• 18 registers for shared memory version.

For these values, one of the best configurations of threads (that gives best performance), obtained by CUDA occupancy calculator (Figure8) was 256. This configuration was used during the test. Times of executions were measured using CUDA timers mechanism. For each instance of problem, ten launches were performed, and the times of their executions were averaged. Three algorithms were tested (and the sequential version as a fourth).

38 Table 2: Times of algorithms’ executions, first round, grammar 2

Length sequential basic modified shared Correct 100 6.140781 75.710291 37.786786 48.035046 200 48.654535 264.926367 109.600415 202.132275 300 165.886646 577.468066 249.998291 583.618115 400 402.929053 1001.556055 497.691699 1298.227344 500 935.362500 1611.768262 871.779297 2457.749805 600 1778.995313 2308.471484 1415.487988 4162.283594 700 2929.121875 3321.806250 2147.416211 6561.800000 800 4405.679688 4480.725781 3083.708984 9691.389844 900 6719.765625 5985.336328 4434.001172 13905.131250 1000 9132.607031 7663.628906 6019.976953 18887.918750 Incorrect 100 6.467274 75.114301 39.619336 48.715234 200 53.310712 289.120728 113.487756 204.047363 300 180.759631 614.630713 258.882397 584.875977 400 417.511279 1057.126172 498.051709 1300.390234 500 972.253027 1671.848633 879.270605 2464.157813 600 1816.597852 2395.263281 1426.343262 4177.201563 700 2981.028906 3409.231250 2159.934375 6576.627344 800 4419.991016 4556.474219 3102.282227 9709.364063 900 6631.042969 5982.898828 4326.918359 13646.846875 1000 8953.753906 7697.674219 5943.507422 18548.329688

Table 3: Times of algorithms’ executions, first round, grammar 3

Length sequential basic modified shared Correct 100 105.784509 772.901270 160.233447 62.450116 200 871.444238 3889.335156 1331.116797 279.366162 300 2981.251758 9751.830469 4556.101172 800.786865 400 7392.471875 18892.278125 10928.656250 1799.5623 500 14593.971875 31228.440625 21579.146875 3415.4308 600 25403.546875 46667.793750 37468.821875 5812.6148 700 40386.83125 67176.512500 59791.931250 9147.9351 800 60905.86875 91774.0500 89722.5687 13596.3343 900 87818.6750 124231.9500 127959.3375 19262.3062 1000 119630.0375 160625.5250 176135.5250 26360.8375 Incorrect 100 109.584741 794.117334 163.746423 61.505804 200 882.465527 3941.543359 1350.439746 278.6599 300 3020.010547 9845.433594 4649.821875 805.811768 400 7404.908594 19095.246875 11124.035156 1812.2132 500 14563.759375 31455.893750 21857.050000 3443.4007 600 25374.06870 47099.218750 37666.143750 5805.7015 700 40282.56875 67638.337500 60059.875000 9140.3203 800 60490.21875 92416.0437 90305.9750 13608.0265 900 89528.19375 125172.2375 129616.9375 19608.8406 1000 123443.9000 162088.1125 178317.7875 26712.7593

39 Figure 9: Relationship between length of input and algorithms’ performance - Grammar 1, 1st round

Figure 10: Relationship between length of input and algorithms’ performance - Grammar 2, 1st round

Each of the algorithms was tested for three grammars and two sets of input words (both correct and incorrect), which indicates six different environments. Test was performed for configuration of 256 of threads in one block.

6.1.2 Results

Tables1,2 and3 show all measured times. Figures9 and 10 show characteristics of times of execution of every algorithm, depending on length of input word, respectively for two (number 1 and 2) grammars. It can be seen that parallel versions did not give satisfactory improvements. Parallel versions (basic and modified)

40 Figure 11: Relationship between length of input and algorithms’ performance - Grammar 3, 1st round are approximately 10 -15 % better than sequential version, which was much worse result than expected, since algorithm’s were tested for inputs in length of 1000 symbols. Shared memory version is even worse than sequential one. Reasons for this, and modifications that improved the algorithm are presented in the following section. Figure 11 shows results for the third grammar, which is recognizing any input string, which indicates filling all entries in CYK table - this input was tested to check the worst scenario of execution. For this grammar, shared memory version is one that gives better results comparing to another versions.

Figure 12: Relationship between length of input and algorithms’ performance - Grammar 1, 2nd round

41 Table 4: Times of algorithms’ executions, second round, grammar 1

Length sequential basic modified shared Correct 100 2.548592 (1) 21.894571 (4) 18.625385 (2) 18.881923 (3) 200 19.378497 (1) 51.952301 (4) 34.881696 (2) 37.908612 (3) 300 64.015344 (2) 95.579517 (4) 62.690729 (1) 66.417102 (3) 400 153.667822 (4) 150.033655 (3) 93.947241 (1) 96.641528 (2) 500 296.470166 (4) 228.892480 (3) 135.804163 (1) 140.796716 (2) 600 511.024072 (4) 305.289404 (3) 186.964600 (1) 193.863123 (2) 700 816.207617 (4) 409.435742 (3) 256.352588 (1) 264.341260 (2) 800 1449.615918 (4) 486.729639 (3) 341.849512 (2) 341.280957 (1) 900 2311.105664 (4) 633.183057 (3) 441.680420 (1) 442.375049 (2) 1000 3450.842969 (4) 759.910400 (3) 570.444678 (1) 587.942139 (2) 1100 5070.887109 (4) 912.398633 (3) 712.582227 (1) 713.443848 (2) 1200 6482.962891 (4) 1046.527148 (3) 900.451660 (2) 870.858789 (1) 1300 8696.946094 (4) 1238.594531 (3) 1109.622656 (1) 1111.003711 (2) 1400 10934.219531 (4) 1416.542578 (3) 1337.173340 (2) 1309.622949 (1) 1500 15797.771875 (4) 1621.220215 (3) 1592.436816 (2) 1579.755957 (1) 1600 20093.253125 (4) 1741.676758 (1) 1916.024609 (3) 1838.655469 (2) 1700 22005.018750 (4) 2061.106445 (1) 2240.314258 (3) 2186.890625 (2) 1800 25484.400000 (4) 2299.349414 (1) 2619.313281 (3) 2529.039453 (2) 1900 33116.271875 (4) 2566.777734 (1) 3031.678711 (3) 2946.673438 (2) 2000 36263.350000 (4) 2793.986719 (1) 3539.542969 (3) 3421.678516 (2) Incorrect 100 2.646510 (1) 89.549536 (4) 20.893549 (3) 19.732201 (2) 200 19.710774 (1) 55.402350 (4) 39.905740 (2) 41.075665 (3) 300 64.785852 (1) 98.266925 (4) 66.299908 (2) 66.834393 (3) 400 154.201892 (3) 160.585852 (4) 98.259894 (2) 97.644177 (1) 500 297.377563 (4) 237.203564 (3) 141.894458 (1) 146.390894 (2) 600 510.936865 (4) 313.136938 (3) 198.826477 (1) 201.559436 (2) 700 817.939746 (4) 416.551660 (3) 263.787720 (1) 271.730933 (2) 800 1453.539160 (4) 497.981250 (3) 355.359351 (2) 351.598340 (1) 900 2334.156836 (4) 644.525830 (3) 451.581934 (1) 452.137158 (2) 1000 3459.837891 (4) 775.090088 (3) 587.971973 (1) 602.916895 (2) 1100 5076.684766 (4) 927.626367 (3) 729.601367 (1) 730.010107 (2) 1200 6486.773438 (4) 1056.300293 (3) 911.369727 (2) 886.568750 (1) 1300 8692.637500 (4) 1253.449219 (3) 1122.987305 (1) 1126.627539 (2) 1400 10922.044531 (4) 1434.444434 (3) 1350.094922 (2) 1337.513184 (1) 1500 15769.289063 (4) 1621.598242 (3) 1598.479492 (2) 1582.051465 (1) 1600 20087.406250 (4) 1744.858984 (1) 1916.019727 (3) 1841.072266 (2) 1700 22035.489063 (4) 2064.499414 (1) 2244.722461 (3) 2182.919531 (2) 1800 25583.029687 (4) 2288.082812 (1) 2621.391797 (3) 2528.912891 (2) 1900 33173.865625 (4) 2570.284961 (1) 3049.801953 (3) 2960.759766 (2) 2000 36219.956250 (4) 2808.780469 (1) 3555.090625 (3) 3438.407031 (2)

42 Table 5: Times of algorithms’ executions, second round, grammar 2

Length sequential basic modified shared Correct 100 6.140781 (1) 32.992938 (4) 21.163220 (2) 21.916275 (3) 200 48.654535 (2) 94.363983 (4) 43.519214 (1) 53.886041 (3) 300 165.886646 (3) 187.329761 (4) 76.947382 (1) 108.398218 (2) 400 402.929053 (4) 307.940576 (3) 124.227185 (1) 200.393506 (2) 500 935.362500 (4) 460.630225 (3) 187.489465 (1) 326.661255 (2) 600 1778.995313 (4) 640.203906 (3) 274.323242 (1) 508.817969 (2) 700 2929.121875 (4) 852.631250 (3) 395.091187 (1) 768.867139 (2) 800 4405.679688 (4) 1078.629199 (2) 537.834766 (1) 1085.841016 (3) 900 6719.765625 (4) 1395.228027 (2) 715.474805 (1) 1518.339063 (3) 1000 9132.607031 (4) 1705.030859 (2) 952.338770 (1) 1996.622070 (3) 1100 11729.857813 (4) 2072.217187 (2) 1280.608105 (1) 2813.316797 (3) 1200 15956.390625 (4) 2481.540820 (2) 1538.695898 (1) 3358.964063 (3) 1300 20039.445313 (4) 2953.567578 (2) 1894.499414 (1) 4196.450781 (3) 1400 26157.853125 (4) 3545.603516 (2) 2381.701562 (1) 5251.349609 (3) 1500 31991.181250 (4) 4151.108984 (2) 2931.768750 (1) 6771.717188 (3) 1600 39713.665625 (4) 4767.688281 (2) 3431.461719 (1) 7748.625000 (3) 1700 52338.615625 (4) 5572.183594 (2) 4075.008203 (1) 9227.660156 (3) 1800 59740.150000 (4) 6384.563672 (2) 4769.262500 (1) 10932.068750 (3) 1900 71113.475000 (4) 7232.849219 (2) 5612.513281 (1) 12860.559375 (3) 2000 83779.568750 (4) 8113.031250 (2) 6461.308594 (1) 14858.898438 (3) Incorrect 100 6.467274 (1) 95.351813 (4) 19.595871 (2) 21.279744 (3) 200 53.310712 (3) 108.480151 (4) 45.169370 (1) 54.952234 (2) 300 180.759631 (3) 208.820435 (4) 77.501050 (1) 105.794592 (2) 400 417.511279 (4) 337.631055 (3) 123.396521 (1) 195.508960 (2) 500 972.253027 (4) 497.922559 (3) 186.785449 (1) 321.483862 (2) 600 1816.597852 (4) 686.230762 (3) 270.076929 (1) 501.753271 (2) 700 2981.028906 (4) 905.504395 (3) 388.920264 (1) 761.102930 (2) 800 4419.991016 (4) 1134.081348 (3) 528.890967 (1) 1069.794629 (2) 900 6631.042969 (4) 1431.029297 (2) 701.254297 (1) 1468.861230 (3) 1000 8953.753906 (4) 1745.768359 (2) 920.276465 (1) 1945.590430 (3) 1100 11336.753125 (4) 2096.758203 (2) 1225.032813 (1) 2738.099805 (3) 1200 15118.821875 (4) 2490.502148 (2) 1479.197266 (1) 3242.317969 (3) 1300 19446.453125 (4) 2984.109375 (2) 1834.240039 (1) 4093.516406 (3) 1400 24829.050000 (4) 3526.690625 (2) 2288.385938 (1) 5085.367578 (3) 1500 29135.450000 (4) 4119.700000 (2) 2816.079688 (1) 6580.186719 (3) 1600 36557.421875 (4) 4708.926953 (2) 3238.543359 (1) 7441.588281 (3) 1700 47779.368750 (4) 5491.296094 (2) 3883.760547 (1) 8903.942188 (3) 1800 53744.212500 (4) 6253.195703 (2) 4545.821094 (1) 10459.667187 (3) 1900 70516.331250 (4) 7450.325000 (2) 5599.823438 (1) 12874.519531 (3) 2000 85577.275000 (4) 8353.540625 (2) 6520.178906 (1) 14949.984375 (3)

43 Table 6: Times of algorithms’ executions, second round, grammar 3

Length sequential basic modified shared Correct 100 105.784509 (3) 293.090991 (4) 33.228580 (2) 31.328815 (1) 200 871.444238 (3) 1262.472852 (4) 150.025122 (2) 125.012280 (1) 300 2981.251758 (4) 2919.509766 (3) 423.191016 (2) 321.749194 (1) 400 7392.471875 (4) 5267.929688 (3) 946.330273 (2) 668.814941 (1) 500 14593.971875 (4) 8313.376562 (3) 1817.293555 (2) 1225.367969 (1) 600 25403.546875 (4) 12053.749219 (3) 3111.574023 (2) 2045.863672 (1) 700 40386.831250 (4) 16513.695313 (3) 4835.642187 (2) 3155.161914 (1) 800 60905.868750 (4) 21719.985938 (3) 7185.243750 (2) 4627.073438 (1) 900 87818.675000 (4) 27854.615625 (3) 10385.600781 (2) 6508.595703 (1) 1000 119630.037500 (4) 34693.246875 (3) 14272.178125 (2) 8847.769531 (1) 1100 160661.337500 (4) 42306.959375 (3) 18733.739063 (2) 11714.818750 (1) 1200 208738.500000 (4) 51082.893750 (3) 24606.879688 (2) 15096.037500 (1) 1300 262685.700000 (4) 60697.350000 (3) 31108.231250 (2) 19131.190625 (1) 1400 331293.250000 (4) 71794.687500 (3) 38668.462500 (2) 23814.490625 (1) 1500 413379.500000 (4) 83249.225000 (3) 48129.728125 (2) 29177.668750 (1) 1600 494018.400000 (4) 95673.400000 (3) 57795.012500 (2) 35308.556250 (1) 1700 594860.500000 (4) 110431.550000 (3) 69673.081250 (2) 42304.384375 (1) 1800 710089.600000 (4) 124503.575000 (3) 82870.043750 (2) 50128.306250 (1) 1900 850779.000000 (4) 141564.037500 (3) 95090.400000 (2) 58917.612500 (1) 2000 981815.900000 (4) 158023.650000 (3) 113064.200000 (2) 68507.175000 (1) Incorrect 100 109.584741 (3) 366.265479 (4) 35.694171 (2) 33.048584 (1) 200 882.465527 (3) 1272.958301 (4) 151.989783 (2) 126.307629 (1) 300 3020.010547 (4) 2937.044531 (3) 424.778174 (2) 321.361963 (1) 400 7404.908594 (4) 5290.696484 (3) 950.943359 (2) 666.518213 (1) 500 14563.759375 (4) 8340.456250 (3) 1823.233203 (2) 1221.704004 (1) 600 25374.068750 (4) 12084.078125 (3) 3112.191406 (2) 2033.315820 (1) 700 40282.568750 (4) 16547.175000 (3) 4852.723437 (2) 3148.680469 (1) 800 60490.218750 (4) 21750.293750 (3) 7207.053906 (2) 4622.027344 (1) 900 89528.193750 (4) 28033.793750 (3) 10481.420313 (2) 6557.342188 (1) 1000 123443.900000 (4) 34973.293750 (3) 14488.367188 (2) 9040.335938 (1) 1100 160536.400000 (4) 42384.625000 (3) 18788.815625 (2) 11710.995312 (1) 1200 208782.812500 (4) 51145.943750 (3) 24656.014062 (2) 15088.868750 (1) 1300 263405.525000 (4) 60745.831250 (3) 31173.303125 (2) 19123.551563 (1) 1400 332134.150000 (4) 71889.731250 (3) 38742.234375 (2) 23799.312500 (1) 1500 414392.825000 (4) 83341.631250 (3) 48207.262500 (2) 29165.215625 (1) 1600 495410.750000 (4) 95738.831250 (3) 57883.300000 (2) 35293.303125 (1) 1700 596184.300000 (4) 110548.850000 (3) 69773.093750 (2) 42286.125000 (1) 1800 711664.100000 (4) 124616.212500 (3) 82996.856250 (2) 50114.650000 (1) 1900 847468.400000 (4) 141563.937500 (3) 95079.343750 (2) 58785.700000 (1) 2000 997198.000000 (4) 158430.800000 (3) 113453.262500 (2) 68614.818750 (1)

44 Figure 13: Relationship between length of input and algorithms’ performance - Grammar 2, 2nd round

Figure 14: Relationship between length of input and algorithms’ performance - Grammar 3, 2nd round

6.2 Second round of experiments

The implemenations of parallel CYK from the first round of the experiment did not give satisfying results. As it was written in section 4.5, crucial issue for threads performance is global memory access, which takes many cycles of stream processors. In algorithms presented above, global memory was allocated using functions that are dedicated to work on 2D or 3D arrays - cudaMalloc3D for three-dimensional array (CYK table), and cudaMal- locPitch for two-dimensional arrays. According to [21], using this functions gives the best effects, because data

45 are aligned properly. However, these functions are dedicated for working with more complex data structures. In the algorithms of this round, data is written as the entries in size of one byte each. The idea of improving the algorithm was to use linear memory allocation function - CudaMalloc, to narrow size of data that are accessed. This approach requires more calculations, because every thread has to calculate index of the memory cell that will be accessed. However, because of the lower size of the data that each thread accesses in these versions of algorithm, data can be accessed and cached more efficiently. Other disadvantage of cudaMalloc3D is limited size of dimensions of the structure allocated using this function to 211 [21]. The cudaMalloc function does not have any restrictions in this matter.

Figure 15: CUDA Occupancy Calculator output for 11 registers per thread

6.2.1 Tests

Again, 3 versions of algorithm were tested, for the same testing set as in the first round. Number of registers used by threads was obtained during the process of compilation of the code by nvcc:

• 12 registers for basic version,

• 11 registers for modified version,

• 17 registers for shared memory version.

For these values, one of the best configurations of threads (that gives best performance), obtained by CUDA occupancy calculator (Figure 15) was again 256, so the configuration of the threads did not need to be changed since the previous round.

6.2.2 Results

Results are shown in tables4,5 and6. Figures 12 and 13 show the relationship between length of string and algorithms’ performance, for grammars 1 and 2, both for correct and incorrect inputs. It can be seen that algorithms are slightly less effective for shorter strings. However, with the longer strings (with more than 200 symbols) parallel versions become more effective than sequential. Differences between times of parallel algorithms and the

46 sequential one are becoming more significant while the size of input increases. Figure 14, shows relationship for the third grammar and for this grammar. For this grammar, shared version gives better results, as it was expected.

There are three different algorithms provided in this round. Each of them is giving promising results. In order to evaluate them, statistical validation of the results was performed, using Friedman test [6]. Test was conducted as follows: in this round there are results for all 4 algorithms (3 parallel and one sequential) for 120 inputs (40 for each grammar). For each input results are ranked - 1st rank for the best result, 4th rank for the worst. Ranks are provided in the tables4,5 and6, together with results. Then, the average rank for each algorithm, was calculated, with the results:

• 3, 72 for sequential,

• 2, 75 for basic,

• 1, 64 for modified,

• 1, 86 for shared.

Test showed that the modified version of the algorithm, according to the chosen testing set, is the most efficient one. Second best version is the shared version, which, as it was written, works well for specific inputs. According to the post-hoc Nemenyi test, the critical difference (the minimum difference between ranks of algorithms that indicate that significant advantage of one) is equal to 0, 382. That shows that all three parallel algorithms are significantly better than the sequential one, and the modified and shared version may be considered as equally effective. Still, modified version of the algorithm gives the best results from the versions tested in this round. Advantage of shared version is significant mostly because of the worst case instances. These versions still could be improved by using some limitations and restrictions on size and type of grammar, that allow to decrease both memory and time complexity. These modifications are described in next section.

6.3 Third round of experiments

6.3.1 Limited version

Previously presented implementations of CYK, including the version that gives best results, use three-dimensional CYK table, with depth equal to the number of non-terminals. It can be assumed that increasing number of non- terminals in grammar influences both time and memory complexity of the algorithm. These two factors can be decreased by assuming constant third dimension of the CYK table - constant size of the cell, which indicates limited number of non-terminals in grammar. While having one vector representing every cell, there is no need to reserve entire entry (in a size of as many bytes as number of non-terminals) for information about one non- terminal, like it was in previous versions. Instead, presence of evey non-terminal in cell is represented by one bit of the vector. For instance, this version will use cells in size of long int - 32 bits. Therefore, it is dedicated to use grammars with maximum 32 non-terminals. This solution has some disadvantages. Mainly, size of a grammar has to fit size of cell - for using bigger grammars, algorithm has to be modified. Because now every entry in the matrix represents information about more than one non-terminal, some changes had to be done in kernel function.

Algorithm works as follows: each thread copies pair of CYK entries. For each set bit in first entry, and each set bit in second entry, algorithm checks, whether two non-terminals that are represented by these bits are mapped

47 Algorithm 4 CYK Algorithm- limited version Require: : Grammar G of l terminals and k non-terminals, term nonterm matrix of transitions, and nonterm nonterm matrix of transitions, α = a1a2a3...an - input string, X, Y = blocks size.

begin Copy term nonterm and nonterm nonterm arrays to device memory Initialize CYK table on device memory Gx = n/X + 1 { Calculate threads and blocks configuration parameters for init kernel} Gy = n/Y + 1 Bx = X Bx = Y Call init kernel for Bx,By threads in block and Gx,Gy blocks

Gy = 1 for all i, 1 ≤ i < n do Call main kernel for Bx,By threads in block and Gx,Gy blocks, for step = i if (n − i)%X == 0 then Gx = Gx − 1 end if if )%Y == 0 then Gy = Gy + +1 end if end for Copy CYK table form device to host memory end

init kernel begin idx = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of thread in grid} idy = BlockIdx.y ∗ Y + T hreadIdx.y {Calculate y coordinate of thread in grid} if idx < n&idy < n then for i, 0 ≤ i < k do CYK[idx][idy][i] = −1 if idy = 0 then CYK[idx][0][term nonterm[ai]] = 1 end if end for end if end

main kernel begin idx = BlockIdx.x ∗ X + T hreadIdx.x {Calculate x coordinate of cell} idy = BlockIdx.y ∗ Y + T hreadIdx.y {Calculate x coordinate of cell} pidx = step − idy − 1 {Calculate x coordinate of corresponding cell} pidy = idx + idy + 1 {Calculate x coordinate of corresponding cell} if idx < n − step and idy < step and CYK[idy][idx] 6= 0 and CYK[pidy][pidx] 6= 0 then x = CYK[idy[idx] y = CYK[pidy][pidx] for all m, 2m ∧ x do for all n,2n ∧ y do if nonterm nonterm[m][n] 6= −1 then CYK[step][idx] = CYK[step][idx] ∨ 2nonterm nonterm[m][n] end if end for end for end if end 48 Table 7: Times of algorithms’ executions, third round, grammar 1

Length modified limited 100 87.183453 22.926881 200 46.924008 40.363846 300 71.916431 61.262903 400 108.129297 80.524286 500 178.350671 119.387158 600 239.819849 155.869092 700 324.423853 192.990979 800 435.035156 234.543262 900 544.212061 285.784082 1000 703.427148 339.270703 1100 888.579004 408.605078 1200 1066.865723 485.074756 1300 1371.066406 577.433740 1400 1597.172754 664.328564 1500 2026.482812 771.995361 1600 2261.583984 913.874316 1700 2754.192187 1020.956641 1800 3115.892773 1169.359277 1900 3580.977734 1086.923926 2000 3907.892969 1187.540039

Table 8: Times of algorithms’ executions, third round, grammar 2

Length modified limited 100 81.832550 20.382256 200 53.055402 39.560977 300 87.282574 60.156390 400 152.765662 90.16528 500 275.776489 121.522949 600 366.977051 166.504309 700 571.664355 217.758423 800 758.636816 256.438770 900 1040.746875 297.920972 1000 1258.093359 359.048975 1100 1748.910742 432.414258 1200 2076.938477 531.077295 1300 2857.895703 626.170313 1400 3189.558398 740.807617 1500 4243.507031 865.139551 1600 4572.712891 992.832422 1700 5972.576563 1159.146094 1800 6906.695313 1315.886914 1900 7994.775781 1200.032910 2000 8288.764844 1316.392871 to another non-terminal in the matrix of rules. If they are, bit that represents this non-terminal, is set in the entry that is being calculated. While many threads are accessing the same cell and modifying its value, some information about them may be lost, due to fact that they can be modified simultaneously. This is the reason for using atomic functions. Each thread, while adding information about new non-terminal, is performing atomic

49 Table 9: Times of algorithms’ executions, third round, grammar 3

Length modified limited 100 78.470477 16.888934 200 54.165015 31.524371 300 122.596057 53.002716 400 207.739551 72.905688 500 416.540869 110.470056 600 643.027295 147.728955 700 1034.873535 194.719397 800 1282.362891 241.085913 900 2080.310742 300.783276 1000 2667.724805 376.931689 1100 3731.328516 457.846240 1200 4137.491406 545.324219 1300 6054.976172 666.020410 1400 7085.269531 767.616260 1500 8998.199219 725.231006 1600 9093.201563 832.475781 1700 13010.128125 999.202832 1800 14569.890625 1145.785449 1900 18085.389062 1302.423145 2000 17716.260937 1497.960156

Table 10: Times of algorithms’ executions, third round, grammar 4

Length modified limited 100 92.626453 18.097272 200 83.110059 37.614471 300 208.818213 59.600977 400 427.263135 84.925934 500 737.941699 108.497705 600 1221.296289 142.425769 700 1889.393164 182.497583 800 2779.358398 234.494116 900 3898.339063 280.151855 1000 5296.940625 347.250537 1100 7023.460156 417.474170 1200 9053.360156 492.970947 1300 11796.786719 685.110742 1400 14540.137500 792.546289 1500 18035.812500 894.226270 1600 21417.909375 1028.361719 1700 25620.657813 1225.885742 1800 30431.850000 1347.455859 1900 35141.078125 1219.373242 2000 41963.259375 1351.579297

50 Table 11: Times of algorithms’ executions, third round, grammar 5

Length modified limited 100 91.322247 23.535196 200 156.072546 78.263190 300 435.878223 191.371875 400 971.070020 400.748779 500 1870.525586 718.459961 600 3167.457812 1183.611133 700 4934.672266 1821.663281 800 7315.722656 2669.623242 900 10566.528125 3742.166797 1000 14490.820313 5077.050781 1100 18951.892188 6713.239062 1200 24950.701562 8706.985156 1300 31665.203125 10993.566406 1400 39233.703125 13601.442188 1500 48736.940625 16662.123437 1600 58747.156250 20350.423438 1700 72062.837500 24874.821875 1800 84305.043750 28661.881250 1900 96463.981250 33604.593750 2000 114263.550000 38974.712500 logical disjunction (A ∨ B, atomic OR) of actual value in cell, and the value with one bit, that indicates added non-terminal, set. Using atomic function eliminates risk of cohesion. However, CUDA architecture does not allow to use atomic function for variables bigger than 32 bits. For this reason, for using bigger grammars, vectors of variables has to be used. For instance, a vector of two long int values allow to use grammars with up to 64 non-terminals. For the purposes of this thesis only the version with 32-bit long int variables was implemented. The limited version of the algorithm was implemented and compared with the best unlimited version from the previous version - modified. Pseudo-code of this version is shown on listing 4.

Figure 16: Relationship between length of input and algorithms’ performance - Grammar 4, 3rd round

51 Figure 17: Relationship between length of input and algorithms’ performance - Grammar 5, 3rd round

6.3.2 Tests

Two versions of the algorithm were tested (modified and limited), for the four grammars that differ with number of non-terminals (8, 16, 24 and 32 non-terminals), to check impact of grammars’ size. Also, they were tested for one grammar that forces algorithm to fill in whole table (the same grammar that was used in previous rounds). Twelve registers was used by the kernel of limited version, so the configuration of the threads did not need to be changed since the previous rounds. Results are shown in tables7,8 and9 and 10, respectively for each grammar. Table 11 shows results for tests performed for the grammar which forces the algorithm to perform the worst case scenario (the same as in previous rounds). Figures 16 and 17 show the relationship between length and time for compared algorithms, for the two different grammars.

6.3.3 Results

It can be seen, that again, the improvement is gained. For every instance of the problem limited algorithm gives better results. For the largest instances of the problem, limited algorithm works from 3 up to even 40 times better (depending on the grammar). Another conclusion is that size of grammar does not influence performance. Figures 18 and 19 show impact of both size of grammar and length to the performance, respectively for modified and limited version of algorithm. It can be seen that time of execution of first algorithm (modified) doubles while increasing the number of non-terminals by eight. More specific tests (i. e. for more grammars, with smaller differences) were not possible, due to long time of tests’ execution. For the limited version size of the grammar does not influence the performance at all. Of course it is a result of bounding memory complexity of the second algorithm. The only factor that impacts performance in this version is number of comparisons and assets that has to be done. It can be seen while analyzing the worst case scenario results (Table 11). However, this factor cannot be reduced. It is a result of the CYK algorithm’s property - each pair of non-terminals in two corresponding entries have to be checked on the list of rules. Number of this checks depends on grammar and input.

52 Figure 18: Relationship between length of input, size of grammar and modified algorithm’s performance

Figure 19: Relationship between length of input, size of grammar and limited algorithm’s performance

7 Discussion

7.1 Research questions

Following paragraphs give the answer for each of the research questions:

• Does the CUDA implementation of CYK give a significant speed-up ?

53 In this work, four different parallel implementations of CYK parsing algorithm for CUDA platform were pre- sented: version that bases on existing model of parallel CYK, two modifications of this version, and one version with initial condition regarding size of grammar. Performed experiments showed that each of these implementa- tions can outperform the sequential algorithm, starting from the specific length of the input (about 200 symbols). As it could be predicted, parallel algorithm gives very bad performance for short inputs (shorter than 200 sym- bols). For these inputs, parallel algorithm can give worse performance. It is caused by the cost of managing the parallel execution - creating threads, copying data to and from the device memory. These instructions are not present in sequential algorithm, and in parallel algorithms they have to be done sequentially. While extending the length of input, increase of performance is more significant.

• How can this implementation be further improved (by modifying first implementation)?

First implementation was later improved by deeper parallelisation (dividing calculations into more threads), using shared memory, and by limiting the size of grammar. Each of these improvements (modified, shared and limited version) gave a significant speed-up, especially while testing the worst case.

• What is the best thread configuration for the specific version of platform and algorithm?

It was not neccessary to experimentally search for the best configuration; it can be obtained by using CUDA occupancy calculator.

• How effective is delivered solution in comparison with the existing ones (according to the found literature)?

Analyzing the characteristics of the algorithms performance (Figures 12 and 13) shows that runs of the parallel algorithms works even 10 times better for the longest inputs (modified and shared version). Moreover, next versions (modified and shared) are giving better results that the version that based on previous work (basic). Statistical ’ results showed the most promising solution. Also, results showed that the performance of the algorithms depends on size of grammar. Therefore, limited version of the algorithm, from the third round of experiment, allowed to eliminate this factor, by prior definition of the maximum grammar size (Figures 18 and 19). In the limited version, factor of global memory accesses and their latency was eliminated - each thread accesses memory only twice while reading the previous entries. Again, this version gives even 10 times better results than modified version, which gives even 100 times better performance, in comparison with sequential version. For the worst case, improvement is lower, but still significant.

7.2 Discussion of time complexity

As it was said in section 3.3, perfect parallelisation of the algorithm may theoretically result in O(n) time complexity. Theoretical complexity of basic version was O(n2), because there are n steps of algorithm, and each thread is searching for less than n entries. In other words, one inner loop from listing1 (that iterates over the cells within one row) is paralleled. In modified and limited version, it can be assumed that theoretical complexity is O(n), because all calculations within one step of algorithm are done in parallel. In other words, two inner loops from listing1 were paralleled. These theoretical complexities base on assumption that parallelisation of algorithm is complete. However, these results cannot be obtained, as the results of the experiments showed. Mainly it is because of, previously mentioned, sequential parts of the algorithm, that cannot be paralleled, i. e. memory allocation or threads’ management. Second reason are CUDA limitations - it is not possible to run all the created

54 threads simultaneously. For longer inputs, due to fixed number of SP’s, only a part of threads is executed in the moment. Therefore, parallelisation of threads is only partial.

By analyzing6, which shows the worst case results, complexity can be discussed. It can be observed, that growth of the basic version is close to n2, and the best ( in this case - shared), close to linear. However, ob- taining theoretical complexity by any of these implementations should not be expected, due to reasons presented in previous paragraph. Obtaining complete parallelisation always will depend on the computational power of the device.

7.3 Discussion of validity

Performed experiment was limited to only one platform. Because parallel applications are highly dependable on platform, it would be interesting to perform wider experiment on a different CUDA devices, with higher computational power. Also, time was a limitation - performing more tests for a different grammars and inputs was not possible due to fact that these tests were consuming a lot of time.

To obtain valuable results, algorithms were tested for wide set of grammars and inputs. To eliminate the factor of randomness in the experiment, each run of algorithm was repeated ten times, and the results were averaged. Also, only time of device code was measured, using CUDA built-in functions, so influence of other factors (such as logging) on collected time values was eliminated.

In the experiment, sequential version of algorithm, that run on CPU, was used as a reference point for testing parallel implementations. There is a threat in performing the experiment on both sequential CPU and parallel device (GPU) - because such devices cannot be simply compared. For the needs of these research, price range of the devices was a factor that allow to choose similarly effective platforms - CPU and GPU. However, comparing implementations for two such different platforms is uncertain.

7.4 Conclusion

It can be concluded, basing on the results, that CUDA mechanisms, like shared memory and SIMD instructions, can be usefully used to parsing problem, as well to the dynamic programming at general. Obtained results, which indicate that the achieved increase of performance is on similar or better level than former attempts of paralleling parsing process (see section 3.3) can be considered as satisfactory output of the work. Also fact that the application does not demand having a special platform or computer, but only specific GPU, make these solutions more usable. Therefore, future work for this issue should be applying the algorithm for learning process in GCS, the system this algorithm was dedicated to.

55 References

[1] M. Brown. Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars. Proceed- ings of the Eighth International Conference on Intelligent Systems for Molecular Biology, 2000.

[2] J. H. Chang, O. H. Ibarra, and M. A. Palis. Parallel Parsing on a One-Way Array of Finite-State Machines. IEEE Transactions on Computers, C-36(1):64–75, January 1987.

[3] Y. T. Chiang and K. S. Fu. Parallel Parsing Algorithms and VLSI Implementations for Syntactic . IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(3):302–314, May 1984.

[4] L. Cielecki and O. Unold. Real-Valued GCS Classifier System. International Journal of Applied Mathemat- ics and Computer Science, 17(4):539–547, December 2007.

[5] L. Cielecki and O. Unold. 3D Function Approximation with rGCS Classifier System. 2008 Eighth Interna- tional Conference on Intelligent Systems Design and Applications, pages 136–141, November 2008.

[6] Janez Demsar.ˇ Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7, Decem- ber 2006.

[7] S. Fowler and J. Paul. Parallel Parsing: The Earley and Packrat Algorithms. ReCALL, 2009.

[8] J. C. Hill and A. Wayne. A CYK approach to parsing in parallel: a case study. ACM SIGCSE Bulletin, 23(1):240–245, March 1991.

[9] O.H. Ibarra, T.C. Pong, and S.M. Sohn. Parallel recognition and parsing on the hypercube. Computers, IEEE Transactions on, 40(6):764–770, 1991.

[10] D.B. Kirk and W.H. Wen-mei. Programming massively parallel processors: A Hands-on approach. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 2010.

[11] B. Kitchenham and S. Charters. Guidelines for performing Systematic Literature Reviews in Software En- gineering. Technical Report EBSE 2007-001, Keele University and Durham University Joint Report, 2007.

[12] S. Rao Kosaraju. Speed of recognition of context-free languages by array automata. SIAM Journal on Computing, 4(3):331–340, 1975.

[13] M. Lange. To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm. Informatica Didactica, 8:1–21, 2009.

[14] D. K. Lowenthal. A Parallel, Out-of-Core Algorithm for RNA Secondary Structure Prediction. International Conference on Parallel Processing (ICPP’06), 2006.

[15] Y. Matsumoto. A parallel parsing system for natural language analysis. In Third International Conference on Logic Programming, pages 396–409. Springer, 1986.

[16] A. Nijholt. The CYK-approach to serial and parallel parsing. Language Research, 27 (2), pages 229–254, 1990.

[17] A. Nijholt and R. D. Akker. CYK- and Earley’s Approach to Parsing. Computer, pages 1–25, 1979.

56 [18] NVIDIA. NVIDIA CUDA Occupancy Calculator. http://developer.download.nvidia.com/ compute/cuda/CUDA_Occupancy_calculator.xls. [online; accessed 2011-09-01].

[19] NVIDIA. NVIDIA ComputePTX: Parallel Thread Execution. ISA Version 1.4, 2009.

[20] NVIDIA. NVIDIA CUDA C Getting Started Guide for Microsoft Windows, 2010.

[21] NVIDIA. NVIDIA CUDA C Programming Guide, 2010.

[22] J. Rekers and W. Koorn. Substring parsing for arbitrary context-free grammars. ACM SIGPLAN Notices, 26(5):59–66, May 1991.

[23] B. Schmidt. Parallel RNA sequence-structure alignment. 18th International Parallel and Distributed Pro- cessing Symposium, 2004. Proceedings., pages 190–197, 2004.

[24] N. Takashi, T. Kentaro, K. Taura, and J. Tsujii. A Parallel CKY Parsing Algorithm on Large-Scale Distributed-Memory Parallel Machines. Science, 1997.

[25] O. Unold. Context-free grammar induction with grammar-based classifier system. Archives of Control Sciences, 15(4):681–690, 2005.

[26] O. Unold. Playing a Toy-Grammar with GCS. Lecture Notes in Computer Science, 3562(4):87–98, 2005.

[27] O. Unold. Ewolucyjne wnioskowanie gramatyczne. Prace Naukowe Instytutu Informatyki, Automatyki i Robotyki Politechniki Wrocawskiej. Monografie, 105(29), 2006.

[28] O. Unold. Grammar-based Classifier System : A Universal Tool for Grammatical Inference. WSEAS Trans- actions on Computers, 7(10):1584–1593, 2008.

57 A CUDA 2.x compute capability

Table 12: CUDA 2.x technical specifications

Technical specifications Value Maximum x and y dimension of the grid of blocks 65535 Maximum number of threads in one block 1024 Maximum x and y dimension of the block 1024 Maximum z dimension of the block 64 Warp size 32 Maximum number of resident blocks per one multiprocessor 8 Maximum number of resident warps for one multiprocessor 48 Maximum number of resident threads per one multiprocessor 1536 Number of 32-bit registers per one multiprocessor 32k Maximum amount of shared memory 48 KB Number of shared memory banks 32 Amount of local meomry per thread 512 KB Constant memory size 64 KB Maximum width for 1D texture reference bound to a CUDA array 32768 Maximum width and height for 2D texture reference bound to a CUDA array 65536 x 65535 Maximum width, height and depth for 3D texture reference bound to a CUDA array 2048 x 2048 x2048 Maximum number of instructions per one kernel 2 million

58 B GeForce 440 GT specification

Figure 20: GeForce 440 GT specification

59 C Descriptions of the grammars used in tests

C.1 First and second round

C.1.1 Grammar 1

Set of terminal symbols: a, b Set of non-terminal symbols: A, B, C, S Set of rules:

S → AC S → AB B → b C → SB A → a

C.1.2 Grammar 2

Set of terminal symbols: a, b, c, d Set of non-terminal symbols: A, B, C, D, F, H, L, M, P, R, S Set of rules:

S → RS M → CB A → a S → FH R → LP B → b F → BD S → AS C → c L → AP S → PS D → d S → FB P → BC H → AB S → SM

C.1.3 Grammar 3

Set of terminal symbols: a, b Set of non-terminal symbols: A, B, C, D, E, F, G, H, I, J, K, L, M, N, S Set of rules:

S → AS G → CA M → CS S → AA H → DA N → DS C → SS I → AE A → a D → SA J → AF B → b E → AC K → EA F → AD L → EF

60 C.2 Third round

C.2.1 Grammar 1

Set of terminal symbols: a, b, c, d, e Set of non-terminal symbols: A, B, C, D, E, F, G, S Set of rules:

S → AF A → a E → e G → BS B → b F → SB C → c S → AB D → d

C.2.2 Grammar 2

Set of terminal symbols: a, b, c, d, e, f, g Set of non-terminal symbols: A, B, C, D, E, F, G, H, I, J, K, L, M, U, V, S Set of rules:

S → RS S → AS J → g S → IJ S → PS S → EI S → SS P → BC S → UU S → FH S → SM U → AB F → BD A → a V → AI L → AP B → b S → UV S → FB C → c S → VU H → AC D → d M → CB E → e R → LP I → f

C.2.3 Grammar 3

Set of terminal symbols: a, b, c, d, e, f, g, h, i, j Set of non-terminal symbols: A,B,C,D,F,H,L,M,P,R,U,V,W,X,Y,S, 1, 2, 3, 4, 5, 6, 9, 0 Set of rules:

S → SS F → BD A → a S → 12 L → AP B → b S → 34 S → FB C → c S → 56 H → AB D → d S → 90 M → CB 1 → e S → VW R → LP 2 → f S → XZ S → AS 3 → g S → YZ S → PS 4 → h S → RS P → BC 5 → i S → FH S → SM 6 → j

61 C.2.4 Grammar 4

Set of terminal symbols: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o Set of non-terminal symbols: A,B,C,D,E,F,G,H,I,J,K,L,M,N,P,R,T,V,W,X,Y,Z,S, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 Set of rules:

S → SS F → BD 1 → e S → 12 L → AP 2 → f S → 34 S → FB 3 → g S → 56 H → AB 4 → h S → 78 M → CB 5 → i S → 90 R → LP 6 → j S → VW S → AS 7 → k S → XZ S → PS 8 → l S → YZ P → BC 9 → m S → RS S → SM 0 → n S → EG A → a T → o S → IJ B → b S → KN C → c S → FH D → d

C.2.5 Grammar 5

Set of terminal symbols: a, b Set of non-terminal symbols: A, B, C, S Set of rules:

S → AC S → AB B → b C → SB A → a

62