Parallel Parsing of Context-Free Grammars
Total Page:16
File Type:pdf, Size:1020Kb
Master Thesis Computer science Thesis no: MCS-2011-28 December 2011 Parallel parsing of context-free grammars Piotr Skrzypczak School of Computing Blekinge Institute of Technology SE-371 79 Karlskrona Sweden This thesis is submitted to the School of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Master of Computer Science. The thesis is equivalent to 20 weeks of full time studies. Contact Information: Author(s): Piotr Skrzypczak 871216-P317 E-mail: [email protected] University advisor(s): Olgierd Unold, PhD, DSc Faculty of Electronics, Wroclaw University of Technology, Poland Bengt Aspvall, PhD School of Computing School of Computing Blekinge Institute of Technology Internet : www.bth.se/com SE-371 79 Karlskrona Phone : +46 455 38 50 00 Sweden Fax : +46 455 38 50 57 ABSTRACT During the last decade increasing interest in parallel programming can be observed. It is caused by a tendency of developing microprocessors as a multicore units, that can perform instructions simultaneously. Pop- ular and widely used example of such platform is a graphic processing unit (GPU). Its ability to perform calculations simpultaneously is being investigated as a way for improving performance of the complex algorithms. Therefore, GPU’s are now having the architectures that allows to use its computional power by programmers and software de- velopers in the same way as CPU. One of these architectures is CUDA platform, developed by nVidia. Aim of this thesis is to implement the parallel CYK algorithm, which is one of the most popular parsing algorithms, for CUDA platform, that will gain a significant speed-up in comparison with the sequential CYK algorithm. The thesis presents review of existing parallelisations of CYK algorithm, descriptions of implemented algorithms (basic version and few modifications), and experimental stage, that includes testing these versions for various inputs in order to justify which version of algorithm is giving the best performance. There are three versions of algorithm presented, from which one was selected as the best (giving about 10 times better performance for the longest instances of inputs). Also, a limited version of algorithm, that gives best performance (even 100 times better in comparison with non-limited sequential version), but requires some conditions to be fulfilled by grammar, is presented. The motivation for the thesis is to use the developed algorithm in GCS. 1 Contents 1 Introduction 7 1.1 Problem definition..........................................7 1.2 Background..............................................8 1.2.1 GCS..............................................8 1.2.2 CUDA............................................8 1.3 Aims and Objectives.........................................9 1.4 Research questions..........................................9 1.5 Expected outcomes..........................................9 1.6 Research methodology........................................ 10 1.7 Thesis outline............................................. 11 2 Context-free grammars 12 2.1 Grammar definition.......................................... 12 2.2 Chomsky chierarchy......................................... 12 2.3 Chomsky normal form........................................ 13 3 CYK algorithm 14 3.1 Parsing process............................................ 14 3.2 Sequential CYK algorithm...................................... 14 3.3 Existing parallelisations of CYK................................... 16 4 CUDA 19 4.1 Overview............................................... 19 4.2 CUDA Enabled GPU architecture.................................. 20 4.3 Programming model......................................... 20 4.4 CUDA compute capability versions................................. 23 4.5 Memory Levels............................................ 23 4.5.1 Global memory........................................ 23 4.5.2 Constant memory....................................... 24 4.5.3 Shared memory........................................ 24 4.5.4 Registers........................................... 24 4.5.5 Local memory........................................ 25 4.5.6 Texture/Surface memory................................... 25 4.6 Driver API vs Runtime API...................................... 25 4.7 CUDA C............................................... 25 4.8 Atomic operations........................................... 26 2 4.9 CUDA Occupancy Calculator..................................... 26 5 Application 28 5.1 Environment.............................................. 28 5.2 Data representation.......................................... 29 5.3 Parallel CYK............................................. 30 5.3.1 Basic version......................................... 30 5.3.2 Modified version....................................... 34 5.3.3 Version with shared memory................................. 35 6 Experiments 37 6.1 First round of experiments...................................... 38 6.1.1 Tests............................................. 38 6.1.2 Results............................................ 40 6.2 Second round of experiments..................................... 45 6.2.1 Tests............................................. 46 6.2.2 Results............................................ 46 6.3 Third round of experiments...................................... 47 6.3.1 Limited version........................................ 47 6.3.2 Tests............................................. 52 6.3.3 Results............................................ 52 7 Discussion 53 7.1 Research questions.......................................... 53 7.2 Discussion of time complexity.................................... 54 7.3 Discussion of validity......................................... 55 7.4 Conclusion.............................................. 55 References 55 A CUDA 2.x compute capability 58 B GeForce 440 GT specification 59 C Descriptions of the grammars used in tests 60 C.1 First and second round........................................ 60 C.1.1 Grammar 1.......................................... 60 C.1.2 Grammar 2.......................................... 60 C.1.3 Grammar 3.......................................... 60 3 C.2 Third round.............................................. 61 C.2.1 Grammar 1.......................................... 61 C.2.2 Grammar 2.......................................... 61 C.2.3 Grammar 3.......................................... 61 C.2.4 Grammar 4.......................................... 62 C.2.5 Grammar 5.......................................... 62 4 List of Tables 1 Times of algorithms’ executions, first round, grammar 1....................... 38 2 Times of algorithms’ executions, first round, grammar 2....................... 39 3 Times of algorithms’ executions, first round, grammar 3....................... 39 4 Times of algorithms’ executions, second round, grammar 1..................... 42 5 Times of algorithms’ executions, second round, grammar 2..................... 43 6 Times of algorithms’ executions, second round, grammar 3..................... 44 7 Times of algorithms’ executions, third round, grammar 1...................... 49 8 Times of algorithms’ executions, third round, grammar 2...................... 49 9 Times of algorithms’ executions, third round, grammar 3...................... 50 10 Times of algorithms’ executions, third round, grammar 4...................... 50 11 Times of algorithms’ executions, third round, grammar 5...................... 51 12 CUDA 2.x technical specifications.................................. 58 5 List of Figures 1 CYK Algorithm run.......................................... 16 2 Cells important for one thread.................................... 17 3 Architecture of CUDA enabled GPU................................. 21 4 Threads configuration within CUDA application........................... 22 5 one row of CYK table......................................... 30 6 Threads within basic version..................................... 32 7 Threads within modified version................................... 34 8 CUDA Occupancy Calculator output for 17 and 18 registers per thread.............. 35 9 Relationship between length of input and algorithms’ performance - Grammar 1, 1st round.... 40 10 Relationship between length of input and algorithms’ performance - Grammar 2, 1st round.... 40 11 Relationship between length of input and algorithms’ performance - Grammar 3, 1st round.... 41 12 Relationship between length of input and algorithms’ performance - Grammar 1, 2nd round... 41 13 Relationship between length of input and algorithms’ performance - Grammar 2, 2nd round... 45 14 Relationship between length of input and algorithms’ performance - Grammar 3, 2nd round... 45 15 CUDA Occupancy Calculator output for 11 registers per thread.................. 46 16 Relationship between length of input and algorithms’ performance - Grammar 4, 3rd round... 51 17 Relationship between length of input and algorithms’ performance - Grammar 5, 3rd round... 52 18 Relationship between length of input, size of grammar and modified algorithm’s performance... 53 19 Relationship between length of input, size of grammar and limited algorithm’s performance... 53 20 GeForce 440 GT specification.................................... 59 6 1 Introduction 1.1 Problem definition Parsing is a process of syntactic analysis of input data, given as a text, in order to determine its grammar structure, according to the given grammar. Usually output