<<

Concurrent Dynamic Programming for Grid-Based Optimisation Problems

Stephen Cossell

A DISSERTATION SUBMITTED IN FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY.

May 2015

School of Mechanical and Manufacturing Engineering Faculty of Engineering The University of New South Wales Sydney, NSW 2052, Australia PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: Cossell First name: Stephen Abbreviation for degree as given in the University calendar: PhD School: Mechanical and Manufacturing Engineering Faculty: Engineering Title: Concurrent dynamic programming for grid-based optimisation problems.

Abstract 350 words maximum: (PLEASE TYPE) A particular class of optimisation problems can be solved using a technique known as dynamic programming. This technique applies to problems that have many possible solutions, each consisting of a number of individual decision points. In theory, a globally optimal solution relative to a given metric can be obtained by recursively choosing the most optimal option at each decision point. In practice, however, applications of dynamic programming are computationally expensive for the scale of real-world domains. This thesis examines the existing array of robot motion planning and applications, a core application of dynamic programming for ground and aerial vehicles. In particular, the thesis highlights that the sequential nature of traditional algorithms does not scale in practice as modern central processing units begin to reach physical limits in terms of computational throughput. This thesis then outlines current parallel processing unit architectures that have emerged in the last decade with particular focus on graphical processing units. The main contribution of this thesis is a new class of concurrent dynamic programming algorithms that are applicable to multi-core processor architectures. The core mechanic of the algorithms is proven to generate an equally optimal global solution to existing sequential algorithms, with a computational complexity of O(n), assuming enough cores are available relative to the problem size. Various implementation flavours are developed and benchmarked over a variety of two-dimensional configuration spaces, with the most efficient being able to plan on the scale of the main campus of the University of New South Wales (a 1000m×500m area) with 1m×1m resolution in sub-second time. Higher dimensional configuration spaces are also investigated with a proof-of-concept experiment presented to assess the feasibility and performance as the dimensionality increases — a factor notorious in traditional approaches that increases the computational complexity exponentially. The work presented here gains an increased concurrent benefit as the dimensionality of the problem increases and hence is able to perform an order of magnitude more efficiently in the presented three-dimensional experiments. Designing concurrent algorithms can be more difficult, but the implementation benefits will continue to increase as the level of parallelism found in modern hardware approaches that found in nature.

Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

21st September 2015 Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

1 Concurrent Dynamic Programming for Grid-Based Optimisation Problems

Stephen Cossell

Abstract

A particular class of optimisation problems can be solved using a technique known as dynamic programming. This technique applies to problems that have many possible solutions, each consisting of a number of individual decision points. In theory, a globally optimal solution relative to a given metric can be obtained by recursively choosing the most optimal option at each decision point. In practice, however, applications of dynamic programming are computationally expensive for the scale of real-world domains. This thesis examines the existing array of robot motion planning algorithms and applications, a core application of dynamic programming for ground and aerial vehicles. In particular, the thesis highlights that the sequential nature of traditional algorithms does not scale in practice as modern central processing units begin to reach physical limits in terms of computational throughput. This thesis then outlines current parallel processing unit architectures that have emerged in the last decade with particular focus on graphical processing units. The main contribution of this thesis is a new class of concurrent dynamic pro- gramming algorithms that are applicable to multi-core processor architectures. The core mechanic of the algorithms is proven to generate an equally optimal global so- lution to existing sequential algorithms, with a computational complexity of O(n), assuming enough cores are available relative to the problem size. Various implemen- tation flavours are developed and benchmarked over a variety of two-dimensional configuration spaces, with the most efficient being able to plan on the scale of the main campus of the University of New South Wales (a 1000m×500m area) with 1m×1m resolution in sub-second time. Higher dimensional configuration spaces are also investigated with a proof-of-

i concept experiment presented to assess the feasibility and performance as the di- mensionality increases — a factor notorious in traditional approaches that increases the computational complexity exponentially. The work presented here gains an in- creased concurrent benefit as the dimensionality of the problem increases and hence is able to perform an order of magnitude more efficiently in the presented three- dimensional experiments. Designing concurrent algorithms can be more difficult, but the implementation benefits will continue to increase as the level of parallelism found in modern hardware approaches that found in nature.

ii Originality Statement

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material with have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowl- edged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Stephen Cossell

Signed: Date: 21st September 2015

iii Copyright Statement

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now and here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my theses of I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation. Signed: Date: 21st September 2015

Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are minor variations in formatting, they are the result of the conversion to digital format.

Signed: Date: 21st September 2015

iv Publications

Relevant Peer Reviewed Journal Publications

• S. Cossell and J. Guivant, “Concurrent dynamic programming for grid-based problems and its application for real-time path planning,” Robotics and Au- tonomous Systems, Vol. 62, No. 6, pp. 737-751, June 2014. (DOI: 10.1016/j.robot.2014.03.002)

• S. Cossell and J. Guivant, “Parallel evaluation of a spatial travesability cost function on GPU for efficient path planning,” Journal of Intelligent Learning Systems and Applications, Vol. 3, No. 4, pp. 191-200, November 2011. (DOI: 10.4236/jilsa.2011.34022)

Relevant Peer Reviewed Conferences

• S. Cossell and J. Guivant, “A GPU-Based Concurrent Motion Planning Al- gorithm for 3D Euclidean Configuration Spaces,” Australasian Conference on Robotics and Automation, Sydney, Australia, 2013.

• S. Cossell and J. Guivant, “An Optimised GPU-Based Robot Motion Plan- ner,” Australasian Conference on Robotics and Automation, Sydney, Australia, 2013.

Other Articles Published During Candidature

• J. Guivant, S. Cossell, M. Whitty and J. Katupitiya, “Internet-based oper- ation of autonomous robots: the role of data replication, compression, band- width allocation and ,” Journal of Field Robotics, Vol. 29, No. 5, pp. 793-818, 2012.

• S. Cossell, “A novel approach to automated systems engineering on a multi- agent robotics platform using enterprise grade configuration testing ,” Australasian Conference on Robotics and Automation, Wellington, New Zealand, December, 2012.

v • S. Cossell, M. Whitty and J. Guivant, “Streaming kinect data for robot tele- operation,” Australasian Conference on Robotics and Automation, Melbourne, Australia, 2011.

• F. Norzahari, K. Fairlie, A. White and M. Leach, M. Whitty, S. Cossell, J. Guivant and J. Katupitiya, “Spatially smart wine — testing geospatial technologies for sustainable wine production,” FIG Working Week, Marrakech, Morocco, May, 2011.

• M. Whitty, S. Cossell, K. S. Dang, J. Guivant and J. Katupitiya, “Au- tonomous navigation using a real-time 3D point cloud,” Australasian Con- ference on Robotics and Automation, Brisbane, Australia, 2010.

vi Acknowledgements

To Jos´, thank you for being the perfect supervisor. You always had the right level of interest and understanding in my work, the right level of benevolence to my progress (even when I was tempted by money and a “real” job), the right level of advice and guidance when required, knowing when to check up on me, knowing when to let me be, and having a good honest brash sense of humour that was always appreciated.

To my co-supervisor Dr Jay, thanks for the trust, support and providing a different point of view to how I approached research as a whole. Thanks for giving me the opportunity to teach — something I’ve grown to really enjoy over many years, especially in the context of robotics.

To Mark1, a good friend for the entirety of my candidature, thank you for being a great enabler, for the late night Maccas runs and most importantly for being a good influence and great example of how to be a proper researcher. Thanks for the last minute proof reading.

To Michael, thanks for the endless discussions about each other’s research during the final months. Stop procrastinating.

To Mark2, thanks for proof reading and the very helpful and constructive feedback. You are one of the few people who really understood my work from both sides of the coin.

To all the Mechatronics Postgrads I’ve met over the years, from the fall of the L205 dynasty, to the nomads from ME-305 and the kids in the Blockhouse: thanks for all the chats, lunches, dinners, jogs and movie nights.

1Whitty 2Sheahan

vii To Glen, thanks for the OpenGL books. They have been a valuable resource during my research. To Dennis, thanks for the display cable.

To Alan, Mike and Hamish, thanks for the continual “are we there yet” hassling, but more importantly for being incredibly flexible with my research commitments on our journey towards world domination. To Paul and Cheyne, thanks for helping promote a good work-life balance so that I could have time after work to have no life and work on my research.

To Mum, thanks for teaching me how to count, and Dad, thanks for teaching me how to count in binary. Thanks for everything!

viii Contents

Abstract ...... i Publications ...... v Acknowledgements ...... vii List of Figures ...... xviii List of Tables ...... xxii Nomenclature ...... xxiii

1 Introduction 1 1.1 Background ...... 1 1.1.1 Traditional Robot Motion Planning ...... 2 1.1.2 Other Grid-Based Optimisation Problems ...... 4 1.1.3 Processors and the Human Brain ...... 5 1.1.4 Concurrent Dynamic Programming ...... 7 1.2 Objectives ...... 8 1.3 Contributions ...... 9 1.4 Thesis Organisation ...... 10

2 Literature Review 11 2.1 Optimisation and Planning Before Robots ...... 12 2.2 Motion Planning for Robots ...... 13 2.2.1 The Computational Burden of Grid-Based Planners ...... 16 2.2.2 Planning with Computational Burden as a Constraint . . . . . 18 2.3 Computing Processor Capability ...... 21 2.4 Graphics Processing Unit Architecture ...... 22 2.5 GPGPU and the Open Graphics Library ...... 24

ix 2.6 GPGPU post Open Graphics Library ...... 28 2.7 GPGPU in the Wider Community ...... 28 2.8 GPGPU and Robotics ...... 29 2.9 Summary ...... 31

3 Foundations of Concurrent Grid-Based Optimisation Algorithms 33 3.1 Problem Definition ...... 34 3.2 The Core Mechanism of Concurrent Dynamic Programming . . . . . 36 3.3 Maintenance of Optimality Relative to a Cost Metric ...... 40 3.4 Initial Implementation and Experimental Results ...... 41 3.5 Complexity Analysis ...... 44 3.6 Summary ...... 45

4 Applications in Two Dimensional Grid-Based Problems 46 4.1 Problem Definition ...... 47 4.2 Optimisation by the Method of Expanding Texture ...... 48 4.3 Larger Kernel Observation Area ...... 51 4.4 Subregions ...... 58 4.4.1 Active and Inactive Subregions ...... 61 4.4.2 Subregion Metadata Data Structures ...... 62 4.4.3 Analysis of Execution Time of Components . . . . 63 4.5 Initial Optimisation Techniques ...... 67 4.6 Statistical Schedule of Checking ...... 68 4.7 Experimental Results ...... 73 4.7.1 Spiral Corridor ...... 75 4.7.2 Office Floor Plan ...... 76 4.7.3 UNSW Help Point ...... 79 4.7.4 Multiple UNSW Help Points ...... 80 4.8 Comparison of Execution Time of Flavours of Algorithm ...... 82 4.9 Complexity Analysis Comparison ...... 88 4.10 Live Test on an Unmanned Ground Vehicle ...... 93 4.11 Summary ...... 94

x 5 Extending to Three and Higher Dimensional Configuration Spaces 96 5.1 Problem Definition ...... 96 5.2 Algorithm Modification ...... 97 5.3 Theoretical Benefit of Increasing the Dimensionality of the Configu- ration Space for Three Dimensions ...... 98 5.4 Implementation Considerations and Limitations ...... 102 5.5 Experimental Results ...... 104 5.5.1 Empty Volume ...... 105 5.5.2 Rotating Planes ...... 106 5.5.3 Barker St Carpark - Single Source ...... 107 5.5.4 Barker St Carpark - All Exits ...... 110 5.6 Case Study: A State Lattice Planner in (x, y, θ) Space ...... 113 5.7 Higher Dimensional Applications ...... 116 5.7.1 Problem Definition ...... 116 5.7.2 Cell Layout in Higher Dimensions ...... 117 5.7.3 Concurrent Benefit of Higher Dimensions ...... 117 5.7.4 Case Study: A 6 -o-F Articulated Robot ...... 121 5.7.5 Implementation Limitations ...... 123 5.8 Summary ...... 125

6 Conclusions and Future Work 127 6.1 Summary of Contributions ...... 129 6.2 Proposed Future Work ...... 130 6.2.1 Implementation using Open Computing Language ...... 130 6.2.2 Applications in Protein Folding ...... 133 6.3 Final Remarks ...... 134

A Derivations 135

A.1 Derivation of the relationship between |E| and |V | in a Cobst = ∅ two-dimensional occupancy grid ...... 135

A.2 Derivation of the relationship between |E| and |V | in a Cobst = ∅ three-dimensional occupancy volume ...... 136

xi A.3 Derivation of the relationship between |E| and |V | in a Cobst = ∅ d-dimensional occupancy hyper-volume ...... 137 A.3.1 Example for d=2 ...... 138 A.3.2 Example for d=3 ...... 138 A.3.3 Example for d=6 ...... 139

B Relevant Experimental Logs 140 B.1 Emulating the rasterisation step of the graphics using OpenCL140 B.2 Comparing rastering single lines in CPU against rendering 2D areas in GPU ...... 143

References 146

xii List of Figures

2.1 Fixed function graphics pipeline adapted from Shreiner et al. [1]. The per-vertex operations stage is highlighted in green and was later replaced by a programmable vertex shader. The per-fragment oper- ations stage is highlighted in blue and was later replaced by a pro- grammable fragment shader...... 23 2.2 The layout of stream processors cores in the NVIDIA Tesla GPU. The diagram here is adapted from Nickolls et al. [2]. SM stands for Streaming Multiprocessor, SP stands for Streaming Processor, SFU stands for Special Function Unit, ROP stands for Raster Operations Pipeline, and MT IU stands for Multithreaded Instruction Unit. . . . 25

3.1 Representation of an occupancy grid’s initial layout and the first five iterations of the algorithm on that occupancy grid. Black cells repre-

sent cells in Cobst. Blue cells represent cells that have been evaluated with their final global cost-to-go value. Green cells highlight areas that are currently being evaluated at that given iteration. White cells represent unexplored areas...... 39 3.2 Example of a configuration space with differing local traversal costs

either side of an obstacle. Black cells represent cells in Cobst. Red

cells represent cells in Cfree with a local cost of 2 units. Blue cells

represent cells in Cfree with a local cost of 1 unit...... 41

4.1 General prediction of how the GPU and CPU side of the implementa- tion will affect overall execution time for different sizes of pieces being managed as a unit (referred to later in this chapter as subregions). . . 48

xiii 4.2 The first four iterations of the expanding texture technique...... 50 4.3 Comparison of the execution time of the vanilla implementation (Whole Texture) against the Expanding Texture implementation. The results are based on appropriately scaling the Office Floor Plan occupancy grid in Section 4.7.2 to a range of dimensions...... 51 4.4 A 5 × 5 cell observation area of a given cell. The kernel is being run for the centre cell marked as blue. Obstacle cells are marked as black with free cells marked as white...... 53 4.5 The cumulative number of sub-iteration steps required by the 3 × 3 and 5 × 5 cell observation variants of the algorithm to be able to evaluate a given number of cells...... 59 4.6 A logarithmically scaled graph of the number of GPU cell calcula- tions requested for different flavours of the algorithm and different subregion sizes. It should be noted that this graph only reflects the cell counts and does not reflect true GPU or CPU load...... 60 4.7 Snapshot of the subregion flavour of the algorithm at a particular iteration. The subregions of size 16 × 16 cells are shown by the grid pattern, with active subregions highlighted by a frosted border. Only active subregions have the cost-to-go function shown in full colour, with inactive subregion cells greyed out...... 61 4.8 Execution time of the various subregion implementation flavours on the on-board machine over the single help point test case presented in Section 4.7.3...... 68 4.9 Execution time of the various subregion implementation flavours on the desktop machine over the single help point test case presented in Section 4.7.3...... 69 4.10 Histogram of the number of iterations a subregion is active using a subregion size of 16 × 16 cells. The iteration equal to the subregion size is highlighted by the red dotted line...... 70

xiv 4.11 Histogram of the number of iterations a subregion is active using a subregion size of 32 × 32 cells. The iteration equal to the subregion size is highlighted by the red dotted line...... 71 4.12 Histogram of the number of iterations a subregion is active using a subregion size of 48 × 48 cells. The iteration equal to the subregion size is highlighted by the red dotted line...... 72 4.13 Histogram of the number of iterations a subregion is active using a subregion size of 64 × 64 cells. The iteration equal to the subregion size is highlighted by the red dotted line...... 73 4.14 The generated cost-to-go function overlaid on the original occupancy grid for the spiral corridor test case. The global cost is shown ranging from blue at the low cost destination cells to red at the highest cost cells. Obstacle cells are shown in black...... 76 4.15 The execution time of the algorithm running on the spiral environ- ment for two different graphics cards...... 77 4.16 The generated cost-to-go function overlaid on the original occupancy grid for the office floor plan test case. The global cost is shown ranging from blue at the low cost destination cells to red at the highest cost cells. Obstacle cells are shown in black...... 78 4.17 The execution time of the algorithm running on the office floor plan environment for two different graphics cards...... 79 4.18 The generated cost-to-go function overlaid on the original UNSW Campus occupancy grid. The global cost ranges from blue for low cost cell values to red for high cost cell values, with obstacle cells shown in black...... 80 4.19 The execution time of the algorithm running on the campus help point experiment for different graphics cards, with a single destination cell. 81 4.20 The generated cost-to-go function overlaid on the original UNSW Campus occupancy grid. Cell colours range from blue for low cost cells to red for high cost cells...... 81

xv 4.21 The execution time of the algorithm running on the multiple destina- tion campus help points occupancy grid. The concurrent algorithm is allowed to spread out into more regions concurrently due to multiple wavefronts, as compared to the single destination point experiment outlined in Section 4.7.3...... 82 4.22 The execution time of each evolutionary flavour of the algorithm. Results gathered here are from the on-board machine using the single destination cell test case from Section 4.7.3. The graph to the right shows a magnified area of the graph to the left focused on the shorter execution times to allow the reader easier comparison of subregion implementations...... 84 4.23 The execution time of each evolutionary flavour of the algorithm. Re- sults gathered here are from the on-board machine using the multiple destination cell test case from Section 4.7.4. The graph to the right shows a magnified area of the graph to the left focused on shorter execution times to allow the reader easier comparison of subregion implementations...... 85 4.24 The execution time of each evolutionary flavour of the algorithm. Results gathered here are from the desktop machine using the single destination cell test case from Section 4.7.3. The graph to the right shows a magnified area of the graph to the left focused on the shorter execution times to allow the reader easier comparison of subregion implementations...... 86 4.25 The execution time of each evolutionary flavour of the algorithm. Re- sults gathered here are from the desktop machine using the multiple destination cell test case from Section 4.7.4. The graph to the right shows a magnified area of the graph to the left focused on shorter execution times to allow the reader easier comparison of subregion implementations...... 87

xvi 4.26 A c 2011 Google Maps overview of a subsection of campus alongside the generated cost-to-go function over a generated occupancy grid of the same region. Blue represents low cost cells tending towards red for high cost cells...... 94

5.1 A comparison of the number of cells evaluated over the number of sub-iterations of the respective two-dimensional or three-dimensional kernel. The three-dimensional kernel is represented by blue marks, while the two-dimensional kernel is represented by red marks. . . . . 101 5.2 The method by which planes in a three-dimensional volume are ar- ranged into a two-dimensional texture using a raster format...... 102 5.3 The final cost-to-go function over the empty volume test case. Blue represents low cost cells whereas red cells indicate high cost cells. . . 107 5.4 The final cost-to-go function over the rotating planes test case. Blue represents low cost cells whereas red indicates high cost cells...... 109 5.5 A side view of the Barker Street Carpark occupancy volume showing relevant features...... 110 5.6 The final cost-to-go function over the single source Barker St carpark test case. Blue represents low cost cells whereas red indicates high cost cells...... 111 5.7 The final cost-to-go function over the multiple source Barker Street carpark test case. Blue represents low cost cells whereas red indicates high cost cells...... 113 5.8 The recursive raster-in-raster format cells from a four-dimensional occupancy hyper-volume are arranged into a two-dimensional texture buffer. Here it is assumed that z represents coordinates in the third dimension and w represents coordinates in the fourth dimension. . . . 117

xvii 5.9 A logarithmically scaled comparison of the number of cells evaluated against the number of sub-iterations of each higher dimensional ver- sion of the algorithm. Each algorithm is operating over a completely free occupancy volume or hyper-volume with dimensions 360d, where d is the dimensionality of the configuration space. Note that the three-dimensional graph terminates execution at sub-iteration 4680 as it has evaluated all 3603 cells in its corresponding configuration

space in R3...... 119 5.10 A normalised, logarithmically scaled comparison of the number of cells evaluated against the number of executed sub-iterations worth of operations. Each algorithm is operating over a volume or hyper- volume with total cells equal to 360d, where d is the dimension of the configuration space. All values are normalised against the total number of cells required to exhaustively evaluate the entire configu- ration space. That is, 360d cells, where d is the dimensionality. Note that the three-dimensional graph terminates at sub-iteration 4680 as it has finished evaluating the entire volume of cells...... 120 5.11 A two degree-of-freedom articulated robot. Shaded areas represent obstacles in the environment. Except where either piece A or piece B may collide with an obstacle, each joint is capable of 360o revolution in the two-dimensional environment...... 122 5.12 Occupancy grid representation of the two degree-of-freedom robot shown in Figure 5.11 with angles of each joint represented as the two axes. The horizontal axis represents possible angles of the joint at the base of piece A, with the vertical axis representing possible angles of the joint at the connection of piece A and B, in the frame of reference of piece A. The red dot represents the location of the non-joint end of piece B relative to the Euclidean coordinate representation of the environment...... 123

6.1 A sample 8 × 8 cell occupancy grid with labels...... 131

xviii List of Tables

1.1 Comparison of processor performance. Estimated throughput is mea- sured in operations per second for the CPU and single precision float- ing point operations per second (FLOPs) for the GPUs listed. Al- though these two scales are not exactly equivalent they are enough to represent the performance of each architecture relative to that of the human brain...... 6

3.1 Comparison of characteristics of the shortest paths North and South of the obstacle between cells A and B...... 41 3.2 Summary of the execution time of the vanilla implementation of the algorithm. Times listed here are given in seconds and reflect the time taken to load the occupancy grid into texture memory on the GPU, exhaustively calculate the cost-to-go function over many iterations, and transfer the resulting function back to CPU memory. CPU based implementations are also included for comparison. Exhaustive Dijk- stra implies an implementation of Dijkstra’s algorithm that continues until all cells in a grid are evaluated rather than terminating when a specific cell is reached by the algorithm. The test case tagged with α involved a 2007 ATI Radeon 2400 XT. The test case tagged with β involved a 2010 NVIDIA GeForce GTX 480. Test cases tagged with γ involved a 2012 Intel Core i7 2.6GHz. The test case tagged with † was implemented using a linked list for the priority queue. The test case tagged with ? was implemented using a min-heap for the priority queue...... 43

xix 4.1 Sub-iteration cost breakdown of the workload of a kernel that observes 3 × 3 cells...... 53 4.2 Sub-iteration cost breakdown of the workload of a kernel observing just bishop’s move cells for a 5 × 5 cell observation area...... 54 4.3 Sub-iteration cost breakdown of the workload of a kernel observing just knight’s move cells for a 5 × 5 cell observation area...... 55 4.4 Sub-iteration cost breakdown of the workload of a kernel observing just rook’s move cells for a 5 × 5 cell observation area...... 56 4.5 Comparison of the total sub-iteration steps performed by each of the 3 × 3 and 5 × 5 cell observation area algorithms...... 56 4.6 Cumulative number of cells that each kernel approach can reach at a given iteration. This assumes the best case scenario of no obstacle cells and the algorithm not reaching the edge of an occupancy grid. Column headings represent the observation area used by each approach. 57 4.7 Cumulative sub-iteration steps required by each approach to evaluate a given number of cells. Values are calculated by multiplying the number of cells that will have been evaluated by the sub-iteration cost of evaluating each cell...... 58 4.8 Breakdown of the execution time of components of the standard sub- regions implementation for a subregion size of 16 × 16. Results here were collected on the on-board machine given in Table 4.14 over the Office Floor Plan test case outlined in Section 4.7.2...... 64 4.9 Breakdown of the execution time of components of the standard sub- regions implementation for a subregion size of 80 × 80. Results here were collected from the on-board machine outlined in Table 4.14 over the Office Floor Plan test case presented in Section 4.7.2...... 64 4.10 Breakdown of the execution time of components of the subregions approach, but checking every 2nd iteration, for a subregion size of 16 × 16. Results given here were collected over the Office Floor Plan test case presented in Section 4.7.2 using the on-board graphics card outlined in Table 4.14...... 65

xx 4.11 Breakdown of the execution time of components of the subregions approach, but checking every 2nd iteration, for a subregion size of 80 × 80. Results given here are based on the on-board graphics card outlined in Table 4.14 over the Office Floor Plan test case presented in Section 4.7.2...... 66 4.12 Comparison of the execution times of that experiment set over the two sample subregion sizes. Here % Execution Time represents the ratio of the two recorded times for the particular subregion size. . . . 66 4.13 Percentage of the subregions that will have been checked and pro- moted to mature on or by the corresponding subregion lifetime itera- tion counter i. Here “Check No.” represents the post-iteration check number that is being performed for that iteration. That is, a check number of 2 implies the second time the subregion is checked. . . . . 70 4.14 Hardware specifications for the on-board machine used to benchmark algorithm implementations...... 74 4.15 Hardware specifications for the desktop machine used to benchmark algorithm implementations...... 75 4.16 A comparison of the fastest execution times for each of the seven implemented algorithm flavours presented in this chapter and the last. Times given in this table are in seconds. († On-board) (? Desktop) . 88

5.1 Comparison of the number of cells evaluated at a particular iteration for two-dimensional and three-dimensional domains...... 99 5.2 Hardware specifications for the on-board machine used to benchmark algorithm implementations...... 105 5.3 Hardware specifications for the desktop machine used to benchmark algorithm implementations...... 106

xxi 5.4 Execution time of the three-dimensional implementation on the Empty Volume test case for each test machine. CPU testing was performed on a 2012 Intel Core i7 2.6GHz processor. The † test case inolved a linked list priority queue, while the ? test case involved a min-heap priority queue...... 108 5.5 Execution time of the three-dimensional implementation on the Ro- tating Planes test case for each of the benchmarking machines. . . . . 108 5.6 Execution time of the three-dimensional implementation on the Single Source Barker Street Carpark test case for each benchmarking machine.112 5.7 The first zero cost destination locations used in the All Exits Barker Street Carpark test case. The first four of these exits are located on the ground floor of the structure, with the fifth representing access to the road at the top of the slope via a foot bridge...... 112 5.8 Execution time of the three-dimensional implementation on the All Exits Barker Street Carpark test case for each benchmarking machine.112 5.9 The six motions of a differentially constrained robot relative to a lattice of discredited states...... 114 5.10 A breakdown of the theoretical number of sub-iterations required to

d exhaustively evaluate a 360 cell configuration space given Cobst = ∅.. 121 5.11 Configuration space size given that all cells within that space must fit inside the bounds of a 4096 × 4096 texel texture buffer...... 124 5.12 Configuration space size given that 128MB can be used to store cell values. This table assumes a cell value can be represented using an IEEE 32-bit single precision floating point format...... 125

xxii Nomenclature

Abbreviations CPU GCS Ground Control Station GPU Graphical Processing Unit GPGPU General Purpose on Graphical Processing Units UAV Unmanned Aerial Vehicle UGV Unmanned Ground Vehicle Symbols C A set of cells in a grid.

Cobst A set of cells in C that are non-traversable. (Obstacle cells).

Cfree C\Cobst i, j, k, . . . A coordinate of a cell in a grid, in the first, second, third, . . . dimension.

ci,j,k,... An individual cell in a grid at row i, column j, plane k, hy- perplane .... C∗() The global cost-to-go value of a cell relative to a zero cost destination cell or cells. c∗() The local cost of traversing a cell. d The dimensionality of the problem. n × n = N The dimensions of an occupancy grid with a total of N cells. s × s = S The dimensions of a subregion within an occupancy grid with a total of S cells.

Ti, ti The i-th iteration of a given algorithm.

xxiii Chapter 1

Introduction

1.1 Background

Robotics and artificial intelligence applications are becoming more prevalent and critical in our society. We are at the beginning of the next industrial revolution. However, according to Bekey et al. [3], the main high level limitations and challenges roboticists face when trying to match the abilities of autonomous beings in nature, and in particular relative to humans, can be categorised into four main areas:

• Poor energy storage and use for computation and propulsion relative to living creatures;

• Poor sensor capabilities to sense the surrounding environment;

• Poor quality actuators and the mechanics of reactive, agile and dexterous interactions with such an environment; and

• Lack of computational power or the correct algorithms to adapt and react to our natural environment in real-time.

The research presented in this thesis contributes to the last of these four areas. In particular, this thesis presents published research that improves the performance of one of the key areas of robotics that also coincides with applications in diverse areas, from control theory to protein folding — solving dynamic programing and optimi- sation problems. The remainder of this section gives a brief outline of traditional

1 approaches to solving the dynamic programming problem of robot motion plan- ning, with the aim of placing the presented research in an applicable context. Other grid-based optimisation problems are briefly discussed to demonstrate the generality of the proposed work. Graphical Processing Units (GPUs) and their architectural design are then independently introduced in the context of performance trends of single and multi-core central processing units (CPUs), as well as processing units found in nature. The general premise behind concurrent dynamic programming is introduced in the context of these two previously disjointed fields.

1.1.1 Traditional Robot Motion Planning

Robot motion planning is used extensively in this thesis as a use case for dynamic programming as motion planning has applications in many fields within robotics. The need for exact and efficient motion planners is becoming more apparent as robotics applications begin to move from the laboratory into the public arena. Glob- ally optimal solutions may be required over the scale of kilometres, given dense local regions that require accurate planning and heavily dynamic environments. Current forefront approaches use subsampled, statistical approximated or graph based meth- ods, which trade precision for real-time solution generation. While this trade off can provide a satisfactory solution in real-time for many laboratory environments, there are many real world applications that require a more accurate solution. However, traditional grid-based approaches are often insufficient for real-time applications when any one of the required scale, resolution or replan frequency of a problem domain increases beyond the computational capabilities of available processor hard- ware. Grid-based approaches are important when the obstacles in an environment are dense. In the most extreme case, if a particularly critical path can only be repre- sented by a single unit wide corridor, sampling and randomised methods are likely to miss the detail in this path. This can lead to either a suboptimal solution, or more critically the algorithm falsely concluding that a solution does not exist. If the environment is highly dynamic and dense, it is important to replan at a global level quickly, rather than just calculate a locally optimal path. Grid-based planners, will

2 always find the most optimal solution given the configuration space representation of an environment is able to show the detail of dense regions adequately. Traditionally, exact planning algorithms suffer from a common inefficiency — they explore potential paths of motion sequentially. Two of the most commonly used exact sequential path planning algorithms are Dijkstra’s algorithm and A*- search, with many derivatives of these also existing in common use. Even non-exact approaches such as rapidly exploring random trees (RRTs) [4] and probabilistic roadmap methods [5] have mainly involved sequential operations to avoid race con- ditions on generating state space representations of an environment and during planning. A more in depth discussion of these algorithms is given in Chapter 2, but a brief outline will be given here. For the entirety of this thesis, when referring to any traditional path planning algorithm in the context of performance comparison, the exhaustive or single-source shortest path version of each algorithm is implied. That is, given a destination location as a node or cell, the assumption is that the shortest path is calculated for all nodes or cells to the given destination cell, for graph and grid-based contexts, respectively. In each exact algorithm, a sequential is repeated until a solution is dis- covered. Each iteration involves deciding which unexplored path to explore next via a priority queue data structure and a given cost metric. Once the most optimal step is decided upon, the path is explored and newly discovered paths are inserted into the priority queue. These algorithms have been increasingly useful in their implementation over the past 50 years as processor speeds increased, without the core algorithm requiring modification. Now that the processing speed of a single core is beginning to reach a physical limit, robotics researchers cannot continue to expect a free ride with existing algorithms simply by upgrading to a faster single core processor. Gordon Moore observed that the number of transistors fabricated onto a chip doubled every 18 months [6], which roughly translates to a doubling of the process- ing power at the same rate. As processors have begun to reach physical limits of transistor number and therefore processing speed [7], the new metric of growth has shifted towards the number of cores that comprise a processor in recent years.

3 Since the speed of a single core has plateaued, while the number of cores continues to increase, existing robotics algorithms must be re-invented and re-implemented for parallel architectures. This thesis presents foundational work in designing dynamic programming techniques for such parallel architectures. In addition to the paradigm shift to multi-core CPUs, which in recent years have been fabricated with two, four, or eight cores, modern Graphics Processing Units (GPUs) have evolved to contain thousands of cores, but under a reduced and less flexible instruction set. The next section introduces the parallel architecture found in modern GPUs and how this type of hardware is a major step towards reaching processing power found in nature relative to the progress currently being developed by CPU manufacturers.

1.1.2 Other Grid-Based Optimisation Problems

While the methods presented in this thesis focus on motion planning, the approach can be generalised to concurrently solve a wide range of dynamic programming problems that can be mapped to a grid representation. For example, the longest common subsequence problem [8] is a classical dynamic programming problem that attempts to find the longest sequence of elements in a series common to two or more sequences. A common technique for solving this problem is to create a grid with the elements of each sequence as increments along orthogonal axes. The algorithm then generates a cost value for each cell in the grid-based on a recursive function. This sequential process can be transformed into a by calculating values of grid cell concurrently that do not ancestrally depend on each other recursively. This algorithm is important as it is the basis of common file differencing tools used in software version control suites and for DNA sequence matching. The knapsack problem is another classical dyanmic programming problem that can be mapped to a grid-based domain. The problem involves a set of objects that each have a corresponding benefit and burden value. The aim is to find the optimal subset of objects that maximise the benefit metric given that the burden metric must be no greater than a given value [9]. Again, all possible combinations of optimal solutions to subproblems can be placed in a grid. Traditionally, each cell in the grid

4 is calculated sequentially, but many of these cells can be calculated concurrently if all dependent cells have been previously calculated.

1.1.3 Computer Processors and the Human Brain

If we continue to draw from nature in our understanding and inspiration for robotic research and development, an autonomous system can be described in the simplest form as a machine that takes sensory inputs, decides upon actions that align with an internal goal, and then performs a sequence of actions. In this analogy the processor that turns sensory inputs into actions has the same role as the brain of a creature. As stated above, over the past 60 years the processing power of mainstream CPUs has doubled every 18 months relative to Moore’s Law in relation to the number of transistors being fabricated onto each chipset [10]. Even with the recent shift to measuring CPU performance relative to the number of cores, processors still have orders of magnitude less capability relative to processors found in nature. Moravec [11] estimates the processing power of the human brain to be 1.0×1014 operations per second, based on the equivalent capabilities of computer vision techniques relative to human vision capabilities, then extrapolated by the ratio of the number of neurons in the human eye relative to the entire volume of the human brain. Alternatively, Westbury [12] estimates the human brain being capable of 2 × 1019 operations per second based on the number of neurons, the speed of each neuron and the number of neurons connected to the output of a given neuron. The current processing power of the human brain can be compared to current CPU and GPU performance via Table 1.1. Using Westbury’s estimate as a refer- ence point, if the processing throughput of the human brain were analogous to the equatorial circumference of the Earth, as given by NASA [13], then the throughput of current consumer multi-core CPUs is in the order of a few centimetres, modern GPUs in the order of ten metres, and Moravec’s estimate in the order most modern city buildings in terms of distance. In contrast, one of the world’s fastest supercom- puters has an estimated computational throughput of 33.86 petaflops, or around 67km in the distance analogy [14]. The sources cited in Table 1.1 claim a throughput measurement for the human

5 Table 1.1: Comparison of processor performance. Estimated throughput is measured in operations per second for the CPU and single precision floating point operations per second (FLOPs) for the GPUs listed. Al- though these two scales are not exactly equivalent they are enough to represent the performance of each architecture relative to that of the human brain.

Processor Estimated Throughput AMD Quad Core CPU with each core’s clock at 2.5GHz 1 × 1010 NVIDIA Tesla GPU (May 2010) 5.18 × 1011 NVIDIA Tesla GPU (May 2011) 2.06 × 1012 NVIDIA Tesla GPU (May 2012) 5.54 × 1012 Human Brain (Moravec) 1 × 1014 MilkyWay-2 (Tianhe-2) 3.39 × 1016 Human Brain (Westbury) 2 × 1019

6 brain many orders of magnitude greater than current CPU and GPU technology, yet the rate at which electrical signals are triggered in the human brain is estimated to be closer to 200Hz, according to Westbury [12], far slower than current microprocessors. The immense computational rate for a processor with such a slow electrical clock frequency mainly comes from the highly parallel nature of neuron wiring in the brain. The result of each neuron’s calculation is thought to be able to propagate to millions of other neurons. Many consumer-grade appear to have far superior processors than their predecessors five years ago, yet quote slower clock frequencies. This may be due to many factors such as a greater amount and faster memory, different caching strategies, faster bus speeds, but it could mainly be due to machines being designed with multiple cores to share the workload. The current state of processor architectures and parallel programming is reviewed in Chapter 2 in greater detail. In addition to the potential benefits of transitioning to multi- core implementations, assigning work to a GPU has the inherent benefit of releasing workload on a CPU for other tasks. That is, the workload among processors, let alone cores, is a beneficial step towards more reactive systems.

1.1.4 Concurrent Dynamic Programming

For robots to be able to react to sensor data and make decisions as quickly as humans, or at least many animals, the multi-core revolution must be embraced. Many robotics algorithms can be directly mapped to concurrent implementations, but there is a class of algorithms particularly used for path planning in ground and aerial vehicles that, as stated already, is inherently sequential in nature. Dynamic programming problems are recursive — the result of one calculation in the overall process of generating a global solution depends on the previous value being calcu- lated first. This research presents a concurrent version of the basis of all grid-based path planning and dynamic programming algorithms for parallel architectures that performs more efficiently by orders of magnitude and gains greater concurrent ben- efit as the dimensionality of the problem increases.

7 1.2 Objectives

The objectives of this research aimed to exploit certain properties of approximated grid-based motion planning problems for efficient solving on parallel architectures. Such architectures include emerging and highly parallel Graphics Processing Units (GPUs) currently available for mainstream application. In particular, the objectives are:

• To determine whether an algorithm could be created that produced a mathe- matically identical global solution to traditional approaches, while performing many steps of that algorithm in parallel rather than sequentially.

• To determine whether a two-dimensional grid-based implementation of the algorithm on modern GPU hardware could perform to, or improve upon, the execution time of traditional sequential implementations.

• To benchmark this implementation on a range of generations of GPU hardware to predict future performance trends of the algorithm’s implementation relative to characteristics of the hardware.

• To determine whether the proposed algorithm and implementation can be extended to applications in higher dimensions and how such an implementation performs relative to existing sequential approaches.

• To properly assess and understand the relationship between the achievable level of parallelism from the algorithm in relation to the dimensionality of the configuration space and whether this relationship proves beneficial for current and predicted future .

8 1.3 Contributions

The main contributions of this research are as follows:

• A set of techniques for solving grid-based optimisation problems using a con- current approach. These techniques are presented from a theoretical point of view and analysed as such. An initial implementation is given to prove that the proposed algorithm generates a numerically identical solution to existing grid-based approaches.

• An evolving set of implementation flavours of the core algorithm on modern GPU architecture. Each flavour is an incremental improvement on the previ- ous, and the reasoning behind each incremental approach is presented. Each flavour is thoroughly benchmarked on a range of configuration space types to properly assess real-world performance. These flavours are proven to be able to run in O(N) time, where N is the number of cells or nodes in the configuration √ space representation, assuming at least O( N) cores are available.

• A technique and analysis of extending the two-dimensional algorithm to higher dimensional configuration spaces. A theoretical analysis is given showing the relationship of efficiency and level of achievable parallelism as the dimension- ality increases.

• A proof-of-concept implementation of a three-dimensional algorithm with bench- marked performance results over a range of occupancy volume types.

9 1.4 Thesis Organisation

The remainder of this thesis is arranged as follows.

• Chapter 2 gives a review of existing research from the traditionally disjointed fields of robot motion planning and concurrent programming. The review concludes by presenting recent research on combining these fields and offers examples where other fields have greatly benefited from the move from tradi- tional sequential computing to .

• Chapter 3 introduces the core concepts of solving grid-based optimisation prob- lems via concurrent dynamic programming. The chapter presents benchmark- ing results of an initial implementation attempt before giving a thorough proof of the optimality and computational complexity of this initial approach.

• Chapter 4 presents a deeper analysis of the two-dimensional grid-based solver implementation and provides details of several optimisation steps used to allow the implementation to perform in real-time. Results of benchmarking exper- iments performed on a variety of occupancy grid examples are presented. A complexity analysis of the final implementation is given in comparison to a traditional sequential algorithm, namely Dijkstra’s algorithm.

• Chapter 5 extends the two-dimensional implementation to higher-dimensional configuration spaces. A theoretical analysis of the algorithm’s computational complexity as the dimensionality of the configuration space increases is given. An implementation is also presented for a set three-dimensional grid-based con- figuration spaces, ranging from contrived cases to real-world examples. This implementation is benchmarked on a range of generations of graphics cards in an attempt to gauge the current and future performance of the algorithm.

• The thesis concludes with a brief discussion covering other areas outside robot motion planning that could potentially benefit from concurrent dynamic pro- gramming before final remarks are given.

10 Chapter 2

Literature Review

The previous chapter brought to light the need for concurrent dynamic program- ming. To put the research presented in this thesis into context, this chapter gives a chronological background as to how various approaches to robot motion plan- ning1 have arisen in the last 50 to 60 years. The review begins by covering research around solving optimisation problems before the context of real-time robotics appli- cations. Here the emphasis focused more on the mathematical process of generating an accurate solution rather than how efficiently the solution could be generated in practice. The chapter then continues to review initial applications of dynamic program- ming in robotics and the constraints and shortcomings that arose from these ap- plications. This section focuses particularly on existing grid-based approaches to robot motion planning as they are known to give exact solutions2 while potentially being the most computationally expensive. As the robotics community began to realise that simple grid-based approaches were too inefficient for practical use on current (at the time) processor hardware, many sub-optimal approaches have been proposed that aim to give a satisfactory solution in a more computationally efficient way. The chapter continues by reviewing these approaches in the context of their

1Here, robot motion planning relates primarily to the process of planning a path between an initial and goal state, with considerations of the shape and volume of the robot taken into account after an optimal path has been generated. It does not consider the subcategory of literature concerned with taking the volume of the robot into consideration when planning. 2Accurate to a chosen grid resolution.

11 computational efficiencies and as well as the quantitative sacrifices to the generated solution. A popular technique used by researchers to gain more performance from their planners over the last few decades was to simply purchase the next generation of processor hardware. The performance trends of processor hardware over the last few decades are reviewed. This area of revision concludes with emphasis on the through- put capabilities of single core processors plateauing due to fabrication techniques reaching their physical limits in recent years. To maintain the desired performance trends relative to the rate proposed by Moore’s Law, manufacturers have shifted to producing multi-core processors. A paradigm shift in processor design translates to a paradigm shift in how engineers design parallel capable software. Some of these differences from a point of view are highlighted before the architecture of the highly parallel modern Graphics Processing Unit (GPU) is introduced. The technique of General Purpose computing on GPUs (GPGPU) is then re- viewed from the point of view of graphical and non-graphical applications. Appli- cations of GPGPU that have benefited various fields of research are then reviewed to demonstrate how the technique of GPGPU can benefit many areas, with robotics applications highlighted at the conclusion of this chapter.

2.1 Optimisation and Planning Before Robots

Dynamic programming algorithms have been applied to a number of optimisation problems over the last 60 years. Gavish [15] presented work in 1982 on optimal packet routing over computer networks, while as recently as 2009, Ehmke et al. [16] applied traditional dynamic programming approaches to transportation and logistics planning. Many of the first dynamic programming algorithms were presented as mathematical proofs of a method to generate an optimal path or minimal subtree without any emphasis on the complexity required to generate the solution. In these two areas of application, the generation of an optimal solution would only be required every few years, months, hours. This is especially evident in logistics planning as

12 the road and rail topology of a city is unlikely to change at such a high frequency. Kruskal in 1956 [17] proposed an algorithm for generating the shortest spanning tree of a graph with possible applications to the traveling salesman problem. Here the short paper concentrated solely on describing the algorithm and proving the optimality of a generated minimal subgraph. Later, Dijkstra [18] presented in 1959 an algorithm to discover the shortest path between two given points in a weighted graph. Similarly, the paper focused on the proof of optimality of the solution rather than any complexity analysis of generating the solution. The lack of comment on the computational complexity of these algorithms is un- derstandable for the era they were published in. In fact, Dijkstra’s original published algorithm differs slightly from the version taught in tertiary level courses and implemented in many modern applications. Specifically, Zhan [19] notes that Dijkstra’s original version did not specify that unexplored paths be kept in a sorted priority queue. It was not until 1969 when Dial [20] first implemented a shortest path algorithm that the notion of sorting unexplored paths was considered using a “topological ordering” into buckets that maintained the global potential cost of exploring a particular path. Here experimental results of the assembly code implementation running over a 12,000 node, 36,000 edge graph are included. The approach claims to generate a single source all shortest-paths solution in one second. It is around this time that published research on generating solutions to optimisa- tion problems shifted from mathematical and theoretical publications to engineering and practical applications.

2.2 Motion Planning for Robots

Prior to applying optimisation solvers in the field of robotics, the previously men- tioned research was mainly applied to higher level planning contexts. That is, such applications likely involved problems such as finding the most optimal path for a de- livery driver or postal worker to perform their daily rounds upon. Here the focus was on the order in which waypoints should be visited rather than the lower level task of trying to navigate between two waypoints in an environment — a task humans

13 are fairly capable of performing optimally. As the first field robots began to come into existence, motion planning began to split into two subfields: the traditional high level planning via multiple waypoints in a graph, and the low level planning between a particular pair of waypoints from a more metric and obstacle avoidance point of view. One of the first thorough attempts at applying path planning algorithms to the field of mobile robotics was Stanford’s Shakey robot [21]. Between 1966 and 1972, researchers at the Artificial Intelligence Center at the Stanford Research Institute developed Shakey as a autonomous ground vehicle able to sense its surrounding en- vironment via a TV camera and other sensors and perform tasks including traversing between waypoints in its environment. From this research arose work by Hart et al. [22] of a modified version of Dijkstra’s algorithm known as A*-search. In its most ba- sic form, A*-search differs from Dijkstra’s algorithm only by how unexplored paths are ordered in the priority queue. Instead of only sorting by global potential of the cost of a path back to a source node, an extra heuristic is included that describes additional information about the environment. For example, in many real-world applications of A*-search, the Euclidean distance from a node to a given destination node is used as the additional heuristic value to coax the algorithm to explore nodes that are spatially more likely to lead to the destination node. See Cossell and Guiv- ant [23] for a visual demonstration of the difference between Dijkstra’s algorithm and A*-search for exploring grid-based environments. Robot motion planners can be classified into two broad categories: metric and topological. Here the difference in classification not only lies in the algorithm, but how the environment is represented. Topological approaches map to the more tradi- tional and abstract graph-based structure of waypoints connected by edges, whereas metric based approaches divide an environment into a tessellating layout of cells. In the context of this thesis a tessellating grid of equally sized square cells can be used to represent an environment known as an occupancy grid, but in other literature this can also imply exact cell decompositions of the configuration space. Elfes [24] published one of the first papers in 1989 involving occupancy grids and grid-based planners for robot motion planning. Here cells in the grid are assigned a traver-

14 sal cost based on how sensor readings interpreted the environment. Cells that are not traversable by an agent are assigned throughout the literature as either large traversal cost values or infinite cost values, as is the case with Khatib’s potential field approach [25], or specially defined constants that deem a cell inaccessible, as is with the first occupancy grid approach by Elfes [24] and the approach taken in this thesis. Hwang and Ahuja [26] presented a comprehensive literature review and defi- nition of robot motion planning in 1992. The main realisation at the time, with regards to this thesis, was that computer processing capabilities were beginning to allow robotics applications to transition from automated to autonomous. In addi- tion, the review references work by Tannenbaum and Yomdin [27] that implies that exhaustive evaluation of a configuration space with d degrees-of-freedom is likely to run in exponential time in d. The review continues by classifying motion planning algorithms into the resolution complete and probabilistic complete, with the main differences highlighted as a trade-off between generating an optimal solution or a satisfactory solution, against the computation complexity and execution time of an approach. Latombe [28] published a more extensive review of the classification and state of robot motion planning a year prior. This review classifies approaches into four categories: roadmap methods, exact cell decomposition, approximate cell decompo- sition and potential field methods. Via this classification of approaches, the work presented in this thesis falls into the approximate cell decomposition field as a tes- sellating grid of cells is used to represent each configuration space. A significant contribution of this review is the acknowledgement of the computational complexity of motion planning and a review of possible approaches researchers have developed to mitigate this complexity in applications. The complexity of grid-based planners is reviewed in more detail in the next section, followed by a more directed review of motion planning approaches that attempt to counter computational complexity of generating a solution.

15 2.2.1 The Computational Burden of Grid-Based Planners

Grid-based planners have the advantage of being able to plan accurate, optimal and intricate solutions to the resolution of a chosen grid size. They also map well to memory models used in computing applications and are easy to comprehend. However, as mentioned already in this chapter, grid-based planners are more com- putationally expensive than approximate, probabilistic and randomised approaches such as roadmap methods. Barbehenn [29] gave a short paper in 1999 proving the Big-Oh complexity of Dijkstra’s algorithm to be O(|E| + |V |log|V |) for a binary heap implementation. Assuming the calculation of an external heuristic is O(1), the computational complexity of A*-search can also be expressed with this complexity. Via the derivation mapping graph-based to grid-based edge and vertex numbers given in Appendix A.1, this gives an exhaustive grid-based planner a runtime order of O(NlogN). Since N can be quite large for many real world applications, let alone higher-dimensional configuration space representations, traditional approaches such as Dijkstra’s algorithm and A*-search are not feasible. Appendix A.3 shows a deriva- tion from the basic structure of grid-based layouts and how the number of cells in a d-dimensional space grows with increased dimensionality. Given a d-dimensional grid of nd = N cells, a traditional exhaustive grid-based planner would run in the order of O(ndlognd). This section reviews grid-based planners that attempt to mit- igate this computation burden. Khatib [25] introduced a potential field method for low level robot motion plan- ning. Here, an environment is broken into a grid-based structure and obstacles are given a high repulsive value of potential in the grid. A path can then be planned through an environment by giving a destination cell an attractive artificial potential and then instructing the robot to blindly follow the potential field. It is conceded that this method is susceptible to planned paths getting trapped in local minima rather than guaranteeing a generated solution if one exists. However, in contrast to low level, metric-based planners at the time, this method was deemed more compu- tationally efficient by Khatib and would benefit real-world real-time applications if augmented by a high level, topological planner. An exact cell decomposition approach presented by Brooks [30] in 1983 at-

16 tempted to give a more efficient planner by modifying the representation of the traversable space by an agent in a polygonal configuration space. On the subject of computational complexity, the method is proposed as a more efficient approach than existing methods at the time, but concedes that it is unable to find possible paths in cluttered environments. It is proposed that this approach be used to guide a more computationally expensive, but exact planner to discover such paths. Latombe [28] suggests reducing the dimensionality of a higher-dimensional prob- lem temporarily to plan more efficiently. An example of a six degree-of-freedom articulated robot is given, with the notion that during certain stages of such a robot’s motion, the last three joints and a potential payload need not move and can be approximated to being a single static object. This object can be then rep- resented in a lower-dimensional configuration space as a bounding box of the last three joints and the payload. While these joints are held stationary, the motion planning problem is reduced to that of a three-dimensional configuration space. A similar approach to reducing the dimensionality is also suggested by slicing two-dimensional planes from a three-dimensional volume. For example, an Un- manned Aerial Vehicle (UAV) might require a three-dimensional occupancy volume to properly plan a path through a multi-storey indoor environment. However, to ease the computation burden of planning in this dimensionality, a planner could assume that when on a particular floor of such a building, it is only required to plan in a two-dimensional representation of that floor. This can be achieved if the vehicle itself is instructed to maintain a constant vertical distance from the ground and only switches back to a three-dimensional planner when a floor change is required. One of the key shortcomings of the sub-optimal grid-based planners reviewed in this section is that they concede that they can fail to find a solution given a complex or dense environment. Rather than regressing to a more generic computationally expensive planner, another approach is to use “local experts.” For example, Yap [31] presented work in 1987 on a model and method for moving a three-dimensional object through a doorway. More recently, Meeussen et al. [32] presented work on a person robot, specifically discussing the sub-tasks of detecting, opening and traversing doorways and self-recharging.

17 While grid-based planners are often preferred, they are also often not feasible for practical applications. The next section reviews approaches to robot motion planning outside of the domain of grid-based planners. Each approach is discussed from the point of view of reducing computational complexity of motion planning over existing approaches.

2.2.2 Planning with Computational Burden as a Constraint

In an attempt to design motion planners that are satisfactory for practical appli- cations, while being computationally feasible, a number of approaches have been proposed over the last few decades. This chapter reviews these alternate methods of motion planning and describes each approach in the context of its computational complexity. The review covers planners from those that map a configuration space to an approximate topological representation, to hybrid metric and topological ap- proaches that limit regular metric based update of the state of a configuration space and planning to a subset of the total environment. Amato and Wu [33] presented work in 1996 on a randomised roadmap method of solving robot motion planning, particularly referring to metric and grid-based methods as being too computationally expensive relative to topological approaches such as theirs. They also refer to the fact that as the dimensionality of the prob- lem increases, the computational complexity also increases in finding an exhaustive solution. Kavraki et al. [5] also presented a probabilistic roadmap method in the same year that reduced the dimensionality of the problem by one. That is, for a two-dimensional configuration space, random points within Cfree are sampled, then connected into a graph structure using k-nearest neighbours search. The source and goal locations are then connected directly to the closest point on the randomly generated graph and motion planning is subsequently performed over this graph. These approaches are considered to be probabilistic complete as the probability of finding a solution if one exists increases as additional random sample points are used. As the number of sample points increases towards the size of the set of points that can be sampled from, the problem tends towards a grid-based approach. For sparse environments, these graph-based approaches are more computationally effi-

18 cient than traditional grid-based approaches because the number of sample points required to discover a solution is far less than the number of grid cells required. How- ever, for dense environments, such approaches tend towards the same computational complexity of traditional grid-based approaches. Amato and Dale [34] have since proposed an embarrassingly parallel approach to probabilistic roadmap generation, which distributes the workload of generating and testing the sampling of possible points in Cfree. Carpin and Pagello [35] also later presented two parallel imple- mentations of probabilistic roadmap generation — one matching the embarrassingly parallel approach given by Amato and Dale, and the other using many sequential roadmap generators in parallel until any generator reaches the desired goal state. While each of these implementations demonstrated a speed up proportional to the number of processors available, the generation of the roadmap still tends towards that of a traditional grid-based approach once an optimal path needs to be generated between two locations relative to a cost metric. Gochev et al.[36] presented in 2011 a method of dynamically adapting the di- mensionality of a problem at a particular point in time to balance computational cost over the cells that actually were required to be planned over. For example, if the entire configuration space is represented in three dimensions, however, for a period of time, the planner only requires planning in two dimensions, then their approach slices an appropriate two-dimensional plane and plans upon that plane. In particular, they cite this approach as exchanging computational burden for a slightly sub-optimal solution for the environments they tested upon. Challou et al. [37] presented a parallel path planning algorithm in 1993 that claimed to approximate an optimal solution with far less computational burden than existing non-parallel approaches. Their approach randomly explored a subset of all next possible paths, then decided after each exploration step, which of the explored paths was the most optimal. They claim that the parallel nature of their approach contributes to lower computational burden, but concede that the true optimal path may not be randomly chosen during exploration. In addition, they also concede that this approach may also not discover a solution at all, even if one exists, and that it is susceptible to being trapped in local minima, but includes a probabilistic

19 mechanism for escaping such situations. Thrun and Bucken [38] presented a work in 1996 on integrating grid-based and topological configuration space representations. They explicitly state that the rea- soning behind this approach is that grid-based representations and path planners alone are too computationally expensive for large indoor environments. They also state that topological approaches, while far more efficient to plan over are not as accurate and consistent. Tomatis et al. [39] also presented work in 2003 on merg- ing metric and topological representations of an environment. Similarly, a high level topological graph connects metric-based representations of local areas of an environment. High level global planning can be completed over a less dense topo- logical representation, while more fine grained local planning can be completed over a smaller scale grid-based representation. Many of the approaches previously reviewed involve static or well defined config- uration spaces. As robotics research has entered more mainstream and time depen- dent applications, robot perception and action must be more adaptive and robust to environments that not well defined. Simultaneous Localisation and Mapping (SLAM), first presented by Leonard and Durrant-Whyte [40], is a popular approach to sensing and navigating an environment in real-time without prior knowledge of the layout of an environment. In particular, from a path planning point of view, this approach to live map building means that an optimal path to a given location can change rapidly as the environment is explored. Guivant et al.[41] presented work in 2004 on an approach to motion planning and SLAM in dense environments using a similar hybrid approach. Here a high level global feature map is used to reference smaller metric-based representations of an environment relative to each other. In terms of computational complexity, they note that a metric-based representation is important when building a map during exploration, while a higher level topological approach is better suited to global path- planning. Later, Whitty and Guivant [42] presented a hybrid metric and topological ap- proach to maintaining map layout and state, even with the inherent deformations that arise from inaccurate spatial and motion sensors. They propose an efficient rep-

20 resentation, correction method and path planning strategy over deformable maps. Pereida and Guivant [43] proposed a different representation of an occupancy grid space by merging cells into a quadtree representation. In this approach, open free space regions are represented by large squares in the quad tree, while smaller squares are used to represent finer detail closer to obstacles. While this approach does quote significantly reduced cell traversals when planning, no execution times of implementation against existing methods are given. Using larger elements to repre- sent sub-areas of an occupancy grid also has the potential to generate sub-optimal solutions. However, this approach is still applicable to real world applications as these sub-areas are only used for free space. Likhachev and Stentz [44] proposed a variation of A*-search called R*-search. In this approach, a short-term random location is chosen and a path is planned to that location. This process is repeated until a goal location is reached.

2.3 Computing Processor Capability

The previous section outlined current approaches to robot motion planning. The main observation from this literature is the large number of approaches that claim that traditional grid-based approaches, while favourable to planning and live map building, are computationally too expensive for practical use. Each proposed ap- proach aimed to find a satisfactory sub-optimal solution that was more computa- tionally feasible. What has not been highlighted so far in the literature is the ability to gain improved planner performance simply by upgrading the processor hardware used to execute an implementation. This section reviews processor throughput ca- pability trends over the last few decades and highlights the current paradigm shift from single to multi-core processors in an attempt to maintain the desired rate of increase of capability in line with Moore’s Law [10]. Schaller [6] presented an article in 1997 concerning Moore’s Law and the looming physical limits processor manufacturers were about to face as they fabricated more dense transistor designs. Kish [7] in 2002 predicted the decline in Moore’s law for single core processors, from a thermodynamic point of view. Kish cites a number

21 of factors for the possible decline, two of which are the thermal noise (known as Johnson-Nyquist noise [45] [46]) between lower voltage adjacent components in the form of interference, and the problems associated with the dissipation of heat from such dense designs. Borkar [47] in 2007 discussed the benefits of many-core and multi-core processors over single-core processors in terms of power usage, memory bandwidth, resiliency and general performance. While multi-core processors were said to give a perfor- mance gain over single-core processors, the paper’s main theme involves taking this one step further and promoting many-core systems. That is, instead of fabricating a small number of complex cores, as Borkar defines as a multi-core processor, it is proven that having 100s to 1000s of small cores, as Borkar defines as a many- core processor, is far more beneficial in each area discussed. Modern GPUs, Sony’s Cell processor [48] and NVIDIA’s Tesla [49] are the closest commodity processors available to this many-core model at the current time. This chapter continues by reviewing GPU architecture in closer detail before shifting focus to the programmers point of view for GPUs. Andrews [50] talks about the additional difficulties of concurrent programming over sequential programming. In particular, issues such as race conditions, accessing and modifying a shared re- source, and synchronisation are discussed in detail. While early GPU programming languages such as OpenGL’s GLSL abstracted away concurrency concerns such as synchronisation and access to , later languages enable a wider level of flexibility. Hence, basic concepts discussed in traditional CPU based concurrency contexts are also relevant for general purpose GPU programming.

2.4 Graphics Processing Unit Architecture

In addition to the industry’s shift from single-core to multi-core CPUs, Graphics Processing Units (GPUs) have also evolved substantially in the last decade. Lind- holm et al. [51] discuss the design trends of modern GPUs and state that the first graphics cards contained a fixed function render pipeline, with separated components for the vertex transformation and the pixel fragment stages of the pipeline. This

22 was ample for early graphical applications, with any customisation in the graph- ics pipeline configured via state changes in the graphics library. Figure 2.1 shows the stages of the graphics pipeline, with the vertex and fragment operations stages highlighted. As graphics cards evolved at the beginning of the 2000s, the vertex and frag- ment operation stages began to be programmable rather than just customisable via library state changes. The OpenGL Shading Language was one of the first high level specifications for each of the programmable components of the graphics pipeline. A small program, called a kernel, could be compiled and loaded onto a graphics card to replace the default, fixed function behaviour of each of the vertex and fragment stages.

Figure 2.1: Fixed function graphics pipeline adapted from Shreiner et al. [1]. The per-vertex operations stage is highlighted in green and was later replaced by a programmable vertex shader. The per-fragment operations stage is highlighted in blue and was later replaced by a programmable fragment shader.

Lindholm et al. [51] then continue to introduce a new unified graphics architec-

23 ture where each of the different programmable processors were now designed with a more general purpose instruction set, so that the same shader processor design could be applied to primitives in the vertex operation and pixels in the fragment operation stages. In addition, this assisted researchers and engineers alike to begin to productively use graphics hardware for non-graphical purposes (GPGPU). Figure 2.2 shows the unified stream processor layout found on modern graphics cards. Here, each Stream Multiprocessor (SM) has access to global memory, each SM has its own shared memory that is accessible by any Streaming Processor (SP) within that SM group, and each Streaming Processor itself contains private memory accessible only by itself. Streaming processors operate using a Single Instruction, Multiple Data (SIMD) model in which the same kernel is executed over multiple instances of data con- currently. In the context of graphics applications, this could involve applying the same texturing or lighting calculation to each fragment in a scene, as described via many examples by Rost [52]. Moreland [53] presents an early implementation of a fast Fourier transform on GPU, which applies an identical kernel to each pixel in a sample image as a convolution. The next two sections shift focus to review GPU literature and background con- cepts from more of a software point of view. First, a traditional approach is reviewed in terms of OpenGL, a common graphics library, as the field of GPGPU has its ori- gins here. As researchers continued to use graphics hardware for non-graphical purposes, vendors began to develop programming languages and standards that more closely resemble traditional CPU based languages. The review naturally con- tinues by focusing on these non-graphical approaches to programming on graphics hardware.

2.5 GPGPU and the Open Graphics Library

Graphics Processing Units (GPUs) have evolved over the past two decades to con- tain one to two orders of magnitude more cores than current multi-core Central Processing Units (CPUs). The main proponent for this trend is the massive number

24 Figure 2.2: The layout of stream processors cores in the NVIDIA Tesla GPU. The diagram here is adapted from Nickolls et al. [2]. SM stands for Streaming Multiprocessor, SP stands for Streaming Processor, SFU stands for Special Function Unit, ROP stands for Raster Operations Pipeline, and MT IU stands for Multithreaded Instruction Unit. of matrix transformation, lighting and texturing calculations that must be applied to many vertices and fragments to render a scene. Each calculation performs the same basic steps to a large array of data individually. As such, graphics hardware uses Flynn’s [54] Single Instruction, Multiple Data (SIMD) paradigm, rather than apply- ing different instructions to different streams of data as is the case with multi-core CPUs. Early programmable graphics hardware allowed the vertex and fragment oper-

25 ations steps to be programmable. These two steps are highlighted in Figure 2.1. While graphical applications can make productive use of both stages in the graph- ics pipeline, early GPGPU programmers focused primarily on the fragment shader stage. The standard GPGPU technique was to attempt to render a rectangle to the screen and apply a texture to that rectangle. Careful dimensions of the rectangle had to be specified so that the bounds of the rectangle lined up with the edge of the display buffer and more importantly that individual texture units matched the number of pixels used to render the rectangle. This last requirement is important as the texture applied to the rectangle con- tained input values to the GPGPU algorithm rather than simply pixels from an image. It is desirable for each input value to map exactly to one fragment in the shader processor. If texture units do not line up, under some early graphics cards and under particular settings, the graphics hardware would automatically interpo- late the “colour” of the texture units closest to the fragment’s location. While this is visually appealing when given proper texture data, it would provid incorrect input values for GPGPU applications. By default, fragments that are generated in the fragment shader stage are des- tined for a framebuffer that will be directly rendered to the screen. An advanced technique in graphics programming that is particularly useful for rendering reflec- tions in a scene is redirecting fragment output to a separate framebuffer that can be later read back into another render pass. For example, a scene could be ren- dered once from the point of view of a mirror, with the result temporarily stored in a framebuffer. Then when rendering the final scene for display, the contents of that framebuffer can be textured over the model of the mirror. Early GPGPU pro- grammers exploited this technique to be able to retrieve output calculations from the fragment shader stage as framebuffer contents can also be sent back to CPU memory. In summary, early GPGPU programmers fooled the graphics hardware into load- ing input data in the form of a texture, then executed many parallel calculations on that input data by requesting that a rectangle be rendered to the screen. The output of this render process was then redirected back into a separate framebuffer,

26 then data was able to be read back from GPU to CPU memory. Buck and Pur- cell [55] outline this programming model in Chapter 37 of GPU Gems and provide simple examples of reduction, sorting and binary search algorithms on GPU. An important facet of the algorithms mentioned above is that they require multi- ple passes to generate a final result. To achieve this, a method introduced by Harris [56] known as Ping-Ponging Textures is used. Here two texture framebuffers are al- located. Before execution begins, input data is loaded into one of these framebuffers. Each iteration of the particular algorithm being executed uses one framebuffer to read input values and the other to store generated output fragments. After each iteration completes over all fragments, the roles of the two framebuffers are reversed, hence data bounces back and forth between a read-only input buffer and a write- only output buffer. This technique, together with the programmer’s view of how the graphics hardware is exposed in OpenGL, alleviates three common concurrent programming challenges: synchronisation, consistent common resource access and determinism while multiple sections of an algorithm execute in an arbitrary order. Between iterations of the algorithm all kernels running on each fragment must finish before the roles of buffers are reversed for the next iteration. From the pro- grammer’s point of view the entire parallel process is abstracted away into rendering a rectangle and thus the algorithm is synchronised between iterations. Many of the algorithms mentioned above in Buck and Purcell [55] require each kernel of observe multiple cells to calculate a result. This is also the case with the proposed set of algorithms presented in this thesis, which must observe all eight neighbouring cells around a given output cell for the two-dimensional case. However, using the read-only and write-only versions of the data at each iteration removes the chance of inconsistent data value lookup. Each kernel is responsible for calculating the value of one fragment, so no multiple source data modification is allowed in this model. Having all kernels read from the same read-only source as input also prevents non-deterministic behaviour, even when kernels execute in an arbitrary order.

27 2.6 GPGPU post Open Graphics Library

As the number of researchers and engineers practicing GPGPU grew over recent years, the two main chip manufacturers released vendor specific programming lan- guages and frameworks to allow applications to be written in a more C like language and still be compiled and run on their respective architectures. NVIDIA released CUDA [57] a language specification to enable more C like programming on its graph- ics hardware. Buck et al. [58] also released the Brook programming language which allowed a C style streaming language to be compiled and run on common graphics hardware. For a period of time, Brook was the chosen language for programming on AMD graphics hardware. In the last five years, Apple led a group of hardware and software companies to develop a common, cross platform standard for computation on heterogeneous hardware. That is, one language specification that can be compiled to run on single and multi-core CPUs, GPUs and custom hardware like Sony’s Cell processor [48]. The language is called the Open Computing Language (OpenCL) and the latest specification release is version 2.0. While the OpenCL 1.0 specification has been available since late 2009, a particular function required to implement the proposed work in this thesis in OpenCL has a fatal flaw. The OpenCL 2.0 specification outlines how this flaw has been resolved, but hardware manufacturers are still implementing the specification in their hardware at the time of writing this thesis. A deeper discussion on this issue is given as future work in Section 6.2.1.

2.7 GPGPU in the Wider Community

This section is aimed at giving the reader a taste of applications that have benefited from the introduction of highly parallel programming techniques and the magnitude of that benefit. The section will not give as much in depth research and cover- age at the previous sections. It simply aims to educate the reader on the extent to which highly parallel computation techniques are benefiting diverse areas, and that the proposed work presented in this dissertation may be applicable to areas of optimisation outside of robot motion planning.

28 Trapnell and Schatz [59] presented work in 2009 on gene sequence alignment using commodity graphics processing units. They claim to be able to align DNA sequence data against reference sequences with a 13× speed up relative to traditional CPU implementations. Bakkum and Skadron [60] presented an implementation of a subset of SQLite’s command processor using CUDA in 2010. They focused on optimising the SQL SELECT statement used to query tables based on a search condition and claimed to achieve a speed up between 20 × and 70 × depending on the size of the resulting query set. Preis et al. [61] presented a CUDA implementation of a method of simulating the ferromagnetic properties of a material from a statistical mechanics point of view. The implementation involved applying Monte Carlo techniques with the two and three dimensional Ising Model [62] and claimed a speed up of 60 × and 35 × for the two and three-dimensional implementations, respectively. Tzeng and Wei [63] presented work in 2009 on a method of parallel random number generation by using existing cryptographic methods for parallel architec- tures. They claim to be able to generate “white noise” quality random values from Rivest’s [64] MD5 hash function in shorter time than all other approaches reviewed. Approaches compared in this study had to meet certain minimum requirements of generated random number spread over the domain and non-predictability, to be considered. A parallel radix sort and merge sort were presented by Satish et al. [65] in 2009. In addition to claiming a 23% faster execution time to traditional CPU-based approaches, they also claim in the order of a 2 × and 4 × speed up over existing GPU based sorting algorithms at the time.

2.8 GPGPU and Robotics

The fields of robotics and artificial intelligence have also benefited from the shift to GPGPU programming in recent years. One of the earliest fields of GPGPU research was computer vision, as graphics hardware had already been designed with image

29 manipulation capabilities in mind, when applying textures in rendered environments. Podlozhnyuk [66] presented an early white paper from NVIDIA in 2007 on using CUDA to perform certain image convolutions. Furukawa et al. [67] presented work in 2010 on performing the matrix arithmetic involved in maintaining state using an extended Kalman filter for SLAM on GPU. The work noted that both the update and prediction steps of the filter require matrix operations and other groups of independent calculations on a number of cells, each with the same process. The approach claims to provide on average a 95% speed up compared to existing methods. Pisula et al. [68][69] presented work in 2000 on three-dimensional Voronoi dia- gram generation using GPU hardware. They note that the efficient generation of such a topological layout of an environment’s free space can greatly assist in robot motion applications. Harish and Narayanan [70] presented a graph-based implementation of Dijk- stra’s algorithm using CUDA in 2007. Their GPU-based implementation claimed a 11.1 × speed up compared to an existing CPU-based implementation. Merrill et al. [71] also presented a graph-based solver that claimed O(|V | + |E|) time com- plexity for |V | vertices and |E| edges in a graph. Although this approach claimed favourable speed ups compared to existing implementations on both single and mul- tiple graphics cards, it is unclear whether their implementation produces a correct result for certain weighted graph configurations. In particular, the algorithm may sub-optimally generate a path between nodes if, for example, one path involved less edge traversals, but cost more than another that involved more edge traversals, but globally cost less. This is similar to the grid-based example given by Figure 3.2 in Section 3.3 where optimality of the proposed algorithms presented in this thesis is discussed. Pan et al. [72] present work on a GPU-based Rapidly Exploring Random Tree (RRT) and a Probabilistic Roadmap (PRM) implementation and show that their approach is faster for their set of test cases over existing CPU implementations by 1–2 orders of magnitude. Kider [73] also published work in 2010 on a GPU imple- mentation of Likhachev’s R*-search [44]. They concede that for simple configuration

30 spaces, the CPU version of R*-search outperforms the GPU implementation, but cites that many real-world applications involve more complex configuration space representations, which are better suited to the GPU implementation. Olson in 2009 [74], Newcombe et al. in 2011 [75] and Ratter and Sammut in 2013 [76] have all published various ICP implementations on GPU hardware. Each provides a real-time capable method of processing the large amount of data required for accurate scan matching, map building and agent localisation.

2.9 Summary

Robot motion planning is becoming an important and critical area of research as more robotics applications are exposed to and used in society. While there has been at least half a century of research into developing efficient path planning algorithms, a common theme from the literature is that proper, accurate and exhaustive evalu- ation of the cost of traversal of an environment for planning is expensive. Many of the approaches reviewed in this chapter either use less accurate probabilistic or ran- domised methods of exploring near optimal paths, or map the configuration space to a lower-dimensional or topological representation. In addition to improvements in algorithm design, implementations have bene- fited in terms of performance over the last few decades by improvements in the processor hardware they run on. However, as single core processor designs have become increasingly dense, they have begun to reach physical limits of density in terms of safe and deterministic operation in recent years. Patterson [77] summarises the industry’s shift from single to multi-core architectures by citing the then CEO of Intel Paul S. Otellini in saying in 2004 that Intel would dedicate “all of our future product designs to multi-core environments.” In addition to this paradigm shift for CPUs, GPU manufacturers have been driven to increase the performance of their products by the lucrative gaming and interactive entertainment industries. As a result of this shift, concurrent computing approaches have arisen in recent years for both multi-core CPU and highly parallel GPU architectures in solving many scientific and engineering problems. This thesis contributes to the field of robot

31 motion planning and more generally grid-based optimisation problems by providing and exhaustive and accurate concurrent grid-based solver. The next chapter begins by outlining the core mechanics of concurrent dynamic programming.

32 Chapter 3

Foundations of Concurrent Grid-Based Optimisation Algorithms

This chapter introduces a concurrent algorithm and implementation for solving grid- based optimisation problems and is based on work presented by Cossell and Guivant [78]. A formal definition of the domain of this solution is given, before a full de- scription of the algorithm is presented. An initial implementatoin of the proposed concurrent algorithm on modern graphics hardware is then presented, including a full analysis of implementation performance via experimental results. Although the implementation presented in this chapter does not perform as efficiently as prior work by other sequential solvers, it does prove that a concurrent implementation is able to produce mathematically identical results in terms of the final cost-to-go function. The implementation inefficiencies presented in this chapter are caused by a flood- gate approach to requesting occupancy grid cell calculations on the GPU. That is, the approach makes basic unintelligent decisions as to which cells to evaluate at each iteration, at the expense of many of these calculations peforming redundant actions. Chapter 4 discusses a range of more directed implementations of this algo- rithm, with the best approaches outperforming traditional sequential algorithms by an order of magnitude.

33 3.1 Problem Definition

A two-dimensional grid-based motion planning problem can be defined as follows. An agent A is able to move within a two-dimensional environment referred to as the configuration space1. At the benefit of implementing a solution for actual computing architectures, a configuration space is defined throughout this thesis as rectangular in shape and consists of w × h equally sized square grid cells tessellating over the area covered by a given environment. The agent’s location at any instant in time is defined by the coordinates of the cell in which it resides. Grid cells are assigned a local value c∗, which is used to represent the cost of agent A traversing through the given cell relative to the metric being optimised. In practice, this local cost can relate to factors such as terrain traversability or properly utilising an energy source, or a weighted combination of factors. For the purposes of this problem definition, it is enough to state that traversing 8 cells in a straight line, each with local cost of 1 unit, is equally costly to traversing 4 cells in a straight line, each with 2 unit cost. A subset of cells comprising the configuration space C are designated obstacle cells if the agent A cannot traverse that cell. In many similar problem definitions these obstacle cells are given a significantly large local cost value c∗, while others, as defined here, give obstacle cells an infinite cost to show that traversal is not difficult or costly, but impossible. These obstacle cells together comprise the set of cells Cobst.

All cells in the set C\Cobst= Cfree are cells that are traversable by the agent A and have a finite positive local cost c∗. Many of the examples and experiments in this thesis assume that the local cost of traversing a cell c ∈ Cfree is of unit cost. However, as outlined in Section 3.3, the algorithm is able to calculate identically optimal2 results for varying positive finite local costs across cells in Cfree. The agent can move between any two cells in Cfree, on the condition that the cells are either horizontally, vertically or diagonally adjacent.

1Throughout this thesis, an agent A is taken to occupy a single cell. In reality, a configuration space can be generated with the dimensions of a mobile robot taken into account, for example, by dilating obstacle regions relative to the size of the robot. 2Optimal relative to the cost metric.

34 More formally, traversal between cells ci,j, cp,q ∈ Cfree is allowed if

|i − p| ∈ {0, 1} and (3.1.1) |j − q| ∈ {0, 1}

+ for i, j, p, q ∈ Z and ci,j 6= cp,q. To plan an optimal path from the agent’s current location to a given destination cell, the destination cell cd ∈ Cfree is assigned a global cost-to-go value of zero. Then all other cells in Cfree are recursively assigned a cost-to-go value based on Equations 3.1.2 and 3.1.3.

∗ ∗ Ci,j = min(Ct ) (3.1.2)

∗ Here Ct is the set of global costs of all neighbouring cells to the cell at i, j plus the cost of traversing from cell ci,j to cell ci0,j0 ∈ Ct. The cost of traversal between cells ci,j and ci0,j0 is given by

∗ ∗ c + c 0 0 c∗ = k × i,j i ,j (3.1.3) t 2 where k = 1 for cells adjacent by an edge (that is, North, East, South and √ West) and k = 2 for cells diagonally adjacent. Once enough3 cells are assigned a cost-to-go value, an optimal path can be generated by repeatedly traversing from the agent’s location to the neighbouring cell with the lowest global cost-to-go value until the destination cell is reached. This mechanism is the basis for the majority of exhaustive grid-based motion planning solvers, with the major difference in the presented work being concerned with how the global cost-to-go values are generated. Traditional grid-based dynamic programming implementations evaluate a single cell each iteration using the definitions given in Equations 3.1.2 and 3.1.3. While these cell value definitions completely describe a cell’s value from an abstract math- ematical point of view, in the most efficient implementation, cells must be evaluated in a particular order to maintain optimality relative to a given cost metric. Many

3 All cells in Cfree can be exhaustively assigned a global cost-to-go value, or cells can be radially assigned cost-to-go values until the agent’s cell is assigned a value.

35 existing implementations employ a priority queue data structure to sort unexplored cells into an optimal explore order to maintain this optimality. Two of the most commonly used path planning algorithms differ only by the metric used to sort the priority queue, namely Dijkstra’s algorithm and A*-search. That is, Dijkstra’s algo- rithm sorts unexplored paths based on how the local cost of traversing a particular path adds to the global cost of reaching that path’s node in a graph [18], while A*-search includes an extra heuristic that relates to external information about the configuration space [22] — most commonly spatial in nature. As described in Section 3.2, the proposed concurrent algorithm is not required to decide which particular cell to explore at a given iteration and simply expands multiple possible next cells at once. Section 3.3 reinforces the lack a requirement for a priority queue as, although cells at a particular iteration may not have their globally optimal value assigned, they will eventually be given a globally optimal value relative to a cost metric. That is, instead of each cell being evaluated correctly once, one at a time, many cells are continually being reevaluated with an equally or more optimal value than they were assigned on a previous iteration of the algorithm. In other words, every cell tends towards its eventual globally optimal value as a steady state by repeatedly apply- ing Equations 3.1.2 and 3.1.3 concurrently across all cells in Cfree. The difference between the next cell decision process in Dijkstra’s algorithm, A*-search and the proposed concurrent algorithm can be seen in Cossell and Guivant [23].

3.2 The Core Mechanism of Concurrent Dynamic Programming

As stated in Section 3.1, the proposed concurrent algorithm presented here does not use a priority queue. Instead it explores all cells that are on the edge of the explored/unexplored border each iteration on the condition that at least one neigh- bouring cell or each cell being evaluated has been explored. This mechanism is best demonstrated by Figure 3.1. The initial implementation outlined in this chapter maps the two-dimensional grid structure of a configuration space to an OpenGL texture. Textures are used in

36 high performance computer games and other interactive entertainment applications to project a particular image onto a plane of an object in a scene. In its simplest and most common usage in computer graphics, a texture is loaded from an image file into graphical hardware memory. When a scene is rendered, the graphics pipeline reads texel4 data and applies each texel to the surface of a triangle or other planar shape in the scene, with the final result being sent to the display buffer. Consequently, the display buffer contains the final raster of the scene displayed to the user. A more advanced texturing technique involves redirecting the output of the graphics pipeline to a temporary buffer for later use. For example, to create the effect of the reflection of a city’s lights in a car’s window in a rendered night time scene, the scene is first rendered without any consideration for reflection to a tempo- rary buffer from the point of view of the car’s window. Then this temporary buffer is combined with the actual window colour and texture to the surface of the car’s model for a final pass before being displayed. Rendering to a texture buffer rather than directly to the screen is the basis for General Purpose computing on Graphics Processing Units (GPGPU). When a pixel traverses the final stages of the graphics pipeline before being sent to a display buffer it is processed by a component called the fragment shader. In graphics applications, this component accepts a pixel that has already been as- signed to its final location in a two-dimensional display buffer and is responsible for assigning the pixel the correct colour. For graphics hardware produced more than a decade ago, this functionality was hard coded into the chip itself. More recently however, graphics hardware manufacturers have allowed developers to program their own fragment shader functionality. The developer builds a small program called a kernel that is concurrently applied to each pixel that passes through the fragment shader. As modern graphics hardware contains hundreds or thousands of shader processors, the same kernel is run concurrently over many pixels at once. Algorithm 1 demonstrates a pseudocode representation of the kernel for the two-dimensional version of the proposed algorithm. As stated in Section 3.1, a cell or pixel’s final value in the implementation is

4A texel is a texture pixel.

37 Algorithm 1 Kernel pseudocode run on each cell in the configuration space.

bestcost := undefined N := set of 8-way neighbours of current cell for n ∈ N do

if ncost is defined then

currentcost := ncost + travelcost(n, current cell)

if bestcost is undefined or currentcost is better than bestcost then

bestcost := currentcost end if end if end for

current cellcost := bestcost based on the values of all explored neighbouring cells and the cost of traversing between each of those cells. As many cells are evaluated concurrently, at each iteration a virtual wavefront emanates from the original zero cost destination cell. The wavefront in these initial iterations of the algorithm can be seen in Figure 3.1. An exhaustive implementation of this algorithm continues to expand into unex- plored cells until all cells in Cfree are evaluated with a final global cost-to-go value. As the concurrent algorithm does not maintain a priority queue of cells left to be evaluated, an implementation must determine when to terminate evaluating cells by some other mechanism. Formally, a terminating condition is reached between iterations i − 1 and i, if the condition

∗ ∗ Ci−1(x) = Ci (x) ∀x ∈ Cfree (3.2.1) is met. A more computationally convenient representation involves not terminating if the condition

∗ ∗ ∃x ∈ Cfree s.t. Ci−1(x) 6= Ci (x) (3.2.2) is met. Section 3.4 describes how this terminating condition is implemented.

38 Figure 3.1: Representation of an occupancy grid’s initial layout and the first five iterations of the algorithm on that occupancy grid. Black cells represent cells in Cobst. Blue cells represent cells that have been evalu- ated with their final global cost-to-go value. Green cells highlight areas that are currently being evaluated at that given iteration. White cells represent unexplored areas.

39 3.3 Maintenance of Optimality Relative to a Cost Metric

In Dijkstra’s algorithm, as well as its derivatives, computational effort is dedicated to deciding which of the unexplored edges or paths should be explored next. Each of these algorithms gains this optimality by conforming to Bellman’s Principle of Opti- mality [79][80], which reduces an optimisation problem to finding the most optimal decision at a given instant, then solving a slightly smaller dynamic programming problem until a goal state is reached. This means that when a cell is assigned a global cost it is the final globally optimal value of that cell. In traditional algorithms, edges are expanded on a least-cost-next basis, whereas the concurrent algorithm expands purely on adjacency to cells that have already been evaluated. Due to expanding on cell proximity rather than actual traversal cost, a cell may initially be given a suboptimal value, with an optimal value later being assigned to the cell on a subsequent iteration. This is especially true when local costs between cells vary. Again, all cells will eventually end up being assigned a globally optimal value given enough iterations. Figure 3.2 shows a contrived example of a configuration space with an obstacle in the middle and two possible paths of traversal on either side of the obstacle. The difference between the two paths is that the path through row 1 traverses cells with a 1-unit local cost, whereas the more spatially direct route via row 3 must traverse the majority of cells having a 2-unit local cost. In reality this may be the mathematical representation of the difference between a ground vehicle travelling on a sealed road versus travelling on grass, gravel or an uneven surface. Given cell B (h3) is the zero cost destination cell, the wavefront of cell evaluation expands out from this cell at a rate proportional to the number of grid cells being traversed rather than the lowest cost cells first. This leads to the cell at the opposite end of the obstacle (a3) being initially sub-optimally evaluated with a value based on travelling through the 2-unit local cost side of the obstacle. Two iterations later the virtual wavefront will wash over that cell again with a more optimal cost based on travelling through the 1-unit local cost cells. Characteristics of the shortest

40 Figure 3.2: Example of a configuration space with differing local traver-

sal costs either side of an obstacle. Black cells represent cells in Cobst.

Red cells represent cells in Cfree with a local cost of 2 units. Blue cells

represent cells in Cfree with a local cost of 1 unit.

Table 3.1: Comparison of characteristics of the shortest paths North and South of the obstacle between cells A and B.

Path Cost (3 d.p.) Number of cells via row 1 9.828 9 via row 3 13.000 7

paths either side of the obstacle between points A and B are given in Table 3.1. In particular, note that the upper path of traversal via row 1 has a lower total cost of traversal even though it traverses 2 more cells than the lower path via row 3.

3.4 Initial Implementation and Experimental Re- sults

The presented algorithm was implemented using the OpenGL Shading Language (GLSL) and C++ on the CPU side of execution. The initial implementation, desig- nated the name vanilla, was implemented with the aim of proving that the kernel presented in Algorithm 1 could produce identically optimal results to existing meth- ods of cost-to-go function generation relative to a cost metric. An exhaustive version

41 of Dijkstra’s algorithm was also implemented as a pure CPU side application in C++. The generated exhaustive global cost-to-go functions of each implementation were then compared with results found to be identical to within negligible rounding errors inherent with 32-bit floating point arithmetic. The vanilla implementation was then benchmarked on the UNSW Help Point occupancy grid test case outlined in Section 4.7.3 with both single and multiple destination contexts. Table 3.2 summarises the performance over both contexts for two generations of graphics hardware relative to various implementations of both the proposed algorithm and traditional algorithms. Complete specifications of each graphics card are given in Tables 4.14 and 4.15. While the vanilla implementation of the proposed algorithm performed well on the desktop machine relative to existing approaches, the on-board graphics card performed an order of magnitude worse. Further analysis of the algorithm’s imple- mentation and performance in Chapter 4 revealed that the large number of redun- dant cell evaluations requested are a significant factor in the poor execution time recorded here. This is especially apparent for graphics hardware with a lower shader processor count. The algorithm terminates when no cells receive or improve on their global cost value between the current and previous iterations — formally when Equation 3.4.1 holds true.

∗ ∗ C (cell) at Ti−1 = C (cell) at Ti ∀cell ∈ Cfree (3.4.1)

From an implementation point of view, this condition can be realised at iteration i by either Equations 3.4.2 or 3.4.3.

∗ ∗ Σi|Ci−1(c) − Ci (c)| = 0 ∀c ∈ Cfree (3.4.2)

∗ ∗ max(|Ci−1(c) − Ci (c)|) = 0 ∀c ∈ Cfree (3.4.3)

Harris [81] has reviewed well known sum and maximum methods in GPU pro- gramming, both of which can run in O(log4N) time. While either of these approaches can be used to detect when to terminate, it is recommended that the maximum ap-

42 Table 3.2: Summary of the execution time of the vanilla implementation of the algorithm. Times listed here are given in seconds and reflect the time taken to load the occupancy grid into texture memory on the GPU, exhaustively calculate the cost-to-go function over many iterations, and transfer the resulting function back to CPU memory. CPU based implementations are also included for comparison. Exhaustive Dijkstra implies an implementation of Dijkstra’s algorithm that continues until all cells in a grid are evaluated rather than terminating when a specific cell is reached by the algorithm. The test case tagged with α involved a 2007 ATI Radeon 2400 XT. The test case tagged with β involved a 2010 NVIDIA GeForce GTX 480. Test cases tagged with γ involved a 2012 Intel Core i7 2.6GHz. The test case tagged with † was implemented using a linked list for the priority queue. The test case tagged with ? was implemented using a min-heap for the priority queue.

Algorithm Implementation Single Destination Multiple Destination Vanilla on GPU α 31.370s 31.564s Vanilla on GPU β 0.999s 1.013s Vanilla on CPU γ 19.835s 19.610s Exhaustive Dijkstra on CPU γ† 1.839s - Exhaustive Dijkstra on CPU γ? 0.874s -

43 proach be chosen over the sum approach for older hardware that might not conform or provide full IEEE 32-bit single precision floating point support. The reasoning behind this is that the maximum approach always stores values in the same order of magnitude as values known to be able to be represented correctly in the global cost-to-go function. If a sum were to be kept, values may climb to be out of the representable range of the floating point representations of older graphics hardware. Further analysis of the termination of the vanilla implementation of the algorithm was not pursued as the implementation flavours presented in Chapter 4 perform far more favourably than this flavour. In addition, the best performing of these approaches keeps track of which cells are changing each iteration as part of it’s de- sign. Therefore, these subsequent implementation approaches require no extra cost to determine when the algorithm should terminate.

3.5 Complexity Analysis

This section gives a complexity analysis of the proposed initial vanilla flavour of the concurrent algorithm and compares it to a traditional sequential algorithm in Dijkstra’s shortest path algorithm.

An occupancy grid consisting of n × n = N cells, with Cobst = ∅ is considered for this analysis. Each iteration, all cells within the occupancy grid are requested for evaluation. Therefore, at each iteration, N cell evaluations are requested. The rate at which the wavefront progresses through the occupancy grid, regardless of the location of the original zero cost destination cell, is in the order of O(n). Therefore, the computational complexity of the vanilla implementation of the algorithm is in √ the order of O(N N). This is worse than the estimated computational complexity of Dijkstra’s algorithm given by Barbehenn [29], being O(NlogN). This level of inefficiency is understandable, as at each iteration, when an order of O(N) cells are being requested for evaluation, there is in the order of O(N −n) ≈ O(N) cells being redundantly evaluated by the kernel. The main focus of this chapter has been to introduce the mechanics of the algorithm and quantitatively prove that it generates identically correct global cost functions relative to traditional

44 approaches. Chapter 4 focuses on properly analysing and evolving the algorithm’s implementation to meet the contraints of real-time robotics applications, with the main contribution of that chapter, the method of subregions, capable of running closer to O(N), given enough cores are available.

3.6 Summary

This chapter introduced the theoretical foundations of a concurrent solver for grid- based optimisation problems and provided a formal definition of such problems, which applies for the remainder of this thesis. The generated cost-to-go function of the proposed solution was compared to existing approaches to prove that identi- cally optimal cost-to-go functions could be generated relative to a given cost metric. While this was the main focus of the work presented in this chapter, an initial imple- mentation of the algorithm compares favourably relative to traditional mainstream solvers. The next chapter changes focus in this research towards the computational efficiency of various implementations to properly assess the benefits of the proposed approach for real world robotics applications.

45 Chapter 4

Applications in Two Dimensional Grid-Based Problems

The previous chapter introduced the algorithm from a theoretical point of view and concentrated on the GPU side of execution in terms of mathematical correctness of a generated solution. The initial vanilla implementation presented in Section 3.4 was developed and benchmarked with little concern for efficiency or optimisation on the CPU side, or overall performance of the implemented algorithm. This chapter presents the method, results and repetitive analysis of various fur- ther developed flavours of the algorithm, with a focus on overall implemented per- formance. Work presented in this chapter is based on research published in Cossell and Guivant 2011 [78] and Cossell and Guivant 2013 [82] [83]. Initially, the method of expanding textures is introduced as a low cost, low complexity attempt to in- crease efficiency of the vanilla implementation. The construct of a subregion is then introduced, which is a grouping of adjacent cells into a single object. The defined subregion size that tessellates an occupancy grid can be tuned to find the appropri- ate balance between CPU and GPU load, which in turn results in a faster execution time. At a particular iteration of the algorithm a subregion can be in one of three states: unexplored, maturing and mature. These states govern whether cells within that subregion are evaluated on the GPU at that iteration. This chapter describes the algorithm and data structures used to manage the state of each subregion as the algorithm progresses.

46 Subregion management steps were initially implemented to run post-iteration, every iteration, however different schedules of when to perform these post-iteration checks are also analysed. Simulation results of each flavour of the algorithm are presented over a number of occupancy grid types and two generations of graphics hardware. The chapter concludes with a thorough analysis of experimental results and a theoretical computational complexity analysis is presented. The main contribution of this chapter is the most evolved flavour of implementa- tion, which performs at least an order of magnitude faster that traditional sequential implementations designed for single-core processor architectures.

4.1 Problem Definition

As stated in Chapter 3, the initial vanilla implementation of the algorithm was susceptible to the major drawback of performing an excessive number of redundant cell calculations on the GPU. That is, every cell was requested to be evaluated at every iteration, even though only one or a few of those iterations would have actually changed a particular cell’s value from unexplored to a valid finite global cost value. In theory, the most optimal approach for reducing redundant cell evaluations on the GPU would be to only request cells that have a current value of unexplored and have at least one neighbouring cell with a valid finite global cost value. How- ever, initial implementation attempts at performing this level of micromanagement performed far worse than existing sequential implementations. Considering these two approaches as the two extremes of how cells are requested to be evaluated on the GPU, the vanilla approach manages the whole occupancy grid as one management piece and unintelligently requests all cells be evaluated every iteration, whereas the micromanagement approach maps each cell to an individual management piece and works excessively to intelligently decide on which cells should be evaluated. A hypothesis that arose at this stage of the research can be represented via Figure 4.1. Here the CPU graph is based on the two data points of the vanilla implemen- tation doing zero CPU side calculations against the micromanagement approach

47 Figure 4.1: General prediction of how the GPU and CPU side of the im- plementation will affect overall execution time for different sizes of pieces being managed as a unit (referred to later in this chapter as subregions). undertaking more work than traditional sequential algorithms. The GPU graph is based on estimating the number of redundant cell calculations being performed rela- tive to the number of cells being requested for evaluation. The graph is deliberately vague as it was intended as a hypothesis at this stage and does not suggest that CPU and GPU inefficiencies contribute equally to overall execution time. The remainder of this chapter outlines an evolving set of implementation flavours that attempt to find the optimal use of CPU and GPU load.

4.2 Optimisation by the Method of Expanding Texture

One of the main bottlenecks in the vanilla implementation is that many cells are being requested for evaluation many times with redundant results. Again, picture a wavefront emanating from the zero cost destination cell one pixel width each iteration. Cells that are a number of iterations away from the wavefront actually

48 reaching them are still requested for evaluation every iteration. These cells will undergo the complete algorithm worth of steps of looking up and comparing the values of each of its eight neighbouring cells, find that they are also all unexplored and then assign itself a value of unexplored for that given iteration. Additionally, cells that have already been assigned a global cost value a number of iterations prior will also be requested for evaluation each iteration. Here they will again look at their neighbouring cells, each of which will also not have changed value since the last iteration. They will calculate an identical result, thus not actually changing the value already assigned to that cell. These two types of redundant cell calculations are referred to as unexplored redundant and explored redundant calculations, respectively. In each iteration of the vanilla implementation, the CPU side requests the whole texture buffer to be evaluated by requesting a rectangle that matches the bounds of the entire input texture. The method of expanding textures attempts to unintelli- gently request cell evaluations in a similar way, but minimises the number of cells that are calculated as unexplored redundant. Figure 4.2 shows the initial iterations of the algorithm. Blue cells have been evaluated with a global cost value. Green cells are currently being evaluated with a cost at the given iteration. Black cells rep- resent obstacles and white cells are unexplored. Without making any assumptions about the layout of the obstacle cells within a given occupancy grid, at iteration i the wavefront of green cells will not have propagated further than i cells in any direction from the initial zero cost cell. In Figure 4.2 this range is shown by the red outline. Corners of the rectangle must be reevaluated each iteration, but this step is negligible relative to the other steps performed each iteration, being O(1). Given a destination cell at (dx, dy) and iteration i, the four corners of the rectangle are based on:

49 Figure 4.2: The first four iterations of the expanding texture technique.

leftx = max(dx − i, 0)

top = max(dy − i, 0) y (4.2.1) rightx = min(dx + i, w)

bottomy = min(dy + i, h) where w and h are the width and height of the occupancy grid respectively 1. The corners of the rectangle are then assigned from top-left in a clockwise direction as

(leftx, topy), (rightx, topy), (rightx, bottomy) and (leftx, bottomy). An initial benchmark comparison was completed between the vanilla and expand- ing texture methods over a range of occupancy grid sizes based on the Office Floor Plan test case given in Section 4.7.2. The method of expanding texture was found to

1It should be noted at this stage that the expanding texture method only works for occupancy grids with a single zero cost destination cell. Later improvements to the algorithm’s implementation presented in this chapter actually allow for multiple zero cost destination cells to be defined, which enables the algorithm to harness more parallel calculations each iteration.

50 run in an order of magnitude less time than the vanilla approach. A summary of the results of this experiment are given in Figure 4.3. Even with a significant increase in performance, this approach does not solve the problem of redundant explored cell evaluations being performed. In addition, many redundant unexplored cell evalua- tions can still take place as soon as the wavefront progession is limited by a wall of obstacle cells as seen in the bottom area of Figure 4.2.

Figure 4.3: Comparison of the execution time of the vanilla implemen- tation (Whole Texture) against the Expanding Texture implementation. The results are based on appropriately scaling the Office Floor Plan oc- cupancy grid in Section 4.7.2 to a range of dimensions.

4.3 Larger Kernel Observation Area

Where the expanding texture approach attempted to reduce the number of redun- dant calculations being performed on the GPU, other approaches were also assessed to reduce the overall execution time of the algorithm’s implementation. One such

51 hypothesis was whether the algorithm could expand into more cells sooner by using a 5 × 5 cell observation area rather than a 3 × 3 cell area. As stated in Algorithm 1, when a cell’s global cost value is being calculated the eight adjacent neighbouring cells are observed. The major benefit that arose out of any flavour of the proposed concurrent algorithm throughout this thesis was that it performed much more pro- ductively when allowed to expand into as many cells as possible as soon as possible. The main proponent for this hypothesis was that after the first iteration of a 3 × 3 cell observing kernel, 3 × 3 cells will have been evaluated, whereas a 5 × 5 cell ob- serving kernel could enable 5 × 5 worth of cells to be given a value. Subsequently at the second iteration, 5 × 5 cells relative to 9 × 9 would be evaluated, respectively2. The individual running time of each kernel of these approaches was then assessed from a theoretical point of view. Firstly, the existing 3 × 3 approach can be gen- eralised to performing 40 sub-iteration units of work. This value is calculated by multiplying the number of steps the kernel uses to observe one neighbouring cell, as outlined in Table 4.1, by the number of neighbouring cells, being eight3. Next the 5 × 5 cell approach was assessed using the same sub-iteration scale. Of the 5 × 5 cells, the central 3 × 3 cells can be evaluated using the same 40 sub- iteration units worth of work as the 3 × 3 cell approach. Of the next layer of cells outside the central 3 × 3, extra steps would be required to check that a cell should be considered reachable from the central cell being evaluated. Figure 4.4 shows one possible observation area for a potential 5 × 5 cell kernel. First the cost of outer bishop’s move cells are considered (viz. A1, E1, A5 and E5 in Figure 4.4) in Table 4.2. The total number of sub-iteration steps performed when checking a bishop’s move cell is 7, multiplied by 4 cells gives a total sub-iteration

2This assumes the theoretical best case of the wavefront not confronting any obstacle cells or reaching the edge of the occupancy grid. 3When observing each cell, up to three branching statements and a possible texture write are performed. Under the GPU architecture all possible paths of a branching structure are executed and the correct result is conditionally written to texture as a final step. Therefore, if a set of texture read and update operations should not logically be executed as they sit inside a conditional that evaluates to false, they are still executed on the GPU and therefore contribute to execution time.

52 Table 4.1: Sub-iteration cost breakdown of the workload of a kernel that observes 3 × 3 cells.

Step Description Cost Read value of neighbouring cell. 1 Check if value is not obstacle or unexplored. 1 Check if best cost so far is either undefined or if it is 2 defined, whether it is less than the best cost so far. Update the best cost variable. 1 Total 5

Figure 4.4: A 5 × 5 cell observation area of a given cell. The kernel is being run for the centre cell marked as blue. Obstacle cells are marked as black with free cells marked as white.

53 Table 4.2: Sub-iteration cost breakdown of the workload of a kernel observing just bishop’s move cells for a 5 × 5 cell observation area.

Step Description Cost Read value of outer bishop’s move cell. 1 Read value of inner bishop’s move cell. 1 Check if inner cell is not an obstacle. 1 Check if outer cell is not obstacle or unexplored. 1 Check if best cost so far is either undefined or if it is 2 defined, whether it is less than the best cost so far. Update the best cost variable. 1 Total 7

cost here of 28. From a logical flow point of view, an outer cell with a defined cost would be considered if the intermediate cell is not an obstacle (e.g. cell E3 through D3) and ignored if the intermediate cell is an obstacle (e.g. cell C1 blocked via C2). Next the cost of outer knight’s move cells are considered (viz. B1, D1, A2, A4, E2, E4, B5 and D5 in Figure 4.4). Here, more steps are required relative to other outer cell observations as a path to a knight’s move cell can pass through two possible inner cells with the same traversal pattern. For example, cell E2 can be reached from √ cell C3 via both cells D2 and D3 using both a k = 1 and k = 2 scaled traversal. The sub-iteration step descriptions and counts for observing a single knights’s move cell are given in Table 4.3. Each outer knights’s move cell uses 9 sub-iteration steps, with 8 of this type of cell, gives a total sub-iteration step count of 72. Lastly the cost of outer rook’s move cells are considered (viz. C1, A3, E3 and C5 in Figure 4.4). The steps required to observe this type of cell are given in Table 4.4. Here each cell requires 7 sub-iteration steps, with 4 such cells, gives a total sub-iteration count of this type of cell of 28. An alternate approach of observing, for example cell C1 via B2 or D2, may also be considered, but this analysis is attempting to underestimate the cost of a 5 × 5 cell kernel using the simplest approach to be

54 Table 4.3: Sub-iteration cost breakdown of the workload of a kernel observing just knight’s move cells for a 5 × 5 cell observation area.

Step Description Cost Read value of outer knight’s move cell. 1 Read value of inner rook’s move cell. 1 Read value of inner bishop’s move cell. 1 Check that at least one of the inner cells is not an ob- 2 stacle. Check that the outer cell is not obstacle or unexplored. 1 Check if best cost so far is either undefined or if it is 2 defined, whether it is less than the best cost so far. Update the best cost variable. 1 Total 9

able to fairly compare to the cost of the 3 × 3 cell observation area approach. A summary of the total sub-iteration step count of each of the 3 × 3 cell and 5×5 cell approaches is given in Table 4.5 for a kernel evaluating a single cell’s global cost. Of particular interest is that the hypothetical 5×5 cell version of the algorithm performs at least four times the sub-iteration steps as the 3 × 3 cell algorithm. The size of the observation area was then factored back into the sub-iteration costs given in Table 4.5. The theoretical best case of an occupancy grid with no obstacle cells is used for this analysis. To compare the two approaches the total number of sub-iteration steps required to evaluate a given number of cells was cal- culated. Table 4.6 shows the cumulative number of cells that are evaluated at initial iterations of the algorithm. The data given here highlights the higher rate of ex- pansion into more cells in less iterations for the hypothesised 5 × 5 cell observation area approach. By combining the cell coverage values from Table 4.6 with the kernel sub-iteration costs of each cell evaluation from Table 4.5 the performance of the two approaches

55 Table 4.4: Sub-iteration cost breakdown of the workload of a kernel observing just rook’s move cells for a 5 × 5 cell observation area.

Step Description Cost Read value of outer rook’s move cell. 1 Read value of inner rook’s move cell. 1 Check if inner cell is not an obstacle cell. 1 Check if the outer cell is not obstacle or unexplored. 1 Check if best cost so far is either undefined or if it is 2 defined, whether it is less than the best cost so far. Update the best cost variable. 1 Total 7

Table 4.5: Comparison of the total sub-iteration steps performed by each of the 3 × 3 and 5 × 5 cell observation area algorithms.

Approach Sub-iteration Cost 3 × 3 40 units 5 × 5 168 units

56 Table 4.6: Cumulative number of cells that each kernel approach can reach at a given iteration. This assumes the best case scenario of no ob- stacle cells and the algorithm not reaching the edge of an occupancy grid. Column headings represent the observation area used by each approach.

Iteration 3 × 3 5 × 5 1 32 = 9 52 = 25 2 52 = 25 92 = 81 3 72 = 49 132 = 169 4 92 = 81 172 = 289 5 ... 121 ... 441 6 ... 169 ... 625 7 ... 225 ... 841 8 ... 289 ... 1089 . . i (2i + 1)2 (4i + 1)2

57 Table 4.7: Cumulative sub-iteration steps required by each approach to evaluate a given number of cells. Values are calculated by multiplying the number of cells that will have been evaluated by the sub-iteration cost of evaluating each cell.

Evaluated Cells 3 × 3 5 × 5 25 1360 4200 81 6560 17808 169 18160 46200 289 38720 94752 441 70800 168840 625 116960 273840 841 179760 415128 1089 261760 598080

can be compared. Table 4.7 shows selected cell counts and the equivalent sub- iteration steps required to reach those evaluated cell counts. Figure 4.5 shows the data from Table 4.7 graphed against the cumulative number of cells evaluated. This shows that a 5 × 5 cell approach to observing neighbouring cells is less efficient than a 3 × 3 cell approach as it takes far more sub-iteration steps to evaluate the same number of cells. Therefore, it was concluded that this approach should not be pursued through to implementation.

4.4 Subregions

Drawing from the hypothesis and the desire to discover the optimal number of cells to collect into a management group presented in Figure 4.1, the method of subregions was proposed. A subregion is defined as a square grouping of cells that are managed by the CPU side of the implementation as a group. Subregions tessellate to cover the area of the original occupancy grid and may be created as rectangular groupings

58 Figure 4.5: The cumulative number of sub-iteration steps required by the 3 × 3 and 5 × 5 cell observation variants of the algorithm to be able to evaluate a given number of cells. where the number of cells in a given dimension of the overall configuration space does not divide evenly by the size of the subregion in that dimension. More formally, two cells A and B are in the same subregion of size s cells if all of these conditions are true:

(zx)s ≤ Ax < (zx + 1)s

(zx)s ≤ Bx < (zx + 1)s

(zy)s ≤ Ay < (zy + 1)s (4.4.1)

(zy)s ≤ By < (zy + 1)s

+ for zx, zy ∈ Z ∪ {0}.

By varying the size of the subregion for a particular application and generation of graphics hardware, the optimal balance between CPU and GPU load can be achieved to minimise execution time. A subregion size of 1×1 cells implies each cell is micromanaged by the CPU. A subregion size of n × n cells, where the occupancy

59 grid is n × n cells resembles the vanilla method outlined in Chapter 3. Execution performance characteristics will be presented later in Section 4.7, but a 2007 gen- eration graphics card performed well with subregion sizes of around 32 × 32 cells. The same real-world occupancy grid performed well on a 2010 generation graphics card using subregion sizes closer to 128 × 128 cells. To highlight how the choice of subregion size affects the computation burden on the GPU, a simulation was run to calculate the raw number of individual cell calculations requested for different subregion sizes as well as the number required for the vanilla and expanding texture implementation flavours. Figure 4.6 shows a logarithmically scaled graph of the number of cells being requested at a particular iteration of the algorithm. This graph does not take into account an estimate of the possible increase in computational burden contributed by the CPU and therefore overall execution time in managing smaller subregions. It is purely used to demon- strate that the GPU burden can be repeatedly reduced by an order of magnitude as the chosen subregion size decreases.

Figure 4.6: A logarithmically scaled graph of the number of GPU cell calculations requested for different flavours of the algorithm and different subregion sizes. It should be noted that this graph only reflects the cell counts and does not reflect true GPU or CPU load.

60 4.4.1 Active and Inactive Subregions

Each subregion is managed as a single unit. Subregions are classified as either being in an active or maturing state, or in an inactive state. Active subregions at a given iteration of the algorithm are requested for evaluation on the GPU, while inactive subregions are not. Before iterations begin, the subregion or subregions containing any zero cost destination cells are added to the active subregions list. As the algorithm iterates, subregions are promoted and demoted from being active depending on the delta of cell values between the current iteration and the previous iteration.

Figure 4.7: Snapshot of the subregion flavour of the algorithm at a partic- ular iteration. The subregions of size 16 × 16 cells are shown by the grid pattern, with active subregions highlighted by a frosted border. Only active subregions have the cost-to-go function shown in full colour, with inactive subregion cells greyed out.

After the active subregions have been requested to evaluate cells within their bounds for a given iteration, the cells in the borders of the subregion are checked. If any cell in the border of the subregion satisfies the condition:

61 cx,y at Ti−1 6= cx,y at Ti (4.4.2)

then the subregion adjacent to that subregion border is added to the active subregions list if it is not already active. If a cell is promoted into the active list, then it changes state from unexplored to maturing. If all cells in a given subregion satisfy the condition:

cx,y at Ti−1 = cx,y at Ti (4.4.3)

then the subregion is demoted from the active list. This is the mechanism where subregions change from a maturing state to mature. As the algorithm iterates, the wavefront of productively evaluated cells advances within the subregions that comprise the active list. The algorithm terminates when the active list of subregions is empty. That is, all subregions that were active on a previous iteration were all found to have not changed from t = i−1 to t = i. To perform cell value comparisons between iterations of the algorithm’s implementation, subregion cell values are read back from GPU to CPU memory.

4.4.2 Subregion Metadata Data Structures

Each subregion has a metadata object associated with it that contains the state and location of the subregion. Subregion metadata objects are stored in two interde- pendent data structures in CPU memory. At initialisation, each subregion object is allocated in a two-dimensional array as a contiguous block of memory. A subregion’s location in this array relates directly to its spatial location relative to the occupancy grid. This data structure is used to perform neighbouring subregion discovery by O(1) simple pointer arithmetic. The second data structure is a C-style linked list and is used to collect subregions that are currently active. This list is referred to as the active list in later sections. A linked list was chosen as it also minimises the required computational complexity to perform the actions of adding and removing items from the list and traversing the list. Subregions that are promoted to an active state are added to the front of the

62 list, thus giving this operation a complexity of O(1). List traversal happens at each post-iteration check and currently runs in O(|L|) time, with |L| being the length of the list. Since list element removal happens during the normal post-iteration list traversal, removing a subregion from the list happens in virtually O(1) time. At initialisation, only subregions with zero cost destination cells are added to the active list. As cells are gradually explored and evaluated subregions are continually added and removed from the list in an overlapping proximity to cells that are being usefully evaluated. When the active list is first detected to be completely empty — an O(1) operation — the algorithm terminates.

4.4.3 Analysis of Execution Time of Algorithm Components

An analysis of the execution times of significant components in the implementation was carried out to attempt to identify possible bottlenecks. The main components measured were the reading of cell data from GPU to CPU memory and the two stages of the subregion post-iteration checks. Tables 4.8 and 4.9 show the recorded execution times of each of these components on the on-board machine (Table 4.14) over the Office Floor Plan test case (Section 4.7.2) for two different subregion sizes. The core process of loading an occupancy grid into GPU memory, computing an exhaustive global cost-to-go function over many iterations, and reading the solution back was repeated 32 times to gain a proper average of each component. Values shown in each component breakdown table in this section are calculated from the total recorded time around each component’s execution, divided by 32. In each of these tables the component named glReadPixels() is a single function call responsible for reading cell data between GPU and CPU memory. It can be concluded from these values that, although the GPU calculations and CPU post- iteration checks perform well, the process of repeatedly synchronising cell data back to CPU memory to be able to perform informed post-iteration checks is the most significant bottleneck. As an exploratory step, the implementation was modified to perform the trans- fer of cell data and the CPU based post-iteration checks every second iteration of the algorithm in a direct attempt to measure the significance of the glReadPixels

63 Table 4.8: Breakdown of the execution time of components of the stan- dard subregions implementation for a subregion size of 16 × 16. Results here were collected on the on-board machine given in Table 4.14 over the Office Floor Plan test case outlined in Section 4.7.2.

Component Average Time % glReadPixels() 5.272s 98.34% Unexplored to Maturing check 0.004s 0.08% Maturing to Mature check 0.009s 0.17% Other 0.076s 1.41%

Table 4.9: Breakdown of the execution time of components of the stan- dard subregions implementation for a subregion size of 80 × 80. Results here were collected from the on-board machine outlined in Table 4.14 over the Office Floor Plan test case presented in Section 4.7.2.

Component Average Time % glReadPixels() 3.846s 97.19% Unexplored to Maturing check 0.007s 0.18% Maturing to Mature check 0.006s 0.14% Other 0.098s 2.48%

64 Table 4.10: Breakdown of the execution time of components of the subre- gions approach, but checking every 2nd iteration, for a subregion size of 16 × 16. Results given here were collected over the Office Floor Plan test case presented in Section 4.7.2 using the on-board graphics card outlined in Table 4.14.

Component Average Time % glReadPixels() 2.894s 97.49% Unexplored to Maturing check 0.005s 0.16% Maturing to Mature check 0.003s 0.11% Other 0.067s 2.24%

component. The experiment was repeated, with the execution time of relevant com- ponents shown in Tables 4.10 and 4.11. The generated cost-to-go function in both experiment sets was also confirmed to be identically correct within the acceptable error margins of 32-bit floating data types. While the two check and Other components saw no significant improvement in the execution time, the glReadPixels component saw a noticeable decrease in its ex- ecution time. Although this result shows that the algorithm still functions correctly when the post-iteration checks are performed less often, even on this reduced sched- ule, the glReadPixels component still contributed to over 95% of overall execution time in both cases. Table 4.12 shows a comparison of the overall execution time of the standard subregions implementation against performing the post-iteration checks every second iteration. Given the reading of cell values from GPU to CPU memory was the most sig- nificant burden on overall execution time and that performing the post-iteration check less often still generated a correct optimal solution, a number of experimental optimisations outlined in the next sections were investigated.

65 Table 4.11: Breakdown of the execution time of components of the sub- regions approach, but checking every 2nd iteration, for a subregion size of 80 × 80. Results given here are based on the on-board graphics card outlined in Table 4.14 over the Office Floor Plan test case presented in Section 4.7.2.

Component Average Time % glReadPixels() 2.283s 96.15% Unexplored to Maturing check 0.003s 0.14% Maturing to Mature check 0.004s 0.15% Other 0.084s 3.55%

Table 4.12: Comparison of the execution times of that experiment set over the two sample subregion sizes. Here % Execution Time represents the ratio of the two recorded times for the particular subregion size.

Subregion Size Check Every Check Every % Execution Time Iteration Second Iteration 16 × 16 (cells) 5.361s 2.968s 55.36% 80 × 80 (cells) 3.957s 2.374s 59.99%

66 4.5 Initial Optimisation Techniques

In its current state at this point, the subregions algorithm required cell values to be read back from GPU memory to CPU memory every iteration, and at each iteration the cell values from the current and previous iterations, so a difference comparison could be performed on cell values. For the remainder of this chapter, this standard flavour of the algorithm will be abbreviated to “subreg.” in figures and comparisons. An obvious initial step to reduce the amount of data read back by half was to cache subregion cell values in CPU memory between iterations. For example, at iteration Ti, post-iteration checks require cell values from iterations

Ti−1 and Ti. On the subsequent Ti+1 iteration, the algorithm requires cell values from iterations Ti and Ti+1. Here, cell values from iteration Ti are required twice on consecutive iterations, so are read back into CPU memory on iteration Ti as the cells’ current values. Then on iteration Ti+1, the algorithm refers to the cached copy in CPU memory and is only required to read back cell values from iteration

Ti+1. In the algorithm’s implementation, this cached cell data was stored in a pair of buffers linked to the corresponding subregion’s metadata object. This optimisation step saw overall execution times drop to between 50% and 51% of their original execution times. For the remainder of this chapter, this flavour of the algorithm will be designated as “cached” in figures and comparisons. In addition to caching subregion cell values and performing the post-iteration checks every second iteration, an additional set of experiments were run with the post-iteration checks performed every third iteration. For the remainder of this chapter, these two flavours will be designated “check 2nd” and “check 3rd” in fig- ures and comparisons, respectively. While checking every third iteration showed an additional reduction in overall execution time, subsequently less often check sched- ules hit a wall of diminishing returns quickly. Figures 4.8 and 4.9 show the execution time of each evolutionary flavour of the algorithm over a range of subregion sizes. Each graph shows how the overall execution time reduces with each subsequent flavour. Instead of performing less post-iteration checks on an arbitrarily periodic schedule, a more efficient and directed approach is presented in the next section.

67 Figure 4.8: Execution time of the various subregion implementation flavours on the on-board machine over the single help point test case presented in Section 4.7.3.

4.6 Statistical Schedule of Checking

Instead of polling subregions at regular intervals to perform active cell checks, an experiment was run to track the lifetimes of subregions for a range of subregion sizes on a real-world occupancy grid. Figures 4.10, 4.11, 4.12 and 4.13 show the lifetimes of subregions used to evaluate the Office Floor Plan occupancy grid, for subregion sizes of 16, 32, 48 and 64, respectively. Of significance is that most subregions either live for one or two iterations, or for just greater than the number of iterations matching the size of the subregion. The algorithm was then further modified in an attempt to maximise performing post-iteration checks when a subregion was statistically scheduled to mature and not when it was in the middle stages of maturing. For the remainder of the chapter, this implementation flavour is designated “statistical” in figures and comparisons. A per subregion iteration lifetime counter was assigned to each subregion metadata object and was reset to zero when the subregion was promoted to the active list. For subregion sizes of s × s, the algorithm performed post-iteration checks for that subregion at iteration Ti, where:

68 Figure 4.9: Execution time of the various subregion implementation flavours on the desktop machine over the single help point test case pre- sented in Section 4.7.3.

• i = 1 and 2,

s s • i = 2 and 2 + 1, and

• i = s + 2z for z ∈ Z+.

Table 4.13 shows the cumulative percentage of subregions that would have sta- tistically been promoted to being mature by the corresponding iteration counter i based on the schedule given above. For each of the subregion sizes sampled, more than 50% of subregions will have been promoted to mature within the first two post-iteration checks. By the end of the sixth post-iteration check on a subregion, statistically at least 95% of subregions will have been promoted. That is, the majority of subregions are checked on a schedule in the order of O(1) rather than being closer to the order of O(s) for prior implementation flavours.

69 Figure 4.10: Histogram of the number of iterations a subregion is active using a subregion size of 16×16 cells. The iteration equal to the subregion size is highlighted by the red dotted line.

Table 4.13: Percentage of the subregions that will have been checked and promoted to mature on or by the corresponding subregion lifetime iteration counter i. Here “Check No.” represents the post-iteration check number that is being performed for that iteration. That is, a check number of 2 implies the second time the subregion is checked.

iteration Check No. s = 16 s = 32 s = 48 s = 64 i = 2 2 55.2% 59.3% 58.1% 52.6%

s i = 2 + 1 4 58.2% 62.7% 64.0% 62.3% i = s + 4 6 99.7% 99.0% 97.9% 96.5%

70 Figure 4.11: Histogram of the number of iterations a subregion is active using a subregion size of 32×32 cells. The iteration equal to the subregion size is highlighted by the red dotted line.

71 Figure 4.12: Histogram of the number of iterations a subregion is active using a subregion size of 48×48 cells. The iteration equal to the subregion size is highlighted by the red dotted line.

72 Figure 4.13: Histogram of the number of iterations a subregion is active using a subregion size of 64×64 cells. The iteration equal to the subregion size is highlighted by the red dotted line.

4.7 Experimental Results

This section provides the performance characteristics of the statistical active subre- gion list check schedule implementation flavour running on a range of contrived and real world occupancy grids to demonstrate how the algorithm performs in certain conditions. When running, the implementation consumes a small amount of time to initialise and terminate. In practical applications, this would not affect the generation time of a cost-to-go solution as the program would be started before a robotics application was performed. However, to benchmark the implementations in terms of real-world clock time, the program as a whole is run inside an external timestamping com- mand. In an attempt to factor out the initialisation and termination times, the core generation of a cost-to-go solution over an occupancy grid was run 100 times in a loop. This loop includes the proper initialisation of a raw occupancy grid in graphics memory (including the transfer from CPU to GPU memory), performing

73 Table 4.14: Hardware specifications for the on-board machine used to benchmark algorithm implementations.

Property ATI Radeon HD 2400 XT Shader processors 40 Peak FLOPs 40 to 48 GigaFLOPs Bus transfer 4GB/s (PCIe x16) Release Date 28th June 2007 Vendor AMD, Inc. [84] [85]

the iterative steps of the algorithm, and copying the final solution back to CPU memory. The estimated execution time of a single iteration of the core implementation is calculated based on Equation 5.5.1.

E s + 100 × A + s T = = 1 2 (4.7.1) 100 100 where:

• T is the estimated execution time of the core algorithm’s implementation,

• E is the recorded real world (clock) execution time surrounding the running of the implementation program,

• s1 is the initialisation time,

• s2 is the termination time,

• A is the true real world execution time of the core algorithm’s implementation.

Each simulation was run on two separate graphics cards to gauge the performance on differing architectures, particularly in relation to the number of shader processors available. Tables 4.14 and 4.15 list the relevant hardware characteristics of the two chosen graphics cards.

74 Table 4.15: Hardware specifications for the desktop machine used to benchmark algorithm implementations.

Property GeForce GTX 480 Shader processors 448 Peak FLOPs 1344.96 GigaFLOPs Bus transfer 8GB/s (PCIe 2.0 x16) Release Date 29th March 2010 Vendor NVIDIA Corporation [86] [87]

4.7.1 Spiral Corridor

The first test case is a contrived layout of a spiral corridor. The occupancy grid has a single zero cost destination cell at the bottom left corner. Figure 4.14 shows the final generated cost-to-go function on the 512 × 512 cell occupancy grid. The corridor is on average between 55 and 65 cells wide. As the algorithm excels when allowed to expand out into wide open areas, the corridor test case is a good example of demonstrating performance when expansion is restricted. As the wavefront expands away from the zero cost cell around the spiral corridor, the level of useful concurrent calculations allowed is restricted by the width of the corridor. The execution time of the algorithm’s implementation over a range of subregion sizes is given in Figure 4.15. Of particular note is the execution time recorded for subregion size of 60 cells, which corresponds to the average corridor width. It is most likely that this particular execution benefit comes from subregion border locations coinciding with corridor walls. The subregion tessellation does not attempt to coincide with obstacles in the occupancy grid, so a subregion may become active for multiple periods of time as the algorithm washes over cells on either side of an obstacle intersecting a particular subregion. In this test case however, as many walls coincide with subregion borders, many subregions exist for a single period of time with the majority of non-obstacle cells being evaluated in that period of being

75 active.

Figure 4.14: The generated cost-to-go function overlaid on the original occupancy grid for the spiral corridor test case. The global cost is shown ranging from blue at the low cost destination cells to red at the highest cost cells. Obstacle cells are shown in black.

In Figure 4.15, the top-left graph shows the raw execution time of the algorithm on each of the on-board and desktop graphics cards for a range of subregions. The bottom-left graph shows the comparison of each of the raw performance graphs normalised relative to the fastest recorded time. The graphs to the right show a magnified section of the corresponding graph to the left to highlight more detail in relevant areas of the graph. Results here are based on the statistical flavour of the algorithm.

4.7.2 Office Floor Plan

To benchmark the algorithm’s implementation in a real world indoor environment this test case represents the author’s office layout. The occupancy grid is post pro- cessed from a combination of architectural floor plans and laser range data captured from an unmanned ground vehicle driven around the office. As such the occupancy grid shows both wall and door layout as well as furniture and other large objects

76 Figure 4.15: The execution time of the algorithm running on the spiral environment for two different graphics cards. situated in the office. The occupancy grid is comprised of 1099 × 731 cells and is given in Figure 4.16 with the final generated cost-to-go function. The width of the area covered by this occupancy grid is approximately 26m, which translates to a resolution of 2–3cm for a single grid cell. This resolution greatly exceeds the 25cm×25cm resolution required by an unmanned ground vehicle to navigate within this environment, as given by Whitty et al. [88]. The zero cost destination cell is chosen as the location of the author’s chair. The execution time of the algorithm’s implementation over a range of subregion sizes is given in Figure 4.16. The top-left graph shows the raw execution on both the on-board and desktop graphics cards. The bottom-left graph shows raw data normalised relative to the minimum recorded execution time for both graphics cards. The graphs to the right show a magnified area of the corresponding graph to the left to highlight a relevant section in finer detail. The results here are based on the statistical flavour of the algorithm.

77 Figure 4.16: The generated cost-to-go function overlaid on the original occupancy grid for the office floor plan test case. The global cost is shown ranging from blue at the low cost destination cells to red at the highest cost cells. Obstacle cells are shown in black.

78 Figure 4.17: The execution time of the algorithm running on the office floor plan environment for two different graphics cards.

4.7.3 UNSW Help Point

The author’s university has 13 emergency help points distributed around campus to allow staff and students radio access to campus security. To benchmark the algorithm’s implementation in a real-world outdoor example, a map of campus was used in this test case with the help point closest to the author’s office location chosen as the zero cost destination cell. Figure 4.18 shows the occupancy grid with the generated cost-to-go function overlaid. Of particular note in this example is that the occupancy grid covers a campus that is approximately 1, 000m × 500m using 1, 158×559 cells. This translates to a grid resolution in the order of 1m×1m, which in practice has been more than adequate for planning cross campus autonomous navigation as obstacles are spaced further apart than indoor environments. Figure 4.19 shows the execution time of the algorithm’s implementation over a range of subregion values. The top-left graph shows the raw execution time of the implementation for both graphics cards. The bottom-left graph shows the raw data

79 Figure 4.18: The generated cost-to-go function overlaid on the original UNSW Campus occupancy grid. The global cost ranges from blue for low cost cell values to red for high cost cell values, with obstacle cells shown in black. normalised relative to the corresponding minimum execution time of each graphics card’s results. The graphs to the right correspond to a magnified area of each graph to the left to show more detail for relevant values of the data. The results presented here are based on the statistical flavour of the algorithm.

4.7.4 Multiple UNSW Help Points

The UNSW campus occupancy grid is used here again, but with all 13 emergency help points set as zero cost destination cells. This test case is used to highlight the extra concurrency achievable by the algorithm when given more cells to evaluate in parallel. The resulting cost-to-go function is given in Figure 4.20 overlaid on the original occupancy grid. Figure 4.21 shows the execution time of the statistical flavour of the algorithm’s implementation over a range of subregion values. The top-left graph shows the raw execution time of the implementation for each of the chosen graphics cards. The

80 Figure 4.19: The execution time of the algorithm running on the campus help point experiment for different graphics cards, with a single destina- tion cell.

Figure 4.20: The generated cost-to-go function overlaid on the original UNSW Campus occupancy grid. Cell colours range from blue for low cost cells to red for high cost cells.

81 bottom-left graph shows raw execution times normalised relative to the minimum execution times. Each graph to the right corresponds to a magnified area of interest of the corresponding graph to the left.

Figure 4.21: The execution time of the algorithm running on the mul- tiple destination campus help points occupancy grid. The concurrent algorithm is allowed to spread out into more regions concurrently due to multiple wavefronts, as compared to the single destination point experi- ment outlined in Section 4.7.3.

4.8 Comparison of Execution Time of Flavours of Algorithm

For the single campus emergency help point (Section 4.7.3) and multiple help point (Section 4.7.4) examples, benchmarking was performed over each flavour of the al- gorithm presented in this chapter. Both the on-board and desktop graphics cards were used for this benchmark to give a more accurate representation of the per- formance characteristics relative to hardware. The vanilla flavour of the algorithm

82 introduced in Chapter 3 is also given here as a reference point. The horizontal axis of each graph represents the subregion size used for the corresponding experimental run. As the vanilla and expanding texture (abbreviated as exp.tex.) flavours do not involve subregions, they are shown in each figure as a horizontal line. In each graph, subreg designates the original subregion implementation presented with no further optimisations. The cached label represents the next evolution of the algorithm in which subregion values are cached in CPU memory rather than being read back from GPU memory twice on two consecutive iterations. The labels check 2nd and check 3rd represent the cached subregion flavour with post-iteration checks performed only every second and third iteration, respectively. Finally, the statistical label designates the approach where post-iteration checks are performed on a statistically optimal schedule of iterations. Figures 4.22 and 4.23 show execution times of benchmarking performed on the on-board machine for the single and multiple help point occupancy grids, respec- tively. Of particular note in these benchmarking contexts is that the vanilla and expanding texture methods performed an order of magnitude slower than all subre- gion implementations. Figures 4.24 and 4.25 show execution times on the desktop machine for the sin- gle and multiple help point test cases, respectively. The graph to the right in each figure shows a magnified section of the same data to the left to show more detail where relevant. Of particular interest here is the contrast between the performance of the vanilla and expanding texture approaches between the two generations of graphics hardware. The ratio between the reported computational throughput of each graphics card in Tables 4.14 and 4.15 approximately matches the ratio of the vanilla and expanding texture execution times between Figures 4.22 and 4.24, and between Figures 4.23 and 4.25, being 30 to 1. However, for both graphics card op- tions, the choice of using the statistically flavour with a subregion size that matches the problem domain is the best approach. Table 4.16 summarises the fastest execution times for each of the flavours of the algorithm, over each graphics card for both the single and multiple help point test cases. The statistical approach performs efficiently enough on both generations of

83 Figure 4.22: The execution time of each evolutionary flavour of the algo- rithm. Results gathered here are from the on-board machine using the single destination cell test case from Section 4.7.3. The graph to the right shows a magnified area of the graph to the left focused on the shorter execution times to allow the reader easier comparison of subregion im- plementations.

84 Figure 4.23: The execution time of each evolutionary flavour of the algo- rithm. Results gathered here are from the on-board machine using the multiple destination cell test case from Section 4.7.4. The graph to the right shows a magnified area of the graph to the left focused on shorter execution times to allow the reader easier comparison of subregion im- plementations.

85 Figure 4.24: The execution time of each evolutionary flavour of the al- gorithm. Results gathered here are from the desktop machine using the single destination cell test case from Section 4.7.3. The graph to the right shows a magnified area of the graph to the left focused on the shorter execution times to allow the reader easier comparison of subregion im- plementations.

86 Figure 4.25: The execution time of each evolutionary flavour of the al- gorithm. Results gathered here are from the desktop machine using the multiple destination cell test case from Section 4.7.4. The graph to the right shows a magnified area of the graph to the left focused on shorter execution times to allow the reader easier comparison of subregion im- plementations.

87 Table 4.16: A comparison of the fastest execution times for each of the seven implemented algorithm flavours presented in this chapter and the last. Times given in this table are in seconds. († On-board) (? Desktop)

Occupancy Grid vanilla exp.tex. subreg. cached Single Destination† 31.37s 29.646s 2.4874s 1.8631s Multi-Destination† 31.564s 29.667s 2.3634s 1.5726s Single Destination? 0.999s 1.496s 1.1357s 0.6128s Multi-Destination? 1.013s 1.5063s 0.9156s 0.4789s

Occupancy Grid check 2nd check 3rd statistical Single Destination† 0.999s 0.8357s 0.6971s Multi-Destination† 0.9333s 0.7801s 0.6428s Single Destination? 0.3591s 0.2765s 0.1872s Multi-Destination? 0.2841s 0.2283s 0.1323s

hardware for real-time applications.

4.9 Complexity Analysis Comparison

The reader should be reminded at this point that the results presented in the pre- vious section are highly dependent on the characteristics of the hardware used to run the experiments. To that effect, the fact that individual CPU speeds are hitting their physical limit, while the number of cores available on a processor continues to grow, is a significant factor in the strength of this work. The remainder of this section attempts to provide a technique for comparing the traditional sequential Dijkstra’s algorithm with the proposed concurrent algorithm from a computational complexity point of view. The implementation performance of Dijkstra’s algorithm can be represented by Equation 4.9.1. Similarly, an approximation of the implementation performance of

88 the GPU-based solver presented in this paper can be represented by Equation 4.9.2.

T CPU = Cseq + M PQ (4.9.1)

T GPU = Cconc + M AL + T SR (4.9.2)

Here, C∗ represents cell computation time comprised of observing neighbouring cells, calculating a cell’s value and assigning the cell that value in memory. The values given by M∗ represent management overhead steps. In Dijkstra’s algorithm, M PQ represents the time taken to maintain an ordered priority queue. In the concurrent algorithm, M AL represents the time take to perform the active list membership maintenance operations. The additional term T SR represents time allocated to the transfer of data between GPU and CPU memory. A Big-Oh analysis of the most efficient approach to Dijkstra’s algorithm is given by Barbehenn [29]. In particular, for a graph G(V,E), using a binary heap data structure to represent the priority queue, the complexity given is:

T CPU : O(|E| + |V |log|V |). (4.9.3)

Here, the O(|E|) component contributes to the calculation of cell values, similar to Cseq in Equation 4.9.1, and O(|V |log|V |) contributes to the maintenance of the priority queue, or M PQ in Equation 4.9.1. If an occupancy grid with dimensions n × n, N cells and Cobst = ∅ is used, given that traversability from a cell is available in eight directions (including diagonals) then the number of edges relative to the number of vertices is given by Equation 4.9.4. A derivation of this relationship is given in A.1.

|E| = 4|V | − 6p|V | + 2 √ (4.9.4) = 4N − 6 N + 2 The relationship between |E| and |V | is in the order of O(N ), therefore Equation 4.9.3 can be generalised as Equation 4.9.5.

T CPU : O(N + NlogN) =⇒ O(NlogN) (4.9.5)

89 As shown in Equation 4.9.2 the time taken to calculate an exhaustive cost-to- go function is comprised of the time allocated to calculating cell values (Cconc), the time allocated to maintaining the active list (M AL), and the time taken to transfer intermediate subregion cell data from GPU memory to CPU memory. In the following derivations the term s is used to represent the subregion size chosen and is maintained within the Big-Oh notations until the final discussion of this proof. This proof also assumes the subregions flavour of the algorithm without the optimisations of caching subregion cell values and performs the post-iteration active list checks after every iteration, rather than every 2nd, 3rd, or on a statistically appropriate schedule. These optimisations will be raised at the end of this proof. Each cell in the occupancy grid is evaluated approximately s times, as each subregion is active for approximately the number of iterations of the subregion’s dimension s, as per Figures 4.10, 4.11, 4.12 and 4.13, and all cells within a subregion are evaluated together. Many cells are evaluated concurrently, however, limited by the number of processors p available. Therefore, the Cconc term can be approximated as that given in Equation 4.9.6.

Ns Cconc : O( ) (4.9.6) p Empirically, this highlights that smaller subregions are more favourable for per- forming GPU calculations4 as they reduce the number of redundant calculations, as seen in the vanilla and expanding texture flavours. In addition, where the proportion of Cobst to Cfree is relatively low, the number of cells that can be evaluated concur- √ rently is in the order of O( N), if enough processors are available. Therefore, in √ low obstacle configuration spaces, the Cconc term can be in the order of O( Ns). The process of maintaining the active list on the other hand is empirically more efficient with larger subregion sizes as there are less items to maintain. First the derivation for the approximate time taken to manage one subregion throughout its active lifetime mAL is given by Equation 4.9.7.

4This proof assumes that the number of processors available on the GPU is in the order of O(n).

90 mAL = 4s(s − 1) The subregion’s borders comprising of 4s cells are checked s − 1 times. + s2 At the end of a subregions lifetime, all s2 cells are checked once. + 2s Each iteration, on average 2 of the 4 subregion borders will find a difference in cell values and will be (4.9.7) required to ensure that the neighboring subregion is also present in the active list. + 1 Removing a subregion at the end of its lifetime runs once and in constant time.

2 mAL = 5s − 2s + 1

N The number of subregions in an occupancy grid can be denoted by s2 , therefore the order of M AL is given in Equation 4.9.8.

−1 −2 M AL : O(N(5 − 2s + s )) (4.9.8)

The component representing the transfer of cell data from GPU memory to CPU memory is given by Equation 4.9.9.

N T SR = s2 The number of subregions. × s Each subregion is active for approximately s iterations. × 2s2 Each iteration, all cells within (4.9.9) a subregion are transferred from both GPU texture buffers.

T SR = 2Ns O(Ns) Therefore, the intermediate complexity of the concurrent algorithm can be sum- marised by Equation 4.9.10.

91 Ns −1 −2 T GPU : O( + N(5 − 2s + s ) + Ns) (4.9.10) p The effect of s should now be considered. If s is in the same order of magnitude √ as N, as is the case in the vanilla and later stages of the expanding texture flavours √ of the algorithm, s = N can be substituted into Equation 4.9.10 to get an order of complexity given by Equation 4.9.11.

√ N N 2 1 √ √ T GPU : O( + N(5 − √ + ) + N N) =⇒ O(N N) (4.9.11) p N N However, using the method of subregions and given the optimal choice of s from experimental results given earlier in this section, s is considerably smaller than N. Given the campus emergency help point test case given in Section 4.7.3, the value of N = 1158 × 559 = 647322 is several orders of magnitude above the experimentally optimal s value of 128. If s is then considered an insignificant constant in the complexity analysis, then the proposed algorithm’s complexity can be represented by Equation 4.9.12.

N T GPU : O( + N + N) =⇒ O(N) (4.9.12) p The profound result here is that the proposed algorithm can produce mathemat- ically identical results in O(N) time compared to a traditional sequential algorithm that runs in O(NlogN) time, assuming that the concurrent implementation has at √ least in the order of O( N) cores available to handle the concurrent load at a par- ticular iteration. Furthermore, each of the implementation optimisations presented in this paper benefits the initial standard subregion implementation presented. The method of caching halves the amount of cell data required to be read from GPU memory to CPU memory and hence, in practice T SR is halved. As stated in Sec- tion 4.6, using a statistically appropriate schedule for performing the post-iteration checks reduces the number of post-iteration checks a single subregion undergoes from O(s) to O(1). It should be conceded, however, that in practice this complexity analysis breaks down for small values of N. In particular, the overhead of setting up data transfer

92 between GPU and CPU memory becomes more significant with small data transfers. Volkov and Demmel [89] estimate the time consumed by data transfer on a series of GPUs similar to the on-board machine outlined in Table 4.14 is given by Equation 4.9.13.

bytes transferred Time = 11µs + (4.9.13) 3.3GB/s The reader can assume that current sequential path planning implementations are capable of exhaustively evaluating, for example, a 16 × 16 cell occupancy grid on modern central processing units in under 11µs.

4.10 Live Test on an Unmanned Ground Vehicle

The statistically scheduled implementation, which had the most efficient perfor- mance from Table 4.16, was implemented into a module that interfaced with an existing Unmanned Ground Vehicle (UGV) and Ground Control Station (GCS) platform. The complete implementation and technical specifications of this plat- form are given in Whitty et al. [88] and Guivant et al. [90]. The module was designed to read raw occupancy grid and local traversability data from a centralised round robin database (outlined by Guivant [91]), generate an exhaustive cost-to-go function, and push the result back into the database for further planning and path smoothing by a subsequent module. The module was run on the desktop machine outlined in Table 4.15, which acted as the remote GCS. The module was set to use a subregion size of 128 × 128 cells, as suggested by the performance of this flavour of the algorithm on this particular machine from Figures 4.24 and 4.25. Figure 4.26 shows an instantaneous cost-to-go function generated on a live occupancy grid during an outdoor campus experiment.

93 Figure 4.26: A c 2011 Google Maps overview of a subsection of campus alongside the generated cost-to-go function over a generated occupancy grid of the same region. Blue represents low cost cells tending towards red for high cost cells.

4.11 Summary

This chapter provided a complete analysis of the initial implementation of the al- gorithm in terms of computational complexity and raw implementation execution time. Several evolutionary iterations of the implementation were presented, each improving on the performance of the last. The main contribution of this chapter to the field of robot motion planning is a grid-based approach to generating exhaus- tive cost-to-go functions in O(N) time, assuming enough cores are available to meet the parallel requirements at a given iteration. The approach provides mathemati- cally identical results to other exhaustive dense grid-based approaches in less time. Therefore, it offers extra flexibility in configuration space size and granularity for a particular environment, or more flexibility in refresh rate for a given application. Given an exhaustive cost-to-go function of an environment, an optimal path can be extracted by using existing hill climbing methods. That is, for any given cell c ∈ Cfree, an optimal path to the zero cost destination cell can be discovered by 0 traversing to the neighbouring cell c ∈ Cfree with the lowest global cost. An additional benefit of implementing an exhaustive planner on graphics hard-

94 ware is that one of the more computationally expensive modules executed on modern mobile robots is delegated away from the central processor, therefore freeing it for other tasks. So far this thesis has focussed on two-dimensional applications. The next chap- ter covers algorithm and implementation modifications to apply the proposed ap- proach to three and higher dimensional problems for possible applications in un- manned aerial vehicles and n degree-of-freedom articulated robots. While increas- ing the dimensionality of a configuration space has historically increased the com- putational complexity of a planner exponentially, the concurrent approach gains some of this added burden back by increasing the dimensionality of the wavefront evaluating cells each iteration. That is, while a two-dimensional planner has a one- dimensional wavefront, a three-dimensional planner with the proposed approach has a two-dimensional surface wavefront.

95 Chapter 5

Extending to Three and Higher Dimensional Configuration Spaces

Previous chapters have defined a solution to robot motion planning for two-dimensional grid-based configuration spaces. Aerial vehicles benefit from motion planning in a three-dimensional configuration space while an n degree-of-freedom articulated robot requires an n dimensional configuration space representation. This chapter outlines how the two-dimensional concurrent algorithm can be mapped to a three- dimensional algorithm and includes implementation hurdles and bottlenecks, as well as theoretical analysis of the concurrent benefit of increasing the dimensionality of the problem. Elements of this chapter are based on work published in [92].

5.1 Problem Definition

For three-dimensional applications, a coordinate in the configuration space is desig- nated by the standard nomenclature x, y, z, with the dimensions of the configuration space respectively being referred to as width, height and depth. The configuration space is broken into N equally sized cubes that tessellate the volume of the config- uration space being represented. As with the two-dimensional definition, grid cells are assigned a local traversal cost value c∗, which relates to the metric being optimised for. Traversal between cells in Cfree is allowed in either the horizontal, vertical and diagonal directions in

96 three-dimensions. More formally, traversal between cells ci,j,k, cp,q,r ∈ Cfree is allowed if

|i − p| ∈ {0, 1} and

|j − q| ∈ {0, 1} and (5.1.1)

|k − r| ∈ {0, 1}

+ for i, j, k, p, q, r ∈ Z and ci,j,k 6= cp,q,r. To plan an optimal path from the agent’s current location to a given destination cell, the destination cell cd ∈ Cfree is assigned a cost-to-go value of zero. All other cells in Cfree are recursively assigned a cost-to-go value based on equations 5.1.2 and 5.1.3.

∗ ∗ Ci,j,k = min(Ct ) (5.1.2)

∗ Here Ct is the set of costs of all neighbouring cells to the cell at i, j, k plus the cost ∗ of traversing from cell ci,j,k to cell ci0,j0,k0 ∈ Ct . The cost of traversal between cells ci,j,k and ci0,j0,k0 is given by

∗ ∗ c + c 0 0 0 C∗ = k × i,j,k i ,j ,k (5.1.3) t 2 √ where k = 1 for cells adjacent by a face of the grid’s volume, k = 2 for cells √ adjacent by an edge of the grid’s volume, and k = 3 for cells adjacent by a corner of the grid’s volume.

5.2 Algorithm Modification

The core mathematical definition of how individual cell values are determined is identical to the two-dimensional definition. The main difference between the two and three-dimensional algorithms is the number of cells observed during a single kernel’s execution. Instead of observing the eight neighbouring cells in the same plane of the cell being evaluated for, the three-dimensional kernel also observes adjacent cells in the plane above and below the kernel’s cell. Algorithm 2 shows the

97 three-dimensional algorithm, which considers 26 neighbouring cells rather than 8 in the two-dimensional case.

Algorithm 2 Kernel pseudocode run on each cell in the three-dimensional config- uration space. best cost := undefined N := set of 26-way neighbours of current cell for n ∈ N do

if ncost is defined then

current cost := ncost + travelcost(n, current cell) if best cost is undefined or current cost is better than best cost then best cost := current cost end if end if end for

current cellcost := best cost

5.3 Theoretical Benefit of Increasing the Dimen- sionality of the Configuration Space for Three Dimensions

At each iteration of the previously presented two-dimensional algorithm, a virtual wavefront abstraction was used to conceptualise the expansion of cells changing from being undefined to being evaluated with a global cost. In the two-dimensional exam- ple, the wavefront was a one-dimensional arrangement of cells. When expanding the configuration space to three dimensions, the wavefront becomes a two-dimensional surface, and as such, more cells are able to be evaluated at a given iteration relative to two-dimensional domains. For example, in the very first iteration of the two-dimensional algorithm, the eight neighbouring cells of the initial zero cost destination cell can be evaluated. In

98 Table 5.1: Comparison of the number of cells evaluated at a particular iteration for two-dimensional and three-dimensional domains.

P P iteration 2Di 0...i2D 3Di 0...i3D 0 1 1 1 1 1 8 9 26 27 2 16 25 98 125 ...... i (2i + 1)2 − (2i − 1)2 (2i + 1)2 (2i + 1)3 − (2i − 1)3 (2i + 1)3 the equivalent first iteration of the three-dimensional algorithm the 26 neighbouring cells of the zero cost destination cell can be evaluated. In the second iteration the comparison grows to 52 − 33 = 16 against 53 − 33 = 98. Table 5.1 shows that the number of possible cell evaluations each iteration increases at a greater rate for three-dimensional configuration spaces as opposed to two-dimensional configuration spaces. It should be noted that these figures assume that the wavefront does not encounter any obstacle cells during any of the iterations and that the wavefront does not encounter the defined edge of the configuration space. For a configuration space comprising n×n×n = N cells, given a destination cell

3 at the center of the configuration space and Cobst= ∅, the algorithm can evaluate n cells in O(n) time. This result is possible as the wavefront is able to expand towards each of the configuration space’s six faces concurrently. Conversely, given the same

3 configuration space dimensions in R consisting of n × n × n cells, but with Cfree trapped in a single cell wide corridor, the algorithm would require in the order of

3 O(n ) steps to exhaustively evaluate all cells in Cfree — equivalent to the order of traditional algorithms. Three-dimensional configuration space representations of real-world environments are situated between these two extremes of obstacle layout. Particularly for three- dimensional configuration spaces, the ratio of the number of cells in Cfree relative to cells in Cobst is quite large, which leads the algorithm’s implementation able to perform closer to the order of O(n). The reader should note that at this stage the benefit of the three-dimensional

99 algorithm over the two-dimensional algorithm does not equate to the trend seen in Table 5.1, due to the three-dimensional kernel performing a greater number of cell observations each iteration. Removing a layer of abstraction, consider each kernel as performing a certain number of sub-iteration steps equaling the number of cells it observes. That is, the two-dimensional algorithm performs 8 sub-iteration steps whereas the three-dimensional algorithm performs 26 sub-iteration steps. Factoring this sub-iteration definition into the number of cells evaluated after a particular whole iteration, the number of cells evaluated after a certain number of sub-iteration steps performed is given by Equations 5.3.1 and 5.3.2 for the two-dimensional and three-dimensional, respectively.

j  |E | = (2 × + 1)2 (5.3.1) 2D 8

 j  |E | = (2 × + 1)3 (5.3.2) 3D 26 Here j is the sub-iteration unit of time required to observe and interpret one neighbouring cell. Hence one iteration of the two-dimensional algorithm consists of

8 × j sub-iterations. Generally, for any dimensionality d ∈ Z+, the relationship can be expressed as

 j  |E | = (2 × + 1)d. (5.3.3) nD 3d − 1 Considering Equations 5.3.1 and 5.3.2, at the initial iterations of the proposed algorithms, the two-dimensional algorithm is able to evaluate more cells than its three-dimensional counterpart. This is due to the limited extent to which the wave- front has expanded into enough cells to gain a concurrent benefit to overcome the fact that the three-dimensional kernel performs over three times the number of sub- iteration steps as the two-dimensional kernel. However, as the algorithm progresses, the three-dimensional approach begins to perform more productively. Figure 5.1 shows the number of cells evaluated after a given number of sub-iterations. Again, the results shown here assume an ideal case where the wavefront does not encounter any obstacle cells and does not breach the edge of the defined configuration space.

100 Figure 5.1: A comparison of the number of cells evaluated over the number of sub-iterations of the respective two-dimensional or three- dimensional kernel. The three-dimensional kernel is represented by blue marks, while the two-dimensional kernel is represented by red marks.

Theoretically, the two-dimensional and three-dimensional versions of the algo- rithm are able to evaluate an obstacle free configuration space in n2 and n3 cells, respectively in the same order of magnitude of computational complexity. Setting n = 512, the two-dimensional version is able to evaluate in the order of 5122 cells in approximately 2044 sub-iteration units of time. The three-dimensional version is able to evaluate in the order of 5123 cells in approximately 6643 sub-iteration units of time. Generally, the increased concurrent benefit gained from the increased parallelism of higher dimensional spaces partially counters the traditional rule that the computational complexity of generating a solution increases exponentially as the dimensionality of the space increases. A deeper discussion of this relationship is given in Section 5.7.3 for higher dimensional configuration spaces.

101 Figure 5.2: The method by which planes in a three-dimensional volume are arranged into a two-dimensional texture using a raster format.

In practice, this theoretical concurrent benefit is limited by the number of cores available in a processor. As the number of concurrent cell evaluations (Q) surpasses the number of available cores in the processor (P), cell evaluations queue up sequen- tially in groups of size P. However, unlike the physical limits being faced by single core processors, the number of cores available in a processor is likely to increase for the foreseeable future, according to Borkar [47].

5.4 Implementation Considerations and Limita- tions

A three-dimensional occupancy volume was mapped to a two-dimensional texture in OpenGL by slicing the volume by planes perpendicular the the z-axis and laying each plane in a raster pattern inside the texture. Figure 5.2 shows the raster layout, with z-axis coordinate labeled. When the kernel is required to observe cells in the same plane as the cell being evaluated for, the two-dimensional coordinate lookup procedure is used. When

102 observing cells in the plane above and below the kernel’s output cell’s plane, a coordinate offset is added to the two-dimensional procedure to offset observation to the correct plane. This offset is given by Algorithms 3 and 4 for the planes of lower and higher z value, respectively.

Algorithm 3 Coordinate offset calculation for observing cells in the plane of lower z value. Ox := −w

Oy := 0

if cx + Ox < 0 then

Ox := Ox + Tw

Oy := Oy − h end if

Algorithm 4 Coordinate offset calculation for observing cells in the plane of higher z value. Ox := w

Oy := 0

if cx + Ox ≥ Tw then

Ox := Ox − Tw

Oy := Oy + h end if

In Algorithms 3 and 4:

• Ox and Oy represent to offset coordinates required to observe the correct plane in the entire texture;

• w and h are the occupancy grid’s width and height in the x and y axes, respectively;

• cx is the position of the cell the kernel is evaluating for in the texture’s coor- dinate space; and

• Tw is the width of the texture.

103 One of the major limitations of mapping a three-dimensional configuration space to a two-dimensional texture buffer in OpenGL is the maximum allowed texture size. On the two graphics cards used for benchmarking the proposed algorithms in Sections 4.7 and 5.5, the maximum texture size was 4096 × 4096 texels. This equates to a maximum three-dimensional configuration space volume in the order √ of 3 4096 × 4096 or 256 texels. One possible solution to this problem, depending on the configuration space, is to employ a hybrid metric-topological approach similar to that presented by Tomatis et al. [39]. Dense sub-volumes of an environment can be represented by an occupancy volume in separate textures with each sub-volume’s relative position stored in a topological data structure. On the other hand, if the problem domain is highly dense, a pure metric approach may still be required. An alternative to OpenGL, known as the Open Computing Language (OpenCL), allows much more flexible use of global and local memory on modern graphics cards. For example, on a 2012 generation graphics card, a global memory buffer can be allocated up to 128MB [93]. As a rough calculation, assuming a cell’s cost-to-go value can be represented by a 32-bit single precision floating point data type, 128MB translates to 32 × 106 cells. This could represent a three-dimensional occupancy volume in the order of n × n × n cells, where n = √ 3 32 × 106 ≈ 317.48. However, the current version of OpenCL that is implemented on modern graphics hardware is missing a crucial feature required to implement this algorithm. More information on implementing the algorithm in OpenCL is given as possible future work in Section 6.2.1.

5.5 Experimental Results

This section provides benchmarks of the implemented vanilla three-dimensional al- gorithm over a range of occupancy volumes. As stated in the two-dimensional experimental results in Section 4.7, the implementation includes a small amount of initialisation and termination time. As the benchmarks recorded in this section use the same method of an external timing command run either side of the algo- rithm implementation, the same equation is used to estimate the real world (clock)

104 Table 5.2: Hardware specifications for the on-board machine used to benchmark algorithm implementations.

Property ATI Radeon HD 2400 XT Shader processors 40 Peak FLOPs 40 to 48 GigaFLOPs Bus transfer 4GB/s (PCIe x16) Release Date 28th June 2007 References [84] and [85]

execution time. This equation is restated as Equation 5.5.1.

E s + 100 × A + s T = = 1 2 (5.5.1) 100 100 where:

• T is the estimated execution time of the core algorithm’s implementation,

• E is the recorded real world (clock) execution time surrounding the running of the implementation program,

• s1 is the initialisation time,

• s2 is the termination time,

• A is the true real world execution time of the core algorithm’s implementation.

Each experiment was run on the two graphics cards listed in Section 4.7. The characteristics of these graphics cards will also be repeated in Tables 5.2 and 5.3 for the convenience of the reader.

5.5.1 Empty Volume

The first three-dimensional test case is an empty volume that has dimensions of 64 × 64 × 64 cells. The centre of the volume is chosen as the zero cost destination

105 Table 5.3: Hardware specifications for the desktop machine used to benchmark algorithm implementations.

Property NVIDIA GeForce GTX 480 Shader processors 448 Peak FLOPs 1344.96 GigaFLOPs Bus transfer 8GB/s (PCIe 2.0 x16) Release Date 29th March 2010 References [86] and [87]

cell (viz. cell (31, 31, 31)). This test case was designed to allow the two-dimensional wavefront maximum expansion into the free cell volume. Figure 5.3 shows a raster representation of the final cost-to-go function overlaid on the original occupancy volume. Table 5.4 shows the estimated execution time to generate the exhaustive cost-to-go function shown in Figure 5.3. As a comparison, an ordered linked list priority queue and min-heap priority queue implementation of exhaustive Dijkstra’s algorithm were implemented and run on a 2012 Intel CPU for this test case.

5.5.2 Rotating Planes

This test case is designed as the three-dimensional analogy to the Spiral Corridor test case in Section 4.7.1. Here the destination cell is chosen as a cell with a low z value as is shown in Figure 5.4 by the blue areas at the top of the figure. Every second xy-plane along the z-axis is filled in as obstacle cells except for one cell’s width worth of a of free cells at the edge of the plane. As the z value increases the edge this channel is situated on changes and rotates among the four compass directions in a spiral pattern as z increases. Again, Figure 5.4 shows the final cost-to-go function on raster layout with the lowest z value being the plane in the top-left corner of the figure. Table 5.5 shows the estimated execution time of the implementation generating a complete cost-to-go function over the occupancy grid. Of the three-dimensional

106 Figure 5.3: The final cost-to-go function over the empty volume test case. Blue represents low cost cells whereas red cells indicate high cost cells. test cases outlined in this section, this is the only test case that benchmarked on the on-board machine with an execution time an order of magnitude worse than that required for real-time application.

5.5.3 Barker St Carpark - Single Source

Adjacent to the author’s lab building is a multi-storey carpark that contains split levels, multiple car ramps and four stairwells adjoining the main structure. This test case was designed to benchmark the three-dimensional algorithm on a real world occupancy volume. Figure 5.6 shows the final generated cost-to-go function over the occupancy volume. This test case has a single zero cost destination cell located just above the ground at the car entrance to the carpark, which can be seen by the blue area in the second plane in the top row (z = 1). Figure 5.5 shows a

107 Table 5.4: Execution time of the three-dimensional implementation on the Empty Volume test case for each test machine. CPU testing was performed on a 2012 Intel Core i7 2.6GHz processor. The † test case inolved a linked list priority queue, while the ? test case involved a min- heap priority queue.

Machine Generation Time On-board GPU 1.0916s Desktop GPU 0.0630s Exhaustive Dijkstra on CPU † 83.31s Exhaustive Dijkstra on CPU ? 45.7s

Table 5.5: Execution time of the three-dimensional implementation on the Rotating Planes test case for each of the benchmarking machines.

Machine Generation Time On-board 12.6192s Desktop 0.3758s

108 Figure 5.4: The final cost-to-go function over the rotating planes test case. Blue represents low cost cells whereas red indicates high cost cells. side view of the environment and is included here to help the reader interpret the occupancy volume. As the implementation is planning through any free cells, the reader should assume this could be applied to an unmanned aerial vehicle rather than a ground vehicle able to drive on all levels of the carpark. This can be seen by valid cost values being assigned to paths of cells travelling to the roof of the carpark from outside the carpark’s bounds. The slope of car ramps between levels is approximated to the resolution of the occupancy volume and the four stairwells contain approximated staircases. By observing the black obstacle cells in the lower z value planes, it can be observed that the carpark is built into the side of a slope. Aside from the ground floor and roof, the perimeter of each platform is fenced off to the height of the next platform above. Only stairwells have a lower fence connecting their landing to the platform the stairwell landing is level with. Table 5.6 gives the estimated execution

109 Figure 5.5: A side view of the Barker Street Carpark occupancy volume showing relevant features. time of the three-dimensional algorithm’s implementation on this occupancy volume for each graphics hardware generation.

5.5.4 Barker St Carpark - All Exits

The final test case for this section uses the Barker Street carpark occupancy volume again, but benchmarks the algorithm’s implementation using multiple destination locations. On the z = 1 plane, which corresponds to the first cell unit of free space above the ground floor, three destination cells are defined: one matching the destination cell from the previous test case in Section 5.5.3 as the car entrance to the carpark; and one at the entrances to each of the stairwells on the right side of the representation. On the z = 6 plane, another destination cell is defined at the base of the stairwell located at the top of the representation. The last destination cell is defined on the roof of the carpark on the z = 51 plane and represents a rooftop

110 Figure 5.6: The final cost-to-go function over the single source Barker St carpark test case. Blue represents low cost cells whereas red indicates high cost cells. exit via a pedestrian bridge to a street outside campus. The coordinates of the five destination cells are given in Table 5.7. Figure 5.7 shows the generated cost-to-go function generated for this occupancy volume for the five aforementioned destination cells. Table 5.8 shows the estimated execution time for the algorithm’s implementation running on the two benchmarking graphics cards for the multiple destination cell test case.

111 Table 5.6: Execution time of the three-dimensional implementation on the Single Source Barker Street Carpark test case for each benchmarking machine.

Machine Generation Time On-board 1.6540s Desktop 0.0658s

Table 5.7: The first zero cost destination locations used in the All Exits Barker Street Carpark test case. The first four of these exits are located on the ground floor of the structure, with the fifth representing access to the road at the top of the slope via a foot bridge.

Description Coordinate Car entrance (10, 31, 1) Top-left stairwell — base (1, 15, 1) Bottom-left stairwell — base (1, 38, 1) Top-middle stairwell — base (39, 15, 6) Rooftop bridge exit (59, 39, 51)

Table 5.8: Execution time of the three-dimensional implementation on the All Exits Barker Street Carpark test case for each benchmarking machine.

Machine Generation Time On-board 1.4860s Desktop 0.0501s

112 Figure 5.7: The final cost-to-go function over the multiple source Barker Street carpark test case. Blue represents low cost cells whereas red indi- cates high cost cells.

5.6 Case Study: A State Lattice Planner in (x, y, θ) Space

The state lattice planner is a grid or lattice-based planner that accounts for differ- entially constrained robot motion. Introduced by Pivtoraiko et al. in 2009 [94], it can be used to map out a lattice of possible positions and orientations that a robot can reach given kinematic constraints. Pivtoraiko et al. state that once a lattice has been pre-calculated for all possible motions within the lattice, a traditional plan- ner such as A*-search can find an appropriate optimal path between two locations. This section discusses how the proposed concurrent algorithm can be modified to generate an exhaustive cost-to-go function over over an area using the state lattice approach.

113 Table 5.9: The six motions of a differentially constrained robot relative to a lattice of discredited states.

ID Motion Traversal Cost

π FL forward left 2 FS forward straight 1

π FR forward right 2 π BL back left 2 BS back straight 1 BR back right π

An agent A is differentially constrained so that it is only able to perform the six motions given in Table 5.9. The cost of each of these motions is relative to the lattice spacing. A motion to the left or right is constrained to trace out a quarter circle with radius matching the lattice spacing and therefore such motions will change the state parameters of A by (±1, ±1, ±90o) for (x, y, θ) respectively. A w × h configuration space is mapped to a w × h × 4 grid, where each of the four w × h layers represent one of the four possible orientations given by the state lattice (viz. 0o, 90o, 180o, 270o). The kernel must also be modified to cater for both the constrained motion and the fact that the θ dimension can “wrap around”, as seen when rotating 90o from a state in the θ = 270o plane to the θ = 0o plane. An expansion of the modified kernel is given by Algorithm 5.

Given a Cfree configuration space the modified kernel runs in O(n) time in a n×n two-dimensional environment, given enough processors. The case study has shown that the proposed concurrent algorithm can be applied to motion planning models that are subject to kinematic constrains.

114 Algorithm 5 Kernel pseudocode with kinematic constraints run on each cell in the configuration space. o o o o Input: The current position (x0, y0, θ0): θ0 ∈ {0, 1, 2, 3} → {0 , 90 , 180 , 270 }

bestcost := undefined o o FLx := x0 + 1 if θ ∈ {0 , 270 } else x0 − 1 o o FLy := y0 + 1 if θ ∈ {0 , 90 } else y0 − 1

FLθ := (θ0 + 1) mod 4 ∗ if FL(x,y,θ) ∈ Cfree and C (FL) is defined then   ∗ pi c∗(x0,y0,θ0)+c∗(FLx,FLy,FLθ) currentcost := C (FL) + 2 2

bestcost := currentcost if bestcost is undefined or currentcost < bestcost end if

FS(x,y,θ) := (x0 + cos(θ), y0 + sin(θ), θ0) ∗ if FS(x,y,θ) ∈ Cfree and C (FS) is defined then   ∗ c∗(x0,y0,θ0)+c∗(FSx,FSy,FSθ) currentcost := C (FS) + 1 2

bestcost := currentcost if bestcost is undefined or currentcost < bestcost end if

o o FRx := x0 + 1 if θ ∈ {0 , 90 } else x0 − 1 o o FRy := y = 0 + 1 if θ ∈ {90 , 180 } else y0 − 1

FRθ := (θ0 − 1) mod 4 ∗ if FR(x,y,θ) ∈ Cfree and C (FR) is defined then   ∗ π c∗(x0,y0,θ0)+c∗(FRx,FRy,FRθ) currentcost := C (FR) + 2 2

bestcost := currentcost if bestcost is undefined or currentcost < bestcost end if . . . for BL, BS, BR

current cellcost := bestcost

115 5.7 Higher Dimensional Applications

Just as the original two-dimensional algorithm was redesigned for three-dimensional applications, it can also be extended to higher dimensions. This section reintro- duces the problem definition, algorithm and implementation method for dimensions greater than three.

5.7.1 Problem Definition

For configuration spaces in Rd for d ∈ (Z+ \{1, 2, 3}), a d-dimensional Euclidean tessellation of equally sized hypercubes is used to represent a d-dimensional envi- ronment. Two grid cells cx1,x2,...,xd and cy1,y2,...,yd in such an occupancy hyper-volume are deemed to be adjacent if the following conditions are all met:

|x1 − y1| ∈ {0, 1} and

|x2 − y2| ∈ {0, 1} and . (5.7.1) .

|xd − yd| ∈ {0, 1}

+ for x1, x2, . . . , xd, y1, y2, . . . , yd ∈ Z and cx1,x2,...,xd 6= cy1,y2,...,yd . The cost of traversal between two adjacent cells can be generalised to that given in Equation 5.7.2.

c∗ + c∗ C∗ = k × x1,x2,...,xd y1,y2,...,yd (5.7.2) t 2 p where k = |x1 − y1| + |x2 − y2| + ... + |xd − yd|. The kernel definition for an n- dimensional application remains the same, except that the number of neighbouring cells that are observed is given by Equation 5.7.3.

|neighbours| = 3d − 1 (5.7.3)

116 5.7.2 Cell Layout in Higher Dimensions

Just as the xy-planes in a three-dimensional occupancy volume can be mapped to a two-dimensional texture by placing planes in raster layout, so can higher dimensional applications recursively be placed in the same way. Taking the entire representation of a three-dimensional volume, a four-dimensional hyper-volume layout places each of these volume units in raster layout. Figure 5.8 shows a labeled layout of such a four-dimensional configuration. As the dimensionality increases further, this pattern continues and matches the Z-curve layout initially published by Morton [95].

Figure 5.8: The recursive raster-in-raster format cells from a four-dimensional occupancy hyper-volume are arranged into a two- dimensional texture buffer. Here it is assumed that z represents coordi- nates in the third dimension and w represents coordinates in the fourth dimension.

5.7.3 Concurrent Benefit of Higher Dimensions

As the dimensionality of the configuration space increases, the proposed algorithm gains added potential for parallelism. In the two-dimensional approach presented in Chapters 3 and 4, a one-dimensional wavefront of cell evaluations and later active

117 subregions are allowed to progress each iteration. In the three-dimensional approach, the wavefront becomes a two-dimensional surface emanating away from the zero cost destination cell each iteration. Naturally, for a four-dimensional configuration space, a three-dimensional volume is allowed to emanate, each iteration. As outlined in Section 5.3, the notion of sub-iterations is reintroduced to nor- malise the amount of work required by kernels for comparison of theoretical ex- ecution time between configuration spaces with differing dimensions. That is, as the dimensionality increases, the kernel must observe a larger number of neighbor- ing cells, as given by Equation 5.7.3. This increase in the number of neighbours observed is inherent in traditional algorithms as well, and although it is outside the scope of this thesis, future work could examine how to parallelise this multi- neighbour observation series within a kernel. Section 5.7.4 outlines how a six-dimensional version of the proposed algorithm could be applied to a six degree-of-freedom articulated robot. In this example, the possible angles of each joint are mapped to Euclidean space with 1o increments. While a proper analysis of this case study will follow, the example of a configuration space consisting of 360d cells for a d-dimensional configuration space will be used. Taking the general form of the number of cells evaluated after a particular sub- iteration given by Equation 5.3.3, Figure 5.9 shows a logarithmically scaled graph projecting this equation over a range of sub-iteration counts for three, four, five and six-dimensional configuration spaces of size 360d. This projection assumes the theoretical best case of a configuration space with Cobst= ∅. It should be noted at this stage that, although the higher dimensional versions of the algorithm take a larger number of sub-iteration steps to evaluate a given number of cells early on, they do eventually embrace a higher level of theoretical concurrency as the wavefront is allowed to spread out. It should also be noted that a planner exhaustively evaluating cells in a higher dimensional configuration space must also evaluate exponentially more cells than a lower dimensional planner. For reference, projections based on Equation 5.3.3 for dimensions three, four, five and six are given again in Figure 5.10, but normalised relative to the total number of cells the planner will eventually have to evaluate. Again a configuration space of

118 Figure 5.9: A logarithmically scaled comparison of the number of cells evaluated against the number of sub-iterations of each higher dimensional version of the algorithm. Each algorithm is operating over a completely free occupancy volume or hyper-volume with dimensions 360d, where d is the dimensionality of the configuration space. Note that the three- dimensional graph terminates execution at sub-iteration 4680 as it has

evaluated all 3603 cells in its corresponding configuration space in R3.

d 360 cells is assumed, with Cobst= ∅. It is apparent that a higher-dimensional planner would still require a greater exe- cution time than lower-dimensional versions as the configuration space is potentially exponentially larger. However, not as apparent from Figure 5.10 is the increased concurrent benefit at higher dimensions. Table 5.10 shows the theoretical number of sub-iterations required to exhaustively evaluate a d-dimensional configuration space based on Equation 5.3.3. Of significance is the column headed by Cells per S-i, which is simply the total number of cells comprising the configuration space divided

119 Figure 5.10: A normalised, logarithmically scaled comparison of the num- ber of cells evaluated against the number of executed sub-iterations worth of operations. Each algorithm is operating over a volume or hyper-volume with total cells equal to 360d, where d is the dimension of the configu- ration space. All values are normalised against the total number of cells required to exhaustively evaluate the entire configuration space. That is, 360d cells, where d is the dimensionality. Note that the three-dimensional graph terminates at sub-iteration 4680 as it has finished evaluating the entire volume of cells. by the theoretical number of sub-iterations required to evaluate that space. In prac- tice, however, this will always be slightly effected by the number of cores available to execute such a concurrent load as well as obstacle cell layouts within the represented environment.

120 Table 5.10: A breakdown of the theoretical number of sub-iterations required to exhaustively evaluate a 360d cell configuration space given

Cobst = ∅.

d Total Cells Sub-iterations Cells per S-i 2 3602 1440 9.00 × 101 3 3603 4680 9.97 × 103 4 3604 14400 1.17 × 106 5 3605 43560 1.39 × 108 6 3606 131040 1.66 × 1010

5.7.4 Case Study: A 6 D-o-F Articulated Robot

Three-dimensional Euclidean configuration spaces are easy to comprehend, visualise and are commonly applied to path planning for aerial or underwater vehicles. For higher dimensional configuration spaces, a technique first proposed by Gouz`enes[96] and later visualised by Lozono-Perez [97] demonstrates how possible angles of each joint of an n degree-of-freedom articulated robot can be mapped to an axis in an n-dimensional Euclidean configuration space. Although is is difficult to illustrate how this technique applies to higher dimensional configuration spaces in the two- dimensional medium of a dissertation, Figures 5.11 and 5.12 show how the free and obstacle space of a two degree-of-freedom articulated robot can be mapped to a two-dimensional occupancy grid. The configuration space shown in Figure 5.12 was pre-computed1 by placing each joint of the articulated robot in 360 different locations, each separated by 1o, then determining whether that state is feasible in terms of obstacle locations. For this two degree-of-freedom example, a total of 3602 states were tested. The reader should realise that every non-shaded location in the Euclidean representation is reachable as this is a map of all possible states of the robot’s possible locations, not all possible

1This process can also be trivially mapped to a GPU implementation as each state can be tested for validity independently.

121 Figure 5.11: A two degree-of-freedom articulated robot. Shaded areas represent obstacles in the environment. Except where either piece A or piece B may collide with an obstacle, each joint is capable of 360o revolution in the two-dimensional environment.

cells in the original Cfree shown in Figure 5.11. Therefore, a path from any valid configuration can be planned to any other using the proposed algorithm applied to the Euclidean representation of an articulated robots configuration space. The test case of a six degree-of-freedom articulated robot was also examined in the same regard. A valid six-dimensional Euclidean occupancy hyper-volume was considered, which comprised of 3606 cells ≈ 2.18 × 1015 cells. Assuming that each cell’s global cost is represented by a 4 byte single precision floating point data type, this equates to approximately 8.7 petabytes. This limitation2 of current hardware in this context is discussing in the next section.

2It might seem ironic that the proposed algorithm also has limitations for higher-dimensional problems in practice. However, these limitations are related to hardware properties that, at least for the foreseeable future, are predicted to continue to increase in capability.

122 Figure 5.12: Occupancy grid representation of the two degree-of-freedom robot shown in Figure 5.11 with angles of each joint represented as the two axes. The horizontal axis represents possible angles of the joint at the base of piece A, with the vertical axis representing possible angles of the joint at the connection of piece A and B, in the frame of reference of piece A. The red dot represents the location of the non-joint end of piece B relative to the Euclidean coordinate representation of the environment.

5.7.5 Implementation Limitations

An inherent property of configuration spaces as the dimensionality increases is the number of cells used to represent that configuration space, while maintaining the same level of resolution, increases exponentially. This is true, regardless of the planning approach used. The algorithm implementations given in this thesis all map a configuration space in any dimension to a two-dimensional texture. Therefore, the maximum number of cells that can be used to represent a configuration space relates directly to the maximum texture size available. For many modern graphics cards, including the two used for experimental benchmarking throughout this thesis, the

123 Table 5.11: Configuration space size given that all cells within that space must fit inside the bounds of a 4096 × 4096 texel texture buffer.

Dimensionality Space Dimensions 2 40962 3 2563 4 644 5 275 6 166 √ d ( d 4096 × 4096)d

maximum texture dimensions are 4096 × 4096 texels. Table 5.11 summarises the maximum square configuration space able to be stored within such a texture as the dimensionality of the space increases. Clearly the texture size limitation increasingly hinders the available coverage and resolution for higher-dimensional configuration spaces. One possible approach to counter this is to store the cost function in multiple textures and provide access to each texture for every kernel execution. While this would allow a larger configuration space to be stored, a more long term implementation choice would be to switch from OpenGL to OpenCL, as suggested in Section 5.4. The Open Computing Language (OpenCL) provides a more C like interface to graphics hardware and as such allows graphics memory to be allocated and accessed closer to that of a traditional C style array. As an example of the possible memory block allocation size, a 2012 model NVIDIA GeForce GT 650M graphics card has a maximum allocatable buffer size in OpenCL of 128MB. Table 5.12 shows the maximum square configuration space dimensions again, but with the OpenCL buffer limit relating to the maximum number of cells. Although this does not appear to be an improvement over the OpenGL texture size, especially for higher dimensions, this limit was discovered on a graphics card with 1,024MB of RAM. The latest model the of the NVIDIA Tesla personal super-

124 Table 5.12: Configuration space size given that 128MB can be used to store cell values. This table assumes a cell value can be represented using an IEEE 32-bit single precision floating point format.

Dimensionality Space Dimensions 2 56562 3 3173 4 754 5 315 6 176 q d d 128×106 d 4

computer [49], a K40 GPU, contains 12GB of graphical memory. In addition, Schaa and Kaeli [98] in 2009 presented initial work on using multiple GPUs to solve larger problems.

5.8 Summary

This chapter has shown how the proposed two-dimensional algorithm can be ex- tended to three and higher-dimensional configuration spaces. Algorithm and imple- mentation modification steps are discussed when translating to higher dimensional domains with the main differences relating to grid-layout and kernel observation area. The three and higher-dimensional versions of the algorithm are independently compared to lower-dimensional versions using a fair sub-iteration metric of steps taken to evaluate a given number of cells. A proof of concept three-dimensional version of the algorithm was implemented and tested on a number of configuration spaces. Benchmarking results reveal a rea- sonable and favourable execution time for the on-board and desktop generations of graphics hardware, respectively. While higher dimensional applications are possi- ble in theory, current GPU hardware capabilities hinder such implementations. A

125 shift from OpenGL to an OpenCL implementation is also suggested as a possible short term improvement, but current shortcomings of the implemented OpenCL specification are discussed in more detail as future work in the next section.

126 Chapter 6

Conclusions and Future Work

This thesis presented foundational research in the field of concurrent dynamic pro- gramming for the emerging parallel architecture found in modern graphics hardware. Both theoretical analyses and practical implementation details were presented to give a proper foundation in not only introducing the research but providing an immediate path to application for the potential reader. Chapter 2 presented a review of current literature in the generally disjointed fields of robot motion planning and GPGPU. The general consensus with many pro- posed path planning approaches was that discovering exact and accurate solutions was too computationally expensive for practical application. As such, a wide vari- ety of random, probabilistic and hybrid planning approaches have been suggested which aim to find the perfect balance between desired accuracy of a solution and the computational capabilities of hardware they plan upon. Roboticists have also re- lied on the increasing capabilities of processor hardware to achieve greater planning capabilities over the last half century. However, as predicted more than a decade ago, the processing capabilities of single-core processors have begun to plateau as the fabrication density begins to reach physical limits. In response, scientists and engineers have been switching their attention to multi-core CPU and GPU architec- tures in an attempt to continue to gain performance benefits with new hardware. This thesis contributes to this transition for path planning and other optimisation problems. The remainder of this chapter restates the core objectives of this research and dis-

127 cusses each in terms of their contribution. The chapter then continues on to discuss two major avenues of further investigation. Over the course of the author’s candida- ture the use of the Open Graphics Library for non-graphical applications on GPUs has declined as a more applicable library has been designed and implemented by key industry manufacturers. The Open Computing Language (OpenCL) was developed within the last few years to be an architecture agnostic programming language for parallel, data heavy . That is, code written in OpenCL could be run on traditional single core CPUs, modern multi-core CPUs, modern GPUs as well as experimental architectures like Sony’s Cell Processor [48], without any modification to the code itself. Implementing the algorithms presented in this thesis in OpenCL was investigated, but a key shortcoming of the currently implemented specification hinders proper applicability for the time being. Further details on this investigation and how to proceed when this shortcoming is addressed in future versions of the specification are discussed in Section 6.2.1. The techniques presented in this thesis are applicable to areas of optimisation outside of robot motion planning. One such application, as discussed in Section 6.2.2, is protein folding simulations for medical research. According to Alberts et al. [99], protein folding is the process where a polypeptide folds into a specific configuration to perform a desired function in the body. When a protein fails to fold into its correct configuration it can lead to one of many diseases classified as proteopathic. For example, Walker and LeVine [100] postulate that malformed proteins in the brain contribute to Alzheimer’s disease, Parkinson’s disease and Huntington’s disease to name a few. The field of research into simulating how proteins fold is driven by analysing the significantly large number of configurations a protein can fold into. From a mechanical point of view, a protein can be abstracted to behave like an n degree-of-freedom articulated robot. Section 6.2.2 discusses how the research presented in this thesis could be mapped to applications in protein folding analysis.

128 6.1 Summary of Contributions

The main contributions of this thesis are as follows:

• Foundational work is presented on solving grid-based optimisation problems using a concurrent approach. As opposed to existing methods, the proposed method is exact to the resolution of the grid chosen and does not employ random or probabilistic approaches. The algorithm is also proven to eventually give optimal results, even with non-uniform local traversal costs between grid cells.

• Various implementation flavours of the proposed algorithm are given with performance characteristics for each in a two-dimensional configuration space. Each flavour is assessed on possible negative contributions to overall execu- tion time and various evolutionary improvements are made to each subsequent flavour based on these analyses. It is proven than on modern consumer-grade graphics hardware, the most efficient flavour is able to exhaustively evaluate a densely populated occupancy grid the size of a 1000m × 500m campus with 1m × 1m resolution in sub-second time. It is also proven to be more than ade- quate for common indoor path planning requirements of autonomous vehicles and is able to plan from a single or multiple zero cost destination cells.

• Foundational work is presented on extending the algorithm to three and higher- dimensional configuration spaces. A discussion is given on the differences in configuration space representation as well as kernel design. A proof-of-concept implementation is also presented that is able to evaluate contrived and sample outdoor occupancy volumes in sub-second time on modern graphics hardware.

• A complete analysis is given on the theoretical concurrent benefit this al- gorithm can achieve for higher-dimensional configuration spaces. In theory, given enough cores, this algorithm can exhaustively evaluate all nd = N cells within a d-dimensional occupancy grid, volume, or hyper-volume, in the order of O(N). This assumes that the configuration space is largely free and there are no contrived snaking corridors.

129 6.2 Proposed Future Work

6.2.1 Implementation using Open Computing Language

The Open Computing Language (OpenCL) is a specification for a architecture ag- nostic programming language. The basic premise behind OpenCL is that a program- mer can write a streaming program in OpenCL once and be able to compile and run that program on any compute device. Such devices can range from single and multi-core CPUs to GPUs and, according to Singh [101], Field-Programmable Gate Arrays (FPGAs). Initial investigations were undertaken on the feasibility of imple- menting the proposed algorithm in OpenCL. One of the main requirements of each implementation flavour, from the expanding texture method onwards, is to be able to request a subset of cells in an occupancy grid to be evaluated. In OpenGL this could be achieved by rendering a rectangle over a sub-area of the entire occupancy volume. In OpenCL, kernel execution is requested via the clEnqueueNDRangeK- ernel function call. According to the OpenCL 1.0 specification [102], this function is missing a crucial argument allowing such an offset to be specified. This shortcoming is covered in more detail in the next subsection.

Shortcomings of OpenCL 1.0

This section outlines the inability to be able to request specific offset of kernel executions on a loaded dataset in OpenCL. Take, for example, an 8×8 cell occupancy grid, as given in Figure 6.1. At a given iteration, if the algorithm was required to update only cells in the bottom right 4×4 cells, an OpenGL implementation would request a GL QUAD to be rendered at (4,4) with a size of (4,4). The OpenCL analogy to requesting cells be evaluated by a kernel is by us- ing the clEnqueueNDRangeKernel function call. The two main arguments for this function that determine which cells are evaluated are global work size and global work offset. As expected, the global work size argument is analogous to the size of the sub-rectangle in the OpenGL case. From the OpenCL 1.0 specifica- tion:

130 Figure 6.1: A sample 8 × 8 cell occupancy grid with labels.

global work size Points to an array of work dim unsigned values that describe the number of global work-items in work dim dimensions that will execute the kernel function. . .

For the current example, to request the bottom right quadrant of the 8 × 8 cell occupancy grid, this argument would be submitted as [4, 4]. As for the position of this subregion, should logically able to submit a similar array of [4, 4]. However, the currently implemented OpenCL specification does not allow custom positioning. Specifically, from the OpenCL 1.0 specification:

global work offset Must currently be a NULL value. In a future version of OpenCL, global work offset can be used to specify an array of work dim unsigned values that describe the offset used to calculate the global ID of a work- item instead of having the global IDs always start at offset (0, 0,. . . 0).

This year, the OpenCL 2.0 specification [103] was ratified. Within this specifi- cation, then description for the global work offset parameter has been updated to the following:

global work offset ...can be used to specify an array of work dim unsigned values that

131 describe the offset used to calculate the global ID of a work-item. If global work offset is NULL, the global IDs start at offset (0, 0, ... 0).

Although this now allows proper description of a sub-area to execute a group of kernels upon, the specification implementation with hardware vendors has not reached consumers yet.

Emulating Sub-Area Rendering in OpenCL 1.0

The standard OpenCL application requests that a kernel be executed on an entire buffer. As stated in Section 6.2.1, OpenCL 1.0 only allows configuration of the amount of cells to be evaluated and fixes the offset to the origin. An attempt was made to investigate whether an offset could be emulated, and if so, whether it performed efficiently. An experimental log given in Appendix B.1 outlines a successful attempt at emulating the rasterisation stage of the graphics pipeline. That is, requesting a particular sub-area of a two-dimensional occupancy grid. First, the experiment showed that a one-dimensional offset value could be sent to all kernels and have a specific subset of adjacent cells evaluated. Then the technique was extended to two dimensions by requesting a rectangle for execution by individually requesting 1D lines within that rectangle’s area. The experiment proved that a two-dimensional offset could be emulated. A second experiment was conducted to determine whether using multiple one- dimensional slices was an efficient method of evaluating a two-dimensional texture over simply requesting that two-dimensional area in one request. Since a true offset was not able to be applied for the two-dimensional case, an offset of (0, 0) was used for the comparison. The time to request execution of cells and the time taken to read cell values back from GPU to CPU memory was recorded for each case. The experiment showed that the one-dimensional slices emulation mentioned above was 2.5 to 3.5 times slower during code execution and 7.5 to 8.5 times slower at reading values back, compared to simply requesting the entire two-dimensional area of cells with one function call. The complete log of this experiment is available in Appendix B.2.

132 It was therefore determined that implementing the proposed algorithm in OpenCL was not feasible given the currently implemented specification and that such an im- plementation was best left for future work when OpenCL 2.0 specification capable hardware reaches the market. The next section introduces a field of optimsation outside of robot motion planning that could benefit from the proposed research.

6.2.2 Applications in Protein Folding

As stated above, the mechanics behind protein folding can be abstracted to the same model as an articulated robot. That is, from a molecular mechanics point of view, single atoms can be abstracted to a joint while bonded interactions between two atoms in proximity can be abstracted to either a rigid interval or a spring. Simulation of proteins folding is of great significance to the medical community as many common diseases and illnesses are thought to be caused by mis-folded proteins. The simulation process involves taking a protein out of its native healthy folded state and reconfiguring it in another random state. Then, using known molecular models, a simulation is then performed on how the protein naturally attempts to fold back into its native desired state. The mechanism of folding can be mapped to a motion planning mechanism for a n degree-of-freedom protein. Stafani and Dobson [104] presented work in 2003 that discussed new insights into the classification of amyloids, which are an accumulation of misfolded proteins. The accumulation of such amyloids are symptoms in a number of neurological diseases such as Alzheimer’s and Parkinson’s disease. Dobson [105] later also proposed that protein misfolding could also cause Type-II diabetes. Pande el al. [106] showed how a massive distributed network of the world’s computers can process a large number of protein folding calculations over time. Their system, called Folding@Home makes use of user’s idle CPU cycles to perform protein folding simulations. Elsen et al. [107] proposed a GPU based N-body approach which can benefit the area of protein folding simulation when spring forces between atoms are required. While reviewing this technique, Owens et al. [108] note that an O(N 2) approach on GPU is observed to run faster than an O(N) approach on CPU. This result alone should arouse the curiosity of a modern roboticist to explore

133 the true practical benefits of implementing solutions on highly parallel architectures.

6.3 Final Remarks

Path planning is such a crucial component of autonomous robotics and it is a pity that such a deterministic task is still too computationally expensive for many practi- cal applications. By embracing the recent paradigm shift to parallel processor archi- tectures, this thesis hopes to contribute another level of feasibility to the possibilities of exact motion planners. While parallel architectures are planned to continue to increase in speed and memory capability for the foreseeable future there may be a time when new physical limits begin to hinder further progress. Due to its determinism, path planning deserves to be a solved problem from an application point of view. It is hoped that the next developments in the field of motion planning (and robotics in general) are towards dedicated hardware for certain expensive core tasks. The robots of tomorrow may be comprised of dedicated processors for scan matching, vision processing and map building, with a central processor simply responsible for directing data traffic and commands between each dedicated expert processor. An exhaustive and exact path planning module in just a few decades may involve an arrangement of billions of molecules settling into a solution state in nanosec- onds...

134 Appendix A

Derivations

A.1 Derivation of the relationship between |E| and

|V | in a Cobst = ∅ two-dimensional occupancy grid

For an n × n occupancy grid with N cells and Cobst = ∅, the following derivation shows the relationship between |E|, the number of edges, and |V |, the number of vertices, or cells, in the grid. Each cell is 8-way connected to adjacent cells. Direc- tions are denoted by the compass directions.

|EN↔S| = n(n − 1)

|EE↔W | = n(n − 1)

|ENW ↔SE| = (n − 1)(n − 1)

|ENE↔SW | = (n − 1)(n − 1)

Σ|E∗| = 2 × n(n − 1) + 2 × (n − 1)(n − 1) = 4n2 − 6n + 2 √ = 4N − 6 N + 2

135 A.2 Derivation of the relationship between |E| and

|V | in a Cobst = ∅ three-dimensional occupancy volume

For an n×n×n occupancy volume with N cells and Cobst = ∅, the following deriva- tion shows the relationship between |E|, the number of edges, and |V |, the number of vertices, or cells in the volume. Each cell is 26-way connected to adjacent cells. Directions on the same 2D plane are denoted with compass directions, with up (U) and down (D) denoting a change in planes.

2 |EN↔S| = n (n − 1) 2 |EE↔W | = n (n − 1) 2 |EU↔D| = n (n − 1) 2 |ENW ↔SE| = n(n − 1) 2 |ENE↔SW | = n(n − 1) 2 |EUN↔DS| = n(n − 1) 2 |EUS↔DN | = n(n − 1) 2 |EUE↔DW | = n(n − 1) 2 |EUW ↔DE| = n(n − 1) 3 |EUNW ↔DSE| = (n − 1) 3 |EUNE↔DSW | = (n − 1) 3 |EDNW ↔USE| = (n − 1) 3 |EDNE↔USW | = (n − 1) 2 2 3 Σ|E∗| = 3 × n (n − 1) + 6 × n(n − 1) + 4 × (n − 1) = 13n3 − 27n2 + 18n − 4 √ √ = 13N − 27 3 N 2 + 18 3 N − 4

136 A.3 Derivation of the relationship between |E| and

|V | in a Cobst = ∅ d-dimensional occupancy hyper-volume

d For an n occupancy hyper-volume with N cells and Cobst = ∅, the following deriva- tion shows the relationship between |E|, the number of edges, and |V |, the number of vertices, of cells in the hyper-volume. Each cell is (3d − 1)-way connected to adjacent cells.

d−1 1 P Σ|E∗| = 2 a(i, d) × b(i, d) i=0

d! d−i (d−i)! ×2 a(i, d) = i!

b(i, d) = ni(n − 1)d−i

a(i, d) is the number of i-cubes that exist on a d-dimentional cube1.

1By referring to the number of 2-cubes that exist on a 3-dimensional cube (i=2,d=3) you are referring to the number of faces on a cube. Likewise, counting the number of 1-cubes that exist on a 2-dimensional cube would be referring to the number of edges on a square.

137 A.3.1 Example for d=2

1 1 P Σ|E∗| = 2 a(i, 2) × b(i, 2) i=0 1 = 2 (a(0, 2) × b(0, 2) + a(1, 2) × b(1, 2))

 2! 2−0  (2−0)! ×2 a(0, 2) = 0! = 4

 2! 2−1  (2−1)! ×2 a(1, 2) = 1! = 4

1 0 2−0 1 2−1 ∴ Σ|E∗| = 2 (4 (n (n − 1) ) + 4 (n (n − 1) )) = 2(n − 1)2 + 2n(n − 1) = 4n2 − 6n + 2

A.3.2 Example for d=3

2 1 P Σ|E∗| = 2 a(i, 3) × b(i, 3) i=0 1 = 2 (a(0, 3) × b(0, 3) + a(1, 3) × b(1, 3) + a(2, 3) × b(2, 3))

 3! 3−0  (3−0)! ×2 a(0, 3) = 0! = 8

 3! 3−1  (3−1)! ×2 a(1, 3) = 1! = 12

 3! 3−2  (3−2)! ×2 a(2, 3) = 2! = 3

1 0 3−0 1 3−1 2 3−2 ∴ Σ|E∗| = 2 (8 (n (n − 1) ) + 12 (n (n − 1) ) + 6 ((n (n − 1) )) = 4(n − 1)3 + 6n(n − 1)2 + 3n2(n − 1) = 13n3 − 27n2 + 18n − 4

138 A.3.3 Example for d=6

5 1 P Σ|E∗| = 2 a(i, 6) × b(i, 6) i=0 1 = 2 (a(0, 6) × b(0, 6) + a(1, 6) × b(1, 6) + a(2, 6) × b(2, 6) + a(3, 6) × b(3, 6) + a(4, 6) × b(4, 6) + a(5, 6) × b(5, 6))

 6! 6−0  (6−0)! ×2 a(0, 6) = 0! = 64

 6! 6−1  (6−1)! ×2 a(1, 6) = 1! = 192

 6! 6−2  (6−2)! ×2 a(2, 6) = 2! = 240

 6! 6−3  (6−3)! ×2 a(3, 6) = 3! = 160

 6! 6−4  (6−4)! ×2 a(4, 6) = 4! = 60

 6! 6−5  (6−5)! ×2 a(5, 6) = 5! = 12

1 0 6−0 1 6−1 2 6−2 ∴ Σ|E∗| = 2 (64 (n (n − 1) ) + 192 (n (n − 1) ) + 240 (n (n − 1) ) + 160 (n3(n − 1)6−3) + 60 (n4(n − 1)6−4) + 12 (n5(n − 1)6−5)) = 32(n − 1)6 + 96n(n − 1)5 + 120n2(n − 1)4 + 80n3(n − 1)3 + 30n4(n − 1)2 + 6n5(n − 1) = 364n6 − 1458n5 + 2430n4 − 2160n3 + 1080n2 − 288n + 32

139 Appendix B

Relevant Experimental Logs

This chapter documents experimental logs of areas of research that relate to fu- ture work presented in this thesis. These experiments mainly involve attempting work-arounds for a particular shortcoming in the current OpenCL 1.0 specification [102], while hardware vendors implement a fix for this shortcoming with the recently released OpenCL 2.0 specification [103].

B.1 Emulating the rasterisation step of the graph- ics pipeline using OpenCL

Aim

To see whether it is possible to emulate the rasterisation step in the graphics pipeline to request a certain sub-area of an occupancy grid to be evaluated in OpenCL.

Method

I am going to attempt to simulate the rasterisation step in the CPU by requesting certain sub-lines of kernels to be evaluated. To request a sub-line, an offset is sent to each kernel relating to the offset in a 1D de-raster representation of a 2D occupancy grid. If I want 4 cells to be evaluated in a row I will request the first 4 cells to be executed (since I can’t specify an offset when enqueue-ing kernels, then each kernel knows which cell values to really evaluate upon by observing the offset. By

140 requesting each line in a rectangle, a 2D sub-rectangle of the occupancy grid could potentially be requested.

Apparatus

The following kernel will be used:

__kernel void offset(__global float * output, const unsigned int count, const unsigned int offset) { int i = get_global_id(0); output[offset+i] = i+1; }

CPU: Intel Core i7 2.6 GHz 8GB 1600 MHz DDR3 GPU: NVIDIA GeForce GT 650M 1024MB

Experimental Log

Set up a 1D memory object with 64 cells and called it with ‘‘offset’’ = 6 and global_size of 6.

Ran the program and printed the output of the buffer after execution:

0 0 0 0 0 0 1 2 3 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

141 The above result is as expected.

Modified the kernel line that used to look like this:

output[offset+i] = i+1; to be this:

output[offset+i] = offset+i;

Re-ran, this time emulating 4 kernel enqueues to cover the bottom quadrant of the 8x8 occupancy grid.

Printed result from execution:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 36 37 38 39 0 0 0 0 44 45 46 47 0 0 0 0 52 53 54 55 0 0 0 0 60 61 62 63

As expected.

Results

It looks like a sub-area rasterisation can be emulated in CPU for OpenCL kernels to only execute on a subset of a set of data.

142 B.2 Comparing rastering single lines in CPU against rendering 2D areas in GPU

Aim

To compare the performance of requesting a 2D region to be evaluated in OpenCL by either:

1. rendering as 1D strips from the CPU, or

2. rendering a 2D area as part of a single OpenCL kernel enqueue.

Method

The 1D strip method is just a clone of the offset testing experiment (Section B.1). The 2D method will just enqueue a 2D kernel to execute over the entire memory object.

Apparatus

The following kernel was used for the 1D case:

__kernel void offset(__global float * output, const unsigned int count, const unsigned int offset) { int i = get_global_id(0); output[offset+i] = offset+i; }

The following kernel was used for the 2D case:

__kernel void normal_2d(__global float * output, const unsigned int width, const unsigned int height) { int x = get_global_id(0); int y = get_global_id(1);

143 if (x < width && y < height) { output[x*width+y] = x * width + y; } }

CPU: Intel Core i7 2.6 GHz 8GB 1600 MHz DDR3 GPU: NVIDIA GeForce GT 650M 1024MB

Experiment Log

Placed C based timers around each execution of the code, like so: start_time = clock(); ... code that enqueues kernel ...... code that reads result back ... total_time = (clock() / start_time) / CLOCKS_PER_SECOND;

Placed an N iteration loop around the execute and read code in an attempt to gain a better average run time of the code blocks being timed. Ran 1D kernel over an 8x8 grid memory object, with N=100. Ran this two more times. Ran 2D kernel over same 8x8 memory object, with N=100. Ran this two more times. Ran 1D with N=1000. Ran this two more times. Ran 2D with N=1000. Ran this two more times.

Results

Recorded times are in seconds.

For N = 100 1D: exec(0.041065) read(0.125484) exec(0.043492) read(0.130664) exec(0.042514) read(0.126741)

144 avg: exec(0.042357) read(0.127630) per: exec(0.42357ms) read(1.2763ms)

2D: exec(0.012231) read(0.016563) exec(0.011161) read(0.016540) exec(0.011909) read(0.016933) avg: exec(0.011767) read(0.016679) per: exec(0.11767ms) read(0.16679ms)

1D/2D: exec(3.59964) read(7.65214)

For N = 1000 1D: exec(0.292396) read(1.135410) exec(0.289736) read(1.145670) exec(0.289563) read(1.159130) avg: exec(0.290565) read(1.146737) per: exec(2.90565ms) read(11.467370ms)

2D: exec(0.107811) read(0.132492) exec(0.113155) read(0.139689) exec(0.107482) read(0.131591) avg: exec(0.109483) read(0.134591) per: exec(1.09483ms) read(1.34591ms)

1D/2D: exec(2.65397) read(8.52016)

Conclusion

1D with slices in slower than 2D. 2.5 - 3.5 times slower for executing the kernel(s). 7.5 - 8.5 times slower for the read back operation. Requesting 2D areas is more efficient, but can’t request a custom subset to be

145 executed.

146 Bibliography

[1] Dave Shreiner, Mason Woo, Jackie Neider, and Tom Davis. OpenGL Program- ming Guide. Addison-Wesley — Pearson Education, Inc., One Lake Street, Upper Saddle River, NJ, 07458, U.S.A., fifth edition, 2006.

[2] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable par- allel computing with CUDA. Queue, 6(2):40–53, 2008.

[3] George A. Bekey, Robert Ambrose, Vijay Kumar, David Lavery, Arthur Sanderson, Brian Wilcox, Junku Yuh, and Yuan Zheng. Robotics: state of the art and future challenges. Imperial College Press, 2008.

[4] Steven LaValle and James J Kuffner. Randomized kinodynamic planning. THe International Journal of Robotics Research, 20(5):378–400, 2001.

[5] Lydia E. Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Proba- bilistic roadmaps for path planning in high-dimensional configuration spaces. 12(4):566–580, 1996.

[6] R. R. Schaller. Moore’s law: past, present and future. IEEE Spectrum, 34(6):52–59, June 1997.

[7] Laszlo B. Kish. End of moore’s law: thermal (noise) death of integration in micro and nano electronics. Physics Letters A, 305(3-4):144–149, December 2002.

[8] Lasse bergroth, Harri Hakonen, and Timo Raita. A survey of longest common subsequence algorithms. In String Processing and Information Retrieval, 2000. Proceedings, Seventh International Symposium on, pages 39–48. IEEE, 2000.

147 [9] Ellis Horowitz and Sartaj Sahni. Computing partitions with applications to the knapsack problem. Journal of the ACM, 21(2):277–292, 1974.

[10] Gordon E. Moore. Cramming more components into integrated circuits. Elec- tronics, 38(8), April 1965.

[11] Hans Moravec. When will computer hardware match the human brain. Journal of Evolution and Technology, 1(1):10, 1998.

[12] Chris Westbury. How fast is the brain? http://www.ualberta.ca/~chrisw/ howfast.html, 2006.

[13] National Aeronautics and Space Administration. Solar system explo- ration: Earth: Facts and figures. http://solarsystem.nasa.gov/planets/ profile.cfm?Dislpay=Facts&Object=Earth, accessed 2014.

[14] Xiangke Liao, Liquan Xiao, Canqun Yang, and Yutong Lu. Milkyway-2 super- computer: system and application. Frontiers of Computer Science, 8(3):345– 356, 2014.

[15] Bezalel Gavish. Topological design of centralized computer networks — for- mulations and algorithms. Networks, 12(4):355–377, 1982.

[16] Jan Fabian Ehmke, Stephan Meisel, Stefan Engelman, and Dirk Christian Mattfeld. Data chain management for planning in city logistics. International Journal of , Modelling and Management, 1(4):335–356, 2009.

[17] Joseph B. Kruskal. On the shortest spanning subtree of a graph and the trav- eling salesman problem. Proceedings of the American Mathematical Society, 7(1):48–50, February 1956.

[18] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1(1):269–271, 1959.

[19] F. Benjamin Zhan. Three fastest shortest path algorithms on real road net- works: Data structures and procedures. Journal of geographic information and decision analysis, 1(1):69–82, 1997.

148 [20] Robert B. Dial. Algorithm 360: shortest-path forest with topological ordering. Communications of the ACM, 12(11):632–633, November 1969.

[21] Nils J. Nilsson. Shakey the robot. SRI International — Technical Note 323, April 1984.

[22] Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimal cost paths. IEEE Transactions of Systems Science and Cybernetics, 4(2):100–107, July 1968.

[23] Stephen Cossell and Jos´eGuivant. Dijkstra’s algorithm vs. A* search vs. con- current algorithm. http://www.youtube.com/watch?v=cSxnOm5aceA, June 2013.

[24] Alberto Elfes. Using occupancy grids for mobile robot perception and naviga- tion. Computer, 22(6):46–57, June 1989.

[25] Oussama Khatib. Real-time obstacle avoidance for manipulators and mobile robots. International Journal of Robotics Research, 5(1):90–98, 1986.

[26] Y. Hwang and N. Ahuja. Gross motion planning — a survey. ACM computing surveys, 24(3):219–291, 1992.

[27] Allen R. Tannenbaum and Yoseph Yomdin. Robotic manipulators and the ge- ometry of real semialgebraic sets. IEEE Journal of Robotics and Automation, RA-3(4):301–307, August 1987.

[28] Jean-Claude Latombe. Robot Motion Planning. Kluwer Academic Publishers, 1991.

[29] Michael Barbehenn. A note on the complexity of Dijkstra’s algorithm for graphs with weighted vertices. IEEE Transactions on Computers, 47(2):263, February 1998.

[30] R. A. Brooks. Solving the find-path problem by good representation of free space. Systems, Man and Cybernetics, IEEE Transactions on, SMW- 13(2):190–197, March–April 1983.

149 [31] Chee-Keng Yap. How to move a chair through a door. Robotics and Automa- tion, IEEE Journal of, 3(3):172–181, 1987.

[32] Wim Meeussen, Melonee Wise, Stuart Glaser, Sachin Chitta, Conor McGann, Patrick Mihelich, Eitan Marder-Eppstein, Marius Muja, Victor Eruhimov, Tully Foote, et al. Autonomous door opening and plugging in with a per- sonal robot. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 729–736. IEEE, 2010.

[33] Nancy M. Amato and Yan Wu. A randomized roadmap method for path and manipulation planning. In Proceedings of the 1996 IEEE Interational Conference on Robotics and Automation, Minneapolis, Minnesota, USA, April 1996.

[34] Nancy M Amato and Lucia K Dale. Probabilistic roadmap methods are em- barrassingly parallel. In Robotics and Automation, 1999. Proceedings. 1999 IEEE International Conference on, volume 1, pages 688–694. IEEE, 1999.

[35] Stefano Carpin and Enrico Pagello. On parallel rrts for multi-robot systems. In Proc. 8th Conf. Italian Association for Artificial Intelligence, pages 834–841, 2002.

[36] Kalin Gochev, Benjamin Cohen, Jonathan Butzke, Alla Safonova, and Maxim Likhachev. Path planning with adaptive dimensionality. In Fourth Annual Symposium on Combinitorial Search, July 2011.

[37] D. J. Challou, M. Gini, and V. Kumar. Parallel search algorithms for robot motion planning. In Robotics and Automation, Proceedings, IEEE Conference on, Atlanta, GA, U.S.A., May 1993.

[38] Sebastian Thrun and Arno B¨ucken. Integrating grid-based and topological maps for mobile robot navigation. In Proceedings of the Thirteenth National Conference on Artificial Intellegence, Portland, Oregon, USA, August 1996.

150 [39] Nicola Tomatis, Illah Nourbakhsh, and Roland Siegwart. Hybrid simultaneous localization and map building: a natural integration of topological and metric. Robotics and Autonomous Systems, 44(1):3–14, February 2003.

[40] John J. Leonard and Hugh F. Durrant-Whyte. Simultaneous map build- ing and localization for an autonomous mobile robot. In Intelligent Robots and Systems’ 91.’Intelligence for Mechanical Systems, Proceedings IROS’91. IEEE/RSJ International Workshop on, pages 1442–1447. IEEE, 1991.

[41] Jos´eGuivant, Eduardo Nebot, Juan Neito, and Favio Masson. Navigation and mapping in large unstructured environments. International Journal of Robotics Research, 23(4-5):449–472, April-May 2004.

[42] Mark A. Whitty and Jos´eE. Guivant. Efficient path planning in deformable maps. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, St. Louis, USA, October 2009.

[43] Karime Pereida and Jos´eGuivant. PWL approximation for dense mapping and associated dijkstra process for the concurrent synthesis of multiple full cost- togo functions. In Proceedings of the Australasian Conference on Robotics and Automation, Sydney, Australia, December 2013.

[44] Maxim Likhachev and Anthony Stentz. R* search. Lab Papers (GRASP), page 23, 2008.

[45] J. B. Johnson. Thermal agitation of electricity in conductors. Phys. Rev., 32:97–109, July 1928.

[46] H. Nyquist. Thermal agitation of electric charge in conductors. Phys. Rev., 32:110–113, July 1928.

[47] Shekhar Borkar. Thousand core chips: a technology perspective. In Proceed- ings of the 44th Annual Design Automation Conference, San Diego, California, USA, June 2007.

151 [48] Dac C Pham, Tony Aipperspach, David Boerstler, Mark Bolliger, Rajat Chaudhry, Dennis Cox, Paul Harvey, Paul M Harvey, H Peter Hofstee, Charles Johns, et al. Overview of the architecture, circuit design, and physical im- plementation of a first-generation cell processor. Solid-State Circuits, IEEE Journal of, 41(1):179–196, 2006.

[49] nVidia Corporation. High performance computing - supercomputing with tesla GPUs. http://www.nvidia.com/object/tesla_computing_ solutions.html, 2011.

[50] G. R. Andrews. Concurrent Programming: Principles and Practice. The Benjamin/Cummings Publishing Company, 1991.

[51] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39– 55, 2008.

[52] Randi J. Rost. OpenGL Shading Language (Second Edition). Addison-Wesley — Pearson Education, Inc., 75 Arlington Street, Suite 900, Boston, MA, 02116, U.S.A., second edition, 2006.

[53] Kenneth Moreland and Edward Angle. The FFT on a GPU. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 112–119. Eurographics Association, 2003.

[54] Michael Flynn. Some computer organizations and their effectiveness. Com- puters, IEEE Transactions on, 100(9):948–960, 1972.

[55] Ian Buck and Tim Purcell. Chapter 37: A toolkit for computation on gpus. In GPU Gems. Asdison-Wesley — Pearson Education Inc., 2007.

[56] Mark Harris. Mapping computational concepts to GPUs. In ACM SIG- GRAPH, Los Angeles, California, USA, July 2005.

[57] NVIDIA Corporation. Cuda toolkit. https://developer.nvidia.com/ cuda-toolkit, 2014.

152 [58] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM Transactions on Graphics (TOG), volume 23, pages 777–786. ACM, 2004.

[59] Cole Trapnell and Michael C. Schatz. Optimizing data intensive GPGPU computations for DNA sequence alignment. , 35(8-9):429– 440, August-September 2009.

[60] Peter Bakkum and Kevin Skadron. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 94–103, New York, NY, USA, 2010.

[61] Tobias Preis, Peter Virnau, Wolfgang Paul, and Johannes J. Schneider. GPU accelerated Monte Carlo simulation of the 2D and 3D ising model. Journal of , 228(12):4468–4477, 2009.

[62] Ernst Ising. A contribution to the theory of ferromagnetism. Z. Phys, 31(1):253–258, 1925.

[63] Stanley Tzeng and Li-Yi Wei. Parallel white noise generation on a GPU via cryptographic hash. In Proceedings of the 2008 symposium on Interactive 3D graphics and games, pages 79–87. ACM, 2008.

[64] Ronald Rivest. The MD5 message-digest algorithm. 1992.

[65] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IEEE International Symposium on Parallel and Dis- tributed Processing, Rome, May 2009.

[66] Victor Podlozhnyuk. Image convolution with CUDA. NVIDIA Corporation white paper, 2097(3), June 2007.

[67] Tomonari Furukawa, Banjamin Lavis, and Hugh F. Durrant-Whyte. Parallel grid-based recursive Bayesian estimation using GPU for real-time autonomous

153 navigation. In International Conference on Robotics and Automation, Anchor- age, Alaska, May 2010.

[68] Charles Pisula, K. Hoff, Ming Lin, and Dinesh Manocha. Randomized path planning for a rigid body based on hardware accelerated voronoi sampling. In Proc. Workshop on Algorithmic Foundation of Robotics, volume 18, 2000.

[69] Kenneth Hoff III, Tim Culver, John Keyser, Ming C. Lin, and Dinesh Manocha. Interactive motion planning using hardware-accelerated compu- tation of generalized Voronoi diagrams. In Robotics and Automation, 2000. Proceedings. ICRA’00. IEEE International Conference on, volume 3, pages 2931–2937. IEEE, 2000.

[70] Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In IEEE High Performance Computing, 2007.

[71] Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable GPU graph traversal. In ACM SIGPLAN Notices, volume 47, pages 117–128. ACM, 2012.

[72] Jia Pan, Christian Lauterbach, and Dinesh Manocha. g-planner: Real-time motion planning and global navigation using GPUs. In Proceedings of the Twnety-Fourth AAAI Conference on Artificial Intelligence, 2010.

[73] Joseph T. Kider, Mark Henderson, Maxim Likhachev, and Alla Safonova. High-dimensional planning on the GPU. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 2515–2522. IEEE, 2010.

[74] Edwin B. Olson. Real-time correlative scan matching. In Robotics and Au- tomation, 2009. ICRA’09. IEEE International Conference on, pages 4387– 4393. IEEE, 2009.

[75] Richard A. Newcombe, Andrew J. Davison, Shahram Izadi, Pushmeet Kohli, Otmar Hilliges, Jamie Shotton, David Molyneaux, Steve Hodges, David Kim, and Andrew Fitzgibbon. KinectFusion: Real-time dense surface mapping and tracking. In Mixed anf augmented reality (ISMAR), 2011 10th IEEE International Symposium on, pages 127–136. IEEE, 2011.

154 [76] Adrian Ratter and Claude Sammut. GPU accelerated parallel occupancy voxel based ICP for position tracking. In Proceedings of Australasian Conference on Robotics and Automation, Sydney, Australia, December 2013.

[77] David Patterson. The trouble with multicore. IEEE Spectrum, July 2010.

[78] Stephen Cossell and Jos´eGuivant. Parallel evaluation of a spatial traversabil- ity cost function on GPU for efficient path planning. Journal of Intelligent Learning Systems and Applications, 3(4):191–200, November 2011.

[79] R. Bellman. The Theory of Dynamic Programming. RAND Corporation, California, 1954.

[80] Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952.

[81] Mark HArris. GPU Gems 2. Addison-Weslet, April 2005.

[82] Stephen Cossell and Jos´eGuivant. An optimised GPU-based robot motion planner. In Australasian Conference on Robotics and Automation, Sydney, Australia, December 2013.

[83] Stephen Cossell and Jos´eGuivant. Concurrent dynamic programming for grid- based problems and its application for real-time path planning. Robotics and Autonomous Systems, 62(6):737–751, June 2014.

[84] Inc. Advanced Micro Devices. ATI Mobility Radeon HD 2400 XT. http://www.amd.com/us/products/notebook/graphics/ ati-mobility-hd-2000/hd-2400-xt/, accessed 2011.

[85] Inc. Advanced Micro Devices. AMD unleashes the ATI Radeon HD 2600 and ATI Radeon HD 2400 series, delivering DirectX 10 graphics with built-in high-definition video processing, June 2007.

[86] nVidia Corporation. GeForce GTX 480. http://www.nvidia.com/object/ product_geforce_gtx_480_us.html, accessed 2011.

155 [87] nVidia Corporation. New nVidia GeForce GTX 480 GPU cranks up PC gam- ing to new heights, March 2010.

[88] Mark Whitty, Stephen Cossell, Kim Son Dang, Jos´eGuivant, and Jayantha Katupitiya. Autonomous navigation using a real-time 3d point cloud. In Australasian Conference on Robotics and Automation, Brisbane, Australia, December 2010.

[89] Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Super- computing, pages 31:1–31:11. IEEE Press, 2008.

[90] Jos´e Guivant, Stephen Cossell, Mark Whitty, and Jayantha Katupitiya. Internet-based operating of authonous robots: the role of data replica- tion, compression, bandwidth allocation and visualization. Journal of Field Robotics, 29(5):793–818, 2012.

[91] Jos´eGuivant. Possum robot. http://www.possumrobot.com, accessed 2012.

[92] Stephen Cossell and Jos´eGuivant. A GPU-based concurrent motion planning algorithm for 3D Euclidean configuration spaces. In Australasian Conference on Robotics and Automation, Sydney, Australia, December 2013.

[93] nVidia Corporation. GeForce GT 650M. http://www.geforce.com/ hardware/notebook-gpus/geforce-gt-650m/specifications, 2014.

[94] Mihail Pivtoraiko, Ross A. Knepper, and Alonzo Kelly. Differentially con- strained mobile robot motion planning in state lattices. Journal of Field Robotics, 26(3):308–333, 2009.

[95] Guy M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.

[96] Laurent Gouz`enes.Strategies for solving collision-free trajectories problems for mobile and manipulator robots. International Journal of Robotics Research, 3(4):51–65, 1984.

156 [97] Tomas Lozano-Perez. A simple motino-planning algorithm for general robot manipulators. Robotics and Automation, IEEE Journal of, 3(3):224–238, 1987.

[98] Dana Schaa and David Kaeli. Exploring the multiple-GPU design space. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–12. IEEE, 2009.

[99] B. Alberts, A. Johnson, Lewis J., and et al. Modlcular Biology of the Cell, 4th edition. Garland Science, New York, 2002.

[100] Lary C. Walker and Harry LeVine. The cerebral proteopathis. Molecular Neurobiology, 21(1–2):83–95, February–April 2000.

[101] Desh Singh. Higher level programming abstractions for fpgas using opencl. In Workshop on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.

[102] Aaftab Munshi. Opencl 1.0 specification. Khronos OpenCL Working Group, 2009.

[103] Aaftab Munshi. Opencl 2.0 specification. Khronos OpenCL Working Group, 2014.

[104] Massimo Stefani and Christopher M. Dobson. Protein aggregation and ag- gregate toxicity: new insights into protein folding, misfolding diseases and biological evolution. Journal of Molecular Medicine, 81(11):678–699, 2003.

[105] Christopher M. Dobson. Experimental investigation of protein folding and misfolding. Methods, 34(1):4–14, 2004.

[106] V. S. Pande, I. Baker, J. Chapman, S. P. Elmer, S. Khaliq, S. M. Larson, Y. M. Rhee, M. R. Shirts, C. D. Snow, E. J. Sorin, and B Zagrovic. Atomistic protein folding simulations on the submillisecond time scale using worldwide . Biopolymers, 68(1):91–109, January 2003.

157 [107] Erich Elsen, Vaidyanathan Vishal, Mike Houston, Vijay Pande, Pat Han- rahan, and Eric Darve. N-body simulations on GPUs. arXiv preprint arXiv:0706.3060, 2007.

[108] John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone, and James C. Phillips. Gpu computing. Proceedings of the IEEE, 96(5):879– 899, 2008.

158