Improving Particle Filter Performance Using SSE Instructions

To appear in Proceedings of IROS 2009-IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS 09), St. Louis, USA, October 2009. Improving Particle Filter Performance Using SSE Instructions Peter Djeu and Michael Quinlan and Peter Stone Abstract— Robotics researchers are often faced with real- advantage of the SSE vector unit in developing his or her time constraints, and for that reason algorithmic and code/algorithms. implementation-level optimization can dramatically increase the In this paper we present ground-up SSE implementations overall performance of a robot. In this paper we illustrate how a substantial run-time gain can be achieved by taking advantage of key functions required in the robotics domain (atan, atan2, of the extended instruction sets found in modern processors, exp) and apply them to a commonly used and practical in particular the SSE1 and SSE2 instruction sets. We present robotics algorithm: Monte Carlo Localization (MCL) [3]. In an SSE version of Monte Carlo Localization that results in an our implementation we achieve an impressive 9x speedup, impressive 9x speedup over an optimized scalar implementation. surpassing the ideal speedup of 4x. In the process, we discuss SSE implementations of atan, atan2 and exp that achieve up to a 4x speedup in these mathematical While this improvement in and of itself is noteworthy, one operations alone. of the main objectives of this paper is to encourage and also instruct the robotics community on how to implement its own I. INTRODUCTION algorithms using the SSE instruction set. To help facilitate In robotics, it is often critical to process sensory data or this task we are releasing the full source code for our MCL to compute decisions in real-time, that is to say at the rate implementation, including the SSE components. at which sensory data is captured. For example, a robot with The remainder of this paper is structured as follows: Sec- a camera that captures frames at 30 Hz should complete tion II provides an overview of the basics of SSE. Section III its sense-act loop at the same rate or risk missing sensing introduces the Monte Carlo Localization algorithm and steps opportunities. Sometimes this constrains the processing al- through the construction of the SSE version, with particular gorithms that can be used (e.g. Hough transforms may be focus on the development of the SSE math extensions (such ruled out entirely) and sometimes it constrains the quality as atan2). Section IV will review and discuss the run-time of the processing (e.g. the number of particles that can be performance results of both the overall MCL algorithm and processed in a particle filtering algorithm). As a result, an the individual math operations. We will conclude and present important component of robotics is code streamlining and future work in Section V. optimization. A common source of untapped potential for II. USING THE SSE INSRUCTION SET code optimization is the use of the vector unit on CPUs to perform efficient and parallel computation. Traditional CPU instructions take single values as their Since 19981 most modern processors, including desktop operands. A typical floating point addition instruction has and notebook/netbook processors, can perform SIMD (Single the following form, which we call 1-wide addition. § Instruction Multiple Data) operations which allow a single f l o a t a = 1.0f; instruction stream to drive parallel computation. In particular f l o a t b = 5.0f; f l o a t c=a+b; // c: 6.0f both Intel and AMD support the SSE (Streaming SIMD ¦ ¥ Extension) instruction set and the follow-up instruction set In SSE, the data types are expanded from ‘float’ to SSE2 via an SSE vector unit [1]. Using the SSE vector unit ‘sse4Floats’2, which include the values of 4 separate floating allows the CPU to perform 4-wide operations in place of 1- points. When an addition is performed on two SSE operands, wide operations; four additions can be performed in the time each of the 4 elements is independently summed. In other it usually takes to perform one. Under ideal settings, an SSE words, element 0 in the first operand is added to element vector implementation can be up to 4x faster than the scalar 0 in the second operand, element 1 in the first operand is implementation, and in cases such as the ones described in added to element 1 in the second operand, and so on. We this paper, it can be even faster. will refer to this operation as a 4-wide addition. However, developing code that effectively uses the SSE In the following example, we will use a C++ constructor instruction set has generally been restricted to graphics for sse4Floats that takes 4 floating point values. The ‘+’ researchers or to the developers of heavily optimized libraries operator is overloaded so that if both operands are sse4Floats, for specific tasks such as vision [2] and linear algebra [4]. the SSE unit is invoked to perform a 4-wide addition. While these libraries provide excellent performance on their § sse4Floats a = sse4Floats (1.0f, 2.0f, 3.0f, 4.0f); intended tasks, the average roboticist has failed to take sse4Floats b = sse4Floats (5.0f, 6.0f, 7.0f, 8.0f); sse4Floats c=a+b; // c: (6.0f, 8.0f, 10.0f, 12.0f) All authors are in the Department of Computer Science, The University of ¦ ¥ Texas at Austin. djeu,mquinlan,[email protected] 1In 1998 AMD released the K6-2 processors, this was followed in 1999 2The built-in SSE data type is actually ‘ m128’, however we encapsulate by Intel’s Pentium III processors this type into a C++ class, which we call ‘sse4Floats’. We will use the term 1-wide and 4-wide to refer to § sse4Floats a = (1.0f, 0.0f, −1.0f , −2.0f ); mathematical operations that are performed analogously to sse4Floats b = (0.0f, 0.0f, 0.0f, 0.0f); the addition operations. The term scalar will be used in- sseMask mask = ( a >= b); // mask: (MASK TRUE , MASK TRUE , MASK FALSE, MASK FALSE ) terchangeably with 1-wide and vector interchangeably with ¦ ¥ 4-wide. This sseMask can then be used in a 4-wide blend operation Apart from 4-wide floating point addition, the SSE vector which takes three operands: an sseMask, a true operand, and unit also provides support for other arithmetic operations, a false operand. The return value is an sse4Floats that is such as subtraction, multiplication, division, minimum, max- set to the value of the true operand at all elements where imum, and bitwise operations. The vector unit supports the sseMask had value MASK TRUE. The return value is vectors of four 32-bit integers (aka ints) and has built-in set to the false operand at all other locations. The operators support for addition, subtraction, and bitwise operations. ‘&’, ‘|’, and ‘∼’ are overloaded to invoke the corresponding Processors which support later versions of SSE also pro- bitwise operations in the SSE unit. vide multiplication, minimum, and maximum operations for § sse4Floats blend4 ( sseMask mask , sse4Floats i n true , integers. sse4Floats i n f a l s e ) { return (mask & in t r u e ) | ( ∼mask & in false ); A. SSE Masks } One difficulty that occurs when using SSE is that a single ¦ ¥ instruction stream must be used for all processing. This The following is an SSE implementation of absolute value. is especially troubling when implementing algorithms that The ‘>=’ operator is overloaded to invoke the greater-than- would normally require branching. Consider the following or-equal-to comparison operator in the SSE unit, while the ‘-’ example, which returns the absolute value of f. operator is overloaded to invoke the unary negation operator § in the SSE unit. f l o a t abs ( f l o a t f ) { return ( f >= 0.0f) ? f : −f ; } § ¦ ¥ sse4Floats abs ( sse4Floats f ) { When implementing a 4-wide version of absolute value for sse4Floats z e r o s = sse4Floats (0.0f, 0.0f, 0.0f, 0.0f); sseMask i s n o n n e g = ( f >= zeros); an sse4Floats consisting of the values (-1.0f, -2.0f, +3.0f, return blend4(is n o n neg, f, −f ) ; +4.0f), the first 2 elements need to be negated, while the } ¦ ¥ last 2 elements do not. However, we must use the same instruction stream on all elements as per the definition of To summarize, masking allows us to conditionally perform SIMD. operations on only certain elements within an sse4Floats The solution is to evaluate both sides of the branch and rather than on all elements. Masking comes with a cost, to mask out the result from one branch and mask in the however, since both the true input and the false input must result from the other. The two results are then combined be evaluated whenever a mask is used. In scalar code, work and only the masked-in result survives. We now revisit the can often be saved by descending down only one branch of a scalar implementation to support this branching paradigm conditional, while in pure SSE code both sides of the branch using bitwise operations. For brevity, the implementation of must always be taken, even in cases where the sseMask the two reinterpretation helper functions are not shown. consists of 4 MASK TRUE’s or 4 MASK FALSE’s. § i n t reint f 2 i ( f l o a t f ) ; // reinterpret float as int B. Reduction Operations f l o a t reint i 2 f ( i n t i ) ; // reinterpret int as float f l o a t abs ( f l o a t f ) { It is often necessary to perform an operation which reduces i n t MASK TRUE = 0xffffffff; the 4 values within an sse4Floats to a single floating point i n t MASK FALSE = 0x00000000; i n t mask = ( f >= 0 .

Load more