Understanding Regular Expressions by Peter Robson, JustSQL

considerably to the ability of the user to really explore and manipulate in great This article is about the new Regular detail the contents of their database. Expressions that were introduced by Oracle in 10g. It is based closely on a This paper attempts to introduce them to presentation that I have made to several the competent SQL user who has no conferences and meetings. One thing experience with regular expressions, became clear from these events, and either in 10g, or in the Unix world. Rather that is the number of people who are than simply list their parameters, this peripherally aware of these new regular paper will demonstrate, using simple expressions, but had never tried to use worked examples, how the four types of them. So this article is for these people, regular expressions can be used in strictly beginners in the field, but people everyday coding situations. It will show who are very comfortable using SQL how these expressions are built on the itself. It is a first timers introduction, and existing and familiar SQL character will prepare you to explore the topic on functions, and will explain in detail each your own of the various parameters associated with the four regular expressions.. Overview Background Regular expressions have long been used in the Unix world. With Oracle 10g, The four regular expressions introduced they have now entered the SQL toolbox. into Oracle 10G are all based on the four For those unfamiliar with them, regular SQL character functions substr, like, expressions can be both complex and instr, and replace. Their names too confusing, but do bring substantial are recognizable, all being prefaced with flexibility to the developer. They enable the text ‘regexp’, as follows: the user to drill down into the detail constituents of individual database text REGEXP_LIKE fields. They can hunt for complex text REGEXP_SUBSTR patterns embedded within fields, REGEXP_INSTR constrained by position in the fields (as REGEXP_REPLACE defined by any arbitrary character Each one of these terms will be position). Therefore, they add explained and demonstrated, showing how the original abilities of the SQL character position in the string (and character functions have been extended again irrespective of what may follow), or into the regular expression forms. Their thirdly, anywhere in the string, function falls into one of two categories, irrespective of what may precede or either to identify patterns of data within follow. Note that in the second example, I discrete strings or attributes, or to have used the underbar symbol (‘_’) to manipulate target patterns within discrete indicate a single space! strings or attributes. By ‘string’ or The regular expression regexp_like ‘attributes’, I mean a string literal which may be part of a SQL expression, or can of course achieve the same result, alternatively (and more likely) an attribute but uses a slightly different construct: value held within a database table. It is Select * from tab where when the two activities of identification REGEXP_LIKE(field,’START’); and manipulation are combined that the true strength of regular expressions can Note that no percentage wild card is be seen. They are indeed very powerful required – the match will happen if the tools with which to both search and string ‘START’ is found anywhere within modify data, but with this power comes the target field. This is where additional danger, which we will discuss later. parameters allow you to control a lot more precisely exactly what will be REGEXP_LIKE matched with a regular expression. A variety of ‘metacharacters’, embedded As indicated, this is based on the SQL within the target search string, enables character function ‘like’. Lets look at you to control the condition of the match. three example uses of ‘like’:

Select * from tab where field like ‘START%’;

Select * from tab where field like ‘_START%’; Metacharacters

Select * from tab where field like ‘%START%’; It is worth listing examples of

In each case the pattern to be found is metacharacters before discussing in ‘START’, but the constructs enable a detail how some of them work: certain amount of choice over where that . (a full stop) Any string is to be found. The ‘%’ symbol acts character as the familiar wild card. The text pattern may either be found at the beginning of + One or More of the the string (irrespective of what may previous expression follow), or commencing at the 2nd ? Zero or One of the Here we would only match those fields in previous which the string START was located at * Zero or More of the the beginning of the field. This we previous expression achieve by inserting the metacharacter ^ (carrot, or hat) immediately before the {m} Exact Count of the required string. Notice that Oracle previous expression automatically distinguishes between {m,} Count of 'm' or more metacharacters and normal text. If you

{m,n} Count of at least 'm', need to include a character in your maximum 'n' search template that is also a metacharacter, you can do this by using [ … ] List of characters to be the ‘escape’ device of a backward slash, matched which forces the next character to be [ ^… ] List of characters NOT interpreted as normal text. to be matched If we use the metacharacter ‘dot’ (or | Or, alternatively ‘stop’, ‘ . ‘), we can create a slightly (…) Group or sub- anomalous query. Look at this: expression of Select * from tab where characters REGEXP_LIKE(field,’^...’);

^ Match from the first character in line or Here we are looking for any three field contiguous characters in our data source – they can be anything. We have also $ Match the very last prefaced those dots with the hat - ^, character in line / field instructing the search to begin at the Most of the above are self-explanatory, beginning of the data field. Syntactically but you should experiment with them in this construct is fine, it is just a bit simple examples to see exactly how they superfluous. It is easy to make errors like vary the results you can obtain. this with regular expressions, and conversely, to create an expression To return to our simple example, let’s which is syntactically correct, but which use a simple metacharacter (^) to force cannot possibly retrieve a result! our pattern search to begin at the first character position in our data source: Our simple query could also be written as follows, which makes more explicit the Select * from tab where use of a pair of (complimentary) REGEXP_LIKE(field,’^START’); metacharacters, the round brackets: Select * from tab where regexp_substr. This is in fact highly REGEXP_LIKE(field,’(^START)’); complimentary with regexp_like, the two work together most effectively. But in this case the round brackets are not strictly required, and might confuse. The reason for this is that you can never actually see what is being matched by There is a final aspect of regexp_like regexp_like – all you see is the to be noted, namely the optional ‘Match- Parameter’, which is used to control the retrieval that you requested if the match case of the character pattern being is made. The two are entirely searched for. So again, our simple disconnected. For example, you may example can include this parameter as inadvertently forget to switch the match follows: parameter from case dependent to case independent, and may therefore get a

Select * from tab where result, but based entirely on a false REGEXP_LIKE(field,’^START’,’i’); premise. This is where regexp_substr allows you to display the actual pattern Here the letter ‘i’ is used to indicate that that is successfully matched with the the string START can be any mixture of regexp_like function. upper or lowercase – it is to be case independent. Use of the only other Lets start with the formal definition of this alternative parameter of ‘c’ will force the expression: character string to match exactly the case of the string in the query. REGEXP_SUBSTR

The formal definition of regexp_like is (source, pattern as follows: [, position [, occurrence REGEXP_LIKE [, match_parameter ] ] ] ) ( source_string, pattern [, match_parameter] ) Unlike regexp_like, this expression enables you to retrieve exactly the Because it is enclosed in square pattern within the text string or table field brackets, the match parameter is defined that matches the pattern as defined. So if as optional. the regexp_like and regexp_substr constructs are identical, you can be REGEXP_SUBSTR absolutely confident that what you see retrieved by regexp_substr is exactly We now know enough to progress on to what is being matched by the next regular expression, regexp_like. Notice that there are a couple more Select REGEXP_SUBSTR(field,’START’,1,1,’i’) from tab optional parameters available for use where REGEXP_LIKE (field,’START’,’i’); with regexp_substr. But this makes it a superset of the regexp_like The great benefit of using regexp_like construct, so enabling any construct in and regexp_substr together is that regexp_like to be used with they are an ideal method for debugging regexp_substr. Lets see how this can complex expressions. As with so much of work: SQL, it is the old problem of the declarative versus the procedural. Often Select REGEXP_SUBSTR(field,’START’) from tab where REGEXP_LIKE(field,’START’); one can build a syntactically correct SQL regular expression, but the actual semantic meaning of the expression can Complexity starts to appear if you want be very different from what you think you to use the match_parameter. In have written. By using these two regexp_substr it is the third optional expressions in tandem, and by doing parameter, which obliges you to insert several tests, you can both learn a lot values for the previous two parameters, more about regular expressions, and namely for ‘position’ and furthermore, be confident that the answer ‘occurrence’. This is easy, however. you get is the answer to the question you ‘Position’ refers to the position in the really asked, and not what you thought text string from which to start looking for you asked. a pattern. The numeric ‘1’ will force the examination to commence from the beginning, and is the default. The ‘occurrence’ parameter simply indicates how many times you wish to find the string – the default is 1. (Note REGEXP_INSTR that this is not the same as the occurrence parameter in the The third of the four regular expressions regexp_instr, which we will discuss is distinctly different from the two later). previous examples that we have looked at. Now, the value returned from We can modify our example to force a regexp_instr is a numeric value case independent search as follows, but which identifies the position of the start we must insert at least the default values (or the end) of a particular pattern in the for the two preceding parameters in the data source character string. The original regexp_substr expression: SQL character function is defined as follows: INSTR ------5 (source, pattern [, starting at M The result gives the character position in [, the Nth occurrence ]] ) the string of the first occurrence of the first letter of the pattern ‘beer’, which is of The regular expression version takes this course the integer 5. We can modify the basic position and adds a couple of query to look for the second instance of additional properties, the ‘beer’: return_option and the match_parameter. Lets have a look at select regexp_instr('Two beers or three beers, the formal definition: sir?','beer',1,2,0,'i') result from dual;

RESULT REGEXP_INSTR ------(source, pattern 20 [, starting at M If the return option is changed from ‘0’ to [, the Nth occurrence ‘1’, to return the integer indicating the [, return_option end of the pattern string, the query and [, match_parameter ] ] ] ] ) result appear as follows:

It may be easier to understand what is select regexp_instr('Two beers or three beers, happening if the above definition is sir?','beer',1,2,1,'i') result from dual; transposed into free text:- the expression RESULT is looking for the position of the Nth ------24 occurrence of pattern in the source, starting at character position M, with the return_option of either the start or end But here is a subtly easily overlooked. position of the pattern, with an optional The result returned is not the numeric match_parameter governing the case position of the last character of the sensitivity of the pattern search. search pattern ‘beer’, but rather the next

As with the two previous expressions, position after the end of the pattern, the source and the pattern operate in which is the letter ‘s’.. exactly the same way, complete with controlling metacharacters when REGEXP_REPLACE required. Lets look at the following example: The fourth of the regular expressions is the only one to actually manipulate a select regexp_instr('Two beers or three beers, target pattern. As previously, it is based sir?','beer',1,1,0,'i') result from dual; on the original SQL character function of RESULT ‘replace’, with the addition of three extra constraints. The definition of replace argument which says ‘find at least 2 looks like this: contiguous spaces with no upper limit on how many occur together’, and finally the REPLACE string with which to replace these (source, pattern, replace_string ) multiple spaces, namely one single space. of which an example could be as follows:

The result would return the following Select replace(‘this is a small string: example’,’small’,’tiny’) from dual;

‘This written by a not-very good typist’ The result takes the target string, substitutes ‘small’ with ‘tiny’ and returns In the above example, we have not seen the modified string ‘this is a tiny the use of either ‘position’, example’. ‘occurrence’, or ‘match_parameter’. This structure is developed further in the Just as with the previous regular regexp_replace, which has the following expressions, their use is exactly the definition: same. The following example shows how one can apply the replacement REGEXP_REPLACE selectively within a string: ( source, pattern [,replace_string select regexp_replace [,position ('Regexp could be used to encrypt text','(e)','*XYZ*',10,1,'i') result from [, occurrence dual; [, match_parameter ] ] ] ] ) In this example, we are going to use Once again, let’s see what an example of ‘regexp_replace’ as a crude form of this expression might look like. We can encryption, by looking for the first use it to prune out multiple instances of spaces between characters, as here: occurrence of the letter ‘e’ after character position 10 in the string, and then select regexp_replace replacing that letter with the string ('This written by a not-very good typist','( ){2,}', ' ') Result from dual; ‘*XYZ*‘. The final result looks like this:

The construct above contains three ’Regexp could b*XYZ* used to encrypt items: a string with multiple spaces text’ between words, a pattern to search for (a single space identified in a closed Syntax bracket construct) followed by a numeric So far we have only considered the most [:digit:] numerics simple of examples, and deliberately too. [:lower:]lower case alphabetic We have touched on the variety of metacharacters available to use. These [:cntrl:] nonprinting or control characters are the characters which retain their own meaning within a pattern string, and [:xdigit:] hexadecimal which are used to apply constraints to the interpretation of that character string. [:alnum:] an alpha-numeric string But there are traps here – look at the use You can see immediately how a of the ‘hat’ or carrot’ (^). It can have two compound pattern expression, looking entirely different meanings, dependent for both upper and lower case letters and on its context. In this form ( [^…] ) it is numerics, constructed as follows: defining characters which must not be matched. Here (^…) it requires the [A-Za-z0-9] parser to start hunting for the pattern from the beginning of the text data source (where ‘…’ stands for any three can be simplified into a single expression alpha numeric characters). Note also the as follows: use of the back slash \, the ‘escape’ character, which will force a [[:alnum:]] metacharacter to be treated as an ordinary text character. It is all too easy Note how the character classes are to overlook these small aspects of defined with opening and closing square regular expression construct, which if brackets. Because we have replaced the misused can alter entirely the behaviour pattern string ‘A-Za-z0-9’, which was of the query. itself contained within square brackets, those outer brackets have to be retained. There are also groups of characters The pairs of square brackets are a known as ‘character classes’ as distinct familiar feature of regular expressions – from metacharacters. These are less you will therefore immediately recognize confusing, and indeed if used carefully, that a character class is being used. can do much to remove some of the confusion which can arise in trying to manage complex strings. Let’s have a Synthesis look at some of the more useful character classes: The above gentle introduction to regular expressions might seem straight forward, [:alpha:] alphabetic chars only but it masks a degree of complexity which can become seriously confusing. Users familiar with the old SQL character may not be a rapid task!), you should be functions will recall that you often had to rather careful where you employ these use multiple functions to achieve the techniques. Inevitably, their use in a desired result. For example, a replace simple query against a text field in a action could be used with one or more table such as the examples discussed in instr functions to restrict both the start this article will result in a full table scan. and the end point of a replace action, With very large tables, this can result in possibly even with the use of the disastrously slow execution times. Where ‘length’ function. This degree of large tables are involved, try to ensure complexity has been partially resolved that the regular expression operates on with regular expressions, in that this sort the smallest possible subset of a query of control can now be obtained within result, by ensuring that as much data as one expression. But equally, there is no possible is retrieved using high reason why one cannot embed a performance indexes, before applying succession of regular expressions, each time consuming regular expression within the other. Any one of these processing. expressions can be embedded in any Regular expressions can also provide other, which will of course automatically beneficial functionality as part of pre-load increases the overall complexity. validation, but again, only so long as load

If you are building a complex suite of volumes are kept under careful review. expressions, be sure to build Finally – beware of using regular incrementally, and test at each stage as expressions when there is an easier way you build up the layers. If you are trying of accomplishing the same task! It may to understand what one expression is well be the case that with some carefully actually doing, take it apart and break it constructed SQL code, you can achieve down into its individual components, the same result, with a consequent according to the formal definition of each performance benefit too. If you still can’t expression. Clarity of layout is the great achieve your objectives with ‘pure’ SQL, issue here, just as it is in writing and look again at the old character functions reading a complex piece of SQL code. before finally resorting to the regular expressions. Conclusions

Further help Regular expressions are a powerful addition to the SQL toolbox. As well as Any internet search in which the words becoming familiar with their complexity ‘Oracle’ and ‘Regular Expressions’ are and slightly confusing syntax (and this submitted will return many useful pages of information. In particular, I would “Oracle Regular Expression Pocket commend the following works for further Reference” (O’Reilly) Jonathan Gennick reading, and indeed for ongoing & Peter Linsley (cheap at $10!) reference material. Be careful with any text book addressing regular expressions in general. They will tend to be directed Peter Robson at the Unix / shell script community. Just ([email protected]) is a to confuse things even further, every database specialist with a particular different context in which regular interest in SQL. His web site expressions appear tends to have its www.justsql.com contains a number of own syntax variants. unusual examples of the use of SQL. Based in Scotland, he is also a Director of the UK Oracle User Group. “Writing Better SQL Using Regular Expressions” – Alice Rischert Google search; A previous article in Oramag. Highly recommended.

Oracle on-line documentation: http://download- uk.oracle.com/docs/cd/B14117_01/serve r.101/b10759/toc.htm (This is ‘SQL Reference for 10g’ – search on ‘regexp’)

An excellent blog / discussion on Regular Expressions by Andrew Clarke (Radio Free Tooting) http://radiofreetooting.blogspot.com/2006 /09/ useful-article-on-regex-in-oracle- 10g.html