<<

Introduction to Regular Expressions in 4D v11 SQL

By Robert Molina, Technical Support Engineer, 4D Inc.

Technical Note 07-47

Abstract ------Regular expressions (regex) is a meta-language that provides developers a means to easily validate and search for characters within text. 4D v11 SQL natively supports this language, displaying 4D’s commitment to adhere to industry wide standards. This Technical Note provides background information, benefits of using regex, and examples within 4D.

Background Information ------Regular expressions can be found in many programming and scripting languages. In addition, regular expressions are also used in word processing applications and text editors. The foundations of regular expressions come from the early works of Stephen Cole Kleene and Ken Thompson. Ken Thompson used Kleene’s notation called regular sets which lead to his production of the popular search tool “grep” in the Unix editor, “ed”. Eventually, different variations of regular expressions tools have been developed based on Thompson’s implementation.

One of the popular programming languages to take advantage of regular expressions is Perl. Perl took the regex library written by Henry Philips and made additions for use in its language. Based on Perl’s implementation, Philip Hazel then developed PCRE (Perl Compatible Regular Expressions) which is a version of regex that is normally used in Apache web servers and PHP today.

The version of 4D’s engine is from the ICU’s (International Components for Unicode) regular expression package. Information about this package can be found here: http://www.icu-project.org/userguide/regexp.html

What is a regular expression? ------A regular expression is a structured string of characters that matches patterns within text. Below is an example of a regular expression.

.*\.txt

This regular expression will match any string that contains “.txt”. At first glance the above expression may look foreign which is common when seeing a new language. Like any language, there is syntax that will need to be learned and followed. The syntax involves the components: literals, metacharacters, and operators. Literals

Literals are constant values that can only mean one thing and nothing else. A literal in a regular expression will stand on its own, matching on a one-by-one basis. For instance, the literal:

regex informs the regular expression engine to search for the text “regex” and nothing else.

Metacharacters

Along with the literals, there are also metacharacters. Metacharacters are the opposite of literals, in that they have special meanings. For instance, a period (.) represents any single character. Therefore the regular expression:

.example will match Aexample, Bexample, and

In another example, what if the task is to find “Regex” right before the end of the line? The metacharacter that will help aid in this task is $, which represents the end of the line. Below is the regular expression:

Regex$

The regular expression engine will search the text for the string “regex” followed by and end of line character.

For the list of metacharacters used in ICU regular expressions please go to: http://www.icu-project.org/userguide/regexp.html

Operators

Regular expressions also contain operators. Some common operators used are *, +, and the |. For instance, what if the task is to either find “Regex” or “regex” the above example can also be written with an operator:

[R|r]egex

The operator used here is the “OR” operator which is symbolized by the pipe character (|). For the list of operators used in ICU regular expressions please go to: http://www.icu-project.org/userguide/regexp.html

In addition, to learn more about pattern matching with regular expressions, a popular reference is the book "Mastering Regular Expressions, Second Edition" by Jeffrey E. F. Friedl, O'Reilly & Associates; 2nd edition (July 15, 2002).

Why Use Regular Expressions in 4D? ------There are two main reasons for using regex in 4D:

Efficiency

One of the main advantages of using regular expressions within 4D, or any , is efficiency in code. Using regular expressions will reduce the amount of code written for a specific task regarding text data. For instance, here is some 4D code that will search for the string “Regex”.

$result:=Position("Regex";[Table_1]Field_2;1;lengthfound) If ($result>0) `The string has been found $resstring:=Substring([Table_1]Field_2;$result;lengthfound) If (Character code($resstring[[1]])=Character code("r")) ALERT("We found "+$resstring) End if End if

As the above example shows, there are 4 lines of code that are needed in achieving the same result as the simple “Regex” regular expression as shown below used with the Match regex command:

$result:=Match regex("regex";[Table_1]Field_2;1;posfound;lengthfound) If ($result=True) `The string has been found ALERT("We found "+Substring([Table_1]Field_2;posfound;lengthfound)) End if

What makes the regular expression efficient compared to the 4D code is the support of literals. As the example above shows, 4D is not a case sensitive language, thus searching for “Regex” is not as simple as the regular expression.

Industry Standard

Because regular expressions are used in numerous applications it has become the de facto industry standard in pattern matching text. Although there is more than one regular expression engine, the basics are essentially the same. Therefore, learning regular expressions may help in future projects that involve text searching. Below is a list of regular expression libraries along with applications and languages that use regular expressions:

The Match regex Command ------Regular expressions are now part of the 4D language in 4D v11 SQL. The command that makes this possible is Match regex. Below is the description of the parameters for the command:

Optional Parameter: start

The optional start parameter allows searching for a text pattern at a specific position. For example, what if the task is to find the second instance of the string “crunch” within “crunch crunch”? The 4D code to do so would be:

vfound:=Match regex("crunch";"crunch crunch";8;pos_found;length_found)

The command notifies the regex engine to go to position 8 within the string and find “crunch”. Using this start parameter provides a means to skip matches as well prevent the engine from trying to parse more than it needs to. In contrast, if no optional parameters are added:

vfound:=Match regex("crunch";"crunch crunch") this notifies the regex engine to obtain a complete match or equality. Therefore, since “crunch” is not equal to “crunch crunch” the result is false.

Optional Parameters: pos_found and length_found

These two parameters can either return a single value or an array of values. Below is an example of the parameters returning single values:

$Start:=1 $result:=Match regex("XL";"Super Bowl XL.";$start;posfound;lengthfound)

The variable posfound will return the position. Therefore the value will be 12. Here is an illustration of the character positions:

The lengthfound variable will return 2 since the string being matched is the two characters “XL”.

As mentioned earlier in this section, these parameters can also be arrays. The use of arrays allows the feature of “Capture Groups” within regular expressions (Capture Groups will be explored in the next section).

Optional Parameter: *

The asterisk (*) parameter is used if the search is specified to a position within the string. The addition of this parameter can produce different results with the same regular expression. For instance, here is an example:

$result:=Match regex("quick";"The quick brown fox";1;$pos_found; $length_found)

The above code will return true since “quick” is within the string. Here is another example:

$result:=Match regex("quick";"The quick brown fox";1;$pos_found; $length_found;*)

The code above will return false since the * parameter has been added, which tells the regex engine to only search at position 1. The string at position 1 is “The” string which does not match “quick”. Here is a third example:

vfound:=Match regex("quick";"The quick brown fox";5;$pos_found; $length_found;*)

The code above will return true since the position parameter is set to 5 and at position 5 is the string “quick” which is the target string.

Capture Groups ------This feature allows separating a regular expression into groups. For instance, what if the task was to match the date format “MM/DD/YYYY” and group MM, DD, and YYYY? Here is the 4D code to do it:

ARRAY LONGINT(posfound;0) ARRAY LONGINT(lenghtfound;0) $Start:=1 $result:=Match regex("([0-9]{2})/([0-9]{2})/([0-9]{4})";"09/25/2007";$start; posfound;lengthfound) One thing to note here is the use of parentheses to specify each group. The end result of the code above is that $result will have the value of true and the arrays posfound and lengthfound will have the following values:

Posfound[0]=1 Posfound[1]=1 Posfound[2]=4 Posfound[3]=7

As the picture and code displays, there are four groups. Posfound[0] contains the first position of the matched pattern, which is 09/25/2007. Posfound[1] is the first subgroup which has the first position of the month. Posfound[2] is the second subgroup which has the first position of the day. Lastly, Posfound[3] is the third subgroup which has the first position of the year.

Note: When using capture group feature, the parameters pos_found and length_found need to be declared as arrays.

Note: The grouping starts from left to right of the regular expression.

Below are the values for the corresponding lengths.

Lengthfound[0]=10 Lengthfound[1]=2 Lengthfound[2]=2 Lengthfound[3]=4

With these values, extracting the specific data from the text will be simple. For instance, to extract the year, the following code will display an alert with the year displayed:

ALERT(Substring("09/25/2007";posfound{3};lengthfound{3})) Regex Errors ------When a regular expression is invalid, 4D will display an error dialog.

Although the error dialog may not point out the specific error, here are some tips for writing regular expressions:

When searching for characters that are operators or metacharacters, make sure to use the “\” character, which indicates that the regex engine should escape and treat those characters as literals. For instance, the + character is an operator. In order to search for it within a string the regular expression will need to be:

\+

In 4D, the code would look like:

vfound:=Match regex("\\+";"test+";1;$pos_found;$length_found)

The above code displays a double “\\”. This is because 4D also uses the “\” as an , thus “\\” tells 4D to treat the second backslash as a literal.

Conclusion ------Regular expressions in 4D provide a means for a developer to validate, search, and find text data. It decreases the amount of code written which also reduces the time spent in development. In addition, knowledge of regular expressions will help in future development projects since it is a widely used meta-language.

Related Resources ------http://www.regular-expressions.info/tutorial.html http://www.evolt.org/article/rating/20/22700/