Introduction to Regular Expressions in 4D V11 SQL

Introduction to Regular Expressions in 4D v11 SQL By Robert Molina, Technical Support Engineer, 4D Inc. Technical Note 07-47 Abstract ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Regular expressions (regex) is a meta-language that provides developers a means to easily validate and search for characters within text. 4D v11 SQL natively supports this language, displaying 4D’s commitment to adhere to industry wide standards. This Technical Note provides background information, benefits of using regex, and examples within 4D. Background Information ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Regular expressions can be found in many programming and scripting languages. In addition, regular expressions are also used in word processing applications and text editors. The foundations of regular expressions come from the early works of Stephen Cole Kleene and Ken Thompson. Ken Thompson used Kleene’s notation called regular sets which lead to his production of the popular search tool “grep” in the Unix editor, “ed”. Eventually, different variations of regular expressions tools have been developed based on Thompson’s implementation. One of the popular programming languages to take advantage of regular expressions is Perl. Perl took the regex library written by Henry Philips and made additions for use in its language. Based on Perl’s implementation, Philip Hazel then developed PCRE (Perl Compatible Regular Expressions) which is a version of regex that is normally used in Apache web servers and PHP today. The version of 4D’s regular expression engine is from the ICU’s (International Components for Unicode) regular expression package. Information about this package can be found here: http://www.icu-project.org/userguide/regexp.html What is a regular expression? ------------------------------------------------------------------------------------------------------------------------------------------------------------------ A regular expression is a structured string of characters that matches patterns within text. Below is an example of a regular expression. .*\.txt This regular expression will match any string that contains “.txt”. At first glance the above expression may look foreign which is common when seeing a new language. Like any language, there is syntax that will need to be learned and followed. The syntax involves the components: literals, metacharacters, and operators. Literals Literals are constant values that can only mean one thing and nothing else. A literal in a regular expression will stand on its own, matching on a one-by-one basis. For instance, the literal: regex informs the regular expression engine to search for the text “regex” and nothing else. Metacharacters Along with the literals, there are also metacharacters. Metacharacters are the opposite of literals, in that they have special meanings. For instance, a period (.) represents any single character. Therefore the regular expression: .example will match Aexample, Bexample, and <example. Any single character in front of the string “example” will result in a match. In another example, what if the task is to find “Regex” right before the end of the line? The metacharacter that will help aid in this task is $, which represents the end of the line. Below is the regular expression: Regex$ The regular expression engine will search the text for the string “regex” followed by and end of line character. For the list of metacharacters used in ICU regular expressions please go to: http://www.icu-project.org/userguide/regexp.html Operators Regular expressions also contain operators. Some common operators used are *, +, and the |. For instance, what if the task is to either find “Regex” or “regex” the above example can also be written with an operator: [R|r]egex The operator used here is the “OR” operator which is symbolized by the pipe character (|). For the list of operators used in ICU regular expressions please go to: http://www.icu-project.org/userguide/regexp.html In addition, to learn more about pattern matching with regular expressions, a popular reference is the book "Mastering Regular Expressions, Second Edition" by Jeffrey E. F. Friedl, O'Reilly & Associates; 2nd edition (July 15, 2002). Why Use Regular Expressions in 4D? ------------------------------------------------------------------------------------------------------------------------------------------------------------------ There are two main reasons for using regex in 4D: Efficiency One of the main advantages of using regular expressions within 4D, or any programming language, is efficiency in code. Using regular expressions will reduce the amount of code written for a specific task regarding text data. For instance, here is some 4D code that will search for the string “Regex”. $result:=Position("Regex";[Table_1]Field_2;1;lengthfound) If ($result>0) `The string has been found $resstring:=Substring([Table_1]Field_2;$result;lengthfound) If (Character code($resstring[[1]])=Character code("r")) ALERT("We found "+$resstring) End if End if As the above example shows, there are 4 lines of code that are needed in achieving the same result as the simple “Regex” regular expression as shown below used with the Match regex command: $result:=Match regex("regex";[Table_1]Field_2;1;posfound;lengthfound) If ($result=True) `The string has been found ALERT("We found "+Substring([Table_1]Field_2;posfound;lengthfound)) End if What makes the regular expression efficient compared to the 4D code is the support of literals. As the example above shows, 4D is not a case sensitive language, thus searching for “Regex” is not as simple as the regular expression. Industry Standard Because regular expressions are used in numerous applications it has become the de facto industry standard in pattern matching text. Although there is more than one regular expression engine, the basics are essentially the same. Therefore, learning regular expressions may help in future projects that involve text searching. Below is a list of regular expression libraries along with applications and languages that use regular expressions: The Match regex Command ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Regular expressions are now part of the 4D language in 4D v11 SQL. The command that makes this possible is Match regex. Below is the description of the parameters for the command: Optional Parameter: start The optional start parameter allows searching for a text pattern at a specific position. For example, what if the task is to find the second instance of the string “crunch” within “crunch crunch”? The 4D code to do so would be: vfound:=Match regex("crunch";"crunch crunch";8;pos_found;length_found) The command notifies the regex engine to go to position 8 within the string and find “crunch”. Using this start parameter provides a means to skip matches as well prevent the engine from trying to parse more than it needs to. In contrast, if no optional parameters are added: vfound:=Match regex("crunch";"crunch crunch") this notifies the regex engine to obtain a complete match or equality. Therefore, since “crunch” is not equal to “crunch crunch” the result is false. Optional Parameters: pos_found and length_found These two parameters can either return a single value or an array of values. Below is an example of the parameters returning single values: $Start:=1 $result:=Match regex("XL";"Super Bowl XL.";$start;posfound;lengthfound) The variable posfound will return the position. Therefore the value will be 12. Here is an illustration of the character positions: The lengthfound variable will return 2 since the string being matched is the two characters “XL”. As mentioned earlier in this section, these parameters can also be arrays. The use of arrays allows the feature of “Capture Groups” within regular expressions (Capture Groups will be explored in the next section). Optional Parameter: * The asterisk (*) parameter is used if the search is specified to a position within the string. The addition of this parameter can produce different results with the same regular expression. For instance, here is an example: $result:=Match regex("quick";"The quick brown fox";1;$pos_found; $length_found) The above code will return true since “quick” is within the string. Here is another example: $result:=Match regex("quick";"The quick brown fox";1;$pos_found; $length_found;*) The code above will return false since the * parameter has been added, which tells the regex engine to only search at position 1. The string at position 1 is “The” string which does not match “quick”. Here is a third example: vfound:=Match regex("quick";"The quick brown fox";5;$pos_found; $length_found;*) The code above will return true since the position parameter is set to 5 and at position 5 is the string “quick” which is the target string. Capture Groups ------------------------------------------------------------------------------------------------------------------------------------------------------------------ This feature allows separating a regular expression into groups. For instance, what if the task was to match the date format “MM/DD/YYYY” and group MM, DD, and YYYY? Here is the 4D code to do it: ARRAY LONGINT(posfound;0) ARRAY LONGINT(lenghtfound;0)

Load more