Package 'Rebus'
Total Page:16
File Type:pdf, Size:1020Kb
Package ‘rebus’ April 25, 2017 Type Package Title Build Regular Expressions in a Human Readable Way Version 0.1-3 Date 2017-04-25 Author Richard Cotton [aut, cre] Maintainer Richard Cotton <[email protected]> Description Build regular expressions piece by piece using human readable code. This package is designed for interactive use. For package development, use the rebus.* dependencies. Depends R (>= 3.1.0) Imports rebus.base (>= 0.0-3), rebus.datetimes, rebus.numbers, rebus.unicode (>= 0.0-2) Suggests testthat License Unlimited LazyLoad yes LazyData yes Acknowledgments Development of this package was partially funded by the Proteomics Core at Weill Cornell Medical College in Qatar <http://qatar-weill.cornell.edu>. The Core is supported by 'Biomedical Research Program' funds, a program funded by Qatar Foundation. RoxygenNote 6.0.1 Collate 'export-base.R' 'export-datetimes.R' 'export-numbers.R' 'export-unicode.R' 'imports.R' 'regex-package.R' NeedsCompilation no Repository CRAN Date/Publication 2017-04-25 21:42:46 UTC 1 2 Anchors R topics documented: Anchors . .2 as.regex . .3 Backreferences . .3 capture . .3 CharacterClasses . .3 char_class . .3 ClassGroups . .4 Concatenation . .4 DateTime . .4 escape_special . .4 exactly . .4 format.regex . .5 get_weekdays . .5 IsoClasses . .5 literal . .5 lookahead . .5 modify_mode . .6 number_range . .6 or ..............................................6 rebus.............................................6 recursive . .8 regex.............................................8 repeated . .8 ReplacementCase . .8 roman . .9 SpecialCharacters . .9 Unicode . .9 UnicodeGeneralCategory . .9 UnicodeOperators . .9 UnicodeProperty . 10 whole_word . 10 WordBoundaries . 10 Index 11 Anchors The start or end of a string Description See Anchors. as.regex 3 as.regex Convert or test for regex objects Description See as.regex. Backreferences Backreferences Description See Backreferences. capture Capture a token, or not Description See capture. CharacterClasses Class Constants Description See CharacterClasses. char_class A range or char_class of characters Description See char_class. 4 exactly ClassGroups Character classes Description See ClassGroups. Concatenation Combine strings together Description See Concatenation. DateTime Date-time regexes Description See DateTime. escape_special Escape special characters Description See escape_special. exactly Make a regex exact Description See exactly. format.regex 5 format.regex Print or format regex objects Description See format.regex. get_weekdays Get the days of the week or months of the year Description See get_weekdays. IsoClasses ISO 8601 date-time classes Description See IsoClasses. literal Treat part of a regular expression literally Description See literal. lookahead Lookaround Description See lookahead. 6 rebus modify_mode Apply mode modifiers Description See modify_mode. number_range Generate a regular expression for a number range Description See number_range. or Alternation Description See or. rebus rebus: Regular Expression Builder, Um, Something Description Build regular expressions in a human readable way. Details Regular expressions are a very powerful tool, but the syntax is terse enough to be difficult to read. This makes bugs easy to introduce, and hard to find. This package contains functions to make building regular expressions easier. Author(s) Richard Cotton <[email protected]> rebus 7 See Also regex and regexpr The ‘stringr‘ and ‘stringi‘ packages provide tools for matching regular expres- sions and nicely complement this package. http://www.regular-expressions.info has good advice on using regular expression in R. In particular, see http://www.regular-expressions. info/rlanguage.html and http://www.regular-expressions.info/examples.html https: //www.debuggex.com is a visual regex debugging and testing site. Examples ### Match a hex colour, like `"#99af01"` # This reads *Match a hash, followed by six hexadecimal values.* "#" %R% hex_digit(6) # To match only a hex colour and nothing else, you can add anchors to the # start and end of the expression. START %R% "#" %R% hex_digit(6) %R% END ### Simple email address matching. # This reads *Match one or more letters, numbers, dots, underscores, percents, # plusses or hyphens. Then match an 'at' symbol. Then match one or more letters, # numbers, dots, or hyphens. Then match a dot. Then match two to four letters.* one_or_more(char_class(ASCII_ALNUM %R% "._%+-")) %R% "@" %R% one_or_more(char_class(ASCII_ALNUM %R% ".-")) %R% DOT %R% ascii_alpha(2, 4) ### IP address matching. # First we need an expression to match numbers between 0 and 255. Both the # following syntaxes read *Match two then five then a number between zero and # five. Or match two then a number between zero and four then a digit. Or match # an optional zero or one followed by an optional digit folowed by a compulsory # digit. Make this a single token, but don't capture it.* # Using the %|% operator ip_element <- group( "25" %R% char_range(0, 5) %|% "2" %R% char_range(0, 4) %R% ascii_digit() %|% optional(char_class("01")) %R% optional(ascii_digit()) %R% ascii_digit() ) # The same again, this time using the or function ip_element <- or( "25" %R% char_range(0, 5), "2" %R% char_range(0, 4) %R% ascii_digit(), optional(char_class("01")) %R% optional(ascii_digit()) %R% ascii_digit() ) # It's easier to write using number_range, though it isn't quite as optimal 8 ReplacementCase # as handcrafted regexes. number_range(0, 255, allow_leading_zeroes = TRUE) # Now an IP address consists of 4 of these numbers separated by dots. This # reads *Match a word boundary. Then create a token from an `ip_element` # followed by a dot, and repeat it three times. Then match another `ip_element` # followed by a word boundary.* BOUNDARY %R% repeated(group(ip_element %R% DOT), 3) %R% ip_element %R% BOUNDARY recursive Make the regular expression recursive. Description See recursive. regex Create a regex Description See regex. repeated Repeat values Description See repeated. ReplacementCase Force the case of replacement values Description See ReplacementCase. roman 9 roman Roman numerals Description See roman. SpecialCharacters Special characters Description See SpecialCharacters. Unicode Unicode classes Description See Unicode. UnicodeGeneralCategory Unicode General Categories Description See UnicodeGeneralCategory. UnicodeOperators Unicode Operators Description See UnicodeOperators. 10 WordBoundaries UnicodeProperty Unicode Properties Description See UnicodeProperty. whole_word Match a whole word Description See whole_word. WordBoundaries Word boundaries Description See WordBoundaries. Index %R% (Concatenation),4 arabic_presentation_forms_b (Unicode),9 %c% (Concatenation),4 ARABIC_SUPPLEMENT (Unicode),9 arabic_supplement (Unicode),9 ADDITIONAL_ARROWS (Unicode),9 ARMENIAN (Unicode),9 additional_arrows (Unicode),9 armenian (Unicode),9 AEGEAN_NUMBERS (Unicode),9 ARMENIAN_LIGATURES (Unicode),9 aegean_numbers (Unicode),9 armenian_ligatures (Unicode),9 ALCHEMICAL_SYMBOLS (Unicode),9 as.regex, 3,3 alchemical_symbols (Unicode),9 as_lower (ReplacementCase),8 ALNUM (CharacterClasses),3 as_upper (ReplacementCase),8 alnum (ClassGroups),4 ASCII_ALNUM (CharacterClasses),3 ALPHA (CharacterClasses),3 ascii_alnum (ClassGroups),4 alpha (ClassGroups),4 ASCII_ALPHA (CharacterClasses),3 ALPHABETIC_PRESENTATION_FORMS ascii_alpha (ClassGroups),4 (Unicode),9 ASCII_DIGIT (CharacterClasses),3 alphabetic_presentation_forms ascii_digit (ClassGroups),4 (Unicode),9 ASCII_LOWER (CharacterClasses),3 AM_PM (DateTime),4 ascii_lower (ClassGroups),4 Anchors, 2,2 ASCII_UPPER (CharacterClasses),3 ANCIENT_GREEK_MUSICAL_NOTATION ascii_upper (ClassGroups),4 (Unicode),9 AVESTAN (Unicode),9 ancient_greek_musical_notation avestan (Unicode),9 (Unicode),9 ANCIENT_GREEK_NUMBERS (Unicode),9 Backreferences, 3,3 ancient_greek_numbers (Unicode),9 BACKSLASH (SpecialCharacters),9 ANCIENT_SYMBOLS (Unicode),9 BALINESE (Unicode),9 ancient_symbols (Unicode),9 balinese (Unicode),9 ANY_CHAR (CharacterClasses),3 BAMUN (Unicode),9 any_char (ClassGroups),4 bamun (Unicode),9 ARABIC (Unicode),9 BAMUN_SUPPLEMENT (Unicode),9 arabic (Unicode),9 bamun_supplement (Unicode),9 ARABIC_EXTENDED_A (Unicode),9 BASSA_VAH (Unicode),9 arabic_extended_a (Unicode),9 bassa_vah (Unicode),9 ARABIC_MATHEMATICAL_ALPHANUMERIC_SYMBOLS BATAK (Unicode),9 (Unicode),9 batak (Unicode),9 arabic_mathematical_alphanumeric_symbols BENGALI_AND_ASSAMESE (Unicode),9 (Unicode),9 bengali_and_assamese (Unicode),9 ARABIC_PRESENTATION_FORMS_A (Unicode),9 BLANK (CharacterClasses),3 arabic_presentation_forms_a (Unicode),9 blank (ClassGroups),4 ARABIC_PRESENTATION_FORMS_B (Unicode),9 BLOCK_ELEMENTS (Unicode),9 11 12 INDEX block_elements (Unicode),9 CJK_COMPATIBILITY_IDEOGRAPHS_SUPPLEMENT BOPOMOFO (Unicode),9 (Unicode),9 bopomofo (Unicode),9 cjk_compatibility_ideographs_supplement BOPOMOFO_EXTENDED (Unicode),9 (Unicode),9 bopomofo_extended (Unicode),9 CJK_IDEOGRAPHIC_DESCRIPTION_CHARACTERS BOUNDARY (WordBoundaries), 10 (Unicode),9 BOX_DRAWING (Unicode),9 cjk_ideographic_description_characters box_drawing (Unicode),9 (Unicode),9 BRAHMI (Unicode),9 CJK_STROKES (Unicode),9 brahmi (Unicode),9 cjk_strokes (Unicode),9 BRAILLE_PATTERNS (Unicode),9 CJK_SYMBOLS_AND_PUNCTUATION (Unicode),9 braille_patterns (Unicode),9 cjk_symbols_and_punctuation (Unicode),9 BUGINESE (Unicode),9 CJK_UNIFIED_IDEOGRAPHS (Unicode),9 buginese (Unicode),9 cjk_unified_ideographs (Unicode),9 BUHID (Unicode),9 CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A buhid (Unicode),9 (Unicode),9 BYZANTINE_MUSICAL_SYMBOLS