re
Starlark module for working with regexp.
Emulates python's re module but using Google's re2. More on the syntax and what is allowed and what is not here:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
There are certain features of PCRE or Perl regular expressions that cannot be implemented in linear time, for example, backreferences, but the vast majority of regular expressions patterns in practice avoid such features.
A good portion of findall and finditer code was ported from: pfalcon’s pycopy-lib located at:
re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below. The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).
The sequence
prog = re.compile(pattern)
result = prog.match(string)
is equivalent to
result = re.match(pattern, string)
but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.
Parameters:
pattern – regexp pattern.
flags – regexp flags, see RegexFlags.
Returns: regular expression object.
re.escape(pattern)
Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
Example:
re.escape('https://www.python.org')
https://www\.python\.org
legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
'[%s]+' % re.escape(legal_chars)
[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
operators = ['+', '-', '*', '/', '**']
'|'.join(map(re.escape, sorted(operators, reverse=True)))
/|\-|\+|\*\*|\*
Parameters:
pattern – regexp pattern.
Returns: escaped regexp pattern.
re.findall(pattern, string, flags=0
Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned to the order found. Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
Examples:
re.findall(r'f[a-z]*', 'which foot or hand fell fastest')
['foot', 'fell', 'fastest']
re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
[('width', '20'), ('height', '10')]
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
flags – regexp flags, see RegexFlags.
Returns: all non-overlapping matches of pattern in string, as a list of strings or tuples.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
flags – regexp flags, see RegexFlags.
Returns: iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
re.fullmatch(pattern, string, flags=0)
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
flags – regexp flags, see RegexFlags.
Returns: corresponding match object or None if no match was found.
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match. Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line. If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
flags – regexp flags, see RegexFlags.
Returns: corresponding match object or None if no match was found.
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
flags – regexp flags, see RegexFlags.
Returns: corresponding match object or None if no match was found.
re.split(pattern, string, maxsplit=0, flags=0)
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.
Examples:
re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
re.split(r'(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string:
re.split(r'(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']
That way, separator components are always found at the same relative indices within the result list. Empty matches for the pattern split the string only when not adjacent to a previous empty match.
re.split(r'', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']
re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Parameters:
pattern – regexp pattern.
string – string to apply pattern to.
maxsplit – if maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.
flags – regexp flags, see RegexFlags.
Returns: resulting list of occurrences of the pattern in string.
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, n is converted to a single newline character, r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as & are left alone. Backreferences, such as 6, are replaced with the substring matched by group 6 in the pattern.
Examples:
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
... r'static PyObject*\npy_(void)\n{',
... 'def myfunc():')
static PyObject*\npy_myfunc(void)\n{
def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'
Parameters:
pattern – regexp pattern.
repl – replacement string, if repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
string – string to apply pattern to.
count – the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced.
flags – regexp flags, see RegexFlags.
Returns: the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
re.subn(pattern, repl, string, count=0, flags=0)
Perform the same operation as sub(), but return a tuple (new_string, number_of_subs_made).
Parameters:
pattern – regexp pattern.
repl – replacement string, if repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
string – string to apply pattern to.
count – the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced.
flags – regexp flags, see RegexFlags.
Returns: the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
Last updated