Mastering Regular Expressions - Regex
The implementation system of regex functionality is often called "regular expression engine". Basically a regex engine tries to match the pattern to the given string. There are two main types of regex engines: DFA and NFA, also referred to as text-directed and regex-directed engines.
h you can build complex patterns that can match a wide range of combinations.
| Metacharacter | Description |
|---|---|
| . | Any single character |
| ^ | Match the beginning of a line |
| $ | Match the end of a line |
| a|b | Match either a or b |
| \d | any digit |
| \D | Any non-digit character |
| \w | Any word character |
| \W | Any non-word character |
| \s | matches any whitespace character |
| \S | Match any non-whitespace character |
| \b | Matches a word boundary |
| \B | Match must not occur on a \b boundary. |
| [\b] | Backspace character |
| \xYY | Match hex character YY |
| \ddd | Octal character ddd |
[] |
Start/close a charaters class |
() |
Start/close a characters group |
\ |
Escape special characters |
| | | It means OR |
{} |
Start/close repetitions of a characters class |
Quantifiers
| Regex Quantifier | Description |
|---|---|
| + | + indicates that the previous character must occur at least one or more times. |
| ? | ? indicates that the preceding character is optional. It means the preceding character can occur zero or one time. |
| * | Matches zero or more of the preceding character. |
| {n} | Matches exactly n occurrences of the preceding character. |
| {n,} | Matches n or more occurrences of the preceding character. |
| {n,m} | Matches between n and m occurrences of the preceding element |
The followings are common examples of character classes:
[abc]- matches any one character that is either 'a', 'b', or 'c'.[a-z]- matches any one lowercase letter from 'a' to 'z'.[A-Z]- matches any one upper case letter from 'A' to 'Z'.[0-9]- matches any one digit from '0' to '9'. Optionaly, use \d metacharacter.[^abc]- matches any one character that is not 'a', 'b', or 'c'.[\w]- matches any one-word character, including letters, digits, and underscore.[\s]- matches any whitespace character, including space, tab, and newline.[^a-z]- matches any one character that is not a lowercase letter from 'a' to 'z'.
In regex, any subpattern enclosed within the parentheses () is considered a group. For example, (xyz) creates a group that matches the exact sequence "xyz".
| Non-printing character | Description |
|---|---|
| \0 | NULL Byte. In many programming language marks the end of a string |
| \b | Within a character class represent the backspace character, while outside \b matches a word boundary |
| \t | Tab key. |
| \n | New line |
| \v | Vertical tabulation |
| \f | Form feed |
| \r | In HTTP the \r\n sequence is used as the end-of-line marker |
| \e | Escape character |
Unicode
Regular expression flavors that work with Unicode use specific meta-sequences to match code points: