Skip to content

Mastering Regular Expressions - Regex

The implementation system of regex functionality is often called "regular expression engine". Basically a regex engine tries to match the pattern to the given string. There are two main types of regex engines: DFA and NFA, also referred to as text-directed and regex-directed engines.

Types of regex engines h you can build complex patterns that can match a wide range of combinations.

Metacharacter Description
. Any single character
^ Match the beginning of a line
$ Match the end of a line
a|b Match either a or b
\d any digit
\D Any non-digit character
\w Any word character
\W Any non-word character
\s matches any whitespace character
\S Match any non-whitespace character
\b Matches a word boundary
\B Match must not occur on a \b boundary.
[\b] Backspace character
\xYY Match hex character YY
\ddd Octal character ddd
[] Start/close a charaters class
() Start/close a characters group
\ Escape special characters
| It means OR
{} Start/close repetitions of a characters class

Quantifiers

Regex Quantifier Description
+ + indicates that the previous character must occur at least one or more times.
? ? indicates that the preceding character is optional. It means the preceding character can occur zero or one time.
* Matches zero or more of the preceding character.
{n} Matches exactly n occurrences of the preceding character.
{n,} Matches n or more occurrences of the preceding character.
{n,m} Matches between n and m occurrences of the preceding element

The followings are common examples of character classes:

  • [abc] - matches any one character that is either 'a', 'b', or 'c'.
  • [a-z] - matches any one lowercase letter from 'a' to 'z'.
  • [A-Z] - matches any one upper case letter from 'A' to 'Z'.
  • [0-9] - matches any one digit from '0' to '9'. Optionaly, use \d metacharacter.
  • [^abc] - matches any one character that is not 'a', 'b', or 'c'.
  • [\w] - matches any one-word character, including letters, digits, and underscore.
  • [\s] - matches any whitespace character, including space, tab, and newline.
  • [^a-z] - matches any one character that is not a lowercase letter from 'a' to 'z'.

In regex, any subpattern enclosed within the parentheses () is considered a group. For example, (xyz) creates a group that matches the exact sequence "xyz".

Non-printing character Description
\0 NULL Byte. In many programming language marks the end of a string
\b Within a character class represent the backspace character, while outside \b matches a word boundary
\t Tab key.
\n New line
\v Vertical tabulation
\f Form feed
\r In HTTP the \r\n sequence is used as the end-of-line marker
\e Escape character

Unicode

Regular expression flavors that work with Unicode use specific meta-sequences to match code points:

# `\u`+code-point 

code-point is the hexadecimal number of the character to match 
`\u2603`

# `\x`{code-point} in the PCRE library in Apache and PHP
{code-point} is the hexadecimal number of the character to match 
`\x{2603}`
Last update: 2024-04-16
Created: April 16, 2024 18:30:30