Mastering Regular Expressions - Regex

The implementation system of regex functionality is often called "regular expression engine". Basically a regex engine tries to match the pattern to the given string. There are two main types of regex engines: DFA and NFA, also referred to as text-directed and regex-directed engines.

Types of regex engines h you can build complex patterns that can match a wide range of combinations.

Metacharacter	Description
.	Any single character
^	Match the beginning of a line
$	Match the end of a line
a\|b	Match either a or b
\d	any digit
\D	Any non-digit character
\w	Any word character
\W	Any non-word character
\s	matches any whitespace character
\S	Match any non-whitespace character
\b	Matches a word boundary
\B	Match must not occur on a \b boundary.
[\b]	Backspace character
\xYY	Match hex character YY
\ddd	Octal character ddd
`[]`	Start/close a charaters class
`()`	Start/close a characters group
`\`	Escape special characters
\|	It means OR
`{}`	Start/close repetitions of a characters class

Quantifiers

Regex Quantifier	Description
+	+ indicates that the previous character must occur at least one or more times.
?	? indicates that the preceding character is optional. It means the preceding character can occur zero or one time.
*	Matches zero or more of the preceding character.
{n}	Matches exactly n occurrences of the preceding character.
{n,}	Matches n or more occurrences of the preceding character.
{n,m}	Matches between n and m occurrences of the preceding element

The followings are common examples of character classes:

[abc] - matches any one character that is either 'a', 'b', or 'c'.
[a-z] - matches any one lowercase letter from 'a' to 'z'.
[A-Z] - matches any one upper case letter from 'A' to 'Z'.
[0-9] - matches any one digit from '0' to '9'. Optionaly, use \d metacharacter.
[^abc] - matches any one character that is not 'a', 'b', or 'c'.
[\w] - matches any one-word character, including letters, digits, and underscore.
[\s] - matches any whitespace character, including space, tab, and newline.
[^a-z] - matches any one character that is not a lowercase letter from 'a' to 'z'.

In regex, any subpattern enclosed within the parentheses () is considered a group. For example, (xyz) creates a group that matches the exact sequence "xyz".

Non-printing character	Description
\0	NULL Byte. In many programming language marks the end of a string
\b	Within a character class represent the backspace character, while outside `\b` matches a word boundary
\t	Tab key.
\n	New line
\v	Vertical tabulation
\f	Form feed
\r	In HTTP the `\r\n` sequence is used as the end-of-line marker
\e	Escape character

Unicode

Regular expression flavors that work with Unicode use specific meta-sequences to match code points:

# `\u`+code-point 

code-point is the hexadecimal number of the character to match 
`\u2603`

# `\x`{code-point} in the PCRE library in Apache and PHP
{code-point} is the hexadecimal number of the character to match 
`\x{2603}`

Last update: 2024-04-16
Created: April 16, 2024 18:30:30