Regex Basics: A Regular Expression Tutorial
One might ask, “Why on earth is Canonical writing about regex basics?” Most new webmasters don’t know what it is, and if they do, they don’t realize how it applies to SEO. Believe it or not, there is a big correlation between regex and search engine optimization.
Having at least a basic knowledge of the regex vocabulary and how to construct simple or intermediate regular expressions is crucial to the success of any webmaster. The Mod Rewrite (like many Un*x-based tools) is designed over a regular expression parser. Learning the basics of regex is a necessary prerequisite for learning Mod Rewrite. Understanding how to use Mod Rewrite is a mandatory requirement to call an IMO webmaster yourself.
So let’s start learning some regex basics.
What is regex?
Regex is a versatile language or syntax used to “express” or define trends in data, usually text data or strings. It’s short for regular expression. It is often referred to as regexp and could be thought of as “Wildcarding on Steroids.”
A regular expression consists of a mixture of abstract characters and/or metacharacters which have a special meaning for a regex parser. It takes two inputs–the regular expression and the input string–and evaluates them to decide if the input string contains the sequence of characters identified by the regex pattern. The regex parser then returns the Boolean value to show the corresponding answer.
The first step to understanding regular expressions is to understanding the metacharacters used for pattern matching.
Regular Expression Cheat Sheet for Metacharacters
The concept of regular expressions involves the need to have a flexible and powerful way to match strings of text or patterns. As always, with great flexibility and power comes great confusion when you actually implement this, which is why I have written this blog post to help you get started with regular expressions or simply copy whichever expression you need.
Regex Metacharacter | Meaning |
^ | An anchor representing the start of a string. |
$ | An anchor representing the end of a string. |
. | Matches any single character. |
\ | Escapes a regex metacharacter so that it will be treated as a literal by the regular expression parser. It can also be used to add special meaning to characters that would otherwise be treated as a literal. |
* | Matches zero or more occurrences of the previous construct. |
? | Matches zero or one occurrences of the previous construct. |
+ | Matches one or more occurrences of the previous construct. |
[ ] | Called a character class. Matches only one of the characters contained inside the brackets. |
( ) | It provides a grouping functionality so that you can treat a group of characters as a single unit. It also provides the ability to capture a group of characters which you can later use as a back reference. |
Become familiar with the above regular expression metacharacters. It pays to understand regular expressions. especially if you work in a Un*x environment and/or host a web site on Apache.
Basic regex pattern matching
The best way to learn is through basic regex examples. The abstract syntax of regular expressions can be confusing at first. But after you’ve walked through some simple examples, you’ll probably start learning quickly.
As I mentioned earlier, the regular expression parser returns a true or false Boolean value to indicate whether the input string matches the regex pattern. In some cases, where the regex pattern uses the characters in the literal group and the metacharacters in the pattern, the regex pattern can be used to capture substrings from the input string as a back reference for later use.
You will find that there are many ways to write a regular expression that fits a specific pattern. Some of the regex expressions are more effective than others. Creating successful expressions should come with practice. But first, you should simply focus on learning the fundamentals, regardless of whether or not the patterns you write are effective.
Anchoring matches to the start and end of string (^ and $)
The character is used to anchor a regex pattern to the beginning of the input string. The $character is used as an end-of-string anchor. Below are some examples of basic regular expression patterns utilizing the start and end of string anchors:
Regex Pattern | Matches |
^$ | Any input string where there is nothing between the start of string and end of string (in other words, it matches only the empty string) |
^abc | Any input string that begins with abc. |
abc$ | Any input string that ends with abc |
^abc$ | Only the input string abc |
Matching any character (.)
The . character is used in a regex pattern to match any character. Below are some examples of using . in regular expressions:
Regex Pattern | Matches |
a.c | Any input string that contains the letter a followed immediatately by any character followed immediately by the letter c.Examples: abc 1a2c3 accept match |
^.$ | Any input string that is exactly one character in length. |
Escaping characters (\)
The (backslash) or escape character is used in a regular expression to remove special meaning from a metacharacter, causing it to be treated as a literal. The following regex examples demonstrate the use of the to escape metacharacters so that they are treated as literal characters:
Regex Pattern | Matches |
\.jpg$ | Any input string that ends in .jpg |
\$1\.00$ | Any input string that ends in $1.00 |
The backslash character can also be used to add special meaning to characters which would otherwise be treated as literals. For example, \d can be used to match a decimal character (i.e. 0, 1, …, 9) or \s to match a space.
Matching zero or more characters (*)
The * character is used in a regex expression to match the previous character in the pattern zero or more times. Some simple examples of using the * character in a regular expression are as follows:
Regex Pattern | Matches |
.* | Any input string (actually, the entire input string) even if it is the empty string. |
^\s*# | Any input string that begins with zero or more spaces followed immediately by a # character. |
^a*bc$ | Any input string that begins with zero or more occurrences of the a characters immediately followed by bc at the end of the string.Examples: bc abc aabc aaabc |
Matching an optional character (?)
The ? character is used in a regular expression to make the previous character in the pattern optional. In other words, it matches zero or one occurrences of the previous character in the pattern.
Regex Pattern | Matches |
e-?mail | Any input strings that contains the string email or e-mail. |
Matching one or more characters (+)
The + character is used in regex to match the previous character in the pattern one or more times. Some simple examples of using + in a regular expression are as follows:
Regex Pattern | Matches |
^\d+\.jpg$ | Any input string that begins with one or more decimal digits immediately followed by .jpg at the end of the string |
a+bc | Any input string that contains a series of on or more consecutive occurrences of the letter a immediately followed by bc.Example: abc aabc aaabc aabcc |
Matching a character class ( [ and ] )
You can use [ and ] to match a character class or character list in regex. The construct matches one and only one of the characters between the [ and ]. Below are some examples of using character classes in a regular expression:
Regex Pattern | Matches |
[bcf]at | Any input string that contains either the letter b or c or f immediately followed by at (in other words, any input string containing bat, cat, or fat) |
You can also use a hyphen inside the start and end brackets of a character class to indicate a range of characters.
Regex Pattern | Matches |
^[b-f]at$ | Any input string that is exactly 3 characters in length where the first character is b, c, d, e, or f followed immediately by at (in other words, any input string that is exactly bat, cat, dat, eat, or fat) |
^image[0-9]\.jpg$ | Any input string that starts with image followed immediately by a single decimal digit followed immediately by .jpg at the end of the string (in other words, image0.jpg, image1.jpg, …, image9.jpg only) |
Grouping patterns and capturing results ( ( and ) )
Parentheses can be used to group characters in a regular expression so that they can be treated as a single unit. This is very useful in pattern matching as it allows you to apply metacharacters to sub-patterns within a bigger pattern. An example of using parentheses in a regex pattern for grouping is as follows:
Regex Pattern | Matches |
(xyz)+ | Any input string that contains one or more consecutive occurrences of xyz (in otherwords, xyz, xyzxyz, xyzxyzxyz, etc.) |
Applications like Mod Rewrite, which are built on top of regex parsers, also utilize parentheses to capture the results of a match for later use. These captured values are stored in variables called back references and can be used to determine what string of characters matched the pattern. I will discuss back references when I get around to writing a post on the basics of Mod Rewrite.
This is a very quick and basic introduction to the world of Regular Expressions. With this, you should be able to perform basic matching and validation. Feel free to browse this post and use any expressions you might come across in your own projects!
Learning more about regular expressions
Familiarize yourself with the basics of regular expressions by experimenting with an online regex parser. Once you have the regex basics down, you can move on to more advanced techniques. This post barely scratches the surface of what you can do using regular expressions. The sky is the limit!