Regular Expressions
Is this string a phone number? Is it an e-mail address? Is it a URL? How do I parse the time of day out of this date string? How do I find all words in the string that contain the letter "a"?
The tool we use to answer these questions is regular expressions. Regular expressions are a powerhouse for string matching.
This guide serves as an introduction to regular expressions. However, there is enough to discuss about them that regular expressions could be their own class. We will only be scratching the surface so that you can get up and running with regular expressions and start to use them in your projects.
Contents
Testing Regular Expressions
There are several tools freely available online to assist you when writing regular expressions. In this wiki, we will be using two:
- RegExPal: Performs a global search on an input string. Best for writing regular expressions used for parsing text.
- Debuggex: Matches a regular expression against the input string, but does not support searches. Shows a schematic to help you understand what your regular expression is doing. Best for fine-tuning regular expressions that act on known input formats.
You can view all regular expressions on this page in RegExPal or Debuggex by pressing "View in _____" above each example.
Your First Regular Expression
Here we go. If you wish to follow along, click "View in RegExPal" below; Debuggex does not work well for this example.
This regular expression finds all words of the form "_are", where _ is an alphanumeric character, and matches that first letter. Let's go through the parts of this regular expression.
- \b means "word boundary". If we didn't have the \b's, then this regular expression would also match words like "daycare" or "apparel" or "arest".
- \w means "any alphanumeric character or underscore". Thus, this regular expression will match dare, care, zare, 5are, _are, and so on.
- are are literal characters in the regular expression.
Here is a little animation to help you visualize how this regular expression is assembled:
Regular Expression Syntax
Character classes and modifiers enable you to fine-tune your matches..
Character Classes
A character class enables you to match any one of a set of characters. Surround your characters of interest in square brackets.
The following regular expression matches all occurrences of a vowel:
Character classes can be negated by starting the character set with a caret ^. For example, the following regular expression matches all occurrences of a consonant:
The \W is a shorthand character class that prevents the above regular expression from also matching non-word characters.
Shorthand Character Classes
Some character classes are provided for you by default.
Description | Literal Character Class | Shorthand |
---|---|---|
Digit | [0-9] | \d |
Word Character | [A-Za-z0-9_] | \w |
Whitespace | [ \t\r\n] | \s |
Non-Digit | [^0-9] | \D |
Non-Word Character | [^A-Za-z0-9_] | \W |
Non-Whitespace | [^ \t\r\n] | \S |
Wildcard character except for line breaks | [^\r\n] | . |
Quantifiers
For a character, character class, or group, you can specify a required number of occurrences.
Description | Syntax | Example |
---|---|---|
Zero or more of | * | Match any string, even the empty string: .*
|
One or more of | + | Match any string, except for the empty string: .+
|
Zero or one of | ? | Match either "color" or "colour": colou?r
|
N of | {N} | Match all three-digit numbers: \b\d{3}\b
|
Between M and N of | {M,N} | Match all two- or three-digit numbers: \b\d{2,3}\b
|
Greedy and Lazy Quantifiers
By default, the * and + quantifiers are greedy: this means that they will continue searching until the end of the string if they can. More often, however, you want them to stop as soon as something matching the next part of your regular expression is found. This is where lazy modifiers come into play.
To make * or + lazy, simply add a ?, as in *? and +?.
For example, you might write the following regular expression to match all HTML tags (view in RegExPal to follow along):
However, this will not separately match the HTML tags; instead, it will start from the first occurrence of < and continue until the last occurrence of >. This is the greedy behavior. The following regular expression will perform what you want:
Now, the match will stop as soon as a > is encountered.
Anchors
If your regular expression begins with a cared ^, you will generate only one match, which must be at the beginning of the string. If your regular expression ends with a dollar sign $, you will generate only one match, which must be at the end of the string. If your regular expression starts with a ^ and ends with a $, you will match if and only if the entire string matches the entire regular expression.
For example, the following regular expression generates a match on all strings that start with a capital letter:
The following regular expression tests whether the entire string is a valid US phone number:
Groups
Groups enable you to perform operations on multi-character strings. To specify a group, surround it with parentheses ().
For example, the following regular expression crudely matches a domain name:
First, we match one or more word character or hyphen, and then we match one or more suffixes, each suffix consisting of a period followed by one or more word characters.
Regular Expression Examples
You know know enough to write some pretty sophisticated regular expressions.
Can you figure out what the following regular expression matches?
If you guessed "e-mail address", you are correct! You can recognize that our domain-name regular expression has been copied after the "@" sign, and before the "@" sign is simply one or more of an alphanumeric or a handful of other characters. We also surrounded the regex with ^ and $ to ensure that we match the entire string.
Can you guess this one?
This will match an IP address (IPv3). We use the quantifier-on-group trick like we did earlier when we matched valid domain names.
Note: Both of these examples are crude. You should use formal regular expressions issued by organizations like the IETF when matching e-mail addresses and the like.
Groups
One use of regular expressions is when you want to extract information of a known format out of a string. This is when groups come into play.
under construction
Using Regular Expressions in Programming Languages
You are learning three new programming languages in CSE 330: PHP (from Module 2), Python (from Module 4), and JavaScript (from Module 6). Below, you may find how to implement regular expressions in each of these three languages.
The example is a function that tests whether a string matches a regular expression representing an e-mail address, and if it does, the function returns the domain name from the e-mail.
PHP
The magic function in PHP is preg_match($regex, $str, $matches), where $regex is the regular expression, $str is the string to test, and $matches is an extra argument which will be modified to contain the matches array.
function domain_from_email($str){
if(preg_match("/^[\w\-\+\.]+@([\w\-]+(?:\.[\w]{2,4})+)$/", $str, $matches)){
return $matches[1];
} else return false;
}
Python
JavaScript
Basics of Regular Expressions. Uses JavaScript as the language of focus.
Regular Expressions in JavaScript Part 2.
Regular Expressions in JavaScript Part 3.
Using Regular Expressions in Your Workflow
Regular expressions aren't limited to use in programming languages. In fact, you can use them right now in your text editor!
In Komodo, go to File->Find, and in the dialog, select "Regex". Bam: whatever you type into the search box will be evaluated as a regular expression! Here is an example where I find all end tags in my document:
You can also do regular expression find and replace in Komodo. You can use groups, too: in the "Replace with" box, a \1 will insert your first group, a \2 will insert your second group, and so on. Nifty!