Regular Expressions

From CSE330 Wiki
Jump to navigationJump to search

Is this string a phone number? Is it an e-mail address? Is it a URL? How do I parse the time of day out of this date string? How do I find all words in the string that contain the letter "a"?

The tool we use to answer these questions is regular expressions. Regular expressions are a powerhouse for string matching.

This guide serves as an introduction to regular expressions. However, there is enough to discuss about them that regular expressions could be their own class. We will only be scratching the surface so that you can get up and running with regular expressions and start to use them in your projects.

Testing Regular Expressions

There are several tools freely available online to assist you when writing regular expressions. In this wiki, we will be using two:

  • RegExPal: Performs a global search on an input string. Best for writing regular expressions used for parsing text.
  • Debuggex: Matches a regular expression against the input string, but does not support searches. Shows a schematic to help you understand what your regular expression is doing. Best for fine-tuning regular expressions that act on known input formats.

You can view all regular expressions on this page in RegExPal or Debuggex by pressing "View in _____" above each example.

Your First Regular Expression

Here we go. If you wish to follow along, click "View in RegExPal" below; Debuggex does not work well for this example.

This regular expression finds all words of the form "_are", where _ is an alphanumeric character, and matches that first letter. Let's go through the parts of this regular expression.

  • \b means "word boundary". If we didn't have the \b's, then this regular expression would also match words like "daycare" or "apparel" or "arest".
  • \w means "any alphanumeric character or underscore". Thus, this regular expression will match dare, care, zare, 5are, _are, and so on.
  • are are literal characters in the regular expression.

Here is a little animation to help you visualize how this regular expression is assembled:

RegexExampleAnimation-32.gif

Groups

One use of regular expressions is when you want to extract information of a known format out of a string. This is when groups come into play.

Testing Regular Expressions

[\w\-\+\.]+@([\w\-]+(?:\.[\w]{2,4})+)

Good tool: http://www.debuggex.com/

Using Regular Expressions in Programming Languages

You are learning three new programming languages in CSE 330: PHP (from Module 2), Python (from Module 4), and JavaScript (from Module 6). Below, you may find how to implement regular expressions in each of these three languages.

The example is a function that tests whether a string matches a regular expression representing an e-mail address, and if it does, the function returns the domain name from the e-mail.

PHP

The magic function in PHP is preg_match($regex, $str, $matches), where $regex is the regular expression, $str is the string to test, and $matches is an extra argument which will be modified to contain the matches array.

function domain_from_email($str){
	if(preg_match("/^[\w\-\+\.]+@([\w\-]+(?:\.[\w]{2,4})+)$/", $str, $matches)){
		return $matches[1];
	} else return false;
}

Python

JavaScript