Regular Expressions

What is a Regular expressions?

Regular expressions are patterns used to match character combinations in strings. In JavaScript, regular expressions are also objects. These patterns are used with the exec and test methods of RegExp, and with the match, replace, search, and split methods of String.

How to Write a Regular Expression Pattern

A regular expression pattern is composed of simple characters, such as /abc/, or a combination of simple and special characters, such as /ab*c/ or /Chapter (\d+)\.\d*/. The last example includes parentheses which are used as a memory device. The match made with this part of the pattern is remembered for later use, as described in Using Parenthesized Substring Matches.

Using Simple Patterns

Simple patterns are constructed of characters for which you want to find a direct match. For example, the pattern /abc/ matches character combinations in strings only when exactly the characters ‘abc’ occur together and in that order. Such a match would succeed in the strings “Hi, do you know your abc’s?” and “The latest airplane designs evolved from slabcraft.” In both cases the match is with the substring ‘abc’. There is no match in the string ‘Grab crab’ because while it contains the substring ‘ab c’, it does not contain the exact substring ‘abc’.

Using Special Characters

When the search for a match requires something more than a direct match, such as finding one or more b’s, or finding white space, the pattern includes special characters. For example, the pattern /ab*c/ matches any character combination in which a single ‘a’ is followed by zero or more ‘b’s (* means 0 or more occurrences of the preceding item) and then immediately followed by ‘c’. In the string “cbbabbbbcdebc,” the pattern matches the substring ‘abbbbc’.

The following table provides a quick list and description of the special characters that can be used in regular expressions:

.		        Any character except newline.
\. 		        A period (and so on for \*, \(, \\, etc.)
^ 		        The start of the string.
$ 		        The end of the string.
\d,\w,\s 	 	A digit, word character [A-Za-z0-9_], or whitespace.
\D,\W,\S 	Anything except a digit, word character, or whitespace.
[abc] 		Character a, b, or c.
[a-z] 		a through z.
[^abc] 	        Any character except a, b, or c.
aa|bb 		Either aa or bb.
? 		        Zero or one of the preceding element.
* 		        Zero or more of the preceding element.
+ 		        One or more of the preceding element.
{n} 		        Exactly n of the preceding element.
{n,} 		        n or more of the preceding element.
{m,n} 		Between m and n of the preceding element.
??,*?,+?,  	Same as above, but as few as possible.
{n}?, etc. 	Same as above, but as few as possible.
(expr) 	        Capture expr for use with \1, etc.
(?:expr) 	        Non-capturing group.
(?=expr) 	Followed by expr.
(?!expr) 	        Not followed by expr.

To find a sequence of characters, you have to define the rules that will always be true for them and then turn those rules into an expression. Here are codes used to represent different types of characters:

\d This represents any number
\D This represents anything that isn’t a number
\s This represents anything considered white space (space, tab, newline, etc.)
\S This represents anything not considered white space
\w This represents any character
\W This represents anything that is not a character
. Matches any character, except a line break
\b Matches for a space that precedes or follows a whole word

Searching for a Name

If you were tasked with searching through a tons of documents for anyone named Emre, how would you do that. You can also search for literal text and you would do that with this expression:

‘Emre\s\w+\s’ This will search for the word Emre followed by a space, 1 or more characters and then another space. The plus sign (+) stands for 1 or more of the code that precedes it. In this case I’m stating, I’m looking for 1 or more characters (\w).

There are other codes like the plus (+):

? Signifies you are looking for 0 or 1 repetitions of the code that precedes
* Signifies you expect 0 or more repetitions
{n} Used when you expect a specific number (n) of repetitions
{x,y} Used when you expect between (x) to (y) repetitions

Some Characters Need Special Care

Some characters that need escaped with a backslash include:

(
)
*
+
?
[
\
^
{
|

For instance; We could search for a dollar amount with this regex:

\$\d*\.\d{2}

Explanation of above the regex:

Looking for a dollar sign
Followed by 0 to more numbers
Followed by a period
Followed by 2 numbers

How to Search for Specific White Space

If you want to search for specific white space, you use the following codes:

\e Escape
\f Form Feed
\n Newline
\r Carriage Return
\t Horizontal Tab

Just place them in the code as if the were any other character.

Match One of a Couple of Characters

What could you do if you wanted to search for commonly miss-spelled words. Turkiya is commonly miss-spelled, and here is how you could search for Turkiye and Turkiya.

Turkiy[ae] : This regex will come back positive if it is spelled in either way. Only one of the letters inside of the braces will be used however. This brace can also be used to search for a series of characters, like these examples:

[a-z] This would match any lower case letter
[0-9] This would match any number
[A-Fa-z1-4] This would match uppercase letters from A to F, all lowercase and the numbers 1 to 4

Remember when you searched through tons of pages to find everyone named Turkiye? Well you missed a few. Don’t worry, we can easily find the Tur’s, and Turk’s with the vertical bar code. The vertical bar (|) is looked at as the word OR in Regex. To find all the Turkiye’s, we can use this code instead:

(Turkiye|Turk|Tur)\b\w+\b

Note: The code \b will match for any space that precedes or follows a whole word. \B will match for when their is no space separating characters.

Using Search Codes Multiple Times

By surrounding parts of a search in braces, you can then call for it with a backslash (\), followed by a number representing it’s location in the Regex. Therefore since this was the first time the braces where used in the Regex, I can use it again with \1. The next braced code block would be referenced with \2 and so on up to \9. Everyone after that would be referenced by surrounding them with carrot braces .

It would also be useful to grab just the text that lies between tags in html code. You code do that with the following code:

‘\(.+)\’

Are you starting to see why people get confused by Regex’s? I’ll break this down for you:

Everything is surrounded with quotes
\ : You don’t have to escape the closing carot brace
(.+) : Capture 1 or more characters and store them in \1
\< : Escape the brace again
\\ : Escape the Backslash character

More on…

You can also reference the beginning of a line of text with the carot symbol (^). So if you wanted to capture and sentence that starts with “The cat”, you’d use this code:

‘^The cat\s\w*\.’

You can also reference the end of a line of text with the Dollar Sign ($), in the same way.

Examples

Example 1: Using a regular expression to change data format

The following script uses the replace() method of the String instance to match a name in the format first last and output it in the format last, first. In the replacement text, the script uses $1 and $2 to indicate the results of the corresponding matching parentheses in the regular expression pattern.

var re = /(\w+)\s(\w+)/;
var str = 'John Smith';
var newstr = str.replace(re, '$2, $1');
console.log(newstr);

This displays “Smith, John”.

Example 2: Using regular expression to split lines with different line endings/ends of line/line breaks

The default line ending varies depending on the platform (Unix, Windows, etc.). The line splitting provided in this example works on all platforms.

var text = 'Some text\nAnd some more\r\nAnd yet\rThis is the end';
var lines = text.split(/\r\n|\r|\n/);
console.log(lines) // prints [ 'Some text', 'And some more', 'And yet', 'This is the end' ]

Note that the order of the patterns in the regular expression matters.

Example 3: Using regular expression on multiple lines

var s = 'Please yes\nmake my day!';
s.match(/yes.*day/);
// Returns null
s.match(/yes[^]*day/);
// Returns 'yes\nmake my day'

Example4: Using a regular expression with the “sticky” flag

This example demonstrates how one could use the sticky flag on regular expressions to match individual lines of multiline input.

var text = 'First line\nSecond line';
var regex = /(\S+) line\n?/y;

var match = regex.exec(text);
console.log(match[1]);        // prints 'First'
console.log(regex.lastIndex); // prints '11'

var match2 = regex.exec(text);
console.log(match2[1]);       // prints 'Second'
console.log(regex.lastIndex); // prints '22'

var match3 = regex.exec(text);
console.log(match3 === null); // prints 'true'

One can test at run-time whether the sticky flag is supported, using try { … } catch { … }. For this, either an eval(…) expression or the RegExp(regex-string, flags-string) syntax must be used (since the /regex/flags notation is processed at compile-time, so throws an exception before the catch block is encountered). For example:

var supports_sticky;
try { RegExp('', 'y'); supports_sticky = true; }
catch(e) { supports_sticky = false; }
console.log(supports_sticky); // prints 'true'

Example 5: Regular expression and Unicode characters

As mentioned above, \w or \W only matches ASCII based characters; for example, “a” to “z”, “A” to “Z”, “0” to “9” and “_”. To match characters from other languages such as Cyrillic or Hebrew, use \uhhhh, where “hhhh” is the character’s Unicode value in hexadecimal. This example demonstrates how one can separate out Unicode characters from a word.

var text = 'Образец text на русском языке';
var regex = /[\u0400-\u04FF]+/g;

var match = regex.exec(text);
console.log(match[0]);        // prints 'Образец'
console.log(regex.lastIndex); // prints '7'

var match2 = regex.exec(text);
console.log(match2[0]);       // prints 'на' [did not print 'text']
console.log(regex.lastIndex); // prints '15'

// and so on

Here’s an external resource for getting the complete Unicode block range for different scripts: Regexp-unicode-block.

Example 6: Extracting subdomain name from URL

var url = 'http://xxx.domain.com';
console.log(/[^.]+/.exec(url)[0].substr(7)); // prints 'xxx'

Referral: RegExp: Special characters in regular expressions

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.