Previous Page
Next Page

Using MySQL Regular Expressions

So what does this have to do with MySQL? As already explained, all regular expressions do is match text, comparing a pattern (the regular expression) with a string of text. MySQL provides rudimentary support for regular expressions with WHERE clauses, allowing you to specify regular expressions that are used to filter data retrieved using SELECT.

Note

Just a Subset of the Regular Expression Language If you are already familiar with regular expressions, take note. MySQL only supports a small subset of what is supported in most regular expression implementations, and this chapter covers most of what is supported.


This will all become much clearer with some examples.

Basic Character Matching

We'll start with a very simple example. The following statement retrieves all rows where column prod_name contains the text 1000:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000'
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| JetPack 1000 |
+--------------+

Analysis

This statement looks much like the ones that used LIKE (in Chapter 8, "Using Wildcard Filtering"), except that the keyword LIKE has been replaced with REGEXP. This tells MySQL that what follows is to be treated as a regular expression (one that just matches the literal text 1000).

So, why bother using a regular expression? Well, in the example just used, regular expressions really add no value (and probably hurt performance), but consider this next example:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '.000'
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+

Analysis

Here the regular expression .000 was used. . is a special character in the regular expression language. It means match any single character, and so both 1000 and 2000 matched and were returned.

Of course, this particular example could also have been accomplished using LIKE and wildcards (as seen in Chapter 8).

Note

LIKE Versus REGEXP There is one very important difference between LIKE and REGEXP. Look at these two statements:

SELECT prod_name
FROM products
WHERE prod_name LIKE '1000'
ORDER BY prod_name;

and

SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000'
ORDER BY prod_name;

If you were to try them both you'd discover that the first returns no data and the second returns one row. Why is this?

As seen in Chapter 8, LIKE matches an entire column. If the text to be matched existed in the middle of a column value, LIKE would not find it and the row would not be returned (unless wildcard characters were used). REGEXP, on the other hand, looks for matches within column values, and so if the text to be matched existed in the middle of a column value, REGEXP would find it and the row would be returned. This is a very important distinction.

So can REGEXP be used to match entire column values (so that it functions like LIKE)? Actually, yes, using the ^ and $ anchors, as will be explained later in this chapter.


Tip

Matches Are Not Case-Sensitive Regular expression matching in MySQL (as of version 3.23.4) are not case-sensitive (either case will be matched). To force case-sensitivity, you can use the BINARY keyword, as in

WHERE prod_name REGEXP BINARY 'JetPack .000'


Performing OR Matches

To search for one of two strings (either one or the other), use | as seen here:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '1000|2000'
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+

Analysis

Here the regular expression 1000|2000 was used. | is the regular expression OR operator. It means match one or the other, and so both 1000 and 2000 matched and were returned.

Using | is functionally similar to using OR statements in SELECT statements, with multiple OR conditions being consolidated into a single regular expression.

Tip

More Than Two OR Conditions More than two OR conditions may be specified. For example, '1000|2000|3000' would match 1000 or 2000 or 3000.


Matching One of Several Characters

. matches any single character. But what if you wanted to match only specific characters? You can do this by specifying a set of characters enclosed within [ and ], as seen here:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '[123] Ton'
ORDER BY prod_name;

Output

+-------------+
| prod_name   |
+-------------+
| 1 ton anvil |
| 2 ton anvil |
+-------------+

Analysis

Here the regular expression [123] Ton was used. [123] defines a set of characters, and here it means match 1 or 2 or 3, so both 1 ton and 2 ton matched and were returned (there was no 3 ton).

As you have just seen, [] is another form of OR statement. In fact, the regular expression [123] Ton is shorthand for [1|2|3] Ton, which also would have worked. But the [] characters are needed to define what the OR statement is looking for. To better understand this, look at the next example:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '1|2|3 Ton'
ORDER BY prod_name;

Output

+---------------+
| prod_name     |
+---------------+
| 1 ton anvil   |
| 2 ton anvil   |
| JetPack 1000  |
| JetPack 2000  |
| TNT (1 stick) |
+---------------+

Analysis

Well, that did not work. The two required rows were retrieved, but so were three others. This happened because MySQL assumed that you meant '1' or '2' or '3 ton'. The | character applies to the entire string unless it is enclosed with a set.

Sets of characters can also be negated. That is, they'll match anything but the specified characters. To negate a character set, place a ^ at the start of the set. So, whereas [123] matches characters 1, 2, or 3, [^123] matches anything but those characters.

Matching Ranges

Sets can be used to define one or more characters to be matched. For example, the following will match digits 0 tHRough 9:

[0123456789]

To simplify this type of set, - can be used to define a range. The following is functionally identical to the list of digits just seen:

[0-9]

Ranges are not limited to complete sets[1-3] and [6-9] are valid ranges, too. In addition, ranges need not be numeric, and so [a-z] will match any alphabetical character.

Here is an example:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '[1-5] Ton'
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| .5 ton anvil |
| 1 ton anvil  |
| 2 ton anvil  |
+--------------+

Analysis

Here the regular expression [1-5] Ton was used. [1-5] defines a range, and so this expression means match 1 through 5, and so three matches were returned. .5 ton was returned because 5 ton matched (without the . character).

Matching Special Characters

The regular expression language is made up of special characters that have specific meanings. You've already seen ., [], |, and -, and there are others, too. Which begs the question, if you needed to match those characters, how would you do so? For example, if you wanted to find values that contain the . character, how would you search for it? Look at this example:

Input

SELECT vend_name
FROM vendors
WHERE vend_name REGEXP '.'
ORDER BY vend_name;

Output

+----------------+
| vend_name      |
+----------------+
| ACME           |
| Anvils R Us    |
| Furball Inc.   |
| Jet Set        |
| Jouets Et Ours |
| LT Supplies    |
+----------------+

Analysis

That did not work. . matches any character, and so every row was retrieved.

To match special characters they must be preceded by \\. So, \\- means find and \\. means find .:

Input

SELECT vend_name
FROM vendors
WHERE vend_name REGEXP '\\.'
ORDER BY vend_name;

Output

+--------------+
| vend_name    |
+--------------+
| Furball Inc. |
+--------------+

Analysis

That worked. \\. matches ., and so only a single row was retrieved. This process is known as escaping, and all characters that have special significance within regular expressions must be escaped this way. This includes ., |, [], and all of the other special characters used thus far.

\\ is also used to refer to metacharacters (characters that have specific meanings), as listed in Table 9.1.

Table 9.1. Whitespace Metacharacters

Metacharacter

Description

\\f

Form feed

\\n

Line feed

\\r

Carriage return

\\t

Tab

\\v

Vertical tab


Tip

To Match \ To match the backslash character itself (\), you would need to use \\\.


Note

\ or \\? Most regular expression implementation use a single backslash to escape special characters to be able to use them as literals. MySQL, however, requires two backslashes (MySQL itself interprets one and the regular expression library interprets the other).


Matching Character Classes

There are matches that you'll find yourself using frequently, digits, or all alphabetical characters, or all alphanumerical characters, and so on. To make working with these easier, you may use predefined character sets known as character classes. Table 9.2 lists the character classes and what they mean.

Table 9.2. Character Classes

Class

Description

[:alnum:]

Any letter or digit, (same as [a-zA-Z0-9])

[:alpha:]

Any letter (same as [a-zA-Z])

[:blank:]

Space or tab (same as [\\t ])

[:cntrl:]

ASCII control characters (ASCII 0 tHRough 31 and 127)

[:digit:]

Any digit (same as [0-9])

[:graph:]

Same as [:print:] but excludes space

[:lower:]

Any lowercase letter (same as [a-z])

[:print:]

Any printable character

[:punct:]

Any character that is neither in [:alnum:] nor [:cntrl:]

[:space:]

Any whitespace character including the space (same as [\\f\\n\\r\\t\\v ])

[:upper:]

Any uppercase letter (same as [A-Z])

[:xdigit:]

Any hexadecimal digit (same as [a-fA-F0-9])


Matching Multiple Instances

All of the regular expressions used thus far attempt to match a single occurrence. If there is a match, the row is retrieved and if not, nothing is retrieved. But sometimes you'll require greater control over the number of matches. For example, you might want to locate all numbers regardless of how many digits the number contains, or you might want to locate a word but also be able to accommodate a trailing s if one exists, and so on.

This can be accomplished using the regular expressions repetition metacharacters, listed in Table 9.3.

Table 9.3. Repetition Metacharacters

Metacharacter

Description

*

0 or more matches

+

1 or more matches (equivalent to {1,})

?

0 or 1 match (equivalent to {0,1})

{n}

Specific number of matches

{n,}

No less than a specified number of matches

{n,m}

Range of matches (m not to exceed 255)


Following are some examples.

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '\\([0-9] sticks?\\)'
ORDER BY prod_name;

Output

+----------------+
| prod_name      |
+----------------+
| TNT (1 stick)  |
| TNT (5 sticks) |
+----------------+

Analysis

Regular expression \\([0-9] sticks?\\) requires some explanation. \\( matches (, [0-9] matches any digit (1 and 5 in this example), sticks? matches stick and sticks (the ? after the s makes that s optional because ? matches 0 or 1 occurrence of whatever it follows), and \\) matches the closing ). Without ? it would have been very difficult to match both stick and sticks.

Here's another example. This time we'll try to match four consecutive digits:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '[[:digit:]]{4}'77
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| JetPack 1000 |
| JetPack 2000 |
+--------------+

Analysis

As explained previously, [:digit:] matches any digit, and so [[:digit:]] is a set of digits. {4} requires exactly four occurrences of whatever it follows (any digit), and so [[:digit:]]{4} matches any four consecutive digits.

It is worth noting that when using regular expressions there is almost always more than one way to write a specific expression. The previous example could have also been written as follows:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '[0-9][0-9][0-9][0-9]'
ORDER BY prod_name;

Anchors

All of the examples thus far have matched text anywhere within a string. To match text at specific locations, you need to use anchors as listed in Table 9.4.

Table 9.4. Anchor Metacharacters

Metacharacter

Description

^

Start of text

$

End of text

[[:<:]]

Start of word

[[:>:]]

End of word


For example, what if you wanted to find all products that started with a number (including numbers starting with a decimal point)? A simple search for [0-9\\.] (or [[:digit:]\\.]) would not work because it would find matches anywhere within the text. The solution is to use the ^ anchor, as seen here:

Input

SELECT prod_name
FROM products
WHERE prod_name REGEXP '^[0-9\\.]'
ORDER BY prod_name;

Output

+--------------+
| prod_name    |
+--------------+
| .5 ton anvil |
| 1 ton anvil  |
| 2 ton anvil  |
+--------------+

Analysis

^ matches the start of a string. As such, ^[0-9\\.] matches . or any digit only if they are the first characters within a string. Without the ^, four other rows would have been retrieved, too (those that have digits in the middle).

Note

The Dual Purpose ^ ^ has two uses. Within a set (defined using [ and ]) it is used to negate that set. Otherwise it is used to refer to the start of a string.


Note

Making REGEXP Behave Like LIKE Earlier in this chapter I mentioned that LIKE and REGEXP behaved differently in that LIKE matched an entire string and REGEXP matched substrings, too. Using anchors, REGEXP can be made to behave just like LIKE by simply starting each expression with ^ and ending it with $.


Tip

Simple Regular Expression Testing You can use SELECT to test regular expressions without using database tables. REGEXP checks always return 0 (not a match) or 1 (match). You can use REGEXP with literal strings to test expressions and to experiment with them. The syntax would look like this:

SELECT 'hello' REGEXP '[0-9]';

This example would obviously return 0 (as there are no digits in the text hello).



Previous Page
Next Page