java.util.regex
This series of lessons covering regular expressions in Java was modeled
after the tutorial created to teach the com.stevesoft.pat
package.
The com.stevesoft.pat
package is available for download and use
from JavaRegex.com. It's an excellent
alternative package to harvest the power of regular expressions in Java.
This is the second part of a four part introduction to the java.util.regex
package. Part one can be found in The
September Newsletter.
capturing groups and back references
-- (X) , Matcher's group(int)
non-capturing groups -- (?:X)
"look ahead" and "look behind" constructs -- (?=X) , (?!X) , (?<=X) , (?<!X)
flags -- (?idmsux-idmsux) , (?idmsux-idmsux:X)
, Pattern.compile(String, int)
One function of parentheses is to provide a grouping ability for parts of a regular expression. The quantifiers and operators introduced in the previous lesson, that were applied to a single character or character class, can then be applied to a group.
String input = "Fee! Fie! Foe! Fum! " + "I smell the blood of an Englishman. " + "Be he 'live, or be he dead, " + "I'll grind his bones to make my bread." ; Pattern pattern = Pattern.compile( "(F[a-z]{2}! ){4}" ); // Matches four occurrences of a pattern that begins // with "F" followed by two lower case letters, a "!" // and a space. Matcher matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.group() ); // Prints "Fee! Fie! Foe! Fum! ". |
Capturing groups are numbered according to their appearance in the regular expression. The first opening parenthesis is the start of the first capturing group; the second opening parenthesis is the start of the second capturing group; and so on. Each capturing group ends at the matching closing parenthesis. It is possible to have one capturing group embedded in another. So, the pattern "I (am (Sam))" has two capturing groups. The first capturing group is the pattern "am Sam" and the second capturing group is the pattern "Sam".
The capturing group count and corresponding matched subsequence data are maintained
in the Matcher
object. A Matcher
object's String
group( int )
method "returns the input subsequence captured
by the given group during the previous match operation." A Matcher
object's int groupCount()
method "returns the number of capturing
groups in this matcher's pattern5."
Group count number zero refers to the entire pattern match, so matcher.group(
0 )
returns the entire previously matched subsequence and is equivalent
to matcher.group()
. Note that capturing group number zero
is not included in the total group count returned by the groupCount()
method7.
Consider this mildly more involved example demonstrating capturing groups. Note that the group construct limits the scope of the OR operator.
input = "Humpty Dumpty sat on a wall. " + "Humpty Dumpty had a great fall. " + "All the king's horses and all the king's men " + "Couldn't put Humpty together again! " ; pattern = Pattern.compile( "((H|D)(umpty) ){2}" ); // Matches six characters ending in "umpty" and // beginning with "H" or "D". Three capturing // groups are defined and remembered by the Matcher. matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 3. System.out.println( matcher.group( 1 ) ); // Prints "Dumpty ". System.out.println( matcher.group( 2 ) ); // Prints "D". System.out.println( matcher.group( 3 ) ); // Prints "umpty". System.out.println( matcher.group( 0 ) ); // Prints "Humpty Dumpty ". // If it was expected that matcher.group( 1 ) should contain // "Humpty", then remember that the group( int ) method // returns the input subsequence captured by the specified // group during the previous match operation. This match // operation was performed two times - the first time matching // "Humpty" and the second time matching "Dumpty". |
Each matched group maintained in the Matcher
object is called
a "back reference". Referencing a matched group as demonstrated above
is one style of back referencing in Java regular expressions. A later lesson
will introduce another style and use of back referencing.
A slight performance cost is associated with maintaining back references (the
group count and matched subsequence data) in the Matcher
object.
The non-capturing group construct provides the function of grouping pattern
elements without the cost of remembering each matched group as a back reference.
The syntax for a non-capturing group is simply "(?:X)". A non-capturing
group functions much like a capturing group with the distinction that no capturing
group specific data is maintained in the Matcher
.
input = "Humpty Dumpty sat on a wall. " + "Humpty Dumpty had a great fall. " + "All the king's horses and all the king's men " + "Couldn't put Humpty together again! " ; pattern = Pattern.compile( "((?:H|D)(?:umpty) ){2}" ); // Matches six characters ending in "umpty" and // beginning with "H" or "D". Three groups // are defined, one is a capturing group that // will be remembered by the Matcher. matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 1. System.out.println( matcher.group( 1 ) ); // Prints "Dumpty ". System.out.println( matcher.group( 0 ) ); // Prints "Humpty Dumpty ". |
According to J-Sprint's memory profiler, when the previous two code examples (the searches for Humpty Dumpty) were performed one hundred thousand times each, the non-capturing group strategy demonstrated a performance improvement of roughly 0.0003% - not much to write home about. Alternatively, tests against a larger input character sequence (50KB) composed of one hundred groups per match, resulted in the non-capturing group test consuming approximately 40% as much memory as the capturing group test.
Java regular expressions provide two "look ahead" constructs. These
constructs allow the description of a pattern where a specified pattern only
matches if it is followed by the pattern described in the look ahead construct.
The pattern described in the look ahead construct is not part of any matched
subsequence described by the Matcher
object - it is only a requirement
that must be met in order for the specified pattern to match. Though the look
ahead construct is contained within matching opening and closing parentheses,
it is a non-capturing group construct.
input = "Today's specials are apple chocolate pie and cherry banana pie." ; pattern = Pattern.compile( "(apple|cherry)(?= chocolate)" ); // Matches "apple" or "cherry" where the following pattern // matches " chocolate". " chocolate" is not a part of the // resulting match, it follows it. matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 1. System.out.println( matcher.group( 1 ) ); // Prints "apple". System.out.println( matcher.group() ); // Prints "apple". pattern = Pattern.compile( "(apple|cherry)(?! chocolate)" ); // Matches "apple" or "cherry" where the following pattern // does not match " chocolate". matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 1. System.out.println( matcher.group( 1 ) ); // Prints "cherry". System.out.println( matcher.group() ); // Prints "cherry". |
Two "look behind" constructs provide a similar function as the look ahead constructs, the distinction being that the look behind constructs try to match whatever precedes a specified pattern. The pattern described in the look behind construct is not part of any matched subsequence described by the Matcher object - it is only a requirement that must be met in order for the specified pattern to match. The look behind construct is also non-capturing.
input = "Tomorrow's special is fried bananas with baked clam." ; pattern = Pattern.compile( "(?<=fried )(bananas|clam)" ); // Matches "bananas" or "clam" if preceded by "fried ". // "fried " is not part of the resulting match, it precedes it. matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 1. System.out.println( matcher.group( 1 ) ); // Prints "bananas". System.out.println( matcher.group() ); // Prints "bananas". pattern = Pattern.compile( "(?<!fried )(bananas|clam)" ); // Matches "bananas" or "clam" if not preceded by "fried ". matcher = pattern.matcher( input ); System.out.println( matcher.find() ); // Prints true. System.out.println( matcher.groupCount() ); // Prints 1. System.out.println( matcher.group( 1 ) ); // Prints "clam". System.out.println( matcher.group() ); // Prints "clam". |
During the critiquing process of this lesson, Jim Yingst pointed out an important issue concerning look ahead and look behind constructs:
Normally when we use find() repeatedly to match a pattern
several times within a given input, each find() "consumes"
characters in the input string up to the last character matched by the
find. This means that a subsequent find() will not normally
"see" any characters which were already matched by previous finds. Lookahead
and lookbehind are workarounds for this limitation. We can look ahead
to match characters without consuming them, or we can look back to match
characters which were already consumed. |
This non-consumptive behavior of the look ahead and look behind constructs is also known as zero-width matching.
Consider this related example, also suggested by Jim (and adapted to fit a nursery rhyme).
input = "John Jacob Jingleheimer Schmidt " + "His name is my name, too! " + "Whenever we go out, " + "The people always shout " + "There goes John Jacob Jingleheimer Schmidt!" ; pattern = Pattern.compile( "(J\\w+)(?=.+Schmidt )" ); // Matches all words starting with "J" that // precede "Schmidt " (note the space following the t). // The ".+Schmidt " part of the regular // expression is not consumed. matcher = pattern.matcher( input ); while ( matcher.find() ) // Prints { // "John" System.out.println( matcher.group() ); // "Jacob" } // "Jingleheimer" |
Pattern.compile(String, int)
As previously mentioned, the Pattern
class contains two static
factory methods that create and return references to a Pattern
object. Pattern.compile( String )
creates a new Pattern
object from the specified String
. Pattern.compile( String ,
int )
creates a new Pattern
object from the specified
String
with the specified flag. This integer flag changes the way
a Pattern
matches an input sequence, such as turning on or off
case sensitivity or whether the "." meta character can match a line
terminator.
Another way to specify flags that adjust the way a Pattern
matches
is with a flag construct embedded in the regular expression itself. This construct
takes the form "(?flags)" and appears as part of the String
used to construct a Pattern
. The embedded flag construct is a non-capturing
group construct (in other words, the opening parenthesis does not count towards
the overall group count). The embedded flag construct can be used in combination
with the flag passed to the Pattern.compile( String , int )
factory method.
Available embedded flag constructs and available flags to specify when constructing
a new Pattern
object include:
embedded flags | construction flags | meanings * |
---|---|---|
(?i) | Pattern.CASE_INSENSITIVE |
Enables case-insensitive matching. |
(?d) | Pattern.UNIX_LINES |
Enables Unix lines mode. |
(?m) | Pattern.MULTILINE |
Enables multi line mode. |
(?s) | Pattern.DOTALL |
Enables "." to match line terminators. |
(?u) | Pattern.UNICODE_CASE |
Enables Unicode-aware case folding. |
(?x) | Pattern.COMMENTS |
Permits white space and comments in the pattern. |
--- | Pattern.CANON_EQ |
Enables canonical equivalence. |
Consider an example with case insensitivity turned on.
input = "Hey, diddle, diddle, " + "The cat and the fiddle, " + "The cow jumped over the moon. " + "The little dog laughed " + "To see such sport, " + "And the dish ran away with the spoon." ; pattern = Pattern.compile( "the \\w+?(?=\\W)" , Pattern.CASE_INSENSITIVE ); // Matches "the " followed by any word, regardless of case. matcher = pattern.matcher( input ); while ( matcher.find() ) // Prints { // The cat System.out.println( matcher.group() ); // the fiddle } // The cow // the moon // The little // the dish // the spoon |
An equivalent Pattern
could be constructed using the "(?i)" embedded
flag. If embedded at the beginning of the regular expression, this embedded
flag would affect the entire regular expression as would any integer flag specified
in Pattern.compile( String , int )
. In the previous
example, pattern = Pattern.compile( "(?i)the \\w+?(?=\\W)" )
would
have resulted in the same matched subsequences.
Multiple flags can be specified as embedded flags or as the integer argument
to Pattern.compile( String , int )
. To specify
multiple embedded flags, simply list them, one after the other. "(?is)"
would specify that the Pattern
is to match regardless of case and
the "." meta character can match line terminators. To specify multiple
flags as the integer argument to Pattern.compile( String , int )
,
OR together the integer constants of the Pattern
class that represent
the desired behavior.
Consider the following example that demonstrates equivalent use of multiple
embedded flags and OR'ed together integer constants passed as the integer flag
argument to Pattern.compile( String , int ) .
input = "Green cheese,\n" + "Yellow laces,\n" + "Up and down\n" + "The market places." ; pattern = Pattern.compile( "(?is)[a-z]*,.[a-z]*" ); // Regardless of case, matches consecutive letters // followed by a comma, any character, then more // consecutive letters where the meta character "." // may match line terminators. matcher = pattern.matcher( input ); while ( matcher.find() ) // Prints { // cheese, System.out.println( matcher.group() ); // Yellow } // laces, // Up int flags = Pattern.CASE_INSENSITIVE | Pattern.DOTALL ; pattern = Pattern.compile( "[a-z]*,.[a-z]*" , flags ); // Regardless of case, matches consecutive letters // followed by a comma, any character, then more // consecutive letters where the meta character "." // may match line terminators. matcher = pattern.matcher( input ); while ( matcher.find() ) // Prints { // cheese, System.out.println( matcher.group() ); // Yellow } // laces, // Up |
The embedded flag construct affects the parts of a regular expression that appear after the embedded flag. If an embedded flag appears in the middle of a regular expression, then only the half of the expression, that appears after the flag, is potentially affected.
Using an embedded flag construct, it is possible to specify that only a section (a group) of a regular expression be affected by the flag. The syntax for such a construct is "(?flags:X)" , where "X" is the regular expression to be affected by the flags. This limits the scope of the flags to within the closing parentheses.
The embedded flag construct can also be used to turn off flags. The syntax
to turn off a flag or flags is "(?-flags)". So, it is possible to
specify a flag be turned on for the entire regular expression by specifying
the appropriate integer using Pattern.compile( String , int )
and that the same flag should be turned off for a specified group of the regular
expression using "(?-flag:X)".
input = "HARK! HARK! The dogs do bark, " + "The beggars are coming to town. " + "Some in rags, " + "And some in tags, " + "And one in a velvet gown!" ; pattern = Pattern.compile( "(?-i:[A-Z])[A-Z]*" , Pattern.CASE_INSENSITIVE ); // Matches any word, regardless of case except // the first letter which must be capitalized. matcher = pattern.matcher( input ); // Prints while ( matcher.find() ) // HARK { // HARK System.out.println( matcher.group() ); // The } // The // Some // And // And |
The embedded flag constructs are non-capturing. So, the enclosing parentheses
do not contribute to the Matcher's
group count.
Notes and Resources | |
1 | The Regular Expressions Tutorial at JavaRegex.com |
2 | java.util.regex
Package API Documentation |
3 | java.lang.CharSequence is a new Interface in Java 1.4.
A CharSequence is a readable sequence of characters. This
interface provides uniform, read-only access to many different kinds of
character sequences. String and StringBuffer
both implement CharSequence . -- CharSequence
API Documentation |
4 | java.util.regex.Pattern
API Documentation |
5 | java.util.regex.Matcher
API Documentation |
6 | The text for the nursery rhymes in this lesson can be found at http://www.zelo.com/family/nursery/index.asp . |
7 | Jim Yingst notes: The API documentation for the groupCount() method of Matcher
is misleading - it should say "Any non-negative integer smaller than or equal to the value returned by this method is guaranteed to be a valid group index for this matcher." The italicized section does not appear in the current API. (It's fixed in 1.4.1 beta source code, but that's not what you see when you browse the API online.) |
8 | The Regular Expression Library |
9 | For further reading on regular expressions, take a look at Mastering Regular Expressions by Jeffrey Friedl. A sample chapter from O'Reilly is available on-line. Jeffrey Friedl maintains a website assiciated with his book at http://regex.info at which he has posted a nice list of alternative Java Regex Packages. |