Regular expressions are prominent in scripting languages and have also been available in the Java library since JDK 1.4. Groovy relies on Java’s regex (regular expression) support and adds three operators for convenience:
Regular expressions are defined by patterns. Patterns are declared by a sequence of symbols. In fact, the pattern description is a language of its own.
Specifying patterns in string literals:
Patterns use lots of backslashes, and to get a backslash in a Java string literal, you need to double it. This makes for difficulty reading patterns in Java strings. It gets even worse if you need to match an actual backslash in your pattern—the pattern language escapes that with a backslash too, so the Java string literal needed to match the pattern "a\b" is "a\\\\b" ! (Or it will be treated as the find operator.)
Groovy does much better. As you saw earlier, there is the slashy form of string literal, which doesn’t require you to escape the backslash character and still works like a normal GString. Listing 3.4 shows how to declare patterns conveniently.
- Listing 3.4 Regular expression GStrings
- assert "abc" == /abc/
- assert "\\d" == /\d/
- def reference = "hello"
- assert reference == /$reference/
- assert "\$" == /$/
- def p = /abc\/\// // Using '\' to escape '/' in pattern literal string
- assert 'abc//' =~ p
Symbols
The key to using regular expressions is knowing the pattern symbols. For convenience, table 3.8 provides a short list of the most common ones. Put an earmark on this page so you can easily look up the table. You will use it a lot.
More to consider:
Applying patterns:
Applied to a given string, Groovy supports the following tasks for regular expressions:
Listing 3.5 shows how Groovy sets patterns into action. Unlike most other examples, this listing contains some comments. This reflects real life and is not for illustrative purposes. The use of regexes is best accompanied by this kind of comment for all but the simplest patterns.
- Listing 3.5 Regular expressions
- twister = 'she sells sea shells at the sea shore of seychelles'
- // twister must contain a substring of size 3
- // that starts with s and ends with a
- assert twister =~ /s.a/ // 1) Regex find operator as usable in if
- finder = (twister =~ /s.a/) // 2) Find expression evaluates to a matcher object
- assert finder instanceof java.util.regex.Matcher
- // twister must contain only words delimited by single spaces
- assert twister ==~ /(\w+ \w+)*/
- WORD = /\w+/
- matches = (twister ==~ /($WORD $WORD)*/)
- assert matches instanceof java.lang.Boolean
- // Match is full, not partial like find
- assert (twister ==~ /s.e/) == false
- wordsByX = twister.replaceAll(WORD, 'x')
- assert wordsByX == 'x x x x x x x x x x'
- // Split returns a list of words
- words = twister.split(/ /)
- assert words.size() == 10
- assert words[0] == 'she'
Common regex pitfalls
You do not need to fall into the regex trapdoors yourself. We have already done this for you. We have learned the following:
Patterns in action:
You’re now ready to do everything you wanted to do with regular expressions, except we haven’t covered “do something with each occurrence.” Something andeach sounds like a cue for a closure to appear, and that’s the case here. String has a method called eachMatch that takes a regex as a parameter along with a closure that defines what to do on each match.
The match gets passed into the closure for further analysis. In our musical example in listing 3.6, we append each match to a result string.
- Listing 3.6 Working on each match of a pattern
- myFairStringy = 'The rain in Spain stays mainly in the plain!'
- // words that end with 'ain': \b\w*ain\b
- BOUNDS = /\b/
- rhyme = /$BOUNDS(\w*ain)$BOUNDS/
- found = ''
- // 1) string.eachMatch(pattern_string)
- myFairStringy.eachMatch(rhyme) { match ->
- found += match[1] + ' '
- }
- printf "Found='$found'\n"
- assert found == 'rain Spain plain '
- found = ''
- // 2) matcher.each(closure)
- (myFairStringy =~ rhyme).each { match ->
- found += match[1] + ' '
- }
- assert found == 'rain Spain plain '
- // 3) string.replaceAll(pattern_string, closure)
- cloze = myFairStringy.replaceAll(rhyme){ it[0]-'ain'+'___'}
- printf "Cloze='$cloze'\n"
- assert cloze == 'The r___ in Sp___ stays mainly in the pl___!'
In order to fully understand how the Groovy regular expression support works, we need to look at the java.util.regex.Matcher class. It is a JDK class that encapsulates knowledge about:
The GDK enhances the Matcher class with simplified array-like access to this information. This is what happens in the following (already familiar) example that matches all non-whitespace characters:
- matcher = 'a b c' =~ /\S/
- assert matcher[0] == 'a'
- assert matcher[1..2] == 'bc'
- assert matcher.count == 3
- matcher = 'a:1 b:2 c:3' =~ /(\S+):(\S+)/
- assert matcher.hasGroup()
- assert matcher[0] == ['a:1', 'a', '1']
- ('xy' =~ /(.)(.)/).each { all, x, y ->
- assert all == 'xy'
- assert x == 'x'
- assert y == 'y'
- }
Patterns and performance:
Finally, let’s look at performance and the pattern operator ~ String. The pattern operator transforms a string into an object of type java. util.regex.Pattern . For a given string, this pattern object can be asked for a matcher object.
The rationale behind this construction is that patterns are internally backed by a so-called finite state machine that does all the high-performance magic. This machine is compiled when the pattern object is created. The more complicated the pattern, the longer the creation takes. In contrast, the matching process as performed by the machine is extremely fast.
The pattern operator allows you to split pattern-creation time from pattern-matching time, increasing performance by reusing the finite state machine. Listing 3.7 shows a poor-man’s performance comparison of the two approaches. The precompiled pattern version is at least 20% faster.
- Listing 3.7 Increase performance with pattern reuse.
- twister = 'she sells sea shells at the sea shore of seychelles'
- // some more complicated regex:
- // word that starts and ends with same letter
- regex = /\b(\w)\w*\1\b/ // 此時 regex 是 String!!!
- start = System.currentTimeMillis()
- 100000.times{
- twister =~ regex // Find operator with implicit pattern construction
- }
- first = System.currentTimeMillis() - start
- printf "First(Implicit pattern) =%d ms\n", first
- start = System.currentTimeMillis()
- pattern = ~regex // 1) Explicit pattern construction
- 100000.times{
- pattern.matcher(twister)
- }
- second = System.currentTimeMillis() - start
- printf "Second(Explicit pattern) =%d ms\n", second
- assert first > second * 1.20
Patterns for classification:
Listing 3.8 completes the domain of patterns. The Pattern object, as returned from the pattern operator, implements an isCase(String) method that is equivalent to a full match of that pattern with the string. This classification method is a prerequisite for using patterns conveniently with the grep method and in switch cases.
- Listing 3.8 Patterns in grep() and switch()
- assert (~/..../).isCase('bear')
- switch('bear'){
- case ~/..../ : assert true; break
- default : assert false
- }
- beasts = ['bear','wolf','tiger','regex']
- assert beasts.grep(~/..../) == ['bear','wolf']
Supplement:
* Documenting Regular Expressions in Groovy
沒有留言:
張貼留言