程式扎記: [ In Action ] The Simple Groovy datatypes - Working with regular expressions

標籤

2014年1月15日 星期三

[ In Action ] The Simple Groovy datatypes - Working with regular expressions

Preface: 
Regular expressions are prominent in scripting languages and have also been available in the Java library since JDK 1.4. Groovy relies on Java’s regex (regular expression) support and adds three operators for convenience: 
■ The regex find operator =~ 
■ The regex match operator ==~
■ The regex pattern operator ~String

Regular expressions are defined by patterns. Patterns are declared by a sequence of symbols. In fact, the pattern description is a language of its own. 

Specifying patterns in string literals: 
Patterns use lots of backslashes, and to get a backslash in a Java string literal, you need to double it. This makes for difficulty reading patterns in Java strings. It gets even worse if you need to match an actual backslash in your pattern—the pattern language escapes that with a backslash too, so the Java string literal needed to match the pattern "a\b" is "a\\\\b" ! (Or it will be treated as the find operator.

Groovy does much better. As you saw earlier, there is the slashy form of string literal, which doesn’t require you to escape the backslash character and still works like a normal GString. Listing 3.4 shows how to declare patterns conveniently. 
- Listing 3.4 Regular expression GStrings 
  1. assert "abc" == /abc/  
  2. assert "\\d" == /\d/   
  3. def reference = "hello"  
  4. assert reference == /$reference/  
  5. assert "\$" == /$/  
  6. def p = /abc\/\//  // Using '\' to escape '/' in pattern literal string  
  7. assert 'abc//' =~ p   
Tips. 
Sometimes the slashy syntax interferes with other valid Groovy expressions such as line comments or numerical expressions with multiple slashes for division. When in doubt, put parentheses around your pattern like (/pattern/) . Parentheses force the parser to interpret the content as an expression.

Symbols 
The key to using regular expressions is knowing the pattern symbols. For convenience, table 3.8 provides a short list of the most common ones. Put an earmark on this page so you can easily look up the table. You will use it a lot. 
 
 

More to consider: 
  • Use grouping properly. The expanding operators such as star and plus bind closely; ab+ matches abbbb . Use (ab)+ to match ababab .

  • In normal mode, the expanding operators are greedy, meaning they try to match the longest substring that matches the pattern. Add an additional question mark after the operator to put them into restrictive mode. You may be tempted to extract the href from an HTML anchor element with this regex:href="(.*)" . But href = "(.*?)" is probably better. The first version matches until the last double quote in your text; the latter matches until the next double quote.

  • Applying patterns: 
    Applied to a given string, Groovy supports the following tasks for regular expressions: 
    ■ Tell whether the pattern fully matches the whole string.
    ■ Tell whether there is an occurrence of the pattern in the string.
    ■ Count the occurrences.
    ■ Do something with each occurrence.
    ■ Replace all occurrences with some text.
    ■ Split the string into multiple strings by cutting at each occurrence.

    Listing 3.5 shows how Groovy sets patterns into action. Unlike most other examples, this listing contains some comments. This reflects real life and is not for illustrative purposes. The use of regexes is best accompanied by this kind of comment for all but the simplest patterns
    - Listing 3.5 Regular expressions 
    1. twister = 'she sells sea shells at the sea shore of seychelles'  
    2. // twister must contain a substring of size 3   
    3. // that starts with s and ends with a  
    4. assert twister =~ /s.a/                     // 1) Regex find operator as usable in if    
    5. finder = (twister =~ /s.a/)                 // 2) Find expression evaluates to a matcher  object                       
    6. assert finder instanceof java.util.regex.Matcher     
    7.   
    8. // twister must contain only words delimited by single spaces  
    9. assert twister ==~ /(\w+ \w+)*/     
    10. WORD = /\w+/  
    11. matches = (twister ==~ /($WORD $WORD)*/)        
    12. assert matches instanceof java.lang.Boolean  
    13.   
    14. // Match is full, not partial like find     
    15. assert (twister ==~ /s.e/) == false       
    16. wordsByX = twister.replaceAll(WORD, 'x')  
    17. assert wordsByX == 'x x x x x x x x x x'  
    18.   
    19. // Split returns a list of words  
    20. words = twister.split(/ /)    
    21. assert words.size() == 10  
    22. assert words[0] == 'she'  
    Tips. 
    To remember the difference between the =~ find operator and the ==~ match operator, recall that match is more restrictive, because the pattern needs to cover the whole string. The demanded coverage is “longer” just like the appearance of its operator.

    Common regex pitfalls 
    You do not need to fall into the regex trapdoors yourself. We have already done this for you. We have learned the following: 
    ■ When things get complex (note, this is when, not if), comment verbosely.
    ■ Use the slashy syntax instead of the regular string syntax, or you will get lost in a forest of backslashes.
    ■ Don’t let your pattern look like a toothpick puzzle. Build your pattern from subexpressions like WORD in listing 3.5.
    ■ Put your assumptions to the test. Write some assertions or unit tests to test your regex against static strings. Please don’t send us any more flowers
    for this advice; an email with the subject “assertion saved my life today” will suffice.

    Patterns in action: 
    You’re now ready to do everything you wanted to do with regular expressions, except we haven’t covered “do something with each occurrence.” Something andeach sounds like a cue for a closure to appear, and that’s the case here. String has a method called eachMatch that takes a regex as a parameter along with a closure that defines what to do on each match. 

    The match gets passed into the closure for further analysis. In our musical example in listing 3.6, we append each match to a result string. 
    - Listing 3.6 Working on each match of a pattern 
    1. myFairStringy = 'The rain in Spain stays mainly in the plain!'  
    2. // words that end with 'ain': \b\w*ain\b  
    3. BOUNDS = /\b/  
    4. rhyme = /$BOUNDS(\w*ain)$BOUNDS/  
    5. found = ''  
    6. // 1) string.eachMatch(pattern_string)  
    7. myFairStringy.eachMatch(rhyme) { match ->     
    8.     found += match[1] + ' '  
    9. }  
    10. printf "Found='$found'\n"  
    11. assert found == 'rain Spain plain '  
    12. found = ''  
    13. // 2) matcher.each(closure)  
    14. (myFairStringy =~ rhyme).each { match ->     
    15.     found += match[1] + ' '  
    16. }  
    17. assert found == 'rain Spain plain '  
    18. // 3) string.replaceAll(pattern_string, closure)  
    19. cloze = myFairStringy.replaceAll(rhyme){ it[0]-'ain'+'___'}  
    20. printf "Cloze='$cloze'\n"  
    21. assert cloze == 'The r___ in Sp___ stays mainly in the pl___!'  
    There are two different ways to iterate through matches with identical behavior: use (1) String.eachMatch(Pattern) , or use (2) Matcher.each() , where the Matcheris the result of applying the regex find operator to a string and a pattern. (3) shows a special case for replacing each match with some dynamically derived content from the given closure. The variable it refers to the matching substring. The result is to replace “ain” with underscores, but only where it forms part of a rhyme. 

    In order to fully understand how the Groovy regular expression support works, we need to look at the java.util.regex.Matcher class. It is a JDK class that encapsulates knowledge about: 
    ■ How often and at what position a pattern matches
    ■ The groupings for each match

    The GDK enhances the Matcher class with simplified array-like access to this information. This is what happens in the following (already familiar) example that matches all non-whitespace characters: 
    1. matcher = 'a b c' =~ /\S/  
    2. assert matcher[0]    == 'a'  
    3. assert matcher[1..2] == 'bc'  
    4. assert matcher.count == 3  
    The interesting part comes with groupings in the match. If the pattern contains parentheses to define groups, the matcher returns not a single string for each match but an array, where the full match is at index 0 and each extracted group follows. Consider this example, where each match finds pairs of strings that are separated by a colon. For later processing, the match is split into two groups, for the left and the right string: 
    1. matcher = 'a:1 b:2 c:3' =~ /(\S+):(\S+)/  
    2. assert matcher.hasGroup()  
    3. assert matcher[0] == ['a:1''a''1']  
    In other words, what matcher[0] returns depends on whether the pattern contains groupings. This also applies to the matcher’s each method, which comes with a convenient notation for groupings. When the processing closure defines multiple parameters, the list of groups is distributed over them: 
    1. ('xy' =~ /(.)(.)/).each { all, x, y  ->  
    2.     assert all == 'xy'  
    3.     assert x == 'x'  
    4.     assert y == 'y'  
    5. }  
    This matcher matches only one time but contains two groups with one character each. 

    Patterns and performance: 
    Finally, let’s look at performance and the pattern operator ~ String. The pattern operator transforms a string into an object of type java. util.regex.Pattern . For a given string, this pattern object can be asked for a matcher object. 

    The rationale behind this construction is that patterns are internally backed by a so-called finite state machine that does all the high-performance magic. This machine is compiled when the pattern object is created. The more complicated the pattern, the longer the creation takes. In contrast, the matching process as performed by the machine is extremely fast

    The pattern operator allows you to split pattern-creation time from pattern-matching time, increasing performance by reusing the finite state machine. Listing 3.7 shows a poor-man’s performance comparison of the two approaches. The precompiled pattern version is at least 20% faster. 
    - Listing 3.7 Increase performance with pattern reuse. 
    1. twister = 'she sells sea shells at the sea shore of seychelles'  
    2. // some more complicated regex:   
    3. // word that starts and ends with same letter  
    4. regex = /\b(\w)\w*\1\b/   // 此時 regex 是 String!!!  
    5. start = System.currentTimeMillis()  
    6. 100000.times{  
    7.     twister =~ regex   // Find operator with implicit pattern construction  
    8. }  
    9. first = System.currentTimeMillis() - start  
    10. printf "First(Implicit pattern) =%d ms\n", first  
    11.   
    12. start = System.currentTimeMillis()  
    13. pattern = ~regex    // 1) Explicit pattern construction  
    14. 100000.times{  
    15.     pattern.matcher(twister)  
    16. }  
    17.   
    18. second = System.currentTimeMillis() - start  
    19. printf "Second(Explicit pattern) =%d ms\n", second  
    20. assert first > second * 1.20  
    To find words that start and end with the same character, we used the \1 back-match to refer to that character. We prepared its usage by putting the word’s first character into a group, which happens to be group 1. Note the difference in spelling in b. This is not =~ b but a = ~b ! (Or it will be treated as the find operator.

    Patterns for classification: 
    Listing 3.8 completes the domain of patterns. The Pattern object, as returned from the pattern operator, implements an isCase(String) method that is equivalent to a full match of that pattern with the string. This classification method is a prerequisite for using patterns conveniently with the grep method and in switch cases.
    - Listing 3.8 Patterns in grep() and switch() 
    1. assert (~/..../).isCase('bear')  
    2. switch('bear'){  
    3.     case ~/..../ : assert truebreak  
    4.     default      : assert false  
    5. }  
    6. beasts = ['bear','wolf','tiger','regex']  
    7. assert beasts.grep(~/..../) == ['bear','wolf']  
    Regular expressions are difficult beasts to tame, but mastering them adds a new quality to all text-manipulation tasks. Once you have a grip on them, you’ll hardly be able to imagine having programmed (some would say lived) without them. Groovy makes regular expressions easily accessible and straightforward to use. 

    Supplement: 
    Documenting Regular Expressions in Groovy

    沒有留言:

    張貼留言

    網誌存檔

    關於我自己

    我的相片
    Where there is a will, there is a way!