程式扎記: [ 文章收集 ] Using Regular Expression in Ruby - Part3

標籤

2014年10月8日 星期三

[ 文章收集 ] Using Regular Expression in Ruby - Part3

Source From Here 
Preface 
This is the final installation in our three part series on regular expressions. Before continuing be sure to check out Part I and Part II

Regular Expression Behaviors 
Regular expressions are powerful. As a famous superhero once said, with great power comes great responsibility. To keep a regular expression from causing havoc, you need to know how to control its behavior. 

Regular expressions have three distinct, recognizable behaviors: greedy, lazy, and possessive. These words sound pretty negative but they’re not necessarily bad ways for your regular expression to behave. These are descriptions of different types of behavior your regular expression might have. Behavior you can recognize and control! I’m going to show you how. 

To understand these regular expression attitudes, we need to understand quantifiers. Quantifiers simply tell your regular expressions engines how many times a character or group of characters should appear in your string. One of the quantifiers I use the most is the "+" quantifier. When I add this to a character, it means that character needs to appear at least one time. It can appear as many times as it wants, but it needs to be there at least once. 

So this regular expression: 
>> regr = /.+/
=> /.+/

would match any character appearing at least once. It guarantees a character will be there. 

Quantifiers lie at the root of whether your regular expression is greedy, lazy, or possessive. By default, they’re greedyA greedy quantifier tries to match as much of the string as it can. It grabs as much of the string as it can get its greedy little hands on and tries to make a match. If the whole string doesn’t work, it backs up one character and tries again. It repeats this process until there are no more characters for it to test. 

Greedy quantifiers use maximum effort for maximum return. A greedy quantifier will try as many ways as it can to find a match and will return the maximum characters that could possibly be a part of that match. Let’s look at an example: 
>> string = "There’s no time to talk about time we don’t have the time"

Now if I run the match method on this regular expression, passing in the string: 
>> regr = /.+time/
=> /.+time/
>> regr.match(string) // By default, '+' will do greedy match
=> #"There’s no time to talk about time we don’t have the time"
>
It matches our entire string. 

When this regular expression sees the string, it tries to match the first part of the regular expression, the ".+," first. This matches the entire string. Then it tries to match the second part of our regular expression, the word "time." Because it already has the entire string marked as a match, it’s going to first look for the word "time" beyond the end of the string. It’s not going to find it since there’s nothing there, so it backtracks. It moves back one character at a time until it finds a match. When it finds it, it returns the whole match. In this case, it’s our entire string. 

Greedy quantifiers try to match the whole string, then backtrack. Backtracking means if the entire string doesn’t match the entire regular expression, it will try as many ways as possible to find a match. It needs to keep track of what ways it’s tried so it doesn’t repeat them. This can potentially take up a lot of system resources, particularly when you have multiple matches running on large amounts of text

Oniguruma has optimizations that make backtracking quicker. Patrick Shaughnessy has a fantastic blog post that goes into the details of how Oniguruma handles backtracking. Even with optimizations, however, a greedy regular expression will chew through a lot of resources. 

When you want a more contained match that uses much less resources, you want a lazy quantifier. Also known as a reluctant quantifier, it starts at the very beginning of the string and tries to make a match with the very first character. If it doesn’t find a match, it grabs another character. As an absolute last resort, it will grab the whole string to try and find a match. 

A lazy quantifier uses minimum effort for minimum return. It returns as few characters as possible to make match. If it finds a match in the first character of the string, it will return just the first character. It’s lazy. It does just enough to get by, nothing more. 

You make a quantifier lazy simply by adding a question mark after it. 
>> regr = /.+?time/

If I run the match method on my string using this lazy regular expression: 
>> regr.match(string)
=> #"There’s no time"
>
I only get "There’s no time" back. It started at the very beginning of the string and delivered just enough to be a match. Lazy regular expressions use much less backtracking and, therefore, fewer resources than greedy regular expressions. 

What if you do want to match as many characters as possible, but don’t want backtracking to consume your resources? There’s a third kind of quantifier, possessive quantifiers. These are all or nothing. Either there’s a match on the first try or they fail. Like a greedy quantifier, they grab as much of the string as they can - the entire string- and try to make a match. If that match fails, though, they won’t backtrack or try again. 

Possessive quantifiers use minimum effort for maximum return. They try to return as many characters as possible for the bare minimum effort - they give it one go then give up. To make a quantifier possessive, you add a plus sign to it: 
>> regr = /.++time/

Let’s run match on the string using this possessive regular expression: 
>> regr.match(string)
=> nil

The match fails. Why would it fail? It seems like our entire string should match this regular expression. The reason this fails is because there is no backtracking. The first thing our regular expression tries to match is the ".+." This matches the entire string. When it tries to match the second part of our regular expression, "time", it already has the entire string marked as a match for ".+." It looks for the word "time" AFTER our entire string. This is the same thing a greedy quantifier does, but a greedy quantifier can go back earlier in the string and look for a match. A possessive quantifier can’t go back in the string to look for a match because it can’t backtrack. Therefore, it fails. 

The main advantage Possessive quantifiers offer is they fail fast. They don’t backtrack, so they use minimal resources. A greedy quantifier will try every possible way to try to make a match. If it fails, all that work, all those resources, will be for nothing. A possessive quantifier prevents this. If it’s going to fail, it fails quickly. 

Generally, you only want to use possessive quantifiers for very small regular expressions, usually when you have small sub expressions nested within larger expressions. They’re very useful, but use with caution. 

Conclusion 
Regular expressions are powerful. So powerful they inspire fear in many of us. That fear can be overcome. As cryptic as they might seem, they do have a logical reasoning and structure. Use them. Fire up Rubular and try some lookaheads and lookbehinds, experiment with greedy, lazy, and possessive quantifiers. Explore the fantastic ways Ruby works with regular expressions. I think you’ll be amazed at what you find. 

Supplement 
Java Regular Expression的學習筆記 [精華] 
Wiki - regular expression 
Regular-expression.info

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!