Source From Here
This is the final installation in our three part series on regular expressions. Before continuing be sure to check out Part I and Part II.
Regular Expression Behaviors
Regular expressions are powerful. As a famous superhero once said, with great power comes great responsibility. To keep a regular expression from causing havoc, you need to know how to control its behavior.
Regular expressions have three distinct, recognizable behaviors: greedy, lazy, and possessive. These words sound pretty negative but they’re not necessarily bad ways for your regular expression to behave. These are descriptions of different types of behavior your regular expression might have. Behavior you can recognize and control! I’m going to show you how.
To understand these regular expression attitudes, we need to understand quantifiers. Quantifiers simply tell your regular expressions engines how many times a character or group of characters should appear in your string. One of the quantifiers I use the most is the "+" quantifier. When I add this to a character, it means that character needs to appear at least one time. It can appear as many times as it wants, but it needs to be there at least once.
So this regular expression:
would match any character appearing at least once. It guarantees a character will be there.
Quantifiers lie at the root of whether your regular expression is greedy, lazy, or possessive. By default, they’re greedy. A greedy quantifier tries to match as much of the string as it can. It grabs as much of the string as it can get its greedy little hands on and tries to make a match. If the whole string doesn’t work, it backs up one character and tries again. It repeats this process until there are no more characters for it to test.
Greedy quantifiers use maximum effort for maximum return. A greedy quantifier will try as many ways as it can to find a match and will return the maximum characters that could possibly be a part of that match. Let’s look at an example:
Now if I run the match method on this regular expression, passing in the string:
It matches our entire string.
When this regular expression sees the string, it tries to match the first part of the regular expression, the ".+," first. This matches the entire string. Then it tries to match the second part of our regular expression, the word "time." Because it already has the entire string marked as a match, it’s going to first look for the word "time" beyond the end of the string. It’s not going to find it since there’s nothing there, so it backtracks. It moves back one character at a time until it finds a match. When it finds it, it returns the whole match. In this case, it’s our entire string.
Greedy quantifiers try to match the whole string, then backtrack. Backtracking means if the entire string doesn’t match the entire regular expression, it will try as many ways as possible to find a match. It needs to keep track of what ways it’s tried so it doesn’t repeat them. This can potentially take up a lot of system resources, particularly when you have multiple matches running on large amounts of text.
Oniguruma has optimizations that make backtracking quicker. Patrick Shaughnessy has a fantastic blog post that goes into the details of how Oniguruma handles backtracking. Even with optimizations, however, a greedy regular expression will chew through a lot of resources.
When you want a more contained match that uses much less resources, you want a lazy quantifier. Also known as a reluctant quantifier, it starts at the very beginning of the string and tries to make a match with the very first character. If it doesn’t find a match, it grabs another character. As an absolute last resort, it will grab the whole string to try and find a match.
A lazy quantifier uses minimum effort for minimum return. It returns as few characters as possible to make match. If it finds a match in the first character of the string, it will return just the first character. It’s lazy. It does just enough to get by, nothing more.
You make a quantifier lazy simply by adding a question mark after it.
If I run the match method on my string using this lazy regular expression:
I only get "There’s no time" back. It started at the very beginning of the string and delivered just enough to be a match. Lazy regular expressions use much less backtracking and, therefore, fewer resources than greedy regular expressions.
What if you do want to match as many characters as possible, but don’t want backtracking to consume your resources? There’s a third kind of quantifier, possessive quantifiers. These are all or nothing. Either there’s a match on the first try or they fail. Like a greedy quantifier, they grab as much of the string as they can - the entire string- and try to make a match. If that match fails, though, they won’t backtrack or try again.
Possessive quantifiers use minimum effort for maximum return. They try to return as many characters as possible for the bare minimum effort - they give it one go then give up. To make a quantifier possessive, you add a plus sign to it:
Let’s run match on the string using this possessive regular expression:
The match fails. Why would it fail? It seems like our entire string should match this regular expression. The reason this fails is because there is no backtracking. The first thing our regular expression tries to match is the ".+." This matches the entire string. When it tries to match the second part of our regular expression, "time", it already has the entire string marked as a match for ".+." It looks for the word "time" AFTER our entire string. This is the same thing a greedy quantifier does, but a greedy quantifier can go back earlier in the string and look for a match. A possessive quantifier can’t go back in the string to look for a match because it can’t backtrack. Therefore, it fails.
The main advantage Possessive quantifiers offer is they fail fast. They don’t backtrack, so they use minimal resources. A greedy quantifier will try every possible way to try to make a match. If it fails, all that work, all those resources, will be for nothing. A possessive quantifier prevents this. If it’s going to fail, it fails quickly.
Generally, you only want to use possessive quantifiers for very small regular expressions, usually when you have small sub expressions nested within larger expressions. They’re very useful, but use with caution.
Regular expressions are powerful. So powerful they inspire fear in many of us. That fear can be overcome. As cryptic as they might seem, they do have a logical reasoning and structure. Use them. Fire up Rubular and try some lookaheads and lookbehinds, experiment with greedy, lazy, and possessive quantifiers. Explore the fantastic ways Ruby works with regular expressions. I think you’ll be amazed at what you find.
* Java Regular Expression的學習筆記 [精華]
* Wiki - regular expression
- [ 英文學習 ] (7)
- [ 計算機概論 ] (1)
- [ 深入雲計算 ] (8)
- [ 雜七雜八 ] (5)
- [ Algorithm in Java ] (26)
- [ Data Structures with Java ] (82)
- [ IR Class ] (19)
- [ Java 文章收集 ] (21)
- [ Java 代碼範本 ] (42)
- [ Java 套件 ] (11)
- [ JVM 應用 ] (7)
- [ LFD Note ] (2)
- [ MangoDB ] (7)
- [ Math CC ] (3)
- [ MongoDB ] (8)
- [ MySQL 小學堂 ] (1)
- [ Python 考題 ] (2)
- [ Python 常見問題 ] (13)
- [ Python 範例代碼 ] (7)
- [心得扎記] (1)
- [網路教學] (3)
- [C 常見考題] (2)
- [C 範例代碼] (4)
- [C/C++ 範例代碼] (18)
- [Intro Alg] (4)
- [Java 代碼範本] (25)
- [Java 套件] (15)
- [Linux 命令] (60)
- [Linux 小技巧] (34)
- [Linux 小學堂] (32)
- [ML In Action] (14)
- [ML] (45)
- [MLP] (7)
- [Python 學習筆記] (1)
- [Quick Python] (20)
- [Software Engineering] (8)
- [The python tutorial] (7)
- 工具收集 (21)
- 設計模式 (18)
- 資料結構 (68)
- ActiveMQ In Action (13)
- AI (6)
- Algorithm (4)
- Android (11)
- Big Data 研究 (15)
- C/C++ (68)
- C++ (19)
- CCDH (20)
- Coursera (2)
- Database (1)
- Design Pattern (1)
- Device Driver Programming (42)
- Docker (31)
- Docker 工具 (1)
- Docker Practice (6)
- Eclipse (1)
- English Writing (52)
- ExtJS 3.x (4)
- FP (1)
- FreeBSD (1)
- GCC (2)
- Git (4)
- Git Pro (4)
- GNU (30)
- Groovy (81)
- Hadoop (65)
- Hadoop. Hadoop Ecosystem (1)
- Java (258)
- Java Framework (1)
- Java UI (3)
- JavaIDE (2)
- JFreeChart (2)
- Kali/Metasploit (6)
- KVM (1)
- Learn Spark (10)
- Linux (240)
- Lucene (19)
- Math (8)
- MPI (3)
- Nachos (4)
- Network (3)
- NLP (1)
- OO (28)
- OpenCL (1)
- OpenMP (3)
- OSC (1)
- OSGi (10)
- Perl (24)
- Python (187)
- Python Std Library (36)
- Python tools (4)
- QEMU (1)
- R (1)
- RIA (14)
- RTC (5)
- Ruby (68)
- Ruby Packages (8)
- Scala (75)
- ScalaIA (15)
- TensorFlow (1)
- Tools (11)
- UML (2)
- Unix (18)
- Verilog (3)
- Vmware (2)
- Windows 技巧 (10)
- ► 2016 (273)
- ► 2015 (276)
- [ 常見問題 ] How do you add an array to another array ...
- [ 文章收集 ] RB Learning - More On Strings
- [Toolkit] Simple Web Service - SimpleHttp.groovy -...
- [文章收集] How to Install, Run and Uninstall VMware Pl...
- [ Ruby Gossip ] Basic : 類別 - 特殊方法定義
- [ 範例代碼 ] Array's collect with index?
- [文章收集] Snort : Customized AppId Lua Script as Dete...
- [ InAction Note ] Ch2. Understand - Concept of mes...
- [ Ruby Gossip ] Basic : 類別 - 定義類別
- [ Ruby Gossip ] Basic : 方法 - 變數範圍
- [ 文章收集 ] RB Learning - Fun with Strings
- [ 文章收集 ] RB Learning - Writing Own Ruby Methods
- [ 文章收集 ] Serializing (And Deserializing) Objects W...
- [Linux 文章收集] Linux CLI 提示字元的設定
- [ Ruby Gossip ] Basic : 方法 - 迭代器與程式區塊
- [文章收集] Snort : Firing up OpenAppID
- [ Ruby Gossip ] Basic : 方法 - def 定義方法
- [ 文章收集 ] How to Sort a Hash in Ruby
- [Linux 小技巧] 設置 Jumbo frame
- [ Ruby Gossip ] Basic : 流程控制 - begin...rescue...en...
- [ Ruby Gossip ] Basic : 流程控制 - case...when...else
- [ Ruby Gossip ] Basic : 流程控制 - while、until、loop 與 ...
- [ Ruby Gossip ] Basic : 流程控制 - if 與 unless
- [ Ruby Gossip ] Basic : 變數、操作與物件 - 淺談物件、訊息與方法
- [ 文章收集 ] How Ruby method dispatch works
- [ Ruby Gossip ] Basic : 變數、操作與物件 - 操作方法
- [ Ruby Gossip ] Basic : 變數、操作與物件 - 變數
- [ 文章收集 ] alias vs alias_method
- [ Ruby Gossip ] Basic : 內建型態與操作 - 範圍型態
- [ Ruby Gossip ] Basic : 內建型態與操作 - 雜湊型態
- [ Ruby Gossip ] Basic : 內建型態與操作 - 陣列型態
- [ 文章收集 ] Using Regular Expression in Ruby - Part3
- [ Ruby Gossip ] Basic : 內建型態與操作 - 符號型態
- [ 文章收集 ] Using Regular Expression in Ruby - Part2
- [ 文章收集 ] Using Regular Expression in Ruby - Part1
- [Toolkit] Simple Web Service - SimpleHttp.groovy
- [ Ruby Gossip ] Basic : 內建型態與操作 - 關於編碼
- [ Ruby Gossip ] Basic : 內建型態與操作 - 字串型態
- [ Ruby Gossip ] Basic : 內建型態與操作 - 數值型態
- [ Metasploit 常見問題 ] Database not connected or cach...
- [Ubuntu 常見問題] Why is chkconfig no longer available...
- [ Python 文章收集 ] python Pexpect
- [ Python 範例代碼 ] Auto-login script for vpnc
- [ Ruby Gossip ] Basic : 基本指令與觀念 - 基本輸入輸出
- [ Ruby Gossip ] Basic : 基本指令與觀念 - load 與 require
- ▼ 十月 (45)
- ► 2013 (112)
- ► 2012 (197)
- ► 2011 (265)