Source From Here
This is the final installation in our three part series on regular expressions. Before continuing be sure to check out Part I and Part II.
Regular Expression Behaviors
Regular expressions are powerful. As a famous superhero once said, with great power comes great responsibility. To keep a regular expression from causing havoc, you need to know how to control its behavior.
Regular expressions have three distinct, recognizable behaviors: greedy, lazy, and possessive. These words sound pretty negative but they’re not necessarily bad ways for your regular expression to behave. These are descriptions of different types of behavior your regular expression might have. Behavior you can recognize and control! I’m going to show you how.
To understand these regular expression attitudes, we need to understand quantifiers. Quantifiers simply tell your regular expressions engines how many times a character or group of characters should appear in your string. One of the quantifiers I use the most is the "+" quantifier. When I add this to a character, it means that character needs to appear at least one time. It can appear as many times as it wants, but it needs to be there at least once.
So this regular expression:
would match any character appearing at least once. It guarantees a character will be there.
Quantifiers lie at the root of whether your regular expression is greedy, lazy, or possessive. By default, they’re greedy. A greedy quantifier tries to match as much of the string as it can. It grabs as much of the string as it can get its greedy little hands on and tries to make a match. If the whole string doesn’t work, it backs up one character and tries again. It repeats this process until there are no more characters for it to test.
Greedy quantifiers use maximum effort for maximum return. A greedy quantifier will try as many ways as it can to find a match and will return the maximum characters that could possibly be a part of that match. Let’s look at an example:
Now if I run the match method on this regular expression, passing in the string:
It matches our entire string.
When this regular expression sees the string, it tries to match the first part of the regular expression, the ".+," first. This matches the entire string. Then it tries to match the second part of our regular expression, the word "time." Because it already has the entire string marked as a match, it’s going to first look for the word "time" beyond the end of the string. It’s not going to find it since there’s nothing there, so it backtracks. It moves back one character at a time until it finds a match. When it finds it, it returns the whole match. In this case, it’s our entire string.
Greedy quantifiers try to match the whole string, then backtrack. Backtracking means if the entire string doesn’t match the entire regular expression, it will try as many ways as possible to find a match. It needs to keep track of what ways it’s tried so it doesn’t repeat them. This can potentially take up a lot of system resources, particularly when you have multiple matches running on large amounts of text.
Oniguruma has optimizations that make backtracking quicker. Patrick Shaughnessy has a fantastic blog post that goes into the details of how Oniguruma handles backtracking. Even with optimizations, however, a greedy regular expression will chew through a lot of resources.
When you want a more contained match that uses much less resources, you want a lazy quantifier. Also known as a reluctant quantifier, it starts at the very beginning of the string and tries to make a match with the very first character. If it doesn’t find a match, it grabs another character. As an absolute last resort, it will grab the whole string to try and find a match.
A lazy quantifier uses minimum effort for minimum return. It returns as few characters as possible to make match. If it finds a match in the first character of the string, it will return just the first character. It’s lazy. It does just enough to get by, nothing more.
You make a quantifier lazy simply by adding a question mark after it.
If I run the match method on my string using this lazy regular expression:
I only get "There’s no time" back. It started at the very beginning of the string and delivered just enough to be a match. Lazy regular expressions use much less backtracking and, therefore, fewer resources than greedy regular expressions.
What if you do want to match as many characters as possible, but don’t want backtracking to consume your resources? There’s a third kind of quantifier, possessive quantifiers. These are all or nothing. Either there’s a match on the first try or they fail. Like a greedy quantifier, they grab as much of the string as they can - the entire string- and try to make a match. If that match fails, though, they won’t backtrack or try again.
Possessive quantifiers use minimum effort for maximum return. They try to return as many characters as possible for the bare minimum effort - they give it one go then give up. To make a quantifier possessive, you add a plus sign to it:
Let’s run match on the string using this possessive regular expression:
The match fails. Why would it fail? It seems like our entire string should match this regular expression. The reason this fails is because there is no backtracking. The first thing our regular expression tries to match is the ".+." This matches the entire string. When it tries to match the second part of our regular expression, "time", it already has the entire string marked as a match for ".+." It looks for the word "time" AFTER our entire string. This is the same thing a greedy quantifier does, but a greedy quantifier can go back earlier in the string and look for a match. A possessive quantifier can’t go back in the string to look for a match because it can’t backtrack. Therefore, it fails.
The main advantage Possessive quantifiers offer is they fail fast. They don’t backtrack, so they use minimal resources. A greedy quantifier will try every possible way to try to make a match. If it fails, all that work, all those resources, will be for nothing. A possessive quantifier prevents this. If it’s going to fail, it fails quickly.
Generally, you only want to use possessive quantifiers for very small regular expressions, usually when you have small sub expressions nested within larger expressions. They’re very useful, but use with caution.
Regular expressions are powerful. So powerful they inspire fear in many of us. That fear can be overcome. As cryptic as they might seem, they do have a logical reasoning and structure. Use them. Fire up Rubular and try some lookaheads and lookbehinds, experiment with greedy, lazy, and possessive quantifiers. Explore the fantastic ways Ruby works with regular expressions. I think you’ll be amazed at what you find.
* Java Regular Expression的學習筆記 [精華]
* Wiki - regular expression
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- ActiveMQ In Action
- Big Data 研究
- Design Pattern
- Device Driver Programming
- Docker 工具
- Docker Practice
- English Writing
- ExtJS 3.x
- Git Pro
- Hadoop. Hadoop Ecosystem
- Java Framework
- Java UI
- Learn Spark
- ML Udemy
- node js
- Python Std Library
- Python tools
- Ruby Packages
- Windows 技巧
Source From Here Preface The cmd module contains one public class, Cmd , designed to be used as a base class for command processors ...
來源自 這裡 前言 : Thread 是 threading 模塊中最重要的類之一，可以使用它來創建線程。有兩種方式來創建線程：一種是通過繼承Thread 類，重寫它的 run 方法；另一種是創建一個 threading.Thread 對象，在它的初始化...
Preface: 在這個階層中，我們只需考慮電路模組的功能，而不需考慮其硬體的詳細內容. Verilog 的時序控制為以事件為基礎的時序控制: * 接線或暫存器的值被改變。 * 模組的輸入埠接收到新的值 * 正規...
轉載自 這裡 前言 : 這裡簡單說明了 #define 的幾種使用方法. 簡單的define定義 : #define MAXTIME 1000 一個簡單的MAXTIME就定義好了，它代表1000，如果在程序裡面寫 : int i = MAXTIME; ...