Source From Here
Preface
This is the final installation in our three part series on regular expressions. Before continuing be sure to check out Part I and Part II.
Regular Expression Behaviors
Regular expressions are powerful. As a famous superhero once said, with great power comes great responsibility. To keep a regular expression from causing havoc, you need to know how to control its behavior.
Regular expressions have three distinct, recognizable behaviors: greedy, lazy, and possessive. These words sound pretty negative but they’re not necessarily bad ways for your regular expression to behave. These are descriptions of different types of behavior your regular expression might have. Behavior you can recognize and control! I’m going to show you how.
To understand these regular expression attitudes, we need to understand quantifiers. Quantifiers simply tell your regular expressions engines how many times a character or group of characters should appear in your string. One of the quantifiers I use the most is the "+" quantifier. When I add this to a character, it means that character needs to appear at least one time. It can appear as many times as it wants, but it needs to be there at least once.
So this regular expression:
would match any character appearing at least once. It guarantees a character will be there.
Quantifiers lie at the root of whether your regular expression is greedy, lazy, or possessive. By default, they’re greedy. A greedy quantifier tries to match as much of the string as it can. It grabs as much of the string as it can get its greedy little hands on and tries to make a match. If the whole string doesn’t work, it backs up one character and tries again. It repeats this process until there are no more characters for it to test.
Greedy quantifiers use maximum effort for maximum return. A greedy quantifier will try as many ways as it can to find a match and will return the maximum characters that could possibly be a part of that match. Let’s look at an example:
Now if I run the match method on this regular expression, passing in the string:
>
It matches our entire string.
When this regular expression sees the string, it tries to match the first part of the regular expression, the ".+," first. This matches the entire string. Then it tries to match the second part of our regular expression, the word "time." Because it already has the entire string marked as a match, it’s going to first look for the word "time" beyond the end of the string. It’s not going to find it since there’s nothing there, so it backtracks. It moves back one character at a time until it finds a match. When it finds it, it returns the whole match. In this case, it’s our entire string.
Greedy quantifiers try to match the whole string, then backtrack. Backtracking means if the entire string doesn’t match the entire regular expression, it will try as many ways as possible to find a match. It needs to keep track of what ways it’s tried so it doesn’t repeat them. This can potentially take up a lot of system resources, particularly when you have multiple matches running on large amounts of text.
Oniguruma has optimizations that make backtracking quicker. Patrick Shaughnessy has a fantastic blog post that goes into the details of how Oniguruma handles backtracking. Even with optimizations, however, a greedy regular expression will chew through a lot of resources.
When you want a more contained match that uses much less resources, you want a lazy quantifier. Also known as a reluctant quantifier, it starts at the very beginning of the string and tries to make a match with the very first character. If it doesn’t find a match, it grabs another character. As an absolute last resort, it will grab the whole string to try and find a match.
A lazy quantifier uses minimum effort for minimum return. It returns as few characters as possible to make match. If it finds a match in the first character of the string, it will return just the first character. It’s lazy. It does just enough to get by, nothing more.
You make a quantifier lazy simply by adding a question mark after it.
If I run the match method on my string using this lazy regular expression:
>
I only get "There’s no time" back. It started at the very beginning of the string and delivered just enough to be a match. Lazy regular expressions use much less backtracking and, therefore, fewer resources than greedy regular expressions.
What if you do want to match as many characters as possible, but don’t want backtracking to consume your resources? There’s a third kind of quantifier, possessive quantifiers. These are all or nothing. Either there’s a match on the first try or they fail. Like a greedy quantifier, they grab as much of the string as they can - the entire string- and try to make a match. If that match fails, though, they won’t backtrack or try again.
Possessive quantifiers use minimum effort for maximum return. They try to return as many characters as possible for the bare minimum effort - they give it one go then give up. To make a quantifier possessive, you add a plus sign to it:
Let’s run match on the string using this possessive regular expression:
The match fails. Why would it fail? It seems like our entire string should match this regular expression. The reason this fails is because there is no backtracking. The first thing our regular expression tries to match is the ".+." This matches the entire string. When it tries to match the second part of our regular expression, "time", it already has the entire string marked as a match for ".+." It looks for the word "time" AFTER our entire string. This is the same thing a greedy quantifier does, but a greedy quantifier can go back earlier in the string and look for a match. A possessive quantifier can’t go back in the string to look for a match because it can’t backtrack. Therefore, it fails.
The main advantage Possessive quantifiers offer is they fail fast. They don’t backtrack, so they use minimal resources. A greedy quantifier will try every possible way to try to make a match. If it fails, all that work, all those resources, will be for nothing. A possessive quantifier prevents this. If it’s going to fail, it fails quickly.
Generally, you only want to use possessive quantifiers for very small regular expressions, usually when you have small sub expressions nested within larger expressions. They’re very useful, but use with caution.
Conclusion
Regular expressions are powerful. So powerful they inspire fear in many of us. That fear can be overcome. As cryptic as they might seem, they do have a logical reasoning and structure. Use them. Fire up Rubular and try some lookaheads and lookbehinds, experiment with greedy, lazy, and possessive quantifiers. Explore the fantastic ways Ruby works with regular expressions. I think you’ll be amazed at what you find.
Supplement
* Java Regular Expression的學習筆記 [精華]
* Wiki - regular expression
* Regular-expression.info
This is a blog to track what I had learned and share knowledge with all who can take advantage of them
標籤
- [ 英文學習 ]
- [ 計算機概論 ]
- [ 深入雲計算 ]
- [ 雜七雜八 ]
- [ Algorithm in Java ]
- [ Data Structures with Java ]
- [ IR Class ]
- [ Java 文章收集 ]
- [ Java 代碼範本 ]
- [ Java 套件 ]
- [ JVM 應用 ]
- [ LFD Note ]
- [ MangoDB ]
- [ Math CC ]
- [ MongoDB ]
- [ MySQL 小學堂 ]
- [ Python 考題 ]
- [ Python 常見問題 ]
- [ Python 範例代碼 ]
- [心得扎記]
- [網路教學]
- [C 常見考題]
- [C 範例代碼]
- [C/C++ 範例代碼]
- [Intro Alg]
- [Java 代碼範本]
- [Java 套件]
- [Linux 小技巧]
- [Linux 小學堂]
- [Linux 命令]
- [ML In Action]
- [ML]
- [MLP]
- [Postgres]
- [Python 學習筆記]
- [Quick Python]
- [Software Engineering]
- [The python tutorial]
- 工具收集
- 設計模式
- 資料結構
- ActiveMQ In Action
- AI
- Algorithm
- Android
- Ansible
- AWS
- Big Data 研究
- C/C++
- C++
- CCDH
- CI/CD
- Coursera
- Database
- DB
- Design Pattern
- Device Driver Programming
- Docker
- Docker 工具
- Docker Practice
- Eclipse
- English Writing
- ExtJS 3.x
- FP
- Fraud Prevention
- FreeBSD
- GCC
- Git
- Git Pro
- GNU
- Golang
- Gradle
- Groovy
- Hadoop
- Hadoop. Hadoop Ecosystem
- Java
- Java Framework
- Java UI
- JavaIDE
- JavaScript
- Jenkins
- JFreeChart
- Kaggle
- Kali/Metasploit
- Keras
- KVM
- Learn Spark
- LeetCode
- Linux
- Lucene
- Math
- ML
- ML Udemy
- Mockito
- MPI
- Nachos
- Network
- NLP
- node js
- OO
- OpenCL
- OpenMP
- OSC
- OSGi
- Pandas
- Perl
- PostgreSQL
- Py DS
- Python
- Python 自製工具
- Python Std Library
- Python tools
- QEMU
- R
- Real Python
- RIA
- RTC
- Ruby
- Ruby Packages
- Scala
- ScalaIA
- SQLAlchemy
- TensorFlow
- Tools
- UML
- Unix
- Verilog
- Vmware
- Windows 技巧
- wxPython
訂閱:
張貼留言 (Atom)
[Git 常見問題] error: The following untracked working tree files would be overwritten by merge
Source From Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 # git clean -d -fx 方案2: 今天在服务器上 gi...
-
前言 : 為什麼程序管理這麼重要呢?這是因為: * 首先,本章一開始就談到的,我們在操作系統時的各項工作其實都是經過某個 PID 來達成的 (包括你的 bash 環境), 因此,能不能進行某項工作,就與該程序的權限有關了。 * 再來,如果您的 Linux 系統是個...
-
屬性 : 系統相關 - 檔案與目錄 語法 : du [參數] [檔案] 參數 | 功能 -a | 顯示目錄中個別檔案的大小 -b | 以bytes為單位顯示 -c | 顯示個別檔案大小與總和 -D | 顯示符號鏈結的來源檔大小 -h | Hum...
-
來源自 這裡 說明 : split 是 Perl 中非常有用的函式之一,它可以將一個字串分割並將之置於陣列中。若無特別的指定,該函式亦使用 RE 與 $_ 變數 語法 : * split /PATTERN/,EXPR,LIMIT * split /...
沒有留言:
張貼留言