程式扎記: [Quick Python] 17. Regular expressions

標籤

2012年3月2日 星期五

[Quick Python] 17. Regular expressions


Preface : 
In some sense, we shouldn’t discuss regular expressions in this book at all. They’re implemented by a single Python module and are advanced enough that they don’t even come as part of the standard library in languages like C or Java. But if you’re using Python, you’re probably doing text parsing; and if you’re doing that, then regular expressions are too useful to be ignored. If you use Perl, Tcl, or UNIX, you may be familiar with regular expressions; if not, this chapter will go into them in some detail. 

This chapter covers : 
* Understanding regular expressions
* Creating regular expressions with special characters
* Using raw strings in regular expressions
* Extracting matched text from strings
* Substituting text with regular expressions

What is a regular expression? 
regular expression (RE) is a way of recognizing and often extracting data from certain patterns of text. A regular expression that recognizes a piece of text or a string is said to match that text or string. An RE is defined by a string in which certain of the characters (the so-called metacharacters) can have a special meaning, which enables a single RE to match many different specific strings. 

It’s easier to understand this through example than through explanation. Here’s a program using a regular expression, which counts how many lines in a text file contain the word hello. A line that contains hello more than once will be counted only once : 
- Exam01.py :
  1. import re  
  2. regexp = re.compile("hello")  
  3. count = 0  
  4. file = open("textfile.txt"'r')  
  5. for line in file.readlines():  
  6.     if regexp.search(line):  
  7.         count += 1  
  8. file.close()  
  9. print(count)  

The program starts by importing the Python regular expression module, called re. It then takes the text string "hello" as a textual regular expression and compiles it into acompiled regular expression, using the re.compile() function. This isn’t strictly necessary, but compiled regular expressions can significantly increase a program’s speed, so they’re almost always used in programs that process large amounts of text. 

What can the regular expression compiled from "hello" be used for? You can use it to recognize other instances of the word "hello" within another string; in other words, you can use it to determine whether another string contains "hello" as a substring. This is accomplished by the search() method, which returns None if the regular expression isn’t found in the string argument; Python interprets None as false in a Boolean context. If the regular expression is found in the string, than Python returns a special object that you can use to determine various things about the match (such as where in the string it occurred). We’ll discuss this later. 

Regular expressions with special characters : 
The previous example has a small flaw—it counts how many lines contain "hello" but ignores lines that contain "Hello" because it doesn’t take capitalization into account. One way to solve this would be to use two regular expressions, one for "hello" and one for "Hello", and test each of these REs against every line. A better way is to use the more advanced features of regular expressions. For the second line in the program, substitute : 
regexp = re.compile("hello|Hello")

This regular expression uses the vertical bar special character |. A special character is a character in a regular expression that isn’t interpreted as itself—it has some special meaning. | means or, so the regular expression matches "hello" or "Hello". 

Another way of doing this is to use : 
regexp = re.compile("(h|H)ello")

In addition to using |, this regular expression uses the parentheses special characters to group things, which in this case means that the | only chooses between a small or capital H. The resulting regular expression matches either an h or an H, followed by ello

Another way of performing the match is : 
regexp = re.compile("[hH]ello")

The special characters [ and ] take a string of characters between them and match any single character in that string. There’s a special shorthand to denote ranges of characters in [ and ][a-z] will match a single character between a and z, [0-9A-Z] will match any digit or any uppercase character, and so forth. Sometimes you may want to include a real hyphen in the [], in which case you should put it as the first character to avoid defining a range; [-012] will match a hyphen, or a 0 or a 1 or a 2, and nothing else. 

Quite a few special characters are available in Python regular expressions, and describing all of the subtleties of using them in regular expressions is beyond the scope of this book. A complete list of the special characters available in Python regular expressions, as well as descriptions of what they mean, you can refer to online document on module re. For the remainder of this chapter, we’ll describe the special characters 
we use as they appear. 

Regular expressions and raw strings : 
The functions that compile REs, or search for matches to REs, understand that certain character sequences in strings have special meanings in the context of regular expressions. For example, RE functions understand that \n represents a newline character. But if you use normal Python strings as regular expressions, the RE functions will typically never see such special sequences, because many of these sequences also possess a special meaning in normal strings. \n, for example, also means newline in the context of a normal Python string, and Python will automatically replace the string sequence \n with a newline character before the RE function ever sees that sequence. The RE function, as a result, will compile strings with embedded newline characters—not with embedded -'\n' sequences. 

In the case of \n, this makes no difference because RE functions interpret a newline character as exactly that and do the expected thing—they attempt to match it with another newline character in the text being searched. Let’s look at another special sequence, \\, which represents a single backslash to REs. Assume that we wish to search some text for an occurrence of the string "\ten". Because we know that we have to represent a backslash as a double backslash, we might try : 
regexp = re.compile("\\ten")

This will compile without complaining, but it’s wrong. The problem is that \\ also means a single backslash in Python strings. Before re.compile is invoked, Python interprets the string we typed as meaning \ten, which is what is passed to re.compile. In the context of regular expressions, \t means tab, so our compiled regular expression searches for a tab character followed by the two characters en

To fix this while using regular Python strings, we need four backslashes. Python interprets the first two backslashes as a special sequence representing a single backslash, and likewise for the second pair of backslashes, resulting in two actual backslashes in the Python string. That string is then passed in to re.compile(), which interprets the two actual backslashes as an RE special sequence representing a single backslash. Our code looks like this : 
regexp = re.compile("\\\\ten")

That seems confusing, and it’s why Python has a way of defining strings, called raw strings

- Raw strings to the rescue 
A raw string looks similar to a normal string, except that it has a leading r character immediately preceding the initial quotation mark of the string : 
r"Hello"
r"""\tTo be\n\tor not to be"""
r'Goodbye'
r'''12345'''

As you can see, you can use raw strings with either the single or double quotation marks and with the regular or triple-quoting convention. You can also use a leading Rinstead of r if you wish. No matter how you do it, raw string notation can be taken as an instruction to Python saying, Don’t process special sequences in this string. In the previous examples, all the raw strings are equivalent to their normal string counterparts except the second example, in which the \t and \n sequences aren’t interpreted as tabs or newlines but are left as two-string character sequences beginning with a backslash. 

Raw strings aren’t a different type of string. They’re a different way of defining strings. It’s easy to see what’s happening by running a few examples interactively : 
 

Using raw strings with regular expressions means you don’t need to worry about any funny interactions between string special sequences and regular expression special sequences. You use the regular expression special sequences. The previous RE example then becomes : 
regexp = re.compile(r"\\ten")

which works as expected. The compiled RE looks for a single backslash followed by the letters ten. You should get into the habit of using raw strings whenever defining REs, and we’ll do so for the remainder of this chapter. 

Extracting matched text from strings : 
One of the most common uses of regular expressions is to perform simple pattern based parsing on text. This is something you should know how to do, and it’s also a good way to learn more regular expression special characters. Assume, for example, that we have a list of people and phone numbers in a text file. Each line of the file will look like this : 
surname, firstname middlename: phonenumber

with a surname, followed by a comma and space, followed by a first name, followed by a space, followed by a middle name, followed by colon and a space, followed by a phone number. But to make things complicated, the middle name may or may not exist, and the phone number may or may not have an area code. It might be 800-123-4567, or it might be 123-4567. You could write code to explicitly parse data out from such a line, but it would be a tedious and error-prone job. Regular expressions provide a simpler answer. 

We’ll start by coming up with a regular expression that will match lines of the given form. The next few paragraphs will throw quite a few special characters at you. Don’t worry if you don’t get them all on the first read—as long as you understand the gist of things, that’s all right. For simplicity’s sake, let’s assume for right now that first names, surnames, and middle names consist of letters and possibly a hyphen. We can use the [] special characters defined in the previous section to define a pattern that defines only name characters : 
[-a-zA-z]

This pattern will match a single hyphen, or a single lowercase letter, or a single uppercase letter. To match a full name (like McDonald), we need to repeat this pattern. The + metacharacter repeats whatever comes before it one or more times as necessary to match the string being processed. So, the pattern : 
[-a-zA-Z]+

will match a single name, like Kenneth or McDonald or Perkin-Elmer. It will also match some strings that aren’t names, like --- or -a-b-c-, but that’s all right for our purposes. Now, what about the phone number? The special sequence \d matches any digit, and a hyphen outside of [] is a normal hyphen. A good pattern to match the phone number is : 
\d\d\d-\d\d\d-\d\d\d\d

That’s three digits, followed by a hyphen, followed by three digits, followed by a hyphen, followed by four digits. This will match only phone numbers with an area code, and our list may contain numbers that don’t have one. The best solution is to enclose the area code part of the pattern in (), group it, and then follow that group with a? special character, which says that the thing coming immediately before the ? is optional : 
(\d\d\d-)?\d\d\d-\d\d\d\d

This pattern will match a phone number that may or may not contain an area code. We can use the same sort of trick to account for the fact that some of the people in our list have their middle name included, and some don’t. (To do this, make the middle name optional using grouping and the ? special character.) Commas, colons, and spaces don’t have any special meaning in regular expressions (they mean themselves). Putting everything together, we come up with a pattern that looks like this : 
[-a-zA-Z]+, [-a-zA-Z]+( [-a-zA-Z]+)?: (\d\d\d-)?\d\d\d-\d\d\d\d

A real pattern would probably be a bit more complex, because we wouldn’t assume that there is exactly one space after the comma, exactly one space after the first and middle names, and exactly one space after the colon. But that’s easy to add later. The problem is that, whereas the above pattern will let us check to see if a line has the anticipated format, we can’t extract any data yet. All we can do is write a program like this : 
  1. import re  
  2. regexp = re.compile( r"[-a-zA-Z]+,"  
  3.                                  r" [-a-zA-Z]+"  
  4.                                  r"( [-a-zA-Z]+)?"  
  5.                                  r": (\d\d\d-)?\d\d\d-\d\d\d\d"  
  6.                                 )  
  7. file = open("textfile"'r')  
  8. for line in file.readlines():  
  9.     if regexp.search(line):  
  10.         print("Yeah, I found a line with a name and number. So what?")  
  11. file.close()  
Notice that we have split up our regular expression pattern using the fact that Python will implicitly concatenate any set of strings separated by whitespace. As your pattern grows, this can be a great aid in keeping it maintainable and understandable. It also solves the problem with the line length possibly increasing beyond the right edge of the screen. 

Fortunately, you can use regular expressions to extract data from patterns, as well as to check to see if the patterns exist. The first part of doing this is to group each subpattern corresponding to a piece of data you wish to extract using the () special characters and then give each subpattern a unique name with the special sequence?P, like this : 
(?P[-a-zA-Z]+), (?P[-a-zA-Z]+)( (?P([-a-zA-Z]+)))?: (?P(\d\d\d-)?\d\d\d-\d\d\d\d)

There’s an obvious point of confusion here: The question marks in ?P<...> and the question mark special characters that say the middle name and area code are optional have nothing to do with one another. It’s an unfortunate semicoincidence that they happen to be the same character. 

Now that we have named the elements of the pattern, we can extract them as matches are made, by using the group() method. This is possible because when thesearch() function returns a successful match, it doesn’t just return a truth value; it returns a data structure that records what was matched. We can write a simple program to extract names and phone numbers from our list and print them right out again as follows : 
 

There are some points of interest here : 
* We can find out whether a match succeeded by checking the value returned by search(). If the value is None, the match failed; otherwise, the match succeeded, and we can extract information from the object returned by search().
* group() is used to extract whatever data matched with our named subpatterns. We pass in the name of the subpattern we’re interested in.
* Because the middle subpattern is optional, we can’t count on it having a value, even if the match as a whole is successful. If the match succeeds, but the match for the middle name doesn’t, then using group() to access the data associated with the middle subpattern will return the value None.
* Part of the phone number is optional, but part isn’t. If the match succeeds, the phone subpattern must have some associated text, so we don’t have to worry about it having a value of None.

Substituting text with regular expressions : 
In addition to extracting strings from text, you can use Python’s regular expression module to find strings in text and substitute other strings in place of those that were found. You accomplish this using the regular substitution method sub(). The following example replaces instances of "the the" (presumably a typo) with single "the" : 
 

The sub method uses the invoking regular expression (regexp, in this case) to scan its second argument (string, in the example) and produces a new string by replacing all matching substrings with the value of the first argument ("the", in this example). 

But what if you want to replace the matched substrings with new ones that reflect the value of those that matched? This is where the elegance of Python comes into play. The first argument to sub—the replacement substring, "the" in the example—doesn’t have to be a string at all. Instead, it can be a function, and if it’s a function, Python calls it with the current match object and lets that function compute and return a replacement string. 

To see this in action, we’ll build an example that will take a string containing integer values (no decimal point or decimal part) and return a string with the same numerical values but as floating numbers (with a trailing decimal point and zero) : 
 

In this case, the pattern looks for a number consisting of one or more digits (the [0-9]+ part). But it’s also given a name (the ?P... part) so that the replacement string function can extract any matched substring by referring to that name. The sub method then scans down the argument string "1 2 3 4 5", looking for anything that matches [0-9]+. When sub finds a substring that matches, it makes a match object defining exactly which substring has matched the pattern, and it calls theint_match_to_float function with that match object as the sole argument. int_match_to_float uses group to extract the matching substring from the match object (by referring to the group name num) and produces a new string by concatenating the matched substring with a ".0". sub returns the new string and incorporates it as a substring into the overall result. Finally, sub starts scanning again right after the place where it found the last matching substring, and it keeps going like that until it can’t find any more matching substrings. 

Supplement : 
[Python Std Library] 7.2. re — Regular expression operations

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!