程式扎記: [Quick Python] 6. Strings

Preface :
Handling text—from user input, to filenames, to processing chunks of text—is a common chore in programming. Python comes with powerful tools to handle and format text. This chapter discusses the standard string and string-related operations in Python.

* Understanding strings as sequences of characters
* Using basic string operations
* Inserting special characters and escape sequences
* Converting from objects to strings
* Formatting strings
* Using the byte type

Strings as sequences of characters :
For the purposes of extracting characters and substrings, strings can be considered sequences of characters, which means you can use index or slice notation :

>>> x = "Hello"
>>> x[0]
'H'
>>> x[-1]
'o'
>>> x[1:]
'ello'

One use for slice notation with strings is to chop the newline off the end of a string, usually a line that’s just been read from a file :

>>> x = "Goodbye\n"
>>> x =x[:-1]
>>> x
'Goodbye'

This is just an example—you should know that Python strings have other, better methods to strip unwanted characters, but this illustrates the usefulness of slicing. You can also determine how many characters are in the string by using the len() function, just like finding out the number of elements in a list :

>>> len("Goodbye\n")
8

But strings aren’t lists of characters. The most noticeable difference between strings and lists is that, unlike lists, strings can’t be modified. Attempting to say something like string.append('c') or string[0] = 'H' will result in an error. You’ll notice in the previous example that we stripped off the newline from the string by creating a string that was a slice of the previous one, not by modifying the previous string directly. This is a basic Python restriction, imposed for efficiency reasons.

Basic string operations :
The simplest (and probably most common) way of combining Python strings is to use the string concatenation operator + :

>>> x = "Hello " + "World"
>>> x
'Hello World'

There is an analogous string multiplication operator that I have found sometimes, but not often, useful :

>>> 8 * "x"
'xxxxxxxx'

Special characters and escape sequences :
You’ve already seen a few of the character sequences Python regards as special when used within strings: \n represents the newline character and \t represents the tab character. Sequences of characters that start with a backslash and that are used to represent other characters are called escape sequences. Escape sequences are generally used to represent special characters—that is, characters (such as tab and newline) that don’t have a standard one-character printable representation. This section covers escape sequences, special characters, and related topics in more detail.

- Basic escape sequences
Python provides a brief list of two-character escape sequences to use in strings (table 6.1) :

The ASCII character set, which is the character set used by Python and the standard character set on almost all computers, defines quite a few more special characters. They’re accessed by the numeric escape sequences, described in the next section.

- Numeric (octal and hexadecimal) and Unicode escape sequences
You can include any ASCII character in a string by using an octal (base 8) or hexadecimal (base 16) escape sequence corresponding to that character. An octal escape sequence is a backslash followed by three digits defining an octal number; the ASCII character corresponding to this octal number is substituted for the octal escape sequence. A hexadecimal escape sequence is similar but starts with \x rather than just \ and can consist of any number of hexadecimal digits. The escape sequence is terminated when a character is found that’s not a hexadecimal digit. For example, in the ASCII character table, the character m happens to have decimal value 109. This is octal value 155 and hexadecimal value 6D, so :

>>> 'm'
'm'
>>> '\155a'
'ma'
>>> '\x6Dake'
'make'

Because all strings in Python 3 are Unicode strings, they can also contain almost every character from every language available. Although a discussion of the Unicode system is far beyond this book, the following examples illustrate that you can also escape any Unicode character, either by number similar to that shown earlier or by Unicode name :

The Unicode character set includes the common ASCII characters (1).

- Printing vs. evaluating strings with special characters
We talked before about the difference between evaluating a Python expression interactively and printing the result of the same expression using the print function. Although the same string is involved, the two operations can produce screen outputs that look different. A string that is evaluated at the top level of an interactive Python session will be shown with all of its special characters as octal escape sequences, which makes clear what is in the string. Meanwhile, the print function passes the string directly to the terminal program, which may interpret special characters in special ways. For example, here’s what happens with a string consisting of an afollowed by a newline, a tab, and a b :

In the first case, the newline and tab are shown explicitly in the string; in the second, they’re used as newline and tab characters. A normal print() function also adds a newline to the end of the string. Sometimes (that is, when you have lines from files that already end with newlines) you may not want this behavior. Giving the print function an end parameter of "" causes the print function to not append the newline :

>>> print("abc\n") # abc 後面接兩個換行
abc

>>> print("abc\n", end="") # 把 print function 的 appending newline 拿掉
abc
>>>

String methods :
Most of the Python string methods are built into the standard Python string class, so all string objects have them automatically. The standard string module also contains some useful constants. Modules will be discussed in detail in chapter 10.

Because strings are immutable, the string methods are used only to obtain their return value and don’t modify the string object they’re attached to in any way. We’ll begin with those string operations that are the most useful and commonly used and then go on to discuss some less commonly used but still useful operations. At the end, we’ll discuss a few miscellaneous points related to strings. Not all of the string methods are documented here. See the documentation for a complete list of string methods.

- The split and join string methods
Anyone who works with strings is almost certain to find the split and join methods invaluable. They’re the inverse of one another—split returns a list of substrings in the string, and join takes a list of strings and puts them together to form a single string with the original string between each element. Typically, split uses whitespace as the delimiter to the strings it’s splitting, but you can change that via an optional argument.

String concatenation using + is useful but not efficient for joining large numbers of strings into a single string, because each time + is applied, a new string object is created. Our previous “Hello World” example produced two string objects, one of which was immediately discarded. A better option is to use the join function :

>>> " ".join(["join", "puts", "spaces", "between", "elements"])
'join puts spaces between elements'

By changing the string used to join, you can put anything you want between the joined strings :

>>> "::".join(["Separated", "with", "colons"])
'Separated::with::colons'

The most common use of split is probably as a simple parsing mechanism for string delimited records stored in text files. By default, split splits on any whitespace, not just a single space character, but you can also tell it to split on a particular sequence by passing it an optional argument :

>>> x = "You\t\t can have tabs\t\n\t and newlines \n\n" \
... "mixed in"
>>> x.split()
['You', 'can', 'have', 'tabs', 'and', 'newlines', 'mixed', 'in']
>>> x = "Mississippi"
>>> x.split("ss")
['Mi', 'i', 'ippi']

Sometimes it’s useful to permit the last field in a joined string to contain arbitrary text, including, perhaps, substrings that may match what split splits on when reading in that data. You can do this by specifying how many splits split should perform when it’s generating its result, via an optional second argument. If you specify n splits, then split will go along the input string until it has performed n splits (generating a list with n+1 substrings as elements) or until it runs out of string. Here are some examples :

>>> x = 'a b c d'
>>> x.split(" ", 1)
['a', 'b c d']
>>> x.split(" ", 2)
['a', 'b', 'c d']
>>> x.split(" ", 9)
['a', 'b', 'c', 'd']

When using split with its optional second argument, you must supply a first argument. To get it to split on runs of whitespace while using the second argument, use Noneas the first argument. I use split and join extensively, usually when working with text files generated by other programs. But you should know that if you’re able to define your own data file format for use solely by your Python programs, there’s a much better alternative to storing data in text files. We’ll discuss it in chapter 13 when we talk about the Pickle module.

- Converting strings to numbers
You can use the functions int() and float() to convert strings into integer or floating point numbers, respectively. If they’re passed a string that can’t be interpreted as a number of the given type, they will raise a ValueError exception. In addition, you may pass int an optional second argument, specifying the numeric base to use when interpreting the input string :

- Getting rid of extra whitespace
A trio of simple methods that are surprisingly useful are the strip, lstrip, and rstrip functions. strip returns a new string that’s the same as the original string, except that any whitespace at the beginning or end of the string has been removed. lstrip and rstrip work similarly, except that they remove whitespace only at the left or right end of the original string, respectively :

>>> x = " Hello, World\t\t "
>>> x.strip()
'Hello, World'
>>> x.lstrip()
'Hello, World\t\t '
>>> x.rstrip()
' Hello, World'

In this example, tab characters are considered to be whitespace. The exact meaning may differ across operating systems, but you can always find out what Python considers to be whitespace by accessing the string.whitespace constant. On my Windows system, it gives the following :

>>> import string
>>> string.whitespace
' \t\n\r\x0b\x0c'
>>> " \t\n\r\v\f"
' \t\n\r\x0b\x0c'

The characters given in backslashed hex (\xnn) format represent the vertical tab and formfeed characters. The space character is in there as itself. It may be tempting to change the value of this variable, to attempt to affect how strip and so forth work, but don’t do it. Such an action isn’t guaranteed to give you the results you’re looking for.

But you can change which characters strip, rstrip, and lstrip remove by passing a string containing the characters to be removed as an extra parameter :

(Note that strip removes any and all of the characters in the extra parameter string, no matter in which order they occur.)

The most common use for these functions is as a quick way of cleaning up strings that have just been read in. This is particularly helpful when you’re reading lines from files (discussed in chapter 13), because Python always reads in an entire line, including the trailing newline, if it exists. When you get around to processing the line read in, you typically don’t want the trailing newline. rstrip is a convenient way to get rid of it.

- String searching
The string objects provide a number of methods to perform simple string searches. Before I describe them, though, let’s talk about another module in Python: re. (This module will be discussed in depth in chapter 17, “Regular expressions.”)

The four basic string-searching methods are all similar: find, rfind, index, and rindex. A related method, count, counts how many times a substring can be found in another string. We’ll describe find in detail and then examine how the other methods differ from it.

find takes one required argument: the substring being searched for. find returns the position of the first character of the first instance of substring in the string object, or –1 if substring doesn’t occur in the string :

>>> x = "Mississippi"
>>> x.find("ss") # The index start from 0.
2
>>> x.find("zz") # If not found, return -1
-1

find can also take one or two additional, optional arguments. The first of these, if present, is an integer start; it causes find to ignore all characters before position start in string when searching for substring. The second optional argument, if present, is an integer end; it causes find to ignore characters at or after position end in string :

>>> x = "Mississippi"
>>> x.find("ss", 3)
5
>>> x.find("ss", 0, 3) # Search 0-2. Not include position index=3!
-1

rfind is almost the same as find, except that it starts search at the end of string and so returns the position of the first character of the last occurrence of substring :

>>> x = "Mississippi"
>>> x.rfind("ss")
5

rfind can also take one or two optional arguments, with the same meanings as those for find.

index and rindex are identical to find and rfind, respectively, except for one difference: if index or rindex fails to find an occurrence of substring in string, it doesn’t return –1 but rather raises a ValueError exception.

count is used identically to any of the previous four functions but returns the number of non-overlapping times the given substring occurs in the given string :

>>> x = "Mississsippi"
>>> x.count("ss") # "sss" 對 "ss" 只算一次.
2

You can use two other string methods to search strings: startswith and endswith. These methods return a True or False result depending on whether the string they’re used on starts or ends with one of the strings given as parameters :

>>> x = "Mississippi"
>>> x.startswith("Miss")
True
>>> x.startswith("Mist")
False
>>> x.endswith("pi")
True
>>> x.endswith("p")
False

Both startswith and endswith can look for more than one string at a time. If the parameter is a tuple of strings, both methods check for all of the strings in the tuple and return a True if any one of them is found :

>>> x.endswith(("u", "i"))
True

- Modifying strings
Strings are immutable, but string objects have a number of methods that can operate on that string and return a new string that’s a modified version of the original string. This provides much the same effect as direct modification for most purposes. You can find a more complete description of these methods in the documentation.

You can use the replace method to replace occurrences of substring (its first argument) in the string with newstring (its second argument). It also takes an optional third argument (see the documentation for details) :

>>> x = "Mississippi"
>>> x.replace("ss", "+++")
'Mi+++i+++ippi'

As with the string search functions, the re module provides a much more powerful method of substring replacement.

The functions string.maketrans and string.translate may be used together to translate characters in strings into different characters. Although rarely used, these functions can simplify your life when they’re needed.

Let’s say, for example, that you’re working on a program that translates string expressions from one computer language into another. The first language uses ~ to mean logical not, whereas the second language uses !; the first language uses ^ to mean logical and, whereas the second language uses &; the first language uses ( and ), where the second language uses [ and ]. In a given string expression, you need to change all instances of ~ to !, all instances of ^ to &, all instances of ( to [, and all instances of ) to ]. You could do this using multiple invocations of replace, but an easier and more efficient way is :

>>> x = "~x ^ (y % z)"
>>> table = x.maketrans("~^()", "!&[]")
>>> x.translate(table)
'!x & [y % z]'

(You can give an optional argument to translate, to specify characters that should be removed from the string entirely. See the documentation for details.)

Other functions in the string module perform more specialized tasks. string.lower converts all alphabetic characters in a string to lowercase, and upper does the opposite.capitalize capitalizes the first character of a string, and title capitalizes all words in a string. swapcase converts lowercase characters to uppercase and uppercase to lowercase in the same string. expandtabs gets rid of tab characters in a string by replacing each tab with a specified number of spaces. ljust, rjust, and center pad a string with spaces, to justify it in a certain field width. zfill left-pads a numeric string with zeros. Refer to the documentation for details of these methods.

- Modifying strings with list manipulations
Because strings are immutable objects, there’s no way to directly manipulate them in the same way you can lists. Although the operations that operate on strings to produce new strings (leaving the original strings unchanged) are useful for many things, sometimes you want to be able to manipulate a string as if it were a list of characters. In that case, just turn it into a list of characters, do whatever you want, and turn the resulting list back into a string:

Although you can use split to turn your string into a list of characters, the type-conversion function list() is easier to use and to remember (and, for what it’s worth, you can turn a string into a tuple of characters using the built-in tuple() function). To turn the list back into a string, use "".join.

You shouldn’t go overboard with this method because it causes the creation and destruction of new string objects, which is relatively expensive. Processing hundreds or thousands of strings in this manner probably won’t have much of an impact on your program. Processing millions probably will.

- Useful methods and constants
string objects also have several useful methods to report qualities of the string, whether it consists of digits or alphabetic characters, is all uppercase or lowercase etc :

>>> x = "123"
>>> x.isdigit()
True
>>> x.isalpha()
False
>>> x = "M"
>>> x.islower()
False
>>> x.isupper()
True

Finally, the string module defines some useful constants. You’ve already seen string.whitespace, which is a string made up of the characters Python thinks of as whitespace on your system. string.digits is the string '0123456789'. string.hexdigits includes all the characters in string.digits, as well as 'abcdefABCDEF', the extra characters used in hexadecimal numbers. string.octdigits contains '01234567'—just those digits used in octal numbers. string.lowercase contains all lowercase alphabetic characters; string.uppercase contains all uppercase alphabetic characters; string.letters contains all of the characters in string.lowercase andstring.uppercase. You might be tempted to try assigning to these constants to change the behavior of the language. Python would let you get away with this, but it would probably be a bad idea.

Remember that strings are sequences of characters, so you can use the convenient Python in operator to test for a character’s membership in any of these strings, although usually the existing string methods is simpler and easier. The most common string operations are shown in table 6.2 :

(Note that these methods don’t change the string itself but return either a location in the string or a new string.)

Supplement :
* [Quick Python] 6. Strings - Part 2
* Python v2.7.2 documentation > String Services
* Tutorials point > Python - Strings

程式扎記

標籤

2012年2月2日星期四

[Quick Python] 6. Strings - Part 1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年2月2日 星期四