程式扎記: [Quick Python] 13. Reading and writing files

標籤

2012年2月15日 星期三

[Quick Python] 13. Reading and writing files





Preface :
Probably the single most common thing you’ll want to do with files is open and read them. This chapter covers :
* Opening files and file objects
* Closing files
* Opening files in different modes
* Reading and writing text or binary data
* Redirecting screen input/output
* Using the struct module
* Pickling objects into files
* Shelving objects

Opening files and file objects :
In Python, you open and read a file using the built-in open() function and various built-in reading operations. The following short Python program reads in one line from a text file named myfile :
file_object = open('myfile', 'r')
line = file_object.readline()

open() doesn’t read anything from the file; instead it returns an object called a file object that you can use to access the opened file. A file object keeps track of a file and how much of the file has been read or written. All Python file I/O is done using file objects rather than filenames.

The first call to readline() returns the first line in the file object, everything up to and including the first newline character or the entire file if there is no newline character in the file; the next call to readline() would return the second line, and so on. The first argument to the open() function is a pathname. In the previous example, we’re opening what we expect to be an existing file in the current working directory. The following opens a file at the given absolute location :
import os
file_name = os.path.join("c:", "My Documents", "test", "myfile")
file_object = open(file_name, 'r')

Closing files :
After all data has been read from or written to a file object, it should be closed. Closing a file object frees up system resources, allows the underlying file to be read or written to by other code, and, in general, makes the program more reliable. For small scripts, not closing a file object generally doesn’t have much of an effect; file objects are automatically closed when the script or program terminates. For larger programs, too many open file objects may exhaust system resources, causing the program to abort. You close file objects using the close() method, after the file object is no longer needed. The earlier short program then becomes this :
file_object = open("myfile", 'r')
line = file_object.readline()
# . . . any further reading on the file_object . . .
file_object.close()

Opening files in write or other modes :
The second argument of the open() command is a string denoting how the file should be opened. 'r' means open the file for reading, 'w' means open the file for writing (any data already in the file will be erased), and 'a' means open the file for appending (new data will be appended to the end of any data already in the file). If you want to open the file for reading, you can leave out the second argument; 'r' is the default. The following short program writes "Hello, World" to a file :
file_object = open("myfile", 'w')
file_object.write(“Hello, World\n”)
file_object.close()

Depending on the operating system, open may also have access to additional file modes. These aren’t necessary for most purposes. As you write more advanced Python programs, you may wish to consult the Python reference manuals for details.

As well, open() can take an optional third argument, which defines how reads or writes for that file are buffered. Buffering is the process of holding data in memory until enough has been requested or written to justify the time cost of doing a disk access. Other parameters to open() control the encoding for text files and the handling of newline characters in text files. Again, these features aren’t something you typically need to worry about, but as you become more advanced in your use of Python, you may wish to read up on them.

Functions to read and write text or binary data :
I’ve presented the most common text file–reading function, readline(). It reads and returns a single line from a file object, including any newline character on the end of the line. If there is nothing more to be read from the file, readline returns an empty string. This makes it easy to, for example, count the number of lines in a file :
  1. file_object = open("myfile"'r')  
  2. count = 0  
  3. while file_object.readline() != "":  
  4.     count = count + 1  
  5. print(count)  
  6. file_object.close()  
For this particular problem, an even shorter way of counting all the lines is to use the built-in readlines() method, which reads all the lines in a file and returns them as a list of strings, one string per line (with trailing newlines still included) :
file_object = open("myfile", 'r')
print(len(file_object.readlines()))
file_object.close()

Of course, if you happen to be counting all the lines in a huge file, this may cause your computer to run out of memory, because it reads the entire file into memory at once. It’s also possible to overflow memory with readline() if you have the misfortune to try to read a line from a huge file that contains no newline characters, although this is highly unlikely. To handle such circumstances, both readline() and readlines() can take an optional argument affecting the amount of data they read at any one time. See the Python reference documentation for details.

Another way to iterate over all of the lines in a file is to treat the file object as an iterator in a for loop :
  1. file_object = open("myfile"'r')  
  2. count = 0  
  3. for line in file_object:  
  4.     count = count + 1  
  5. print(count)  
  6. file_object.close()  
This method has the advantage that the lines are read into memory as needed, so even with large files, running out of memory isn’t a concern. The other advantage of this method is that it’s simpler and easier to read.

On some occasions, you may wish to read all the data in a file into a single bytes object, especially if the data isn’t a string, and you want to get it all into memory so you can treat it as a byte sequence. Or you may wish to read data from a file as bytes objects of a fixed size. For example, you may be reading data without explicit newlines, where each line is assumed to be a sequence of characters of a fixed size. To do this, use the read() method. Without any argument, it reads all the rest of a file and returns that data as a bytes object. With a single-integer argument, it reads that number of bytes, or less, if there isn’t enough data in the file to satisfy the request, and returns a bytes object of the given size :
input_file = open("myfile", 'rb')
header = input_file.read(4)
data = input_file.read()
input_file.close()

The first line opens a file for reading in binary mode, the second line reads the first four bytes as a header string, and the third line reads the rest of the file as a single piece of data.

A possible problem with the read() method may arise due to the fact that on Windows and Macintosh machines, text-mode translations occur if you use the open() command in text mode—that is, without adding a b to the mode. In text mode, on a Macintosh any '\r' is converted to "\n", whereas on Windows "\r\n" pairs are converted to "\n". You can specify the treatment of newline characters by using the newline parameter when you open the file and specifying newline="\n", "\r", or "\r\n", which forces only that string to be used as newline. If the file has been opened in binary mode, the newline parameter isn’t needed—all bytes are returned exactly as they’re in the file :
input_file = open("myfile", newline="\n")

This forces only "\n" to be considered a newline.

The converses of the readline() and readlines methods are the write() and writelines() methods. Note that there is no writeline() function. write writes a single string, which can span multiple lines if newline characters are embedded within the string—for example, something like :
myfile.write("Hello")

write() doesn’t write out a newline after it writes its argument; if you want a newline in the output, you must put it there yourself. If you open a file in text mode (using w), any '\n' characters are mapped back to the platform-specific line endings (that is, '\r\n' on Windows or '\r' on Macintosh platforms). Again, opening the file with a specified newline will avoid this.

writelines() is something of a misnomer; it doesn’t necessarily write lines—it takes a list of strings as an argument and writes them, one after the other, to the given fileobject, without writing newlines. If the strings in the list end with newlines, they’re written as lines; otherwise, they’re effectively concatenated together in the file. Butwritelines() is a precise inverse of readlines() in that it can be used on the list returned by readlines() to write a file identical to the file readlines read from. For example, assuming myfile.txt exists and is a text file, this bit of code will create an exact copy of myfile.txt called myfile2.txt :
input_file = open("myfile.txt", 'r')
lines = input_file.readlines()
input_file.close()
output = open("myfile2.txt", 'w')
output.writelines(lines)
output.close()

- Using binary mode
Sometimes, you want to access the data in a file as a string of bytes, with no translation or text encoding. To do this, you need to open files in binary mode, which will return a bytes object, instead of a string when reading. To open the file in binary mode, use the 'b' (binary) argument with the mode—open("file", 'rb') or open("file", 'wb') :
input_file = open("myfile", 'rb')
header = input_file.read(4)
data = input_file.read()
input_file.close()

This example opens a file for reading, reads the first four bytes as a header string, and then reads the rest of the file as a single piece of data.

Keep in mind that files open in binary mode deal only in bytes, not strings. To use the data as strings, you must decode any bytes objects to string objects. This is often an important point in dealing with network protocols, where streams of data appear as files but are in binary mode as a rule.
>>> str = "string"
>>> bts = str.encode() # Return bytes object
>>> bts
b'string'
>>> bts.decode() # Decode the bytes into string object
'string'

Screen input/output and redirection :
You can use the built-in input method to prompt for and read an input string :
>>> x = input("Enter file name to use: ")
Enter file name to use: myfile
>>> x
'myfile'

The prompt line is optional, and the newline at the end of the input line is stripped off.

To read in numbers using input(), you need to explicitly convert the string that input returns to the correct number type. The following example uses int() :
>>> x = int(input("enter your number: "))
enter your number: 39
>>> x
39

input() writes its prompt to the standard output and reads from the standard input. Lower-level access to these and standard error can be had using the sys module, which has sys.stdinsys.stdout, and sys.stderr attributes. These can be treated as specialized file objects.

For sys.stdin, you have read(), readline(), and readlines() methods. For sys.stdout and sys.stderr, you can use the standard print() function as well as the write() and writelines() methods, which operate as they do for other file objects :


You can redirect standard input to read from a file. Similarly, standard output or standard error can be set to write to files. They can also be subsequently programmatically restored to their original values using sys.__stdin__sys.__stdout__, and sys.__stderr__ :


The print() function also can be redirected to any file without changing standard output :
>>> import sys
>>> f = open("outfile.txt", 'w')
>>> print("A first line.\n", "A second line.\n", file=f)
>>> 3 + 4
7
>>> f.close()

You’d normally use this when you’re running from a script or program. But if you’re using the interactive mode on Windows, you may want to temporarily redirect standard output in order to capture what might otherwise scroll off the screen. The short module in listing 13.1 implements a set of functions that provide this capability.
- mio.py :
  1. """mio: module, (contains functions capture_output, restore_output,  
  2. print_file, and clear_file )"""  
  3. import sys  
  4. _file_object = None  
  5. def capture_output(file="capture_file.txt"):  
  6.     """capture_output(file='capture_file.txt'): redirect the standard  
  7.     output to 'file'."""  
  8.     global _file_object  
  9.     print("output will be sent to file: {0}".format(file))  
  10.     print("restore to normal by calling 'mio.restore_output()'")  
  11.     _file_object = open(file, 'w')  
  12.     sys.stdout = _file_object  
  13. def restore_output():  
  14.     """restore_output(): restore the standard output back to the  
  15.     default (also closes the capture file)"""  
  16.     global _file_object  
  17.     sys.stdout = sys.__stdout__  
  18.     _file_object.close()  
  19.     print("standard output has been restored back to normal")  
  20. def print_file(file="capture_file.txt"):  
  21.     """print_file(file="capture_file.txt"): print the given file to the  
  22.     standard output"""  
  23.     f = open(file, 'r')  
  24.     print(f.read())  
  25.     f.close()  
  26. def clear_file(file="capture_file.txt"):  
  27.     """clear_file(file="capture_file.txt"): clears the contents of the  
  28.     given file"""  
  29.     f = open(file, 'w')  
  30.     f.close()  

Here, capture_output() redirects standard output to a file that defaults to "capture_file.txt". The function restore_output() restores standard output to the default. Also, print_file() prints this file to the standard output, and clear_file() clears its current contents.

Reading structured binary data with the struct module :
Generally speaking, when working with your own files, you probably don’t want to read or write binary data in Python. For very simple storage needs, it’s usually best to use textual input and output as described earlier. For more sophisticated applications, Python provides the ability to easily read or write arbitrary Python objects (pickling, described later in this chapter). This ability is much less error prone than directly writing and reading your own binary data and is highly recommended.

But there’s at least one situation in which you’ll likely need to know how to read or write binary data, and that’s when you’re dealing with files that are generated or used by other programs. This section gives a short description of how to do this using the struct module. Refer to the Python reference documentation for more details.

As you’ve seen, Python supports explicit binary input or output by using bytes instead of strings if you open the file in binary mode. But because most binary files rely on a particular structure to help parse the values, writing your own code to read and split them into variables correctly is often more work than it’s worth. Instead, you can use the standard struct module to permit you to treat those strings as formatted byte sequences with some specific meaning.

Assume that we wish to read in a binary file called data, containing a series of records generated by a C program. Each record consists of a C short integer, a C double float, and a sequence of four characters that should be taken as a four-character string. We wish to read this data into a Python list of tuples, with each tuple containing an integer, a floating-point number, and a string.

The first thing to do is to define a format string understandable to the struct module, which tells the module how the data in one of our records is packed. The format string uses characters meaningful to struct to indicate what type of data is expected where in a record. For example, the character 'h' indicates the presence of a single C short integer, and the character 'd' indicates the presence of a single C double-precision floating-point number. Not surprisingly, 's' indicates the presence of a string and may be preceded by an integer to indicate the length of the string; '4s' indicates a string consisting of four characters. For our records, the appropriate format string is therefore 'hd4s'. struct understands a wide range of numeric, character, and string formats. See the Python Library Reference for details.

Before we start reading records from our file, we need to know how many bytes to read at a time. Fortunately, struct includes a calcsize() function, which takes our format string as an argument and returns the number of bytes used to contain data in such a format.

To read each record, we’ll use the read() method described previously. Then, the struct.unpack() function conveniently returns a tuple of values by parsing a read record according to our format string. The program to read our binary data file is remarkably simple :
  1. import struct  
  2. record_format = 'hd4s'  
  3. record_size = struct.calcsize(record_format)  
  4. result_list = []  
  5. input = open("data"'rb')  
  6. while 1:  
  7.     record = input.read(record_size)  # Read in single record  
  8.     if record == '':  # (1)  
  9.         input.close()  
  10.         break  
  11.     result_list.append(struct.unpack(record_format, record)) # Unpack record into tuple and append to result  
If the record is empty, we’re at end of file, so we quit the loop (1). Note that there is no checking for file consistency. But if the last record is an odd size, the struct.unpack() function raises an error.

As you may already have guessed, struct also provides the ability to take Python values and convert them into packed byte sequences. This is accomplished through thestruct.pack() function, which is almost, but not quite, an inverse of struct.unpack(). The almost comes from the fact that whereas struct.unpack() returns a tuple of Python values, struct.pack() doesn’t take a tuple of Python values; rather, it takes a format string as its first argument and then enough additional arguments to satisfy the format string. To produce a binary record of the form used in the previous example, we might do something like this :
>>> import struct
>>> record_format = 'hd4s'
>>> struct.pack(record_format, 7, 3.14, 'john'.encode())
b'\x07\x00\x00\x00\x00\x00\x00\x00\x1f\x85\xebQ\xb8\x1e\t@john'

struct gets even better than this; you can insert other special characters into the format string to indicate that data should be read/written in big-endian, little-endian, or machine-native-endian format (default is machine-native) and to indicate that things like C short integer should be sized either as native to the machine (the default) or as standard C sizes. If you need these features, it’s nice to know they exist.

Pickling objects into files :
Python can write any data structure into a file and read that data structure back out of a file and re-create it, with just a few commands. This is an unusual ability but one that’s highly useful. It can save you many pages of code that do nothing but dump the state of a program into a file (and can save a similar amount of code that does nothing but read that state back in).

Python provides this ability via the pickle module. Pickling is powerful but simple to use. For example, assume that the entire state of a program is held in three variables: a, b, and c. We can save this state to a file called state as follows :
import pickle
.
.
.
file = open("state", 'w')
pickle.dump(a, file)
pickle.dump(b, file)
pickle.dump(c, file)
file.close()

It doesn’t matter what was stored in a, b, and c. It might be as simple as numbers or as complex as a list of dictionaries containing instances of user-defined classes.pickle.dump() will save everything.

Now, to read that data back in on a later run of the program, just write :
import pickle
file = open("state", 'r')
a = pickle.load(file)
b = pickle.load(file)
c = pickle.load(file)
file.close()

Any data that was previously in the variables a, b, or c is restored to them by pickle.load().

The pickle module can store almost anything in this manner. It can handle lists, tuples, numbers, strings, dictionaries, and just about anything made up of these types of objects, which includes all class instances. It also handles shared objects, cyclic references, and other complex memory structures correctly, storing shared objects only once and restoring them as shared objects, not as identical copies. But code objects (what Python uses to store byte-compiled code) and system resources (like files or sockets) can’t be pickled.


More often than not, you won’t want to save your entire program state with pickle. For example, most applications can have multiple documents open at one time. If you saved the entire state of the program, you would effectively save all open documents in one file. An easy and effective way of saving and restoring only data of interest is to write a save function that stores all data you wish to save into a dictionary and then uses pickle to save the dictionary. Then, you can use a complementary restore function to read the dictionary back in (again using pickle) and to assign the values in the dictionary to the appropriate program variables. This also has the advantage that there’s no possibility of reading values back in an incorrect order—that is, an order different from the order in which they were stored. Using this approach with the previous example, we get code looking something like this :
  1. import pickle  
  2. .  
  3. .  
  4. .  
  5. def save_data():  
  6.     global a, b, c  
  7.     file = open("state"'w')  
  8.     data = {'a': a, 'b': b, 'c': c}  
  9.     pickle.dump(data, file)  
  10.     file.close()  
  11. def restore_data():  
  12.     global a, b, c  
  13.     file = open("state"'r')  
  14.     data = pickle.load(file)  
  15.     file.close()  
  16.     a = data['a']  
  17.     b = data['b']  
  18.     c = data['c']  
  19.     .  
  20.     .  
A real-life application is an extension of the cache example given in chapter 7, "Dictionaries." Recall that there, we were calling a function that performed a time-intensive calculation based on its three arguments. During the course of a program run, many of our calls to it ended up using the same set of arguments. We were able to obtain a significant performance improvement by caching the results in a dictionary, keyed by the arguments that produced them. But it was also the case that many different sessions of this program were being run many times over the course of days, weeks, and months. Therefore, by pickling the cache, we can keep from having to start over with every session. Below sole.py is a pared-down version of the module for doing this :
- sole.py :
  1. """sole module: contains function sole, save, show"""  
  2. import pickle  
  3. _sole_mem_cache_d = {}  
  4. _sole_disk_file_s = "solecache"  
  5. file = open(_sole_disk_file_s, 'br')  
  6. _sole_mem_cache_d = pickle.load(file)  
  7. file.close()  
  8. def sole(m, n, t):  
  9.     """sole(m, n, t): perform the sole calculation using the cache."""  
  10.     global _sole_mem_cache_d  
  11.     if _sole_mem_cache_d.has_key((m, n, t)):  
  12.         return _sole_mem_cache_d[(m, n, t)]  
  13.     else:  
  14.         # . . . do some time-consuming calculations . . .  
  15.         _sole_mem_cache_d[(m, n, t)] = result  
  16.         return result  
  17. def save():  
  18.     """save(): save the updated cache to disk."""  
  19.     global _sole_mem_cache_d, _sole_disk_file_s  
  20.     file = open(_sole_disk_file_s, 'bw')  
  21.     pickle.dump(_sole_mem_cache_d, file)  
  22.     file.close()  
  23. def show():  
  24.     """show(): print the cache"""  
  25.     global _sole_mem_cache_d  
  26.     print(_sole_mem_cache_d)  

Note that for production code, this is a situation where you probably would use an absolute pathname for your cache file. Also, concurrency isn’t being handled here. If two people run overlapping sessions, you’ll end up with only the additions of the last person to save. If this were an issue, you could limit this overlap window significantly by using the dictionary update() method in the save() function.

Shelving objects :
This is a somewhat advanced topic but certainly not a difficult one. This section is likely of most interest to people whose work involves storing or accessing pieces of data in large files, because the Python shelve module does exactly that—it permits the reading or writing of pieces of data in large files, without reading or writing the entire file. For applications that perform many accesses of large files (such as database applications), the savings in time can be spectacular. Like the pickle module (which it uses), the shelve module is simple.

Let’s explore it through an address book. This sort of thing is usually small enough that an entire address file can be read in when the application is started and written out when the application is done. If you’re an extremely friendly sort of person, and your address book is too big for this, it would be better to use shelve and not worry about it.

We’ll assume that each entry in our address book consists of a tuple of three elements, giving the first name, phone number, and address of a person. Each entry will be indexed by the last name of the person the entry refers to. This is so simple that our application will be an interactive session with the Python shell.

First, import the shelve module and open the address book. shelve.open() creates the address book file if it doesn’t exist :
>>> import shelve
>>> book = shelve.open("addresses")

Now, add a couple of entries. Notice that we’re treating the object returned by shelve.open() as a dictionary (although it’s a dictionary that can use only strings as keys) :
>>> book['flintstone'] = ('fred', '555-1234', '1233 Bedrock Place')
>>> book['rubble'] = ('barney', '555-4321', '1235 Bedrock Place')

Finally, close the file and end the session :
>>> book.close()

So what? Well, in that same directory, start Python again, and open the same address book :
>>> import shelve
>>> book = shelve.open("addresses")

But now, instead of entering something, let’s see if what we put in before is still around :
>>> book['flintstone']
('fred', '555-1234', '1233 Bedrock Place')

The addresses file created by shelve.open() in the first interactive session has acted just like a persistent dictionary. The data we entered before was stored to disk, even though we did no explicit disk writes. That’s exactly what shelve does.

More generally, shelve.open() returns a shelf object that permits basic dictionary operations, key assignment or lookup, delin, and the keys method. But unlike a normal dictionary, shelf objects store their data on disk, not in memory. Unfortunately, shelf objects do have one significant restriction as compared to dictionaries: they can use only strings as keys, versus the wide range of key types allowable in dictionaries.

It’s important to understand the advantage shelf objects give you over dictionaries when dealing with large data sets. shelve.open() makes the file accessible; it doesn’t read an entire shelf object file into memory. File accesses are done only when needed, typically when an element is looked up, and the file structure is maintained in such a manner that lookups are very fast. Even if your data file is really large, only a couple of disk accesses will be required to locate the desired object in the file. This can improve your program in a number of ways. It may start faster, because it doesn’t need to read a potentially large file into memory. It may execute faster, because more memory is available to the rest of the program, and thus less code will need to be swapped out into virtual memory. You can operate on data sets that are otherwise too large to fit in memory.

There are a few restrictions when using the shelve module. As previously mentioned, shelf object keys can be only strings; but any Python object that can be pickled can be stored under a key in a shelf object. Also, shelf objects aren’t suitable for multiuser databases because they provide no control for concurrent access. Finally, make sure you close a shelf object when you’ve finished—this is sometimes required in order for changes you’ve made (entries or deletions) to be written back to disk.

Supplement :
[ The python tutorial ] 7. Input and Output
這裡將介紹 Python 的 IO 操作, 包括顯示訊息到 Console 與 檔案操作等.

[ Python 文章收集 ] 常見檔案操作範例

This message was edited 36 times. Last update was at

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!