程式扎記: [Quick Python] 12. Using the filesystem

標籤

2012年2月13日 星期一

[Quick Python] 12. Using the filesystem




Preface : 
Working with files involves one of two things: basic I/O (described in chapter 13, "Reading and writing files") and working with the filesystem (for example, naming, creating, moving, or referring to files), which is a bit tricky, because different operating systems have different filesystem conventions. 

It would be easy enough to learn how to perform basic file I/O without learning all the features Python has provided to simplify cross-platform filesystem interaction — but I wouldn’t recommend it. Instead, read at least the first part of this chapter. This will give you the tools you need to refer to files in a manner that doesn’t depend on your particular operating system. Then, when you use the basic I/O operations, you can open the relevant files in this manner. 

This chapter covers : 
* Managing paths and pathnames
* Getting information about files
* Performing filesystem operations
* Processing all files in a directory subtree

Paths and pathnames : 
All operating systems refer to files and directories with strings naming a given file or directory. Strings used in this manner are usually called pathnames (or sometimes just paths), which is the word we’ll use for them. The fact that pathnames are strings introduces possible complications into working with them. Python does a good job of providing functions that help avoid these complications; but to make use of these Python functions effectively, you need an understanding of what the underlying problems are. This section discusses these details. 

Pathname semantics across different operating systems are very similar, because the filesystem on almost all operating systems is modeled as a tree structure, with a disk being the root, and folders, subfolders, and so forth being branches, subbranches, and so on. This means that most operating systems refer to a specific file in fundamentally the same manner: with a pathname that specifies the path to follow from the root of the filesystem tree (the disk) to the file in question. (This characterization of the root corresponding to a hard disk is an oversimplification. But it’s close enough to the truth to serve for this chapter.

This pathname consists of a series of folders to descend into, in order to get to the desired file. Different operating systems have different conventions regarding the precise syntax of pathnames. For example, the character used to separate sequential file or directory names in a Linux/UNIX pathname is /, whereas the character used to separate file or directory names in a Windows pathname is \. In addition, the UNIX filesystem has a single root (which is referred to by having a / character as the very first character in a pathname), whereas the Windows filesystem has a separate root for each drive, labeled A:\, B:\, C:\, and so forth (with C: usually being the main drive). Because of these differences, files will have different pathname representations on different operating systems. For example, a file called C:\data\myfile in MS Windows might be called /data/myfile on UNIX and on the Macintosh. Python provides functions and constants that allow you to perform common pathname manipulations without worrying about such syntactic details. With a little care, you can write your Python programs in such a manner that they will run correctly no matter what the underlying filesystem happens to be. 

- Absolute and relative paths 
These operating systems allow two different types of pathnames. Absolute pathnames specify the exact location of a file in a filesystem, without any ambiguity; they do this by listing the entire path to that file, starting from the root of the filesystem. Relative pathnames specify the position of a file relative to some other point in the filesystem, and that other point isn’t specified in the relative pathname itself; instead, the absolute starting point for relative pathnames is provided by the context in which they’re used. 

As examples of this, here are two Windows absolute pathnames : 
C:\Program Files\Doom
A:\backup\June

And here are two Linux absolute pathnames and a Mac absolute pathname : 
/bin/Doom
/floppy/backup/June
/Applications/Utilities

The following are two Windows relative pathnames : 
mydata\project1\readme.txt
games\tetris

And these are two Linux/UNIX relative pathnames and one Mac relative pathname : 
mydata/project1/readme.txt
games/tetris
Utilities/Java

Relative paths need context to anchor them. This is typically provided in one of two ways. The simplest is to append the relative path to an existing absolute path, producing a new absolute path. For example, we might have a relative Windows path, Start Menu\Programs\Explorer, and an absolute path, C:\Documents and Settings\Administrator. By appending the two, we have a new absolute path, C:\Documents and Settings\Administrator\Start Menu\Programs\Explorer, which refers to a specific file in the filesystem. By appending the same relative path to a different absolute path (say, C:\Documents and Settings\kmcdonald), we produce a path that refers to the Explorer program in a different user’s (kmcdonald’s) Profiles directory. 

The second way in which relative paths may obtain a context is via an implicit reference to the current working directory, which is the particular directory where a Python program considers itself to be at any point during its execution. Python commands may implicitly make use of the current working directory when they’re given a relative path as an argument. For example, if you use the os.listdir(path) command with a relative path argument, the anchor for that relative path is the current working directory, and the result of the command is a list of the filenames in the directory whose path is formed by appending the current working directory with the relative path argument. 

- The current working directory 
Whenever you edit a document on a computer, you have a concept of where you are in that computer’s file structure because you’re in the same directory (folder) as the file you’re working on. Similarly, whenever Python is running, it has a concept of where in the directory structure it is at any moment. This is important because the program may ask for a list of files stored in the current directory. The directory that a Python program is in is called the current working directory for that program. This may be different from the directory the program resides in. 

To see this in action, start Python and use the os.getcwd() (get current working directory) command to find out what Python’s initial current working directory is : 
>>> import os
>>> os.getcwd()

Note that os.getcwd is used as a zero-argument function call, to emphasize the fact that the value it returns isn’t a constant but will change as you issue commands that change the value of the current working directory. (It will probably be either the directory the Python program itself resides in or the directory you were in when you started up Python.) On Windows machines, you’ll see extra backslashes inserted into the path—this is because Windows uses \ as its path separator, and in Python strings '\ 'has a special meaning unless it’s itself backslashed. Now, type : 
>>> os.listdir(os.curdir)

The constant os.curdir returns whatever string your system happens to use as the same directory indicator. On both UNIX and Windows, this is a single dot; but to keep your programs portable, you should always use os.curdir instead of typing just the dot. This string is a relative path, meaning that os.listdir() will append it to the path for the current working directory, giving the same path. This command returns a list of all of the files or folders inside the current working directory. Choose some folderfolder, and type : 
>>> os.chdir(folder)
>>> os.getcwd()

As you can see, Python moves into the folder specified as an argument of the os.chdir() function. Another call to os.listdir(os.curdir) would return a list of files in folder, because os.curdir would then be taken relative to the new current working directory. Many Python filesystem operations (discussed later in this chapter) use the current working directory in this manner. 

- Manipulating pathnames 
Now that you have the background to understand file and directory pathnames, it’s time to look at the facilities Python provides for manipulating these pathnames. These facilities consist of a number of functions and constants in the os.path sub-module, which you can use to manipulate paths without explicitly using any operating system–specific syntax. Paths are still represented as strings, but you need never think of them or manipulate them as such. 

Let’s start out by constructing a few pathnames on different operating systems, using the os.path.join() function. Note that importing os is sufficient to bring in theos.path submodule also. There’s no need for an explicit import os.path statement. First, let’s start Python under Windows : 
>>> import os
>>> print(os.path.join('bin', 'utils', 'disktools'))
bin\utils\disktools

The os.path.join() function interprets its arguments as a series of directory names or filenames, which are to be joined to form a single string understandable as a relative path by the underlying operating system. In a Windows system, that means path component names should be joined together with backslashes, which is what was produced. Now, try the same thing in UNIX : 
>>> import os
>>> print(os.path.join('bin', 'utils', 'disktools'))
bin/utils/disktools

The result is the same path, but using the Linux/UNIX convention of forward slash separators rather than the Windows convention of backslash separators. In other words, os.path.join() lets you form file paths from a sequence of directory or filenames without any worry about the conventions of the underlying operating system.os.path.join() is the fundamental way by which file paths may be built in a manner that doesn’t constrain the operating systems on which your program will run. 

The arguments to os.path.join() need not be single a directory or filename; they may also be subpaths that are then joined together to make a longer pathname. The following example illustrates this in the Windows environment and is also a case where we find it necessary to use double backslashes in our strings. Note that we could enter the pathname with forward slashes (/) as well because Python converts them before accessing the Windows operating system : 
>>> import os
>>> print(os.path.join('mydir\\bin', 'utils\\disktools\\chkdisk'))
mydir\bin\utils\disktools\chkdisk

The os.path.join() command also has some understanding of absolute versus relative pathnames. In Linux/UNIX, an absolute path always begins with a / (because a single slash denotes the topmost directory of the entire system, which contains everything else, including the various floppy and CD drives that might be available). A relative path in UNIX is any legal path that does not begin with a slash. Under any of the Windows operating systems, the situation is more complicated because the way in which MS Windows handles relative and absolute paths is messier. Rather than going into all of the details, I’ll just say that the best way to handle this is to work with the following simplified rules for Windows paths : 
* A pathname beginning with a drive letter followed by a backslash and then a path is an absolute path: C:\Program Files\Doom. (Note that C: by itself, without a trailing backslash, can’t reliably be used to refer to the top-level directory on the C: drive. You must use C:\ to refer to the top-level directory on C:. This is a result of DOS conventions, not Python design.)
* A pathname beginning with neither a drive letter nor a backslash is a relative path: mydirectory\letters\business.
* A pathname beginning with \\ followed by the name of a server is the path to a network resource.
* Anything else can be considered as an invalid pathname.

Regardless of the operating system used, the os.path.join() command doesn’t perform sanity checks on the names it’s constructing. It’s possible to construct pathnames containing characters that, according to your OS, are forbidden in pathnames. If such checks are a requirement, probably the best solution is to write a small path-validitychecker function yourself. 

The os.path.split() command returns a two-element tuple splitting the basename of a path (the single file or directory name at the end of the path) from the rest of the path. For example, I use this on my Windows system : 
>>> import os
>>> os.getcwd()
'C:\\Software\\Python3.2.2\\tutorial\\Quick\\CH12'
>>> print(os.path.split(os.getcwd()))
('C:\\Software\\Python3.2.2\\tutorial\\Quick', 'CH12')

The os.path.basename() function returns only the basename of the path, and the os.path.dirname() function returns the path up to but not including the last name : 
>>> path="C:\\Software\\Python3.2.2\\tutorial\\Quick\\CH12\\exam01.py"
>>> os.path.basename(path)
'exam01.py'
>>> os.path.dirname(path)
'C:\\Software\\Python3.2.2\\tutorial\\Quick\\CH12'

To handle the dotted extension notation used by most filesystems to indicate file type (the Macintosh is a notable exception), Python provides os.path.splitext() : 
>>> os.path.splitext(os.path.join('some', 'directory', 'path.jpg'))
('some/directory/path', '.jpg')

The last element of the returned tuple contains the dotted extension of the indicated file (if there was a dotted extension.) The first element of the returned tuple contains everything from the original argument except the dotted extension. 

You can also use more specialized functions to manipulate pathnames. os.path.commonprefix(path1path2, ...) finds the common prefix (if any) for a set of paths. This is useful if you wish to find the lowest-level directory that contains every file in a set of files. os.path.expanduser() expands username shortcuts in paths, such as for UNIX. Similarly, os.path.expandvars() does the same for environment variables., Here’s an example on a Windows XP system : 
>>> import os
>>> os.path.expandvars('$WINDIR\\temp')
'C:\\Windows\\temp'

- Useful constants and functions 
You can access a number of useful path-related constants and functions to make your Python code more system independent than it otherwise would be. The most basic of these constants are os.curdir and os.pardir, which respectively define the symbol used by the operating system for the directory and parent directory path indicators. (In Windows as well as Linux/UNIX and Mac OS X, these are . and .. respectively.) These can be used as normal path elements : 
os.path.isdir(os.path.join(path, os.pardir, os.pardir))

asks if the parent of the parent of path is a directory. os.curdir is particularly useful for requesting commands on the current working directory. For example : 
os.listdir(os.curdir)

returns a list of filenames in the current working directory (because os.curdir is a relative path, and os.listdir always takes relative paths as being relative to the current working directory). 

The os.name constant returns the name of the Python module imported to handle the operating system–specific details. Here’s an example on my Windows XP system : 
>>> import os
>>> os.name
'nt'

Note that os.name returns 'nt' even though the actual version of Windows is XP. Most versions of Windows, except for Windows CE, are identified as 'nt'. 

On a Mac running OS X and on Linux/UNIX, the response is posix. You can use this to perform special operations, depending on the platform you’re working on : 
  1. import os  
  2. if os.name == 'posix':  
  3.     root_dir = "/"  
  4. elif os.name == 'nt':  
  5.     root_dir = "C:\\"  
  6. else:  
  7.     print("Don't understand this operating system!")  
You may also see programs use sys.platform, which gives more exact information. On Windows XP, it’s set to win32. On Linux, you may see linux2, whereas on Solaris, it may be set to sunos5 depending on the version you’re running. 

All your environment variables, and the values associated with them, are available in a dictionary called os.environ; in most operating systems, this includes variables related to paths, typically search paths for binaries and so forth. If what you’re doing requires this, you know where to find it now. 

Getting information about files : 
File paths are supposed to indicate actual files and directories on your hard drive. Of course, you’re probably passing a path around because you wish to know something about what it points to. Various Python functions are available to do this. 

The most commonly used Python path-information functions are os.path.exists()os.path.isfile(), and os.path.isdir(), which all take a single path as an argument. os.path.exists() returns True if its argument is a path corresponding to something that exists in the filesystem. os.path.isfile() returns True if and only if the path it’s given indicates a normal data file of some sort (executables fall under this heading), and it returns False otherwise, including the possibility that the path argument doesn’t indicate anything in the filesystem. os.path.isdir() returns True if and only if its path argument indicates a directory; it returns False otherwise. These examples are valid on my system. You may need to use different paths on yours to investigate the behavior of these functions : 
 

A number of similar functions provide more specialized queries. os.path.islink() and os.path.ismount() are useful in the context of Linux and other UNIX operating systems that provide file links and mount points. They return True if, respectively, a path indicates a file that’s a link or a mount point. os.path.islink() does not return True on Windows shortcuts files (files ending with .lnk), for the simple reason that such files aren’t true links. The OS doesn’t assign them a special status, and programs can’t transparently use them as if they were the actual fileos.path.samefile(path1path2) returns True if and only if the two path arguments point to the same file.os.path.isabs(path) returns True if its argument is an absolute path, False otherwise. os.path.getsize(path)os.path.getmtime(path), and os.path.getatime(path) return the size, last modify time, and last access time of a pathname, respectively. 

More filesystem operations : 
In addition to obtaining information about files, Python lets you perform certain filesystem operations directly. This is accomplished through a set of basic but highly useful commands in the os module. I’ll describe only those true cross-platform operations. Many operating systems also have access to more advanced filesystem functions, and you’ll need to check the main Python library documentation for the details. 

You’ve already seen that to obtain a list of files in a directory, you use os.listdir() : 
>>> import os
>>> os.chdir(os.path.join('C:', 'my documents', 'tmp'))
>>> os.listdir(os.curdir)
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp']

Note that unlike the list directory command in many other languages or shells, Python does not include the os.curdir and os.pardir indicators in the list returned by os.listdir()

The glob function from the glob module (named after an old UNIX function that did pattern matching) expands Linux/UNIX shell-style wildcard characters and character sequences in a pathname, returning the files in the current working directory that match. A * matches any sequence of characters. A ? matches any single character. A character sequence ([h,H] or [0-9]) matches any single character in that sequence : 
>>> import glob
>>> glob.glob("*")
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp']
>>> glob.glob("*bkp")
['registry.bkp']
>>> glob.glob("?.tmp")
['a.tmp', '1.tmp', '7.tmp', '9.tmp']
>>> glob.glob("[0-9].tmp")
['1.tmp', '7.tmp', '9.tmp']

To rename (move) a file or directory, use os.rename() : 
>>> os.rename('registry.bkp', 'registry.bkp.old')
>>> os.listdir(os.curdir)
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

You can use this command to move files across directories as well as within directories. Remove or delete a data file with os.remove() : 
>>> os.remove('book1.doc.tmp')
>>> os.listdir(os.curdir)
'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

Note that you can’t use os.remove to delete directories. This is a safety feature, to ensure that you don’t accidentally delete an entire directory substructure by mistake. 

Files can be created by writing to them, as you saw in the last chapter. To create a directory, use os.makedirs() or os.mkdir(). The difference between them is that os.mkdir doesn’t create any necessary intermediate directories, but os.makedirs does : 
>>> os.makedirs('mydir')
>>> os.listdir(os.curdir)
['mydir', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']
>>> os.path.isdir('mydir')
True

To remove a directory, use os.rmdir(). This removes only empty directories. Attempting to use it on a nonempty directory raises an exception : 
>>> os.rmdir('mydir')
>>> os.listdir(os.curdir)
['a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

To remove nonempty directories, use the shutil.rmtree() function. It will recursively remove all files in a directory tree. 

Processing all files in a directory subtree : 
Finally, a highly useful function for traversing recursive directory structures is the os.walk() function. You can use it to walk through an entire directory tree, returning three things for each directory it traverses: the root, or path, of that directory; a list of its subdirectories; and a list of its files. 

os.walk() is called with the path of the starting, or top, folder and three optional arguments: os.walk(directorytopdown=True, onerror=None, followlinks=False). directory is a starting directory path; if topdown is True or not present, the files in each directory are processed before its sub directories, resulting in a listing that starts at the top and goes down; whereas if topdown is False, the subdirectories of each directory are processed first, giving a bottom-up traversal of the tree. The onerror parameter can be set to a function to handle any errors that result from calls to os.listdir(), which are ignored by default. Finally, os.walk() by default doesn’t walk down into folders that are symbolic links, unless you give it the followlinks=True parameter. 

When called, os.walk() creates an iterator that recursively applies itself to all the directories contained in the top parameter. In other words, for each sub directory subdir in names, os.walk() recursively invokes a call to itself, of the form os.walk(subdir,...). Note that if topdown is True or not given, the list of sub directories may be modified (using any of the list-modification operators or methods) before its items are used for the next level of recursion; you can use this to control into which—if any—sub directories os.walk will descend. 

To get a feel for os.walk(), I recommend iterating over the tree and printing out the values returned for each directory. As an example of the power of os.walk(), list the current working directory and all of its sub directories along with a count of the number of entries in each of them, excluding any '.git' directories : 
  1. import os  
  2. for root, dirs, files in os.walk(os.curdir):  
  3.      print("{0} has {1} files".format(root, len(files)))  
  4.      if ".git" in dirs:  
  5.              dirs.remove(".git")  
This is complex, and if you want to use os.walk() to its fullest extent, you should probably play around with it quite a bit to understand the details of what’s going on. 

The copytree() function of the shutil module recursively makes copies of all the files in a directory and all of its subdirectories, preserving permission mode and stat (that is, access/modify times) information. It also has the already-mentioned rmtree() function, for removing a directory and all of its subdirectories, as well as a number of functions for making copies of individual files. 

Summary : 
Handling filesystem references (pathnames) and filesystem operations in a manner independent of the underlying operating system is never simple. Fortunately, Python provides a group of functions and constants that make this task much easier. For convenience, a summary of the functions discussed is given in table 12.1 : 
 

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!