2018年1月20日 星期六

[ Python 文章收集 ] A tutorial on python-daemon – or – Why doesn’t python-daemon have any documentation?

Source From Here 
Introduction 
A few weeks ago I needed to create a daemon for a school project. I had never really dealt with daemons before so I browsed on the Internet for what they were and how they worked. After reading some pages, I found out how hard it seemed to be working with daemons: you need to deal with the correct forking of a process, prevent core dump generation, change root and working directories, change process and file umasks and ownership, and quite a lot of other OS-related stuff you can find listed here. By the way, at this time I’m still not quite sure of how these conditions came to be. 

Wait. What is a daemon? 
A little side note. Although the word “daemon” might sound like something diabolical, it has nothing to do with Satan, in fact it comes from the Greek word δαίμων, which refers to the spirits people have inside and which eventually define them. daemon is a process. No more and no less then your browser’s process is. The key difference is, though, that a daemon is a process that doesn’t need user input to work. Think, for instance, about a web server which isn’t waiting for its own user to perform some action, but rather it’s waiting for some other host on the network to perform a request. Such request needs to be processed without any human taking part in it. 

So a daemon is a process that performs what you could think of as a background task. May it be a server like a web server or an ssh server, or something more complicated like systemd. OS-wise, daemons are specific to the world of Unix. If you’re running Windows or Mac OS X, you should keep in mind that Unix daemons do have their respective counter-part in other OSes: Windows has its so-called services; OSX sometimes calls them “daemons”, sometimes “agents” and they still pretty much work just as Unix daemons, but it sort of expects you to make them compatible with launchd. 

How can I create a daemon? 
There are a lot of ways you can create a daemon in Unix, since nobody enforces or supports one in favor of the others. Within the shell of your choice, type in the command you were trying to daemonize and add ‘&’ at the end of it. 
# python spam.py &

And that’s it. You have a daemon. 

Now, while this is a very fast way to spawn a daemon, it might not be the most reasonable choice for a number of reasons: the process will output anything to your current shell (using ‘&’ doesn’t close the stdout and stderr file descriptors), you can’t assign the daemon any PID lock file so multiple daemons might be running at the same time (I’ll come back to the lock file later) and often that makes it generally harder to control the daemon. 

Nonetheless this remains the fastest way to create a daemon in Unix. Just a character away. And sometimes speed and simplicity are exactly what you need. There are a couple recipes online that will do the job for you. I won’t go through the details of all of them, I’ll just link a couple of them here: 
* Creating a daemon the python way — by Chad J. Schroeder
* A simple unix/linux daemon in Python — by Sander Marechal
* Daemon with start/stop/restart behavior — by Clark Evans

This does have some consequences: you cannot pip install them and you cannot have a simple way to know if there is going to be an update for any of them (crucial in a security-aware context), but still these recipes will get you the job done. Use “daemon“. Here’s a nice article about it. This is a pretty nice utility, which will do most of the job for you, similarly to python-daemon, but if you’re relying on python code for running your daemon, it may not be the best choice. Anyway, it’s simple enough that you might want to look at it if you need to spawn deamons in C or any other language. 

While quite a lot of people (like me) find it tedious to re-invent the wheel, others sometimes feel like they need to re-invent the whole TCP/IP stack from the ground up. I won’t judge you for this. You’re going to have a lot of fun with that. Just a couple of modules I found a lot of people were using while writing their homemade daemons: os.forkos.setsidsignalsubprocess

Ok, here we are. This is the way I create daemons and the way I would recommend to most people by using python-daemon

It doesn’t require a lot of knowledge about the underlying machinery that gets the daemon to run, its source code is rather readable, its APIs are very very simple (we’ll go through these later), you can pip install it, it is still maintainedand it was going to be the standard way to create daemons in Python. Its main missing feature is its lack of documentation: the documentation you’ll find online is sparse and you’ll often need to look at its source code if you encounter any bugs. 

So what is python-daemon? 
Back in the first weeks of 2009 PEP 3143 was created. Its aim was to create “a package [in] the Python standard library that provides a simple interface to the task of becoming a daemon process.” While the goal was not an impossible task and quite some people were interested in seeing this project succeed, it didn’t make it. The guy that was in charge of doing it simply didn’t have enough time anymore and no one stepped in to save the project. Such a sad death for such a nice project. 

This tragedy didn’t affect the functionality of python-daemon too much (it does include basically anything you need for a daemon), but rather its documentation. As I said earlier, its weakest point is documentation: you’ll find some inside the PEP and some within the code itself. 

And how do you make it work? 
Ok, I guess I got you interested in python-daemon since you’re still reading. First, let me start off by telling you what you shouldn’t use in this library: DaemonRunner. Googling python-daemon will find some pages that will point you to the DaemonRunner object to handle your daemon, but it is a deprecated part of the library. 

Instead, you want to use the DaemonContext API which is used inside of DaemonRunner. It’s true that DaemonRunner extends DaemonContext functionality, but it does so in a very old-fashioned way (doesn’t use argparse for instance). This is probably the reason why it ended up being deprecated. 

Without further ado, DaemonContext makes it super simple to start your daemon with just a context manager: 
  1. with daemon.DaemonContext():  
  2.     main()  
This is the most basic configuration you can pass to DaemonContext, and it will actually create a well-behaving daemon with just one line of code and four spaces of indentation. I’ll give you an overview of what you can set in order to have a more complex and detailed configuration for the daemon you need. 

Dealing with the file system 
A daemon is a fairly peculiar process: since it is unbound from human interaction, a daemon will have its own keys to be identified user-wise. This means that, regardless of the user that started a daemon, the daemon will have its own UID, GID (User/Group ID), its own root and working directories, and its own umask. 

Don’t be afraid, DaemonContext will take care of this stuff for you, even with just the default configuration, but let’s say that you need to customize this stuff. To change the root directory, useful for confining your daemon, simply set the chroot_directory argument to a valid directory on your file system. The same goes for the working directory, which is a more usual thing to do, under the argument working_directory. By default, DaemonContext will set your working directory to root “/”. An example as below: 
  1. with daemon.DaemonContext(  
  2.         chroot_directory=None,  
  3.         working_directory='/var/lib/myprettylittledaemon'):  
  4.     print(os.getcwd())  
In case you don’t see on-screen the result of print, that’s because you need to keep the stdout stream open. Such configuration is explained in the “Preserve files” paragraph below. For the UID and GID, DaemonContext by default “will relinquish any effective privilege elevation inherited by the process” which is usually the reason why you need to change them. In case you don’t find this satisfactory, the process is still pretty straight-forward: set them to what you need, provided that your user is granted permission to do so. In case your user doesn’t have root permissions, DaemonContext will raise a DaemonOSEnvironmentError exception. 
  1. with daemon.DaemonContext(  
  2.         uid=1001,  
  3.         gid=777):  
  4.     print(os.getuid())  
  5.     print(os.getgid())  
Additionally, you might want to set the daemon umask, which will set the mode the daemon will create files with (Check os.umask): 
  1. with daemon.DaemonContext(  
  2.         umask=0o002):  
  3.     your_mask = os.umask(0)  # i'm doing this weird three lines trick  
  4.     print(your_mask)         # to print the umask set by DaemonContext  
  5.     os.umask(your_mask)      # due to the behaviour of os.umask.  
Preserve files. 
One thing to take into account when creating a daemon is that on start DaemonContext will close any open files you have around. This is normal and it’s what it is supposed to do. Now, even though this is the behavior we should expect from the library, you might still need some files to be opened in your program. You can do this by declaring what files you won’t need to be closed through the files_preserve argument. For instance: 
  1. some_important_file = open('AVERYBIGDATABASE''r')  
  2.   
  3. with daemon.DaemonContext(  
  4.         files_preserve=[some_important_file]):  
  5.     print(some_important_file.readlines())  
Along with your open files, DaemonContext will also close the standard streams file descriptors, namely stdinstdout and stderr. By default it will redirect them to os.devnull. If you need to keep them open, simply set the stdinstdout and stderr arguments according to your needs. 
  1. with daemon.DaemonContext(  
  2.         stdout=sys.stdout,  
  3.         stderr=sys.stderr):  
  4.     print("Hello World! Daemon here.")  
Handling OS signals 
Signals coming from the OS are important, regardless of whether you’re switching your program to be daemonized. Furthermore, this makes it even more important for you to take care of such signals, since it might become the only way a human interacts with your process. DaemonContext will conveniently let you define a dictionary in the signal_map argument that will be linked to the signals you might want to configure. Some popular ones are: SIGINTSIGKILLSIGTERMSIGTSTP. You can find further details here
  1. import signal  
  2.   
  3. def shutdown(signum, frame):  # signum and frame are mandatory  
  4.     sys.exit(0)  
  5.       
  6. with daemon.DaemonContext(  
  7.         signal_map={  
  8.             signal.SIGTERM: shutdown,  
  9.             signal.SIGTSTP: shutdown  
  10.         }):  
  11.     main()  
One at a time 
More than often daemons will use resources, such as a TCP port for a listening server or some files on disk. You’ll probably want to make sure that there aren’t multiple daemons conflicting for these resources. To make sure that only one of your daemons is running at the same time, you can use a PID lock file, which is a file containing the PID of a process that will prevent the same program from running on more than one instance. Please note that it is the duty of the newly spawned process (handled within DaemonContext) to check the lock file and abort the start procedure. If you’re already familiar with threading.Lock the concept is basically the same. 

You can set a lock file like this: 
  1. import lockfile  
  2.   
  3. with daemon.DaemonContext(  
  4.         pidfile=lockfile.FileLock('/var/run/spam.pid')):  
  5.     main()  
Start/stop/reload 
A common pattern for a daemon to interact with its administrator is to provide a start/stop/reload behavior which is usually implemented as a set of command line arguments. This is particularly useful if you’re planning to support initd. DaemonContext, though, will not take care of this for you. DaemonRunner does have code in regard to this behavior, but I wouldn’t advise you to use it directly since it is deprecated. Anyway you can still use its source code as a reference, for further details take a look at the _start and _stop methods. 

Conclusions 
The package python-daemon is absolutely not the only way you can create a daemon for a Python program, you should carefully consider every possibility you have. If your choice is python-daemon, we have gone through pretty much all of the configurations of DaemonContext. I haven’t covered all of them though, if you’re still looking for more options you should look in the PEP; if you can’t find enough information there, have a look at DaemonContext’s source code

Supplement 
PEP 3143 -- Standard daemon process library

2018年1月18日 星期四

[ Python 文章收集 ] Using Python's Watchdog to monitor changes to a directory

Source From Here 
Preface 
Watchdog is a handy Python package which uses the inotify Linux kernel subsystem to watch for any changes to the filesystem. This makes it an excellent foundation to build a a small script which takes action whenever a file is received in a directory, or any of the directory's contents change. An example might be a client-facing sftp server where you may want to receive an email when a file is received. 

You can install the package with below command: 
# pip install watchdog


How to 
First create the monitoring script, it will run daemonized and will observe any changes to the given directory. In that script 3 modules/classes will be used: 
* time from Python will be used to sleep the main loop
watchdog.observers.Observer is the class that will watch for any change, and then dispatch the event to specified the handler.
* watchdog.events.PatterMatchingHandler is the class that will take the event dispatched by the observer and perform some action

- watch_for_changes.py 
  1. import time    
  2. from watchdog.observers import Observer    
  3. from watchdog.events import PatternMatchingEventHandler   
PatternMatchingEventHandler inherits from FileSystemEventHandler and exposes some usefull methods: 
on_any_event: if defined, will be executed for any event
on_created: Executed when a file or a directory is created
on_modified: Executed when a file is modified or a directory renamed
on_moved: Executed when a file or directory is moved
on_deleted: Executed when a file or directory is deleted.

Each one of those methods receives the event object as first parameter, and the event object has 3 attributes: 
* event_type: 'modified' | 'created' | 'moved' | 'deleted'
* is_directory: True | False
* src_path: path/to/observed/file

So to create a handler just inherit from one of the existing handlers, for this example PatternMatchingEventHandler will be used to match only xml files. To simplify I will enclose the file processor in just one method, and I will implement method only for on_modified and on_created, which means that my handler will ignore any other events. 

Also defining the patterns attribute to watch only for files with xml or lxml extensions. 
  1. class MyHandler(PatternMatchingEventHandler):  
  2.     patterns = ["*.xml""*.lxml"]  
  3.   
  4.     def process(self, event):  
  5.         """  
  6.         event.event_type   
  7.             'modified' | 'created' | 'moved' | 'deleted'  
  8.         event.is_directory  
  9.             True | False  
  10.         event.src_path  
  11.             path/to/observed/file  
  12.         """  
  13.         # the file will be processed there  
  14.         print event.src_path, event.event_type  # print now only for degug  
  15.   
  16.     def on_modified(self, event):  
  17.         self.process(event)  
  18.   
  19.     def on_created(self, event):  
  20.         self.process(event)  
With the above handler only creation and modification will be watched now the Obserser needs to be scheduled. 
  1. if __name__ == '__main__':  
  2.     args = sys.argv[1:]  
  3.     observer = Observer()  
  4.     observer.schedule(MyHandler(), path=args[0if args else '.')  
  5.     observer.start()  
  6.   
  7.     try:  
  8.         while True:  
  9.             time.sleep(1)  
  10.     except KeyboardInterrupt:  
  11.         observer.stop()  
  12.   
  13.     observer.join()  
Notes. 
You can set the named-argument "recursive" to True for observer.schedule. if you want to watch for files in subfolders.

That's all needed to watch for modifications on the given directory, it will take the current directory as default or the path given as first parameter. 
# python watch_for_changes.py /path/to/directory

Let it run in a shell and open another one or the file browser to change or create new .xml files in the /path/to/directory
# echo "testing" > /tmp/test.xml

Since the handler is printing the results, the outrput should be: 
/tmp/test.xml created
/tmp/test.xml modified

Now to complete the script only need to implement in the process method, the necessary logic to parse and insert to database. For example, if the xml file contains some data about current track on a web radio: 


The easiest way to parse this small xml is using xmltodict library. 

# pip install xmltodict

With xmltodict.parse function the above xml will be outputed as an OrderedDict
  1. OrderedDict([(u'Pulsar',  
  2.     OrderedDict([(u'OnAir',  
  3.         OrderedDict([(u'media_type', u'default'),  
  4.         (u'media',   
  5.             OrderedDict([(u'title1', u'JOVEM PAN FM'),  
  6.                          (u'title2', u'100,9MHz'),  
  7.                          (u'title3', u'A maior rede de radio do Brasil'),  
  8.                          (u'title4', u'00:00:00'),  
  9.                          (u'media_id1', u'#ID_Title#'),  
  10.                          (u'media_id2', u'#ID_SubTitle#'),  
  11.                          (u'media_id3', u'#ID_Album#'),  
  12.                          (u'hour', u'2013-12-07 11:44:32'),  
  13.                          (u'length', u'#Duration#'),  
  14.                          (u'ISRC', u'#Code#'),  
  15.                          (u'id_singer', u'#ID_Singer#'),  
  16.                          (u'id_song', u'#ID_Song#'),  
  17.                          (u'id_album', u'#ID_Album#'),  
  18.                          (u'id_jpg', u'#Jpg#')]))]))]))])  
Now we can just access that dict to create the registry on filesystem or something else. Notice that I will use a lot of get method of dict type to avoid KeyErrors
  1. with open(event.src_path, 'r') as xml_source:  
  2.     xml_string = xml_source.read()  
  3.     parsed = xmltodict.parse(xml_string)  
  4.     element = parsed.get('Pulsar', {}).get('OnAir', {}).get('media')  
  5.     if not element:  
  6.         return  
  7.     print dict(element)  
and the output will be: 
{u'hour': u'2013-12-07 11:44:32',
u'title2': u'100,9MHz',
u'id_album': u'#ID_Album#',
u'title1': u'JOVEM PAN FM',
u'length': u'#Duration#',
u'title3': u'A maior rede de radio do Brasil',
u'title4': u'00:00:00',
u'ISRC': u'#Code#',
u'id_song': u'#ID_Song#',
u'media_id2': u'#ID_SubTitle#',
u'media_id1': u'#ID_Title#',
u'id_jpg': u'#Jpg#',
u'media_id3': u'#ID_Album#',
u'id_singer': u'#ID_Singer#'}

Much better than XPATH, and for this particular case when the xml_source is small there will no relevant performace issue. Now only need to get the values and populate the database, in my case I will use Redis DataModel as storage. Also I will use magicdate module to automagically convert the date format to datetime object. The complete code is as below: 
  1. import sys  
  2. import time  
  3. import xmltodict  
  4. import magicdate  
  5. from watchdog.observers import Observer  
  6. from watchdog.events import PatternMatchingEventHandler  
  7.   
  8. from .models import Media  
  9.   
  10.   
  11. class MyHandler(PatternMatchingEventHandler):  
  12.     patterns=["*.xml"]  
  13.   
  14.     def process(self, event):  
  15.         """  
  16.         event.event_type  
  17.             'modified' | 'created' | 'moved' | 'deleted'  
  18.         event.is_directory  
  19.             True | False  
  20.         event.src_path  
  21.             path/to/observed/file  
  22.         """  
  23.   
  24.         with open(event.src_path, 'r') as xml_source:  
  25.             xml_string = xml_source.read()  
  26.             parsed = xmltodict.parse(xml_string)  
  27.             element = parsed.get('Pulsar', {}).get('OnAir', {}).get('media')  
  28.             if not element:  
  29.                 return  
  30.   
  31.             media = Media(  
  32.                 title=element.get('title1'),  
  33.                 description=element.get('title3'),  
  34.                 media_id=element.get('media_id1'),  
  35.                 hour=magicdate(element.get('hour')),  
  36.                 length=element.get('title4')  
  37.             )  
  38.             media.save()  
  39.   
  40.     def on_modified(self, event):  
  41.         self.process(event)  
  42.   
  43.     def on_created(self, event):  
  44.         self.process(event)  
  45.   
  46.   
  47. if __name__ == '__main__':  
  48.     args = sys.argv[1:]  
  49.     observer = Observer()  
  50.     observer.schedule(MyHandler(), path=args[0if args else '.')  
  51.     observer.start()  
  52.   
  53.     try:  
  54.         while True:  
  55.             time.sleep(1)  
  56.     except KeyboardInterrupt:  
  57.         observer.stop()  
  58.   
  59.     observer.join()  
Supplement 
Using Python's Watchdog to monitor changes to a directory

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

Source From  Here   Pivot Tables   We have seen how the  GroupBy  abstraction lets us explore relationships within a dataset. A pivot ta...