程式扎記: [ FP In Python ] Ch3. Lazy Evaluation

Preface
A powerful feature of Python is its iterator protocol (which we will get to shortly). This capability is only loosely connected to functional programming per se, since Python does not quite offer lazy data structures in the sense of a language like Haskell. However, use of the iterator protocol—and Python’s many built-in or standard library iteratables—accomplish much the same effect as an actual lazy data structure. Let us explain the contrast here in slightly more detail. In a language like Haskell, which is inherently lazily evaluated, we might define a list of all the prime numbers in a manner like the following:

view plaincopy to clipboardprint?
-- Define a list of ALL the prime numbers  
primes = sieve [2 ..]  
  where sieve (p:xs) = p : sieve [x | x <- class="number" p="" rem="" span="" style="background-color: inherit; border: none; color: #c00000; margin: 0px; padding: 0px;" x="" xs="">0]  

This report is not the place to try to teach Haskell, but you can see a comprehension in there, which is in fact the model that Python used in introducing its own comprehensions. There is also deep recursion involved, which is not going to work in Python. Apart from syntactic differences, or even the ability to recurse to indefinite depth, the significant difference here is that the Haskell version of primes is an actual (infinite) sequence, not just an object capable of sequentially producing elements (as was the primes object we demonstrated in the chapter entitled “Callables”). In particular, you can index into an arbitrary element of the infinite list of primes in Haskell, and the intermediate values will be produced internally as needed based on the syntactic construction of the list itself.

Mind you, one can replicate this in Python too, it just isn’t in the inherent syntax of the language and takes more manual construction. Given the get_primes() generator function discussed earlier, we might write our own container to simulate the same thing, for example:

view plaincopy to clipboardprint?
def get_primes():  
    "Simple lazy Sieve of Eratosthenes"  
    candidate = 2  
    found = []  
    while True:  
        if all(candidate % prime != 0 for prime in found):  
            yield candidate  
            found.append(candidate)  
        candidate += 1  
  
  
from collections.abc import Sequence  
class ExpandingSequence(Sequence):  
    def __init__(self, it):  
        self.it = it  
        self._cache = []  
  
    def __getitem__(self, index):  
        while len(self._cache) <= index:  
            self._cache.append(next(self.it))  
        return self._cache[index]  
  
    def __len__(self):  
        return len(self._cache)  

This new container can be both lazy and also indexible:

>>> primes = ExpandingSequence(get_primes())
>>> for _, p in zip(range(10), primes):
... print(p, end=" ")
...
2 3 5 7 11 13 17 19 23 29
>>> primes[10]
31
>>> primes[5]
13
>>> primes[100] // Enter while loop
547
>>> len(primes)
101

Of course, there are other custom capabilities we might want to engineer in, since lazy data structures are not inherently intertwined into Python. Maybe we’d like to be able to slice this special sequence. Maybe we’d like a prettier representation of the object when printed. Maybe we should report the length as inf if we somehow signaled it was meant to be infinite. All of this is possible, but it takes a little bit of code to add each behavior rather than simply being the default assumption of Python data structures.

The Iterator Protocol
The easiest way to create an iterator—that is to say, a lazy sequence —in Python is to define a generator function, as was discussed in the chapter entitled “Callables.” Simply use the yield statement within the body of a function to define the places (usually in a loop) where values are produced. Or, technically, the easiest way is to use one of the many iterable objects already produced by built-ins or the standard library rather than programming a custom one at all. Generator functions are syntax sugar for defining a function that returns an iterator.

Many objects have a method named .__iter__(), which will return an iterator when it is called, generally via the iter() built-in function, or even more often simply by looping over the object (e.g., for item in collection: ...). What an iterator is is the object returned by a call to iter(some thing), which itself has a method named .__iter__() that simply returns the object itself, and another method named .__next__(). The reason the iterable itself still has an .__iter__()method is to make iter() idempotent. That is, this identity should always hold (or raise TypeError("object is not iterable")):

view plaincopy to clipboardprint?
iter_seq = iter(sequence)  
iter(iter_seq) == iter_seq  

The above remarks are a bit abstract, so let us look at a few concrete examples:

>>> lazy = open('06-laziness.md') # iterate over lines of file
>>> '__iter__' in dir(lazy) and '__next__' in dir(lazy)
True
>>> plus1 = map(lambda x: x+1, range(10))
>>> plus1 # iterate over deferred computations

>>> '__iter__' in dir(plus1) and '__next__' in dir(plus1)
True
>>> def to10():
... for i in range(10):
... yield i
...
>>> '__iter__' in dir(to10)
False
>>> '__iter__' in dir(to10()) and '__next__' in dir(to10())
True
>>> l = [1,2,3]
>>> '__iter__' in dir(l)
True
>>> '__next__' in dir(l)
False
>>> li = iter(l) # iterate over concrete collection
>>> li

>>> li == iter(li)
True

In a functional programming style—or even just generally for readability— writing custom iterators as generator functions is most natural. However, we can also create custom classes that obey the protocol; often these will have other behaviors (i.e., methods) as well, but most such behaviors necessarily rely on statefulness and side effects to be meaningful. For example:

view plaincopy to clipboardprint?
from collections.abc import Iterable  
  
class Fibonacci(Iterable):  
    def __init__(self):  
        self.a, self.b = 0, 1  
        self.total = 0  
  
    def __iter__(self):  
        return self  
  
    def __next__(self):  
        self.a, self.b = self.b, self.a + self.b  
        self.total += self.a  
        return self.a  
  
    def running_sum(self):  
        return self.total  

So you can test it this way:

>>> fib = Fibonacci()
>>> fib.running_sum()
0
>>> for _, i in zip(range(10), fib):
... print(i, end=" ")
...
1 1 2 3 5 8 13 21 34 55
>>> fib.running_sum()
143
>>> next(fib)
89

This example is trivial, of course, but it shows a class that both implements the iterator protocol and also provides an additional method to return something stateful about its instance. Since statefulness is for object-oriented programmers, in a functional programming style we will generally avoid classes like this.

Module: itertools
The module itertools is a collection of very powerful—and carefully designed—functions for performing iterator algebra. That is, these allow you to combine iterators in sophisticated ways without having to concretely instantiate anything more than is currently required. As well as the basic functions in the module itself, the module documentation provides a number of short, but easy to get subtly wrong, recipes for additional functions that each utilize two or three of the basic functions in combination. The third-party module more_itertools mentioned in the Preface provides additional functions that are likewise designed to avoid common pitfalls and edge cases.

The basic goal of using the building blocks inside itertools is to avoid performing computations before they are required, to avoid the memory requirements of a large instantiated collection, to avoid potentially slow I/O until it is stricly required, and so on. Iterators are lazy sequences rather than realized collections, and when combined with functions or recipes in itertools they retain this property.

Here is a quick example of combining a few things. Rather than the stateful Fibonacci class to let us keep a running sum, we might simply create a single lazy iterator to generate both the current number and this sum:

view plain copy to clipboard print ?

def fibonacci():

  a, b = 1, 1

  while True:

    yield a

    a, b = b, a+b

>>> from itertools import tee, accumulate
>>> s, t = tee(fibonacci()) # Generate two generators s and t
>>> pairs = zip(t, accumulate(s))
>>> for _, (fib, total) in zip(range(7), pairs):
... print(fib, total)
...
1 1
1 2
2 4
3 7
5 12
8 20
13 33

Figuring out exactly how to use functions in itertools correctly and optimally often requires careful thought, but once combined, remarkable power is obtained for dealing with large, or even infinite, iterators that could not be done with concrete collections. The documentation for the itertools module contain details on its combinatorial functions as well as a number of short recipes for combining them. This paper does not have space to repeat those descriptions, so just exhibiting a few of them above will suffice.

Note that for practical purposes, zip(), map(), filter(), and range() (which is, in a sense, just a terminating itertools.count()) could well live in itertools if they were not built-ins. That is, all of those functions lazily generate sequential items (mostly based on existing iterables) without creating a concrete sequence. Built-ins like all(), any(), sum(), min(), max(), and functools.reduce() also act on iterables, but all of them, in the general case, need to exhaust the iterator rather than remain lazy. The function itertools.product() might be out of place in its module since it also creates concrete cached sequences, and cannot operate on infinite iterators.

Chaining Iterables
The itertools.chain() and itertools.chain.from_iterable() functions combine multiple iterables. Built-in zip() and itertools.zip_longest() also do this, of course, but in manners that allow incremental advancement through the iterables. A consequence of this is that while chaining infinite iterables is valid syntactically and semantically, no actual program will exhaust the earlier iterable. For example:

view plaincopy to clipboardprint?
from itertools import chain, count  
thrice_to_inf = chain(count(), count(), count())  

Conceptually, thrice_to_inf will count to infinity three times, but in practice once would always be enough. However, for merely large iterables—not for infinite ones—chaining can be very useful and parsimonious:

view plaincopy to clipboardprint?
def from_logs(fnames):  
  yield from (open(file) for file in fnames)  
lines = chain.from_iterable(from_logs(['huge.log', 'gigantic.log']))  

Notice that in the example given, we didn’t even need to pass in a concrete list of files—that sequence of filenames itself could be a lazy iterable per the API given. A more concrete example will be:

>>> from itertools import chain, count
>>> alist = [1, 2, 3]; blist= ['a', 'b', 'c']
>>> elist = chain.from_iterable([alist, blist])
>>> elist
# Lazy evaluation
>>> elist = chain.from_iterable([alist, blist])
>>> for e in elist:
... print('{} '.format(e), end='')
...
1 2 3 a b c

Besides the chaining with itertools, we should mention collections.ChainMap() in the same breath. Dictionaries (or generally any collections.abc.Mapping) are iterable (over their keys). Just as we might want to chain multiple sequence-like iterables, we sometimes want to chain together multiple mappings without needing to create a single larger concrete one. ChainMap() is handy, and does not alter the underlying mappings used to construct it.

Supplement
* Python3 Course - Generators

程式扎記

標籤

2018年5月29日星期二

[ FP In Python ] Ch3. Lazy Evaluation

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2018年5月29日 星期二