程式扎記: [ FP with Java ] Ch5 - Data handling with lists

Preface
This chapter covers:

* Classifying data structures in functional programming
* Using the ubiquitous singly linked list
* Understanding the importance of immutability
* Handling lists with recursion and functions

Data structures are among the most important concepts in programming, as well as in everyday life. The world as we see it is itself a huge data structure composed of simpler data structures, which are in turn composed of simpler structures. Each time we try to model something, be it objects or facts, we end up with data structures. There are many types of data structures. In computing, data structures are generally represented as a whole by the term collections. A collection is a group of data items that have some relation to each other. In the simplest form, this relation is the fact that they belong to the same group.

How to classify data collections
Data collections can be classified from many different points of view. You can classify data collections as linear, associative, and graph:

* Linear collections are collections in which elements are related along a single dimension. In such a collection, each element has a relation to the next element. The most common example of a linear collection is the list.
* Associative collections are collections that can be viewed as a function. Given an object o, a function f(o) will return true or false according to whether this object belongs to the collection or not. Unlike in linear collections, there’s no relation between the elements of the collection. These collections aren’t ordered, although it is possible to define an order on the elements. The most common examples of associative collections are the set and the associative array (which is also called a map or dictionary). We’ll study a functional implementation of maps in chapter 11.
* Graphs are collections in which each element is in relationships with multiple other elements. A particular example is the tree, and more specifically the binary tree, where each element is related to two other elements. You’ll learn more about trees from a functional perspective in chapter 10.

Different types of lists
In this chapter, we’ll focus on the most common type of linear collections, the list. The list is the most used data structure in functional programming, so it’s generally used to teach functional programming concepts. Be aware, however, that what you’ll learn in this chapter is not specific to lists but is shared by many other data structures (which may not be collections).

Lists can be further classified based on several different aspects, including the following:

* Access— Some lists will be accessed from one end only, and others will be accessed from both ends. Some will be written from one end and read from the other end. Finally, some lists may allow access to any element using its position in the list; the position of an element is also called its index.
* Type of ordering— In some lists, the elements will be read in the same order in which they were inserted. This kind of structure is said to be FIFO (first in, first out). In others, the order of retrieval will be the inverse of the order of insertion (LIFO, or last in, first out). Finally, some lists will allow you to retrieve the elements in a completely different order.
* Implementation— Access type and ordering are concepts strongly related to the implementation you choose for the list. If you choose to represent the list by linking each element to the next, you’ll get a completely different result, from the access point of view, than from an implementation based on an indexed array. Or if you choose to link each element to the next as well as to the previous element, you’ll get a list that can be accessed from both ends.

Figure 5.1 shows different types of lists offering different kinds of access. Note that this figure shows the principle behind each type of list, but not the way the lists are implemented.

Figure 5.1. Different types of lists offer different types of access to their elements.

Relative expected list performance
One very important criterion when choosing a type of list is the expected performance for various kinds of operations. Performance is often expressed in Big O notation. This notation is mainly used in mathematics, but when used in computing, it indicates the way the complexity of an algorithm changes when responding to a change of input size. When used to characterize the performance of list operations, this notation shows how the performance varies as a function of the length of the list. For example, consider the following performances:

* O(1)—This means that the time needed for an operation will be constant. (You may think of it as meaning that the time for one element will be multiplied by 1 for n elements.)
* O(log(n))—This means that the time for an operation on n elements will be the time for one element multiplied by log(n).
* O(n)—The time for n elements will be the time for one element multiplied by n.
* O(n^2)—The time for n elements will be the time for one element multiplied by n^2.

It would be ideal to create a data structure with O(1) performance for all types of operations. Unfortunately, this has not been found possible yet. Each type of list offers different performance for different operations. Indexed lists offer O(1) performance for data retrieval and near to O(1) for insertion. The singly linked list offers O(1) performance for insertion and retrieval on one end, and O(n) for the other end. Choosing the best structure is a compromise. Most often, you’ll seek O(1) performance for the most frequent operations, and you’ll have to accept O(log(n)) or even O(n) for some operations that don’t occur very often.

Be aware that this way of measuring performance has a real meaning for structures that can be scaled infinitely. This is not the case for the data structures we manipulate, because your structures are limited in size by the available memory. A structure with O(n) access time might always be faster than another one with O(1) due to this size limit. If the time for one element is much smaller for the first structure, its memory limitation may prevent the second from showing its benefits. It’s often better to have O(n) performance with an access time of 1 nanosecond to one element than O(1) with an access time of 1 millisecond. (The latter will be faster than the former only for sizes over 1,000,000 elements.)

Trading time against memory space, and time against complexity
You just saw that choosing an implementation for a data structure is generally a question of trading time against time. You’ll choose an implementation that’s faster on some operations, but slower on others, based on which operations are the most frequent. But there are other trading decisions to make.
Imagine you want a structure from which elements can be retrieved in sorted order, the smallest first. You might choose to sort the elements on insertion, or you might prefer to store them as they arrive and search for the smallest on retrieval only. One important criterion for making the decision would be whether the retrieved element is systematically removed from the structure. If not, it might be accessed several times without removal, so it would probably be better to sort the elements at insertion time, in order to avoid sorting them several times on retrieval. This use case corresponds to what’s called a priority queue, in which you’re waiting for a given element. You might test the queue many times until the expected element is returned. Such a use case requires that elements be sorted at insertion time.

But what if you want to access elements by several different sort orders? For example, you might want to access elements in the same order they were inserted, or in reverse order. The result might correspond to the doubly linked list of figure 5.1. It seems that in such a case, elements should be sorted at retrieval time. You might favor one order, leading to O(1) access time from one end and O(n) from the other end, or you might invent a different structure, perhaps giving O(log(n)) access time from both ends. Another solution would be to store two lists, one in insertion order and one in reverse order. This way, you’d have a slower insertion time, but O(1) retrieval from both ends. One drawback is that this approach would probably use more memory. Thus you can see that choosing the right structure might also be a question of trading time against memory space.

But you might also invent some structure minimizing both insertion time and retrieval time from both ends. These types of structures have already been invented, and you’d only have to implement them, but such structures are much more complex than the simplest ones, so you’d be trading time against complexity.

In-place mutation
Most data structures change over time because elements are inserted and removed. Basically, there are two ways to handle such operations. The first one is update in place.

Update in place consists of changing the elements of the data structure by mutating the structure itself. It would have been considered a good idea when all programs were single threaded, although it wasn’t. It’s much worse now that all programs are multithreaded. This doesn’t only concern replacing elements. It’s the same for adding or removing, sorting, and all operations that mutate the structure. If programs are allowed to mutate data structures, these structures simply can’t be shared without sophisticated protections that are rarely done right the first time, leading to deadlock, livelock, thread starving, stale data, and all those sorts of troubles.

So what’s the solution? Simply use immutable data structures. Many imperative programmers are shocked when they first read this. How can you do useful things with data structures if you can’t mutate them? After all, you often start with empty structures and want to add data to them. How can you possibly do this if they’re immutable?

Update in place
In a 1981 article titled “The transaction concept: virtues and limitations,” Jim Gray wrote this “The transaction concept: virtues and limitations”...hnical Report 81.3, June 1981),

Update in place: a poison apple? When bookkeeping was done with clay tablets or paper and ink, accountants developed some clear rules about good accounting practices. One of the cardinal rules is double-entry bookkeeping so that calculations are self checking, thereby making them fail-fast. A second rule is that one never alters the books; if an error is made, it is annotated and a new compensating entry is made in the books. The books are thus a complete history of the transactions of the business... Update-in-place strikes many systems designers as a cardinal sin: it violates traditional accounting practices which have been observed for hundreds of years.

The answer is simple. As with double-entry accounting, instead of changing what existed previously, you create new data to represent the new state. Instead of adding an element to an existing list, you create a new list with the added element. The main benefit is that if another thread was manipulating the list at insertion time, it’s not affected by the change because it doesn’t see it.

Generally, this conception immediately raises two protests:

* If the other thread doesn’t see the change, it’s manipulating stale data.
* Making a new copy of the list with the added element is a time- and memory-consuming process, so immutable data structures lead to very poor performance.

Both arguments are fallacious. The thread manipulating the “stale data” is in fact manipulating the data as it was when it started reading it. If inserting an element occurs after the manipulation is finished, there’s no concurrency problem. But if the insertion occurs while the manipulation is going on, what would occur with a mutable data structure? Either it wouldn’t be protected against concurrent access, and the data might be corrupted or the result false (or both), or some protection mechanism would lock the data, delaying the insertion until after the manipulation by the first thread is completed. In the second case, the end result would be exactly the same as with an immutable structure.

Persistent data structures
As you saw in the previous section, making a copy (sometimes called a defensive copy) of the data structure before inserting an element is often considered a time-consuming operation that leads to poor performance. This isn’t the case if you use data sharing, which is possible because immutable data structures are persistent. Figure 5.2 shows how elements could be removed and added to create a new, immutable, singly linked list with optimal performance.

Figure 5.2. Removing and adding elements without mutation or copying

As you can see, no copying occurs at all. The result is that such a list might be more performant for removing and inserting elements than a mutable list. So functional data structures (immutable and persistent) are not always slower than mutable ones. They’re often even faster (although they might be slower on some operations). In any case, they’re much safer.

An immutable, persistent, singly linked list implementation
The structure of the singly linked list shown in figures 5.1 and 5.2 is theoretical. The list can’t be implemented that way, because elements can’t be linked to one another. They’d have to be special elements to allow linking, and you want your lists to be able to store any elements. The solution is to devise a recursive list structure composed of the following:

* An element that will be the first element of the list, also called the head.
* The rest of the list, which is a list by itself and is called the tail.

Instead of using a Tuple with properties _1 and _2, you’ll create a specific List class with two properties: head and tail. This will simplify the handling of the Nil case. Figure 5.3 shows the structure of your list implementation.

Figure 5.3. The representation of the singly linked list implementation

Listing 5.1 shows the basic implementation of this list.
Listing 5.1. Singly linked lists

view plaincopy to clipboardprint?
package fp.utils;  
  
import java.util.ArrayList;  
  
public abstract class List {  
    public static  java.util.List iterate(T seed, Function fun, int loop)  
    {  
        java.util.List list = new ArrayList(); list.add(seed);  
        T tmp = seed;  
        for(int i=0; i1; i++)  
        {  
            tmp = fun.apply(tmp); list.add(tmp);  
        }  
          
        return list;  
    }  
      
    public abstract A head();  
    public abstract List tail();  
    public abstract boolean isEmpty();  
      
    @SuppressWarnings("rawtypes")  
    public static final List NIL = new Nil();  
      
    private List(){}  
      
    private static class Nil extends List  
    {  
        private Nil(){};  
        public A head(){ throw new IllegalStateException("head called an empty list");}  
        public List tail() { throw new IllegalStateException("tail called an empty list"); }  
        public boolean isEmpty(){ return true; }  
    }  
      
    private static class Cons extends List  
    {  
        private final A head;  
        private final List tail;  
          
        private Cons(A head, List tail) {  
            this.head = head;   
            this.tail = tail;  
        }  
        public A head(){ return head; }  
        public List tail(){ return tail; }  
        public boolean isEmpty(){ return true;}  
    }  
      
    @SuppressWarnings("unchecked")  
    public static  List list(){ return NIL; }  
      
    @SafeVarargs  
    public static  List list(A... a)  
    {  
        List n = list();  
        for(int i = a.length-1; i >= 0; i--)  
        {  
            n = new Cons<>(a[i], n);  
        }  
        return n;  
    }  
}  

The list class is implemented as an abstract class. The List class contains two private static subclasses to represent the two possible forms a List can take: Nil for an empty list, and Cons for a non-empty one. The List class defines three abstract methods: head(), which will return the first element of the list; tail(), which will return the rest of the list (without the first element); and isEmpty(), which will return true if the list is empty and false otherwise. The List class is parameterized with type parameter A, which represents the type of the list elements.

Subclasses have been made private, so you construct lists through calls to the static factory methods. These methods can be statically imported:

view plaincopy to clipboardprint?
import static fp.utils.List.*;  
import fp.utils.List;  

They can then be used without referencing the enclosing class, as follows:

view plaincopy to clipboardprint?
List ex1 = list();  
List ex2 = list(1);  
List ex3 = list(1, 2);  

Note that the empty list has no type parameter. In other words, it’s a raw type that can be used to represent an empty list of elements of any types. As such, creating or using an empty list will generate a warning by the compiler. The advantage is that you can use a singleton for the empty list. Another solution would have been to use a parameterized empty list, but this would have caused much trouble. You’d have had to create a different empty list for each type parameter. To solve this problem, you use a singleton empty list with no parameter type. This generates a compiler warning. In order to restrict this warning to the List class and not let it leak to the List users, you don’t give direct access to the singleton. That’s why there’s a (parameterized) static method to access the singleton, and a @SuppressWarnings("rawtypes") on the NIL property, as well as a @SuppressWarnings("unchecked") on the list() method.

Note that the list(A ... a) method is annotated with @SafeVarargs to indicate that the method doesn’t do anything that could lead to heap pollution. This method uses an imperative implementation based on a for loop. This isn’t very “functional,” but it’s a trade-off for simplicity and performance. If you insist on implementing it in a functional way, you can do so. All you need is a function taking an array as its argument and returning its last element, and another one to return the array without its last element. Here’s one possible solution:

view plaincopy to clipboardprint?
@SafeVarargs  
public static  List list(A... as) {  
  return list_(list(), as).eval();  
}  
  
public static  TailCall> list_(List acc, A[] as) {  
  return as.length == 0  
      ? ret(acc)  
      : sus(() -> list_(new Cons<>(as[as.length -1], acc),  
          Arrays.copyOfRange(as, 0, as.length - 1)));  
}  

Be sure, however, not to use this implementation, because it’s 10,000 times slower than the imperative one. This is a good example of when not to be blindly functional. The imperative version has a functional interface, and this is what you need. Note that recursion isn’t the problem. Recursion using TailCall is nearly as fast as iteration. The problem here is the copyOfRange method, which is very slow.

Data sharing in list operations
One of the huge benefits of immutable persistent data structures like the singly linked list is the performance boost provided by data sharing. You can already see that accessing the first element of the list is immediate. It’s just a matter of calling the head() method, which is a simple accessor for the head property. Removing the first element is equally fast. Just call the tail() method, which will return the tail property. Now let’s see how to get a new list with an additional element.

Exercise 5.1
Let's implement the instance functional method cons, adding an element at the beginning of a list. (Remember cons stands for construct.) This instance method has the same implementation for the Nil and Cons subclasses:

view plaincopy to clipboardprint?
public List cons(A a) {  
    return new Cons<>(a, this);  
}  

Exercise 5.2
Implement setHead, an instance method for replacing the first element of a List with a new value. You might think of implementing a static method for this, but you’d have to test for an empty list:

view plaincopy to clipboardprint?
public static  List setHead(List list, A h) {  
    if (list.isEmpty()) {  
        throw new IllegalStateException("setHead called on an empty list");  
    } else {  
        return new Cons<>(h, list.tail());  
    }  
}  

This makes little sense. As a general rule, if you find yourself forced to use an if ... else structure, you’re probably on the wrong path. Think of how you’d implement instance methods calling this static one. A much better solution is to add an abstract method to the List class:

view plaincopy to clipboardprint?
public abstract List setHead(A h);  

Implementation in the Nil subclass is straightforward. Just throw an exception, because trying to access the head of an empty list is considered a bug:

view plaincopy to clipboardprint?
public List setHead(A h) { throw new IllegalStateException("setHead called on empty list"); }  

The Cons implementation corresponds to the else clause of the static method:

view plaincopy to clipboardprint?
public List setHead(A h) { return new Cons<>(h, tail()); }  

And if you need a static method, it can simply call the instance implementation:

view plaincopy to clipboardprint?
public static  List setHead(List list, A h) {  
  return list.setHead(h);  
}  

Exercise 5.3
Write a toString method to display the content of a list. An empty list will be displayed as "[NIL]", and a list containing the integers from 1 to 3 will be displayed as "[1, 2, 3, NIL]". For a list of arbitrary objects, the toString method will be called to display each object. The Nil implementation is very simple:

view plaincopy to clipboardprint?
public String toString() { return "[NIL]"; }  

The cons method is recursive and uses a StringBuilder as the accumulator. Note that the StringBuilder, although it’s a mutable object, has a functional-friendly append method, because it returns the mutated StringBuilder instance.

view plaincopy to clipboardprint?
public String toString() {  
    return String.format("[%sNIL]",  
                           toString(new StringBuilder(), this).eval());  
}  
private TailCall toString(StringBuilder acc, List list) {  
    return list.isEmpty()  
          ? ret(acc)  
          : sus(() -> toString(acc.append(list.head()).append(", "),  
                              list.tail()));  
}  

More list operations
You can rely on data sharing to implement various other operations in a very efficient way—often more efficiently than what can be done with mutable lists. In the rest of this section, you’ll add functionality to the linked list based on data sharing.

Exercise 5.4
The tail method, although it doesn’t mutate the list in any way, has the same effect as removing the first element. Write a more general method, drop, that removes the first n elements from a list. Of course, this method won’t remove the element, but will return a new list corresponding to the intended result. This “new” list won’t be anything new, because data sharing will be used, so nothing will be created. Figure 5.4 shows how you should proceed.

Figure 5.4. Dropping the n first elements of a list while not mutating or creating anything.

The signature of the method will be

view plaincopy to clipboardprint?
public List drop(int n);  

You should use recursion to implement the drop method. And don’t forget to consider every special case, such as an empty list, or n being higher than the list length.

Here, you have the choice to implement a static method or instance methods. Instance methods are needed if you want to use object notation, which is much easier to read. For example, if you want to drop two elements of a list of integers and then replace the first element of the result with 0, you could use static methods:

view plaincopy to clipboardprint?
List newList = setHead(drop(list, 2), 0);  

Using object notation makes the code much easier to read:

view plaincopy to clipboardprint?
List newList = drop(list, 2).setHead(0);  

The implementation of the drop method in the Nil class simply returns this:

view plaincopy to clipboardprint?
public List drop(int n) { return this; }  

In the Cons class, you use a private helper method to implement recursion in the same way you learned in chapter 4.

view plaincopy to clipboardprint?
public List drop(int n) {  
    return n <= 0  
          ? this  
          : drop_(this, n).eval();  
}  
  
private TailCall> drop_(List list, int n) {  
    return n <= 0 || list.isEmpty()  
          ? ret(list)  
          : sus(() -> drop_(list.tail(), n - 1));  
}  

Note that you have to test for an empty list parameter. This wouldn’t be necessary if the drop method were recursive. But only the drop_ helper method is recursive, and this method isn’t defined for Nil. Forgetting to test for the empty list would result in an exception being thrown while calling list.tail(). Of course, you’d need a better way to handle this case. After all, dropping four elements of a list of three makes little sense. You could throw an exception, but it would be better to use more-functional techniques that you’ll learn in the next chapter.

Exercise 5.5
Implement a dropWhile method to remove elements from the head of the List as long as a condition holds true. Here’s the signature to add to the List abstract class:

view plaincopy to clipboardprint?
public abstract List dropWhile(Function f);  

We won’t look at the Nil implementation because it will only return this. The implementation for the Cons class is recursive:

view plaincopy to clipboardprint?
@Override  
public List dropWhile(Function f) {  
    return dropWhile_(this, f).eval();  
}  
  
private TailCall> dropWhile_(List list,  
                                     Function f) {  
  return !list.isEmpty() && f.apply(list.head())  
      ? sus(() -> dropWhile_(list.tail(), f))  
      : ret(list);  
}  

Note that when calling dropWhile on an empty list, you may face a problem. The following code, for example, won’t compile:

view plaincopy to clipboardprint?
list().dropWhile(f)  

The reason for this is that Java is unable to infer the type of the list from the function you pass to the dropWhile method. Let’s say you’re dealing with a list of integers. You can then use either this solution:

view plaincopy to clipboardprint?
List list = list();  
list.dropWhile(f);  

or this one:

view plaincopy to clipboardprint?
List.list().dropWhile(f);  

Concatenating lists
A very common operation on lists consists of “adding” one list to another to form a new list that contains all elements of both original lists. It would be nice to be able to simply link both lists, but this isn’t possible. The solution is to add all elements of one list to the other list. But elements can only be added to the front (head) of the list, so if you want to concatenate list1 to list2, you must start by adding the last element of list1 to the front of list2, as indicated in figure 5.6.

Figure 5.6. Sharing data by concatenation. You can see that both lists are preserved and that list2 is shared by the resulting list. But you can also see that you can’t proceed exactly as is indicated in the figure, because you’d have to access the last element of list1 first, which isn’t possible due to the structure of the list.

One way to proceed is to first reverse list1, producing a new list, and then add each element to list2, this time starting from the head of the reversed list. But you haven’t yet defined a reverse method. Can you still define concat? Yes you can. Just consider how you could define this method:

* If list1 is empty, return list2.
* Else return the addition of the first element (list1.head) of list1 to the concatenation of the rest of list1 (list1.tail) to list2.

This recursive definition can be translated into code as follows:

view plaincopy to clipboardprint?
public static  List concat(List list1, List list2) {  
      return list1.isEmpty()  
          ? list2  
          : new Cons<>(list1.head(), concat(list1.tail(), list2));  
}  

The beauty of this solution (for some readers) is that you don’t need a figure to expose how it works, because it isn’t “working.” It’s just a mathematical definition translated into code. The main drawback of this definition (for other readers) is that, for the same reason, you can’t easily represent it in a figure. This may sound like humor, but it’s not. Both solutions represent exactly the same “operation,” but one represents the process (from which you can see the result) and the other expresses the result directly. Whichever is better is a matter of choice. But functional programming most often involves thinking in terms of what the intended result is, rather than how to obtain it. Functional code is a direct translation of the definition into code.

Obviously, this code will overflow the stack if list1 is too long, although you’ll never have a stack problem with the length of list2. The consequence is that you won’t have to worry if you’re careful to only add small lists to the front end of lists of any length. An important point to note is that what you’re actually doing is adding elements of the first list, in reverse order, to the front of the second list. This is obviously different from the common sense understanding of concatenation: adding the second list to the tail of the first one. This is definitely not how it works with the singly linked list.

If you need to concatenate lists of arbitrary length, you can just apply what you learned in chapter 4 to make the concat method stack-safe. If you ponder what you’ve done, you might guess that there’s much room left for abstraction here. What if the concat method were only a specific application of a much more general operation? Maybe you could abstract this operation, make it stack-safe, and then reuse it to implement many other operations? Wait and see!

You may have noticed that the complexity of this operation (and hence the time it’ll take to be executed by Java) is proportional to the length of the first list. In other words, if you concatenate list1 and list2, of length n1 and n2, the complexity is O(n1), which means it’s independent of n2. In other words, depending on n2, this operation may be more efficient than concatenating two mutable lists in imperative Java.

Exercise 5.6
Write a method to remove the last element from a list. This method should return the resulting list. Implement it as an instance method with the following signature:

view plaincopy to clipboardprint?
List init()  

There might be a way to express this function in terms of another one, and one we’ve already spoken about. Maybe now would be the right time to create this helper function. To remove the last element, you have to traverse the list (from front to back) and build up the new list (from back to front, because the “last” element in a list must be Nil). This is a consequence of the way lists are created with Cons objects. This results in a list with the elements in reverse order, so the resulting list must be reversed. That means you only have to implement a reverse method:

view plaincopy to clipboardprint?
public List reverse() { return reverse_(list(), this).eval();}  
  
private TailCall> reverse_(List acc, List list) {  
    return list.isEmpty()  
          ? ret(acc)  
          : sus(() -> reverse_(new Cons<>(list.head(), acc), list.tail()));  
}  

With the reverse method, you can implement init very easily:

view plaincopy to clipboardprint?
public List init() { return reverse().tail().reverse(); }  

Of course, these are the implementations for the Cons class. In the Nil class, the reverse method returns this, and the init method throws an exception.

程式扎記

標籤

2017年12月9日星期六

[ FP with Java ] Ch5 - Data handling with lists - Part1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年12月9日 星期六