This chapter covers
In chapter 5, you created your first data structure, the singly linked list. At that point, you didn’t have at your disposal all the techniques needed to make it a complete tool for data handling. One particularly useful tool you were missing was some way to represent operations producing optional data, or operations that can produce an error. In chapters 6 and 7, you learned how to represent optional data and errors. In this chapter, you’ll learn how to compose operations that produce optional data or errors with lists.
You also developed some functions that were far from optimal, such as length, and I said that you’d eventually learn more-efficient techniques for these operations. In this chapter, you’ll learn how to implement these techniques. You’ll also learn how to automatically parallelize some list operations in order to benefit from the multicore architecture of today’s computers.
The problem with length
Folding a list involves starting with a value and composing it successively with each element of the list. This obviously takes an amount of time proportional to the length of the list. Is there any way to make this operation faster? Or, at least, is there a way to make it appear faster? As an example of a fold application, you created a length method in List in exercise 5.9 with the following implementation:
You can compare the preceding operation with one that computes the sum of a list of integers:
There are other operations that can be applied to lists in this way, and, among them, several for which the type of the list elements is irrelevant:
Some operations may depend on some characteristics of the element’s type, but not on the specific type itself. For example, a max method that returns the maximum element of a list will only need the type to be Comparable or a Comparator.
The performance problem
All these methods can be implemented using a fold, but such implementations have a major drawback: the time needed to compute the result is proportional to the length of the list. Imagine you have a list of about a million elements, and you want to check the length. Counting the elements may seem the only way to go (this is what the fold-based length method does). But if you were adding elements to the list until it reaches a million, you surely wouldn’t count the elements after adding each one.
In such a situation, you’d keep a count of the elements somewhere, and add one to this count each time you added an element to the list. Maybe you’d have to count once if you were starting with a non-empty list, but that’s it. This technique is what you learned in chapter 4: memoization. The question is, where can you store the memoized value? The answer is obvious: in the list itself.
The benefit of memoization
Maintaining a count of the elements in a list will take some time, so adding an element to a list will be slightly slower than if you didn’t keep the count. It might look like you’re trading time against time. If you build a list of 1,000,000 elements, you’ll lose 1,000,000 times the amount of time needed to add one to the count. In compensation, however, the time needed to get the length of the list will be near 0 (and obviously constant). Maybe the total time lost in incrementing the count will equal the gain when calling length. But as soon as you call length more than once, the gain is absolutely obvious.
The drawbacks of memoization
Memoization can turn a function that works in O(n) time (time proportional to the number of elements) into O(1) time (constant time). This is a huge benefit, although it has a time cost, because it makes the insertion of elements slightly slower. But slowing insertion is generally not a big problem.
A much more important problem is the increase in memory space. Data structures implementing in-place mutation don’t have this problem. In a mutable list, nothing keeps you from memoizing the list length as a mutable integer, which takes only 32 bits. But with an immutable list, you have to memoize the length in each element. It’s difficult to know the exact increase in size, but if the size of a singly linked list is around 40 bytes per node (for the nodes themselves), plus two 32-bit references for the head and the tail (on a 32-bit JVM), this would result in about 100 bytes per element. In this case, adding the length would cause an increase of slightly over 30%. The result would be the same if the memoized values were references, such as memoizing the max or min of a list of Comparable objects. On a 64-bit JVM, it’s even more difficult to calculate due to some optimization in the size of the references, but you get the idea.
Sizes of object references
For more information about the size of object references in Java 7 and Java 8, see Oracle’s documentation on compressed oops (http://mng.bz/TjY9) and JVM performance enhancements (http://mng.bz/8X0o).
It’s up to you to decide whether you want to use memoization in your data structures. It may be a valid option for functions that are often called and don’t create new objects for their results. For example, the length and hashCode functions return integers, and the max and min functions return references to already existing objects, so they may be good candidates. On the other hand, the toString function creates new strings that would have to be memoized, so that would probably be a huge waste of memory space. The other factor to take into consideration is how often the function is used. The length function may be used more often than hashCode, because using lists as map keys is not a common practice.
Let's create a memoized version of the length method. Its signature in the List class will be (Exercise 8.1):
Note that memoizing the maximum or minimum value in a list of Comparable could be done the same way (although with a static method), but it wouldn’t help in the case where you want to remove the max or min value from the list. Min or max elements are often accessed to retrieve elements by priority. In that case, the elements’ compareTo method would compare their priorities. Memoizing priority would let you know immediately which element has the maximum priority, but it wouldn’t help much because what you often need is to remove the corresponding element. For such use cases, you’ll need a different data structure, which you’ll learn to create in chapter 11.
As I said, it’s up to you to decide if you should memoize some functions of the List class. A few experiments should help you make your decision. Measuring the available memory size just before and after the creation of a list of 1,000,000 integers shows a very small increase when using memoization. Although this measurement method isn’t very precise, the average decrease in available memory is about 22 MB in both cases (with or without memoization), varying between 20 MB and 25 MB. This shows that the theoretical increase of 4 MB (1,000,000 x 4 bytes) isn’t as significant as you’d expected. On the other hand, the increase in performance is huge. Asking for the length ten times might cost more than 200 milliseconds without memoization. With memoization, the time is 0 (too short a time to be measured in milliseconds).
Note that although adding an element increases the cost (adding one to the tail length and storing the result), removing an element has zero cost, because the tail length is already memoized. Another way to go, if memoization isn’t desirable, is to optimize the length method. Instead of using a fold, you can resort to imperative style, with a loop and a local mutable variable. Here’s the length implementation borrowed from the Scala List class:
Composing List and Result
In the previous chapter, you saw that Result and List are very similar data structures, mainly differing in their cardinality but sharing some of their most important methods, such as map, flatMap, and even foldLeft and foldRight. You saw how lists could be composed with lists, and results with results. Now, you’re going to see how results can be composed with lists.
Methods on List returning Result
At this point, you’ve noticed that I try to avoid accessing the elements of results and lists directly. Accessing the head or the tail of a list will throw an exception if the list is Nil, and throwing an exception is one of the worst things that can happen in functional programming. But you saw that you could safely access the value in a Result by providing a default value to be used in the case of a failure or empty result. Can you do the same when accessing the head of a list? Not exactly, but you can return a Result.
Let's implement a headOption method in List<A> that will return a Result<A>. Use the following abstract method declaration in List, and implement it in each subclass (Exercise 8.2):
Converting from List%lt;Result> to Result<List>
When a list contains the results of some computations, it will often be a List<Result>. For example, mapping a function from T to Result<U> on a list of T will produce a list of Result<U>. Such values will often have to be composed with functions taking a List<T> as their argument. This means you’ll need a way to convert the resulting List<Result<U>> into a List<U>, which is the same kind of flattening involved in the flatMap method, with the huge difference that two different data types are involved: List and Result. You can apply several strategies to this conversion:
The first solution would correspond to a list of results where all results are optional. The second solution means that there should be at least one success in the list for the result to be a success. The third solution corresponds to the case where all results are mandatory. Let's write a method called flattenResult that takes a List<Result<A>> as its argument and returns a List<A> containing all the success values in the original list, ignoring the failures and empty values (Exercise 8.5). This will be a static method in List<A> with the following signature:
Write a sequence function that combines a List<Result<T>> into a Result<List<T>>. It will be a Success<List<T>> if all values in the original list were Success instances, or a Failure<List<T>> otherwise. Here’s its signature:
Define a more generic traverse method that traverses a list of A while applying a function from A to Result and producing a Result
Many common use cases of the List data type deserve to be abstracted so you don’t have to repeat the same code again and again. You’ll regularly find yourself discovering new use cases that can be implemented by combining basic functions. You should never hesitate to incorporate these use cases as new functions in the List class. The following exercises show several of the most common use cases.
Zipping and unzipping lists
Zipping is the process of assembling two lists into one by combining the elements of the same index. Unzipping is the reverse procedure, consisting of making two lists out of one by “deconstructing” the elements, such as producing two lists of x and y coordinates from one list of points. Let's write a zipWith method that combines the elements of two lists of different types to produce a new list, given a function argument. Here’s the signature (Exercise 8.8):
The previous exercise consisted of creating a list by matching elements of both lists by their indexes. Write a product method that will produce a list of all possible combinations of elements taken from both lists (Exercise 8.9). In other words, given the two lists list("a", "b", "c") and list("d", "e", "f") and string concatenation, the product of the two lists should be List("ad", "ae", "af", "bd", "be", "bf", "cd", "ce", "cf").
The solution is similar to the comprehension pattern you used to compose Result in chapter 7. The only difference here is that it produces as many combinations as the product of the number of elements in the lists, whereas for combining Result, the number of combinations was always limited to one.
The second line will only produce the list of tuples built from elements with the same index:
Of course, you may use any constructor of any class. (Java objects are in fact tuples with special names.)
Now let's write an unzip static method to transform a list of tuples into a tuple of lists. Here’s its signature (Exercise 8.10):
The singly linked list isn’t the best structure for indexed access to its elements, but sometimes it’s necessary to use indexed access. As usual, you should abstract such a procedure into List methods.
Let's write a getAt method that takes an index as its argument and returns the corresponding element. The method should not throw an exception in the case of the index being out of bounds (Exercise 8.12). This time, start with an explicitly recursive version. Then try to answer the following questions:
The explicitly recursive solution is easy:
This looks like the best possible recursive solution. Is it possible to use a fold? Yes, it is, and it should be a left fold. But the solution is tricky:
Let's find a solution that makes the fold-based version terminate as soon as the result is found (Exercise 8.13). You’ll need a special version of foldLeft for this, and also a special version of Tuple.
First, you need a special version of foldLeft in which you can escape the fold when the absorbing element (or “zero” element) of the folding operation is found. Think of a list of integers that you want to fold by multiplying them. The absorbing element for the multiplication is 0. Here’s the declaration of a short-circuiting (or escaping) version of foldLeft in the List class:
It’s by analogy that the absorbing element of any operation is sometimes called “zero,” but remember that it’s not always equal to 0. The 0 value is only the absorbing element for multiplication. For the addition of positive integers, it would be infinity. And here’s the Cons implementation:
Sometimes you need to split a list into two parts at a specific position. Although the singly linked list is far from ideal for this kind of operation, it’s relatively simple to implement. Splitting a list has several useful applications, among which is processing its parts in parallel using several threads.
Let's write a splitAt method that takes an int as its parameter and returns two lists by splitting the list at the given position. There shouldn’t be any IndexOutOfBoundExceptions. Instead, an index below 0 should be treated as 0, and an index above max should be treated as the maximum value for the index. (Exercise 8.14)
An explicitly recursive solution is easy to design:
Searching for sublists
One common use case for lists is searching to find out whether a list is contained in another (longer) list. In other words, you want to know whether a list is a sublist of another list.
Let's implement a hasSubList method to check whether a list is a sublist of another. For example, the list (3, 4, 5) is a sublist of (1, 2, 3, 4, 5) but not of (1, 2, 4, 5, 6). Implement it as a static method with the following signature: (Exercise 8.16)
Many other useful functions can be developed to work with lists. The following exercises will give you some practice in this domain. Note that the proposed solutions are certainly not the only ones. Feel free to invent your own.
Create a groupBy method taking a function from A to B as a parameter and returning a Map, where keys are the result of the function applied to each element of the list and values are lists of elements corresponding to each key. In other words, given a list of Payments such as these (Exercise 8.17):
Now let's write an unfold method that takes a starting element S and a function f from S to Result
To continue, let's write a range method that takes two integers as its parameters and produces a list of all integers greater than or equal to the first and less than the second (Exercise 8.19). This is very simple if you reuse the method from exercise 8.18: