This chapter covers:
Data structures are among the most important concepts in programming, as well as in everyday life. The world as we see it is itself a huge data structure composed of simpler data structures, which are in turn composed of simpler structures. Each time we try to model something, be it objects or facts, we end up with data structures. There are many types of data structures. In computing, data structures are generally represented as a whole by the term collections. A collection is a group of data items that have some relation to each other. In the simplest form, this relation is the fact that they belong to the same group.
How to classify data collections
Data collections can be classified from many different points of view. You can classify data collections as linear, associative, and graph:
Different types of lists
In this chapter, we’ll focus on the most common type of linear collections, the list. The list is the most used data structure in functional programming, so it’s generally used to teach functional programming concepts. Be aware, however, that what you’ll learn in this chapter is not specific to lists but is shared by many other data structures (which may not be collections).
Lists can be further classified based on several different aspects, including the following:
Figure 5.1 shows different types of lists offering different kinds of access. Note that this figure shows the principle behind each type of list, but not the way the lists are implemented.
Figure 5.1. Different types of lists offer different types of access to their elements.
Relative expected list performance
One very important criterion when choosing a type of list is the expected performance for various kinds of operations. Performance is often expressed in Big O notation. This notation is mainly used in mathematics, but when used in computing, it indicates the way the complexity of an algorithm changes when responding to a change of input size. When used to characterize the performance of list operations, this notation shows how the performance varies as a function of the length of the list. For example, consider the following performances:
It would be ideal to create a data structure with O(1) performance for all types of operations. Unfortunately, this has not been found possible yet. Each type of list offers different performance for different operations. Indexed lists offer O(1) performance for data retrieval and near to O(1) for insertion. The singly linked list offers O(1) performance for insertion and retrieval on one end, and O(n) for the other end. Choosing the best structure is a compromise. Most often, you’ll seek O(1) performance for the most frequent operations, and you’ll have to accept O(log(n)) or even O(n) for some operations that don’t occur very often.
Be aware that this way of measuring performance has a real meaning for structures that can be scaled infinitely. This is not the case for the data structures we manipulate, because your structures are limited in size by the available memory. A structure with O(n) access time might always be faster than another one with O(1) due to this size limit. If the time for one element is much smaller for the first structure, its memory limitation may prevent the second from showing its benefits. It’s often better to have O(n) performance with an access time of 1 nanosecond to one element than O(1) with an access time of 1 millisecond. (The latter will be faster than the former only for sizes over 1,000,000 elements.)
Trading time against memory space, and time against complexity
You just saw that choosing an implementation for a data structure is generally a question of trading time against time. You’ll choose an implementation that’s faster on some operations, but slower on others, based on which operations are the most frequent. But there are other trading decisions to make.
Imagine you want a structure from which elements can be retrieved in sorted order, the smallest first. You might choose to sort the elements on insertion, or you might prefer to store them as they arrive and search for the smallest on retrieval only. One important criterion for making the decision would be whether the retrieved element is systematically removed from the structure. If not, it might be accessed several times without removal, so it would probably be better to sort the elements at insertion time, in order to avoid sorting them several times on retrieval. This use case corresponds to what’s called a priority queue, in which you’re waiting for a given element. You might test the queue many times until the expected element is returned. Such a use case requires that elements be sorted at insertion time.
But what if you want to access elements by several different sort orders? For example, you might want to access elements in the same order they were inserted, or in reverse order. The result might correspond to the doubly linked list of figure 5.1. It seems that in such a case, elements should be sorted at retrieval time. You might favor one order, leading to O(1) access time from one end and O(n) from the other end, or you might invent a different structure, perhaps giving O(log(n)) access time from both ends. Another solution would be to store two lists, one in insertion order and one in reverse order. This way, you’d have a slower insertion time, but O(1) retrieval from both ends. One drawback is that this approach would probably use more memory. Thus you can see that choosing the right structure might also be a question of trading time against memory space.
But you might also invent some structure minimizing both insertion time and retrieval time from both ends. These types of structures have already been invented, and you’d only have to implement them, but such structures are much more complex than the simplest ones, so you’d be trading time against complexity.
Most data structures change over time because elements are inserted and removed. Basically, there are two ways to handle such operations. The first one is update in place.
Update in place consists of changing the elements of the data structure by mutating the structure itself. It would have been considered a good idea when all programs were single threaded, although it wasn’t. It’s much worse now that all programs are multithreaded. This doesn’t only concern replacing elements. It’s the same for adding or removing, sorting, and all operations that mutate the structure. If programs are allowed to mutate data structures, these structures simply can’t be shared without sophisticated protections that are rarely done right the first time, leading to deadlock, livelock, thread starving, stale data, and all those sorts of troubles.
So what’s the solution? Simply use immutable data structures. Many imperative programmers are shocked when they first read this. How can you do useful things with data structures if you can’t mutate them? After all, you often start with empty structures and want to add data to them. How can you possibly do this if they’re immutable?
Update in place
In a 1981 article titled “The transaction concept: virtues and limitations,” Jim Gray wrote this “The transaction concept: virtues and limitations”...hnical Report 81.3, June 1981),
Update in place: a poison apple? When bookkeeping was done with clay tablets or paper and ink, accountants developed some clear rules about good accounting practices. One of the cardinal rules is double-entry bookkeeping so that calculations are self checking, thereby making them fail-fast. A second rule is that one never alters the books; if an error is made, it is annotated and a new compensating entry is made in the books. The books are thus a complete history of the transactions of the business... Update-in-place strikes many systems designers as a cardinal sin: it violates traditional accounting practices which have been observed for hundreds of years.
The answer is simple. As with double-entry accounting, instead of changing what existed previously, you create new data to represent the new state. Instead of adding an element to an existing list, you create a new list with the added element. The main benefit is that if another thread was manipulating the list at insertion time, it’s not affected by the change because it doesn’t see it.
Generally, this conception immediately raises two protests:
Both arguments are fallacious. The thread manipulating the “stale data” is in fact manipulating the data as it was when it started reading it. If inserting an element occurs after the manipulation is finished, there’s no concurrency problem. But if the insertion occurs while the manipulation is going on, what would occur with a mutable data structure? Either it wouldn’t be protected against concurrent access, and the data might be corrupted or the result false (or both), or some protection mechanism would lock the data, delaying the insertion until after the manipulation by the first thread is completed. In the second case, the end result would be exactly the same as with an immutable structure.
Persistent data structures
As you saw in the previous section, making a copy (sometimes called a defensive copy) of the data structure before inserting an element is often considered a time-consuming operation that leads to poor performance. This isn’t the case if you use data sharing, which is possible because immutable data structures are persistent. Figure 5.2 shows how elements could be removed and added to create a new, immutable, singly linked list with optimal performance.
Figure 5.2. Removing and adding elements without mutation or copying
As you can see, no copying occurs at all. The result is that such a list might be more performant for removing and inserting elements than a mutable list. So functional data structures (immutable and persistent) are not always slower than mutable ones. They’re often even faster (although they might be slower on some operations). In any case, they’re much safer.
An immutable, persistent, singly linked list implementation
The structure of the singly linked list shown in figures 5.1 and 5.2 is theoretical. The list can’t be implemented that way, because elements can’t be linked to one another. They’d have to be special elements to allow linking, and you want your lists to be able to store any elements. The solution is to devise a recursive list structure composed of the following:
Instead of using a Tuple with properties _1 and _2, you’ll create a specific List class with two properties: head and tail. This will simplify the handling of the Nil case. Figure 5.3 shows the structure of your list implementation.
Figure 5.3. The representation of the singly linked list implementation
Listing 5.1 shows the basic implementation of this list.
Listing 5.1. Singly linked lists
Subclasses have been made private, so you construct lists through calls to the static factory methods. These methods can be statically imported:
Note that the list(A ... a) method is annotated with @SafeVarargs to indicate that the method doesn’t do anything that could lead to heap pollution. This method uses an imperative implementation based on a for loop. This isn’t very “functional,” but it’s a trade-off for simplicity and performance. If you insist on implementing it in a functional way, you can do so. All you need is a function taking an array as its argument and returning its last element, and another one to return the array without its last element. Here’s one possible solution:
Data sharing in list operations
One of the huge benefits of immutable persistent data structures like the singly linked list is the performance boost provided by data sharing. You can already see that accessing the first element of the list is immediate. It’s just a matter of calling the head() method, which is a simple accessor for the head property. Removing the first element is equally fast. Just call the tail() method, which will return the tail property. Now let’s see how to get a new list with an additional element.
Let's implement the instance functional method cons, adding an element at the beginning of a list. (Remember cons stands for construct.) This instance method has the same implementation for the Nil and Cons subclasses:
Implement setHead, an instance method for replacing the first element of a List with a new value. You might think of implementing a static method for this, but you’d have to test for an empty list:
Write a toString method to display the content of a list. An empty list will be displayed as "[NIL]", and a list containing the integers from 1 to 3 will be displayed as "[1, 2, 3, NIL]". For a list of arbitrary objects, the toString method will be called to display each object. The Nil implementation is very simple:
You can rely on data sharing to implement various other operations in a very efficient way—often more efficiently than what can be done with mutable lists. In the rest of this section, you’ll add functionality to the linked list based on data sharing.
The tail method, although it doesn’t mutate the list in any way, has the same effect as removing the first element. Write a more general method, drop, that removes the first n elements from a list. Of course, this method won’t remove the element, but will return a new list corresponding to the intended result. This “new” list won’t be anything new, because data sharing will be used, so nothing will be created. Figure 5.4 shows how you should proceed.
Figure 5.4. Dropping the n first elements of a list while not mutating or creating anything.
The signature of the method will be
Here, you have the choice to implement a static method or instance methods. Instance methods are needed if you want to use object notation, which is much easier to read. For example, if you want to drop two elements of a list of integers and then replace the first element of the result with 0, you could use static methods:
Implement a dropWhile method to remove elements from the head of the List as long as a condition holds true. Here’s the signature to add to the List abstract class:
or this one:
A very common operation on lists consists of “adding” one list to another to form a new list that contains all elements of both original lists. It would be nice to be able to simply link both lists, but this isn’t possible. The solution is to add all elements of one list to the other list. But elements can only be added to the front (head) of the list, so if you want to concatenate list1 to list2, you must start by adding the last element of list1 to the front of list2, as indicated in figure 5.6.
Figure 5.6. Sharing data by concatenation. You can see that both lists are preserved and that list2 is shared by the resulting list. But you can also see that you can’t proceed exactly as is indicated in the figure, because you’d have to access the last element of list1 first, which isn’t possible due to the structure of the list.
One way to proceed is to first reverse list1, producing a new list, and then add each element to list2, this time starting from the head of the reversed list. But you haven’t yet defined a reverse method. Can you still define concat? Yes you can. Just consider how you could define this method:
This recursive definition can be translated into code as follows:
Obviously, this code will overflow the stack if list1 is too long, although you’ll never have a stack problem with the length of list2. The consequence is that you won’t have to worry if you’re careful to only add small lists to the front end of lists of any length. An important point to note is that what you’re actually doing is adding elements of the first list, in reverse order, to the front of the second list. This is obviously different from the common sense understanding of concatenation: adding the second list to the tail of the first one. This is definitely not how it works with the singly linked list.
If you need to concatenate lists of arbitrary length, you can just apply what you learned in chapter 4 to make the concat method stack-safe. If you ponder what you’ve done, you might guess that there’s much room left for abstraction here. What if the concat method were only a specific application of a much more general operation? Maybe you could abstract this operation, make it stack-safe, and then reuse it to implement many other operations? Wait and see!
You may have noticed that the complexity of this operation (and hence the time it’ll take to be executed by Java) is proportional to the length of the first list. In other words, if you concatenate list1 and list2, of length n1 and n2, the complexity is O(n1), which means it’s independent of n2. In other words, depending on n2, this operation may be more efficient than concatenating two mutable lists in imperative Java.
Write a method to remove the last element from a list. This method should return the resulting list. Implement it as an instance method with the following signature: