F Sharp Programming/Advanced Data Structures

F# comes with its own set of data structures, however its very important to know how to implement data structures from scratch.

Incidentally, hundreds of authors have written thousands of lengthy volumes on this single topic alone, so its unreasonable to provide a comprehensive picture of data structures in the short amount of space available for this book. Instead, this chapter is intended as a cursory introduction to the development of immutable data structures using F#. Readers are encouraged to use the resources listed at the bottom of this page for a more comprehensive treatment of algorithms and data structures.

Stacks
F#'s built-in list data structure is essentially an immutable stack. While its certainly usable, for the purposes of writing exploratory code, we're going to implement a stack from scratch. We can represent each node in a stack using a simple union:

It's easy enough to create an instance of a stack using:

Each StackNode contains a value and a pointer to the next stack in the list. The resulting data structure can be diagrammed as follows: ___   ___    ___    ___    ___
 * _1_|->|_2_|->|_3_|->|_4_|->|_5_|->Empty

We can create a boilerplate stack module as follows:

Let's say we wanted to add a few methods to our stack, such as a method which updates an item at a certain index. Since our nodes are immutable, we can't update our list in place; we need to copy all of the nodes up to the node we want to update.

Setting item at index 2 to the value 9.

0     1      2      3      4          ___    ___    ___    ___    ___ let x =  |_1_|->|_2_|->|_3_|->|_4_|->|_5_|->Empty ^                            /          ___    ___    ___ / let y =  |_1_|->|_2_|->|_9_|

So, we copy all of the nodes up to index 2 and reuse the remaining nodes. A function like this is very easy to write:

Appending items from one stack to the rear of another uses a similar technique. Since we can't modify stacks in place, we append two stacks by copying all of the nodes from the "front" stack and pointing the last copied node to the our "rear" stack, resulting in the following:

Append x and y

___   ___    ___    ___    ___ let x =  |_1_|->|_2_|->|_3_|->|_4_|->|_5_|->Empty ___   ___    ___ let y =                                     |_6_|->|_7_|->|_8_|->Empty ^                                          /          ___    ___    ___    ___    ___ / let z =  |_1_|->|_2_|->|_3_|->|_4_|->|_5_|

We can implement this function with minimal effort using the following:

Stacks are very easy to work with and implement. The principles behind copying nodes to "modify" stacks is fundamentally the same for all persistent data structures.

Complete Stack Module

Naive Queue
Queues aren't quite as straightforward as stacks. A naive queue can be implemented using a stack, with the caveat that:


 * Items are always appended to the end of the list, and dequeued from the head of the stack.
 * -OR- Items are prepended to the front of the stack, and dequeued by reversing the stack and getting its head.

We use an interface file to hide the  class's constructor. Although this technically satisfies the function of a queue, every dequeue is an O(n) operation where n is the number of items in the queue. There are lots of variations on the same approach, but these are often not very practical in practice. We can certainly improve on the implementation of immutable queues.

Queue From Two Stacks
The implementation above isn't very efficient because it requires reversing our underlying data representation several times. Why not keep those reversed stacks around for future use? Rather than using one stack, we can have two stacks: a front stack  and a rear stack.

Stack  holds items in the correct order, while stack   holds items in reverse order; this allows the first element in   to be the head of the queue, and the first element in   to be the last item in queue. So, a queue of the numbers 1 .. 6 might be represented with  and.

To enqueue a new item, prepend it to the front of ; to dequeue an item, pop it off. Both enqueues and dequeues are O(1) operations. Of course, at some point,  will be empty and there will be no more items to dequeue; in this case, simply move all items from   to   and reverse the list. While the queue certainly has O(n) worst-case behavior, it has acceptable O(1) amortized (average case) bounds.

The code for this implementation is straight forward:

This is a simple, common, and useful implementation of an immutable queue. The magic is in the  function which maintains that   always contains items if they are available.


 * Note:  The queue's periodic O(n) worst case behavior can give it unpredictable response times, especially in applications which rely heavily on persistence since its possible to hit the pathological case each time the queue is accessed. However, this particular implementation of queues is perfectly adequate for the vast majority of applications which do not require persistence or uniform response times.

As shown above, we often want to wrap our underlying data structure in class for two reasons:
 * 1) To simplify the interface to the data structure. For example, clients neither know nor care that our queue uses two stacks; they only know that items in the queue obey the principle of first-in, first-out.
 * 2) To prevent clients from putting the underlying data in the data structure in an invalid state.

Beyond stacks, virtually all data structures are complex enough to require wrapping up class to hide away complex details from clients.

Binary Search Trees
Binary search trees are similar to stacks, but each node points to two other nodes called the left and right child nodes:

Additionally, nodes in the tree are ordered in a particular way: each item in a tree is greater than all items in its left child node and less than all items in its right child node.

Since our tree is immutable, we "insert" into the tree by returning a brand new tree with the node inserted. This process is more efficient than it sounds: we copy nodes as we traverse down the tree, so we only copy nodes which are in the path of our node being inserted. Writing a binary search tree is relatively straightforward:

We're using an interface and a wrapper class to hide the implementation details of the tree from the user, otherwise the user could construct a tree which invalidates the specific ordering rules used in the binary tree.

This implementation is simple and it allows us to add and lookup any item in the tree in O(log n) best case time. However, it suffers from a pathological case: if we add items in sorted order, or mostly sorted order, then the tree can become heavily unbalanced. For example, the following code: Results in this tree:

1  / \  E   2 / \   E   3 / \     E   4 / \       E   5 / \         E   6 / \           E   7 / \             E   E

A tree like this isn't much better than our inefficient queue implementation above! Trees are most efficient when they have a minimum height and are as full as possible. Ideally, we'd like to represent the tree above as follows:

_ 4 _      /     \      2       6     / \     / \    1   3   5   7   / \ / \ / \ / \   E E E E E E E E

The minimum height of the tree is ceiling(log n + 1), where n is the number of items in the list. When we insert items into the tree, we want the tree to balance itself to maintain the minimum height. There are a variety of self-balancing tree implementations, many of which are easy to implement as immutable data structures.

Red Black Trees
Red-black trees are self-balancing trees which attach a "color" attribute to each node in the tree. In addition to the rules defining a binary search tree, red-black trees must maintain the following set of rules:
 * 1) A node is either red or black.
 * 2) The root node is always black.
 * 3) No red node has a red child.
 * 4) Every simple path from a given node to any of its descendant leaves contains the same number of black nodes.



We can augment our binary tree with a color field as follows:

When we insert into the tree, we need to rebalance the tree to restore the rules. In particular, we need to remove nodes with a red child. There are four cases where a red node may have a red child. They are depicted in the diagram below by the top, right, bottom, and left trees. The center tree is the balanced version.

B(z) / \                      R(x)  d                      /  \ a  R(y) / \                        b    c                         || \/

B(z)             B(y)                 B(x) / \              /  \                 /  \    R(y)  d    =>     R(x) R(z)    <=       a   R(y) / \              / \  / \                  /  \  R(x)  c             a b  c d                 b   R(z) / \                                             /  \ a    b                                           c    d

/\                        ||

B(x) / \                      a   R(z) / \                        R(y)  d                        /  \ b   c

We can modify our binary tree class as follows:

All of the magic that makes this tree work happens in the  function. We're not performing any terribly complicated transformations to the tree, yet it comes out relatively balanced (in fact, the maximum depth of this tree is 2 * ceiling(log n + 1) ).

AVL Trees
AVL trees are named after its two inventors, G.M. Adelson-Velskii and E.M. Landis. These trees are self-balancing because the heights of the two child subtrees of any node will only differ 0 or 1; therefore, these trees are said to be height-balanced.

An empty node in a tree has a height of 0; non-empty nodes have a height >= 1. We can store the height of each node in our tree definition:

The height of any node is equal to. For convenience, we'll use the following constructor to create a tree node and initialize its height:

Inserting into an AVL tree is very similar to inserting into an unbalanced binary tree with one exception: after we insert a node, we use a series of tree rotations to re-balance the tree. Each node has an implicit property, its balance factor, which refers to the left-child's height minus the right-child's height; a positive balance factor indicates the tree is weighted on the left, negative indicates the tree is weighted on the right, otherwise the tree is balanced.

We only need to rebalance the tree when balance factor for a node is +/-2. There are four scenarios which can cause our tree to become unbalanced:

Left-left case: root balance factor = +2, left-childs balance factor = +1. Balanced by right-rotating the root node:

5                           3       /   \    Root                /   \ 3    D   Right rotation     2     5 / \            ->        / \   / \   2    C                        A   B C   D  / \ A  B

Left-right case: root balance factor = +2, right-child's balance factor = -1. Balanced by left-rotating the left child, then right-rotating the root (this operation is called a double right rotation):

5                             5                                 4     /   \      Left child          /   \        Root                 /   \ 3     D     Left rotation      4     D       Right rotation      3     5 / \               ->       / \               ->          / \   / \ A   4                           3   C                             A   B C   D    / \                         / \ B  C                       A   B

Right-right case: root balance factor = -2, right-child's balance factor = -1. Balanced by left-rotating the root node:

3                                5  /   \         Root                /   \ A    5        Left rotation      3     7 / \          ->        / \   / \     B   7                       A   B C   D        / \ C  D

Right-left case: root balance factor = -2, right-child's balance factor = +1. Balanced by right-rotating the right child, then left-rotating the root (this operation is called a double-left rotation):

3                              3                                 4   /   \        Right child        /   \        Root                 /   \ A     5      Right rotation    A     4       Left rotation       3     5 / \        ->             / \         ->          / \   / \       4   D                          B   5                       A   B C   D      / \                                / \ B  C                              C   D

With this in mind, its very easy to put together the rest of our AVL tree:


 * Note: The  attribute indicates to F# that the construct can give rise to generic code through type inference . Without the attribute, F# will infer the type of   as the undefined type , resulting in a "value restriction" error at compilation.


 * Optimization tip: The tree supports inserts and lookups in  time, where   is the number of nodes in the tree. This is already pretty good, but we can make it faster by eliminating unnecessary comparisons. Notice when we insert a node into the left side of the tree, we can only add weight to the left child; however, the   function checks both sides of the tree for each insert. By re-writing   into a   and   function to handle, we can handle left- and right-child inserts separately. Similar optimizations are possible on the red-black tree implementation as well.

An AVL trees height is limited to, whereas a red-black tree's height is limited to. The AVL trees smaller height and more rigid balancing leads to slower insert/removal but faster retrieval than red-black trees. In practice, the difference will be hardly noticeable: a lookup on a 10,000,000 node AVL tree lookup requires at most 34 comparisons, compared to 47 comparisons on a red-black tree.

Heaps
Binary search trees can efficiently find arbitrary elements in a set, however it can be occasionally useful to access the minimum element in set. Heaps are special data structure which satisfy the heap property: the value of every node is greater than the value of any of its child nodes. Additionally, we can keep the tree approximately balanced using the leftist property, meaning that the height of any left child heap is at least as large as its right sibling. We can hold the height of each tree in each heap node.

Finally, since heaps can be implemented as min- or max-heaps, where the root element will either be the largest or smallest element in the set, we support both types of heaps by passing in an ordering function into heap's constructor as such:


 * Note: the functionality we gain by passing the  function into the   constructor approximates OCaml functors, although its not quite as elegant.

An interesting consequence of the leftist property is that elements along any path in a heap are stored in sorted order. This means we can merge any two heaps by merging their right spines and swapping children as necessary to restore the leftist property. Since each right spine contains at least as many nodes as the left spine, the height of each right spine is proportional to the logarithm of the number of elements in the heap, so merging two heaps can be performed in O(log n) time. We can implement all of the properties of our heap as follows:

This heap implements the  interface, allowing us to iterate through it like a. In addition to the leftist heap shown above, its very easy to implement immutable versions of splay heaps, binomial heaps, Fibonacci heaps, pairing heaps, and a variety other tree-like data structures in F#.

Lazy Data Structures
Its worth noting that some purely functional data structures above are not as efficient as their imperative implementations. For example, appending two immutable stacks  and   together takes O(n) time, where n is the number of elements in stack. However, we can exploit laziness in ways which make purely functional data structures just as efficient as their imperative counterparts.

For example, its easy to create a stack-like data structure which delays all computation until its really needed:

In the example above, the  operation returns one node delays the rest of the computation, so appending two lists will occur in constant time. A  statement above has been added to demonstrate that we really don't compute appended values until the first time they're accessed:

Interestingly, the  method clearly runs in O(1) time because the actual appending operation is delayed until a user grabs the head of the list. At the same time, grabbing the head of the list may have the side effect of triggering, at most, one call to the append method without causing a monolithic rebuilding the rest of the data structure, so grabbing the head is itself an O(1) operation. This stack implementation supports supports constant-time consing and appending, and linear time lookups.

Similarly, implementations of lazy queues exists which support O(1) worst-case behavior for all operations.

Additional Resources

 * Purely Functional Data Structures by Chris Okasaki ISBN 978-0521663502. Highly recommended. Provides techniques and analysis of immutable data structures using SML.
 * The Algorithm Design Manual by Steven S. Skiena ISBN 978-0387948607. Highly recommended. Provides language-agnostic description of a variety of algorithms, data structures, and techniques for solving hard problems in computer science.
 * Tutorial on immutable data structures using C#:
 * Kinds of Immutability
 * A Simple Immutable Stack
 * A Covariant Immutable Stack
 * An Immutable Queue
 * LOLZ!
 * A Simple Binary Tree
 * More on Binary Trees
 * Even More on Binary Trees
 * AVL Tree Implementation
 * A Double-ended Queue
 * A Working Double-ended Queue