Haskell/Graph reduction

Notes and TODOs

 * TODO: Pour lazy evaluation explanation from ../Laziness/ into this mold.
 * TODO: better section names.
 * TODO: ponder the graphical representation of graphs.
 * No grapical representation, do it with . Pro: Reduction are easiest to perform in that way anyway. Cons: no graphic.
 * ASCII art / line art similar to the one in Bird&Wadler? Pro: displays only the relevant parts truly as graph, easy to perform on paper. Cons: Ugly, no large graphs with that.
 * Full blown graphs with @-nodes? Pro: look graphy. Cons: nobody needs to know @-nodes in order to understand graph reduction. Can be explained in the implementation section.
 * Graphs without @-nodes. Pro: easy to understand. Cons: what about currying?
 * ! Keep this chapter short. The sooner the reader knows how to evaluate Haskell programs by hand, the better.
 * First sections closely follow Bird&Wadler

Introduction
Programming is not only about writing correct programs, answered by denotational semantics, but also about writing fast ones that require little memory. For that, we need to know how they're executed on a machine, commonly given by operational semantics. This chapter explains how Haskell programs are commonly executed on a real computer and thus serves as foundation for analyzing time and space usage. Note that the Haskell standard deliberately does not give operational semantics, implementations are free to choose their own. But so far, every implementation of Haskell more or less closely follows the execution model of lazy evaluation.

In the following, we will detail lazy evaluation and subsequently use this execution model to explain and exemplify the reasoning about time and memory complexity of Haskell programs.

Reductions
Executing a functional program, i.e. evaluating an expression, means to repeatedly apply function definitions until all function applications have been expanded. Take for example the expression  together with the definitions

One possible sequence of such reductions is

pythagoras 3 4 &rArr; square 3 + square 4  (pythagoras) &rArr;   (3*3) + square 4   (square) &rArr;       9 + square 4   (*) &rArr;       9 + (4*4)      (square) &rArr;       9 + 16         (*) &rArr;         25

Every reduction replaces a subexpression, called reducible expression or redex for short, with an equivalent one, either by appealing to a function definition like for  or by using a built-in function like. An expression without redexes is said to be in normal form. Of course, execution stops once reaching a normal form which thus is the result of the computation.

Clearly, the fewer reductions that have to be performed, the faster the program runs. We cannot expect each reduction step to take the same amount of time because its implementation on real hardware looks very different, but in terms of asymptotic complexity, this number of reductions is an accurate measure.

Reduction Strategies
There are many possible reduction sequences and the number of reductions may depend on the order in which reductions are performed. Take for example the expression. One systematic possibility is to evaluate all function arguments before applying the function definition

fst (square 3, square 4) &rArr; fst (3*3, square 4)  (square) &rArr; fst ( 9, square 4)  (*) &rArr; fst ( 9, 4*4)       (square) &rArr; fst ( 9, 16 )       (*) &rArr; 9                    (fst)

This is called an innermost reduction strategy and an innermost redex is a redex that has no other redex as subexpression inside.

Another systematic possibility is to apply all function definitions first and only then evaluate arguments:

fst (square 3, square 4) &rArr; square 3             (fst) &rArr; 3*3                  (square) &rArr; 9                    (*)

which is named outermost reduction and always reduces outermost redexes that are not inside another redex. Here, the outermost reduction uses fewer reduction steps than the innermost reduction. Why? Because the function  doesn't need the second component of the pair and the reduction of   was superfluous.

Termination
For some expressions like

no reduction sequence may terminate and program execution enters a neverending loop, those expressions do not have a normal form. But there are also expressions where some reduction sequences terminate and some do not, an example being

fst (42, loop) &rArr; 42                   (fst) fst (42, loop) &rArr; fst (42,1+loop)      (loop) &rArr; fst (42,1+(1+loop))  (loop) &rArr; ...

The first reduction sequence is outermost reduction and the second is innermost reduction which tries in vain to evaluate the  even though it is ignored by   anyway. The ability to evaluate function arguments only when needed is what makes outermost optimal when it comes to termination:


 * Theorem (Church Rosser II): If there is one terminating reduction, then outermost reduction will terminate, too.

Graph Reduction (Reduction + Sharing)
Despite the ability to discard arguments, outermost reduction doesn't always take fewer reduction steps than innermost reduction:

square (1+2) &rArr; (1+2)*(1+2)          (square) &rArr; (1+2)*3              (+) &rArr;     3*3              (+) &rArr;      9               (*)

Here, the argument  is duplicated and subsequently reduced twice. But because it is one and the same argument, the solution is to share the reduction  with all other incarnations of this argument. This can be achieved by representing expressions as graphs. For example,

__________ |  |     &darr; &loz; * &loz;    (1+2)

represents the expression. Now, the outermost graph reduction of  proceeds as follows

square (1+2) &rArr; __________           (square) |  |     &darr; &loz; * &loz;    (1+2) &rArr; __________           (+) |  |     &darr; &loz; * &loz;     3 &rArr; 9                    (*)

and the work has been shared. In other words, outermost graph reduction now reduces every argument at most once. For this reason, it always takes fewer reduction steps than the innermost reduction, a fact we will prove when reasoning about time.

Sharing of expressions is also introduced with  and   constructs. For instance, consider Heron's formula for the area of a triangle with sides,  and  :

Instantiating this to an equilateral triangle will reduce as

area 1 1 1 &rArr;       _____________________             (area) | |    |     |      &darr; sqrt (&loz;*(&loz;-a)*(&loz;-b)*(&loz;-c)) ((1+1+1)/2) &rArr;       _____________________             (+),(+),(/) | |    |     |      &darr; sqrt (&loz;*(&loz;-a)*(&loz;-b)*(&loz;-c)) 1.5 &rArr; ... &rArr; 0.433012702

which is $$\sqrt3/4$$. Put differently, -bindings simply give names to nodes in the graph. In fact, one can dispense entirely with a graphical notation and solely rely on  to mark sharing and express a graph structure.

Any implementation of Haskell is in some form based on outermost graph reduction which thus provides a good model for reasoning about the asymptotic complexity of time and memory allocation. The number of reduction steps to reach normal form corresponds to the execution time and the size of the terms in the graph corresponds to the memory used.

Pattern Matching
So far, our description of outermost graph reduction is still underspecified when it comes to pattern matching and data constructors. Explaining these points will enable the reader to trace most cases of the reduction strategy that is commonly the base for implementing non-strict functional languages like Haskell. It is called call-by-need or lazy evaluation in allusion to the fact that it "lazily" postpones the reduction of function arguments to the last possible moment. Of course, the remaining details are covered in subsequent chapters.

To see how pattern matching needs specification, consider for example the boolean disjunction

and the expression with a non-terminating. The following reduction sequence or (1==1) loop &rArr; or (1==1) (not loop)       (loop) &rArr; or (1==1) (not (not loop)) (loop) &rArr; ...

only reduces outermost redexes and therefore is an outermost reduction. But

or (1==1) loop &rArr; or True  loop              (or) &rArr; True

makes much more sense. Of course, we just want to apply the definition of  and are only reducing arguments to decide which equation to choose. This intention is captured by the following rules for pattern matching in Haskell: Thus, for our example, we have to reduce the first argument to either   or  , then evaluate the second to match a variable   pattern and then expand the matching function definition. As the match against a variable always succeeds, the second argument will not be reduced at all. It is the second reduction section above that reproduces this behavior.
 * Left hand sides are matched from top to bottom
 * When matching a left hand side, arguments are matched from left to right
 * Evaluate arguments only as much as needed to decide whether they match or not.

With these preparations, the reader should now be able to evaluate most Haskell expressions. Here are some random encounters to test this ability:

Higher Order Functions
The remaining point to clarify is the reduction of higher order functions and currying. For instance, consider the definitions where both  and   are only defined with one argument. The solution is to see multiple arguments as subsequent applications to one argument, this is called currying To reduce an arbitrary application, call-by-need first reduce expression1 until this becomes a function whose definition can be unfolded with the argument. Hence, the reduction sequences are

a &rArr; (id (+1)) 41          (a) &rArr; (+1) 41              (id) &rArr; 42                   (+) b &rArr; (twice (+1)) (13*3)   (b) &rArr; ((+1).(+1) ) (13*3)  (twice) &rArr; (+1) ((+1) (13*3))   (.) &rArr; (+1) ((+1) 39)       (*) &rArr; (+1) 40              (+) &rArr; 41                   (+)

Admittedly, the description is a bit vague and the next section will detail a way to state it clearly.

While it may seem that pattern matching is the workhorse of time intensive computations and higher order functions are only for capturing the essence of an algorithm, functions are indeed useful as data structures. One example are difference lists that permit concatenation in $$O(1)$$ time, another is the representation of a stream by a fold. In fact, all data structures are represented as functions in the pure lambda calculus, the root of all functional programming languages.

''Exercises! Or not? Diff-Lists Best done with  but this requires knowledge of the fold example. Oh, where do we introduce the foldl VS. foldr example at all? Hm, Bird&Wadler sneak in an extra section "Meet again with fold" for the (++) example at the end of "Controlling reduction order and space requirements" :-/ The complexity of (++) is explained when arguing about .''

Weak Head Normal Form
To formulate precisely how lazy evaluation chooses its reduction sequence, it is best to abandon equational function definitions and replace them with an expression-oriented approach. In other words, our goal is to translate function definitions like  into the form. This can be done with two primitives, namely case-expressions and lambda abstractions.

In their primitive form, case-expressions only allow the discrimination of the outermost constructor. For instance, the primitive case-expression for lists has the form

case expression of  []   -> ...   x:xs -> ...

Lambda abstractions are functions of one parameter, so that the following two definitions are equivalent

f x = expression f  = \x -> expression

Here is a translation of the definition of to case-expressions and lambda-abstractions: Assuming that all definitions have been translated to those primitives, every redex now has the form of either
 * a function application
 * or a case-expression

lazy evaluation.


 * Weak Head Normal Form:An expression is in weak head normal form, iff it is either
 * a constructor (possibly applied to arguments) like,   or
 * a built-in function applied to too few arguments (perhaps none) like  or.
 * or a lambda abstraction.

functions types cannot be pattern matched anyway, but the devious seq can evaluate them to WHNF nonetheless. "weak" = no reduction under lambdas. "head" = first the function application, then the arguments. 

Strict and Non-strict Functions
''A non-strict function doesn't need its argument. A strict function needs its argument in WHNF, as long as we do not distinguish between different forms of non-termination ( doesn't need its argument, for example).''

Controlling Space
"Space" here may be better visualized as traversal of a graph. Either a data structure, or an induced dependencies graph. For instance : Fibonacci(N) depends on : Nothing if N = 0 or N = 1 ; Fibonacci(N-1) and Fibonacci(N-2) else. As Fibonacci(N-1) depends on Fibonacci(N-2), the induced graph is not a tree. Therefore, there is a correspondence between implementation technique and data structure traversal :

The classical : Is a tree traversal applied to a directed acyclic graph for the worse. The optimized version : Uses a DAG traversal. Luckily, the frontier size is constant, so it's a tail recursive algorithm.

NOTE: The chapter ../Strictness is intended to elaborate on the stuff here.

NOTE: The notion of strict function is to be introduced before this section.

Now's the time for the space-eating fold example:

Introduce  and   that can force an expression to WHNF. =>.

Tricky space leak example:

Since order of evaluation in Haskell is only defined by data dependencies and neither  depends on   nor vice versa, either may be performed first. This means that depending on the compiler one version runs on O(1) space while the other in O(n) (perhaps a sufficiently smart compiler could optimize both versions to O(1) but GHC 9.8.2 doesn't do that on my machine—only the second one runs in O(1) space).

Sharing and CSE
''NOTE: overlaps with section about time. Hm, make an extra memoization section?''

How to share "Lambda-lifting", "Full laziness". The compiler should not do full laziness.

A classic and important example for the trade between space and time: That's why the compiler should not do common subexpression elimination as optimization. (Does GHC?).

Tail recursion
''NOTE: Does this belong to the space section? I think so, it's about stack space.''

Tail recursion in Haskell looks different.

Reasoning about Time
Note: introducing strictness before the upper time bound saves some hassle with explanation?

Lazy eval < Eager eval
When reasoning about execution time, naively performing graph reduction by hand to get a clue on what's going on is most often infeasible. In fact, the order of evaluation taken by lazy evaluation is difficult to predict by humans, it is much easier to trace the path of eager evaluation where arguments are reduced to normal form before being supplied to a function. But knowing that lazy evaluation always performs fewer reduction steps than eager evaluation (present the proof!), we can easily get an upper bound for the number of reductions by pretending that our function is evaluated eagerly.

Example:

=> eager evaluation always takes n steps, lazy won't take more than that. But it will actually take fewer.

Throwing away arguments
Time bound exact for functions that examine their argument to normal form anyway. The property that a function needs its argument can concisely be captured by denotational semantics:

f &perp; = &perp;

Argument in WHNF only, though. Operationally: non-termination -> non-termination. (this is an approximation only, though because f anything = &perp; doesn't "need" its argument). Non-strict functions don't need their argument and eager time bound is not sharp. But the information whether a function is strict or not can already be used to great benefit in the analysis.

It's enough to know.

Other examples:
 * vs.  with &perp;.
 * Can  be analyzed only with &perp;? In any case, this example is too involved and belongs to ../Laziness.

Persistence & Amortization
NOTE: this section is better left to a data structures chapter because the subsections above cover most of the cases a programmer not focusing on data structures / amortization will encounter.

Persistence = no updates in place, older versions are still there. Amortization = distribute unequal running times across a sequence of operations. Both don't go well together in a strict setting. Lazy evaluation can reconcile them. Debit invariants. Example: incrementing numbers in binary representation.

Implementation of Graph reduction
Small talk about G-machines and such. Main definition:

closure = thunk = code/data pair on the heap. What do they do? Consider $$(\lambda x.\lambda y.x+y) 2$$. This is a function that returns a function, namely $$\lambda y.2+y$$ in this case. But when you want to compile code, it's prohibitive to actually perform the substitution in memory and replace all occurrences of $$x$$ by 2. So, you return a closure that consists of the function code $$\lambda y.x+y$$ and an environment $$\{x=2\}$$ that assigns values to the free variables appearing in there.

GHC (?, most Haskell implementations?) avoid free variables completely and use supercombinators. In other words, they're supplied as extra-parameters and the observation that lambda-expressions with too few parameters don't need to be reduced since their WHNF is not very different.

Note that these terms are technical terms for implementation stuff, lazy evaluation happily lives without them. Don't use them in any of the sections above.