Haskell/Libraries/Data structures primer

This chapter introduces some of the data structures available from the libraries. We will focus on common and particularly useful examples that every Haskeller should learn about.

Trade-offs
This chapter continually emphasizes shortcomings with lists, but that does not mean you should quit using them! Lists are the default data structure in Haskell for good reasons: beyond their simplicity, lists have a pretty high power-to-weight ratio in a lazy, purely functional setting. Laziness makes it possible to use lists as streams where we sequentially process elements that are generated on demand. That process allows functions such as,  ,  ,  , and   to work as effective replacements for many common uses of iterative control structures.

As powerful as they may be, lists are better suited to patterns like streaming and iteration control rather than simple data storage and retrieval. Of course, switching to a different data structure involves trade-offs. There will be advantages and disadvantages to any data structure, and the right choice depends on the problem at hand.

Lookups: Data.Map and co.
First, we'll consider a common problem: performing lookups on a data structure. Given a collection of associations between keys and values, we may want to retrieve the value, if any, corresponding to some key. We could simply store the associations as a list of pairs,. Indeed, Prelude contains, but to find a value from within a list, we have to go through all pairs in the list, testing the keys for equality until we reach either the one we are looking for (or exhaust the list). A lookup in a plain association list is an O(n) operation, as the expected number of steps needed to perform it grows proportionally to the length of the list. It is easy to see how that can become a problem when there are a lot of associations.

We can achieve better lookups by switching to a more appropriate data structure. The  type provided by   from the   package is a fine general-purpose choice. Note that  is generally imported qualified, to avert name clashes with Prelude functions.

GHCi> import qualified Data.Map as M GHCi> :t M.empty M.empty :: M.Map k a

In a, keys and values are arranged in a (size balanced, binary) tree. That tree form makes looking for a key work by simply going down a particular branch of the tree. Tree manipulation, however, happens entirely behind the scenes. , like many other data structures from the libraries, is used as an abstract type through an interface with no mention of the tree implementation backing it. In particular, constructors are not exported: a new  is built e.g. by either inserting associations into an   map or by using the utility function  :

GHCi> let foo = M.fromList [(1, "Robert"), (5, "Ian"), (6, "Bruce")] GHCi> :t foo foo :: M.Map Integer [Char]

The  interface provides O(log n) lookups...

GHCi> :t M.lookup M.lookup :: Ord k => k -> M.Map k a -> Maybe a GHCi> M.lookup 5 foo Just "Ian" GHCi> M.lookup 7 foo Nothing

...as well as scores of other useful operations — unions, intersections, deletions and so forth. Instances for a handful of important type classes such as  are available as well.

GHCi> M.size $ M.union foo $ M.fromList [(11, "Andrew"), (17, "Mike")] 5 GHCi> fmap reverse foo fromList [(1,"treboR"),(5,"naI"),(6,"ecurB")]

Variations
Other modules providing map and map-like data structures worth knowing about include:


 * from  provides a more efficient map implementation limited to   keys.
 * , also in, provides a set implementation. Sets are appropriate when the interesting operation is simply finding whether a value is in a collection, rather than retrieving a value given a key. They are a lot like a map in which only the keys matter, and so many considerations about performance and implementations apply to both sets and maps.
 * The  package provides hash maps and sets. They offer efficiency gains (such as almost constant-time lookups) without limiting the type of the key to , at the cost of a comparatively limited interface and the loss of the ordering guarantees of the   tree-based maps.

Peeking at both ends with Data.Sequence
One of the peculiarities of lists is that they are asymmetric. Given that we construct and deconstruct lists by their head with ``, operations at the head are more efficient than the corresponding operations at the tail. For instance, while prepending an element with `` takes constant time, naïvely appending a single element with takes time proportional to the length of. That means building up a list by repeatedly appending will take quadratic time in the number of elements, which is really bad.

When lots of operations at the middle or at the tail have to be done, an excellent list-like alternative to lists are sequences, as provided by the  module, which is also part of. Sequences and lists are quite different from one another, even though many of the familiar list functions reappear in some guise in. While lists are lazy and can be infinite, sequences are finite and strict. The trade-off which makes sequences useful is that, at the cost of some overhead with respect to lists, many operations which were troublesome with lists perform much better. Remarkably, we get both appending and prepending in constant time, length in constant time and also concatenation and random access in logarithmic time. All of that is available through a pleasant, purely functional interface.

GHCi> import qualified Data.Sequence as S GHCi> import Data.Sequence((<|), (|>), (><), ViewL(..), ViewR(..)) GHCi> let foo = S.fromList [1, 3, 5, 2, 9] GHCi> :t foo foo :: S.Seq Integer

prepends,  appends, and   concatenates.

GHCi> 0 <| foo fromList [0,1,3,5,2,9] GHCi> foo |> 18 fromList [1,3,5,2,9,18] GHCi> foo >< foo fromList [1,3,5,2,9,1,3,5,2,9]

You can also pattern match at both ends. For that, you use either  or   to get the desired view of the sequence and then match using   and   and   and   respectively.

GHCi> S.viewl foo 1 :< fromList [3,5,2,9] GHCi> S.viewr foo fromList [1,3,5,2] :> 9 GHCi> let xs :> x = S.viewr foo GHCi> xs fromList [1,3,5,2] GHCi> x 9

Raw performance with arrays
Performance demands when dealing with large volumes of data can be quite stringent. For situations which require fast processing of bulk data, with laziness and streaming not being relevant concerns, Haskell offers true arrays in the vein of those found in C and elsewhere. Arrays are compact memory-wise, offer constant time random access and many blazingly fast operations (the main exceptions being those that require copying the arrays, such as immutable array concatenation), at the cost of a certain unwieldiness arising from the deep differences in behaviour between arrays and the purely functional data structures we normally deal with. There are several array libraries available; each of them generally providing a number of different kinds of arrays — from those whose usage do not feel very different from usual Haskell data structures to C-like mutable arrays of raw primitive values. Here we will mention three of the most popular array libraries.


 * is a good default choice if you are just starting with arrays, or if you do not have specialised needs covered by other libraries. It provides one-dimensional arrays with an interface reasonably similar to that found in other data structure libraries such as.


 * , in contrast, has a more intimidating interface. As for features, it supports multi-dimensional arrays and custom indexing. Importantly, it is also part of the language standard, and is bundled with GHC, which makes it useful for library writers unwilling to incur additional dependencies. We provide an overview of standard arrays in a separate chapter, which can be used as an introduction to array-related terminology.


 * is a sophisticated library providing state-of-the-art multi-dimensional parallelisable arrays. It is well suited for tasks such as image processing.

text, bytestring, and the problem with String
A quick glance at the  libraries might leave us with the impression that  s are the preferred way of getting data into and out of a Haskell program. However, there are several issues with  which make it a poor fit for such a role.


 * The most obvious problem is performance. A  is just a list of  . In applications that have to process even modestly large amounts of text or binary data, the advantages of general-purpose linked lists are overshadowed by the large losses of efficiency relative to what a specialised implementation would make possible.


 * With binary data, a deeper issue is that a -based representation makes little sense, as we are actually dealing with raw bytes.


 * Finally, while Haskell s are Unicode characters, overall the support for different encodings and internationalisation in the   libraries is somewhat lacking.

Those shortcomings of  are addressed by the libraries  and. Both are de facto standards; pretty much all modern libraries whose functionality involves any significant volumes of data input or output use them. They have clearly separate use cases:


 * is for efficient processing of Unicode text. It supports conversion between encodings and, with the companion  library, a wide assortment of Unicode services.


 * is for efficient processing of binary data, in all of its guises - cases include network packets, raw image data, serialization (through libraries such as and ); you name it.

The core types of both libraries,  and , are implemented as specialised, monomorphic containers of   and   (i.e. raw bytes) respectively. The internal representation is array-based and very compact. In terms of interfaces, both libraries are quite straightforward. The main subtlety to be aware of is that in both cases there are strict and lazy variants of the types. The strict versions are well-suited for processing large volumes of small pieces of data, while the lazy ones are processed in chunks, and therefore allow for streaming and processing of large pieces of monolithic data without memory consumption woes.

A convenience feature worth being aware of when dealing with  replacements is that the   GHC extension makes it possible to have automatic, type-directed conversion of string literals to   or. This can be quite helpful, especially for.