Structured Query Language/Window functions

The window functions discussed on this page are a special and very powerful extension to 'traditional' functions. They compute their result not on a single row but on a set of rows (similar to aggregate functions acting in correlation with a ). This set of rows - and this is the crucial point - 'moves' or 'slides' over all rows, which are determined by the. This 'sliding window' is called a frame or - in terms of the official SQL standard - the 'window frame'.

Here are some examples:
 * A straightforward example is a 'sliding window' consisting of the previous, the current, and the next row.
 * One typical area for the use of window functions are evaluations about arbitrary time series. If you have the time series of market prices of a share, you can easily compute the Moving Average of the last n days.
 * Window functions are often used in data warehouse and other OLAP applications. If you have data about sales of all products over a lot of periods within a lot of regions, you can compute statistical indicators about the revenues. This evaluations are more powerful than simple.

In contrast to, where only one output row per group exists, with window functions all rows of the result set retain their identity and are shown.

Syntax
Window functions are listed between the two keywords  and   at the same place where usual functions and columns are listed. They contain the keyword OVER.

Overall Description
Concerning window functions, there are some similar concepts. To distinguish the concepts from each other, it is necessary to use an exact terminology. This terminology is introduced in the next eight paragraphs, which also - roughly - reflect the order of execution. The goal of the first seven steps is the determination of the actual frame, and the eighth step acts on it.
 * 1) The   returns a certain number of rows. They constitutes the result set.
 * 2) The   (syntactically behind the  ) re-orders the result set into a certain sequence.
 * 3) This sequence determines the order in which the rows are passed to the  . The row, which is actually given to the , is called the current row.
 * 4) The   divides the result set into window partitions (We will use the shorter term partition as in the context of our site there is no danger of confusion). If there is no , all rows of the result set constitutes one partition. (These partitions are equivalent to groups created by the  .) Partitions are distinct from each other: there is no overlapping as every row of the result set belongs to one and only one partition.
 * 5) The   orders the rows of each partition (which may differ from the  ).
 * 6) The   defines which rows of the actual partition belong to the actual window frame (We will use the shorter term frame). The clause defines one frame for every row of the result set. This is done by determining the lower and upper boundary of affected rows. In consequence, there are as many (mostly different) frames as number of rows in the result set. The upper and lower boundaries are newly determined with every row of the result set! Single rows may be part of more than one frame. The actual frame is the instantiation of the 'sliding window'. Its rows are ordered according to the.
 * 7) If there is no , the rows of the actual partition constitute frames with the following default boundaries: The first row of the actual partition is their lower boundary and the current row is their upper boundary. If there is no   and no  , the upper boundary switches to the last row of the actual partition. Below we will explain how to change this default behavior.
 * 8) The s act on the rows of the actual frame.

Example Table
We use the following table to demonstrate window functions.

A First Query
The example demonstrates how the boundaries 'slides' over the result set. Doing so, they create one frame after the next, one per row of the result set. These frames are part of partitions, the partitions are part of the result set, and the result set is part of the table. Please notice how the lower boundary (FRAME_FIRST_ROW) and the upper boundary (FRAME_LAST_ROW) changes from row to row.

The query has no. Therefore all rows of the table are part of the result set. According to the, which is 'PARTITION BY dep_name', the result set is divided into the 4 partitions: 'Management', 'Production', 'Sales' and 'Service'. The frames run within these partitions. As there is no  the frames start at the first row of the actual partition and runs up to the current row.

You can see that the actual number of rows within a frame (column FRAME_COUNT) grows from 1 up to the sum of all rows within the partition. When the partition switches to the next one, the number starts again with 1.

The columns PREV_ROW and NEXT_ROW show the ids of the previous and next row within the actual partition. As the first row has no predecessor, the  is shown. This applies correspondingly to the last row and its successor.

Basic Window Functions
We present some of the  functions and their meaning. The standard as well as most implementations include additional functions and overloaded variants.

Here are some examples:

The three example shows:
 * The row number within the actual frame.
 * The employee name of the second row within the actual frame. This is not possible in all cases. a) Every first frame within the series of frames of a partition consists of only 1 row. b) The last partition and its one and only frame contains only one row.
 * The employee name of the row, which is two rows 'ahead' of the current row. Similar as in the previous column, this is not possible in all cases.
 * Please notice the difference in the last two columns of the first row. The SECOND_ROW_IN_FRAME-column contains the NULL indicator. The frame which is associated with this row contains only 1 row (from the first to the current row) - and the scope of the nth_value function is 'frame'. In contrast, the TWO_ROW_AHEAD-column contains the value 'Grace'. This value is evaluated by the lead function, whose scope is the partition! The partition contains 3 rows: all rows within the department 'Management'. Only with the second and third row it becomes impossible to go 2 rows 'ahead'.

Determine Partition and Sequence
As shown in the above examples, the  defines the partitions by using the keywords PARTITION BY and the   defines the sequence of rows within the partition by using the key words ORDER BY.

Determine the Frame
The frames are defined by the, which optionally follows the   and the.

With the exception of the lead and lag functions, whose scope is the actual partition, all other window functions act on the actual frame. Therefore it is an elementary decision, which rows shall constitute the frame. This is done by establishing the lower and upper boundary (in the sense of the ). All rows within these two bounds constitute the actual frame. Therefore the  consists mainly of the definition of the two boundaries - in one of four ways:
 * Define a certain number of rows before and after the current row. This leads to a constant number of rows within the series of frames - with some exceptions near the lower and upper boundary and the exception of the use of the 'UNBOUNDED' keyword.
 * Define a certain number of groups before and after the current row. Such groups are built by the unique values of the preceding and following rows - in the same way as a  or  . The resulting frame covers all rows, whose values fall into one of the groups. As every group may be built out of multiple rows (with the same value), the number of rows per frame is not constant.
 * Define a range for the values of a certain column by denoting a fixed numerical value, eg: 1.000 (for a salary) or 30 days (for a time series). The defined range runs from the difference of the current value and the defined value up to the current value (the FOLLOWING-case builds the sum, not the difference). All rows of the partition, whose column values fall into this range, constitute the frame. Accordingly, the number of rows within the frame may differ from frame to frame - in opposite to the rows technique.
 * Omit the clause and use default values.

In accordance with these different strategies, there are three keywords 'ROWS', 'GROUPS' and 'RANGE' which leads to the different behavior.

Terminology
The  uses some keywords that modify or specify where the ordered rows of a partition are visualized.

Rows in a partition and the related keywords -    <--   UNBOUNDED PRECEDING (first row) ... -     <-- 2 PRECEDING -    <-- 1 PRECEDING -    <--   CURRENT ROW -    <-- 1 FOLLOWING -    <-- 2 FOLLOWING ... -     <--   UNBOUNDED FOLLOWING (last row)

The term UNBOUNDED PRECEDING denotes the first row in a partition and UNBOUNDED FOLLOWING the last row. Counting from the CURRENT ROW there are  PRECEDING and  FOLLOWING rows. Obviously this PRECEDING/FOLLOWING terminology works only, if there is a, which creates an unambiguous sequence.

The (simplified) syntax of the  is:

An example of a complete window function with its  is: In this case the  starts with the keyword 'ROWS'. It defines the lower boundary to the very first row of the partition and the upper boundary to the actual row. This means that the series of frames grows from frame to frame by one additional row until all rows of the partition are handled. Afterward, the next partition starts with a 1-row-frame and repeats the growing.

ROWS
The ROWS syntax defines a certain number of rows to process.

The example acts on a certain number of rows, namely the two rows before the current row (if existing within the partition) and the current row. There is no situation where more than three rows exists in one of the frames. The window function computes the sum of the salary over these maximal three rows.

The sum is reset to zero with every new partition, which is the department in this case. This holds true also for the GROUPS and RANGE syntax.

The ROWS syntax is often used when one is interested in the average about a certain number of rows or in the distance between two rows.

GROUPS
The GROUPS syntax has a similar semantic as the ROWS syntax - with one exception: rows with equal values within the column of the  count as 1 row. The GROUPS syntax counts the number of distinct values, not the number of rows.

The example starts with the keyword GROUPS and defines that it wants to work on three distinct values of the column 'salary'. Possibly more than three rows are satisfying these criteria - in opposite to the equivalent ROWS strategy.

The GROUPS syntax is the appropriate strategy, if one has a varying number of rows within the time period under review, eg.: one has a varying number of measurement values per day and is interested in the average of the variance over a week or month.

RANGE
At a first glance, the RANGE syntax is similar to the ROWS and GROUPS syntax. But the semantic is very different! Numbers  given in this syntax did not specify any counter. They specify the distance from the value in the current row to the lower or upper boundary. Therefore the ORDER BY column shall be of type NUMERIC, DATE, or INTERVAL. This definition leads to the sum over all rows which have a salary from 100 below and 50 over the actual row. In our example table, this criteria applies in some rare cases to more than 1 row.

Typical use cases for the RANGE strategy are situations where someone analyzes a wide numeric range and expects to meet only a few rows within this range, e.g.: a sparse matrix.

Defaults
If the  is omitted, its default value is: 'RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'. This leads to a range from the very first row of the partition up the current row plus all rows with the same value as the current row - because the RANGE syntax applies.

If the  is omitted, the   is not allowed and all rows of the partition constitute the frame.

If the  is omitted, all rows of the result set constitutes the one and only partition.

A Word of Caution
Although the SQL standard 2003 and his successors define very clear rules concerning window functions, several implementations did not follow them. Some vendors implement only parts of the standard - which is their own responsibility -, but others seem to interpret the standard in a fanciful fashion.

As far as we know, the ROWS syntax conforms to the standard when it is implemented. But it seems that the RANGE syntax sometimes implements what the GROUPS syntax of the SQL standard requires. (Perhaps this is a misrepresentation, and only the public available descriptions of various implementations do not reflect the details.) So: be careful, test your system, and give us feedback on the discussion page.

Exercises
Show id, emp_name, dep_name, salary and the average salary within the department.

Does older persons earn more money than younger?

To give an answer show id, emp_name, salary, age and the average salary of 3 (or 5) persons, which are in a similar age.

Extend the above question and its solution to show the results within the four departments.

Show id, emp_name, salary and the difference to the salary of the previous person (in ID-order).

Show the 'surrounding' of a value: id and emp_name of all persons ordered by emp_name. Supplement each row with the two emp_names before and the two after the actual emp_name (in the usual alphabetical order).