Structured Query Language/Retrieve Top N Rows per Group

"Technology evolves from the primitive over the complex to the simple." (Antoine de Saint-Exupery)

The Challenge
Often there is the requirement to access the first or top n rows for every unique value of a given column: the cheapest product (= first row) within a product group (= unique values of the given column), the rows with the highest version number per entity within a historic table, the newest 10 log entries per user, ... . In the SQL world, this is a three-step-job: a) group the table over the given column b) order the rows within the created groups according to a criteria and c) access the first or the top n rows within the created, ordered groups.

In complex cases like this one SQL does not offer only a single solution. There are multiple formulations to get the expected result. At the logical level, they are equivalent, but it is likely that their performance differs strongly from each other. And it's likely that the performance of the same formulation differs strongly on different database systems. The deviation in performance results from the fact that SQL in general defines only WHAT the system shall do and not HOW it shall be done. It is the responsibility of the database system to find an optimal execution plan.

We offer some of the possible solutions - from primitive over complex to simple ones. They include subselects, joins, the  clause, the use of a predicate, and finally   as the method of choice.

Example Table and Data
We use the example table  with a small number of data rows to discuss diverse strategies.

With this structure and data, we will try to access the rows with the highest prize per product group.

Example 1
The first solution uses only the  clause and reduces the problem in two ways: a) it offers only the very first row per group (ignoring the second best, third best, etc. rows) by using the functions max or min and b) the solution has only access to the grouping criteria and the result of max / min. However, due to the nature of the  clause all remaining columns are not accessible - see here.

Example 2
We can extend this first solution to show more columns by combining it with a correlated or non-correlated subquery. This second solution offers access to all columns. Nevertheless, the result is not what we expect as the number of accessed rows is 4. The  criteria is not necessarily unique. Thus we receive 4 rows for the 3 groups out of our small example table. And - as mentioned above - we don't have access to the row with the second-highest prize.

There are problems with these methods. If one uses nothing but the  clause, the complete set columns and rows won't be displayed. If the  is put into a subquery, all columns are displayed, but multiple rows for the same column will be displayed if more than one row meets the criteria.

Example 3
The same holds true for the third solution. One can create a  over the product_group and reduce the resulting rows to those with the highest prize within the group by using the   clause. The result is the same as with the second solution.

Example 4
As the fourth solution we offer a last example how to express the same issue - with the same imperfect result. It uses a  predicate to search those rows, to which there is no higher prize within their group.

Complex Solutions
To overcome the above shortcomings we make 2 adjustments. First, the link between the two s (via join or subselect) must be changed to a column, with unique values. We will use the ID column. Second, we must use the  clause in combination with an   to count rows.

Example 5
First, we show a modification of the upper second solution. The  clause sorts the rows into the desired sequence. The  clause restricts the number of resulting rows to any desired quantity per group. The result of the subquery is a list of ids. Because the ids are a unique criterion within our example table, the outer  retrieves exactly the expected rows - with all their columns.

Example 6
Next, we use a  clause, which allows - similar to a correlated subquery - the use of previously named tables and their columns as a link to later named tables and their columns. In the example, every row of p1 is joined with the first row (FETCH FIRST) of p2 within the same group (p1.product_group = p2.product_group). The resulting columns of p2 are propagated to the outside parts of the query with the name p3. Finally, the join takes place over the id (ON p1.id = p3.id). The p2/p3 aliases retrieve only the rows with the highest prize per group, so they become the result.

Window Functions
Window functions offer a very flexible and rich set of features. They work on multiple rows of the (intermediate) result set by 'sliding' over them like a 'window' and produce their results from the rows actually seen in the window.

They are addressed by two parts: the name of the desired function plus a definition of the 'sliding window', eg:. In this case, the name of the function is 'row_number' and the window definition 'OVER ' remains empty, which leads to a window where all rows are seen. As its name suggests, the function counts the rows within the window.

In our case of 'n-rows-per-group' we must define windows which act on groups of rows (in the sense of the  clause). To do so we expand the window definition to  and get a counter per group:

Example 7
Now the row_number starts with the value '1' within each group respectively partition. We can take advantage of this behaviour by sorting the rows as desired and limit the resulting rows to any desired quantity by querying this row_number in an outer  clause.

Example 8
As a window function cannot be used in the  clause, we must use it in a SELECT witch is nested in another SELECT. The outer SELECT reduces the number of lastly retrieved rows to one per group, which is the one with the highest prize as the  clause contains a.

You can easily modify this solution to enlarge the number of retrieved rows or to integrate additional window functions - eg. if you use rank instead of row_number, you get the additional row with id=2 and prize=675.

Example 9
Lastly, we show a more complex query that retrieves additional statistical values per group. For details, please refer to the page Window functions.