Structured Query Language/Eliminate Duplicates

The Challenge
Sometimes people detect that corrupt or unwanted data exist in their database, e.g.: they forgot to create an Primary Key at CREATE TABLE time and identical values were inserted in the key column or they recognize that values in one or the combination of multiple columns are not unique - as a violation of the business rules. The situation is commonly detected when they issue an  or   command.

In such cases, data must be corrected, or some rows must be deleted. As the first case depends strongly on the individual situation, we focus on the latter action. In general, the SQL command will consist of two parts: The DELETE command and a second part, where the pending rows are identified. In complex situations, it may be necessary to use more than one SQL command (which always is declarative by definition) - maybe a CURSOR with a loop over the affected rows and additional actions depending on the values in different columns.

The solutions we discuss here are closely related to the explanations in Structured Query Language/Retrieve Top N Rows per Group. Over there, we locate certain rows within groups. The same must be done here as we want to delete only dedicated rows. At least one row must be kept in each affected group.

We use the same table product on this page. We will eliminate all but one row where the product prize is identical to any other prize within the same product group. The goal is that each row will have a unique combination of product_group and prize.

Identify affected Rows
A first approach to the situation may be a 'sniffing' in the data with a  clause for a listing of possibly affected rows.

But the  clause is not very helpful as it is not possible to show columns other than the grouping columns and the result of some system functions like   (in rare cases a sort over a timestamp together with   does help). The question is: how can we identify the 'right' and the 'wrong' rows? We need access to other columns of the rows to identify them. In the best case, we get access to the row's IDs.

To see such details we replace the  clause by a window function (this is not the only possible solution). The following SQL command uses the same grouping over the two columns product_group and prize. And it uses a similar way to count affected rows. The main difference is that we see and have access to all columns of all rows.

This  offers everything we need: The last column cnt counts the number of unique product_group/prize combinations. And the column id gives us access to every single row.

In the next step, we expand this query and shift it into a subselect (window functions cannot be used in a  clause, only their results). The rows with a counter of '1' are not of interest, we eliminate them from the further processing, order the remaining rows in a deterministic way, and compute an additional column for the position within each group.

Up to this point our algorithm to identify problematic rows is easy, clear and the same for all use cases: create groups over the columns of interest with the  clause, count the number of rows within the groups, and eliminate groups with a counter of '1'. But now we have to decide which of the rows shall survive and which ones shall be deleted (or modified)? The answer depends strongly on the business logic, the manner in which the data was added into the table, the expectations of your customers, and much more. So you have to make your own decision.

On this page, we choose a simple solution: The row with the smallest ID shall survive; all others will be deleted. For testing purposes, we retrieve the rows we intend to delete, namely those with a position greater 1.

Delete Rows
If this is what you expect, you can delete the rows in the final step. Reduce the above command to retrieve only the IDs, shift it into a subselect, and use its result as the input for a  command.