Structured Query Language/SELECT: Set Operations

Tables, views, and results of SELECT commands are in somewhat similar to sets of set theory. In this comparison the elements of sets correspond to rows of tables, views, and SELECT results. The differences between set theory and the itemized SQL constructs are:
 * Sets of set theory do not allow duplicates, whereas SQL allows duplicates. (Even different rows of one table may be identical as there is no duty to use the concept of primary keys.) In the following, we use the term multiset when we speak about sets in SQL where duplicates are possible.
 * Sets of set theory and multisets are not ordered. But for the result of a SELECT command, we can enforce an ordering through the optional ORDER BY clause.

The comparison between set theory and SQL goes even further. In SQL, we have operations which act on multisets in the sense of set theory: The SQL operations UNION, INTERSECT, and EXCEPT (some name it MINUS) process intermediate multisets generated by differents SELECT commands. The operations expect the multisets are of the same type. This means mainly that they must have the same number of columns. Also, their data type should correlate, but this is not mandatory. If they do not, the DBMS will cast them to a common data type - if possible.

UNION
The UNION operation pushes the result of several SELECT commands together. The result of the UNION contains those values, which are in the first or the second intermediate result.



This is a single SQL command. It consists of two SELECTs and one UNION operation. The SELECTs are evaluated first. Afterward, their results are pushed together to one single result. In our example, the result contains all lastnames and firstnames in a single column (our example may be of limited help in praxis, it's only a demonstration for the UNION).

DISTINCT / ALL

If we examine the result carefully, we will notice that it consists only of 17 values. The table person contains ten rows so that we probably expect twenty values in the result. If we perform the 'SELECT firstname ...' and 'SELECT lastname ...' as separate commands without the UNION, we receive for both commands ten values. The explanation for the three missing values is the UNION command. By default, UNION removes duplicates. Therefore some of the intermediate values are skipped. If we want to obtain the duplicate values we have to modify the UNION operation. Its behavior can be changed with one of the two keywords DISTINCT or ALL. DISTINCT is the default, and it removes duplicate values as we have seen before. ALL will retain all values, including duplicates.

A hint for Oracle users: The use of the keyword DISTINCT, which is the default, is not accepted by Oracle. Omit it.

General hint

In most cases, the UNION combines SELECT commands on different tables or different columns of the same table. SELECT commands on the same column of a single table usually use the WHERE clause in combination with boolean logic.

INTERSECT
The INTERSECT operation evaluates to those values, which are in both intermediate results, in the first as well as in the second.

A hint to MySQL users: MySQL (5.5) does not support INTERSECT operation. But as it is not an elementary operation, there are workarounds.

EXCEPT
The EXCEPT operation evaluates to those values, which are in the first intermediate result but not in the second.

A hint to MySQL users: MySQL (5.5) does not support the EXCEPT operation. But as it is not an elementary operation, there are workarounds.

A hint to Oracle users: Oracle uses the keyword MINUS instead of EXCEPT.

Order By
We can combine set operations with all other elements of SELECT command, in particular with ORDER BY and GROUP BY. But this may lead to some uncertainties. Therefore, we would like to explain some of the details below. To which part of the command belongs the ORDER BY? To the first SELECT, the second SELECT, or the result of the UNION? The SQL rules determine that set operations are evaluated before ORDER BY clauses (as always parenthesis can change the order of evaluation). Therefore the ORDER BY sorts the final result and not any of the intermediate results.

We rearrange the example in the hope that things get clear. First, the two SELECTS are evaluated, afterward the UNION. This intermediate result gets the name 't'. 't' is ordered.

Often one would like the rows from the first SELECT to be ordered independent from the rows of the second SELECT. We can do this by adding a virtual column to the result of each SELECT statement and using the virtual columns in the ORDER BY.

Group By
With the GROUP BY clause, things are a little more complicated than with ORDER BY. The GROUP BY refers to the last SELECT or - to say it the other way round - to the SELECT of its direct level.

Exercises
Show the lowest, highest and mean weight as a) 3 values of 1 row and b) 1 value in 3 rows.

Extend the previous 3-rows-solution to meet two additional criteria: a) consider only persons born in San Francisco and b) add a virtual column to show 'Min', 'Max' and 'Avg' according to the correlating numeric values.

Extend the previous solution to order the result: the minimum value first, followed by the average and then the highest value.

Create a list of lastnames for persons with a weight greater than 70 kg together with

all e-mail values (one value per row). There is no concordance between lastnames and e-mails.

(This example is not very helpfull for praxis, but instructive.)

In the previous example the lastname 'de Winter' is shown only once. But there are more than one persons of the family with a weight greater than 70 kg.

Why?

Extend the previous solution to show as much resulting rows as hits to the criteria.