Structured Query Language/SELECT: Subquery

A subquery is a complete SELECT command which is used within another SELECT, UPDATE, INSERT or DELETE command. The only difference to a simple SELECT is, that it is enclosed in parenthesis.

Classification
Depending on the type of the created result there are three classes of subqueries:
 * Scalar Value Subquery: The subquery returns one single value, e.g:.
 * Row Subquery: The subquery returns one single row of one or more values, e.g:.
 * Table Subquery: The subquery returns a list of rows, which is a table, e.g: . For the classification, it makes no difference whether the resulting list contains zero, one, or more rows. The demarcation between a table subquery and a row subquery is that potentially more than one row may occur.

Every type can be used on all positions where the type it stands for may occur: the scalar value subquery where a single value may occur, the row subquery where a single row may occur and the table subquery where a table may occur. Additionally, table subqueries may occur as an argument of an EXISTS, IN, SOME, ANY or ALL predicate.

Independent from this classification subqueries may be correlated subqueries or non-correlated subqueries. Correlated subqueries have a correlation to the surrounding query by the fact that they use values from the surrounding query within the subquery. Non-correlated subqueries are independent of the surrounding query. This distinction is shown in detail in the next chapter but applies also to the other two subquery classes.

Because correlated subqueries use values, which are determined by the surrounding query and may change from row to row, the subquery is executed - conceptually - as often as resulting rows of the surrounding query exist. This might lead to performance problems. Nevertheless correlated subqueries are an often used construct. In many cases, there are equivalent constructs which use a JOIN. Which one shows the better performance depends highly on the DBMS, and the number of involved rows, the existence of indices, and a lot more variables.

Scalar Value Subquery
The first example creates a list of lastnames, weights and the average weight of all persons.

Because the subquery uses the  function, the SQL compiler knows that it will return exactly one single value. Therefore it's type is Scalar Value Subquery and can be used on positions where scalar values may occur, e.g. in the list between SELECT and FROM.

In the next example, the subquery is used as a deputy for a value within the WHERE clause.

Both examples use the table person twice. One can also use different tables. There is no dependency between the table name in the subquery and in the surrounding query. This applies to all classes of correlated and non-correlated subqueries. The subqueries may retrieve any value from any other table, e.g. the number of contacts.

These first two examples show non-correlated subqueries, which means, that the subqueries are independent from the queries in which they are embedded. They are executed only once.

But often an application faces a situation, where the subquery must use values from the outside query (similar to subroutines which uses parameters). This kind of subquery is called a correlated subquery. As an example, the next query lists persons together with the average weight of their family.

The subselect gets one row of the surrounding SELECT after the next as an parameter with the name 'p'. Within the subselect all columns of the row 'p' are known and may be used. Here the family name from the outside row is used in the subquery to find all persons within the family and the average weight of the family members.

Be careful: Correlated subqueries are executed once per row of the surrounding query. Therefore they are much more costly than non-correlated subqueries. There might exist an equivalent solution using JOIN or GROUP BY which works with better performance. The query optimizer of the DBMS internally might rearrange the given subquery into one of the equivalent forms. But this does not work in all cases.

The distinction between correlated and non-correlated subqueries is universal. It applies also to the other subquery classes.

Row Subquery
This example retrieves one or more persons, whose firstname is the lowest (in the sense of the lexical order) of all firstnames and whose lastname is the lowest of all lastnames. Because of the AND condition it might be the case that no person is found.

Within the subquery, the lowest first- and lastnames are retrieved. The use of the  function guarantees that not more than one row with two columns will arise - therefore it is a row subquery. In the surrounding query, this intermediate result is compared with each row of the complete table person or - if present - an index is used.

It is fortunate that the command retrieves a row. In most cases, the lowest first- and lastname results from different persons. But also in those cases, the command is syntactically correct and will not throw any exception.

In the next example, persons with the lowest first- and lastnames within every family are retrieved. To do so, it is necessary to use a correlated row subquery.

Again, there are the two incarnations of table person, one with the alias name 'p' in the surrounding query and one with the alias name 'sq' in the subquery. The subquery is called once per resulting row of the surrounding query, because the 'p.lastname' may change with every row of 'p'.

Within every family there is at least one person which achieves the condition - it is also conceivable that several persons achieve the condition.

Table Subquery
The next example retrieves persons who have a contact. The class of the subquery is: non-correlated table subquery (used as a condition in the IN predicate).

The subquery creates multiple rows with one column for each of them. This constitutes a new, intermediate table. Therefore this example is a table subquery.

The IN operator is able to act on this intermediate table. In contrast, it is not possible to use operators like '=' or '>' on this kind of intermediate result. In this case, the SQL compiler will recognize a syntax error.

The next example is an extension of the first one. It adds a correlation criterion between the query and the subquery by requesting the lastname within an email-address.

The last comparison after the AND is a little bit complex. It uses the functions CONCAT and UPPER as well as the predicate LIKE, but this is not of interest for the actual topic 'subquery'. The important part is that the subquery refers to 'p.lastname' of the surrounding query. Only Mr. Goldstein meets the criterion that his e-mail address contains his lastname when the two columns are compared case-insensitive.

Remark: CONCAT concatenates two strings. UPPER converts a string to the upper-case. LIKE in combination with the '%' sign looks for one string within another.

Next, there is an example where a non-correlated table subquery is object to a join operation.

Another Example
The example shows a solution to a common problem. Sometimes there are rows describing an outdated stage of entities. Those rows - for one logical entity - differ from each other in some columns and there is an additional column version to track the time flow.

Here is the example table booking and its data.

The problem is to retrieve all actual rows, which are those with the highest version number within each booking. Bookings are considered to be the same if they have the same booking_number.

The first solution uses a non-correlated table subquery.

The subquery creates a list of booking numbers together with their highest version. This list is used by the surrounding query to retrieve the required rows with all its columns.

The second solution uses a correlated scalar value subquery.

The surrounding query retrieves all rows of the table. For each of them, it calls the subquery, which retrieves the highest version within this booking_number. In most cases, this highest version differs from the version of the actual row and because of the '=' operator those rows are not part of the result. Only those, whose version is equal to the value determined in the subquery (and whose booking_number is the same as those used in the subquery) are part of the final result.

A variation of the introducing question may be to retrieve only historical rows (all versions except the highest one) for one special booking.

The surrounding query restricts the rows to those of one special booking. The subquery is called only for those rows.

It's easy to run into pitfalls:

The above query returns all versions of booking 4711 including the actual one! To get the expected result, it's necessary to 'link' the surrounding query and the subquery together.

Exercises
Find the booking with the most versions.

Find all bookings with are canceled (in the latest version).

Create a list of all persons. For each person include the number of persons born in the same city as the person.

Create a list of all persons together with the number of their contact information.

Create a list of all persons together with the number of their e-mail-addresses.

Create a list of all persons together with the number of their contact information. (Same question as above.) Replace the subquery by a JOIN construct.

For which persons there are NO contact information?