Structured Query Language/NULLs and the Three Valued Logic

The Problem
As mentioned in a previous chapter of this wikibook and in wikipedia sometimes there is no value in a column of a row, or - to say it the other way round - the column stores the NULL marker (a flag to indicate the absence of any data), or - to use the notion of the SQL standard - the column stores the NULL value. This NULL marker is very different from the numeric value zero or a string with a length of zero characters! Typically it occurs when an application yet hasn't stored anything in the column of this row.

(A hint to Oracle users: For Oracle, the NULL marker is identical to a string of zero characters.)

The existence of the NULL marker introduces a new fundamental problem. In the usual boolean logic, there are the two logical values TRUE and FALSE. Every comparison evaluates to one of the two - and the comparisons negation evaluates to the opposite one. If a comparison evaluates to TRUE, its negation evaluates to FALSE and vice versa. As an example, in the usual boolean logic, one of the following two comparisons is TRUE, and the other one is FALSE: 'x < 5', 'x >= 5'.

Imagine now the new situation that x holds the NULL marker. It is not feasible that 'NULL < 5' is true (1). But can we say 'NULL < 5' is false (2) and its negation 'NULL >= 5' is true (3)? Is (3) more feasible than (1)? Of course not. (1) and (3) have the same 'degree of truth', so they shall evaluate to the same value. And this value must be different from TRUE and FALSE.

Therefore the usual boolean logic is extended by a third logic value. It is named UNKNOWN. All comparisons to the NULL marker results per definition in this new value. And the well-known statement 'if a statement is true, its negation is false' gets lost because there is a third option.

SQL's logic is an implementation of this so-called trivalent, ternary or three-valued logic (3VL). The existence of the NULL marker in SQL is not without controversy. But if NULLs are accepted, the 3VL is a necessity.

This page proceeds in two stages: First, it explains the handling of NULLs concerning comparisons, grouping, etc. . Second, it explains the boolean logic for the cases where the new value UNKNOWN interacts with any other boolean value - including itself.

Example Table
To demonstrate NULL behaviors, we define an example tables: t1, and t2.

Comparison Predicates, IS NULL Predicate
SQL knows the six comparison predicates <, <=, =, >=, > and <> (unequal). Their main purpose is the arithmetic comparison of numeric values. Each of them needs two variables or constants (infix notation). This implies that it is possible that one or even both operands hold the NULL marker. As stated before, the common and very simple rule is: "All comparisons to the NULL marker results per definition in this new value (unknown).". Here are some examples:
 * NULL = 5 evaluates to UNKNOWN.
 * 5 = NULL evaluates to UNKNOWN.
 * NULL <= 5 evaluates to UNKNOWN.
 * col_1 = 5 evaluates to UNKNOWN for rows where col_1 holds the NULL marker.
 * col_1 = col_2 evaluates to UNKNOWN for rows where col_1 or col_2 holds the NULL marker.
 * NULL = NULL evaluates to UNKNOWN.
 * col_1 = col_2 evaluates to UNKNOWN for rows where col_1 and col_2 holds the NULL marker.

The WHERE clause returns such rows where it evaluates to TRUE. It does not return rows where it evaluates to FALSE or to UNKNOWN. In consequence, it is not guaranteed that the following SELECT will return the complete table t1:

Of course, there are use cases where rows with the NULL marker must be retrieved. Because the arithmetic comparisons are not able to do so, another language construct must do the job. It is the IS NULL predicate.

Other Predicates
For the other predicates, there is no simple rule of thumb. They must be explained one after the other.

The IN predicate is a shortcut for a sequence of OR operations:

Only the two comparisons 'col_1 = 3' and 'col_1 = 18' are able to retrieve rows (possibly many rows). The comparison 'col_1 = NULL' will never evaluate to TRUE. It's always UNKNOWN, even if col_1 holds the NULL marker. To retrieve those rows, it's necessary - as shown above - to use the 'IS NULL' predicate.

This is a little more complex. This will only return 1, 3, and 4, the items that don't have NULL in t2.col_x or t1.col_1.

The subselect of an EXISTS predicate evaluates to TRUE if the cardinality of the retrieved rows is greater than 0, and to FALSE if the cardinality is 0. It is not possible that the UNKNOWN value occurs.

The LIKE predicate compares a column with a regular expression. If the column contains the NULL marker, the LIKE predicate returns the UNKNOWN value, which means that the row is not retrieved.

Predefined Functions
The aggregate functions  and   ignores such rows where  contains the NULL marker. On the other hand  includes all rows.

If a parameter of one of the scalar functions like  contains the NULL marker the resulting value is - in the most cases - the NULL marker.

Grouping
There are some situations where column values are compared to each other to answer the question, whether they are distinct. For usual numbers and strings, the result of such decisions is obvious. But how shall the DBMS handle NULL markers? Are they distinct from each other, are they equal to each other or is there no answer to this question at all? To get results, which are expected by (nearly) every end-user, the standard defines "Two null values are not distinct.", they build a single group.

retrieves one and only row for all rows where col_1 holds the NULL marker.

builds one and only one group for all rows where col_1 holds the NULL marker.

Step 2: Boolean Operations within three-valued logic (3VL)
After we have seen how various comparisons and predicates on the NULL marker produce TRUE, FALSE, and UNKNOWN, it's necessary to explain the rules for the new logic value UNKNOWN.

Inspection
A first elementary operation is the inspection of a truth value: is it TRUE, FALSE or UNKNOWN? Analogous to the IS NULL predicate there are three additional predicates:
 * IS [NOT] TRUE
 * IS [NOT] FALSE
 * IS [NOT] UNKNOWN

In the abstract syntax of logical systems p shall represent any of its truth values. Here is the three-valued logic truth table:

All predicates lead to TRUE or FALSE and never to UNKNOWN.

NOT
The next operation is the negation of the new value. Which values evaluate to 'NOT UNKNOWN'? The UNKNOWN value represents the impossibility to decide between TRUE and FALSE. It is not feasible that the negation of this impossibility leads to TRUE or FALSE. Likewise, it is UNKNOWN.

The above SELECT (1) will retrieve no rows as 'NOT col_2 = NULL' evaluates to the same as 'col_2 = NULL', namely UNKNOWN. And the SELECT (2) will retrieve all rows, as the part after EXCEPT will retrieve no rows, hence only the part before EXCEPT is relevant.

In the abstract syntax of logical systems p shall represent any of its truth values and NOT p its negation. Herein the following table applies:

AND, OR
There are the two binary operations AND and OR. They evaluate as follows:

The precedence of the operations is defined as usual: IS predicate, NOT, AND, OR.