Structured Query Language/SELECT: Join Operation

Data should be stored in such a way that no redundant information exists in the database. For example, if our database includes groups of people who, in each case, all pursue the same hobby, then we would rather avoid repeatedly storing the same static details about a given hobby; namely in every record about one of the hobby's enthusiasts. Likewise, we would rather avoid repeatedly storing the same detailed information about an individual hobbyist, namely in every record about one of that person's hobbies. Instead, we create independent person and hobby tables and point from one to the other. This technique for grouping data in separate, redundancy-free tables is called database normalization. Such separation also tends to simplify the logic and enhance the flexibility of assembling precisely the items needed for a given purpose. This assembly is accomplished through the 'JOIN' operation.

The Idea
In our example database, there are two tables: person and contact. The contact table contains the column person_id, which correlates with the Primary-Key column id of the person table. By evaluating the column values, we can join contacts and persons together.



person table P



contact table C


 * }

Joined (virtual) table, created out of person and contact

So, Larry Goldstein, that exists only once in the stored person table, is now listed four times in the joined, virtual table – each time, in combination with one of his four contact items. The same applies to Kim Goldstein and his two contact items.

But what is going on with Tom Burton and Lisa Hamilton, whose contact information is not available? We may have some trouble attempting to join their person data with their non-existent contact information. For the moment, we have flagged the situation with question marks. A detailed explanation of how to transform the problem into a solution appears later on this page.

The Basic Syntax
Obviously it's necessary to specify two things with the JOIN operation
 * the names of the relevant tables
 * the names of the relevant columns

The basic syntax extends the SELECT command with these two elements

Let's make a first attempt.

One of the table names is referenced after the FROM keyword (as previously), and the other one after the new keyword, JOIN, which (no surprise here) instructs the DBMS to perform a join operation. Next, the ON keyword introduces the column names together with a comparison operator (or a general condition, as you will see later). The column names are prefixed with the respective aliases of the table names, p and c. This is necessary because columns with identical names (like id) may exist in multiple tables.

When the DBMS executes the command, it delivers 'something' that contains all the columns from both tables, including the two id columns from their respective (person and contact) tables. The result contains nine rows, one per existing combination of person and contact; viz., due to the 'ON' expression, person records without any corresponding contact records will not appear in the result.

The delivered 'something' looks like a new table; in fact, it has the same structure, behavior, and data as a table. If it is created from a view or as the result of a subselection, we can even perform new SELECTs on it. But there is an important difference between this and a table: Its assembled data is not stored in the DBMS as such; rather, the data is computed at run time from the values of real tables, and only held in temporary memory while the DBMS is running your program.

This key feature – assembling complex information from simple tables – is made possible by means of the two simple keywords, JOIN and ON. As you will see also, the syntax can be extended to build very complex queries, such that you can add many additional refinements to the specification of your join criteria.

It can sometimes be confusing when results don't match your intentions. If this happens, try to simplify your query, as shown here. Confusion often results from the fact that the JOIN syntax itself may become quite complicated. Moreover, joining can be combined with all of the other syntactic elements of the SELECT command, which also may lead to a lack of clarity.

The combination of the join syntax with other language elements is shown in the following examples.

Four Join Types
Earlier on this page, we saw an example of a join result wherein some rows contained person names, but no contact information – instead showing a question mark in that latter column. If the basic syntax of the JOIN operation had been used, those (question-mark) rows would have been filtered out. That (basic syntax with exclusive result) is known as an INNER join. There are also three different kinds of OUTER joins. The results of an OUTER join will contain not only all the full-data rows that an INNER join's results would, but also partial-data rows, i.e., those where no data was found in one or both of the two stored tables; thus, they're called LEFT OUTER, RIGHT OUTER and FULL OUTER joins.

So we can widen the basic JOIN syntax to the four options:
 * [INNER] JOIN
 * LEFT [OUTER] JOIN
 * RIGHT [OUTER] JOIN
 * FULL [OUTER] JOIN

Keywords surrounded by [ ] are optional. The parser infers OUTER from LEFT, RIGHT or FULL, and a plain (i.e., basic-syntax) JOIN defaults to INNER.

Inner Join
The inner join is probably the most commonly used of the four types. As we have seen, it results in precisely those rows that exactly match the criterion following the ON. Below is an example showing how to create a list of persons and their contacts.

What is most significant is that records for persons without any contact information are not part of the result.

Left (outer) Join
Sometimes we need a little more; for example, we might want a list of all person records, to include any contact-information records that may also be available for that person. Note how this differs from the example above: this time, the results will contain all person records, even those for persons who have no contact-information record(s).

In those cases where the contact information is unavailable, the DBMS will supplant it with the 'null value' or with the 'null special marker' (not to be confused with the string (-type) 'null value' or 'null' nor with binary 0. Nonetheless, implementation details aren't important here. The null special marker will be discussed in a later chapter).

In summary, the left (outer) join is an inner join, plus one row for each left-side match without a counterpart on the right side.

Consider the word 'left'. It refers to the left side of the formula, "FROM

What's the difference? We've changed the order of the table names. Note that we're still using a LEFT join, but because contact is now the "left" referent (the object in the FROM clause), contact data will now be considered as being of primary importance; therefore, all the contact rows will appear in the result - along with any corresponding information that may exist in the person table. As it happens, in the database we're using, every contact record corresponds to a person record so, in this case, it works out that the results are equivalent to what they'd have been if we'd used an inner join. Yet they're different from those of the previous left-join example.

Right (outer) Join
The right join obeys the same rules as the left join, but in reverse. Now, every record from the table referenced in the join clause will appear in the result, including those that have no corresponding record in the other table. Again, the DBMS supplies each empty right-column cell with the null special marker. The only difference is that the evaluation sequence of tables is carried out in reverse or, in other words, with the roles of the two tables swapped.

Full (outer) Join
A full join retrieves every row of both the left table and the right table, regardless of whether a corresponding record exists in the respective opposite table.

Given table_1 and table_2 below,



table_1

table_2


 * }

the full join:

will yield: These results contain the (single) matching row, plus a row each for all the other records of both of the original tables. As each of these other rows represent data found in only one of the tables, they are each missing some data, so the cells representative of that missing data contain the null special marker.

Note: The full join is not supported by all DBMS. Nevertheless, because it isn't an atomic operation, it is always possible to create the desired result by a combination of multiple SELECTs with SET operations.

Cartesian Product (Cross Join)
With inner joins it is possible to omit the ON. SQL interprets this as a - syntactically correct - request to combine every record of the left table with every record of the right table. It will return a large number of rows: the product of the row counts of the two tables.

This particular kind of an inner join is called a Cartesian product or CROSS JOIN. The Cartesian product is an elementary operation of relational algebra, which is the foundation for all rDBMS implementations.

Be careful then; if you unintentionally omit the ON term, the result will be much larger than expected. If, for example, the first table contains 10,000 records, and the second one 20,000 records, the output will contain 200 million rows.

The n:m Situation
How can we create a list of persons and their hobbies? Remember: one person may run many hobbies and several persons may run the same hobby. So there is no direct connection from persons to hobbies. Between the two tables, we have created a third one person_hobby. It holds the id of persons as well as the id of hobbies.

We have to 'walk' from person to person_hobby and from there to hobby.

Please note that no column of the table person_hobby goes to the result. This table acts only during intermediate execution steps. Even its column id is not of interest.

Some people do not perform a hobby. As we performed an INNER JOIN they are not part of the above list. If we want to see in the list also persons without hobbies, we must do what we have done before: use LEFT OUTER JOINs instead of INNER JOINs.

Hint: If necessary, we can combine every kind of join with every other kind of join in every desired sequence, e.g.: LEFT OUTER with FULL OUTER with INNER ... .

More Details
Criteria for join operations are not restricted to the usual formulation:

First, we can use any column, not only primary key and foreign key columns. In one of the above examples, we used the lastname for a join. Lastname is of type character and has no meaning of any key. To avoid poor performance, some DBMS restrict the use of columns to those having an index.

Second, the comparator is not restricted to the equal sign. We can use any meaningful operator, for example, the 'greater than' for numeric values.

Third, we can use an arbitrary function.

Exercises
Show first- and lastname plus icq number for persons having an icq number

Show first- and lastname plus ICQ number plus fixed line number for persons having an ICQ number AND a fixed line. You need to join the contact table twice.

Show first- and lastname plus (if present) the ICQ number for ALL persons

Create a list which contains ALL hobbies plus according persons (if present)

Is it possible that one of the three outer joins contains fewer rows than the corresponding inner join?