Data Management in Bioinformatics/Normalization

= Functional Dependency (FD) =

Introduction of FD
Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g.   "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A1, A2, ...At} is a key for relation R if

(1) {A1, A2, ...At} functionally determines all other attributes;

(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs: Observation: Correct: X->Y, A->B:  XA->BY; X->AY:  X->A, X->Y; X->Y, A->Y: XA->Y; Incorrect: XY->A: X->A, Y->A; Formal inference rules for functional dependencies: Armstrong’s axioms: 1. Reflexivity: If Y &sube; X then X -> Y (called trivial FD); 2. Transitivity: If X -> Y and Y -> Z then X -> Z;               3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.

Where do FDs come from?

looking either at               (1) Domain Knowledge (the meaning of the attributes); (2) the data;

b->a is not FD, but a->b is a valid FD.

BCNF
Formal way to break tables:

&rArr; R1 (gid, name, annotation);   gid->name annotation; R2 (exp id, exp desc);        exp id -> exp desc; R3 (gid, expid, exp level);  gid, expid -> exp level;

$$ Original Relation -> find\ \ a \ FD \  X->Y \left\{ \begin{array}{ll} {X, Y}, & \hbox{gid, name, annotation;} \\ {everything\ \ except Y}, & \hbox{gid, exp id, exp desc, exp level;} \end{array} \right. $$

$$ gid, exp id, exp desc, exp level \left\{ \begin{array}{ll} {exp id, exp desc}, & \hbox{;} \\ {gid, \ exp id, \ exp level }, & \hbox{;} \end{array} \right. $$

A relation R is in BCNF if for all dependencies X -> Y, at least one of the following holds:

• X -> Y is a trivial FD (Y &sube; X)

• X is a superkey for R

3NF
Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y &sube; X)

• X is a superkey for R

• Y &sub; a candidate key for R

Lossy Decomposition
Suppose we have a table such as the one below.

into the following "normalized" tables

This means we followed


 * b → a or
 * b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

= Multivalued Dependency (MD) =

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of


 * SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like


 * SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:


 * SID →→ clothes

We can get the following table:

Perhaps we want to decompose this table further. We decompose along the MD of


 * SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

Spotting trivial MDs
We say that an MD $X &rarr;&rarr; Y$ is trivial in $R$ if $X &cup; Y = {all attributes}$.

For example, consider the following relation $R_{1}$ where we have the MD $a &rarr;&rarr; b$:

Because we can have many $<VAR>R_{1}</VAR>$s to each $<VAR>b</VAR>$, we can not generate data to violate the MD.

Now consider the following relation $<VAR>a</VAR>$:

In this case, $<VAR>R_{2}</VAR>$ is non-trivial, because the relation is three-way.

Non-trivial MDs come in pairs
In fact, $<VAR>R_{3}</VAR>$ has a partner MD which is also non-trivial: $<VAR>a</VAR> &rarr;&rarr; <VAR>b</VAR>$

Here are the listings of all non-trivial pairs of MDs in $<VAR>a</VAR> &rarr;&rarr; <VAR>b</VAR>$

{| border="1" cellpadding="4" style="margin: 1em auto"


 * }

Compare this to examples of trivial MDs for $<VAR>a</VAR> &rarr;&rarr; <VAR>c</VAR>$:

Implications of FDs do not hold for MDs
In FDs, we know $$X \rightarrow YZ \iff \begin{array}{c} X \rightarrow Y \\ \mbox{and} \\ X \rightarrow Z \end{array}$$

However, this same implication does not exist for MDs. That is, $$X \rightarrow\rightarrow YZ \nLeftrightarrow \begin{array}{c} X \rightarrow\rightarrow Y \\ \mbox{and} \\ X \rightarrow\rightarrow Z \end{array}$$

MDs are actually about independence
Given the MDs
 * and
 * and

We know that, for each $<VAR>R_{2}</VAR>$
 * there can be many $<VAR>a</VAR> &rarr;&rarr; <VAR>b</VAR>$
 * there can be many $<VAR>a</VAR> &rarr;&rarr; <VAR>c</VAR>$
 * and, $<VAR>b</VAR> &rarr;&rarr; <VAR>a</VAR>$s are independent of $<VAR>b</VAR> &rarr;&rarr; <VAR>c</VAR>$s with respect to $<VAR>c</VAR> &rarr;&rarr; <VAR>a</VAR>$, which we represent symbolically as $$b \perp c | a$$ [NOTE: the actual first binary relation symbol should be <tt>\Perp</tt> instead of <tt>\perp</tt>, but this symbol is not supported by Mediawiki.]

Every FD is an MD
By definition, $<VAR>c</VAR> &rarr;&rarr; <VAR>b</VAR>$. If $<VAR>R_{2}</VAR>$ has at most one $<VAR>a</VAR> &rarr;&rarr; <VAR>bc</VAR>$, then it satisfies the condition of having many $<VAR>ab</VAR> &rarr;&rarr; <VAR>c</VAR>$.

For example, given the following relation

and an FD of $<VAR>b</VAR> &rarr;&rarr; <VAR>ac</VAR>$, then we also have a trivial MD of $<VAR>bc</VAR> &rarr;&rarr; <VAR>a</VAR>$.

Now, consider the FD $<VAR>c</VAR> &rarr;&rarr; <VAR>ab</VAR>$ with following relation:

In this case, we actually have two non-trivial MDs:

(Remember, non-trivial MDs come in pairs.)