Data Management in Bioinformatics/Data Querying

= Data Querying =

Query Languages

 * SQL (Structured Query Language ). SQL is a Data Definition and Manipulation Language (both DDL and DML)
 * RA (Relational Algebra)
 * Datalog

Creating Tables

 * Command : CREATE TABLE

For the relations:

Gene( gid, name, annotation)

Experiment( eid, name,date)

Microarray( gid, eid , exprlevel)

Tables can be created using the following set of commands in SQL:

A primary key is used to uniquely identify each row in a table. It can either be part of the actual record itself, or it can be an artificial field. SQL will automatically index the table based on the primary key. Indexing with other keys (other than the primary key) is also possible; it can be done by using the CREATE INDEX command.

Note that the primary key for the "Microarray" table is composed of keys from the other two tables i.e. Gene and Experiment. These fields are called Foreign Keys. It is important to reference where the table gets its primary key from. The syntax is as follows.

These types of constraints are often called Referential Integrity Constraints.

Storing Data

 * Command : INSERT INTO

Example:

Most database management systems have the capability to load bulk data at once.

Projection
  To display all the data stored in the Gene table

 SQL



RA $$Gene$$ 

Datalog

$$Gene(x,y,z)$$  



 To display the names of all genes in the Gene table

 SQL



RA

$$\pi_{name}(Gene)$$ </li>

<li>Datalog

$$Answer(x) \leftarrow Gene(y,x,z)$$ </li>

Note: to list out only distinct values use SELECT DISTINCT in SQL. RA and Datalog return distinct values by default. </ul>

</li>

<li> To display gene names that meet certain criterion

<ul> <li>SQL

</li>

<li>RA

$$ \sigma_{annotation='\%phosphate\%'}(Gene)$$ </li>

<li>Datalog

$$Gene(x,y,z), z='\%phosphates\%'$$ </li> </ul>

</li> </ol>

Cartesian Product
Definition : The Cartesian product of two tables of size nxa and mxb is a table having n*m rows and a+b columns

<ul> <li>SQL </li>

<li>RA

$$Gene \times Experiment$$ </li>

<li>Datalog

$$Answer(x,y,z, a, b,c) \leftarrow Gene (x,y,z), Experiment (a,b,c)$$ </li> </ul>

Example 1

 * Table


 * E-R Diagram

<ul> <li>SQL </li>

<li>RA Genes ?? Expression ?? Expts

(A ?? B) ?? C = A ?? (B ?? C) </li>

<li>Datalog Answer(x, y, t, a, b, w) ← Genes(x, y, z), Expression(x, t, w), Expts(t, a, b)

Genes' x match Expression's x Expts' t match Expression's t </li> </ul>

Example 2
Question: ...

<ul> <li>RA

CannotbeMax = ...

Allpairs = ...

Allpairs - CannotbeMax </li>

<li>SQL

- EXCEPT &cup; UNION &cap; INTERSECT </li>

<li>Datalog CannotbeMax(x, y) <- Expression(x, y, t), Expression(a, b, w), t < w.

Answer(x, y) <- Expression(x, y, a), NOT CannotbeMax(x, y). </li> </ul>

Interesting Query
Without MAX operator - ref. Join section Example 2
 * max
 * min
 * second largest
 * median

Relational Division
A X B = C C / B = ?

Through Relational Algebra <BR> 1. sid =$$\pi$$sidStudent&Courses<BR> 2. Ideal     = sid X R4G<BR> 3. Reality   = Student&Courses<BR> 4. Defaulters = Ideal - Reality<BR> 5. Answer    = sid   - $$\pi$$sidDefaulters <BR> Through SQL <BR> 1. 2.  3.  4.  5.

<BR> Exercise:
 * Find students who satisfied graduate requirements except for 1.
 * Find students who satisfied most of the graduate requirements.

<BR> A View is not really computed, an row most of times cannot be inserted because of ambiguities.<BR> SQL Injection, is a hacking technique where the input is formatted such that it executes a SQL query which is not expected.

<BR>Bells and Whistles <BR> 1. ORDER BY <BR> 2. MAX,AVG,SUM,MIN <BR> 3. GROUP BY, HAVING <BR> 4. SELECT has a functionality of Printing<BR>