Data Management in Bioinformatics/SQL Exercises

For these questions, we will consider the following tables about microarray gene expression data. Your task is to express each of the given queries in SQL. Example data are provided in these tables for your convenience but note that they are only example data. Your queries must work for all potential data residing in the given tables, not just those illustrated here.

Genes gid | name   | organism |         annotation -+-+--+ g1 | YLR180C | yeast    | hypothetical protein g2 | YLR181D | yeast    | response to desiccation g3 | sp15    | yeast    | drought stress responsive g4 | pdp77   | pine     | putative stress responsive g5 | hsp70   | pine     | heat shock protein g6 | hsp90   | pine     | heat shock protein

Expression gid | experimentid | level | significance -+--+---+-- g1 | exp12        |   3.5 |            1 g2 | exp23        |    -3 |            1 g3 | exp12        |     1 |            2 g3 | exp13        |  -1.5 |            2 g3 | exp23        |   1.7 |            4 g4 | exp12        |   1.5 |            2 g4 | exp13        |   1.5 |            2 g4 | exp23        |   1.5 |            2 g4 | exp6         |   1.5 |            2 g5 | exp6         |     2 |            1 g5 | exp13        |   2.5 |            2 g6 | exp6         | -3.86 |            3

Experiments experimentid |       name        | whoperformed  |    date --++---+ exp12       | Systematic Torture | Prof. Pain    | 2004-06-02 exp23       | Heaped Abuse       | Tommy Student | 2004-06-03 exp13       | Salt Stress        | Gasch         | 1998-07-04 exp6        | Sorbitol Exposure  | Gasch         | 1999-07-05

Membership gid |     category -+ g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding

GOTree category     |  parent_category + antioxidant       | molecular function binding           | molecular function glutathione       | antioxidant glycine binding   | amino acid binding amino acid binding | binding

= Question 1 =

Find the names of experiments performed by Prof. Pain after Jan 1, 2004.

Answer
= Question 2 =

Find the names of genes that were either positively expressed twofold or more with a significance of at least 1.0, in some experiment, or negatively expressed twofold or less with a significance of at least 1.0, in some experiment. List them alongside their organisms in a two-column format.

Answer
= Question 3 =

What is the grandparent category of 'glycine binding'?

Answer
= Question 4 =

Find the names of experiments that were performed before some Gasch experiment.

Subquery Answer
= Question 5 =

Find the names of pine genes that were positively expressed more than 0.5-fold (with a significance of 1.0 or more) in at least two experiments.

Straightforward Answer
First, we must find the experiments where genes are upreglated and significant.

Next, we must determine the genes which were upregulated in at least two experiments. We do this by taking the product of the upregulated genes and selecting rows where the gene ID is the same but the experiment ID is different.

Finally, we determine which of these genes come from pine, and project their names.

We can alternatively do these steps all in one single query:

Subquery Answer
As another approach, we can make use of subqueries to find the answer. The key to this is to make a correlated subquery where the subquery depends on some property (in this case the gene ID) of the parent query. Note that we'll still need to make use of the Upregulated view created above to reduce code redundancy.

GROUP BY Answer
We can make use of advanced features for certain database systems such as PostgreSQL and MySQL to make these queries in slightly more convenient ways via the GROUP BY and COUNT.

= Question 6 =

Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at least three experiments.

Straightforward Answer
Similar to the answer for question 5. The caveat here is that while the equality evaluations are transitive, while inequality evaluations are not, and so every case must be covered.

Alternatively:

Subquery Answer
We need to build in another correlated subquery for our original correlated subquery to make this work.

GROUP BY Answer
Simply adjust the count evaluation.

= Question 7 =

Find the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at exactly two experiments.

Straightforward Answer
The key here is identifying that taking the set of genes upregulated in two or more experiments and subtracting the set of genes upregulated in three or mor experiments gives the set of genes upregulated in precisely two experiments. Thus, our answer is the answer to question 5 subtracted by the answer to question 6.

Or alternatively:

GROUP BY Answer
= Question 8 =

Find the genes that showed positive expression in every experiment recorded for it.

SELECT Expression.gid, level from Expression WHERE level>1.0;

= Question 9 =

What is the name of the gene that was most positively expressed in experiment exp23? Assume a minimum significance of 1.0.

Answer [suggested by ChatGPT3]
Here's one way you could write an SQL query to find the name of the gene that was most positively expressed in experiment exp23: SELECT Genes.name FROM Genes JOIN Expression ON Genes.gid = Expression.gid WHERE Expression.experimentid = 'exp23' AND Expression.level > 0 AND Expression.significance >= 1 ORDER BY Expression.level DESC LIMIT 1; This query selects the "name" column from the "Genes" table and join it with the "Expression" table on the "gid" column. It then filters the results based on the conditions provided in the "WHERE" clause. The conditions in the WHERE clause are checking for rows where the experiment id is 'exp23', level is greater than 0 and significance is greater than or equal to 1. This will return all the rows in the "Genes" table where the genes were positively expressed in experiment exp23 with a minimum significance of 1.0. Then the query orders the results by level in descending order and limits the results to 1, which will give the gene with the highest level of expression.

This query will return the name of the gene that was most positively expressed in experiment exp23 with a minimum significance of 1.0.

= Question 10 =

This question builds on question 8; what is the name of the gene that was "second most positively expressed"? Assume again a minimum significance of 1.0.

= Question 11 =

Which gene(s) were positively expressed in ALL the experiments listed in the Experiments table? No constraints on significance level.

= Question 12 =

Prepare a table of genes, their annotation, and any experiment in which they were either the highest or lowest expressed (of any significance level). Include a fourth column to say if they were the highest or lowest.