Metabolomics/Databases

Back to Previous Chapter: Computational Modeling of Metabolic Control Next chapter: Applications

=Overview=

The vast amount of metabolomic information harvested using high-throughput techniques has necessitated an effective means of storage to organize, disseminate, and facilitate analysis and annotation. This need has driven the development of databases as a repository of metabolomic data being produced. Data housed in these databases covers the wide-spectrum of research being done in the metabolomic world from NMR spectra to metabolic pathway substrates and products.

Metabolomic database serve a primary purpose or organizing information on the large catalog of metabolites that are encountered in metabolism pathways. There are many different databases that exist on the World Wide Web and house a wide variety of information covering a large variety of organisms.

Biological Magnetic Resonance Data Bank
The Biological Magnetic Resonance Data Bank (BMRB) focuses on quantitative data generated by spectroscopic investigations of biological macromolecules. It has links to search engines such as PubChem, that connect to recent articles and new data. It also links to projects and other databases that are all related to Metabolomics and Metabonomics. This database focuses on the NMR research aspect of metabolites discovery and what roles they play in metabolism. BMRB offers a large list of different known compounds and the information associated to it.

Terms:
 * Metabonomics (verses Metabolomics): The words 'Metabolomics' and 'Metabonomics' are often used interchangeably, though a consensus is beginning to develop as to the specific meaning of each. The goals of Metabolomics are to catalog and quantify the myriad small molecules found in biological fluids under different conditions. Metabonomics is the study of how the metabolic profile of a complex biological system changes in response to stresses like disease, toxic exposure, or dietary change
 * Metabolites: low molecular weight molecules.
 * Diamagnetism: A weak repulsion from a magnetic field. It is a form of magnetism that is only exhibited by a substance in the presence of an externally applied magnetic field. It results from changes in the orbital motion of electrons. Applying a magnetic field creates a magnetic force on a moving electron in the form of F = Qv × B. This force changes the centripetal force on the electron, causing it to either speed up or slow down in its orbital motion. This changed electron speed modifies the magnetic moment of the orbital in a direction opposing the external field
 * Calmodulin: An example of calcium binding regulatory proteins in intracellular signaling pathways. It is highly conserved and abundant in all eukaryotic cells. As a signaling protein, Calmodulin's function is to bind calcium ions and then bind a target protein, affecting its activity. It affects processes ranging from neurotransmitter release to membrane protein organization
 * Heuristic: A method to help solve a problem, commonly informal. It is particularly used for a method that often rapidly leads to a solution that is usually reasonably close to the best possible answer.

Relevance: This information relates to what we have studied in class because we have been studying metabolism and the metabolites involved. This resource is simply a collection of all the accountable knowledge that exists. The field of metabolomics is growing and with the help of NMR spectroscopy more compounds and metabolites will be discovered along with their functionality. The information studied in class forms the foundation for this knowledge

Metabolomics: Resources, Reagents, and Kits for Metabolomic Analysis
The Sigma-Alderich database provides access to a number of metabolomic kits and reagents, as well as a number of resources, including information on cell signaling pathways, enzyme structures/functions/specificities, animations, charts, and an online library. This site also provides links to other resources.

Terms:
 * Cytokines: a group of proteins and peptides used in organisms as signaling compounds. Consists mainly of small water-soluble proteins and glycoproteins. Also play a central role in the immune system.
 * Metabolome: a collection of all the metabolic products and intermediates found in an organism.
 * Angiopoietin: protein growth factors that promote the formation of new blood vessels. Only four identified angiopoietins: Ang1, Ang2, Ang3, and Ang4.
 * Phosphoproteomics: a type of proteomics involved in identifying, cataloging, and characterizing proteins containing a phosphate group as post-translational modification.

Relevance: This website shows cell signaling and other metabolic pathways (including glycolysis) in an animated, in-depth way. This site also provides a search feature to find pathways related to molecules of your choosing.

Madison Metabolomics Consortium Database
The Madison Metabolomics Consortium Database contains metabolites determined through NMR and MS. It contains information with the main focus on Arabidopsis thaliana, but also refers to many different species. The database also contains information on the presence of metabolites under several different physiological conditions, their structures in 2D and 3D, and links to related resource sources and other databases.

Terms:
 * Nuclear magnetic resonance spectroscopy (NMR): technique using nuclear magnetic resonance to determine structural information about a molecule.
 * Mass spectroscopy (MS): technique that uses the mass-to-charge ratio of ions to determine the composition of a sample.
 * Arabidopsis thaliana: the thale cress, a species of plant with a small genome and rapid life cycle that is a model organism in the lab.
 * chemoinformatics: the use of computer and informational techniques, applied to a range of problems in the field of chemistry
 * Chemical shift: relevant to NMR, chemical shift describes the dependence of nuclear magnetic energy levels on the electronic environment of a molecule.

Relevance: How does this information relate to the information that has been studied in this course to date? Using this website, it is possible to enter a molecule of interest into the search engine and obtain links that will lead to a list of pathways in which that molecule participates. Doing this for glucose, two pathways with which were covered in class: starch degradation (aka glycolysis) and glycogen degradation were displayed.

MetaCyc
The main focus of the MetaCyc Database is to collect and display information on experimentally studied pathways from a variety of organisms. Pathways are divided into five categories: biosynthesis, degradation/utilization/assimilation, detoxification, generation of precursor metabolites and energy, and Super-Pathways. Clicking on any of these will open, in outline format, more specific categories. This eventually leads to individual Metabolomes that are described graphically. There is also descriptions with details about their history and connected pathways. The database can also be browsed by compounds and reactions, though these sections tend to be less detailed.

MetaCyc allows anyone to submit newly identified pathways, but they unsurprisingly demand detailed, experimentally proven data which is closely examined before any additions are curated.

Terms:
 * Superatom: A cluster of atoms that has the same behavior as elemental atoms.
 * Prostaglandins: A group of lipid compounds found in a wide variety of tissues that are synthesized from essential fatty acids. Cells have several receptors for prostaglandins that lead to actions ranging from smooth muscle constriction to increasing spinal neurons' sensitivity to pain.

Relevance: MetaCyc is closely related to the material that we have been learning about in class because it is a comprehensive database that covers many of the same pathways, such as glycolysis I (http://biocyc.org/META/NEW-IMAGE?type=PATHWAY&object=GLYCOLYSIS)

The Scripps Center for Mass Spectrometry: Metabolomics Science Webpage
The main focus of the Scripps Center for Mass Spectrometry is to provide a user-friendly websites for scientist in the field of Metabolomics. They provide general information on analytical tools, timelines of Metabolomics history, Metabolomic events held around the world, databases of metabolic systems, as well as bioinformatics software.

Terms:
 * Pathophysiology: the physiology of abnormal or diseased organisms or their parts; the functional changes associated with a disease or syndrome.
 * Lipidomics: deals with Lipids studying not only their structures, but also functions and modifications occurring during physiological and pathological conditions.
 * Exdogenous: of or noting the metabolic assimilation of proteins or other metabolites, the elimination of nitrogenous catabolites being in direct proportion to the amount of metabolites taken in.
 * Ernobiotic: a chemical or substance that is foreign to an organism or biological system.
 * Paraccetamol: The generic name for a common nonprescription medication useful in the treatment of mild pain or fever.
 * GC-MS: gas chromatography mass spectrometry
 * CE-MS: capillary electrophoresis mass spectrometry
 * FT-IR: Fourier transform infrared spectrometry

Relevance: This Website relates to the information that we have been studying in class because it is it full of information about pathways and numerous databases. One such database is the KEGG Pathway Database, which contains all the pathways that are involved in metabolism. It shows such pathways as Glycolysis, Gluconeogenesis, Citrate cycle, pentose phosphate pathway, glactose metabolism, pyruvate metabolism, and hundreds more. Click here to check out they glycolysis pathway -> http://www.genome.jp/kegg/pathway/map/map00010.html This website does a good job of showing how all the pathways are interconnected into one another.

The Human Metabolome Database
The Human Metabolome Database is an extremely comprehensive, free electronic database that gives a detailed overview of human metabolites divided into chemical, clinical, and molecular biology/biochemistry data.

Terms:
 * Human Metabolome Project: The HMB is an ambitious Canadian project begun over 	3 years ago with the ultimate goal of “identifying, quantifying and cataloging” every 		metabolite detectable in human tissue at concentrations greater than 1 micromolar.
 * Biofluids: A biological fluid such as urine, blood or sweat. In this database, metabolites can be categorized by their biofluid localization.
 * Chemical Class: A broad term used to categorize organic and inorganic chemicals based on common characteristics into groups such as amines and carbohydrates. The database 		can be browsed by chemical class.
 * Metabocard: The individual datasheets for metabolites in this database are called metabocards.  Each one contains a detailed description, over 90 categories of data, and 		cited sources.  An example metabocard for citric acid can be found at: 	http://hmdb.ca/scripts/show_card.cgi?METABOCARD=HMDB00094.txt
 * TOCSY: Total correlation spectroscopy in which magnetization through chemical 	bonds of adjacent protons and protons connected by adjacent protons is visualized.   An

Relevance: The Human Metabolome Database is connected to our coursework by 	its 	extremely thorough amount of data on all of the metabolites that we've been studying. Reaction intermediates and products such as glucose, 3-phosphoglycerate, and citrate can all be 	looked up and everything from the 3d structure to associated disorders are provided.

KNApSAcK
KNApSAcK is a Java application that presents an interactive display of biochemical 	information that can be searched by organism or metabolite name. KNApSAcK focuses 	primarily on the origin and mass spectra of particular metabolites.

Terms: Relevance: KNApsAcK connects to our coursework because it allows for 	comparison of 	metabolites important to different organisms. One example search that was	attempted was to see 	the metabolites shared by cyanobacteria and plants for photosynthesis.
 * JRE: The Java Runtime Environment is a set of free software programs that are used by 		many internet developers to run java programs and scripts on users' computers.
 * Mw +- margin: A search parameter of KNApSAcK that allows a user to search for 			metabolites within a gram range of a set number.  For example, searching for MW: 100 			with a margin of 2 would return all metabolites with a molecular weight between 98 and 		102 grams.
 * Phylum: The fourth taxonomic rank for classifying organisms, between kingdom and 			class. Cyanobacteria is one phylum.  The database allows searches based on any 			specifying any taxonomic rank, although the higher ones take significant time to load.
 * m/z: The mass-to-charge ratio, a physical quantity that is used in the detailed 				examination of charged particles. It is a key aspect of mass spectrometry studies and the 		database focuses heavily on this data.

BRENDA
The BRENDA developers boast that it is the main internet repository of 	functional enzyme data of the scientific community. An extremely robust system, it allows for 	searching of more than 4000 enzymes and provides comprehensive information on each of 	them, including indispensable reaction diagrams.

Terms:
 * ECTree: A term for the outline organization BRENDA uses to characterize related 			enzymes. An example image of an ECTree for Oxidoreductases from the user manual:
 * TaxTree: TaxTree is the interactive display used by BRENDA to search for organisms 			by taxonomy. Once an organism or taxonomy designation is chosen, all of the enzymes 			in the database linked to it are displayed.
 * Substructure Search: The substructure search function allows a user to actually draw 			part of the enzyme structure in skeletal formula. All enzymes containing the component 		drawn are returned.
 * EC Explorer: A search function that allows the user to access enzyme information by a 		several criteria including common name, reaction, and even history.
 * Systematic Name: A style of naming enzymes controlled by the Enzyme condition. 			The enzyme is categorized by four numbers classifying its main class, subclass, sub-subclass and serial number which are all separated by periods.

Relevance: Information on this database is reinforces what was covered in class. Material covered in class also is the foundation for the material on this database

Reactome
The Reactome is a collaboration between Cold Spring Harbor Laboratory, The European Bioinformatics Institute, and the Gene Oncology Consortium to provide a curated database that catalogs core pathways and reactions in human biology. The Reactome obtains information from researchers with expertise in their fields and is cross-validated by an Reactome editorial team which references other databases such as the NCBI, Ensembl, and UniProt. Alongside the human pathways and reactions the Reactome also contains inferred data from 22 non-human species including mouse, rat, chicken, puffer fish, worm, fly, yeast, two plants and E.coli.

Current versions of the Reactome allow for searching by keyword but also allow a more visual approach by allowing researchers to view a map of much of the data being housed in the database and allowing reactions to be selected and zoomed in on from the top level.

Terms
 * Skypainter tool: A tool provided by the Reactome which allows a list of proteins or gene identifiers to be uploaded to color one of the reaction or pathway maps generated by the database.
 * Morbid map: A diagram showing chromosome location of genes that are known to be associated with disease.
 * Reactome Author Tool: Desktop application written in Java that is utilized to enter new data into the Reactome. Use a graphical interface to allow for ease in expanding or adding reactions and pathways.
 * BioPAX: an attempt at a common exchange format for biological pathway data.
 * SBML: Systems Biology Markup Language; a computer-readable format representing biochemical reaction networks.
 * PSI-MI: Proteomics Standards Initiative - Molecular Interactions; A standardized format which described molecular interactions.

Relevance: Much of the data housed in the Reactome database covers many of the pathways and reactions we have covered in the course such as the intermediary metabolism and regulator pathways. Like many of the other metabolomics database it can be thought of almost like a textbook containing thousands of entries on metabolism and its associated events.

KEGG Pathway DB
The KEGG Pathway Database is a large part of a collection of smaller databases which comprise the Kyoto Encyclopedia of Genes and Genomes. The Pathway database is known for its extensive collection of metabolic pathways and its handling of their interconnections, as well as other non metabolic cellular interactions. The database does an excellent job of integrating genomic, chemical and systemic functional information into an easily readable format.

Instead of new terms, enjoy this list of subsections of the database.
 * 1.1 Carbohydrate Metabolism
 * 1.2 Energy Metabolism
 * 1.3 Lipid Metabolism
 * 1.4 Nucleotide Metabolism
 * 1.5 Amino Acid Metabolism
 * 1.6 Metabolism of Other Amino Acids
 * 1.7 Glycan Biosynthesis and Metabolism
 * 1.8 Biosynthesis of Polyketides and Nonribosomal Peptides
 * 1.9 Metabolism of Cofactors and Vitamins
 * 1.10 Biosynthesis of Secondary Metabolites
 * 1.11 Xenobiotics Biodegradation and Metabolism

BMRB, MMCD and the Sesame laboratory module
Databases have been recently developed as metabolomics resources. Some of the databases that have been designed as metabolomics resources are intended to assist in MS and NMR analyses of relevant research. Among these particular databases are the BioMagResBank (BMRB), Madison Metabolomics Consortium Database (MMCD) and a module for the Sesame laboratory information management system.

The BMRB comprises of experimental spectral data for over 270 pure compounds. Each molecule entry includes five or six one- and two-dimensional NMR data sets, as well as compound source information, solution conditions, data collection protocol and the NMR pulse sequences. Database entries can be accessed by name, monoisotopic mass and chemical shift. Currently in development is an open access feature to this database that will allow users to contribute their own data, and substantiate the BMRB.

The MMCD consists of information on over 10,000 metabolites that primarily consists of data collected from Arabidopsis metabolites. Users may make queries comprising of MS and/or NMR spectra.

The Sesame laboratory module collects all metabolomics based experimental protocols, background information, and data for a particular study.

Link to article:

http://psb.stanford.edu/psb-online/proceedings/psb07/markley.pdf

CellCircuits: a database of protein network models
General Overview: This article provides a rationale for the development of CellCircuits, an open-	access database that focuses on molecular network models. The database covers models that 	have been derived computationally and posted in published journal articles. The article explains 	that the ultimate goal of the project is to bridge the gap between molecular databases, even 	those with unconfirmed data, and strictly regulated pathway databases. The body of the article 	explores not only the rationale of CellCircuits, but the computational process that went 	into 	developing it and some example results of molecular networks models.

Terms:
 * GO Annotation: GO refers to the Gene Ontology project, which is a system of universal 	descriptions of genes across a broad variety of databases. The developers of CellCircuits have 	used GO to score genes in comparisons across databases.


 * Data Processing Pipeline: A pipeline is a construct used to carry data through threads, scripts, 	and processes through a chain of software elements. CellCircuits uses pipelines to draw text 	information from input models for processing.
 * MySQL: MySQL is a database management system that allows users to easily set up multi-	platform systems of data control and is very popular for internet applications, Wikipedia being 	one example. CellCircuits is built with MySQL.
 * Scoring models: Scoring models is a concept that refers to a system for comparison of two 	sets of data. In  CellCircuits, scoring models are used in conjunction with the GO database to 	compare sets of genes from input models.
 * Perl: Perl is a popular procedural programming language heavily derived from C. The 	primary graphical interface of CellCircuits is written in Perl.

Relevance: This article relates to our coursework because it shows some of the dizzying 	heights of complexity involved in trying to collate the growing body of metabolomics data into 	a usable form for the general science community.

ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites
General Overview – This article explains the development and use of the ProMEX mass spectral 	library database. The goal of the expanding database is to allow users to compare an unknown 	sample to a body of confirmed mass spectra for known proteins. The article explores some of 	the theory and algorithms that go into making that possible.

Terms Threshold: In comparing the mass spectra of an unknown, user provided sample against those 	in the database, the threshold value is point at which mass spectra hits are ignored because they 	are not considered matching.
 * Metadata: A common database term that refers to the overlying information on an object rather 	than discrete points. The ProMEX developers use it to refer to the consensus of experimental 	results and mass spectrometric parameters.
 * AGI Codes: A uniform system of nomenclature to classify genes that was developed in 1999. 	AGI codes reference the organism, chromosome number, gene, and gene ID.  The 	announcement of the original decision to create the system for Arabidopsis genes can be found 	at: http://mips.gsf.de/proj/thal/db/about/agicodes.html
 * LC-MS: Liquid chromatography- mass spectrometry is an uncreatively named data gathering 	process that combines the two techniques to allow for highly sensitive detection of specific 	chemicals. The ProMEX developers use LC-MS to distinguish even closely related samples.
 * CLR: Common Language Runtime is a virtual machine developed by Microsoft. It provides 	an execution environment for software programs on a variety of platforms.  The ProMEX's 	algorithms  for comparing spectra run within the CLR.

Relevance – ProMEX is a relevant resource to our coursework because it shows 	how quickly the field of metabolomics is advancing. Using the search algorithms described in 	this article, users can now identify unknown proteins from experimental data in a quick and 	highly automated process.

Correcting ligands, metabolites, and pathways
General Overview – The authors of this article explain that the goal of their database, Biometa, 	is to provide an example of the need to correct inaccurate pathways and chemical structures. After originally developing this database, they came up with tools to validate the data it 	contained by stereochemistry and stoichiometric outcomes only to find that they had a high 	error rate. The article explains the creation of the database and validation tools and the steps 	they took to make corrections.

Terms
 * Xenobiotics: Chemicals that can be experimentally or clinically detected in an 	organism that 	is not capable of normally producing them, either at all or at the 	concentration 	that they 	appear. The authors of the article use xenobiotics as an example 	of the staggering intricacy of 	metabolomics.
 * Reactants and Products: Although these are not new terms, their use in BioMeta is 	significant because the authors deliberately eschewed use of “substrate,” feeling that it was 	inappropriate for their purposes on the grounds that it can refer to either reactant or product of 	an enzyme and they are only interested in catalyzed reactions.
 * “Fuzzy” Synonyms: To deal with a lack of uniform nomenclature, BioMeta contains 	synonym tables that recognize many common names for compounds or pathways. If an initial 	search can't locate a synonym, it is referred to a table of fuzzy synonyms that strips out non-	alphanumeric characters and capitalizes all letters for a looser, but still automated, comparison.
 * ElemCount: ElemCount is one field used by the BioMeta Compounds data table that covers 	the raw quantity of each element in a compound. Searches can be made with it specifying a 	minimum or maximum number
 * Molfiles: Molfiles are minute structure descriptions of small compounds that can be quickly 	analyzed and validated by the developers' chemical structure software.
 * Canonicalization: The concept of recognizing several synonymous data references as a single 	reference. The validation tools used by the developers make heavy use of canonicalization at 	most steps to reduce repetitive comparisons and false error reports.

Relevance – This article is relevant to our coursework because it explains the logical 	eventuality that with the incredibly vast amount of metabolomics data and speed at which it is 	growing, errors are inevitable. The authors offer some insight into how this problem can be 	corrected and the necessity of doing so. The compound query window from the BioMeta database.

HMDB: The Human Metabolome Database
The Human Metabolome Database (HMDB) was established in 2004 with the explicit aim to catalog the whole metabolome in humans just as the Human Genome project unraveled the mysteries behind our genetic code. This paper covers the information contained in the database, which includes compound description, synonyms, physo-chemical structure, disease association, pathways information, and NMR Spectra and MS spectra among other things; each entry in the database contains 90 entries filled with relevant information. The paper also serves as a design documentation for the database, detailing how it was built with care to allow for efficient searching as well as explaining the quality control and curation of the database.

The HMDB is built upon a MySQL database that serves as the backend to to the graphical web-page interface. Raw text found in the database is translated to HTML via special Perl scripts that also generate links and graphics. The MySQL database is part of a generalized metabolomic LIMS system called MetaboLIMS that utilizes Java to handle input and queries.

The robustness of the database allows researchers to search from many different angles including by chemical structure, BLAST, single and multiple sequences, MS and NMR spectra, and boolean text searches via GLIMPSE.

Terms
 * Biomarker: A biochemical feature that can be used to detect or measure a disease or the effects of a treatment.
 * Medical Informatics: A field of information science that primarily deals with the analysis and distribution of medical data through the use of computers. This data can be applied to different areas of health care and medicine.
 * GLIMPSE: Global Implicit Search; an indexing and query scheme for searching file systems.
 * SimCell: A metabolic simulation software package which models complex metabolic pathways at the cellular level with real-time movies of the enzymatic process.  These movies can also be graphed by the package.
 * Nutrigenomics: The study of how foods interact with genes as to increase the risk factor of chronic disease.

Relevance: The information housed in the HMDB can be traced through all of the coursework that has been covered so far. Many of the metabolites housed in the HMDB were directly discussed in both the textbook and the lectures covered in class. Of course this is a surface connection between the information the text and this paper as the HMDB and other metabolic databases really encompass the majority of the metabolism world as they serve as as a repository under which all past and future research data can be stored.

Toward Pathway Engineering: A New Database of Genetic and Molecular Pathways
The KEGG database was created with the sole purpose of providing a diagram of molecular and genetic interactions to aid in the understanding of biological systems. Its creation was fueled in part by the completion of the Human Genome Project as a way to take this massive amount of information and place it in the proper locations in a system. KEGG is connected to DNA and Protein databases by integration with the tool DBGET, which acts to search across databases.

Terms
 * DBGET: An integrated database retrieval tool to search across databases.
 * GenomeNet: A network that establishes a informatics framework for genome research and related areas.
 * Φx174: Small virus genome consisting of 11 genes; one of the first viruses sequenced.
 * Superfamilies: A classification scheme to group proteins.
 * Boehringer wall chart: A classic biological pathway chart.

Relevance: The KEGG database is just another entry in the long line of databases which sum up much of the metabolic pathway information we have learned in class.

=Articles and Web Pages for Review and Inclusion=

Nutritional Metabolomics Database

A Liquid Chromatography-Mass Spectrometry-Based Metabolome Database for Tomato

Plant Physiology 141:1205-1218 (2006)