Structural Biochemistry/Bioinformatics/Homology

Homology is a concept that takes into account similarities that occur among nucleic acid or protein sequences of two different organisms. Coined by Richard Owen in 1948, homology is quantized by comparing matches that occur between two different samples of amino acid sequences in proteins or DNA sequences in DNA and assigning a system of point values to identical/similar matches that occur in alignments. This type of analysis is useful in determining relationships between species and can help to trace ancestral descent as well as evolutionary changes that have occurred over time in a given set of species. Today, techniques have been developed to assess the probability of two organisms being homologous and has largely become the main area of focus for bioinformaticians around the world. Homology among nucleic acid are of two major types: orthologous and paralogous. Homologous said to be orthologous if they were separated by an event called speciation. orthologous gene are found in different species, but similar to each other in which they originate from the same common ancestors. orthology often have the same function. paralogous are genes that separate by a gene duplication event. paralogs mostly have the same functions. The genes that encoding hemoglobin and myoglobin are consider paralogs genes. Also hemoglobin A, A2, B, F are paralogs of one another.

Misuse of the Term
The term “homology” is often mistakenly used when describing proteins or nucleic acid sequences due to the fact that “homology is a concept of quality and cannot be ‘quantified’ ”. In a recent analysis, the term “homology” was searched on PubMed in the 2007 database and 1966 abstracts contain the word homology either in the title or the abstract, discarding those which used the term as part of the name of a protein or procedure. Of these abstracts, 57% (1128) properly uses the term while 43% (828) uses the term incorrectly. Some of the incorrect usage of the term includes association with a percentage value and terms such as “high”, “low”, and “significant”. Analyzing the term for the abstracts in the 1986 database shows that the frequency of misusing the term “homology” has slightly decreased.

The analysis of the term was also performed across languages. In the 1986 search of articles containing homology, there was an overall lower percentage of articles that misused the term. However, as other countries have surged in scientific research, more articles from rising countries have been produced and a greater percentage of those articles have misused the term homology. The article "When It Comes to Homology, Bad Habits Die Hard," advocates a solution to this problem by requiring scientific journals to promote guidelines on proper usage of common terminology as well as education of new researchers on terminology in rising countries.

The misuse of the term Homology is considered a problem due to the confusion that it would cause the reader in trying to understand the author's intention. For example, the author may state that two proteins are homologous while also making the statement that the two proteins do not share the same evolutionary origin (which is the definition of homology). The author may also state that two peptide chains are homologous while completely ignoring any discussion if they shared the same evolutionary origin. Authors were also found using the term as evidence that the proteins were from the same evolutionary origin (i.e. "The fact that these proteins are homologous proves that they are from the same evolutionary chain").

An example of the difference between homology and similarity would be comparisons between human and chimpanzee DNA vs a comparison of human and mice DNA. While mice and humans share about 97.5% of their DNA with humans, that does not mean they possess the same evolutionary origin. While very similar, they would not be homologous. Humans and chimpanzees, however, share >98.0% of their DNA and are believed to share the same evolutionary origin. Therefore, human and chimpanzee DNA strands may correctly be stated to be homologous.

Orthologs
Orthologs are specific gene sequences that are closely related between two entirely different species, but often have the same functions. The term ortholog stems from the root "ortho" meaning "other" and was coined by Walter Fitch in 1970. Separated by a speciation event where a species diverges into two separate species, divergent copies of a solitary gene result in the orthologous homologous sequence.

An example of orthologous genes are the genes that code for hemoglobins in both cows and humans. The mapping of orthologs can help biologist construct evolutionary trees that are much more detailed and specific. Taxonomy and phylogenetic studies benefit from orthologous sequences. A simple example can be a bat and a bird; a bird and a bat are part of two different species and yet their wings have the same function.

Paralogs
Paralogs refer to gene sequences that are shared by organisms in the same species but exhibit different functions. Paralogs are usually the product of gene duplication which can be caused by any number of mechanisms such as transposons or unequal cross-overs. These duplicated genes typically have similar functions and can mutate further to take on other functions which results in the paralogs.

The number of differences or substitutions are proportional to the time that has passed since the gene has become duplicated. Thereby shedding light upon the way genomes evolve. Myoglobin and hemoglobin are considered to be the ancient paralogs which all evolve from.

Suspected paralogs are the genes that encode for hemoglobin and myoglobin as both have similar protein structures but differ in their oxygen-carrying duties. There are four known classes of hemoglobins (hemoglobin A, hemoglobin A2, hemoglobin B, and hemoglobin F) where are all paralogs of one another.Other examples of paralogs are Actin and Hsp-70. Their tertiary structures are similar but their functions are different; actin is part of the cytoskeleton, while Hsp-70 is a heat shock protein.

Sequence Alignments Detect Homologs
To test whether or not two molecules are homologous, it is important to examine the nucleic acid or protein sequence for matches that occur between the two sequences. Although forms of sequencing work, the protein sequencing is usually preferable because it's composed of 20 different building blocks (amino acids) while DNA and RNA are each only comprised of four nucleic acids; so having a significant number of matches in protein sequencing is much stronger evidence of a common ancestry than nucleic acid sequencing. Also, redundancy in the genetic code where different genes can encode for the same amino acid (e.g. GCU,GCC,GCA,GCG all code for Alanine) makes the comparison of proteins much more sensitive and useful in determining similarities in protein function than with DNA or RNA.

Two different protein sequences can be compared by analyzing the number of times that their amino acids match when aligned directly above each other or when one sequence is slid past the other. For instance, when assessing the number of matches, amino acid one of the top strand can either be directly above amino acid 1 from the second strand or slid to the left/right of it thus causing different amino acids to align. The number of matches are then plotted against the alignment in order to assess what alignment the maximum number of matches occur. It is important to understand that a large number of matches does not mean the two proteins are homologs.

To account for mutations such as insertions and deletions, gaps may be inserted to create a better match. If two sequence comparisons appear to be a good match, a gap may be inserted to accommodate both comparisons. Scientist score the alignment: +10 points for each match and -25 points for each gap no matter the size. This score must then be plotted against a distribution of other scores obtained by randomly shuffling one of the protein strands and comparing it to the other many times to ensure the amino acid matches were not just due to chance. If the score deviates largely from the majority of the scores, then the two proteins are probably homologs. However, a low score does not rule out homology.

Homolog Sequencing Technology: Matrices


Scores may be calculated using identity or substitution matrices. This process can be more precise by selecting a matrix that adds in gaps to further match the sequences. Examples of matrices include PAM, BLOSSUM (a type of substitution matrix) GONNET (a matrix that specifically targets distance), DNA identity matrix, and a DNA PUPY matrix. Overall, the substitution matrices are most sensitive to protein sequences. By using these matrices, it is possible to detect distant evolutionary relationships. If two sequences are at least 25% homologous identical it can be determined that these two proteins are homologous. However, sequences with percentages lower than 25 are not necessarily not homologous. For example, if protein A is homologous to protein B (based on their identity percentages), and protein B is homologous to protein C, A and C are likely to have similarities in function even if they are only 15% identical. Identity matrices assign a value of one for matches between sequences and zeros for non-matches. This method does not distinguish between likely and rare mutations and therefore does not give a clear answer to homology. Substitution matrices account for conservative mutations that are less likely to be deleterious or seriously change the function, such as switching glycine and alanine, by giving them a large positive score. So in other words, substitution matrices take into account not only if the sequences are identical (giving them the highest possible score), but unlike identity matrices they also assign values for amino acids sequences when they are "substituted" by another amino acid with similarities. The more simililar the amino acid sequence, the bigger the "value" it receives. The more different the sequences are or "rare" the substitution of a given amino acid like A would be substituted for something like P, the bigger their "negative" values get. By making a distinction between the different types of mutations, better matches can be made and alignments based on random chance are avoided.

Identity Matrix : Identity matrices uses scores of one and zero where the matching of identical amino acids or nucleotides results in a score of one and any mismatches are given a score of zero. This is not very as meaningful because random shuffling scores may be in the same area as the original score.

GONNET : Gonnet matrices uses “exhaustive pair-wise alignments” of proteins and measure the distances to estimate alignments. This creates a new distance matrix which refines the alignment score. This type of matrices showed if the proteins were derived from close or distant homologous proteins. This type of matrix was formed in 1993 by Gonnet with the help of Cohen and Benner.

DNA PUPY : DNA Pupy matrices give scores for purine-purine transitions as well as pyrimidine-pyrimidine. It is believed to be helpful in finding primers for PCR.

PAM : Point accepted mutations (PAM) is a set of matrices used for the scoring in sequence alignments. PAM was introduces in 1978 by Margaret Dayhoff, an American physical chemist and bioinformatist. PAM is used to develop a scoring matrix which is used to determine the homology of two genes or proteins. The matrix is normalized so that PAM1 would give substitution probabilities for sequences that have 1 point mutation for every 100 amino acids. The most commonly used is PAM250, where the probability is determined for 250 point mutations for every 100 amino acids.

BLOSUM 62 : BLOSUM 62 is the most commonly used substitution matrix. A program was developed by the National Center of Biotechnology Information(NCBI) to do this sequence alignment and is available online. This substitution matrix tallies points for different amino acid pairs and accounts not only for the identity but also for the conservation (how similar an amino acid is to another as to not induce a dramatic change in the function of a particular protein) and frequency(how many times the amino acid shows up on the protein sequence) of the amino acid pairs. The matrix will give a higher score if the amino acids are identical but is also going to give points based on the similarities. For example, isoleucine and valine will be given a higher score because although the amino acids are not identical, they are similar in that both are hydrophobic.

Homology Modeling
The primary goal of homology modeling is to study the structure of the macromolecules. X-ray crystallography and NMR are the only ways to provide detailed structural information; however, these techniques involve elaborate procedures and many proteins fail to crystallize or cannot be obtained or dissolved in adequate quantities for NMR analysis. Therefore, model building on the basis of the known three dimensional structure of a homologous protein is the most reliable way to obtain structural information about the unknown protein. These are the main steps in homology modeling:

1.	Finding homologues protein database files (the template) Template selection is a critical step in homology modeling. Template identification can be aided by database search techniques.

2. Creation of the alignment, using single or multiple sequence alignments.

When more than one known is involved, the knowns will align together, then the unknown sequence aligned with the group; this helps ensure better domain conservation) furthermore, the alignment can be corrected by the insertion or deletion of gaps. Even though introduction of gap complicates the alignment, there are developed methods that use scoring systems to compare different systems and penalize gaps to prevent the unreasonable insertions. Scoring of alignment involves the construction of identity matrices and substitution matrices. Substitution matrices are believed to be the best, theses methods are based on the analysis of the frequency with which a given amino acid is observed to be replaced by other amino acids among proteins for which the sequences can be aligned.

3.	Model generation: The information contained in the template and alignment can be used to generate a three dimensional structural model of the protein, which is represented as a set of Cartesian Coordinates.

4.	Model Refinement: The major sources of error in homology modeling are the poor selection of template and inaccurate template-target sequence alignment. This can be improved by using multiple sequences and structural alignment.