An Introduction to Molecular Biology/Genetic Code

After the structure of DNA was discovered by James Watson and Francis Crick, who used the experimental evidence of Maurice Wilkins and Rosalind Franklin (among others), serious efforts to understand the nature of the encoding of proteins began. George Gamow, in 1954, postulated that a three-letter code must be employed to encode the 20 standard amino acids used by living cells to encode protein. Three is the smallest integer n such that 4n is at least 20.

The fact that codons consist of three DNA bases was first demonstrated in the Crick, Brenner et al. experiment. The first elucidation of a codon was done by Marshall Nirenberg and Heinrich J. Matthaei in 1961 at the National Institutes of Health. They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU...) and discovered that the polypeptide that they had synthesized consisted of only the amino acid phenylalanine. They thereby deduced that the codon UUU specified the amino acid phenylalanine. This was followed by experiments in the laboratory of Severo Ochoa demonstrating that the poly-adenine RNA sequence (AAAAA...) coded for the polypeptide, poly-lysine. and the poly-cytosine RNA sequence (CCCCC...) coded for the polypeptide, poly-proline. Therefore the codon AAA specified the amino acid lysine, and the codon CCC specified the amino acid proline. Using different copolymers most of the remaining codons were then determined. Extending this work, Nirenberg and Philip Leder revealed the triplet nature of the genetic code and allowed the codons of the standard genetic code to be deciphered. In these experiments various combinations of mRNA were passed through a filter which contained ribosomes, the components of cells that translate RNA into protein. Unique triplets promoted the binding of specific tRNAs to the ribosome. Leder and Nirenberg were able to determine the sequences of 54 out of 64 codons in their experiments.

Subsequent work by Har Gobind Khorana identified the rest of the genetic code. Shortly thereafter, Robert W. Holley determined the structure of transfer RNA (tRNA), the adapter molecule that facilitates the process of translating RNA into protein. This work was based upon earlier studies by Severo Ochoa, who received the Nobel prize in 1959 for his work on the enzymology of RNA synthesis. In 1968, Khorana, Holley and Nirenberg received the Nobel Prize in Physiology or Medicine for their work.

Origin of genetic code
There are many theories behind the origin of genetic codes. The genetic code used by all known forms of life is nearly universal. However, there are a huge number of possible genetic codes. If amino acids are randomly associated with triplet codons, there will be 1.5 x 1084 possible genetic codes. Phylogenetic analysis of transfer RNA suggests that tRNA molecules evolved before the present set of aminoacyl-tRNA synthetases.

Theoretically the genetic code could be completely random (a "frozen accident"), completely non-random (optimal) or a combination of random and nonrandom. There are sufficient data to refute the first possibility. For a start, a quick view on the table of the genetic code already shows a clustering of amino acid assignments. Furthermore, amino acids that share the same biosynthetic pathway tend to have the same first base in their codons, and amino acids with similar physical properties tend to have similar codons.

There are four themes running through the many theories that seek to explain the evolution of the genetic code (and hence the origin of these patterns):

1. Chemical principles govern specific RNA interaction with amino acids. Aptamer experiments showed that some amino acids have a selective chemical affinity for the base triplets that code for them. Recent experiments show that of the 8 amino acids tested, 6 show some RNA triplet-amino acid association. This has been called the stereochemical code. The stereochemical code could have created an ancient core of assignments. The current complex translation mechanism involving tRNA and associated enzymes may be a later development, and that originally, protein sequences were directly templated on base sequences.

2. Biosynthetic expansion. The standard modern genetic code grew from a simpler earlier code through a process of "biosynthetic expansion". Here the idea is that primordial life "discovered" new amino acids (e.g., as by-products of metabolism) and later back-incorporated some of these into the machinery of genetic coding. Although much circumstantial evidence has been found to suggest that fewer different amino acids were used in the past than today, precise and detailed hypotheses about exactly which amino acids entered the code in exactly what order have proved far more controversial.

3. Natural selection has led to codon assignments of the genetic code that minimize the effects of mutations. A recent hypothesis suggests that the triplet code was derived from codes that used longer than triplet codons. Longer than triplet decoding has higher degree of codon redundancy and is more error resistant than the triplet decoding. This feature could allow accurate decoding in the absence of highly complex translational machinery such as the ribosome.

4. Information channels: Information-theoretic approaches see the genetic code as an error-prone information channel. The inherent noise (i.e. errors) in the channel poses the organism with a fundamental question: how to construct a genetic code that can withstand the impact of noise while accurately and efficiently translating information? These “rate-distortion” models suggest that the genetic code originated as a result of the interplay of the three conflicting evolutionary forces: the needs for diverse amino-acids, for error-tolerance and for minimal cost of resources. The code emerges at a coding transition when the mapping of codons to amino-acids becomes nonrandom. The emergence of the code is governed by the topology defined by the probable errors and is related to the map coloring problem.

Summary of Khorana’s research
Ribonucleic acid (RNA) with two repeating units (UCUCUCU → UCU CUC UCU) produced two alternating amino acids. This, combined with the Nirenberg and Leder experiment, showed that UCU codes for Serine and CUC codes for Leucine. RNAs with three repeating units (UACUACUA → UAC UAC UAC, or ACU ACU ACU, or CUA CUA CUA) produced three different strings of amino acids. RNAs with four repeating units including UAG, UAA, or UGA, produced only dipeptides and tripeptides thus revealing that UAG, UAA and UGA are stop codons. With this, Khorana and his team had established that the mother of all codes, the biological language common to all living organisms, is spelled out in three-letter words: each set of three nucleotides codes for a specific amino acid. Their Nobel lecture was delivered on December 12, 1968. To do this Khorana was also the first to synthesize oligonucleotides, that is, strings of nucleotides.

Degeneracy of the genetic code
Degeneracy is the redundancy of the genetic code. The genetic code has redundancy but no ambiguity ( above for the full correlation). For example, although codons GAA and GAG both specify glutamic acid (redundancy), neither of them specifies any other amino acid (no ambiguity). The codons encoding one amino acid may differ in any of their three positions. For example the amino acid glutamic acid is specified by GAA and GAG codons (difference in the third position), the amino acid leucine is specified by UUA, UUG, CUU, CUC, CUA, CUG codons (difference in the first or third position), while the amino acid serine is specified by UCA, UCG, UCC, UCU, AGU, AGC (difference in the first, second or third position).

A position of a codon is said to be a fourfold degenerate site if any nucleotide at this position specifies the same amino acid. For example, the third position of the glycine codons (GGA, GGG, GGC, GGU) is a fourfold degenerate site, because all nucleotide substitutions at this site are synonymous; i.e., they do not change the amino acid. Only the third positions of some codons may be fourfold degenerate. A position of a codon is said to be a twofold degenerate site if only two of four possible nucleotides at this position specify the same amino acid. For example, the third position of the glutamic acid codons (GAA, GAG) is a twofold degenerate site. In twofold degenerate sites, the equivalent nucleotides are always either two purines (A/G) or two pyrimidines (C/U), so only transversional substitutions  (purine to pyrimidine or pyrimidine to purine) in twofold degenerate sites are nonsynonymous.

A position of a codon is said to be a non-degenerate site if any mutation at this position results in amino acid substitution. There is only one threefold degenerate site where changing to three of the four nucleotides may have no effect on the amino acid (depending on what it is changed to), while changing to the fourth possible nucleotide always results in an amino acid substitution. This is the third position of an isoleucine codon: AUU, AUC, or AUA all encode isoleucine, but AUG encodes methionine. In computation this position is often treated as a twofold degenerate site.

There are three amino acids encoded by six different codons: serine, leucine, and arginine. Only two amino acids are specified by a single codon. One of these is the amino-acid methionine, specified by the codon AUG, which also specifies the start of translation; the other is tryptophan, specified by the codon UGG. The degeneracy of the genetic code is what accounts for the existence of synonymous mutations.

Degeneracy results because there are more codons than encodable amino acids. For example, if there were two bases per codon, then only 16 amino acids could be coded for (4²=16). Because at least 21 codes are required (20 amino acids plus stop), and the next largest number of bases is three, then 4³ gives 64 possible codons, meaning that some degeneracy must exist.

These properties of the genetic code make it more fault-tolerant for point mutations. For example, in theory, fourfold degenerate codons can tolerate any point mutation at the third position, although codon usage bias restricts this in practice in many organisms; twofold degenerate codons can tolerate one out of the three possible point mutations at the third position. Since transition mutations (purine to purine or pyrimidine to pyrimidine mutations) are more likely than transversion (purine to pyrimidine or vice-versa) mutations, the equivalence of purines or that of pyrimidines at twofold degenerate sites adds a further fault-tolerance.

Despite the redundancy of the genetic code, single point mutations can still cause dysfunctional proteins. For example, a mutated hemoglobin gene causes sickle-cell disease. In the mutant hemoglobin a hydrophilic glutamate (Glu) is substituted by the hydrophobic valine (Val), that is, GAA or GAG becomes GUA or GUG. The substitution of glutamate by valine reduces the solubility of Beta globulins|β-globin which causes hemoglobin to form linear polymers linked by the hydrophobic interaction between the valine groups, causing sickle-cell deformation of erythrocytes. Sickle-cell disease is generally not caused by a de novo mutation. Rather it is selected for in geographic regions where malaria is common (in a way similar to thalassemia), as heterozygous people have some resistance to the malarial Plasmodium parasite (heterozygote advantage).

These variable codes for amino acids are allowed because of modified bases in the first base of the anticodon of the tRNA, and the base-pair formed is called a wobble base pair. The modified bases include inosine and the Non-Watson-Crick U-G basepair.

Initiation or Start Codon
The start codon is generally defined as the point, sequence, at which a ribosome begins to translate a sequence of RNA into amino acids. When an RNA transcript is "read" from the 5' carbon to the 3' carbon by the ribosome the start codon is the first codon on which the tRNA bound to Met, methionine, and ribosomal subunits attach. ATG and AUG denote sequences of DNA and RNA, respectively, that are the start codon or initiation codon encoding the amino acid methionine (Met) in eukaryotes and a modified Met (fMet) in prokaryotes. The principle called the Central dogma of molecular biology describes the process of translation of a gene to a protein. Specific sequences of DNA act as a template to synthesize mRNA in a process termed "transcription" in the nucleus. This mRNA is exported from the nucleus into the cytoplasm of the cell and acts as a template to synthesize protein in a process called "translation." Three nucleotide bases specify one amino acid in the genetic code, a mapping encoded in the tRNA of the organism. The first three bases of the coding sequence (CDS) of mRNA to be translated into protein are called a start codon or initiation codon. The start codon is almost always preceded by an untranslated region 5' UTR. The start codon is typically AUG (or ATG in DNA; this also encodes methionine). Very rarely in higher organisms (eukaryotes) are non AUG start codons used. In addition to AUG, alternative start codons, mainly GUG and UUG are used in prokaryotes. For example E. coli uses 83% ATG (AUG), 14% GTG (GUG), 3% TTG (UUG) and one or two others (e.g., ATT and CTG).

Termination or Stop codon


In the genetic code, a stop codon (also known as termination codon) is a nucleotide triplet within messenger RNA that signals a termination of translation. Proteins are based upon polypeptides, which are unique sequences of amino acids; and most codons in messenger RNA correspond to the addition of an amino acid to a growing polypeptide chain, which may ultimately become a protein — stop codons signal the termination of this process, releasing the amino acid chain.

Stop codons were historically given many different names, as they each corresponded to a distinct class of mutants that all behaved in a similar manner. These mutants were first isolated within bacteriophages (T4 and lambda), viruses that infect the bacteria Escherichia coli. Mutations in viral genes weakened their infectious ability, sometimes creating viruses that were able to infect and grow within only certain varieties of E coli.

1. Amber mutations were the first set of nonsense mutations to be discovered. They were isolated by Richard Epstein and Charles Steinberg, but named after their friend Harris Bernstein (see Edgar pgs. 580-581 ) for the story behind this incident)

Viruses with amber mutations are characterized by their ability to infect only certain strains of bacteria, known as amber suppressors. These bacteria carry their own mutation that allow a recovery of function in the mutant viruses. For example, a mutation in the tRNA that recognizes the amber stop codon allows translation to "read through" the codon and produce full-length protein, thereby recovering the normal form of the protein and "suppressing" the amber mutation. Thus, amber mutants are an entire class of virus mutants that can grow in bacteria that contain amber suppressor mutations.

2.Ochre  Ochre mutation was the second stop codon mutation to be discovered. Given a color name to match the name of amber mutants, ochre mutant viruses had a similar property in that they recovered infectious ability within certain suppressor strains of bacteria. The set of ochre suppressors was distinct from amber suppressors, so ochre mutants were inferred to correspond to a different nucleotide triplet. Through a series of mutation experiments comparing these mutants with each other and other known amino acid codons, Sydney Brenner concluded that the amber and ochre mutations corresponded to the nucleotide triplets "UAG" and "UAA".

3. Opal mutations or umber mutations the third and last stop codon in the standard genetic code was discovered soon after, corresponding to the nucleotide triplet "UGA". Nonsense mutations that created this premature stop codon were later called opal mutations or umber mutations.

In RNA: UAG ("amber") UAA ("ochre") UGA ("opal")

In DNA: TAG ("amber") TAA ("ochre") TGA ("opal" or "umber").

Facts to be remembered
Exceptions to the genetic code: Although the vast majority of living organisms today use the standard genetic code, geneticists have discovered a few variations on this code. Moreover, these variants are found in different evolutionary lineages and consist of different translations of a few codons.

The CUG codon, usually translated as leucine, corresponds to the serine 2 in many species of fungi Candida 3.

Many species of green algae of the genus Acetabularia use stop codons UAG and UAA to encode glycine.

Many ciliates like Paramecium tetraurelia, Tetrahymena thermophila or Stylonychia 4 lemnae use codons UAG and UAA to code for glutamine instead of stop. UGA is the one stop codon used by these cells.

The ciliate Euplotes octocarinatus uses the codon UGA to encode cysteine, leaving UAG and UAA as stop signs.

In the three kingdoms of life, we sometimes find a twenty-first amino acid, selenocysteine , encoded by the UGA codon (normally a stop codon).

In archaea and eubacteria, a twenty-second amino acid, pyrrolysine is sometimes met, encoded by UAG (also usually a stop codon).

The first amino acid incorporated (determined by the start codon AUG) is a methionine in most eukaryotes, more rarely a valine (in some eukaryotes ), and formyl-methionine in most prokaryotes. In addition, this codon is GUG or GUU sometimes in some prokaryotes.

We therefore believe that life today originally had a smaller number of amino acids. These amino acids have been modified and have seen their numbers increase (by a phenomenon similar to the formation of sélénocytéine and pyrrolysine derived from serine and lysine, respectively, modified as they are on their transfer RNA on the ribosome .) These new amino acids were then used a subset of transfer RNAs and their associated coding. Maybe we notice signs of this phenomenon with glutamine, which in some bacteria, derived from glutamate still attached to its tRNA.

Another exception: the code is sometimes ambiguous. For example, the codon UGA is in the same organism ( Escherichia coli, for example) sometimes code for the 21st amino acid mentioned above ( selenocysteine ) or "stop".